Evaluating the numerical instability in fuzzy clustering validation of high-dimensional data

JID:TCS AID:12245 /FLA Doctopic: Theory of natural computing [m3G; v1.261; Prn:4/11/2019; 17:14] P.1 (1-18) Theoretical Computer Science ••• (••••...

Download PDF

1MB Sizes 0 Downloads 6 Views

Report

PDF Reader
Full Text

JID:TCS

AID:12245 /FLA

Doctopic: Theory of natural computing

[m3G; v1.261; Prn:4/11/2019; 17:14] P.1 (1-18)

Theoretical Computer Science ••• (••••) •••–•••

Contents lists available at ScienceDirect

Theoretical Computer Science www.elsevier.com/locate/tcs

Evaluating the numerical instability in fuzzy clustering validation of high-dimensional data Fernanda Eustáquio ∗ , Tatiane Nogueira Computer Science Department, Federal University of Bahia at Salvador, BA 40170-110, Brazil

a r t i c l e

i n f o

Article history: Received 30 January 2019 Received in revised form 20 September 2019 Accepted 25 October 2019 Available online xxxx Keywords: Fuzzy clustering validation Fuzzy cluster validity indices Fuzzy c-means Monotonic tendency Number of clusters High-dimensional data

a b s t r a c t Fuzzy clustering validation of high-dimensional datasets is only possible using a reliable cluster validity index (CVI). A good CVI must correctly recognize a data structure and its validations must be independently of any parameter of a clustering algorithm or data property. However, some classical fuzzy CVIs as Partition Coeﬃcient (PC), Partition Entropy (PE) and Fukuyama-Sugeno (FS) have the monotonic tendency in function of the number of clusters. Although the literature presents extensive investigations about such tendency, they were conducted for low-dimensional data, in which such data property does not affect the clustering behavior. In order to investigate how such aspects affect the fuzzy clustering results of high-dimensional data, in this work we have clustered objects of thirteen real datasets, using the Fuzzy c-Means algorithm. The fuzzy partitions were validated by PC, PE, FS and some proposed improvements of them to lead with the monotonic tendency, totaling eight fuzzy CVIs analyzed. Besides the analysis made about the number of clusters selected by the CVIs, the Mann-Kendall test was performed to verify statistically the monotonic trend of the CVIs results. From the two analysis made, the Modiﬁed Partition Coeﬃcient and Scaled Partition Entropy indices were successful in respectively improving the PC and PE indices. © 2019 Elsevier B.V. All rights reserved.

1. Introduction High-dimensional datasets are often encountered in many areas such as medicine, where DNA microarray can produce a large number of measurements; image, where each pixel within an image is a dimension; and text documents, where the number of dimensions is equal to the size of the vocabulary when a word-frequency vector is used. On the other hand, clustering datasets with hundreds or thousands of dimensions is still a challenging problem [1]. Due to the complexity of a high-dimensional dataset, its visualization is very diﬃcult to interpret and only possible by applying a feature transformation technique as Principal Component Analysis (PCA). In this type of transformation there is loss of data information and the complete enumeration of all subspaces becomes intractable. However, clustering methods are still used to reveal meaningful patterns even in high-dimensional data, perhaps essential, when seeking new information about a dataset. For example, on microarrays datasets, cell samples can reveal tumors; on images datasets, some clusters can reveal image segmentations to extract regions of interest from the background; and on text collections, clustering of documents can reveal topics of interest.

*

Corresponding author. E-mail addresses: [email protected] (F. Eustáquio), [email protected] (T. Nogueira).

https://doi.org/10.1016/j.tcs.2019.10.039 0304-3975/© 2019 Elsevier B.V. All rights reserved.

JID:TCS

2

AID:12245 /FLA

Doctopic: Theory of natural computing

[m3G; v1.261; Prn:4/11/2019; 17:14] P.2 (1-18)

F. Eustáquio, T. Nogueira / Theoretical Computer Science ••• (••••) •••–•••

Besides such importance of clustering datasets, the introduction of fuzzy set theory into clustering of high-dimensional data provides a favorable mechanism to capture the uncertainty and imprecision inherent to any data. Fuzzy sets is a natural computing paradigm [2] since it can capture these real world features while the crisp or hard clustering could be considered as an artiﬁcial paradigm. Fuzzy clustering is suitable to high-dimensional because it is useful when the boundaries between clusters are ambiguous and not well separated [3]. Finding overlapping boundaries among clusters, such type of information is measured by means of fuzzy memberships, i.e., the degrees of belonging of objects with respect to clusters. In that sense, the Fuzzy c-Means (FCM) [4] clustering algorithm has been widely studied and applied in a variety of key areas, and Cluster Validity Indices (CVIs) play a very important role in fuzzy clustering [5], since they have been designed to study the performance of a clustering algorithm. It is well-known that CVIs that use relative criterion choose their best partition by comparing each index value for different number of clusters and, because of that, their results must be independently of any parameter of a clustering algorithm. For datasets with hundreds or thousands of dimensions the validation made by a fuzzy CVI is the most reliable and effective way to ﬁnd the optimal number of clusters that best describes the data structure [5] However, some classical fuzzy CVIs as Partition Coeﬃcient (PC) [6,7], Partition Entropy (PE) [8], Fukuyama-Sugeno (FS) [9] and Xie-Beni (XB) [10] are sensitive to parameters of FCM, such as fuzziﬁcation factor m and number of clusters c, presenting the following problem: PC, FS and XB have the decreased monotonic tendency with c, i.e., their values become smaller when c increases and the opposite occurs with PE. Because of that, they present limitations concerning to the lack of computationally fast methods to set optimal values of FCM parameters. According to [11], wrong cluster parameter values may either lead to the inclusion of random ﬂuctuations in the results or ignore potentially important data. Besides that, several comparative studies of such validity indices have been proposed to evaluate FCM partitions, but most of them was performed by analyzing datasets with low dimensionality. For example, among the most cited works about CVIs, [12] examined the role of the fuzziﬁcation factor m in the validation of FCM clustering of 4-dimensional datasets; in [5] the biggest one had thirty dimensions; and in [13], the biggest image had 296 dimensions. Therefore, such studies are inconclusive about the suitability of them for high-dimensional data. Many fuzzy CVIs have been proposed since the classical indices PC, PE, FS, XB from the 70 to 90’s decades. Because of that, the monotonic tendency of these indices are well-known and it is natural that indices are ﬁrstly proposed to improve them. From these indices, XB and its improvements as Kwon’s [14] and Tang’s [15] indices were not used in this work because it is well-known that XB tendency is conditioned to the number of clusters close to the number of objects in the dataset (c → n), that is unusual in literature ([12,4,16–18,5]). The classical fuzzy CVIs as PC, PE and XB have their monotonic tendency well-known in literature [21,5,17,19,13] but it is more diﬃcult to ﬁnd a work that describes the FS monotonic problem [22,23,16]. Moreover, it is not well-known if the indices that were proposed to correct or reduce the monotonic tendency of PC (Improved Partition Coeﬃcient (IPC) [24], Modiﬁed Partition Coeﬃcient (MPC) [21]) and PE (Normalized Partition Entropy (NPE) [25]) can correct it for different values of exponent m and for a high-dimensional context. It is also not well-known whether which indices proposed to improve the same original index perform better or a recent one, as the Robust index of Yang et al. (MPO) [19], that take care about this problem in its deﬁnition really works well. To overcome these drawbacks, in [20] we made an exploratory investigation about the monotonic tendency of seven fuzzy clustering validity indices: PC, MPC, IPC, PE, NPE, FS and MPO. In such investigation, we have used ten different real high-dimensional datasets to check whether or not the indices select correct partitions of a given dataset: three used datasets were sets of documents and seven used datasets were sets of microarray. Although the previous work [20] analyzed the monotonic problem of some fuzzy CVIs for FCM partitions of highdimensional datasets, it is possible to enumerate signiﬁcant differences in relation to this current extended work. First, in this work we have added the Scaled Partition Entropy index (PEB) [8], which it is the ﬁrst normalized version of the PE index. Besides that, we have used seven sets of documents and six microarray datasets, doing a more balanced investigation concerning to the type of data. Such improvement lead us to observe the numerical instability of fuzzy validation indices when clustering datasets with their number of dimensions ranging from 456 to 22,926. The datasets used in [20] have the number of dimensions ranging from 456 to 11,026. Second, in this current work, the fuzzy CVIs results were separately analyzed for the textual and microarray datasets because, although they are high-dimensional datasets, both have different properties: the DNA microarray datasets usually are not sparse and the number of objects is much smaller than their number of features. Third, we check, in the current work, that using just one or two stop criterion for the Fuzzy c-Means algorithm can affect the FS deﬁnition of clusters compact and well-separated. Finally, in [20], it was not veriﬁed whether an index had the same optimal value for different pseudo-partitions, resulting in an indeﬁniteness about the optimal number of clusters, as was made in the current work. Therefore, the scope of this current work goes beyond checking the monotonic tendency of the classical indices in a high-dimensional context. Neither in the studies [26,20] that FCM was used to cluster high-dimensional datasets nor in the papers of their improved versions ([8,25,21,24]) were made similar analysis and were shown the same results. This current study has the following objectives: (i) to verify if the monotonic problem of some indices is more common in FCM partitions of high-dimensional data than in low dimensional, since it was showed in [26] that fuzzy CVIs perform poorly for the ﬁrst type of data with an appropriate dissimilarity measure; (ii) to evaluate if the improved versions of the

JID:TCS

AID:12245 /FLA

Doctopic: Theory of natural computing

[m3G; v1.261; Prn:4/11/2019; 17:14] P.3 (1-18)

F. Eustáquio, T. Nogueira / Theoretical Computer Science ••• (••••) •••–•••

3

indices that have monotonic tendency really works in the context of FCM clustering of high-dimensional data, where they are still slightly explored in literature; (iii) to check if a recent index that take care about the monotonic problem in its deﬁnition fails in a high-dimensional context; (iv) to compare the performances of indices that were proposed to improve the same original index. To present such analysis, this paper is organized as follows. In Section 2 the theoretical background of the Fuzzy c-Means clustering technique is presented and, in Section 3, we deﬁne the fuzzy CVIs analyzed. In Section 4, the methodology, experimental setup and results are presented. In Section 5, we present a numerical analysis about the fuzzy CVIs results. Finally, in Section 7, the conclusion and the future directions of this research are also presented. 2. Fuzzy clustering The Fuzzy c-Means (FCM) algorithm is one of the most widely used fuzzy clustering models [27]. Developed by Dunn in [28], the FCM algorithm was generalized in [29] by deﬁning any value greater than 1 for the fuzziﬁcation factor m (Dunn’s function is obtained by setting m = 2) that will be detailed later. FCM is a pseudo-partitioning algorithm that, differently from k-Means [30], assigns memberships degrees to an object in each cluster. This pseudo-partition is deﬁned as follows [31]: Deﬁnition 2.1. Let X = {x1 , x2 , ..., xn } be a set of n objects, c be the number of clusters (2 ≤ c < n). A fuzzy pseudo-partition is a family of fuzzy subsets of X { A 1 , A 2 , ..., A c } denoted by a n × c matrix U = [ A i (xk )], where A i (xk ) ∈ [0; 1] is the membership degree of an object xk (1 ≤ k ≤ n) in a cluster A i (1 ≤ i ≤ c) deﬁned in Eq. (1).

A i (xk ) =

c j =1

1 xk − v i xk − v j

2 m −1

, 1 ≤ i ≤ c ; 1 ≤ k ≤ n,

(1)

where v i is the center of the cluster A i and A i (xk ) value reveals the proximity of xk to v i , i.e., if xk is closer to v i than the others v j cluster centers, then A i (xk ) will be the maximum membership degree of xk . The membeshipp degrees in a FCM partition must satisfy the following constraints c

A i (xk ) = 1,

(2)

i =1

0<

n

A i (xk ) < n.

(3)

k =1

FCM is an iterative process that is randomly started by a pseudo-partition matrix U or by a set V = { v 1 , ..., v c } of cluster centers v i (1 ≤ i ≤ c) and, in the following steps and iterations, updates U by Eq. (1) and V by Eq. (4). n

vi =

[ A i (xk )]m xk

k =1 n

k =1

, 1 ≤ i ≤ c.

(4)

[ A i (xk )]m

These updates try to minimize the dissimilarity between an object xk and a cluster center v i by the following objective function

J m (U , V ; X ) =

c n

[ A i (xk )]m xk − v i 2 ,

(5)

i =1 k =1

where m (1 < m < ∞) is the fuzziﬁcation factor that determines the degree of fuzziness of membership degrees on clusters. m is usually chosen equal to 2. During the clustering procedure, the cluster centers and the pseudo-partition matrix are updated until a stopping criterion be satisﬁed. This criterion can be the maximum number of iterations T and/or a convergence value E. If this last criterion is satisﬁed, the current cluster centers can be considered the most representatives points of the clusters. The FCM process initialized by a pseudo-partition matrix U can be summarized in the following steps [5]: Step 1: Given a preselected number of clusters c, chosen values of m, E and/or T , initialize U (0) such that the constraint deﬁned in Eq. (2) be satisﬁed. Then at iteration t = 1, 2, ...; Step 2: Calculate the fuzzy cluster center v i for i = 1, 2, ..., c using Eq. (4); Step 3: Employ Eq. (1) to update U (t ) ;

JID:TCS

AID:12245 /FLA

Doctopic: Theory of natural computing

[m3G; v1.261; Prn:4/11/2019; 17:14] P.4 (1-18)

F. Eustáquio, T. Nogueira / Theoretical Computer Science ••• (••••) •••–•••

4

Table 1 Notations used in this work. Notation

Means

c n X xk A i (xk ) U m vi

number of clusters number of objects to be clustered {x1 , x2 , ..., xn } is a set of n objects one object to be clustered, 1 ≤ k ≤ n membership degree of xk in cluster A i , 1 ≤ i ≤ c fuzzy pseudo-partition deﬁned as U = [ A i (xk )] fuzziﬁcation factor cluster center of cluster A i

v¯

c

i =1

v i /c is the mean of all cluster centers

Step 4: If the improvement in J m (U , V ; X ) is less than the convergence value E, i.e., if U (t −1) − U (t ) ≤ E or the maximum number of iterations T was satisﬁed then halt; otherwise go to Step 2. In the case of the FCM algorithm be initialized by the set of cluster centers V , the steps showed above should be following by replacing the pseudo-partition U for the set V and the same on the other side. A main problem in fuzzy clustering is that the number of clusters c must be speciﬁed beforehand. Selections of a different number of initial clusters result in different clustering partitions. Thus, it is necessary to validate each fuzzy c-pseudo-partition, once they are found [5]. This validation is performed by functions named as fuzzy cluster validity indices that will be shown in the next section. 3. Fuzzy validity indices In this section, we present the description of each index investigated. They are associated with, but not speciﬁcally designed for the FCM clustering algorithm. Therefore, for each index description, consider the notations presented in Table 1. In this work, we selected the three classical indices Partition Coeﬃcient (PC) [6,7], Partition Entropy (PE) [8] and Fukuyama-Sugeno [9] that are well-known to have the monotonic tendency and four extensions proposed to improve them: (i) Modiﬁed Partition Coeﬃcient (MPC) [21] and Improved Partition Coeﬃcient [24] for the PC index; (ii) Scaled Partition Entropy (PEB) [8,32] and Normalized Partition Entropy (NPE) [25,32] for PE. In addition to these indices, we use the Robust index of Yang et al. [19] because it added a modiﬁcation of MPC as a part of its deﬁnition and because it has a term to avoid the monotonic tendency for the number of clusters. 3.1. Partition Coeﬃcient (PC) Partition Coeﬃcient (PC) calculates the relative average of fuzzy intersection between pairs of fuzzy subsets in U by their algebraic product. PC is a maximization index and its value ranges in [1/c , 1]. It is deﬁned as

1 n

c

P C (U ) =

n

[ A i (xk )]2 .

(6)

i =1 k =1

3.2. Modiﬁed Partition Coeﬃcient (MPC) Modiﬁed Partition Coeﬃcient (MPC) was proposed as an effort to try to correct the PC monotonic tendency. MPC is also a maximization index and its value ranges in [0, 1]. This index is deﬁned as

M P C (U ) = 1 −

c c−1

(1 − P C (U )).

(7)

3.3. Improved Partition Coeﬃcient (IPC) Improved Partition Coeﬃcient (IPC) was proposed to overcome the two disadvantages of PC: its monotonic tendency and sensitivity to the fuzziﬁcation factor m. IPC is a maximization index deﬁned as the difference between two successive variance rates of P C (U c ), r [(c − 1) → c ] and r [c → (c + 1)]. The IPC index is calculated for 2 ≤ c ≤ C max − 1, where P C (U 1 ) = 1. It is deﬁned as

I P C (U c ) = r [(c − 1) → c ] − r [c → (c + 1)]

= 100 ·

P C ( U c −1 ) − P C ( U c ) P C ( U c −1 )

−

P C ( U c ) − P C ( U c +1 ) P C (U c )

(8)

.

JID:TCS

AID:12245 /FLA

Doctopic: Theory of natural computing

[m3G; v1.261; Prn:4/11/2019; 17:14] P.5 (1-18)

F. Eustáquio, T. Nogueira / Theoretical Computer Science ••• (••••) •••–•••

5

3.4. Partition Entropy (PE) Partition Entropy (PE) is a minimization index that uses the Shannon’s entropy function (−

c i =1

A i (xk ) log A i (xk )) to de-

scribe the fuzzy uncertainty in each object xk [8]. Thus, to measure the fuzzy uncertainty in a pseudo-partition U , PE calculates the average of fuzzy entropy as follows

1 c

n

P E (U ) = −

n

A i (xk ) loga A i (xk ),

(9)

k =1 i =1

where a ∈ (1, ∞) and the PE value ranges in [0, loga c ]. 3.5. Scaled Partition Entropy (PEB) Scaled Partition Entropy (PEB) is the ﬁrst normalized version of the Partition Entropy index. This minimization index was proposed to validate fuzzy clustering by evaluating the relative fuzziness incorporated in different pseudo-partitions. PEB was proposed as an improvement of the PE minimization process by sharpening its lower bound [8]. PEB value ranges in [0, 1]. It is deﬁned as

P E B (U ) =

P E (U ) loga c

(10)

.

3.6. Normalized Partition Entropy (NPE) Normalized Partition Entropy (NPE) is the Dunn’s normalized version of PE that was proposed to counter the tendency of PE decreases towards zero when c → n: from his observation in [25], for c = 1 or c = n, PE will be equal to zero [4]. Suppose that X 0 is a set of n objects in a p dimensional space. X 0 has only a cluster structure for c = 1 or c = n and Dunn conjectured that the Partition Entropy value of X 0 for 2 ≤ c < n followed the equation P E 0 (U ) = 1 − (c /n). So, by comparing PE with the expected P E 0 , that represents the null hypothesis that X (see Table 1) does not have a cluster structure, if is veriﬁed a signiﬁcant deviation of PE from P E 0 then the likelihood that X has cluster substructure increases [4]. The Normalized Partition Entropy index is deﬁned as

N P E (U ) =

P E (U )

[1 − (c /n)]

(11)

,

where the minimization of its value corresponds to maximizes the likelihood that X has a cluster substructure [4]. 3.7. Fukuyama-Sugeno index (FS) Fukuyama-Sugeno index (FS) is a minimization index that measures the difference between the compactness, calculated by the intra-cluster distance, and the separation between clusters, calculated by the inter-cluster distance. FS is deﬁned as

F S (U , V ; X ) =

c n

[ A i (xk )]m (xk − v i 2 − v i − v¯ 2 ).

(12)

i =1 k =1

3.8. Robust index of Yang et al. (MPO) Robust index of Yang et al. (MPO) is a maximization index proposed to decrease the effect of noises and outliers. This index is deﬁned as the difference between the compactness (Eq. (14)) and separation of clusters, that is given by the sum of separation degree of all clusters to all objects (Eq. (15)).

M P O (U ) = M P C (U ) − Sep (U )

M P C (U ) =

c+1 c−1

1 n

Sep (U ) =

n

k =1

(13)

n c

1/2

k =1 i =1

min1≤i ≤c

[ A i (xk )]2 n

k =1

⎛ ⎞ c −1 c ⎝ O i jk (U )⎠ , i =1 j = i +1

. [ A i (xk

(14)

)]2

(15)

JID:TCS

AID:12245 /FLA

Doctopic: Theory of natural computing

[m3G; v1.261; Prn:4/11/2019; 17:14] P.6 (1-18)

F. Eustáquio, T. Nogueira / Theoretical Computer Science ••• (••••) •••–•••

6

Table 2 Datasets with their related number of objects (n), dimensions (p), expected number of clusters (#classes) and domain.

where

O i jk (U ) =

Dataset

p

n

#classes

Domain

Matrix Sparsity (%)

Sorlie Christensen Khan Su Yeoh Burczynski CSTR SyskillWebert Hitech Irish-Sentiment 20Newsgroups La1s Reviews

456 1,413 2,308 5,563 12,625 22,283 1,725 4,339 6,593 8,658 11,015 13,195 22,926

85 217 63 102 248 127 299 333 600 1,660 2,000 3,204 4,069

5 3 4 4 6 3 4 4 6 3 4 6 5

Breast Cancer N/A Small Round Blue Cell Tumors N/A Leukemia Crohn’s Disease Scientiﬁc Web pages Newspaper Sentiment Analysis E-mails News articles News articles

1.19 0.46 1.59 13.63 0.40 0.79 96.86 97.85 97.92 98.70 99.11 98.90 99.20

1 − A i (xk ) − A j (xk ) if A i (xk ) − A j (xk ) ≥ T o , i = j . 0 if otherwise

(16)

The compactness was deﬁned as a modiﬁcation of the Modiﬁed Partition Coeﬃcient index (Eq. (7)). In Eq. (14), the term in the denominator is used to deal with different scales and the term (c + 1/c − 1)1/2 is used to adjust the value of M P C (U ) to avoid the monotonic tendency for the number of clusters [19]. In Eq. (15), O i jk can reduce the effect of noise and outliers because it uses the threshold T o to remove the objects that are scattered in the border of clusters. Eq. (16) indicates the degree of separation of xk to the clusters A i and A j . In the next section, we present the experiments, with datasets and experimental setup, results and discussion. 4. Experimental results 4.1. Datasets To evaluate the monotonic tendency of some fuzzy CVIs, thirteen real high-dimensional datasets of two types were clustered by Fuzzy c-Means (FCM). These datasets are summarized in Table 2 The ﬁrst type of data is text collections that are naturally sparse and with high dimensionality. In a text collection, its objects are documents and its features are terms present in all of them. Usually, text collections form a document-term matrix. In this work, seven text collections were used: (i) Computer Science Technical Reports (CSTR), SyskillWebert [33], Irish-Sentiment [34], La1s [35] and Reviews [36] were obtained from the repository1 of the Laboratory of Computer Intelligence of the Institute of Mathematics and Computer Sciences, University of São Paulo (ICMC-USP) and they are described in [37]; (ii) Hitech and 20Newsgroups [38] are well-known text collections that were modiﬁed to their subset versions. The second type of a commonly used high-dimensional data are DNA microarray datasets. In this type of data, objects are genes and features represent a condition under a gene is developed [39]. Each cell of this data is a gene expression and, because of that, microarrays data are not sparse, different from a document-term matrix, as can be seen in Table 2. In this work, six microarray2 datasets were used: Sorlie, Christensen, Khan, Su, Yeoh and Burczynski. In the next subsection, the values of parameters used in this work are detailed. 4.2. Experimental setup We have made experiments using the following values of FCM parameters: fuzziﬁcation factor m ranging from 1.5 to 2.5 (m = {1.5, 1.8, 2.0, 2.2, 2.5}); convergence value E = 1e − 5 and maximum number of iterations T = 100, as stop criterion; the Cosine metric was used as the distance function [40]; and number of clusters varying from C min = 2 until three values deﬁned for C max (C max = {C classes , C classes + C min , 10}). For the PE and PEB indices, it was used a = 10 (Eq. (9)) and value of MPO threshold T o was set to 1/c (Eq. (16)). Different from the minimum number of clusters that Fuzzy c-Means should be experimented, the maximum number of clusters (C max ) can be any value in the range [2; n − 1] where n is the number of objects in a dataset. However, it would not be appropriate, in this work, to cluster high-dimensional data using a value of C max much higher than the expected (C classes ) besides being computationally intensive [22].

1 2

http://sites.labic.icmc.usp.br/text_collections. https://github.com/ramhiser/datamicroarray.

JID:TCS

AID:12245 /FLA

Doctopic: Theory of natural computing

[m3G; v1.261; Prn:4/11/2019; 17:14] P.7 (1-18)

F. Eustáquio, T. Nogueira / Theoretical Computer Science ••• (••••) •••–•••

7

Table 3 Number of clusters selected for each index when C max = C classes . (For interpretation of the colors in the table(s), the reader is referred to the web version of this article.) m = 1.5

Dataset

Sorlie Chris. Khan Su Yeoh Burcz. CSTR Syskill Hitech Irish 20ng La1s Reviews

m = 1.8

PC

MPC

IPC

PE, NPE

PEB

FS

MPO

MPC

IPC

PEB

FS

MPO

2 2 2 2 2 2 2 2 2 2 2 2 2

2 3 2 3 2 2 -

2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2

3 3 2 4 3 2 2, 4 2 5, 6 5

5 3 4 4 6 3 4 4 6 3 4 6 5

5 3 4 4 6 3 4 4 6 3 4 6 5

2 2 2 2 2 2 -

2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 5, 6 5

5 3 4 4 6 3 4 4 6 3 4 6 5

5 3 4 4 6 3 4 4 6 3 4 6 5

Table 4 Number of clusters selected for each index when C max = C classes . Dataset

Sorlie Chris. Khan Su Yeoh Burcz. CSTR Syskill Hitech Irish 20ng La1s Reviews

m = 2.0

m = 2.2

MPC

IPC

PEB

FS

MPO

MPC

IPC

PEB

FS

MPO

2 2 2 2 2 2 -

2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 5, 6 5

5 3 4 4 6 3 4 4 6 3 4 6 5

5 3 4 4 6 3 4 4 6 3 4 6 5

2 2 2 2 2, 3 3 -

2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2, 3, 5 2 5, 6 5

5 3 4 4 6 3 4 4 6 3 4 6 5

5 3 4 4 6 3 4 4 6 3 4 6 5

√

As happened in [20], the rule of thumb C max ≤ n, presented by Bezdek and Pal in [12], was not used because the maximum number of clusters √would still be much higher than the expected as would happen, for example, with the Reviews dataset where C max ≤ 4, 069 ≈ 64. To evaluate the indices that could have a monotonic tendency to selecting the maximum number of clusters, we use three values of C max : (i) C max = C classes is a value that can induce an user to consider as a good index that one with often selected c = C classes , but in fact it is choosing c = C max ; (ii) C max = C classes + C min avoids the problem previously reported; (iii) C max = 10 is a common value used in literature [41,16,13] in addition to being a value greater than the expected number of clusters of the datasets (see Table 2) . 4.3. Results The number of clusters selected by each index varying the FCM parameters are shown in Tables 3–11. In these tables, are respectively highlighted in blue, bold and red when: (i) the CVIs (PC, MPC, IPC, PE, PEB, NPE), that have the monotonic tendency, or were proposed to mitigate it, to the minimum number of clusters, selected c = C min ; (ii) the fuzzy CVIs selected the real number of clusters (c = C classes ); (iii) the CVIs (FS, MPO), that have the monotonic tendency to the maximum number of clusters, selected c = C max . The case that an index has selected all number of clusters as optimal c, i.e., it has had the same result for all values of c, is shown with “-” in the cell of the tables. For all datasets, values of C max and exponent m, the PC, PE and NPE indices invariably recognized c = C min = 2 and FS recognized c = C max . Because of that, the PC, PE and NPE results will be only showed for m = 1.5 in Tables 3–11. PC, PE and NPE, with exception of FS, had the same result showed in [20]. 4.3.1. Results with maximum number of clusters C max = C classes In Tables 3–5, we can check that MPO selected c = C max = C classes and IPC selected c = C min for all datasets and values of exponent m. In this case, the IPC index had the same result showed in [20] and it did not improve the PC results. For all m values and only for the text collections, the MPC index obtained the same value, equal to zero, for C min ≤ c ≤ C classes and, because of that, it selected all values of c as the optimal number of clusters. Moreover, we have to highlight

JID:TCS

AID:12245 /FLA

Doctopic: Theory of natural computing

[m3G; v1.261; Prn:4/11/2019; 17:14] P.8 (1-18)

F. Eustáquio, T. Nogueira / Theoretical Computer Science ••• (••••) •••–•••

8

Table 5 Number of clusters selected for each index when C max = C classes . Dataset

Sorlie Chris. Khan Su Yeoh Burcz. CSTR Syskill Hitech Irish 20ng La1s Reviews

m = 2.5 MPC

IPC

PEB

FS

MPO

2 2 4 2 4 2 -

2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 4 2 2, 4 2 5, 6 5

5 3 4 4 6 3 4 4 6 3 4 6 5

5 3 4 4 6 3 4 4 6 3 4 6 5

Table 6 Number of clusters selected for each index when C max = C classes + C min . Dataset

Sorlie Chris. Khan Su Yeoh Burcz. CSTR Syskill Hitech Irish 20ng La1s Reviews

m = 1.5

m = 1.8

PC

MPC

IPC

PE, NPE

PEB

FS

MPO

MPC

IPC

PEB

FS

MPO

2 2 2 2 2 2 2 2 2 2 2 2 2

2 4 2 3 2 2 6 5,6 2-6,8 2-6,8 6

2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2

3 4 2 4 3 2 2, 4 2,4,5 5, 6 5,6

7 5 6 6 8 5 6 6 8 5 6 8 7

7 3 6 6 8 5 6 6 8 5 6 8 7

2 2 2 2 2 2 6 5 2-6,8 6

2 3 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2,4,5 5, 6 5,6

7 5 6 6 8 5 6 6 8 5 6 8 7

7 5 6 6 8 5 6 6 8 5 6 8 7

that MPC = 0 is the minimum value within the range of this index and it means that the clustering was not good (see Section 3). A similar situation happened with PEB, but for less datasets. It did not select the number of clusters for some text collections and just for Yeoh and Burczynski microarray datasets (Tables 3–5). This situation was more common with m ≥ 2.0. In these indeﬁniteness, this index obtained for all clusters its maximum value, equal to one, that means that the clustering was not good (see Section 3). 4.3.2. Results with maximum number of clusters C max = C classes + C min In Tables 6–8, we can check that MPO and IPC respectively selected c = C max and c = C min for a few values of exponent m and only for the Christensen microarray dataset. As well as in [20], the IPC index only selected c = C min for the Christensen dataset. As occurred for C max = C classes , for C max = C classes + C min , the indeﬁniteness about the number of clusters was more common with the PEB index for text collections and m ≥ 2.0. In Table 6, for the Hitech clustering with m = 1.8, MPC had the same result, equal to zero, for all number of clusters, while for the remaining values of exponent m, MPC selected all clusters as optimal c with exception of c = 7. This last situation also occurred with the La1s collection where MPC = 0 for c = 7 and MPC = −2.22e − 16 for c = 7. Besides the MPC index had obtained its minimum value, equal to zero, for c = 7, MPC obtained a value out of its range [0, 1]. All the fuzzy CVIs selected the same number of clusters that were selected previously with C max = C classes for the Sorlie, Su, Yeoh and 20Newsgroups datasets. The same situation occurred with MPC for Irish-Sentiment and Burczynski and PEB had the same result showed in Tables 3–5 for the CSTR, SyskillWebert, Hitech and La1s collections. 4.3.3. Results with maximum number of clusters C max = 10 In Tables 9–11 is shown that only for the Christensen dataset that MPO and IPC present different results. In [20], besides Christensen, MPO recognized c = C max for the Sorlie and Su datasets and IPC recognized c = C min for Sorlie with m = 2.2. As occurred with C max = C classes + C min , for C max = 10, the indeﬁniteness about the number of clusters was more common with the PEB index for text collections and m ≥ 2.0. On the other hand, MPC reduced the number of selections of all clusters as optimal c for the Irish-Sentiment collection (c = 6) and 20Newsgroups, that similar as happened previously with La1s, it

JID:TCS

AID:12245 /FLA

Doctopic: Theory of natural computing

[m3G; v1.261; Prn:4/11/2019; 17:14] P.9 (1-18)

F. Eustáquio, T. Nogueira / Theoretical Computer Science ••• (••••) •••–•••

9

Table 7 Number of clusters selected for each index when C max = C classes + C min . Dataset

Sorlie Chris. Khan Su Yeoh Burcz. CSTR Syskill Hitech Irish 20ng La1s Reviews

m = 2.0

m = 2.2

MPC

IPC

PEB

FS

MPO

MPC

IPC

PEB

FS

MPO

2 2 2 2 2 2 6 5 2-6,8 2-6,8 6

2 3 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2-6 5 2,4,5 5,6 5,6

7 5 6 6 8 5 6 6 8 5 6 8 7

7 5 6 6 8 5 6 6 8 5 6 8 7

2 2 2 2 2, 3 3 6 5 2-6,8 2-6,8 6

2 4 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2, 3, 5 2, 3 2,4,5 5, 6 5

7 5 6 6 8 5 6 6 8 5 6 8 7

7 4 6 6 8 5 6 6 8 5 6 8 7

Table 8 Number of clusters selected for each index when C max = C classes + C min . Dataset

Sorlie Chris. Khan Su Yeoh Burcz. CSTR Syskill Hitech Irish 20ng La1s Reviews

m = 2.5 MPC

IPC

PEB

FS

MPO

2 2 5 2 4 2 6 5 2-6,8 2-6,8 6

2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 5 2 2, 4 5 2,4,5 5, 6 5

7 5 6 6 8 5 6 6 8 5 6 8 7

7 5 6 6 8 5 6 6 8 5 6 8 7

Table 9 Number of clusters selected for each index when C max = 10. Dataset

Sorlie Chris. Khan Su Yeoh Burcz. CSTR Syskill Hitech Irish 20ng La1s Reviews

m = 1.5

m = 1.8

PC

MPC

IPC

PE, NPE

PEB

FS

MPO

MPC

IPC

PEB

FS

MPO

2 2 2 2 2 2 2 2 2 2 2 2 2

2 4 2 3 2 2 6 5,6 2-6,8-10 6 2-6,8-10 2-6,8-10 6

2 5 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2

3 4 2 4 3 2 2-7,9,10 2, 4 2,4-8,10 5, 6 5,6

10 10 10 10 10 10 10 10 10 10 10 10 10

10 3 10 10 10 10 10 10 10 10 10 10 10

2 2 2 2 2 2 6 5 6 2-6,8-10 2-6,8-10 6

2 3 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2-7,9,10 2 2,4-8,10 5, 6 5,6

10 10 10 10 10 10 10 10 10 10 10 10 10

10 10 10 10 10 10 10 10 10 10 10 10 10

also did not select c = 7 in its results because for this c, MPC = −2.22e − 16. The Hitech and La1s collections had the same result showed previously for C max = C max + C min . All the fuzzy CVIs selected the same number of clusters that were showed in Tables 6–8 for: (i) all microarray datasets with exception of Christensen, where MPO and IPC had different results; (ii) all text collections with exception of IrishSentiment, where only MPO and IPC had different results, and 20Newsgroups, where only MPC had different result. 5. Numerical analysis To summarize the results introduced in Tables 3–11, in Tables 12-13 is shown the proportion of pseudo-partitions with undeﬁned number of clusters by the PEB and MPC validations.

JID:TCS

AID:12245 /FLA

Doctopic: Theory of natural computing

[m3G; v1.261; Prn:4/11/2019; 17:14] P.10 (1-18)

F. Eustáquio, T. Nogueira / Theoretical Computer Science ••• (••••) •••–•••

10

Table 10 Number of clusters selected for each index when C max = 10. Dataset

Sorlie Chris. Khan Su Yeoh Burcz. CSTR Syskill Hitech Irish 20ng La1s Reviews

m = 2.0

m = 2.2

MPC

IPC

PEB

FS

MPO

MPC

IPC

PEB

FS

MPO

2 2 2 2 2 2 6 5 2-6,8-10 6 2-6,8-10 2-6,8-10 6

2 3 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2-6 5 2-7,9,10 2,4-8,10 5, 6 5,6

10 10 10 10 10 10 10 10 10 10 10 10 10

10 7 10 10 10 10 10 10 10 10 10 10 10

2 2 2 2 2,3 3 6 5 2-6,8-10 6 2-6,8-10 2-6,8-10 6

2 4 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2,3,5 2,3 2-7,9,10 2,4-8,10 5, 6 5

10 10 10 10 10 10 10 10 10 10 10 10 10

10 10 10 10 10 10 10 10 10 10 10 10 10

Table 11 Number of clusters selected for each index when C max = 10. Dataset

m = 2.5 MPC

IPC

PEB

FS

MPO

Sorlie Chris. Khan Su Yeoh Burcz. CSTR Syskill Hitech Irish 20ng La1s Reviews

2 2 5 2 4 2 6 5 2-6,8-10 6 2-6,8-10 2-6,8-10 6

2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 5 2 2,4 5 2-7,9,10 2,4-8,10 5, 6 5

10 10 10 10 10 10 10 10 10 10 10 10 10

10 10 10 10 10 10 10 10 10 10 10 10 10

Table 12 Proportion of undeﬁned number of clusters by the PEB index. m

1.5 1.8 2.0 2.2 2.5 AVG

C max = C classes

C max = C classes + C min

C max = 10

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

0.00 0.00 33.33 33.33 33.33 20.00

71.43 57.14 71.43 71.43 71.43 68.57

38.46 30.77 53.85 53.85 53.85 46.15

0.00 0.00 16.67 33.33 16.67 13.33

100.00 85.71 100.00 85.71 85.71 91.43

53.85 46.15 61.54 61.54 53.85 55.38

0.00 0.00 16.67 33.33 16.67 13.33

100.00 85.71 100.00 85.71 85.71 91.43

53.85 46.15 61.54 61.54 53.85 55.38

In the following tables, for the rows with values of m and the columns Gene and Text, the proportion was respectively calculated about the total number of microarray datasets and text collections. In the WAVG column is the weighted average by the type of dataset (microarray and text have weight equal to the number of datasets that an index selected one cluster or that C min is not included in its result). In the last row of these tables is shown the ﬁnal average about the percentage values presented in the m rows. 5.1. Undeﬁned optimal number of clusters We can check, in Table 12, that the indeﬁniteness of PEB about the optimal number of clusters was more common for: (i) the textual collections, with lower AVG = 68.57% for C max = C classes and total average of 83.31% of the 105 PEB validations (7 textual collections × 5 m values × 3 C max values); (ii) m = 2.0 and m = 2.2, in at least 61.54% of the textual collections for Cmax = {C classes + C min , 10} and total average of 58.97% of the 78 PEB validations (13 datasets × 2 m values × 3 C max values). As happened with PEB, in Table 13, it is shown that MPC indeﬁniteness about c was more common for: (i) the textual collections, with lower AVG = 45.71% for C max = 10 and total average of 68.57% of the MPC validations; (ii) m = 2.2, in at

JID:TCS

AID:12245 /FLA

Doctopic: Theory of natural computing

[m3G; v1.261; Prn:4/11/2019; 17:14] P.11 (1-18)

F. Eustáquio, T. Nogueira / Theoretical Computer Science ••• (••••) •••–•••

11

Table 13 Proportion of undeﬁned number of clusters by the MPC index. m

1.5 1.8 2.0 2.2 2.5 AVG

C max = C classes

C max = C classes + C min

C max = 10

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

0.00 0.00 0.00 16.67 0.00 3.33

100.00 100.00 100.00 100.00 100.00 100.00

53.85 53.85 53.85 61.54 53.85 55.38

0.00 0.00 0.00 16.67 0.00 3.33

71.43 57.14 57.14 57.14 57.14 60.00

38.46 30.77 30.77 38.46 30.77 33.85

0.00 0.00 0.00 16.67 0.00 3.33

57.14 42.86 42.86 42.86 66.67 45.71

30.77 23.08 23.08 30.77 35.90 26.15

Table 14 Proportion of selection c = C min by the PEB index. m

1.5 1. 8 2.0 2.2 2.5 AVG

C max = C classes

C max = C classes + C min

C max = 10

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

33.33 100.00 100.00 100.00 75.00 81.67

33.33 50.00 33.33 33.33 33.33 36.67

33.33 80.00 71.43 71.43 57.14 62.67

33.33 100.00 80.00 100.00 60.00 74.67

0.00 33.33 0.00 0.00 0.00 6.67

25.00 77.78 57.14 66.67 42.86 53.89

33.33 100.00 80.00 100.00 60.00 74.67

0.00 33.33 0.00 0.00 0.00 6.67

25.00 77.78 57.14 66.67 42.86 53.89

Table 15 Proportion of selection c = C min by the MPC index. m

1.5 1. 8 2.0 2.2 2.5 AVG

C max = C classes

C max = C classes + C min

C max = 10

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

66.67 100.00 100.00 80.00 66.67 82.67

-

66.67 100.00 100.00 80.00 66.67 82.67

66.67 100.00 100.00 80.00 66.67 82.67

0.00 0.00 0.00 0.00 0.00 0.00

50.00 66.67 66.67 50.00 44.44 55.56

66.67 100.00 100.00 80.00 66.67 82.67

0.00 0.00 0.00 0.00 0.00 0.00

44.44 60.00 60.00 44.44 40.00 49.78

least 30.77% of the textual collections for C max = 10 and total average of 43.59% of the 39 validations (13 datasets × 3 C max values). 5.2. The monotonic tendency in the fuzzy CVIs validations To summarize the results in Tables 3–11, in Tables 14–17 is shown the proportion of selection of c = C min for PEB, MPC, IPC, and c = C max for the MPO index. According to the PC, PE, NPE and FS results, showed in Tables 3–11, the monotonic tendency of these indices was veriﬁed in 100% of the 195 validated pseudo-partitions (13 datasets × 5 m values × 3 C max values). We veriﬁed, in Table 14, that PEB keep selecting c = C min in the average of 56.81% of the validations with deﬁned optimal c. The PEB index failed more frequently for: (i) the microarray datasets, with lower AVG = 74.67% for C max = {C classes + C min , 10} and total average of 77.00% of the PEB validations; (ii) m = 1.8 and m = 2.2, in 100% of the microarray validated pseudo-partitions; (iii) m = 1.8, with lower AVG = 77.78% for C max = {C classes + C min , 10} and total average of 78.52% of the PEB validations. In Table 15, it was showed that MPC keep selecting c = C min in the average of 62.67% of the validations with deﬁned optimal c. The MPC index failed more frequently for: (i) the microarray datasets, as happened for PEB, with AVG = 82.67% for all C max values; (ii) m = 1.8 and m = 2.0, in 100% of the microarray validated pseudo-partitions, with lower AVG = 60% for Cmax = 10 and total average of 75.56% of the MPC validations. We check, in Table 16, that IPC keep selecting c = C min on the average of 96.41% of the 195 validations, very close to the 100% of the PC index. The IPC index failed more frequently for: (i) the textual collections, different of PEB and MPC, in 100% of the textual validated pseudo-partitions. This result was not much different that IPC obtained for the microarray datasets, with lower AVG = 86.97% for C max = 10 and total average of 92.22% of its validations; (ii) Cmax = C classes , in 100% of the pseudo-partitions with this value of C max ; (iii) m = 2.5, in 100% of the IPC validations with this value of exponent. From the MPO results, summarized in Table 17, it is possible to clearly verify its increased monotonic tendency in function of the number of clusters. MPO selected c = C max in the average of 97.95% of the 195 pseudo-partitions, very close to the 100% obtained for the FS index. The MPO index failed more frequently for: (i) the textual collections, in 100% of the textual validated pseudo-partitions, as happened in the IPC validations. This result was not much different that MPO obtained for the microarray datasets, with lower AVG = 93.33% for C max = {C classes + C min , 10} and total average of 95.56% of

JID:TCS

AID:12245 /FLA

Doctopic: Theory of natural computing

[m3G; v1.261; Prn:4/11/2019; 17:14] P.12 (1-18)

F. Eustáquio, T. Nogueira / Theoretical Computer Science ••• (••••) •••–•••

12

Table 16 Proportion of selection c = C min by the IPC index. m

1.5 1.8 2.0 2.2 2.5 AVG

C max = C classes

C max = C classes + C min

C max = 10

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

100.00 100.00 100.00 100.00 100.00 100.00

100.00 100.00 100.00 100.00 100.00 100.00

100.00 100.00 100.00 100.00 100.00 100.00

100.00 83.33 83.33 83.33 100.00 90.00

100.00 100.00 100.00 100.00 100.00 100.00

100.00 92.31 92.31 92.31 100.00 95.38

83.33 83.33 83.33 83.33 100.00 86.67

100.00 100.00 100.00 100.00 100.00 100.00

92.31 92.31 92.31 92.31 100.00 93.85

Table 17 Proportion of selection c = C max by the MPO index. m

1.5 1.8 2.0 2.2 2.5 AVG

C max = C classes

C max = C classes + C min

C max = 10

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

100.00 100.00 100.00 100.00 100.00 100.00

100.00 100.00 100.00 100.00 100.00 100.00

100.00 100.00 100.00 100.00 100.00 100.00

83.33 100.00 100.00 83.33 100.00 93.33

100.00 100.00 100.00 100.00 100.00 100.00

92.31 100.00 100.00 92.31 100.00 96.92

83.33 100.00 83.33 100.00 100.00 93.33

100.00 100.00 100.00 100.00 100.00 100.00

92.31 100.00 92.31 100.00 100.00 96.92

Table 18 Proportion of selection c = C min by the PEB index. m

1.5 1.8 2.0 2.2 2.5 AVG

C max = C classes

C max = C classes + C min

C max = 10

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

66.67 0.00 0.00 0.00 16.67 16.67

28.57 28.57 28.57 28.57 28.57 28.57

46.15 15.38 15.38 15.38 23.08 23.08

66.67 0.00 16.67 0.00 33.33 23.33

28.57 28.57 28.57 28.57 28.57 28.57

46.15 15.38 23.08 15.38 30.77 26.15

66.67 0.00 16.67 0.00 33.33 23.33

28.57 28.57 28.57 28.57 28.57 28.57

46.15 15.38 23.08 15.38 30.77 26.15

its validations; (ii) Cmax = C classes , in 100% of the pseudo-partitions with this value of Cmax; (iii) m = 1.8 and m = 2.5, in 100% of the MPO validations with this value of exponent. 5.3. Numerical improvement To summarize the results in Tables 3–11, in Tables 18–20 is shown the proportion of selection of c = C min for (i) PEB when PE selected in the same time c = C min (Table 18); (ii) MPC (Table 19) and IPC (Table 20) when PC selected c = C min in its validations. The NPE proportion of selection was not shown because this index did not improve the PE index results, i.e., its proportion results were equal to zero for all values of exponent m and type of dataset. It is possible to check from Table 18, that PEB improved its original index, PE, as occurred in [42]. For all values of C max , PEB no longer select c = C min with higher average for m = 1.5. The best performance of PEB was obtained for gene (microarray) datasets with m = 1.5, where it improved in 66.67% the PE results for C max = C classes . On the other hand, PEB has improved more the results of text collections in comparison with the total improvement achieved in the microarray datasets. The average results of PEB for C max = C classes + C min and C max = 10 were the same and the total average of improvement was bigger than the average obtained for C max = C classes . To improve the PE results, the PEB index was much better than NPE in this setup of C max n: PEB had a different result of PE in the average of 25.12% while NPE had the same PE result. In Table 19, it is possible to verify that MPC improved its original index, PC. For C max = C classes , no value of average was calculated due to MPC had selected all number of clusters as optimal c. As happened with PEB, MPC no longer select c = C min with higher average for m = 1.5 and m = 2.5. The MPC averages results for the microarray datasets were smaller than obtained for PEB (see Table 18) and it was the same for all C max . On the other hand, MPC averages results for text collections were bigger than obtained for PEB. MPC has improved more the results of text collections in comparison with the total improvement achieved in the microarray datasets. We check, from Table 20, that IPC also improved its original index (PC) but in a much smaller proportion than MPC and PEB. For the text collections and all validations when C max = C classes , no improvement was veriﬁed, i.e., IPC had the same result and monotonic tendency of PC. Different that occurred with PEB (see Table 18) and MPC (see Table 19), the proportion of selection of c = C min for microarray datasets were more equally distributed among the values of m. IPC has improved more the results of text collections in comparison with the total improvement achieved in the microarray datasets.

JID:TCS

AID:12245 /FLA

Doctopic: Theory of natural computing

[m3G; v1.261; Prn:4/11/2019; 17:14] P.13 (1-18)

F. Eustáquio, T. Nogueira / Theoretical Computer Science ••• (••••) •••–•••

13

Table 19 Proportion of selection c = C min by the MPC index. C max = C classes

m

1.5 1. 8 2.0 2.2 2.5 AVG

C max = C classes + C min

C max = 10

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

33.33 0.00 0.00 16.67 33.33 16.67

-

15.38 0.00 0.00 7.69 15.38 7.69

33.33 0.00 0.00 16.67 33.33 16.67

42.86 42.86 42.86 42.86 42.86 42.86

38.46 23.08 23.08 30.77 38.46 30.77

33.33 0.00 0.00 16.67 33.33 16.67

57.14 57.14 57.14 57.14 57.14 57.14

46.15 30.77 30.77 38.46 46.15 38.46

Table 20 Proportion of selection c = C min by the IPC index. C max = C classes

m

1.5 1.8 2.0 2.2 2.5 AVG

C max = C classes + C min

C max = 10

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

Gene (%)

Text (%)

WAVG

0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00

0.00 16.67 16.67 16.67 0.00 10.00

0.00 0.00 0.00 0.00 0.00 0.00

0.00 7.69 7.69 7.69 0.00 4.62

16.67 16.67 16.67 16.67 0.00 13.33

0.00 0.00 0.00 0.00 0.00 0.00

7.69 7.69 7.69 7.69 0.00 6.15

Table 21 Mann-Kendall Trend results. Dataset

C max

m

Index

p-value

Z

S

Var(S)

Chris. Su Irish Reviews

C classes + C min C classes + C min 10 10

1.5 1.5 [1.5; 2.5] 1.5, 2.2

MPC PEB PEB MPC

1.00e+00 1.00e+00 NaN 1.00e+00

0.00e+00 0.00e+00 NaN 0.00e+00

0.00e+00 0.00e+00 0.00e+00 0.00e+00

8.67e+00 1.67e+01 0.00e+00 2.67e+01

To correct the PC monotonic problem, the MPC index was best succeeded than IPC: the ﬁrst had a different result of PC in the average of 25.64% of its validations while the IPC had only in 3.59%. It is a consensus, from Tables 18–20, and it was also veriﬁed in [20], that PEB, MPC and IPC improved more the results of their original indices when C max = 10. Moreover, as the value deﬁned as C max was increased, the total average of improvement also increased. The PEB and MPC indices were the best options to improve respectively the PE and PC results. 5.4. The Mann-Kendall Trend test To verify if the PEB and MPC indices really have results without a monotonic tendency, we performer the Mann-Kendall Trend test [43,44]. This is a non-parametric test, i.e., the observations do not have to be normally distributed, commonly used to detect monotonic trends. As a statistical test, the Mann-Kendall test has a null hypothesis H 0 , that the observations of a variable over time are independent and identically distributed, and an alternative hypothesis H A , that the observations follow an increasing or decreasing monotonic trend. In this work, the observations over time consist of index results over the number of clusters in the range C min ≤ c ≤ C max . The null hypothesis H 0 is true, i.e., there is no monotonic trend, when the test statistic S = 0. The Mann-Kendall Trend test was performed using the mk.test function of the R software package Trend [45] with its default parameters: (i) two sided test for the alternative hypothesis; (ii) continuity correction was applied in the statistic Z . The null hypothesis H 0 was only true for PEB and MPC and, in the cases of these indices selected all number of clusters as optimal c, i.e., when these indices obtained the same value for all number of clusters in the range C min ≤ c ≤ C max . Exceptions of this situation are summarized in Table 21. In this table are shown the p-value, the Z quantile of the standard normal distribution, the statistic S and its variance Var(S). In Table 21, PEB had the same result of Mann-Kendall for the Irish-Sentiment collection if this index had obtained the same value for all number of clusters. Although PEB only selected a different value for c = 3 and c = 9, as shown in Table 22, the difference of 2e − 16 between the PEB results for the other cluster numbers is too small and, because of that, the variance calculated in the Mann-Kendall test was equal to zero. It is possible to check, in Table 22, when C max = 10, that the difference between the PEB and MPC values along the number of clusters was not signiﬁcant. This short difference between the index values in addition to the repetition of the same value for many others number of clusters are the reasons to the no monotonic trend. Only when C max = C classes + C min that MPC and PEB did not have the monotonic tendency, with both indices having different values for each number of cluster.

JID:TCS

AID:12245 /FLA

Doctopic: Theory of natural computing

[m3G; v1.261; Prn:4/11/2019; 17:14] P.14 (1-18)

F. Eustáquio, T. Nogueira / Theoretical Computer Science ••• (••••) •••–•••

14

Table 22 Indices results for each number of clusters. c

2 3 4 5 6 7 8 9 10

C max = C classes + C min

C max = 10

Chris.

Su

Irish

Reviews

MPC 5.46e-01 5.94e-01 6.12e-01 4.48e-01 -

PEB 9.77e-01 9.07e-01 9.04e-01 9.16e-01 9.27e-01 -

PEB 1.00e+00 1.00e+00 + 2e-16 1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00 + 2e-16 1.00e+00

MPC 0.00e+00 0.00e+00 0.00e+00 0.00e+00 1.11e-16 0.00e+00 0.00e+00 0.00e+00 0.00e+00

Therefore, we can not evaluate the improvements of an index face its original version, that has the monotonic tendency with c, only based in a monotonic trend test. It is necessary to verify if the index values are signiﬁcantly different. 6. Discussion In the next subsections we discuss separately the results of each index. 6.1. The Partition Coeﬃcient and Partition Entropy indices PC and PE recognized c = C min = 2 for all datasets and values of weighting exponent m. This can be explained by their limits when m → ∞: limm→∞ P C = 1/c and limm→∞ P E = loga c. This happens because c = 2 will maximize the value of 1/c and minimize the value of log10 c [12]. Although the PC decreasing and PE increasing monotonic trends are already well-known in literature [8,32,12,21,4,46, 24], this problem appears to be more common when validating real datasets. In [12,21,5,47,19,46,17,42], all datasets have low dimensionality, and these indices only selected c = 2 for some of the synthetic data, while for all real data PC and PE indicated c = 2 with exception of one dataset in [42]. In the current work it was not different, since no dataset used is synthetic. According to [46], the reason why PC selects c = 2 for real world datasets is that they must contain some noise points that inﬂuence this index. 6.2. The Modiﬁed Partition Coeﬃcient index The MPC index was designed to alleviate the PC dependency on c by applying a linear transformation on it. In Table 19, it was veriﬁed that the MPC index improved the PC results by not selecting c = C min in the average of 25.64% of its validations. Its best performance was obtained for the textual collections by improving in 57.14% the PC results when C max = 10. In the same time, we checked, in Table 15, that MPC keep selecting c = C min in the average of 62.67% of the validations with deﬁned optimal c. Besides that, the MPC indeﬁniteness about the optimal number of clusters was attested in the average of 38.46% of its validations. From the MPC results, we can check that this modiﬁcation works well improving the results of its original index. 6.3. The Improved Partition Coeﬃcient index The IPC index tries to minimize the PC tendency and it improves the PC results from the value of C max = C classes + C min . Despite this, it is still dependent on c and it only allows to choose the number of clusters from c = C min until c = C max − 1 (Eq. (9)), i.e., the maximum c deﬁned to FCM is not considered. Even though the IPC index did not present the same indeﬁniteness problem of PEB and MPC, it only had a different result of PC in the average of 3.59% of its validations. 6.4. The Scaled Partition Entropy index In Table 18, it was veriﬁed that PEB improved its original index, PE, as occurred in [42]. It was showed that PEB was different from PE on average 25.13% of its validations and its best performance was obtained for gene (microarray) datasets with m = 1.5, where it improved in 66.67% the PE results for C max = C classes . In the same time, we checked, in Table 14, that PEB keep selecting c = C min in the average of 56.81% of the validations with deﬁned optimal c. Besides that, the PEB indeﬁniteness about the optimal number of clusters was attested in the average of 52.30% of its validations. After all, PEB was the most successful fuzzy CVI, i.e., it more correctly recognized the number of clusters.

JID:TCS

AID:12245 /FLA

Doctopic: Theory of natural computing

[m3G; v1.261; Prn:4/11/2019; 17:14] P.15 (1-18)

F. Eustáquio, T. Nogueira / Theoretical Computer Science ••• (••••) •••–•••

15

Fig. 1. FS result in separated terms and with its complete value.

6.5. The Dunn’s Normalized Partition Entropy index The NPE index did not improve the PE results and this happened because its improvement is conditioned to the number of clusters c being close to the number of objects n in a dataset. The values calculated by PE and NPE were very close to each other and it can be explained by its deﬁnition (see Subsection 3.6, Eq. (11)). Due to the maximum number of clusters √ used in this work, and in literature, be much smaller than n, being usually set lower than n or equal to ten clusters ([12,4,16–18]), the ratio (c /n) 1 and then 1 − (c /n) → 1 resulting in N P E → P E (see Eq. (11)). As (c /n) → 1, PE and NPE behaviors will diverge [4]. The problem is to know the threshold of c that the two indices become signiﬁcantly different to the point that the NPE results do not monotonically increase with the number of clusters. 6.6. The Fukuyama-Sugeno index The FS index showed the same decreasing monotonic trend of PC but the ﬁrst one tends to choose c = C max , since it is a minimization index. In [12], Bezdek and Pal showed that FS is sensitive to the fuzziﬁcation factor m and, because of that, it may be unreliable. Besides that, in [22,23] the authors considered that the FS results are unexpected and unreasonable because its tendency. Although the FS tendency has been described in literature [23,22,16], this problem appears to be more common when validating high-dimensional datasets. In [47], the maximum c was deﬁned as C max = 8 and in [13], C max = 10. In these both studies, FCM was performed using m = 2.0 and FS did not recognize C max as the optimal number of clusters for any dataset. In [12], Bezdek and Pal performed FCM with C max = 10, using thirteen values of m for two datasets and FS selected c = 10 in 38% of its executions. In [16,5,23], FS only selected c = C max in respectively 18%, 25% and 43% of its executions. These results are much smaller than obtained in this work, where FS selected c = C max in 100% of its executions. In these situations, the difference between the others studies and this one is in the dimensionality of the data: while in the others the dimensionality of datasets is not greater than 296, in this paper, the datasets have at least 456 dimensions (see Table 2). In [22] is explained that FS works well when the clusters are highly separable. When cluster centers are suﬃciently close to one another, the inter-cluster distance (separation) plays a less important role in Eq. (12). This is because the FS second term v i − v¯ is always signiﬁcantly small and this can lead to the index favoring more clusters than the desired [22]. Therefore, the maximum number of iterations may have been the reason why, in this work, the FS index select c = C max for the Su dataset, which did not happen in [20] for any FS validation of the same data. In Fig. 1 is shown the FS result, obtained in [20] for m = 2.0, of the Su dataset with its compactness, separation and complete value. We check, in Fig. 1, that the separation term of FS was bigger than the compactness one and, because of that, it selected c = C min instead of c = C max as often occurs. Note that, in this situation that clusters are too separable, FS did not work well because it keeps with a monotonic trend but now it is an increasing monotonic trend that made it selected c = C min . In [20], it was only used the convergence value of E = 1e − 5 as stop criterion of FCM and, in this case, the cluster centers were far enough to the FS separation term be bigger than the compactness one. In the current work, the number of iterations it was not enough to separate the cluster centers and, because of that, the separation term was smaller than the compactness one, that lead to FS favoring the maximum number of clusters as the optimal c. In this work, for all datasets and values of m, the compactness term was bigger than the separation one.

JID:TCS

AID:12245 /FLA

Doctopic: Theory of natural computing

16

[m3G; v1.261; Prn:4/11/2019; 17:14] P.16 (1-18)

F. Eustáquio, T. Nogueira / Theoretical Computer Science ••• (••••) •••–•••

6.7. Robust index of Yang et al. MPO showed the same increasing tendency of PE but the ﬁrst one tends to choose c = C max , since it is a maximization index. Besides that, differently from what it was showed in [19], MPO was dependent on c for high-dimensional data, and its term (c + 1/c − 1)1 /2 (see Subsection 3.8, Eq. (14)) used to avoid the monotonic tendency for the number of clusters did not work as expected. As occurred with FS, in the most of cases, the separation term of MPO is much smaller than its compactness term and it leads to MPO select C max as the optimal number of clusters. The FS and MPO tendencies are more diﬃcult to identify because the value of C max parameter is not commonly deﬁned as happened with the minimum number of clusters (C min = 2). FS and MPO may falsely appear, dependently of C max , to recognize the correct number of clusters when C max = C classes , as shown in Tables 3–5 and, because of that, C max must be greater than C classes . 7. Conclusion Cluster validity Indices (CVIs) that use relative criterion to choose the best partition of a dataset are usually used varying the number of clusters and, because of that, their results must be independently of any parameter of a clustering algorithm. However, indices that have a monotonic tendency in function of the number of clusters or that are sensitive to other algorithm parameter are considered unreliable. Because of that, in this work, we have investigated the critical monotonic tendency problem of classical indices (PC, PE and FS), as well as their extensions (MPC, IPC, PEB and NPE), over high-dimensional data. For such investigation, we have clustered high-dimensional datasets varying the value of weighting exponent m and the maximum number of clusters C max and performed the Mann-Kendall Trend test to verify statistically the monotonic trend of the CVIs results. From the FCM clustering setup used in this work by varying the fuzziﬁcation exponent m and the maximum number of clusters, for each dataset, ﬁfteen pseudo-partitions (5 m× 3 C max ) were well validated and selected by each index. This number of validations for each dataset is bigger than it is usually performed. Besides that, from Tables 14–17, we checked that the indices results were expressively determined by their monotonic tendency, as occurred with PC, PE, NPE and FS indices in 100% of the pseudo-partitions. The instability relative to the indeﬁniteness about the optimal number of clusters of PEB and MPC was also expressively with the total average of 52.30% of the PEB validations and 38.46% for MPC. From that results, we can recognize that the sample used in this work was representative, since the indices likelihood of having the same or similar proportion of indeﬁniteness or monotonic tendency in their validations in a bigger sample is considerably high. From the CVIs results, it was veriﬁed that MPC and PEB really improved, respectively, the PC and PE indices in relation to their monotonic trend. On the other hand, NPE had the same result of its original index because its improvement is conditioned to an unusual value of C max n. The IPC index has improved, much less than MPC, the PC results and only for C max > C classes . These indices have showed a lower frequency of selection of c = C min when FCM was performed with C max = 10. Besides that, for m = 1.5, MPC and PEB have showed their best results. In the analysis detailed above, we consider that MPC, IPC and PEB were successful in improving their original versions when they select c = C min . In the case of the Mann-Kendall Trend test, an index would be successful if the null hypothesis H 0 was true, i.e., there was not monotonic trend in the index results. This test only conﬁrmed the improvements of MPC and PEB but in just four cases, showed in Table 21. The remaining cases occurred when MPC and PEB obtained the same value for all number of clusters, and because of that, no monotonic trend was veriﬁed. Therefore, although MPC and PEB had good results in recognizing the expected number of cluster and in improving their original indices, these two indices have the problem of getting the same result for different number of clusters. This situation makes unreliable the selection of the optimal number of clusters by these indices. This becomes a major problem for high-dimensional datasets because, without a precise deﬁnition of one or at most two near number of clusters by an index, this information can not be retrieved through visualization of the dataset. As future work, we intend to verify if other indices have some tendency in function of parameters of FCM or any other clustering algorithm. In addition we could test the level of signiﬁcance of these indices results among the number of clusters that could disturb the clustering validation of high-dimensional data. From the recurrent problem of monotonic tendency, how to choose an appropriate index to validate fuzzy clustering of high-dimensional data? Start eliminating those which have some tendency in function of any parameter is already a ﬁrst big step! Declaration of competing interest The authors declare that they have no known competing ﬁnancial interests or personal relationships that could have appeared to inﬂuence the work reported in this paper. Acknowledgements This work was supported by the Coordination for the Improvement of Higher Education Personnel (CAPES) [grant number 88882.453917/2019-01] granting the scholarship during the period of this master’s degree at Federal University of Bahia.

JID:TCS

AID:12245 /FLA

Doctopic: Theory of natural computing

[m3G; v1.261; Prn:4/11/2019; 17:14] P.17 (1-18)

F. Eustáquio, T. Nogueira / Theoretical Computer Science ••• (••••) •••–•••

17

References [1] M. Steinbach, L. Ertöz, V. Kumar, The Challenges of Clustering High Dimensional Data, New Directions in Statistical Physics: Econophysics, Bioinformatics, and Pattern Recognition, 2004, pp. 273–309. [2] R. Jensen, C. Cornelis, Fuzzy-rough nearest neighbour classiﬁcation and prediction, in: Rough Sets and Fuzzy Sets in Natural Computing, Theor. Comput. Sci. 412 (42) (2011) 5871–5884, https://doi.org/10.1016/j.tcs.2011.05.040. [3] R. Xu, D. Wunsch, Clustering, IEEE Series on Computational Intelligence, Wiley, 2009, https://books.google.com.br/books?id=XC4nAQAAIAAJ. [4] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Kluwer Academic Publishers, Norwell, MA, USA, 1981. [5] W. Wang, Y. Zhang, On fuzzy cluster validity indices, Fuzzy Sets Syst. 158 (19) (2007) 2095–2117. [6] J.C. Bezdek, Cluster validity with fuzzy sets, J. Cybern. 3 (3) (1973) 58–73, https://doi.org/10.1080/01969727308546047. [7] J.C. Bezdek, Numerical taxonomy with fuzzy sets, J. Math. Biol. 1 (1) (1974) 57–71, https://doi.org/10.1007/BF02339490. [8] J.C. Bezdek, Mathematical models for systematics and taxonomy, in: 8th Int. Conf. Numerical Taxonomy San Francisco, 1975, pp. 143–166. [9] Y. Fukuyama, M. Sugeno, A new method of choosing the number of clusters for fuzzy c-means method, in: Fuzzy Systems Symposium, 1989, pp. 247–250. [10] X.L. Xie, G. Beni, A validity measure for fuzzy clustering, IEEE Trans. Pattern Anal. Mach. Intell. 13 (8) (1991) 841–847, https://doi.org/10.1109/34.85677. [11] V. Schwämmle, O.N. Jensen, A simple and fast method to determine the parameters for fuzzy c–means cluster analysis, Bioinformatics 26 (22) (2010) 2841–2848. [12] N.R. Pal, J.C. Bezdek, On cluster validity for the fuzzy c-means model, IEEE Trans. Fuzzy Syst. 3 (3) (1995) 370–379. [13] H. Li, S. Zhang, X. Ding, C. Zhang, P. Dale, Performance Evaluation of Cluster Validity Indices (CVIs) on Multi/Hyperspectral Remote Sensing Datasets, Remote Sensing 8 (2016). [14] S.H. Kwon, Cluster validity index for fuzzy clustering, Electron. Lett. 34 (22) (1998) 2176–2177, https://doi.org/10.1049/el:19981523. [15] Y. Tang, F. Sun, Z. Sun, Improved validation index for fuzzy clustering, in: Proceedings of the 2005, American Control Conference, 2005, vol. 2, 2005, pp. 1120–1125. [16] H.L. Capitaine, C. Frelicot, A cluster-validity index combining an overlap measure and a separation measure based on fuzzy-aggregation operators, IEEE Trans. Fuzzy Syst. 19 (3) (2011) 580–588. [17] C.F.S.Y. Kaile Zhou, Shuai Ding, Comparison and weighted summation type of fuzzy cluster validity indices, Int. J. Comput. Commun. Control 9 (3) (2014) 370–378. [18] Y. Tang, X. Hu, W. Pedrycz, X. Song, Possibilistic fuzzy clustering with high-density viewpoint, Neurocomputing 329 (2019) 407–423, https://doi.org/ 10.1016/j.neucom.2018.11.007. [19] Y. Hu, C. Zuo, Y. Yang, F. Qu, A robust cluster validity index for fuzzy c-means clustering, in: International Conference on Transportation, Mechanical, and Electrical Engineering, 2011, pp. 448–451. [20] F. Eustáquio, T. Nogueira, On monotonic tendency of some fuzzy cluster validity indices for high-dimensional data, in: 2018 7th Brazilian Conference on Intelligent Systems (BRACIS), 2018, pp. 558–563. [21] R.N. Dave, Validating fuzzy partitions obtained through c-shells clustering, Pattern Recognit. Lett. 17 (6) (1996) 613–623. [22] A. Chong, T.D. Gedeon, L.T. Koczy, A hybrid approach for solving the cluster validity problem, in: International Conference on Digital Signal Processing Proceedings, vol. 2, 2002, pp. 1207–1210. [23] M.-S. Yang, K.-L. Wu, A new validity index for fuzzy clustering, in: IEEE International Conference on Fuzzy Systems, vol. 1, 2001, pp. 89–92. [24] C. Li sheng, The improved partition coeﬃcient, in: International Conference on Advances in Engineering 24 (2011) 534–538. [25] J.C. Dunn, Indices of partition fuzziness and the detection of clusters in large data sets, in: Fuzzy Automata and Decision Processes, Elsevier, New York, 1977, pp. 271–284. [26] F. Eustáquio, H. Camargo, S. Rezende, T. Nogueira, On fuzzy cluster validity indexes for high dimensional feature space, in: Advances in Fuzzy Logic and Technology 2017: Proceedings of the 10th Conference of the European Society for Fuzzy Logic and Technology, 2017, Warsaw, Poland, vol. 2, 2018, pp. 12–23. [27] N.R. Pal, K. Pal, J.M. Keller, J.C. Bezdek, A possibilistic fuzzy c-means clustering algorithm, IEEE Trans. Fuzzy Syst. 13 (4) (2005) 517–530, https:// doi.org/10.1109/TFUZZ.2004.840099. [28] J.C. Dunn, A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters, J. Cybern. 3 (3) (1973) 32–57, https:// doi.org/10.1080/01969727308546046. [29] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Kluwer Academic Publishers, Norwell, MA, USA, 1981. [30] J.A. Hartigan, Clustering Algorithms, 99th edition, John Wiley and Sons, Inc., New York, NY, USA, 1975. [31] G.J. Klir, B. Yuan, Fuzzy Sets and Fuzzy Logic: Theory and Applications, Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1995. [32] J.C. Bezdek, M.P. Windham, R. Ehrlich, Statistical parameters of cluster validity functionals, Int. J. Comput. Inf. Sci. 9 (4) (1980) 323–336, https:// doi.org/10.1007/BF00978164. [33] M. Pazzani, Syskill and Webert web page ratings data set, http://archive.ics.uci.edu/ml/datasets/Syskill+and+Webert+Web+Page+Ratings, 1998. [34] M.L. Group, Irish economic sentiment dataset, http://mlg.ucd.ie/sentiment, 2009. [35] G. Forman, 19mclasstextwc dataset, http://sourceforge.net/projects/weka/ﬁles/datasets/text-datasets/19MclassTextWc.zip/download, 2006. [36] G. Karypis, Cluto - software for clustering high-dimensional datasets, http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download, 2006. [37] R.G. Rossi, R.M. Marcacini, S.O. Rezende, Benchmarking Text Collections for Classication and Clustering Tasks, Tech. Rep. 395, Institute of Mathematics and Computer Sciences, Federal University of São Carlos, 2013. [38] J. Rennie, 20 newsgroup dataset, http://qwone.com/~jason/20Newsgroups/, 2008. [39] J. Han, M. Kamber, Data Mining: Concepts and Techniques, 2nd edition, Morgan Kaufmann Publishers, 500 Sansome Street, Suite 400, San Francisco, CA 94111, 2006. [40] R. Subhashini, V.J.S. Kumar, Evaluating the performance of similarity measures used in document clustering and information retrieval, in: International Conference on Integrated Intelligent Computing, 2010, pp. 27–31. [41] D. Kumar, J. Bezdek, M. Palaniswami, S. Rajasegarar, C. Leckie, T. Havens, A hybrid approach to clustering in big data, IEEE Trans. Cybern. 99 (2015) 1. [42] J.C. Bezdek, M. Moshtaghi, T. Runkler, C. Leckie, The generalized c index for internal fuzzy cluster validity, IEEE Trans. Fuzzy Syst. 24 (6) (2016) 1500–1512, https://doi.org/10.1109/TFUZZ.2016.2540063. [43] H.B. Mann, Nonparametric tests against trend, Econometrica 13 (3) (1945) 245–259. [44] M. Kendall, Rank Correlation Methods, C. Griﬃn, 1948. [45] T. Pohlert, trend: Non-Parametric Trend Tests and Change-Point Detection, r package version 1.1.1, 2018. [46] K.-L. Wu, An analysis of robustness of partition coeﬃcient index, in: IEEE International Conference on Fuzzy Systems, 2008, pp. 372–376. [47] R.X. Valente, A.P. Braga, W. Pedrycz, A new fuzzy clustering validity index based on fuzzy proximity matrices, in: Brazilian Congress on Computational Intelligence, 2013, pp. 489–494.

JID:TCS

AID:12245 /FLA

18

Doctopic: Theory of natural computing

[m3G; v1.261; Prn:4/11/2019; 17:14] P.18 (1-18)

F. Eustáquio, T. Nogueira / Theoretical Computer Science ••• (••••) •••–•••

Further reading [48] Yating Hu, Fuheng Qu, Changji Wen, An unsupervised possibilistic c-means clustering algorithm with data reduction, in: 2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2013, pp. 29–33. [49] F.K. Teklehaymanot, M. Muma, A.M. Zoubir, Robust Bayesian cluster enumeration, arXiv:1811.12337, 2018. [50] F.K. Teklehaymanot, Robust and Distributed Cluster Enumeration and Object Labeling, Ph.D. thesis, Technische Universität, Darmstadt, March 2019, http://tuprints.ulb.tu-darmstadt.de/8539/.

Evaluating the numerical instability in fuzzy clustering validation of high-dimensional data

Evaluating the numerical instability in fuzzy clustering validation of high-dimensional data

Recommend Documents