Genetic intuitionistic weighted fuzzy k-modes algorithm for categorical data

Genetic intuitionistic weighted fuzzy k-modes algorithm for categorical data

Accepted Manuscript Genetic Intuitionistic Weighted Fuzzy K-modes Algorithm for Categorical Data R.J. Kuo , Thi Phuong Quyen Nguyen PII: DOI: Referen...

1MB Sizes 1 Downloads 65 Views

Accepted Manuscript

Genetic Intuitionistic Weighted Fuzzy K-modes Algorithm for Categorical Data R.J. Kuo , Thi Phuong Quyen Nguyen PII: DOI: Reference:

S0925-2312(18)31344-4 https://doi.org/10.1016/j.neucom.2018.11.016 NEUCOM 20149

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

19 March 2018 1 July 2018 10 November 2018

Please cite this article as: R.J. Kuo , Thi Phuong Quyen Nguyen , Genetic Intuitionistic Weighted Fuzzy K-modes Algorithm for Categorical Data, Neurocomputing (2018), doi: https://doi.org/10.1016/j.neucom.2018.11.016

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights  Employ the intuitionistic fuzzy set theory in fuzzy clustering for categorical attributes.  Use the new similarity measure for categorical data, which is based on the frequency probability-based distance metric, to calculate the dissimilarity measure.  Consider the importance of each categorical attribute differently by updating the weight for

CR IP T

each categorical attribute in the clustering process iteratively.  Exploit the global optimal solution by genetic algorithm (GA).

 Provide the unsupervised feature selection process to remove the redundant features of the

AC

CE

PT

ED

M

AN US

original dataset prior to performing GA process.

1

ACCEPTED MANUSCRIPT

Genetic Intuitionistic Weighted Fuzzy K-modes Algorithm for Categorical Data

R. J. Kuo & Thi Phuong Quyen Nguyen(*)

CR IP T

Department of Industrial Management, National Taiwan University of Science and Technology, Taiwan

No. 43, Section 4, Keelung Rd., Da-an District, Taipei City, Taiwan (ROC)

[email protected]

AC

CE

PT

ED

M

* Corresponding author

AN US

Email: [email protected]

2

ACCEPTED MANUSCRIPT Abstract Data clustering with categorical attributes has been widely used in many real-world applications. Most of the existing clustering algorithms proposed for the categorical data face two major drawbacks of termination at a local optimal solution and considering all attributes equally. Thus, this study proposes a novel clustering method, named genetic intuitionistic weighted fuzzy k-modes (GIWFKM) algorithm, based on the conventional fuzzy k-modes and genetic algorithm (GA). The proposed algorithm firstly introduces the intuitionistic

CR IP T

weighted fuzzy k-modes (IWFKM) algorithm which employs the intuitionistic fuzzy set in the clustering process and the new similarity measure for categorical data based on frequency probability-based distance metric. Then, the GIWFKM algorithm, which integrates the IWFKM algorithm and GA, is proposed to employ the global optimal solution. Moreover, the GIWFKM algorithm performs the unsupervised feature selection based on the correlation

AN US

coefficient to remove some redundant features which can both improve the clustering performance and reduce the computational time. To evaluate the clustering result, a series of experiments in different categorical datasets are conducted to compare the performance of the proposed algorithms with that of other benchmark algorithms including fuzzy k-modes,

M

weighted fuzzy k-modes, genetic fuzzy k-modes, space structure-based clustering, and manyobjective fuzzy centroids clustering algorithms. The experimental results conducted on the

ED

datasets collected from UCI machine learning repository exhibit that the GIWFKM algorithm outperforms the other benchmark algorithms in terms of Adjusted Rank Index (ARI) and

PT

clustering accuracy (CA).

Keywords: Categorical data, fuzzy k-modes, genetic algorithm, intuitionistic fuzzy set,

CE

frequency probability-based distance, weighted features. 1. Introduction

AC

Data clustering is an unsupervised learning technique that partitions a given dataset into multiple clusters in which objects in a cluster are similar to each other and distinct from the objects that belong to other clusters [1]. The clustering process aims to reveal the hidden structure of the unlabeled data instances in various applications, such as pattern recognition, market research, decision making, medical application, and so on. In general, the clustering algorithms are usually reserved for numerical data, which uses the standard distance measure to calculate the distance between any pair of data instances straightforwardly. Clustering of categorical data has received less attention than those of numerical data because of challenge 3

ACCEPTED MANUSCRIPT and difficulty in nature of data. Categorical attributes are obviously deficient in inherent order that causes difficulty to identify the proximity measure between two data objects [2]. The classic approach for the categorical data clustering is to expand some existing clustering algorithms for numerical data with a suitable dissimilarity measure which is particular for categorical attributes. For instance, the first conventional algorithm for categorical data, k-modes algorithm, which was proposed by Huang [3], is an extended version of k-means algorithm using Hamming distance and cluster mode to represent cluster

CR IP T

center instead of Euclidean distance and cluster mean. Similarly, fuzzy k-modes algorithm [4] is also an extended version of fuzzy k-means algorithm for the categorical data. Thereafter, the clustering algorithms for the categorical data have been paid progressively more attention due to the variety of the categorical data in the real-world problems. These algorithms consist of both single objective and multiple objectives, such as ROCK [5], CACTUS [6],

AN US

COOLCAT [7], LIMBO [8], wk-modes [9], MOGA [10], NSGA-FMC [11], SBC [12], MOFC [13], and so on. However, most of the existing algorithms face two major drawbacks that can reduce the clustering performance, i.e., some algorithms usually consider all attributes equally when calculating the dissimilarity between two objects, while some

M

algorithms may terminate at a local optimal solution.

Recently, intuitionistic fuzzy set (IFS), which was firstly introduced by Atanassov [14]

ED

based on the concept of fuzzy set theory, has been used in data clustering to enhance the clustering performance. The IFS is known as a generalization of fuzzy sets and usually used for handling uncertainty. An IFS is described by three parameters including membership,

PT

non-membership, and hesitation degrees. Xu et al. [15] reported a clustering algorithm for IFSs which classified the IFSs by constructing the association and equivalent association

CE

matrix. Xu [16] appended the IFS to hierarchical clustering to deal with uncertain data based on the distance measure between the IFS and the intuitionistic fuzzy aggregation operator.

AC

Similarly, some studies developed clustering techniques by combining the IFS with fuzzy cmeans algorithm, such as intuitionistic fuzzy c-means algorithm [15], intuitionistic fuzzy possibilistic c-means clustering algorithm [17]. Besides, Xu et al. [18] also integrated the IFS with spectral clustering to improve the clustering performance as well as obtain the global optimal solution. The existing methods are generally based on either distance measures or intuitionistic fuzzy information; however, some of them cannot warranty for the global optimal solution [18]. Consequently, they are all reserved for numerical datasets.

4

ACCEPTED MANUSCRIPT To overcome the aforementioned drawbacks of the existing algorithms as well as consider the application prospects of the IFS to improve the clustering performance, this study proposes a novel clustering algorithm for the categorical data, i.e., genetic intuitionistic weighted fuzzy k-modes (GIWFKM) algorithm. This algorithm is a combination of the conventional fuzzy k-modes algorithm [4] and the IFS. We firstly introduce the intuitionistic weighted fuzzy k-modes (IWFKM) algorithm which employs the IFS in the clustering process. The IWFKM algorithm considers the importance of each attribute differently by

CR IP T

updating the weight vector for categorical attributes in each iteration. In addition, the IWFKM algorithm replaces Hamming distance with the new similarity measure named frequency probability-based distance metric, which has been proved that could improve the clustering result [19]. Then, the proposed GIWFKM algorithm integrates the IWFKM algorithm and genetic algorithm (GA) to exploit the global optimal solution. The reason to

AN US

choose the GA is that GA is known as a search and optimization technique which is used to solve various problem domains due to its extensive applicability [20]. Moreover, the GA has been applied in many clustering approaches for both numerical and categorical data to improve the clustering performance, e.g., genetic k-means algorithm [21], genetic fuzzy c-

M

means [22], and genetic fuzzy k-modes (GFKM) [23]. Besides, the proposed GIWFKM algorithm performs the unsupervised feature selection based on the correlation coefficient to

the computational time.

ED

remove some redundant features, therefore, improve the clustering performance and reduce

PT

The rest of this paper is organized as follows. Section 2 reviews some related literatures such as fuzzy k-modes algorithm, weighted fuzzy k-modes algorithm, and the IFS theory. The

CE

proposed algorithms are introduced in Section 3, while Section 4 comes with a series of experiments and results. Finally, the conclusion and future research directions are

AC

summarized in Section 5. 2. Literature review This section firstly reviews fuzzy k-modes and weighted fuzzy k-modes algorithms. Then

the IFS theory with two generating functions is also described. 2.1 Fuzzy k-modes and weighted fuzzy k-modes algorithms Conventionally, the fuzzy k-modes (FKM) algorithm, investigated by Huang [4], is one of the most popular algorithms reserved for categorical data. Let X be a set of n categorical 5

ACCEPTED MANUSCRIPT objects. Each object 𝑥𝑖 can be characterized by a set of m categorical attributes, so that 𝑥𝑖 = {𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑚 }. The FKM algorithm partitions X into k clusters by finding W and Z to minimize the following objective function: 𝐹(𝑈, 𝑍) = ∑𝑘𝑗=1 ∑𝑛𝑖=1 𝑢𝑗𝑖𝛼 𝑑(𝑥𝑖 , 𝑧𝑗 ),

(1)

Subject to 1 ≤ 𝑗 ≤ 𝑘,

∑𝑘𝑗=1 𝑢𝑗𝑖 = 1,

1 ≤ 𝑖 ≤ 𝑛,

1 ≤ 𝑖 ≤ 𝑛,

0 < ∑𝑛𝑖=1 𝑢𝑗𝑖 < 𝑛,

(2)

CR IP T

0 ≤ 𝑢𝑗𝑖 ≤ 1,

(3)

1 ≤ 𝑗 ≤ 𝑘,

(4)

where k is a pre-defined number of clusters, 𝛼 is a fuzziness component, 𝑈 = (𝑢𝑗𝑖 ) is a 𝑘 × 𝑛

AN US

fuzzy membership matrix, 𝑍 = {𝑧1 , 𝑧2 , … , 𝑧𝑘 } is the set of cluster modes, and 𝑑(𝑥𝑖 , 𝑧𝑗 ) is the distance between object 𝑥𝑖 and its responding cluster center 𝑧𝑗 . 𝑑(𝑥𝑖 , 𝑧𝑗 ) is measured by using simple matching dissimilarity measure or Hamming distance as follows:

0, 1,

𝑖𝑓 𝑥𝑖𝑙 = 𝑧𝑗𝑙 , 𝑖𝑓 𝑥𝑖𝑙 ≠ 𝑧𝑗𝑙

(5) (6)

ED

𝛿(𝑥𝑖𝑙 , 𝑧𝑗𝑙 ) = {

M

𝑑(𝑥𝑖 , 𝑧𝑗 ) = ∑𝑚 𝑙=1 𝛿(𝑥𝑖𝑙 , 𝑧𝑗𝑙 ), and

However, the FKM algorithm suffers from several major drawbacks. First, the clustering performance is sensitive to the initial choice of the cluster modes. Next, the clustering result

PT

may terminate at the local optimal solution. Moreover, the FKM algorithm considers all attributes equally in which some attributes may not contribute in discriminating the clusters.

CE

Therefore, Sara and Das [24] presented a weighted fuzzy k-modes (WFKM) algorithm which uses the weight factor for each categorical attribute. The WFKM algorithm minimizes:

AC

𝐹(𝑈, 𝑍, 𝑊) = ∑𝑘𝑗=1 ∑𝑛𝑖=1 𝑢𝑗𝑖𝛼 . 𝑑𝑊 (𝑥𝑖 , 𝑧𝑗 ),

(7)

where 𝑊 = (𝑤1 , 𝑤2 , … , 𝑤𝑚 ) is a weight vector for categorical attributes, 𝑊 𝑑 𝑊 (𝑥𝑖 , 𝑧𝑗 ) = ∑𝑚 𝑙=1 𝛿 (𝑥𝑖𝑙, 𝑧𝑗𝑙 ),

𝛿 𝑊 (𝑥𝑖𝑙 , 𝑧𝑗𝑙 ) = {

0, 𝛽

𝑤𝑙 ,

𝑖𝑓 𝑥𝑖𝑙 = 𝑧𝑗𝑙 𝑖𝑓 𝑥𝑖𝑙 ≠ 𝑧𝑗𝑙

(8) ,

(9)

6

ACCEPTED MANUSCRIPT where 𝛽 is the coefficient of weight which is selected excluding 1. If 𝛽 = 0, the WFKM becomes the conventional FKM algorithm. The procedure of the WFKM algorithm is described as follows: Step 1: Randomly select cluster modes 𝑍1 , fix the fuzziness value 𝛼 and number of iterations T, generate initial weight vector 𝑊 1 , identify membership function 𝑈1 such that

Step 2: Fix 𝑍 𝑡 𝑎𝑛𝑑 𝑊 𝑡 and update 𝑈 𝑡+1. If 𝐹(𝑈 𝑡+1 , 𝑍 𝑡 , 𝑊 𝑡 ) = 𝐹(𝑈 𝑡 , 𝑍 𝑡 , 𝑊 𝑡 ), then stop; Else go to step 3.

AN US

Step 3: Fix 𝑊 𝑡 and 𝑈 𝑡+1 and update 𝑍 𝑡+1

CR IP T

the cost function 𝐹(𝑈1 , 𝑍1 , 𝑊 1 ) is minimized. The iteration is set at 𝑡 = 1.

If 𝐹(𝑈 𝑡+1 , 𝑍 𝑡+1 , 𝑊 𝑡 )= 𝐹(𝑈 𝑡+1 , 𝑍 𝑡 , 𝑊 𝑡 ), then stop; Else go to step 4.

M

Step 4: Fix 𝑈 𝑡+1 and 𝑍 𝑡+1 and update 𝑊 𝑡+1 .

If 𝐹(𝑈 𝑡+1 , 𝑍 𝑡+1 , 𝑊 𝑡+1 ) = 𝐹(𝑈 𝑡+1 , 𝑍 𝑡+1 , 𝑊 𝑡 ) or iteration t=T, then stop;

ED

Else increase iteration t = t+1 and go to step 2.

PT

2.2 Intuitionistic fuzzy sets

Atanassov [14] introduced the concept of intuitionistic fuzzy sets (IFS) which use the

CE

membership values and non-membership values to evaluate the uncertainty. The IFS is defined as:

(10)

AC

𝐴 = {(𝑥, 𝑢𝐴 (𝑥), 𝑣𝐴 (𝑥))|𝑥 ∈ 𝑋},

where 𝑋 ∈ [0,1] is a universe of discourse, 𝑢𝐴 (𝑥) ∈ [0,1] and 𝑣𝐴 (𝑥) ∈ [0,1] are the membership and non-membership degrees with the condition 𝑢𝐴 (𝑥) + 𝑣𝐴 (𝑥) ≤ 1 ∀ 𝑥 ∈ 𝑋. The degree of hesitation of 𝑥 to 𝐴 (𝜋𝐴 (𝑥)) is defined as: 𝜋𝐴 (𝑥) = 1 − 𝑢𝐴 (𝑥) − 𝑣𝐴 (𝑥),

(11)

where 0 ≤ 𝜋𝐴 (𝑥) ≤ 1 ∀ 𝑥 ∈ 𝑋. If 𝜋𝐴 (𝑥) = 0, the IFS becomes fuzzy set. On the contrary, the IFS is totally intuitionistic if 𝜋𝐴 (𝑥) = 1. Therefore, the IFS is completely explained by 7

ACCEPTED MANUSCRIPT three elements: 1) membership degree 𝑢𝐴 (𝑥), 2) non-membership degree 𝑣𝐴 (𝑥), and 3) hesitation degree 𝜋𝐴 (𝑥). The parametric fuzzy complement is used to construct the IFS. There are two methods to create the intuitionistic fuzzy complement. According to Yager’s generating function [25], the IFS is obtained as: 𝐴 = {(𝑥, 𝑢𝐴 (𝑥), (1 − 𝑢𝐴 (𝑥)𝛼 )1/𝛼 )|𝑥 ∈ 𝑋},

CR IP T

(12)

where 𝛼 ∈ (0, ∞) is a control parameter of non-membership and hesitation degree. Then, the hesitation degree can be calculated as: 𝜋𝐴 (𝑥) = 1 − 𝑢𝐴 (𝑥) − (1 − 𝑢𝐴 (𝑥)𝛼 )1/𝛼 ,

(13)

AN US

Considering the Sugeno’s generating function [26], the IFS and hesitation degree can be written as:

𝐴 = {(𝑥, 𝑢𝐴 (𝑥), (1 − 𝑢𝐴 (𝑥))/(1 + 𝛼 𝑢𝐴 (𝑥)))|𝑥 ∈ 𝑋}, and

(14)

𝜋𝐴 (𝑥) = 1 − 𝑢𝐴 (𝑥) − (1 − 𝑢𝐴 (𝑥))/(1 + 𝛼 𝑢𝐴 (𝑥)),

(15)

M

3. Proposed genetic intuitionistic weighted fuzzy k-modes (GIWFKM) algorithm The proposed algorithm, i.e. GIWFKM, is described in this section. We firstly introduce

ED

the IWFKM algorithm which integrates the IFS with the WFKM algorithm. Moreover, the IWFKM uses the frequency probability-based distance metric instead of Hamming distance

PT

to calculate the dissimilarity between data instances. Consequently, the proposed GIWFKM algorithm, which employs the IWFKM algorithm and GA, is expected to exploit the global

CE

optimal solution of the clustering process. In the proposed GIWFKM algorithm, the unsupervised feature selection based on the correlation coefficient is performed prior to the GA procedure. In addition, the proposed GIWFKM algorithm uses cluster discrimination

AC

index (CDI), which is a new clustering indicator based on the average intra-cluster and intercluster distance, as the fitness function of GA. The genetic operators including crossover and mutation are implemented by the updating rules of the IWFKM algorithm. Fig. 1 illustrates the algorithm framework in this study. The following sub-sections will clarify fully details of the proposed IWFKM and GIWFKM algorithms. 3.1 The intuitionistic weighted fuzzy k-modes (IWFKM) algorithm 8

ACCEPTED MANUSCRIPT The IWFKM algorithm aims to integrate the IFS into the WFKM to improve the clustering performance. Herein, the hesitation degree is considered to add into the fuzzy membership degree to obtain the intuitionistic fuzzy membership value. This idea is inspired from the studies of Lin [27] and Shang et al. [28] since they also appended the membership degree and hesitation degree in clustering procedure to obtain the intuitionistic fuzzy membership value. Therefore, more accurate results could be provided using fuzzy c-means

degree becomes: 𝑢𝑗𝑖∗ = 𝑢𝑗𝑖 + 𝜋𝑗𝑖 ,

CR IP T

algorithm in their studies. The fuzzy membership degree after appending the hesitation

(16)

Therefore, the objective function of the IWFKM algorithm becomes: 𝛼

(17)

AN US

𝐹(𝑈, 𝑍, 𝑊) = ∑𝑘𝑗=1 ∑𝑛𝑖=1(𝑢𝑗𝑖∗ ) 𝑑𝑊 (𝑥𝑖 , 𝑧𝑗 ),

Instead of using Hamming distance with weighted attributes in Eq. (8), this study uses the new distance which computes the proximity between two categorical data instances based on the frequency probability [19]. Herein, the frequency probability-based distance metric for

M

categorical attributes is defined as follows:

𝑑(𝑥𝑖 , 𝑧𝑗 ) = ∑𝑚 𝑙=1 𝛿(𝑥𝑖𝑙 , 𝑧𝑗𝑙 ). 𝑝(𝑥𝑖𝑙 = 𝑧𝑗𝑙 ),

(18)

ED

where 𝛿(𝑥𝑖𝑙 , 𝑧𝑗𝑙 ) is defined in Eq. (5). Given two categorical value 𝑥𝑖𝑙 and 𝑧𝑗𝑙 of attribute 𝐴𝑙 , 𝑝(𝑥𝑖𝑙 = 𝑧𝑗𝑙 ) is the frequency probability that 𝑥𝑖𝑙 and 𝑧𝑗𝑙 take the same categorical value.

PT

𝑝(𝑥𝑖𝑙 = 𝑧𝑗𝑙 ) is calculated based on the frequency of the situation that 𝑥𝑖𝑙 = 𝑧𝑗𝑙 in the whole

CE

dataset as follows [19]:

𝑝(𝑥𝑖𝑙 = 𝑧𝑗𝑙 ) = 𝑝(𝐴𝑙 = 𝑥𝑖𝑙 |𝑋). 𝑝− (𝐴𝑙 = 𝑥𝑖𝑙 |𝑋) + 𝑝(𝐴𝑙 = 𝑧𝑗𝑙 |𝑋). 𝑝− (𝐴𝑙 = 𝑧𝑗𝑙 |𝑋), (19)

AC

where

𝜎𝐴𝑙=𝑥𝑖𝑙 (𝑋)

𝑝(𝐴𝑙 = 𝑥𝑖𝑙 |𝑋) = 𝜎 𝑝− (𝐴𝑙 = 𝑥𝑖𝑙 |𝑋) =

𝐴𝑙 ≠𝑁𝑢𝑙𝑙 (𝑋)

,

𝜎𝐴𝑙=𝑥𝑖𝑙 (𝑋)−1 𝜎𝐴𝑙 ≠𝑁𝑢𝑙𝑙 (𝑋)−1

and

,

(20)

(21)

In Eq. (19), 𝑝(𝐴𝑙 = 𝑥𝑖𝑙 |𝑋). 𝑝− (𝐴𝑙 = 𝑥𝑖𝑙 |𝑋) indicates the situation that both 𝑥𝑖𝑙 and 𝑧𝑗𝑙 take value 𝑥𝑖𝑙 , 𝑝(𝐴𝑙 = 𝑥𝑖𝑙 |𝑋) is the frequency probability which is calculated by the frequency of the instances that take value 𝑥𝑖𝑙 for attribute 𝐴𝑙 in the given dataset X, 𝑝− (𝐴𝑙 = 𝑥𝑖𝑙 |𝑋) is the 9

ACCEPTED MANUSCRIPT frequency probability that the event 𝑥𝑖𝑙 = 𝑧𝑗𝑙 (both takes value 𝑥𝑖𝑙 ) occurs. Herein, 𝜎𝐴𝑙=𝑥𝑖𝑙 (𝑋) indicates the number of instances that have value 𝑥𝑖𝑙 in the whole dataset X. Similarly, the 𝑝(𝐴𝑙 = 𝑧𝑗𝑙 |𝑋). 𝑝− (𝐴𝑙 = 𝑧𝑗𝑙 |𝑋) expresses the situation that both 𝑥𝑖𝑙 and 𝑧𝑗𝑙 take value 𝑧𝑗𝑙 . Eq. (20) and (21) are also applied to calculate the frequency probability of this situation.

as: 𝑊 𝑑 𝑊 (𝑥𝑖 , 𝑧𝑗 ) = ∑𝑚 𝑙=1 𝛿 (𝑥𝑖𝑙, 𝑧𝑗𝑙 ) . 𝑝(𝑥𝑖𝑙 = 𝑧𝑗𝑙 ),

where 𝛿 𝑊 (𝑥𝑖𝑙, 𝑧𝑗𝑙 ) is defined in Eq. (9).

AN US

Updating rules

CR IP T

Finally, the distance 𝑑𝑗𝑖𝑊 in in the objective function of the IWFKM algorithm is identified

(22)

The intuitionistic fuzzy membership degree and the cluster center are updated based on the following formulations: 1 0

𝑖𝑓 𝑋𝑖 = 𝑍𝑗 , 𝑖𝑓 𝑋𝑖 = 𝑍ℎ , ℎ ≠ 𝑗, −1

1

𝑑𝑊 (𝑥𝑖 ,𝑧𝑗 ) 𝛼−1

)

𝑧𝑗𝑙 = 𝑎𝑙𝑟 ∈ 𝐷𝑂𝑀 (𝐴𝑙 ),

PT

where

(23)

𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒,

ED

{

(∑𝑘ℎ=1 [𝑑𝑊 (𝑥 ,𝑧 )] 𝑖 ℎ

M

𝑢∗𝑗𝑖 =

CE

𝑟 = arg max1<𝑡<𝑛𝑙 ∑𝑖,𝑥𝑖𝑙=𝑎𝑙𝑡 𝑢∗𝑗𝑖𝛼 , and

AC

∑𝑖,𝑥𝑖𝑙=𝑎𝑙𝑟 𝑢∗𝑗𝑖𝛼 ≥ ∑𝑖,𝑥𝑖𝑙=𝑎𝑙𝑡 𝑢∗𝑗𝑖𝛼 ,

(24)

(25)

(26)

The weight vector is updated as follows: 0 𝑤𝑙 = {

𝑖𝑓 ∆𝑙 = 0, 1

∆ 𝛽−1 (∑𝑠𝑔=1 [∆ 𝑙 ] ) 𝑔

−1

𝑖𝑓 ∆𝑙 ≠ 0,

(27)

10

ACCEPTED MANUSCRIPT where s is the number of attributes that ∆𝑙 ≠ 0. According to frequency probability-based distance metric, ∆𝑙 is defined as: ∆𝑙 = ∑𝑘𝑗=1 ∑𝑛𝑖=1(𝑢𝑗𝑖∗ )𝛼 . 𝛿(𝑥𝑖𝑙 , 𝑧𝑗𝑙 ). 𝑝(𝑥𝑖𝑙 = 𝑧𝑗𝑙 ),

(28)

Next, the IWFKM algorithm is integrated with the GA to propose the GIWFKM algorithm based on the updating rules of the IWFKM algorithm.

CR IP T

3.2 Genetic intuitionistic weighted fuzzy k-modes (GIWFKM) algorithm

The proposed GIWFKM algorithm employs the GA to obtain the global optimal solution for the IWFKM. Moreover, the proposed GIWFKM algorithm also performs feature selection to remove some redundant features and retain the important features prior to GA procedure.

AN US

The details of the proposed GIWFKM algorithm are described in this section. Feature selection

The correlation coefficient is the simple method to exploit the relation between two variables. The correlation coefficient of two variables 𝑥 and 𝑦 is calculated as follows:

1 𝑜𝑟 − 1, 𝑥 𝑎𝑛𝑑 𝑦 𝑎𝑟𝑒 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑙𝑦 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑 0, 𝑥 𝑎𝑛𝑑 𝑦 𝑎𝑟𝑒 𝑡𝑜𝑡𝑎𝑙𝑙𝑦 𝑛𝑜𝑡 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑

(29) (30)

ED

𝜌(𝑥, 𝑦) = {

M

𝜌(𝑥, 𝑦) = 𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 (𝑥, 𝑦)/√𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑥). 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 (𝑦)

This study uses the correlation coefficient to select which features should be used for

PT

clustering. The feature selection (FS) aims to select the redundant features which are highly correlated with the other features. Firstly, the pairwise correlation coefficient of the given

CE

dataset is calculated. Two attributes are completely correlated if the correlation value 𝜌 is 1 or -1 and totally uncorrelated if 𝜌 = 0. Then, the attribute that is most highly correlated with

AC

others is removed each time. The procedure is repeated until meet the terminal condition. In this study, the removed features are only accounting for at most 20% of the total number of features.

Chromosome representation A chromosome is a k x n matrix where k is the number of clusters, n is the number of data instances [10]. The chromosome can be illustrated as follows:

11

ACCEPTED MANUSCRIPT a12  a1n  a22  a2n 

 a11  a U   21    ak1

ak 2

(31)

   akn 

where U is an intuitionistic fuzzy membership function defined according to Eq. (16). The initialization process will generate the initial population according to the chromosome setting and population size (N).

CR IP T

Fitness evaluation

The fitness evaluation is an indispensable process for evaluating how well the chromosomes are. This study uses cluster discrimination index (CDI) as the fitness function. The CDI is a criterion that measures clustering performance based on the average intra-

AN US

cluster and inter-cluster distance. The smaller values of the CDI provide the better results than the larger ones. Thus, chromosomes with the small CDI values are selected to reproduce the next generation. The CDI index is calculated by [19]: 1

𝐶𝐷𝐼 = ∑𝑘𝑟=1{𝐴𝐴𝐷(𝐶𝑟 )/ ∑𝑟≠𝑡 𝐴𝐸𝐷(𝐶𝑟 , 𝐶𝑡 )}, 𝑘

(32)

M

The average intra-cluster distance for cluster 𝐶𝑟 with 𝑛𝑟 data instances is calculated by [29]:

ED

𝐴𝐴𝐷(𝐶𝑟 ) = ∑𝑥𝑖 ∈𝐶𝑟 ∑𝑥𝑗∈𝐶𝑟 𝑑(𝑥𝑖, 𝑥𝑗 ) /𝑛𝑟2 ,

(33)

PT

The average inter-cluster distance between two clusters 𝐶𝑟 and 𝐶𝑡 with 𝑛𝑟 and 𝑛𝑡 data instances in each cluster, respectively is formulated as follows [29]:

CE

𝐴𝐸𝐷(𝐶𝑟 , 𝐶𝑡 ) = ∑𝑥𝑖 ∈𝐶𝑟 ∑𝑥𝑗∈𝐶𝑡 𝑑(𝑥𝑖, 𝑥𝑗 ) /𝑛𝑟 𝑛𝑡 ,

(34)

AC

Selection process

Selection is a process to pick up the good chromosomes from the population for genetic

operators to reproduce next generation. There are two popular selection methods in the GA, i.e., roulette wheel selection and tournament selection. In this study, the roulette wheel selection is used to select the chromosomes for reproduction since all chromosomes have its own opportunity to pick up with the corresponding probability based on fitness values. The chromosomes with smaller fitness values (smaller CDI) have a higher probability to be picked up. 12

ACCEPTED MANUSCRIPT Crossover The proposed GIWFKM algorithm uses a similar approach to the GFKM algorithm, which obtains the one-step fuzzy k-modes algorithm in crossover process. However, the proposed GIWFKM algorithm updates the chromosomes based on the IWFKM algorithm’s updating information. The crossover process is described as follows: For s = 1: N

(16). Initial weight vector is randomly generated as 𝑊𝑠 .

CR IP T

Each chromosome 𝑠 is an intuitionistic fuzzy membership matrix 𝑈𝑠 as defined in Eq.

AN US

Update cluster centroid 𝑍̂𝑠 according to Eq. (24) with the given 𝑈𝑠 and 𝑊𝑠 . ̂𝑠 according to Eq. (27) with the given 𝑍̂𝑠 and 𝑈𝑠 . Update weight vector 𝑊

̂𝑠 based on Eq. (23) with the given Update the intuitionistic fuzzy membership degree 𝑈 ̂𝑆 . 𝑍̂𝑠 and 𝑊

M

̂𝑠 . Obtain the chromosome𝑠 after crossover by the updated 𝑈

ED

End for Mutation

PT

The mutation process will make a change in each gene of a chromosome with the mutation probability 𝑝𝑚 . Due to the constraint of membership degree in Eq. (3), the change in

CE

one gene will lead to the change of all membership degrees in the corresponding objects. The mutation process is described as follows:

AC

For s = 1: N

Each gene in chromosome s is denoted as 𝑎𝑗𝑖 , 𝑗 = 1, … , 𝑘, 𝑎𝑛𝑑 𝑖 = 1, … , 𝑛, which is a

membership degree in 𝑈𝑠 matrix as illustrated in Eq. (31). For 𝑖=1: n Generate random number 𝑟𝑖 ∈ [0, 1]. If 𝑟𝑖 < 𝑝𝑚

do 13

ACCEPTED MANUSCRIPT Change genes(𝑎𝑗𝑖 , 𝑗 = 1, … , 𝑘)of the corresponding object 𝑖 by: 

Randomly generate 𝑣𝑗𝑖 ∈ [0, 1], , 𝑗 = 1, … , 𝑘.



Calculate 𝑎̂𝑗𝑖 = 𝑣𝑗𝑖 / ∑𝑘𝑗=1 𝑣𝑗𝑖 .



Change 𝑎𝑗𝑖 by 𝑎̂𝑗𝑖 .

End if End for

CR IP T

End for Terminate condition

The stopping condition is set using the number of generations.

AN US

3.3 Time complexity

Firstly, the complexities of the benchmark algorithms are investigated. The time complexity of the classical FKM algorithm is 𝑂(𝑘𝑛(𝑚 + 𝑀), where 𝑀 is the total number of attribute values in all attributes 𝑚 [4]. Regarding the WFKM algorithm, the computational cost is more complex since it needs to update the weight vector in each iteration. Thus, the

M

time complexity of the WFKM algorithm is 𝑂(𝑘𝑛(2𝑚 + 𝑀) [24]. For GA, the algorithm is usually implemented in polynomial time with 𝑂(𝑛2 ) . However, the GFKM algorithm

ED

employed the one-step fuzzy k-modes algorithm to update the chromosomes in genetic operator. Therefore, the GFKM algorithm’s time complexity becomes 𝑂(𝑛2 𝑘(𝑚 + 𝑀)) ,

PT

which is slower than the FKM and WFKM algorithms. The proposed IWFKM integrates the IFS with the WFKM algorithm. As shown in Eq.

CE

(16), the intuitionistic fuzzy membership function is updated in each iteration and its complexity becomes 𝑂(2𝑛). Therefore, the computation cost for IWFKM algorithm becomes

AC

𝑂(𝑘2𝑛(2𝑚 + 𝑀)). Herein, the time complexity of the IWFKM algorithm should be greater the that of the WFKM algorithm. Similarly, the time complexity of the proposed GIWFKM algorithm without feature selection prior to the GA procedure is defined as 𝑂(𝑘2𝑛2 (2𝑚 + 𝑀)) since it combines the IWFKM algorithm and GA. However, the feature selection will remove 20% of redundant categorical attributes. Therefore, the final time complexity of GIWFKM becomes 𝑂(𝑘2𝑛2 (2𝑚′ + 𝑀′ )), where 𝑚′ and 𝑀′ are the number of categorical attributes and total number of categorical values in all attributes after feature selection,

14

ACCEPTED MANUSCRIPT respectively. The perfomance using proposed GIWFKM algorithm is expected to be faster than that of the GFKM algorithm. In the next section, the experiment is conducted to compare the clustering result between the proposed IWFKM and GIWFKM algorithms with other benchmark algorithms on various categorical datasets.

CR IP T

4. Experimental results 4.1 Datasets and parameter setting

In this study, the experimental datasets are collected from the UCI machine learning repository (http://archive.ics.uci.edu/ml/). Twelve categorical datasets are selected with a variety of dimensionalities. For instance, the Lung dataset has the largest dimensionality

AN US

which contains 56 attributes, while the two smallest ones, the Breast Cancer and Tic-tac-toe datasets, have only 9 attributes. Table 1 provides a brief description of the datasets used in this study.


M

Benchmark datasets.>

ED

In addition, several benchmark algorithms are used to compare with the proposed IWFKM and GIWFKM algorithms. First of all, the FKM algorithm is selected since it is the

PT

most popular and conventional method for the categorical data. The WFKM algorithm is given second place due to its advantages in the literature review. In addition, the GFKM

CE

algorithm, which used GA to obtain the global optimal solution for the FKM algorithm, is counted as a benchmark algorithm. Another one, the SBC algorithm, which employed the

AC

group structure inherent in a set of categorical instances, is also selected. Finally, the MOFC algorithm, which combines fuzzy centroids algorithm and genetic operation, is used to compare with the proposed algorithms. The FKM, WFKM, GFKM, IWFKM, and GIWFKM algorithms were coded in Matlab programming language and run on a processor Intel Core i7-3770 CPU, 16GB RAM, and Windows 10 operation system. Each algorithm was implemented 30 times and then the results were taken by calculating the average. In general, after conducting several experiments with different values of GA’s parameters, the number of generations, population size, crossover rate, and mutation rate were set at 100, 100, 0.8, and 15

ACCEPTED MANUSCRIPT 0.1, respectively, for the GA-based approaches. The IFS uses Yager’s generating function to update the hesitation degree. Besides, the results of the SBC and MOFC algorithms are adopted from the original papers [12, 13]. To evaluate the clustering performance, two external clustering validation indices are selected, i.e., adjusted rank index (ARI), and clustering accuracy (CA). The ARI measures the agreement between the partitions based on the contingency table. The CA calculates the

CR IP T

percentage of correctly classified data instances in the clustering result of the proposed algorithm compared with the pre-determined class label. The ARI and CA are defined as follows [30, 31]: 2(𝑎𝑑−𝑏𝑐) (𝑎+𝑏)(𝑏+𝑑)+(𝑎+𝑐)(𝑐+𝑑)

𝐴𝑅𝐼(𝑇, 𝐶) =

,

(35)

AN US

where 𝑇 is the pre-determined or true class label, 𝐶 is the result of clustering algorithm, 𝑎, 𝑏, 𝑐, 𝑎𝑛𝑑 𝑑 are the number of pairs of objects that are placed: 1) in the same class in both T and C, 2) in the same class in T but different class in C, 3) in the same class in C but different class in T, and 4) in the different class in both T and C, respectively. 1 𝑛

∑𝑘𝑖=1 𝑎𝑖 ,

(36)

M

𝐶𝐴 =

determined class.

PT

4.2 Experimental results

ED

where 𝑎𝑖 is the maximum number of objects that have the same class label with the pre-

4.2.1 Evaluate the effect of FS on the proposed GIWFKM

CE

To evaluate the contribution of the FS in the proposed GIWFKM algorithm, the experiment firstly conducts the comparison of performing clustering between the proposed

AC

GIWFKM with FS prior and the proposed GIWFKM without FS on 12 tested datasets. The compared results based on the CDI (i.e., objective function in the proposed GIWFKM algorithm), ARI, AC, and computational time are shown in Table 2. As shown in Table 2, the proposed GIWFKM algorithm with FS outperforms the one without FS in terms of three clustering validation indicators: CDI, ARI, and AC. For instance, the proposed GIWFKM algorithm with FS performs a better result than that of the one without FS on: 1) 10 datasets (excluding Soybean and Zoo) regarding the CDI comparison, 2) 11 datasets (excluding Tictac-toe dataset) regarding the ARI and CA comparison. Moreover, the computational time of 16

ACCEPTED MANUSCRIPT the proposed algorithm with FS is also faster than that of the one without FS. Therefore, it can be concluded that the FS contributes to the performance of the proposed GIWFKM algorithm in both the clustering performance and time complexity.

4.2.2 Result comparison with the benchmark algorithms

CR IP T

Comparison of the proposed GIWFKM algorithm with FS and without FS.>

This section considers evaluating the proposed GIWFKM algorithm in comparison with the benchmark algorithms in terms of the ARI and AC. Note that 12 datasets are used to conduct the experiment in the proposed GIWFKM algorithm with and without FS in section

AN US

4.2.1. However, only 6 datasets are selected to compare with the benchmark algorithms because these are the mutual datasets which were used to conduct the experiment on both the proposed and the benchmark algorithms.

Table 3 shows the computational results of all algorithms in 6 tested datasets in terms of the ARI. It is not difficult to see that the proposed GIWFKM algorithm outperforms its rivals

M

since it achieves better results on 5 datasets (i.e., Voting, Mushroom, Zoo, Lung, and Dermatology) in a total of 6 tested datasets. For the Soybean dataset, the best ARI is obtained

ED

by MOFC algorithm. However, the result shown in Table 6 is the average value of multiple runs. To warranty that the average result is significantly different with other benchmark

PT

methods, the hypothesis test with a significant level 𝛼 = 0.05 is conducted. Moreover, this study wants to employ the performance of both the IWFKM and GIWFKM algorithms with

CE

the benchmark algorithms. Thus, the comparison between the IWFKM algorithm and FKM, WFKM, GFKM, SBC, and MOFC algorithms are firstly made to analyze the effect of the IFS on the clustering result. Thereafter, the comparison between the proposed GIWFKM

AC

algorithm and the other algorithms is made to exploit the improvement of using GA to obtain a global optimal solution.


To compare the proposed IWFKM algorithm with other benchmark algorithms, Table 4 shows the result of the hypothesis test on each dataset. Symbol “+” indicates that the 17

ACCEPTED MANUSCRIPT proposed IWFKM algorithm performs a better result. Similarly, the symbol “=” indicates the equal result or no difference between two algorithms, while “-” means the worse result of the IWFKM algorithm. According to the statistical result in Table 4, the IWFKM algorithm performs the worse result than those of all benchmark algorithms on the Soybean dataset in term of the ARI. Compared with the FKM algorithm, the IWFKM algorithm yields the better results on 4 datasets (i.e., Voting, Mushroom, and Lung) and there is no difference in the performance of two algorithms on the Dermatology dataset. Similarly, the IWFKM algorithm

CR IP T

also performs better on 4 datasets (i.e., Voting, Mushroom, Lung, and Dermatology) and a comparable result on the Zoo dataset as compared with the WFKM algorithm. Regarding the GFKM and SBC algorithms, the performance of the IWFKM algorithm is comparable since there are: 1) 3 significantly better results on the Voting, Mushroom, and Lung datasets, 2 worse results on the Soybean and Dermatology, as compared with the GFKM algorithm; and

AN US

2) 2 better results on the Voting and Zoo datasets, 2 similar results on the Mushroom and Lung datasets, and 2 worse results on the Soybean and Dermatology datasets as compared with the SBC algorithm. In contrast, the results of the statistical test in the comparison between the IWFKM and MOFC algorithms are quite different. For instance, the IWFKM

M

algorithm obtains the worse performance on 4 datasets (i.e., Soybean, Mushroom, Zoo, and Dermatology) in a total of 6 tested datasets. The better performance only on one dataset (i.e.,

ED

Voting), while no significant difference is found from the performance of two algorithms on the remaining datasets (i.e., Lung). Overall, the IWFKM algorithm outperforms the FKM and WFKM algorithms since it can employ the advantage of the IFS as well as the frequency

PT

probability-based distance metric. However, the IWFKM algorithm only achieves a comparable result compared with GFKM and SBC algorithms or even the worse result

CE

compared with the MOFC algorithm. This is because the proposed IWFKM method still suffers from the major drawback similarly with the FKM and WFKM algorithms since the

AC

clustering result may terminate at the local optimal solution. The statistical result is also illustrated in Fig. 2.


18

ACCEPTED MANUSCRIPT Similar to that represented in the IWFKM comparison, the hypothesis test between the GIWFKM algorithm and other algorithms in terms of the ARI is conducted and shown in Table 5. The proposed GIWFKM algorithm exhibits a better performance in most of the tested datasets. For a better visualization, Fig. 3 presents the result of the statistical test. The proposed GIWFKM algorithm yields a significantly better performance than that of the FKM, WFKM, GFKM, and SBC algorithms on 6 tested datasets in terms of the ARI. In the comparison between the proposed GIWFKM and MOFC algorithms, the performance of

CR IP T

GIWFKM algorithm is better than that of the MOFC algorithm on 4 datasets including the Voting, Mushroom, Zoo, and Lung datasets. On the two remaining datasets, the MOFC achieves the best ARI on the Soybean dataset, while there is no difference in the performance between the GIWFKM and MOFC algorithms on the Dermatology dataset. Regarding the IWFKM algorithm, the proposed GIWFKM algorithm yields a significant improvement on 5

AN US

datasets (i.e., Soybean, Mushroom, Zoo, Lung, and Dermatology), and no significant difference is found from the performance of two algorithms on the Voting dataset. Consequently, the GIWFKM algorithm outperforms all the benchmark algorithms. The GIWFKM algorithm not only inherits the advantage of the IFS and frequency probability-

M

based distance from the IWFKM algorithm, but also can overcome the drawback of the FKM, WFKM, and IWFKM algorithms since the clustering result of these algorithms may

ED

terminate at the local optimal solution.

PT

The result of the statistical test for the proposed GIWFKM algorithm in terms of ARI.>

CE



AC

Next, the clustering performance is evaluated based on the CA index. Table 6 displays the

experimental results in various datasets in terms of the CA index. As shown in Table 6, the proposed GIWFKM algorithm achieves the best results on 5 datasets (i.e., Voting, Mushroom, Zoo, Lung, and Dermatology) in a total of 6 tested datasets. For the remaining datasets (i.e., Soybean), the best result is yielded by the MOFC algorithm. Similar to that occurred on the ARI index comparison, the compared results on the CA index shown in Table 6 is also the average value of multiple runs. Therefore, the hypothesis test is also needed. 19

ACCEPTED MANUSCRIPT


The statistical tests in terms of the CA are conducted in a similar approach with the ones in terms of the ARI. The comparison between the IWFKM algorithm and the FKM, WFKM, GFKM, SBC, and MOFC algorithms are tested to exploit the contribution of the IFS and the

CR IP T

new similarity measures. Then, the comparison between the proposed GIWFKM algorithm and other algorithms are made. Table 7 and Table 8 are the sign tables which are exhibited the statistical results of the algorithms’ comparison in terms of the CA index. As shown in Table 7, the IWFKM algorithm performs the better performance than the FKM and WFKM algorithms. For instance, the IWFKM algorithm can significantly yield better results on 3

AN US

datasets (i.e., Soybean, Mushroom, and Zoo) and 4 datasets (i.e., Voting, Mushroom, Zoo, and Dermatology) in a total of 6 tested datasets comparing with the FKM and WFKM algorithms, respectively. However, the proposed IWFKM algorithm only can achieve a comparable result compared with the SBC algorithm since it obtains the better results on 3 datasets (i.e., Soybean, Mushroom, and Zoo), the worse result on 2 datasets (i.e., Lung and

M

Dermatology), and no significant difference between the two algorithms on 1 dataset (i.e., Voting). Regarding the comparison with the GFKM algorithm, the result of IWFKM

ED

algorithm is slightly worse than that of the GFKM algorithm. The IWFKM algorithm yields the better result than that of the GFKM algorithm on only 2 datasets (i.e., Voting and

PT

Mushroom). In contrast, the IWFKM algorithm does not have any improvement in terms of the CA index compared with that of the MOFC algorithm, since the clustering results of the

CE

IWFKM algorithm in terms of the CA are worse than those of the MOFC algorithm on 5 tested datasets. The statistical results of the IWFKM algorithm in terms of CA are also shown

AC

in Fig. 3 for a better visualization.




Table 8 shows the result of statistical hypothesis test between the proposed GIWFKM algorithm and other algorithms in terms of the CA. The summary results are displayed in Fig. 20

ACCEPTED MANUSCRIPT 4. It is not difficult to recognize that the proposed GIWFKM algorithm dominates the other algorithms since its results are better than those of the benchmark algorithms on most of the tested datasets. No significant difference is found between the proposed GIWFKM algorithm and IWFKM algorithm on the result of the Soybean and Dermatology datasets as well as on the Soybean dataset in comparison with the GFKM algorithm.

CR IP T

The result of the statistical test for the proposed GIWFKM algorithm in terms of CA.>



AN US

In summary, the proposed IWFKM, which takes the advantage of IFS and the new distance metric (frequency probability-based distance) for categorical data, can obtain better results than some existing clustering algorithms such as the FKM and WFKM algorithm. However, the IWFKM algorithm still suffers on terminating at a local optimal solution. Its performances may not have a significant improvement compared with those of the GFKM,

M

SBC, and MOFC algorithms. Therefore, the proposed GIWFKM algorithm is necessary and expected to obtain a better performance since it not only employs GA to obtain the global

ED

optimal solution, but also selects the crucial features to perform clustering. The experimental results on the UCI datasets and the comparison of the proposed GIWFKM algorithm and the

5. Conclusion

PT

benchmark algorithms confirm the achievement of the proposed GIWFKM algorithm.

CE

First, the proposed IWFKM algorithm, which integrates the IFS and WFKM algorithm, is investigated experimentally in this study. The proposed IWFKM algorithm provides some

AC

novel enhancements, for instance, employing the IFS to improve clustering result, considering each categorical attribute differently according to the weight vector, and using the frequency probability-based distance metric to estimate the distance between data instances instead of using the Hamming distance. The results conducted on the UCI datasets show that the IWFKM algorithm obtains a better performance than that of FKM and WFKM algorithms in terms of the ARI and CA. However, the performance of IWFKM algorithm cannot achieve a better result than that of the GFKM, SBC, and MOFC algorithms since it

21

ACCEPTED MANUSCRIPT still suffers from the major drawback of some existing categorical data clustering algorithms, i.e., the clustering result may terminate at a local optimal solution. Consequently, the second algorithm-GIWFKM, which combined the IWFKM and GA, is proposed. The GIWFKM algorithm uses the CDI as the fitness value in the GA procedure. Moreover, the GIWFKM algorithm employs the updating rules of the IWFKM algorithm in crossover and mutation process. The redundant features are also removed prior to

CR IP T

implementing the GA in the proposed GIWFKM algorithm by the feature selection. The experimental results on the UCI datasets show that the proposed GIWFKM algorithm outperforms the FKM, WFKM, IWFKM, GFKM, SBC, and MOFC algorithms in terms of the ARI and CA.

There are several ways to expand this study in future research. Firstly, the algorithm can

AN US

consider the situation that the number of clusters is unknown. Secondly, the IFS can be fully integrated in the GA procedure to improve the clustering result by a novel design in chromosome setting in which each chromosome contains membership and non-membership degree simultaneously. Besides, the interval-valued IFS can be employed with an appropriate form of categorical data clustering. Moreover, instead of using the GA, the algorithm can

M

combine the investigated IWFKM algorithm with other metaheuristic approaches to obtain a

Acknowledgment

ED

global optimal solution.

PT

This study was financially supported by the Ministry of Science and Technology of the Taiwanese Government, under contracts MOST 105-2410-H-011-017-MY3 and MOST 106-

CE

2811-H-011-002. This support is really appreciated.

AC

References

[1] P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining (Pearson education,Inc, United States of America, 2006). [2] S. Boriah, V. Chandola, V. Kumar, Similarity measures for categorical data: A comparative evaluation, in: Proceedings of the 2008 SIAM International Conference on Data Mining, 2008, pp. 243-254. [3] Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Min. Knowl. Discov., 2 (3) (1998) 283-304. 22

ACCEPTED MANUSCRIPT [4] Z. Huang, M.K. Ng, A fuzzy k-modes algorithm for clustering categorical data, IEEE Trans. Fuzzy Syst., 7 (7)(1999) 446-452. [5] S. Guha, R. Rastogi, K. Shim, ROCK: A robust clustering algorithm for categorical attributes, Inf. Syst., 25 (5) (2000) 345-366. [6] V. Ganti, J. Gehrke, R. Ramakrishnan, CACTUS—clustering categorical data using summaries, in: Proc. 5th ACM SIGKDD conference, San Diego, CA, USA, 1999, pp. 73-83. [7] D. Barbará, Y. Li, J. Couto, COOLCAT: an entropy-based algorithm for categorical ’02), McLean, VA, USA 2002), pp. 582-589.

CR IP T

clustering, in: in Proc. 11th ACM Conf. Information and Knowledge Management (CIKM

[8] P. Andritsos, P. Tsaparas, R.J. Miller, K.C. Sevcik, LIMBO: A Scalable Algorithm to Cluster Categorical Data, in: Proc. 9th Int’l Conf. Extending Database Technology (EDBT), Heraklion, Crete, Greece, 2004, pp. 123-146.

AN US

[9] F. Cao, J. Liang, D. Li, X. Zhao, A weighting k-modes algorithm for subspace clustering of categorical data, Neurocomputing, 108 (2013) 23-30.

[10] A. Mukhopadhyay, U. Maulik, S. Bandyopadhyay, Multiobjective Genetic AlgorithmBased Fuzzy Clustering of Categorical Attributes, IEEE Trans. Evol. Comput., 13 (5) (2009)

M

991-1005.

[11] C.-L. Yang, R.J. Kuo, C.-H. Chien, N.T.P. Quyen, Non-dominated sorting genetic

ED

algorithm using fuzzy membership chromosome for categorical data clustering, Appl. Soft Comput., 30 (2015) 113-122.

[12] Y. Qian, F. Li, J. Liang, B. Liu, C. Dang, Space structure and clustering of categorical

PT

data, IEEE Trans. Neural Netw. Learn. Syst, 27 (10) (2016) 2047-2059. [13] S. Zhu, L. Xu, Many-objective fuzzy centroids clustering algorithm for categorical data,

CE

Expert Syst. Appl., 96 (2018) 230-248.

[14] K.T. Atanassov, Intuitionistic fuzzy sets, Fuzzy Sets Syst, 20 (1) (1986) 87-96.

AC

[15] Z. Xu, J. Chen, J. Wu, Clustering algorithm for intuitionistic fuzzy sets, Inform. Sciences, 178 (19) (2008) 3775-3790. [16] Z. Xu, Intuitionistic fuzzy hierarchical clustering algorithms, J. Syst. Eng. Electronics, 20 (1) (2009) 90-97. [17] A. Chaudhuri, Intuitionistic fuzzy possibilistic c means clustering algorithms, Adv. Fuzzy Syst., 2015 (2015) 1-17. [18] D. Xu, Z. Xu, S. Liu, H. Zhao, A spectral clustering algorithm based on intuitionistic fuzzy information, Knowl-Based Syst., 53 (2013) 20-26. 23

ACCEPTED MANUSCRIPT [19] H. Jia, Y.-m. Cheung, J. Liu, A new distance metric for unsupervised learning of categorical data, IEEE Trans. Neural Netw. Learn. Syst., 27 (5) (2016) 1065-1079. [20] D.E. Goldberg, Genetic algorithms in search, optimization, and machine learning, 1989, Reading: Addison-Wesley, (1989). [21] K. Krishna, M.N. Murty, Genetic K-means algorithm, IEEE Trans. Syst., Man, Cybern. B, Cybern., 29 (3) (1999) 433-439. [22] L. Ballerini, L. Bocchi, C.B. Johansson, Image segmentation by a genetic fuzzy c-means

CR IP T

algorithm using color and spatial information, EvoWorkshops, (Springer2004), pp. 260-269. [23] G. Gan, J. Wu, Z. Yang, A genetic fuzzy k-Modes algorithm for clustering categorical data, Expert Syst. Appl., 36 (2) (2009) 1615-1620.

[24] A. Saha, S. Das, Categorical fuzzy k-modes clustering with automated feature weight learning, Neurocomputing, 166 (2015) 422-435.

AN US

[25] R.R. Yager, J. Kacprzyk, G. Beliakov, Recent developments in the ordered weighted averaging operators: theory and practice, Springer Science & Business Media, 2011. [26] M. Sugeno, Fuzzy measures and fuzzy integrals-A survey, The Netherlands: North Holland, Amsterdam, 1977.

M

[27] K.-P. Lin, A novel evolutionary kernel intuitionistic fuzzy c-means clustering algorithm, IEEE Trans. Fuzzy Syst., 22 (2014) 1074-1087

ED

[28] R. Shang, P. Tian, A. Wen, W. Liu, L. Jiao, An intuitionistic fuzzy possibilistic C-means clustering based on genetic algorithm, IEEE C. Evo. Computat., (IEEE2016), pp. 941-947. [29] A. Ahmad, L. Dey, A method to compute distance between two categorical values of

(2007) 110-118.

PT

same attribute in unsupervised learning for categorical data set, Pattern Recogn. Lett., 28 (1)

CE

[30] I. Heloulou, M.S. Radjef, M.T. Kechadi, A multi-act sequential game-based multiobjective clustering approach for categorical data, Neurocomputing, 267 (2017) 320-332.

AC

[31] M. Hoffman, D. Steinley, M.J. Brusco, A note on using the adjusted Rand index for link prediction in networks, Soc. Networks, 42 (2015) 72-79.

24

ACCEPTED MANUSCRIPT

Biography of the Authors R. J. Kuo received the MS degree in Industrial and Manufacturing Systems Engineering from Iowa State University, Ames, IA, in 1990 and the PhD degree in Industrial and Management Systems Engineering from the Pennsylvania State University, University Park, PA, in 1994. Currently, he is the Distinguished Professor in the Department of Industrial Management

CR IP T

at National Taiwan University of Science and Technology, Taiwan. He has published almost 100 papers in international journals, such as Information Sciences, Neural Networks, Decision Support Systems, European Journal of Operational Research, and Applied Soft Computing. His research interests include architecture issues of computational intelligence and their applications in data mining, electronic business, production management, supply

AC

CE

PT

ED

M

AN US

chain management, and decision support systems.

25

ACCEPTED MANUSCRIPT

Thi Phuong Quyen Nguyen received the B.S. degree in industrial systems engineering from the Hochiminh City University of Technology, Vietnam, in 2008, the M.S. and Ph.D. degrees in industrial management from the National Taiwan University of Science and Technology, Taiwan, in 2013 and 2016,

CR IP T

respectively.

She is currently a Postdoctoral Research Fellow with the Department of Industrial Management, National Taiwan University of Science and Technology. Her research interests

AC

CE

PT

ED

M

AN US

include data mining, machine learning, and meta-heuristic approaches.

26

ACCEPTED MANUSCRIPT

Start

Feature selection using correlation coefficient

Intuitionistic weighted fuzzy kmodes (IWFKM) algorithm

Frequency probability-based distance

Weighted fuzzy kmodes (WFKM)

Initialize chromosomes by fuzzy membership value

CR IP T

Intuitionistic fuzzy set (IFS)

AN US

Evaluate fitness using CDI

ED

M

Using updating rules of the IWFKM for GA operation

GA operation (Selection, Crossover, Mutation)

No

Termination criteria met? Yes End

AC

CE

PT

Fig. 1. Algorithm framework of the proposed method.

27

ACCEPTED MANUSCRIPT

5

3 2 1 0 IWFKM v.s FKM

IWFKM v.s WFKM

Equal

Worse

IWFKM v.s SBC

IWFKM v.s MOFC

AN US

Better

IWFKM v.s GFKM

CR IP T

# of datasets

4

AC

CE

PT

ED

M

Fig. 2. The comparison result of the IWFKM algorithm in terms of ARI

28

ACCEPTED MANUSCRIPT

7

5 4 3 2 1 0 GIWFKM v.s FKM

GIWFKM v.s WFKM

GIWFKM v.s GFKM Better

Equal

GIWFKM v.s SBC Worse

CR IP T

# of datasets

6

GIWFKM v.s MOFC

GIWFKM v.s IWFKM

AC

CE

PT

ED

M

AN US

Fig. 3. The comparison result of the proposed GIWFKM algorithm in terms of ARI.

29

ACCEPTED MANUSCRIPT

6 5

# of datasets

4 3 2

0

IWFKM v.s FKM

IWFKM v.s WFKM Better

IWFKM v.s GFKM Equal

Worse

CR IP T

1

IWFKM v.s SBC

IWFKM v.s MOFC

AC

CE

PT

ED

M

AN US

Fig. 4. The comparison result of the IWFKM algorithm in terms of CA.

30

ACCEPTED MANUSCRIPT

7

# of datasets

6 5 4 3 2

0

GIWFKM v.s FKM

GIWFKM v.s WFKM

GIWFKM v.s GFKM

Better

Equal

CR IP T

1

GIWFKM v.s SBC

Worse

GIWFKM v.s MOFC

GIWFKM v.s IWFKM

AC

CE

PT

ED

M

AN US

Fig. 5. The comparison result of the proposed GIWFKM algorithm in terms of CA.

31

ACCEPTED MANUSCRIPT Table 1 Benchmark datasets. # of attributes

# of classes

286 47 267 8124 435 101 958 148 3196 339 32 366

9 35 22 22 16 17 9 18 36 17 56 34

2 4 2 2 2 7 2 4 2 15 3 6

AC

CE

PT

ED

M

AN US

Breast Cancer Soybean Spect Heart Mushroom Voting Zoo Tic-tac-toe Lymphography Chess Primary Lung Dermatology

# of instances

CR IP T

Dataset

32

ACCEPTED MANUSCRIPT Table 2 Comparison of the proposed GIWFKM algorithm with FS and without FS.

AN US

CR IP T

ARI AC Time Non-FS FS Non-FS FS Non-FS FS 0.177 134 0.388 0.503 0.692 119 0.936 109 0.967 0.960 0.985 101 0.268 290 0.649 0.671 0.910 260 0.607 0.703 0.750 0.932 8081 6599 0.815 298 0.930 0.850 0.927 282 0.158 243 0.454 0.434 0.688 201 0.087 0.215 0.352 0.550 2405 2332 0.152 0.213 0.428 0.594 2441 511 0.057 206 0.106 0.366 0.574 119 0.110 0.663 0.508 509 0.152 430 0.105 122 0.292 0.488 0.696 107 0.563 0.624 0.774 0.828 1241 543

AC

CE

PT

ED

M

Breast Cancer Soybean Voting Mushroom Zoo Lymphography Chess Primary Tumor Spect Tic-tac-toe Lung Dermatology

CDI Non-FS FS 0.850 0.818 0.133 0.188 0.483 0.383 0.767 0.657 0.043 0.134 0.230 0.218 0.829 0.814 0.802 0.126 0.686 0.673 0.880 0.865 0.225 0.149 0.313 0.113

33

ACCEPTED MANUSCRIPT Table 3 Experimental results on the tested datasets in terms of ARI WFKM 0.788 0.577 0.238 0.719 0.140 0.305

GFKM 0.893 0.489 0.334 0.735 0.142 0.536

SBC 0.850 0.564 0.387 0.404 0.216 0.545

MOFC 1.000 0.578 0.593 0.894 0.243 0.593

IWFKM 0.719 0.644 0.376 0.711 0.232 0.391

GIWFKM 0.967 0.649 0.703 0.930 0.292 0.624

CR IP T

FKM 0.770 0.481 0.078 0.391 0.167 0.419

AC

CE

PT

ED

M

AN US

ARI Soybean Voting Mushroom Zoo Lung Dermatology

34

ACCEPTED MANUSCRIPT Table 4 The result of the statistical test for the IWFKM algorithm in terms of ARI. IWFKM v.s WFKM

IWFKM v.s SBC

IWFKM v.s MOFC

+ + = + -

+ = + = -

+ = -

AC

CE

PT

ED

M

AN US

Soybean Voting + + Mushroom + + Zoo + = Lung + + Dermatology = + Note: "+": better; "=": equally; "-": worse

IWFKM v.s GFKM

CR IP T

IWFKM v.s FKM

35

ACCEPTED MANUSCRIPT Table 5 The result of the statistical test for the proposed GIWFKM algorithm in terms of ARI.

AC

CE

PT

ED

M

AN US

CR IP T

GIWFKM GIWFKM GIWFKM GIWFKM GIWFKM GIWFKM v.s v.s v.s v.s v.s v.s FKM WFKM GFKM SBC MOFC IWFKM Soybean + + + + + Voting + + + + + = Mushroom + + + + + + Zoo + + + + + + Lung + + + + + + Dermatology + + + + = + Note: "+": better; "=": equally; "-": worse

36

ACCEPTED MANUSCRIPT Table 6 Experimental results in terms of CA. GFKM 0.971 0.858 0.778 0.874 0.563 0.764

SBC 0.936 0.876 0.798 0.579 0.635 0.793

MOFC 1.000 0.881 0.885 0.910 0.639 0.822

IWFKM 0.894 0.899 0.825 0.821 0.597 0.695

GIWFKM 0.985 0.910 0.932 0.927 0.696 0.828

CR IP T

WFKM 0.893 0.820 0.644 0.720 0.580 0.635

AC

CE

PT

ED

M

AN US

Soybean Voting Mushroom Zoo Lung Dermatology

FKM 0.766 0.850 0.640 0.733 0.615 0.702

37

ACCEPTED MANUSCRIPT Table 7 The result of the statistical test for the IWFKM algorithm in terms of CA. IWFKM v.s WFKM

IWFKM v.s SBC

IWFKM v.s MOFC

+ + = -

+ = + + -

= -

AC

CE

PT

ED

M

AN US

Soybean + = Voting = + Mushroom + + Zoo + + Lung = = Dermatology = + Note: "+": better; "=": equally; "-": worse

IWFKM v.s GFKM

CR IP T

IWFKM v.s FKM

38

ACCEPTED MANUSCRIPT Table 8 The result of the statistical test for the proposed GIWFKM algorithm in terms of CA.

AC

CE

PT

ED

M

AN US

CR IP T

GIWFKM GIWFKM GIWFKM GIWFKM GIWFKM GIWFKM v.s v.s v.s v.s v.s v.s FKM WFKM GFKM SBC MOFC IWFKM Soybean + + = + = + Voting + + + + + + Mushroom + + + + + + Zoo + + + + + + Lung + + + + + + Dermatology + + + + = + Note: "+": better; "=": equally; "-": worse

39