Accepted Manuscript
Genetic Intuitionistic Weighted Fuzzy K-modes Algorithm for Categorical Data R.J. Kuo , Thi Phuong Quyen Nguyen PII: DOI: Reference:
S0925-2312(18)31344-4 https://doi.org/10.1016/j.neucom.2018.11.016 NEUCOM 20149
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
19 March 2018 1 July 2018 10 November 2018
Please cite this article as: R.J. Kuo , Thi Phuong Quyen Nguyen , Genetic Intuitionistic Weighted Fuzzy K-modes Algorithm for Categorical Data, Neurocomputing (2018), doi: https://doi.org/10.1016/j.neucom.2018.11.016
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights Employ the intuitionistic fuzzy set theory in fuzzy clustering for categorical attributes. Use the new similarity measure for categorical data, which is based on the frequency probability-based distance metric, to calculate the dissimilarity measure. Consider the importance of each categorical attribute differently by updating the weight for
CR IP T
each categorical attribute in the clustering process iteratively. Exploit the global optimal solution by genetic algorithm (GA).
Provide the unsupervised feature selection process to remove the redundant features of the
AC
CE
PT
ED
M
AN US
original dataset prior to performing GA process.
1
ACCEPTED MANUSCRIPT
Genetic Intuitionistic Weighted Fuzzy K-modes Algorithm for Categorical Data
R. J. Kuo & Thi Phuong Quyen Nguyen(*)
CR IP T
Department of Industrial Management, National Taiwan University of Science and Technology, Taiwan
No. 43, Section 4, Keelung Rd., Da-an District, Taipei City, Taiwan (ROC)
[email protected]
AC
CE
PT
ED
M
* Corresponding author
AN US
Email:
[email protected]
2
ACCEPTED MANUSCRIPT Abstract Data clustering with categorical attributes has been widely used in many real-world applications. Most of the existing clustering algorithms proposed for the categorical data face two major drawbacks of termination at a local optimal solution and considering all attributes equally. Thus, this study proposes a novel clustering method, named genetic intuitionistic weighted fuzzy k-modes (GIWFKM) algorithm, based on the conventional fuzzy k-modes and genetic algorithm (GA). The proposed algorithm firstly introduces the intuitionistic
CR IP T
weighted fuzzy k-modes (IWFKM) algorithm which employs the intuitionistic fuzzy set in the clustering process and the new similarity measure for categorical data based on frequency probability-based distance metric. Then, the GIWFKM algorithm, which integrates the IWFKM algorithm and GA, is proposed to employ the global optimal solution. Moreover, the GIWFKM algorithm performs the unsupervised feature selection based on the correlation
AN US
coefficient to remove some redundant features which can both improve the clustering performance and reduce the computational time. To evaluate the clustering result, a series of experiments in different categorical datasets are conducted to compare the performance of the proposed algorithms with that of other benchmark algorithms including fuzzy k-modes,
M
weighted fuzzy k-modes, genetic fuzzy k-modes, space structure-based clustering, and manyobjective fuzzy centroids clustering algorithms. The experimental results conducted on the
ED
datasets collected from UCI machine learning repository exhibit that the GIWFKM algorithm outperforms the other benchmark algorithms in terms of Adjusted Rank Index (ARI) and
PT
clustering accuracy (CA).
Keywords: Categorical data, fuzzy k-modes, genetic algorithm, intuitionistic fuzzy set,
CE
frequency probability-based distance, weighted features. 1. Introduction
AC
Data clustering is an unsupervised learning technique that partitions a given dataset into multiple clusters in which objects in a cluster are similar to each other and distinct from the objects that belong to other clusters [1]. The clustering process aims to reveal the hidden structure of the unlabeled data instances in various applications, such as pattern recognition, market research, decision making, medical application, and so on. In general, the clustering algorithms are usually reserved for numerical data, which uses the standard distance measure to calculate the distance between any pair of data instances straightforwardly. Clustering of categorical data has received less attention than those of numerical data because of challenge 3
ACCEPTED MANUSCRIPT and difficulty in nature of data. Categorical attributes are obviously deficient in inherent order that causes difficulty to identify the proximity measure between two data objects [2]. The classic approach for the categorical data clustering is to expand some existing clustering algorithms for numerical data with a suitable dissimilarity measure which is particular for categorical attributes. For instance, the first conventional algorithm for categorical data, k-modes algorithm, which was proposed by Huang [3], is an extended version of k-means algorithm using Hamming distance and cluster mode to represent cluster
CR IP T
center instead of Euclidean distance and cluster mean. Similarly, fuzzy k-modes algorithm [4] is also an extended version of fuzzy k-means algorithm for the categorical data. Thereafter, the clustering algorithms for the categorical data have been paid progressively more attention due to the variety of the categorical data in the real-world problems. These algorithms consist of both single objective and multiple objectives, such as ROCK [5], CACTUS [6],
AN US
COOLCAT [7], LIMBO [8], wk-modes [9], MOGA [10], NSGA-FMC [11], SBC [12], MOFC [13], and so on. However, most of the existing algorithms face two major drawbacks that can reduce the clustering performance, i.e., some algorithms usually consider all attributes equally when calculating the dissimilarity between two objects, while some
M
algorithms may terminate at a local optimal solution.
Recently, intuitionistic fuzzy set (IFS), which was firstly introduced by Atanassov [14]
ED
based on the concept of fuzzy set theory, has been used in data clustering to enhance the clustering performance. The IFS is known as a generalization of fuzzy sets and usually used for handling uncertainty. An IFS is described by three parameters including membership,
PT
non-membership, and hesitation degrees. Xu et al. [15] reported a clustering algorithm for IFSs which classified the IFSs by constructing the association and equivalent association
CE
matrix. Xu [16] appended the IFS to hierarchical clustering to deal with uncertain data based on the distance measure between the IFS and the intuitionistic fuzzy aggregation operator.
AC
Similarly, some studies developed clustering techniques by combining the IFS with fuzzy cmeans algorithm, such as intuitionistic fuzzy c-means algorithm [15], intuitionistic fuzzy possibilistic c-means clustering algorithm [17]. Besides, Xu et al. [18] also integrated the IFS with spectral clustering to improve the clustering performance as well as obtain the global optimal solution. The existing methods are generally based on either distance measures or intuitionistic fuzzy information; however, some of them cannot warranty for the global optimal solution [18]. Consequently, they are all reserved for numerical datasets.
4
ACCEPTED MANUSCRIPT To overcome the aforementioned drawbacks of the existing algorithms as well as consider the application prospects of the IFS to improve the clustering performance, this study proposes a novel clustering algorithm for the categorical data, i.e., genetic intuitionistic weighted fuzzy k-modes (GIWFKM) algorithm. This algorithm is a combination of the conventional fuzzy k-modes algorithm [4] and the IFS. We firstly introduce the intuitionistic weighted fuzzy k-modes (IWFKM) algorithm which employs the IFS in the clustering process. The IWFKM algorithm considers the importance of each attribute differently by
CR IP T
updating the weight vector for categorical attributes in each iteration. In addition, the IWFKM algorithm replaces Hamming distance with the new similarity measure named frequency probability-based distance metric, which has been proved that could improve the clustering result [19]. Then, the proposed GIWFKM algorithm integrates the IWFKM algorithm and genetic algorithm (GA) to exploit the global optimal solution. The reason to
AN US
choose the GA is that GA is known as a search and optimization technique which is used to solve various problem domains due to its extensive applicability [20]. Moreover, the GA has been applied in many clustering approaches for both numerical and categorical data to improve the clustering performance, e.g., genetic k-means algorithm [21], genetic fuzzy c-
M
means [22], and genetic fuzzy k-modes (GFKM) [23]. Besides, the proposed GIWFKM algorithm performs the unsupervised feature selection based on the correlation coefficient to
the computational time.
ED
remove some redundant features, therefore, improve the clustering performance and reduce
PT
The rest of this paper is organized as follows. Section 2 reviews some related literatures such as fuzzy k-modes algorithm, weighted fuzzy k-modes algorithm, and the IFS theory. The
CE
proposed algorithms are introduced in Section 3, while Section 4 comes with a series of experiments and results. Finally, the conclusion and future research directions are
AC
summarized in Section 5. 2. Literature review This section firstly reviews fuzzy k-modes and weighted fuzzy k-modes algorithms. Then
the IFS theory with two generating functions is also described. 2.1 Fuzzy k-modes and weighted fuzzy k-modes algorithms Conventionally, the fuzzy k-modes (FKM) algorithm, investigated by Huang [4], is one of the most popular algorithms reserved for categorical data. Let X be a set of n categorical 5
ACCEPTED MANUSCRIPT objects. Each object 𝑥𝑖 can be characterized by a set of m categorical attributes, so that 𝑥𝑖 = {𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑚 }. The FKM algorithm partitions X into k clusters by finding W and Z to minimize the following objective function: 𝐹(𝑈, 𝑍) = ∑𝑘𝑗=1 ∑𝑛𝑖=1 𝑢𝑗𝑖𝛼 𝑑(𝑥𝑖 , 𝑧𝑗 ),
(1)
Subject to 1 ≤ 𝑗 ≤ 𝑘,
∑𝑘𝑗=1 𝑢𝑗𝑖 = 1,
1 ≤ 𝑖 ≤ 𝑛,
1 ≤ 𝑖 ≤ 𝑛,
0 < ∑𝑛𝑖=1 𝑢𝑗𝑖 < 𝑛,
(2)
CR IP T
0 ≤ 𝑢𝑗𝑖 ≤ 1,
(3)
1 ≤ 𝑗 ≤ 𝑘,
(4)
where k is a pre-defined number of clusters, 𝛼 is a fuzziness component, 𝑈 = (𝑢𝑗𝑖 ) is a 𝑘 × 𝑛
AN US
fuzzy membership matrix, 𝑍 = {𝑧1 , 𝑧2 , … , 𝑧𝑘 } is the set of cluster modes, and 𝑑(𝑥𝑖 , 𝑧𝑗 ) is the distance between object 𝑥𝑖 and its responding cluster center 𝑧𝑗 . 𝑑(𝑥𝑖 , 𝑧𝑗 ) is measured by using simple matching dissimilarity measure or Hamming distance as follows:
0, 1,
𝑖𝑓 𝑥𝑖𝑙 = 𝑧𝑗𝑙 , 𝑖𝑓 𝑥𝑖𝑙 ≠ 𝑧𝑗𝑙
(5) (6)
ED
𝛿(𝑥𝑖𝑙 , 𝑧𝑗𝑙 ) = {
M
𝑑(𝑥𝑖 , 𝑧𝑗 ) = ∑𝑚 𝑙=1 𝛿(𝑥𝑖𝑙 , 𝑧𝑗𝑙 ), and
However, the FKM algorithm suffers from several major drawbacks. First, the clustering performance is sensitive to the initial choice of the cluster modes. Next, the clustering result
PT
may terminate at the local optimal solution. Moreover, the FKM algorithm considers all attributes equally in which some attributes may not contribute in discriminating the clusters.
CE
Therefore, Sara and Das [24] presented a weighted fuzzy k-modes (WFKM) algorithm which uses the weight factor for each categorical attribute. The WFKM algorithm minimizes:
AC
𝐹(𝑈, 𝑍, 𝑊) = ∑𝑘𝑗=1 ∑𝑛𝑖=1 𝑢𝑗𝑖𝛼 . 𝑑𝑊 (𝑥𝑖 , 𝑧𝑗 ),
(7)
where 𝑊 = (𝑤1 , 𝑤2 , … , 𝑤𝑚 ) is a weight vector for categorical attributes, 𝑊 𝑑 𝑊 (𝑥𝑖 , 𝑧𝑗 ) = ∑𝑚 𝑙=1 𝛿 (𝑥𝑖𝑙, 𝑧𝑗𝑙 ),
𝛿 𝑊 (𝑥𝑖𝑙 , 𝑧𝑗𝑙 ) = {
0, 𝛽
𝑤𝑙 ,
𝑖𝑓 𝑥𝑖𝑙 = 𝑧𝑗𝑙 𝑖𝑓 𝑥𝑖𝑙 ≠ 𝑧𝑗𝑙
(8) ,
(9)
6
ACCEPTED MANUSCRIPT where 𝛽 is the coefficient of weight which is selected excluding 1. If 𝛽 = 0, the WFKM becomes the conventional FKM algorithm. The procedure of the WFKM algorithm is described as follows: Step 1: Randomly select cluster modes 𝑍1 , fix the fuzziness value 𝛼 and number of iterations T, generate initial weight vector 𝑊 1 , identify membership function 𝑈1 such that
Step 2: Fix 𝑍 𝑡 𝑎𝑛𝑑 𝑊 𝑡 and update 𝑈 𝑡+1. If 𝐹(𝑈 𝑡+1 , 𝑍 𝑡 , 𝑊 𝑡 ) = 𝐹(𝑈 𝑡 , 𝑍 𝑡 , 𝑊 𝑡 ), then stop; Else go to step 3.
AN US
Step 3: Fix 𝑊 𝑡 and 𝑈 𝑡+1 and update 𝑍 𝑡+1
CR IP T
the cost function 𝐹(𝑈1 , 𝑍1 , 𝑊 1 ) is minimized. The iteration is set at 𝑡 = 1.
If 𝐹(𝑈 𝑡+1 , 𝑍 𝑡+1 , 𝑊 𝑡 )= 𝐹(𝑈 𝑡+1 , 𝑍 𝑡 , 𝑊 𝑡 ), then stop; Else go to step 4.
M
Step 4: Fix 𝑈 𝑡+1 and 𝑍 𝑡+1 and update 𝑊 𝑡+1 .
If 𝐹(𝑈 𝑡+1 , 𝑍 𝑡+1 , 𝑊 𝑡+1 ) = 𝐹(𝑈 𝑡+1 , 𝑍 𝑡+1 , 𝑊 𝑡 ) or iteration t=T, then stop;
ED
Else increase iteration t = t+1 and go to step 2.
PT
2.2 Intuitionistic fuzzy sets
Atanassov [14] introduced the concept of intuitionistic fuzzy sets (IFS) which use the
CE
membership values and non-membership values to evaluate the uncertainty. The IFS is defined as:
(10)
AC
𝐴 = {(𝑥, 𝑢𝐴 (𝑥), 𝑣𝐴 (𝑥))|𝑥 ∈ 𝑋},
where 𝑋 ∈ [0,1] is a universe of discourse, 𝑢𝐴 (𝑥) ∈ [0,1] and 𝑣𝐴 (𝑥) ∈ [0,1] are the membership and non-membership degrees with the condition 𝑢𝐴 (𝑥) + 𝑣𝐴 (𝑥) ≤ 1 ∀ 𝑥 ∈ 𝑋. The degree of hesitation of 𝑥 to 𝐴 (𝜋𝐴 (𝑥)) is defined as: 𝜋𝐴 (𝑥) = 1 − 𝑢𝐴 (𝑥) − 𝑣𝐴 (𝑥),
(11)
where 0 ≤ 𝜋𝐴 (𝑥) ≤ 1 ∀ 𝑥 ∈ 𝑋. If 𝜋𝐴 (𝑥) = 0, the IFS becomes fuzzy set. On the contrary, the IFS is totally intuitionistic if 𝜋𝐴 (𝑥) = 1. Therefore, the IFS is completely explained by 7
ACCEPTED MANUSCRIPT three elements: 1) membership degree 𝑢𝐴 (𝑥), 2) non-membership degree 𝑣𝐴 (𝑥), and 3) hesitation degree 𝜋𝐴 (𝑥). The parametric fuzzy complement is used to construct the IFS. There are two methods to create the intuitionistic fuzzy complement. According to Yager’s generating function [25], the IFS is obtained as: 𝐴 = {(𝑥, 𝑢𝐴 (𝑥), (1 − 𝑢𝐴 (𝑥)𝛼 )1/𝛼 )|𝑥 ∈ 𝑋},
CR IP T
(12)
where 𝛼 ∈ (0, ∞) is a control parameter of non-membership and hesitation degree. Then, the hesitation degree can be calculated as: 𝜋𝐴 (𝑥) = 1 − 𝑢𝐴 (𝑥) − (1 − 𝑢𝐴 (𝑥)𝛼 )1/𝛼 ,
(13)
AN US
Considering the Sugeno’s generating function [26], the IFS and hesitation degree can be written as:
𝐴 = {(𝑥, 𝑢𝐴 (𝑥), (1 − 𝑢𝐴 (𝑥))/(1 + 𝛼 𝑢𝐴 (𝑥)))|𝑥 ∈ 𝑋}, and
(14)
𝜋𝐴 (𝑥) = 1 − 𝑢𝐴 (𝑥) − (1 − 𝑢𝐴 (𝑥))/(1 + 𝛼 𝑢𝐴 (𝑥)),
(15)
M
3. Proposed genetic intuitionistic weighted fuzzy k-modes (GIWFKM) algorithm The proposed algorithm, i.e. GIWFKM, is described in this section. We firstly introduce
ED
the IWFKM algorithm which integrates the IFS with the WFKM algorithm. Moreover, the IWFKM uses the frequency probability-based distance metric instead of Hamming distance
PT
to calculate the dissimilarity between data instances. Consequently, the proposed GIWFKM algorithm, which employs the IWFKM algorithm and GA, is expected to exploit the global
CE
optimal solution of the clustering process. In the proposed GIWFKM algorithm, the unsupervised feature selection based on the correlation coefficient is performed prior to the GA procedure. In addition, the proposed GIWFKM algorithm uses cluster discrimination
AC
index (CDI), which is a new clustering indicator based on the average intra-cluster and intercluster distance, as the fitness function of GA. The genetic operators including crossover and mutation are implemented by the updating rules of the IWFKM algorithm. Fig. 1 illustrates the algorithm framework in this study. The following sub-sections will clarify fully details of the proposed IWFKM and GIWFKM algorithms.
3.1 The intuitionistic weighted fuzzy k-modes (IWFKM) algorithm 8
ACCEPTED MANUSCRIPT The IWFKM algorithm aims to integrate the IFS into the WFKM to improve the clustering performance. Herein, the hesitation degree is considered to add into the fuzzy membership degree to obtain the intuitionistic fuzzy membership value. This idea is inspired from the studies of Lin [27] and Shang et al. [28] since they also appended the membership degree and hesitation degree in clustering procedure to obtain the intuitionistic fuzzy membership value. Therefore, more accurate results could be provided using fuzzy c-means
degree becomes: 𝑢𝑗𝑖∗ = 𝑢𝑗𝑖 + 𝜋𝑗𝑖 ,
CR IP T
algorithm in their studies. The fuzzy membership degree after appending the hesitation
(16)
Therefore, the objective function of the IWFKM algorithm becomes: 𝛼
(17)
AN US
𝐹(𝑈, 𝑍, 𝑊) = ∑𝑘𝑗=1 ∑𝑛𝑖=1(𝑢𝑗𝑖∗ ) 𝑑𝑊 (𝑥𝑖 , 𝑧𝑗 ),
Instead of using Hamming distance with weighted attributes in Eq. (8), this study uses the new distance which computes the proximity between two categorical data instances based on the frequency probability [19]. Herein, the frequency probability-based distance metric for
M
categorical attributes is defined as follows:
𝑑(𝑥𝑖 , 𝑧𝑗 ) = ∑𝑚 𝑙=1 𝛿(𝑥𝑖𝑙 , 𝑧𝑗𝑙 ). 𝑝(𝑥𝑖𝑙 = 𝑧𝑗𝑙 ),
(18)
ED
where 𝛿(𝑥𝑖𝑙 , 𝑧𝑗𝑙 ) is defined in Eq. (5). Given two categorical value 𝑥𝑖𝑙 and 𝑧𝑗𝑙 of attribute 𝐴𝑙 , 𝑝(𝑥𝑖𝑙 = 𝑧𝑗𝑙 ) is the frequency probability that 𝑥𝑖𝑙 and 𝑧𝑗𝑙 take the same categorical value.
PT
𝑝(𝑥𝑖𝑙 = 𝑧𝑗𝑙 ) is calculated based on the frequency of the situation that 𝑥𝑖𝑙 = 𝑧𝑗𝑙 in the whole
CE
dataset as follows [19]:
𝑝(𝑥𝑖𝑙 = 𝑧𝑗𝑙 ) = 𝑝(𝐴𝑙 = 𝑥𝑖𝑙 |𝑋). 𝑝− (𝐴𝑙 = 𝑥𝑖𝑙 |𝑋) + 𝑝(𝐴𝑙 = 𝑧𝑗𝑙 |𝑋). 𝑝− (𝐴𝑙 = 𝑧𝑗𝑙 |𝑋), (19)
AC
where
𝜎𝐴𝑙=𝑥𝑖𝑙 (𝑋)
𝑝(𝐴𝑙 = 𝑥𝑖𝑙 |𝑋) = 𝜎 𝑝− (𝐴𝑙 = 𝑥𝑖𝑙 |𝑋) =
𝐴𝑙 ≠𝑁𝑢𝑙𝑙 (𝑋)
,
𝜎𝐴𝑙=𝑥𝑖𝑙 (𝑋)−1 𝜎𝐴𝑙 ≠𝑁𝑢𝑙𝑙 (𝑋)−1
and
,
(20)
(21)
In Eq. (19), 𝑝(𝐴𝑙 = 𝑥𝑖𝑙 |𝑋). 𝑝− (𝐴𝑙 = 𝑥𝑖𝑙 |𝑋) indicates the situation that both 𝑥𝑖𝑙 and 𝑧𝑗𝑙 take value 𝑥𝑖𝑙 , 𝑝(𝐴𝑙 = 𝑥𝑖𝑙 |𝑋) is the frequency probability which is calculated by the frequency of the instances that take value 𝑥𝑖𝑙 for attribute 𝐴𝑙 in the given dataset X, 𝑝− (𝐴𝑙 = 𝑥𝑖𝑙 |𝑋) is the 9
ACCEPTED MANUSCRIPT frequency probability that the event 𝑥𝑖𝑙 = 𝑧𝑗𝑙 (both takes value 𝑥𝑖𝑙 ) occurs. Herein, 𝜎𝐴𝑙=𝑥𝑖𝑙 (𝑋) indicates the number of instances that have value 𝑥𝑖𝑙 in the whole dataset X. Similarly, the 𝑝(𝐴𝑙 = 𝑧𝑗𝑙 |𝑋). 𝑝− (𝐴𝑙 = 𝑧𝑗𝑙 |𝑋) expresses the situation that both 𝑥𝑖𝑙 and 𝑧𝑗𝑙 take value 𝑧𝑗𝑙 . Eq. (20) and (21) are also applied to calculate the frequency probability of this situation.
as: 𝑊 𝑑 𝑊 (𝑥𝑖 , 𝑧𝑗 ) = ∑𝑚 𝑙=1 𝛿 (𝑥𝑖𝑙, 𝑧𝑗𝑙 ) . 𝑝(𝑥𝑖𝑙 = 𝑧𝑗𝑙 ),
where 𝛿 𝑊 (𝑥𝑖𝑙, 𝑧𝑗𝑙 ) is defined in Eq. (9).
AN US
Updating rules
CR IP T
Finally, the distance 𝑑𝑗𝑖𝑊 in in the objective function of the IWFKM algorithm is identified
(22)
The intuitionistic fuzzy membership degree and the cluster center are updated based on the following formulations: 1 0
𝑖𝑓 𝑋𝑖 = 𝑍𝑗 , 𝑖𝑓 𝑋𝑖 = 𝑍ℎ , ℎ ≠ 𝑗, −1
1
𝑑𝑊 (𝑥𝑖 ,𝑧𝑗 ) 𝛼−1
)
𝑧𝑗𝑙 = 𝑎𝑙𝑟 ∈ 𝐷𝑂𝑀 (𝐴𝑙 ),
PT
where
(23)
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒,
ED
{
(∑𝑘ℎ=1 [𝑑𝑊 (𝑥 ,𝑧 )] 𝑖 ℎ
M
𝑢∗𝑗𝑖 =
CE
𝑟 = arg max1<𝑡<𝑛𝑙 ∑𝑖,𝑥𝑖𝑙=𝑎𝑙𝑡 𝑢∗𝑗𝑖𝛼 , and
AC
∑𝑖,𝑥𝑖𝑙=𝑎𝑙𝑟 𝑢∗𝑗𝑖𝛼 ≥ ∑𝑖,𝑥𝑖𝑙=𝑎𝑙𝑡 𝑢∗𝑗𝑖𝛼 ,
(24)
(25)
(26)
The weight vector is updated as follows: 0 𝑤𝑙 = {
𝑖𝑓 ∆𝑙 = 0, 1
∆ 𝛽−1 (∑𝑠𝑔=1 [∆ 𝑙 ] ) 𝑔
−1
𝑖𝑓 ∆𝑙 ≠ 0,
(27)
10
ACCEPTED MANUSCRIPT where s is the number of attributes that ∆𝑙 ≠ 0. According to frequency probability-based distance metric, ∆𝑙 is defined as: ∆𝑙 = ∑𝑘𝑗=1 ∑𝑛𝑖=1(𝑢𝑗𝑖∗ )𝛼 . 𝛿(𝑥𝑖𝑙 , 𝑧𝑗𝑙 ). 𝑝(𝑥𝑖𝑙 = 𝑧𝑗𝑙 ),
(28)
Next, the IWFKM algorithm is integrated with the GA to propose the GIWFKM algorithm based on the updating rules of the IWFKM algorithm.
CR IP T
3.2 Genetic intuitionistic weighted fuzzy k-modes (GIWFKM) algorithm
The proposed GIWFKM algorithm employs the GA to obtain the global optimal solution for the IWFKM. Moreover, the proposed GIWFKM algorithm also performs feature selection to remove some redundant features and retain the important features prior to GA procedure.
AN US
The details of the proposed GIWFKM algorithm are described in this section. Feature selection
The correlation coefficient is the simple method to exploit the relation between two variables. The correlation coefficient of two variables 𝑥 and 𝑦 is calculated as follows:
1 𝑜𝑟 − 1, 𝑥 𝑎𝑛𝑑 𝑦 𝑎𝑟𝑒 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑙𝑦 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑 0, 𝑥 𝑎𝑛𝑑 𝑦 𝑎𝑟𝑒 𝑡𝑜𝑡𝑎𝑙𝑙𝑦 𝑛𝑜𝑡 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑
(29) (30)
ED
𝜌(𝑥, 𝑦) = {
M
𝜌(𝑥, 𝑦) = 𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 (𝑥, 𝑦)/√𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑥). 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 (𝑦)
This study uses the correlation coefficient to select which features should be used for
PT
clustering. The feature selection (FS) aims to select the redundant features which are highly correlated with the other features. Firstly, the pairwise correlation coefficient of the given
CE
dataset is calculated. Two attributes are completely correlated if the correlation value 𝜌 is 1 or -1 and totally uncorrelated if 𝜌 = 0. Then, the attribute that is most highly correlated with
AC
others is removed each time. The procedure is repeated until meet the terminal condition. In this study, the removed features are only accounting for at most 20% of the total number of features.
Chromosome representation A chromosome is a k x n matrix where k is the number of clusters, n is the number of data instances [10]. The chromosome can be illustrated as follows:
11
ACCEPTED MANUSCRIPT a12 a1n a22 a2n
a11 a U 21 ak1
ak 2
(31)
akn
where U is an intuitionistic fuzzy membership function defined according to Eq. (16). The initialization process will generate the initial population according to the chromosome setting and population size (N).
CR IP T
Fitness evaluation
The fitness evaluation is an indispensable process for evaluating how well the chromosomes are. This study uses cluster discrimination index (CDI) as the fitness function. The CDI is a criterion that measures clustering performance based on the average intra-
AN US
cluster and inter-cluster distance. The smaller values of the CDI provide the better results than the larger ones. Thus, chromosomes with the small CDI values are selected to reproduce the next generation. The CDI index is calculated by [19]: 1
𝐶𝐷𝐼 = ∑𝑘𝑟=1{𝐴𝐴𝐷(𝐶𝑟 )/ ∑𝑟≠𝑡 𝐴𝐸𝐷(𝐶𝑟 , 𝐶𝑡 )}, 𝑘
(32)
M
The average intra-cluster distance for cluster 𝐶𝑟 with 𝑛𝑟 data instances is calculated by [29]:
ED
𝐴𝐴𝐷(𝐶𝑟 ) = ∑𝑥𝑖 ∈𝐶𝑟 ∑𝑥𝑗∈𝐶𝑟 𝑑(𝑥𝑖, 𝑥𝑗 ) /𝑛𝑟2 ,
(33)
PT
The average inter-cluster distance between two clusters 𝐶𝑟 and 𝐶𝑡 with 𝑛𝑟 and 𝑛𝑡 data instances in each cluster, respectively is formulated as follows [29]:
CE
𝐴𝐸𝐷(𝐶𝑟 , 𝐶𝑡 ) = ∑𝑥𝑖 ∈𝐶𝑟 ∑𝑥𝑗∈𝐶𝑡 𝑑(𝑥𝑖, 𝑥𝑗 ) /𝑛𝑟 𝑛𝑡 ,
(34)
AC
Selection process
Selection is a process to pick up the good chromosomes from the population for genetic
operators to reproduce next generation. There are two popular selection methods in the GA, i.e., roulette wheel selection and tournament selection. In this study, the roulette wheel selection is used to select the chromosomes for reproduction since all chromosomes have its own opportunity to pick up with the corresponding probability based on fitness values. The chromosomes with smaller fitness values (smaller CDI) have a higher probability to be picked up. 12
ACCEPTED MANUSCRIPT Crossover The proposed GIWFKM algorithm uses a similar approach to the GFKM algorithm, which obtains the one-step fuzzy k-modes algorithm in crossover process. However, the proposed GIWFKM algorithm updates the chromosomes based on the IWFKM algorithm’s updating information. The crossover process is described as follows: For s = 1: N
(16). Initial weight vector is randomly generated as 𝑊𝑠 .
CR IP T
Each chromosome 𝑠 is an intuitionistic fuzzy membership matrix 𝑈𝑠 as defined in Eq.
AN US
Update cluster centroid 𝑍̂𝑠 according to Eq. (24) with the given 𝑈𝑠 and 𝑊𝑠 . ̂𝑠 according to Eq. (27) with the given 𝑍̂𝑠 and 𝑈𝑠 . Update weight vector 𝑊
̂𝑠 based on Eq. (23) with the given Update the intuitionistic fuzzy membership degree 𝑈 ̂𝑆 . 𝑍̂𝑠 and 𝑊
M
̂𝑠 . Obtain the chromosome𝑠 after crossover by the updated 𝑈
ED
End for Mutation
PT
The mutation process will make a change in each gene of a chromosome with the mutation probability 𝑝𝑚 . Due to the constraint of membership degree in Eq. (3), the change in
CE
one gene will lead to the change of all membership degrees in the corresponding objects. The mutation process is described as follows:
AC
For s = 1: N
Each gene in chromosome s is denoted as 𝑎𝑗𝑖 , 𝑗 = 1, … , 𝑘, 𝑎𝑛𝑑 𝑖 = 1, … , 𝑛, which is a
membership degree in 𝑈𝑠 matrix as illustrated in Eq. (31). For 𝑖=1: n Generate random number 𝑟𝑖 ∈ [0, 1]. If 𝑟𝑖 < 𝑝𝑚
do 13
ACCEPTED MANUSCRIPT Change genes(𝑎𝑗𝑖 , 𝑗 = 1, … , 𝑘)of the corresponding object 𝑖 by:
Randomly generate 𝑣𝑗𝑖 ∈ [0, 1], , 𝑗 = 1, … , 𝑘.
Calculate 𝑎̂𝑗𝑖 = 𝑣𝑗𝑖 / ∑𝑘𝑗=1 𝑣𝑗𝑖 .
Change 𝑎𝑗𝑖 by 𝑎̂𝑗𝑖 .
End if End for
CR IP T
End for Terminate condition
The stopping condition is set using the number of generations.
AN US
3.3 Time complexity
Firstly, the complexities of the benchmark algorithms are investigated. The time complexity of the classical FKM algorithm is 𝑂(𝑘𝑛(𝑚 + 𝑀), where 𝑀 is the total number of attribute values in all attributes 𝑚 [4]. Regarding the WFKM algorithm, the computational cost is more complex since it needs to update the weight vector in each iteration. Thus, the
M
time complexity of the WFKM algorithm is 𝑂(𝑘𝑛(2𝑚 + 𝑀) [24]. For GA, the algorithm is usually implemented in polynomial time with 𝑂(𝑛2 ) . However, the GFKM algorithm
ED
employed the one-step fuzzy k-modes algorithm to update the chromosomes in genetic operator. Therefore, the GFKM algorithm’s time complexity becomes 𝑂(𝑛2 𝑘(𝑚 + 𝑀)) ,
PT
which is slower than the FKM and WFKM algorithms. The proposed IWFKM integrates the IFS with the WFKM algorithm. As shown in Eq.
CE
(16), the intuitionistic fuzzy membership function is updated in each iteration and its complexity becomes 𝑂(2𝑛). Therefore, the computation cost for IWFKM algorithm becomes
AC
𝑂(𝑘2𝑛(2𝑚 + 𝑀)). Herein, the time complexity of the IWFKM algorithm should be greater the that of the WFKM algorithm. Similarly, the time complexity of the proposed GIWFKM algorithm without feature selection prior to the GA procedure is defined as 𝑂(𝑘2𝑛2 (2𝑚 + 𝑀)) since it combines the IWFKM algorithm and GA. However, the feature selection will remove 20% of redundant categorical attributes. Therefore, the final time complexity of GIWFKM becomes 𝑂(𝑘2𝑛2 (2𝑚′ + 𝑀′ )), where 𝑚′ and 𝑀′ are the number of categorical attributes and total number of categorical values in all attributes after feature selection,
14
ACCEPTED MANUSCRIPT respectively. The perfomance using proposed GIWFKM algorithm is expected to be faster than that of the GFKM algorithm. In the next section, the experiment is conducted to compare the clustering result between the proposed IWFKM and GIWFKM algorithms with other benchmark algorithms on various categorical datasets.
CR IP T
4. Experimental results 4.1 Datasets and parameter setting
In this study, the experimental datasets are collected from the UCI machine learning repository (http://archive.ics.uci.edu/ml/). Twelve categorical datasets are selected with a variety of dimensionalities. For instance, the Lung dataset has the largest dimensionality
AN US
which contains 56 attributes, while the two smallest ones, the Breast Cancer and Tic-tac-toe datasets, have only 9 attributes. Table 1 provides a brief description of the datasets used in this study.
M
Benchmark datasets.>
ED
In addition, several benchmark algorithms are used to compare with the proposed IWFKM and GIWFKM algorithms. First of all, the FKM algorithm is selected since it is the
PT
most popular and conventional method for the categorical data. The WFKM algorithm is given second place due to its advantages in the literature review. In addition, the GFKM
CE
algorithm, which used GA to obtain the global optimal solution for the FKM algorithm, is counted as a benchmark algorithm. Another one, the SBC algorithm, which employed the
AC
group structure inherent in a set of categorical instances, is also selected. Finally, the MOFC algorithm, which combines fuzzy centroids algorithm and genetic operation, is used to compare with the proposed algorithms. The FKM, WFKM, GFKM, IWFKM, and GIWFKM algorithms were coded in Matlab programming language and run on a processor Intel Core i7-3770 CPU, 16GB RAM, and Windows 10 operation system. Each algorithm was implemented 30 times and then the results were taken by calculating the average. In general, after conducting several experiments with different values of GA’s parameters, the number of generations, population size, crossover rate, and mutation rate were set at 100, 100, 0.8, and 15
ACCEPTED MANUSCRIPT 0.1, respectively, for the GA-based approaches. The IFS uses Yager’s generating function to update the hesitation degree. Besides, the results of the SBC and MOFC algorithms are adopted from the original papers [12, 13]. To evaluate the clustering performance, two external clustering validation indices are selected, i.e., adjusted rank index (ARI), and clustering accuracy (CA). The ARI measures the agreement between the partitions based on the contingency table. The CA calculates the
CR IP T
percentage of correctly classified data instances in the clustering result of the proposed algorithm compared with the pre-determined class label. The ARI and CA are defined as follows [30, 31]: 2(𝑎𝑑−𝑏𝑐) (𝑎+𝑏)(𝑏+𝑑)+(𝑎+𝑐)(𝑐+𝑑)
𝐴𝑅𝐼(𝑇, 𝐶) =
,
(35)
AN US
where 𝑇 is the pre-determined or true class label, 𝐶 is the result of clustering algorithm, 𝑎, 𝑏, 𝑐, 𝑎𝑛𝑑 𝑑 are the number of pairs of objects that are placed: 1) in the same class in both T and C, 2) in the same class in T but different class in C, 3) in the same class in C but different class in T, and 4) in the different class in both T and C, respectively. 1 𝑛
∑𝑘𝑖=1 𝑎𝑖 ,
(36)
M
𝐶𝐴 =
determined class.
PT
4.2 Experimental results
ED
where 𝑎𝑖 is the maximum number of objects that have the same class label with the pre-
4.2.1 Evaluate the effect of FS on the proposed GIWFKM
CE
To evaluate the contribution of the FS in the proposed GIWFKM algorithm, the experiment firstly conducts the comparison of performing clustering between the proposed
AC
GIWFKM with FS prior and the proposed GIWFKM without FS on 12 tested datasets. The compared results based on the CDI (i.e., objective function in the proposed GIWFKM algorithm), ARI, AC, and computational time are shown in Table 2. As shown in Table 2, the proposed GIWFKM algorithm with FS outperforms the one without FS in terms of three clustering validation indicators: CDI, ARI, and AC. For instance, the proposed GIWFKM algorithm with FS performs a better result than that of the one without FS on: 1) 10 datasets (excluding Soybean and Zoo) regarding the CDI comparison, 2) 11 datasets (excluding Tictac-toe dataset) regarding the ARI and CA comparison. Moreover, the computational time of 16
ACCEPTED MANUSCRIPT the proposed algorithm with FS is also faster than that of the one without FS. Therefore, it can be concluded that the FS contributes to the performance of the proposed GIWFKM algorithm in both the clustering performance and time complexity.
4.2.2 Result comparison with the benchmark algorithms
CR IP T
Comparison of the proposed GIWFKM algorithm with FS and without FS.>
This section considers evaluating the proposed GIWFKM algorithm in comparison with the benchmark algorithms in terms of the ARI and AC. Note that 12 datasets are used to conduct the experiment in the proposed GIWFKM algorithm with and without FS in section
AN US
4.2.1. However, only 6 datasets are selected to compare with the benchmark algorithms because these are the mutual datasets which were used to conduct the experiment on both the proposed and the benchmark algorithms.
Table 3 shows the computational results of all algorithms in 6 tested datasets in terms of the ARI. It is not difficult to see that the proposed GIWFKM algorithm outperforms its rivals
M
since it achieves better results on 5 datasets (i.e., Voting, Mushroom, Zoo, Lung, and Dermatology) in a total of 6 tested datasets. For the Soybean dataset, the best ARI is obtained
ED
by MOFC algorithm. However, the result shown in Table 6 is the average value of multiple runs. To warranty that the average result is significantly different with other benchmark
PT
methods, the hypothesis test with a significant level 𝛼 = 0.05 is conducted. Moreover, this study wants to employ the performance of both the IWFKM and GIWFKM algorithms with
CE
the benchmark algorithms. Thus, the comparison between the IWFKM algorithm and FKM, WFKM, GFKM, SBC, and MOFC algorithms are firstly made to analyze the effect of the IFS on the clustering result. Thereafter, the comparison between the proposed GIWFKM
AC
algorithm and the other algorithms is made to exploit the improvement of using GA to obtain a global optimal solution.
To compare the proposed IWFKM algorithm with other benchmark algorithms, Table 4 shows the result of the hypothesis test on each dataset. Symbol “+” indicates that the 17
ACCEPTED MANUSCRIPT proposed IWFKM algorithm performs a better result. Similarly, the symbol “=” indicates the equal result or no difference between two algorithms, while “-” means the worse result of the IWFKM algorithm. According to the statistical result in Table 4, the IWFKM algorithm performs the worse result than those of all benchmark algorithms on the Soybean dataset in term of the ARI. Compared with the FKM algorithm, the IWFKM algorithm yields the better results on 4 datasets (i.e., Voting, Mushroom, and Lung) and there is no difference in the performance of two algorithms on the Dermatology dataset. Similarly, the IWFKM algorithm
CR IP T
also performs better on 4 datasets (i.e., Voting, Mushroom, Lung, and Dermatology) and a comparable result on the Zoo dataset as compared with the WFKM algorithm. Regarding the GFKM and SBC algorithms, the performance of the IWFKM algorithm is comparable since there are: 1) 3 significantly better results on the Voting, Mushroom, and Lung datasets, 2 worse results on the Soybean and Dermatology, as compared with the GFKM algorithm; and
AN US
2) 2 better results on the Voting and Zoo datasets, 2 similar results on the Mushroom and Lung datasets, and 2 worse results on the Soybean and Dermatology datasets as compared with the SBC algorithm. In contrast, the results of the statistical test in the comparison between the IWFKM and MOFC algorithms are quite different. For instance, the IWFKM
M
algorithm obtains the worse performance on 4 datasets (i.e., Soybean, Mushroom, Zoo, and Dermatology) in a total of 6 tested datasets. The better performance only on one dataset (i.e.,
ED
Voting), while no significant difference is found from the performance of two algorithms on the remaining datasets (i.e., Lung). Overall, the IWFKM algorithm outperforms the FKM and WFKM algorithms since it can employ the advantage of the IFS as well as the frequency
PT
probability-based distance metric. However, the IWFKM algorithm only achieves a comparable result compared with GFKM and SBC algorithms or even the worse result
CE
compared with the MOFC algorithm. This is because the proposed IWFKM method still suffers from the major drawback similarly with the FKM and WFKM algorithms since the
AC
clustering result may terminate at the local optimal solution. The statistical result is also illustrated in Fig. 2.
18
ACCEPTED MANUSCRIPT Similar to that represented in the IWFKM comparison, the hypothesis test between the GIWFKM algorithm and other algorithms in terms of the ARI is conducted and shown in Table 5. The proposed GIWFKM algorithm exhibits a better performance in most of the tested datasets. For a better visualization, Fig. 3 presents the result of the statistical test. The proposed GIWFKM algorithm yields a significantly better performance than that of the FKM, WFKM, GFKM, and SBC algorithms on 6 tested datasets in terms of the ARI. In the comparison between the proposed GIWFKM and MOFC algorithms, the performance of
CR IP T
GIWFKM algorithm is better than that of the MOFC algorithm on 4 datasets including the Voting, Mushroom, Zoo, and Lung datasets. On the two remaining datasets, the MOFC achieves the best ARI on the Soybean dataset, while there is no difference in the performance between the GIWFKM and MOFC algorithms on the Dermatology dataset. Regarding the IWFKM algorithm, the proposed GIWFKM algorithm yields a significant improvement on 5
AN US
datasets (i.e., Soybean, Mushroom, Zoo, Lung, and Dermatology), and no significant difference is found from the performance of two algorithms on the Voting dataset. Consequently, the GIWFKM algorithm outperforms all the benchmark algorithms. The GIWFKM algorithm not only inherits the advantage of the IFS and frequency probability-
M
based distance from the IWFKM algorithm, but also can overcome the drawback of the FKM, WFKM, and IWFKM algorithms since the clustering result of these algorithms may
ED
terminate at the local optimal solution.
PT
The result of the statistical test for the proposed GIWFKM algorithm in terms of ARI.>
CE
AC
Next, the clustering performance is evaluated based on the CA index. Table 6 displays the
experimental results in various datasets in terms of the CA index. As shown in Table 6, the proposed GIWFKM algorithm achieves the best results on 5 datasets (i.e., Voting, Mushroom, Zoo, Lung, and Dermatology) in a total of 6 tested datasets. For the remaining datasets (i.e., Soybean), the best result is yielded by the MOFC algorithm. Similar to that occurred on the ARI index comparison, the compared results on the CA index shown in Table 6 is also the average value of multiple runs. Therefore, the hypothesis test is also needed. 19
ACCEPTED MANUSCRIPT
The statistical tests in terms of the CA are conducted in a similar approach with the ones in terms of the ARI. The comparison between the IWFKM algorithm and the FKM, WFKM, GFKM, SBC, and MOFC algorithms are tested to exploit the contribution of the IFS and the
CR IP T
new similarity measures. Then, the comparison between the proposed GIWFKM algorithm and other algorithms are made. Table 7 and Table 8 are the sign tables which are exhibited the statistical results of the algorithms’ comparison in terms of the CA index. As shown in Table 7, the IWFKM algorithm performs the better performance than the FKM and WFKM algorithms. For instance, the IWFKM algorithm can significantly yield better results on 3
AN US
datasets (i.e., Soybean, Mushroom, and Zoo) and 4 datasets (i.e., Voting, Mushroom, Zoo, and Dermatology) in a total of 6 tested datasets comparing with the FKM and WFKM algorithms, respectively. However, the proposed IWFKM algorithm only can achieve a comparable result compared with the SBC algorithm since it obtains the better results on 3 datasets (i.e., Soybean, Mushroom, and Zoo), the worse result on 2 datasets (i.e., Lung and
M
Dermatology), and no significant difference between the two algorithms on 1 dataset (i.e., Voting). Regarding the comparison with the GFKM algorithm, the result of IWFKM
ED
algorithm is slightly worse than that of the GFKM algorithm. The IWFKM algorithm yields the better result than that of the GFKM algorithm on only 2 datasets (i.e., Voting and
PT
Mushroom). In contrast, the IWFKM algorithm does not have any improvement in terms of the CA index compared with that of the MOFC algorithm, since the clustering results of the
CE
IWFKM algorithm in terms of the CA are worse than those of the MOFC algorithm on 5 tested datasets. The statistical results of the IWFKM algorithm in terms of CA are also shown
AC
in Fig. 3 for a better visualization.
Table 8 shows the result of statistical hypothesis test between the proposed GIWFKM algorithm and other algorithms in terms of the CA. The summary results are displayed in Fig. 20
ACCEPTED MANUSCRIPT 4. It is not difficult to recognize that the proposed GIWFKM algorithm dominates the other algorithms since its results are better than those of the benchmark algorithms on most of the tested datasets. No significant difference is found between the proposed GIWFKM algorithm and IWFKM algorithm on the result of the Soybean and Dermatology datasets as well as on the Soybean dataset in comparison with the GFKM algorithm.
CR IP T
The result of the statistical test for the proposed GIWFKM algorithm in terms of CA.>
AN US
In summary, the proposed IWFKM, which takes the advantage of IFS and the new distance metric (frequency probability-based distance) for categorical data, can obtain better results than some existing clustering algorithms such as the FKM and WFKM algorithm. However, the IWFKM algorithm still suffers on terminating at a local optimal solution. Its performances may not have a significant improvement compared with those of the GFKM,
M
SBC, and MOFC algorithms. Therefore, the proposed GIWFKM algorithm is necessary and expected to obtain a better performance since it not only employs GA to obtain the global
ED
optimal solution, but also selects the crucial features to perform clustering. The experimental results on the UCI datasets and the comparison of the proposed GIWFKM algorithm and the
5. Conclusion
PT
benchmark algorithms confirm the achievement of the proposed GIWFKM algorithm.
CE
First, the proposed IWFKM algorithm, which integrates the IFS and WFKM algorithm, is investigated experimentally in this study. The proposed IWFKM algorithm provides some
AC
novel enhancements, for instance, employing the IFS to improve clustering result, considering each categorical attribute differently according to the weight vector, and using the frequency probability-based distance metric to estimate the distance between data instances instead of using the Hamming distance. The results conducted on the UCI datasets show that the IWFKM algorithm obtains a better performance than that of FKM and WFKM algorithms in terms of the ARI and CA. However, the performance of IWFKM algorithm cannot achieve a better result than that of the GFKM, SBC, and MOFC algorithms since it
21
ACCEPTED MANUSCRIPT still suffers from the major drawback of some existing categorical data clustering algorithms, i.e., the clustering result may terminate at a local optimal solution. Consequently, the second algorithm-GIWFKM, which combined the IWFKM and GA, is proposed. The GIWFKM algorithm uses the CDI as the fitness value in the GA procedure. Moreover, the GIWFKM algorithm employs the updating rules of the IWFKM algorithm in crossover and mutation process. The redundant features are also removed prior to
CR IP T
implementing the GA in the proposed GIWFKM algorithm by the feature selection. The experimental results on the UCI datasets show that the proposed GIWFKM algorithm outperforms the FKM, WFKM, IWFKM, GFKM, SBC, and MOFC algorithms in terms of the ARI and CA.
There are several ways to expand this study in future research. Firstly, the algorithm can
AN US
consider the situation that the number of clusters is unknown. Secondly, the IFS can be fully integrated in the GA procedure to improve the clustering result by a novel design in chromosome setting in which each chromosome contains membership and non-membership degree simultaneously. Besides, the interval-valued IFS can be employed with an appropriate form of categorical data clustering. Moreover, instead of using the GA, the algorithm can
M
combine the investigated IWFKM algorithm with other metaheuristic approaches to obtain a
Acknowledgment
ED
global optimal solution.
PT
This study was financially supported by the Ministry of Science and Technology of the Taiwanese Government, under contracts MOST 105-2410-H-011-017-MY3 and MOST 106-
CE
2811-H-011-002. This support is really appreciated.
AC
References
[1] P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining (Pearson education,Inc, United States of America, 2006). [2] S. Boriah, V. Chandola, V. Kumar, Similarity measures for categorical data: A comparative evaluation, in: Proceedings of the 2008 SIAM International Conference on Data Mining, 2008, pp. 243-254. [3] Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Min. Knowl. Discov., 2 (3) (1998) 283-304. 22
ACCEPTED MANUSCRIPT [4] Z. Huang, M.K. Ng, A fuzzy k-modes algorithm for clustering categorical data, IEEE Trans. Fuzzy Syst., 7 (7)(1999) 446-452. [5] S. Guha, R. Rastogi, K. Shim, ROCK: A robust clustering algorithm for categorical attributes, Inf. Syst., 25 (5) (2000) 345-366. [6] V. Ganti, J. Gehrke, R. Ramakrishnan, CACTUS—clustering categorical data using summaries, in: Proc. 5th ACM SIGKDD conference, San Diego, CA, USA, 1999, pp. 73-83. [7] D. Barbará, Y. Li, J. Couto, COOLCAT: an entropy-based algorithm for categorical ’02), McLean, VA, USA 2002), pp. 582-589.
CR IP T
clustering, in: in Proc. 11th ACM Conf. Information and Knowledge Management (CIKM
[8] P. Andritsos, P. Tsaparas, R.J. Miller, K.C. Sevcik, LIMBO: A Scalable Algorithm to Cluster Categorical Data, in: Proc. 9th Int’l Conf. Extending Database Technology (EDBT), Heraklion, Crete, Greece, 2004, pp. 123-146.
AN US
[9] F. Cao, J. Liang, D. Li, X. Zhao, A weighting k-modes algorithm for subspace clustering of categorical data, Neurocomputing, 108 (2013) 23-30.
[10] A. Mukhopadhyay, U. Maulik, S. Bandyopadhyay, Multiobjective Genetic AlgorithmBased Fuzzy Clustering of Categorical Attributes, IEEE Trans. Evol. Comput., 13 (5) (2009)
M
991-1005.
[11] C.-L. Yang, R.J. Kuo, C.-H. Chien, N.T.P. Quyen, Non-dominated sorting genetic
ED
algorithm using fuzzy membership chromosome for categorical data clustering, Appl. Soft Comput., 30 (2015) 113-122.
[12] Y. Qian, F. Li, J. Liang, B. Liu, C. Dang, Space structure and clustering of categorical
PT
data, IEEE Trans. Neural Netw. Learn. Syst, 27 (10) (2016) 2047-2059. [13] S. Zhu, L. Xu, Many-objective fuzzy centroids clustering algorithm for categorical data,
CE
Expert Syst. Appl., 96 (2018) 230-248.
[14] K.T. Atanassov, Intuitionistic fuzzy sets, Fuzzy Sets Syst, 20 (1) (1986) 87-96.
AC
[15] Z. Xu, J. Chen, J. Wu, Clustering algorithm for intuitionistic fuzzy sets, Inform. Sciences, 178 (19) (2008) 3775-3790. [16] Z. Xu, Intuitionistic fuzzy hierarchical clustering algorithms, J. Syst. Eng. Electronics, 20 (1) (2009) 90-97. [17] A. Chaudhuri, Intuitionistic fuzzy possibilistic c means clustering algorithms, Adv. Fuzzy Syst., 2015 (2015) 1-17. [18] D. Xu, Z. Xu, S. Liu, H. Zhao, A spectral clustering algorithm based on intuitionistic fuzzy information, Knowl-Based Syst., 53 (2013) 20-26. 23
ACCEPTED MANUSCRIPT [19] H. Jia, Y.-m. Cheung, J. Liu, A new distance metric for unsupervised learning of categorical data, IEEE Trans. Neural Netw. Learn. Syst., 27 (5) (2016) 1065-1079. [20] D.E. Goldberg, Genetic algorithms in search, optimization, and machine learning, 1989, Reading: Addison-Wesley, (1989). [21] K. Krishna, M.N. Murty, Genetic K-means algorithm, IEEE Trans. Syst., Man, Cybern. B, Cybern., 29 (3) (1999) 433-439. [22] L. Ballerini, L. Bocchi, C.B. Johansson, Image segmentation by a genetic fuzzy c-means
CR IP T
algorithm using color and spatial information, EvoWorkshops, (Springer2004), pp. 260-269. [23] G. Gan, J. Wu, Z. Yang, A genetic fuzzy k-Modes algorithm for clustering categorical data, Expert Syst. Appl., 36 (2) (2009) 1615-1620.
[24] A. Saha, S. Das, Categorical fuzzy k-modes clustering with automated feature weight learning, Neurocomputing, 166 (2015) 422-435.
AN US
[25] R.R. Yager, J. Kacprzyk, G. Beliakov, Recent developments in the ordered weighted averaging operators: theory and practice, Springer Science & Business Media, 2011. [26] M. Sugeno, Fuzzy measures and fuzzy integrals-A survey, The Netherlands: North Holland, Amsterdam, 1977.
M
[27] K.-P. Lin, A novel evolutionary kernel intuitionistic fuzzy c-means clustering algorithm, IEEE Trans. Fuzzy Syst., 22 (2014) 1074-1087
ED
[28] R. Shang, P. Tian, A. Wen, W. Liu, L. Jiao, An intuitionistic fuzzy possibilistic C-means clustering based on genetic algorithm, IEEE C. Evo. Computat., (IEEE2016), pp. 941-947. [29] A. Ahmad, L. Dey, A method to compute distance between two categorical values of
(2007) 110-118.
PT
same attribute in unsupervised learning for categorical data set, Pattern Recogn. Lett., 28 (1)
CE
[30] I. Heloulou, M.S. Radjef, M.T. Kechadi, A multi-act sequential game-based multiobjective clustering approach for categorical data, Neurocomputing, 267 (2017) 320-332.
AC
[31] M. Hoffman, D. Steinley, M.J. Brusco, A note on using the adjusted Rand index for link prediction in networks, Soc. Networks, 42 (2015) 72-79.
24
ACCEPTED MANUSCRIPT
Biography of the Authors R. J. Kuo received the MS degree in Industrial and Manufacturing Systems Engineering from Iowa State University, Ames, IA, in 1990 and the PhD degree in Industrial and Management Systems Engineering from the Pennsylvania State University, University Park, PA, in 1994. Currently, he is the Distinguished Professor in the Department of Industrial Management
CR IP T
at National Taiwan University of Science and Technology, Taiwan. He has published almost 100 papers in international journals, such as Information Sciences, Neural Networks, Decision Support Systems, European Journal of Operational Research, and Applied Soft Computing. His research interests include architecture issues of computational intelligence and their applications in data mining, electronic business, production management, supply
AC
CE
PT
ED
M
AN US
chain management, and decision support systems.
25
ACCEPTED MANUSCRIPT
Thi Phuong Quyen Nguyen received the B.S. degree in industrial systems engineering from the Hochiminh City University of Technology, Vietnam, in 2008, the M.S. and Ph.D. degrees in industrial management from the National Taiwan University of Science and Technology, Taiwan, in 2013 and 2016,
CR IP T
respectively.
She is currently a Postdoctoral Research Fellow with the Department of Industrial Management, National Taiwan University of Science and Technology. Her research interests
AC
CE
PT
ED
M
AN US
include data mining, machine learning, and meta-heuristic approaches.
26
ACCEPTED MANUSCRIPT
Start
Feature selection using correlation coefficient
Intuitionistic weighted fuzzy kmodes (IWFKM) algorithm
Frequency probability-based distance
Weighted fuzzy kmodes (WFKM)
Initialize chromosomes by fuzzy membership value
CR IP T
Intuitionistic fuzzy set (IFS)
AN US
Evaluate fitness using CDI
ED
M
Using updating rules of the IWFKM for GA operation
GA operation (Selection, Crossover, Mutation)
No
Termination criteria met? Yes End
AC
CE
PT
Fig. 1. Algorithm framework of the proposed method.
27
ACCEPTED MANUSCRIPT
5
3 2 1 0 IWFKM v.s FKM
IWFKM v.s WFKM
Equal
Worse
IWFKM v.s SBC
IWFKM v.s MOFC
AN US
Better
IWFKM v.s GFKM
CR IP T
# of datasets
4
AC
CE
PT
ED
M
Fig. 2. The comparison result of the IWFKM algorithm in terms of ARI
28
ACCEPTED MANUSCRIPT
7
5 4 3 2 1 0 GIWFKM v.s FKM
GIWFKM v.s WFKM
GIWFKM v.s GFKM Better
Equal
GIWFKM v.s SBC Worse
CR IP T
# of datasets
6
GIWFKM v.s MOFC
GIWFKM v.s IWFKM
AC
CE
PT
ED
M
AN US
Fig. 3. The comparison result of the proposed GIWFKM algorithm in terms of ARI.
29
ACCEPTED MANUSCRIPT
6 5
# of datasets
4 3 2
0
IWFKM v.s FKM
IWFKM v.s WFKM Better
IWFKM v.s GFKM Equal
Worse
CR IP T
1
IWFKM v.s SBC
IWFKM v.s MOFC
AC
CE
PT
ED
M
AN US
Fig. 4. The comparison result of the IWFKM algorithm in terms of CA.
30
ACCEPTED MANUSCRIPT
7
# of datasets
6 5 4 3 2
0
GIWFKM v.s FKM
GIWFKM v.s WFKM
GIWFKM v.s GFKM
Better
Equal
CR IP T
1
GIWFKM v.s SBC
Worse
GIWFKM v.s MOFC
GIWFKM v.s IWFKM
AC
CE
PT
ED
M
AN US
Fig. 5. The comparison result of the proposed GIWFKM algorithm in terms of CA.
31
ACCEPTED MANUSCRIPT Table 1 Benchmark datasets. # of attributes
# of classes
286 47 267 8124 435 101 958 148 3196 339 32 366
9 35 22 22 16 17 9 18 36 17 56 34
2 4 2 2 2 7 2 4 2 15 3 6
AC
CE
PT
ED
M
AN US
Breast Cancer Soybean Spect Heart Mushroom Voting Zoo Tic-tac-toe Lymphography Chess Primary Lung Dermatology
# of instances
CR IP T
Dataset
32
ACCEPTED MANUSCRIPT Table 2 Comparison of the proposed GIWFKM algorithm with FS and without FS.
AN US
CR IP T
ARI AC Time Non-FS FS Non-FS FS Non-FS FS 0.177 134 0.388 0.503 0.692 119 0.936 109 0.967 0.960 0.985 101 0.268 290 0.649 0.671 0.910 260 0.607 0.703 0.750 0.932 8081 6599 0.815 298 0.930 0.850 0.927 282 0.158 243 0.454 0.434 0.688 201 0.087 0.215 0.352 0.550 2405 2332 0.152 0.213 0.428 0.594 2441 511 0.057 206 0.106 0.366 0.574 119 0.110 0.663 0.508 509 0.152 430 0.105 122 0.292 0.488 0.696 107 0.563 0.624 0.774 0.828 1241 543
AC
CE
PT
ED
M
Breast Cancer Soybean Voting Mushroom Zoo Lymphography Chess Primary Tumor Spect Tic-tac-toe Lung Dermatology
CDI Non-FS FS 0.850 0.818 0.133 0.188 0.483 0.383 0.767 0.657 0.043 0.134 0.230 0.218 0.829 0.814 0.802 0.126 0.686 0.673 0.880 0.865 0.225 0.149 0.313 0.113
33
ACCEPTED MANUSCRIPT Table 3 Experimental results on the tested datasets in terms of ARI WFKM 0.788 0.577 0.238 0.719 0.140 0.305
GFKM 0.893 0.489 0.334 0.735 0.142 0.536
SBC 0.850 0.564 0.387 0.404 0.216 0.545
MOFC 1.000 0.578 0.593 0.894 0.243 0.593
IWFKM 0.719 0.644 0.376 0.711 0.232 0.391
GIWFKM 0.967 0.649 0.703 0.930 0.292 0.624
CR IP T
FKM 0.770 0.481 0.078 0.391 0.167 0.419
AC
CE
PT
ED
M
AN US
ARI Soybean Voting Mushroom Zoo Lung Dermatology
34
ACCEPTED MANUSCRIPT Table 4 The result of the statistical test for the IWFKM algorithm in terms of ARI. IWFKM v.s WFKM
IWFKM v.s SBC
IWFKM v.s MOFC
+ + = + -
+ = + = -
+ = -
AC
CE
PT
ED
M
AN US
Soybean Voting + + Mushroom + + Zoo + = Lung + + Dermatology = + Note: "+": better; "=": equally; "-": worse
IWFKM v.s GFKM
CR IP T
IWFKM v.s FKM
35
ACCEPTED MANUSCRIPT Table 5 The result of the statistical test for the proposed GIWFKM algorithm in terms of ARI.
AC
CE
PT
ED
M
AN US
CR IP T
GIWFKM GIWFKM GIWFKM GIWFKM GIWFKM GIWFKM v.s v.s v.s v.s v.s v.s FKM WFKM GFKM SBC MOFC IWFKM Soybean + + + + + Voting + + + + + = Mushroom + + + + + + Zoo + + + + + + Lung + + + + + + Dermatology + + + + = + Note: "+": better; "=": equally; "-": worse
36
ACCEPTED MANUSCRIPT Table 6 Experimental results in terms of CA. GFKM 0.971 0.858 0.778 0.874 0.563 0.764
SBC 0.936 0.876 0.798 0.579 0.635 0.793
MOFC 1.000 0.881 0.885 0.910 0.639 0.822
IWFKM 0.894 0.899 0.825 0.821 0.597 0.695
GIWFKM 0.985 0.910 0.932 0.927 0.696 0.828
CR IP T
WFKM 0.893 0.820 0.644 0.720 0.580 0.635
AC
CE
PT
ED
M
AN US
Soybean Voting Mushroom Zoo Lung Dermatology
FKM 0.766 0.850 0.640 0.733 0.615 0.702
37
ACCEPTED MANUSCRIPT Table 7 The result of the statistical test for the IWFKM algorithm in terms of CA. IWFKM v.s WFKM
IWFKM v.s SBC
IWFKM v.s MOFC
+ + = -
+ = + + -
= -
AC
CE
PT
ED
M
AN US
Soybean + = Voting = + Mushroom + + Zoo + + Lung = = Dermatology = + Note: "+": better; "=": equally; "-": worse
IWFKM v.s GFKM
CR IP T
IWFKM v.s FKM
38
ACCEPTED MANUSCRIPT Table 8 The result of the statistical test for the proposed GIWFKM algorithm in terms of CA.
AC
CE
PT
ED
M
AN US
CR IP T
GIWFKM GIWFKM GIWFKM GIWFKM GIWFKM GIWFKM v.s v.s v.s v.s v.s v.s FKM WFKM GFKM SBC MOFC IWFKM Soybean + + = + = + Voting + + + + + + Mushroom + + + + + + Zoo + + + + + + Lung + + + + + + Dermatology + + + + = + Note: "+": better; "=": equally; "-": worse
39