Machine learning integrated credibilistic semi supervised clustering for categorical data

Machine learning integrated credibilistic semi supervised clustering for categorical data

Applied Soft Computing Journal xxx (xxxx) xxx Contents lists available at ScienceDirect Applied Soft Computing Journal journal homepage: www.elsevie...

3MB Sizes 0 Downloads 86 Views

Applied Soft Computing Journal xxx (xxxx) xxx

Contents lists available at ScienceDirect

Applied Soft Computing Journal journal homepage: www.elsevier.com/locate/asoc

Machine learning integrated credibilistic semi supervised clustering for categorical data ∗

Jnanendra Prasad Sarkar a,c ,1 , Indrajit Saha b , ,1 , Sinjan Chakraborty c , Ujjwal Maulik c a

Larsen & Toubro Infotech Ltd., Pune, India Department of Computer Science and Engineering, National Institute of Technical Teachers’ Training and Research, Kolkata, India c Department of Computer Science and Engineering, Jadavpur University, Kolkata, India b

article

info

Article history: Received 8 November 2018 Received in revised form 3 October 2019 Accepted 13 October 2019 Available online xxxx Keywords: Categorical data Credibilistic clustering Friedman test Fuzzy set Machine learning Possibilistic measure Semi supervised clustering Statistical significance

a b s t r a c t In real life, availability of correctly labeled data and handling of categorical data are often acknowledged as two major challenges in pattern analysis. Thus, clustering techniques are employed on unlabeled data to group them according to homogeneity. However, clustering techniques fail to make a decision while data are uncertain, ambiguous, vague, coincidental and overlapping in nature. Hence, in this case, the use of semi supervised technique can be useful. On the other hand, real life datasets are majorly categorical in nature, where natural ordering in attribute values is missing. This special property of categorical values with the inherent characteristics like uncertainty, ambiguity and vagueness makes clustering more complicated than numerical data. In recent times, credibilistic measure shows better performance over fuzzy and possibilistic measures while considering similar inherent characteristics in numerical data. Thus, these facts motivated us to propose a semi supervised clustering technique using credibilistic measure with the integration of machine learning techniques to address the above mentioned challenges of clustering categorical data. This semi supervised technique first clusters the dataset into K subsets with the proposed Credibilistic K -Mode, where credibilistic measure helps to determine the homogeneity by avoiding coincident clustering problem as well as finds the points those are certain to the clusters. Thereafter, in the second part of the semi supervised technique, clustered dataset is used to build a supervised model for classification of other unlabeled or uncertain data. This technique not only handles the unlabeled data better, but also yields improved results for uncertain or ambiguous data e.g, if the credibilistic measure is same for a data point in multiple classes. The results of the proposed technique are demonstrated quantitatively and visually in comparison with widely used state-of-the-art methods for eight synthetic and four real life datasets. Finally, statistical tests have been conducted to judge the statistical significance of the results produced by the proposed technique. © 2019 Elsevier B.V. All rights reserved.

1. Introduction In recent decades, data scientists are facing two major practical challenges for mining real life data. Generally, in real life, only handful observations are available where data are labeled. On the other hand, labeling the data is also costly due to limited resources and time. Therefore, unsupervised clustering techniques play an important role while discovering interesting pattern in advance analytical applications such as pattern recognition [1], medical system [2], mobile performance marketing [3], customer segmentation [4] etc. However, clustering techniques find challenges to assign a data point into cluster in case of uncertain, ∗ Corresponding author. E-mail address: [email protected] (I. Saha). 1 Equally contributed

ambiguous, vague, coincidental and overlapping situation. In this regard, semi supervised clustering technique is suitable. Moreover, most of the literature about data clustering methods are mainly focused on numerical data. It is because, geometric structure of numerical data can easily be exploited by using some defined distance functions. In case of categorical data, clustering is challenging due to the special inherent properties of data within the attributes. For example, the categorical attribute, color, can have different values like red, green, blue etc., where natural ordering is missing. Therefore, the clustering of categorical data has become popular in research community. Some challenges of clustering categorical data are discussed in [5]. For this reason, traditional algorithms such as K -Means, Fuzzy CMeans and their variants fail to cluster categorical data. It is because, these algorithms compute mean of the clusters. In recent times, few methods for clustering categorical data have been proposed in [6–25]. Among those K -Modes (KMd) [7] and

https://doi.org/10.1016/j.asoc.2019.105871 1568-4946/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: J.P. Sarkar, I. Saha, S. Chakraborty et al., Machine learning integrated credibilistic semi supervised clustering for categorical data, Applied Soft Computing Journal (2019) 105871, https://doi.org/10.1016/j.asoc.2019.105871.

2

J.P. Sarkar, I. Saha, S. Chakraborty et al. / Applied Soft Computing Journal xxx (xxxx) xxx

Fuzzy K -Modes (FKMd) [8] are widely used methods. In [22], Kim et al. improved FKMd by proposing a fuzzy centroid based clustering approach, while in [26], another fuzzy approach was introduced by Umayahara et al. for document classification. Partitioning Around Medoids (PAM) or K -Medoid (KMdd) [6] is also popular for the categorical clustering. In recent times, few more methods are proposed, which are discussed in Section 2. Clustering can broadly be thought of two categories, crisp clustering where each data point belongs to exactly one cluster and fuzzy clustering where each data point has some degree of belongingness for all the clusters. Fuzzy clustering is the most popular and advantageous over traditional crisp clustering, because it handles overlapping of clustering by using fuzzy objective function with probabilistic membership values. However, in noisy environment, it does not always correctly correspond to the intuitive concept of degrees of belongingness, due to its probabilistic constraints. To overcome this problem, Possibilistic Clustering Algorithm (PCA) was proposed in [27,28] and subsequently improved in [29,30]. Although Possibilistic method is used widely, it suffers from a serious problem called the coincident problem, where, a dataset having K number of clusters may fall into less than K clusters by PCA with certain probability. The detail of coincident problem is discussed in [31]. To address these issues of fuzzy and possibilistic clustering, Liu et al. [32] proposed credibilistic measure and subsequently, credibilistic clustering was proposed in [33,34] and is modified further in [35] by using the concept of alternating cluster estimation [36]. The variant of credibilistic clustering of [35] used the membership function same as what is used in [29]. However, all the variants of credibilistic clustering methods [33–35] are mainly focused to cluster numerical data while clustering of categorical data using credibilistic measure is not done yet to the best of our knowledge. Hence, to address the clustering of categorical data and the problem of uncertainty, ambiguity, vagueness, coincidental and overlapping inherent characteristic of the categorical data, semi supervised clustering technique using credibilistic measure with the integration of machine learning techniques is proposed. In this regard, first we propose Credibilistic K -Mode (CrKMd) clustering technique where, credibilistic measure helps to identify the certain and uncertain data by solving the coincident problem while performing clustering. However, in some cases, uncertain data may have same credibility to multiple classes. Thus, CrKMd integrates with various machine learning (ML) techniques such as K Nearest Neighbor (K-NN) [37], Support Vector Machine (SVM) [38], Artificial Neural Network (ANN) [39], Decision Tree (DT) [40] and Random Forest (RF) [41] separately, which are referred as CrKMd-KNN, CrKMd-SVM, CrKMd-ANN, CrKMd-DT and CrKMd-RF, in order to classify the uncertain data for yielding better results. In general the proposed machine learning integrated semi supervised clustering technique is abbreviated as MLCrKMd. The effectiveness of the proposed technique is demonstrated visually, quantitatively and statistically in comparison with other state-of-the-art methods for eight synthetic and four real life datasets. In remaining of the article, Section 2 describes background, Section 3 explains the proposed technique, Section 4 explains the convergence analysis, Section 5 analyzes the worst case time complexity and space complexity of the proposed technique, Section 6 describes the experimental results and finally, Section 7 concludes the article. 2. Background This section briefly describes the various categorical clustering algorithms, fuzzy clustering, possibilistic clustering, credibility measure and machine learning techniques. To explain various mathematical relationships, we define a set of n categorical

data points as X = {x1 , x2 , . . . , xn } where, each data point xi , i = 1, 2, . . . , n, is described by a set of m attributes A = {A1 , A2 , . . . , Am }. DOM(Aj ), 1 ≤ j ≤ m, denotes the domain of the jth attribute and it consists of qj categories such as DOM(Aj ) = q {a1j , a2j , . . . , aj j }. Hence, the ith categorical point is defined as xi = [xi1 , xi2 , . . . , xim ], where xij ∈ DOM(Aj ), 1 ≤ j ≤ m. K -Modes (KMd) [7] and Fuzzy K -Modes (FKMd) [8] algorithms are the most popular clustering methods to the research community. However, other few algorithms were also developed as explained in [6–24]. KMd algorithm is inspired from K -Means [42] paradigm, while Fuzzy K -Modes [8] algorithm is the extension of Fuzzy C-Means [43] algorithm for clustering categorical data. KMd uses measure of dissimilarity matching for categorical data and it calculates the mode of the cluster using frequency based method to optimize the objective function as defined in Eq. (1). G (K ) =

K ∑



D(cl , xi )

(1)

l=1 xi ∈Cl ,1≤i≤n

Here, D(cl , xi ) denotes the dissimilarity measure between the mode of the cluster, cl and data point, xi . The jth attribute value of the mode is computed as the most frequent value of the jth attribute over the data in lth cluster, Cl . If there are multiple most frequent values, one is chosen randomly. The algorithm iterates until G (K ) does not change any more. On the other hand, FKMd partitions the dataset X into K clusters by minimizing JF in Eq. (2). JF =

n K ∑ ∑

η

µli D(cl , xi ).

(2)

i=1 l=1

where η is the fuzzy exponent. µ = [µli ] denotes the K × n fuzzy partition matrix and µli denotes the membership degree of the ith categorical data point to the lth cluster. FKMd algorithm starts with randomly initializing K modes. Subsequently, at every iteration it finds the fuzzy membership [8] of each data point to every cluster using Eq. (3):

µli = ∑ K

1 1

D(cl ,xi ) η−1 h=1 ( D(ch ,xi ) )

, for 1 ≤ l ≤ K ; 1 ≤ i ≤ n,

(3)

Note that while computing µli using Eq. (3), if D(ch , xi ) is equal to zero for some h, then µli is set to zero for all l = 1, . . . , K , l ̸ = h, while µhi is set to one. After each iteration, based on the membership values, the cluster centers (modes) are recomputed as cl = [cl1 , cl2 , . . . , clm ] which minimizes the objective function in Eq. (2). Here clj = arj ∈ DOM(Aj ) and

∑ i,xij =arj

η

µli ≥



η

µli ,

1 ≤ t ≤ qj , r ̸ = t .

(4)

i,xij =atj

Here, qj denotes number of categorical values for jth attribute. The algorithm terminates when there is no improvement in JF value of Eq. (2). Finally, each data point is assigned to the cluster to which it has the maximum membership. Subsequently, CACTUS [9], STIRR [10], ROCK [11], Squeezer [13], TSFKMd [14], COOLCAT [12], CLOPE [44], LIMBO [15], CORE [45], TCSOM [16], cdByEnsemble [17], ANMI [17] were also developed for handling categorical data. All these methods do not work equally well for all kind of datasets, due to its own inherent strengths and weaknesses. Even same method with different parameters produces distinct solutions. It is also observed that these methods fail to work equally well for uncertain, vague and overlapping datasets. In real world, this is an important issue because sharp boundary between the clusters is not found. For this purpose, Min–Min-Roughness (MMR) [18] based clustering

Please cite this article as: J.P. Sarkar, I. Saha, S. Chakraborty et al., Machine learning integrated credibilistic semi supervised clustering for categorical data, Applied Soft Computing Journal (2019) 105871, https://doi.org/10.1016/j.asoc.2019.105871.

J.P. Sarkar, I. Saha, S. Chakraborty et al. / Applied Soft Computing Journal xxx (xxxx) xxx

method was proposed. However, MMR also fails to remove any outliers which might affect the size of the clusters. Recently mutual information based categorical data clustering (CDC) method, called k-ANMI, was developed in [19], which is quite similar to the K -Means algorithm. However, this method also may get trapped into local optimal solution. Hence, Genetic Algorithm based Average Normalized Mutual Information (G-ANMI) was developed in [20]. The ANMI is computed on a set of given partitions by assuming that good combined partition may share as much information as possible. In recent times, few soft subspace clustering methods [46–48] have been proposed with a goal to discover a set of subspace clusters. All these subspace clustering methods eventually can be viewed as an extension of KMd methods and they assign a weight to each categorical attribute to compute modes. In this process, there is a chance of having biased importance of attributes to clusters. To overcome this limitation, a new Subspace Clustering of Categories (SCC) has been proposed in [49]. Moreover, recently published Partition and merge based fuzzy genetic clustering algorithm (PM-FGCA) [50] for categorical data has claimed that it outperforms other multiobjective based methods such as Multiobjective genetic algorithm (MOGA) [51], Non-dominated sorting genetic algorithm fuzzy membership chromosome (NSGA-FMC) [52]. However, PM-FGCA has limitations that the number of clusters should be known in advance and it uses fuzzy theory as underlying clustering method, which in turn fails to address coincident problem. Note that G-ANMI has already been found to perform better than k-ANMI [19], TCSOM [16], and Squeezer [13] methods. Moreover, SCC outperforms the other soft subspace clustering methods [46–48] and PM-FGCA performs better than rest of the multiobjective based categorical clustering [51,52]. Therefore, in this article, proposed technique (MLCrKMd) is compared with KMd [7], FKMd [8], TSFKMd [14], MMR [18], G-ANMI [20], ccdByEnsemble [17], Possibilistic K -Modes (PKMd) [53], SCC [49] and PM-FGCA [50]. 3. Machine learning integrated credibilistic semi supervised clustering

3

are later used in CrKMd. The necessary and sufficient condition of credibility measure is described in [56]. Let S be a nonempty set and P(S ) is the power set of S where each element of P(S ) is called an event represented as A. To define credibility measure, it is important to assign a number, C {A} to each event, A ∈ P(S ). C {A} indicates the credibility for the occurrence of event, A and triplet (S , P(S ), C ) is called credibility space. Following five axioms defines the mathematical properties of C {A}.

• • • • •

C {S } = 1; C is increasing, i.e., C {A} < C {B} whenever A ⊂ B; C is self-dual, i.e., C {A} + C {Ac } = 1 for any A ∈ P(S ); C {∪i Ai } ∧ 0.5 = supi C {Ai } for any {Ai } with C {Ai } ≤ 0.5. Let Si be nonempty sets on which Ci satisfies the first four axioms for i = 1, 2, . . . , n, respectively, and S = S1 × S2 × · · · × Sn , then C (S1 , S2 , . . . , Sn ) = C1 {S1 } ∧ C2 {S2 } ∧ · · · ∧ Cn {Sn }

The set function, C is called credibility measure when it satisfies first four axioms [32]. It is easy to see that C {φ} = 0. Therefore, credibility measure can take any value in the range 0 to 1. The above axioms ensure that credibility measure is increasing in nature and has property of self-duality. To establish the relationship among fuzzy membership value, possibilistic measure and credibility measure, let us briefly describe certain mathematical properties. Let X denotes the dataset of n data points which are distributed in K clusters where µli , Pli and Cli indicate fuzzy membership value, possibilistic measure and credibility measure of ith point, xi ∈ X in lth cluster. According to possibility theory proposed in [55], it follows that Pli = µli , ∀i = 1, 2, . . . , n

(5)

Moreover, according to mathematical rules on possibility theory described in [57], it follows 0 ≤ Pli ≤ 1, ∀i = 1, 2, . . . , n and ∀j = 1, 2, . . . , K

(6)

supi {Pli } = 1, ∀j = 1, 2, . . . , K

(7)

Therefore, from Eqs. (5)–(7), it can be derived as follows. In this section, we explain fundamental of credibility measure, how it is extended to develop Credibilistic K -Modes clustering (CrKMd) for categorical data and subsequently, how machine learning techniques are integrated to improve the final results generated by CrKMd.

0 ≤ µli ≤ 1, ∀i = 1, 2, . . . , n and ∀l = 1, 2, . . . , K

(8)

supi {µli } = 1, ∀l = 1, 2, . . . , K

(9)

The triplet (S , P(S ), P) is called possibility space for event A ∈ P(S ) as defined in [58] and necessity measure of A is defined as

3.1. Mathematical foundation of credibility measure

N {A} = 1 − P {Ac }

Zadeh [54] proposed concept of fuzzy set theory using membership function and it has been applied in a wide variety of real problems. However, the fuzzy based clustering is sensitive to noise and the membership value does not always correspond to degree of belongingness into correct cluster. To deal with this fact, possibilistic approach was proposed by Zadeh [55] in order to measure fuzzy event through possibility measure. Subsequently, the possibility theory and its variants were studied by many scientists. Among those, most widely used algorithm is Possibilistic C-Means (PCM) [27] where possibilistic type of membership function was used to compute degree of belongingness by relaxing the constraint of Fuzzy C-Means (FCM). Therefore, it allows data to be clustered independently of other clusters. However, this introduced problem of coincident for dataset with close clusters. To handle the limitation of both FCM and PCM, the credibility measure was developed by Liu et al. [32] with the concept of selfduality which is missing in both fuzzy and possibilistic measures. For better understanding, few important mathematical properties of credibility measure have been discussed in this section, which

whereas, in same possibility space, the relationship of credibility measure for event A ∈ P(S ) with necessity and possibilistic measures is defined in [32] as follows, C {A} =

1 2

(10)

{P {A} + N {A}}

(11)

Hence, from Eqs. (5), (10), (11) and mathematical rules on possibility theory described in [33,57], it can be derived that Cli =

1 2

(µli + 1 − supk̸=l µki ), ∀i, l

(12)

where,

µli = µ ˆ li =

µ ˆ li supk µki 1

, ∀i , l

1 + D(cl , xi )

, ∀i , l

(13) (14)

It is explained and proved in [33], that µli has been normalized by Eq. (13), so that it satisfies the condition of Eq. (9).

Please cite this article as: J.P. Sarkar, I. Saha, S. Chakraborty et al., Machine learning integrated credibilistic semi supervised clustering for categorical data, Applied Soft Computing Journal (2019) 105871, https://doi.org/10.1016/j.asoc.2019.105871.

4

J.P. Sarkar, I. Saha, S. Chakraborty et al. / Applied Soft Computing Journal xxx (xxxx) xxx

Fig. 1. Block diagram of credibilistic semi supervised clustering with machine learning technique.

3.2. Credibilistic K-Modes In Credibilistic K -Modes (CrKMd) the above mentioned mathematical properties are used and the steps of CrKMd are outlined diagrammatically in Fig. 1. The proposed algorithm starts with randomly selected K modes (cl = [cl1 , cl2 , . . . , clm ], where, 1 ≤ l ≤ K and cl ∈ X ) from dataset itself and then computes the credibility matrix [Cli ] for entire data using Eq. (12) with the following properties [56].

• sup1≤l≤K Cli ≥ 0.5, ∀ l • Cli + suph̸=l Chi = 1, for any l, i with Cli ≥ 0.5 • 0 ≤ Cli ≤ 1, ∀ l, i ∑K

Credibility measure removes the FCM constraints, l=1 µli = 1 and its self-dual property ensures that ith point belongs to lth cluster if Cli = 1 whereas, the point will be absolutely outside of the cluster if Cli = 0. On the other hand, it differs from PCM where PCM does not have self-dual property and thus in PCM, a fuzzy event may fail even though possibility gains 1. This fact of PCM creates an ambiguity where, the data point xi might not belong to cluster Cl even after having possibilistic measure, Pli of xi is 1. Moreover, PCM considers the possibilistic measure, Pli only in the cluster, l for data point xi causing coincident clustering. Therefore, the Credibilistic K -Modes algorithm using credibilistic measure not only ensures the compactness of clustering, but also it overcomes coincident problem of PCM by considering the credibility of both the clusters, i.e, credibility, Cli in cluster, l as well as the credibility, Chi , (l ̸ = h) in other cluster h in order to assign xi into proper cluster. The steps of updating of cluster center are described in [33,35], which is suitable for numeric data and not fit for categorical data. Therefore, in this paper we have adopted a new strategy of updating cluster mode as mathematically defined in Eq. (15) to propose CrKMd which is outlined in Algorithm 1. Cluster mode of each cluster (Cl ) is represented by cl . Using credibility matrix, the mode is computed iteratively by Eq. (15) in such a way that

objective function as defined in Eq. (16) is optimized. The value of each attribute of cluster mode is determined as clj = arj ∈ DOM(Aj ) where,

r = arg max 1≤t ≤qj

⎧ ∑ ⎪ (Cli )η ⎨ 1≤i≤n, xi ∈Cl , xij =at j

⎪ ⎩

(15)

Here, qj is the number of categorical values of attribute, Aj . The computed mode, cl of cluster, Cl , needs not necessarily be equal to an existing data point within dataset. JC =

n K ∑ ∑

η

Cli D(cl , xi )

(16)

i=1 l=1

Algorithm 1 Steps of the CrKMd Input: X , the dataset K , the number of clusters ϵ , a small real threshold value, e.g., 0.0001 Output: [Cli ], where 1 ≤ l ≤ K and 1 ≤ i ≤ n 1: 2: 3: 4: 5: 6: 7:

Select random K data points from dataset as K cluster modes repeat Compute Cli for all n data points using Eq. (12) Compute new mode using Eq. (15) Compute the objective, JC , using Eq. (16) until |Current JC − Prev ious JC |≤ ϵ return [Cli ] where 1 ≤ l ≤ K and 1 ≤ i ≤ n

The algorithm terminates when there is no significant improvement of the JC as compared to previous JC value. It is determined when difference between current and previous values of JC is less than or equal to a threshold value, ϵ . Algorithm 1 describes the steps of CrKMd. The final credibility matrix [Cli ] returned by Algorithm 1 is used to determine the credibility of a data point, xi ∈ X for a particular cluster. Cli refers to the credibility value of

Please cite this article as: J.P. Sarkar, I. Saha, S. Chakraborty et al., Machine learning integrated credibilistic semi supervised clustering for categorical data, Applied Soft Computing Journal (2019) 105871, https://doi.org/10.1016/j.asoc.2019.105871.

J.P. Sarkar, I. Saha, S. Chakraborty et al. / Applied Soft Computing Journal xxx (xxxx) xxx

5

Algorithm 2 Steps of MLCrKMd Input: X , the dataset K , the number of cluster [Cli ], Credibilistic matrix Output: TF , the final class label vector of X 1: 2: 3: 4: 5: 6: 7: 8: 9:

Using Algorithm 1, compute [Cli ] for 1 ≤ l ≤ K and 1 ≤ i ≤ n repeat //Build Training dataset If sup1≤l≤K Cli > 0.5, ∀ l of xi then, put xi into Training data, T and corresponding label into class label vector, TL until i <= n for each j in {K − NN , SVM , ANN , DT , RF } do Classify T∗ = (X − T) using jth Machine Learning technique, trained by T and TL to get label vector, TL j Compute MS j ← MS value of TL j against true class label end for ∀j ∈ {K − NN , SVM , ANN , DT , RF }, best ← arg min{MS j } j

10: 11: 12:

Select TL best Combine TL and TL best to get final label vector, TF , where TF should be in order of data points in X return TF

the data point, xi belonging to the lth cluster where 1 ≤ l ≤ K and 1 ≤ i ≤ n. 3.3. Integration of machine learning techniques with credibilistic K-Modes In some cases, Algorithm 1 might produce same credibility value of a data point, xi ∈ X for multiple classes and it is difficult to determine unique clustering of that point. In this regard, a particular cluster is selected randomly from such multiple clusters and the point is assigned accordingly to that cluster. However, this is a limitation of CrKMd. Therefore, with the motivation to overcome the limitation of CrKMd and to improve the final clustering result, we have further investigated and have proposed a strategy to integrate machine learning technique with CrKMd. For this purpose, various well known machine learning techniques are integrated with CrKMd separately and abbreviated as CrKMd-KNN, CrKMd-SVM, CrKMd-ANN, CrKMd-DT and CrKMd-RF respectively, while in general it is named as MLCrKMd. Thereafter, best result is selected as final clustering result. Algorithm 2 describes the steps of machine learning integrated CrKMd (MLCrKMd). It first divides the dataset X into two datasets which are referred as training (certain) and testing (uncertain) data. For this purpose, it uses the credibility matrix [Cli ] obtained from Algorithm 1. The training data, T contains all xi ∈ X for which there exists a unique cluster corresponding to which it attains a maximum credibility value. However, as described earlier that there are some cases where a data point might have same credibility for multiple classes. The data points which satisfy the condition, sup1≤l≤K Cli > 0.5, ∀ l have such unique maximum credibility value to determine unique cluster. Therefore, these certain data are considered as training data and the remaining uncertain data are considered as testing data. The testing data, T∗ contains all xi ∈ {X −T}. Training data are used to train each of the machine learning techniques. Thereafter, each trained machine is used separately to predict the cluster label of testing data. Thus, each trained machine produces its own label vector for the test dataset. Each label vector for each trained machine is evaluated using cluster validity index which is Minkowski Score (MS) [59] in our case. Based on the best MS value, the label vector is selected in order to compute the final clustering result. In real-life dataset where cluster label is unknown, Xie–Beni (XB) index [60] needs to be used instead of Minkowski Score in Algorithm 2.

4. Convergence analysis of CrKMd Convergence of K -Type algorithm and Fuzzy K-Modes have been explained in [8,61], while few other techniques on convergence analysis are explained in [62–65]. Similar concept as explained in [8,61] has been adopted to derive the convergence of CrKMd. The objective function as defined in Eq. (16) gets minimized iteratively by updating cluster mode using in Eq. (15). In a single iteration, credibility value, Cli of any data point, xi , is constant and positive. Therefore, the minimization of Eq. (16) essentially depends on minimization of dissimilarity measure, D(cl , xi ) which is defined by Hamming distance between cl and xi . Hence, minimization of D(cl , xi ) directly depends on the evaluation of cluster mode, cl in every iteration. Eq. (15) selects the maximum frequent attribute value over data points of a cluster to the maximize similarity in attribute values between data point and cluster mode. Therefore, it directly minimizes the value of D(cl , xi ) and the objective function as well. A finite number of possible combination of attribute values ∏m is possible in cluster modes which is equal to j=1 qj , where qj = |DOM(Aj )|, is the number of attribute values of jth attribute. Over all iterations, a particular possible combination of attributes appears at most once by CrKMd. Let us assume that it is not true. Therefore, the specification of the problem can be defined as follows.

• it1 , it2 are two separate iterations. it it • cl 1 and cl 2 correspond to cluster modes of lth cluster in two it it consecutive iterations, where cl 1 = cl 2 . it1 it2 • JC and JC are two objective values of those corresponding it it iterations, where JC1 ̸ = JC2 , otherwise method terminates. it1 it2 • [Cli ] and [Cli ] are credibility matrices of data points in lth cluster for it1 , it2 it

it

We assumed that cl 1 = cl 2 . Therefore, the dissimilarity measures of data points from cluster mode in two consecutive iterations are also same. This signifies that [Cli ]it1 = [Cli ]it2 and it it in turn, JC1 = JC2 , which contradicts the original assumption. Therefore, the assumption, ‘‘Each possible mode may appear more than once by CrKMd over all iterations’’, is wrong. Hence, it can be concluded that CrKMd converges while modes are identical. On the other hand, all the machine learning algorithms which are used in various machine learning integrated CrKMd are classifying the data which are not crisply clustered by CrKMd. Such data are finite. Hence, it is safe to conclude that CrKMd-KNN, CrKMdSVM, CrKMd-ANN, CrKMd-DT, CrKMd-RF and finally MLCrKMd converge in finite time.

Please cite this article as: J.P. Sarkar, I. Saha, S. Chakraborty et al., Machine learning integrated credibilistic semi supervised clustering for categorical data, Applied Soft Computing Journal (2019) 105871, https://doi.org/10.1016/j.asoc.2019.105871.

6

J.P. Sarkar, I. Saha, S. Chakraborty et al. / Applied Soft Computing Journal xxx (xxxx) xxx

Table 1 Space complexity.

5.2. Worst case time complexity

Element to store

Space Complexity

Initial dataset Cluster modes Distance matrix

Training and Testing data

O(nm) O(Km) O(Kn) O(Kn) O(Kn) O(nm)

Overall space complexity

O(2nm + Km + 3Kn)

[µli ] [C ]

Table 2 Worst case time complexity of CrKMd. Element to compute

Worst case time complexity

Distance matrix Updating cluster modes Fuzzy membership matrix Credibility matrix Objective function value

O(Kn) O(KMn) O(Kn) O(K 2 n) O(nm)

Overall time complexity

O(nm + KMn + 2Kn + K 2 n) ≈O(nm + KMn + K 2 n)

Table 3 Additional space and time complexities due to integration of machine learning technique. Algorithm

Additional space complexity

Additional time complexity

CrKMd-KNN CrKMd-SVM CrKMd-ANN CrKMd-DT CrKMd-RF

O(nm) O(n2 ) O(nm + m(m + K )) O(nm) O(τ nm)

O(nm) O(n3 ) O(n) O(log(n)) O(τ nmlog(n))

Table 4 Average values of XB, over 20 runs of CrKMd for different values of η.

Algorithm 1 is dominated by the inner loop where in each iteration, computation is done to compute distance matrix, fuzzy membership matrix, credibility matrix, cluster modes, objective function value to get final credibility matrix. Table 2 shows the time complexity analysis of CrKMd. ) total number of at( ∑mHere, tribute values is denoted by M = j=1 qj . Moreover, additional time is required while integrating with machine learning techniques and mentioned in Table 3. For example, overall time is required for CrKMd-SVM is O(nm + KMn + K 2 n + n3 ). 6. Experimental results It has been explained earlier that due to noisy situation and coincident clustering problem, fuzzy and possibilistic clustering methods fail to produce good clustering results. Moreover, categorical nature of data creates additional complexity in clustering. To address such limitations of existing methods, MLCrKMd has been proposed in this article. The efficiency of the technique can be evaluated by measuring the quality of clustering results of various datasets using different cluster validity indices. In order to judge the performance of the proposed technique over stateof-the-art methods eight synthetic (Cat_100_8_3, Cat_250_15_5, Cat_300_8_3, Cat_300_15_5, Cat_500_20_10, Cat_50000_10_2, Cat_100000_10_2) and four real life datasets (Soybean, Zoo, Dermatology, Mushroom) are used. The Minkowski Score (MS) [59], Percentage of Correct Pair (%CP) [23], Adjusted Rand Index (ARI) [67] and Xie–Beni (XB) [60] index are used to evaluate the performance of clustering techniques. The use of four different measures assures how good the clustering solutions are by their different way of computations.

Datasets

η = 1.5

η=2

η = 2.5

6.1. Datasets

Cat_100_8_3 Cat_250_15_5 Cat_300_8_3 Cat_300_15_5 Cat_500_20_10 Cat_1000_7_7 Cat_50000_10_2 Cat_100000_10_2 Soybean Zoo Dermatology Mushroom

0.2517 0.4196 0.4421 0.4521 0.3943 0.3595 0.6395 0.5639 0.2530 0.1295 0.8320 0.4693

0.2173 0.3851 0.4283 0.4004 0.3427 0.3175 0.5830 0.5167 0.1911 0.0905 0.7679 0.4031

0.2294 0.3929 0.4388 0.4118 0.3695 0.3300 0.6029 0.5320 0.2099 0.1100 0.7826 0.4184

Six synthetic small and medium-sized datasets are generated at datgen portal,2 while four real life benchmark datasets are taken from UCI machine learning repository.3 Although available datasets of UCI machine learning repository have large number of attributes for serving as benchmark in case of comparison in categorical clustering research, but sizes are medium in nature. Therefore, additionally two large-sized datasets are taken from [66]. Table 5 describes the details of synthetic datasets, while, Table 6 describes the real life datasets. 6.2. Input parameters and performance metrics

5. Complexity analysis The space and time complexities of MLCrKMd is dependent on CrKMd integrated with each individual machine learning. Therefore, each of the individual techniques such as CrKMd-KNN, CrKMd-SVM, CrKMd-ANN, CrKMd-DT and CrKMd-RF is analyzed in terms of space and time complexities for n number of data points having m attributes with K modes. 5.1. Space complexity Space is required to store initial dataset, cluster modes, distance matrix, [µli ], [C ] matrices, training and testing data. Table 1 shows that the overall space complexity of CrKMd is O(2nm + Km + 3Kn). Moreover, additional space complexity for different machine learning techniques are mentioned in Table 3. For example, CrKMd-RF needs additional space to maintain τ number of trees. Therefore, worst case space complexity for CrKMd-RF is O(2nm + Km + 3Kn + τ nm).

Inputs used in CrKMd are dataset (X ), number of cluster (K ) and a small threshold, ϵ = 0.0001. Sensitivity analysis has been performed to select the value of important parameter, η used in Eq. (16) of CrKMd. It is found in Table 4 that the average XB value is better for η = 2, thus it is used in our experiment. Other input parameters like number of trees in CrKMd-RF, kernel in CrKMd-SVM, number of K in CrKMd-KNN are used as 1000, RBF (Radial Basis Function) and 3 respectively. Moreover, CrKMdANN has used one hidden layer, number of hidden neurons as 2/3 of number of inputs and resilient backpropagation with weight backtracking to compute neural network. All these aforementioned parameters are set either experimentally or following the literatures. Input parameters for TSFKMd, MMR, G-ANMI and PKMd are same as used in [14,17,18,28]. Note that, the methods are executed till they converge to final solution. To validate the 2 Synthetic, http://www.datgen.com. 3 Real life datasets are taken from MLRepository.html.

http://www.ics.uci.edu/~mlearn/

Please cite this article as: J.P. Sarkar, I. Saha, S. Chakraborty et al., Machine learning integrated credibilistic semi supervised clustering for categorical data, Applied Soft Computing Journal (2019) 105871, https://doi.org/10.1016/j.asoc.2019.105871.

J.P. Sarkar, I. Saha, S. Chakraborty et al. / Applied Soft Computing Journal xxx (xxxx) xxx

7

Table 5 Specification of synthetic datasets. Datasets

Description

Cat_100_8_3

This has a one-layer clustering structure with 8 attributes and 100 points. It has 3 clusters. Each cluster has random categorical values selected from {0; 1; 2; 3; 4; 5} in a distinct continuous set of 5 attributes, while the rest attributes are set to 0

Cat_250_15_5

The dataset contains 250 points with 15 attributes. The points are clustered into 5 clusters

Cat_300_8_3

It consists of 300 points each having 8 attributes. The datasets are grouped into 3 clusters

Cat_300_15_5

This has 300 points and 15 attributes with 5 clusters. For each cluster, 8 random attributes of the points of that cluster are set to zero where the remaining attributes have values in the range {0; 1; 2; 3; 4; 5}

Cat_500_20_10

It consists of 500 points where each point is having 20 attributes and the dataset is grouped into 10 clusters

Cat_1000_7_7

This dataset has 1000 points each having 7 attributes and grouped into 7 clusters

Cat_50000_10_2

This [66] has 50 000 points where each datapoint having 10 attributes and grouped into 2 clusters

Cat_100000_10_2

This dataset [66] has 100 000 points each having 10 attributes and grouped into 2 clusters

Table 6 Specification of real life datasets. Datasets

Description

Soybean

The Soybean dataset contains 47 data points on diseases in soybeans. Each data point has 35 categorical attributes and is classified as one of the four diseases. So the number of clusters in the dataset is 4.

Zoo

The Zoo data consists of 101 instances of animals in a zoo with 17 features. The name of the animal constitutes the first attribute. This attribute is ignored. There are 15 Boolean attributes corresponding to the presence of hair, feathers, eggs, milk, backbone, fins, tail; and whether airborne, aquatic, predator, toothed, breathes, venomous, domestic and catsize. The character attribute corresponds to the number of legs lying in the set {0; 2; 4; 5; 6; 8}.

Dermatology

The Dermatology data contains 34 attributes, 33 of which are linear valued and one of them is nominal. There are 366 instances of differential diagnosis of erythemato-squamous diseases. The attributes consist of Clinical attributes and Histopathological attributes of the patient. The dataset is distributed among six classes, such as psoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic dermatitis and pityriasis rubra pilaris.

Mushroom

The Mushroom data has 8124 instances of mushrooms with 22 attributes to determine whether a mushroom is definitely edible, definitely poisonous, or of unknown edibility and not recommended.

Fig. 2. Average values of MS for CrKMd integrated ML techniques for different datasets.

performance, we have used external cluster validity indices like Minkowski Score (MS) [59], Percentage of Correct Pair (%CP) [23], Adjusted Rand Index (ARI) [67] and internal cluster validity index such as Xie–Beni index (XB) [60]. Lower MS and XB values signify better results. Similarly, higher value of ARI indicates better results, while zero (0%) and 100% of %CP value specify the worst and best results. Supplementary4 of the article provides the details of MS, %CP, ARI, XB, dissimilarity measure and visual assessment of 4 htttp://www.nitttrkol.ac.in/indrajit/projects/CatProject/CrKMdsupp.pdf.

tendency (VAT) [68] used to further validate the results. All the algorithms are implemented in Matlab and executed on Intel Core i5-2410M CPU of 2.30 GHz Machine with 4 GB RAM and Windows 7 Operating System. 6.3. Results In this section, first the results of CrKMd and subsequently, the results of MLCrKMd are discussed. For this purpose, eight synthetic and four real life datasets are used. It is observed

Please cite this article as: J.P. Sarkar, I. Saha, S. Chakraborty et al., Machine learning integrated credibilistic semi supervised clustering for categorical data, Applied Soft Computing Journal (2019) 105871, https://doi.org/10.1016/j.asoc.2019.105871.

8

J.P. Sarkar, I. Saha, S. Chakraborty et al. / Applied Soft Computing Journal xxx (xxxx) xxx

Table 7 Average values of MS, %CP, ARI and XB for Synthetic datasets. Datasets

Cat_100_8_3

Cat_300_8_3

Cat_500_20_10

Cat_50000_10_2

Method

MS

%CP

ARI

XB

KMd FKMd AL TSFKMd MMR G-ANMI MoDEFCCD PKMd SCC PM-FGCA CrKMd MLCrKMd

0.88202 0.79713 0.76040 0.73736 0.55900 0.50471 0.36240 0.39101 0.38982 0.38979 0.38353 0.19512

76.18283 79.91267 81.13110 83.11590 88.29480 89.44180 94.55710 93.82249 94.00623 94.00623 94.00894 98.69802

0.73107 0.78020 0.79115 0.79410 0.85610 0.86610 0.94360 0.91160 0.92234 0.92235 0.92385 0.97814

0.49857 0.38548 0.37838 0.36647 0.36129 0.35846 0.20536 0.22157 0.21846 0.21846 0.21733 0.11057

KMd FKMd AL TSFKMd MMR G-ANMI MoDEFCCD PKMd SCC PM-FGCA CrKMd MLCrKMd

0.77440 0.37300 0.75890 0.34283 0.31699 0.20322 0.44170 0.40416 0.40183 0.40180 0.39124 0.10888

80.63140 94.27590 81.31090 94.90220 96.15320 98.10197 92.02560 93.72930 93.73001 93.73001 93.74056 99.19732

0.79000 0.93710 0.79133 0.94380 0.96370 0.97784 0.89321 0.91154 0.91154 0.91154 0.91156 0.98030

0.92764 0.40619 0.42616 0.38320 0.35164 0.30339 0.48354 0.44245 0.44196 0.44195 0.42830 0.11919

KMd FKMd AL TSFKMd MMR G-ANMI MoDEFCCD PKMd SCC PM-FGCA CrKMd MLCrKMd

0.94438 0.81305 0.81350 0.75597 0.71532 0.67155 0.48210 0.65582 0.62032 0.63483 0.42590 0.20322

75.28247 78.92458 78.85420 81.42737 84.20744 85.82147 90.87090 86.31600 87.22440 87.19580 93.71924 98.10197

0.71026 0.76100 0.76090 0.79135 0.79830 0.82670 0.87122 0.82997 0.84530 0.84302 0.91153 0.97784

0.96647 0.71037 0.65284 0.61349 0.55395 0.36421 0.38792 0.52770 0.49173 0.51629 0.34270 0.16352

KMd FKMd AL TSFKMd MMR G-ANMI MoDEFCCD PKMd SCC PM-FGCA CrKMd MLCrKMd

0.94438 0.81305 0.81350 0.75597 0.71532 0.67155 0.48210 0.65582 0.64001 0.65295 0.42590 0.20322

75.28247 78.92458 78.85420 81.42737 84.20744 85.82147 90.87090 86.31600 87.19540 86.32300 93.71924 98.10197

0.71026 0.76100 0.76090 0.79135 0.79830 0.82670 0.87122 0.82997 0.84298 0.82998 0.91153 0.97784

0.69900 0.68693 0.69698 0.67615 0.65688 0.64887 0.63959 0.61950 0.61330 0.61664 0.58302 0.48538

Datasets

Cat_250_15_5

Cat_300_15_5

Cat_1000_7_7

Cat_100000_10_2

Method

MS

%CP

ARI

XB

KMd FKMd AL TSFKMd MMR G-ANMI MoDEFCCD PKMd SCC PM-FGCA CrKMd MLCrKMd

0.77524 0.62713 0.64680 0.55319 0.52321 0.49988 0.34070 0.43137 0.43009 0.43008 0.42690 0.17754

80.55590 87.21280 86.40430 88.60240 89.09280 90.29160 94.93020 93.01345 93.02561 93.02562 93.60990 98.70707

0.78620 0.84370 0.83010 0.86040 0.86590 0.86670 0.94520 0.90380 0.90450 0.90480 0.91152 0.97867

0.90637 0.61401 0.62431 0.40829 0.40254 0.39556 0.30732 0.38911 0.38821 0.38821 0.38508 0.16015

KMd FKMd AL TSFKMd MMR G-ANMI MoDEFCCD PKMd SCC PM-FGCA CrKMd MLCrKMd

0.84465 0.74558 0.73550 0.69895 0.64432 0.57669 0.50160 0.68641 0.66329 0.68729 0.49191 0.32400

77.51644 82.68290 83.51650 84.76120 86.63930 87.91220 89.80470 85.00500 86.00300 84.83950 90.42932 96.13601

0.75260 0.79260 0.79550 0.80990 0.83160 0.84760 0.86620 0.81558 0.82993 0.81431 0.87001 0.96121

0.89795 0.60884 0.51572 0.50332 0.49115 0.47340 0.40832 0.55876 0.51942 0.51570 0.40043 0.26374

KMd FKMd AL TSFKMd MMR G-ANMI MoDEFCCD PKMd SCC PM-FGCA CrKMd MLCrKMd

0.94989 0.81150 0.80320 0.74440 0.72522 0.66527 0.37550 0.72522 0.58359 0.65829 0.37222 0.20452

74.73140 78.99450 79.17520 82.77160 83.55489 85.98743 94.01120 83.55489 87.85910 86.30100 94.35636 98.06146

0.71008 0.76210 0.76621 0.79280 0.79612 0.82991 0.92751 0.79612 0.84757 0.82997 0.94138 0.97313

0.73542 0.69765 0.69648 0.65431 0.61277 0.59413 0.32030 0.52511 0.48246 0.58393 0.31751 0.17446

KMd FKMd AL TSFKMd MMR G-ANMI MoDEFCCD PKMd SCC PM-FGCA CrKMd MLCrKMd

0.94989 0.81150 0.80320 0.74440 0.72522 0.66527 0.37550 0.72522 0.61995 0.65053 0.37222 0.20452

74.73140 78.99450 79.17520 82.77160 83.55489 85.98743 94.01120 83.55489 87.63160 86.33520 94.35636 98.06146

0.71008 0.76210 0.76621 0.79280 0.79612 0.82991 0.92751 0.79612 0.84546 0.82999 0.94138 0.97313

0.62330 0.61221 0.62145 0.60231 0.58460 0.57723 0.56871 0.55024 0.53295 0.57799 0.51672 0.42703

Fig. 3. Average values of XB for CrKMd integrated ML techniques for different datasets.

that in certain cases, CrKMd produces same credibility of a data point for multiple clusters. In this regard, a cluster is selected randomly from those multiple clusters for which the data point has same credibility. Thereafter, the data point is assigned to the selected cluster to compute the cluster validity indices. Fig. 2 shows the average Minkowski Score (MS) values produced by CrKMd for all the datasets. For example, average MS values of

dataset Cat_100_8_3, Cat_250_15_5, Cat_300_8_3, Cat_300_15_5, Cat_500_20_10, Cat_1000_7_7, Cat_50000_10_2, Cat_100000_10_2, Soybean, Zoo, Dermatology and Mushroom are 0.38353, 0.42690, 0.39124, 0.49191, 0.42590, 0.37222, 0.42590, 0.37222, 0.20416, 0.20530, 0.58436 and 0.58734 respectively. Similarly, Fig. 3 shows the average Xie–Beni index (XB) of CrKMd for all the datasets. Tables 7 and 8 report the average values of

Please cite this article as: J.P. Sarkar, I. Saha, S. Chakraborty et al., Machine learning integrated credibilistic semi supervised clustering for categorical data, Applied Soft Computing Journal (2019) 105871, https://doi.org/10.1016/j.asoc.2019.105871.

J.P. Sarkar, I. Saha, S. Chakraborty et al. / Applied Soft Computing Journal xxx (xxxx) xxx

9

Fig. 4. Boxplot of MS values of different clustering techniques for (a) Cat_100_8_3 (b) Cat_250_15_5 (c) Cat_300_8_3 (d) Cat_300_15_5 (e) Cat_500_20_10 (f) Cat_1000_7_7 (g) Cat_50000_10_2 (h) Cat_100000_10_2 (i) Soyabean (j) Zoo (k) Dermatology (l) Mushroom.

Please cite this article as: J.P. Sarkar, I. Saha, S. Chakraborty et al., Machine learning integrated credibilistic semi supervised clustering for categorical data, Applied Soft Computing Journal (2019) 105871, https://doi.org/10.1016/j.asoc.2019.105871.

10

J.P. Sarkar, I. Saha, S. Chakraborty et al. / Applied Soft Computing Journal xxx (xxxx) xxx

Table 8 Average values of MS, %CP, ARI and XB for Real datasets. Datasets

Soybean

Dermatology

Method

MS

%CP

ARI

XB

KMd FKMd AL TSFKMd MMR G-ANMI MoDEFCCD PKMd SCC PM-FGCA CrKMd MLCrKMd

0.64363 0.39077 0.44980 0.36806 0.33104 0.25899 0.20070 0.29713 0.27035 0.27032 0.20416 0.09323

86.79560 93.96750 91.72570 94.45920 95.34930 96.95324 98.30110 96.32843 96.72941 96.72985 98.08240 99.57212

0.83610 0.92000 0.89224 0.94220 0.94820 0.96620 0.97785 0.96462 0.96560 0.96560 0.97363 0.99103

0.57577 0.28765 0.30922 0.41140 0.37002 0.27496 0.18783 0.27808 0.24782 0.23529 0.19107 0.08725

KMd FKMd AL TSFKMd MMR G-ANMI MoDEFCCD PKMd SCC PM-FGCA CrKMd MLCrKMd

1.02256 0.92276 0.88536 0.84113 0.73970 0.73221 0.70654 0.67673 0.63286 0.65035 0.58436 0.42847

71.08340 75.86546 76.12840 78.12498 83.00900 83.53410 84.62750 85.76540 87.20030 86.33950 87.85600 93.03084

0.66835 0.72835 0.73006 0.75864 0.79390 0.79561 0.80810 0.82657 0.84315 0.83000 0.84755 0.91120

1.34380 1.21265 1.16350 1.10537 0.97208 0.96224 0.92850 0.88933 0.83258 0.87384 0.76794 0.56308

Datasets

Zoo

Mushroom

Method

MS

%CP

ARI

XB

KMd FKMd AL TSFKMd MMR G-ANMI MoDEFCCD PKMd SCC PM-FGCA CrKMd MLCrKMd

0.68839 0.43895 0.45840 0.42744 0.39099 0.37686 0.32920 0.38187 0.32226 0.39639 0.20530 0.07228

84.82920 92.29820 91.49410 93.18530 93.89980 94.00950 95.62740 94.00945 96.13783 93.73856 98.02162 99.59643

0.81370 0.89680 0.88682 0.91140 0.91160 0.92500 0.95998 0.92491 0.96243 0.91155 0.97308 0.99347

0.30697 0.17982 0.23293 0.17439 0.17033 0.16500 0.14513 0.16835 0.13937 0.17121 0.09051 0.03187

KMd FKMd AL TSFKMd MMR G-ANMI MoDEFCCD PKMd SCC PM-FGCA CrKMd MLCrKMd

0.81998 0.69724 0.71355 0.68428 0.66143 0.65269 0.64258 0.61429 0.60327 0.62394 0.58734 0.50856

78.79645 84.80500 84.35100 85.52400 86.01700 86.32400 86.98240 87.70590 87.79140 87.21590 87.83300 89.35645

0.76005 0.81120 0.80160 0.81960 0.82996 0.82998 0.83920 0.84599 0.84713 0.84450 0.84721 0.86609

0.56270 0.47847 0.48966 0.46958 0.45390 0.44790 0.44096 0.42155 0.41038 0.44006 0.40305 0.34899

Fig. 5. VAT plot of (a) True clusters and the clusters produced by (b) FKMd (c) CrKMd (d) MLCrKMd for Cat_100_8_3 dataset.

Fig. 6. VAT plot of (a) True clusters and the clusters produced by (b) FKMd (c) CrKMd (d) MLCrKMd for Cat_1000_7_7 dataset.

external cluster validity indices like %CP, ARI and internal cluster validity index like XB score in comparison with other state-ofthe-art methods. External cluster validity index measures the quality of the clustering solution with the known cluster labels. On the other hand, internal clustering metric is used to evaluate clustering solution in terms of geometrical properties of the clusters. For all the datasets which are used in our experiment true clusters are known. Thus, we have validated the performance of the methods by evaluating both external and internal cluster validity indices. It is evident from the reported results that CrKMd outperforms other methods for all the datasets. However, due to the anomaly as described above, where CrKMd fails to determine

unique cluster of a data point because of same credibility for multiple clusters, there is a chance of misclassification. To handle this issue and to improve the final clustering result, various machine learning techniques are integrated with CrKMd referred as CrKMd-KNN, CrKMd-SVM, CrKMd-ANN, CrKMd-DT and CrKMdRF. In this case, each individual machine learning technique is trained with the data which are uniquely clustered by CrKMd and the machine is used to classify remaining data for which CrKMd has failed to uniquely determine cluster. Thereafter, based on the computed values of various cluster validity indices, the best result is produced by any one of the machine learning integrated CrKMd techniques referred as MLCrKMd. The comparative

Please cite this article as: J.P. Sarkar, I. Saha, S. Chakraborty et al., Machine learning integrated credibilistic semi supervised clustering for categorical data, Applied Soft Computing Journal (2019) 105871, https://doi.org/10.1016/j.asoc.2019.105871.

J.P. Sarkar, I. Saha, S. Chakraborty et al. / Applied Soft Computing Journal xxx (xxxx) xxx

11

Fig. 7. VAT plot of (a) True clusters and the clusters produced by (b) FKMd (c) CrKMd (d) MLCrKMd for Soybean dataset.

Fig. 8. VAT plot of (a) True clusters and the clusters produced by (b) FKMd (c) CrKMd (d) MLCrKMd for Zoo dataset.

study of integrated techniques on average MS and XB values are reported in Figs. 2 and 3, while the average %CP and ARI values are reported in Figs. S1 and S2 of supplementary material. It has been observed that CrKMd-RF produces better results consistently for different synthetic and real life datasets used for our experiments. For example, average MS values of CrKMd, CrKMd-KNN, CrKMd-SVM, CrKMd-ANN, CrKMd-DT and CrKMd-RF for dataset Cat_100_8_3 are 0.38353, 0.37724, 0.35173, 0.34256, 0.20406 and 0.19512 respectively. It has also been observed that performance of CrKMd-DT and CrKMd-RF over dataset Mushroom is same, while results of CrKMd-SVM and CrKMd-ANN are close for almost all the datasets. However, CrKMd-RF outperformed other integrated techniques in our experiments. Subsequently, the result of CrKMd-RF is referred as the final result of MLCrKMd. Similarly, in real world, if any other integrated technique outperforms for any different dataset, then MLCrKMd will refer that result accordingly. The performance of MLCrKMd is compared with other widely used state-of-the-art methods. For this purpose, we have selected different popular methods in different aspects like traditional hard clustering (KMd), fuzzy version (FKMd) of that method, hierarchical clustering (AL), Tabu search based FKMd (TSFKMd), Rough set based algorithm (MMR), G-ANMI, Modified DEFCCD (MoDEFCCD), Possibilistic version of KMd (PKMd), Subspace Clustering of Categories (SCC) and Partition-and-Merge based Fuzzy Genetic Clustering Algorithm (PM-FGCA). Tables 7 and 8 report average values of MS, %CP, ARI, XB scores over 20 runs for eight synthetic and four real life datasets. For example, average MS values of KMd, FKMd, AL, TSFKMd, MMR, G-ANMI, MoDEFCCD, PKMd, SCC, PK-FGCA and MLCrKMd on Cat_250_15_5 are 0.77524, 0.62713, 0.64680, 0.55319, 0.52321, 0.49988, 0.34070, 0.43137, 0.43009, 0.43008 and 0.17754 respectively. Comparative results have been demonstrated visually by boxplot as well as VAT plot. Fig. 4 show the boxplots of various methods on MS values. Similarly, VAT plots are shown in Figs. 5–8 for datasets Cat_100_8_3, Cat_1000_7_7, Soybean and Zoo, while in the supplementary, Figs. S3–S8 represent the VAT plots of the

other datasets. However, due to very large dataset, VAT plots have not been generated for Cat_50000_10_2 and Cat_100000_10_2 datasets. As a comparative study, VAT plot of true cluster label of each dataset is shown along with computed cluster labels of FKMd, CrKMd and MLCrKMd. For example, Fig. 5(a), (b), (c) and (d) represent the true cluster label, computed cluster label of FKMd, CrKMd and MLCrKMd for Cat_100_8_3. Additionally, parametric and non-parametric statistical test, paired t-test [69] and Friedman test [70] have been conducted to judge the statistical significance of the results produced by MLCrKMd. In order to do the paired t-test, MLCrKMd is paired with other ten methods such as KMd, FKMd, AL, TSFKMd, MMR, G-ANMI, MoDEFCCD, PKMd, SCC and PM-FGCA individually by considering 20 MS values of each method for every dataset. Subsequently, the t-test [69] has been performed at 5% significance level. Here null hypothesis assumes that there is no significant difference between the paired groups of 20 MS values, whereas alternative hypothesis assumes that there is a significant difference if the p-v alue is less than 0.05. Moreover, the p-v alues of t-test have been adjusted using FDR (False Discovery Rate) [71] and reported in Table 9. It is evident from Table 9, as the p-v alues less than 0.05 in all the cases, the MS values produced by MLCrKMd is statistically significant and have not occurred by chance. In case of Friedman test, average rank (Rj ) of all the methods and chi-square (χ 2 ) value are computed as follows. Rj =

1 ∑ N

j

ri

(17)

i

and

χ =

12N

2

Q (Q + 1)

⎡ ⎤ ∑ Q (Q + 1)2 2 ⎣ ⎦ Rj − j

4

(18)

j

Here, ri is the rank of the jth method for ith dataset where the number of datasets and methods is denoted by N and Q respectively. Friedman test is distributed for χ 2 with (Q − 1)

Please cite this article as: J.P. Sarkar, I. Saha, S. Chakraborty et al., Machine learning integrated credibilistic semi supervised clustering for categorical data, Applied Soft Computing Journal (2019) 105871, https://doi.org/10.1016/j.asoc.2019.105871.

12

J.P. Sarkar, I. Saha, S. Chakraborty et al. / Applied Soft Computing Journal xxx (xxxx) xxx

Table 9 FDR adjusted p-values on synthetic and real life datasets by comparing MLCrKMd with other methods. Datasets

KMd

FKMd

AL

TSFKMd

MMR

G-ANMI

MoDEFCCD

PKMd

SCC

PM-FGCA

Cat_100_8_3 Cat_250_15_5 Cat_300_8_3 Cat_300_15_5 Cat_500_20_10 Cat_1000_7_7 Cat_50000_10_2 Cat_100000_10_2 Soybean Zoo Dermatology Mushroom

1.13e−18 2.64e−20 3.23e−16 2.22e−21 8.87e−31 1.07e−31 2.49e−24 3.95e−17 4.35e−15 1.69e−14 6.96e−13 8.40e−09

6.72e−21 4.06e−18 1.93e−08 2.63e−19 2.68e−21 1.85e−22 2.52e−22 1.12e−14 1.31e−11 1.98e−14 9.73e−18 5.20e−08

1.38e−24 6.55e−26 1.14e−30 2.21e−27 2.28e−26 1.70e−27 1.26e−30 1.36e−30 1.25e−20 5.58e−27 9.19e−28 5.26e−16

2.14e−37 1.09e−34 2.48e−23 4.33e−37 6.80e−36 1.07e−35 2.98e−40 3.88e−30 2.31e−24 7.18e−33 2.33e−16 1.32e−07

4.27e−21 1.78e−23 3.86e−22 1.55e−25 5.74e−25 1.87e−26 2.64e−29 1.28e−29 2.17e−17 1.70e−25 6.67e−25 8.64e−14

5.05e−28 9.06e−20 6.96e−19 7.81e−20 4.76e−32 5.00e−28 8.13e−25 2.59e−31 4.22e−15 1.91e−24 4.46e−08 1.76e−04

6.78e−26 1.98e−27 4.55e−25 2.07e−23 1.19e−39 3.83e−37 4.01e−27 8.26e−20 3.13e−17 9.98e−86 7.99e−21 2.26e−23

4.24e−27 1.32e−27 3.52e−30 1.86e−26 1.06e−30 3.29e−33 2.20e−28 1.48e−34 7.94e−27 2.46e−25 5.07e−20 5.28e−18

1.42e−24 6.99e−18 9.32e−28 3.61e−22 3.50e−31 1.52e−26 2.49e−24 1.23e−30 1.44e−15 7.01e−23 1.07e−05 6.11e−03

3.01e−29 1.03e−31 4.55e−25 4.33e−37 5.15e−34 3.87e−34 7.51e−39 1.04e−28 6.26e−21 2.72e−32 1.33e−11 3.56e−05

Table 10 Average rank of the different methods for Synthetic and Real Life datasets. For every entry, the first value indicates the rank and average MS is given within the parenthesis. Datasets

KMd

FKMd

AL

TSFKMd

MMR

G-ANMI

MoDEFCCD

PKMd

SCC

PM-FGCA

MLCrKMd

Cat_100_8_3 Cat_250_15_5 Cat_300_8_3 Cat_300_15_5 Cat_500_20_10 Cat_1000_7_7 Cat_50000_10_2 Cat_100000_10_2 Soybean Zoo Dermatology Mushroom

11(0.88) 11(0.78) 11(0.77) 11(0.84) 11(0.94) 11(0.95) 11(0.94) 11(0.95) 11(0.64) 11(0.69) 11(1.02) 11(0.82)

10(0.80) 9(0.63) 5(0.37) 10(0.75) 9.5(0.81) 10(0.81) 9.5(0.81) 10(0.81) 9(0.39) 9(0.44) 10(0.92) 9(0.70)

9(0.76) 10(0.65) 10(0.76) 9(0.74) 9.5(0.81) 9(0.80) 9.5(0.81) 9(0.80) 10(0.45) 10(0.46) 9(0.89) 10(0.71)

8(0.74) 8(0.55) 4(0.34) 8(0.70) 8(0.76) 8(0.74) 8(0.76) 8(0.74) 8(0.37) 8(0.43) 8(0.84) 8(0.68)

7(0.56) 7(0.52) 3(0.32) 4(0.64) 7(0.72) 7(0.73) 7(0.72) 6.5(0.73) 7(0.33) 6(0.39) 7(0.74) 7(0.66)

6(0.5) 6(0.50) 2(0.20) 3(0.58) 6(0.67) 6(0.67) 6(0.67) 5(0.67) 3(0.26) 4.5(0.38) 6(0.73) 6(0.65)

2(0.36) 2(0.34) 9(0.44) 2(0.50) 2(0.48) 2(0.38) 2(0.48) 2(0.38) 2(0.20) 3(0.33) 5(0.71) 5(0.64)

3.5(0.39) 3.5(0.43) 6.5(0.40) 6.5(0.69) 5(0.66) 4(0.62) 5(0.66) 6.5(0.73) 6(0.30) 4.5(0.38) 4(0.68) 3(0.61)

3.5(0.39) 3.5(0.43) 6.5(0.40) 5(0.66) 3(0.62) 3(0.58) 3(0.64) 3(0.62) 4.5(0.27) 2(0.32) 2(0.63) 2(0.60)

3.5(0.39) 3.5(0.43) 6.5(0.40) 6.5(0.69) 4(0.63) 5(0.66) 4(0.65) 4(0.65) 4.5(0.27) 7(0.40) 3(0.65) 4(0.62)

1(0.20) 1(0.18) 1(0.11) 1(0.32) 1(0.20) 1(0.20) 1(0.20) 1(0.20) 1(0.09) 1(0.07) 1(0.43) 1(0.51)

Average Rank

11

9.17

9.5

7.67

6.29

4.96

3.17

4.83

3.42

4.63

1

degree of freedom. Here, null-hypothesis defines that there is no significant difference among the results produced by the different methods, whereas alternative hypothesis says the opposite. Using Eqs. (17) and (18), average rank of different methods is computed and reported in Table 10 as well as χ 2 is computed as 98.012 and its corresponding p-v alue is 1.1102E−16 at 5% significance level. Therefore, it shows that the results are significantly different among the methods by rejecting the null hypothesis and rank suggests MLCrKMd is superior to others. However, it has been observed that MLCrKMd needs additional computation power and space as compared to fuzzy and possibilistic clustering methods due to additional computation of credibilistic measure and storing it. This is certainly a limitation of MLCrKMd, which is a scope of our future research to improve. 7. Conclusion Fuzzy clustering and its variants have been widely used in many real life problems over the last decades. However, it does not perform correctly in noisy environment. Possibilistic theory was introduced to overcome the problem of fuzzy approach. However, possibilistic approach itself suffers from coincident clustering problem in case of close clusters. Moreover, categorical data add more complexities due to special inherent properties of categorical data within the attributes. In addition, real world challenge on availability of sufficient and correct labeled data has motivated us to propose a semi supervised clustering technique using credibilistic measure with the integration of machine learning techniques for categorical data. For this purpose, in this article, we have explained the mathematical properties of credibilistic measure and how certain properties like self-duality can eliminate the drawback of both fuzzy and possibilistic approaches of clustering. To leverage this benefit of credibilistic measure, first, we have developed Credibilistic K -Modes clustering technique for categorical data. For non-outlier data, the technique

generates higher credibility, whereas it assigns low credibility for outlier data. Therefore, the developed technique can cluster nonoutlier data better than FCM and PCM. Based on the maximum credibility, a data point is assigned into cluster. However, in some cases, it still creates confusion to uniquely classify data having same credibility values for multiple clusters. To eliminate such anomaly, next the semi supervised clustering technique, MLCrKMd has been developed, where CrKMd has been integrated with various machine learning techniques to yield the better final result. The performance of MLCrKMd, has been demonstrated quantitatively, visually and statistically in comparison with ten existing state-of-the-art methods in terms of various cluster validity indices for eight artificial and four real life categorical datasets. The categorical data clustering results have shown that MLCrKMd is superior to others methods. We have also described the convergence analysis, space and worst case time complexities of the developed technique. It has been observed that MLCrKMd takes comparatively additional computation power and space. As a future scope of research, we are working to improve the time and complexity. Moreover, we are also working on uncertain data in order to handle it better by extending the credibilistic measure with other mathematical theory like rough set. Declaration of competing interest No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.asoc.2019.105871. Acknowledgment This work was partially supported by the grants from the Department of Science and Technology, India (DST/INT/POL/P36/2016).

Please cite this article as: J.P. Sarkar, I. Saha, S. Chakraborty et al., Machine learning integrated credibilistic semi supervised clustering for categorical data, Applied Soft Computing Journal (2019) 105871, https://doi.org/10.1016/j.asoc.2019.105871.

J.P. Sarkar, I. Saha, S. Chakraborty et al. / Applied Soft Computing Journal xxx (xxxx) xxx

References [1] U. Maulik, I. Saha, Modified differential evolution based Fuzzy clustering for pixel classification in remote sensing imagery, Pattern Recognit. 42 (9) (2009) 2135–2149. [2] U. Maulik, Medical image segmentation using genetic algorithms, IEEE Trans. Inf. Technol. BioMed. 13 (2) (2009) 166–173. [3] S. Silva, P. Cortez, R. Mendes, P.J. Pereira, L.M. Matos, L. Garcia, A categorical clustering of publishers for mobile performance marketing, in: Proceeding of the 13th International Conference on Soft Computing Models in Industrial and Environmental Applications, Vol. 771, pp. 145–154. [4] D.S. Boone, M. Roehm, Retail segmentation using artificial neural networks, Intern. J. Res. Mark. 19 (3) (2002) 287–301. [5] H.L. Chen, K.T. Chuang, M.S. Chen, On data labeling for clustering categorical data, IEEE Trans. Knowl. Data Eng. 20 (11) (2008) 1458–1472. [6] L. Kaufman, P.J. Roussenw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, NY, USA, 1990. [7] Z. Huang, Extension of k-means algorithm for clustering large data sets with categorical values, Data Min. Knowl. Discov. 2 (3) (1998) 283–304. [8] Z. Huang, M.K. Ng, A fuzzy k-modes algorithm for clustering categorical data, IEEE Trans. Fuzzy Syst. 7 (4) (1999) 446–452. [9] V. Ganti, J. Gehrke, R. Ramakrishnan, CACTUS - clustering categorical data using summaries, in: Proceeding of Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp. 73–83. [10] D. Gibson, J. Kleinberg, P. Raghavan, Clustering categorical data: an approach based on dynamical systems, Very Large Data Bases J. 8 (3–4) (2000) 222–236. [11] S. Guha, R. Rastogi, K. Shim, ROCK: a robust clustering algorithm for categorical attributes, Inf. Syst. 25 (5) (2000) 345–366. [12] D. Barbara, Y. Li, J. Couto, COOLCAT: an entropy-based algorithm for categorical clustering, in: Proceeding of Eleventh International Conference on Information and Knowledge Management, 2002, pp. 582–589. [13] Z. He, X. Xu, S. Deng, Squeezer: an efficient algorithm for clustering categorical data, J. Comput. Sci. Technol. 17 (5) (2002) 611–624. [14] M.K. Ng, J.C. Wong, Clustering categorical data sets using tabu search techniques, Pattern Recognit. 35 (12) (2002) 2783–2790. [15] P. Andritsos, P. Tsaparas, R.J. Miller, K.C. Sevcik, LIMBO: scalable clustering of categorical data, in: Proceeding of Ninth International Conference on Extending Database Technology, Vol. 2992, 2004, pp. 123–146. [16] Z. He, X. Xu, S. Deng, TCSOM: clustering transactions using selforganizing map, Neural Process. Lett. 22 (3) (2005) 249–262. [17] Z. He, X. Xu, S. Deng, A cluster ensemble method for clustering categorical data, Inf. Fusion 6 (2) (2005) 143–151. [18] D. Parmar, T. Wu, J. Blackhurst, MMR: An algorithm for clustering categorical data using rough set theory, Data Knowl. Eng. 63 (3) (2007) 879–893. [19] Z. He, X. Xu, S. Deng, k-ANMI: a mutual information based clustering algorithm for categorical data, Inf. Fusion 9 (2) (2008) 223–233. [20] S. Deng, Z. He, X. Xu, G-ANMI: A mutual information based genetic clustering algorithm for categorical data, Knowl.-Based Syst. 23 (2) (2010) 144–149. [21] I. Saha, J.P. Sarkar, U. Maulik, Rough set based fuzzy k-modes for categorical data, in: Swarm, Evolutionary, and Memetic Computing, Vol. 7677, SEMCCO 2012, 2012, pp. 323–330. [22] D.W. Kim, K.H. Lee, D. Lee, Fuzzy clustering of categorical data using fuzzy centroids, Pattern Recognit. Lett. 25 (11) (2004) 1263–1271. [23] U. Maulik, S. Bandyopadhyay, I. Saha, Integrating clustering and supervised learning for categorical data analysis, IEEE Trans. Syst. Man Cybern. A 40 (4) (2010) 664–675. [24] I. Saha, J.P. Sarkar, U. Maulik, Ensemble based rough fuzzy clustering for categorical data, Knowl. Based Syst. 77 (2015) 114–127. [25] I. Saha, J.P. Sarkar, U. Maulik, Integrated rough fuzzy clustering for categorical data analysis, Fuzzy Sets and Systems 361 (2019) 1–32. [26] K. Umayahara, S. Miyamoto, Y. Nakamori, Formulations of fuzzy clustering for categorical data, Int. J. Innovative Comput. Inf. Control 1 (1) (2005) 83–94. [27] R. Krishnapuram, J.M. Keller, A possibilistic approach to clustering, IEEE Trans. Fuzzy Syst. 1 (2) (1993) 98–110. [28] R. Krishnapuram, J.M. Keller, The possibilistic c-means algorithm: Insights and recommendations, IEEE Trans. Fuzzy Syst. 4 (3) (1996) 385–393. [29] M.S. Yang, K.L. Wu, Unsupervised possibilistic clustering, Pattern Recognit. 39 (1) (2006) 5–21. [30] J.P. Sarkar, I. Saha, U. Maulik, Rough possibilistic type-2 fuzzy c-means clustering for MR brain image segmentation, Appl. Soft Comput. 46 (2016) 527–536. [31] W.C. Tjhi, L. Chen, Possibilistic fuzzy co-clustering of large document collections, Pattern Recognit. 40 (12) (2007) 3452–3466.

13

[32] B. Liu, Y.K. Liu, Expected value of fuzzy variable and fuzzy expected value models, IEEE Trans. Fuzzy Syst. 10 (4) (2002) 445–450. [33] J. Zhou, Q. Wang, C.C. Hung, X. Yi, Credibilistic clustering: The model and algorithms, Fuzziness Knowl.-Based Syst. 23 (4) (2015) 545–564. [34] M.R.N. Kalhori, M.H.F. Zarandi, Interval type-2 credibilistic clustering for pattern recognition, Pattern Recognit. 48 (11) (2015) 3652–3672. [35] J. Zhou, Q. Wang, C.C. Hung, F. Yang, Credibilistic clustering algorithms via alternating cluster estimation, J. Intell. Manuf. 28 (3) (2017) 727–738. [36] T.A. Runkler, J.C. Bezdek, Alternating cluster estimation:a new tool for clustering and function approximation, IEEE Trans. Fuzzy Syst. 7 (4) (1999) 377–393. [37] N.S. Altman, An introduction to kernel and nearest-neighbor nonparametric regression, Amer. Statist. 46 (3) (1992) 175–185. [38] R. Collobert, S. Bengio, Svmtorch: Support vector machines for large-scale regression problems, J. Mach. Learn. Res. 1 (2001) 143–160. [39] D. Graupe, Principles of Artificial Neural Networks, World Scientific, 2013. [40] J.R. Quinlan, Induction of decision trees, Mach. Learn. 1 (1) (1986) 81–106. [41] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32. [42] M.R. Anderberg, Cluster Analysis for Applications, Academic Press, NY, USA, 1973. [43] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Kluwer Academic, MA, USA, 1981. [44] Y. Yang, S. Guan, J. You, CLOPE: a fast and effective clustering algorithm for transactional data, in: Proceedings of Eighth International Conference on Knowledge Discovery and Data Mining, 2002, pp. 682–687. [45] M. Chen, K. Chuang, Clustering categorical data using the correlated force ensemble, in: Proceedings of Fourth SIAM International Conference on Data Mining, 2004, pp. 269–278. [46] L. Bai, J. Liang, C. Dang, F. Cao, A novel attribute weighting algorithm for clustering high-dimensional categorical data, Pattern Recognit. 44 (12) (2011) 2843–2861. [47] T. Xiong, S. Wang, A. Mayers, E. Monga, DHCC: divisive hierarchical clustering of categorical data, Neuro Comput. 24 (1) (2012) 103–135. [48] F. Cao, J. Liang, D. Li, X. Zhao, A weighting k-modes algorithm for subspace clustering of categorical data, Neuro Comput. 108 (2013) 23–30. [49] L. Chen, S. Wang, K. Wang, J. Zhu, Soft subspace clustering of categorical data with probabilistic distance, Pattern Recognit. 51 (2016) 322–332. [50] T.P.Q. Nguyen, R.J. Kuo, Partition-and-merge based fuzzy genetic clustering algorithm for categorical data, Appl. Soft Comput. 75 (2019) 254–264. [51] A. Mukhopadhyay, U. Maulik, S. Bandyopadhyay, Multiobjective genetic algorithm based Fuzzy clustering of categorical attributes, IEEE Trans. Evol. Comput. 13 (2009) 991–1005. [52] C.L. Yang, R.J. Kuo, C.H. Chien, N.T.P. Quyen, Non-dominated sorting genetic algorithm using fuzzy membership chromosome for categorical data clustering, Appl. Soft Comput. 30 (2015) 113–122. [53] A. Ammar, Z. Elouedi, P. Lingras, The k-modes method under possibilistic framework, Adv. Artif. Intell. 7884 (2013) 211–217. [54] L.A. Zadeh, Fuzzy sets, Inf. Control 8 (3) (1965) 338–353. [55] L.A. Zadeh, Fuzzy sets as a basis for a theory of possibility, Fuzzy Sets and Systems 1 (1) (1978) 3–28. [56] L. Liu, Y. Li, L. Yang, The maximum fuzzy weighted matching models and hybrid genetic algorithm, Appl. Math. Comput. 181 (1) (2006) 662–674. [57] S. Nahmias, Fuzzy variables, Fuzzy Sets and Systems 1 (2) (1978) 97–110. [58] L.A. Zadeh, A theory of approximate reasoning, in: J. Hayes, D. Michie, R. Thrall (Eds.), Mathematical Frontiers of the Social and Policy Sciences, Westview Press, 1979, pp. 69–129. [59] N. Jardine, R. Sibson, Mathematical Taxonomy, John Wiley & Sons, NY, USA, 1971. [60] X.L. Xie, G. Beni, A validity measure for fuzzy clustering, IEEE Trans. Pattern Anal. Mach. Intell. 13 (8) (1991) 841–847. [61] S.Z. Selim, M.A. Ismail, K-means-type algorithms: A generalized convergence theorem and characterization of local optimality, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1) (1984) 81–87. [62] R. Shang, Z. Zhang, L. Jiao, W. Wang, S. Yang, Global discriminative-based nonnegative spectral clustering, Pattern Recognit. 55 (2016) 172–182. [63] R. Shang, W. Wang, R. Stolkin, L. Jiao, Non-negative spectral learning and sparse regression-based dual-graph regularized feature selection, IEEE Trans. Cybern. 48 (2) (2017) 793–806. [64] Y. Meng, R. Shang, L. Jiao, W. Zhang, Y. Yuan, S. Yang, Feature selection based dual-graph sparse non-negative matrix factorization for local discriminative clustering, Neurocomputing 290 (2018) 87–99. [65] Y. Meng, R. Shang, L. Jiao, W. Zhang, S. Yang, Dual-graph regularized nonnegative matrix factorization with sparse and orthogonal constraints, Eng. Appl. Artif. Intell. 69 (2018) 24–35. [66] Y. Xiao, C. Huang, J. Huang, I. Kaku, Y. Xu, Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering, Pattern Recognit. 90 (2019) 183–195.

Please cite this article as: J.P. Sarkar, I. Saha, S. Chakraborty et al., Machine learning integrated credibilistic semi supervised clustering for categorical data, Applied Soft Computing Journal (2019) 105871, https://doi.org/10.1016/j.asoc.2019.105871.

14

J.P. Sarkar, I. Saha, S. Chakraborty et al. / Applied Soft Computing Journal xxx (xxxx) xxx

[67] K.Y. Yeung, W.L. Ruzzo, An empirical study on principal component analysis for clustering gene expression data, Bioinformatics 17 (9) (2001) 763–774. [68] J.C. Bezdek, R.J. Hathaway, VAT: A tool for visual assessment of (cluster) tendency, in: Proceedings of the International Joint Conference on Neural Networks, Vol. 3, 2002, pp. 2225–2230.

[69] G.A. Ferguson, Y. Takane, Statistical Analysis in Psychology and Education, sixth ed., McGraw-Hill Ryerson Limited, 2005. [70] M. Friedman, A comparison of alternative tests of significance for the problem of m rankings, Ann. Math. Stat. 11 (1940) 86–92. [71] Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: A practical and powerful approach to multiple testing, J. R. Stat. Soc. Ser. B Stat. Methodol. 57 (1) (1995) 289–300.

Please cite this article as: J.P. Sarkar, I. Saha, S. Chakraborty et al., Machine learning integrated credibilistic semi supervised clustering for categorical data, Applied Soft Computing Journal (2019) 105871, https://doi.org/10.1016/j.asoc.2019.105871.