Robust Heterogeneous C-means

Robust Heterogeneous C-means

Journal Pre-proof Robust Heterogeneous C-means Atieh Gharib, Hadi Sadoghi Yazdi, Amir hossein Taherinia PII: DOI: Reference: S1568-4946(19)30666-0 h...

12MB Sizes 0 Downloads 31 Views

Journal Pre-proof Robust Heterogeneous C-means Atieh Gharib, Hadi Sadoghi Yazdi, Amir hossein Taherinia

PII: DOI: Reference:

S1568-4946(19)30666-0 https://doi.org/10.1016/j.asoc.2019.105885 ASOC 105885

To appear in:

Applied Soft Computing Journal

Received date : 2 March 2018 Revised date : 23 September 2019 Accepted date : 17 October 2019 Please cite this article as: A. Gharib, H.S. Yazdi and A.h. Taherinia, Robust Heterogeneous C-means, Applied Soft Computing Journal (2019), doi: https://doi.org/10.1016/j.asoc.2019.105885. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier B.V.

Journal Pre-proof

A new perspective based on the loss function (minimum risk) to FCM method is presented. The help of this new perspective can generate different kinds of FCM including the robust versions.



Our methods cover different kinds of robustness such as robustness to outliers, to the volume of clusters and robustness in noisy environments even while initializing. Robust Heterogeneous C-Means (RHCM) method has very low dependency to the value of kernel parameters RHCM solves some drawbacks of Direct clustering (DC) method (DC is a kind of weighted fuzzy c-means) Initializing by Direct Clustering (DC) method makes our method fast. A real dataset of images and their corresponding tags from 500px social media is gathered and used in this paper.

re-



lP



urn a



Jo



pro of



Journal Pre-proof

Robust Heterogeneous C-means Atieh Gharib, Hadi Sadoghi Yazdi∗, Amir hossein Taherinia

pro of

Dept. of Computer Engineering, Ferdowsi University of Mashhad, Iran

Abstract

urn a

lP

re-

Fuzzy c-means is one of the popular algorithms in clustering, but it has some drawbacks such as sensitivity to outliers. Although many correntropy based works have been proposed to improve the robustness of FCM, fundamentally a proper error function is required to apply to FCM. In this paper, we present a new perspective based on the expected loss (or risk) to FCM method to provide different kinds of robustness such as robustness to outliers, to the volume of clusters and robustness in noisy environments. First, we propose Robust FCM method (RCM) by defining a loss function as a least square problem and benefiting the correntropy to make FCM robust to outliers. Furthermore, we utilize the half-quadratic (HQ) optimization as a problem-solving method. Second, inspiring by the Bayesian perspective, we define a new loss function based on correntropy as a distance metric to present Robust Heterogeneous C-Means (RH CM ) by utilizing direct clustering (DC) method. DC helps RH CM to have robust initialization. Besides, RH CM will make some robust cluster centers in noisy environments and is capable of clustering the elliptical or spherical shaped data accurately, regardless of the volume of each cluster. The results are shown visually on some synthetic datasets including the noisy ones, the UCI repository and also on real image dataset that was gathered manually from 500px social media. Also , for evaluation of the clustering results, several validity indices are calculated. Experimental results indicate the superiority of our proposed method over the base FCM, DC, KFCM, two new methods called GPFCM and GEPFCM and a method called DC-KFCM that we created for the comparison purpose. Keywords: Robust, heterogeneous, clustering, half-quadratic, correntropy, loss function.

1. Introduction



Jo

Fuzzy C-Means (FCM) [1] has been known as one of the most powerful clustering algorithm. Since it uses Euclidean norm as similarity measurement, it is effective in clustering Corresponding author Email addresses: [email protected] (Atieh Gharib), [email protected] (Hadi Sadoghi Yazdi), [email protected] (Amir hossein Taherinia) Preprint submitted to Applied Soft Computing

September 23, 2019

Journal Pre-proof

Jo

urn a

lP

re-

pro of

spherical shaped objects. Due to several drawbacks of FCM such as sensitivity to initialization, sensitivity to outliers, and limitation of equal-sized and spherical-shaped clusters, different versions of FCM have been developed to improve its performance [2, 3, 4]. Krishnapuram and Keller [4] proposed the possibilistic c-means (PCM) algorithm to improve the weakness of FCM in a noisy environment. PCM determines a possibilistic partition in which a possibilistic type of membership function is used to illustrate the degree of belongings [4]. PCM relaxes the constraint of the sum of memberships to 1 for all data points so that it can ignore noisy and outlier data. But it depends heavily on the initial parameters and suffers from coincident clusters problem [5, 6, 7, 8]. Pal et al. [9] also proposed fuzzy-possibilistic c-means (FPCM) model that combines the characteristics of both fuzzy and possibilistic c-means to overcome their problems but it has the constraint in which the sum over all data points of typicalities to a cluster is one. This row sum constraint may cause producing unrealistic results for large data sets. So, possibilisticfuzzy c-means (PFCM) model introduced in [10] produces simultaneously memberships, possibilities and cluster centers for each cluster and solves the noise sensitivity defect of FCM, the coincident clusters problem of PCM and the row sum constraints problem of FPCM [10, 5]. PCM was also extended by Yang and Wu [11]. Robust analysis of nonprecise data was focused by Ferraro et al [12] on the basis of a fuzzy and possibilistic clustering method in which determining the parameter values are based on the generalized form of the Xie and Beni validity index. In PCM, each object has equal importance in the clustering process. So the weighted PCM (WPCM) model [13, 14] defines weights for each object according to their relative importance in clustering solution that can minimize the effects of noisy data. In [15] Wu and Yung presented AFCM to replace the Euclidean norm in c-means clustering algorithm by exponential metric to make it robust to noise and outliers. It can also tolerate unequal sized clusters. However, it does not guarantee to result in homogeneous clusters. To address this problem, Wu and Yung [16] introduced the Possibilistic version of AFCM as an unsupervised alternating clustering method called UACM. UACM can create denser regions and more accurate cluster centers. Also, a new alternative weighted FCM (AWFCM) proposed by Xiang et al. [17] suggests a new distance metric and a weight matrix based on sample dots density to overcome the bugs of FCM in dealing with mass shape partitions and prevents the error of cluster center for noisy samples [18]. Another development is to use kernel version of FCM proposed by Girolami [19] to map the nonseparable input data into a higher dimensional feature space via a nonlinear data transformation that can cluster more general shape of objects [2, 19, 20]. KFCM has better outlier and noise robusness than FCM, so it is used in [21, 22] to deal with incomplete data. In KFCM the values of the kernel parameters affect the final clustering results [2, 23]. In [23] Yang and Tsai proposed a Gaussian kernel-based FCM (GKFCM) with a spatial bias correction that solves this drawback. Ferreira and Carvalho [24] proposed a variable-wise kernel fuzzy c-means that uses adaptive distances. It means the relevance weight of each variable to each cluster may be different and dissimilarity measures are obtained as a sum of kernel functions applied to each variable. 2

Journal Pre-proof

urn a

lP

re-

pro of

In [25] Shanmugapriya and Punithavalli proposed a Kernelized Fuzzy Possibilistic CMeans (KFPCM) algorithm in which FPCM combines the advantages of both FCM and PCM and a kind of Kernel-Induced Distance Measure is used to obtain better clustering results in case of high dimensional data. It is also robust to outliers and noise. However, FCM has the ability to overcome the curse of dimensionality problem by forming information granules as used in [26]. In [6] Zhang and Chen presented a weighted kernel PCM (WkPCM) algorithm with combination of WPCM algorithm and KPCM [27] to reduce the corruption caused by noisy data and to cluster big data effectively. In addition, a generalized form of possibilistic fuzzy c-means (GPFCM) [28] and a Generalized Entropy based Possibilistic Fuzzy C-Means algorithm (GEPFCM) [29] have been developed for clustering noisy data. Although these methods can find accurate cluster centers in noisy environments, when the sizes of clusters are considerably different, they fail. Since it’s hard to find a classifier that simultaneously consider the uncertainty on both membership values and distance measure while the initialization is important, the data environment is noisy, and the clusters are of different shapes and sizes, we want to present some models for defining a fuzzy classifier that covers all of these issues. In fact, in this paper we present a new perspective based on the expected loss (or risk) to FCM method. The help of this new perspective can generate different kinds of FCM including the robust versions. We first present a new robust FCM called RCM by defining an expected loss according to the error. Our definition is based on the correntropy induced loss function utilizing the half-quadratic optimization solving method to make RCM robust to outliers. Then utilizing this correntropy based method and inspiring by the Bayesian perspective, we define another expected loss according to the data and parameters to show another view to robustness. The new generated robust method called RH CM uses Direct clustering (DC) algorithm [30] which is an equivalent method to the weighted FCM to be robust in the initialization step. Inspired by DC, we add the concept of heterogeneity as a constraint to our problem to also make clusters robust to the volume of data in noisy environments. The rest of the paper is structured as follows. Traditional FCM is reviewed in section 2 and then for developing our models, FCM from the expected loss perspective is considered. In section 3, RCM method is presented. In section 4, direct clustering algorithm and some drawbacks of this method are reviewed. Then RH CM method is introduced to solve the previous mentioned problems. Section 5 offers some experimental results both visually and in terms of cluster validity indices. Finally, conclusions are made in Section 6. 2. FCM from the expected loss perspective

Jo

Traditional FCM method is first reviewed in this section. Then, the expected loss perspective to FCM method is presented. 2.1. Traditional Fuzzy c-means (FCM) clustering Let X = {x1 , x2 , ..., xn } be a set of n data samples in d-dimensional space. The FCM algorithm [31] tries to minimize the objective function defined in (1) to cluster the data 3

Journal Pre-proof

into c subsets. In FCM, each data point belongs to more than one cluster with different membership degrees ranging in [0,1]. So, the following objective function should satisfy some conditions as n X c X 2 Jm (u, v) = um ij kxi − vj k i=1 j=1

j=1

(1)

uij = 1 ∀i,

0<

n X

uij < n

pro of

s.t. c X

i=1

Where, vj is the center of j-th cluster that needs to be found and uij is the membership degree of i-th data sample to the j-th cluster.

re-

2.2. FCM via the expected loss In this section, we present a new perspective based on the expected loss (or risk) to FCM method. The help of this new perspective can generate different kinds of FCM including the robust versions. So, we briefly review the role of the loss function. To develop our models we can have two views to the loss function: First, to define an expected loss according to the error as 1X E{l(e)} ' l(ei ) (2) c i n X

lP

if l(ei ) = e2i then:

E{l(e)} =

e2i

(3)

i=1

Second, to define an expected loss according to the data and parameters as Z Z E{l(x, v)} = l(x, v)f (x, v)dxdv v

urn a

x

or

1 XX E{l(x, v)} = l(x, v)f (x, v) nc x v

(4)

Jo

where c is the class numbers and n is the number of samples. f (x, v) is a distribution function and l(x, v) can be any user defined loss. Since we know the ability of Bayesian approach both to integrate the expert prior knowledge into clustering and to evaluate a noisy model with different kinds of noise in clustering, we want to define a proper classifier that handles the uncertainty on both membership values (u) and distances (d) while considering the noisy data. So, we propose a fuzzy classifier based on the Bayesian approach by developing (4) as E{l(x, v)} =

n X c X i=1 j=1

4

l(x, v)f (xi |vj )f (vj )

(5)

Journal Pre-proof

If we let l(x, v) = kx − vk2 then: E{l(x, v)} =

n X c X i=1 j=1

kxi − vj k2 um ij f (vj )

(6)

2

or if l(x, v) = 1 − exp( −kx−vk ) then: 2σ 2

pro of

n X c X −kxi − vj k2 m ))uij f (vj ) E{l(x, v)} = (1 − exp( 2σj2 i=1 j=1

(7)

As it is shown in Eqs. (6) , (7) we define um ij = f (xi |vj ) which means that, given the center of cluster j-th, how much the degree of belonging of data point i-th is. Also, We extend the definition of the weights for each data in [32] to the weights for each cluster. So, we define the prior knowledge f (vj ) as the weight of each cluser. As a result, one can consider a lower weight for the noisy clusters as a prior knowledge.

re-

3. Robust FCM method (RCM)

Starting from defining a proper loss function, we extend the RCM model based on the mentioned expected loss perpective in the following.

lP

3.1. Formulation of RCM The development of our RCM model is based on the expected loss defined in Eq. (3) of section 2.2. If the loss is squared error, we have the following minimization problem: min J(u, v) = u,v

n X

e2i

i=1

urn a

s.t.

e2i =

c X

um ij d(xi , vj ),

(8)

j=1

c X

i = 1, · · · , n

uij = 1,

j=1

Jo

Where (8) can be seen as a constrained least square problem that is going to be robust by applying the correntropy loss function as l(ei ) = 1 − exp(−ηi2 e2i ). Where η is the scaling constant.

5

Journal Pre-proof

So the minimization of (8) is equivalent to the maximization of the following function that is applied to FCM: n X max J(u, v) = exp(−ηi2 e2i ) u,v

i=1

s.t.

c X

um ij d(xi , vj ),

pro of

e2i =

j=1

c X

(9)

i = 1, 2, · · · , n

uij = 1,

j=1

Using the half-quadratic problem solving method (see Appendix A), one can rewrite the cost function J(u, v) in (9) as J(u, v) =

n X i=1

sup{ηi2 e2i pi − ϕ(pi )}

p<0

(10)

re-

n X = sup{ (ηi2 e2i pi − ϕ(pi ))} i=1

the second equation in (10) establishes since ηi2 e2i pi − ϕ(pi ) are independant functions in terms of pi . So, we can derive that (9) is equivalent to

lP

max

u,v,p<0

n X

J(u, v, p) =

i=1

(ηi2 e2i pi − ϕ(pi ))

(11)

urn a

Now, we can optimize (11) using the alteranting optimization method. It means, iteratively given (u, v) we optimize over p and then given p, we optimize over (u, v). If we define the subscript t as the t-th iteration, then given (ut , vt ), (11) is equivalent to max pt <0

n X (ηi2 e2i,t pi,t − ϕ(pi,t ))

(12)

i=1

Whose analytic solutions are

pi,t = − exp(−ηi2 e2i,t ) < 0,

i = 1, 2, · · · , n

Jo

Then, after obtaining pi,t we can get (vt+1 , ut+1 ) by solving the following problem: max

ut+1 ,vt+1

n X

ηi 2 e2i,t+1 pi,t

i=1

6

(13)

Journal Pre-proof

For clarity, we omit the subscripts t, t+1. By defining qi = −pi = exp(−ηi2 e2i ) the equivalent problem of (13) is: n X min ηi2 e2i qi u,v

i=1

=

n X c X

ηi2 um ij d(xi , vj )qi

i=1 j=1

i

j

2 ηi2 um ij kvj − xi k qi

s.t.

c X

uij = 1,

j=1

2

pro of

=

XX

(14)

i = 1, 2, · · · , n

re-

Where kvj − xi k is the Euclidean distance. After producing the Lagrange relation for (14) the centers and memberships are acquired as follows: P 2 m ηi uij qi xi (15a) vj = Pi 2 m i ηi uij qi 1 uij = P (15b) c 1 d(x ,v ) ( d(xii,vkj ) ) m−1 k=1

lP

As it is shown in Eq. (15a) RCM presents a new relation for centers. It will be shown in section 5 that it is a good solution in noisy environments and in the presence of outliers. The proposed RCM procedure is shown in Algorithm 1.

Jo

urn a

Algorithm 1: The proposed RCM method Input: c, n, m, , η, tmax (iteration number) Output: u,v 1 - Initialize the cluster centers and membership values randomly; 2 - Set t=0; 3 - while t < tmax do P 4 Compute e2i,t = cj=1 um i = 1, 2, · · · , n; ij d(xi , vj ) , 2 2 5 Compute qi,t = exp(−ηi ei,t ), i = 1, 2, · · · , n; 6 Update all cluster centers vj,t by Eq. (15a) , j = 1, 2, · · · , c; 7 Update membership values uij,t by Eq. (15b), i = 1, 2, · · · , n, j = 1, 2, · · · , c; 8 Set t=t+1; 9 - end while 3.2. Convergence of RCM Proposition. The sequence J(ut , vt , pt ), p = 1, 2, · · · generated by RCM method converges. Proof. According to (10), we have: J(u, v, p) ≤ J(u, v). Then resulting from (12) , (13) we 7

Journal Pre-proof

have: J (ut , vt , pt ) ≤ J (ut+1 , vt+1 , pt ) ≤ J (ut+1 , vt+1 , pt+1 ) It means that the sequence {J (ut , vt , pt ) , t = 1, 2, · · · } generated by RCM is nondecreasing. As a result, this sequence converges.

min ηi

n X

pro of

3.3. Optimizing η parameter Utilizing alternating optimization method, in this section we introduce adaptive η parameter, i.e. just like as section 3.1, given (u, v, p) we can also optimize over η. We define a constraint for η parameter over the minimization problem of (14) as follows: ηi2 e2i qi

i=1

s.t. n X

(16)

ηi ∈ [0, 1]

ηi = ρ,

re-

i=1

The parameter ρ is a user defined parameter. After producing the Lagrange relation by adding the penalty term we have: ρ λ P ηi =ρ −−−−→ ηi = Pn 2 2ei qi

lP

ηi =

e2i qi j=1 e2j qj

(17)

Then, for dealing with divide by zero problem we add ξ parameter to Eq. (17) as ηi = Pn

ρ

e2i qi j=1 e2j qj



,

i = 1, 2, · · · , n

(18)

urn a

In fact, ηi is the weight of the error. It determines how much a sample data can affect the clustering result. For the samples with large errors (e.g. outliers), the value of ηi becomes small to decrease the effect of outlier data in our minimization problem. 4. Robust Heterogeneous c-means method (RH CM )

Jo

In this section we define a new form of robustness over the membership function. To have more compact and homogeneous clusters for spherical or elliptical shaped data, we apply the cluster fit function based on direct clustering method. In other words, we define a new restriction based on the membership function obtained by DC algorithm. Besides, since FCM is an iterative algorithm, it depends on the initial point. Determining the initial point is another issue that leads us to search for methods that are robust to noises. DC is a method that would satisfy this goal. 8

Journal Pre-proof

4.1. Direct Clustering (DC) algorithm Given a dataset X = {x1 , x2 , · · · , xn }, the DC algorithm divides the dataset into c subsets (note that c < n ) by minimizing the following cluster fit function: n c X Y F (v1 , v2 , · · · , vc ) = { kvj − xi k2 }, i=1

j=1

vj , xi ∈ Rd

pro of

Where d denotes the number of dimensions in the feature space. DC can be viewed as an analytic solution to the 2D minimization problem. By minimizing c Q kvj − xi k2 each xi will tend to associate with its nearest cluster center, and therefore will j=1

1Y kvj − xi k2 c j6=r

lP

i,r

re-

lead to compact and homogeneous clusters [30]. The definition of the above fit function matches the general definition of the weighted fuzzy c-means by rewriting the fit function as follows: ( c ) 2 n 1X Y c kvj − xi k F (v1 , v2 , ..., vc ) = c i=1 j=1 ( ) X Y 21 2 kvr − xi k = kvj − xi k c i,r j6=r X 2 = ui,r kvr − xi k , ui,r =

(19)

urn a

Where ui,r means how well all other cluster centers except vr approximate point xi . High value of ui,r for some xi and all j 6= r expresses that any vj,j6=r could not approximate this xi closely, and the remaining vr is needed to reduce the value of F(.). In other words, each cluster center is used to approximate the points that no other cluster center can approximate efficiently. Therefore, the more the distance of data xi from other cluster centers vj,j6=r is, the higher the degree of belonging of xi to cluster r − th is.

Jo

4.2. The problem of DC algorithm Although DC algorithm results in homogeneous clusters, when the number of data points in at least one cluster is significantly more than others, it fails to result in an accurate clustering. In this case, the centers of other clusters will be attracted to larger volume cluster center. Here we provide an example to show this problem. Example: Without loss of generality, suppose that the clusters are spherical, c = 2 (c is the number of clusters) and the points are in 2D space (x1 ;x2 ). The distribution of data points in Fig. 1 shows that cluster 1 has larger volume (more data). The data points are 9

pro of

Journal Pre-proof

Figure 1: The distribution of data points in two clusters

symmetric with respect to line L1 that is parallel to x1 -axis.

lP

re-

First, we prove that the center of each cluster is located on this line. Proof 1 : we use the method of proof by contradiction. Consider the situation in which the cluster centers are not on line L1. Since the data distribution in each cluster is symmetric with respect to L1, so there should be another center with the same cost function that is located in a symmetric position with respect to L1. This is a contradiction, because DC cost function has only one global optimum (this was proved in [30]). Therefore, the initial assumption that the cluster centers are not located on line L1 must be false.

urn a

Then, under this condition we must show that how DC algorithm fails to result in an accurate clustering. So, we consider two cases: Case I : Suppose that there are two clusters. The total number of data points is n and the number of data points in one cluster is k times the other one. First, The cluster centers are located on the centroid part of each cluster that is illustrated in Fig.1 as v1 and v2 . One can rewrite the DC cost function as follows: n 2 X Y F (v1 , v2 ) = { kvj − xi k2 },

F (v1 , v2 ) =

i=1 j=1 n X i=1

2

kv1 − xi k kv2 − xi k

vj , xi ∈ R2

(20)

2

Jo

In this case, the distance of i − th data from j − th cluster center is denoted as ki,j . Case II :If the center of the cluster with less data points (cluster 2) approaches near to the center of the other cluster (cluster 1) on distance of h (h > 0) from the previous position

10

Journal Pre-proof

(v2 ) -lets call this point v20 - one can rewrite the DC cost function as follows: n 2 X Y { kvj − xi k2 }, i=1

F (v1 , v2 ) =

n X i=1

j=1

vj , xi ∈ R2

(21)

kv1 − xi k2 kv20 − xi k2

pro of

F (v1 , v2 ) =

It is obvious that the distance of each data point in cluster 1 from v20 is less than the previous case and it is at most k1,2 − h. The distance of most of the data points in the second cluster from v20 is more than the previous case. Since the number of data points in the first cluster is significantly more, so the value of objective function in Eq. (21) is less than the previous case (Eq. (20)). As we know, the objective function should be minimized. So, point v20 will be chosen as the center of the cluster. Since the DC algorithm is independent of the shape of the clusters, this proof can be modified for general cases.

re-

4.3. Formulation of RH CM The goal of the proposed method is to cluster unequal spherical or elliptical shaped data accurately and efficiently. So, we use a cluster fit function inspired by KFCM method. Also, we need a formula for uij that expresses how well each data point is approximated by at least one cluster center. To reach this goal, we use the previous mentioned direct clustering formula to evaluate uij in each iteration as c Q

lP ui,j

1 = · c

(

r=1

kvr − xi k2 )

kvj − xi k2

(22)

urn a

The development of our RH CM method is also based on the expected loss defined in Eq. (7) of Section 2.2. For simplicity we put f (vj ) = 1. By using Eq. (22) the following constrained optimization problem is produced: min Jm (u, v) = u,v

s.t. c Y

Jo

j=1

uij =

(

Qc

r=1

n X c X i=1 j=1

um ij (1 − k(xi , vj )) (23)

kvr − xi k2 )c−1 cc

where k is an RBF kernel that helps us define a correntropy based loss function as k(xi , vj ) = exp(− 11

kxi − vj k2 ) 2σj2

(24)

Journal Pre-proof

Now, by using Lagrange multipliers (λi ) we have: L(u, v, λ) =

n X c X

um ij (1

i=1 j=1

− k(xi , vj )) −

n X i=1

First,we obtain λi as ∂L =0⇒ ∂uij

λi = mum ij

uil

l=1

(25)

pro of

!  Qc ( r=1 kvr − xi k2 )c−1 − cc

λi

c Y

1 − k(xi , vj ) Qc , l=1 uil

i = 1, 2, · · · , n

m(1 − k(xi , vj ))(c − 1) ) + k(xi , vj ) kvj − xi k2

lP

d2 (xi , vj ) = (

re-

And then by substituting λi in Eq. (25) and taking the partial derivatives with respect to vj we get the centers as follows: Pn m uij d2 (xi , vj )xi , j = 1, 2, · · · , c vj = Pi=1 n m i=1 uij d2 (xi , vj ) (26)

The proposed RH CM procedure is shown in Algorithm 2. Utilizing DC as a method of initialization make RH CM to have robust initial points in noisy environments.

urn a

Algorithm 2: The proposed RH CM method Input: c, n, m, , σ (σ is the kernel parameter) Output: u,v 1 - Initialize the cluster centers by DC algorithm; 2 - Set t=0 and initialize the membership values randomly; 3 - do 4 Update memberships uij,t+1 by Eq. (22), i = 1, 2, · · · , n, j = 1, 2, · · · , c; 5 Update all cluster centers vj,t+1 by Eq. (26) , j = 1, 2, · · · , c; 6 Set t=t+1; 7 - while |J(uij,t , vj,t ) − J(uij,(t−1) , vj,(t−1) )| < ;

Jo

Since the constraint of our cost function in (23) is the multiplication of all membership values of a data point over all clusters, each cluster center is used to approximate the data points that are not being approximated closely enough to the other cluster centers. Therefore, this algorithm results in compact clusters having inter-cluster heterogeneity. Clearly, this definition is reasonable. 12

Journal Pre-proof

By using an RBF kernel as a similarity measure, this algorithm can cluster spherical or elliptical shaped data as well as more general ones effectively. As we proved in this section, the DC algorithm does not guarantee to result in accurate cluster centers when the size of clusters are unequal. So, as the test results indicate, RH CM algorithm overcomes this drawback.

pro of

5. Experimental Results In this section, first we show the results of our RCM method on both synthetic and real datasets from the UCI machine learning repository in compare with basic FCM. Then we will show the superiority of our RH CM method in clustering unequal sized spherical or elliptical shaped data on synthetic datasets, synthetic noisy datasets and real datasets. Finally, the clustering quality of the algorithms are compared utilizing several cluster validity indices.

urn a

lP

re-

5.1. Evaluation of RCM algorithm To show the differences between FCM and our RCM method, consider the distribution of datapoints in Fig. 2. If we have two clusters, the outlier point in FCM method attracts one of the cluster centers, but RCM method resists to the effect of the outlier data.

Figure 2: A Synthetic dataset with 2 classes. Centers of two clusters are shown in red. In FCM the outlier point made a separate cluster; but RCM does not allow the outlier data to affect the centers a lot.

Jo

In Fig. 3 , the result of the movement of the centers are shown on a synthetic dataset as well as the UCI repository. For visualizing high dimensional data, Andrews plot method [33] is used. As it can be seen, RCM algorithm has much fewer changes of the cluster centers in the presence of outliers in compare with FCM. So, the robustness of our proposed method is properly shown. 13

pro of

Journal Pre-proof

(a) A synthetic dataset containing 600 datapoints and 6 attributes. 20 outliers have been added to the samples. Data are clustered in 3 classes.

lP

re-

(b) Iris dataset: movement of two of the cluster centers is shown. 10% of the 3rd class have been added as outliers.

urn a

(c) wine dataset: movement of one of the cluster centers is shown. 10% of the 3rd class have been added as outliers.

Jo

(d) contraceptive dataset: movement of one of the cluster centers is shown. 10% of the 3rd class have been added as outliers. Figure 3: Movement of the cluster centers in the presence of outliers is shown. The blue curve shows the real center of clusters. The green curve shows the centers achieved by RCM and FCM method. The red curve shows the centers obtained by the two algorithm after adding outliers. To avoid the interference of the curves, in (c) and (d) the results for one of the classes are shown. The centers are presented in 2D space using the andrewsplot function in matlab. t is a continuous dummy variable that was created by Andrews plot method and f (t) represents each observation.

14

Journal Pre-proof

(b) data2

urn a

(a) data1

lP

re-

pro of

5.2. Evaluation of RH CM algorithm In this section, the experimental results for RH CM algorithm compared with DC algortihm, KFCM (that initialized randomly), DC-KFCM (which is a KFCM method that we initialized with DC results) , GPFCM and GEPFCM methods are shown. Since DC algorithm has an exact solution (i.e. it does not have any iterations), it is very fast. So, instead of initializing randomly, our RH CM cluster centers are initialized by DC results. Since DC works on 2D space, we will show the results on 2D space. Fig. 4 shows the clustering results on synthetic datasets having different volume of clusters. Different values of kernel parameter (σ) were tested and the best ones were selected for each method. Since KFCM, GPFCM and GEPFCM methods are initialized randomly, we have repeated each of these methods 50 times and then the best results (the results with the minimum entropy) are reported. Red arrows in Fig. 4 show the wrong movement of the centers. RH CM centers are shown with the red circles.

(d) data4

(e) data5

(c) data3

(f) data6

Jo

Figure 4: Movement of the class centers while having different volume of clusters (in almost all cases σRH CM = 2 , but different parameter values are used for the other methods in each figure). In KFCM, DC-KFCM, GPFCM and GEPFCM the centers in smaller volume clusters move towards the much larger volume clusters; but in RH CM the cluster volumes hardly affect the position of other cluster centers.

In Fig. 4(a) to Fig. 4(d), KFCM, DC-KFCM, GPFCM, GEPFCM and RH CM methods have almost similar behaviors; but as shown in Fig. 4(e) when the volume of a few clus15

Journal Pre-proof

(b) noisyData2 No. of noise points=100

urn a

(a) noisyData1 No. of noise points=50

lP

re-

pro of

ters dramatically increases and the clusters have overlaps, the centers resulted by KFCM, GPFCM and GEPFCM can not find their positions correctly; but in DC-KFCM, initializing by DC helps the algorithm find the more reasonable centers. However, DC-KFCM can also be sensitive to the volume of clusters as shown in Fig. 4(f). Since the shape and the volume of clusters have a lot of differences with each other and the clusters have overlaps, the larger volume cluster pulls the center of the smaller one towards itself in all KFCM, DC-KFCM, GPFCM and GEPFCM methods. But the results show the superiority of RH CM in robustness to the shape and the volume of clusters in all the figures. The RH CM algorithm not only addresses the drawbacks of DC algorithm but also results in compact clusters, heterogeneous from each other. Fig. 5 shows six synthetic noisy datasets having various number and different volume of clusters. The uniform noise has been added to the data. The number of noise and outlier points increases from noisyData1 to noisyData6 to study the impact of noise on the performances of the algorithms.

(e) noisyData5 No. of noise points=1500

Jo

(d) noisyData4 No. of noise points=1000

(c) noisyData3 No. of noise points=300

(f) noisyData6 No. of noise points=1500

Figure 5: Movement of the class centers while having different volume of clusters in a noisy environment. In KFCM, DC-KFCM, GPFCM and GEPFCM the centers in smaller volume clusters move towards the much larger volume clusters; but in RH CM the cluster volumes hardly affect the position of other cluster centers.

16

Journal Pre-proof

(a) σ = 2

lP

re-

pro of

Since GPFCM and GEPFCM methods were designed to have good clustering results especially in noisy environments, their overall performances on nearly equal-sized clusters are reasonable (Figs. 5(d) and 5(e) ). But when the clusters are of different sizes, similar misbehavior that were stated in Fig. 4 can be seen here again. To be more clear, as shown in Figs. 5(a) - 5(c) and Fig. 5(f), larger volume clusters pull the centers of smaller ones for all DC, DC-KFCM, GPFCM and GEPFCM methods. But in all of these datasets, our RH CM method is robust both to the noise and to the volume of clusters and hence it has a lot more reasonable clustering results than others. In Fig. 6 the effect of σ parameter on the mentioned methods is shown. As it can be seen, KFCM, DC-KFCM, GPFCM and GEPFCM are very dependant to the value of σ parameter. It means, by changing the value of σ, the position of centers in these methods changes a lot; but due to the definition of membership function (u), the role of σ in RH CM is lower than the other methods. So, RH CM shows almost the same behavior while changing the value of σ. In other words, RH CM is nearly robust to the changes of σ parameter.

(b) σ = 10

(c) σ = 100

urn a

Figure 6: The effect of different σ parameter

Jo

To generalize our method on nonspherical and non-elliptical shaped data, we also tested our RH CM method on a real dataset that contains 80,000 images from 500px social media. We have used the 500px api to search and gather the uploaded images on different concepts. We have grouped the data images into four categories: ‘sport’, ’food’, ’cold’ and ’war’. The data contain images and other information such as the tags users created for each uploaded images, the latitude and longitude, the rating and etc. We use the latitude and longitude information of each image for clustering purpose and then we show the results of our clustering method on a desired number of clusters (Fig. 7(a)). According to the shape of the overlaping clusters and a noisy environment, RH CM and DC-KFCM cluster centers are more satisfiable than the other methods (it is shown in the right picture of Fig. 7(a)). After obtaining the centers, the nearest real data to the center of each cluster are found (according to their latitude and longitude) to be shown on 17

Journal Pre-proof

pro of

the map (Fig. 7(b)). The images found at the center of clusters are shown in Fig. 7(c).

re-

(a) The distribution of image data according to their latitude and longitude. Clustering result is shown in the right picture

urn a

lP

(b) 3 cluster centers are specified on the map

(c) The images obtained at the center of each cluster are shown

Jo

Figure 7: Clustering result of uploaded images by 500px social media users on the ‘cold’ concept. We desire to create 3 clusters of data (a) Two clusters have overlaps in the right picture. Just like the previous shown synthetic datasets, RH CM tends to result in compact clusters having inter-cluster heterogeneity (b) the centers of each cluster obtained by RH CM are shown on the map to specify the regions (c) the images related to the ‘cold’ concept as the center of clusters are shown.

The results of clustering of the uploaded images by 500px social media users on the concepts of ‘food’, ‘sport and ‘war’ are also shown in Fig. 8. 18

Journal Pre-proof

re-

pro of

As mentioned before, RH CM tends to create compact cluters with heterogeneity among them. So, it will tend to find the centers of clusters far from each other and in denser regions. When the volume of clusters are dramatically different as in Fig. 8(b) , the centers obtained by DC, DC-KFCM, GPFCM and GEPFCM in smaller cluster (red circle) tend to move toward larger volume clusters, but RH CM shows robustness. Even when different clusters have overlaps, RH CM has the ability to find reasonable cluster centers as in Fig. 8(c).

(a) food

(b) sport

(c) war

lP

Figure 8: (a) ,(b), (c) Clustering result of uploaded images by 500px social media users on 3 concepts: ‘food’, ‘sport’ and ‘war’.

Jo

urn a

Fig. 9 shows the cluster centers for the last 3 concepts (‘food’, ‘sport’ and ‘war’) on a real map along with their corresponding images. Red markers show the regions that are cluster centers found by RH CM method (we desired to cluster data in 3 groups). In fact, the distribution of data shows the amount of uploaded images by 500px social media users. The most uploaded images have belonged to the United States 500px users. So, in almost all concepts, one of the cluster centers is located in the United States. Although the users could upload any unrelated images to a concept, some cluster centers created meaningful results such as ice hockey in Indiana in the second picture of Fig. 9(b) or the Vietnam war in the second picture of Fig. 9(c).

19

pro of

Journal Pre-proof

lP

re-

(a) the centers of each cluster for the ‘food’ concept and their corresponding images uploaded by the users

urn a

(b) the centers of each cluster for the ‘sport’ concept and their corresponding images uploaded by the users

Jo

(c) the centers of each cluster for the ‘war’ concept and their corresponding images uploaded by the users Figure 9: (a), (b), (c) the centers of each cluster for 3 concepts (‘food’, ‘sport’ and ‘war’) obtained by RH CM are shown to specify the regions on the map.

20

Journal Pre-proof

5.3. clustering quality In this section, we first mention a number of cluster validity indices and then we use them to measure the clustering quality of the algorithms.

n

PC =

c

pro of

5.3.1. validity indices Partition Coefficient (PC): The PC index [34] measures the amount of overlapping between clusters [35, 36]. It is defined as: 1 XX 2 u n i=1 j=1 ij

(27)

The index values range in [1/c, 1] where c is the number of clusters. If all membership grades are equal, the PC has its lower value; that is, a PC value close to 1/c indicates the failure of the clustering algorithm. The closer the value to 1 the crisper the clustering is. Hence, the larger the PC value the better. Partition Entropy (PE): The PE index [37] measures the amount of fuzziness in the given cluster partitions (where the value of c is greater than 1) [35, 36] . PE is defined as: c

re-

n

1 XX uij .loga uij PE = − n i=1 j=1

(28)

urn a

lP

The index values range in [0, loga c] where a is the base of the logarithm. A PE value close to loga c shows the inability of the algorithm to extract the clustering structure in the dataset. The closer to zero the entropy the better. Partition Index (SC): This validity index [38] is obtained by summing up the ratio of compactness of cluster i to its separation, over all clusters: c Pn m 2 X j=1 uij kxj − vi k P (29) SC = ni ck=1 kvk − vi k2 i=1

Jo

Where the compactness uses the variance of data within a cluster to indicate the degree of closeness of cluster members; and the separation of cluster i is defined as the sum of distances of its center to the other cluster centers. The lower the value of SC the better the clustering result. Xie and Beni’s Index (XB): The XB index [39] also focuses on compactness and separation [36] and is defined as: Pc Pn m 2 i=1 j=1 uij kxj − vi k XB = (30) n.mini,j kvi − vj k2 For a good partition, the value of compactness is small and for well-separated clusters, the value of separation is high. Therefore, for compact and well-separated clusters the value of XB is expected to be small. 21

Journal Pre-proof

Dunn’s Index (DI): DI [40] attempts to identify compact and well-separated clusters [35]. It is defined as: d(ci , cj ) }}, where maxk∈c (diam(ck )) d(ci , cj ) = minx∈ci ,y∈cj {d(x, y)} and diam(ci ) = maxx,y∈ci {d(x, y)} DI = mini∈c {minj∈c,i6=j {

(31)

RI =

pro of

For a good partitioning, the distances among clusters are expected to be large and the diameters of the clusters are small. So, larger value of DI indicates better clustering results. Rand Index (RI): This index [41] measures the degree of similarity between the set of discovered clusters C and the predefined partition P of data [36, 42]. It is given by: a+d a+b+c+d

(32)

lP

re-

Where a is the number of pairs (x, x0 ) of data that belongs to the same cluster of C and to the same group of partition P. b is the number of pairs (x, x0 ) of data that belongs to the same cluster of C and to different groups of P. c is the number of pairs (x, x0 ) of data that belongs to different clusters of C and to the same group of P. d is the number of pairs (x, x0 ) of data that belongs to different clusters of C and to different groups of P. The value of RI varies in [0, 1]. A higher value of RI indicates the more agreement of the two partitions. A value of 1 states that the two partitions are exactly the same.

Jo

urn a

5.3.2. Comparing the algorithms in terms of validity indices The clustering quality of the algoritms on three groups of datasets (synthetic dataset, synthetic noisy dataset and the real dataset) are shown in Tables 1 - 8. As it was mentioned before, for KFCM, GPFCM and GEPCM the experiments were repeated 50 times and the best results with the minimum entropy are reported. For a dataset like data5 (shown in Fig. 4) with different volume of clusters, Table 1 indicates that our RH CM method has larger value for PC, lower value for PE, SC and XB and larger value for DI and RI compared to the others. In Table 2 the mean values of the validity indices on data1 to data6 are calculated. It shows the better performance of RH CM algorithm. Table 3 shows the validity indices on noisyData3 (shown in Fig. 5) in which the clusters are of different sizes. The mean value of the validity indices on noisyData1 to noisyData6 are reported in Table 4. Although GPFCM and GEPFC methods have good clustering results on noisy data, they fail to find accurate cluster centers when the sizes of clusters are considerably different. So, RH CM outperforms the other methods in terms of all indices and it will result in compact and well-separated clusters. Tables 5 - 8 show the clustering quality of the algorithms on the real dataset. Since the true lables of the real data are not available, RI values are not reported. It can be inferred that the overall performance of RH CM is better than that of the other methods. 22

Journal Pre-proof

Table 1: Cluster validity indices on synthetic datasets for data5

PC PE 0.31 1.27 0.64 0.65 0.31 1.27 0.27 1.33 0.41 1.05 0.75 0.49

SC 0.179 0.081 0.129 0.572 0.412 0.059

XB 0.014 0.011 0.008 0.043 0.038 0.006

DI 0.0046 0.0032 0.0064 0.0024 0.0019 0.0064

RI 0.59 0.81 0.92 0.73 0.73 0.93

pro of

Category DC KFCM DC-KFCM GPFCM GEPFCM RH CM

Table 2: The mean value of cluster validity indices on synthetic datasets for data1-data6

PC PE 0.31 1.27 0.76 0.47 0.70 0.58 0.44 0.98 0.70 0.57 0.79 0.44

SC 0.181 0.058 0.066 0.180 0.117 0.049

XB DI 0.0164 0.039 0.0080 0.044 0.0075 0.044 0.0131 0.042 0.0125 0.040 0.0063 0.044

re-

Category DC KFCM DC-KFCM GPFCM GEPFCM RH CM

RI 0.79 0.90 0.93 0.88 0.89 0.98

Table 3: Cluster validity indices on synthetic noisy datasets for noisyData3

SC 0.15 0.068 0.057 0.108 0.059 0.047

lP

PC PE 0.32 1.24 0.64 0.68 0.71 0.53 0.34 1.07 0.69 0.58 0.75 0.48

urn a

Category DC KFCM DC-KFCM GPFCM GEPFCM RH CM

XB 0.0109 0.0053 0.0060 0.0056 0.0056 0.0051

DI 0.001 0.002 0.004 0.002 0.002 0.006

RI 0.89 0.68 0.71 0.68 0.69 0.94

Table 4: The mean value of cluster validity indices on synthetic noisy datasets for noisyData1-noisyData6

Jo

Category DC KFCM DC-KFCM GPFCM GEPFCM RH CM

PC PE 0.34 1.24 0.70 0.60 0.75 0.49 0.30 1.28 0.60 0.70 0.78 0.44

SC 0.23 0.10 0.10 9.1*105 5.5*104 0.07

23

XB 0.027 0.013 0.015 1.2*105 7.5*103 0.010

DI 0.004 0.008 0.008 0.003 0.003 0.008

RI 0.87 0.88 0.91 0.73 0.76 0.98

Journal Pre-proof

Table 5: Cluster validity indices on real datasets for the concept of ”cold”

PC PE 0.52 0.82 0.79 0.39 0.56 0.75 0.51 0.75 0.82 0.34 0.83 0.32

SC XB 0.006 0.0011 0.006 0.0010 0.005 8*10−4 0.012 0.0014 0.008 0.0014 0.002 6*10−4

DI 0.005 7.99*10−4 0.0025 0.0014 0.0015 0.0019

pro of

Category DC KFCM DC-KFCM GPFCM GEPFCM RH CM

Table 6: Cluster validity indices on real datasets for the concept of ”food”

PC 0.51 0.79 0.57 0.43 0.58 0.85

PE 0.83 0.40 0.75 0.86 0.69 0.29

SC XB 0.006 0.0011 0.003 7.6*10−4 0.005 8.9*10−4 0.049 0.0065 0.044 0.0074 0.002 6.5*10−4

re-

Category DC KFCM DC-KFCM GPFCM GEPFCM RH CM

DI 0.050 0.034 0.034 0.001 0.002 0.014

Table 7: Cluster validity indices on real datasets for the concept of ”sport”

SC XB 0.005 8.2*10−4 0.0032 6.75*10−4 0.005 8.4*10−4 1.5 0.21 2.69 0.45 0.001 4.08*10−4

lP

PC PE 0.53 0.79 0.81 0.35 0.61 0.68 0.40 0.99 0.39 1.01 0.87 0.24

urn a

Category DC KFCM DC-KFCM GPFCM GEPFCM RH CM

DI 0.01 0.013 0.019 3.04*10−4 6.4*10−4 0.004

Table 8: Cluster validity indices on real datasets for the concept of ”war”

Jo

Category DC KFCM DC-KFCM GPFCM GEPFCM RH CM

PC PE 0.50 0.8 0.79 0.4 0.56 0.8 0.59 0.7 0.83 0.3 0.82 0.3

SC 0.005 0.006 0.005 0.033 0.007 0.003

XB 0.001 0.001 9.35*10−4 0.005 0.001 6.9*10−4

DI 0.001 0.002 0.007 6.09*10−5 0.002 0.002

6. Conclusion

In this paper, we presented a new perspective based on the expected loss (or risk) to the FCM method. As it was mentioned, FCM is a sensitive algorithm to initialization, noise 24

Journal Pre-proof

re-

pro of

and outliers. This new perspective will help to generate different kinds of FCM including the robust versions. So, first we proposed the expected loss based on the error to present a robust FCM called RCM. By using correntropy loss function and utilizing half-quadratic (HQ) problem-solving method, we showed in the experimental results that RCM is a lot more robust to outliers than FCM. Then, we proposed another kind of expected loss based on the data and parameters to present some other kinds of robustness. Besides, to find some initial points in noisy environments, we needed a fast robust method like DC, which does not need any initial center guessing and iterations. However, DC does not result in accurate cluster centers when at least one of the clusters in the dataset has a larger volume than the others. To address this drawback, we introduced the iterative RH CM utilizing direct clustering (DC) method and using the correntropy induced metric similar to RCM. The goal is to have a robust method to the noise, outliers and different volume clusters while the initialization is also robust. We have tested our method on three groups of datasets: synthetic datasets, synthetic noisy datasets and a real dataset consisting of 4 groups of data gathered from 500px social media. The results were reported both visually and in terms of cluster validity indices. Based on the experimental results, RH CM algorithm, in contrast with KFCM, DC, DCKFCM, GPFCM and GEPFCM methods, can provide competitive clustering results and produce homogeneous clusters that have heterogeneity among themselves. Appendix A. Half-quadratic (HQ) optimization

lP

The description of HQ modeling is based on the conjugate function theory [43, 44]. So, for a function ϕ(p), the conjugate function is ϕ∗ (s) that is defined as follows: ϕ∗ (s) = sup (sT p − ϕ(p))

(A.1)

p∈domϕ

Appendix B.

urn a

By deriving the function ϕ with respect to p, the supremum is achieved at s = ϕ0 (p). If we define ϕ(p) = −p log(−p) + p, where p < 0 as a convex function [45], then using the definition of conjugate function one can find supremum at p = − exp(−s) < 0. So, by replacing p in (A.1) the conjugate function of ϕ(p) is ϕ∗ (s) = exp(−s) p<0

The proof of Equation. 23 (constraint of RH CM optimization problem )

Jo

As it was mentioned in (19) for DC method we had: 1Y k vr − xi k2 Ui,j = c r6=j Then, one can rewrite Ui,j as Ui,j

1 = c

Qc

k vr − xi k2 k v j − xi k 2 25

r=1

Journal Pre-proof

=

(

Qc

r=1

k vr − xi k2 )c−1 cc

pro of

Consequently, the following constraint can be achieved: Q Q Q c Y 1 cr=1 k vr − xi k2 1 cr=1 k vr − xi k2 1 cr=1 k vr − xi k2 Ui,j = ( )( )...( ) c k v1 − xi k2 c k v2 − xi k2 c k vc − xi k2 j=1 Q ( cr=1 k vr − xi k2 )c 1 = c. c k v1 − xi k2 k v2 − xi k2 ... k vc − xi k2

Appendix C. The proof of Equation. 26 (RH CM optimization problem) According to 25 we have: L(u, v, λ) =

n X c X i=1 j=1

um ij (1 − k(xi , vj )) −

n X

λi

i=1

!  Qc ( r=1 kvr − xi k2 )c−1 uil − (C.1) cc l=1

c Y

∂L =0⇒ ∂uij − k(xi , vj )) −

λi

Qc

l=1

uil

uij

=0⇒

lP

m−1 muij (1

re-

By taking the partial derivatives with respect to uij and vj we have:

1 − k(xi , vj ) λi = mum , ij . Qc l=1 uil

i = 1, 2, · · · , n

urn a

∂L =0⇒ ∂vj ! Q X X ( r6=j k vr − xi k2 )c−1 (k vj − xi k2 )c−1 m =0 2uij (vj − xi )k(xi , vj ) + λi 2(c − 1)(vj − xi ) cc k vj − xi k2 i i Then, by substituting λi as: X 2um ij (vj − xi )k(xi , vj )+ i

X

Jo

i

( 1 − k(xi , vj ) mum 2(c − 1)(vj − xi ) ij . Qc l=1 ui,l

and by substituting

Q

Q ( cr=1 k vr − xi k2 )c−1 Ui,l = cc l=1

c Y

26

r6=j

k vr − xi k2 )c−1 (k vj − xi k2 )c−1 cc k vj − xi k2

we have:

!

=0

Journal Pre-proof

X

X

(1 − k(xi , vj ))((c − 1)(vj − xi ) =0 2 k v − x k j i i i ! X m(1 − k(x , v ))(c − 1) i j =0 ⇒ um ij (vj − xi ) k(xi , vj ) + 2 k v − x k j i i ⇒

um ij (vj − xi )k(xi , vj ) +

mum ij

pro of

Pn m uij d2 (xi , vj )xi , j = 1, 2, · · · , c ⇒ vj = Pi=1 n m i=1 uij d2 (xi , vj ) m(1 − k(xi , vj ))(c − 1) , where d2 (xi , vj ) = ( ) + k(xi , vj ) kvj − xi k2

References

Jo

urn a

lP

re-

[1] R. L. Cannon, J. V. Dave, J. C. Bezdek, Efficient implementation of the fuzzy c-means clustering algorithms, IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-8 (2) (1986) 248– 255. doi:10.1109/TPAMI.1986.4767778. [2] D. Graves, W. Pedrycz, Fuzzy c-means, gustafson-kessel fcm, and kernel-based fcm: A comparative study, in: Analysis and Design of Intelligent Systems using Soft Computing Techniques, a selection of papers from IFSA 2007, 2007, pp. 140–149. doi:10.1007/978-3-540-72432-2_15. URL https://doi.org/10.1007/978-3-540-72432-2_15 [3] G. Heo, P. Gader, An extension of global fuzzy c-means using kernel methods, in: International Conference on Fuzzy Systems, 2010, pp. 1–6. doi:10.1109/FUZZY.2010.5584271. [4] R. Krishnapuram, J. M. Keller, A possibilistic approach to clustering, IEEE Transactions on Fuzzy Systems 1 (2) (1993) 98–110. doi:10.1109/91.227387. [5] O. A. M. Jafar, R. Sivakumar, A study on possibilistic and fuzzy possibilistic c-means clustering algorithms for data clustering, in: 2012 International Conference on Emerging Trends in Science, Engineering and Technology (INCOSET), 2012, pp. 90–95. doi:10.1109/INCOSET.2012.6513887. [6] Q. Zhang, Z. Chen, A weighted kernel possibilistic c-means algorithm based on cloud computing for clustering big data, Int. J. Commun. Syst. 27 (9) (2014) 1378–1391. doi:10.1002/dac.2844. URL http://dx.doi.org/10.1002/dac.2844 [7] Y. Hu, F. Qu, C. Wen, An unsupervised possibilistic c-means clustering algorithm with data reduction, in: 2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2013, pp. 29–33. doi:10.1109/FSKD.2013.6816161. [8] H. Q. Truong, L. T. Ngo, W. Pedrycz, Granular fuzzy possibilistic c-means clustering approach to dna microarray problem, Knowl.-Based Syst. 133 (2017) 53–65. [9] N. R. Pal, K. Pal, J. C. Bezdek, A mixed c-means clustering model, in: Proceedings of 6th International Fuzzy Systems Conference, Vol. 1, 1997, pp. 11–21 vol.1. doi:10.1109/FUZZY.1997.616338. [10] N. R. Pal, K. Pal, J. M. Keller, J. C. Bezdek, A possibilistic fuzzy c-means clustering algorithm, IEEE Transactions on Fuzzy Systems 13 (4) (2005) 517–530. doi:10.1109/TFUZZ.2004.840099. [11] M.-S. Yang, K.-L. Wu, Unsupervised possibilistic clustering, Pattern Recognition 39 (2006) 5–21. [12] M. B. Ferraro, P. Giordani, Possibilistic and fuzzy clustering methods for robust analysis of non-precise data, Int. J. Approx. Reasoning 88 (C) (2017) 23–38. doi:10.1016/j.ijar.2017.05.002. URL https://doi.org/10.1016/j.ijar.2017.05.002 [13] B. Liu, S.-X. Xia, Y. Zhou, X.-D. Han, A sample-weighted possibilistic fuzzy clustering algorithm 40 (2012) 371–375.

27

Journal Pre-proof

Jo

urn a

lP

re-

pro of

[14] A. Schneider, Weighted possibilistic c-means clustering algorithms, in: Ninth IEEE International Conference on Fuzzy Systems. FUZZ- IEEE 2000 (Cat. No.00CH37063), Vol. 1, 2000, pp. 176–180 vol.1. doi:10.1109/FUZZY.2000.838654. [15] K.-L. Wu, M.-S. Yang, Alternative c-means clustering algorithms 35 (2002) 2267–2278. [16] M.-S. Yang, K.-L. Wu, A possibilistic type of alternative fuzzy c-means, in: Fuzzy Systems, 2002. FUZZ-IEEE’02. Proceedings of the 2002 IEEE International Conference on, Vol. 2, 2002, pp. 1456– 1459. doi:10.1109/FUZZ.2002.1006719. [17] W. Xiang, G. Rui, L. Jizhong, G. Xiaoying, W. Lina, L. Wei, L. Zhiying, Z. Chi, Z. Ke, A novel alternative weighted fuzzy c-means algorithm and cluster validity analysis, in: 2008 IEEE PacificAsia Workshop on Computational Intelligence and Industrial Application, Vol. 2, 2008, pp. 130–134. doi:10.1109/PACIIA.2008.286. [18] L. Gang, Y. Chunlei, Z. Qianqian, Z. Dan, Neighborhood weight fuzzy c-means kernel clustering based infrared image segmentation, in: 2014 IEEE International Conference on Information and Automation (ICIA), 2014, pp. 451–454. doi:10.1109/ICInfA.2014.6932698. [19] M. Girolami, Mercer kernel-based clustering in feature space, IEEE Transactions on Neural Networks 13 (3) (2002) 780–784. doi:10.1109/TNN.2002.1000150. [20] D. Zhang, S. Chen, Fuzzy clustering using kernel method, in: 2002 International Conference on Control and Automation, 2002. ICCA, 2002, pp. 162–163. [21] D.-Q. Zhang, S.-C. Chen, Clustering incomplete data using kernel-based fuzzy c-means algorithm, Neural Process. Lett. 18 (3) (2003) 155–162. doi:10.1023/B:NEPL.0000011135.19145.1b. URL https://doi.org/10.1023/B:NEPL.0000011135.19145.1b [22] T. Li, L. Zhang, W. Lu, H. Hou, X. Liu, W. Pedrycz, C. Zhong, Interval kernel fuzzy c-means clustering of incomplete data, Neurocomput. 237 (C) (2017) 316–331. doi:10.1016/j.neucom.2017.01.017. URL https://doi.org/10.1016/j.neucom.2017.01.017 [23] M.-S. Yang, H.-S. Tsai, A gaussian kernel based fuzzy c-means algorithm with a spatial bias correction 29 (2008) 1713–1725. [24] M. R. P. Ferreira, F. D. A. T. De Carvalho, Kernel fuzzy c-means with automatic variable weighting, Fuzzy Sets Syst. 237 (2014) 1–46. doi:10.1016/j.fss.2013.05.004. URL http://dx.doi.org/10.1016/j.fss.2013.05.004 [25] B. Shanmugapriya, M. Punithavalli, A new kernelized fuzzy possibilistic c-means for high dimensional data clustering based on kernel-induced distance measure, in: 2013 International Conference on Computer Communication and Informatics, 2013, pp. 1–5. doi:10.1109/ICCCI.2013.6466319. [26] W. Huang, S.-K. Oh, W. Pedrycz, Design of hybrid radial basis function neural networks (hrbfnns) realized with the aid of hybridization of fuzzy clustering method (fcm) and polynomial neural networks (pnns), Neural networks : the official journal of the International Neural Network Society 60 (2014) 166–81. [27] M. Filippone, F. Masulli, S. Rovetta, Applying the possibilistic c-means algorithm in kernel-induced spaces, IEEE Transactions on Fuzzy Systems 18 (3) (2010) 572–584. doi:10.1109/TFUZZ.2010. 2043440. [28] S. Askari, N. Montazerin, M. Zarandi, Generalized possibilistic fuzzy c-means with novel cluster validity indices for clustering noisy data, Applied Soft Computing 53. doi:10.1016/j.asoc.2016.12.049. [29] S. Askari, N. Montazerin, M. F. Zarandi, E. Hakimi, Generalized entropy based possibilistic fuzzy c-means for clustering noisy data and its convergence proof, Neurocomput. 219 (C) (2017) 186–202. doi:10.1016/j.neucom.2016.09.025. URL https://doi.org/10.1016/j.neucom.2016.09.025 [30] O. S. Pianykh, Analytically tractable case of fuzzy c-means clustering, Pattern Recognition 39 (1) (2006) 35 – 46. doi:https://doi.org/10.1016/j.patcog.2005.06.005. URL http://www.sciencedirect.com/science/article/pii/S003132030500230X [31] J. C. Bezdek, R. Ehrlich, W. E. Full, Fcm: The fuzzy c-means clustering algorithm, Vol. 10, 1984, pp. 191 – 203. [32] R. J. Hathaway, Y. Hu, Density-weighted fuzzy c-means clustering, Trans. Fuz Sys. 17 (1) (2009) 243–

28

Journal Pre-proof

[38]

[39] [40] [41]

[42]

[43]

[44] [45]

pro of

[37]

re-

[36]

lP

[35]

urn a

[34]

Jo

[33]

252. doi:10.1109/TFUZZ.2008.2009458. URL http://dx.doi.org/10.1109/TFUZZ.2008.2009458 D. F. Andrews, Plots of high-dimensional data, Biometrics 28 (1) (1972) 125–136. URL http://www.jstor.org/stable/2528964 J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Kluwer Academic Publishers, Norwell, MA, USA, 1981. M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques, Journal of Intelligent Information Systems 17 (2) (2001) 107–145. doi:10.1023/A:1012801612483. URL 10.1023/A:1012801612483 W. Wang, Y. Zhang, On fuzzy cluster validity indices, Fuzzy Sets and Systems 158 (19) (2007) 2095 – 2117, theme: Data Analysis. doi:https://doi.org/10.1016/j.fss.2007.03.004. URL http://www.sciencedirect.com/science/article/pii/S0165011407001133 J. Bezdek, Cluster validity with fuzzy sets, Journal of Cybernetics 3 (1973) 58–73. doi:10.1080/ 01969727308546047. A. M. Bensaid, L. O. Hall, J. C. Bezdek, L. P. Clarke, M. L. Silbiger, J. A. Arrington, R. F. Murtagh, Validity-guided (re)clustering with applications to image segmentation, Trans. Fuz Sys. 4 (2) (1996) 112–123. doi:10.1109/91.493905. URL https://doi.org/10.1109/91.493905 X. L. Xie, G. Beni, A validity measure for fuzzy clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 13 (8) (1991) 841–847. doi:10.1109/34.85677. J. Dunn, Well-separated clusters and optimal fuzzy partitions, Cybernetics and Systems 4 (2008) 95– 104. doi:10.1080/01969727408546059. W. M. Rand, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association 66 (336) (1971) 846–850. URL http://www.jstor.org/stable/2284239 Z. Ansari, M. Azeem, W. Ahmed, A. Vinaya Babu, Quantitative evaluation of performance and validity indices for clustering the web navigational sessions, World of Computer Science and Information Technology Journal 1. R. He, W. S. Zheng, T. Tan, Z. Sun, Half-quadratic-based iterative minimization for robust sparse representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (2) (2014) 261– 275. doi:10.1109/TPAMI.2013.102. S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, New York, NY, USA, 2004. G. Xu, Z. Cao, B.-G. Hu, J. C. Principe, Robust support vector machines based on the rescaled hinge loss function, Pattern Recognition 63 (Supplement C) (2017) 139 – 148. doi:https://doi.org/10. 1016/j.patcog.2016.09.045. URL http://www.sciencedirect.com/science/article/pii/S0031320316303065

29

Journal Pre-proof

Conflict of Interest and Authorship Confirmation

Author’s name: Atieh Gharib Hadi Sadoghi Yazdi Amir hossein Taherinia

pro of

[email protected]

re-



lP



urn a



We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property. We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He/she is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs. We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author and which has been configured to accept email from

Jo



Affiliation

Ferdowsi University of Mashhad Ferdowsi University of Mashhad Ferdowsi University of Mashhad