Fuzzy C-Means clustering of incomplete data based on probabilistic information granules of missing values

Fuzzy C-Means clustering of incomplete data based on probabilistic information granules of missing values

Accepted Manuscript Fuzzy C-Means Clustering of Incomplete Data Based on Probabilistic Information Granules of Missing Values Liyong Zhang , Wei Lu ,...

984KB Sizes 0 Downloads 18 Views

Accepted Manuscript

Fuzzy C-Means Clustering of Incomplete Data Based on Probabilistic Information Granules of Missing Values Liyong Zhang , Wei Lu , Xiaodong Liu , Witold Pedrycz , Chongquan Zhong PII: DOI: Reference:

S0950-7051(16)00078-2 10.1016/j.knosys.2016.01.048 KNOSYS 3420

To appear in:

Knowledge-Based Systems

Received date: Revised date: Accepted date:

3 August 2015 30 November 2015 31 January 2016

Please cite this article as: Liyong Zhang , Wei Lu , Xiaodong Liu , Witold Pedrycz , Chongquan Zhong , Fuzzy C-Means Clustering of Incomplete Data Based on Probabilistic Information Granules of Missing Values, Knowledge-Based Systems (2016), doi: 10.1016/j.knosys.2016.01.048

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Fuzzy C-Means Clustering of Incomplete Data Based on Probabilistic Information Granules of Missing Values1 Liyong Zhang a, b,*, Wei Lu a, Xiaodong Liu a, Witold Pedrycz c, d, e, Chongquan Zhong a School of Control Science and Engineering, Dalian University of Technology, Dalian 116024, China b

c

d

Department of Engineering Mechanics, Dalian University of Technology, Dalian 116024, China

CR IP T

a

Department of Electrical and Computer Engineering, University of Alberta, Edmonton T6R 2V4 AB, Canada

Department of Electrical and Computer Engineering, Faculty of Engineering, King Abdulaziz University, Jeddah, Saudi Arabia

Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland

AN US

e

Abstract: Missing values are a common phenomenon when dealing with real-world data sets. Analysis of incomplete data sets has become an active area of research. In this paper, we focus on the problem of clustering incomplete data, which is intended to introduce some prior distribution information of the

M

missing values into the algorithm of fuzzy clustering. First, non-parametric hypothesis testing is employed to describe the missing values adhering to a certain Gaussian distribution as probabilistic

ED

information granules based on the nearest neighbors of incomplete data. Second, we propose a novel clustering model, in which probabilistic information granules of missing values are incorporated into the

PT

Fuzzy C-Means clustering of incomplete data by involving the maximum likelihood criterion. Third, the clustering model is optimized by using a tri-level alternating optimization utilizing the method of

CE

Lagrange multipliers. The convergence and the time complexity of the clustering algorithm are also discussed. The experiments reported both on synthetic and real-world data sets demonstrate that the

AC

proposed approach can effectively realize clustering of incomplete data. Keywords: Fuzzy clustering; Incomplete data; Missing value; Probabilistic information granules;

Alternating optimization.

*Corresponding author. Tel.: +86 411 84707335. E-mail address: [email protected] (L. Zhang).

ACCEPTED MANUSCRIPT

1. Introduction In real-world problems, data often suffer from missing values due to deficiencies of data collection or other reasons. It is important to analyze incomplete data with missing values and extract useful features by means of clustering and, more specifically, fuzzy clustering. The Fuzzy C-Means (FCM) algorithm, as

CR IP T

a clustering technique, not only has been widely applied to clustering complete data, but also becomes applicable to cope with incomplete data.

With an ultimate objective to deal with incomplete data clustering, a number of approaches have been discussed in the literature, in which the influential are the four specific strategies proposed by Hathaway

AN US

and Bezdek making the FCM clustering capable of coping with incomplete data [1]. 1) Whole data strategy (WDS) discards the data with missing values and partitions the remaining complete data using the standard FCM algorithm. 2) Partial distance strategy (PDS) ignores missing attribute values and introduces the local distance, as proposed by Dixon [2], for the FCM algorithm to realize incomplete data

M

clustering. 3) Optimal completion strategy (OCS) regards the missing values as additional variables, and estimates of missing values and data partition can be carried out simultaneously by optimizing the FCM

ED

clustering objective function. 4) Nearest prototype strategy (NPS) can be regarded as a simple modification of the OCS, in which the missing values are replaced by the corresponding attribute values

PT

of the nearest cluster prototype in each step of iterative calculation. PDS is a simple and effective vehicle to solve incomplete data clustering. On the basis of the PDS,

CE

Furukawa et al. proposed a FCM clustering algorithm that can handle mixed numerical and categorical incomplete data [3]. In order to solve the clustering problem with incomplete nominal and numerical data,

AC

Guan and Feng presented a new hierarchical clustering model by improving the measure for the numerical values in the mixed attribute data in terms of the PDS [4]. Zhang and Chen designed a weighted possibilistic c-means clustering algorithm of incomplete data, in which PDS is applied for calculating the distance between any two objects in the incomplete data set, and low weight values are assigned to incomplete data objects to reduce the corruption of missing values [5]. OCS represents a very different incomplete data clustering approach where missing values are regarded 2

ACCEPTED MANUSCRIPT

as variables and handled not before clusterisation but during this process. Zhang and Chen extended the OCS algorithm to the kernel-based Fuzzy C-Means and achieved better clustering performance for incomplete data [6]. Himmelspach and Conrad presented an extension for existing OCS clustering algorithms for incomplete data, in which a new membership degree is determined for estimating missing

CR IP T

values by taking the information about the dispersion of clusters into account [7]. Lim et al. proposed a hybrid neural network comprising fuzzy ARTMAP and Fuzzy C-Means clustering pattern classification with incomplete training and test data, in which the estimates of missing values are obtained by OCS or NPS and then are applied to the retraining of the network with the normal FAM algorithm [8]. Balkis and

AN US

Yahia proposed a fuzzy self-organizing map clustering method for handling incomplete data through the use of OCS for missing value estimates during the learning process, and the method can estimate the number of clusters and simultaneously partition data [9].

The expectation-maximization (EM) algorithm is another important method that provides a

M

model-based tool for clustering incomplete data and estimating missing values. On the basis of the EM algorithm, Adhikari and Hollmen proposed a fast training of a series of mixture models using progressive

ED

merging of mixture components to facilitate model selection algorithm to make appropriate choice of the model [10]. Lin et al. developed a novel EM algorithm assisted by the auxiliary indicator matrices, which

PT

can handle mixtures of multivariate Gaussian distributions and perform supervised clustering of incomplete data with high computational efficiency [11]. Subsequently, Lin designed fourteen members

CE

of eigen-decomposed t mixture models with missing information and developed another computationally flexible EM-type algorithm, which can be applied to the model-based clustering of incomplete data [12].

AC

Zhou et al. presented a novel framework for clustering distributed data streams containing noisy or incomplete data records, which combines the advantage of Gaussian mixture model based EM algorithm with the test-and-cluster strategy [13]. Zhang et al. combined the EM algorithm with a multiple kernel density estimator based on the inverse probability weighting technique for solving classification tasks with incomplete data, and achieved the good performance with high classification accuracies [14]. In order to deal with incomplete data clustering, a number of other methods have also been proposed 3

ACCEPTED MANUSCRIPT

successively. Honda et al. proposed hybrid techniques of fuzzy clustering and principal component analysis for partitioning incomplete data, which provide useful tools for interpretation of the local structures of incomplete data [15-17]. Chou et al. developed an approach based on combining FCM and Dempster-Shafer theory that makes the final decision on which class incomplete data should belong to

CR IP T

[18]. Siminski preprocesses missing data using both marginalisation and imputation, and his clustering results present as the form of the type-2 fuzzy set/rough fuzzy set [19]. Ghannad-Rezaie et al. proposed the two stage classification approach of incomplete data, in which a set of classifiers are trained on different feature spaces (subsets) in the selection stage and then the results of the classifiers of the

AN US

selection stage are combined in the fusion stage, in fact, the method is a classification approach where the idea of information fusion is injected [20]. Hathaway and Bezdek proposed an approach for clustering on the basis of incomplete dissimilarity data, in which the incomplete data is first estimated using simple triangle inequality-based approximation schemes and then clustered using non-Euclidean relational FCM

M

algorithm [21], and subsequently, Yamamoto et al. further modified the above clustering model [22]. Timm et al. were enlightened by the fact that values are missing provides additional information for the

ED

classification of the dataset, and extended the model of the fuzzy maximum likelihood estimation algorithm by introducing a class-specific probability for missing values in order to appropriately assign

PT

incomplete data points to clusters [23]. He et al. modified the objective function of the original FCM by a regularizing function in the form of total variation for solving the segmentation of images with noisy and

CE

incomplete data [24]. Wu et al. proposed a design of initial prototypes based on the idea of the nearest neighbors, and then improved k-prototypes algorithm to partition incomplete data set with mixed numeric

AC

and categorical attributes [25]. The nearest neighbor scheme is an effective approach to obtain the estimates of missing values that aid

to cluster incomplete data, since the nearest neighbors are expected to have similar characteristics as the corresponding incomplete datum [26]. The nearest neighbor scheme is designed to find q nearest neighbors for an incomplete datum from all complete data and incomplete data in a given dataset, and then fill in the missing value with the mean of the neighbors or the most frequent one occurring in the 4

ACCEPTED MANUSCRIPT

neighbors [27]. For instances, Van and Khoshgoftaar used complete data and incomplete data to find q nearest neighbors of incomplete data, and these missing values are replaced by the mean values of the corresponding attributes of these neighbors [28]; Li et al. imputed the missing values with the most frequent values and the geometric center values in the ranges of the neighbors respectively, and then

CR IP T

clustered the incomplete data using the generic FCM algorithm [29]. The missing values are represented by the numerical values in the above nearest neighbor methods. Interval, as an important data type, has been widely studied in fuzzy clustering [30]. In our previous research [31-32], the q nearest neighbors of incomplete data are used to determine the nearest neighbor

AN US

intervals to describe the missing values, and we consider the missing values as variables constrained by intervals and propose hybrid optimization approaches with particle swarm assisted by the FCM for incomplete data clustering. The proposed interval representation is helpful to integrate the uncertainty of missing values into the clustering algorithm for incomplete data, yet our methods still exhibit some

M

deficiency in two different ways. 1) The interval descriptions of missing values can only utilize the minimum and the maximum of the corresponding attribute values of the q nearest neighbors to indicate

ED

the uncertainty of the missing values, while the actual distributions of missing values are not taken into account. 2) The interval descriptions make missing values constrained by bounded closed sets, in this case,

PT

optimization algorithms, such as GA, PSO and alike, usually need to be involved in further optimization, which leads to relatively low computational efficiency.

CE

In this study, we view the missing values as variables and investigate their description as probabilistic information granules. In view of the fact that Gaussian distributions are commonly encountered and are

AC

easy to be processed, on the basis of the nearest neighbor information of incomplete data, non-parametric hypothesis testing is employed to determine whether missing values obey Gaussian distributions. If this holds, the missing values can be described as probabilistic information granules, otherwise, could be regarded as ordinary variables. In the following, in order to integrate the local distribution information of missing values into incomplete data clustering, we design a novel clustering model hybridized by the maximum likelihood criterion and deduce a tri-level alternating optimization of cluster prototypes, 5

ACCEPTED MANUSCRIPT

memberships and missing values by using gradient-based method. The proposed approach improves the clustering performance and ensures an overall computational efficiency. It can be viewed as an effective extension of the OCS clustering for incomplete data in the case of the local distribution information of missing values being considered.

CR IP T

The paper is organized as follows. Section 2 presents a short description of the FCM clustering for complete and incomplete data. Section 3 discusses a formation of the probabilistic information granule descriptions of missing values based on the q nearest neighbors. The novel clustering approach for incomplete data based on probabilistic information granules is proposed in Section 4. Section 5 presents

AN US

clustering results produced for several synthetic and real-world incomplete data sets. Finally, Section 6 draws conclusions.

2. Fuzzy C-Means clustering of complete data and incomplete data 2.1. Fuzzy C-Means algorithm

x1 , x2 ,

, xn  

s

into c - fuzzy clusters by minimizing the distance-based objective

ED

object data

M

The Fuzzy C-Means (FCM) algorithm is one of the basic clustering techniques. FCM partitions a set of

(1)

i 1 k 1

s

, and let matrix of cluster prototypes V  v ji    v1 , v 2 ,

, vc 

AC

sc

2

T

the i th cluster prototype, vi 



n

, xsk  is an object datum, and x jk is the j th attribute value of x k ; v i is

CE

where xk   x1k , x2 k ,

c

J  U, V    uikm xk  vi 2 ,

PT

function

; uik is the membership value that represents the degree to which x k belongs to the i th cluster,

i, k : uik  0,1 , and satisfies the following condition c

u i 1

and let partition matrix U  uik  

ik

cn

 1 for k  1, 2,..., n , ; m is a fuzzification parameter, m  1,   ; 

6

(2)

2

denotes

ACCEPTED MANUSCRIPT

the Euclidean norm in

s

.

The necessary conditions for minimizing (1) with the constraint (2) result in the following iterative update formulas for the prototypes and the partition matrix [33]: n

m ik

k 1 n

xk

for i  1, 2,..., c ,

u k 1

CR IP T

vi 

u

m ik

and

   2 

2

1 m 1

2 2

    

1

for i  1, 2,..., c and k  1, 2,..., n .

(4)

AN US

  c  xk  vi uik       t 1  x k  v t 

(3)

The iterations are carried out until the changes in the values of the partition matrix reported in consecutive iterations are lower than a certain predetermined threshold.

M

2.2. Fuzzy C-Means clustering of incomplete data

In this subsection, four strategies based on the FCM algorithm, proposed by Hathaway and Bezdek, for

ED

handling incomplete data clustering are introduced. First, we introduce some required terminology and

PT

notation, which are consistent with those discussed in [1]. Let

X  x1 , x2 ,

, xn  

s

(5)

CE

be an incomplete data set in s-dimensional real space,

X W  {xk  X xk is a complete datum}

(6)

AC

be the whole-data subset of X ,

X P  {x jk for 1  j  s,1  k  n the value for x jk is present in xk for 1  k  n}

(7)

be the set of the available attribute values, and

X M  {x jk for 1  j  s,1  k  n the value for x jk is missing from xk for 1  k  n} be the set of the missing attribute values.

7

(8)

ACCEPTED MANUSCRIPT

1) Whole Data Strategy The whole data strategy (WDS) is a simple method for clustering incomplete data. When the proportion of incomplete data is small, WDS simply deletes all incomplete data and applies directly the standard FCM to the remaining complete data subset X W . In the WDS algorithm, the prototypes and the

CR IP T

memberships of the data vectors in X W can be calculated by using the alternating optimization of (3) and (4), and the memberships of the data vectors in X  X W can be estimated by using a nearest-prototype classification scheme based on the following partial distance from each incomplete datum to each of the computed cluster prototypes

s s

I j 1

where

s

 (x

AN US

Dik 

j 1

jk

 v ji ) 2 I jk ,

jk

(10)

ED

M

 0 , i fx j k  X M for j  1, 2,..., s and k  1, 2,..., n . I jk    1 , i fx jk  X P 2) Partial Distance Strategy

(9)

Unlike the WDS algorithm, the partial distance strategy (PDS) only ignores the missing attribute values

PT

in X M and makes full use of all available attribute values in X P to cluster incomplete data. In the

CE

PDS algorithm, the partial distance shown as (9) between the incomplete datum and cluster prototype is directly introduced to (4), that is, the iteration formula of membership of the FCM algorithm, and the

AC

cluster prototypes are updated in the following manner n

v ji 

u k 1 n

m ik

I jxk

u k 1

m ik

jk

for j  1, 2,..., s and i  1, 2,..., c .

(11)

I jk

In contrast to the WDS algorithm, PDS is applicable to the case that the proportion of missing values is larger in incomplete data, owing to a smaller amount of discarded information. 8

ACCEPTED MANUSCRIPT

3) Optimal Completion Strategy The optimal completion strategy (OCS) is one of the most influential approaches for clustering incomplete data. In the OCS algorithm, the missing values in X M are viewed as additional variables.

c

n

J (U, V, X M )   uikm xk  vi 2 . i 1 k 1

2

CR IP T

The clustering objective function (1) of the FCM is revised as (12)

Zeroing the gradient of (12), we derive (3), (4), and obtain the following expression c

i 1 c

m ik

v

 uikm

ji

for j  1, 2,..., s and k  1, 2,..., n .

(13)

AN US

x jk 

u i 1

Given the available attribute set X P , the OCS algorithm is a tri-level alternating optimization of (3), (4) and (13). Formula (13) indicates that the unknown data can be replaced by an estimate corresponding to a

M

fuzzy mean, so the OCS algorithm belongs to the class of imputation methods. 4) Nearest Prototype Strategy

ED

The nearest prototype strategy (NPS) can be regarded as a simple modification of the OCS. The missing value is replaced by the corresponding attribute value of the nearest cluster prototype in each

PT

iteration, that is, the following expression

CE

x(jkl )  v jil ( , )Dik  min{D1k , D2k ,

Dck }

(14)

can be used to update missing values based on the partial distance between the incomplete datum and the ) is the iteration index.

AC

prototype by using (9). Here l ( l  1, 2,

3. Probabilistic information granules of missing values For incomplete data clustering, how to handle missing values is a basis step that influences the clustering performance. In this study, beside regarding the missing values as variables, we describe them as probabilistic information granules based on the nearest neighbors of incomplete data.

9

ACCEPTED MANUSCRIPT

3.1. The q nearest neighbors of incomplete data The missing values are usually associated with the corresponding attribute values of their closely positioned neighbors. The nearest neighbor method has become one of the most commonly used techniques to deal with missing values by introducing the degree of similarity between data.

set. Let an s-dimensional incomplete data set X  {x1 , x2 ,

CR IP T

Here we adopt the partial Euclidean distance to calculate the distance between data in incomplete data

, xn } contains at least one incomplete

datum with some (but not all) missing attribute values, and the partial distance of x b and x p is calculated in the form [31]: s

s s

I j 1

 (x

 x jb ) 2 I jpb ,

AN US

Dpb 

j 1

jp

(15)

jpb

where x jb and x jp are the j th attribute of x b and x p , respectively, and

M

1, if x jb , x jp  X P I jpb   for p, b  1, 2,..., n and j  1, 2,..., s . 0, otherwise

(16)

(2) j th attribute values are given superscripts x(1) jk , x jk ,

PT

whose respective

ED

According to the distance (15), the q nearest neighbors of an incomplete datum x k can be selected,

, x(jkq ) according to the

CE

non-descending order, that is,

(2) x(1) jk  x jk 

 x(jkq ) .

(17)

3.2. Probabilistic information granules based on the q nearest neighbors

AC

Information granules have been important constructs of knowledge representation. Intervals, fuzzy sets,

rough sets and uncertain variables are examples of information granules captured with the use of different formalism [34-35]. Here we describe missing values as probabilistic information granules based on the above q nearest neighbors. The general probabilistic information granules can be represented by probability density functions

10

ACCEPTED MANUSCRIPT

p( x jk ) for x jk  X M

(18)

or numerical characteristics of missing values as follows:

1 q (i )  x jk , q i 1

(19)

1 q (i )   ( x jk   jk )2 ,  q  1 i 1 2 jk

q

 (x

Kurtosis jk 

i 1

(i ) jk

  jk ) 4

(q  1) jk4

,

CR IP T

 jk 

(20)

(21)

AN US

where  jk ,  jk2 and Kurtosis jk denote mean value, variance, and kurtosis, respectively.

For the general probability density function, it is difficult to be incorporated into incomplete data clustering. Therefore, we only consider the case when missing values adhere to the Gaussian distributions,

2 jk



( x jk   jk )2

e

2 2jk

,  jk2  0 , for x jk  X M .

(22)

ED

p( x jk ) 

1

M

viz.

Non-parametric hypothesis tests can be used to determine whether missing values obey Gaussian

PT

distributions. The commonly used methods for test of normality mainly include  2 test of goodness of fit, Kolmogorov-Smirnov test, Shapiro-Wilk test, and Shapiro-Francia test, etc [36]. Each method realizes

CE

testing by invoking different test statistics. Here we adopt Shapiro-Wilk test and Shapiro-Francia test to

AC

carry out the test of normality of the attribute values, which correspond to the missing values, in the q nearest neighbors of incomplete data, in that the two test methods are more suitable for the relatively small sample size. When kurtosis value of the tested samples is low, we select Shapiro-Wilk test, otherwise, Shapiro-Francia test will be used. (2) Shapiro-Wilk test is based on the order statistics. The corresponding attribute values x(1) jk , x jk ,

, x(jkq )

of the q nearest neighbors have been sorted according to the non-descending order shown as (17), so we

11

ACCEPTED MANUSCRIPT

directly arrive at the Shapiro-Wilk statistic [37-38]

W

(aT x jk ) 2 q

 (x i 1

(i ) jk

 ik )

,

(23)

2

T

(2) x jk   x(1) jk , x jk ,

, x(jkq )  ,

CR IP T

where

mT V 1  , aq   T 1 1 1/ 2 , (m V V m)

a   a1 , a2 , T

(24)

(25)

AN US

and m and V are the mean vector and the covariance matrix of the order statistics for a standard Gaussian sample.

The Shapiro-Francia statistic W  is the modification of the Shapiro-Wilk statistic W and reads as follows [37, 39]:

(mT x jk )2

m m ( x  ik ) T

(26)

2

, x(jkq ) are the samples drawn from population with Gaussian distribution, that is, the

PT

(2) H 0 : x(1) jk , x jk ,

(i ) jk

ED

i 1

We test the hypothesis

.

q

M

W 

missing value x jk adheres to the Gaussian distribution.

CE

Choose significant level  , if the value of test statistics is less than the  quantile, then reject the null

AC

hypothesis H 0 ; otherwise, accept the null hypothesis H 0 , and the missing value x jk is considered to adhere to Gaussian distribution.

4. Optimal Completion Strategy with probabilistic information granules In this section, we first present an overall framework of the proposed algorithm, which is outlined in Fig. 1. Here the complete data and the incomplete data are represented by symbols + and ○, respectively, and the cluster prototypes are represented by stars. 12

ACCEPTED MANUSCRIPT

CR IP T

Input: Incomplete data set

Normality hypothesis testing p( x jk ) ……

AN US

x jk

Probabilistic information granules of missing values

Clustering objective function

Model

Gradient-based optimization calculation

PT

ED

M

Algorithm

Cluster 2

CE

Cluster 1

Output: Clustering results

Fig. 1. An overal framework of the algorithm.

AC

From Fig. 1, it can be seen that the overall algorithm framework consists of two parts, which are

grouped into thick-line boxes. The first part concerns a way of building of probabilistic information granules of missing values, which has been addressed by using the normality hypothesis testing technique covered in Section 3. The second part concerns a proposed clustering objective function as well as the optimization of this function using gradient-based method. The combination of the two parts, i.e., a sound introduction of probabilistic information granules of missing values into the clustering objective function, 13

ACCEPTED MANUSCRIPT

is realized on the basis of maximum likelihood principle, which we will discuss in more details. 4.1. Clustering objective function and tri-level alternating optimization For a set of incomplete data X  x1 , x2 ,

, xn  , under the assumption that the missing values

x jk ( X M ) are random variables adhering to Gaussian distributions, i.e., x jk ~ N ( jk ,  jk ) , we

objective function c

n

J (U, V, x jk )   uikm x k  v i

2 2

1



x jk X M

2 jk



( x jk   jk )2

e

2 2jk

,

(27)

AN US

i 1 k 1

K

CR IP T

calculate U , V and all x jk ( X M ) to produce the smallest possible value of the following clustering

where K is a positive number called the gain factor. (27) has two components: the first component is the objective function of the OCS algorithm (12), which is to achieve the minimum to obtain compact clusters, and the second component is the likelihood function of missing values, which is to achieve the

M

maximum to obtain the most probable values of all x jk ( X M ) . In order to achieve the minimum value of the overall objective function (27), a minus sign is arranged in front of the likelihood function. It can

ED

be seen that the distribution information of missing values can be fully considered when clustering incomplete data based on the objective function (27).

PT

In order to make the optimization computationally feasible, we introduce logarithm likelihood function

CE

to the clustering objective function, and the second component of (27) is modified as

AC

ln



x jk X M

1 2 jk



e

( x jk   jk )2 2 2jk





( x jk   jk )2 2 jk2

x jk X M

C ,

(28)

where C is a constant, and

C



ln

x jk X M

Therefore, the objective function (27) can be modified as

14

1 2 jk

.

(29)

ACCEPTED MANUSCRIPT

c

n

J (U, V, x jk )   u

xk  vi

m ik

i 1 k 1

2 2

K

( x jk   jk )2



2 jk2

x jk X M

.

(30)

Because not all x jk ( X M ) are subject to Gaussian distributions, we introduce factors K jk to underline this

Hence, the ultimate clustering objective function is modified as c

n

J (U, V, x jk )   u

xk  vi

i 1 k 1

2 2

K



x jk X M

K jk

( x jk   jk )2 2 jk2

.

AN US

m ik

CR IP T

1, if x jk ~ N (  jk ,  jk ) K jk   , for x jk  X M . 0, otherwise

(31)

(32)

Proposition 1. If incomplete data clustering is obtained by minimizing the objective function (32) with the constraint of c

u

ik

 1 for k  1, 2,..., n ,

(2)

M

i 1

then the necessary conditions that should be met by cluster prototypes, memberships and missing values

ED

lead to the following update equations

vi 

AC

CE

PT

n

u k 1 n

m ik

xk

for i  1, 2,..., c ,

 uikm

(3)

k 1

  c  xk  vi uik       t 1  x k  v t 

   2 

2

1 m 1

2 2

    

1

for i  1, 2,..., c and k  1, 2,..., n ,

(4)

and

c

x jk 

2 uikm v ji  i 1

c

2 u  i 1

m ik

KK jk

 jk2 KK jk

 jk2 15

 jk for x jk  X M .

(33)

ACCEPTED MANUSCRIPT

Proof 1. Applying the method of Lagrange multipliers for finding an optimum solution of the objective function (32), we consider the following Lagrange function c

n

J  (U, V, x jk )   uikm xk  vi i 1 k 1

K 2



2

x jk X M

( x jk   jk )2

K jk

2 jk2

n

c

k 1

i 1

  k ( uik  1) ,

(34)

CR IP T

where k , k  1, 2,..., n , are Lagrange multipliers. Fixing the parameters uik and x jk (  X M ), the objective function attains its minimum when n J   2 uikm (xk  vi )  0 , vi k 1

AN US

and then (3) can be derived directly.

(35)

For fixed parameters v i and x jk (  X M ), J  is stationary when c J    uik  1  0 , k i 1

PT

From (37), we have

ED

M

J   muikm1 xk  vi uik

 k uik    m xk  vi 

2 2

 k  0 .

(36)

(37)

1

 m1  . 2  2 

(38)

AC

CE

Substituting (38) into (36), we obtain the following result

 k     m 

1 m 1

   x  v i 1 i 2   k c



1

 

1 m 1

 1.

2

(39)

In the sequel,

 k     m 

1 m 1

  c  1      i 1  x k  v i 

  2  2 

1 m 1

1

   .  

Substitute (40) into (38), the membership uik can be expressed as shown in (4).

16

(40)

ACCEPTED MANUSCRIPT

For fixed values of the parameters uik and v i , the objective function stabilizes when c x jk   jk J   2 uikm ( x jk  v ji )  KK jk  0. x jk  jk2 i 1

(41)



We start with solving (41) for x jk , which leads to (33).

CR IP T

Proposition 1 demonstrates that we just need to perform a tri-level alternating optimization of (3), (4) and (33) when probabilistic information granules of the missing values are introduced to the OCS clustering for incomplete data. The procedure of the OCS clustering with probabilistic information granules (POCS for short) can be described as follows.

AN US

Step 1) Determine the q nearest neighbors of each incomplete datum according to (15) and (16).

Step 2) Check the corresponding attribute values of the q nearest neighbors of each missing value to see whether they are governed by the Gaussian distribution or not,

if yes, determine the probability density function of the missing value using (19), (20) and (22),

M

and set K jk  1 ;

ED

else, set K jk  0 .

0 Step 3) Choose m , c , K and a threshold value   0 ; initialize partition matrix U  and the set

PT

of missing values X M(0) .

), calculate matrix of cluster prototypes V   using l

CE

Step 4) When the iteration index is l ( l  1, 2, (3) and U

 l 1

, X M(l 1) .

AC

 l 1 l  Step 5) Update the set of missing values X M(l ) using (33) and U , V .

Step 6) Update partition matrix U

l 

l 

using (4) and V , X M(l ) .

l   l 1 Step 7) If i, k : max uik  uik   , then stop and get partition matrix U , matrix of cluster

prototypes V and the set of missing values X M ; otherwise set l  l  1 and return to Step 4.

17

ACCEPTED MANUSCRIPT

4.2. Illustration 1) Discussion on the role of K To make the algorithm structure of POCS more transparent, we consider two extreme values of K as follows:

KK jk

c

lim x jk  lim K 0

2 uikm v ji 



i 1

c

K 0

2 u  i 1

m ik

2 jk

c

 jk 

KK jk

 jk2

u i 1 c

m ik

v ji

u

m ik

i 1

CR IP T

i) If K tends to 0, (33) produces the limit

for x jk  X M .

(42)

AN US

(42) is, in fact, the same as (13) in OCS, that is, the POCS algorithm degenerates to the OCS algorithm when K tends to 0.

ii) By the same token, under the assumption that all the missing values satisfy the assumption of Gaussian distribution and if K tends to infinity, we have c

c

2 u  m ik

ED

K 

i 1

KK jk

i 1



2 jk

KK jk

 jk

M

lim x jk  lim

K 

2 uikm v ji 

KK jk

 jk2



 jk2

 jk

KK jk

  jk for x jk  X M .

(43)

 jk2

PT

(43) shows that missing value is directly replaced by the mean value of corresponding attribute values of the q nearest neighbors when K tends to infinity. Then POCS performs the standard FCM clustering on

CE

the recovered data. In this case, our algorithm degenerates to the FCM algorithm based on filling in the missing values with the mean of the neighbors (NNM for short) [26, 29].

AC

From the above discussions, it can be seen that the larger K is, the proposed algorithm starts resembling the NNM algorithm, and the smaller K is, it becomes positioned close to the OCS algorithm, as shown in Fig. 2. The value of the K determines the reliance degree between the overall structure information of data set and the local nearest neighbor information of missing values when clustering incomplete data.

18

ACCEPTED MANUSCRIPT

0←K 1   m 1      2  

1

2

2 2

c KK  2 uikm v ji  2 jk  jk   jk i 1  x  c KK jk  jk m 2 uik  2   jk i 1  n  uikm x k   K→∞  k 1  vi  n m  uik   k 1  1 1   2 m 1   c  xk  vi 2     uik     x  v 2   t 2   t 1  k      

OCS

POCS

x jk   jk n  uikm x k    v i  k 1n  uikm   k 1      c  xk  vi  uik        t 1  x k  v t  

1   m 1  2  2    2  

1

2

CR IP T

c  uikm v ji   i 1  x jk  c  uikm   i 1  n  uikm x k   k 1  vi  n  uikm   k 1     c  xk  vi  u  ik      t 1  x k  v t   

NNM

Fig. 2. The relationship between the values of K and the algorithm structure.

AN US

2) Convergence of POCS

As a typical tri-level alternating optimization, POCS can achieve convergence. The general theory of numerical convergence of tri-level alternating optimization has been provided in [40] and is similar to that for the basic FCM algorithm.

M

3) The time complexities

ED

The time complexity of the standard FCM algorithm is O(nc 2 s) [41-43], where n is the number of object data, c is the number of clusters, and s is the dimension of data vectors. On this basis, we give the

PT

time complexities of the proposed algorithm and the compared algorithms as follows. i) The time complexity of NNM is O(nc 2 s) , since the NNM algorithm actually performs the standard

CE

FCM clustering on the recovered data. ii) WDS discards the data with missing values, and then performs the basic FCM clustering on the

AC

complete data subset X W . Thus, the time complexity of WDS is O( X W c 2 s) , where X W

denotes

the cardinality of X W . iii) PDS uses the partial Euclidean distance shown as (9) to update partition matrix and cluster prototypes in FCM clustering, but the complexity of the algorithm does not be increased. So the time complexity of PDS is O(nc 2 s) . 19

ACCEPTED MANUSCRIPT

iv) In OCS, the calculation of partition matrix requires nc 2 s operations, and the calculations of cluster prototypes and missing values both require ncs operations. In conclusion, the time complexity of the OCS algorithm is O(nc 2 s) . v) NPS and POCS have slight differences with the OCS algorithm only in the calculation of missing

CR IP T

values. By comparison, the most calculation costs of NPS and POCS are implied by the calculation of partition matrix, so the time complexities of these two algorithms are still O(nc 2 s) .

5. Experimental studies

AN US

In this section, in order to assess the validity of the POCS algorithm, we mainly designed simple tests of missing single attribute value and experiments of random missing attribute values at several missing rates considering synthetic data sets and real-world data sets.

The performance of the proposed POCS algorithm and the compared algorithms is evaluated in terms of

M

prototype error, classification error and imputation error. Among them, the prototype error is our main evaluation index, the other two as references. The main reason is that the clustering problem is much

ED

concerned with the obtained cluster prototypes. The prototype as a representative of the cluster is often used in many studies, such as granular computing. In addition, the prototype error is a continuous variable of the

PT

prototype position, which can evaluate the clustering performance more elaborately. The classification error is one of the main evaluation indices of the classification algorithms, which is a discrete variation index. In the

CE

clustering problem, the case that the number of misclassified does not change even though the cluster prototypes have changed a lot often occurs. The classification error is not particularly sensitive to evaluate the

AC

clustering performance. The imputation error is mainly used to evaluate the imputation accuracy of the POCS, NNM, OCS and NPS algorithms. Take the OCS algorithm as an example, the algorithm estimates numerous missing values only by means of the convex combination of the c cluster prototypes shown as (13), which may result in the less high imputation accuracy. The imputation step of the above methods forms a bridge to reach the incomplete data clustering, so the obtained imputation values can only be used as a reference for the evaluation of clustering performance.

20

ACCEPTED MANUSCRIPT

Here, we adopt the error sum of squares (ESS) between the obtained cluster prototypes and the actual ones c

ESS   vi  v*i i 1

2

(44)

2

CR IP T

to measure the prototype error, adopt the misclassifications compared with the actual data partition to measure the classification error, and adopt the mean absolute error (MAE) between the imputed values and the corresponding real values

MAE 

1 XM



x jk X M

x jk  x*jk

(45)

AN US

to measure the imputation error. In (44) and (45), v*i and x*jk denote the actual cluster prototype and the real value of missing value, respectively. If the actual prototypes are unknown, we replace them with the full data prototypes [1], and then the ESS index can measure the offset degree of the obtained

M

prototypes compared with the full data prototypes, caused by the missing values. The relevant parameters in the experiments are set as follows: fuzzification parameter m is set to 2 that

ED

is the most frequently used value, threshold value is  =10-6, the size of nearest neighbors q is set to 8. 5.1. Simple tests of missing single attribute value

PT

This group of tests is designed to test whether there exist any missing values with such a feature, namely, when the prior distribution information of missing values is introduced by adopting maximum

CE

likelihood criterion, appropriate guidance can also be introduced for the OCS strategy. We use three real-world data sets, namely, Wholesale customers, Seeds and Wine [44-45] to carry out this group of

AC

tests. Incomplete data sets are artificially generated on the basis of the above complete data sets by randomly selecting one component to be designated as missing. 1) Wholesale Customers data set The Wholesale Customers data set contains 440 six-dimensional vectors, which refers to clients of a wholesale distributor. It includes the annual spending in monetary units on diverse product categories. The data set can be divided into 2 categories according to the Channel index, in which 298 samples 21

ACCEPTED MANUSCRIPT

belong to the first category and the rest 142 samples belong to the second. We perform 10 experiments on 10 incomplete Wholesale Customers data sets, and the average experimental results are listed in Table 1. Table 1 Averaged results of 10 experiments using incomplete Wholesale Customers data sets WDS 6766.83 60.9 37.1

PDS 94.6391 61 35.8

NPS 51.8851 61 1503.1 35.8

OCS 54.4075 61 1653.8 35.8

NNM 33.5839 61 1292.1 35.8

POCS 32.5120 61 1371.7 35.8

CR IP T

Mean prototype error Mean number of misclassifications Mean imputation error Mean number of iterations to termination

From Table 1, it can be seen that the mean misclassifications of each algorithm except WDS are all 61, which cannot distinguish merit orders of various algorithms. The mean prototype error of WDS is far

AN US

worse than those of other methods for the reason that missing one attribute value is equivalent to ignoring a whole object datum for the WDS. The mean prototype error of PDS is the second worst. The above WDS and PDS algorithms can not estimate the missing values. The clustering results obtained by NPS and OCS exhibit little difference, and are much worse than those obtained by NNM and POCS. The

M

POCS algorithm has produced the cluster prototypes with the minimal deviation in the 10 experiments,

ED

but its imputation error is slightly larger than that of the NNM algorithm by a large margin due to the fact that 3 missing values do not adhere to Gaussian distribution and the OCS strategy is executed. Next, we

PT

discuss the effect of the gain factor K on the clustering performance of the POCS algorithm, which is shown in a logarithmic coordinate in Fig. 3.

Mean prototype error

AC

CE

55

50

45

40

35

30 -8

-7

-6

-5

-4

-3

-2

-1

0

1

2

lgK

Fig.3. Trend line of mean prototype error with the increase of K for incomplete Wholesale Customers data sets.

22

ACCEPTED MANUSCRIPT

By examining Fig. 3, we can see that when K is increased from 10-8 to 10-6, the mean prototype error of the 10 experiments has no significant change, consistent with that of the OCS algorithm, which suggests that the POCS algorithm for this case degenerates into the OCS algorithm. When K is increased from 1 to 100, the mean prototype error also no longer changes, which suggests that the POCS algorithm for this

CR IP T

case degenerates into the NNM algorithm if missing value adhering to Gaussian distribution. When K locates between 10-6 and 1, the mean prototype error begins to decrease and then increases with the increase of K, in which a turning point appears in the K=10-3 and the corresponding prototype error is 32.5120, better than those produced by the compared algorithms.

AN US

In what follows, we discard the 3rd dimensional attribute value of the 26th object datum and further present the detailed results. In this instance, the imputations estimated by OCS and NNM are 4312.56 and 2120.88 respectively, and the prototype errors produced by OCS and NNM are 16.7403 and 19.6438 respectively. The trend of prototype error of the POCS algorithm with the value of K is shown in Fig. 4.

M

25

15

ED

Prototype error

20

10

PT

5

0 -8

-7

-6

-5

-4

-3

-2

-1

0

1

2

lgK

CE

Fig. 4. Trend line of prototype error with the increase of K for an incomplete Wholesale Customers data set.

AC

NNM does not reach the ideal clustering results for its absolute dependence on the mean of the near neighbors. OCS calculates imputation of missing value based on (13). From (13), we discover that the imputation is actually the convex combination of the corresponding attribute values of cluster prototypes, and in the case of the number of clusters c=2, the imputation must belong to the interval formed by the corresponding attribute values of the two cluster prototypes. If missing value were not within the range of the above interval, OCS is not likely to get an accurate imputation, as shown in Fig. 5. 23

ACCEPTED MANUSCRIPT

Imputation by OCS

xjk

v ji min

v ji max

the jth attribute

CR IP T

Fig. 5. The range of imputation for the OCS algorithm.

Specific to this instance, according to the full data prototypes, we can determine the OCS imputation of the 3rd dimensional missing attribute about in the range of [4238.87 17738.2], but the discarded real value is 3250, so the OCS algorithm is powerless. When K=10-3, the imputation estimated by POCS is 2824.10 that is outside the above interval and close to the discarded true value, which is actually caused

AN US

by introducing the prior distribution information of missing value into the OCS strategy. Eventually, the prototype error produced by POCS is 2.7781, far better than those produced by OCS and NNM. 2) Seeds data set

The Seeds data set contains 210 seven-dimensional vectors. The data come from the measurements of

M

geometrical properties of kernels belonging to three different varieties of wheat: Kama, Rosa and

ED

Canadian, 70 samples each. We perform 10 experiments on 10 incomplete Seeds data sets, the average experimental results are listed in Table 2, and the trend of mean prototype error of the POCS algorithm

PT

with the value of K is shown in Fig. 6.

Mean prototype error

AC

CE

9

x 10

-5

8

7

6

5

4 -7

-6

-5

-4

-3

-2

-1

0

1

2

3

lgK

Fig. 6. Trend line of mean prototype error with the increase of K for incomplete Seeds data sets.

24

ACCEPTED MANUSCRIPT

Table 2 Averaged results of 10 experiments using incomplete Seeds data sets

Mean prototype error (×10-5) Mean number of misclassifications Mean imputation error Mean number of iterations to termination

WDS 62.295 21.3 27.8

PDS 9.3406 21 28.1

NPS 8.7094 21 0.4232 27.9

OCS 7.8119 21 0.4065 27.9

NNM 8.7661 21 0.2975 27.9

POCS 4.8299 21 0.2659 27.9

CR IP T

From Table 2, it can be seen that the results of WDS are still the worst when only one component missing. The clustering performance of the four algorithms, i.e., PDS, NPS, OCS and NNM, is basically at the same level, where the relatively small mean prototype error is produced by OCS and the relatively accurate imputation of missing value is obtained by NNM. In Table 2, we give the clustering results of

AN US

POCS when K=10-2. Obviously, compared with the OCS and NNM algorithms, the clustering performance of POCS has been greatly improved. From Fig. 6, we can see that the mean prototype error of POCS decreases in the beginning and increases afterwards with the increasing K value, and its clustering performance appears better than those of the OCS and NNM algorithms when the value of K is

M

selected in a large range around the turning point, which show the effectiveness of the prior distribution information of missing value guiding the OCS strategy.

ED

3) Wine data set

The Wine data set consists of 178 thirteen-dimensional attribute vectors that describe the results of a

PT

chemical analysis of wines grown in the same region but derived from three different cultivars. The corresponding classes contain 59 samples, 71 samples, and 48 samples, respectively. We still perform 10

CE

experiments on 10 incomplete Wine data sets, and the average experimental results are shown in Table 3

AC

and Fig. 7.

Table 3 Averaged results of 10 experiments using incomplete Wine data sets

WDS Mean prototype error 19.0780 Mean number of misclassifications 9.1 Mean imputation error Mean number of iterations to termination 26.2

PDS 0.0521 9.1 27.1

25

NPS 0.0467 9.1 1.4067 27.4

OCS 0.0425 9.1 1.2411 27.4

NNM 0.0415 9.1 1.1918 27.2

POCS 0.0398 9.1 1.2165 27.4

ACCEPTED MANUSCRIPT

0.042

0.041

0.04

0.039 -7

-6

-5

-4

-3

-2

lgK

-1

0

1

CR IP T

Mean prototype error

0.043

2

3

Fig. 7. Trend line of mean prototype error with the increase of K for incomplete Wine data sets.

From Table 3 and Fig. 7, it can be seen that merit orders of six algorithms for the incomplete Wine data

AN US

sets are the same as those for the incomplete Wholesale Customers data sets, and both of the experimental conclusions are also consistent.

Finally, we elaborate on the POCS algorithm in the case of missing single attribute value. In contrast to Fig. 3, Fig. 6, and Fig. 7, we can see that the turning point of trend line of mean prototype error appears in

M

the K=10-3 for the incomplete Wholesale Customers data sets, while in the K=10-2 for the incomplete

ED

Seeds and Wine data sets. Here, we intend to make clear that the mean prototype error at the turning point of trend line is not minimal. For instance, for the incomplete Wholesale Customers data sets, the mean

PT

prototype error is 31.9573 when K=2  10-3, which is smaller than the value when K=10-3. If roughly judge from trend lines, the minimal mean prototype error for the incomplete Wholesale Customers data sets

CE

should be produced in the K  [10-4 10-3] or [10-3 10-2], while K should belong to [10-3 10-2] or [10-2 10-1] for the incomplete Seeds and Wine data sets. The appropriate range of K is not always consistent for

AC

different data sets. From Fig. 3, Fig. 6 and Fig. 7, it can also be seen that the clustering results of POCS on the above incomplete data sets are basically acceptable when K [10-4 10-1]. In practical applications, it is difficult to find the optimal value of the gain factor K, but finding the feasible value of K that can make the prior distribution information of missing value properly guide the OCS strategy is relatively easy. The above experimental results can also provide some reference for a selection of a suitable value of K. 26

ACCEPTED MANUSCRIPT

5.2. Applications In this subsection, we apply the POCS algorithm to two actual incomplete data sets with missing values in themselves. 1) Stone Flakes data set [45]. The data set concerns the earliest history of mankind and reflects the

CR IP T

technological progress during several hundred thousand years. The data set consists of 79 eight-dimensional attribute vectors to describe stone flakes in the prehistoric era, in which 13 object data describe geometric and stylistic features of the flakes left over from the Lower Paleolithic Age about 300 millennia years ago, 26 object data describe the flakes from the Levallois Technique Age about 200

AN US

millennia years ago, and 40 object data correspond to the Middle Paleolithic Age and the Homo Sapiens Age about 40 millennia years ago. The Stone Flakes data set is not complete. There are 6 incomplete data that contain 10 missing values.

2) Breast Cancer data set [45-46]. The data set was obtained from the University of Wisconsin

M

Hospitals, which contains 699 nine-dimensional attribute vectors. The data set can be divided into two categories, that is, 458 instances are malignant and 241 instances are benign. In the data set, there are 16

ED

samples that are not complete, each missing one attribute value. Since these two data sets themselves are incomplete, the real values of missing values and the full data

PT

prototypes are both unknown. Now we can only provide the misclassifications of each algorithm for these

Misclassifications

AC

CE

two data sets, as shown in Fig. 8, wherein the gain factor K of POCS is 0.05.

35

Stone Flakes data set 32

Breast Cancer data set 33

32

31

31

31

30 25 20

18

18

18

17

16

16

OCS

NNM

POCS

15 10 WDS

PDS

NPS

Fig. 8. Misclassifications of each algorithm for two actual incomplete data sets.

27

ACCEPTED MANUSCRIPT

From Fig. 8, we can see that for the Stone Flakes data set, POCS and NNM both achieve the least misclassifications; for the Breast data set, POCS, OCS and NPS all achieve the least misclassifications, while the misclassifications of the NNM algorithm are the most among the six algorithms. Because of poor sensitivity of the misclassification index, the above results do not show that the POCS algorithm stands out,

CR IP T

but the robustness of the algorithm for these two practical applications has been implied.

5.3 Experiments on synthetic data sets

We generate three synthetic data sets to validate the proposed algorithm. The points of the i th cluster are randomly distributed according to a Gaussian distribution with mean vector  i and covariance

AN US

matrix  i . For the synthetic Gaussian data set, its actual cluster prototypes are known, i.e., its mean vectors  i for i  1, 2,

, c . Each data set is given a denomination by the number of object data and

clusters. For instance, N100C3 denotes that the data set contains 100 object data and can be divided into three clusters.

M

Incomplete data sets are artificially generated on a basis of the complete synthetic data sets by randomly selecting a specified percentage of their components to be designated as missing, and the

ED

random selections of missing values satisfying the following constraints [1]: 1) Each incomplete datum retains at least one component;

PT

2) Each attribute has at least one value presents in the incomplete data set.

CE

Here, we generate incomplete data sets on top of each of the three complete data sets by taking missing rates as 5%, 10%, 15% and 20%.

AC

1) N200C2 data set

The data set consists of 200 two-dimensional vectors scattered between the two clusters, each

containing 100 points. The points of each cluster are drawn according to the following parameters:

0   50  50  3600 1   ,  2    , 1   2   .  3600   50   0 50  The data set is shown in Fig. 9, where the object data belonging to two different clusters are represented

28

ACCEPTED MANUSCRIPT

by × and +, respectively, and the actual cluster prototypes are represented by stars. From Fig. 9, we can

AN US

CR IP T

see that the data set exhibits some overlap between two clusters.

Fig. 9. N200C2 data set.

We perform 10 experiments on 10 incomplete N200C2 data sets at each missing rate, and the average

M

experimental results are shown in Table 4, in which those achieved by POCS only when K=0.01 are listed. Note: NC is the abbreviation for non-convergence in Table 4.

WDS PDS NPS 147.91 157.19 154.56 181.49 209.01 202.13 211.41 257.22 246.35 287.03 364.31 NC Mean imputation error WDS PDS NPS 55.19 58.46 57.83 NC

% missing

AC

CE

5 10 15 20

5 10 15 20

Mean prototype error

Mean number of misclassifications

OCS 151.69 192.23 225.62 301.87

NNM 90.46 129.08 250.47 333.00

POCS 129.38 147.14 158.40 186.08

OCS 55.05 58.47 56.91 61.62

NNM 65.05 65.31 66.82 66.83

POCS 55.16 56.20 56.60 59.68

PT

% missing

ED

Table 4 Averaged results of 10 experiments using incomplete N200C2 data sets

WDS PDS NPS OCS NNM 31.7 31.5 31.4 31.4 32.6 32.9 32.5 32.4 32.6 32.4 35.2 34.8 34.7 34.4 37.8 35.6 35.6 NC 35.9 39.1 Mean number of iterations to termination WDS PDS NPS OCS NNM 21.1 21.8 23.1 28.6 22.9 21.9 22.1 23.6 28.2 25.8 24.4 21.5 24.5 26.9 28.0 22.0 21.3 NC 37.2 28.2

POCS 31.7 32.3 35.1 36.3 POCS 25.0 25.9 44.2 40.5

From Table 4, we first note that there is a failure to converge in the 10 experiments for the NPS algorithm when missing rate is 20%. With the exception of NNM, the algorithms differ no greatly in their mean misclassifications. Among the four algorithms capable of providing imputations, POCS produces 29

ACCEPTED MANUSCRIPT

the smallest mean imputation errors in general whereas the NNM algorithm produces the largest ones. Next, the performance of each algorithm on the incomplete N200C2 data sets is discussed respectively, with the focus on the mean prototype error index. PDS produces the largest mean prototype errors at the four missing rates. WDS achieves good clustering effects and, as the missing rate rises, its mean prototype

CR IP T

errors grow slowly; particularly, when the missing rates are 15% and 20%, WDS is inferior only to POCS in terms of mean prototype error. OCS produces a larger mean prototype error than WDS at each missing rate, whereas, as the missing rate rises, its prototype error grows at a similar speed to the case of WDS. NNM produces the smallest mean prototype errors among the six algorithms when the missing rates are

AN US

5% and 10%; at higher missing rates, however, its mean prototype errors increase so quickly that they are only better than those produced by the PDS algorithm when the missing rates become 15% and 20%. The POCS algorithm produces the second smallest mean prototype errors when the missing rates are 5% and 10%, being 129.38 and 147.14 respectively and differing in fact not much from those produced by the

M

NNM algorithm; but when the missing rates are 15% and 20%, the resulting mean prototype errors become 158.40 and 186.08 respectively, far below those produced by the compared algorithms. In the

which is shown in Fig. 10.

ED

following, we will discuss the trend of mean prototype error of the POCS algorithm with the value of K,

PT

Firstly, we see the case when the missing rate is 5% as shown in Fig. 10(a). When K is less than 10-4, the prior distribution information of missing values plays a weak role and the POCS algorithm actually

CE

degrades into a general OCS; with the increase of K, the guiding role of prior distribution information of missing values is gradually strengthened and the mean prototype error is gradually declined; when K is

AC

larger than 1, for the missing values adhering to Gaussian distributions, the POCS algorithm equals to that the mean values of their near neighbors to their imputations have played an entire leading role, and the mean prototype error has reached another steady state. With the increase of K, POCS actually has realized the smooth transition from the OCS strategy to the NNM strategy of Gaussian missing values, and the clustering effect of POCS keeps always somewhere in between. We refer the trend line shown in Fig. 10(a) as the “Smooth Transition” type, ST-type, for short. 30

ACCEPTED MANUSCRIPT

160

200

Missing rate: 10% Mean prototype error

Mean prototype error

Missing rate: 5% 140

120

100

80 -7

-6

-5

-4

-3

-2

-1

0

1

2

170

140

110

80 -7

3

-6

-5

-4

-3

(a) 360

Mean prototype error

200

170

-4

-3

-2

-1

0

1

2

lgK

(c)

3

320

280

240

AN US

Mean prototype error

230

-5

0

1

2

3

Missing rate: 20%

Missing rate: 15%

-6

-1

(b)

260

140 -7

-2

lgK

CR IP T

lgK

200

160 -7

-6

-5

-4

-3

-2

-1

0

1

2

3

lgK

(d)

M

Fig. 10. Trend lines of mean prototype error with the increase of K for incomplete N200C2 data sets.

ED

Then see cases when the missing rates are 10%, 15% and 20% respectively shown in Fig. 10(b)-(d). With the increase of K, the mean prototype error firstly reduces and then increases in the transition

PT

process from the OCS strategy to the NNM strategy of Gaussian missing values, that is, a turning point appears, and the mean prototype error of POCS at the turning point is smaller than those of OCS and

CE

NNM, reflecting the superiority of adopting the prior distribution information of missing values to guide the OCS strategy. We refer the trend lines shown in Fig. 10(b)-(d) as the “Turning Point” type, TP-type,

AC

for short. From Fig. 10(b)-(d), it can be seen that the turning points all appear in the K=0.1, whereas the results of POCS we list in Table 4 are achieved when K=0.01, that is, we have not listed better results achieved by POCS in Table 4. We can see from Fig. 10(b) that when K=0.1, the mean prototype error of POCS is actually smaller than that of NNM. 2) N400C3 data set The data set shown in Fig. 11 consists of 400 two-dimensional vectors scattered among three clusters of 31

ACCEPTED MANUSCRIPT

sizes: 150 points (cluster 1), 100 points (cluster 2) and 150 points (cluster 3). The points of each cluster are drawn according to the following parameters:

AN US

CR IP T

0.1   2.3  1.1 0.72  4 1 7.2   0.42 0.36 , 2    2   , 3    3   1    1     .  2.3 0.72 0.9  0.1 0.25  0.36 0.8  5.6  3.6 

M

Fig. 11. N400C3 data set.

ED

The experimental results of the POCS algorithm when K=0.01 and the compared algorithms are listed

PT

in Table 5.

Table 5 Averaged results of 10 experiments using incomplete N400C3 data sets

WDS PDS NPS 0.2911 0.2902 0.2883 0.2816 0.3118 NC 0.3369 0.3441 0.3445 0.3289 0.3546 0.3587 Mean imputation error WDS PDS NPS 1.0823 NC 1.1528 1.0970

AC

5 10 15 20

Mean prototype error

CE

% missing

% missing 5 10 15 20

Mean number of misclassifications

OCS 0.2456 0.1981 0.2115 0.2459

NNM 0.2147 0.1834 0.2134 0.2875

POCS 0.2325 0.1845 0.1841 0.1886

OCS 1.4340 1.4027 1.4316 1.4596

NNM 1.7727 1.7537 1.7330 1.7342

POCS 1.3893 1.3866 1.3940 1.4023

32

WDS PDS NPS OCS NNM 30.6 30.7 30.7 33.2 37.1 36.8 37.2 NC 38.5 44.5 42.1 42.2 42.2 48.3 56.6 45.8 45.3 45.1 57.6 65.1 Mean number of iterations to termination WDS PDS NPS OCS NNM 19.0 19.8 22.0 27.8 20.3 19.0 19.6 NC 36.2 21.7 19.6 21.4 30.0 43.4 28.1 18.8 22.4 32.3 71.0 30.2

POCS 33.4 39.3 48.5 57.5 POCS 49.1 43.0 65.7 78.3

ACCEPTED MANUSCRIPT

From Table 5, we can see that NPS still does not get convergence when the missing rate is 10%; WDS, PDS and NPS algorithms get fewer misclassifications, but the obtained cluster prototypes from them have much larger errors; on the contrary, OCS, NNM and POCS algorithms produce smaller prototype errors, but get more misclassifications and the produced imputation errors is also larger than that of NPS. The

CR IP T

above results show the inconsistency of the three indices, i.e., prototype error, imputation error and misclassifications. The prototype error index will mainly be discussed in the following. For the incomplete N400C3 data sets, the mean prototype errors of POCS are only worse than those of NNM when the missing rates are 5% and 10%, and the cluster prototypes of POCS are the most accurate when

AN US

the missing rates are 15% and 20%. The trend of mean prototype error of the POCS algorithm with the value of K is shown in Fig. 12.

0.25

M

0.23

0.22

0.21 -7

-6

-5

-4

-3

-2

ED

Mean prototype error

0.24

-1

0

1

2

Mean prototype error

0.2

Missing rate: 5%

0.19

0.185

0.18

0.175 -7

3

-6

-5

-4

-3

lgK

0.22

0.17 -7

-6

-5

-4

-3

-2

-1

0

1

2

3

-1

0

1

2

3

Missing rate: 20% Mean prototype error

0.18

0

0.28

CE

0.19

AC

Mean prototype error

0.2

-1

(b)

Missing rate: 15%

0.21

-2

lgK

PT

(a)

Missing rate: 10%

0.195

1

2

0.25

0.22

0.19

0.16 -7

3

lgK

-6

-5

-4

-3

-2

lgK

(c)

(d)

Fig. 12. Trend lines of mean prototype error with the increase of K for incomplete N400C3 data sets.

From Fig. 12, it can be seen that the trend lines of mean prototype error are all the TP-type for the

33

ACCEPTED MANUSCRIPT

incomplete N400C3 data sets, showing the effectiveness of introducing the prior distribution information of missing values for the OCS strategy. At the first three missing rates, the turning points of trend lines all appear in the K=0.1; and only when the missing rate is 20%, the turning point appears in the K=0.01. So the results we list in Table 5 for comparison with other algorithms are not all good results of POCS.

CR IP T

Actually, it can be seen from the trend lines that when the missing rate is 10%, the mean prototype error produced by POCS algorithm can be smaller than that of NNM algorithm, especially when the missing rates are 15% and 20%, the mean prototype errors of POCS algorithm is obviously smaller than those of the compared algorithms.

AN US

3) N250C5 data set

The data set shown in Fig. 13 consists of 250 two-dimensional vectors scattered among five clusters, each containing 50 points. The points of each cluster are drawn according to the following parameters:

 30   30  0  50  50  , 2   , 3    ,  4   , 5    , 1       30  0  30   50  50 

AC

CE

PT

ED

M

100 0  300 0  1   2  3   ,  4  5    .  0 100   0 300

Fig. 13. N250C5 data set.

We randomly generate 10 incomplete data sets at each missing rate, and the average results of 10 experiments are shown in Table 6. 34

ACCEPTED MANUSCRIPT

Table 6 Averaged results of 10 experiments using incomplete N250C5 data sets

5 10 15 20 % missing 5 10 15 20

Mean prototype error WDS PDS NPS 46.41 170.27 48.25 46.73 51.05 188.94 168.71 73.46 187.38 193.03 70.01 NC Mean imputation error WDS PDS NPS 20.72 24.28 24.02 NC

Mean number of misclassifications OCS 46.31 48.85 290.61 91.24

NNM 48.56 50.68 99.42 125.19

POCS 42.61 40.99 62.36 69.25

OCS 26.62 30.24 33.67 31.61

NNM 34.63 34.90 35.05 34.69

POCS 26.07 29.39 29.50 28.89

WDS PDS NPS OCS NNM 12.8 16.2 12.7 14.8 21.1 19.9 19.7 24.0 25.5 31.9 27.9 25.4 28.1 40.4 44.8 31.8 27.6 NC 44.1 54.6 Mean number of iterations to termination WDS PDS NPS OCS NNM 25.3 41.4 32.2 36.6 30.0 24.9 26.6 31.8 41.3 31.8 32.0 27.5 41.3 65.0 40.4 25.9 26.7 NC 98.9 45.3

POCS 15.8 25.9 33.9 41.7

CR IP T

% missing

POCS 38.6 53.4 82.5 92.7

AN US

Here we focus directly the mean prototype error produced by each algorithm. First see the WDS algorithm, when the missing rates are 15% and 20%, its prototype errors are unusually large, 168.71 and 193.03, respectively. The analysis finds that, when the missing rate is 15%, in one of 10 experiments, its solution converges to v 3 =(33.16, 55.61)T and v 5 =(59.06, 46.22)T; when the missing rate is 20%, its

M

solution in some experiment converges to v 3 =(40.03, 57.87)T and v 5 =(61.75, 39.49)T, the above solutions are obviously wrong. See the PDS algorithm again, when the missing rate is 5%, one wrong

ED

solution is v 3 =(38.76, 55.66)T and v 5 =(60.68, 39.79)T, which results in the mean prototype error of

PT

170.27, also very large. For the NPS algorithm, when the missing rate is 10%, one wrong solution is v 3 =(38.73, 56.49)T and v 5 =(60.81, 37.15)T; when the missing rate is 15%, one wrong solution is

CE

v 3 =( 36.22, 37.51)T, meanwhile the deviations of v1 and v 2 are also large; when the missing rate is

20%, the algorithm does not converge at all. For the OCS algorithm, when the missing rate is 15%, in two

AC

of 10 experiments, the solutions are wrong, where for one v 3 =(49.87, 9.80)T and for the other v 3 = (-58.03, -29.15)T, meanwhile both lead to large deviation of v1 . The above results suggest that unstable cases occur to the four algorithms, i.e. WDS, PDS, NPS and OCS. For the NNM algorithm, when the missing rate is 20%, although its prototype error is also slight large, the obtained cluster prototypes are all around the actual ones, which are not wrong. Finally, see the POCS algorithm, which not only achieves 35

ACCEPTED MANUSCRIPT

stable convergence, but also obtains more accurate cluster prototypes at the four missing rates. Continue seeing Table 4, we find that in the cases of wrong convergence of the four compared algorithms, the mean misclassifications and the mean imputation error they produce are not large, even better than those produced by the POCS algorithm. For example, when the missing rate is 15%, the mean

CR IP T

misclassifications and the mean imputation error of NPS are 28.1 and 24.02 respectively, while for POCS they are 33.9 and 29.50 respectively. The above case also suggests that the prototype error index is more suitable for evaluating incomplete data clustering. The trend of mean prototype error of the POCS algorithm with the value of K is shown in Fig. 14. 50

49

Mean prototype error

Mean prototype error

47

45

43

41 -7

AN US

Missing rate: 10%

Missing rate: 5%

-6

-5

-4

-3

-2

-1

0

1

2

3

(a)

42

-6

-5

-4

-3

-2

-1

0

1

2

3

-1

0

1

2

3

lgK

(b) 120

ED

250

150

100

-5

-4

-3

CE

-6

PT

200

-2

-1

0

Mean prototype error

Missing rate: 15%

Mean prototype error

44

40 -7

300

50 -7

46

M

lgK

48

110

Missing rate: 20%

100 90 80 70

1

2

60 -7

3

-6

-5

-4

-3

-2

lgK

lgK

(c)

(d)

AC

Fig. 14. Trend lines of mean prototype error with the increase of K for incomplete N250C5 data sets.

From Fig. 14, it can be seen that for the incomplete N250C5 data sets at four missing rates, the trend

lines of mean prototype error are all the TP-type, whose turning points all appear in the K=0.01. From Fig. 14(c), we can see that even if the OCS algorithm converges to a wrong solution, it is still able to be guided to the correct solution by the prior distribution information of missing values. 36

ACCEPTED MANUSCRIPT

5.4 Experiments on real-world data sets Besides the synthetic data sets, three real-world data sets, i.e., Crude-Oil, Glass and IRIS [45, 47], are also used to validate the proposed algorithm. Here, we set six missing rates from 5% to 30%. 1) Crude-Oil data set

CR IP T

The data set has 56 data points and five attributes, which is a chemical analysis of crude-oil samples from three zones of sandstone. In the data set, 7 samples are from Wilhelm, 11 samples from Sub-Mulinia and 38 samples from Upper-Mulinia. The average experimental results of 10 experiments for the incomplete Crude-Oil data sets are listed in Table 7, and the trend of mean prototype error of the POCS

AN US

algorithm with the value of K is shown in Fig. 15. 1.4

8

Mean prototype error

Missing rate: 10%

1.3

1.2

1.1

1 -7

-6

-5

-4

-3

-2

-1

0

1

lgK

(a)

M

Mean prototype error

Missing rate: 5%

2

4

-6

-5

-4

-3

6.5 6

CE

5.5

-6

-5

-4

-3

-2

-1

0

18

1

2

2

3

16 14 12

8 -7

3

-6

-5

-4

-3

10

-3

1

2

3

-1

0

1

2

3

Missing rate: 30%

15

-4

0

(d)

20

-5

-1

40

Missing rate: 25%

-6

-2

lgK

Mean prototype error

Mean prototype error

1

Missing rate: 20%

lgK

25

5 -7

0

10

AC

30

-1

(b)

(c)

35

-2

lgK

Mean prototype error

ED

Missing rate: 15%

PT

Mean prototype error

5

20

7

5 -7

6

3 -7

3

8 7.5

7

-2

-1

0

1

2

35

30

25

20

15 -7

3

lgK

-6

-5

-4

-3

-2

lgK

(e)

(f)

Fig. 15. Trend lines of mean prototype error with the increase of K for incomplete Crude-Oil data sets. 37

ACCEPTED MANUSCRIPT

Table 7 Averaged results of 10 experiments using incomplete Crude-Oil data sets

% missing 5 10 15 20 25 30

WDS PDS NPS 4.636 1.186 1.121 18.149 4.034 3.721 33.955 5.774 6.789 42.693 9.197 9.803 47.116 10.674 25.600 89.921 15.430 29.448 Mean imputation error WDS PDS NPS 2.495 2.420 2.660 2.484 2.747 2.710

Mean number of misclassifications OCS 1.025 3.440 6.682 10.646 25.405 21.105

NNM 1.283 7.841 7.589 19.949 26.069 45.812

POCS 1.057 3.636 5.960 10.162 10.357 17.627

OCS 2.359 2.338 2.642 2.444 2.709 2.572

NNM 2.662 2.968 2.737 2.903 2.966 3.034

POCS 2.369 2.356 2.576 2.448 2.580 2.491

WDS PDS NPS OCS NNM 21.9 23.0 22.9 22.6 22.5 22.0 22.7 22.8 23.2 23.0 21.9 22.9 22.9 22.6 21.6 23.9 23.5 23.1 23.4 23.2 21.8 21.4 22.6 22.1 21.9 25.2 22.7 23.5 23.0 23.6 Mean number of iterations to termination WDS PDS NPS OCS NNM 44.7 43.3 45.6 49.5 39.5 49.6 46.3 46.1 51.1 41.2 42.1 43.5 47.6 66.8 40.6 43.3 49.1 50.0 97.3 61.7 43.9 58.5 63.7 94.3 54.9 48.8 55.4 61.4 74.6 69.9

POCS 22.9 23.4 22.6 23.3 21.7 21.9

CR IP T

5 10 15 20 25 30

Mean prototype error

AN US

% missing

POCS 48.2 50.8 59.4 63.9 111.3 74.2

From Fig. 15, it can be seen that for the incomplete Crude-Oil data sets, the trend lines of mean prototype error of the POCS algorithm exhibit the ST-type when the missing rates are 5% and 10%; the

M

trend lines are all the TP-type at the last four missing rates, showing the role played by the prior distribution information of the missing values in guiding the OCS strategy. When the missing rates are

ED

15% and 20%, the turning points of trend lines appear in the K=0.1 and K=0.001 respectively, and when the missing rates are 25% and 30%, the corresponding turning points both appear in the K=0.01. Table 7

PT

provides the results of the POCS algorithm when K=0.01. It is clear that POCS behaves slightly less

CE

satisfactorily than OCS in terms of mean prototype error at the first two missing rates, but POCS produces much more accurate cluster prototypes than OCS does at the last four missing rates.

AC

From Table 7, it can also be seen that the six algorithms do not differ much from one another when it comes to the mean misclassifications and the mean imputation error. WDS and NNM continue to produce unsatisfactory clustering in terms of mean prototype error. Unexpected by us, the performance of the PDS algorithm is diametrically different from its previous clustering performance in addressing the synthetic data sets in the sense that its mean prototype errors are far smaller than those of OCS at the last four missing rates and even smaller than those of POCS listed in Table 7 when the missing rates are 15%, 20%,

38

ACCEPTED MANUSCRIPT

and 30% respectively. However, the mean prototype error of POCS is 5.587 at the turning point of the trend line corresponding to the missing rate 15%, better than that of PDS; at the turning point corresponding to the missing rate 20%, POCS produces a mean prototype error of 9.329, only larger than that produced by PDS a little. Therefore, in the experiments of clustering the incomplete Crude-Oil data

CR IP T

sets, POCS achieves clustering results close to those of PDS when K remains to be 0.01, but POCS performs slightly better than PDS at the turning points of trend lines. The above analysis suggests that for the incomplete Crude-Oil data sets, a satisfactory clustering is possible with POCS due to the guidance by the local distribution information of missing values even at higher missing rates when OCS is far less

AN US

desirable than PDS. 2) Glass data set

The data set contains 214 nine-dimensional attribute vectors. The class distribution of the data set is that 163 samples are from Window glass and 51 samples are from Non-window glass. The average

M

experimental results of 10 experiments for the incomplete Glass data sets are listed in Table 8, and the

ED

trend of mean prototype error of the POCS algorithm with the value of K is shown in Fig. 16.

Table 8 Averaged results of 10 experiments using incomplete Glass data sets

AC

% missing

WDS PDS NPS 0.1148 0.0103 0.0123 0.5092 0.0196 0.0192 0.9089 0.0314 0.0433 2.3619 0.0627 0.0675 2.3125 0.1002 0.1178 5.5301 0.1221 0.1361 Mean imputation error WDS PDS NPS 0.3661 0.3782 0.3840 0.3836 0.3977 0.3930

5 10 15 20 25 30

Mean number of misclassifications

OCS 0.0116 0.0199 0.0398 0.0601 0.0986 0.1161

NNM 0.0266 0.0916 0.1612 0.4616 0.5482 0.7807

POCS 0.0124 0.0189 0.0367 0.0647 0.0971 0.1080

OCS 0.3698 0.3715 0.3787 0.3841 0.3977 0.3975

NNM 0.5059 0.5018 0.4980 0.5234 0.5207 0.5100

POCS 0.3719 0.3721 0.3794 0.3853 0.3986 0.3973

PT

5 10 15 20 25 30

Mean prototype error

CE

% missing

39

WDS PDS NPS OCS NNM 20.4 20.5 20.7 20.7 21.7 21.6 21.6 21.6 21.9 23.7 21.4 21.1 21.4 22.0 23.6 33.9 20.0 20.1 21.0 27.0 31.0 21.7 21.9 21.9 27.9 51.2 22.4 22.3 23.6 30.2 Mean number of iterations to termination WDS PDS NPS OCS NNM 21.3 21.3 23.5 25.7 22.1 22.6 22.4 25.4 36.4 24.2 25.3 21.6 27.4 36.5 24.2 21.6 22.5 33.4 45.5 28.7 23.6 22.9 36.2 52.9 30.7 20.6 24.1 44.3 61.3 33.8

POCS 20.7 22.0 21.9 21.0 22.0 23.6 POCS 25.2 38.4 37.2 46.0 52.8 52.0

ACCEPTED MANUSCRIPT

0.016

0.03

Missing rate: 10% Mean prototype error

Mean prototype error

Missing rate: 5% 0.015

0.014

0.013

0.012

0.011 -7

-6

-5

-4

-3

-2

-1

0

1

2

0.027

0.024

0.021

0.018 -7

3

-6

-5

-4

-3

(a)

-1

0

1

2

3

(b)

0.052

0.11

Missing rate: 15% Mean prototype error

Missing rate: 20%

0.047

0.042

0.037

0.1

0.09

0.08

AN US

Mean prototype error

-2

lgK

CR IP T

lgK

0.07

0.06

-6

-5

-4

-3

-2

-1

0

1

2

3

lgK

(c)

M

Missing rate: 25% 0.16

0.14

0.12

0.1 -6

-5

-4

-3

-2

ED

Mean prototype error

-6

-5

-4

-3

-2

-1

0

1

2

3

-1

0

1

2

3

lgK

(d)

0.24

0.18

-7

-7

-1

0

1

2

Missing rate: 30%

Mean prototype error

0.032 -7

0.2

0.16

0.12

0.08 -7

3

-6

-5

-4

-3

-2

lgK

PT

lgK

(f)

CE

(e)

Fig. 16. Trend lines of mean prototype error with the increase of K for incomplete Glass data sets.

AC

From Fig. 16, it can be seen that the trend lines of mean prototype error exhibit the ST-type when the

missing rates are 5% and 20%, and exhibit the TP-type at the rest four missing rates. When the missing rates are 15% and 25%, the turning points of trend lines appear in the K=0.01 and K=0.0001 respectively, and when the missing rates are 10% and 30%, the corresponding turning points both appear in the K=0.001. Combining the results of the POCS algorithm when K=0.001 listed in Table 8, we can see that POCS produces smaller mean prototype errors than OCS even at the non-turning points of the trend lines 40

ACCEPTED MANUSCRIPT

when the missing rates are 15% and 25%. From Table 8, it can be seen that WDS and NNM are still not satisfactory in clustering, and the two algorithms apart, all the rest algorithms still do not differ much from one another by the mean misclassifications and the mean imputation error. In respect of the incomplete Glass data sets, the

CR IP T

algorithms doing well in clustering include PDS, OCS, and POCS, and, roughly speaking, PDS compares advantageously at the first three lower missing rates, whereas OCS and POCS fare better at the last three higher missing rates. According to the result statistics for the six missing rates from Table 8, the times that PDS, OCS and POCS produce the smallest mean prototype errors are 2, 1 and 3 respectively, the times

AN US

corresponding to the second smallest are 1, 3 and 1 respectively, the times to the third are 3, 1 and 1 respectively, and to the fourth are 0, 1 and 1 respectively. The above statistics indicate that POCS generally achieves more satisfactory results. 3) IRIS data set

M

The IRIS data set is the most commonly used for testing the algorithms of clustering. The data set contains 150 four-dimensional attribute vectors, depicting four attributes of IRIS, which include Petal

ED

Length, Petal Width, Sepal Length and Sepal Width. The three IRIS classes involved are Setosa, Versicolor and Virginica, each containing 50 vectors, in which, Setosa is well separated from the others,

PT

while Versicolor and Virginica are not easily separable due to the overlapping of their vectors. Hathaway and Bezdek presented the actual cluster prototypes of the IRIS data: v1* =(5.00, 3.42, 1.46, 0.24)T,

CE

v*2 =(5.93, 2.77, 4.26, 1.32) T, v *3 =(6.58, 2.97, 5.55, 2.02) T [48].

AC

In order to make the result more intuitive, the clustering partition of the POCS algorithm at the missing rate 5% is described in Petal Length - Sepal Width plane, as shown in Fig. 17. In Fig. 17, the object data classified to Setosa, Versicolor and Virginica are represented by Δ, +, and ×,

respectively, where the incomplete data are marked with black colour. Object data that actually belong to Virginica but classified to Versicolor are represented by ○; object data that actually belong to Versicolor but classified to Virginica are represented by □.

41

AN US

CR IP T

ACCEPTED MANUSCRIPT

Fig. 17. The POCS Clustering partition for the incomplete IRIS data at the missing rate 5%.

And the average experimental results of 10 experiments are listed in Table 9. The trend of mean

M

prototype error of the POCS algorithm with the value of K is shown in Fig. 18.

AC

% missing

WDS PDS NPS 0.0545 0.0511 0.0494 0.0734 0.0571 0.0583 0.0888 0.0564 0.0547 0.1669 0.0781 0.0743 0.2871 0.0802 0.0802 0.3955 0.0883 0.0871 Mean imputation error WDS PDS NPS 0.3318 0.3092 0.3180 0.3229 0.3340 0.3442

5 10 15 20 25 30

Mean number of misclassifications

OCS 0.0484 0.0569 0.0531 0.0745 0.0751 0.0973

NNM 0.0403 0.0672 0.1376 0.2784 0.4053 0.7173

POCS 0.0453 0.0511 0.0450 0.0663 0.0640 0.0870

OCS 0.3194 0.3009 0.3365 0.3314 0.3481 0.3668

NNM 0.6452 0.7286 0.7387 0.7860 0.7708 0.8190

POCS 0.3267 0.3083 0.3427 0.3441 0.3629 0.3914

PT

5 10 15 20 25 30

Mean prototype error

CE

% missing

ED

Table 9 Averaged results of 10 experiments using incomplete IRIS data sets

42

WDS PDS NPS OCS NNM 16.7 16.6 16.8 16.7 16.6 16.3 17.0 17.0 16.8 16.6 16.7 17.2 17.0 17.8 17.5 16.1 17.6 17.7 17.7 16.6 18.3 17.3 17.5 18.7 21.2 25.9 18.4 18.3 18.6 25.1 Mean number of iterations to termination WDS PDS NPS OCS NNM 31.0 31.8 32.6 43.5 31.2 32.9 30.0 31.8 38.0 28.8 32.8 29.9 33.9 48.7 29.8 30.5 30.8 35.2 51.9 30.5 31.8 29.2 35.6 47.2 28.0 33.9 31.7 39.7 46.1 32.5

POCS 16.8 16.3 17.0 16.8 18.1 18.6 POCS 36.9 44.5 44.8 56.4 48.9 58.1

ACCEPTED MANUSCRIPT

0.05

0.058

Missing rate: 10% Mean prototype error

Mean prototype error

Missing rate: 5% 0.048

0.046

0.044

0.042

0.04 -7

-6

-5

-4

-3

-2

-1

0

1

2

0.056

0.054

0.052

0.05 -7

3

-6

-5

-4

-3

(a) 0.16

Mean prototype error

0.05

-4

-3

-2

-1

0

1

2

3

lgK

(c) 0.25

0.12

0.1

2

3

0.08

0.06 -7

M

0.2

0.15

0.1

-6

-5

-4

-3

-2

ED

Mean prototype error

1

-6

-5

-4

-3

-2

-1

0

1

2

-1

0

1

2

3

lgK

(d)

0.45

Missing rate: 25%

0.05 -7

0.14

AN US

0.06

-5

0

Missing rate: 20%

0.07

-1

0

1

2

Missing rate: 30%

Mean prototype error

Mean prototype error

Missing rate: 15%

-6

-1

(b)

0.08

0.04 -7

-2

lgK

CR IP T

lgK

0.35

0.25

0.15

0.05 -7

3

lgK

-6

-5

-4

-3

-2

3

PT

lgK

(f)

CE

(e)

Fig. 18. Trend lines of mean prototype error with the increase of K for incomplete IRIS data sets.

AC

From Fig. 18, it can be seen that the trend lines of mean prototype error exhibit the ST-type when the

missing rates are 5%, and exhibit the TP-type, whose turning points all appear in the K=0.01, at the last five missing rates. Specific to Fig. 18(e)-(f), we can see that when the missing rates are 25% and 30%, although the clustering effect of the NNM strategy of Gaussian missing values is very poor, the prior distribution information of the missing values can still guide the OCS strategy appropriately, which results in the mean prototype errors of POCS actually much smaller than those of OCS, simultaneously 43

ACCEPTED MANUSCRIPT

referring to the results of the POCS algorithm when K=0.01 listed in Table 9. From Table 9, it can be seen that the algorithms differ no greatly in their mean misclassifications except for the case that WDS and NNM produce more mean misclassifications when the missing rate is 30%. The mean imputation error does not differ much among NPS, OCS and POCS algorithms. Next, we

CR IP T

focus on the mean prototype error index. The cluster prototypes obtained by WDS remain to be rather inaccurate when dealing with the incomplete IRIS data sets. Among PDS, NPS and OCS algorithms, OCS produces better results whereas PDS is weaker in clustering effect. When the missing rate is 5%, the smallest prototype error is produced by the NNM algorithm; however, the prototype error of NNM

AN US

increases rapidly with the rise of missing rate, becoming far above the values produced by the POCS algorithm and the other compared algorithms. The cluster prototypes obtained by the POCS algorithm are generally more close to the actual cluster ones. When the missing rate is 5%, the mean prototype error of the POCS algorithm is only worse than that of the NNM algorithm. When the missing rates are from 10%

M

to 30%, the prototype errors of POCS are the smallest among the six algorithms, and from the perspective of the prototype error values, such advantage is very obvious except when the missing rate is 30%.

ED

In the following, we discuss the effect of the size of near neighbors (that is, q) on clustering performance by taking the incomplete IRIS data sets at the missing rate 15% as an example. Mean

PT

prototype errors produced by POCS under the different values of q are shown in Fig. 19, where the values

0. 04 71

0. 04 69

0. 04 71

0. 04 56

0. 04 54

0. 04 52

0. 04 5

0.046

18

23

28

0. 04 6

0.048

0. 04 57

Mean prototype error

AC

0.05

0. 04 91

CE

of q are set from 3 to 28 respectively.

4

5

6

7

8

0.044 0.042 3

13

q

Fig. 19. Mean prototype error corresponding to the different size of near neighbors.

44

ACCEPTED MANUSCRIPT

The selection of q affects the judgment whether the missing values obey Gaussian distributions, at the same time, affects the numerical characteristics of probability density functions of missing values, therefore, affects the clustering results to some extent. From Fig. 19, it can be seen that the mean prototype error is relatively large when q=3, and it gets smaller when increasing q from 4 to 8 and then

CR IP T

gets larger when continuously increasing q from 8 to 13. The above results reveal that the too few neighbors of incomplete data are not sufficient to estimate the distributions of missing values if the selected q is too small, and the ill-suited near neighbors may be introduced into the descriptions of missing values if the selected q is larger.

AN US

If we continue increasing the number of q from 13 to 28, Fig. 19 shows that the mean prototype error also becomes larger and then tends to stabilize with a little fluctuation. The above results suggest that if the value of q is selected to be too large, the distribution information of missing values described by too many near neighbors is no longer local, but tends to a global distribution description, and in this case the

M

value of q is no longer sensitive to the estimation of mean values and variances of Gaussian distributions. Of course, the POCS algorithm should use the local distribution information of missing values rather than

ED

global one.

Here the values of the mean prototype error shown in Fig. 19 are concerned. When the value of q is

PT

arbitrarily selected from 4 to 13, POCS can still perform better than its competitors in clustering when operating the incomplete IRIS data sets at the missing rate 15%; hence q taking any value in this range

CE

will be acceptable. In addition, although the optimal value of q may be different when clustering different incomplete data sets, we always set q equal to 8 throughout all the previous experiments of clustering

AC

incomplete data and achieved the satisfactory results. The above findings demonstrate that when the local distribution information of missing values is introduced for clustering incomplete data, due to the guarantee offered by the normality hypothesis testing technique, the use of probabilistic information granules not only describes such local distribution more finely but is also effective in mitigating the sensibility of the size of the near neighbors in its influence upon the clustering results. Finally, we concentrate on the convergence and the computational efficiency of the POCS algorithm as 45

ACCEPTED MANUSCRIPT

well as the compared algorithms combining with the experimental results. In order to gain a better intuitive insight, we first consider the incomplete IRIS data sets and look at the POCS objective function

M

AN US

CR IP T

obtained in successive iterations, as shown in Fig. 20.

Fig. 20. Objective function value in the iterative procedure.

ED

It can be seen that the values of the objective function of POCS, at the six different missing rates, all decrease rapidly within the first 10 iterations and then tend to converge gradually. By examining from

PT

Table 4 to Table 9, we can see that the five algorithms, WDS, PDS, NNM, OCS and POCS respectively, basically need no more than 100 times of iterations in average before achieving clustering over each of

CE

the tested incomplete data sets, all showing good convergence and high convergence rate. The NPS

AC

algorithm does not get convergence occasionally, which verifies the defect that its convergence lacks a theoretical basis. Overall, the number of iterations of bi-level alternating optimization algorithms such as WDS, PDS and NNM is smaller than that of tri-level alternating optimization algorithms such as OCS, NPS and POCS. The number of iterations of the POCS algorithm is generally close to that of the OCS, which shows that POCS does not cause too much computational overhead even the local distribution information of missing values is also involved in the iterations.

46

ACCEPTED MANUSCRIPT

6. Conclusions In this study, we describe missing values being governed Gaussian distributions as probabilistic information granules based on the nearest neighbors of incomplete data, and include them as a part of the hybrid clustering model considering optimal completion strategy and maximum likelihood criterion.

CR IP T

Under this framework, a novel tri-level alternating optimization is developed to cluster incomplete data. The experiments with synthetic data sets and real-world data sets are carried out in order to show the usefulness of the proposed method. The accuracy of the results is evaluated by means of the prototype error index, with reference of the other two indices, namely, classification error and imputation error. As

AN US

can be seen from the results, the proposed method shows stable convergence, high accuracy and significant robustness. Probabilistic information granules of missing values actually rely on the inherent correlation among data so that the local distribution characteristics of missing values described are undoubtedly more sophisticated. The constructed clustering model introduces Gaussian information

M

granules into the optimal completion strategy by means of maximum likelihood criterion, which not only maintains the superiority of optimal completion strategy but also takes the distributions of missing values

ED

into account. The proposed method is equivalent to feeding the deep mined prior information of missing values forward to the OCS clustering process, which is useful for improving the clustering performance.

PT

Another advantage of the proposed method is that it can be solved by using gradient-based optimization. From the time complexity and the number of iterations of the proposed algorithm, it can be seen that

CE

POCS does not cause too much computational cost although probabilistic information granules have been introduced, which indicates the rationality of the proposed clustering model and optimization strategy.

AC

The future work will be mainly focused on enriching the forms of information granules to describe the characteristics of the distributions of missing values. By improving clustering methods, such more general information granules will be introduced into clustering of incomplete data yielding a better clustering performance.

Acknowledgements 47

ACCEPTED MANUSCRIPT

This work was supported by National Natural Science Foundation of China (61472062), National Key Technology Support Program of China (2015BAF20B02), Canada Research Chair (CRC) Program and Natural Sciences and Engineering Council of Canada (NSERC).

References

CR IP T

[1] R. J. Hathaway, J. C. Bezdek, Fuzzy c-means clustering of incomplete data, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 31 (5) (2001) 735-744.

[2] J. K. Dixon, Pattern recognition with partly missing data, IEEE Transactions on Systems, Man and Cybernetics 9 (10) (1979) 617-621.

AN US

[3] T. Furukawa, S. I. Ohnishi, T. Yamanoi, A study on a fuzzy clustering for mixed numerical and categorical incomplete data, In: Proceedings of 2013 International Conference on Fuzzy Theory and Its Application, 2013, pp. 425-428.

[4] T. Guan, B. Q. Feng, Fuzzy clustering of incomplete nominal and numerical data, In: Proceedings of

M

the 5th World Congress on Intelligent Control and Automation, 2004, pp. 2331-2334.

ED

[5] Q. C. Zhang, Z. K. Chen, A distributed weighted possibilistic c-means algorithm for clustering incomplete big sensor data, International Journal of Distributed Sensor Networks (2014) 430814: 1-8.

PT

[6] D. Q. Zhang, S. C Chen, Clustering incomplete data using kernel-based fuzzy c-means algorithm, Neural Processing Letters 18 (2003) 155-162.

CE

[7] L. Himmelspach, S. Conrad, Fuzzy clustering of incomplete data based on cluster dispersion, Lecture Notes in Computer Science 6178 (2010) 59-68.

AC

[8] C. P Lim, J. H Leong, M. M Kuan, A hybrid neural network system for pattern classification tasks with missing features, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (4) (2005) 648-653.

[9] A. Balkis, S. B. Yahia, A new algorithm for fuzzy clustering handling incomplete dataset, International Journal on Artificial Intelligence Tools 23 (4) (2014) 1460012: 1-21. [10] P. R. Adhikari, J. Hollmen, Fast progressive training of mixture models for model selection, Journal 48

ACCEPTED MANUSCRIPT

of Intelligent Information Systems 44 (2) (2015) 223-241. [11] T. I. Lin, J. C. Lee, H. J. Ho, On fast supervised learning for normal mixture models with missing information, Pattern Recognition 39 (6) (2006) 1177-1187. [12] T. I. Lin, Learning from incomplete data via parameterized t mixture models through eigenvalue

CR IP T

decomposition, Computational Statistics and Data Analysis 71 (2014) 183-195. [13] A. Y. Zhou, F. Cao, Y. Yan, C. F. Sha, X. F. He, Distributed data stream clustering: A fast EM-based approach, In: Proceedings of the 23rd International Conference on Data Engineering, 2007, pp. 736-745.

AN US

[14] X. N. Zhang, S. J. Song, C. Wu, Robust Bayesian classification with incomplete data, Cognitive Computation 5 (2) (2013) 170-187.

[15] K. Honda, N. Sugiura, H. Ichihashi, Simultaneous approach to principal component analysis and fuzzy clustering with missing values, In: Annual Conference of the North American Fuzzy

M

Information Processing Society, 2001, pp. 1810-1815.

[16] K. Honda, H. Ichihashi, Linear fuzzy clustering techniques with missing values and their application

ED

to local principal component analysis, IEEE Transactions on Fuzzy Systems 12 (2) (2004) 183-193. [17] K. Honda, R. Nonoguchi, A. Notsu, H. Ichihashi, PCA-guided k-means clustering with incomplete

PT

data, In: IEEE International Conference on Fuzzy Systems, 2011, pp. 1710-1714. [18] T. S. Chou, K. K. Yen, L. W. An, N. Pissinou, K. Makki, Fuzzy belief pattern classification of

CE

incomplete data, In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, 2007, pp. 535-540.

AC

[19] K. Siminski, Clustering with missing values, Fundamenta Informaticae 123 (3) (2013) 331-350. [20] M. Ghannad-Rezaie, H. Soltanian-Zadeh, H. Ying, M. Dong, Selection-fusion approach for classification of datasets with missing values, Pattern Recognition 43 (6) (2010) 2340-2350.

[21] R. J. Hathaway, J. C. Bezdek, Clustering incomplete relational data using the non-Euclidean relational fuzzy c-means algorithm, Pattern Recognition Letters 23 (1-3) (2002) 151-160. [22] T. Yamamoto, K. Honda, A. Notsu, H. Ichihashi, FCMdd-type linear fuzzy clustering for incomplete 49

ACCEPTED MANUSCRIPT

non-Euclidean relational data, In: Proceedings of IEEE International Conference on Fuzzy Systems, 2011, pp. 792-798. [23] H. Timm, C. Doring, R. Kruse, Different approaches to fuzzy clustering of incomplete datasets, International Journal of Approximate Reasoning 35 (3) (2004) 239-249.

CR IP T

[24] Y. Y. He, H. M. Yousuff, J. W. Ma, B. Shafei, G. Steidl, A new fuzzy c-means method with total variation regularization for segmentation of images with noisy and incomplete data, Pattern Recognition 45 (9) (2012) 3463-3471.

[25] S. Wu, H. Chen, X. D. Feng, Clustering algorithm for incomplete data sets with mixed numeric and

AN US

categorical attributes, International Journal of Database Theory and Application 6 (5) (2013) 95-104. [26] Y. H. Ding, A. Ross, A comparison of imputation methods for handling missing scores in biometric fusion, Pattern Recognition 45 (3) (2012) 919-933.

[27] S. C. Zhang, Nearest neighbor selection for iteratively kNN imputation, Journal of Systems and

M

Software 85 (11) (2012) 2541-2552.

[28] H. J. Van, T. M. Khoshgoftaar, Incomplete-case nearest neighbor imputation in software

ED

measurement data, Information Sciences 259 (2014) 596-610. [29] D. Li, C. Q. Zhong, S. Q. Wang. A fuzzy c-means approach for incomplete data sets based on

PT

nearest-neighbor intervals. Applied Mechanics and Materials 411-414 (2013) 1108-1111. [30] W. Pedrycz, R. Al-Hmouz, A. Morfeq, A. S. Balamash, Distributed proximity-based granular

CE

clustering: towards a development of global structural relationships in data, Soft Computing (online). [31] Z. H. Bing, L. Zhang, L. Y. Zhang, B. L. Wang. The particle swarm and fuzzy c-means hybrid

AC

method for incomplete data clustering. ICIC Express Letters 7 (8) (2013) 2437-2441.

[32] L. Zhang, Z. H. Bing, L. Y. Zhang, A hybrid clustering algorithm based on missing attribute interval estimation for incomplete data, Pattern Analysis and Applications 18 (2) (2015) 377-384.

[33] J. C. Bezdek, Pattern recognition with fuzzy objective function algorithms. NY: Plenum Press, 1981. [34] W. Pedrycz, Knowledge-based clustering: from data to information granules. Hoboken, NJ: John Wiley & Sons, 2005. 50

ACCEPTED MANUSCRIPT

[35] S. Salehi, A. Selamat, H. Fujita. Systematic mapping study on granular computing. Knowledge-Based Systems 80 (2015) 78-97. [36] A. Ralph Henderson, Testing experimental data for univariate normality, Clinica Chimica Acta 366 (1-2) (2006) 112-129.

CR IP T

[37] N. Kim, The limit distribution of a modified Shapiro-Wilk statistic for normality to Type II censored data, Journal of the Korean Statistical Society 40 (3) (2011) 257-266.

[38] S. S. Shapiro, M. B. Wilk, An analysis of variance test for normality (complete samples), Biometrika 52 (3-4) (1965) 591-611.

AN US

[39] S. S. Shapiro, R. S. Francia, An approximate analysis of variance test for normality, Journal of the American Statistical Association 67 (337) (1972) 215-216.

[40] R. J. Hathaway, Y. Hu, J. C. Bezdek, Local convergence of tri-level alternating optimization, Neural, Parallel, and Scientific Computation 9 (2001) 19-28.

13 (4) (2013) 1592-1607.

M

[41] B. A. Pimentel, R. M. C. R. Souza, A multivariate fuzzy c-means method, Applied Soft Computing

ED

[42] J. F. Kolen, T. Hutcheson, Reducing the time complexity of the fuzzy c-means algorithm. IEEE Transactions on Fuzzy Systems 10 (2) (2002) 263-267.

PT

[43] L. Y. Zhang, W. Pedrycz, W. Lu, X. D. Liu, L. Zhang, An interval weighed fuzzy c-means clustering by genetically guided alternating optimization, Expert Systems with Applications 41 (13) (2014)

CE

5960-5971.

[44] N. Abreu, Analise do perfil do cliente Recheio e desenvolvimento de um sistema promocional,

AC

Mestrado em Marketing, ISCTE-IUL, Lisbon, 2011.

[45] M. Lichman, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science, 2015.

[46] K. P. Bennett, O. L. Mangasarian, Robust linear programming discrimination of two linearly inseparable sets, Optimization Methods and Software 1 (1) (1992) 23-34. [47] R. A. Johnson, D. W. Wichern, Applied multivariate statistical analysis. New Jersey: Prentice-Hall, 51

ACCEPTED MANUSCRIPT

1982. [48] R. J. Hathaway, J. C. Bezdek, Optimization of clustering criteria by reformulation. IEEE

AC

CE

PT

ED

M

AN US

CR IP T

Transactions on Fuzzy Systems 3 (2) (1995) 241-245.

52