Multitask fuzzy Bregman co-clustering approach for clustering data with multisource features

Multitask fuzzy Bregman co-clustering approach for clustering data with multisource features

Accepted Manuscript Multitask Fuzzy Bregman Co-clustering Approach for Clustering Data with Multisource Features Alireza Sokhandan , Peyman Adibi , M...

1MB Sizes 0 Downloads 59 Views

Accepted Manuscript

Multitask Fuzzy Bregman Co-clustering Approach for Clustering Data with Multisource Features Alireza Sokhandan , Peyman Adibi , Mohammadreza Sajadi PII: DOI: Reference:

S0925-2312(17)30595-7 10.1016/j.neucom.2017.03.062 NEUCOM 18295

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

14 December 2015 28 January 2017 22 March 2017

Please cite this article as: Alireza Sokhandan , Peyman Adibi , Mohammadreza Sajadi , Multitask Fuzzy Bregman Co-clustering Approach for Clustering Data with Multisource Features, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.03.062

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Multitask Fuzzy Bregman Co-clustering Approach for Clustering Data with Multisource Features 1

1*

Alireza Sokhandan , Peyman Adibi , Mohammadreza Sajadi

2

1

Artificial Intelligence Department, Computer Engineering Faculty, University of Isfahan, Isfahan, Iran

2

Mechatronics Engineering Department, School of Engineering Emerging Technologies, University of Tabriz, Tabriz, Iran [email protected], [email protected], [email protected] * Corresponding Author

Abstract

AN US

CR IP T

In usual real-world clustering problems, the set of features extracted from the data has two problems which prevent the methods from accurate clustering. First, the features extracted from the samples provide poor information for clustering purpose. Second, the feature vector usually has a high-dimensional multi-source nature, which results in a complex cluster structure in the feature space. In this paper, we propose to use a combination of multi-task clustering and fuzzy co-clustering techniques, to overcome these two problems. In addition, the Bregman divergence is used as the concept of dissimilarity in the proposed algorithm, in order to create a general framework which enables us to use any kind of Bregman distance function, which is consistent with the data distribution and the structure of the clusters. The experimental results indicate that the proposed algorithm can overcome the two mentioned problems, and manages the complexity and weakness of the features, which results in appropriate clustering performances. Keywords: Multitask Clustering, Fuzzy Co-clustering, Bregman Divergence

1

Introduction

AC

CE

PT

ED

M

Clustering is one of the basic problems in the field of machine learning which is used in various areas such as bioinformatics (protein structure analysis [1], genetic classification [2, 3]), trade and marketing (classification and analysis of customer behavior [4], classification of companies, and production chain management [5]), computer science (document classification [6], image segmentation [7]), social sciences (analysis of behavioral pattern of society [8], social media analysis [9]), medical application (medical image analysis [10, 11]) and so on. The purpose of the clustering is assigning the data samples to different groups in a way that the samples within a group are as similar as possible, while the samples in different groups have minimum similarities. With the growth of data in many real-world applications and advancement in data processing systems, there is more request for development of powerful and efficient data processing and data mining algorithms. On the other hand, the increasing of dimensionality and complexity of data which should be processed caused data processing and data mining algorithms, and on the top of them, the data clustering algorithms as the basic tools, to encounter many challenges. The main sources of these challenges are the types of the features used to describe the data and the dimensionality of the feature vectors. We know that the data itself, due to its high dimensionality and the correlation of its components, is not used directly in

the clustering and data mining algorithms. However, using feature extraction algorithms, a set of features are extracted from the data which by putting them together, a feature vector or descriptor which represents the data is produced. These produced descriptors are used in data mining algorithms. The weakness of the features and the complexity of the distribution of the feature vectors in the feature space are two main challenges in the field of data clustering. The main purpose of this paper is to provide a solution to reduce the impact of these factors on the performance of the clustering algorithms. The structural complexity of the distribution of the feature vectors is a common challenge in many applications. In several types of the real-world problems, the feature space is formed by the combination of various features obtained from different feature sources. These features describe data from different aspects and usually produce different cluster structures. The inconsistency of the cluster structures produced by different features introduces difficulties in putting them together to achieve a unified clustering. The clustering of the countries economically is a good example. The features available in this area for each country consist of the unemployment rate, the number of exports and imports, and the gross production rate. By considering each of these attributes separately, different cluster structures are obtained, but putting these features together and producing a new feature space, make the clustering process very difficult.

ACCEPTED MANUSCRIPT

In recent years, according to these approaches, various algorithms have been proposed which have focused on solving one of the two mentioned problems. Compared to the other algorithms, multi-task clustering methods, especially the ones based on Bregman divergence [22, 23], co-clustering methods, and multi-source fuzzy clustering algorithms [25, 26] achieve better and more reliable results. Thus, a combination of these techniques (i.e. multi-task and fuzzy co-clustering with Bregman divergence as dissimilarity measure), may result in an algorithm which overcomes both the mentioned challenges in the clustering process, simultaneously. This algorithm can be used as a coclustering algorithm, a multi-task clustering algorithm, or a combination of both of them, by assigning different values to its parameters. In this paper, such an algorithm is introduced, which by a combination of multi-task clustering and fuzzy coclustering ideas, tries to solve the weakness of the data features and the complexity of the cluster structures, simultaneously. In the proposed algorithm, the multitask clustering framework presented by Zhang and Zhang [22] has been used. This framework performs the multi-task clustering based on minimization of the cost function consisting of a local and a global part. In minimization of the local part of the cost function, the goal is to cluster any task without considering the other clustering tasks. In the proposed algorithm, the fuzzy co-clustering algorithm with Bregman divergence [27] is used in the local part. In the global cost function minimization phase, this fact is regarded that if the two clusters in two different tasks have the same data with the same membership values, the centers of those clusters should also be similar to each other. As a result, this part of the cost function attempts to put the center of similar clusters in different tasks, closer to each other. Totally, the goal of the proposed algorithm is to find the three unknown variables for each task: centers of the clusters, the data membership values, and the impact factors of the feature sources. The proposed algorithm in an iterative process attempts to estimate these three unknown variables based on minimization of the cost function.

CE

PT

ED

M

AN US

In the last decade, different algorithms and methods have been proposed to solve each of these two problems. Using the collaborative clustering algorithms [12, 13, 14] was the most common solution used for data clustering with multi-source features and high structural complexity. In these clustering algorithms, first, the data are clustered separately based on feature blocks of each source. Then, using the results obtained from the clustering of each source, a new feature space is defined. By clustering this new feature space, the general clustering of the data is performed. The passive performance in these algorithms is their main weakness, in a way that there is no difference between different sources. In fact, the structural differences between clusters in different sources, such as differences in the range of their attributes, the power of structural clustering of each source and so on, are ignored. Because of these limitations, the researchers look for the methods that work directly with the feature blocks of different sources and control the impact of the features of each source in the process of final clustering in order to achieve an appropriate result. Co-clustering [15, 16, 17, 18] is one of the solutions which proposed for this kind of the clustering problems. In this approach, by assigning the weights to each feature or each set of features, not only the data are assigned to the clusters but also the assignments of the features to the clusters are also calculated. In this way, each feature or each set of features plays a more influential role in a special cluster. This makes a balance between the cluster structures of different feature sources and produces an appropriate clustering result based on the whole feature space.

and the other ones, instead of being clustered based on single-task, information received from the other clustering tasks can produce a strong guide for the assumed clustering one.

CR IP T

The challenge of weak features occurs in the cases where the features describing the data contain unreliable information for the mentioned clustering task. In these cases, the clustering algorithm cannot extract the correct and appropriate cluster structures for the data. For example, if the only feature used in an image segmentation task is the location of the pixels, the resulting segmentation would be completely wrong.

AC

Regarding the weakness of the features, transfer learning is a useful approach used for clustering. In this approach, a small amount of the data is partitioned in a supervised manner, and the actual amount of the data is clustered based on the information received from this clustering [19]. Using the supervision in a part of the clustering process is the main problem in this technique. The multi-task learning is one of the methods introduced after transfer learning, which can handle the weakness of the features significantly [20, 21, 22, 23, 24]. In this method, it is assumed that although the existing features are weak ones in the considered clustering task, these features may be appropriate for the other tasks. As a result, if the data is simultaneously clustered based on the assumed task

In a nutshell, the main contribution of this paper can be summarized as follows. Combining the multi-task clustering and fuzzy co-clustering techniques in order to handle the weakness of the data features and reduce the structural complexity of the clusters of multi-source features. In addition, the Bregman divergence is used as the concept of dissimilarity to overcome the nonlinearity of the data. The Proposed algorithm offers multiple parameters, which let it perform various

ACCEPTED MANUSCRIPT

In order to evaluate the proposed algorithm, it used in clustering of different well-known datasets. The results of the proposed algorithm are compared with those of various multi-task clustering and fuzzy co-clustering methods. The experimental results indicate that using Euclidean distance as the simplest Bregman divergence, although the proposed algorithm performs weaker than multitask kernel clustering method, it achieves better results compared to the other multi-task clustering algorithms. By changing the Bregman divergence and choosing more complex spaces, the proposed algorithm performs like multitask kernel clustering method. But, we should bear in mind that the choice of an appropriate core in the kernel based algorithms and an appropriate Bregman divergence impose an extra cost on the system.

2

Bregman Divergence

M

ED

( )

( )

CE

( )

PT

If ( ) is a differentiable convex function, Bregman divergence of this function will be defined as follows [28]: )



̅

(

)

| |



(1)

AC

Where and · represent the operations of derivation and dot product, respectively. In other words, ( ) is equal to the difference of ( ) and a linear approximation of this value around point . Since ( ) is ( ) is always a convex function, the amount of non-negative. For better understanding, the calculation of Bregman divergence of the sample function is shown in Figure 1.

Figure 1: Bregman Divergence between point x and y based on function

(2)

Where ̅ is the average of the data, is the data set, and | | denotes the cardinality of a set. A list of several well-known Bregman divergences and their base functions is shown in Table 1. As indicated in this table, the Euclidean distance is defined in the form of Bregman divergence. As a result, the classic k-means clustering algorithm can be considered as a special case of clustering using Bregman divergence [22, 29]. Table 1: Four conventional Bregman divergences with their base functions for the d dimensional data space (

( )

The data clustering is usually defined based on minimization of distances from the center of the clusters or from the data representing the cluster. Thus, the definition of the distance becomes a fundamental issue in this area. It is shown that a large family of distance functions can be rewritten as a standard form called Bregman divergence.

(

Another important feature of Bregman divergence used in data clustering is the independence of the average of the data from the base function ( ). In other words:

AN US

Later in this paper, the Bregman divergence is briefly introduced in section 2. In section 3, the formulation and the procedure of the proposed algorithm are discussed. Section 4 dedicated to the experimental results. Finally, conclusions are given in section 5.

Like the other distance functions, Bregman divergence is used to quantify the similarity between two data points, but it does not meet all the properties of a standard distance. Bregman divergence is always non( )) negative, but it can be asymmetric ( ( ) ( ) and does not meet the triangle inequality ( ( ) ( )). Bregman divergence is also convex in its first parameter, but not necessarily convex in the second one.

CR IP T

moods, from fuzzy to crisp, from multi-source coclustering to single source clustering, and from multitask to single task clustering.



Euclidean (Ecl)

Generalized Kullback–Leibler (KL)





Itakura–Saito (IS)



Exponential (Exp)



3

∑ ( ∑ ∑

(

.

)

(

)

. /

(

))

. /

/

(

)

)

The Proposed Algorithm

In this section, details of the proposed algorithm are explained. First, the formulation of the proposed method is given. Then, the parameters and the optimization process used in this algorithm are discussed. 3.1

Formulation

* | It is assumed that the data set is given as +. Also, we assume that there is no information about the values of the data points, except the features produced by the sources { | +, which are available in the way that for each data point the jth source will produce the feature vector with the length of (the length of the feature vectors for each source can be different). In this way, there is a descriptor for each datum which consists of a set of features extracted from as follows: {

|

|

|

}

( )

(3)

ACCEPTED MANUSCRIPT

{ |

*

+}

{

*

+}

|

(4)

The variable indicates the number of clusters in tth clustering task. For multi-task clustering, Zhang and Zhang [22] presented a general framework based on minimization of a cost function, which also used in our proposed algorithm, as follows: (

)

(5)

For computing the distance function defined in equation (7), at first, each data sample is divided into a set of 's based on the features produced by different sources. Then, the distance of the sample from the center of kth cluster is calculated based on a weighted sum of Bregman divergences between the parts of the data sample and the corresponding parts of the cluster centers. The weights used for this purpose are the importance factors 's, whose degrees of whose degrees of fuzzification, as noted before, are adjusted by parameter b in equation (7). As mentioned earlier, the contribution of a datum and the features from a source in tth clustering task are controlled by and , respectively. As shown in equations (6) and (7) the zero value of and makes ith datum and the features from jth source not to have any role in the clustering procedure of considering task. It should be noted that in the proposed algorithm, unlike the traditional methods of co-clustering, the λ parameters represent the importance factors of the feature sources in the clusters and not the importance factors of the single features. But, if any individual feature is considered as a source (i.e. any source for any data provides a feature vector of length 1), then λ's represent the importance factors of the features in the clusters.

AN US



equation (7), generalizes the clustering method into a Bregman co-clustering type algorithm.

CR IP T

In order to cluster the data, several different clustering tasks are defined, and the data are clustered based on them, simultaneously. These clustering tasks are * | defined as +, in which each task has properties in the form * +, where and are two binary vectors with the length of and , respectively, that represent which data and the features from which sources should be involved in this task, as follows:





(

)

(

)

ED

M

Where is the local cost function for tth clustering task regardless of other tasks and is the global regularization of center of clusters in tth task based on the center of clusters in the other tasks. In this equation is a free parameter which indicates the importance of the regularization term. Different definitions can be used for these two parts of the cost function. As mentioned earlier, in this paper the fuzzy co-clustering based on Bregman divergence is used. Thus, the local cost function is defined based on the fuzzy co-clustering method and Bregman divergence as follows: (6)

(

)



CE

PT

Where is defined in equation (4), indicates the degree of membership of ith data to the kth cluster in tth clustering task, a is the fuzzification degree of data ) is defined as: membership values, and ( (

)

(

)

(7)

AC

Where is defined in equation (4), indicates the th importance of features achieved from j source for kth cluster in tth clustering task, b is the fuzzification degree of feature sources Importance factors, and is the th center of k cluster in this task. In the calculation of ( ), the vector is divided based on the number of sources, into a set of vectors 's. The length of is equal to the length of the feature vector achieved from jth source, i.e. . In equation (6), if it is assumed that is the Euclidean distance and knowing that ∑ is the sum of the data th samples participating in t clustering task, then is totally similar to the cost function of standard fuzzy cmeans (FCM) algorithm [30]. Thus, definition of as in

It is obvious that the conditions ∑ and , -, considered in FCM algorithm, are also defined for in the proposed method, as ∑ and , -. In order to control the importance factors of the feature sources, similar conditions ∑ and , -, are also defined. The comparison of the centers of the clusters in different tasks is used to define the global regularization function of tth clustering task in equation (5). If two clusters in two different tasks including similar data with the same membership values, the centers of these clusters should be close to each other. Considering this fact, the term is defined as follows: ∑





(

)

(8)

In this equation, a weighted average of the distances of tth clustering task centers from the other clustering ( ), is calculated with the tasks centers, i.e. weights . These distances and weights are calculated according to equations (9) and (10), respectively, as follows: (

) ∑



( (

|

) |)⁄∑

(9) (10)

ACCEPTED MANUSCRIPT

The Lagrangian optimization procedure and details about the calculation of the above equations are explained in section 3.3. The proposed optimization process, discussed in continue, starts with initial values for and and obtains the values of , , and by executing equations (12)-(14), iteratively. Input: Data: Tasks:

Start

M

If equations (6) and (8) are put in equation (5), the final cost function is achieved. In this function, there are three sets of unknown values, including the data memberships ( ), the center of clusters ( ), and the importance factors of the feature sources ( ), which should be calculated through the proposed algorithm.

( (

)

(

))

s.t.

CR IP T

Update U, phase 1 (For each task): ⁄∑



AC ⁄∑

⁄∑

(

(

∑ ∑

with

and (

(

)

(

)

)

⁄(

)

(12)



(

)

.

/



(

)

(

)









⁄(

)

(

)

(13)

(14)

defined as: )

)

(

)





⁄∑



Update U, phase 2 (For each task):



⁄∑

Update V (For each task): ∑ ∑









Calculate cost function: ∑

Finish

Where , , and are the sets of the algorithm parameters mentioned above for tth clustering task, whose defined as follows:



Update λ (For each task):

(11)

CE



|

⁄∑

PT



ED

To estimate these values, the following optimization problem (equation (11)) is defined, and solved using a Lagrangian optimization framework:

|

Initialize (For each task):

AN US

The weight represents the degree of similarity between lth and kth clusters in sth and tth clustering tasks, respectively, whose value is in the range (-1, 1). The maximum value of this weight has occurred when the data membership in both clusters is the same for each common data of these two tasks. On the other hand, if these common data have different degrees of membership in these two clusters, in the worst case, the value of this weight will be -1. In the first case, the centers of the clusters should be as close as possible, but in the latter, the centers of the clusters should be as far as possible. This goal is achieved when the term is minimized. On the other hand, if tth and sth tasks have no common data, i.e. the values of and for none of the data are non-zero simultaneously, according to equation (10), the value of will be th th undefined. But, this case means that t and s tasks do not have any common data. Therefore, the clustering information of each of these tasks will not useful for the other one. Thus, in this case the value of is considered to be zero; which means that the distances between the clusters of these two tasks have no importance in and do not effect on cost function value.

Stop criteria? (Max iteration or Min cost value)

Figure 2: Flowchart of the learning phase of the proposed clustering algorithm, where the optimal values of the parameters , , and are estimated.

In the proposed clustering procedure, the values of 3 variables , , and are updated iteratively based on the Lagrangian optimization and the gradient descent method to iteratively minimize the cost function and achieve the best possible clustering result. But it is worth mentioning that, because of the non-convex nature of the Bregman divergence and the greedy nature of the gradient descent method, there is no guarantee that the proposed algorithm converges to the global minimum of the cost function, and it is possible that the algorithm stuck in an undesired critical

ACCEPTED MANUSCRIPT

(

) ∑

∑ (



and (

3.2

(

)

|



with

(

)

|)









(15) (16) (17)

defined as:

) (

)

(

)

The Parameters and Execution Modes

As mentioned before, one of the aspects of the proposed algorithm is its ability to perform in different modes, by adjusting its parameters. The way of adjusting these parameters is explained as follows. More details about these conditions and the evaluation of their results are described in section 4.1.5.

AN US

In the proposed algorithm described in Figure 2, first, the initialization of importance factors of the features ( ) and the centers of the clusters ( ) is performed. In this step, the importance factors of all features are considered to be the same for all clusters ( ⁄∑ ) and the centers of the clusters are randomly selected. Then, based on and , the algorithm calculates the data membership values . In continue, the importance factors of the features are updated, and the values of the data memberships are re-calculated based on these updated coefficients. Finally, the centers of the clusters are updated based on the new values of and . These steps are iterated until the termination condition is satisfied, i.e. when a maximum allowed number of iterations is reached, or a minimum acceptable cost value of equation (5) is obtained.

take part in the clustering task play no role in the calculation of the features importance factors ( ) and the centers of the clusters ( ). This is also the case for the feature sources which do not take part in the clustering task. As a result, the equations (9), (10) and (14) will change to equations (15), (16) and (17), respectively:

CR IP T

point (e.g. a local minimum). But, even by taking all of these issues into account, through multiple runs with random initializations of these 3 variables, it is possible that the proposed clustering method achieve the optimal clustering result.

AC

CE

PT

ED

M

Like several basic fuzzy clustering algorithms, the data membership values can be initialized instead of the centers of the clusters [31, 32]. But, our experiments show that for large data sets, since each cluster contains a high amount of data, the random initialization of the membership values causes each initial cluster to have approximately the same amount of data from each real cluster. This results in very close cluster centers in the initial steps of the algorithm, which causes poor clustering quality. Another benefit of initialization of the clusters centers is that different conditions can be defined in this case to make a distance between the clusters (e.g. minimum or maximum distances between initial clusters centers). But, defining such conditions in the case of the initialization of the membership values is too difficult. Therefore, the random initialization of the centers of the clusters is used in the proposed algorithm. It is worth mentioning that in the proposed algorithm, several conditions are defined for the random initialization of the cluster centers, which are explained in section 4.1.5. The above-mentioned formulation encounters a problem in a particular case, where none of two tasks have any common samples or sources. In this case, for ( ) will be zero. each cluster the values or Therefore, the global regularization term ( ) is completely removed from the clustering procedure, and the multi-task clustering algorithm reduces to multiple local clustering tasks. To solve this problem, the membership values of all data regardless of their value, and the center of the clusters for all sources regardless of their value are calculated. It is worth mentioning that, even though the membership values of all the data are calculated, the data which do not

1st Mode: According to equation (5), if the value of the parameter α is set to zero, because of ignoring the impact of the global regularization term in clustering process, the algorithm will lose its multitasking mode and the clustering tasks are performed separately. 2nd Mode: According to equation (12), if the value of the parameter a (in equation (6)) approaches one, for each datum, only the membership value of its nearest cluster will be one, and the rest of the membership values will be zero. Therefore, the algorithm will lose its fuzzy aspect and is reduced to a crisp clustering method. 3rd Mode: According to equation (13), if the value of the parameter b (in equation (7)) approaches infinity, the importance factor of all sources will be the same. Thus, the algorithm will not regard the importance factor of each source and will lose its co-clustering aspect. For example, based on the above-mentioned cases, if the value of the parameter α is set to be 0, the values of parameters a and b go toward 1 and ∞, respectively, and the Euclidean distance is used, the proposed algorithm is performed like the k-means algorithm. 3.3

The Optimization Process

In this section, minimization of the cost function (equation (11)) which results in updating equations for , and (equations (12), (13) and (14)) is described. Considering the constraints defined on and in th equation (11), the Lagrangian function for t task is defined as follows:

ACCEPTED MANUSCRIPT ∑

(

Where



.



)

's and



(18)

/

Which it can be rewritten as follows: (25) Where:

's are the Lagrange multipliers.



Calculation of Features Importance Factor (λ)

To calculate , the values of and are assumed to be constant, and the derivation of Lagrangian for each task (equation (18)) is calculated based on as follows: (19) Replacing from equations (6) and (7), and applying the derivation, the following equation is resulted: ∑

(

)

(



By replacing and removing

from equation (25) in equation (24) ) (

)-



(

Where is used in place of ( ) to simplify the relation. Solving the above equation, we will have the following relation for :

)

⁄(

(

)

)

⁄(

)

(21)

)

(26)

)

⁄∑

(

(

3.3.3

)

⁄(

)

AN US



⁄(

Finally, by replacing in the constraint ∑ , the value of is found, and consequently the updating rule for the is obtained as follows:

⁄(∑

(

)

, the following relations are obtained:

,(

(20)

)

(



CR IP T

3.3.1

⁄(

)

(27)

)

)

Calculation of Cluster Centers ( )

t



(

⁄(∑

)

(



⁄∑

(

(

)

)



(



(

)

)

(22)

)

AC

(23)

Replacing from equations (6) and (7), and applying the derivation, the following equation is resulted: )

(24) is



(

). According to equation (8) we





(

)



(∑



(



(28) )

(

)

(



(

Assuming

)

( )

(

th

)

))

and

(

)

, the

th

cost function of j source of t clustering task is written based on equation (28), as follows: ∑



For calculation of , the values of and are assumed to be constant. For each task, the derivative of equation (18) with respect to is obtained as follows:

Where have:





Calculation of Data Memberships ( )

(

Using L and Gt of equations (6) and (8), respectively, the cost function of each task can be written as follows:

(29)

)

CE

3.3.2

)

ED

(

PT



M

Finally, by replacing from equation (21) into the constraint ∑ , the value of is obtained, and consequently the updating rule for is found, as follows:

( ∑

)



(

)

To estimate , the following optimization problem should be solved: ∑ ∑



(

(

)

(30)

)

Based on the characteristics of Bregman divergence, the following equation is defined [33]: ∑

( ∑

)

( ∑

)

(31)



Where is a constant with respect to . By applying this characteristic on equation (30), the following equation is obtained:

ACCEPTED MANUSCRIPT ∑

)

(





)

(32)











Again, by applying the characteristic of equation (31) on equation (32), the following equation is found: (

)

.

(

)

(33)

/

Thus, according to the fact that, the minimum value of the distance function ( ) is achieved when , will be equal to

replacing the value of , , is obtained as follows: ∑ ∑

(

) (









and

)

. By

, the variable (34)

)

(

)

Since is defined as * | together, the optimum value of

4

(

|

+, putting all is achieved.

Experimental Results

CE

PT

ED

M

In this section, the performance of the proposed algorithm is evaluated using a number of well-known datasets, and the results are compared with other clustering methods. The experiments are classified into three parts. In the first part, the proposed algorithm is used in document classification application and the results are compared with three multi-task clustering algorithms. Furthermore, the effect of different values of the free parameters on the performance of the proposed algorithm is investigated as well. In the second part, the power of the proposed algorithm for data clustering in presence of multi-source features is examined. Finally, in the last experiment, the proposed algorithm is compared with four different co-clustering algorithms.

AC

The Experimental Dataset

To perform experiments, the NG20 textual document database, which is widely used for evaluation of clustering algorithms, is selected. This database contains about 20000 messages which are collected from 20 different newsgroups and are categorized into 6 main groups, such as computer, sports, politics, science, and religious, and 20 subcategories. The features of this database are extracted in the form of bag of word in a way that each sample finally consists 43586 features. Using this database, three sets of experiments are conducted, to examine different aspects of the proposed algorithm, discussed below: Same distribution and resolution of data (NG1): in the first experimental set, the textual materials of three classes are selected, and their samples are divided into two equal parts. In this way, we constitute two clustering tasks with three clusters. Same distribution, different data resolution (NG2): for the second set of experiments, four sub-categories are used, which the first two of them belong to the same main category, and the other two ones belong to another main category. Four hundred samples are randomly selected from each sub-category, which results in two clustering tasks, one with four clusters (sub-categories) and the other with two clusters (main categories).

AN US

the optimal value of

4.1.1

CR IP T

(

4.1 Multi-Task Document Classification As mentioned before, in this section, the performance of the proposed algorithm is examined in the textual data classification problem. The results of the proposed algorithm are compared with those of multi-task clustering algorithms MBC, S-MBC, and S-MKC, which use Zhang’s multi-task framework. Moreover, at the end of this section, the effect of different values of free parameters on the performance of the proposed clustering algorithm is studied.

Different data distribution (NG3): to create the data for the third set of the experiments, four hundred data samples are randomly selected from all the classes, and three clustering tasks are created out of these different data classes, which some of these tasks have the common categories. Complete information of these three experimental sets is shown in Table 2. Table 2: Three different experimental sets extracted from NG20. The numbers in the parentheses indicate which subcategories used in each experiment. Experimental sets NG1

Tasks

Number of Samples

Number of Features

Number of Clusters

1st

1,493

43,586

(6-8) 3

2nd

1,493

43,586

(6-8) 3

st

1,600

43,586

(3,4,12,15) 4

2nd

1,600

43,586

([3,4], [12,15]) 2

1st

4,000

43,586

(1-10) 10

2nd

4,800

43,586

(4-15) 12

3rd

6,400

43,586

(5-20) 16

1 NG2

NG3

4.1.2

The Algorithm Parameters

As mentioned before, all of the multi-task clustering algorithms used in the experiments (the proposed method, MBC, S-MBC, and S-MKC algorithms) are

ACCEPTED MANUSCRIPT

In order to create feature sources, which is required in the proposed algorithm, data is divided into 100 equal parts in terms of the features. As a result, each source produces feature vectors with about 440 dimensions. Both fuzzification degree of data membership ( ) and feature sources Importance factor ( ) are considered to be 4. 4.1.3

/ (∑

M

ED

PT

CE

The proposed clustering method, the k-means algorithm, and the other mentioned clustering techniques are evaluated through the three sets of the experiments in 10 random trials. The average NMI’s computed for the results of these methods are shown in Table 3. It can be taken from the reported results that when target clustering data follows the same distribution, MBC and S-MBC multitask clustering methods result in better clustering performance compared to local clustering. However, as a result of complex data distribution (as in NG3 experimental set), improvements in clustering results seems not much significant. But, in addition to exploiting the positive aspects of multitask clustering, using subspace clustering helps the proposed algorithm to get better results compared to MBC and S-MBC algorithms, by managing the impact of different feature sources in the

AC

26.27

27.65

45.47

34.92

nd

12.50

27.23

28.77

49.01

35.18

st

10.30

20.48

21.19

40.82

30.45

nd

18.25

23.66

24.04

37.79

32.83

st

6.01

6.69

16.20

27.54

24.36

nd

7.98

7.84

25.32

28.47

25.12

rd

9.82

18.63

21.43

27.25

24.92

2

1

1

NG3

S-MKC Proposed

14.13

2

The quality of the clustering algorithms can be evaluated using this criterion, by comparing their results with the grand truth. Results

NG1

S-MBC

st

1

NG2

)

In this equation, and are the two compared clustering results, and are the number of clusters used by and , respectively, is the number of data in the ith cluster of , is the number of data in the jth cluster of , is the number of data in the ith cluster of as well as in the jth cluster of , and is the total number of data.

4.1.4

K-means MBC

(35)



√.∑

NMI (%)

Exp. sets Tasks

AN US



)

The results of S-MKC reported in Table 3 are better than the proposed method, with the mentioned settings (using Euclidean distance as Bregman divergence base function). But the ability of Bregman divergence in using different distances helps to improve the results of the proposed method, as will be seen in continue. Table 3: Clustering results of experimental sets with different clustering algorithms based on NMI criterion

Evaluation Method

The NMI factor is used to examine the clustering results [34]. This factor measures the similarity of two clustering results. The NMI criterion, defined in equation (35), is a number between 0 and 1, whose larger value indicates more similar clustering results: (

clustering process. In the third experiment, where the data do not follow the same distribution, multitask clustering techniques cannot get the benefit of their positive attributes and result in lower NMI compared to the first two experiments, for all the algorithms. However, the proposed algorithm shows a better performance than three methods, k-means, MBC, and S-MBC algorithms, in all the experiments.

CR IP T

based on the multi-task framework proposed in [22]. In the all of these algorithms, the global regularization factor is selected as . Also, since the S-MBC and MBC algorithms cannot be solved in closed-form for asymmetric Bregman spaces, in all algorithms including the proposed one, the Euclidean distance is used as the Bregman divergence base function. The Gaussian kernel is also used in the S-MKC algorithm (as in [23]), for clustering NG20 database.

2

3

As it can be seen from these experimental results, the S-MKC algorithm works better, compared to other algorithms, which is due to the ability of kernel-based algorithms in processing data with non-linear transformations besides the multitask nature of this clustering algorithm. However, it should also be noted that since selecting an inappropriate kernel can lead to inaccurate results, the difficulty of selecting a proper kernel and its high computational cost are considered major weaknesses for this algorithm. At the second evaluation phase, Bregman divergence ability in using different functions and their effect on the clustering results is studied. For this purpose, the above three sets of the experiments are conducted on the proposed algorithm, using four common Bregman space functions (Euclidean space and three other ones). Equations of these space functions and the results of clustering using them are shown in Table 1 and Figure 3, respectively. Furthermore, the clustering results of the S-MKC algorithm are included in Figure 3, which achieved the best results in the previous phase of the experiments.

ACCEPTED MANUSCRIPT

Euclidean

50

Generalized Kullback–Leibler

45

Itakura–Saito

40

Exponential

S-MKC

35 30 25 20 15

NG1 (t1)

NG1 (t2)

NG2 (t1)

NG2 (t2)

NG3 (t1)

NG3 (t2)

NG3 (t3)

M

Figure 3: Clustering results on experimental sets using the proposed algorithm with different Bregman divergence function according to NMI criterion

CE

PT

ED

As Figure 3 indicates, the proposed algorithm shows acceptable results using Itakura-Saito (IS) space function. Also, compared to S-MKC clustering algorithm, the proposed method is able to get better results in four cases from seven clustering tasks. By assuming as a random variable and using t-test evaluation [35], the resulting value for t equals 1.007. Thus, in the above experimental sets, with the possibility of 72%, the proposed algorithm with Itakura-Saito Bregman function works better than S-MKC.

AC

Therefore, it can be stated that when the proper Bregman function is identified, the proposed algorithm clusters the data more accurately, compared to the other relevant methods. 4.1.5

Data membership fuzzification degree (a): this parameter can accept values in the range of (1,∞). Based on equation (11), if this parameter approaches 1, the fuzzy clustering will turn to a crisp clustering, and for each datum, only the membership value of its nearest cluster will be one, and the rest of the membership values will be zero. On the other hand, if its value goes to infinity, the membership value will become as fuzzy as possible and for each datum, the membership value of all clusters will be the same ⁄ ( ), which causes misclustering.

CR IP T

55

other hand, if its value approaches infinity, the local part of the algorithm will lose its effect, and the algorithm only makes the centers of similar clusters close to each other, and the centers of dissimilar clusters distant from each other. In this case, getting closer or more distant the centers of the clusters will completely depend on initializing the positions of the centers.

Fuzzification degree of feature sources Importance factor (b): like the previous parameter, this parameter will accept the values in the range of (1,∞). Based on equation (12), if the parameter value approaches 1, for each cluster, only the importance factor of one of the feature sources which produce the most cluster-like distribution for that cluster, will be 1. Because the importance factor of the other sources for this cluster will be 0, these features will lose their effect in this cluster. In other words, each cluster will be clustered only based on one feature source, which is the most proper one in defining that cluster. On the other hand, if the parameter value approaches infinity, the importance factor of all sources will be the same ( ⁄∑ ), and the clustering algorithm will lose its co-clustering aspect.

AN US

NMI(%)

The obtained results indicate the significant effect of the Bregman’s function in clustering results. It can be argued that similar to kernel methods, using different Bregman functions, the calculation is done in other spaces, in which the clustering algorithm achieves better results in dealing with the complexity of data distribution compared to the main space. Moreover, Figure 3 shows a rather similar behavior of the proposed algorithm with the specified space function in different tasks, and only minor changes are observed in the results, when generalized Kullback–Leibler (KL) base function is used, which can be regarded as direct effect of the value in calculation of equation ( ) based on the KL base function. According to equation (7), the first parameter indicates data, and the data distribution in the space of KL base function is more effective compared to the other three functions.

The Effect of Free Parameters

According to the cost function of the proposed clustering algorithm (equation (5)), there are three free parameters, whose correct initializations affect the clustering quality. Global regularization factor (α): This parameter can accept values in the range of [0,∞). If this value is selected to be zero, the global regularization will not have any role in the clustering process and it will be turned into several local clustering algorithms. On the

In order to investigate the effect of the abovementioned parameters, the proposed algorithm is executed with different values of these three parameters, α, a, and b, changed in the ranges [0,3], [1,5], and [1,5], respectively, on the previous experimental sets, and the corresponding performances are shown in Figure 4, Figure 5, and Figure 6. As shown in Figure 4, a zero value for the global regularization factor α reduces the algorithm into a local clustering algorithm. Assigning high values to this parameter will cause failure in local section of the clustering algorithm, and the proposed algorithm cannot use all of its capacity for the clustering purpose. As shown in Figure 4, the values in the range [0.5,1.5] are the proper ones for this parameter, and trespassing this range decreases the algorithm performance. The effect of different values of data membership fuzzification degree a is shown in Figure 5. The value of 1, reduces the algorithm from a fuzzy clustering method to a crisp one. Although it does not produce

ACCEPTED MANUSCRIPT

As shown in Figure 6, changing the fuzzification degree of the feature sources Importance factor b, has a rather similar behavior when changes are performed in data membership fuzzification factor. However, the difference is in assigning the value of 2 to this parameter, which causes a local minimum, and the algorithm has a lower performance at this point. The reason could be the linear calculation when the value of 2 is assigned to this parameter (see equation (13)).

40

NG1 [1] NG1 [2]

35

NG2 [2] NG3 [1]

25

NG3 [2] NG3 [3]

20

0

0.5

1

ALPHA

1.5

2

2.5

M

15 3

ED

Figure 4: The performance of the proposed algorithms, when the global regularization factor is changed. 40

NG1 [2]

25

15 1

1.5

CE

20

2

2.5

3

a

3.5

NG2 [2] NG3 [1] NG3 [2]

30 25

NG3 [3]

20 NG1 [1]

4

4.5

NG1 [2]

5

AC

Figure 5: The proposed algorithm's performance when changes are made in data Membership fuzzification degree.

5000

NG1 [2] NG2 [1] NG2 [2]

25

NG3 [1]

20

NG3 [2] NG3 [3]

15

NG2 [1]

NG2 [2]

Tasks

NG3 [1]

NG3 [2]

NG3 [3]

Figure 7: The performance of the proposed algorithm, using three different methods for initialization of the clusters centers.

NG1 [1]

30

NMI (%)

Kmeans++ AVG Random

35

NMI (%)

PT

NMI (%)

NG2 [1]

30

35

40

NG1 [1]

35

40

Clustering results of the proposed algorithm for these three different initialization methods in terms of NMI criterion and the execution time are shown in Figure 7 and Figure 8, respectively. As it can be observed, the complete random selection method, as compared with the two other methods, resulted in worse performances in terms of the clustering quality. Two other methods have obtained more or less similar results. The initialization method of k-mean++ algorithm results in lower execution time, and also lower clustering accuracy. The random selection method with an average distance between data points as the minimum distance of the cluster centers has a higher execution time, compared to the other two methods, but achieves better results. Moreover, since this algorithm is not intended to be used online, it can be a proper choice for initialization of the clusters centers.

Run Time (Sec)

NMI (%)

NG2 [1] 30

 Completely random selection.  Random selection, with the distance between two centers considered to be higher than the average distance between data points.  Using initial center selection method, introduced in k-means++ algorithm [36].

AN US

In addition to the three mentioned parameters which play a direct role in the performance of the proposed algorithm, two other factors are also effective (but indirectly) in the clustering performance, which introduced and examined in continue.

Initialization of the clusters centers: In clustering algorithms, the most common method for initialization of the clusters centers is a random selection. However, if some intelligence is used in this phase, better results can be achieved in a shorter period of time. We used three different center initialization methods for the proposed algorithm, and its performance is examined in these three situations:

CR IP T

bad results, but it does not improve the performance of the algorithm, as well. The higher the value of this factor, the higher the quality of clustering (until it reaches its saturation). As mentioned earlier, continuation of this procedure and assigning very high values to this parameter causes failure in the execution of the algorithm.

4000

Kmeans++ AVG Random

3000 2000 1000

0

10 1

1.5

2

2.5

3

b

3.5

4

4.5

5

Figure 6: The proposed algorithm's performance when the fuzzification degree of feature sources Importance factor is changed.

NG1

Tasks

NG2

NG3

Figure 8: The execution time of the proposed algorithm, using three different methods for initialization of the clusters centers.

The number of feature sources and their dimensions: Like the dataset used in this section (NG20), it is

ACCEPTED MANUSCRIPT

4.2.1

Experimental Datasets

In this section, three real datasets of UCI database [37] were used to investigate the algorithm, which are discussed in continue. Iris: Contains four features to describe 150 data points that are classified in two sources:  D1: Length and width of petals.  D2: Length and width of sepals Boston city information: Contains descriptions for different areas of Boston using eight features, which are classified in three feature sources:  D1: Crime rate, the density of nitrogen oxide, and student-teacher ratio, to describe the quality of the environment.  D2: The average age, the number of rooms in each house, and the distance from employment centers, to describe the state of the environment.  D3: The number of people in each building block, and the percentage of poor people, to describe the population properties.

CE

PT

ED

M

AN US

To investigate the effect of the number and dimensions of feature sources, the experimental sets are clustered in 10 different settings, in which the number of sources is changed from 10 to 100. The results are shown in Figure 9. These results show that dividing the features into a number of sources in the range of (20,50) does not lead to a good clustering performance. However, generally, it should be noted that the number of sources cannot be generalized to all datasets, and it is completely dependent on the structure of the dataset. Also, because there is not exact information about cluster structure and how the features are distributed, it cannot be claimed that with more number of feature sources, better clustering results are achieved. Therefore, if there is no accurate information about the number and properties of the sources, as well as the databases, we must assign a similar procedure to achieve the proper number of sources which results in appropriate clusters.

with the multi-source clustering algorithm introduced in [26].

CR IP T

possible that the information about feature sources is not available. Therefore, in order to use the proposed algorithm, it is necessary that the user separates the features and introduces the sources. The more the number of sources (with a maximum number of the feature vectors dimension), the more the ability to control the importance of the features. But, increasing the number of sources will increase the calculation cost, while due to the possibility of redundancy between features, there is no guarantee to increase the clustering quality. On the other hand, by reducing the number of sources and increasing their dimensions, the computational cost is decreased, while the structure of the clusters becomes more complicated, and thus the definition of an importance factor for each source cannot control this complexity. In general, the selection of the number of sources makes a compromise between the computational cost and the clustering quality which user should decide about it.

40 35

25 20 15 10

AC

NMI (%)

30

10

20

30

40

50

NG1 [1] NG1 [2]

NG2 [1]

NG3 [1]

NG3 [3]

70

80

90

100

Source Count

Figure 9: The performance of the proposed algorithm based on changes in the number of the feature sources

4.2

 D1: Geographical coordinates.  D2: four parameters of FWI (fire weather index).  D3: Temperature, humidity, and wind speed, to describe the weather conditions. 4.2.2

Evaluation Method

To investigate the clustering strength of multi-source data and the ability to maintain the structure of different sources, the following evaluation criterion is used [26]: ∑

(

| ) (

)

(36)

NG2 [2]

NG3 [2]

60

Fire information: Contains nine features were to predict fire in 517 forest fires reports in the northern Portugal, from which three data sources are produced as follows:

Multi-Source Features

In this experiment, the ability of the proposed algorithm in managing the effect of multi-source features is investigated. In this regard, the proposed algorithm is used to cluster three real datasets in a single-task mode, and its results are compared directly

In this equation, ( | ) is the energy function based on the general membership matrix, and regarding the jth source, and ( ) is cost function if only the features of jth source are used for clustering. In general, ( | )⁄ ( ) quantifies the amount of proper clustering with jth source (the value of 1 indicates the best case). It is obvious that ( | ) ( ), which results in ( is an ideal clustering result in which the clusters are in their ideal configurations for all sources). Therefore, can be used as a criterion to investigate the capability of the algorithm in clustering multisource data, such that the cluster structure of each source is maintained as much as possible. It should be

ACCEPTED MANUSCRIPT

4.2.3

Results

In order to investigate the performance of the proposed algorithm and compare it with the algorithm presented in [26], three mentioned datasets are clustered in five different settings, in which the data clustered into 2 to 6 clusters. For this purpose, the Euclidean and Itakura-Saito distance functions are used for Bregman divergence in the proposed algorithm. The obtained results are shown in Table 4 in terms of the mentioned criterion.

Number of Clusters

Set

Algorithm

C=2

C=3

C=4

C=5

C=6

I

2.163

2.543

2.668

2.917

2.786

P-Ecl

2.217

2.621

2.652

2.873

2.650

P-IS

2.138

2.457

2.489

2.672

I

3.457

4.192

4.459

4.638

4.814

P-Ecl

3.872

4.432

4.532

4.534

4.721

P-IS

3.247

3.932

4.281

4.302

4.421

I

3.568

3.812

4.225

4.583

4.844

P-Ecl

3.643

4.031

4.134

4.508

4.694

3.293

3.412

3.872

4.152

4.521

2.320

ED

PT

CE

Boston City information Fire information

P-IS

M

Clustering

Iris

Expr.

The results presented in Table 4 show that considering Euclidean distance as Bregman distance function, the proposed algorithm and the clustering method of [26] have similar performances. In some cases, the proposed algorithm and in some others the algorithm of [26] produces better results. By applying the t-test and considering QI-QP-Ecl as its corresponding random variable, the value of t will be equal to -0.715 which shows that with the possibility of 62%, the algorithm of [26] performs better than the proposed one. In this case, by considering the Euclidean distance as a Bregman divergence function and not using the multitasking aspect, the cost functions of the two algorithms are similar to each other. The proposed algorithm uses a gradient descent search, but the algorithm of [26]

AC

When the Bregman divergence function is changed to Itakura-Saito, the clustering results of the proposed algorithm will be better than the algorithm of [26] in all cases. With this Bregman function, based on t-test, the proposed algorithm with the possibility of more than 95% performs better than the algorithm of [26]. The main reason for this performance can be the nonlinearity feature of Itakura-Saito function, which results in the better calculation of the degree of similarity between samples, which leads to the better performances compared to the ones resulted from the evolutionary algorithm used in [26]. It should also be noted that the algorithm of [26] is time-consuming, but the proposed algorithm has a better performance in terms of execution time. 4.3

Compression with Co-clustering Algorithms

AN US

Table 4: The clustering results based on the criterion introduced in (36) on Iris, Boston City information, and Fire information datasets, using the algorithm of [26] (noted by I) and the proposed algorithm, with Euclidean (P-Ecl) and Itakura-Saito (P-IS) Bregman functions.

applies an evolutionary PSO technique to find the importance factors. Thus, the parallel search ability of the evolutionary algorithm can be the main reason for the better performance of the algorithm of [26].

CR IP T

noted that since the cost functions of the two algorithms (the proposed method and the version of FCM proposed in [26]) are different, after clustering by each algorithm, the cost function of standard FCM algorithm is used in the criterion of equation (36) as ( ), to have a fair comparison.

In the previous experiments, the proposed method is compared with other fuzzy clustering algorithms. In this section, in order to further evaluate, the proposed method is compared with several non-fuzzy coclustering algorithms. In this purpose, the proposed algorithm is compared with the following co-clustering algorithms: NBVD (co-clustering by block value decomposition) [38], ICC (information-theoretic coclustering) [39], HCC (hierarchical co-clustering) [40], and HICC (entropy based hierarchical co-clustering) [41]. 4.3.1

Experimental Datasets

In these experiments, the NG20 dataset is used again, and two famous experimental sets, widely used in the performance evaluation of document clustering papers, are created. The details of these experimental sets are listed in Table 5. Table 5: Details of two experimental sets created from NG20 Number of Total Samples Per Number of Group Samples

Experimental sets

Newsgroups (Clusters)

Multi5

comp.graphics, rec.motorcycles, rec.sports.baseball, sci.space, talk.politics.mideast

100

500

Multi10

alt.atheism, comp.sys.mac.hardware, misc.forsale, rec.autos, rec.sport.hockey, sci.crypt, sci.electronics, sci.med, sci.space, talk.politics.guns

50

500

4.3.2

Evaluation Method

To compare the algorithms, the micro-averaged precision factor [42] is used. This factor measures the

ACCEPTED MANUSCRIPT

( )

∑ ∑ ( (

( )

) (

(37)

))

is the clustering result, and ( ) and ), respectively indicate the number of correctly

Where (

and incorrectly clustered samples in the th cluster. This criterion produces a value in the range of [0,1]. A better clustering result generates a higher value. 4.3.3

Results

In order to evaluate the performance of the proposed algorithm in these experiments, its parameters are set as follows. The global regularization factor is selected as , both fuzzification degrees of data membership ( ) and feature sources Importance factor ( ) are considered to be 4, and the feature space is divided into 80 equal parts to form the feature sources.

Conclusion

In this paper, a multi-task clustering algorithm is proposed to improve the clustering accuracy, based on local information and information received from the relation of the clusters in different tasks. The coclustering idea used in the proposed algorithm can handle the complexity of the data distribution and results in more accurate clustering. The experimental results show that two factors in the proposed algorithm have significant effects on the performance of the proposed algorithm; that are the global regularization factor (α) and the type of the Bregman divergence used. A proper selection of these factors has an important effect in improvement of the algorithm performance. Selection of these two factors depends on the datasets used to be clustered. Assigning high values to α can be useful when the data participate in each task have the similar properties with the data of other tasks, for example, they have similar or at least closed distributions to each other. In this case, their close relation can help to regularize the center of the clusters, properly. The choice of the Bregman divergence may not be as critical as selection of the value of α parameter; however, if a proper distance consistent with the data distribution can be selected, a great improvement in the accuracy of the clustering is achieved, which is an opportunity received from Bregman divergence.

AN US

The clustering results of the four mentioned coclustering algorithms and the proposed algorithm for the four Bregman divergence functions used, are shown in Table 6.

5

CR IP T

ratio of the number of correctly and incorrectly clustered samples, which is defined as follows:

CE

PT

ED

M

As shown in Table 6, in the case of using Euclidean distance, the proposed algorithm performs like the simple fuzzy co-clustering algorithm, and its results are far from the state-of-the-art algorithms. But, using more complex Bregman divergence functions, the clustering results are improved. In the case of using Exponential and Itakura–Saito functions, the proposed algorithm achieves its best results. In the Multi5 experimental set, the selected newsgroups are from different categories, causes the clustering algorithms to achieve higher precisions. But, in the case of Multi10 experimental set, some of the selected newsgroups are from the same categories (they are different in subcategories). Thus, similar to the first experiment, selecting complex Bregman functions helps the proposed algorithm to handle the complexity of the data and to produce more precise results.

AC

Table 6: Clustering results of the Multi5 and Multi10 experimental sets based on the criterion Result (map)

Proposed Algorithm

Co-clustering Algorithm

Clustering Algorithm

Multi5

Multi10

NBVD

0.93

0.67

ICC

0.89

0.54

HCC

0.72

0.44

HICC

0.95

0.69

Ecl

0.85

0.53

KL

0.89

0.57

Exp

0.95

0.68

IS

0.95

0.71

Based on the experimental results, with the simplest selection of Bregman distance, i.e. Euclidean distance, the proposed algorithm archives better results compared to the similar tested multi-task clustering algorithms, due to the power of fuzzy co-clustering idea used in the algorithm. However, for non-linear data, the proposed algorithm with Euclidean distance cannot perform better than the kernel-based multi-task clustering algorithm. But, by changing the Bregman divergence distance function and selecting more complex distances, the proposed method like the kernel-based techniques, calculates the similarity of samples in a new transformed space and handles nonlinear data more properly. However, selection of suitable Bregman divergence distance function, like selecting proper kernel function for kernel-based techniques, can be an expensive and time-consuming procedure. As observed through the experiments, by appropriate selection of the Bregman divergence distance function, better results are achieved in comparison with the related tested kernel-based clustering algorithm.

6

References

[1] J. Zhou and D. S. Wishart, "An improved method to detect correct protein folds using partial clustering," BMC Bioinformatics , vol. 14, no. 11,

pp. 1-12, January 2013.

ACCEPTED MANUSCRIPT

[2] W. Li, L. Fu, B. Niu, SitaoWu and JohnWooley, "Ultrafast clustering algorithms for metagenomic sequence analysis," Briefings in Bioinformatics, vol. 16, no. 2, pp. 1-14, May 2012.

[13] V. Loia, W. Pedrycz and S. Senatore, "Semantic Web Content Analysis: A Study in Proximity-Based Collaborative Clustering," IEEE Transactions on Fuzzy Systems, vol. 15, no. 6, pp. 1294 - 1312, December 2007. [14] L. Coletta, E. Hruschka and R. Campello, "Collaborative Fuzzy Clustering Algorithms: Some Refinements and Design Guidelines," IEEE Transactions on Fuzzy Systems, vol. 20 , no. 3, pp. 444-462, June 2012.

[4] G. Ho, W. Ip, C. Lee and W. Mou, "Customer grouping for better resources allocation using GA based clustering technique," Expert Systems with Applications, vol. 39, no. 2, p. 1979–1987, August 2011.

[15] B. Mandhani, S. Joshi and K. Kummamuru, "A matrix density based algorithm to hierarchically cocluster documents and words," in International conference on World Wide Web, New York, 2003. [16] W.-C. Tjhi and L. Chen, "Dual Fuzzy-Possibilistic Coclustering for Categorization of Documents," IEEE Transactions on Fuzzy Systems, vol. 17, no. 3, pp. 532-543 , April 2008.

AN US

[5] M. Kargari and M. M. Sepehri, "Stores clustering using a data mining approach for distributing automotive spare-parts to reduce transportation costs," Expert Systems with Applications, vol. 39, no. 5, p. 4740–4748, April 2012.

CR IP T

[3] W.-W. Fan, B. Chen, G. Selvaraj and F.-X. Wu, "Discovering biological patterns from short timeseries gene expression profiles with integrating PPI data," Neurocomputing, vol. 135, p. 3–13, 2014.

[6] A. Tagarelli and G. Karypis, "A segment-based approach to clustering multi-topic documents," Knowledge and Information Systems, vol. 34, no. 3, pp. 563-595, March 2013.

ED

M

[7] M. Gong, Y. Liang, J. Shi and W. Ma, "Fuzzy CMeans Clustering With Local Information and Kernel Metric for Image Segmentation," IEEE Transactions on Image Processing, vol. 22, no. 2, pp. 573-584 , Febuary 2013.

CE

PT

[8] M. I. Lopez, J. M. Luna, C. Romero and S. Ventura, "Classification via Clustering for Predicting Final Marks Based on Student Participation in Forums," in International Conference on Educational Data Mining Society, Chania, 2012.

AC

[9] D. Pohl, A. Bouchachia and H. Hellwagner, "Online indexing and clustering of social media data for emergency management," Neurocomputing, vol. 172, no. 8, p. 168–179, 2016. [10] A. Liu, Y. Lu, W. Nie, Y. Su and Z. Yang, "HEp-2 cells Classification via clustered multi-task learning," Neurocomputing, vol. 195, p. 195–201, 2016. [11] J. Han, X. Ji, X. Hu, J. Han and T. Liu, "Clustering and retrieval of video shots based on natural stimulus fMRI," Neurocomputing, vol. 144, p. 128–137, 2014. [12] W. Pedrycz, "Collaborative fuzzy clustering," Pattern Recognition Letters, vol. 23, no. 14, p. 1675–1686, December 2002.

[17] Y. Yan, L. Chen and W. C. Tjhi, "Fuzzy semisupervised co-clustering for text documents," Fuzzy Sets and Systems, vol. 215, pp. 74-89, March 2013. [18] C. Laclau and M. Nadif, "Hard and fuzzy diagonal co-clustering for document-term partitioning," Neurocomputing, vol. 193, no. 12, p. 133–147, 2016. [19] W. Dai, Q. Yang, G.-R. Xue and Y. Yu, "Self-taught clustering," in International conference on Machine learning, 2008 . [20] Q. Gu and J. Zhou, "Learning the Shared Subspace for Multi-task Clustering and Transductive Transfer Classification," in IEEE International Conference on Data Mining, Miami, 2009. [21] Z. Zhang and J. Zhou, "Multi-task clustering via domain adaptation," Pattern Recognition, vol. 45, no. 1, p. 465–473, January 2012. [22] J. Zhang and C. Zhang, "Multitask Bregman clustering," Neurocomputing, vol. 70, no. 10, pp. 1720-1734, May 2011. [23] X. Zhang and X. Zhang, "Smart Multi-Task Bregman Clustering and Multi-Task Kernel Clustering," in Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, 2013. [24] T. N. Huy, H. Shao, B. Tong and E. Suzuki, "A feature-free and parameter-light multi-task clustering framework," Knowledge and Information Systems, vol. 36, no. 1, pp. 251-276, July 2013.

ACCEPTED MANUSCRIPT

[26] H. Izakian and W. Pedrycz, "Agreement-based fuzzy C-means for clustering data with blocks of features," Neurocomputing, vol. 120, no. 15, pp. 266-280, March 2014. [27] A. Banerjee, S. Merugu, I. S. Dhillon and J. Ghosh, "Clustering with Bregman Divergences," Journal of Machine Learning Research, vol. 6, p. 1705–1749, December 2005.

[36] D. Arthur and S. Vassilvitskii, "k-means++: the advantages of careful seeding," in Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, Louisiana, 2007. [37] M. Lichman, "UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]," University of California, School of Information and Computer Science, Irvine, CA, 2013. [38] B. Long, . Z. Zhang and P. S. Yu, "Co-clustering by block value decomposition," in the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, New York, NY, USA, 2005. [39] I. S. Dhillon, . S. Mallela and D. S. Modha , "Information-theoretic co-clustering," in international conference on Knowledge discovery and data mining, New York, NY, USA, 2003.

AN US

[28] L. Bregman, "The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming," USSR Computational Mathematics and Mathematical Physics, vol. 7, no. 3, p. 200–217, 1967.

[35] O. T. Yıldız, Ö. Aslan and E. Alpaydın, "Multivariate Statistical Tests for Comparing Classification Algorithms," Learning and Intelligent Optimization, vol. 6683 , no. , pp. 1-15 , 2011.

CR IP T

[25] M. Hanmandlu, O. P. Verma, S. Susan and V. Madasu, "Color segmentation by fuzzy coclustering of chrominance color features," Neurocomputing, vol. 120, no. 23, pp. 235-249, November 2013.

[29] L. Cayton, "Fast Nearest Neighbor Retrieval for Bregman Divergences," in international conference on Machine learning, New York, 2008 .

M

[30] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, 1st ed., Norwell: Kluwer Academic Publishers, 1981.

ED

[31] K. Zou, Z. Wang and M. Hu, "An new initialization method for fuzzy c-means algorithm," Fuzzy Optimization and Decision Making, vol. 7, no. 4, pp. 409-416, 2008.

CE

PT

[32] R. Babuska, "Fuzzy Clustering," in Fuzzy and Neural Control - DISC Course Lecture Notes, Delft, the Netherlands, Delft University of Technology, 2001, pp. 55-72.

AC

[33] F. Nielsen and R. Nock, "Sided and Symmetrized Bregman Centroids," IEEE Transactions on Information Theory, vol. 55, no. 6, pp. 2882-2904 , June 2009. [34] A. Strehl and J. Ghosh, "Cluster Ensembles - A Knowledge Reuse Framework for Combining Multiple Partitions," Journal of Machine Learning Research, vol. 3, pp. 583-617, March 2003.

[40] J. Li and T. Li, "HCC: a hierarchical co-clustering algorithm," in International ACM SIGIR conference on Research and development in information retrieval, UniMail, Geneva, Switzerland, 2010. [41] W. Cheng, X. Zhang, F. Pan and W. Wang, "HICC: an entropy splitting-based framework for hierarchical co-clustering," Knowledge and Information Systems, vol. 46, no. 2, pp. 343-367, 2016. [42] N. Zheng-Yu, J. Dong-Hong and T. Chew-Lim, "Document clustering based on cluster validation," in ACM international conference on Information and knowledge management , Washington, DC, USA, 2004.

AN US

Alireza Sokhandan is Ph.D. Student in Artificial Intelligence at University of Isfahan, Isfahan, Iran. He received his B.Sc. in information technology engineering in 2010 and M.Sc. in mechatronics engineering in 2012 from University of Tabriz, Tabriz, Iran. His research interests include image processing and computer vision, machine learning, and evolutionary algorithms.

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

Peyman Adibi was born in Isfahan, Iran, in 1975. He received the B.S. degree in computer engineering from Isfahan University of Technology, Isfahan, Iran, in 1998, and the M.S. and Ph.D. degrees in computer engineering from Amirkabir University of Technology, Tehran, Iran, in 2001 and 2009, respectively. Since 2010, he has been with the Artificial Intelligence Department, Faculty of Computer Engineering, University of Isfahan, Isfahan, Iran, where he is currently an Assistant Professor. His current research interests include machine learning, soft-computing, and computer vision.

Mohammad Reza Sajadi was born in Tehran in 1987. He was studying mechanical engineering at University of Shahid Rajaee, Tehran, Iran, which led to get B.Sc. in 2008. Then, he was studying mechatronics engineering at School of Engineering Emerging Technologies at University of Tabriz, Tabriz, Iran, which led to get master’s degree in 2012. Manufacturing of Electric car, surgical robot, Wrist rehabilitation robot and 7 Dof redundant manipulator are his practical experiences. His surgical robot gained the position of selective design in fifth national festival of HAREKAT.