Learning category distance metric for data clustering

Learning category distance metric for data clustering

Neurocomputing 306 (2018) 160–170 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Learnin...

892KB Sizes 0 Downloads 54 Views

Neurocomputing 306 (2018) 160–170

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Learning category distance metric for data clustering Baoguo Chen a,∗, Haitao Yin b a b

Research Center for Science Technology and Society, Fuzhou University of International Studies and Trade, Fujian 350202, China Research Institute of Science Technology and Society, Fuzhou University, Fujian 350116, China

a r t i c l e

i n f o

Article history: Received 12 November 2016 Revised 17 December 2017 Accepted 16 March 2018 Available online 4 May 2018 Communicated by Prof. Zhou Xiuzhuang Keywords: Data clustering Categorical attribute Distance metric Distance learning Category weight

a b s t r a c t Unsupervised learning of adaptive distance metrics for categorical data is currently a challenge due to the difficulties in defining an inherently meaningful measure parameterizing the heterogeneity within matched or mismatched categorical symbols. In this paper, a new distance metric called category distance and a non-center-based algorithm are proposed for categorical data clustering. The new metric is formulated based on the category weights for each categorical attribute, no more depending on the common assumption that all categories on the same attribute are independent of each other. The problem of learning the category distance is therefore transformed into the new problem of learning a set of category weights, which can be jointly optimized with the clusters optimization. A case study on DNA sequences and experimental results on ten real-world data sets from different domains are given to demonstrate the performance of the proposed methods with comparisons to the existing distance measures for categorical data.

1. Introduction Computing the dissimilarity (or similarity) between data objects is one of the key intermediate operations in many machine learning tasks, such as data clustering aimed at partitioning a set of objects into homogeneous groups based on some distance functions. Learning of an adaptive distance metric for data clustering has sparked wide interest because of inherent data dependency of the semantic dissimilarity between data objects [1,2]. A number of unsupervised metric learning methods have been proposed, including kernel linear transformation [3,4], relevant component analysis [5], automated attribute-weighting [6] and many others integrating feature extraction methods [1,7,8]. These methods have been applied to many distance-based clustering tasks and gained great popularity. However, they mainly focus on learning distance metric from numeric data, where the metric can be parameterized in a well-defined measure, for instance, the Mahalanobis distance function [9,10]. For categorical data, distance computation is not straightforward. The problem becomes difficult due to the fact that in categorical case the data can only take discrete values (categorical symbols or categories) and statistical measures such as mean, variance as well as covariance, which are common in numeric data, are undefined for categorical data



Corresponding author. E-mail address: [email protected] (B. Chen).

https://doi.org/10.1016/j.neucom.2018.03.048 0925-2312/© 2018 Elsevier B.V. All rights reserved.

© 2018 Elsevier B.V. All rights reserved.

[11,12]. Consequently, the learning methods that have been successfully used for numeric data, including the popular maximummargin-based approaches [13], sparse representation and manifold learning [14], Laplacian regularized metric learning [15] and the continuous-kernel methods [16], cannot be directly applied to categorical data. Learning a distance metric on categorical data is a fundamental problem as such data have become ubiquitous in machine learning applications [11,12,17,18]. A few intuitive metrics have been defined, such as the common overlap measure, occurrence frequency [19], information-theoretic metric [20], etc (see [21] for a survey or Section 2.2 for the typical measures). For example, the overlap measure (alternatively known as the simple-matching coefficient [12]) computes the similarity from the number of categories that appear in both objects, and have been widely used in the categorical data clustering algorithms including the popular K-modes and its numerous variants [11,22]. We remark that these metrics are not valid in many real applications because, essentially, they are defined based on the assumption that all the categories in the data are independent of each other [23]: samples having different categories are independent of each other while they are perfectly correlated as long as the same category is taken. The assumption is generally not true in reality. For instance, in the international trade catalogue, the categories “blazers” and “jackets” are quite heterogeneous since both make up one of the components of a suit. As an other example, take an attribute representing customers’ age: two customers may share the

B. Chen, H. Yin / Neurocomputing 306 (2018) 160–170

same category “Middle-aged”, but their real ages can actually be different. Many examples like these pose a unique challenge to the distance definition, because there is currently no method for adaptively learning the category dissimilarity for a polytomous attribute in clustering categorical data [23,24]. In this paper, we propose to solve these problems by learning a category distance metric for categorical data clustering. The metric assigns an individual distance value to each pair of categories on the same attribute, either matched or mismatched, to distinguish their heterogeneity such that the independence assumption is relieved. The category distance metric is then parameterized by a set of category weights, allowing the learning problem to be transformed into the new problem of learning the optimized weights. We also define a new clustering algorithm based on the category distance metric, to perform non-center-based clustering on categorical data with the distance metric jointly learned from the data. A series of experiments on UCI categorical data sets are conducted to evaluate the performance of the distance metric and the clustering algorithm. The remainder of this paper is organized as follows: Section 2 presents some preliminaries and related work. Section 3 describes our category distance metric. In Section 4, the distance learning method and the new clustering algorithm are presented. Experimental results are presented in Section 5. Section 6 gives our conclusions.

2. Preliminaries and related work In this section, notation and definitions related to categorical data clustering are introduced, followed by a sampling of related work on the distance measures for categorical data.

2.1. Preliminaries In the following pages, the sample set to be grouped into K clusters is denoted by X, which consists of N = |X | data objects, each being a D-dimensional vector x = x1 , x2 , . . . , xD  or y = y1 , y2 , . . . , yD . We call x a categorical data object if each attribute xd for d = 1, 2, . . . , D is a categorical attribute, as defined in the following Definition 1. Definition 1 (Categorical attribute). An attribute is of categorical type if it takes values from the finite symbols (categories) set S = {s1 , s2 , . . . , sm }, where m = |S| is the number of symbols. Such categorical data have become ubiquitous in machine learning applications. In bioinformatics, for instance, the nucleotides in each position of DNA sequences can be viewed as a categorical attribute, where the category set is typically S = { A ,  G ,  T ,  C }. Clearly, the set mean is a undefined concept for such a categorical data set. As a consequence, the popular K-means type algorithms, which make use of the set mean to represent the cluster center, cannot be directly used for categorical data clustering. The K-modes algorithm and its variants [11,25] then resort to the mode categories on each attribute to represent the “center” for categorical clusters. However, such mode-based approaches can only capture partial information on the data objects in a cluster. To define an efficient clustering algorithm without the formulation for cluster centers, partition-based methods have been suggested [12], as shown in the following Definition 2. Definition 2 (Partition-based categorical data clustering). Partition-based clustering of the categorical data set X is the optimized partitioning  = {πk |k = 1, 2, . . . , K } that minimizes

J0 () =

K  1

161

 

|πk | x∈πk y∈πk k=1 s.t. X =

K 

Dis(x, y )

πk and ∀k : πk = ∅,

(1)

k=1

where Dis( · , · ) measures the pairwise dissimilarity of categorical objects, and π k denotes the kth cluster of X with |π k | being the number of objects in π k . Unlike the numeric case, where the pairwise dissimilarity can be measured using the common distance functions such as the Euclidean distance, here, Dis( · , · ) should be computed as that aggregation of the symbolic distance on each categorical attribute. Formally,

Dis(x, y ) =

D d=1

[ ψ ( x d , y d )] 2

(2)

where ψ ( · , · ) is the distance metric measuring the symbolic dissimilarity between two categories. Based on the definitions, the problem of learning ψ ( · , · ) can be alternatively represented as learning the real matrix  for each categorical attribute of X, defined by



ψ ( s1 , s1 ) ⎢ ψ ( s2 , s1 ) =⎢ ⎣ ... ψ ( sm , s1 ) satisfying, for fore, c, c and Condition Condition Condition equality).

ψ ( s1 , s2 ) ψ ( s2 , s2 ) ...

ψ ( sm , s2 )

... ... ... ...

⎤ ψ ( s1 , sm ) ψ ( s2 , sm ) ⎥ ⎥ ⎦ ... ψ ( sm , sm )

any three categories c, c , c  on that attribute (therec  ∈ S), (1): ψ (c, c ) ≥ 0 (non-negativity), (2): ψ (c, c ) = ψ (c , c ) (symmetry), and (3): ψ (c, c ) ≤ ψ (c, c ) + ψ (c , c ) (triangular in-

2.2. A sampling of related work The similarity or distance measures defined for categorical data in the literature can be roughly divided into three groups, named Type I, Type II and Type III measures, respectively. The measures will all be in the context of dissimilarity (distance) for discussion, with the similarity converted by 1 − ψ (·, · ). In the Type I measures, the diagonal elements of  are fixed at 0, while ψ (c, c ) = w for c = c , where w is the attribute weight, which is a positive constant irrelevant to c or c . Such a measure, called Overlap Measure (OM) [21], is defined based on the simplematching method, given by



ψOM (c, c ) = w ×

0 c = c 1 c = c 

(3)

In the case where w = 1, it degenerates to the common simplematching distance as mentioned before. Though simple, the measure has been used in the categorical data clustering algorithms, such as the well-known K-modes algorithm [25]. An effective extension to the simple-matching distance is to define a weighted measure, by learning the attribute weight w ∈ [0, 1]. For example, in [26], w is computed as being inversely proportional to the kernel bandwidth of the categorical attribute, while in [11] it is calculated based on the complement-entropy of the category distribution. It can be seen that, in the Type I measures, distance is defined as zero between two samples sharing a common category without considering the heterogeneity of the categorical attribute. This is because such measures are generally based on the independence assumption, as described in Section 1. The measures of Type II fix this

162

B. Chen, H. Yin / Neurocomputing 306 (2018) 160–170

problem by assigning a non-zero dissimilarity w to ψ (c, c). Examples include the weighting method used in the mixed-attributeweighting K-modes (MWKM) algorithm [22], where w is learned as being inversely proportional to the frequency of the mode category. In this method, however, both the weights w and w are unique for all categories. In the Goodall’s measure [27] and its variants, the values could be different over various categories, depending on their individual frequency. For example, the Goodall3 measure presented in [21] is defined as

ψGoodal l 3 (c, c ) =

f (c ) × (N × f (c ) − 1 ) X X

c = c

N−1

c = c 

1

(4)

#X ( c ) |X |

(5)

is the frequency of c with regard to X and #X (s ) is the number of times the samples take the symbol c on the categorical attribute of X. Both the Types I and II measures compute ψ (c, c ) with c = c as a constant for all categories. As discussed previously, such measures would encounter difficulties in practice due to the implicit assumption that different categories have the same dissimilarity. To address this issue, a few alternatives were proposed in the statistics community, by taking the category frequency into account for the definition. Representatives in this Type III group include the well-known Lin’s measure [28], which is the information-theoretic measure giving lower dissimilarity to matched categories on frequent values and higher dissimilarity to mismatches on infrequent values [21]. However, the independence assumption is reused in other measures of this group. In the Modified Short and Fukunaga Metric (MSFM) [29] measure, for example, the distance is computed as



ψMSF M (c, c ) =

0 | log fX (c ) − log fX (c )|

c = c c = c 

(6)

Based on the heuristics that mismatches on infrequent values should be more dissimilar than those on frequent values, [19] suggested the occurrence frequency (OF) measure, given by



ψOF (c, c ) =

0 log fX (c ) × log fX (c )

c = c c = c 

To provide a general distance metric for categories without the independence assumption, two key elements must be formulated: the heterogeneity of data objects sharing the common category and the dissimilarity when they take different categories. We formulate the former by ρ (c), a measure defining the dissimilarity of the objects with the common category c on the same attribute. Accordingly, the latter is formulated based on ρ¯ (c ), which is the dissimilarity of the objects taking different categories including c. Then, for two categories c and c on the same attribute, the general distance can be given as

ρ (c ) ψ (c, c ) = 1 [ρ¯ (c ) + ρ¯ (c )] 2

where

f X (c ) =

3.1. A general distance metric for categories

(7)

We remark that ψ OF is not a distance metric since the triangular inequality property, i.e., Condition (3), is not necessarily satisfied. Moreover, the Type III measures are typically designed for supervised learning tasks. Currently, the problem of efficiently learning such a measure for data clustering remains unsolved. For example, by applying the Lin’s measure directly to the K-means-type clustering algorithm, the time complexity would reach O(N2 ) due to the time-consuming computation for its normalization factor [21,28]. In the following, a Type III measure called Category Distance for categorical data clustering is proposed without the independence assumption. By theoretical analysis, we show that the new measure is definitely a distance metric and can be learned in a linear time complexity with respect to the number of data objects. 3. The category distance metric The aim of this section is to define the category distance measure, denoted by ψ CD ( · , · ). We begin by proposing a general distance metric for categorical data, from which the new measure is derived as a special case.

c = c

(8)

c = c 

subject to

∀c : ρ¯ (c ) ≥ ρ (c ) ≥ 0.

(9)

Intuitively, in Eq. (8), the distance between c and c with c = c is the average dissimilarity of the data objects taking c and c , respectively. The following Lemma 1 indicates that ψ ( · , · ) is a distance metric. Lemma 1. ψ ( · , · ) is a distance metric. Proof. We shall show that the conditions (1)–(3) presented in Section 2.1 hold true for any category c, c and c  on the same attribute. The non-negativity and symmetry properties follow directly from Eq. (8). For the triangular inequality, we distinguish four cases: (i) c = c = c . Since ψ (c, c ) = ρ (c ) ≤ 2 ρ (c ) = ρ (c ) +



ρ (c ) = ψ (c, c ) + ψ (c , c ) in this case, the triangular inequality follows.  and (c = c or c = c ). Supposing that c = c, we have (ii) c = c

ψ (c, c ) = 12 (ρ¯ (c ) + ρ¯ (c )) ≤ ρ (c ) + 12 (ρ¯ (c ) + ρ¯ (c )) = ψ (c, c ) + ψ (c , c ) because ρ (c) ≥ 0. The similar result can be

obtained when supposed that c = c . (iii) c = c and c  = c and c  = c . For this case, since [ψ (c, c )]2 = 1 2 [ρ¯ (c ) +

ρ¯ (c )] ≤



1 2 [ρ¯ (c ) +

ρ¯ (c )] +

1  2 [ρ¯ (c )

2

+ ρ¯ (c )]

=

[ψ (c, c ) + ψ (c , c )]2 , the triangular inequality follows.  . c = c = c For this case, we have ψ (c, c ) = (iv) ρ (c ) ≤ ρ¯ (c ) + ρ¯ (c ) as ρ (c ) ≤ ρ¯ (c ) and ρ¯ (c ) ≥ 0.

ψ (c, c ) ≤ 12 [ ρ¯ (c ) + ρ¯ (c )] + 12 [ ρ¯ (c ) + ρ¯ (c )] ≤

1 1       2 (ρ¯ (c ) + ρ¯ (c ) + 2 (ρ¯ (c ) + ρ¯ (c )) = ψ (c, c ) + ψ (c , c ).

Here, the inequality 12 ( ρ¯ (c ) + ρ¯ (c ) ) ≤ 12 (ρ¯ (c ) + ρ¯ (c ))   holds true since [ 21 ( ρ¯ (c ) + ρ¯ (c ) )]2 − 12 ρ¯ (c ) + ρ¯ (c ) = − 41 ( ρ¯ (c ) − ρ¯ (c ) )2 ≤ 0.  Thus,

We remark that some commonly used measures as discussed in Section 2.2 can be viewed as special cases of the general distance metric. Here are two examples: Example 1. ρ (c ) = 0 and ρ¯ (c ) = w with w ≥ 0. This setting immediately leads to the overlap measure, as Eq. (3) shows. Example 2. ρ (c ) = w2 and ρ¯ (c ) = w2 with w > w. In this case, the distance between c and c is computed as w if c = c and w otherwise. This is precisely the distance measure used in the MWKM algorithm [22], where each categorical attribute is assigned two weights indicating its contribution for clustering. Based on Eq. (8) and Lemma 1, one can easily define new distance measures for categorical data by specifying ρ (sl ) and ρ¯ (sl )

B. Chen, H. Yin / Neurocomputing 306 (2018) 160–170

163

for sl ∈ S with l = 1, 2, . . . , m, as long as Eq. (9) holds true. However, they should be data-dependent when applied to data clustering, i.e., both should be adaptively learned according to the statistics of clusters. We therefore propose a parameterized formulation, on which the new distance measure ψ CD ( · , · ) is derived.

defining the clustering criterion that needs to be optimized by the algorithm.

3.2. The category-weighted distance

Given the categorical data set X and the number of clusters K, the goal of CWC is to search for an optimized set of clusters  = {πk |k = 1, 2, . . . , K } according to the category distance ψ CD ( · , · ). Note that here each object x = x1 , . . . , xd , . . . , xD  consists of D categorical attributes. To distinguish the differences in the D attributes, we now use Sd to denote the the set of categories on the dth attribute. We will also use cd ∈ Sd or cd ∈ Sd to denote an arbitrary category in Sd , in order to simply the representation. The clustering criterion of CWC is defined based on Eq. (1) of Definition 2 and Eq. (11), with the distance measure replaced by our category-weighted distance metric ψ CD ( · , · ). Substituting ψ( · , ·) in Eq. (11) according to Eq. (10), the partition-based clustering criterion changes to

Our category distance measure is also a special case of the general metric ψ ( · , · ), where ρ (sl ) and ρ¯ (sl ) for ∀sl ∈ S are formulated as



1

ρ (sl ) = [1 − λX (sl )] β , l = 1, 2, . . . , m 1 ρ¯ (sl ) = 1 + [λX (sl )] β

(10)

Here, 0 ≤ λX (sl ) ≤ 1 is the category weight measuring the contribution of sl in the distance computation, with regard to the data objects in X. The larger the weight, the more contribution the category. The weighting exponent β1 > 1 is introduced to control the

strength of the incentive for clustering on more categories. In more details, we can see from Eq. (10) that the value of ρ¯ (· ) is enlarged while ρ ( · ) scales down, when the category weight increases. This means that the category with larger weight has more distinguishing capacity for data clustering. In particular, ρ (· ) = 0 and ρ¯ (· ) = 1 when the weight falls in the range (0,1) and β goes to 0; in this case, the distance metric degenerates to the common overlap measure, as the Example 1 in the last subsection shows. Substituting the two elements in Eq. (8) according to Eq. (10), the new distance measure can be derived, given by

⎧ 1 ⎨ (1 − λX (c ) ) 2β c = c  1   ψCD (c, c ) = ⎩ 1 + 1 ([λX (c )] β1 + [λX (c )] β1 ) 2 c = c 2

(11)

In the sense of the following Lemma 2, Eq. (11) is definitely a distance metric. Lemma 2. If 0 ≤ λX (sl ) ≤ 1 for ∀sl ∈ S and 0 < β < 1, then ψ CD ( · , · ) is a distance metric. Proof. According to Lemma 1, ψ CD ( · , · ) is a distance metric only if ρ (sl ) and ρ¯ (sl ) defined in Eq. (10) satisfy Eq. (9), i.e., ∀sl ∈ S : 0 ≤ ρ (sl ) ≤ ρ¯ (sl ). First, it is obvious that ρ (sl ) ≥ 0 and ρ¯ (sl ) ≥ 0 if 0 < β < 1 and 0 ≤ λX (sl ) ≤ 1. Second, we consider the function 1

1

g(λ ) = ρ (sl ) − ρ¯ (sl ) = (1 − λ ) β − 1 − λ β , and obtain g(0 ) = 0 and dg dλ

1

= − β1 ((1 − λ ) β

−1

1

+ λβ

−1

) < 0. The latter means that g(λ) is

a monotonically decreasing function with respect to λ. Therefore, g(λ ) = ρ (sl ) − ρ¯ (sl ) ≤ 0 and subsequently ρ (sl ) ≤ ρ¯ (sl ) when the conditions stated in the lemma are satisfied. 

Based on Eq. (11), the problem of learning a distance metric from the sample set X, i.e., learning of the optimal m × m matrix  , can be transformed into the new problem of optimizing the category weights {λX (sl )|sl ∈ S, l = 1, 2, . . . , m}, subject to the constrains given by the Lemma 2. In the work described here, the optimization problem is solved by a clustering algorithm, as described below. 4. Clustering with distance metric learning Despite different kinds of categorical data clustering algorithms in the literature, such as hierarchical clustering which organizes data objects into a tree of clusters [30], we are interested in the partitioning method aimed at grouping the data set into K flat clusters, owing to its inherent simplicity and its computational efficiency. In this section, a new partitioning algorithm called CWC (for Clustering with Weighted Categories) is proposed, by learning the category-weighted distance metric ψ CD ( · , · ). We will begin by

4.1. Clustering criterion

J1 () =

K  1 k=1

=

K 

D   [ψCD (xd , yd )]2

|πk | x∈πk y∈πk d=1 | πk |

k=1

D  

fπk (cd )

d=1 cd ∈Sd



fπk (cd )[ψCD (cd , cd )]2

cd ∈Sd

where fπk (cd ) is the frequency of cd on the dth attribute of the data subset π k , computed using Eq. (5) with X replaced by π k . Since ψ CD ( · , · ) is parameterized by a set of category weights in the sense of Eq. (10), the clustering criterion should be further rewritten by considering the weights as parameters. Letting = {λπk (cd )|cd ∈ Sd , d = 1, 2, . . . , D; k = 1, 2, . . . , K } be the set of category weights, we obtain the resulting criterion as

J (, ) =

K  k=1

| πk |

D  

1

fπk (cd ) × ( fπk (cd )[1 − λπk (cd )] β

d=1 cd ∈Sd 1

+(1 − fπk (cd ))(1 + [λπk (cd )] β ))

(12)

where the category weight λπk (· ) is defined in the same way as the previous λX ( · ) with the data set X replaced by π k . We remark that, besides the clusters optimization for , minimizing J(, ) will also lead to an optimized category-weight set , which, in effect, equates to learning the category distance from the given data set X. Next, the unsupervised learning algorithm is described. 4.2. Clustering algorithm Our CWC algorithm, as outlined in Algorithm 1, is an iterative algorithm aimed at obtaining a local optimal solution that minimizes J(, ), given the data set X, the number of clusters K and the weighting parameter β . The algorithm starts clustering from creating an initial partition (0) (Step 2), using a method similar to that suggested by Chen et al. [12]. First, K objects e1 , . . . , ek , . . . , eK with ek = ek1 , . . . , ekd , . . . , ekD  are chosen at random from X as the seeds. Next, each object x ∈ X is assigned to its closest seed. The distance between x and the seed ek is com puted by D d=1 ψOM (xd , ekd ) with w fixed at 1. The resulting assignments then composes (0) . Step 3 of the algorithm is made up of a series of iterations in order to optimize both  and . The method used is the common partial optimization method [11,12,22,26], which achieves the local minimum of J(, ) by partially optimizing the two parameters in a sequential structure. In the (t + 1 )th iterative step (Step 3.1),

164

B. Chen, H. Yin / Neurocomputing 306 (2018) 160–170

Input: the data set X, the number of clusters k, and the weighting exponent β ; Output: the set of clusters , and the set of category weights ; begin 1. Let t be the number of iterations, t=0; 2. Create the initial set of clusters, denoted by (0 ) ; 2. repeat 2.1 Learn the distance measure ψCD (·, · ), i.e., compute the category weights λπk (cd ) in according to Eq. (13) for k = 1, 2, . . . , K, ∀cd ∈ Sd and d = 1, 2, . . . , D; 2.2 Reassign each object x ∈ X to its closest cluster using the rule of Eq. (14), and obtain (t+1 ) ; 2.3 t = t + 1; until (t ) = (t−1 ) ; 3. Output (t ) and . end Algorithm 1: The outline of the CWC algorithm.

d

K], cd ∈ Sd and d ∈ [1, D], which yields − β1 |πk | fπk (cd )( fπk (cd )[1 − 1

λπk (cd )] β

−1

1

+ (1 − fπk (cd ))[λπk (cd )] β

λπk (cd ) = 1 − λπk (cd )



fπk (cd ) 1 − fπk (cd )

 1−ββ

−1

) = 0, obtaining

.

(13)

In the next step (Step 3.2), all the objects are reassigned to generate the new partitioning (t+1 ) by minimizing J ((t+1 ) , ) using the category weights newly learned by Eq. (13). This is achieved by reassigning each object x to its closest cluster k according to

k = argmini=1,2,...,K

1

| πi |

D   (t )

y∈πi

ψCD (xd , yd ).

wkd =



1

|Sd | c ∈S d

[λπk (cd )]2 .

(15)

d

In other words, CWC also performs subspace clustering on categorical data. 5. Experiments In this section, we evaluate the performance of the CWC algorithm on some commonly used UCI data sets. It was also experimentally compared with a few existing clustering algorithms and distance measures for categorical data. 5.1. Experimental setup

the cluster set is fixed at (t) and the distance measure or equivalently the optimal is learned to minimize J((t) , ). This optimization problem can be solved by taking ∂λ ∂ J(c ) = 0 for ∀k ∈ [1, πk

others receive 0, the attribute is considerably important as there is one dominating category on the attribute. Thus, the attribute weight for the dth attribute of π d , denoted by wij , can be computed based on the distribution of category weights by, for example,

(14)

d=1

The time complexity of Step 3.1 using Eq. (13) is O(KDMN), where M denotes the average number of categories on the attributes. In Step 3.2, K distances are computed for each object according to Eqs. (13) and (11); thus, the time complexity is also O(KDMN). On the other hand, in terms of algorithmic structure, CWC can be viewed as an extension to the traditional K-means clustering algorithm. Therefore, the time complexity of CWC can be given as O(KDMNT), where T denotes the number of iterations. 4.3. Discussion It can be seen that, different from the mode-based clustering algorithms such as the K-modes and its variants [11,18,22,25], CWC is a non-center-based algorithm for categorical data clustering. Thus, the problem of representing clusters by mode categories can be avoided by our CWC. Interestingly, the mode categories can be identified from the clustering results generated by CWC. This is based on the observation that the optimized category weight is proportional to the category frequency in the sense of Eq. (13). In fact, λπk (cd ) = fπk (cd ) when β = 0.5. According to this view, the mode category on the dth attribute of π k can be given as argmaxcd ∈Sd λπk (cd ). The resulting category weights can also be used to rank the categorical attributes. Let us take two extreme cases for example. When all the categories are assigned with the same weight, which indicates that the categories distribute evenly on the attribute; in this case, the attribute would be irrelevant to the cluster. In the extreme case where only one category has non-zero weight while the

Two K-modes type algorithms, namely the complemententropy-weighting K-modes (CEWKM) [11] and the MWKM algorithm [22], were chosen for comparisons. As discussed in Section 2.2, they are representatives of the existing categorical data clustering algorithms using the Type I or Type II distance measure. CEWKM makes use of ψ OM ( · , · ) as defined in Eq. (3), with the attribute-weight w optimized according to the category distribution [11]. The Type II measure used in MWKM has been discussed as an example (Example 2) in Section 3.1. The parameter of MWKM was set to the author-recommended values [22]. The recently published KPC algorithm [26] was also used in the experiments. The distance measure used in KPC is similar to that of CEWKM but weighting the attribute with a kernel smoothing method. Moreover, we chose three distance measures for the evaluation: OF [19], Goodall3 [21,27] and MSFM measure [29], denoted by ψ OF ( · , · ), ψ Goodall3 ( · , · ) and ψ MSFM ( · , · ), respectively, in Section 2.2. Note that the three are not originally defined for data clustering. To apply them to categorical data clustering, we designed new clustering algorithms for them based on the CWC algorithm, by removing the Step 3.1 and replacing the distance function with their own definition. However, it was observed that the resulting algorithms did not always converge in a finite number of iterations; in these cases, the iterations were terminated when the number reached 100 in the experiments. Two main approaches were adopted to evaluate the clustering quality. One is based on the category utility (CU), given by Mirkin [31]

CU () =

1 K

K D  |πk |   ([ fπk (cd )]2 − [ fX (cd )]2 ). N

(16)

d=1 cd ∈Sd

k=1

where fX (cd ) is the frequency of cd with regard to the dth attribute of the entire data set. The other approach for evaluating is to calculate the clustering accuracy (CA) for a clustering:

CA() =

1 N

K  k=1

ak

where ak is the number of data objects in the majority class corresponding to π k . Clearly, such an index requires that the ground truth of the data sets be known, which is the case in our experiments. 5.2. A case study This set of experiments aims at evaluating the performance of CWC by a case study on DNA sequence data. The data set used

B. Chen, H. Yin / Neurocomputing 306 (2018) 160–170

a

165

b

Fig. 1. (a) Changes in CU and CA with various β ; (b) The attribute-weight distributions yielded by CWC for the Promoters data set.

was the well-known E. Coli promoter gene sequences (Promoters for short) [32]. The Promoters data set has 57 categorical attributes, each associated with one of the four nucleotides encoded  A ,  G ,  T or  C , starting at position -50 (p-50) and ending at position +7 (p+7). There are 106 sequences in the data set, originally classified into two classes: promoters and non-promoters; thus, the number of clusters K=2. In the promoters class, the sequences are expected to contain promoters (a promoter is one special genetic region indicating the biological concept), which are absent in the sequences of the non-promoters class. To apply the CWC algorithm, its parameter β ∈ (0, 1) should be estimated in advance. For the purpose, a trial-and-error method was used. We first ran the algorithm with different β ; then, the value associated with the highest CU was chosen as the optimal parameter. Note that CU can be viewed as an internal criterion evaluating the clustering quality [31], as Eq. (16) shows. However, since the algorithm starts from the randomly chosen seeds, the resulting value may vary over different executions of the algorithm. To provide a robust condition for the parameter estimation, we propose to choose the object having the largest dissimilarity to others in the data set as the first seed e1 , i.e., e1 =   argmaxx∈X y∈X D d=1 ψOM (xd , yd ) with w fixed at 1. The selection can be further simplified into

e1 = argmaxx∈X

D d=1

2[1 − fX (xd )] = argminx∈X

D d=1

f X ( xd ).

The remaining K − 1 seeds are determined using the maximum– minimum principle [33], as follows: Suppose that a set of k(1 ≤ k < K) seeds E = {ei |i ∈ [1, k]} have been determined. Then, we choose the -(k + 1 )-th seed according to



ek+1 = argmaxx∈X \E mini∈[1,k]

D 



ψOM (xd , ekd )

d=1

with w of ψ OM fixed at 1. Fig. 1(a) shows the CU values computed on the clustering results obtained by CWC on the Promoters data set, using the initialization method and different β ranging from 0.05 to 0.95 with increment 0.05. The corresponding CA values are also shown in the figure to provide context. We observe that CWC obtains the highest category utility when β = 0.85, accompanied by the highest clustering accuracy. Using the setting, the data set was clustered by CWC for 100 executions. The best result is reported in Table 1 (the

average performance will be reported in Section 5.3), where the accuracy for each cluster is measured in terms of F1-score. The results obtained by the competing methods are also summarized in the table. The highest scores are marked in bold typeface. It can be seen from Table 1 that CWC is more accurate than the other for both clusters. To understand the reason for the performance, the detailed clustering results were used for further analysis. Fig. 1(b) shows the attribute-weights computed using Eq. (15) on the category weights learned by CWC. The figure shows that the attribute weights over all positions are approximating to 0 for the non-promoters cluster, whereas the weights for the promoters indicate a few interesting patterns hidden in the DNA sequences. For example, the attribute weights corresponding to p36–p-32 are obviously higher than those to the neighboring nucleotides. Note that the slice of these successive nucleotides is the very promoter called contact-minus_35@-36 [32]. Table 2 illustrates the category weights learned by CWC for both clusters on these attributes. According to Table 2, the mode categories are assigned significantly larger weight for promoters than for non-promoters, and the categories except the modes have weights approaching 0. Moreover, the modes for the promoters cluster on p-36−p-32 are “TTGAC”, which is precisely the contact region sited on these positions according to the domain theory for promoters [32]. The table also indicates that there is no dominating category on the attributes of the non-promoters cluster. On the position p-32, for example, both  A and  T can be considered as the mode; however, in a mode-based algorithm, such as KM, CEWKM and MWKM used in this case study, an unique mode category should be chosen as the representative for each attribute. In this case, they yield inaccurate clustering results, as Table 1 shows. The clustering quality of the partition-based algorithms, including CWC, KPC and those using OF, Goodall3 or MSFM, depends on the distance measure used. To give an example, Table 3 shows the distance matrices calculated by the algorithms for p-35 of the promoters cluster. The results of KPC are omitted since its distance measure is similar to CEWKM. From the table, we see different behaviors of the algorithms. Both CEWKM and OF1 give a zero value

1 According to the results in Table 3, the triangular inequality of ψ OF does not hold true: for example, ψOF ( A ,  T ) + ψOF ( T ,  G ) = 1.07 + 1.23 < ψOF ( A ,  G ) = 6.08.

166

B. Chen, H. Yin / Neurocomputing 306 (2018) 160–170 Table 1 F1-score obtained by different methods on the Promoters data set. Cluster

CWC

KPC

CEWKM

MWKM

OF

Goodall3

MSFM

promoters non-promoters

0.96 0.96

0.92 0.92

0.84 0.80

0.86 0.82

0.81 0.75

0.81 0.79

0.65 0.49

Table 2 Category weights learned by CWC with the modes (the categories corresponding to the weights marked in bold typeface) identified on p-36–p-32 of the Promoters data set. Position

promoters

p-36 p-35 p-34 p-33 p-32

non–promoters

 A

 G

 T

 C

 A

 G

 T

 C

0.0 0 0 0.0 0 0 0.0 0 0 0.738 0.0 0 0

0.0 0 0 0.0 0 0 0.999 0.0 0 0 0.0 0 0

1.0 0 0 1.0 0 0 0.0 0 0 0.0 0 0 0.0 0 0

0.0 0 0 0.0 0 0 0.0 0 0 0.001 0.909

0.173 0.002 0.004 0.0 0 0 0.012

0.0 0 0 0.0 0 0 0.0 0 0 0.031 0.0 0 0

0.0 0 0 0.0 0 0 0.019 0.0 0 0 0.012

0.001 0.077 0.0 0 0 0.012 0.001

Table 3 Category distance computed by different algorithms for the 16th attribute (p-35) of the resulting promoters cluster. CWC

 A

 G  T

 C

CEWKM

 A

 G

 T

 C

 A

 G

1.00 1.00 1.50 1.00

1.00 1.00 1.50 1.00

1.50 1.50 0.00 1.50

1.00 1.00 1.50 1.00

0.00 0.02 0.02 0.02

0.00 6.08 1.07 3.71

6.08 0.00 1.23 4.25

1.07 1.23 0.00 0.75

3.71 4.25 0.75 0.00

0.01 1.00 1.00 1.00

OF  A

 G  T

 C

MWKM  C

 A

 G

0.02 0.02 0.00 0.02 0.02 0.00 0.02 0.02 Goodall3

0.02 0.02 0.02 0.00

0.01 1.00 1.00 1.00

1.00 1.00 0.01 1.00 1.00 0.01 1.00 1.00 MSFM

1.00 1.00 1.00 0.01

1.00 0.00 1.00 1.00

1.00 1.00 1.00 0.00

0.00 0.29 1.73 0.81

0.29 0.00 2.01 1.10

0.81 1.10 0.92 0.00

Table 4 Details of the ten real-world data sets.

 T

1.00 1.00 0.60 1.00

 T

1.73 2.01 0.00 0.92

 C

This set of experiments was designed to evaluate the average performance of CWC and to compare with the competing methods on more real-world data sets.

data set includes patient instances collected at the Cleveland Clinic and the Creditcard set is the Australian credit card data. Both have 6 numeric attributes. To enable the categorical data clustering algorithms on them, the numeric attributes were transformed into ordinal ones, using√an equal-width binning method with the number of bins set to N following the suggestion of [34]. The Adult set was extracted from the 1994 census bureau database, where the 6 numeric attributes were discretized using √ the same method but with the number of bins set to 30, because N ≈ 173 for this set, which is much larger than the number of categories on each attribute of the data set. Generally, the number of clusters K should be estimated before applying the clustering algorithms. However, this is currently an open problem, especially for categorical data [33,35]. In the experiments, we simply set the parameter to the true number of the data sets as shown in Table 4 and leave this issue for the future work. The parameter β in CWC was estimated using the same method described in Section 5.2. Fig. 2 illustrates the relationships between β and the CU values on the data sets except Promoters. The optimal β can be obtained from the figures, i.e, 0.35 (for Lungcancer), 0.05 (for Soybean), 0.4 (for Heartdisease), 0.9 (for Dermatology), 0.2 (for Creditcard), 0.55 (for Breastcancer), 0.8 (for SpliceDNA), 0.75 (for Mushroom) and 0.25 (for Adult). We also observe from the figures that CWC is robust with β on seven of the ten data sets (except Splices, Adult and Promoters). In particular, on Soybean, the CU values are almost unchanged with the increasing β ; for this case, the smallest value 0.05 is selected.

5.3.1. Data sets and parameter settings We used ten real-world data sets obtained from the UCI Machine Learning Repository, including the Promoters set used in the previous subsection. Table 4 lists the details. The Lungcancer, Soybean, Promoters, Dermatology, Breastcancer, Mushroom and SpliceDNA data sets originally consist of categorical attributes. The remaining sets contain attributes of mixed types. The Heartdisease

5.3.2. Experimental results Table 5 summarizes the average clustering accuracy of CWC and the competing algorithms on the ten data sets, while Table 6 gives the CU results. Each data set was clustered by each algorithm for 100 executions, and the average performances are reported in the format average ± 1 standard deviation. For fairness of comparison,

Data set

#Objects (N)

#Clusters (K)

#Attributes (D)

Lungcancer Soybean Promoters Heartdisease Dermatology Creditcard Breastcancer SpliceDNA Mushroom Adult

32 47 106 270 366 653 699 3190 8124 30162

3 4 2 2 6 2 2 3 2 2

56 21 57 13 33 15 9 60 21 14

to all the diagonal elements of the matrix, while MWKM, Goodall3 and CWC assign them non-zero values. In particular, our CWC learns different distances for the diagonal elements. For example, ψCD ( T ,  T ) = 0 and the distances for other categories reach 1, for the promoters cluster. It is reasonable because  T is identified as the mode on p-35 and is assigned a category weight as large as 1, as Table 2 shows. The results thus make CWC more accurate in the generation of high-quality clusters. 5.3. Performance comparison

B. Chen, H. Yin / Neurocomputing 306 (2018) 160–170

167

Fig. 2. The relationships between β and the CU values on the real-world data sets except Promoters.

all the algorithms performed clustering from the same initial centers (for CWC, they were used to generate the initial ). For each data set, the best result or the result without significant difference to the best is marked in bold typeface. The results shown in the tables indicate that CWC outperforms significantly the model-based algorithms CEWKM and MWKM. Overall, KPC also performs better than than the model-based algorithms due to the use of non-mode representation for categorical clusters; however, it obtains obviously worse results than CWC in the most cases. Note that both CEWKM and KPC compute the object-to-cluster distances based on the independence assumption as discussed in Section 2.2. This is essentially different from our CWC, where the distances are adaptively learned in terms of the category heterogeneity. Consequently, more accurate results can be obtained by CWC. The tables also show that both Goodall3 and MFSM perform unstably in clustering, and OF measure performs poorly on the whole. For example, Goodall3 achieves the highest CU on Breastcancer; however, it fails in clustering Lungcancer, Soybean, Dermatology and Mushroom, while performing the worst on Promoters and Adult. Similar results were found with the MSFM measure in our experiments. These are mainly due to the fact that the three measures are independent of the clustering objective. In CWC, however, each pairwise category distance is learned in order to minimize the clustering criterion. Since the clusters and the distances are jointly optimized in the unsupervised learning process,

CWC is able to achieve high-quality clustering on the categorical data sets, as Tables 5 and 6 show. 5.3.3. Efficiency The efficiency of different clustering algorithms with respect to the data size and the numbers of categories is tested in this subsection. The data set used was the Adult shown in Table 4, as it contains a relatively large number of objects and six numeric attributes that can be discretized into varying number of bins, which facilitates the efficiency evaluation. To test the sensitivity with respect to the numbers of categories, four data sets were generated based on the original Adult, each by discretizing the numeric attributes into ordinal categories with the number of bins set to 5, 10, 20, or 30. The set with 30 categories has been used in the previous subsection for performance comparison. Each data set was clustered by each algorithm for 20 executions, and the runtimes were averaged. The left figure of Fig. 3 illustrates the change in the average CPU time used by the algorithms on the four data sets. The results show that the time used by the mode-based algorithms remain approximately a constant, while the time of the others including our CWC scales linearly with the number of categories. The reason for CWC lies in the need of learning category weights in each iteration of the clustering process, as discussed in Section 4.2. The scalability with respect to the number of data objects was evaluated on the entire Adult data set (which consists of about

168

B. Chen, H. Yin / Neurocomputing 306 (2018) 160–170

Table 5 Average clustering accuracy (CA) of different algorithms on the real-world data sets. Data set

CWC

KPC

CEWKM

MWKM

OF

Goodall3

MSFM

Lungcancer

0.57 ± 0.06 0.93 ± 0.12 0.95 ± 0.04 0.83 ± 0.01 0.72 ± 0.11 0.80 ± 0.04 0.94 ± 0.00 0.71 ± 0.06 0.84 ± 0.09 0.69 ± 0.06

0.51 ± 0.06 0.82 ± 0.15 0.77 ± 0.10 0.80 ± 0.02 0.68 ± 0.13 0.73 ± 0.14 0.95 ± 0.05 0.71 ± 0.04 0.79 ± 0.14 0.67 ± 0.06

0.54 ± 0.06 0.86 ± 0.16 0.63 ± 0.09 0.75 ± 0.09 0.68 ± 0.12 0.75 ± 0.10 0.89 ± 0.08 0.42 ± 0.02 0.76 ± 0.14 0.61 ± 0.05

0.54 ± 0.06 0.85 ± 0.16 0.62 ± 0.09 0.74 ± 0.08 0.63 ± 0.11 0.76 ± 0.09 0.89 ± 0.09 0.42 ± 0.02 0.76 ± 0.14 0.61 ± 0.05

0.53 ± 0.05 0.76 ± 0.09 0.62 ± 0.07 0.68 ± 0.03 0.63 ± 0.07 0.56 ± 0.03 0.87 ± 0.00 0.55 ± 0.05 0.73 ± 0.13 0.65 ± 0.03



0.57 ± 0.07 0.88 ± 0.13 0.52 ± 0.02 0.65 ± 0.05 0.74 ± 0.12 0.73 ± 0.08 0.80 ± 0.01 –

Soybean Promoters Heartdisease Dermatology Creditcard Breastcancer SpliceDNA Mushroom Adult

– 0.61 ± 0.07 0.73 ± 0.06 – 0.60 ± 0.05 0.95 ± 0.04 0.60 ± 0.03 – 0.59 ± 0.06

0.71 ± 0.10 0.70 ± 0.05

Table 6 Average category utility (CU) of different methods on the real-world data sets. Data set

CWC

KPC

CEWKM

MWKM

OF

Goodall3

MSFM

Lungcancer

1.22 ± 0.12 1.33 ± 0.11 0.64 ± 0.03 0.40 ± 0.00 0.76 ± 0.03 0.39 ± 0.02 0.59 ± 0.00 0.33 ± 0.03 0.85 ± 0.05 0.36 ± 0.06

1.13 ± 0.12 1.23 ± 0.15 0.62 ± 0.08 0.40 ± 0.01 0.76 ± 0.03 0.37 ± 0.03 0.60 ± 0.08 0.41 ± 0.02 0.81 ± 0.09 0.34 ± 0.06

1.21 ± 0.10 1.26 ± 0.16 0.49 ± 0.05 0.32 ± 0.07 0.74 ± 0.05 0.34 ± 0.05 0.47 ± 0.10 0.28 ± 0.04 0.75 ± 0.12 0.29 ± 0.06

1.21 ± 0.10 1.24 ± 0.16 0.49 ± 0.05 0.32 ± 0.07 0.69 ± 0.08 0.33 ± 0.04 0.48 ± 0.12 0.29 ± 0.04 0.75 ± 0.12 0.28 ± 0.06

1.10 ± 0.14 1.13 ± 0.13 0.44 ± 0.06 0.15 ± 0.03 0.67 ± 0.04 0.07 ± 0.01 0.40 ± 0.01 0.38 ± 0.02 0.67 ± 0.14 0.23 ± 0.09



1.16 ± 0.12 1.25 ± 0.16 0.28 ± 0.07 0.21 ± 0.04 0.76 ± 0.03 0.29 ± 0.09 0.46 ± 0.01 –

Soybean Promoters Heartdisease Dermatology Creditcard Breastcancer SpliceDNA Mushroom Adult

– 0.45 ± 0.06 0.25 ± 0.07 – 0.12 ± 0.06 0.60 ± 0.06 0.40 ± 0.01 – 0.21 ± 0.03

0.66 ± 0.19 0.33 ± 0.11

Fig. 3. The relationships between the runtime and different numbers of categories, and data objects.

B. Chen, H. Yin / Neurocomputing 306 (2018) 160–170

30 0 0 0 objects) and three subsets containing 50 0 0, 10 0 0 0 or 20 0 0 0 objects, respectively. The subsets were created by sampling the original Adult set. Each numeric attributes in the four sets were discretized into 30 categories using the binning method described previously. The relationship between the average time and different numbers of data objects is shown in the right figure of Fig. 3, where we can see that the runtime of all the algorithms increases sublinearly with respect to the number of objects to be clustered. The algorithms using the OF, Goodall3 and MSFM measures consume more CPU time, because they require more iterations to reach the convergence on the data sets. The mode-based algorithms and KPC are efficient due to the use of mode categories or probabilistic centers to represent the categorical clusters. The efficiency of our CWC falls in between. 6. Concluding remarks In this paper, we first discussed problems arising from the implicit independence assumption in the existing distance measures for categorical data clustering. We proposed a new measure called category distance, where the pairwise category distance is measured by the category heterogeneity and can be learned by optimizing a set of category weights. We showed that the new measure is definitely a distance metric by the Lemmas 1 and 2. We also proposed a partition-based clustering algorithm using the category distance measure, which is jointly optimized with the clusters optimization. Further analysis showed that the resulting category weights can be used to identify the mode categories and to rank the categorical attributes. A case study on DNA sequences and the evaluation experiments on ten UCI data sets were conducted, and the results demonstrated the outstanding effectiveness of the proposed methods. For future work, we want to further improve the distance definition to discriminate between nominal attributes and ordinal attributes. We will also explore the cluster validation problem based on the category distance metric and the problem of determining the number of clusters in a categorical data set. Acknowledgments The work has been funded by the National Social Science Foundation of China (No. 16BKS132). The authors would like to thank Professor Lifei Chen at Fujian Normal University for his helpful suggestions and comments. References [1] J. Ye, Z. Zhao, H. Liu, Adaptive distance metric learning for clustering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–7. [2] L. Yang, Distance Metric Learning: A Comprehensive Survey, Michigan State University, 2006. [3] P. Jain, B. Kulis, J. Davis, I. Dhillon, Metric and kernel learning using a linear transformation, J. Mach. Learn. Res. 13 (2012) 519–547. [4] Y. He, W. Chen, Y. Chen, Y. Mao, Kernel density metric learning, in: Proceedings of the IEEE International Conference on Data Mining, 2013, pp. 271–280. [5] A. Bar-Hillel, T. Hertz, N. Shental, D. Weinshall, Learning distance functions using equivalence relations, in: Proceedings of the International Conference on Machine Learning, 2003, pp. 11–18. [6] J.Z. Huang, M.K. Ng, H. Rong, Z. Li, Automated variable weighting in k-means type clustering, IEEE Trans. Pattern Anal. Mach. Intell. 27 (5) (2005) 657–668. [7] J. Yu, X. Yang, F. Gao, D. Tao, Deep multimodal distance metric learning using click constraints for image ranking, IEEE Trans. Cybern. 47 (12) (2017) 4014–4024.

169

[8] H. Chang, D.Y. Yeung, Locally linear metric adaptation with application to semi-supervised clustering and image retrieval, Pattern Recognit. 39 (2006) 1253–1264. [9] E.P. Xing, A.Y. Ng, M.I. Jordan, S.J. Russell, Distance metric learning with application to clustering with side-information, Proceedings of the Conference and Workshop on Neural Information Processing Systems, 2002, pp. 505–512. [10] A. Bar-Hillel, T. Hertz, N. Shental, D. Weinshall, Learning and Mahalanobis metric from equivalence constraints, J. Mach. Learn. Res. 6 (2005) 937–965. [11] F. Cao, J. Liang, D. Li, X. Zhao, A weighting k-modes algorithm for subspace clustering of categorical data, Neurocomputing 108 (2013) 23–30. [12] L. Chen, S. Wang, K. Wang, J. Zhu, Soft subspace clustering of categorical data with probabilistic distance, Pattern Recognit. 51 (2016) 322–332. [13] K. Weinberger, L. Saul, Distance metric learning for large margin nearest neighbor classification, J. Mach. Learn. Res. 10 (2009) 207–244. [14] J. Zhang, J. Wang, X. Cai, Sparse locality preserving discriminative projections for face recognition, Neurocomputing 260 (2017) 321–330. [15] S.C.H. Hoi, W. Liu, S.F. Chang, Semi-supervised distance metric learning for collaborative image retrieval and clustering, ACM Trans. Multimed. Comput. Commun. Appl. 6 (3) (2010) 18:1–18:26. [16] J. Chen, Z. Zhao, J. Ye, H. Liu, Nonlinear adaptive distance metric learning for clustering, in: Proceedings of the ACM Conference on Knowledge Discovery and Data Mining, 2007, pp. 123–132. [17] M. Bouguessa, Clustering categorical data in projected spaces, Data Min. Knowl. Discov. 29 (1) (2015) 3–38. [18] J. Ji, T. Bai, C. Zhou, C. Ma, Z. Wang, An improved k-prototypes clustering algorithm for mixed numeric and categorical data, Neurocomputing 120 (2013) 590–596. [19] K.S. Jones, A statistical interpretation of term specificity and its application in retrieval, in: Proceedings of the Document Retrieval Systems, in: Foundations of Information Science, 3, 1988, pp. 132–142. [20] J. Davis, B. Kulis, P. Jain, S. Sra, I. Dhillon, Information-theoretic metric learning, in: Proceedings of International Conference on Machine Learning, 2007, pp. 209–216. [21] S. Boriah, V. Chandola, V. Kumar, Similarity measures for categorical data: a comparative evaluation, in: Proceedings of the SIAM International Conference on Data Mining, 2008, pp. 243–254. [22] L. Bai, J. Liang, C. Dang, F. Cao, A novel attribute weighting algorithm for clustering high-dimensional categorical data, Pattern Recognit. 44 (12) (2011) 2843–2861. [23] R.W. Knippenberg, Orthogonalization of categorical data: how to fix a measurement problem in statistical distance metrics, SSRN Electron. J. (2013), doi:10.2139/ssrn.2357607. [24] I. Morlini, S. Zani, A new class of weighted similarity indices using polytomous variables, J. Classif. 29 (2012) 199–226. [25] Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Min. Knowl. Discov. 2 (3) (1998) 283–304. [26] L. Chen, A probabilistic framework for optimizing projected clusters with categorical attributes, Sci. China Inf. Sci. 58 (2015). 072104(15). [27] D. Goodall, A new similarity index based on probability, Biometrics 22 (1966) 882–907. [28] D. Lin, An information-theoretic definition of similarity, in: Proceedings of the Fifteenth International Conference on Machine Learning, 1998, pp. 296–304. [29] C. Li, H. Li, A modified short and Fukunaga metric based on the attribute independence assumption, Pattern Recognit. Lett. 33 (2012) 1213–1218. [30] E. Cesario, G. Manco, R. Ortale, Top-down parameter-free clustering of high-dimensional categorical data, IEEE Trans. Knowl. Data Eng. 19 (12) (2007) 1607–1624. [31] B. Mirkin, Reinterpreting the category utility function, Mach. Learn. 45 (2001) 219–228. [32] G. Towell, J. Shavlik, M. Noordewier, Refinement of approximate domain theories by knowledge-based neural networks, in: Proceedings of the National Conference on Artificial Intelligence, 1990, pp. 861–866. [33] G. Guo, L. Chen, Y. Ye, Q. Jiang, Cluster validation method for determining the number of clusters in categorical sequences, IEEE Trans. Neural Netw. Learn. Syst. 28 (12) (2017) 2936–2948. [34] Y. Yang, G.I. Webb, Proportional k-interval discretization for Naive-Bayes classifiers, in: Proceedings of the European Conference on Machine Learning, 2001, pp. 564–575. [35] Y.M. Cheung, H. Jia, Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number, Pattern Recognit. 46 (8) (2013) 2228–2238. Baoguo Chen, born in 1966. He received his Ph.D. in management from Hohai University. He is a professor at the Research Center for Science Technology and Society, Fuzhou University of International Studies and Trade, China. His research interests include large data learning and information management.

170

B. Chen, H. Yin / Neurocomputing 306 (2018) 160–170 Yin Haitao, born in 1991. He is now a graduate student in philosophy of science and technology at Fuzhou University, China. His research interests include machine learning and science and technology management.