Nonlinear multicriteria clustering based on multiple dissimilarity matrices

Nonlinear multicriteria clustering based on multiple dissimilarity matrices

Pattern Recognition 46 (2013) 3383–3394 Contents lists available at SciVerse ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/lo...

522KB Sizes 1 Downloads 70 Views

Pattern Recognition 46 (2013) 3383–3394

Contents lists available at SciVerse ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Nonlinear multicriteria clustering based on multiple dissimilarity matrices Sergio Queiroz a,n, Francisco de A.T. de Carvalho a, Yves Lechevallier b a b

Centro de Informática, Universidade Federal de Pernambuco, Av. Jornalista Anibal Fernandes, s/n - Cidade Universitária, CEP 50740-560 - Recife (PE), Brazil INRIA Paris - Rocquencourt, Domaine de Voluceau, BP 105, 78153 Le Chesnay Cedex, France

art ic l e i nf o

a b s t r a c t

Article history: Received 30 November 2012 Received in revised form 24 May 2013 Accepted 10 June 2013 Available online 19 June 2013

We present a new algorithm capable of partitioning sets of objects by taking simultaneously into account their relational descriptions given by multiple dissimilarity matrices. The novelty of the algorithm is that it is based on a nonlinear aggregation criterion, weighted Tchebycheff distances, more appropriate than linear combinations (such as weighted averages) for the construction of compromise solutions. We obtain a hard partition of the set of objects, the prototype of each cluster and a weight vector that indicates the relevance of each matrix in each cluster. Since this is a clustering algorithm for relational data, it is compatible with any distance function used to measure the dissimilarity between objects. Results obtained in experiments with data sets (synthetic and real) show the usefulness of the proposed algorithm. & 2013 Elsevier Ltd. All rights reserved.

Keywords: Clustering analysis Relational data Multicriteria decision support Nonlinear optimization

1. Introduction Clustering methods organize a set of observations (items/ objects) into subsets (called clusters) so that items in the same cluster are more similar to each other than items in different clusters. Such methods have been applied in areas as diverse as bioinformatics, image processing, data mining and information retrieval. A common method of clustering is based on partitioning the input data [27]. Partitioning methods seek to obtain a partition of the input data into a fixed number of clusters. These methods usually seek a partition that optimizes (often locally) an objective function. To improve the quality of the obtained partition, it is common to execute the algorithm several times with different initial conditions, and the best final configuration obtained for the multiple runs is used as the result of the clustering process. Partitioning methods are divided into hard and fuzzy methods. In hard methods it is obtained a partition where each item in the input set is assigned to exactly one cluster. In fuzzy methods it is generated a partition which provides a degree of pertinence of each item for each one of the clusters of the partition. This allows to express that items belong to more than one cluster at a time. Commonly there are two types of representation of items that can be used to perform the clustering task: characteristic data and relational data. When each item is described by a vector of

n

Corresponding author. Tel.: +55 81 21268430; fax: +55 81 21268438. E-mail addresses: [email protected] (S. Queiroz), [email protected] (F.d.A.T. de Carvalho), [email protected] (Y. Lechevallier). 0031-3203/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2013.06.008

quantitative or qualitative values, the set of vectors that describes the items is named characteristic data. Alternatively, when we describe a relationship between each pair of items, the set of relations is named relational data. The most common case of relational data is when you have a dissimilarity matrix, for example D ¼ ½dði; lÞ, where dði; lÞ is the dissimilarity (usually a distance) between the pair of items i and l. Methods that are able to accomplish the task of clustering from relational data are extremely useful, allowing the grouping of items that cannot be described by characteristic data, or when the distance measure between items does not have a closed form. For example, if the items to be grouped correspond to preference relations of different individuals, it is possible to group individuals of similar preferences just from the distances between the preferences of each pair of individuals, measured by an appropriate similarity measure, as the probabilistic distance of Ha and Haddawy [15], which can even be used under uncertainty or for partial relations. In many situations we have not only a single measure of distance between pairs of items but also a vector of distances for each pair. This may be due to multicriteria problems, where each distance is obtained according to a different criterion; to group decision, where each distance corresponds to the opinion of a different individual; or to decision under uncertainty, where each distance corresponds to the realization of a different scenario. Therefore, the need for methods capable of clustering according to multiple dissimilarity matrices taken into account simultaneously arises. Given the equivalence between the three types of problems, we will treat the problem as multicriteria clustering, without loss of generality.

3384

S. Queiroz et al. / Pattern Recognition 46 (2013) 3383–3394

In recent years, some techniques have been proposed in the literature to address the problem of multicriteria clustering. In [8] it is proposed an extension for the multicriteria case to the popular clustering algorithm based on the optimization of a local criterion, k-means, whereas in [11] it is proposed a new multicriteria clustering algorithm using heuristics. Both methods are based on specific distance functions defined in their respective papers, lacking the generality of clustering methods based on relational data, which can use any distance measure between items that is appropriate for the problem at hand. Considering clustering methods based on relational data for the multicriteria problem, the Clustering and Aggregating Relational Data (CARD) algorithm, based on the popular algorithms for clustering relational data Non-Euclidean Relational Fuzzy clustering (NERF) [17] and Fuzzy Analysis (FANNY) [20], is proposed in [14]. In [7] is proposed the dynamic hard clustering algorithm based on multiple dissimilarity matrices (MRDCA), a hard clustering algorithm which takes into account multiple dissimilarity matrices, extending the dynamic hard clustering algorithms of [22,6]. Both CARD and MRDCA are able to calculate a local weight for each dissimilarity matrix in each of the clusters, which potentially produces more significant clusters [14]. However, both methods use weighted averages as the aggregation function, which may be inappropriate in many situations to find compromise solutions. In this paper we propose a new algorithm that is the first clustering method for relational data expressed in multiple dissimilarity matrices that in its process of combining them does not use a linear aggregator. Instead, the proposed method, a dynamic hard clustering algorithm using weighted Tchebycheff distances, called WRDCA, uses weighted Tchebycheff distances instead of weighted averages. Its potential advantage is that as it is known in the literature of multicriteria optimization, the optimization of a function based on weighted Tchebycheff distances is more suitable for the construction of compromise solutions than weighted averages. However, being a nonlinear aggregator, its optimization is almost always non-trivial. In this paper, we show that the criterion can be optimized in our clustering method by using linear programming. The use of linear programming also adds to the flexibility of the algorithm, as it opens the way to the introduction of constraints (that can be expressed as linear program restrictions) in the clustering process, which may be useful in many problems. The paper is organized as follows. In Section 2 the problem being treated is described in more detail, as well as its motivation. Section 3 describes our algorithm, while in Section 4 the application of the algorithm is illustrated through various representative examples; finally, Section 5 presents concluding remarks.

2. Problem description and motivation

cluster Ck has a representative element gk called a prototype, which is an element of E, i.e., g k ∈E ∀k ¼ 1; …; K. We call G the list of prototypes G ¼ ðg 1 ; …; g K Þ. In a multicriteria clustering algorithm such as MRDCA, a partition P ¼ ðC 1 ; …; C K Þ of E in K clusters and the corresponding prototype g k ∈E for each cluster Ck in P are obtained by minimizing an (internal) adequacy criterion (objective function). This criterion measures the adequacy between the clusters and their respective prototypes and is, in general, based on a distance. Such distance is frequently parameterized by weight vectors λ ¼ ðλ1 ; …; λK Þ, also called a relevance matrix, with λk ¼ ðλ1k ; …; λpk Þ∈Rp≥0 ð∀k∈f1; …; KgÞ. 2.2. Drawbacks of linear combinations Current multicriteria clustering algorithms such as CARD and MRDCA seek to minimize adequacy criteria for the generated clusters by taking into account the weighted average of the dissimilarities between the elements of the cluster. In order to illustrate this procedure, we describe hereinafter the procedure taken by the MRDCA algorithm. In MRDCA, the adequacy criterion, called J, is defined as K

J¼ ∑

ðkÞ

∑ d ðei ; g k Þ

where

k ¼ 1 ei ∈C k

ðkÞ

p

d ðei ; g k Þ ¼ ∑ λjk dj ðei ; g k Þ

ð1Þ

j¼1

is the dissimilarity between an item ei ∈C k and the prototype g k ∈E parameterized by the weight vector λk ¼ ðλ1k ; …; λpk Þ where λjk is the weight of the dissimilarity matrix Dj in the cluster Ck. The relevance matrix λ changes at each iteration (notice that the weights may be different from one cluster to another), being calculated in such a way to minimize J (see [7]). Therefore, in MRDCA, optimization is performed by minimizing a function based on the linear combination of dissimilarities between items in the cluster and its prototype. The minimization

Compromise

We begin this section by introducing the notation that will be used throughout this paper. Then we highlight the drawbacks of using linear combinations (for example, weighted averages) as aggregation function in the search for optimal solutions and the possible advantages of using nonlinear aggregators, such as weighted Tchebycheff distances. 2.1. Common notation Throughout this paper, we consider E ¼ fe1 ; …; en g to be a set of n items to be partitioned into K clusters Ck ðk∈f1; …; KgÞ, whereas ðD1 ; …; Dp Þ is a list of p ðn  nÞ dissimilarity matrices where Dj ½i; l ¼ dj ðei ; el Þ∈R≥0 gives the dissimilarity between objects ei and el according to the dissimilarity matrix Dj ∀j ¼ 1; …; p. Each

Fig. 1. Choice of the best alternative evaluated according to two criteria (to be minimized). The rectangles represent non-dominated alternatives (i.e., the Pareto frontier). An optimization function that is a linear combination (e.g., a weighted average) of the criteria (dotted lines in the figure) will lead to the choice of solutions in the convex hull (filled rectangles), while good trade-offs (circled in the figure) are not reachable by such an optimization procedure.

S. Queiroz et al. / Pattern Recognition 46 (2013) 3383–3394

3385

of a function that is a linear combination of the evaluation of objects in relation to a set of criteria may limit the possibility of finding interesting compromise solutions. For example, in Fig. 1 objects evaluated according to two criteria to be minimized (horizontal and vertical axes) are illustrated. The rectangles represent non-dominated objects (Pareto frontier). The optimization of a linear combination of criteria (for example, through a weighted average of the evaluation obtained in the two criteria) will always lead to the choice of a solution in the convex hull of the frontier (filled rectangles), which in this case correspond to extremely ill-balanced solutions (i.e., with a very high disagreement between the two evaluation criteria). Objects that would be better choices w.r.t. the balance between the criteria (circled in the figure), that represent a good compromise between the criteria, are in the concave region of the frontier, making it impossible that they will be chosen by optimizing a linear combination of the criteria.

2.3. Advantages of weighted Tchebycheff distances When we search for compromise solutions in relation to multiple criteria, a more flexible method than linear combinations, used in the literature of multi-objective optimization, is the criterion of Tchebycheff [25,26]. In order to generate compromise solutions according to the Tchebycheff criterion we minimize the function f w ðxÞ ¼ ‖wðu−uðxÞÞ‖∞ ¼ max ðwj ju j −uj ðxÞjÞ j∈f1;…;pg

where

 u ¼ ðu 1 ; …; u p Þ represents an ideal point (normally fictitious)



Fig. 2. Distance according to the infinite norm to a fictitious ideal point (in the bottom left of the figure) for the example depicted in Fig. 1 (using the identity weight vector). The optimum point found (filled rectangle) is one of those identified as showing a good trade-off in Fig. 1.

that has simultaneously the best evaluation according to all the p criteria. This ideal point is an upper limit of the set of nondominated solutions in the sense of Pareto. w is a vector of (positive) weights.

Note that, by minimizing the distance according to the infinity norm (the maximum coordinate), the focus is on the criterion where is the greatest difference between the evaluation obtained by the object being considered and the best result that may be obtained by an object. Thus, we favor objects close to the reference point u in all dimensions of the criterion space. This allows us to find good compromise. The functions fw(x) have two important properties [26]: 1. If ∀j∈f1; …; pg, wj 4 0, then every solution that minimizes fw(x) in the set of objects X is weakly Pareto-optimal. In addition, at least one such solution is Pareto-optimal. 2. If ∀j∈f1; …; pg, u j o inf x∈X uj ðxÞ, then for every Pareto-optimal solution x, there is one weight vector w such that x is the only solution that minimizes fw(x) in X. Property 1 shows that the minimization of fw(x) reaches at least one Pareto-optimal solution. Property 2 shows that, unlike linear aggregation functions, every Pareto-optimal solution can be reached by minimizing fw(x) with an appropriate weight vector. This explains why the minimization of fw(x) is preferable to weighted sums in the case of multiobjective optimization in non-convex sets [25,26]. Fig. 2 shows the distances to an ideal point according to the infinity norm for the example depicted in Fig. 1, using the identity weight vector for w. Note that by modifying the weight vector, we deform the hypercube in order to reach other regions of interest.

3. Multicriteria clustering with weighted Tchebycheff distances for relational data In view of the potential drawbacks of using objective functions based on weighted averages and the pertinence of the Tchebycheff criterion in the case of multiobjective optimization in non-convex sets, we present the WRDCA algorithm, a new hard clustering algorithm for relational data with multiple dissimilarity matrices, based on weighted Tchebycheff distances instead of linear combinations. In WRDCA we propose a nonlinear adequacy criterion, called Γ, to measure the adequacy between the clusters and their respective prototypes. This criterion is defined as K

Γ¼ ∑

∑ ρðkÞ ðei ; g k Þ

where

k ¼ 1 ei ∈C k

p

ρðkÞ ðei ; g k Þ ¼ max λjk dj ðei ; g k Þ j¼1

ð2Þ

is the dissimilarity (a weighted Tchebycheff distance) between an item ei ∈C k and the prototype g k ∈E parameterized by the weight vector λk ¼ ðλ1k ; …; λpk Þ. WRDCA is an algorithm that looks for the partition P n ¼ fC n1 ; …; C nK g of E into K clusters, the corresponding K prototypes Gn ¼ ðg n1 ; …; g nK Þ and K weight vectors λn ¼ ðλn1 ; …; λnK Þ to parameterize its weighted Tchebycheff distances such that ΓðGn ; λn ; P n Þ ¼ minfΓðG; λ; PÞ : G∈EK ; λ∈ΛK ; P∈PK g

ð3Þ

where EK ¼ E  ⋯  E (K times) is the space of prototypes; ΛK ¼ Λ  ⋯  Λ (K times) with Λ ¼ ðλ1 ; …; λp Þ∈Rp , λj ≥0 ∀j∈f1; …; pg and ∑pj¼ 1 λj ¼ 1 is the space of valid weight vectors; and PK is the set of all the possible partitions of E in K classes (space of partitions). It proceeds by selecting K initial prototypes ð0Þ

ð0Þ

Gð0Þ ¼ ðg 1 ; …; g K Þ∈EK ; an initial relevance matrix λð0Þ , where ð0Þ

ð0Þ

ð0Þ

∀k∈f1; …; Kgλk ¼ ðλ1k ; …; λpk Þ ¼ ð1=p; ½1…; 1=pÞ; and given Gð0Þ , λð0Þ it starts with the partition P ð0Þ that minimizes Γ, that is each item ei ∀i∈f1; …; ng is allocated to the nearest cluster, using the following

3386

S. Queiroz et al. / Pattern Recognition 46 (2013) 3383–3394

allocation rule:   ð0Þ ð0Þ ð0Þ C k ¼ ei ∈E : ρðkÞ ðei ; g k Þ o ρðhÞ ðei ; g h Þ∀h≠k ∀k ¼ 1; …; K

ð4Þ

where ci is a 0–1 integer variable and M is a large enough positive constant. 2. A p-ary max constraint can be transformed in a set of constraints involving binary max constraints

if the minimum is not unique, ei will be allocated to the cluster with the lowest index. We call this initial configuration ð0Þ

ð0Þ

v0 ¼ ðG ; λ ; P

ð0Þ

Þ, and its associated objective function value

In this step, we start with the configuration vt−1 ¼ ðGðt−1Þ ; λðt−1Þ ; P ðt−1Þ Þ and end with a configuration ðGðtÞ ; λðt−1Þ ; P ðt−1Þ Þ, i.e., the partition P ðt−1Þ and the relevance matrix λðt−1Þ are kept fixed while we update the cluster prototypes. ðt−1Þ

ðtÞ

Proposition 1. The prototype g k ¼ el ∈E of each cluster C k in the partition P ðt−1Þ which minimizes the criterion Γ is computed according to n

l ¼ arg min ∑ h¼1

ðt−1Þ

ðt−1Þ p max λjk dj ðei ; eh Þ j¼1

ð5Þ

ei ∈ C k

ðtÞ

In this step, we start with the configuration ðGðtÞ ; λðt−1Þ ; P ðt−1Þ Þ from Step 1 and end with a configuration ðGðtÞ ; λðtÞ ; P ðt−1Þ Þ, i.e., the partition P ðt−1Þ (from the last iteration) and the vector of prototypes GðtÞ (obtained in Step 1) are kept fixed, while we compute a new optimal relevance matrix. ðtÞ

ðtÞ

Proposition 2. The optimal vector λk ¼ ðλ1k ; …; λpk Þ of each cluster k can be calculated by solving the min–max linear programming (mmLP) problem ∑

Minimize

ðt−1Þ

p

ðt−1Þ

ðtÞ

max λjk dj ðei ; g k Þ j¼1

ei ∈ C k

ðtÞ

∀j ¼ 1; …; p

p

and

ðtÞ

∑ λjk ¼ 1

if y ¼ p

ðtÞ ðtÞ mðyÞ ¼ maxðλyk dy ðei ; g k Þ; mðy

þ 1ÞÞ

if y≠p

ð6Þ

j¼1

This mmLP problem can be solved by branch-and-bound and, more interestingly, can also be transformed into a problem of mixed integer linear programming (MILP), allowing its resolution by general purpose solvers, which is advantageous given the availability of highly efficient implementations of MILP solvers. For performing such transformation, it suffices to note that:

3. Step3: definition of the best partition In this step, we start with the configuration ðGðtÞ ; λðtÞ ; P ðt−1Þ Þ from Step 2 and finally end with the configuration vt ¼ ðGðtÞ ; λðtÞ ; P ðtÞ Þ, i.e., the vector of prototypes GðtÞ (obtained in Step 1) and the relevance matrix λðtÞ (computed in the previous step) are kept fixed while we update the partition. ðtÞ

ðtÞ

Proposition 3. The partition P ðtÞ ¼ ðC 1 ; …; C K Þ that minimizes Γ is computed by following the same allocation rule used in Eq. (4), i.e. ðtÞ

ðtÞ

ðtÞ

C k ¼ fei ∈E : ρðkÞ ðei ; g k Þ o ρðhÞ ðei ; g h Þ∀h≠kg∀k ¼ 1; …; K

1. A binary max constraint of type xi ¼ maxðxj ; xk Þ can be replaced by the following restrictions: xi ≥xk

xi −xj ≤ci M;

ð7Þ xi −xk ≤ð1−ci ÞM

ð10Þ

and if the minimum is not unique, ei will be allocated to the cluster with the lowest index. We show in the next section that each step previously described reduces the criterion Γ, therefore Γ converges to a local minimum.

This section first shows the convergence properties of the WRDCA clustering algorithm, then its complexity is given. 3.1.1. Convergence properties Following the method used in [9], the convergence properties of a dynamic clustering algorithm, such as WRDCA, can be studied from

the

two ðtÞ

series: ðtÞ

vt ¼ ðGðtÞ ; λðtÞ ; P ðtÞ Þ∈EK  ΛK  PK

and

ðtÞ

ut ¼ Γðvt Þ ¼ ΓðG ; λ ; P Þ, t ¼ 0; 1; …; T. As we saw, from an initial term v0 ¼ ðGð0Þ ; λð0Þ ; P ð0Þ Þ, the algorithm computes the different terms of the series vt until the convergence (to be shown), when the criterion Γ achieves a stationary value at the Tth iteration.

Proof. We will first show that the inequalities (I), (II) and (III) below hold (i.e., this series decreases at each iteration) z}|{ðIÞ ΓðGðt−1Þ ; λðt−1Þ ; P ðt−1Þ Þ ≥ |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} uðt−1Þ

z}|{ðIIÞ ; P ðt−1Þ Þ ≥ z}|{ðIIIÞ ΓðGðtÞ ; λðtÞ ; P ðt−1Þ Þ ≥ ðtÞ

ΓðG ; λ

ðt−1Þ

ΓðGðtÞ ; λðtÞ ; P ðtÞ Þ |fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl} ut

xi ≥xj ;

ð9Þ

Proposition 4. The series ut ¼ Γðvt Þ decreases at each iteration t ¼ 1; …; T, converging at the Tth iteration.

subject to the restrictions 0≤λjk ≤1

ðtÞ

mðyð ¼ λpk dp ðei ; g k Þ

3.1. Properties of the algorithms

 Step2: definition of the best relevance matrix.

ðtÞ

ðtÞ

where

the algorithm generates a series vt ¼ ðGðtÞ ; λðtÞ ; P ðtÞ Þ∈EK  ΛK  PK

 Step1: definition of the best prototypes.

ðtÞ

ðtÞ

j¼1

u0 ¼ ΓðGð0Þ ; λð0Þ ; P ð0Þ Þ. From the initial v0 (and corresponding u0) (and corresponding ut ¼ ΓðGðtÞ ; λðtÞ ; P ðtÞ Þ), where t ¼ 1; 2; …; T is the iteration number. Each iteration involves three steps, as we see in what follows. The WRDCA algorithm iterates until it converges to a value of Γ corresponding to a local minimum, reached at the Tth iteration, when there is no further change in assignment of items to clusters:

ðtÞ

p

max λjk dj ðei ; g k Þ ¼ maxðλ1k d1 ðei ; g k Þ; mð2ÞÞ

ð8Þ

The inequality (I) holds because K

ΓðGðt−1Þ ; λðt−1Þ ; P ðt−1Þ Þ ¼ ∑

k¼1

p

ðt−1Þ

ðt−1Þ

∑ max λjk dj ðei ; g k Þ ðt−1Þ

ei ∈ C k

j¼1

S. Queiroz et al. / Pattern Recognition 46 (2013) 3383–3394

K

p

ΓðGðtÞ ; λðt−1Þ ; P ðt−1Þ Þ ¼ ∑

k¼1

ðt−1Þ

ðtÞ

∑ max λjk dj ðei ; g k Þ ðt−1Þ

j¼1

ei ∈ C k

and according to Proposition 1 (in Step 1), ðtÞ

ðtÞ

GðtÞ ¼ ðg 1 ; …; g K Þ K

¼

G ¼ ðg 1 ;…;gK Þ∈EK

ðt−1Þ

p



arg min |fflfflfflfflffl{zfflfflfflfflffl}

k¼1

∑ max λjk dj ðei ; g k Þ ðt−1Þ

j¼1

weights λðTþ1Þ are fixed. Therefore we conclude that vT ¼ vðTþ1Þ . This conclusion holds for all t≥T and vt ¼vT, ∀t≥T and it follows that the series vt converges. □ 3.1.2. Complexity analysis The time complexity of WRDCA can be analyzed considering the complexity of each single step. Let n be the number of objects, K{n be the number of clusters and p be the number of dissimilarity matrices:

 Initialization: in this step, the initialization of the relevance

ei ∈ C k

Moreover, inequality (II) also holds because K

ΓðGðtÞ ; λðtÞ ; P ðt−1Þ Þ ¼ ∑

ðtÞ

p

k¼1

ðtÞ

∑ max λjk dj ðei ; g k Þ ðt−1Þ



j¼1

ei ∈ C k

and according to Proposition 2 (in Step 2) ðtÞ



ðtÞ

λðtÞ ¼ ðλ1 ; …; λK Þ ¼

K

λ ¼ ðλ1 ;…;λK Þ∈ΛK

ðtÞ

p



arg min |fflfflfflfflffl{zfflfflfflfflffl}

k¼1

∑ max λjk dj ðei ; g k Þ ðt−1Þ

j¼1

ei ∈ C k

The inequality (III) also holds because K

ðtÞ

p

ΓðGðtÞ ; λðtÞ ; P ðtÞ Þ ¼ ∑

k¼1

ðtÞ

∑ max λjk dj ðei ; g k Þ ðtÞ

j¼1

ei ∈C k

and according to Proposition 3 (in Step 3) ðtÞ

ðtÞ

P ðtÞ ¼ ðC 1 ; …; C K Þ ¼

K

arg min |fflfflfflfflffl{zfflfflfflfflffl}

P ¼ ðC 1 ;…;C K Þ∈PK



p

3387

ðtÞ

ðtÞ

∑ max λjk dj ðei ; g k Þ



weight vectors costs Oðp  KÞ. The random selection of K distinct prototypes can be done using a random generator and a data structure like a red-black tree to check for repetitions. The time complexity is then Oðp  K  logðKÞÞ. Step 1: for each cluster using each table of dissimilarity we test each individual as a candidate prototype. This needs the computation of the distance between an individual i ði ¼ 1; …; nÞ and all elements of each cluster using all p dissimilarity matrices and it costs Oðn2  pÞ. Step 2: as we approach this step, we need to solve a MILP problem. Note that, when we transform our mmLP into a MILP, using Eqs. (7)–(9), we introduce p−1 integer variables of type 0−1 into the program, as each p-ary max is transformed into p−1 binary max, and each binary max introduces a 0−1 integer variable. Minimizing such a mixed-integer program is a known NP-hard problem. Therefore, in order to be tractable, we need p to be small. As p is the number of criteria of our problem, this is often the case, but it is a limitation of our method that it becomes intractable for a large number of criteria, as the cost to find the optimal attribution of the p−1 binary variables for all clusters could be as bad as OðK  2p Þ. Step 3: this step needs the computation of the dissimilarity between an individual i ði ¼ 1; …; nÞ and the prototype of each cluster using the p dissimilarity matrices. It thus costs Oðp  K  nÞ.

k ¼ 1 ei ∈C k j ¼ 1

Finally, because the series ut decreases and it is bounded (Γðvt Þ≥0), it converges. □ Proposition 5. The series vt ¼ ðGðtÞ ; λðtÞ ; P ðtÞ Þ converges. Proof. Assume that the stationarity of the series ut is achieved in the iteration t¼T. Then, we have that uT ¼ uTþ1 , that is ΓðvT Þ ¼ ΓðvTþ1 Þ. From ΓðvT Þ ¼ ΓðvTþ1 Þ, we have ΓðGðTÞ ; λðTÞ ; P ðTÞ Þ ¼ ΓðGðTþ1Þ ; λðTþ1Þ ; P ðTþ1Þ Þ and this equality, according to Proposition 4, can be rewritten as equalities (I)–(III) z}|{ðIÞ ΓðGðTÞ ; λðTÞ ; P ðTÞ Þ ¼ ΓðGðTþ1Þ ; λðTÞ ; P ðTÞ Þ |fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl}

So, globally, under the assumption that p and K are small constants (such that Step 2 remains tractable), we have a global cost of Oðn2 Þ1 by iteration (remember that each iteration is a cycle through the three steps). If the clustering process needs t iterations to converge we would have a time of Oðn2  tÞ. As in the traditional k-means method, the number of iterations to converge, depending on the data may be prohibitively high, however, its observed speed is generally acceptable under the previous assumptions. Meaningful theoretical bounds on worst-case running times of this kind of algorithm are very hard to calculate, we refer the interested reader to [1,2] for more information and recent investigations concerning k-means. 3.2. Global weights

uT

z}|{ðIIÞ ¼ ΓðGðTþ1Þ ; λðTþ1Þ ; P ðTÞ Þ z}|{ðIIIÞ ΓðGðTþ1Þ ; λðTþ1Þ ; P ðTþ1Þ Þ ¼ |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} uTþ1

From the first equality (I), we have that GðTÞ ¼ GðTþ1Þ because G is unique minimizing Γ when the partition represented by P ðTÞ and the vector of matrices of weights λðTÞ are fixed. From the second equality (II), we have that λðTÞ ¼ λðTþ1Þ because λ is unique minimizing Γ when the partition represented by P ðTÞ and the vector of prototypes GðTþ1Þ are fixed. Moreover, from the third equality (III), we have that P ðTÞ ¼ P ðTþ1Þ because P is unique minimizing Γ when the vector of prototypes GðTþ1Þ and the vector of matrices of

As we mentioned before, the use of a local weight for each dissimilarity matrix in each of the clusters has the potential of producing more significant clusters. For instance, Ref. [14] gives the example of when clustering image data, one cluster of images (e.g., “roses”) may share similar texture and structure properties but have different colors, whereas another cluster (e.g., “outdoor scenes”) may be similar in color only. Therefore, to obtain meaningful clusters across all similarity matrices in the given example, we need to learn cluster-dependent relevance weights for each similarity matrix. 1 As we would have Oð1Þ in Initialization, plus Oðn2 Þ in Step 1, plus Oð1Þ in Step 2, plus O(n) in Step 3.

3388

S. Queiroz et al. / Pattern Recognition 46 (2013) 3383–3394

However, in other applications it may be a requirement of the problem that all clusters share the same relevance weights. For example, consider a decision problem (as used in [11]) where we have a set of projects evaluated according to several criteria and we want to use a clustering method in order to group them in classes like “Exceptional”, “Average”, “Bad”. It is natural that the decision maker wants that each cluster uses the same relevance weight matrix, so that every project is categorized under the same method of evaluation. This calls for a multicriteria clustering algorithm that learns a unique relevance weight matrix to be used in all clusters. Fortunately, the WRDCA algorithm can be easily adapted in order to give a version of it with global weights. In this version, in the computation of the adequacy criterion Γ, we have λjk ¼ λj ∀k∈f1; …; Kg, ∀j∈f1; …; pg, as we have a single global weight vector (over all clusters): λ ¼ ðλ1 ; …; λp Þ where λj is the global weight of the dissimilarity matrix Dj. Notice that this weight vector may change at each iteration (but is the same over all clusters). Thereupon, we will need to make small changes in the equations of each one of the steps of the algorithm. In Step 1 (Eq. (5)) it suffices to use λj in place of λjk . The modification for Step 3 is equally simple, it also suffices to use in Eq. (10) λj in place of λjk . The major change comes in Step 3 where we should now optimize a single mmLP problem, instead of K independent ones (one for each cluster). This new mmLP problem is given by K

Minimize



k¼1

p

ðtÞ

ðtÞ

∑ max λj dj ðei ; g k Þ ðt−1Þ

j¼1

ei ∈ C k

subject to the restrictions ðtÞ

0≤λj ≤1 ∀j ¼ 1; …; p

p

and

ðtÞ

∑ λj ¼ 1

ð11Þ

j¼1

where the same kind of strategy as before may be used to rewrite it as a MILP. Following an analogous reasoning for the complexity analysis, one can conclude that under the same assumptions (small p and K) the same complexity of WRDCA with local weights applies to the version of WRDCA with global weights. 3.3. Adding constraints Another advantage brought up by the use of linear programming in Step 2 of the WRDCA algorithm is that it gives the possibility to trivially introduce constraints over the weights that can be represented by adding restrictions to the linear program. At first, this may seen as an unimportant detail, however, it may be of great use in practice. For instance, if we have a decision problem where each dissimilarity matrix corresponds to the comparison between objects according to one decision dimension and we want to cluster similar objects, the decision maker may have a restriction of the kind “decision dimensions should have similar weights”, such that we may prioritize some dimensions but not completely ignore others. This restriction may be easily introduced by adding constraints such that any two weights may not differ by more than a margin (we illustrate that in Section 4.4).

4. Application All clustering algorithms mentioned in this paper (including ours) are completely unsupervised methods that do not use any class information to find the clusters in the data. When we run the WRDCA algorithm for a given data set, we execute it 100 times and the best result according to the internal adequacy criterion Γ is then selected.

In the applications that follow, when there is a priori class information, we will use it to compare the clustering results obtained by the clustering methods, by means of two external indices: the Corrected Rand index (CR) [19] and the F-measure [24], computed over the confusion matrix obtained after each experiment. In particular, for a K cluster case where there is a priori information saying that the objects belong to m classes, let P ¼ fP 1 ; …; P i ; …; P m g be the a priori partition into m classes (given), Q ¼ fQ 1 ; …; Q j ; …; Q K g be the hard partition into K clusters computed by a clustering algorithm in the experiment and N ¼ ½nij  be a m-by-K confusion matrix with nij the number of objects in cluster j that belong to class i. The CR index is given by  " −1   K  # ni nj n ∑m − ∑ i¼1 2 2 j¼1 2 2 CR ¼      " −1    # ni n ni nj n 1 m j K m K ∑i ¼ 1 ∑ þ ∑j ¼ 1 − ∑j ¼ 1 2 i¼1 2 2 2 2 2 

K ∑m i ¼ 1 ∑j ¼ 1

nij

ð12Þ ðn2Þ

¼ ðnðn−1ÞÞ=2; ni indicates the number of objects in class where Pi; nj indicates the number of objects in cluster Qj; and n is the total number of objects in the data set. The CR index assesses the degree of agreement (similarity) between an a priori partition and a partition obtained by the clustering algorithm. Moreover, the CR index is not sensitive to the number of classes in the partitions or the distribution of the items in the clusters. Finally, the CR index takes its values from the interval [−1,1], in which the value 1 indicates perfect agreement between partitions, whereas values near 0 (or negatives) correspond to cluster agreement found by chance [23]. The traditional F-measure (Fm) between class P i ði ¼ 1; …; mÞ and cluster Q j ðj ¼ 1; …; KÞ is the harmonic mean of precision (Pr) and recall (Rc): FmðP i ; Q j Þ ¼ 2

PrðP i ; Q j Þ  RcðP i ; Q j Þ PrðP i ; Q j Þ þ RcðP i ; Q j Þ

ð13Þ

The Precision (Pr) between class P i ði ¼ 1; …; mÞ and cluster Q j ðj ¼ 1; …; KÞ is defined as the ratio between the number of objects that are in class Pi and cluster Qj and the number of objects in cluster Qj PrðP i ; Q j Þ ¼

nij nij ¼ m nj ∑i ¼ 1 nij

ð14Þ

The Recall (Rc) between class P i ði ¼ 1; …; mÞ and cluster Q j ðj ¼ 1; …; KÞ is defined as the ratio between the number of objects that are in class Pi and cluster Qj and the number of objects in class Pi RcðP i ; Q j Þ ¼

nij nij ¼ K ni ∑j ¼ 1 nij

ð15Þ

The F-measure between the a priori partition P ¼ fP 1 ; …; P i ; …; P m g and the hard partition Q ¼ fQ 1 ; …; Q j ; …; Q K g given by a cluster algorithm is defined as FmðP; Q Þ ¼

1 m ∑ n max FmðP i ; Q j Þ n i ¼ 1 i 1≤j≤K

ð16Þ

The F-measure index takes its values from the interval [0,1], in which the value 1 indicates perfect agreement between partitions.

4.1. First example: synthetic toy data Goal of this illustrate with very simple data, situations where example: it would be important to have a nonlinear clustering method such as ours.

S. Queiroz et al. / Pattern Recognition 46 (2013) 3383–3394

3389

the “petal length” and “petal width”. However, for cluster one (which corresponds to the Iris versicolour plants), the petal width was more important than the petal length, while in clusters two and three (Iris setosa and Iris virginica plants, respectively) the most important aspect was petal length, in the case of cluster two by a large margin. 4.3. Third example: phoneme data set Goal of this compare the results given by WRDCA with those example: obtained by CARD and MRDCA using the same multicriteria modeling. This data set is harder to cluster than the previous examples.

Fig. 3. Generated data consisting of 10 clusters (“sloped points” with two attributes). The middle point of each cluster (y-coordinate 50) would be the best prototype of each cluster, but it is in a non-convex region of the space.

In order to achieve our goal, we generated the synthetic data depicted in Fig. 3. It consists of 110 points in 10 clusters. Each point has two numerical attributes (x, y coordinates in the figure). We then generated two dissimilarity tables for the input data: one for the Euclidean distance considering the x coordinate and another for the Euclidean distance considering the y coordinate. These two dissimilarity matrices were given as the input relational data for the MRDCA2 algorithm and WRDCA. The results obtained (Table 1) shows that WRDCA clearly outperforms MRDCA in such situation. The average number of iterations to converge for 100 runs was μT ¼ 5:19 whereas the standard deviation of T was sT ¼ 1:53. 4.2. Second example: Iris data set Goal of this assess the quality of the clusters proposed by our example: algorithm using Iris data, a classic data set introduced in [12], which is frequently used in the data mining literature as a benchmark. This example comes from the UCI Repository [13]. It is used in [7] in order to evaluate the performance of the algorithm MRDCA. There, they also gave the results for CARD and NERF, therefore we will reproduce here the indices obtained by those algorithms for this data set for comparison. The Iris data set consists of three species (classes) of plants: Iris setosa, Iris versicolour and Iris virginica.The three classes have 50 instances (objects) each. One class is linearly separable from the other two, the latter two are not linearly separable from each other. Each object is described by four real-valued attributes: (1) sepal length, (2) sepal width, (3) petal length and (4) petal width. We then generated four dissimilarity matrices, each one corresponding to the Euclidean distance according to one of the attributes. As in [7], all dissimilarity matrices were normalized according to their overall dispersion to have the same dynamic range [4]. This means that each dissimilarity dðei′ ; el′ Þ in a given dissimilarity matrix has been normalized as dðei′ ; el′ Þ=Y, where Y ¼ ∑ni¼ 1 dðei ; gÞ is the overall dispersion, with g ¼ el ∈E ¼ fe1 ; …; en g such that l ¼ arg minnh ¼ 1 ∑ni¼ 1 dðei ; eh Þ. Table 2 shows that WRDCA and CARD had the best performance for this data set (they obtained the same indices), followed by MRDCA and NERF. For WRDCA we had μT ¼ 5:76 and sT ¼ 1:98. Table 3 gives the vector of relevance weights obtained by WRDCA, while Table 4 gives the confusion matrix of the three-cluster hard partition given by the WRDCA algorithm. By looking at these tables, we see that for all clusters the two most important dissimilarity matrices were those that give distances based on 2 In [7] they propose different configurations for the MRDCA algorithm. The results showed hereinafter used the configuration MRDCA-RWL with prototypes of cardinality one, the counterpart to our algorithm based on a linear combination of the dissimilarity matrices.

This data set is used in [16] and is available at the website of the book.3 Here we use a subset of the data set (the same used in [7]). It consists of five phonemes (classes): “sh”, “iy”, “dcl”, ‘aa” and “ao”. Each one of the five classes has 400 instances (objects). Each object corresponds to a time trajectory and is described as ðxi ; yi Þ, i ¼ 1; …; n, where yi gives the class membership (phonemes), whereas xi ¼ ðxi ðt 1 Þ; …; xi ðt 150 ÞÞ is the i-th discretized functional data corresponding to the discretized log-periodograms. From this subset of the original data set, in [7] they initially obtained two additional data sets corresponding to the velocity and acceleration of the discretized log-periodograms. Then, three relational data tables were created from these three data sets (position, velocity and acceleration of the discretized log-periodograms) through the application of the squared Euclidean distance. After, all dissimilarity matrices were normalized according to their dispersion to have the same dynamic range [4]. Then they applied CARD and MRDCA on the three relational data tables and compared their results. Here we applied WRDCA on the same data tables. Table 5 shows the obtained results. For this data set, WRDCA obtained the best performance, followed by MRDCA and CARD. For WRDCA we had μT ¼ 15:06 and sT ¼ 5:48. Table 6 gives the vector of relevance weights (locally for each cluster and dissimilarity matrix) according to the best result (selected by using only the internal criterion Γ). For all clusters, the position dissimilarity matrix has the highest relevance weight. Table 7 gives the confusion matrix for this result. 4.4. Fourth example: decision over R&D projects Goal of this compare the results given by WRDCA with those example: obtained by one recent clustering algorithm found in the multicriteria literature [11] based on an approach completely different from WRDCA. Here we also demonstrate the use of global weights and a type of constraint that can be easily incorporated in the model. In [11], the behavior of their heuristic clustering algorithm is tested using a real data set coming from a decision problem where a decision maker attributed an aggregate global impact for each one of 81 candidate R&D projects. The description attributes (criteria) involved four dimensions: economic, social, scientific, and improvement of research competence-formation of human resources. Each one of the four criteria was expressed as integer functions with domain on the interval ½1; 7 (higher values are better). It was expected that the human decision maker could attribute the aggregate global impact of each project into one of eight categories: {Exceptional, Very High, High, Above Average, 3

http://www-stat.stanford.edu/  tibs/ElemStatLearn/.

3390

S. Queiroz et al. / Pattern Recognition 46 (2013) 3383–3394

Table 1 Synthetic toy data: results.

Table 7 Phoneme: confusion matrix (WRDCA).

Algorithm

CR

F-measure

MRDCA WRDCA

0.6564 0.8944

0.7845 0.9548

Cluster

A priori class

1 2 3 4 5

Table 2 Iris: results. Algorithm

CR

F-measure

CARD NERF MRDCA WRDCA

0.8856 0.7294 0.8680 0.8856

0.9599 0.8922 0.9533 0.9599

Table 3 Iris: vectors of relevance weights (WRDCA). Criterion

Weights by cluster

Sepal length Sepal width Petal length Petal width

Cluster 1

Cluster 2

Cluster 3

0.0891 0.0765 0.3873 0.4472

0.0592 0.0191 0.6432 0.2786

0.1146 0.1107 0.4152 0.3596

V

A priori class Iris setosa

Iris versicolour

Iris virginica

0 50 0

48 0 2

4 0 46

Table 5 Phoneme: results. Algorithm

CR

F-measure

CARD MRDCA WRDCA

0.1922 0.5418 0.6110

0.4853 0.7435 0.7920

Table 6 Phoneme: vectors of relevance weights (WRDCA). Criterion

Position Velocity Acceleration

3-dcl

4-aa

5-ao

0 12 0 388 0

12 236 0 142 10

0 24 376 0 0

285 0 0 0 115

92 2 1 0 305

K

p

∑ min max λjk dj ðei ; g k Þ;

v ¼ 1 ei ∈Ψ v k ¼ 1 j ¼ 1

Table 4 Iris: confusion matrix (WRDCA).

1 2 3

2-iy

As we mentioned before (in Section 3.2), using a multi-criteria clustering algorithm such as ours to tackle this problem calls for global weights. Moreover, according to [11], “the decision maker considered that the four criteria have similar importance”, which we can translate into a constraint over the weights. Therefore, in this example we used the version of WRDCA with global weights, and we introduced the constraint “the absolute difference of any two weights for the criteria should be at maximum 0.1”. To determine the number of clusters for our algorithm, we used a V-fold cross-validation approach (cf. [18],4). For this end we should first randomly generate a well-balanced partition, in terms of the cardinality of its sets, Ψ ¼ fΨ 1 ; …; Ψ V g of the elements to cluster E ¼ fe1 ; …; en g. If possible, we may use jΨ v j ¼ n=V, ∀v∈f1; …; Vg. Then, for each candidate value for K, we compute ΘðKÞ ¼ ∑

Cluster

1-sh

Weights by cluster C1

C2

C3

C4

C5

0.7121 0.1483 0.1396

0.4959 0.2490 0.2552

0.5163 0.2561 0.2276

0.6003 0.2071 0.1926

0.6943 0.1592 0.1465

Average, Below Average, Low, Very Low}. However, the decision maker was unable to decide between two categories for some examples, therefore s/he has used twelve categories {Exceptional, Very High, High or Very High, High, Above Average or High, Above Average, Average or Above Average, Average, Below Average, Low or Below Average, Low, Very Low}.

ð17Þ

where the relevance matrix λ and the cluster prototypes gk come from the result of running WRDCA for the set of elements E\Ψ v and K clusters. These ΘðKÞ values correspond to the adequacy criterion for each K considering the elements of the test folds Ψ v affected to the cluster generated from the E\Ψ v elements with the nearest prototype (compare with Eq. (2)). When we use jΨ v j 4 1 it is customary to make multiple runs and take the average (as we could have run into a bad partition). In this example, we used K ¼ 1; …; 25, V¼ 9 (giving nine parts of nine elements each, as n¼ 81), and we ran the nine-fold cross-validation 30 times, taking the average ΘðKÞ over the 30 runs, which we will note ΘðKÞ. Also, as we are here in the context of global weights, we have λjk ¼ λj ∀k∈f1; …; Kg. From this sequence of points we generate a graph of the “second-order differences” (cf. [5,18]) for ΘðKÞ: ðK−1Þ ðKþ1Þ ðKÞ Θ þΘ −2Θ , K ¼ 2; …; 24, from which we should consider its peaks. Fig. 4 shows the second-order differences for ΘðKÞ (solid line, with left y-axis) as well as the obtained ΘðKÞ values with 1standard deviation error bars (dashed line, with right y-axis). The most important peak occurs when K ¼5. Using K ¼5, we rerun WRDCA-global with the complete data set. The clusters obtained are shown in Table 8, where they are presented in the same format as in [11]. We can identify a high coincidence between the prototypes of each cluster and its elements. By looking at the confusion matrix shown in Table 9, we see that the first cluster includes all elements that were classified by the decision maker as Below Average or worse. Its prototype is an element classified as Average and this cluster includes the only element that was classified as Average or Above Average. It also includes some elements from Above Average to High, but no element classified as better than High. Cluster 2 is clearly a cluster of elements that were classified as High. It contains no elements worse than Above Average, and also includes some elements that were classified as Very High and one 4 p. 123: finding the right number of clusters in k-means and EM clustering: Vfold cross-validation.

S. Queiroz et al. / Pattern Recognition 46 (2013) 3383–3394

Exceptional element. Cluster 3 is also well defined, it is mostly composed of elements that were classified as Very High. Its worst elements were classified as High, and it also contains two elements that were considered exceptional by the decision maker. Cluster 4 is composed exclusively of elements considered Exceptional by the decision maker. Cluster 5 has a prototype that was considered Above Average or High and it includes half of the elements from class Average (the other half being on Cluster 1). Elements from the Average class are the most frequent in this cluster. Cluster 5 also contains the majority of the elements from class Above Average or High, some elements that are from class High, as well as one element from class Very High. It contains no element worse than Average. It is interesting to note that the Average class was split in two clusters, with half of its elements put in Cluster 1 (that contains elements at most High) and the other half going to Cluster 5 (that contains elements no worse than Average and up to Very High). We then quantify the quality of the clusters obtained w.r.t. the classification given by the decision maker by calculating the F-measure and Corrected Rand (CR) indices. In order to compare with the results obtained in [11] (their heuristic algorithm obtained 21 clusters for this data set), we also computed the F-measure and CR indices for their results, published in [11] (no index is used there to measure the quality of the obtained results). Moreover, as their algorithm gives an incomplete partition (i.e., there remain objects that are not attributed to any cluster), when computing the CR and F-measure for their results, we considered their algorithm as giving 22 clusters (the 22th cluster being the one with all objects that were not attributed to any cluster). As in the computation of the recommended number of clusters for our algorithm we also had K ¼7, K ¼10, K ¼ 12, K ¼14, K¼ 16, K ¼19, and K ¼21 as minor peaks, we show likewise the F-measure and CR indices calculated for these other configurations in Table 10. This table also shows the μT and sT obtained for each K. We see that

8.5

0.35 0.3

8

0.2

7.5

WRDCA-global obtained best results according to the CR and F-measure indices than Fernandez et al. [11], except in comparison to K ¼21 (both indices) and K ¼19 (only for the F-measure). In Table 11 we show the weights obtained for each of the configurations (K ¼ 5; 7; 10; 12; 14; 16; 19; 21). Notice that they respect the imposed constraint that the absolute difference of any two weights for the criteria should be at maximum 0.1, in order to reflect the opinion of the decision maker that the criteria should have similar weights. However, in the best configuration (K¼5) they were not given equal weights. In this configuration, the scientific criterion was given higher importance than the others. We highlight that the configuration recommended by the highest peak in the graph of second-order differences, with K ¼5, achieved a better CR and F-measure indices w.r.t. the Table 9 Decision over R&D projects: confusion matrix (WRDCA-global, K¼ 5). Cluster A priori class (given by the decision maker)

1 2 3 4 5

VL L L or BAvg

BAvg Avg Avg or AAvg

AAvg AAvg or H

H

H or VH

VH El

2 0 0 0 0

2 0 0 0 0

1 1 0 0 0

4 14 2 0 4

0 0 1 0 0

0 9 18 0 1

1 0 0 0 0

1 0 0 0 0

5 0 0 0 5

1 0 0 0 0

1 0 0 0 2

0 1 2 3 0

Table 10 Decision over R&D projects: results. Algorithm

CR

F-measure

μT

sT

WRDCA-global (K¼5) WRDCA-global (K¼7) WRDCA-global (K¼10) WRDCA-global (K¼12) WRDCA-global (K¼14) WRDCA-global (K¼16) WRDCA-global (K¼19) WRDCA-global (K¼21) Fernandez et al. [11]

0.2569 0.1600 0.1209 0.0895 0.0762 0.1242 0.0658 0.0371 0.0513

0.5460 0.4416 0.4173 0.3930 0.3857 0.4274 0.3558 0.3259 0.3659

2.49 2.77 2.72 2.89 2.84 2.93 2.95 3.00 n/a

0.64 0.81 0.78 0.75 0.79 0.74 0.95 0.74 n/a

0.15 7

0.1

Θ(K)

Second-order differences

0.25

3391

0.05

6.5

0

Table 11 Decision over R&D projects: vectors of relevance weights (WRDCA-global).

6

-0.05

Criterion

-0.1

5.5

-0.15 -0.2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

5

Number of clusters (K) Fig. 4. Decision over R&D projects: second-order differences for ΘðKÞ.

Economic Social Scientific HR

Global weights by configuration (K) 5

7

10

12

14

16

19

21

0.225 0.225 0.325 0.225

0.267 0.278 0.278 0.178

0.225 0.225 0.325 0.225

0.175 0.275 0.275 0.275

0.250 0.250 0.250 0.250

0.250 0.250 0.250 0.250

0.250 0.250 0.250 0.250

0.250 0.250 0.250 0.250

Table 8 Decision over R&D projects: results of the clustering process (WRDCA-global, K¼ 5). Cluster

Center

Objects in this cluster

Classification by its prototype

Classification by plurality

1 2 3 4 5

7 49 48 2 80

6, 7, 8, 9, 25, 26, 27, 28, 32, 38, 40, 41, 45, 46, 50, 52, 61, 77 10, 13, 14, 17, 18, 21, 22, 30, 37, 39, 43, 44, 49, 53, 54, 56, 58, 62, 65, 67, 68, 71, 76, 78, 79 12, 15, 16, 19, 20, 23, 29, 31, 33, 42, 47, 48, 51, 57, 59, 64, 66, 69, 72, 73, 74, 75, 81 1, 2, 63 3, 4, 5, 11, 24, 34, 35, 36, 55, 60, 70, 80

Average High Very High Exceptional Above Average or High

Average High Very High Exceptional Average

3392

S. Queiroz et al. / Pattern Recognition 46 (2013) 3383–3394

Table 12 L.A. Times/KTLA poll: subgroups of the population. Group name

Description

ALL REG 18-34, 35-44, etc. WHITE, NON/WHT EVAN, N/EVAN AF/DEM,AF/IND,AF/REP DEM, IND, REP LIB, MOD, CONS o COL, DEG+ DK/GL, KN/GL

All respondents All registered voters Age groups White non-Hispanic, all others including Latino/Hispanic Self-described Evangelicals, all others Affiliated Democrats, Independents and Republicans Registered Democrats, Independents and Republicans Self-described liberal, moderates and conservatives No college degree, undergraduate degree or more Do not know any gays or lesbians, or know someone who is gay or lesbian

Table 13 L.A. Times/KTLA poll: considered questions. Id

Question

Answer Type

Q8

How closely have you been following the news about the California Supreme Court's decision on same-sex marriage? Have you been following it very closely, not too closely, or not closely at all? As you may know, last week the California Supreme Court ruled that the California Constitution requires that same-sex couples be given the same right to marry that opposite-sex couples have. Based on what you know, do you approve or disapprove of the Court's decision to allow same-sex marriage in California? Do you strongly or only somewhat (approve/disapprove) As you may also know, a proposed amendment to the state's constitution may appear on the November ballot which would reverse the Supreme Court's decision and reinstate a ban on same-sex marriage. The amendment would state that marriage is only between a man and a woman. If the November election were held today, would you vote for or against the amendment to make marriage only between a man and a woman? As you may know, Governor Schwarzenegger said he will respect the Supreme Court's decision on same-sex marriage. He also said he will not support a ballot measure to amend the constitution to define marriage as only between a man and a woman. Do you agree or disagree with the governor's decision to respect the Supreme Court ruling and not support a ballot initiative? Do you strongly or only somewhat (agree/disagree) with the governor's decision? Do you think the debate about same-sex marriage is the most important issue facing the state, or an important issue but not the most important one, or is it not an important issue at all? Regardless of your opinion about same-sex marriage, do you think legal recognition of it is inevitable, or not? Do you personally believe that same-sex relationships between consenting adults are morally wrong or is that not a moral issue? “As long as two people are in love and are committed to each other it does not matter if they are a same-sex couple or a heterosexual couple.” Do you agree or disagree with this statement? Do you (agree/disagree) strongly or only somewhat? “If gays are allowed to marry, the institution of marriage will be degraded.” Do you agree or disagree with this statement? Do you (agree/disagree) strongly or only somewhat? Do you have a friend, family member or co-worker who you know is gay or lesbian?

A

Q9

Q10

Q11

Q12 Q14 Q15 Q16 Q17 Q18

B

C

D

E F G D D H

Table 14 L.A. Times/KTLA poll: modalities for each type of answer. Answer type

Modalities

A B C D E F G H

Very closely; Somewhat closely; Not too closely; Not closely at all; D/Know Strongly approve; Somewhat approve; Somewhat disapprove; Strongly disapprove; D/Know Vote yes; Lean yes; Lean no; Vote no; Would not vote; D/Know Strongly agree; Somewhat agree; Somewhat disagree; Strongly disagree; D/Know Most important issue; Important, not most; Not important; D/Know Yes, inevitable; No, not inevitable; Do not know; Refused Morally wrong; Not a moral issue; Do not know Do not know anyone; Know some one; Do not know

classification given by the decision maker than all other options for K using WRDCA-global. The coincidence of the best secondorder difference with the best CR and F-measure obtained is worthy of notice, as the second-order difference is an internal index that uses no a priori information saying to which class the objects belong (we are in the context of unsupervised methods) while the CR and F-measure compares the obtained clusters with the classes given by the decision maker. This indicates that, in this example, it is verified the assumption that the natural grouping of the data set (which the clustering algorithm is to find) is reflected

by the class information to a certain degree. As it is discussed in [10], this assumption may encounter several problems in practice since the class labels do not necessarily correspond to natural clusters.

4.5. Fifth example: L.A. Times/KTLA poll Goal of this illustrate the flexibility of WRDCA, which being a example: clustering algorithm for relational data, is

S. Queiroz et al. / Pattern Recognition 46 (2013) 3383–3394

compatible with any distance function used to measure the dissimilarity between objects. In this last example, we do not have a priori classes to evaluate the obtained clusters using the external Corrected Rand (CR) and F-measure indices, we will use WRDCA in an exploratory way, in order to get insights from data. In May 2008, the US newspaper Los Angeles Times along with the TV station KTLA conducted a poll to assess the opinion of Californian adults on issues related to marriage between persons of the same sex [21]. The poll interviewed 834 adults. The results were reported not only considering the statistics for the population as a whole but also for specific subgroups, listed in Table 12. For each of the 10 questions considered here (Table 13), the responses of each of the 22 subgroups were reported, that is, which percentage of the subgroup chose each one of the possible modalities (Table 14) for the question. As WRDCA is based on relational dissimilarity data, we are free to use any adequate dissimilarity measure between objects (here subgroups of the population). To measure the relative dissimilarity between two subgroups ei and ej of the population w.r.t. a question q, we have used the affinity coefficient [3], which means representing each subgroup of the population by its list of histograms, each one of them corresponding to the answers given by the people in the subgroup to a question q. Each histogram is composed by a vector of m values, corresponding to the frequencies obtained by each of the modalities of q. The dissimilarity between subgroup ei and ej according to q is then given by sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi q q m f l ðei Þ f l ðej Þ  q ; dq ðei ; ej Þ ¼ 1− ∑ q f ðei Þ f ðej Þ l¼1 q

where f l ðÞ is the frequency of modality l of question q for the q l group and f ðÞ ¼ ∑m l ¼ 1 f q ðÞ. In order to determine the number of groups, we used the same V-fold cross validation approach as in the previous example 0.05 0.04

3393

(cf. Eq. (17)), but this time with a leave-one-out (jackknife) approach (as the number of objects is small) and K ¼ 1; …; 10. As in this case (leave-one-out) we only have one possible partition Ψ (where each Ψ v is a singleton set), we do not need to execute for multiple partitions and take the average. Fig. 5 shows that the graph of second-order differences for ΘðKÞ has its peak at K ¼4. Table 15 shows the results obtained for K ¼4 when we rerun WRDCA with the complete data set (for this configuration we had μT ¼ 2:98 and sT ¼ 1:28). It is interesting to note that by looking at the clusters formed (Table 15) and the weight for each question in the clusters (Table 16), we can extract a set of useful information from the results. For instance, the cluster formed with the most heterogeneous subgroups, which is cluster 1, has questions 12 and 14 with the two greatest weights. Note that these questions are mostly non-polemical, as groups with very different opinions over same sex marriage can answer this particular questions in the same way. Other interesting information is that conservatives, evangelicals and affiliated republicans were put together mostly because they are very alike in question 18 (looking at the data, it is less common that they have someone in their entourage who is gay or lesbian). We can also see that cluster 4 agrees mostly on the controversial question (considering the whole population) Q15. This is a more liberal cluster, formed by people who describe themselves as moderates, those with undergraduate degree or more, people in the age group 35–44, those who know gay/ lesbians, and independents.

5. Conclusion We presented a new algorithm capable of partitioning sets of objects taking into simultaneous consideration their relational descriptions given by multiple relational dissimilarity matrices. This algorithm is the first clustering method for relational data expressed in multiple dissimilarity matrices that in its process of combining them does not use a linear aggregator. Instead it is based on a nonlinear aggregation criterion, weighted Tchebycheff distances, richer than linear combinations (such as weighted averages), as the former can explore more solutions than the

Second-orderdifferences

0.03

Table 16 L.A. Times/KTLA poll: vectors of relevance weights (WRDCA).

0.02 0.01

Criterion

0 -0.01 -0.02 -0.03 -0.04 -0.05 -0.06

2

3

4

5

6

7

8

9

Number of clusters (K) Fig. 5. L.A. Times/KTLA poll: second-order differences for ΘðKÞ.

Q8 Q9 Q10 Q11 Q12 Q14 Q15 Q16 Q17 Q18

Weights by cluster Cluster 1

Cluster 2

Cluster 3

Cluster 4

0.0850 0.0446 0.0295 0.0773 0.2056 0.1566 0.0922 0.0641 0.0986 0.1465

0.0398 0.0567 0.0376 0.0863 0.0731 0.2176 0.4468 0.0173 0.0219 0.0028

0.0304 0.0313 0.0071 0.0148 0.0637 0.0501 0.0115 0.0461 0.0146 0.7303

0.0127 0.0158 0.0211 0.0145 0.0177 0.0548 0.7786 0.0545 0.0178 0.0125

Table 15 L.A. Times/KTLA poll: Clusters obtained for K¼4. Cluster

Center

Objects in this cluster

1 2 3 4

ALL o COL AF/REP DEG+

ALL, 45–64, 65, WHITE, NON/WHT, MALE, N/EVAN, REG, AF/DEM, LIB 18–34, o COL, DK/GL, FEMALE CONS, EVAN, AF/REP MOD, DEG+, 35–44, KN/GL, AF/IND

3394

S. Queiroz et al. / Pattern Recognition 46 (2013) 3383–3394

latter, having thus the potential of finding best solutions. We presented two models, one that uses a local weight vector for the dissimilarity matrices for each cluster and a global version, that computes a single weight vector for all clusters. The use of linear programming in the algorithm also opens the way to trivially incorporate some types of constraints in the algorithm. In order to illustrate the use of the algorithm, various applications from different domains were shown, outlining the flexibility of the method. The results obtained suggest that this algorithm performs well, frequently outperforming other algorithms for multicriteria clustering. For future works, many directions seem compelling to explore. Examples include the possibility of incorporating other type of constrains (like must-link, cannot-link) or fuzzy versions of the algorithm. Conflict of interest None declared. Acknowledgments The authors are grateful to the anonymous referees and the handling editor for their careful revision, valuable suggestions, and comments which improved this paper. The authors would also like to thank FACEPE (Research Agency from the State of Pernambuco, Brazil), CNPq (National Council for Scientific and Technological Development, Brazil) and INRIA (Institut National de Recherche en Informatique et en Automatique, France) for their financial support. References [1] D. Arthur, B. Manthey, H. Röglin, Smoothed analysis of the k-means method, Journal of the ACM 58 (2011) 19:1–19:31. [2] D. Arthur, S. Vassilvitskii, How slow is the k-means method? in: N. Amenta, O. Cheong (Eds.), Symposium on Computational Geometry, ACM, 2006, pp. 144– 153. [3] H. Bacelar-Nicolau, The affinity coefficient, in: H.H. Bock, E. Diday (Eds.), Analysis of Symbolic Data, Springer, Heidelberg, Germany, 2000, pp. 160–165. [4] M. Chavent, J. Saracco, On central tendency and dispersion measures for intervals and hypercubes, Communications in Statistics—Theory and Methods 37 (2008) 1471–1482.

[5] A. Da Silva, Analyse de données évolutives: application aux données d'usage Web, Ph.D. Thesis, Université Paris-IX Dauphine, Paris, France, 2009. [6] F.A.T. De Carvalho, M. Csernel, Y. Lechevallier, Clustering constrained symbolic data, Pattern Recognition Letters 30 (2009) 1037–1045. [7] F.A.T. De Carvalho, Y. Lechevallier, F.M. De Melo, Partitioning hard clustering algorithms based on multiple dissimilarity matrices, Pattern Recognition 45 (2012) 447–464. [8] Y. De Smet, L.M. Guzmán, Towards multicriteria clustering: an extension of the k-means algorithm, European Journal of Operational Research 158 (2004) 390–398. [9] E. Diday, G. Govaert, Clustering analysis, in: K.S. Fu (Ed.), Digital Pattern Classification, Springer, Berlin, Germany, 1976, pp. 47–94. [10] I. Färber, S. Gunnemann, H.P. Kriegel, P. Kröger, E. Müller, E. Schubert, T. Seidl, A. Zimek, Using class-labels in evaluation of clusterings, in: MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, ACM, 2010. [11] E. Fernandez, J. Navarro, S. Bernal, Handling multicriteria preferences in cluster analysis, European Journal of Operational Research 202 (2010) 819–827. [12] R.A. Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics 7 (1936) 179–188. [13] A. Frank, A. Asuncion, UCI machine learning repository, 〈http://archive.ics.uci. edu/ml〉, 2010. [14] H. Frigui, C. Hwanga, F.C.H. Rhee, Clustering and aggregation of relational data with applications to image database categorization, Pattern Recognition 40 (2007) 3053–3068. [15] V. Ha, P. Haddawy, Similarity of personal preferences: theoretical foundations and empirical analysis, Artificial Intelligence 146 (2003) 149–173. [16] T. Hastie, R. Tibshirani, J. Friedman, Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Verlag, New York, 2001. [17] R.J. Hathaway, J.C. Bezdek, NERF c-means: non-Euclidean relational fuzzy clustering, Pattern Recognition 27 (1994) 429–437. [18] T. Hill, P. Lewicki, Statistics: Methods and Applications, StatSoft, Inc, 2005. [19] L. Hubert, P. Arabie, Comparing partitions, Journal of Classification 2 (1985) 193–218. [20] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data, Wiley, New York, USA, 1990. [21] L.A. Times, KTLA, California same sex marriage issues survey, 〈http://latimes. com/timespoll〉, 2008. [22] Y. Lechevallier, Optimisation de quelques critères en classification automatique et application à l’étude des modifications des protéines sériques en pathologie clinique, Ph.D. Thesis, Université Paris–VI, Paris, France, 1974. [23] G.W. Milligan, Clustering validation: results and implications for applied analysis, in: P. Arabie, L.J. Hubert, G. de Soete (Eds.), Clustering and Classification, River Edge, NJ, USA, 1996, pp. 341–375. [24] C.J. van Rijsbergen, Information Retrieval, Butterworths, London, UK, 1979. [25] R. Steuer, E.U. Choo, An interactive weighted Tchebycheff procedure for multiple objective programming, Mathematical Programming 26 (1983) 326–344. [26] A. Wierzbicki, On the completeness and constructiveness of parametric characterizations to vector optimization problems, OR Spektrum 8 (1986) 73–87. [27] R. Xu, D. Wunsch, Survey of clustering algorithms, IEEE Transactions on Neural Networks 16 (2005) 654–678.

Sergio Queiroz received the Dr. degree in Computer Science in 2008 from Université Pierre et Marie Curie (Paris-VI), France. Since 2010 he is a Tenured Professor (Professor Adjunto) in the Center of Informatics at Universidade Federal de Pernambuco. His main research interests are multicriteria decision and machine learning.

Francisco de A.T. de Carvalho received the Ph.D. degree in Computer Science in 1992 from Institut National de Recherche en Informatique et en Automatique (INRIA) and Universite Paris-IX Dauphine, France. From 1992 to 1998, he was a Lecturer at Statistical Department at Universidade Federal de Pernambuco, Brazil. He joined the Center of Informatics at Universidade Federal de Pernambuco, in 1999, where he is currently Full Professor. He held visiting posts in several leading universities and research centers in Europe. With main research interests in symbolic data analysis, clustering analysis and statistical pattern recognition he has authored over 190 technical papers in international journals and conferences. He has served as Coordinator (2005–2009) of the post-graduate program of computer science of the CIn/UFPE. He has been involved in program committees of many Brazilian and international conferences. He has also served as review of many international journals and conferences. He was elected for the council (2009–2013) of the International Association for Statistical Computing (IASC).

Yves Lechevallier joined the INRIA in 1976 where he was engaged in the project of Clustering and Pattern Recognition. Since 1988 he has been teaching clustering, neural network and data mining at the University of PARIS IX, CNAM and ENSAE. He specializes in Mathematical Statistics, Applied Statistics, Data Analysis and Classification. Current research interests: clustering algorithms (dynamic clustering method, Kohonen maps, divisive clustering method); discrimination problems and decision tree methods; build an efficient neural network by classification tree.