Dynamic clustering of interval data using a Wasserstein-based distance

Dynamic clustering of interval data using a Wasserstein-based distance

Pattern Recognition Letters 29 (2008) 1648–1658 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier...

448KB Sizes 2 Downloads 80 Views

Pattern Recognition Letters 29 (2008) 1648–1658

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Dynamic clustering of interval data using a Wasserstein-based distance Antonio Irpino *, Rosanna Verde Dipartimento di studi europei e mediterranei, Seconda Universitá degli Studi di Napoli, Caserta (CE), Italy

a r t i c l e

i n f o

Article history: Received 27 April 2006 Received in revised form 21 February 2008 Available online 29 April 2008 Communicated by A. Fred Keywords: Interval data Clustering Wasserstein distance Inertia

a b s t r a c t Interval data allow statistical units to be described by means of intervals of values, whereas their representation by means of a single value appears to be too reductive or inconsistent. In the present paper, we present a Wasserstein-based distance for interval data, and we show its interesting properties in the context of clustering techniques. We show that the proposed distance generalizes a wide set of distances proposed for interval data by different approaches or in different contexts of analysis. An application on real data is performed to illustrate the impact of using different metrics and the proposed one using a dynamic clustering algorithm. Ó 2008 Elsevier B.V. All rights reserved.

1. Introduction The representation of data by means of intervals of values is becoming more and more frequent in different fields of application. Intervals appear as a way to describe the uncertainty affecting the observed values. The uncertainty can be considered as the incapability to obtain true values depending on not knowing the model that regulates the phenomena. It can be the expression of three causes: randomness, vagueness or imprecision. Randomness is present when it possible to hypothesize a probability distribution of the outcomes of an experiment, or when the observation is affected by an error component that is modeled as a random variable (i.e., white noise in a Gaussian distribution). Vagueness is related to a unclear fact of the matter whether the concept applies or not. Imprecision is related to the difficulty of measuring accurately a phenomenon. While randomness is strictly related to a probabilistic approach, vagueness and imprecision have been widely treated by using fuzzy set theory, as well as the interval algebra approach. Probabilistic, fuzzy and interval algebra sometimes overlaps in treating interval data. Many connections are presented in literature between interval algebra and fuzzy theory, especially in the definition of dissimilarity measures to compare values affected by uncertain and so expressed by intervals. Some distances between intervals are based on a comparison of the domains of the belongingness function or on a-cuts (Bezdek, 1981; Tran and Duckstein, 2002).

* Corresponding author. Tel.: +39 3287195399; fax: +39 081675009. E-mail addresses: [email protected] (A. Irpino), [email protected] (R. Verde). 0167-8655/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2008.04.008

Interval data have been even studied in symbolic data analysis (SDA) (Bock and Diday, 2000), a new domain related to multivariate analysis, pattern recognition and artificial intelligence. In this framework, to take into account the variability and/or the uncertainty inherent in the data, variables can assume multiple values (bounded sets of real values, multi-categories, weight distributions), where intervals are a particular case. As in classic multivariate data analysis, dissimilarities and distances between data play an important role. Whereas several dissimilarities and distances have been defined in classic data analysis according to the analysis aims and to the data nature, several proposals have been advanced for the analysis of interval data. In SDA several dissimilarity measures have been proposed. Chavent and Lechevallier (2002) and Chavent et al. (2006) proposed Hausdorff L1 distances, while De Carvalho et al. (2006) proposed Lq distances and De Carvalho et al. (2006) an adaptive L2 version. It is worth observing that these measures are based essentially on the boundary values of the compared intervals. These distances have been mainly proposed as criterion functions in a clustering algorithm to partition a set of intervals data whatever a cluster structure can be assumed. In the present paper, we introduce a new metric based on the Wasserstein distance for the comparison of interval data. We show its properties, and we use it as the base of the definition of criteria for a dynamic clustering algorithm. The choice is motivated by the fact that, until now, several proposals have been introduced for dynamic clustering so we are better able to compare the use of the new distance with others used in recent literature. The paper is structured as follows: We present the general schema of dynamic clustering algorithm in Section 2, that can be

A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658

considered a general schema for clustering complex data. In Section 3, we present the main dissimilarity (or distance) functions, proposed in fuzzy, symbolic data analysis and probabilistic context to compare intervals of real values. Then, in Section 4, we introduce a new metric based on the Wasserstein distance that respects all the classical properties of a distance and, being based on the quantile functions associated with the interval distributions, seems particularly able to keep the whole information contained in the intervals. In Section 5, we compare the reviewed distances, especially the Euclidean-based ones, as Euclidean-based allow us to define an inertia (variability) measure that respect the Huygens theorem of decomposition of clustered data. Further, the distances are compared on the basis of their capability to define a representative interval allowing the minimization of the variability measure of a set of interval data. In Section 7, we use the Euclidean distances and the Hausdorff-based and L1 -based ones using some datasets to perform a comparison of the different distances for the dynamic clustering algorithm. Section 8 gives some concluding remarks and some perspectives. 2. Dynamic clustering of interval data Clustering methods play a central role in allowing conceptual descriptions to be compared and clustered, in order to obtain typologies of concepts. The dynamic clustering algorithm (DCA) (Diday, 1971; Chavent et al., 2003) represents a general reference for unsupervised non-hierarchical iterative clustering algorithms, and can be proven that DCA generalizes several clustering partitive methods such as k-means and k-median algorithm. In particular, DCA simultaneously looks for the partition of the set of data and the representation of the clusters. The main innovation of the symbolic clustering approach is the definition of a way to represent the obtained clusters by means of prototypes (Chavent et al., 2003). In the literature, several authors indicate two ways to compute prototypes. In the first approach (Verde and Lauro, 2000), the prototype of a cluster is an element having the same properties of the elements belonging to it. In such a way, a cluster of intervals is described by a single prototypal interval, in the same way as a cluster of points is represented by its barycenter. In the second approach (Verde and Lauro, 2000; Michalski et al., 1981; Kodratoff and Bisson, 1992), the prototype of a cluster has to represent the whole variability of the clustered elements by means, for example, of a distribution on the domain of the descriptors. An interval variable X is a correspondence between a set E of units and a set of closed intervals ½a; b, where a 6 b and a; b 2 R. A proximity measure d is a non-negative function defined on each couple of elements of the space of descriptions of E, where the closer the individuals are, the lower the value assumed by d. Let E be a set of n symbolic data described by p interval variables X j (j ¼ 1; . . . ; p). The general DCA looks for the partition P 2 Pk of E in k classes, among all the possible partitions P k , and the vector L 2 Lk of k prototypes representing the classes in P, such that the following D fitting criterion between L and P is minimized: DðP ; L Þ ¼ MinfDðP; LÞjP 2 Pk ; L 2 Lk g:

ð1Þ

Such a criterion is defined as the sum of dissimilarity or distance measures dðxi ; Gh Þ of fitting between each object xi belonging to a class C h 2 P and the class representation Gh 2 L: DðP; LÞ ¼

k X X

dðxi ; Gh Þ:

h¼1 xi 2C h

A prototype Gh associated with a class C h is an element of the space of the description of E, and it can be represented as a vector of intervals. The algorithm is initialized by generating k random clusters or,

1649

alternatively, k random prototypes. Generally, the criterion DðP; LÞ is based on an additive distance on the p descriptors. In the following, we give an overview of the dissimilarities and distances proposed in the literature for interval data treatment, focusing our attention on those that can give a solution to Eq. (1), i.e., to those metrics allowing a prototype for a set of intervals described as an interval itself. 3. A brief survey of the existing distances for interval data According to symbolic data analysis, an interval variable X is a correspondence between a set E of units and a set of closed intervals ½a; b, where a 6 b and a; b 2 R. Without losing generality, the same notation is used by the interval arithmetic approach, and with few modifications, by the fuzzy data analysis approach. Given p interval variables, the interval description of the ith unit can be done using a vectorial notation xi ¼ ðx1i ; . . . ; xpi Þ where j xji ¼ ½aji ; bi . Let A and B be two intervals described, respectively, by ½a; b and ½u; v. dðA; BÞ can be considered as a distance if the main properties that define a distance are achieved: (reflexivity) dðA; AÞ ¼ 0, (symmetry) dðA; BÞ ¼ dðB; AÞ, and (triangular inequality) dðA; BÞ 6 dðA; CÞ þ dðC; BÞ. Hereafter we present some of the most used distances for interval data belonging to different families and referring to several contexts where they have been proposed. The main properties of such measures are even underlined. We may group distances among interval data considering the different approaches that generated them: the feature extraction or component-wise approach, the fuzzy analysis approach and the symbolic data analysis approach. Further, we can consider the proposed distances into three main families, according to a component approach, where the distance is settled up by combining different aspects for the comparison of two intervals (position, size, span and content), using an extreme value approach, where the distance is computed considering only the bounds of two intervals, using an extension approach, where the distance are considered as an extension of distances defined between points. Before introducing the most used distance measures of distance between two interval data, we recall the definition of two operators: the join and the meet. Given two multivariate intervals a ¼ ðA1 ; . . . ; Ap Þ, where Aj ¼ ½alj ; auj , and b ¼ ðB1 ; . . . ; Bp Þ, where Bj ¼ ½blj ; buj , the join operator (Ichino and Yaguchi, 1994) is defined as   c ¼ C 1 ; . . . ; C p ¼ a  b; where C j ¼ ½clj ; cuj  such that clj ¼ minðalj ; blj Þ

and

cuj ¼ maxðauj ; buj Þ:

The meet operator is defined as the intersection of the two interval data:   c ¼ C 1 ; . . . ; C p ¼ a  b; where   C j ¼ clj ; cuj ¼ Aj \ Bj : Further, we introduce the De Carvalho (1994)potential of description of a multivariate interval datum, that, in the case of interval data, corresponds to the well-known Lebesgue measure of a set: pðaÞ ¼

p Y j¼1

lðAj Þ ¼

p Y ðauj  alj Þ: j¼1

1650

A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658

3.1. The component-wise approach

uðAj ; Bj Þ ¼

The following metrics are based on a feature extraction approach that emphasizes the different aspects of the comparison of two (or more intervals). We call this approach the ‘‘component-wise approach” where the comparison of two intervals is done taking into account position, span, content and other aspects. Gowda and Diday (1991): Considering two multivariate intervals a and b described by p interval variables, Gowda and Diday (1991) proposed the following distance: dða; bÞ ¼

p X

DðAj ; Bj Þ:

j¼1

/ðAj ; Bj Þ ; lðAj  Bj Þ

while Ichino and Yaguchi (1994) proposed a normalized version, which considered as the normalization parameter the length of the domain of the jth descriptor: wðAj ; Bj Þ ¼

/ðAj ; Bj Þ : lðDj Þ

The distances computed for each descriptor are then aggregated by the following function: !1q p X q dq ða; bÞ ¼ wj ðFCðAj ; Bj ÞÞ ; q P 1; j¼1

The distance is the sum of three components DðAj ; Bj Þ ¼ Dp ðAj ; Bj Þ þ Ds ðAj ; Bj Þ þ Dc ðAj ; Bj Þ;

wj > 0 and

Dp ðAj ; Bj Þ ¼

jalj  blj j 2 ½0; 1; lðDj Þ

Ds ðAj ; Bj Þ ¼

jlðAj Þ  lðBj Þj 2 ½0; 1 lðAj  Bj Þ

/ðAj ; Bj Þ ¼

ð4Þ

and a content component Dc that takes into consideration how much the two intervals do not overlap, i.e., how much the two intervals do not have in common, normalized on the join of the two intervals: Dc ðAj ; Bj Þ ¼

lðAj Þ þ lðBj Þ  2lðAj \ Bj Þ 2 ½0; 1: lðAj  Bj Þ

ð5Þ

If we consider a set E of n elements, the minimization of Eq. (1) can be found only numerically, fixing the prototype interval as an interval having as minimum the median of the minima bounds of the interval belonging to the cluster and varying the length of the prototypal interval. Ichino and Yaguchi (1994): In 1994, Ichino and Yaguchi proposed a new distance measure where the comparisons are done for each descriptor by means of the following comparison function: /ðAj ; Bj Þ ¼ lðAj  Bj Þ  lðAj \ Bj Þ þ cð2lðAj \ Bj Þ  lðAj Þ  lðBj ÞÞ: ð6Þ The distance combines the component related to the length of the join minus the length of the meet of the two intervals, integrated by the content component weighted by c, where 0 6 c 6 0:5: If c ¼ 0 then /ðAj ; Bj Þ ¼ lðAj  Bj Þ  lðAj \ Bj Þ simply compares the two intervals on the base of the join minus the meet. If c ¼ 0:5, then ðlðAj Þ þ lðBj ÞÞ /ðAj ; Bj Þ ¼ lðAj  Bj Þ  2 compares the two intervals considering their join minus the average of the two lengths of the intervals. It is possible to prove that / is a distance. More recently, De Carvalho and Diday (2000) proposed a normalized version on the basis of the join between the two intervals:

wj ¼ 1;

FC : / or w or u:

If we consider a set E of n elements, the minimization of Eq. (1) cannot be determined analytically. De Carvalho (1994): In 1994, De Carvalho proposed a family of metrics based on the following comparison function:

ð3Þ

where Dj is the length of the domain of the jth interval variable (i.e., the join of all the intervals for the jth variable), a span component Ds that compares the different spreads of the two intervals:

ð7Þ

j¼1

ð2Þ

a position component Dp related to where the intervals are located along R

p X

1 ½/ ðAj ; Bj Þ þ /p ðAj ; Bj Þ: 2 c

ð8Þ

It can be considered as the average between a content component and a position component. The content component measures the relationships between the common (agreement) and the uncommon (disagreement) part between two intervals for each descriptor, on the basis of the lengths of the four intervals generated as in Table 1. De Carvalho proposed five different comparison functions in order to take into consideration the measures of agreements and disagreements; some of these has been proved to be metric measures while others are only semi-metrics (i.e., dissimilarities). The five comparison function for the content components are s1c ¼

a ; aþbþc

which can be considered a metric s2c ¼

2a ; 2a þ b þ c

which can be considered a dissimilarity s3c ¼

a ; a þ 2ðb þ cÞ

which can be considered a metric   1 a a þ ; s4c ¼ 2 aþb aþc which can be considered a dissimilarity a s5c ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; ða þ bÞða þ cÞ which can be considered a dissimilarity. The position component is proposed as the (normalized) length of the separation between two intervals in the following way:

Table 1 Agreement and disagreement functions for the content component of the De Carvalho (1994) metric Agreement

Disagreement

Total

Agreement Disagreement

a ¼ lðAj \ Bj Þ c ¼ lðcðAj Þ \ Bj Þ

b ¼ lðAj \ cðBj ÞÞ d ¼ lðcðAj Þ \ cðBj ÞÞ

lðAj Þ lðcðAj ÞÞ

Total

lðBj Þ

lðcðBj ÞÞ

lðDj Þ

cðAÞ is the complement of interval A.

1651

A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658

/p ðAj ; Bj Þ ¼

lðcðAj Þ \ cðBj Þ \ ðAj  Bj ÞÞ ; lðAj  Bj Þ

that, for interval data is ( j minðau ;bu Þmaxðal ;bl Þj j

j

j

j

maxðauj ;buj Þminðalj ;blj Þ

/p ðAj ; Bj Þ ¼

0

ð9Þ

aa ¼ f½ala ; aua ja 2 ½0; 1g: if Aj \ Bj ¼ ;;

ð10Þ

otherwise:

Finally, the distance is aggregated for each descriptor by means of the following formula: !1q p X i dq ða; bÞ ¼ wj ð/ðAj ; Bj ÞÞq ; q P 1: ð11Þ j¼1

Gowda and Ravi (1995): In 1995 Gowda and Ravi proposed a new metric combining a position and a size component: dða; bÞ ¼

p X

the support of the a-cut of a fuzzy number can be expressed as an interval

DðAj ; Bj Þ;

ð12Þ

j¼1

DðAj ; Bj Þ ¼ Dp ðAj ; Bj Þ þ Ds ðAj ; Bj Þ: The position component is defined as 

 jalj  blj j  90 ; Dp ðAj ; Bj Þ ¼ cos 1  lðDj Þ

ð13Þ

Bertoluzza et al. (1995): Considering two fuzzy numbers a and b with bounded support, Bertoluzza et al. (1995) defined the follow2 2 ing distance d ðaÞ ¼ d ðaa ; ba Þ: Definition 2. Let g be a normalized weight measure on 2 ð½0; 1; Bð½0; 1ÞÞ. The squared distance d between two intervals a ¼ ½al; au and b ¼ ½bl; bu is given by Z 1 2 dBer ða; bÞ ¼ ½t jal  blj þ ð1  TÞ jau  buj2 dgðtÞ: 0

Also the function g has the same properties of a probability density. They are restricted to weight measures that are the sum of a term that is continuous with respect to the Lebesgue measure and of a finite weight distribution placed at S points t1 ; . . . ; tS , that is dg ¼ cðtÞdt;

ð14Þ

cðtÞ ¼ cðtÞ þ

S X

ks dðt  ts Þ;

s¼1

where the role of the cosine function is that of standardizing the distances in order to compare the different measurement scales of the descriptors. The size component is defined as   lðAj Þ þ lðBj Þ  90 : ð15Þ Ds ðAj ; Bj Þ ¼ cos 2  lðAj  Bj Þ

where d is the Dirac function and ks is the weight of the t s . The distance is rewritten as Z 1 S X 2 d ða; bÞ ¼ cðtÞ½t jal  blj þ ð1  TÞ jau  buj2 dt þ ks ½as  bs 2 ;

De Carvalho (1998): In 1998, De Carvalho proposed an extension of the Ichino and Yaguchi (1994) distance based on the potential description as follows:

where as ¼ ts al þ ð1  tÞau, as ¼ t s bl þ ð1  tÞbu and cðtÞ has to satisfy the following properties:



d1 ða; bÞ ¼ pða  bÞ  pða  bÞ þ cð2pða  bÞ  pðaÞ  pðbÞÞ

ð16Þ

is a dissimilarity where 0 6 c 6 0:5, Da Carvalho also proposed the following normalization: 

d2 ða; bÞ ¼

pða  bÞ  pða  bÞ þ cð2pða  bÞ  pðaÞ  pðbÞÞ ; pðsE Þ

ð17Þ

is a dissimilarity. While 

pða  bÞ  pða  bÞ þ cð2pða  bÞ  pðaÞ  pðbÞÞ pða  bÞ

s¼1

cðtÞ P 0; Z 1 cðtÞdt ¼ 1; 0

cð0Þ > 0;

cð1Þ > 0;

t1 ¼ 0; tS ¼ 1

sE ¼ ðD1 ; . . . ; Dp Þ

d3 ða; bÞ ¼

0

if S > 1:

When no reasons exist for preferring the left side of the interval with respect the right one, they imposed the supplementary condition: cðtÞ ¼ cð1  tÞ:

ð18Þ

is a distance. 3.2. The fuzzy oriented approach The fuzzy data analysis approach has supported the development of distance measures for interval data, where, in several applications fuzzy numbers are considered interval data. We recall here some basic definitions. Definition 1. A fuzzy number is a convex and normal fuzzy subset of R, that is, a map (membership function)

The computation of the distance can be hard when complicated choices of cðtÞ are done. The author suggested using cðtÞ ¼ 0 and S ¼ 3 (assigning t 1 ¼ 0, t 2 ¼ 0:5 and t 3 ¼ 1) to obtain a sufficiently good measure of the distance. In this case, we have the weighted sum of the distances between the starting, the central and the ending points of two intervals. For every pair of fuzzy numbers a and b for which the function 2

2

d ðaÞ ¼ d ðaa ; ba Þ is u-integrable, the distance dBer between a and b is defined by sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Z 1 dBer ða; bÞ ¼ ½dðaa ; ba Þ2 duðaÞ 0

a : R ! ½0; 1 with the following properties:  a½kx þ ð1  ky P inf½aðxÞ; aðyÞ;  9x such that aðx Þ ¼ 1;  SuppðaÞ is a compact subset of R. It is easy to recognize that all the a-cuts faa ja 2 ð0; 1g of a fuzzy number are intervals, and so the crisp subset a0 ¼ SuppðaÞ. Then

provided the integral exists. The function u is a suitable normalized weight function that looks like a probability density; moreover, they impose some conditions: uðaÞ P 0; a1 < a2 ) uða1 Þ 6 uða2 Þ; Z 1 uðaÞda ¼ 1: 0

1652

A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658

Coppi and D’Urso (2003) Indeed, in several applications (D’Urso and Santoro, 2006), the most used class of fuzzy variable is the socalled symmetrical fuzzy variable. Usually, a symmetrical fuzzy ~ ¼ ðm; lÞ, where m denotes the center and variable is denoted by Y l the left and right spreads with the following membership function: m  x lðxÞ ¼ L ; ml6x6mþl ðl > 0Þ; l where L is a decreasing ‘‘shape” function from Rþ to ½0; 1 with Lð0Þ ¼ 1, LðxÞ < 1 for all x > 0, LðxÞ > 0 for all x < 1, Lð1Þ ¼ 0 and LðxÞ ¼ LðxÞ. On the basis of the choice of L, it is possible to define different types of symmetrical fuzzy data. Usually, the symmetric triangular, normal, parabolic and square root fuzzy variables are chosen. In particular, a squared Euclidean distance between a pair of fuzzy numbers a ¼ ðma ; la Þ, where ma ¼ 0:5ðal þ auÞ and r a ¼ 0:5ðau  alÞ, and b ¼ ðmb ; rb Þ, where mb ¼ 0:5ðbl þ buÞ and r b ¼ 0:5ðbu  blÞ, is defined by Coppi and D’Urso (2003):

! dH ðA; BÞ ¼ max sup inf dðx; yÞ; sup inf dðx; yÞ : x2A

y2B

x2A

ð22Þ

If dðx; yÞ is the L1 City block distance, then Chavent et al. (2002) proved that dH ðA; BÞ ¼ maxðja  uj; jb  vjÞ     a þ b u þ v b  a v  u þ :   ¼    2 2 2 2 

ð23Þ

. Lq distances between the bounds of intervals: A family of distances between intervals has been proposed by De Carvalho et al. (2006). Considering a set of interval data described into a space Rp , the metric of norm q is defined as !1=q p X dLq ðA; BÞ ¼ ja  ujq þ jb  vjq ; ð24Þ j¼1

2

dCD ðkÞ ¼ ðma  mb Þ2 þ ½ðma  kr a Þ  ðmb  krb Þ2 þ ½ðma þ kra Þ  ðmb þ kr b Þ2 ¼ 3ðma  mb Þ2 þ 2k2 ðr a  rb Þ2 ;

y2B

ð19Þ

where Z 1 k¼ L1 ðtÞdt:

they also showed that if the norm is L1 then dL1 ¼ dH (in L1 norm). The same measure was extended (De Carvalho, 2007) to an adaptive one in order to take into account the variability of the different clusters in a dynamical clustering process. 4. The Wasserstein distance

0

The values of the parameter k related to the shape function of the symmetric triangular, normal, parabolic and square root fuzzy data pffiffi are, respectively, 12, 2p, 23 and 13. Tran and Duckstein (2002): In the framework of fuzzy data analysis, Tran and Duckstein (2002) proposed the following ‘‘distance” between two intervals:

 Z 1=2 Z 1=2  aþb þ xðb  aÞ dTD ðA; BÞ ¼ 2 1=2 1=2 h u þ v io2  þ yðv  uÞ dx dy 2 " # 



2 2 aþb u þ v 1 ba v  u 2  ¼ þ þ : 2 2 3 2 2 ð20Þ In practice, they consider the expected value of the distance between all the points belonging to the interval A and all those points belonging to the interval B. In their paper, they assert that it is a distance, but it is easy to observe that the distance does not satisfy the first properties mentioned above. Indeed, the distance of an interval by itself is equal to zero only if the interval is thin: 



2 aþb aþb  dTD ðA; AÞ ¼ 2 2 "

2

2 # 1 ba ba þ þ 3 2 2

2 2 ba P 0: ð21Þ ¼ 3 2 3.3. The boundary approach Hausdorff distance: The most common distance used for the comparison of two sets is the Hausdorff distance.1 Considering two sets A and B of points of Rn , and a distance dðx; yÞ where x 2 A and y 2 B, the Hausdorff distance is defined as follows:

1 The name is related to the Felix Hausdorff who is well-known for the separability theorem on topological spaces at the end of the 19th century.

If F and G are the distribution functions of two random variables f and g, respectively, the Wasserstein L2 metric is defined as (Gibbs and Su, 2002) Z 1

1=2 dWass ðF; GÞ :¼ ðF 1 ðtÞ  G1 ðtÞÞ2 dt ; ð25Þ 0

where F 1 and G1 are the quantile functions of the two distributions. Irpino and Romano (2007) proved also a general formulation of the Wasserstein distance: if F and G are the distribution functions of two random variables f and g, respectively, with first moments lf and lg , and rf and rg their standard deviations, the Wasserstein distance can be written as 2

dWass ¼ ðlf  lg Þ2 þ ðrf  rg Þ2 þ 2rf rg ð1  qQQ ðF; GÞÞ; |fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl} |fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} Size

Location

ð26Þ

Shape

where

qQQ ðF; GÞ ¼

R 1 1 F ðtÞ  lf ðG1 ðtÞ  lg Þdt 0 R1

¼

0

rf rg F

1

1

ðtÞG ðtÞdt  lf lg rf rg

ð27Þ

is the correlation of the quantiles of the two distributions as represented in a classical QQ plot. It is worth noting that 0 < qQQ 6 1 differently from the classical range of variation of the Bravais– Pearson’s q. This decomposition allows us to take into consideration three aspects in the comparison of distribution functions. The first aspect is related to the location: two distributions can differ in position and this aspect is explained by the distance between the mean values of the two distributions. The second aspect is related to the different variability of the compared distribution. This aspect is related to the different standard deviations of the distributions and to the different shapes of the density functions. While the former sub-aspect is taken into account by the distance between the standard deviations, the latter sub-aspect is taken into consideration by the value of qQQ . Indeed, qQQ is equal to one only if the two (standardized) distributions are of the same shape. If we suppose a uniform distribution of points, an interval of rej als xji ¼ ½aji ; bi  can be expressed as a function of the following type:

1653

A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658 j

j

xji ðt j Þ ¼ ½aji ; bi  ¼ aji þ t j ðbi  aji Þ;

0 6 tj 6 1:

ð28Þ

If we consider a description of the interval by means of its midpoint mji and radius r ji , the same function can be rewritten as follows:   xji ðt j Þ ¼ mji þ r ji 2tj  1 ; 0 6 tj 6 1; where j

mji ¼

r ji ¼

bi  aji : 2

Then, the squared Euclidean distance between homologous points of two intervals xji and xji0 (for the generic variable Y j ) is defined as follows: Z 1 2 dWass ðxji ; xji0 Þ ¼ ½xji ðt j Þ  xji0 ðt j Þ2 dt j ¼

Z

0

1 0

½ðmji  mji0 Þ þ ðrji  rji0 Þð2tj  1Þ2 dt j

1 ¼ ðmji  mji0 Þ2 þ ðr ji  r ji0 Þ2 : ð29Þ 3 It is a particular version of the probabilistic Wasserstein metric (Gibbs and Su, 2002) in L2 or, as it is better known, it is Mallow’s distance between two distributions when it is supposed that each interval is the support of a uniform distribution. In our context, such a distance is generalized to an Rp space (where p is the number of interval variables). Let the description of the box xi ðTÞ into p dimensions be as follows: 8 xi1 ðt 1 Þ ¼ ci1 þ ri1 ð2t 1  1Þ; > > > > < xi2 ðt 2 Þ ¼ ci2 þ ri2 ð2t 2  1Þ; ð30Þ .. > > . > > : xip ðt p Þ ¼ cip þ rip ð2tp  1Þ; j ¼ 1; . . . ; p:

T

¼ ¼

j¼1

p Z X j¼1

1

0

ðxji ðt j Þ  xji0 ðt j ÞÞ2 dtj

p X 1 ½ðmji  mji0 Þ2 þ ðrji  r ji0 Þ2 ; 3 j¼1

ð31Þ

5. Relationships among metrics and their ability to compute prototypes In this section, we do a comparison among the proposed metrics considering only those measures that can be considered, or have been considered by their authors, as distances. In particular, we show whether it is possible to identify a representative (prototype) element of a set of interval data as a solution of a minimization of an homogeneity criterion. A representative element of a set can be viewed as an interval itself. Given a set of n interval data described by p interval variables, as homogeneity criterion we consider the sum of a set of intervals with respect to a representative interval as follows: dðxij ; Gj Þ

ð32Þ

j¼1

or in the case of multivariate distances as n X dðxi ; GÞ: i¼1

1 1 1 ðxlij  xli0 j Þ2 þ ðcij  ci0 j Þ2 þ ðxuij  xui0 j Þ2 ; 3 3 3 2 2 2 2 dBer ðxij ; xi0 j Þ ¼ ðcij  ci0 j Þ þ ðrij  ri0 j Þ : 3 2

dBer ðxij ; xi0 j Þ ¼

ð34Þ

The representative is the interval Gj ¼ ðcj ; r j Þ that minimizes Eq. (32) 2 and considering d ¼ d . In this case, we obtain Gj ¼ ðcj ; rj Þ where cj ¼ n1

n X

cij

and

r j ¼ n1

i¼1

n X

r ij :

i¼1

5.1.2. Coppi and D’Urso Considering the distance 2

dCD ðkÞ ¼ 3ðam  bm Þ2 þ 2k2 ðal  bl Þ2 ; the values of the parameter k related to the shape function of the symmetric uniform (crisp), triangular, normal, parabolic and square pffiffi root fuzzy data are, respectively, 1 12, 2p, 23 and 13. Then, we obtain the following distances:

2

dCD ð1Þ ¼ 3ðcij  ci0 j Þ2 þ 2ðrij  ri0 j Þ2 :  Triangular:

1 1 2 ¼ 3ðcij  ci0 j Þ2 þ ðrij  r i0 j Þ2 : dCD 2 2  Normal: pffiffiffi

p p 2 dCD ¼ 3ðcij  ci0 j Þ2 þ ðrij  r i0 j Þ2 : 2 2  Parabolic:

2 8 2 ¼ 3ðcij  ci0 j Þ2 þ ðrij  r i0 j Þ2 : dCD 3 9

where T ¼ f8tj j0 6 t j 6 1g.

i¼1

5.1.1. Bertoluzza Let us start with the Bertoluzza distance considering cðtÞ ¼ 0 and S ¼ 3, obtaining

 Uniform (crisp):

Hypothesizing independency among the variables, the generalization of the proposed distance to Rp can be obtained as Z X p 2 dWass ðxi ðTÞ; xi0 ðTÞÞ ¼ ðxji ðtj Þ  xji0 ðt j ÞÞ2 dT

p n X X

We start rewriting interval data considering the center–radius notation as it simplifies notation. Given an interval xij ¼ ½xlij ; xuij , we rewrite it as xij ¼ ðcij ; r ij Þ where cij ¼ 0:5ðxlij þ xuij Þ and rij ¼ 0:5ðxuij  xlij Þ.

j

aji þ bi ; 2

0 6 t j 6 1;

5.1. Euclidean metrics: Bertoluzza, Coppi and D’Urso, Tran and Duckstein, L2 , Wasserstein

 Square root:

1 2 2 ¼ 3ðcij  ci0 j Þ2 þ ðrij  r i0 j Þ2 : dCD 3 9

ð35Þ

ð36Þ

ð37Þ

ð38Þ

ð39Þ

As in the Bertoluzza case, for each case, we obtain as represenP P tative Gj ¼ ðcj ; r j Þ where cj ¼ n1 ni¼1 cij and r j ¼ n1 ni¼1 rij . 5.1.3. Tran and Duckstein Tran and Duckstein (2002) ‘‘distance” have the following formulation: 

2 aþb u þ v 2  dTD ðA; BÞ ¼ 2 2 " #

2 1 ba v  u 2 þ þ ; ð40Þ 3 2 2 which can be rewritten as

ð33Þ

2

dTD ðxij ; xi0 j Þ ¼ ðcij  ci0 j Þ2 þ

1 ðr ij þ r i0 j Þ2 : 3

ð41Þ

1654

A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658

P The representative is obtained as Gj ¼ ðcj ; rj Þ where cj ¼ n1 ni¼1 cij and r j ¼ 0. This means that the prototypes are points and not intervals, or are all thin intervals. 2

5.1.4. L distance Considering a set of interval data described into a space Rp , the metric of norm 2 is defined as 2

dL2 ðxij ; xi0 j Þ ¼ jxlij  xli0 j j2 þ jxuij  xui0 j j2 ;

2

5.1.5. Wasserstein Wasserstein distance between two intervals, under the hypothesis of uniform distribution, can be computed as 1 2 dWass ðxij ; xi0 j Þ ¼ ðcij  ci0 j Þ2 þ ðrij  ri0 j Þ2 : 3

ð44Þ

Also in this case we obtain as representative Gj ¼ ðcj ; rj Þ, where P P cj ¼ n1 ni¼1 cij and r j ¼ n1 ni¼1 rij : 5.1.6. Comparison among Euclidean distances Wasserstein vs. Bertoluzza: It can be shown that Wasserstein distance is equal to the Bertoluzza, considering the ct function role we may consider the Bertoluzza as the application of the Wasserstein distance in the fuzzy approach. For example, if cðtÞ ¼ 0 and S ¼ 3 we may rewrite the Bertoluzza distance as ðF 1 ð0Þ  G1 ð0ÞÞ2 ðF 1 ð0:5Þ  G1 ð0:5ÞÞ2 þ 3 3 1 1 ðF ð1Þ  G ð1ÞÞ2 : þ 3

ð45Þ

Wasserstein vs. Coppi and D’Urso: In general, for uniform (crisp) fuzzy variables: 2

dCD ð1Þðxij ; xi0 j Þ ¼ 3dWass ðxij ; xi0 j Þ þ ðr ij  r i0 j Þ2 :

ð46Þ

Bertoluzza vs. L2 : The L2 distance between the bounds of the intervals is a particular case of the Bertoluzza distance where cðtÞ ¼ 0 and S ¼ 2 and then corresponds to the Wasserstein distance when the density is equally concentrated on the bounds: 2 2dBert2

¼

2 dL2 ðF; GÞ

¼ ðF

1

1

2

ð0Þ  G ð0ÞÞ þ ðF

1

1

2

ð1Þ  G ð1ÞÞ :

ð47Þ

All the proposed distances, except for the Bertoluzza, the Coppi–D’Urso and the Wasserstein distance, cannot take into consideration apriori information about how the uncertainty behaves inside the interval. While the Bertoluzza distance is the fuzzycounterpart of the Wasserstein distance, the Coppi and D’Urso distance gives a double emphasis to the range of symmetric fuzzy variables. Considering the center–radius representation of intervals, all the proposed distances let play a different role to the components related with the positions of intervals (the centers) and the component related to their sizes (the radii). From this point of view, we may compare them on the basis of the emphasis that each distance gives to each component. In other words we may compare the distances considering the general formulation: 2

a

b

a aþb

b aþb

Bertoluzza Coppi D’Urso

cðtÞ ¼ 0 and T ¼ 3 k ¼ 1 uniform k ¼ 1=2 triangular pffiffiffi k ¼ p=2 normal k ¼ 2=3 parabolic k ¼ 1=3 square root q¼2 Uniform

1 3 3 3 3 3 1 1

2/3 2 1/2 p=2 8/9 2/9 1 1/3

0.600 0.600 0.857 0.656 0.771 0.931 0.500 0.750

0.400 0.400 0.143 0.344 0.229 0.069 0.500 0.250

ð43Þ

Also in this case we obtain as representative Gj ¼ ðcj ; rj Þ, where P P cj ¼ n1 ni¼1 cij and r j ¼ n1 ni¼1 rij .

2

Parameter

Lq Wasserstein

dL2 ðxij ; xi0 j Þ ¼ 2ðcij  ci0 j Þ2 þ 2ðr ij  r i0 j Þ2 :

2

Distance

ð42Þ

which can be rewritten as

dBer ðF; GÞ ¼

Table 2 Comparison among Euclidean distances for interval data: how each distance weights the position and the span component in the comparison of two intervals

d ðxij ; xi0 j Þ ¼ aðcij  ci0 j Þ2 þ bðrij  ri0 j Þ2 :

ð48Þ

We denote the relative weight assigned to the position component a a as aþb and the size component as aþb (Table 2).

5.2. Hausdorff, L1 and L1 Considering two intervals, the Hausdorff distance is dH ðxij ; xi0 j Þ ¼ maxðjxlij  xli0 j j; jxlij  xui0 j jÞ ¼ jcij  ci0 j j þ jr ij  r i0 j j:

ð49Þ

While in the Euclidean distances we consider their squared version as d measure in Eq. (32), using the Hausdorff, L1 and L1 distances we use their formulas directly: dL1 ðxij ; xi0 j Þ ¼ jxlij  xli0 j j þ jxuij  xui0 j j;

ð50Þ

while dL1 ðxij ; xi0 j Þ ¼ maxðjxlij  xli0 j j; jxuij  xui0 j jÞ ¼ dH ðxij ; xi0 j Þ:

ð51Þ

To compute the representative element of a set of n interval data for the generic jth variable, it is possible to show that in the case of the Hausdorff distance the Gj ¼ ðcj ; rj Þ interval that minimizes Eq. (32) and considering d ¼ dH is reached when cj ¼ Medianðcij ji ¼ 1; . . . ; nÞ and r j ¼ Medianðr ij ji ¼ 1; . . . ; nÞ. In the case of L1 distance between the bounds, by analogy we obtain the Gj ¼ ½xlj ; xuj  interval that minimizes Eq. (32) and considering d ¼ dL1 where xlj ¼ Medianðxlij ji ¼ 1; . . . ; nÞ and xuj ¼ Medianðxuij ji ¼ 1; . . . ; nÞ. In a clustering perspective, it is possible to prove that these distances cannot allow the definition of an inertia measure that satisfies the Huygens theorem of decomposition of inertia, which is different from a Euclidean-type distance. 5.3. Component-wise metrics In general, it is difficult to cope with the join  and the meet  operator in the definition of a prototype. It is not easy to express analytically the optimum reached by Eqs. (32) and (33). It is also difficult to compare these kinds of metrics with the Wasserstein one. 6. Wasserstein distance as a new allocation function in DCA According to Chavent et al. (2003), the prototype of a class C h of the partition P is defined as the vector Gh of intervals that minimizes the following function: f ðGh Þ ¼

X xi 2C h

2

dWass ðxi ; Gh Þ ¼

p X X

2

dWass ðxji ðtj Þ; Gjh ðt j ÞÞ

ð52Þ

xi 2C h j¼1

as the criterion is additive, we may solve the optimization problem for each variable Y j . Let Gjh ¼ ½ajh ; bjh  be the interval prototype of the class C h for the jth variable, ljh and qjh the midpoint and the radius of the prototype, we have to solve the following minimization problem for each variable:

1655

A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658

X

~f ðGj Þ ¼ h

2

dWass ðxji ðt j Þ; Gjh ðtj ÞÞ

xi 2C h

 X j 1 ðmi  ljh Þ2 þ ðr ji  qjh Þ2 : 3 x 2C

¼

i

ð53Þ

h

The minimum of the function is obtained when X j X j ljh ¼ jC h j1 mi ; qjh ¼ jC h j1 ri ; xi 2C h

ð54Þ

i¼1

n n X 1X ðmji  lj Þ2 þ ðr j  qj Þ2 : 3 i¼1 i i¼1

According to (53), we solve the minimization problem, obtaining the following results: lj ¼ n1

n X

mji ;

qj ¼ n1

i¼1

n X

k X

qj ¼ n1

h¼1

k X

jC h jqjh

h¼1

ðmji  lj Þ2 þ 13

i¼1

¼

n P k P

ðmji i¼1 h¼1 þ

k P h¼1

n P

ðrji  qj Þ2

Total inertia T j ðkÞ

i¼1

 ljh Þ2 1ih þ 13

jC h jðljh  qj Þ2 þ 13

n P k P

ðrji  qjh Þ2 1ih

Within inertia W j ðkÞ

i¼1 h¼1 k P h¼1

jC h jðqjh  qj Þ2

Between inertia Bj ðkÞ;

where 1ih ¼ 1 if xi 2 C h else 1ih ¼ 0. The last result then proves that the proposed distance satisfies the Huygens theorem for the decomposition of the total inertia as the sum of a within component and a between one: Xp Xn 2 j j Xp Xn Xk 2 d ðxi ; G Þ ¼ d ðxji ; Gjh Þ1ih j¼1 i¼1 j¼1 i¼1 h¼1 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} TðkÞ

WðkÞ

þ

Xp Xk

i¼1

n X n X i¼1

2

d ðxi ; xj Þ ¼ 2n

j¼1

ð56Þ

l¼1

p n X X i¼1

2

d ðxil ; Gl Þ;

ð57Þ

l¼1

where Gl is the barycenter (interval) for the lth interval variable. Using Euclidean-based distances for the definition of the inertia measure, and considering that all the presented distances allow a decomposition of the distance into two parts, where the first is related with the distances among the centers of intervals, while the second is related with the distances among the radii of the intervals, may observe the impact of using the different distances. Indeed, we may rewrite the inertia equation (57), in a general way. Considering Dcij ¼ jcij  cGj j and

i¼1

and using a little of algebra, it is possible to prove that n P

j¼1

p n X X ðxil  xl Þ2 :

Taking into consideration the notation introduced in Eq. (48), we may define the general formula of the inertia for Euclidean-based distances as " # p p n X n X X X In ¼ 2n a ðDcij Þ2 þ b ðDrij Þ2 : ð58Þ

i¼1

jC h jljh ;

2

dE ðxi ; xj Þ ¼ 2n

Drij ¼ jr ij  rGj j:

  Gj ¼ lj  qj ; lj þ qj :

r ji ;

Considering also that lj ¼ n1

i¼1

In ¼

We prove that Wasserstein distance can be considered as an inertia measure among data, and it satisfies the well-known Huygens theorem of decomposition of inertia. As the criterion is additive with respect to each variable, it is sufficient to prove the result for the generic Y j variable. Considering the general prototype Gj of all the intervals for the generic Y j variable and lj and qj , we may write the total inertia as 2

n X n X

By analogy, using Euclidean distances we may define an inertia between interval data as the following quantity:

6.1. A useful property of the Euclidean distances

d ðxji ; Gj Þ ¼

In ¼

xi 2C h

where jC h j is the number of elements in the class C h .

n X

terion is the minimization of intra-class sum of distance from the class prototype (that generalize the within-cluster inertia), and, for this reason the DCA can be considered as a general approach for partitive clustering algorithms. Given n points described in Rp , the inertia of a set of points, in the case of Euclidean distance is

2

jC h jd ðGjh ; Gj Þ : j¼1 h¼1 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

ð55Þ

BðkÞ

The use of the Wasserstein distance in DCA, as an allocation function and as a criterion to be minimized to identify prototypes, permits the usual indexes to be used to validate clustering procedures that are generally based on the decomposition of the inertia. Without losing generality, all the Euclidean-type distances allow the same property. 7. A comparison on artificial and real data In this section, we present a comparison of the presented distances on the basis of the results obtained using the DCA. The comparison is based on the cluster obtained using the different distance in the criterion minimized in DCA. We recall that the cri-

j¼1

i¼1

j¼1

It is immediate hypothesizing that a different choice of a distance instead of another can modify the weight of the inertia related with the centers or to the radii in the definition of the clusters. We use a real dataset and three artificial ones. The real dataset is a climatic dataset that contains the temperature intervals (min and max) observed in 60 Chinese stations over 12 months in 19882 (the same data have been used by Chavent et al. (2003) for the DCA of interval data). The three artificial datasets contain 100 interval observation for two variables, randomly generated in order to observe P r 2 P c 2 P r 2 P c 2 ðD Þ ¼ j ðDij Þ (Fig. 1), (Fig. 2), and j ðDij Þ ¼ 0:5 j ðDij Þ Pj ijr 2 P c 2 j ðDij Þ ¼ 2 j ðDij Þ (Fig. 3). The choice of the datasets is motivated by the different variability of the data related with the size of the intervals. We avoid, in this case, the standardization of data as preprocessing step, because the three datasets are described by variables that have the same scales of measure and a similar variability. The datasets shows (see Table 3) a different structure of the inertia related with the centers and the radii of the observed data. In the DCA, a central role is played by the initialization of the procedure. We choose to initialize the algorithm generating a random partition of the stations into k groups. Having fixed a value for k, it is known that two different initializations can provide two different optimal solutions. Thus, for each dataset we have fixed a range for the number of clusters (k goes from 2 to 10), and for each k we have performed 200 random initializations. We have stored only the best solution, i.e., that solution allowing a minimum value for the criterion of DCA. China’s temperature dataset: The China dataset variability is greatly related with the variability of the centers (96% of the sum

2 Long-Term Instrum. Climatic Data Base of the People’s Republic of China http:// dss.ucar.edu/dataset/ds578.5/data/.

1656

A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658

Fig. 1. Artificial dataset 1: 100 interval data in 2d, the left figure shows a cross representation where emphasizing centers and radii, the right figure is a rectangular representation of the data.

Fig. 2. Artificial dataset 2.

Fig. 3. Artificial dataset 3.

1657

A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658

of centers and radii variability). An outline of the input data table is shown in Table 4. Each row represents the vector of interval temperature observed in a particular station along the 12 months of 1988. The DCA and the Euclidean-based distances, the L1 and the Hausdorff distances are used to cluster the stations into homogeneous clusters. There are not apriori information about a classification structure of the data. We have chosen k in a range that goes from 3 to 10, we have performed 200 initialization and, for each distance used in the DCA, we have stored the best results as those classification allowing the minimum value for the DCA criterion (the within cluster sum of distances). We compare clusterings using the corrected by chance Rand Index (Hubert and Arabie, 1985) in order to discover how much different the clusters are generated using the different distances, that shows the agreement between two partition and is widely used for this aim in the literature. It is equal to 1 when the partitions are the same, while is equal to 0 when there is the maximum disagreement between partition. The analysis of China dataset shows (Tables 5–8) that partitions obtained by means of the different distances have a good agreement especially comparing Euclidean-based distances.

Table 3 The inertia of centers and radii for the two considered datasets (1) Dataset

(2) n

(3) vars

(4) P

(5) P

(6) (4) + (5)

(7) (4)/(6)

(8) (5)/(6)

China Artificial 1 Artificial 2 Artificial 3

60 100 100 100

12 2 2 2

32,039 4924 4847 4659

1411 4924 2425 9315

33,450 9848 7273 13,974

0.96 0.50 0.67 0.33

0.04 0.50 0.33 0.67

c 2 j ðDij Þ

r 2 j ðDij Þ

Table 4 China dataset: monthly temperature [min:max] in 60 stations during 1988 Stations

January

February



November

December

AnQing BaoDing BeiJing BoKeTu ChangChun ChangSha ZhiJiang

[1.8:7.1] [7.1:1.7] [7.2:2.1] [23.4:15.5] [16.9:6.7] [2.7:7.4] [2.7:8.4]

[2.1:7.2] [5.3:4.8] [5.9:3.8] [24:14] [17.6:6.8] [3.1:7.7] [2.7:8.7]

...

[7.8:17.9] [0.8:14] [1.5:12.7] [13.5:4.2] [7.9:2.2] [7.6:19.6] [8.2:20]

[4.3:11.8] [3.9:5.2] [4.4:4.7] [21.1:13.1] [15.9:7.2] [4.1:13.3] [5.1:13.3]

Table 5 Corrected Rand Index for partition agreement (China dataset): the upper triangle shows results for k ¼ 3, the lower triangle shows results for k ¼ 4 dWass

dL2

dCD ð1=2Þ

dCD ð1=3Þ

dCD ð2=3Þ

1

1 1

1 1 1

1 1 1 1

pffiffiffi dCD ð p=2Þ 1 1 1 1 1

dCD ð1Þ

dBer

dL1

dH

1 1 1 1 1 1

1 1 1 1 1 1 1

0.2629 0.2629 0.2629 0.2629 0.2629 0.2629 0.2629 0.2629

0.2629 0.2629 0.2629 0.2629 0.2629 0.2629 0.2629 0.2629

dWass dL2 dCD ð1=2Þ dCD ð1=3Þ dCD ð2=3Þ pffiffiffi dCD ð p=2Þ dCD ð1Þ dBer

1 0.9479 0.8689 1 1 1 1

0.9479 0.8689 1 1 1 1

0.9176 0.9479 0.9479 0.9479 0.9479

0.8689 0.8689 0.8689 0.8689

1 1 1

1 1

1

dL1 dH

0.8571 0.901

0.8571 0.901

0.8138 0.8547

0.7451 0.7825

0.8571 0.901

0.8571 0.901

0.8571 0.901

0.8571 0.901

1 0.9504

Table 6 Corrected Rand Index for partition agreement (China dataset): the upper triangle shows results for k ¼ 5, the lower triangle shows results for k ¼ 6 dWass

dL2

dCD ð1=2Þ

dCD ð1=3Þ

dCD ð2=3Þ

1

0.9542 0.9542

0.9542 0.9542 1

1 1 0.9542 0.9542

pffiffiffi dCD ð p=2Þ 1 1 0.9542 0.9542 1

dCD ð1Þ

dBer

dL1

dH

0.9542 0.9542 1 1 0.9542 0.9542

1 1 0.9542 0.9542 1 1 0.9542

1 1 0.9542 0.9542 1 1 0.9542 1

0.9666 0.9666 0.9203 0.9203 0.9666 0.9666 0.9203 0.9666

dWass dL2 dCD ð1=2Þ dCD ð1=3Þ dCD ð2=3Þ pffiffiffi dCD ð p=2Þ dCD ð1Þ dBer

1 1 1 1 1 1 1

1 1 1 1 1 1

1 1 1 1 1

1 1 1 1

1 1 1

1 1

1

dL1 dH

1 0.9632

1 0.9632

1 0.9632

1 0.9632

1 0.9632

1 0.9632

1 0.9632

1 0.9632

0.9666 0. 9632

Table 7 Corrected Rand Index for partition agreement (China dataset): the upper triangle shows results for k ¼ 7, the lower triangle shows results for k ¼ 8 dWass

dL2

dCD ð1=2Þ

dCD ð1=3Þ

dCD ð2=3Þ

0.889

1 0.889

0.9652 0.8585 0.9652

0.8252 0.8852 0.8252 0.794

pffiffiffi dCD ð p=2Þ 0.9676 0.9201 0.9676 0.9324 0.835

dCD ð1Þ

dBer

dL1

dH

0.9676 0.9201 0.9676 0.9324 0.835 1

0.9676 0.9201 0.9676 0.9324 0.835 1 1

0.8184 0.8794 0.8184 0.8254 0.8369 0.8498 0.8498 0.8498

0.7364 0.8124 0.7364 0.7384 0.8239 0.7632 0.7632 0.7632

dWass dL2 dCD ð1=2Þ dCD ð1=3Þ dCD ð2=3Þ pffiffiffi dCD ð p=2Þ dCD ð1Þ dBer

0.7902 0.8516 1 1 0.9626 0.8522 0.8153

0.8977 0.7902 0.7902 0.7675 0.839 0.8767

0.8516 0.8516 0.8154 0.8986 0.9661

1 0.9626 0.8522 0.8153

0.9626 0.8522 0.8153

0.888 0.8501

0.9328

dL1 dH

0.8223 0.748

0.8195 0.8069

0.8834 0.8247

0.8223 0.748

0.8223 0.748

0.8576 0.7775

0.9695 0.8893

0.9173 0.8532

0.8851 0.8742

1658

A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658

Table 8 Corrected Rand Index for partition agreement (China dataset): the upper triangle shows results for k ¼ 9, the lower triangle shows results for k ¼ 10 dWass

dL2

dCD ð1=2Þ

dCD ð1=3Þ

dCD ð2=3Þ

0.8774

0.7322 0.7999

0.7402 0.8285 0.9718

0.6933 0.6013 0.7751 0.773

pffiffiffi dCD ð p=2Þ

dBer

dL1

dH

0.7323 0.8389 0.86 0.8893 0.6554 0.957

0.8073 0.701 0.6438 0.6471 0.5934 0. 6559 0.641

0.8711 0.7461 0.8357 0. 8435 0. 7966 0.7647 0.7774 0.7454

0. 7982 0.8706 0. 7959 0.8236 0.6032 0.8305 0. 8081 0. 7101

dWass dL2 dCD ð1=2Þ dCD ð1=3Þ dCD ð2=3Þ pffiffiffi dCD ð p=2Þ dCD ð1Þ dBer

0.5186 0.6855 0.7685 0.6011 0.7336 0.6011 0.6519

0.695 0.6577 0.7037 0.7667 0.7037 0.749

0.7956 0.8468 0.9028 0.8468 0.8791

0.7225 0.8782 0.7225 0.8178

0.8266 1 0.8366

0.8266 0.9198

0.8366

dL1 dH

0.7467 0.5214

0.6697 0.6243

0.8078 0.767

0.9604 0.7032

0.7342 0.8777

0.8911 0.7517

0.7342 0.8777

Table 9 Corrected Rand Index for artificial datasets (k ¼ 5) Distance

Artificial 1

Artificial 2

Artificial 3

dWass dL2 dCD ð1=2Þ dCD ð1=3Þ dCD ð2=3Þ pffiffiffi dCD ð p=2Þ dCD ð1Þ dBer

0.824 0.6991 1 1 0.824 0.824 0.824 0.824

1 0.7843 1 1 1 0.8074 0.7962 0.7962

0.7885 0.7641 0.9746 1 0.9041 0.789 0.789 0.789

dL1 dH

0.6991 0.7696

0.7962 0.8343

0.7256 0.7919

The artificial datasets: The three artificial datasets have been generated considering that intervals have five clusters for the centers and for the radii, as it is possible to see in the left panels of Figs. 1–3. This means that the three datasets contains five clusters of 20 interval observations described by two interval variables. The results in Table 9 show that in general Euclidean-based distances allow better partitions than L1 and Hausdorff distance. Among the Euclidean-based distances, dL2 have the worst performances. 8. Concluding remarks The choice of a distance measure is an important task when performing a clustering of data. Dealing with interval data several proposals have been made, in this paper we have proposed to review the main distances proposed in the literature and to introduce a new metric for the distance measure between intervals. The proposed measure, extending an existent metric (the Wasserstein metric), has several advantages with respect to those proposed in the literature: it is computed considering the density of points within the intervals, satisfies the Huygens theorem for the decomposition of inertia, and can be easily computed, can be suitably extended to the case when a non-uniform density is defined on intervals. These characteristics are only shared with the Bertoluzza distance and the Coppi and D’Urso one, but the Wasserstein distance has a further property: it can be used also when different distributions are defined on the compared intervals or in a fuzzy

0.7608 0.8825 0.8512 0.8802 0.649

dCD ð1Þ

0.8302 0.8117

0. 7968 0. 7202

approach, when fuzzy numbers have different shapes for the memberships functions. For these reasons, the proposed distance can be the basis of new techniques of analysis of interval data, and, more generally, to multivalued quantitative data (such as histogram data, as defined in symbolic data analysis). References Bertoluzza, C., Corral, N., Salas, A., 1995. On a new class of distances between fuzzy numbers. Mathware Soft Comput. 2, 71–84. Bezdek, J.C., 1981. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York. Bock, H.H., 2000. Analysis of Symbolic Data, Exploratory Methods for Extracting Statistical Information from Complex Data. Studies in Classification, Data Analysis and Knowledge Organisation. Springer-Verlag. Chavent, M., De Carvalho, F., Lechevallier, Y., Verde, R., 2003. Trois nouvelles méthodes de classification automatique de données symboliques de type intervalle. Rev. Stat. Appl. 4, 529. De Carvalho, F., 1994. Proximity coefficients between Boolean symbolic objects. In: Proc. 4th Conf. on International Federation of Classification Societies. New Approaches in Classification and Data Analysis. Springer, pp. 370–378. De Carvalho, F., 1998. Extension based proximities between constrained Boolean symbolic objects. In: Proc. 5th Conf. on International Federation of Classification Societies. Data Science, Classification and Related Methods. Springer, pp. 387– 394. De Carvalho, F., Brito, P., Bock, H.H., 2006. Dynamic clustering of interval data based on L2 distance. Comput. Statist. 21 (2), 231–250. Diday, E., 1971. Le méthode des nuées dynamique. Rev. Statist. Appl. 19 (2), 19–34. Gibbs, A.L., Su, F.E., 2002. On choosing and bounding probability metrics. Internat. Statist. Rev. 70, 419. Gowda, K.C., Diday, E., 1991. Symbolic clustering using a new dissimilarity measure. Pattern Recognition 24, 567–578. Gowda, K.C., Ravi, T.V., 1995. Agglomerative clustering of symbolic objects using the concepts of both similarity and dissimilarity. Pattern Recognition Lett. 16, 647– 652. Hubert, L., Arabie, P., 1985. Comparing partitions. J. Classif. 2, 193–218. Ichino, M., Yaguchi, H., 1994. Generalized Minkowsky metrics for mixed featuretype data analysis. IEEE Trans. System Man Cybernet. 24 (4), 698–708. Irpino, A., Romano, E., 2007. Optimal histogram representation of large data sets: Fisher vs piecewise linear approximations. RNTI E-9, 99–110. Kodratoff, Y., Bisson, G., 1992. The epistemology of conceptual clustering: Kbg, an implementation. J. Intell. Inform. Systems 1 (1), 5784. Michalski, R., Stepp, R., Diday, E., 1981. A Recent Advance in Data Analysis: Clustering Objects into Classes Characterized by Conjunctive Concepts, vol. 1. North-Holland, New York. p. 3356. Tran, L., Duckstein, L., 2002. Comparison of fuzzy numbers using a fuzzy distance measure. Fuzzy Sets Systems 130, 331–341. Verde, R., Lauro, N., 2000. Basic choices and algorithms for symbolic objects dynamical clustering. In: XXXIIe Journées de Statistique, Fés, Maroc, Societé Francßaise de Statistique, p. 3842.