Pattern Recognition Letters 29 (2008) 1648–1658
Contents lists available at ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Dynamic clustering of interval data using a Wasserstein-based distance Antonio Irpino *, Rosanna Verde Dipartimento di studi europei e mediterranei, Seconda Universitá degli Studi di Napoli, Caserta (CE), Italy
a r t i c l e
i n f o
Article history: Received 27 April 2006 Received in revised form 21 February 2008 Available online 29 April 2008 Communicated by A. Fred Keywords: Interval data Clustering Wasserstein distance Inertia
a b s t r a c t Interval data allow statistical units to be described by means of intervals of values, whereas their representation by means of a single value appears to be too reductive or inconsistent. In the present paper, we present a Wasserstein-based distance for interval data, and we show its interesting properties in the context of clustering techniques. We show that the proposed distance generalizes a wide set of distances proposed for interval data by different approaches or in different contexts of analysis. An application on real data is performed to illustrate the impact of using different metrics and the proposed one using a dynamic clustering algorithm. Ó 2008 Elsevier B.V. All rights reserved.
1. Introduction The representation of data by means of intervals of values is becoming more and more frequent in different fields of application. Intervals appear as a way to describe the uncertainty affecting the observed values. The uncertainty can be considered as the incapability to obtain true values depending on not knowing the model that regulates the phenomena. It can be the expression of three causes: randomness, vagueness or imprecision. Randomness is present when it possible to hypothesize a probability distribution of the outcomes of an experiment, or when the observation is affected by an error component that is modeled as a random variable (i.e., white noise in a Gaussian distribution). Vagueness is related to a unclear fact of the matter whether the concept applies or not. Imprecision is related to the difficulty of measuring accurately a phenomenon. While randomness is strictly related to a probabilistic approach, vagueness and imprecision have been widely treated by using fuzzy set theory, as well as the interval algebra approach. Probabilistic, fuzzy and interval algebra sometimes overlaps in treating interval data. Many connections are presented in literature between interval algebra and fuzzy theory, especially in the definition of dissimilarity measures to compare values affected by uncertain and so expressed by intervals. Some distances between intervals are based on a comparison of the domains of the belongingness function or on a-cuts (Bezdek, 1981; Tran and Duckstein, 2002).
* Corresponding author. Tel.: +39 3287195399; fax: +39 081675009. E-mail addresses:
[email protected] (A. Irpino),
[email protected] (R. Verde). 0167-8655/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2008.04.008
Interval data have been even studied in symbolic data analysis (SDA) (Bock and Diday, 2000), a new domain related to multivariate analysis, pattern recognition and artificial intelligence. In this framework, to take into account the variability and/or the uncertainty inherent in the data, variables can assume multiple values (bounded sets of real values, multi-categories, weight distributions), where intervals are a particular case. As in classic multivariate data analysis, dissimilarities and distances between data play an important role. Whereas several dissimilarities and distances have been defined in classic data analysis according to the analysis aims and to the data nature, several proposals have been advanced for the analysis of interval data. In SDA several dissimilarity measures have been proposed. Chavent and Lechevallier (2002) and Chavent et al. (2006) proposed Hausdorff L1 distances, while De Carvalho et al. (2006) proposed Lq distances and De Carvalho et al. (2006) an adaptive L2 version. It is worth observing that these measures are based essentially on the boundary values of the compared intervals. These distances have been mainly proposed as criterion functions in a clustering algorithm to partition a set of intervals data whatever a cluster structure can be assumed. In the present paper, we introduce a new metric based on the Wasserstein distance for the comparison of interval data. We show its properties, and we use it as the base of the definition of criteria for a dynamic clustering algorithm. The choice is motivated by the fact that, until now, several proposals have been introduced for dynamic clustering so we are better able to compare the use of the new distance with others used in recent literature. The paper is structured as follows: We present the general schema of dynamic clustering algorithm in Section 2, that can be
A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658
considered a general schema for clustering complex data. In Section 3, we present the main dissimilarity (or distance) functions, proposed in fuzzy, symbolic data analysis and probabilistic context to compare intervals of real values. Then, in Section 4, we introduce a new metric based on the Wasserstein distance that respects all the classical properties of a distance and, being based on the quantile functions associated with the interval distributions, seems particularly able to keep the whole information contained in the intervals. In Section 5, we compare the reviewed distances, especially the Euclidean-based ones, as Euclidean-based allow us to define an inertia (variability) measure that respect the Huygens theorem of decomposition of clustered data. Further, the distances are compared on the basis of their capability to define a representative interval allowing the minimization of the variability measure of a set of interval data. In Section 7, we use the Euclidean distances and the Hausdorff-based and L1 -based ones using some datasets to perform a comparison of the different distances for the dynamic clustering algorithm. Section 8 gives some concluding remarks and some perspectives. 2. Dynamic clustering of interval data Clustering methods play a central role in allowing conceptual descriptions to be compared and clustered, in order to obtain typologies of concepts. The dynamic clustering algorithm (DCA) (Diday, 1971; Chavent et al., 2003) represents a general reference for unsupervised non-hierarchical iterative clustering algorithms, and can be proven that DCA generalizes several clustering partitive methods such as k-means and k-median algorithm. In particular, DCA simultaneously looks for the partition of the set of data and the representation of the clusters. The main innovation of the symbolic clustering approach is the definition of a way to represent the obtained clusters by means of prototypes (Chavent et al., 2003). In the literature, several authors indicate two ways to compute prototypes. In the first approach (Verde and Lauro, 2000), the prototype of a cluster is an element having the same properties of the elements belonging to it. In such a way, a cluster of intervals is described by a single prototypal interval, in the same way as a cluster of points is represented by its barycenter. In the second approach (Verde and Lauro, 2000; Michalski et al., 1981; Kodratoff and Bisson, 1992), the prototype of a cluster has to represent the whole variability of the clustered elements by means, for example, of a distribution on the domain of the descriptors. An interval variable X is a correspondence between a set E of units and a set of closed intervals ½a; b, where a 6 b and a; b 2 R. A proximity measure d is a non-negative function defined on each couple of elements of the space of descriptions of E, where the closer the individuals are, the lower the value assumed by d. Let E be a set of n symbolic data described by p interval variables X j (j ¼ 1; . . . ; p). The general DCA looks for the partition P 2 Pk of E in k classes, among all the possible partitions P k , and the vector L 2 Lk of k prototypes representing the classes in P, such that the following D fitting criterion between L and P is minimized: DðP ; L Þ ¼ MinfDðP; LÞjP 2 Pk ; L 2 Lk g:
ð1Þ
Such a criterion is defined as the sum of dissimilarity or distance measures dðxi ; Gh Þ of fitting between each object xi belonging to a class C h 2 P and the class representation Gh 2 L: DðP; LÞ ¼
k X X
dðxi ; Gh Þ:
h¼1 xi 2C h
A prototype Gh associated with a class C h is an element of the space of the description of E, and it can be represented as a vector of intervals. The algorithm is initialized by generating k random clusters or,
1649
alternatively, k random prototypes. Generally, the criterion DðP; LÞ is based on an additive distance on the p descriptors. In the following, we give an overview of the dissimilarities and distances proposed in the literature for interval data treatment, focusing our attention on those that can give a solution to Eq. (1), i.e., to those metrics allowing a prototype for a set of intervals described as an interval itself. 3. A brief survey of the existing distances for interval data According to symbolic data analysis, an interval variable X is a correspondence between a set E of units and a set of closed intervals ½a; b, where a 6 b and a; b 2 R. Without losing generality, the same notation is used by the interval arithmetic approach, and with few modifications, by the fuzzy data analysis approach. Given p interval variables, the interval description of the ith unit can be done using a vectorial notation xi ¼ ðx1i ; . . . ; xpi Þ where j xji ¼ ½aji ; bi . Let A and B be two intervals described, respectively, by ½a; b and ½u; v. dðA; BÞ can be considered as a distance if the main properties that define a distance are achieved: (reflexivity) dðA; AÞ ¼ 0, (symmetry) dðA; BÞ ¼ dðB; AÞ, and (triangular inequality) dðA; BÞ 6 dðA; CÞ þ dðC; BÞ. Hereafter we present some of the most used distances for interval data belonging to different families and referring to several contexts where they have been proposed. The main properties of such measures are even underlined. We may group distances among interval data considering the different approaches that generated them: the feature extraction or component-wise approach, the fuzzy analysis approach and the symbolic data analysis approach. Further, we can consider the proposed distances into three main families, according to a component approach, where the distance is settled up by combining different aspects for the comparison of two intervals (position, size, span and content), using an extreme value approach, where the distance is computed considering only the bounds of two intervals, using an extension approach, where the distance are considered as an extension of distances defined between points. Before introducing the most used distance measures of distance between two interval data, we recall the definition of two operators: the join and the meet. Given two multivariate intervals a ¼ ðA1 ; . . . ; Ap Þ, where Aj ¼ ½alj ; auj , and b ¼ ðB1 ; . . . ; Bp Þ, where Bj ¼ ½blj ; buj , the join operator (Ichino and Yaguchi, 1994) is defined as c ¼ C 1 ; . . . ; C p ¼ a b; where C j ¼ ½clj ; cuj such that clj ¼ minðalj ; blj Þ
and
cuj ¼ maxðauj ; buj Þ:
The meet operator is defined as the intersection of the two interval data: c ¼ C 1 ; . . . ; C p ¼ a b; where C j ¼ clj ; cuj ¼ Aj \ Bj : Further, we introduce the De Carvalho (1994)potential of description of a multivariate interval datum, that, in the case of interval data, corresponds to the well-known Lebesgue measure of a set: pðaÞ ¼
p Y j¼1
lðAj Þ ¼
p Y ðauj alj Þ: j¼1
1650
A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658
3.1. The component-wise approach
uðAj ; Bj Þ ¼
The following metrics are based on a feature extraction approach that emphasizes the different aspects of the comparison of two (or more intervals). We call this approach the ‘‘component-wise approach” where the comparison of two intervals is done taking into account position, span, content and other aspects. Gowda and Diday (1991): Considering two multivariate intervals a and b described by p interval variables, Gowda and Diday (1991) proposed the following distance: dða; bÞ ¼
p X
DðAj ; Bj Þ:
j¼1
/ðAj ; Bj Þ ; lðAj Bj Þ
while Ichino and Yaguchi (1994) proposed a normalized version, which considered as the normalization parameter the length of the domain of the jth descriptor: wðAj ; Bj Þ ¼
/ðAj ; Bj Þ : lðDj Þ
The distances computed for each descriptor are then aggregated by the following function: !1q p X q dq ða; bÞ ¼ wj ðFCðAj ; Bj ÞÞ ; q P 1; j¼1
The distance is the sum of three components DðAj ; Bj Þ ¼ Dp ðAj ; Bj Þ þ Ds ðAj ; Bj Þ þ Dc ðAj ; Bj Þ;
wj > 0 and
Dp ðAj ; Bj Þ ¼
jalj blj j 2 ½0; 1; lðDj Þ
Ds ðAj ; Bj Þ ¼
jlðAj Þ lðBj Þj 2 ½0; 1 lðAj Bj Þ
/ðAj ; Bj Þ ¼
ð4Þ
and a content component Dc that takes into consideration how much the two intervals do not overlap, i.e., how much the two intervals do not have in common, normalized on the join of the two intervals: Dc ðAj ; Bj Þ ¼
lðAj Þ þ lðBj Þ 2lðAj \ Bj Þ 2 ½0; 1: lðAj Bj Þ
ð5Þ
If we consider a set E of n elements, the minimization of Eq. (1) can be found only numerically, fixing the prototype interval as an interval having as minimum the median of the minima bounds of the interval belonging to the cluster and varying the length of the prototypal interval. Ichino and Yaguchi (1994): In 1994, Ichino and Yaguchi proposed a new distance measure where the comparisons are done for each descriptor by means of the following comparison function: /ðAj ; Bj Þ ¼ lðAj Bj Þ lðAj \ Bj Þ þ cð2lðAj \ Bj Þ lðAj Þ lðBj ÞÞ: ð6Þ The distance combines the component related to the length of the join minus the length of the meet of the two intervals, integrated by the content component weighted by c, where 0 6 c 6 0:5: If c ¼ 0 then /ðAj ; Bj Þ ¼ lðAj Bj Þ lðAj \ Bj Þ simply compares the two intervals on the base of the join minus the meet. If c ¼ 0:5, then ðlðAj Þ þ lðBj ÞÞ /ðAj ; Bj Þ ¼ lðAj Bj Þ 2 compares the two intervals considering their join minus the average of the two lengths of the intervals. It is possible to prove that / is a distance. More recently, De Carvalho and Diday (2000) proposed a normalized version on the basis of the join between the two intervals:
wj ¼ 1;
FC : / or w or u:
If we consider a set E of n elements, the minimization of Eq. (1) cannot be determined analytically. De Carvalho (1994): In 1994, De Carvalho proposed a family of metrics based on the following comparison function:
ð3Þ
where Dj is the length of the domain of the jth interval variable (i.e., the join of all the intervals for the jth variable), a span component Ds that compares the different spreads of the two intervals:
ð7Þ
j¼1
ð2Þ
a position component Dp related to where the intervals are located along R
p X
1 ½/ ðAj ; Bj Þ þ /p ðAj ; Bj Þ: 2 c
ð8Þ
It can be considered as the average between a content component and a position component. The content component measures the relationships between the common (agreement) and the uncommon (disagreement) part between two intervals for each descriptor, on the basis of the lengths of the four intervals generated as in Table 1. De Carvalho proposed five different comparison functions in order to take into consideration the measures of agreements and disagreements; some of these has been proved to be metric measures while others are only semi-metrics (i.e., dissimilarities). The five comparison function for the content components are s1c ¼
a ; aþbþc
which can be considered a metric s2c ¼
2a ; 2a þ b þ c
which can be considered a dissimilarity s3c ¼
a ; a þ 2ðb þ cÞ
which can be considered a metric 1 a a þ ; s4c ¼ 2 aþb aþc which can be considered a dissimilarity a s5c ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; ða þ bÞða þ cÞ which can be considered a dissimilarity. The position component is proposed as the (normalized) length of the separation between two intervals in the following way:
Table 1 Agreement and disagreement functions for the content component of the De Carvalho (1994) metric Agreement
Disagreement
Total
Agreement Disagreement
a ¼ lðAj \ Bj Þ c ¼ lðcðAj Þ \ Bj Þ
b ¼ lðAj \ cðBj ÞÞ d ¼ lðcðAj Þ \ cðBj ÞÞ
lðAj Þ lðcðAj ÞÞ
Total
lðBj Þ
lðcðBj ÞÞ
lðDj Þ
cðAÞ is the complement of interval A.
1651
A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658
/p ðAj ; Bj Þ ¼
lðcðAj Þ \ cðBj Þ \ ðAj Bj ÞÞ ; lðAj Bj Þ
that, for interval data is ( j minðau ;bu Þmaxðal ;bl Þj j
j
j
j
maxðauj ;buj Þminðalj ;blj Þ
/p ðAj ; Bj Þ ¼
0
ð9Þ
aa ¼ f½ala ; aua ja 2 ½0; 1g: if Aj \ Bj ¼ ;;
ð10Þ
otherwise:
Finally, the distance is aggregated for each descriptor by means of the following formula: !1q p X i dq ða; bÞ ¼ wj ð/ðAj ; Bj ÞÞq ; q P 1: ð11Þ j¼1
Gowda and Ravi (1995): In 1995 Gowda and Ravi proposed a new metric combining a position and a size component: dða; bÞ ¼
p X
the support of the a-cut of a fuzzy number can be expressed as an interval
DðAj ; Bj Þ;
ð12Þ
j¼1
DðAj ; Bj Þ ¼ Dp ðAj ; Bj Þ þ Ds ðAj ; Bj Þ: The position component is defined as
jalj blj j 90 ; Dp ðAj ; Bj Þ ¼ cos 1 lðDj Þ
ð13Þ
Bertoluzza et al. (1995): Considering two fuzzy numbers a and b with bounded support, Bertoluzza et al. (1995) defined the follow2 2 ing distance d ðaÞ ¼ d ðaa ; ba Þ: Definition 2. Let g be a normalized weight measure on 2 ð½0; 1; Bð½0; 1ÞÞ. The squared distance d between two intervals a ¼ ½al; au and b ¼ ½bl; bu is given by Z 1 2 dBer ða; bÞ ¼ ½t jal blj þ ð1 TÞ jau buj2 dgðtÞ: 0
Also the function g has the same properties of a probability density. They are restricted to weight measures that are the sum of a term that is continuous with respect to the Lebesgue measure and of a finite weight distribution placed at S points t1 ; . . . ; tS , that is dg ¼ cðtÞdt;
ð14Þ
cðtÞ ¼ cðtÞ þ
S X
ks dðt ts Þ;
s¼1
where the role of the cosine function is that of standardizing the distances in order to compare the different measurement scales of the descriptors. The size component is defined as lðAj Þ þ lðBj Þ 90 : ð15Þ Ds ðAj ; Bj Þ ¼ cos 2 lðAj Bj Þ
where d is the Dirac function and ks is the weight of the t s . The distance is rewritten as Z 1 S X 2 d ða; bÞ ¼ cðtÞ½t jal blj þ ð1 TÞ jau buj2 dt þ ks ½as bs 2 ;
De Carvalho (1998): In 1998, De Carvalho proposed an extension of the Ichino and Yaguchi (1994) distance based on the potential description as follows:
where as ¼ ts al þ ð1 tÞau, as ¼ t s bl þ ð1 tÞbu and cðtÞ has to satisfy the following properties:
d1 ða; bÞ ¼ pða bÞ pða bÞ þ cð2pða bÞ pðaÞ pðbÞÞ
ð16Þ
is a dissimilarity where 0 6 c 6 0:5, Da Carvalho also proposed the following normalization:
d2 ða; bÞ ¼
pða bÞ pða bÞ þ cð2pða bÞ pðaÞ pðbÞÞ ; pðsE Þ
ð17Þ
is a dissimilarity. While
pða bÞ pða bÞ þ cð2pða bÞ pðaÞ pðbÞÞ pða bÞ
s¼1
cðtÞ P 0; Z 1 cðtÞdt ¼ 1; 0
cð0Þ > 0;
cð1Þ > 0;
t1 ¼ 0; tS ¼ 1
sE ¼ ðD1 ; . . . ; Dp Þ
d3 ða; bÞ ¼
0
if S > 1:
When no reasons exist for preferring the left side of the interval with respect the right one, they imposed the supplementary condition: cðtÞ ¼ cð1 tÞ:
ð18Þ
is a distance. 3.2. The fuzzy oriented approach The fuzzy data analysis approach has supported the development of distance measures for interval data, where, in several applications fuzzy numbers are considered interval data. We recall here some basic definitions. Definition 1. A fuzzy number is a convex and normal fuzzy subset of R, that is, a map (membership function)
The computation of the distance can be hard when complicated choices of cðtÞ are done. The author suggested using cðtÞ ¼ 0 and S ¼ 3 (assigning t 1 ¼ 0, t 2 ¼ 0:5 and t 3 ¼ 1) to obtain a sufficiently good measure of the distance. In this case, we have the weighted sum of the distances between the starting, the central and the ending points of two intervals. For every pair of fuzzy numbers a and b for which the function 2
2
d ðaÞ ¼ d ðaa ; ba Þ is u-integrable, the distance dBer between a and b is defined by sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Z 1 dBer ða; bÞ ¼ ½dðaa ; ba Þ2 duðaÞ 0
a : R ! ½0; 1 with the following properties: a½kx þ ð1 ky P inf½aðxÞ; aðyÞ; 9x such that aðx Þ ¼ 1; SuppðaÞ is a compact subset of R. It is easy to recognize that all the a-cuts faa ja 2 ð0; 1g of a fuzzy number are intervals, and so the crisp subset a0 ¼ SuppðaÞ. Then
provided the integral exists. The function u is a suitable normalized weight function that looks like a probability density; moreover, they impose some conditions: uðaÞ P 0; a1 < a2 ) uða1 Þ 6 uða2 Þ; Z 1 uðaÞda ¼ 1: 0
1652
A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658
Coppi and D’Urso (2003) Indeed, in several applications (D’Urso and Santoro, 2006), the most used class of fuzzy variable is the socalled symmetrical fuzzy variable. Usually, a symmetrical fuzzy ~ ¼ ðm; lÞ, where m denotes the center and variable is denoted by Y l the left and right spreads with the following membership function: m x lðxÞ ¼ L ; ml6x6mþl ðl > 0Þ; l where L is a decreasing ‘‘shape” function from Rþ to ½0; 1 with Lð0Þ ¼ 1, LðxÞ < 1 for all x > 0, LðxÞ > 0 for all x < 1, Lð1Þ ¼ 0 and LðxÞ ¼ LðxÞ. On the basis of the choice of L, it is possible to define different types of symmetrical fuzzy data. Usually, the symmetric triangular, normal, parabolic and square root fuzzy variables are chosen. In particular, a squared Euclidean distance between a pair of fuzzy numbers a ¼ ðma ; la Þ, where ma ¼ 0:5ðal þ auÞ and r a ¼ 0:5ðau alÞ, and b ¼ ðmb ; rb Þ, where mb ¼ 0:5ðbl þ buÞ and r b ¼ 0:5ðbu blÞ, is defined by Coppi and D’Urso (2003):
! dH ðA; BÞ ¼ max sup inf dðx; yÞ; sup inf dðx; yÞ : x2A
y2B
x2A
ð22Þ
If dðx; yÞ is the L1 City block distance, then Chavent et al. (2002) proved that dH ðA; BÞ ¼ maxðja uj; jb vjÞ a þ b u þ v b a v u þ : ¼ 2 2 2 2
ð23Þ
. Lq distances between the bounds of intervals: A family of distances between intervals has been proposed by De Carvalho et al. (2006). Considering a set of interval data described into a space Rp , the metric of norm q is defined as !1=q p X dLq ðA; BÞ ¼ ja ujq þ jb vjq ; ð24Þ j¼1
2
dCD ðkÞ ¼ ðma mb Þ2 þ ½ðma kr a Þ ðmb krb Þ2 þ ½ðma þ kra Þ ðmb þ kr b Þ2 ¼ 3ðma mb Þ2 þ 2k2 ðr a rb Þ2 ;
y2B
ð19Þ
where Z 1 k¼ L1 ðtÞdt:
they also showed that if the norm is L1 then dL1 ¼ dH (in L1 norm). The same measure was extended (De Carvalho, 2007) to an adaptive one in order to take into account the variability of the different clusters in a dynamical clustering process. 4. The Wasserstein distance
0
The values of the parameter k related to the shape function of the symmetric triangular, normal, parabolic and square root fuzzy data pffiffi are, respectively, 12, 2p, 23 and 13. Tran and Duckstein (2002): In the framework of fuzzy data analysis, Tran and Duckstein (2002) proposed the following ‘‘distance” between two intervals:
Z 1=2 Z 1=2 aþb þ xðb aÞ dTD ðA; BÞ ¼ 2 1=2 1=2 hu þ v io2 þ yðv uÞ dx dy 2 " #
2 2 aþb u þ v 1 ba v u 2 ¼ þ þ : 2 2 3 2 2 ð20Þ In practice, they consider the expected value of the distance between all the points belonging to the interval A and all those points belonging to the interval B. In their paper, they assert that it is a distance, but it is easy to observe that the distance does not satisfy the first properties mentioned above. Indeed, the distance of an interval by itself is equal to zero only if the interval is thin:
2 aþb aþb dTD ðA; AÞ ¼ 2 2 "
2
2 # 1 ba ba þ þ 3 2 2
2 2 ba P 0: ð21Þ ¼ 3 2 3.3. The boundary approach Hausdorff distance: The most common distance used for the comparison of two sets is the Hausdorff distance.1 Considering two sets A and B of points of Rn , and a distance dðx; yÞ where x 2 A and y 2 B, the Hausdorff distance is defined as follows:
1 The name is related to the Felix Hausdorff who is well-known for the separability theorem on topological spaces at the end of the 19th century.
If F and G are the distribution functions of two random variables f and g, respectively, the Wasserstein L2 metric is defined as (Gibbs and Su, 2002) Z 1
1=2 dWass ðF; GÞ :¼ ðF 1 ðtÞ G1 ðtÞÞ2 dt ; ð25Þ 0
where F 1 and G1 are the quantile functions of the two distributions. Irpino and Romano (2007) proved also a general formulation of the Wasserstein distance: if F and G are the distribution functions of two random variables f and g, respectively, with first moments lf and lg , and rf and rg their standard deviations, the Wasserstein distance can be written as 2
dWass ¼ ðlf lg Þ2 þ ðrf rg Þ2 þ 2rf rg ð1 qQQ ðF; GÞÞ; |fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl} |fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} Size
Location
ð26Þ
Shape
where
qQQ ðF; GÞ ¼
R 1 1 F ðtÞ lf ðG1 ðtÞ lg Þdt 0 R1
¼
0
rf rg F
1
1
ðtÞG ðtÞdt lf lg rf rg
ð27Þ
is the correlation of the quantiles of the two distributions as represented in a classical QQ plot. It is worth noting that 0 < qQQ 6 1 differently from the classical range of variation of the Bravais– Pearson’s q. This decomposition allows us to take into consideration three aspects in the comparison of distribution functions. The first aspect is related to the location: two distributions can differ in position and this aspect is explained by the distance between the mean values of the two distributions. The second aspect is related to the different variability of the compared distribution. This aspect is related to the different standard deviations of the distributions and to the different shapes of the density functions. While the former sub-aspect is taken into account by the distance between the standard deviations, the latter sub-aspect is taken into consideration by the value of qQQ . Indeed, qQQ is equal to one only if the two (standardized) distributions are of the same shape. If we suppose a uniform distribution of points, an interval of rej als xji ¼ ½aji ; bi can be expressed as a function of the following type:
1653
A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658 j
j
xji ðt j Þ ¼ ½aji ; bi ¼ aji þ t j ðbi aji Þ;
0 6 tj 6 1:
ð28Þ
If we consider a description of the interval by means of its midpoint mji and radius r ji , the same function can be rewritten as follows: xji ðt j Þ ¼ mji þ r ji 2tj 1 ; 0 6 tj 6 1; where j
mji ¼
r ji ¼
bi aji : 2
Then, the squared Euclidean distance between homologous points of two intervals xji and xji0 (for the generic variable Y j ) is defined as follows: Z 1 2 dWass ðxji ; xji0 Þ ¼ ½xji ðt j Þ xji0 ðt j Þ2 dt j ¼
Z
0
1 0
½ðmji mji0 Þ þ ðrji rji0 Þð2tj 1Þ2 dt j
1 ¼ ðmji mji0 Þ2 þ ðr ji r ji0 Þ2 : ð29Þ 3 It is a particular version of the probabilistic Wasserstein metric (Gibbs and Su, 2002) in L2 or, as it is better known, it is Mallow’s distance between two distributions when it is supposed that each interval is the support of a uniform distribution. In our context, such a distance is generalized to an Rp space (where p is the number of interval variables). Let the description of the box xi ðTÞ into p dimensions be as follows: 8 xi1 ðt 1 Þ ¼ ci1 þ ri1 ð2t 1 1Þ; > > > > < xi2 ðt 2 Þ ¼ ci2 þ ri2 ð2t 2 1Þ; ð30Þ .. > > . > > : xip ðt p Þ ¼ cip þ rip ð2tp 1Þ; j ¼ 1; . . . ; p:
T
¼ ¼
j¼1
p Z X j¼1
1
0
ðxji ðt j Þ xji0 ðt j ÞÞ2 dtj
p X 1 ½ðmji mji0 Þ2 þ ðrji r ji0 Þ2 ; 3 j¼1
ð31Þ
5. Relationships among metrics and their ability to compute prototypes In this section, we do a comparison among the proposed metrics considering only those measures that can be considered, or have been considered by their authors, as distances. In particular, we show whether it is possible to identify a representative (prototype) element of a set of interval data as a solution of a minimization of an homogeneity criterion. A representative element of a set can be viewed as an interval itself. Given a set of n interval data described by p interval variables, as homogeneity criterion we consider the sum of a set of intervals with respect to a representative interval as follows: dðxij ; Gj Þ
ð32Þ
j¼1
or in the case of multivariate distances as n X dðxi ; GÞ: i¼1
1 1 1 ðxlij xli0 j Þ2 þ ðcij ci0 j Þ2 þ ðxuij xui0 j Þ2 ; 3 3 3 2 2 2 2 dBer ðxij ; xi0 j Þ ¼ ðcij ci0 j Þ þ ðrij ri0 j Þ : 3 2
dBer ðxij ; xi0 j Þ ¼
ð34Þ
The representative is the interval Gj ¼ ðcj ; r j Þ that minimizes Eq. (32) 2 and considering d ¼ d . In this case, we obtain Gj ¼ ðcj ; rj Þ where cj ¼ n1
n X
cij
and
r j ¼ n1
i¼1
n X
r ij :
i¼1
5.1.2. Coppi and D’Urso Considering the distance 2
dCD ðkÞ ¼ 3ðam bm Þ2 þ 2k2 ðal bl Þ2 ; the values of the parameter k related to the shape function of the symmetric uniform (crisp), triangular, normal, parabolic and square pffiffi root fuzzy data are, respectively, 1 12, 2p, 23 and 13. Then, we obtain the following distances:
2
dCD ð1Þ ¼ 3ðcij ci0 j Þ2 þ 2ðrij ri0 j Þ2 : Triangular:
1 1 2 ¼ 3ðcij ci0 j Þ2 þ ðrij r i0 j Þ2 : dCD 2 2 Normal: pffiffiffi
p p 2 dCD ¼ 3ðcij ci0 j Þ2 þ ðrij r i0 j Þ2 : 2 2 Parabolic:
2 8 2 ¼ 3ðcij ci0 j Þ2 þ ðrij r i0 j Þ2 : dCD 3 9
where T ¼ f8tj j0 6 t j 6 1g.
i¼1
5.1.1. Bertoluzza Let us start with the Bertoluzza distance considering cðtÞ ¼ 0 and S ¼ 3, obtaining
Uniform (crisp):
Hypothesizing independency among the variables, the generalization of the proposed distance to Rp can be obtained as Z X p 2 dWass ðxi ðTÞ; xi0 ðTÞÞ ¼ ðxji ðtj Þ xji0 ðt j ÞÞ2 dT
p n X X
We start rewriting interval data considering the center–radius notation as it simplifies notation. Given an interval xij ¼ ½xlij ; xuij , we rewrite it as xij ¼ ðcij ; r ij Þ where cij ¼ 0:5ðxlij þ xuij Þ and rij ¼ 0:5ðxuij xlij Þ.
j
aji þ bi ; 2
0 6 t j 6 1;
5.1. Euclidean metrics: Bertoluzza, Coppi and D’Urso, Tran and Duckstein, L2 , Wasserstein
Square root:
1 2 2 ¼ 3ðcij ci0 j Þ2 þ ðrij r i0 j Þ2 : dCD 3 9
ð35Þ
ð36Þ
ð37Þ
ð38Þ
ð39Þ
As in the Bertoluzza case, for each case, we obtain as represenP P tative Gj ¼ ðcj ; r j Þ where cj ¼ n1 ni¼1 cij and r j ¼ n1 ni¼1 rij . 5.1.3. Tran and Duckstein Tran and Duckstein (2002) ‘‘distance” have the following formulation:
2 aþb u þ v 2 dTD ðA; BÞ ¼ 2 2 " #
2 1 ba v u 2 þ þ ; ð40Þ 3 2 2 which can be rewritten as
ð33Þ
2
dTD ðxij ; xi0 j Þ ¼ ðcij ci0 j Þ2 þ
1 ðr ij þ r i0 j Þ2 : 3
ð41Þ
1654
A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658
P The representative is obtained as Gj ¼ ðcj ; rj Þ where cj ¼ n1 ni¼1 cij and r j ¼ 0. This means that the prototypes are points and not intervals, or are all thin intervals. 2
5.1.4. L distance Considering a set of interval data described into a space Rp , the metric of norm 2 is defined as 2
dL2 ðxij ; xi0 j Þ ¼ jxlij xli0 j j2 þ jxuij xui0 j j2 ;
2
5.1.5. Wasserstein Wasserstein distance between two intervals, under the hypothesis of uniform distribution, can be computed as 1 2 dWass ðxij ; xi0 j Þ ¼ ðcij ci0 j Þ2 þ ðrij ri0 j Þ2 : 3
ð44Þ
Also in this case we obtain as representative Gj ¼ ðcj ; rj Þ, where P P cj ¼ n1 ni¼1 cij and r j ¼ n1 ni¼1 rij : 5.1.6. Comparison among Euclidean distances Wasserstein vs. Bertoluzza: It can be shown that Wasserstein distance is equal to the Bertoluzza, considering the ct function role we may consider the Bertoluzza as the application of the Wasserstein distance in the fuzzy approach. For example, if cðtÞ ¼ 0 and S ¼ 3 we may rewrite the Bertoluzza distance as ðF 1 ð0Þ G1 ð0ÞÞ2 ðF 1 ð0:5Þ G1 ð0:5ÞÞ2 þ 3 3 1 1 ðF ð1Þ G ð1ÞÞ2 : þ 3
ð45Þ
Wasserstein vs. Coppi and D’Urso: In general, for uniform (crisp) fuzzy variables: 2
dCD ð1Þðxij ; xi0 j Þ ¼ 3dWass ðxij ; xi0 j Þ þ ðr ij r i0 j Þ2 :
ð46Þ
Bertoluzza vs. L2 : The L2 distance between the bounds of the intervals is a particular case of the Bertoluzza distance where cðtÞ ¼ 0 and S ¼ 2 and then corresponds to the Wasserstein distance when the density is equally concentrated on the bounds: 2 2dBert2
¼
2 dL2 ðF; GÞ
¼ ðF
1
1
2
ð0Þ G ð0ÞÞ þ ðF
1
1
2
ð1Þ G ð1ÞÞ :
ð47Þ
All the proposed distances, except for the Bertoluzza, the Coppi–D’Urso and the Wasserstein distance, cannot take into consideration apriori information about how the uncertainty behaves inside the interval. While the Bertoluzza distance is the fuzzycounterpart of the Wasserstein distance, the Coppi and D’Urso distance gives a double emphasis to the range of symmetric fuzzy variables. Considering the center–radius representation of intervals, all the proposed distances let play a different role to the components related with the positions of intervals (the centers) and the component related to their sizes (the radii). From this point of view, we may compare them on the basis of the emphasis that each distance gives to each component. In other words we may compare the distances considering the general formulation: 2
a
b
a aþb
b aþb
Bertoluzza Coppi D’Urso
cðtÞ ¼ 0 and T ¼ 3 k ¼ 1 uniform k ¼ 1=2 triangular pffiffiffi k ¼ p=2 normal k ¼ 2=3 parabolic k ¼ 1=3 square root q¼2 Uniform
1 3 3 3 3 3 1 1
2/3 2 1/2 p=2 8/9 2/9 1 1/3
0.600 0.600 0.857 0.656 0.771 0.931 0.500 0.750
0.400 0.400 0.143 0.344 0.229 0.069 0.500 0.250
ð43Þ
Also in this case we obtain as representative Gj ¼ ðcj ; rj Þ, where P P cj ¼ n1 ni¼1 cij and r j ¼ n1 ni¼1 rij .
2
Parameter
Lq Wasserstein
dL2 ðxij ; xi0 j Þ ¼ 2ðcij ci0 j Þ2 þ 2ðr ij r i0 j Þ2 :
2
Distance
ð42Þ
which can be rewritten as
dBer ðF; GÞ ¼
Table 2 Comparison among Euclidean distances for interval data: how each distance weights the position and the span component in the comparison of two intervals
d ðxij ; xi0 j Þ ¼ aðcij ci0 j Þ2 þ bðrij ri0 j Þ2 :
ð48Þ
We denote the relative weight assigned to the position component a a as aþb and the size component as aþb (Table 2).
5.2. Hausdorff, L1 and L1 Considering two intervals, the Hausdorff distance is dH ðxij ; xi0 j Þ ¼ maxðjxlij xli0 j j; jxlij xui0 j jÞ ¼ jcij ci0 j j þ jr ij r i0 j j:
ð49Þ
While in the Euclidean distances we consider their squared version as d measure in Eq. (32), using the Hausdorff, L1 and L1 distances we use their formulas directly: dL1 ðxij ; xi0 j Þ ¼ jxlij xli0 j j þ jxuij xui0 j j;
ð50Þ
while dL1 ðxij ; xi0 j Þ ¼ maxðjxlij xli0 j j; jxuij xui0 j jÞ ¼ dH ðxij ; xi0 j Þ:
ð51Þ
To compute the representative element of a set of n interval data for the generic jth variable, it is possible to show that in the case of the Hausdorff distance the Gj ¼ ðcj ; rj Þ interval that minimizes Eq. (32) and considering d ¼ dH is reached when cj ¼ Medianðcij ji ¼ 1; . . . ; nÞ and r j ¼ Medianðr ij ji ¼ 1; . . . ; nÞ. In the case of L1 distance between the bounds, by analogy we obtain the Gj ¼ ½xlj ; xuj interval that minimizes Eq. (32) and considering d ¼ dL1 where xlj ¼ Medianðxlij ji ¼ 1; . . . ; nÞ and xuj ¼ Medianðxuij ji ¼ 1; . . . ; nÞ. In a clustering perspective, it is possible to prove that these distances cannot allow the definition of an inertia measure that satisfies the Huygens theorem of decomposition of inertia, which is different from a Euclidean-type distance. 5.3. Component-wise metrics In general, it is difficult to cope with the join and the meet operator in the definition of a prototype. It is not easy to express analytically the optimum reached by Eqs. (32) and (33). It is also difficult to compare these kinds of metrics with the Wasserstein one. 6. Wasserstein distance as a new allocation function in DCA According to Chavent et al. (2003), the prototype of a class C h of the partition P is defined as the vector Gh of intervals that minimizes the following function: f ðGh Þ ¼
X xi 2C h
2
dWass ðxi ; Gh Þ ¼
p X X
2
dWass ðxji ðtj Þ; Gjh ðt j ÞÞ
ð52Þ
xi 2C h j¼1
as the criterion is additive, we may solve the optimization problem for each variable Y j . Let Gjh ¼ ½ajh ; bjh be the interval prototype of the class C h for the jth variable, ljh and qjh the midpoint and the radius of the prototype, we have to solve the following minimization problem for each variable:
1655
A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658
X
~f ðGj Þ ¼ h
2
dWass ðxji ðt j Þ; Gjh ðtj ÞÞ
xi 2C h
X j 1 ðmi ljh Þ2 þ ðr ji qjh Þ2 : 3 x 2C
¼
i
ð53Þ
h
The minimum of the function is obtained when X j X j ljh ¼ jC h j1 mi ; qjh ¼ jC h j1 ri ; xi 2C h
ð54Þ
i¼1
n n X 1X ðmji lj Þ2 þ ðr j qj Þ2 : 3 i¼1 i i¼1
According to (53), we solve the minimization problem, obtaining the following results: lj ¼ n1
n X
mji ;
qj ¼ n1
i¼1
n X
k X
qj ¼ n1
h¼1
k X
jC h jqjh
h¼1
ðmji lj Þ2 þ 13
i¼1
¼
n P k P
ðmji i¼1 h¼1 þ
k P h¼1
n P
ðrji qj Þ2
Total inertia T j ðkÞ
i¼1
ljh Þ2 1ih þ 13
jC h jðljh qj Þ2 þ 13
n P k P
ðrji qjh Þ2 1ih
Within inertia W j ðkÞ
i¼1 h¼1 k P h¼1
jC h jðqjh qj Þ2
Between inertia Bj ðkÞ;
where 1ih ¼ 1 if xi 2 C h else 1ih ¼ 0. The last result then proves that the proposed distance satisfies the Huygens theorem for the decomposition of the total inertia as the sum of a within component and a between one: Xp Xn 2 j j Xp Xn Xk 2 d ðxi ; G Þ ¼ d ðxji ; Gjh Þ1ih j¼1 i¼1 j¼1 i¼1 h¼1 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} TðkÞ
WðkÞ
þ
Xp Xk
i¼1
n X n X i¼1
2
d ðxi ; xj Þ ¼ 2n
j¼1
ð56Þ
l¼1
p n X X i¼1
2
d ðxil ; Gl Þ;
ð57Þ
l¼1
where Gl is the barycenter (interval) for the lth interval variable. Using Euclidean-based distances for the definition of the inertia measure, and considering that all the presented distances allow a decomposition of the distance into two parts, where the first is related with the distances among the centers of intervals, while the second is related with the distances among the radii of the intervals, may observe the impact of using the different distances. Indeed, we may rewrite the inertia equation (57), in a general way. Considering Dcij ¼ jcij cGj j and
i¼1
and using a little of algebra, it is possible to prove that n P
j¼1
p n X X ðxil xl Þ2 :
Taking into consideration the notation introduced in Eq. (48), we may define the general formula of the inertia for Euclidean-based distances as " # p p n X n X X X In ¼ 2n a ðDcij Þ2 þ b ðDrij Þ2 : ð58Þ
i¼1
jC h jljh ;
2
dE ðxi ; xj Þ ¼ 2n
Drij ¼ jr ij rGj j:
Gj ¼ lj qj ; lj þ qj :
r ji ;
Considering also that lj ¼ n1
i¼1
In ¼
We prove that Wasserstein distance can be considered as an inertia measure among data, and it satisfies the well-known Huygens theorem of decomposition of inertia. As the criterion is additive with respect to each variable, it is sufficient to prove the result for the generic Y j variable. Considering the general prototype Gj of all the intervals for the generic Y j variable and lj and qj , we may write the total inertia as 2
n X n X
By analogy, using Euclidean distances we may define an inertia between interval data as the following quantity:
6.1. A useful property of the Euclidean distances
d ðxji ; Gj Þ ¼
In ¼
xi 2C h
where jC h j is the number of elements in the class C h .
n X
terion is the minimization of intra-class sum of distance from the class prototype (that generalize the within-cluster inertia), and, for this reason the DCA can be considered as a general approach for partitive clustering algorithms. Given n points described in Rp , the inertia of a set of points, in the case of Euclidean distance is
2
jC h jd ðGjh ; Gj Þ : j¼1 h¼1 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
ð55Þ
BðkÞ
The use of the Wasserstein distance in DCA, as an allocation function and as a criterion to be minimized to identify prototypes, permits the usual indexes to be used to validate clustering procedures that are generally based on the decomposition of the inertia. Without losing generality, all the Euclidean-type distances allow the same property. 7. A comparison on artificial and real data In this section, we present a comparison of the presented distances on the basis of the results obtained using the DCA. The comparison is based on the cluster obtained using the different distance in the criterion minimized in DCA. We recall that the cri-
j¼1
i¼1
j¼1
It is immediate hypothesizing that a different choice of a distance instead of another can modify the weight of the inertia related with the centers or to the radii in the definition of the clusters. We use a real dataset and three artificial ones. The real dataset is a climatic dataset that contains the temperature intervals (min and max) observed in 60 Chinese stations over 12 months in 19882 (the same data have been used by Chavent et al. (2003) for the DCA of interval data). The three artificial datasets contain 100 interval observation for two variables, randomly generated in order to observe P r 2 P c 2 P r 2 P c 2 ðD Þ ¼ j ðDij Þ (Fig. 1), (Fig. 2), and j ðDij Þ ¼ 0:5 j ðDij Þ Pj ijr 2 P c 2 j ðDij Þ ¼ 2 j ðDij Þ (Fig. 3). The choice of the datasets is motivated by the different variability of the data related with the size of the intervals. We avoid, in this case, the standardization of data as preprocessing step, because the three datasets are described by variables that have the same scales of measure and a similar variability. The datasets shows (see Table 3) a different structure of the inertia related with the centers and the radii of the observed data. In the DCA, a central role is played by the initialization of the procedure. We choose to initialize the algorithm generating a random partition of the stations into k groups. Having fixed a value for k, it is known that two different initializations can provide two different optimal solutions. Thus, for each dataset we have fixed a range for the number of clusters (k goes from 2 to 10), and for each k we have performed 200 random initializations. We have stored only the best solution, i.e., that solution allowing a minimum value for the criterion of DCA. China’s temperature dataset: The China dataset variability is greatly related with the variability of the centers (96% of the sum
2 Long-Term Instrum. Climatic Data Base of the People’s Republic of China http:// dss.ucar.edu/dataset/ds578.5/data/.
1656
A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658
Fig. 1. Artificial dataset 1: 100 interval data in 2d, the left figure shows a cross representation where emphasizing centers and radii, the right figure is a rectangular representation of the data.
Fig. 2. Artificial dataset 2.
Fig. 3. Artificial dataset 3.
1657
A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658
of centers and radii variability). An outline of the input data table is shown in Table 4. Each row represents the vector of interval temperature observed in a particular station along the 12 months of 1988. The DCA and the Euclidean-based distances, the L1 and the Hausdorff distances are used to cluster the stations into homogeneous clusters. There are not apriori information about a classification structure of the data. We have chosen k in a range that goes from 3 to 10, we have performed 200 initialization and, for each distance used in the DCA, we have stored the best results as those classification allowing the minimum value for the DCA criterion (the within cluster sum of distances). We compare clusterings using the corrected by chance Rand Index (Hubert and Arabie, 1985) in order to discover how much different the clusters are generated using the different distances, that shows the agreement between two partition and is widely used for this aim in the literature. It is equal to 1 when the partitions are the same, while is equal to 0 when there is the maximum disagreement between partition. The analysis of China dataset shows (Tables 5–8) that partitions obtained by means of the different distances have a good agreement especially comparing Euclidean-based distances.
Table 3 The inertia of centers and radii for the two considered datasets (1) Dataset
(2) n
(3) vars
(4) P
(5) P
(6) (4) + (5)
(7) (4)/(6)
(8) (5)/(6)
China Artificial 1 Artificial 2 Artificial 3
60 100 100 100
12 2 2 2
32,039 4924 4847 4659
1411 4924 2425 9315
33,450 9848 7273 13,974
0.96 0.50 0.67 0.33
0.04 0.50 0.33 0.67
c 2 j ðDij Þ
r 2 j ðDij Þ
Table 4 China dataset: monthly temperature [min:max] in 60 stations during 1988 Stations
January
February
November
December
AnQing BaoDing BeiJing BoKeTu ChangChun ChangSha ZhiJiang
[1.8:7.1] [7.1:1.7] [7.2:2.1] [23.4:15.5] [16.9:6.7] [2.7:7.4] [2.7:8.4]
[2.1:7.2] [5.3:4.8] [5.9:3.8] [24:14] [17.6:6.8] [3.1:7.7] [2.7:8.7]
...
[7.8:17.9] [0.8:14] [1.5:12.7] [13.5:4.2] [7.9:2.2] [7.6:19.6] [8.2:20]
[4.3:11.8] [3.9:5.2] [4.4:4.7] [21.1:13.1] [15.9:7.2] [4.1:13.3] [5.1:13.3]
Table 5 Corrected Rand Index for partition agreement (China dataset): the upper triangle shows results for k ¼ 3, the lower triangle shows results for k ¼ 4 dWass
dL2
dCD ð1=2Þ
dCD ð1=3Þ
dCD ð2=3Þ
1
1 1
1 1 1
1 1 1 1
pffiffiffi dCD ð p=2Þ 1 1 1 1 1
dCD ð1Þ
dBer
dL1
dH
1 1 1 1 1 1
1 1 1 1 1 1 1
0.2629 0.2629 0.2629 0.2629 0.2629 0.2629 0.2629 0.2629
0.2629 0.2629 0.2629 0.2629 0.2629 0.2629 0.2629 0.2629
dWass dL2 dCD ð1=2Þ dCD ð1=3Þ dCD ð2=3Þ pffiffiffi dCD ð p=2Þ dCD ð1Þ dBer
1 0.9479 0.8689 1 1 1 1
0.9479 0.8689 1 1 1 1
0.9176 0.9479 0.9479 0.9479 0.9479
0.8689 0.8689 0.8689 0.8689
1 1 1
1 1
1
dL1 dH
0.8571 0.901
0.8571 0.901
0.8138 0.8547
0.7451 0.7825
0.8571 0.901
0.8571 0.901
0.8571 0.901
0.8571 0.901
1 0.9504
Table 6 Corrected Rand Index for partition agreement (China dataset): the upper triangle shows results for k ¼ 5, the lower triangle shows results for k ¼ 6 dWass
dL2
dCD ð1=2Þ
dCD ð1=3Þ
dCD ð2=3Þ
1
0.9542 0.9542
0.9542 0.9542 1
1 1 0.9542 0.9542
pffiffiffi dCD ð p=2Þ 1 1 0.9542 0.9542 1
dCD ð1Þ
dBer
dL1
dH
0.9542 0.9542 1 1 0.9542 0.9542
1 1 0.9542 0.9542 1 1 0.9542
1 1 0.9542 0.9542 1 1 0.9542 1
0.9666 0.9666 0.9203 0.9203 0.9666 0.9666 0.9203 0.9666
dWass dL2 dCD ð1=2Þ dCD ð1=3Þ dCD ð2=3Þ pffiffiffi dCD ð p=2Þ dCD ð1Þ dBer
1 1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1
1 1 1 1
1 1 1
1 1
1
dL1 dH
1 0.9632
1 0.9632
1 0.9632
1 0.9632
1 0.9632
1 0.9632
1 0.9632
1 0.9632
0.9666 0. 9632
Table 7 Corrected Rand Index for partition agreement (China dataset): the upper triangle shows results for k ¼ 7, the lower triangle shows results for k ¼ 8 dWass
dL2
dCD ð1=2Þ
dCD ð1=3Þ
dCD ð2=3Þ
0.889
1 0.889
0.9652 0.8585 0.9652
0.8252 0.8852 0.8252 0.794
pffiffiffi dCD ð p=2Þ 0.9676 0.9201 0.9676 0.9324 0.835
dCD ð1Þ
dBer
dL1
dH
0.9676 0.9201 0.9676 0.9324 0.835 1
0.9676 0.9201 0.9676 0.9324 0.835 1 1
0.8184 0.8794 0.8184 0.8254 0.8369 0.8498 0.8498 0.8498
0.7364 0.8124 0.7364 0.7384 0.8239 0.7632 0.7632 0.7632
dWass dL2 dCD ð1=2Þ dCD ð1=3Þ dCD ð2=3Þ pffiffiffi dCD ð p=2Þ dCD ð1Þ dBer
0.7902 0.8516 1 1 0.9626 0.8522 0.8153
0.8977 0.7902 0.7902 0.7675 0.839 0.8767
0.8516 0.8516 0.8154 0.8986 0.9661
1 0.9626 0.8522 0.8153
0.9626 0.8522 0.8153
0.888 0.8501
0.9328
dL1 dH
0.8223 0.748
0.8195 0.8069
0.8834 0.8247
0.8223 0.748
0.8223 0.748
0.8576 0.7775
0.9695 0.8893
0.9173 0.8532
0.8851 0.8742
1658
A. Irpino, R. Verde / Pattern Recognition Letters 29 (2008) 1648–1658
Table 8 Corrected Rand Index for partition agreement (China dataset): the upper triangle shows results for k ¼ 9, the lower triangle shows results for k ¼ 10 dWass
dL2
dCD ð1=2Þ
dCD ð1=3Þ
dCD ð2=3Þ
0.8774
0.7322 0.7999
0.7402 0.8285 0.9718
0.6933 0.6013 0.7751 0.773
pffiffiffi dCD ð p=2Þ
dBer
dL1
dH
0.7323 0.8389 0.86 0.8893 0.6554 0.957
0.8073 0.701 0.6438 0.6471 0.5934 0. 6559 0.641
0.8711 0.7461 0.8357 0. 8435 0. 7966 0.7647 0.7774 0.7454
0. 7982 0.8706 0. 7959 0.8236 0.6032 0.8305 0. 8081 0. 7101
dWass dL2 dCD ð1=2Þ dCD ð1=3Þ dCD ð2=3Þ pffiffiffi dCD ð p=2Þ dCD ð1Þ dBer
0.5186 0.6855 0.7685 0.6011 0.7336 0.6011 0.6519
0.695 0.6577 0.7037 0.7667 0.7037 0.749
0.7956 0.8468 0.9028 0.8468 0.8791
0.7225 0.8782 0.7225 0.8178
0.8266 1 0.8366
0.8266 0.9198
0.8366
dL1 dH
0.7467 0.5214
0.6697 0.6243
0.8078 0.767
0.9604 0.7032
0.7342 0.8777
0.8911 0.7517
0.7342 0.8777
Table 9 Corrected Rand Index for artificial datasets (k ¼ 5) Distance
Artificial 1
Artificial 2
Artificial 3
dWass dL2 dCD ð1=2Þ dCD ð1=3Þ dCD ð2=3Þ pffiffiffi dCD ð p=2Þ dCD ð1Þ dBer
0.824 0.6991 1 1 0.824 0.824 0.824 0.824
1 0.7843 1 1 1 0.8074 0.7962 0.7962
0.7885 0.7641 0.9746 1 0.9041 0.789 0.789 0.789
dL1 dH
0.6991 0.7696
0.7962 0.8343
0.7256 0.7919
The artificial datasets: The three artificial datasets have been generated considering that intervals have five clusters for the centers and for the radii, as it is possible to see in the left panels of Figs. 1–3. This means that the three datasets contains five clusters of 20 interval observations described by two interval variables. The results in Table 9 show that in general Euclidean-based distances allow better partitions than L1 and Hausdorff distance. Among the Euclidean-based distances, dL2 have the worst performances. 8. Concluding remarks The choice of a distance measure is an important task when performing a clustering of data. Dealing with interval data several proposals have been made, in this paper we have proposed to review the main distances proposed in the literature and to introduce a new metric for the distance measure between intervals. The proposed measure, extending an existent metric (the Wasserstein metric), has several advantages with respect to those proposed in the literature: it is computed considering the density of points within the intervals, satisfies the Huygens theorem for the decomposition of inertia, and can be easily computed, can be suitably extended to the case when a non-uniform density is defined on intervals. These characteristics are only shared with the Bertoluzza distance and the Coppi and D’Urso one, but the Wasserstein distance has a further property: it can be used also when different distributions are defined on the compared intervals or in a fuzzy
0.7608 0.8825 0.8512 0.8802 0.649
dCD ð1Þ
0.8302 0.8117
0. 7968 0. 7202
approach, when fuzzy numbers have different shapes for the memberships functions. For these reasons, the proposed distance can be the basis of new techniques of analysis of interval data, and, more generally, to multivalued quantitative data (such as histogram data, as defined in symbolic data analysis). References Bertoluzza, C., Corral, N., Salas, A., 1995. On a new class of distances between fuzzy numbers. Mathware Soft Comput. 2, 71–84. Bezdek, J.C., 1981. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York. Bock, H.H., 2000. Analysis of Symbolic Data, Exploratory Methods for Extracting Statistical Information from Complex Data. Studies in Classification, Data Analysis and Knowledge Organisation. Springer-Verlag. Chavent, M., De Carvalho, F., Lechevallier, Y., Verde, R., 2003. Trois nouvelles méthodes de classification automatique de données symboliques de type intervalle. Rev. Stat. Appl. 4, 529. De Carvalho, F., 1994. Proximity coefficients between Boolean symbolic objects. In: Proc. 4th Conf. on International Federation of Classification Societies. New Approaches in Classification and Data Analysis. Springer, pp. 370–378. De Carvalho, F., 1998. Extension based proximities between constrained Boolean symbolic objects. In: Proc. 5th Conf. on International Federation of Classification Societies. Data Science, Classification and Related Methods. Springer, pp. 387– 394. De Carvalho, F., Brito, P., Bock, H.H., 2006. Dynamic clustering of interval data based on L2 distance. Comput. Statist. 21 (2), 231–250. Diday, E., 1971. Le méthode des nuées dynamique. Rev. Statist. Appl. 19 (2), 19–34. Gibbs, A.L., Su, F.E., 2002. On choosing and bounding probability metrics. Internat. Statist. Rev. 70, 419. Gowda, K.C., Diday, E., 1991. Symbolic clustering using a new dissimilarity measure. Pattern Recognition 24, 567–578. Gowda, K.C., Ravi, T.V., 1995. Agglomerative clustering of symbolic objects using the concepts of both similarity and dissimilarity. Pattern Recognition Lett. 16, 647– 652. Hubert, L., Arabie, P., 1985. Comparing partitions. J. Classif. 2, 193–218. Ichino, M., Yaguchi, H., 1994. Generalized Minkowsky metrics for mixed featuretype data analysis. IEEE Trans. System Man Cybernet. 24 (4), 698–708. Irpino, A., Romano, E., 2007. Optimal histogram representation of large data sets: Fisher vs piecewise linear approximations. RNTI E-9, 99–110. Kodratoff, Y., Bisson, G., 1992. The epistemology of conceptual clustering: Kbg, an implementation. J. Intell. Inform. Systems 1 (1), 5784. Michalski, R., Stepp, R., Diday, E., 1981. A Recent Advance in Data Analysis: Clustering Objects into Classes Characterized by Conjunctive Concepts, vol. 1. North-Holland, New York. p. 3356. Tran, L., Duckstein, L., 2002. Comparison of fuzzy numbers using a fuzzy distance measure. Fuzzy Sets Systems 130, 331–341. Verde, R., Lauro, N., 2000. Basic choices and algorithms for symbolic objects dynamical clustering. In: XXXIIe Journées de Statistique, Fés, Maroc, Societé Francßaise de Statistique, p. 3842.