A study of divisive clustering with Hausdorff distances for interval data

A study of divisive clustering with Hausdorff distances for interval data

Pattern Recognition 96 (2019) 106969 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/patcog ...

2MB Sizes 0 Downloads 7 Views

Pattern Recognition 96 (2019) 106969

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/patcog

A study of divisive clustering with Hausdorff distances for interval data Yi Chen a, L. Billard b,∗ a b

Facebook, Menlo Park, CA 94025, USA Department of Statistics, University of Georgia, Athens, GA 30602, USA

a r t i c l e

i n f o

Article history: Received 19 February 2019 Revised 30 April 2019 Accepted 15 July 2019 Available online 24 July 2019 Keywords: Interval data Divisive clustering Hausdorff distances Gowda–Diday distances Ichino–Yaguchi distances Span normalization Euclidean normalization Local and global normalizations

a b s t r a c t Clustering methods are becoming key as analysts try to understand what knowledge is buried inside contemporary large data sets. This article analyzes the impact of six different Hausdorff distances on sets of multivariate interval data (where, for each dimension, an interval is defined as an observation [a, b] with a ≤ b and with a and b taking values on the real line R1 ), used as the basis for Chavent’s [15, 16] divisive clustering algorithm. Advantages and disadvantages are summarized for each distance. Comparisons with two other distances for interval data, the Gowda–Diday and Ichino–Yaguchi measures are included. All have specific strengths depending on the type of data present. Global normalization of a distance is not recommended; and care needs to be made when using local normalizations to ensure the features of the underlying data sets are revealed. The study is based on sets of simulated data, and on a real data set.

1. Introduction This work focuses on clustering for multivariate interval-valued data, where for each dimension interval observations assume the form X = [a, b], a ≤ b (with a ≤ b and with a and b taking values on the real line R1 ). Such observations are examples of what Diday [24] introduced as symbolic data (which are observations broadly defined as hypercubles or Cartesian products of distributions in R p , e.g., lists, intervals, histograms, and the like); see, e.g., Bock and Diday [10], Billard and Diday [7], Diday and Noirhomme-Fraiture [26], and [52] with a non-technical introduction in [6]. With the advent of modern computer capabilities and the attendant proliferation of massively large data sets, it is imperative that methodologies for symbolic data be developed. Indeed, Goodman [29] opined that symbolic data analysis was one of the two most important new avenues appearing over the horizon for the future of statistical science. Hierarchical clustering can be either divisive or agglomerative; see, e.g., Anderberg [3]. Divisive clustering starts with all the observations in one cluster  and successively divides a cluster at each stage until a requisite number of clusters is achieved (with one possibility being that the final stage consists entirely of single



Corresponding author. E-mail address: [email protected] (L. Billard).

https://doi.org/10.1016/j.patcog.2019.106969 0031-3203/© 2019 Elsevier Ltd. All rights reserved.

© 2019 Elsevier Ltd. All rights reserved.

element clusters). In contrast, in agglomerative clustering methods, the original clusters are merged two at a time until all the observations belong to a final cluster . Except for pyramidal clustering (e.g., [25] and [11], but not covered in this work), clusters are non-overlapping and exhaustive. Typically, the divisive/merging criteria are based on dissimilarity or distance measures between observations or clusters. Our focus is on divisive clustering. When performing divisive clustering, we can partition the data according to either one single variable (monothetic method) or all variables simultaneously (polythetic method). Clustering methodology has seen extensive activity for classical data sets in recent decades; a good coverage and review can be found in, e.g., [3,9,27,30]. Much of the literature deals with agglomerative clustering, with some nice overall reviews in [38,39,49,51,59]. Other important contributions include [1,2,28,47,50,54–57,60,61], among others. Some papers look at questions revolving about initial seeds in k-means methods, e.g., [13,48,53,58]. Indhu and Porkodi [35] concludes that the hierarchical method has greater accuracy that the k-means, density-based or EM algorithms. Chang [14] uses principal components on a mixture of two normal distributions. These are for classical data. Unfortunately, there are relatively few works for divisive clustering for symbolic data. Indeed, it was not until Chavent [15], Chavent [16] that the first divisive clustering method for interval data was introduced; this was a monothethic method for inter-

2

Y. Chen and L. Billard / Pattern Recognition 96 (2019) 106969

val distances adapted from the original Hausdorff [33] distances between point observations. Yet, given the increasing prevalence of symbolic data sets, especially for interval-valued data arising from aggregation of the massively large-sized data sets from modern computer environments, it is becoming increasing important that methods be developed to handle such data sets. Even without aggregation, examples abound; e.g., temperatures are recorded as daily minimum and maximum values, stock market values are recorded as daily (low, high) or (opening, closing) values. Thus, a clustering algorithm applied to interval stock prices might identity which kinds of stocks are similar to each other (such as manufacturing, banking, media, and so on). It is known that doing analyses based on interval midpoints or means lose critical information (see, e.g., [5,6]; therefore, answers are not necessarily correct when using classical point surrogates. While [42,43] have introduced algorithms for histogram data (a polythetic and a double monothethic algorithm, respectively), the original [15] algorithm remains as the prime algorithm for interval data, and that itself was restricted to the Hausdorff distance only. Therefore, the aim herein is to compare different distances (for which no comparisons currently exist) for this well-known and important algorithm. In contrast, many methods for obtaining partitions of interval data have been developed; see, e.g., [4,17,18,20–23,36,37], among others. More recently, partitioning algorithms for histogram-type data were introduced by, e.g., [46], and [44,45]. Some of these are also based on the Hausdorff distance, some transform the interval to its corresponding center and/or end-points, some use Mallows’ L2 distances, and so on. In our article, Chavent’s hierarchical divisive monothetic method is applied to interval-valued data with the implementation based on six different types of Hausdorff distances: the basic Hausdorff distance (of [15]), the Euclidean Hausdorff distance, the Global Span Normalized Hausdorff distance, the Local Span Normalized Hausdorff distance, the Global Normalized Hausdorff distance, and the Local Normalized Hausdorff distance; see Section 2.1.1. To date, the global normalized distances have not been applied to symbolic data. Nor have the advantages and disadvantages between these choices been discussed in the literature. As part of this comparison, we apply Chavent’s hierarchical divisive monothetic method and the different distances to seven different types of simulated data, in Section 3. It is noted that Hausdorff distances dominate clustering methodologies for interval data to date, primarily because of their simplicity and intuitive appeal. There are other distances/dissimilarities that could be used for interval data, such as the Gowda–Diday [31,32] dissimilarities, or the Ichino–Yaguchi [34] distances; these can also be used in a divisive clustering method. In our simulation study, these distances are compared with the Hausdorff distances. It is observed that these require longer computing times than for the Hausdorff distances, primarily due to the complexity of their definitions. The Hausdorff distance, on the other hand, is easy to understand and to calculate, and more importantly, has lower computing costs than counterparts. Hence, the primary focus of this paper is on the Hausdorff distance. Our aim is to provide a comparative study of currently available distances and their extensions for interval-valued data sets, so as to obtain an intuitive sense of what distances may be more appropriate for particular settings. After defining the various Hausdorff distances in Section 2.1.1 and the Gowda–Diday and Ichino–Yaguchi distances in Section 2.1.2, Section 2.2 gives the detailed algorithm of applying Chavent’s method to interval-valued data. Simulations are run in Section 3 in order to compare the different distances and to learn their respective advantages and disadvantages. While the algorithm is applicable for any dimension (p), for ease of presentation and illustration, these data sets are p = 2-dimensional. Simulations are also run for p = 10-dimensional observations for

each of the different types of data sets studied. Further, for each of these, uncorrelated and correlated observations are considered. A real data set with p = 13 variables is considered in Section 4. Some concluding remarks are in Section 5. Finally, an alternative aspect that needs attention when calculating local normalizations is discussed in the Supplementary Materials (Section S3). 2. Divisive clustering and Hausdorff distances Suppose we have a domain  ≡ 1 × · · · ×  p ⊂ R p and a set of n interval-valued observations, measured on p random variables, described by X(i ) = (Xi1 , . . . , Xip ), i = 1, . . . , n, with Xi j = [ai j , bi j ], aij ≤ bij , j = 1, . . . , p. The goal is to divide  into R non-overlapping and exhaustive clusters C1 , . . . , CR , with Cu containing nu observations. Typically, the clustering process is based on dissimilarities or distances between observations. In Section 2.1.1, the basic Hausdorff [33] distance between interval observations is defined; see Eq. (2.1). We also consider variations of this basic Hausdorff distance, viz., the Euclidean Hausdorff distance of Eq. (2.3), the Span Normalized Euclidean Hausdorff distance of Eq. (2.4), and the Normalized Euclidean Hausdorff distance of Eq. (2.5). Then, in Section 2.1.2, the basic Gowda–Diday [32] distance as well as the basic Ichino–Yaguchi [34] distance are defined. Normalized Euclidean distances for each are defined, as is a Span Normalized Euclidean formulation for the Ichino–Yaguchi distance. Chavent [15,16] introduced a divisive algorithm for interval observations based on the basic Hausdorff distance (see Section 2.2). Later, Kim [40] and Kim and Billard [41] adapted Chavent’s algorithm for histogram-valued data.1 2.1. Distances 2.1.1. Hausdorff distances Chavent [15,16] extended the Hausdorff [33] distance between classical observations to the basic Hausdorff distance between interval observations. We also study five extensions of this basic distance. Thus, for two interval observations X(i1 ) and X(i2 ), i1 , i2 = 1, . . . , n, in cluster C, we define the basic Hausdorff distance (H) as:

d j (X(i1 ), X(i2 )) = max{|ai1 j − ai2 j |, |bi1 j − bi2 j |}, j = 1, . . . , p,

with

d (X(i1 ), X(i2 )) =

(2.1)

p 

d j (X(i1 ), X(i2 )).

(2.2)

j=1

Variations of the Hausdorff distance are the Euclidean Hausdorff (EH), Span Normalized Euclidean Hausdorff (SNEH), and Normalized Euclidean Hausdorff (NEH) distances, given respectively by



d (X(i1 ), X(i2 )) =

p  [d j (X(i1 ), X(i2 ))]2

1/2

,

(2.3)

j=1

 d (X(i1 ), X(i2 )) =

p  

|Y j | d j (X(i1 ), X(i2 )) −1

2

1/2 ,

(2.4)

j=1

 d (X(i1 ), X(i2 )) =

p  

H −1 dj j

(X(i1 ), X(i2 ))

2

1/2 (2.5)

j=1

where, for variable X j , j = 1, . . . , p, the span |Y j | and normalizing factor Hj , are, respectively, 1 Presented at the Symbolic Data Analysis Workshop, Vienna, Austria, October 2009.

Y. Chen and L. Billard / Pattern Recognition 96 (2019) 106969

|Y j | = max{bi j } − min{ai j }, j = 1, . . . , p, i∈ j

i∈ j

H 2j =

n n 1  [d j (X(i1 ), X(i2 ))]2 , j = 1, . . . , p. 2 2n

(2.6)

i1 =1 i2 =1

In Eq. (2.3)–(2.5), the dj (X(i1 ), X(i2 )) are the Hausdorff distances as defined in Eq. (2.1). Notice that distances calculated across all p variables (such as Eq. (2.3)) represent a distance across samples whereas the distance of Eq. (2.1) for a single variable j is a distance between intervals. If, in the clustering algorithm, the Hausdorff distance (Eq. (2.1)) is used, we use the same variable (Xj ) in Step 1 (see Section 2.2) for calculating distances in Step 2. If the Euclidean Hausdorff distance (of Eq. (2.3)) is used, distances between observations do not depend on the chosen variable, and hence remain the same across all variables in Step 3. If the Span Normalized Euclidean Hausdorff distance is used, either the distances can be recalculated at each clustering stage since, from Eq. (2.4), the normalization factor Y j can rely on the entire sample space  to give a global Y j value, or this normalization is based on the (sub-)cluster observations (Cu ) to give a local Y j value. Similarly, the Normalized Euclidean Hausdorff distance (of Eq. (2.5)) can be calculated using a global or a local normalization value for Hj . Clearly, the global normalized distances are invariant as the algorithm proceeds but the local normalized distances are not invariant. 2.1.2. Gowda–Diday and Ichino–Yaguchi distances Gowda and Diday [31], Gowda and Diday [32] introduced a distance (GD distance) between two interval observations X(i1 ) and X(i2 ), i1 , i2 = 1, . . . , n, in cluster C, as

+ D2 j (X(i1 ), X(i2 )) + D3 j (X(i1 ), X(i2 )), D1 j (X(i1 ), X(i2 )) = (|(bi1 j − ai1 j ) − (bi2 j − ai2 j )| )/k j , D2 j (X(i1 ), X(i2 )) = (bi1 j − ai1 j + bi2 j − ai2 j − 2I j )/k j , D3 j (X(i1 ), X(i2 )) = |ai1 j − ai2 j |/|Y j |,

(2.7)

for j = 1, . . . , p, where the span |Y j | is as in Eq. (2.6) and k j = | max(bi1 j , bi2 j ) − min(ai1 j , ai2 j )|, Ij =

 | max(ai1 j , ai2 j ) − min(bi1 j , bi2 j )|, 0,

if if



X ( i1 ) ∩ X ( i2 ) = φ , X ( i1 ) ∩ X ( i2 ) = φ .

d j (X(i1 ), X(i2 )) = |X j (i1 )  X j (i2 )| − |X j (i1 )  X j (i2 )| + γ (2|X j (i1 )  X j (i2 )| − |X j (i1 )| − |X j (i2 )| ) (2.8) where |A| = (b − a ) is the length of the interval A = [a, b], and 0 ≤ γ ≤ 0.5 is any pre-specified constant (taken as γ = 0.5 in the sequel) and where Xj (i1 )  Xj (i2 ) and Xj (i1 )  Xj (i2 ) are operators for variable j, defined by, for each j, respectively,



[max{ai1 , ai2 }, min{bi1 , bi2 }], 0,



when Xi1 ∩ Xi2 = φ , when Xi1 ∩ Xi2 = φ ,

Xi1  Xi2 = [min{ai1 , ai2 }, max{bi1 , bi2 }].

Eq. (2.7) and Eq. (2.8) are distances for a single variable j. For distances across all j = 1, . . . , p, the summation distance of Eq. (2.2) applies. Normalized Euclidean distances for both the Gowda–Diday distances and the Ichino–Yaguchi distances can be developed as follows:



d (X(i1 ), X(i2 )) =

1 [d j (X(i1 ), X(i2 ))]2 p p

j=1



d (X(i1 ), X(i2 )) =

p 

1/2

[|Y j |

−1

2

d j (X(i1 ), X(i2 ))]

(2.10)

j=1

where d(X(i1 ), X(i2 )) is given by Eq. (2.8) and |Y j | is the span given in Eq. (2.6). Since in the simulation study the Hausdorff distance of Eq. (2.1) when using one variable at a time performed poorly compared to the Euclidean Hausdorff distance of Eq. (2.3) when all variables are utilized together, the summation Gowda–Diday distances and Ichino–Yaguchi distances over all variables are used in the sequel. It is also noted that for computing time comparison purposes, the Ichino–Yaguchi distance and the Normalized Euclidean Ichino–Yaguchi distance compare with the Euclidean Hausdorff distance as they are calculated over all variables and so the span normalization factor does not need to be recalculated at each stage. However, the normalization factor |Y j | in the Gowda– Diday, Normalized Euclidean Gowda–Diday and Span Normalized Euclidean Ichino–Yaguchi distances are all local normalizations and so are recalculated based on each (sub-)cluster, and so are not invariant; these compare with the Local Span Normalized Euclidean Hausdorff distance.

Our goal is to divide the complete data set  into R clusters which are internally as homogeneous as possible and as heterogeneously as possible between clusters. Suppose at the rth stage,  consists of the non-overlapping clusters Pr = (C1r , . . . , Crr ), with nru observations in cluster Cur , u = 1, . . . , r, r = 1, . . . , R, where R ≤ n is a pre-determined number of clusters sought, i.e., the stopping stage. Let us describe the observations in Cur as Cur = ({Xru (1 ), . . . , Xru (nru )} ). We have



Ichino and Yaguchi [34] developed the (IY) distance for observations X(i1 ) and X(i2 ) by

Xi1  Xi2 =

where dj (X(i1 ), X(i2 )) is given by Eq. (2.7) for the Normalized Euclidean Gowda–Diday (NEGD) distance and by Eq. (2.8) for the Normalized Euclidean Ichino–Yaguchi (NEIY) distance. The third term in Eq. (2.7) is a span component in the Gowda– Diday distance; so, there is no span distance counterpart for this distance. However, a span counterpart for the Ichino–Yaguchi distance (SNEIY) can be defined as:

2.2. Chavent’s divisive clustering algorithm for interval data

d j (X(i1 ), X(i2 )) = D1 j (X(i1 ), X(i2 )) with

3

1/2

(2.9)

Cur = , Cur 1



Cur 2 = φ , u1 = u2 , and

r 

nru = n.

u=1

u

Definition 1. For the cluster Cur , the within-cluster variation I (Cur ) is defined by

I (Cur ) =

nru  nru nru nru  1  1  w i1 w i2 d 2 ( i1 , i2 ) ≡ w i1 w i2 d 2 ( i1 , i2 ) 2λ λ i1 =1 i2 =1

i1
(2.11) where d2 (i1 , i2 ) is a distance or dissimilarity measure between the observations Xru (i1 ) and Xru (i2 ) in Cur , i1 , i2 = 1, . . . , nru , and where wi is the weight associated with the observation Xru (i), and where ru λ = ni=1 wi . When observations are evenly weighted, i.e., wi = 1/n, i = 1, . . . , n, Eq. (2.11) becomes

I (Cur ) =

nru  nru 1  d 2 ( i1 , i2 ) . 2nru n

(2.12)

i1 =1 i2 =1

Definition 2. The total within-cluster variation for partition Pr = (C1r , . . . , Crr ) is

W (Pr ) =

r  u=1

I (Cur )

(2.13)

4

Y. Chen and L. Billard / Pattern Recognition 96 (2019) 106969

where I (Cur ) is the within-cluster variation for Cur given in Eq. (2.11).

observations in Cu ,

Chavent’s [15,16] divisive clustering algorithm consists essentially of the following steps: Step 1. At the rth stage,  ≡ Pr = (C1r , . . . , Crr ). Within each Cur cluster, for each variable j = 1, . . . , p, calculate the observation mean for each observation Xru (i) in Cur , i.e., X¯ iru = (aru + bru )/2, i = j ij ij ru 1, . . . , nru . Then, re-order the observations {X (1 ), . . . , Xru (nru )} by increasing mean values X¯ iru to obtain the ordered observations: j ru {Xru , . . . , X } . (1 ) (nru ) Step 2. Cut each cluster Cur at the qth cut-point to give the two subr+1 ru ru ru clusters Cur+1 = {Xru (1 ) , . . . , X(q ) } and Cu+1 = {X(q+1 ) , . . . , X(nru ) }, q = 1, . . . , nru . This cut produces a new partition Pr+1 = (C1r , . . . , Cur −1 , Cur+1 , Cur+1 , C r , . . . , Cnr ru ). Calculate the decrease, ru , +1 u+1 jq in the total within-cluster variation going from Pr to Pr+1 if we use the jth variable and cut-point q for the uth cluster at the rth stage, i.e., for I (Cur ) given in Eq. (2.11), i.e., ru  I (Cur ) − I (Cur+1 ) − I (Cur+1 ). +1 jq Step 3. Within the cluster Cur , at the rth stage, obtain ru = max j,q {ru }, j = 1, . . . , p, q = 1, . . . , nru . jq Step 4. At the rth stage, repeat Steps 1–3 for u = 1, . . . , r, and obtain r = maxu {ru }, u = 1, . . . , r. Step 5. From Step 3–4, we now have our choice of (u∗ , j∗ , q∗ ) such that ∗ ∗ ∗ ru = r . If |ru | > 1, i.e., if ru is not unique, then select j ∗ q∗ j ∗ q∗ j ∗ q∗ (u∗∗ , j∗∗ , q∗∗ ) such that

E=



∗∗

∗∗









d j∗∗ Xqru∗∗ j∗∗ , Xqru∗∗ +1, j∗∗ = max d j∗ Xqru∗ j∗ , Xqru∗ +1, j∗



u∗ , j∗ ,q∗

where dj (i1 , i2 ) is the selected distance from Section 2.1; if ∗ ∗ |ru | = 1, i.e., ru is unique, then (u∗∗ , j∗∗ , q∗∗ ) = (u∗ , j∗ , q∗ ). j ∗ q∗ j ∗ q∗ ∗∗ ∗∗ ¯ ru Then, the cut-point value is cr = (X¯ ru∗∗ ∗∗ + X¯ ru∗∗ ∗∗ )/2 where X q

j

q +1, j

ij

is calculated in Step 1. Step 6. If at the rth stage, the criterion to cut the cluster Cur ∗∗ is ∗∗ “Is X¯ iru ≤ cr ? , i = 1, . . . , nru∗∗ , then the observation i goes into j ∗∗ r+1 cluster Cur+1 ∗∗ ; else, it goes into cluster Cu∗∗ +1 . Step 7. At the (r + 1 ) th stage, the rest of the clusters, except for Cur+1 ∗∗ and Cur+1 ∗∗ +1 , inherit from the clusters at the rth stage, respectively,

r+1 r+1 r r r i.e., C1r+1 = C1r , . . . , Cur+1 ∗∗ −1 = Cu∗∗ −1 , Cu∗∗ +2 = Cu∗∗ +1 , . . . , Cr+1 = Cr . Step 8. If r < R ≤ n where R is the pre-assigned number of clusters, return to Step 1; else, stop. Step 3 identifies the location q of the cut point and the variable j which divides this particular cluster; and Step 4 identifies which particular cluster Cur is to be divided at the rth stage. The code in the R-statistical package to implement this algorithm is given in [19]. In Section 3, this hierarchical divisive monothetic clustering method is applied to interval-valued observations using each of the different distances of Section 2.1. Both classification error rates and computing times are reported.

r 

wu Eu .

u=1

For example, suppose we have two true clusters with observations C1 = {X(1 ), . . . , X(10 )} and C2 = {X(11 ), . . . , X(20 )}. Suppose the clustering result is C1∗ = {X(1 ), . . . , X(10 ), X(11 ), . . . , X(16 )} and C2∗ ={X(17), ... , X(20)}. In C1∗ , ten observations are from C1 and six observations are from C2 . Hence, p11 = 10 16 = 0.625 and 6 p12 = 16 = 0.375. Thus, the classification error rate of C1∗ is E1 = 1 − max(0.625, 0.375 ) = 0.375. All observations in C2∗ are from C2 . Hence, p21 = 0 and p22 = 1. The classification error rate for C2∗ is E2 = 1 − max(0, 1 ) = 0. The overall error rate is simply the average 16 of E1 and E2 , weighted by each cluster’s size, i.e., E = 20 × 0.375 + 4 20 × 0 = 0.3. The classification error rate ranges from 0 to 0.5 with 0 meaning no classification error at all and 0.5 meaning all observations are grouped into the wrong cluster. 3. Simulations The different distances (of Section 2.1) are used to cluster data by using the Chavent divisive algorithm in each of six different sets of simulated data designed to illustrate six differing types of data sets. Since, as is seen from the relevant tables, the Gowda–Diday distances and the Ichino–Yaguchi distances are considerably slower than for the Hausdorff distances, for conciseness, this discussion will tend to focus on the Hausdorff distances per se. However, CER and computing time results along with the cut variables are presented in the relevant tables for all eleven distances, and are compared in the Summary of Section 3.8. The simulations producing these six data sets were based on normal distributions. To study what if anything changes for nonnormal distributed data, a seventh case using exponentially distributed data is also included. The results were comparable to those for the normally based simulations. All these seven cases were investigated using uncorrelated data. Therefore, each case was repeated for correlated data. Not surprisingly, results were similar to those for the uncorrelated studies (though for some intervals in Case 7 for the global normalized Euclidean Hausdorff distance there were slight differences; compare Tables 16–18). We give details of the correlated simulations for Case 1 (in Section 3.1), but omit the details for the other cases. Further, simulations were also performed in all cases on both bivariate data and on p = 10-dimensional data, for both uncorrelated and correlated observations. The results for higher dimensions were consistent with those for two dimensions, for the respective cases. However, there were some differences computing time-wise on some of the global and locally normalized distances, as discussed for the Case 1 results in Section 3.1. Again, we give details of this simulation for Case 1. Also, since the Ichino–Yaguchi distance for the Case 3 data gave slightly different results from those for p = 2, we show these results briefly in Section 3.3. For the other cases, the results were the same and so we omit the details. All tables and hierarchies (Tables 1–18 and Figs. 10–12) are provided in the Supplementary Materials (Section S1). 3.1. Case 1: An intuitive example

Definition 3. The classification error rate (CER) of the mth cluster is the fraction of the observations in that cluster that do not belong to the most common underlying cluster:

Em = 1 − maxk ( pmk ) where pmk represents the proportion of observations in the mth cluster that are from the kth underlying cluster. The overall error rate across all clusters C1 , . . . , Cr is, for weights wu = nu /n for nu

A random sample of 10 0 0 observations is simulated from each of two bivariate normal distributions N2 (μi , i ), i = 1, 2, with μ1 = (10, 15 ), μ2 = (20, 15 ), 1 = 2 = diag(10, 10 ). Fig. 1 shows the scatter plot of these points. The distinct clusters are evident. As part of this study, the impact of the number of intervals will be considered. Therefore, these 20 0 0 observations are aggregated into n = 4, 10, 20, 40, and 10 0 intervals with 50 0, 20 0, 10 0,

Y. Chen and L. Billard / Pattern Recognition 96 (2019) 106969

Fig. 1. Simulated data - Case 1.

50, and 20 points, respectively. For example, if we decide to use 100 points for each interval, we take the first 100 points out of the 10 0 0 points simulated, i.e., X11 , . . . , X100,1 and X12 , . . . , X100,2 ; and then, for each variable, set the 5th percentile of the 100 points as the lower bound of the interval and the 95th percentile of the 100 points as the upper bound of the interval. This aggregates the first 100 points into an interval X(1). The process is continued until all the classical points have been aggregated and we therefore have interval-valued observations {X(1 ), . . . , X(10 )} from the distribution N2 (μ1 , 1 ); and likewise for observations {X(11 ), . . . , X(20 )} from N2 (μ2 , 2 ). Thus, there are now five data sets containing n = 4, 10, 20, 40, or 100 interval-valued observations in each data set, respectively. The divisive monothetic clustering algorithm is applied to each data set using each of the distances of Section 2.1, where we recall for the Hausdorff distances in Eq. (2.4) and (2.5) there are two possibilities depending on whether the local or global normalizations are used. The CER and computing times for each of these fifty-five combinations are calculated. This process is repeated B = 10 0 0 times. The mean and standard deviation of the B CER and computing time measures are calculated. These are summarized in Table 1 for each of the five data sets (represented by the number of intervals n in the data set) by using each of the different distances for the B = 10 0 0 replications. Here, as in all subsequent tables, H, EH, GSNEH, GNEH, LSNEH, and LNEH refer to the Hausdorff distance over one variable j at a time, the Euclidean Hausdorff distance, the Global Span Normalized Euclidean Hausdorff distance, the Global Normalized Euclidean Hausdorff distance, the Local Span Normalized Euclidean Hausdorff distance, and the Local Normalized Euclidean Hausdorff distance, respectively; and GD, NEGD, IY, NEIY, and SNEIY refer to the Gowda–Diday distance, the Normalized Euclidean Gowda– Diday distance, the Ichino–Yaguchi distance, the Normalized Euclidean Ichino–Yaguchi distance, and the Span Normalized Euclidean Ichino–Yaguchi distance, respectively. Table 2 provides a count of the number of times each variable (either X1 or X2 ) is used for cutting in the clustering algorithm. It can be seen that all the intervals, except for a few for the GNEH and LNEH distances, are correctly grouped into their underlying clusters by using the different Hausdorff distances. In this example, it is obvious that X1 should be the correct variable to use for cutting in order to separate the two clusters. However, by looking at Table 2, we see that when the number of intervals is small, e.g., n = 4, when the GSNEH, GNEH, LSNEH and LNEH distances are used, the algorithm will select X2 to do the cutting occasionally. This is due to the normalization step when calculating these

5

Fig. 2. Regression fit for Hausdorff computing times - Case 1.

distances. Notice that in Fig. 1, the variation along the X1 axis is larger than that along the X2 axis. Hence, the normalization factors |Y j | and Hj introduced in Eq. (2.4) and (2.5) are larger when j = 1, i.e., |Y1 | > |Y2 | and H1 > H2 . For any interval observations X(i1 ) and X(i2 ) from the left cluster (the cluster with solid blue dots in Fig. 1) and the right cluster (the cluster with hollow red squares in Fig. 1), respectively, d1 (X(i1 ), X(i2 )) > d2 (X(i1 ), X(i2 )) where dj (X(i1 ), X(i2 )) is the Hausdorff distance between X(i1 ) and X(i2 ) for variable Xj . However, it is possible that when |Y1 | and H1 become much larger than |Y2 | and H2 , |Y1 |−1 d1 (X(i1 ), X(i2 )) < |Y2 |−1 d2 (X(i1 ), X(i2 )) and H1−1 d1 (X(i1 ), X(i2 )) < H2−1 d2 (X(i1 ), X(i2 )). Hence, the Normalized distances between X(i1 ) and X(i2 ), of Eq. (2.4) and (2.5) ( p = 2) will be dominated by |Y2 |−1 d2 (X(i1 ), X(i2 )) and H2−1 d2 (X(i1 ), X(i2 )); i.e., variable X2 will be used as the cutting variable since the clustering process believes that the two clusters are more dissimilar to each other relative to the variable X2 while this is not true according to Fig. 1. Here, Span Normalized distances perform better than the Normalized distances since H1 is much larger than H2 in this case, while |Y1 | is just a little larger than |Y2 |. In Table 1, the computing time increases quadratically when the number of intervals increases. For example, by doing a linear regression of computing time on the squared number of intervals, the relationship when using the Hausdorff distance is

Computing Time = 0.00485 + 10−2 × 0.084 2

× (Number of Intervals) . The plot of this regression is shown in Fig. 2 where the red line is the regression through the fitted values. Clustering using the Hausdorff distance is the fastest of the methods used, which is not surprising since, unlike the other distances, the Hausdorff distance utilizes only one variable at a time when calculating distances. The computing times for the EH, GSNEH and GNEH distances are higher and are similar to each other. This is because they utilize all the variables when calculating distances. The LSNEH and LNEH distances need the most time to do the clustering since they not only utilize all the variables when calculating distances, but also recalculate distances as each new cluster is formed. Suppose now, instead of uncorrelated data, a random sample of 10 0 0 observations is simulated from each of two bivariate normal distributions N2 (μi , i ), i = 1, 2, with μ1 = (10, 15 ), μ2 = (20, 15 ), and 1 = 2 with σii2 = 10 and σi j = 8, i = j, i, j = 1, 2. The analyses are the same as those for the uncorrelated data as described above. The mean and standard deviation for the CER values are shown in Table 3. These are almost the same as for the

6

Y. Chen and L. Billard / Pattern Recognition 96 (2019) 106969 Table 1 Mean and standard deviation of CER and computing time - Case 1. n

H

EH

GSNEH

GNEH

(0.000) (0.000) (0.000) (0.000) (0.000)

0.000 0.000 0.000 0.000 0.000

(0.000) (0.000) (0.000) (0.000) (0.000)

0.022 0.001 0.000 0.000 0.000

Mean (SD) of Computing Time (s) 4 0.005 (0.001) 0.005 (0.001) 10 0.006 (0.001) 0.006 (0.001) 20 0.009 (0.001) 0.010 (0.001) 40 0.017 (0.002) 0.019 (0.002) 100 0.089 (0.006) 0.107 (0.006)

0.005 0.006 0.010 0.020 0.107

(0.001) (0.001) (0.001) (0.002) (0.005)

0.005 0.006 0.010 0.019 0.107

n

IY

Mean (SD) of CER 4 0.000 (0.000) 10 0.000 (0.000) 20 0.000 (0.000) 40 0.000 (0.000) 100 0.000 (0.000)

GD

0.000 0.000 0.000 0.000 0.000

NEGD

Mean (SD) of CER 4 0.000 (0.000) 10 0.000 (0.000) 20 0.000 (0.000) 40 0.000 (0.000) 100 0.000 (0.001)

0.000 0.000 0.000 0.000 0.000

LSNEH

LNEH

(0.072) (0.014) (0.000) (0.000) (0.000)

0.000 0.000 0.000 0.000 0.000

(0.000) (0.000) (0.000) (0.000) (0.000)

0.022 0.001 0.000 0.000 0.000

(0.072) (0.014) (0.000) (0.000) (0.000)

(0.001) (0.001) (0.002) (0.002) (0.006)

0.006 0.010 0.023 0.069 0.375

(0.001) (0.001) (0.002) (0.004) (0.018)

0.005 0.010 0.023 0.069 0.376

(0.001) (0.001) (0.002) (0.004) (0.020)

NEIY

SNEIY

(0.000) (0.000) (0.000) (0.000) (0.001)

0.000 0.000 0.000 0.000 0.000

(0.000) (0.000) (0.000) (0.000) (0.000)

0.000 0.000 0.000 0.000 0.000

(0.000) (0.000) (0.000) (0.000) (0.000)

0.000 0.000 0.000 0.000 0.000

(0.000) (0.000) (0.000) (0.000) (0.000)

Mean (SD) of Computing Time (s) 4 0.009 (0.001) 0.009 (0.001) 10 0.034 (0.002) 0.034 (0.002) 20 0.112 (0.006) 0.113 (0.007) 0.422 (0.019) 40 0.414 (0.022) 100 2.654 (0.083) 2.653 (0.082)

0.005 0.008 0.016 0.042 0.231

(0.001) (0.001) (0.001) (0.002) (0.012)

0.005 0.008 0.016 0.042 0.231

(0.000) (0.001) (0.001) (0.002) (0.011)

0.008 0.024 0.074 0.254 1.544

(0.001) (0.002) (0.004) (0.015) (0.058)

Table 2 Frequency of variable X1 , X2 used for cutting - Case 1. n

H

EH

GSNEH

GNEH

LSNEH

LNEH

GD

NEGD

IY

NEIY

SNEIY

4 10 20 40 100

1000,0 1000,0 1000,0 1000,0 1000,0

1000,0 1000,0 1000,0 1000,0 1000,0

998,2 999,1 1000,0 1000,0 1000,0

921,79 994,6 1000,0 1000,0 1000,0

998,2 999,1 1000,0 1000,0 1000,0

921,79 994,6 1000,0 1000,0 1000,0

999,1 1000,0 1000,0 1000,0 1000,0

1000,0 1000,0 1000,0 1000,0 1000,0

1000,0 1000,0 1000,0 1000,0 1000,0

1000,0 1000,0 1000,0 1000,0 1000,0

1000,0 1000,0 1000,0 1000,0 1000,0

Euclidean Hausdorff distances (both the GNEH and LNEH) have a higher chance of selecting the wrong cutting variable, i.e., CER is higher, when n is small. This is exacerbated when the dimension increases as there are more variables competing to be the cutting variable. However, as n increases, the correct cutting variable is more likely to be selected, and so the CER values eventually improve to CER = 0. Except for these two distances, all the other distances perform as well as for the p = 2 case. It is also noted that the computing times for the GD, NEGD and SNEIY distances are much longer when p = 10 than for p = 2. These three distances also had longer computing times compared to the other distances when p = 2 but the differences were in seconds and so made no material difference. However, as the dimension increases, these time differences become worse. Thus, these distances are relatively disadvantageous compared to the other distances. Fig. 3. Simulated data - Case 2.

uncorrelated data in Table 1; and so the conclusions there also apply here. To study the impact of higher dimensions, we repeat the simulations for p = 10-dimensional data with observations drawn from two normal distributions N10 (μi , i ), i = 1, 2, with μ1 = (10, 15, . . . , 15 ), μ2 = (20, 15, . . . , 15 ), and 1 = 2 = diag(10, . . . , 10 ). The means and standard deviations of the CER values and computing times are summarized in Table 4 for the respective distances. It is instructive to compare these results with those for p = 2 in Table 1. First, we observe that the normalized

3.2. Case 2: A bad example for Hausdorff distance Now, let us consider a situation where the variances of the two variables are of different magnitudes. The steps for this simulation study are the same as those in Section 3.1, except that the bivariate normal distributions have parameters μ1 = (14, 15 ), 1 = diag(0.1, 30 ) and μ2 = (16, 15 ), 2 = diag(0.1, 0.1 ), respectively. Fig. 3 depicts these distributions. The corresponding intervalvalued observations are {X(1 ), . . . , X(n/2)} from the distribution with (μ1 , 1 ), and {X(n/2 + 1 ), . . . , X(n )} from the distribution with (μ2 , 2 ), where n = 4, 10, 20, 40 and 100. Table 5 lists the means and standard deviations of CER and computing time to do

Y. Chen and L. Billard / Pattern Recognition 96 (2019) 106969

the clustering on each n for each distance for B = 10 0 0 replications. Table 6 lists the number of times each variable is used for splitting. In this example, except for the Hausdorff distance, all other distances are able to group the interval observations into their underlying clusters perfectly. Since the Hausdorff distance utilizes just one variable when calculating distances, the calculation of the distance between two observations can easily be dominated by the variable with larger variance (variable X2 in this example), and the distance is exaggerated. As a consequence, the within-cluster variation I (Cur ) (Eq. (2.11)) will be calculated based on X2 , and ru and r in Step 3 and Step 4 of Section 2 will also be obtained by using X2 . Therefore, X2 will always be the variable to be used for cutting, even if it is not the best choice. This can also be observed in Table 6 where X2 is used for splitting all B = 10 0 0 replications when the Hausdorff distance is used. Fig. 10 (in S1) is the clustering result for one simulated data set with n = 20 when using the Hausdorff distance. Clearly, the variable X2 is dominating the clustering process while the data should better be clustered by variable X1 . Although observations {2, 3, 4, 9} are successfully separated from {11, . . . , 20} at stage 3, the clustering process should have stopped at stage r = 2 instead of turning the result into R = 3 clusters. The CER when using the Hausdorff distance increases when the number of intervals n increases. As mentioned in Section 3.1 for Case 1, if more intervals are generated from a classical data set, that implies there are fewer points needed to constitute one interval. Therefore, each interval contains less information about the underlying cluster to which it belongs; and for those intervals belonging to the same underlying cluster, their between-interval variations increase. This will cause the total within-cluster variation (Eq. (2.13)) of each underlying cluster to increase. Therefore, it will be harder for the clustering process to discover the true underlying cluster structure since the process is always trying to minimize the total within-cluster variation to find the best clustering structure. This suggests that sometimes aggregating a classical data set into too many intervals may not be a good idea. Instead, aggregating the data set into fewer intervals may give better results. The statistics of Table 5 for computing time are similar to those in Table 1. Computing time increases quadratically when the number of intervals increases. The order of time needed for clustering by distance is {H} < {EH, GSNEH, GNEH} < {LSNEH, LNEH}. 3.3. Case 3: A bad example for Hausdorff/Euclidean Hausdorff distance Although the Euclidean Hausdorff distance performs well in the Case 2 study, it can still meet with problems when the variance of one variable is extraordinary larger than for the other variables. Consider the example, shown in Fig. 4, where 10 0 0 observations are randomly drawn from each of two bivariate normal distributions with μ1 = (14, 15 ), μ2 = (16, 15 ), 1 = 2 = diag(0.1, 30 ). Aggregation into intervals gives the observations {X(1 ), . . . , X(n/2 )} from the distribution with (μ1 , 1 ), and {X(n/2 + 1 ), . . . , X(n )} from the distribution with (μ2 , 2 ), for n = 4, 10, 20, 40, 100. As before, the divisive clustering algorithm is applied to each data set for each distance. Table 7 lists the means and standard deviations of CER and computing time to do the clustering over the B = 10 0 0 simulations performed. Table 8 lists the number of times each variable is used for splitting. It is seen that the Hausdorff and Euclidean Hausdorff distances perform worse as the number of intervals n increases, because these distances tend to use X2 as the splitting variable when the number of intervals increases; when the number of intervals reaches 100, they use X2 for cutting all the time. However, from Fig. 4, it is obvious that X1 should be chosen as the cutting vari-

7

Fig. 4. Simulated Data - Case 3.

able. Otherwise, the result cannot be correct since the two underlying clusters have the same mean and variance for variable X2 . All other normalized distances still perform well in this case since the impact from the large variance of X2 is eliminated by doing the normalization. Consider now p = 10-dimensional observations drawn from two normal distributions N10 (μi , i ), i = 1, 2, with μ1 = (14, 15, . . . , 15 ), μ2 = (16, 15, . . . , 15 ),  1 = 2 = diag(0.1, 30, . . . , 30 ). The corresponding results for the CER values and computing times are as shown in Table 9. The normalized Euclidean Hausdorff distances (GNEH and LNEH) have higher CER values compared to those when p = 2 when n is small but improves eventually. Also, the IY distance makes a few incorrect clustering results when n is small, which is not reflected in the p = 2-dimensional result. This is an indication that these distances can also be affected by large variances if variables with large variances exist in more dimensions. Nevertheless, they still perform better than do the Hausdorff and Euclidean Hausdorff distances. 3.4. Case 4: A bad example for global (span) normalized Euclidean Hausdorff distance Here, we want to compare the Global Normalization with the Local Normalization Hausdorff distance. Recall, from Section 2, that the normalizations |Y j | and Hj from Eq. (2.4) and (2.5) can either rely on all the observations in the data set, or they can rely on the observations within the cluster that pertains as the algorithm proceeds. Therefore, random samples of 10 0 0 classically valued observations are drawn, respectively, from each of the bivariate normal distributions N2 (μi , i ), i = 1, 2, 3, with μ1 = (2, 5 ), μ2 = (7, 5 ), μ3 = (50, 5 ), and 1 = 2 = 3 = diag(1, 1 ). Fig. 5 shows a plot of these 30 0 0 observations. Clearly, the observations are overdispersed with respect to X1 and have multiple modes (here three) along the X1 axis. Aggregating, as before, we obtain observations {X(1 ), . . . , X( 3n )} from the distribution with (μ1 , 1 ), {X( n3 + 1 ), . . . , X( 23n } from the distribution with (μ2 , 2 ), and {X( 23n + 1 ), . . . , X(n )} from the distribution with (μ3 , 3 ), for n = 6, 15, 30, 60, 150. Table 10 shows the means and standard deviations of CER and computing time over B = 10 0 0 simulations, and Table 11 shows the number of times each variable is used for splitting at the second stage (the third stage is the final stage and that is where the splitting stops). Clearly, all distances perform well except for the two Global Normalization distances. Global Normalization distances tend to choose X2 to split the data during the second stage, which is evidently inappropriate according to Fig. 5. To illustrate the rationale

8

Y. Chen and L. Billard / Pattern Recognition 96 (2019) 106969

Fig. 5. Simulated data - Case 4.

of their bad performance, an example with a simulated data set will be studied below. We obtain the n = 30 interval observations with {X(1 ), . . . , X(10 )}, {X(11 ), . . . , X(20 )}, and {X(21 ), . . . , X(30 )} from the respective distributions. Applying the divisive algorithm to the six Hausdorff distances, we obtain the respective six hierarchies shown in Fig. 11. Except for the two Global Normalized distances (Fig. 11 (C) and (D)), other distances are able to group the observations successfully into the three underlying clusters. All the distances separate {X(21 ), . . . , X(30 )} (hollow round green dots on the right in Fig. 5) from the other observations using X1 as the cutting variable at the 1st stage of the clustering. Since the data are dispersed along X1 , the normalizations |Y1 | and H1 are very large at the 1st stage. At the 2nd stage, {X(1 ), . . . , X(20 )} are supposed to be divided into two groups like {X(1 ), . . . , X(10 )} (solid round blue dots on the left in Fig. 5) and {X(11 ), . . . , X(20 )} (hollow square red dots on the left in Fig. 5) along X1 . It is obvious that the variation for the variable X1 will be much smaller after removing the observations {X(21 ), . . . , X(30 )} (see Fig. 5). Nevertheless, the GSNEH and GNEH distances will still use the same |Y1 | and H1 from the 1st stage to calculate distances when separating {X(1 ), . . . , X(20 )}. As discussed for Case 1 in Section 3.1, the Global Normalized distances between two observations X(i1 ) and X(i2 ) (X(i1 ) and X(i2 ) are in the same cluster at the beginning of stage 2) at the 2nd stage are as given in Eq. (2.4) and (2.5). However, the values of |Y1 | and H1 are too large for the 2nd stage. Therefore, d(X(i1 ), X(i2 )) will be dominated by variable X2 since the values of d1 (X(i1 ), X(i2 ))|Y1 |−1 and d2 (X(i1 ), X(i2 ))H1−1 are consequently too small. Hence, the clustering process will choose variable X2 for splitting observations while {X(1 ), . . . , X(20 )} are supposed to be divided into two groups by using the variable X1 . In contrast, Local Normalized distances adjust the |Y1 | and H1 values at the 2nd stage according to which observations are in what cluster. After separating {X(21 ), . . . , X(30 )} from other observations at stage 1, |Y1 | and H1 are much smaller at stage 2 if the LSNEH and LNEH distances are used since the variation of the left two clusters (in Fig. 5) for X1 is much smaller. Thus, X2 is not able to dominate the distances and the cutting process, and hence X1 is correctly chosen as the splitting variable. 3.5. Case 5: Outliers We now consider a data set with outliers. Thus, random samples are simulated from each of the bivariate normal distributions N2 (μi , i ), i = 1, 2, 3, with μ1 = (12, 5 ), μ2 = (18, 5 ), 1 = 2 = diag(2, 2 ) and μ3 = (10, 30 ), 3 = diag(0.1, 0.1 ), respectively. The observations are plotted in Fig. 6. The outliers are evident. Then, 10 0 0 observations from the distribution with (μ1 , 1 ) are aggregated into intervals {X(1), ... , X( 2n )}, 10 0 0 observations from the distribution with (μ2 , 2 ), are aggregated into {X( n2 + 1 ), . . . , X(n )}, for n = 4, 10, 20, 40, 100. The number of observations forming interval X(n + 1 ) (i.e., random observations selected

Fig. 6. Simulated data - Case 5.

from the outlier distribution with (μ3 , 3 )), is always equal to the number of observations forming interval X(i ), i = 1, . . . , n. For example, when n = 100, interval X(1) is aggregated from 20 observations from the distribution with (μ1 , 1 ). Consequently, 20 observations are randomly selected from the distribution with (μ3 , 3 ) and are aggregated to interval X(5) as the outlier. Table 12 shows the means and standard deviations of the CER and computing time over B = 10 0 0 simulations, and Table 13 shows the number of times each variable is used for splitting at the second stage (the third stage is the final stage and that is where the splitting stops). All distances perform well in terms of the CER. However, when the number of intervals reaches a larger number (n = 20 here), GSNEH and LSNEH tend to exclude the outlier at the second stage (when variable X2 is used for cutting, the outlier will be separated). The H and EH distances also carry this tendency, but at a much larger number (n = 100 here). Again, an example will be used to illustrate this better. Fig. 12 is the clustering result of one simulated data set using the six different Hausdorff distances when n = 20. Except for the GSNEH and LSNEH distances (Fig. 12 (C) and (E)), the hierarchy based on the other distances separate out the outlier at the 1st stage. The GSNEH and LSNEH distances do not exclude the outlier until the 2nd stage. This is because the variation for variable X2 is diluted by the span normalization factor |Y2 | at the 1st stage, and hence X1 dominates the clustering process at the 1st stage. This is similar to the discussion in Section 3.4 for Case 4. This brings up the potential problem that if the clustering is stopped at stage 2, the outlier cannot be separated out and will be grouped with observations {X(1 ), . . . , X(10 )}. Other distances exclude the outlier at the first stage, which is preferable. Therefore, it is always beneficial to continue the clustering process for more steps even if the target structure (in terms of the number of stages R in the algorithm) has been achieved so that potential outliers can be eliminated from the main clusters.

Y. Chen and L. Billard / Pattern Recognition 96 (2019) 106969

9

Fig. 8. Simulated data - Case 7. Fig. 7. Simulated data - Case 6.

3.6. Case 6: A special example for symbolic data This simulation study considers a special case when the centers of the underlying clusters overlap. The simulation process is the same as for Case 1 in Section 3.1, except that the bivariate normal distributions to generate random samples now have parameters μ1 = μ2 = (15, 15 ), 1 = diag(1, 1 ), 2 = diag(5, 30 ), respectively. Fig. 7 shows the scatter plot of the simulated data. Table 14 gives the means and standard deviations of the CER and computing time of doing clustering on n = 4, 10, 20, 40 and 100 aggregated intervals by using the different distances on B = 10 0 0 sets of simulated samples. Table 15 shows the number of times each variable is used as the cutting variable. All the distances perform similarly to each other. Obviously, it will be very hard for the divisive monothetic clustering method to do clustering on a data set like this if they are treated as classical data points since the method divides the data into two separate groups at each step and uses just one variable at each step to do the splitting. The CER will most probably be around 0.5, not to mention the tremendous computing time needed to group 20 0 0 observations since the method stops at every observation and calculates the distances and variations at each stop when performing the clustering algorithm. The time increases quadratically as the number of observations increases. However, by aggregating them into interval observations, the CER can even reach 0.085 when the number of intervals is n = 4. Still, the CERs increase as the number of intervals increases, and arrive at 0.3 when there are 100 intervals. Notice that from Table 15, all distances prefer to use X2 as the splitting variable. This is reasonable since by looking at Fig. 7, the variable X2 is the direction where the big difference between the two clusters is located.

3.7. Case 7: Exponential distributed intervals Instead of the normal assumption as used for the previous six cases, we now use the exponential distribution. Random samples, shown in Fig. 8, are simulated from each of the bivariate exponential distributions exp2 (λi ), i = 1, 2, with λ1 = (1, 1 ), λ2 = (0.25, 1 ), ρ1 = ρ2 = diag(1, 1 ). From Tables 16 and 17, it can be seen that all distances perform well and use the right variable for cutting most of time. There are some exceptions for GNEH and LNEH distances since they may ignore variables with large variances, and the reason behind this has been explained in Case 1.

Again, it will be hard to do divisive monothetic clustering on a data set like this if they are treated as classical data points since data points tangle together along X1 while the correct cutting should happen along X1 . This happens due to large variance of the second cluster along X1 . Another 10 0 0 simulations are run to add correlation to the variables. The bivariate exponential distributions, i = 1, 2, now have parameters λ1 = (1, 1 ), λ2 = (0.25, 1 ), ρ1 = ρ2 with σi2 = 1 and σi j = 0.8, i = j i, j = 1, 2, and results are shown in Table 18. Overall, they are consistent with those of Table 16. All distances perform well except for some cases with GNEH distances. 3.8. Summary Based on the observations used in the simulation studies and in the context of these data sets, it is clear that no particular distance can outperform the other distances all the time. Each distance has its limitations. Their advantages and disadvantages can be summarized as follows: Hausdorff distance: This is the fastest of the distances considered; but a drawback is that it is sensitive to variables with large variances. Thus, the clustering process can easily be controlled by large variance variables even if they are the wrong choices. Euclidean Hausdorff distance: This is fast; but it has the same drawback as does the Hausdorff distance in that it is sensitive to variables with large variances. Global span normalized/normalized Euclidean Hausdorff distance: This is fast. However, it can give the wrong result after the 1st stage, especially when underlying clusters are overdispersed and have multiple modes with respect to one variable. [Overdispersed Variable: The standard deviation of the overdispersed variable is much larger (e.g., at least 5 times larger) than other variables’ standard deviations.] Local span normalized Euclidean Hausdorff distance: The major advantage is that it is not sensitive to variables with large variances (the clustering process will not be wrongly controlled by variables with large variances). A disadvantage is that it is slow; variables with outliers may be ignored by the clustering process, and so cannot identify outliers as early as possible. Local normalized Euclidean Hausdorff distance: An advantage is that its sensitivity to large variance is better than the H and EH distances but not as good as the LSNEH distance (i.e., the clustering process will not be wrongly controlled by variables with large variances most of the time). Also, it is robust to outliers (i.e., variables with outliers along it will

10

Y. Chen and L. Billard / Pattern Recognition 96 (2019) 106969

not be ignored during the clustering process). A disadvantage is that it is slow, and that variables with large variances may be ignored by the clustering process even if they are the right choices. Gowda–Diday, normalized Euclidean Gowda–Diday, and span normalized Euclidean Ichino–Yaguchi distances: These have similar advantages as the Local Span Normalized Euclidean Hausdorff Distance. A drawback is that it is exceptionally slow – computing times for the two Gowda–Diday distances are about seven times longer (when p = 2) than for the Local Span Normalized Euclidean Hausdorff Distance, and computing time for the Span Normalized Euclidean Ichino–Yaguchi Distance about four times longer than for the Local Span Normalized Euclidean Hausdorff Distance. This problem is exacerbated as the dimension p increases. Ichino–Yaguchi distance: An advantage is that it performs consistently well in all cases. A drawback is that sometimes it can be sensitive to variables with large variances when the dimension becomes higher. Computing time is about two times longer than that for the Euclidean Hausdorff distances (when p = 2). Normalized Euclidean Ichino–Yaguchi distance: This performs similarly to the Euclidean Hausdorff distance. A disadvantage is that computing time is about two times longer than that for the Euclidean Hausdorff distances (when p = 2). As a summary, if computing time is important, then Hausdorff distances should be used. Global Normalized distances should be cautiously used since they have the risk of not being able to identify the real variation within each cluster after the 1st stage, especially when some variables are overdispersed compared to others. If we are conscious of variables with significant larger variances, we can start with the LSNEH and LNEH distances and note that LSNEH will slightly overperform LNEH. If outliers are observed along some variables, the LNEH distance should be considered. Overall, these two distances perform similarly. If variables’ variances are of the same magnitude, the H and EH distances should be preferred since they are much faster than the LSNEH and LNEH distances and do not have the limitations that the LSNEH and LNEH distances have. If computing time is not important, then IY distances can be used as an alternative to Hausdorff distances. Recall that as p increases, so does the computing time increase; see, e.g., Section 3.1. 4. Application Table 22 shows temperature data for 60 weather stations of China in 1988. [Tables 19–22 and Fig. 13 are in the Supplementary Materials, Section S2.] The data set consists of minimum and maximum temperatures for each month, variables X1 − X12 representing January - December, and X13 is the elevation. The unit of temperatures is Celsius degree. Each observation is of equal weight here. These data are extracted from a larger data set which contains observations for many more stations, more variables and more years, and can be found at < http://www.rda.ucar.edu/datasets/ds578.5/ >. For ease of illustration, consider the first n = 15 stations. These data are akin to the Case 2 and Case 3 of Section 3, due to the elevation variable having a larger variance than the other variables. When the elevation variable is excluded from the analyses, the data are like those of Case 1. Each station’s average temperatures and variations [max(temperature ) − min(temperature )] for the four seasons of 1988 (Winter: December - February, Spring: March - May, Summer: June - August, Autumn: September - November) and annually are shown in Table 19 and Table 20, respectively. The standard

deviations for each month are shown in Table 21. According to these tables, it is evident that variances of the twelve temperature variables used here are of the same magnitude while no variable is overdispersed and no outlier is observed. Therefore, based on the conclusion from Section 3.8, all the distances can be used here and they should give similar results. Fig. 9 is the result of clustering the China temperature data for these 15 stations, using each of the Hausdorff distance, the Euclidean Hausdorff distance, the Global and Local Span Normalized Euclidean Hausdorff distance and the Global and Local Normalized Euclidean Hausdorff distance, based on the p = 12 temperature variables. Dendograms, up to R = 5 clusters, for the Ichino– Yaguchi distance and the Normalized Euclidean Ichino–Yaguchi distance are the same as for the Euclidean Hausdorff distance; and those for the Gowda–Diday distance, the Normalized Euclidean Gowda–Diday distance and the Span Normalized Euclidean Ichino– Yaguchi distance are the same as for the Local Span Normalized Euclidean Hausdorff distance. Thus, here, and in subsequent analyses, these plots are not shown. From Fig. 9, it can be seen that the two partitions obtained by using the Hausdorff distance [Fig. 9 (A)] and the Euclidean Hausdorff distance [Fig. 9 (B)] are almost the same, except for station 15 being clustered with station 1 in Fig. 9 (A) and with station 2 in Fig. 9 (B). These clusters are consistent with the annual and seasonal average temperatures shown in Table 19 and the corresponding variations of Table 21. We observe that stations 6 - 11 are successfully separated from the others since they are warmer than are the other stations across the year. Stations 10 and 11 are further separated from stations 6 - 9 because they are the hottest stations across the year and their annual temperature variations are the smallest. Stations 1, 2 and 15 are grouped together since they are the coldest stations according to their annual average and their temperature variations are similar to each other (large). Although the clustering result has 7 clusters with the last cutting on stations 1, 2 and 15, the clustering could have been stopped at R = 6 clusters since the clustering results are becoming more granular and the little difference between stations 1, 2, 15, is not necessarily worth splitting again. Stations 12, 13 and 14 are grouped together because they are the second coldest stations on annual temperature averages and their temperature variations are also similar (large). Stations 4 and 5 are clustered together with moderate temperatures across the year compared to other stations. Station 3, LaSa, is a special case with its temperature being the median among stations during the winter, spring and autumn and the lowest during the summer. Also, its temperature variation is the largest during the winter and summer while being moderate during the spring and autumn. The partitions (Fig. 9 (C) and (D)) using the Global Span Normalized Euclidean Hausdorff distance and the Global Normalized Euclidean Hausdorff distance are the same as using the Euclidean Hausdorff distance (Fig. 9 (B)) here. By observing Table 22, we can see that the data are not overdispersed for any of the twelve variables. Hence, the GSNEH and GNEH distances give the same result as the EH distance. The partition of Fig. 9 (E) using the Local Span Normalized Euclidean Hausdorff distance is very similar to Fig. 9 (A) and (B) except for stations 10 and 11 being separated at the 6th stage. Also, the partition of Fig. 12 (F) using the Local Normalized Euclidean Hausdorff distance is very similar to Fig.12 (A) and Fig. 9 (B) except for stations 6 and {7, 8, 9} being separated after R = 6 clusters. The clustering could have been stopped at R = 6 clusters and thus the H, EH, LSNEH and LNEH distances would give the same clustering structures. Again, simulation results indicate if variables are of very different variances, or the data are overdispersed along one variable having multiple modes, or the data have an extreme outlier along

Y. Chen and L. Billard / Pattern Recognition 96 (2019) 106969

11

Fig. 9. China temperature clustering results based on (X1 , . . . , X12 ).

one variable, different distances can perform differently. However, if none of these special situations exists, all distances should perform similarly. Since the China temperature data based on twelve variables do not have any of these conditions, it is not surprising that all distances generate similar results. When using all p = 13 variables, clusters based on the six Hausdorff distances are shown in Fig. 13 (see S2). Take the Hausdorff distance. Comparing Fig. 9 (A) and Fig. 13 (A) shows the biggest difference occurs at the first stage. After this, the elevation variable is used twice again. This is due to the larger variance of this elevation variable compared to the variations of the other variables. Other than that, most clusters resulting from the p = 13 variables (Fig. 13 (A)) are the same as those found when using only the p = 12 temperature variables (Fig. 9 (A)). An expanded discussion of this analysis is provided in the Supplementary Materials S2.

Overall, it is clear that by using the Local Normalized distances, the result can be quite different from that obtained when using non-normalized Hausdorff distances, especially after several stages of clustering. Unlike when using the Hausdorff distance and the Euclidean Hausdorff distance, results from using the Local Normalized distances will not be influenced by the variable elevation much and will treat each variable equally, whether it is temperature or elevation. Hence, by using the LSNEH and LNEH distances, results are given from an aspect that variables are “equally weighted”. The more granular result from the Hausdorff distance is potentially problematic in this case since it starts to deviate from other distances after R = 4 clusters. Although the Euclidean Hausdorff distance generates the same result as the Local Normalized distances do if stopping at R = 5 clusters, caution should still be taken when using the Euclidean Hausdorff distance on data like this. It is only due to coincidence that the variance of the eleva-

12

Y. Chen and L. Billard / Pattern Recognition 96 (2019) 106969

tion variable is not extreme enough to cause the Euclidean Hausdorff distance to fail here. Finally, if the clustering process is stopped after three iterations to produce just four clusters, all distances give the same four clusters. These are the same clusters as obtained by Billard and LeRademacher [8] when undertaking an exploratory analysis based on an principal component analysis for these interval data, hence corroborating the present technique. The analysis was also performed for all (n = 60) stations in the full data set. Again, because the elevation variable has a larger variance than do the other variables, this data set is akin to Case 2 and Case 3 of Section 3. The results are not inconsistent with those obtained when using the first 15 stations illustrated thus far, with differences due to the presence of more observations. A more detailed discussion of the resulting cluster trees is given in Section S2. A summary of the first three clusters generated for the respective distances is also provided in S2. 5. Conclusion Chavent [15] introduced a divisive clustering method for interval-valued data, based on the basic Hausdorff distance, universally accepted to date for interval data. While distances can be adapted in various ways, such as normalizing or not (via a variety of possible normalizations), global or local weighting options, and so forth, there were no definitive studies to compare these various options. This article addresses this deficiency by comparing six different Hausdorff distances applied to seven different types of data sets, including some with or without outliers, and some with quite different means and/or variances inherent to the underlying data; in each case, data sets of uncorrelated and correlated observations were studied. These distances were compared with corresponding Gowda–Diday and Ichino–Yaguchi distances. Advantages and disadvantages, relative to the accuracy of separating out the underlying clusters and relative to computational efficiency, for each distance as they pertain to each data type were summarized. In all instances, the computing times for the Gowda– Diday distances and the Ichino–Yaguchi distances were considerably slower than for the Hausdorff distances. However, distances such as the basic Hausdorff distance that focussed on a single variable, though faster computationally, had higher classification error rates than those that used all variables simultaneously such as Normalized Euclidean distances. As might be expected, distance measures adopted will tend to be “optimal” depending on the type of data contained within the data sets. However, based on our results, it can be said that local normalized distances perform better than do global normalized distances overall. As a final point, all these distances are based on an assumption that the aggregated observations within the resulting intervals are uniformly spread across those intervals. One way to check this assumption is to apply the techniques of Cariou and Billard [12]. Further, while the study focussed on the classification error rate and computing time required for the different distances and different data sets, if the clustering process was stopped at R = 4 clusters, then there were essentially no differences in the respective dendograms. Supplementary material Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.patcog.2019.106969. References [1] Z. Abdullah, A.R. Hamdan, Hierarchical clustering algorithms in data mining, Int. J. Comput. Inf. Eng. 9 (2015) 2201–2206.

[2] V.S. Ananthanarayana, M.N. Murty, D.K. Subramanian, Rapid and brief communication efficient clustering of large data sets, Pattern Recognit. 34 (2001) 2561–2563. [3] M.R. Anderberg, Cluster Analysis for Applications, Academic Press, New York, 1973. [4] V. Batagelj, Generalized ward and related clustering problems, in: H.H. Bock (Ed.), Classification and Related Methods of Data Analysis, North-Holland, Amsterdam, 1988, pp. 67–74. [5] L. Billard, M. Mizuta, J. Nakano, Sample covariance functions for complex quantitative data, in: Proceedings World Congress, International Association of Statistical Computing, Japanese Society of Computational Statistics, Japan, 2008, pp. 157–163. [6] L. Billard, Brief overview of symbolic data and analytic issues, Stat. Anal. Data Min. 4 (2011) 149–156. [7] L. Billard, E. Diday, Symbolic Data Analysis: Conceptual Statistics and Data Mining, John Wiley, Chichester, 2006. [8] L. Billard, J. Le-Rademacher, Principal component analysis for interval data, Wiley Interdiscip. Rev. 4 (2012) 535–540. WICS 1231 [9] H.H. Bock, Clustering methods: a history of k-means algorithms, in: P. Brito, P. Bertrand, G. Cucumel, F. de Carvalho (Eds.), Selected Contributions in Data Analysis and Classification, Springer, Berlin, 2007, pp. 161–172. [10] H.H. Bock, E. Diday, Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data, Springer-Verlag, Berlin, 20 0 0. [11] M.P. Brito, E. Diday, Pyramidal represenation of symbolic objects, in: M. Schader, W. Gaul (Eds.), Knowledge, Data and Computer-Assisted Decisions, Springer-Verlag, Heidelberg, 1990, pp. 3–16. [12] V. Cariou, L. Billard, Generalization method when manipulating relational databases, Revue des Nouvelles Technologies de l’Information 27 (2015) 59–86. [13] M.E. Celebi, H.A. Kingravi, P.A. Vela, A comparative study of efficient initialization methods for the k-means clustering algorithm, Expert Syst. Appl. 40 (2013) 200–210. [14] W.C. Chang, On using principal components before separating a mixture of two multivariate normal normal distributions, Appl. Stat. 32 (1983) 267–275. [15] M. Chavent, A monothetic clustering method, Pattern Recognit. Lett. 19 (1998) 989–996. [16] M. Chavent, Criterion-based divisive clustering for symbolic data, in: H.H. Bock, E. Diday (Eds.), Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data, Springer-Verlag, Berlin, 20 0 0, pp. 299–311. [17] M. Chavent, F.A.T. de Carvalho, Y. Lechevallier, R. Verde, New clustering methods for interval data, Comput. Stat. 21 (2006) 211–229. [18] M. Chavent, Y. Lechevallier, K. Jajuga, A. Sokolowski, H.H. Bock, Dynamical clustering of interval data: optimization of an adequacy criterion based on Hausdorff distance, in: Classification, Clustering, and Data Analysis, Springer-Verlag, Berlin, 2002, pp. 53–60. [19] Y. Chen, Symbolic Data Regression and Clustering, University of Georgia, 2014 Doctoral dissertation. [20] F.A.T. de Carvalho, P. Brito, H.H. Bock, Dynamic clustering for interval data based on L2 distance, Comput. Stat. 21 (2006) 231–250. [21] F.A.T. de Carvalho, Y. Lechevallier, R. Verde, Clustering methods in symbolic data analysis, in: E. Diday, M. Noirhomme-Fraiture (Eds.), Symbolic Data Analysis and the SODAS Software, John Wiley, Chichester, 2008, pp. 181–203. [22] F.A.T. de Carvalho, R.M.C.R. de Souza, Unsupervised pattern recognition models for mixed feature-type symbolic data, Pattern Recognit. Lett. 31 (2010) 430–443. [23] R.M.C.R. de Souza, F.A.T. de Carvalho, C.P. Tenóio, Y. Lechevallier, D. Banks, L. House, F.R. Morris, P. Arabie, W. Gaul, Dynamic cluster methods for interval data based on Mahalanobis distances, in: Classification, Clustering, and Data Mining Applications, Springer, Berlin, 2004, pp. 351–360. [24] E. Diday, Introduction á l’approche symbolique en analyse des données, in: Premier Jouneles Symbolique-Numerique, CEREMADE, Université de Paris, Dauphine, 1987, pp. 21–56. [25] E. Diday, P. Bertrand, An extension of hierarchical clustering: the pyramidal presentation, in: E.S. Gelsema, L.N. Kanal (Eds.), Pattern Recognition in Practice II, Elsevier, Amsterdam, 1986, pp. 411–424. [26] E. Diday, M. Noirhomme-Fraiture, Symbolic Data Analysis and the SODAS Software, John Wiley, Chichester, 2008. [27] R. Dubes, A.K. Jain, Validity studies in clustering methodologies, Pattern Recognit. 11 (1979) 235–254. [28] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Recognition, (2nd ed), Wiley and Sons, 2001. [29] A. Goodman, Emerging topics and challenges for statistical analysis and data mining, Stat. Anal. Data Min. 4 (2011) 3–8. [30] A.D. Gordon, Classification (2nd. ed.), Chapman and Hall, Boca Raton, 1999. [31] K.C. Gowda, E. Diday, Symbolic clustering using a new dissimilarity measure, Pattern Recognit. 24 (1991) 567–578. [32] K.C. Gowda, E. Diday, Symbolic clustering using a new similarity measure, IEEE Trans. Syst. Man Cybern. 22 (1992) 368–378. [33] F. Hausdorff, Set Theory (translated into English by Aumann, J. R. 1957, Chelsey, New York, 1937. [34] M. Ichino, H. Yaguchi, Generalized Minkowski metrics for mixed feature type data analysis, IEEE Trans. Syst. Man Cybern. 24 (1994) 698–708. [35] R. Indhu, R. Porkodi, Comparison of clustering algorithm, Int. J. Scientif. Res. Comput. Sci. Eng. Inf. Technol. 3 (2018) 218–223.

Y. Chen and L. Billard / Pattern Recognition 96 (2019) 106969 [36] A. Irpino, R. Verde, Dynamic clustering of interval data using a Wasserstein-based distance, Pattern Recognit. Lett. 29 (2008) 1648–1658. [37] A. Irpino, R. Verde, F.A.T. de Carvalho, Dynamic clustering of histogram data based on adaptive squared Wasserstein distances, Expert Syst. Appl. 41 (2014) 3351–3366. [38] A.K. Jain, Data clustering: 50 years beyond K-means, J. Pattern Recognit. Lett. 31 (2010) 651–666. [39] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Comput. Surv. (CSUR) 31 (1999) 264–323. [40] J. Kim, Dissimilarity Measures for Histogram-valued Data and Divisive Clustering of Symbolic Objects, University of Georgia, 2009 Doctoral dissertation. [41] J. Kim, L. Billard, Divisive clustering for histogram-valued data, 2010, Manuscript. [42] J. Kim, L. Billard, A polythetic clustering process for symbolic observations and cluster validity indexes, Comput. Stat. Data Anal. 55 (2011) 2250–2262. [43] K. Kim, L. Billard, Double monothetic clustering for histogram-valued data, Commun. Stat. Appl.Methods 25 (2018) 263–274. [44] K. Košmelj, L. Billard, Clustering of population pyramids using Mallows’ L2 distance, Metodološki Zvezki 8 (2011) 1–15. [45] K. Košmelj, L. Billard, Mallows’L2 distance in some multivariate methods and its application to histogram-type data, Metodološki Zvezki 9 (2012) 107–118. [46] S. Korenjak-Çerne, V. Batagelj, B.J. Pavešic´ , Clustering large data sets described with discrete distributions and its application on TIMSS data set, Stat. Anal. Data Min. 4 (2011) 199–215. [47] A. Likas, N. Vlassis, J.J. Verbeck, The global k-means clustering algorithm, Pattern Recognit. 36 (2003) 451–461. [48] J.F. Lu, J.B. Tang, Z.M. Tang, J.Y. Yang, Hierarchical initialization approach for K-means clustering, Pattern Recognit. Lett. 29 (2008) 787–795. [49] T.S. Madhulatha, An overview of clustering methods, IOSR J. Eng. 2 (2012) 719–725.

13

[50] B. Mirkin, Clustering for Data Mining. A Data Recovery Approach, Chapman and Hall, 2005. [51] M. Namratha, T.R. Prajwala, A comprehensive overview of clustering algorithms in pattern recognition, IOSR J. Eng. 4 (2012) 23–30. [52] M. Noirhomme-Fraiture, M.P. Brito, Far beyond the classical data models: symbolic data analysis, Stat. Anal. Data Min. 4 (2011) 157–170. [53] J.M. Pena, J.A. Lozana, P. Larranaga, An empirical comparison of four initialization methods for the K-means algorithm, Pattern Recognit. Lett. 20 (1999) 1027–1040. [54] P. Praveen, B. Rama, U. Dulhare, A study on monothetic divisive hierarchical clustering method, Int. J. Adv. Scientif.Technol. Eng. Manag. Sci. 3 (2017) 369–372. [55] M.K. Rafsanjani, Z.A. Varzaneh, N.E. Chukanlo, A survey of hierarchical clustering algorithms, J. Math. Comput. Sci. 5 (2012) 229–240. [56] D.N. Rajalingam, K. Ranjini, Hierarchical clustering algorithm - a comparative study, Int. J. Comput. Appl. 19 (2011) 42–46. [57] M.V. Reddy, M. Vivekananda, R.U.V.N. Satish, Divisive hierarchical clustering with K-means and agglomerative hierarchical clustering, Int. J. Comput. Sci. Trends Technol. 5 (2017) 6–11. [58] S.J. Redmond, C. Heneghan, A method for initialising the k-means cluster algorithm using kd-trees, Pattern Recognit. Lett. 28 (2007) 965–973. [59] D. Steinley, k-means clustering: a half-century synthesis, Br. J. Math. Stat. Psychol. 59 (2006) 1–34. [60] Y. Wang, H. Yan, C. Sriskandarajah, The weighted sum of split and diameter clustering, J. Classif. 13 (1996) 231–248. [61] N.A. Yousri, M.S. Kamel, M.A. Ismail, A distance-related dynamic model for clustering high dimensional data of arbitrary shapes and densities, Pattern Recognit. 42 (2009) 1193–1209.