Neural Networks 46 (2013) 124–132
Contents lists available at SciVerse ScienceDirect
Neural Networks journal homepage: www.elsevier.com/locate/neunet
Interval data clustering using self-organizing maps based on adaptive Mahalanobis distances Chantal Hajjar ∗ , Hani Hamdan ∗ Department of Signal Processing and Electronic Systems, École Supérieure d’Électricité (SUPÉLEC), 91190 Gif-sur-Yvette, France
article
info
Article history: Received 30 July 2012 Received in revised form 22 April 2013 Accepted 24 April 2013 Keywords: Self-organizing maps Interval data Mahalanobis distance Unsupervised learning Clustering
abstract The self-organizing map is a kind of artificial neural network used to map high dimensional data into a low dimensional space. This paper presents a self-organizing map for interval-valued data based on adaptive Mahalanobis distances in order to do clustering of interval data with topology preservation. Two methods based on the batch training algorithm for the self-organizing maps are proposed. The first method uses a common Mahalanobis distance for all clusters. In the second method, the algorithm starts with a common Mahalanobis distance per cluster and then switches to use a different distance per cluster. This process allows a more adapted clustering for the given data set. The performances of the proposed methods are compared and discussed using artificial and real interval data sets. © 2013 Elsevier Ltd. All rights reserved.
1. Introduction In real world applications, data may not be formatted as single values, but are represented by lists, intervals, distributions, etc. This type of data is called symbolic data. Interval data are a kind of symbolic data that typically reflect the variability and uncertainty in the observed measurements. Many data analysis tools have been already extended to handle in a natural way interval data: principal component analysis (Cazes, Chouakria, Diday, & Schektman, 1997), factor analysis (Chouakria, 1998), multilayer perceptron (Rossi & Conan-Guez, 2002), etc. Within the clustering framework, several authors presented clustering algorithms for interval data. In the partitioning clustering framework, Chavent and Lechevallier (2002) proposed a dynamic clustering algorithm for symbolic interval data where the prototypes are defined by the optimization of an adequacy criterion based on Hausdorff distance. We recall that the dynamic clustering algorithm (Diday, 1971) is a generalized k-means algorithm where the class representative is not only limited to its center but may be of any kind (i.e. an individual of this class, a group of individuals, a probability distribution, etc.). Bock (2003) constructed a self-organizing map (SOM) based on the vertex-type distance for visualizing interval data. Hamdan and Govaert developed a theory on mixture model-based clustering for interval data. In this context, they proposed two interval data-based maximum likelihood approaches: the mixture approach (Hamdan & Govaert, 2003, 2005) and the classification approach (Hamdan &
∗
Corresponding authors. E-mail addresses:
[email protected],
[email protected] (C. Hajjar),
[email protected] (H. Hamdan). 0893-6080/$ – see front matter © 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.neunet.2013.04.009
Govaert, 2004). De Souza and De Carvalho (2004) proposed two dynamic clustering methods for interval data. The first method uses an extension for interval data of the city-block distance. The second method is adaptive and has been proposed in two variants. In the first variant, the adaptive distance has a single component, whereas it has two components in the second variant. de Souza, de Carvalho, Tenório, and Lechevallier (2004) proposed two dynamic clustering methods, based on Mahalanobis distance, for interval data. In both methods, the prototypes are defined by optimizing an adequacy criterion based on an extension for interval data of the Mahalanobis distance. In the first method, the used distance is adaptive and common to all classes, and the prototypes are vectors of intervals. In the second method, each class has its own adaptive distance, and the prototype of each class is then composed of an interval vector and an adaptive distance. El Golli, Conan-Guez, and Rossi (2004) proposed an adaptation of the self-organizing map to interval-valued dissimilarity data by implementing the SOM algorithm for interval-valued dissimilarity measures rather than for individuals-variables interval data. De Carvalho, Brito, and Bock (2006) applied the dynamic clustering algorithm based on L2 distance to interval data and presented three standardization techniques for interval-type variables. Recently, Hajjar and Hamdan proposed a self-organizing map for individuals-variables interval data (Hajjar & Hamdan, 2011). De Carvalho and Pacifico (2011) extended the batch optimization algorithm for topological maps to interval-valued data using adaptive L2 distance. In this paper, we propose two methods to cluster intervalvalued data with topology preservation. The two methods are about building self-organizing maps to cluster interval data using Mahalanobis distances. A self-organizing map (SOM) is a kind of an
C. Hajjar, H. Hamdan / Neural Networks 46 (2013) 124–132
125
Table 1 Comparison between different clustering methods for interval data. Method
Distance
Result
Dynamic clustering algorithm for interval data (Chavent & Lechevallier, 2002) Online SOM extended to interval data (Bock, 2003)
Hausdorff distance Vertex-type distance
EM algorithm extended to intervals (Hamdan & Govaert, 2003, 2005) CEM algorithm extended to interval data (Hamdan & Govaert, 2004) Dynamic clustering algorithm for interval data (De Souza & De Carvalho, 2004) Batch SOM algorithm for dissimilarity dataa (El Golli et al., 2004)
Adaptive distance based on city-block distance Hausdorff and L2 distance
Dynamic clustering algorithm for interval data (de Souza et al., 2004) Dynamic clustering algorithm for interval data (de Souza et al., 2004) Dynamic clustering algorithm for interval data (De Carvalho et al., 2006) Batch SOM algorithm extended to interval data (Hajjar & Hamdan, 2011)
Common Mahalanobis distance for all clusters Different Mahalanobis distance per cluster L2 distance L2 distance
Batch Optimization Algorithm for Topological Maps extended to interval data (De Carvalho & Pacifico, 2011) Proposed Method 1: Batch SOM algorithm extended to interval data
Adaptive L2 distance
Proposed Method 2: Batch SOM algorithm extended to interval data
Starting with common Mahalanobis distance for all clusters, ending with different Mahalanobis distance per cluster
Partition Partition with topology preservation Partition Partition of K clusters Partition Partition with topology preservation Partition Partition Partition Partition with topology preservation Partition with topology preservation Partition with topology preservation Partition with topology preservation
a
Common Mahalanobis distance for all clusters
A dissimilarity matrix is required instead of an individuals-variables matrix.
artificial neural network used to map high dimensional data onto a low dimensional space, usually a two dimensional map of neurons (Kohonen, 1984). A SOM is unsupervisely trained so that similar data vectors will be allocated to the same neuron, and two neighboring neurons on the map will represent close data vectors in the input space in order to ensure topology preservation (see Fig. 1). The first method consists of using a common Mahalanobis distance for all clusters to compute the best matching units through all the training process. Whereas the second method starts with a common distance for all clusters then switches to a different distance per cluster nearly at the end of the training process, which ensures better clustering results. Since the Mahalanobis distance is scale invariant and takes into account the covariance between variables, data clustering will be more adaptive and different shaped clusters are recognized. Table 1 shows a comparison between the proposed methods and other clustering methods for interval data listed above. The paper is organized as follows: In Section 2, we propose two methods to train the self-organizing maps for interval data using Mahalanobis distances. In Section 3, we show the results of the implementation of our approach on synthetic and real interval data sets. Finally, in Section 4, we give our conclusion. 2. Self-organizing map for interval data Let R = {R1 , . . . , Rn } be a set of n symbolic data objects described by p interval variables. Each object Ri is represented by a p p j j vector of intervals Ri = ([a1i , b1i ], . . . , [ai , bi ])T where [ai , bi ] ∈ I = {[a, b]; a ∈ R, b ∈ R, and a 6 b}. A self-organizing map of K neurons arranged in a rectangular grid is trained using the batch training in order to cluster the set R while preserving the topology. To each neuron k, (k = 1, . . . , K ) is associated a prototype vector W k of dimension p whose components are intervals: p p W k = ([u1k , vk1 ], . . . , [uk , vk ])T . Since each neuron represents a cluster, a partition P of K clusters will be formed: P = {C1 , . . . , CK }. Each cluster Ck will be represented by the prototype vector W k .
Fig. 1. SOM for classification.
whose prototype vector is closest to the data vector Ri . c (i) is determined according to the following equation:
δ(Ri , W c (i) ) = min δ(Ri , W k )
where δ(Ri , W k ) is the distance between two interval vectors calculated using the Mahalanobis distance (de Souza et al., 2004).
δ(Ri , W k ) = dM (RiL , W kL ) + dM (RiU , W kU )
dM (RiL , W kL ) = (RiL − W kL )T ML (RiL − W kL ) (3)
The matrices ML and MU are defined as follows:
1/p
ML = det(QpoolL )
The Mahalanobis distance is used to determine the best matching unit (BMU) of an interval data vector Ri . It is the neuron c (i)
(2)
p
where RiL = (a1i , . . . , ai )T is the vector of lower bounds of Ri , p RiU = (b1i , . . . , bi )T is the vector of upper bounds of Ri , W kL = p T 1 (uk , . . . , uk ) is the vector of lower bounds of W k , W kU = (vk1 , . . . , vkp )T is the vector of upper bounds of W k and dM is the Mahalanobis distance between two vectors: dM (RiU , W kU ) = (RiU − W kU )T MU (RiU − W kU ).
2.1. Best matching unit
(1)
k=1,...,K
−1 QpoolL ,
1/p −1 MU = det(QpoolU ) QpoolU ,
det(QpoolL ) ̸= 0 det(QpoolU ) ̸= 0
(4)
126
C. Hajjar, H. Hamdan / Neural Networks 46 (2013) 124–132
where QpoolL and QpoolU are the pooled covariance matrices:
(n1 − 1)S1L + · · · + (nK − 1)SKL n1 + · · · + nK − K (n1 − 1)S1U + · · · + (nK − 1)SKU = n1 + · · · + nK − K
of the algorithm, the adequacy clustering criterion:
QpoolL = QpoolU
G= (5)
δ(Ri , W c (i) ) = min (RiL − W kL )T MkL (RiL − W kL ) k=1,...,K
(6)
The matrices MkL and MkU associated to cluster Ck are expressed as follows:
1/p
MkL = det(QkL )
MkU = det(QkU )
−1 , QkL
1/p
−1 , QkU
det(QkL ) ̸= 0 det(QkU ) ̸= 0
(7)
where QkL is the covariance matrix of the set of vectors {RiL /i ∈ Ck }, and QkU is the covariance matrix of the set of vectors {RiU /i ∈ Ck }. 2.2. Computing the prototype vectors At each iteration t, all the input vectors are presented to the map and each prototype vector W k , (k = 1, . . . , K ) is replaced by a weighted mean over the data vectors Ri , (i = 1, . . . , n). The weights are the neighborhood function values. n
W k (t + 1) =
hkc (i) σ (t ) Ri
i=1 n
(k = 1, . . . , K )
(8)
hkc (i) σ (t )
i=1
where hkc (i) σ (t ) is the Gaussian neighborhood function between the neuron k and the BMU c (i) of the input vector Ri defined in Eq. (9):
hck σ (t ) = exp −
d2 (rc , rk )
2σ 2 (t )
∥rc − rk ∥2 = exp − 2σ 2 (t )
δ(Ri , W k ).
(10)
k=1 Ri ∈Ck
where SkL is the covariance matrix of the set of vectors {RiL /i ∈ Ck }, and SkU is the covariance matrix of the set of vectors {RiU /i ∈ Ck }. nk is the cardinal of Ck . Since the matrices ML and MU are the same for all clusters, the distance used is common to all clusters. A more adaptive version consists of replacing the matrices ML and MU by matrices that change with each cluster, which leads to a different distance per cluster:
+ (RiU − W kU )T MkU (RiU − W kU ) .
K
(9)
where rc and rk are respectively the location of neuron c (line number,column number) and neuron k on the grid and σ (t ), the width of the Gaussian function, represents the neighborhood radius at iteration t. All input vectors having the same BMU have the same value of the neighborhood function. In the last few iterations of the algorithm, when the neighborhood radius tends to zero, the neigh borhood function hkc (i) σ (t ) will be equal to 1 only if k = c (i) (k is the BMU of input vector Ri ) and 0 otherwise. The input data set is then clustered into K clusters. The prototype of each cluster Ck is the neuron k whose prototype vector W k is a mean of the data vectors belonging to that class (de Souza et al., 2004). This implies that the updating formula of Eq. (8) will minimize, at convergence
In addition, using the values of the neighborhood function as weights in the weighted mean, defined in Eq. (8), will preserve the data topology. 2.3. SOM training algorithms for interval data In the following, we present two training algorithms for the self-organizing maps applied to interval-valued data. In the first training algorithm, the SOM is trained using a common distance for all clusters. The second training algorithm is divided into two phases. In the first phase, a common distance for all clusters is used, while in the second phase a different distance per cluster is used. The number of iterations of the first phase constitutes 90% of the total number of iterations since only few iterations are needed for the second phase. This process will allow to do a more adaptive clustering for the given data set. The reason for using a common distance in the first phase is because the algorithm may not converge when using a different distance per cluster from the beginning. 2.3.1. SOM training algorithm for interval data with common Mahalanobis distance for all clusters (‘‘intSOM_MCDC ’’) 1. Initialization: t = 0 • Choose the map dimensions (lines, cols). The number of neurons is K = lines · cols • Choose the initial value (σinit ) and the final value (σfinal ) of the neighborhood radius. • Choose the total number of iterations (totalIter). • Randomly choose an initial partition P0 = {C1 , . . . , CK }. • Compute the initial prototype vectors. Each prototype vector W k , (k = 1, . . . , K ) is the mean of data vectors belonging to cluster Ck . • Set the matrices ML and MU to the identity matrix. 2. Allocation: • Compute the matrices ML and MU according to Eqs. (4). • For i = 1 to n, compute the best matching unit c (i) of the input vector Ri according to Eq. (1). (A new partition Pt = {C1 , . . . , CK } is generated. All data vectors having the same BMU belong to the same cluster.) • For k = 1to K , compute the values of the neighborhood function hkc (i) σ (t ) , (i = 1, . . . , n) according to Eq. (9). 3. Training: • For k = 1 to K , update the prototype vectors of the map W k using Eq. (8). 4. Increment t, and reduce the neighborhood radius σ (t ) according to Eq. (11). Repeat step 2 until t reaches the maximum number of iterations (totalIter).
σ (t ) = σinit +
t totalIter
· (σfinal − σinit ).
(11)
N.B: If in the current iteration it is impossible to calculate the matrix ML or the matrix MU due to matrices inversion problems, we replace them by the matrices of the preceding iteration. 2.3.2. SOM training algorithm for interval data with different Mahalanobis distance per cluster (‘‘intSOM_MDDC’’) The algorithm for the ‘‘intSOM_MDDC ’’ method is comparable to that of the ‘‘intSOM_MCDC ’’ method except during the second phase where the matrices ML and MU are replaced by the matrices MkL and MkU calculated in Eqs. (7). The number of iterations of the first phase, IterPhase1, is set at the initialization of the algorithm.
C. Hajjar, H. Hamdan / Neural Networks 46 (2013) 124–132
127
2 8 5 51 2 5 61 2 13 71 2 8 5 81 2 5 91
C5 : µ15 = 15, µ25 = . , δ C6 : µ = 15, µ = , δ 1 6
2 6
C7 : µ = 26, µ = 1 7
2 7
,δ
C8 : µ = 26, µ = . , δ 1 8
2 8
C9 : µ19 = 26, µ29 = , δ
22 5
= 2.33, δ 22 6
= 9, δ
= 1/9, ρ6 = 0.
22 7
= 9, δ
= 6.78, ρ5 = 0.97.
= 1/9, ρ7 = 0. 2
= 2.33, δ82 = 6.78, ρ8 = 0.97. 2
= 9, δ92 = 1/9, ρ9 = 0.
The obtained clusters are overlapping clusters with 7.6% as mixture degree. For each single-valued simulated cluster we generated three interval data sets with different interval lengths. The first one has an average interval length of L = 1.5, the second one of L = 3 and the third one of L = 4 as described in Hamdan (2005).
Fig. 2. Single-valued quantitative data set.
2.4. SOM quality evaluation Two criteria are used to evaluate the map quality and the clustering results: the topographic error (tpe) and the data classification error (dce). The topographic error measures the degree of topology preservation. It is the proportion of data vectors for which the first and second BMUs are not adjacent units (Kiviluoto, 1996). The first BMU of the data vector is the BMU defined in Section 2.1. The second BMU is the second neuron whose prototype vector is closest to a data vector after the first BMU. The data classification error is the percentage of misclassified data vectors. It requires a prior knowledge of the data true classes and is calculated by taking into consideration the switching problem that may arise with unsupervised classification. 3. Experiments Two data sets are used as experiments. The first one consists of an artificial interval data set. The second one consists of a real interval temperature data set. 3.1. Artificial interval data set In order to simulate an artificial interval data set, we begin first by simulating a single-valued quantitative data set of n points xi ∈ R2 , then we make their position imprecise by generating x˜ i . Finally, we construct the rectangles representing the intervals Ri having as centers x˜ i (Hamdan, 2005). The artificial single-valued data set consists of n = 1800 points equally distributed among K = 9 clusters (see Fig. 2). Each cluster Ck is generated according to a bivariate normal distribution of mean µk = (µ1k , µ2k ) and covariance matrix 6k :
2
δk1 6k = ρk δk1 δk2
ρk δk1 δk2 . 2 δk2
The mean and covariance matrix for each of the simulated clusters are listed as follows: 2
2
C1 : µ11 = 4, µ21 = 13, δ11 = 9, δ12 = 1/9, ρ1 = 0. 2
2
C2 : µ12 = 4, µ22 = 8.5, δ21 = 2.33, δ22 = 6.78, ρ2 = 0.97. 2
2
C3 : µ13 = 4, µ23 = 5, δ31 = 9, δ32 = 1/9, ρ3 = 0. 2
2
C4 : µ14 = 15, µ24 = 13, δ41 = 9, δ42 = 1/9, ρ4 = 0.
3.1.1. Clustering results The simulated interval data sets are used to train a selforganizing map of K = 9 neurons arranged in a square grid. The training is done according to three algorithms. The first one (‘‘intSOM_MCDC ’’) is described in Section 2.3.1, the second one (‘‘intSOM_MDDC ’’) is described in Section 2.3.2, and the third one (‘‘intSOM_L2’’) differs from (‘‘intSOM_MCDC ’’) only by setting the matrices ML and MU to the identity matrix through all the training process, and therefore the distance used is the L2 distance for interval data (Hajjar & Hamdan, 2011). In all algorithms, the neighborhood radius is initialized at σinit = 3 and decreases to reach σfinal = 0.1. The final partition depends on the initial prototype vectors that are chosen randomly. In order to have the most convenient set of initial prototype vectors, each algorithm is run 20 times, and the best result giving the smallest value of the clustering criteria G, calculated in Eq. (10), is chosen. Concerning the total number of iterations, experiments were performed with different values for this parameter (200–1000 iterations). Starting from 600 iterations, the three algorithms converged to the lowest clustering criterion G. Thus, the total number of iterations is fixed at totalIter = 600 for all algorithms. The number of iterations of the first phase in algorithm ‘‘intSOM_MDDC ’’ is set to 540 iterations which is 90% of the total number of iterations. Figs. 3–5 show the clustering results for an interval data set corresponding to an average interval length of L = 1.5 using respectively the methods ‘‘intSOM_MCDC ’’, ‘‘intSOM_MDDC ’’ and ‘‘intSOM_L2’’. Each cluster is drawn with a different color. The prototype vectors, connected with their centers, are drawn in red. We notice that the method ‘‘intSOM_MDDC ’’ gives the best clustering results. It produces a partition that corresponds mostly to the a priori partition. Method ‘‘intSOM_MCDC ’’ gives better clustering results than the ‘‘intSOM_L2’’ method due to the fact that the L2 distance contributes in finding clusters of spherical shapes only. In all methods we obtain an excellent degree of map topology preservation. In order to provide more accurate clustering results, we simulated 20 replications of the single valued data set, and for each replication we generated three interval data sets for the three values of the average interval length (Monte Carlo Simulations). For each interval data set, we evaluated the SOM quality using the topographic error (tpe) and the data classification error (dce) for the three methods. Table 2 gives the mean and the standard deviation in parenthesis of the tpe and dce for each method and for each value of the average interval length over the 20 replications. The obtained results show that the method ‘‘intSOM_MDDC ’’ gives the best classification results for all generated interval data sets since the data classification error (dce) for this method is the lowest. The ‘‘intSOM_MCDC ’’ method performs better than the ‘‘intSOM_L2’’ method. All proposed methods give better classification results when the length of intervals is smaller. In all methods,
128
C. Hajjar, H. Hamdan / Neural Networks 46 (2013) 124–132 Table 2 SOM quality evaluation for the artificial data set. Avg. interval length
intSOM_MCDC
intSOM_MDDC
intSOM_L2
L = 1.5
tpe: 0.43% (0.084%) dce: 18.55% (0.82%)
tpe: 0.39% (0.082%) dce: 10.57% (0.73%)
tpe: 0.44% (0.099%) dce: 42.95% (1.65%)
tpe: 0.48% (0.037%) dce: 20.35% (1.00%)
tpe: 0.43% (0.079%) dce: 17.55% (1.08%)
tpe: 0.40% (0.11%) dce: 44.02% (2.02%)
tpe: 0.41% (0.10%) dce: 34.85% (10.12%)
tpe: 0.42% (0.085%) dce: 24.54% (4.50%)
tpe: 0.40% (0.12%) dce: 44.12% (2.12%)
L=3
L=4
Table 3 Topographic error (tpe) per grid size and per method. Fig. 3. Data and trained SOM with the ‘‘intSOM_MCDC ’’ method. intSOM_MDDC intSOM_MCDC intSOM_L2
Grid size: 3 × 3 (%)
Grid size: 8 × 8 (%)
Grid size: 12 × 12 (%)
0 0 0
5.06 9.78 2.2
10.39 13.11 6.06
Fig. 4. Data and trained SOM with the ‘‘intSOM_MDDC ’’ method.
Fig. 6. Data and trained SOM with the ‘‘intSOM_MDDC ’’ method.
proposed methods. Table 3 shows the tpe error for different grid sizes when running the proposed methods. The grid size as well as the data shape and dimensionality may have an influence on the topographic error. In fact, the topographic error increases slightly with the grid size, but we recall that the main purpose of the proposed methods is clustering with topology preservation which is usually performed with small grid sizes. Fig. 6 plots the data rectangles and the prototype rectangles of a 12 × 12 square map. We can easily see a good degree of map deployment over the data and a good degree of map topology preservation, knowing that the map is trained with the ‘‘intSOM_MDDC ’’ method for one replication of the simulated data set. 3.2. Real temperature interval data set Fig. 5. Data and trained SOM with the ‘‘intSOM_L2’’ method.
we obtain very low values for the topographic error, which proves that the topology is preserved. In fact, in all cases the obtained map is well ordered and neighbor clusters are represented by neighbor neurons. More experiments were conducted on one replication of the simulated data with different grid sizes for the self-organizing map in order to test the topology preservation capabilities of the
The real temperature interval data set concerns the monthly average of daily minimal temperatures and the monthly average of daily maximal temperatures observed in 106 meteorological stations spread all over France. We constructed this data set by collecting the data provided by Meteo France. Table 4 shows an example of the interval data set. The lower bound and the upper bound of each interval are respectively the average minimal and
C. Hajjar, H. Hamdan / Neural Networks 46 (2013) 124–132
(a) SOM grid.
129
(b) Geographical map of France. Fig. 7. ‘‘intSOM_MDDC ’’ method clustering results.
(a) SOM grid.
(b) Geographical map of France. Fig. 8. ‘‘intSOM_MCDC ’’ method clustering results.
Table 4 French stations minimal and maximal monthly temperatures. Number
Description
January
February
···
December
1
Abbeville
[−1.5, 3]
[0.9, 5.9]
···
[−2.4, 2.2]
. . .
. . .
. . .
. . .
. . .
. . .
53
Langres
[−3.7, 0.5]
[−1, 4]
···
[−3.4, 1.2]
. . .
. . .
. . .
. . .
. . .
. . .
106
Vichy
[−2.4, 3.8]
[−0.2, 7.6]
···
[−3.1, 5.1]
average maximal temperatures recorded by a station over a month for the year 2010. The data set consists of n = 106 vectors of intervals, each one of dimension p = 12. This data set is used to train a square selforganizing map composed of K = 9 neurons according to the three methods: ‘‘intSOM_MDDC ’’, ‘‘intSOM_MCDC ’’ and ‘‘intSOM_L2’’. For all methods the total number of iterations is totalIter = 600, the neighborhood radius is initialized at σinit = 3 and decreases to reach σfinal = 0.1. The number of iterations of the first phase is set to 540 iterations for the ‘‘intSOM_MDDC ’’ method. The initial prototype vectors are chosen randomly. Each algorithm is run 20 times, and the best result giving the smallest value of the clustering criteria G, calculated in Eq. (10), is chosen. 3.2.1. Clustering results and interpretation The trained network leads us to a partition of 9 clusters. Figs. 7– 9 show the repartition of the data vectors over the 9 neurons and the repartition of the clusters on the geographical map of France where each station is plotted at its exact location when the SOM
is trained with the three methods. Table 5 lists the 106 French meteorological stations. With the ‘‘intSOM_MDDC ’’ method, stations installed in geographically close regions are assigned to the same cluster or neuron. Moreover, neighboring clusters on the map are represented by neighboring neurons on the self-organizing map (see for example clusters C 1, C 2 and C 4) which proves that the topology is preserved and the topographic error is tpe = 4.7%. Cluster C 1 contains the stations of the northeastern regions, while cluster C 9, its opposite on the SOM, contains the stations of the southwestern regions. Moving down and to the right on the self-organizing map corresponds to going to the west and to the south on the geographical map of France. The ‘‘intSOM_MCDC ’’ method leads to clusters containing stations mounted in geographically close areas with some exceptions: stations 42, 30 and 43 are allocated to clusters containing distant regions. The topographic error is tpe = 6.6%. In the ‘‘intSOM_L2’’ method, only clusters C 1, C 3, C 4 and C 7 contains geographically close stations. The topographic error is tpe = 6.6%. In order to give more accurate results in comparing the proposed methods, we computed the geographical distance between two stations using their latitude and longitude according to the formula: d(s1 ,s2 ) =
(lat 1 − lat 2 )2 + (long 1 − long 2 )2 .
(12)
For each cluster, we calculated the average geographical distance between each station and the center of gravity gk of each cluster Ck as follows:
DCk =
d(si ,gk )
si ∈Ck
|Ck |
(13)
130
C. Hajjar, H. Hamdan / Neural Networks 46 (2013) 124–132
(a) SOM grid.
(b) Geographical map of France. Fig. 9. ‘‘intSOM_L2’’ method clustering results. Table 6 Average geographical distance per cluster for the 3 methods.
Table 5 French meteorological stations.
Clusters
Number
Description
Number
Description
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
Abbeville Agen Ajaccio Albi Alenon Ambérieu Angers Aubenas Auch Aurillac Auxerre Bâle-Mulhouse Bastia Beauvais Belfort Belle-île Bergerac Besanon Biarritz Biscarrosse Blois Bordeaux Boulogne-sur-Mer Bourg-Saint-Maurice Bourges Brest-Guipavas Brive-la-Gaillarde Cap-de-la-Hève Carcassonne Cazaux Chambéry Charleville-Mézières Chartres Châteauroux Cherbourg-Valognes Clermont-Ferrand Cognac Colmar Dax Dijon Dinard Dunkerque Embrun Epinal Evreux Gourdon Grenoble Guéret Ile d’Ouessant Ile d’Yeu La Roche-sur-Yon La Rochelle Langres
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106
Le Luc Le Mans Le Puy Le Touquet Lille Limoges Lons-le-Saunier Lorient Luxeuil Lyon-Bron Mâcon Marignane Melun Mende Metz Millau Mont-Aigoual Mont-de-Marsan Montauban Montélimar Montpellier Nancy-Essey Nantes Nevers Nice Nimes-Courbessac Niort Orange Orléans-Bricy Paris-Montsouris Pau Perpignan Poitiers Reims Rennes Romorantin Rouen Saint-Auban Saint-Brieuc Saint-Dizier Saint-Etienne Saint-Girons Saint-Quentin Saint-Raphael Salon-de-Provence Solenzara Strasbourg Tarbes Toulon Toulouse-Blagnac Tours Troyes Vichy
C1 C2 C3 C4 C5 C6 C7 C8 C9 Total
‘‘intSOM_MDDC ’’
‘‘intSOM_MCDC ’’
‘‘intSOM_L2’’
2.59 1.18 1.23 1.45 1.02 1.28 1.25 0.89 1.83
1.19 1.39 2.55 0.88 2.02 1.40 1.83 1.00 1.36
2.09 2.10 1.89 2.32 1.92 1.75 0.36 2.10 2.27
12.72
13.62
16.80
where |Ck | is the cardinal of cluster Ck . Table 6 shows the average distance per cluster for the three proposed methods. The ‘‘intSOM_MDDC ’’ method gives the smallest geographical total distance of 12.72, followed by 13.63 given by the ‘‘intSOM_ MCDC ’’ method and 16.80 given by the ‘‘intSOM_L2’’. Experiments were done by training the self-organizing map with data points instead of intervals according to the ‘‘intSOM_ MDDC ’’ method. By taking the monthly average temperatures instead of minimal and maximal values, the obtained geographical total distance is 17.36 against 12.72 in the case of intervals. 3.3. Comparison with other methods Two methods of the dynamical clustering algorithm for interval data based on Mahalanobis distances are used for comparison (de Souza et al., 2004). In the first method (‘‘intDyn_MCDC ’’), a common distance for all clusters is used until convergence whereas in the second method (‘‘intDyn_MDDC ’’), a different distance per cluster is used from the beginning until convergence of the algorithm. For comparison, we used the same simulated data sets as in de Souza et al. (2004) and the real French temperature data set proposed in the current paper. 3.3.1. Simulated data Two simulated data sets are used, each one generated using 4 bivariate Gaussian distributions producing 2 elliptical clusters of 150 points and 2 spherical clusters of 100 points and 50 points. The clusters are well separated in the first data set while they overlap in the second data set. 100 replications of each simulated data set are generated, and for each replication, five interval data sets with different interval lengths are generated. The corrected rand (CR) index is used to evaluate the quality of the obtained partition (Hubert & Arabie, 1985). A value of 1 of the CR index indicates a perfect match between the obtained partition and the a priori partition, while 0 or negative values mean mismatched partitions.
C. Hajjar, H. Hamdan / Neural Networks 46 (2013) 124–132 Table 7 Comparison between the different clustering methods for the first data set. Interval length range
intSOM_MDDC intDyn_MDDC intSOM_MCDC intDyn_MCDC
[1, 8] [1, 16] [1, 24] [1, 32] [1, 40]
0.994 0.989 0.964 0.941 0.909
0.996 0.986 0.963 0.937 0.923
0.762 0.775 0.784 0.789 0.798
0.778 0.784 0.789 0.802 0.805
Table 8 Comparison between the different clustering methods for the second data set. Interval length range
intSOM_MDDC intDyn_MDDC intSOM_MCDC intDyn_MCDC
[1, 8] [1, 16] [1, 24] [1, 32] [1, 40]
0.702 0.630 0.448 0.388 0.356
0.755 0.688 0.572 0.0435 0.386
0.458 0.408 0.349 0.343 0.337
0.409 0.358 0.352 0.349 0.341
Fig. 10. SOM for the overlapping clusters data set with ‘‘intSOM_MDDC ’’ method.
131
Table 9 Clustering results for the dynamical clustering method with common distance for all clusters. Clusters
Individuals
C1 C2 C3 C4 C5 C6 C7 C8 C9
3 8 13 54 65 73 74 78 79 81 85 91 97 98 99 102 6 12 15 18 31 36 38 40 47 60 63 64 94 106 11 32 44 53 62 66 68 75 87 93 100 105 16 26 35 41 49 50 61 88 92 5 21 25 33 34 37 55 77 80 82 86 89 104 1 7 14 45 51 57 58 76 83 90 96 23 24 28 42 43 48 59 2 4 9 17 19 20 22 27 29 30 39 46 52 71 72 84 95 101 103 10 56 67 69 70
3.3.2. French meteorological real data set The two methods of the dynamical clustering algorithm proposed in de Souza et al. (2004) are used to cluster the French temperature interval data sets. The second method, where a different distance per cluster is used, failed to converge due to matrices inversion problems. For the first method where a common distance per cluster is used, the algorithm is run 20 times, and the best result giving the smallest value of the clustering criteria G, calculated in Eq. (10), is chosen. The clustering results are illustrated in Table 9. All clusters contain geographically close stations except cluster C 7, it contains stations 23, 28 and 42 mounted in northeastern regions along with stations 24 and 23 mounted in western regions. We calculated the average geographical distance per cluster using Eq. (13) and add their values for all clusters to obtain 13.89. 3.3.3. Interpretation of comparison results The ‘‘intDyn_MDDC ’’ method of the dynamical clustering algorithm did not succeed in clustering the temperature data set because a different distance per cluster is used from the beginning, whereas the ‘‘intSOM_MDDC ’’ method gives good clustering results for the same data set. Also, the results obtained by the proposed method (‘‘intSOM_MCDC ’’) are better than those obtained by the ‘‘intDyn_MCDC ’’ method in term of average geographical distance. Moreover, the ‘‘intDyn_MCDC ’’ method clusters data without topology preservation. However, for the simulated data set, both methods (‘‘intDyn_MDDC ’’) and (‘‘intDyn_MCDC ’’) gave close but better clustering results than ‘‘intSOM_MDDC ’’ and ‘‘intSOM_ MCDC ’’ methods. Finally, after analyzing the different results, we may conclude that the proposed methods give good clustering results while giving additional information on the clusters proximity. 4. Conclusion
Fig. 11. SOM for the overlapping clusters data set with ‘‘intSOM_MCDC ’’ method.
Tables 7 and 8 give the mean of the corrected rand index over the 100 replications calculated when clustering the first data set and the second data set respectively with the different methods using the same settings as in de Souza et al. (2004). The first column of each table shows the range of values among which the interval lengths are chosen randomly. Figs. 10 and 11 are the results of applying the methods ‘‘intSOM_MDDC ’’ and ‘‘intSOM_MCDC ’’ respectively on an interval data set where the interval lengths vary between 1 and 8. We can see that the map is well ordered which means that the topology is well preserved in both cases.
In this paper, we proposed two methods to train a selforganizing map in order to cluster interval-valued data sets with topology preservation. Both methods use the Mahalanobis distance to find the best matching unit of an interval data vector. The second method is more adaptive than the first one because it uses a different distance per cluster in the last iterations of the training algorithm. The obtained results using simulated and real interval data sets show that the proposed methods succeeded in clustering data in an adapted way, especially in cases where the data variables are correlated and the data clusters tend to have ellipsoidal forms. However, some limitations are to be considered when using the proposed methods on data where the clusters are not convex like in the case of concentric data for example. References Bock, H. H. (2003). Clustering methods and Kohonen maps for symbolic data. Journal of the Japanese Society of Computational Statistics, 15(2), 217–229. Cazes, P., Chouakria, A., Diday, E., & Schektman, Y. (1997). Extension de l’analyse en composantes principales à des données de type intervalle. Revue de Statistique Appliquée, XIV (3), 5–24.
132
C. Hajjar, H. Hamdan / Neural Networks 46 (2013) 124–132
Chavent, M., & Lechevallier, Y. (2002). Dynamical clustering of interval data: optimization of an adequacy criterion based on Hausdorff distance. In K. Jajuga, A. Sokolowski, & H. H. Bock (Eds.), Classification, clustering and data analysis (pp. 53–60). Berlin, Germany: Springer. Also in the proceedings of IFCS2002, Poland. Chouakria, A. (1998). Extension des méthodes d’analyse factorielle à des données de type intervalle. Université Paris 9 Dauphine. De Carvalho, F. A. T., Brito, P., & Bock, H. H. (2006). Dynamic clustering for interval data based on L2 distance. Computational Statistics, 21(2), 231–250. De Carvalho, F. A. T., & Pacifico, L. D. S. (2011). Une version batch de l’algorithme SOM pour des données de type intervalle. In Actes des XVIIIéme rencontres de la société francophone de classification (pp. 99–102). Orléans, France. De Souza, R. M. C. R., & De Carvalho, F. A. T. (2004). Clustering of interval data based on city-block distances. Pattern Recognition Letters, 25(3), 353–365. de Souza, R. M. C. R., de Carvalho, F. A. T., Tenório, C. P., & Lechevallier, Y. (2004). Dynamic cluster methods for interval data based on Mahalanobis distances. In Proceedings of the 9th conference of the international federation of classification societies (pp. 351–360). Chicago, USA: Springer-Verlag. Diday, E. (1971). La méthode des nuées dynamiques. Revue de Statistique Appliquée, 19(2), 19–34. El Golli, A., Conan-Guez, B., & Rossi, F. (2004). Self-organizing maps and symbolic data. Journal of Symbolic Data Analysis (JSDA), 2(1).
Hajjar, C., & Hamdan, H. (2011). Self-organizing map based on L2 distance for interval-valued data. In IEEE international symposium on applied computational intelligence and informatics. Timisoara, Romania (pp. 317–322). Hamdan, H. (2005). Développement de méthodes de classification pour le contrôle par émission acoustique d’appareils à pression. Université de technologie de Compiègne. Hamdan, H., & Govaert, G. (2003). Classification de données de type intervalle via l’algorithme EM. In XXXVèmes journées de statistique, SFdS. Lyon, France (pp. 549–552). Hamdan, H., & Govaert, G. (2004). Int-EM-CEM algorithm for imprecise data. Comparison with the CEM algorithm using Monte Carlo simulations. In IEEE international conference on cybernetics and intelligent systems. Singapore (pp. 410–415). Hamdan, H., & Govaert, G. (2005). Mixture model clustering of uncertain data. In IEEE international conference on fuzzy systems. Reno, Nevada, USA (pp. 879–884). Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218. Kiviluoto, K. (1996). Topology preservation in self-organizing maps. In International conference on neural networks. Washington, D.C., USA (pp. 294–299). Kohonen, T. (1984). Self organization and associative memory (2nd ed.). SpringerVerlag. Rossi, F., & Conan-Guez, B. (2002). Multi-layer perceptron on interval data. In K. Jajuga, A. Sokolowski, & H. H. Bock (Eds.), Classification, clustering, and data analysis (pp. 427–436). Berlin, Germany: Springer.