Pattern Recognition 38 (2005) 151 – 156 www.elsevier.com/locate/patcog
Rapid and brief communication
Multivalued type dissimilarity measure and concept of mutual dissimilarity value for clustering symbolic patterns D.S. Guru∗ , Bapu B. Kiranagi Department of Studies in Computer Science, University of Mysore, Manasagangothri, Mysore 570 006, Karnataka, India Received 25 May 2003; accepted 1 June 2004
Abstract A successful attempt in exploring a dissimilarity measure which captures the reality is made in this paper. The proposed measure unlike other measures (Pattern Recognition 24(6) (1991) 567; Pattern Recognition Lett. 16 (1995) 647; Pattern Recognition 28(8) (1995) 1277; IEEE Trans. Syst. Man Cybern. 24(4) (1994)) is multivalued and non-symmetric. The concept of mutual dissimilarity value is introduced to make the existing conventional clustering algorithms work on the proposed unconventional dissimilarity measure. 䉷 2004 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Symbolic patterns; Dissimilarity measure; Multivalued data type; Mutual dissimilarity value; Clustering of symbolic patterns
1. Introduction The concept of clustering has been extended now-a-days to the patterns described by realistic/unconventional data types called symbolic data types [1]. Though the methods [2–5] work on symbolic patterns, the degree of proximity between two symbolic patterns is assumed to be crisp and symmetric. Indeed, it is quite natural that the proximity values are themselves symbolic and are not necessarily symmetric. For instance if we look at the compatibility between two blood groups say groups ‘O’ and ‘A’ as the proximity between them, then the blood group ‘O’ possess ‘Yes’ compatibility to the blood group ‘A’ but on the other way round, from ‘A’ to ‘O’ we have ‘No’ compatibility from the point of view of donation, irrespective of the observer. Table 1 summarizes the proximities among four different blood groups A, B, AB and O. It can be noticed that Table 1 is non-symmetric and Boolean, a special instance of crisp. ∗ Corresponding author. Tel./fax: +91-821-251-0789.
E-mail addresses:
[email protected] (D.S. Guru),
[email protected] (B.B. Kiranagi).
However, the proximity can, in general, be expected either to lie within a certain range or to be an instance of multivalued type in addition to being non-symmetric. Therefore devising such a realistic proximity measure and the development of a clustering technique which works on such unconventional proximity values demand considerable attention in order to best simulate the real pragmatic scenario. In view of this, we present a novel dissimilarity measure to estimate the degree of dissimilarity between two symbolic patterns. The proposed measure unlike other methods [2–5] approximates degree of dissimilarity by multivalued type data and in addition, it is non-symmetric. Furthermore, an agglomerative clustering method which works on this unconventional dissimilarity measure is also explored by introducing the concept of mutual dissimilarity value (MDV).
2. Proposed methodology The methodology has two stages. The first stage proposes a novel dissimilarity measure while the second stage explores an agglomerative clustering algorithm by introducing the concept of MDV.
0031-3203/$30.00 䉷 2004 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2003.06.006
152
D.S. Guru, B.B. Kiranagi / Pattern Recognition 38 (2005) 151 – 156
Table 1 Proximity (compatibility) matrix of 4 blood groups Blood group
Donor
A B AB O
given by 1 , d 2 , d 3 , . . . , d n ]. Di→j = [di→j i→j i→j i→j
Receiver A
B
AB
O
Yes No No Yes
No Yes No Yes
Yes Yes Yes Yes
No No No Yes
(2)
Similarly, on the other hand, the degree of dissimilarity of Fj to Fi with respect to all n features is Dj →i = [dj1→i , dj2→i , dj3→i , . . . , djn→i ].
(3)
2.1. A novel dissimilarity measure The degree of dissimilarity between two patterns is estimated based on their degrees of non-overlapping portion and the separability, if any, in each feature. Let Fi and Fj be two symbolic patterns described by n interval valued features as follows: Fi = {Fi1 , Fi2 , . . . , Fin } i.e.
− + − + − + , fi1 ], [fi2 , fi2 ], . . . , [fin , fin ]}, Fi = {[fi1
Fj = {Fj 1 , Fj 2 , . . . , Fj n } i.e.
Fj = {[fj−1 , fj+1 ], [fj−2 , fj+2 ], . . . , [fj−n , fj+n ]}.
− + , fik ] and Fj k = [fj−k , fj+k ] may The kth features Fik = [fik or may not overlap. In case of overlapping, the degree of dissimilarity of Fi to Fj with respect to the kth feature is defined to be the ratio of non-overlapping portion of Fik to |Fj k |. On the other hand, in case of no overlapping the degree of dissimilarity of the pattern Fi to Fj with respect to the kth feature is the ratio of the sum of |Fik | and the separability between Fi and Fj to |Fj k |, where |.| denotes the length. Hence, the degree of dissimilarity of Fi to Fj , with respect to the kth feature (irrespective of overlapping or no overlapping) is characterized by k di→j =
− − + + fj k ) − min(fik , fj k )] |Fik | + [max(fik
|Fj k |
.
(1)
Thus, the dissimilarity of Fi to Fj with respect to all n features turned out to be mutlivalued, and is
It can be perceived from the above that the dissimilarity matrix apart from being multivalued, is not necessarily symmetric.
2.2. Mutual dissimilarity value in clustering As the proposed dissimilarity measure is non-symmetric and is mutlivalued type, the conventional clustering algorithms cannot be applied on the obtained proximity matrix. In view of this, in this section we propose a modified agglomerative clustering technique by introducing the concept of MDV. The MDV between two patterns is defined to be the magnitude of the vector, which is the sum of the scalar times of the vectors representing the degree of dissimilarity between the patterns. i.e. MDV = |.Di→j + .Dj →i |, where and are scalars. Since it is an agglomerative clustering, initially m clusters, each consisting an individual pattern, are created, where m is the total number of patterns. Two patterns belonging to two different clusters possessing the minimum MDV are chosen and subsequently the corresponding clusters are merged together into a single cluster. If there are many such pairs of clusters, then they are merged together at the same stage. This process of merging is continued till the desired number of clusters are obtained or a single cluster with all the m patterns is obtained. Thus, the proposed methodology for clustering symbolic patterns based on unconventional dissimilarity matrix by the use of MDV, is as trivial as follows.
Algorithm. MDV-based agglomerative clustering 1 , d 2 , d 3 , . . . , d n ] ∀i, j = 1, 2, . . . , m Input: The dissimilarity matrix Di→j = [di→j i→j i→j i→j of size m × m where m is the total number of patterns. Output: Cluster(s) of patterns. Method: Create m number of clusters say C1 , C2 , . . . , Cm , each containing an individual pattern. Repeat Merge two clusters Cp and Cq if there exist two patterns Fi and Fj , respectively in Cp and Cq possessing minimum MDV. Until the desired number of clusters are obtained OR a single cluster containing all m patterns is obtained. Algorithm ends
D.S. Guru, B.B. Kiranagi / Pattern Recognition 38 (2005) 151 – 156
153
Table 2 Results based comparison Methodology
Fat oil
Microcomputer
Description of the clusters
Description of the clusters
Ichino and Yaguchi (1994) Gowda and Diday (1991) Gowda and Ravi (1995) Gowda and Ravi (1995)
{0,1,2,3,4,5} {6,7} {0,1} {2,3,4,5} {6,7} {0,1} {2,3,4,5}{6,7} {0,1,2,3,4,5} {6,7}
{0,1,2,3,4,5,7,8,910,11} {6} {0,1,3,9,10} {6} {2,8} {4,5,11} {7} {0,1,3,5,7,8,9,10,11} {2} {6} {4} {0,1,2,3,4,5,7,8,910,11} {6}
Proposed method
{0,1,2,3,4,5} {6,7}
{0,1,2,3,4,5,7,8,910,11} {6}
Table 3 Qualitative factors based comparison Methodology
Parametric/nonparametric
Knowledge of number on samples
Computational burden
Type of proximity matrix
Suitability on multivalued proximity matrix
Ichino and Yaguchi (1994) Gowda and Diday (1991)
Parametric
Not required
Crisp and symmetric
Non-suitable
Non-parametric
Required
Crisp and symmetric
Non-suitable
Gowda and Ravi, (1995(a),1995(b))
Non-parametric
Not required
Computation of special operators Computation of span, content and position Computation of span, content and position
Crisp and symmetric
Non-suitable
Proposed method
Non-parametric
Not required
No
Multivalued and nonsymmetric (Hence suitable for realistic analysis)
Suitable (due to the concept of MDV)
It shall be noticed that the MDV concept is not a step in the computation of the proximity matrix, instead, a step in the proposed modified conventional agglomerative clustering approach. The concept of MDV is introduced specifically to adapt conventional clustering algorithms to work on unconventional proximity matrices. Though the concept of MDV is used to obtain a crisp symmetric MDV matrix, it could be traced out that the computed MDV upholds the non-symmetrical dissimilarity values with different weights. Unlike other algorithms where the proximity value is directly derived to be a crisp assuming that the measure is symmetric, in the proposed method the MDV is derived by the use of non-symmetric values Di→j and Dj →i with different weight factors and . If Di→j and Dj →i are one and the same (as in conventional techniques), the weight factors do not convey any meaning. Indeed, deciding suitable weight factors for obtaining a better cluster of symbolic patterns for a specific application is a challenging task and is our future target. Infact, it has been perceived that this unconventional measure finds its importance in qualitative data analysis and also in pixels’ aggregation based on a seed point growing algorithm useful for image segmentation.
3. Experimentation and comparative study For the purpose of validating the proposed method for its efficacy, we have conducted several experiments on different data sets of type interval and qualitative. The results of only three experiments on fat oil patterns [2–5], microcomputer patterns [2–5] and temperature data (Table 4) are presented here. Throughout the experimentation, for sake of simplicity the weight factors and are set to 1. The superiority of our method can be better understood when it is compared with other methodologies. It can be noticed in Table 2 that the methods [4,5] group the fat oil patterns into 2 clusters and the methods [2,3] group the patterns into 3 clusters based on their own cluster indicator function which acts as a stopping criterion. The proposed methodology on fat oil data has resulted with 2 clusters which are same as that of the methods [4,5]. It can also be noticed from Table 2 that the clusters obtained on microcomputer data by the other methods [2–5] are entirely different except the clusters obtained by the methods [4,5]. It is stated in the literature that no
154
Table 4 Minimum and maximum temperatures of cities in ◦ C Cities
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sept
Oct
Nov
Dec
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
Amsterdam Athens Bahrain Bombay Cairo Calcutta Colombo Copenhagen Dubai Frankfurt Geneva Hong Kong Kula Lumpur Lisbon London Madras Madrid Manila Mauritius Mexico City Moscow Munich Nairobi New Delhi New York Paris Rome San Francisco Seoul Singapore Stockholm Sydney Tehran Tokyo Toronto Vienna Zurich
[−4, 4] [6, 12] [13, 19] [19, 28] [8, 20] [13, 27] [22, 30] [−2, 2] [13, 23] [−10, 9] [−3, 5] [13, 17] [22, 31] [8, 13] [2, 6] [20, 30] [1, 9] [21, 27] [22, 28] [6, 22] [−13, −6] [−6, 1] [12, 25] [6, 21] [−2, 4] [1, 7] [4, 11] [6, 13] [0, 7] [23, 30] [−9, −5] [20, 30] [0, 5] [0, 9] [−8, −1] [−2, 1] [−11, 9]
[−5, 3] [6, 12] [14, 19] [19, 28] [9, 22] [16, 29] [22, 30] [−3, 2] [14, 24] [−8, 10] [−6, 6] [12, 16] [23, 32] [8, 14] [2, 7] [20, 31] [1, 12] [22, 27] [22, 29] [15, 23] [−12, −5] [−5, 3] [13, 26] [10, 24] [−3, 4] [1, 7] [5, 13] [6, 14] [1, 6] [23, 30] [−9, −6] [20, 30] [5, 8] [0, 10] [−8, −1] [−1, 3] [−8, 15]
[2, 12] [8, 16] [17, 23] [22, 30] [11, 25] [21, 34] [23, 31] [−1, 5] [17, 28] [−4, 17] [3, 9] [15, 19] [23, 33] [9, 16] [3, 10] [22, 33] [3, 16] [24, 29] [22, 29] [17, 25] [−8, 0] [−2, 9] [14, 25] [14, 29] [1, 9] [2, 12] [7, 16] [7, 17] [1, 8] [24, 31] [−4, 2] [18, 26] [10, 15] [3, 13] [−4, 4] [1, 8] [−7, 18]
[5, 15] [11, 19] [21, 27] [24, 32] [14, 29] [24, 36] [24, 31] [3, 10] [19, 31] [0, 24] [7, 13] [19, 23] [23, 33] [11, 18] [5, 13] [26, 35] [6, 19] [24, 31] [21, 28] [18, 27] [0, 8] [3, 14] [14, 24] [20, 36] [6, 15] [5, 16] [10, 19] [8, 18] [6, 16] [24, 31] [1, 8] [16, 23] [15, 18] [9, 18] [−2, 11] [5, 14] [−1, 21]
[7, 17] [16, 25] [25, 32] [27, 33] [17, 33] [26, 36] [25, 31] [8, 16] [22, 34] [3, 27] [10, 17] [22, 27] [23, 32] [13, 21] [8, 17] [28, 39] [9, 24] [25, 31] [19, 25] [18, 27] [7, 18] [7, 18] [13, 22] [26, 40] [12, 22] [8, 19] [13, 23] [10, 19] [12, 22] [24, 30] [6, 15] [12, 20] [20, 25] [14, 23] [−8, 18] [10, 19] [2, 27]
[10, 20] [19, 29] [28, 34] [26, 32] [20, 35] [26, 33] [25, 30] [11, 20] [25, 36] [7, 30] [15, 17] [25, 29] [23, 32] [16, 24] [11, 20] [27, 38] [13, 29] [25, 31] [18, 24] [18, 27] [11, 23] [10, 21] [12, 21] [28, 39] [17, 27] [12, 22] [17, 28] [11, 21] [16, 25] [25, 30] [11, 19] [5, 17] [28, 30] [18, 25] [13, 24] [13, 22] [6, 30]
[10, 20] [22, 32] [29, 36] [25, 30] [22, 36] [26, 32] [25, 29] [14, 22] [28, 39] [8, 32] [16, 24] [25, 30] [23, 31] [17, 26] [13, 22] [26, 36] [16, 34] [23, 29] [17, 23] [18, 27] [13, 24] [12, 23] [11, 21] [27, 35] [21, 29] [14, 24] [20, 31] [12, 22] [18, 31] [25, 30] [14, 22] [8, 16] [36, 38] [22, 29] [16, 27] [15, 24] [10, 31]
[12, 23] [22, 32] [30, 36] [25, 30] [22, 35] [26, 32] [25, 29] [14, 21] [28, 39] [8, 31] [16, 23] [25, 30] [23, 32] [18, 27] [13, 21] [26, 35] [16, 33] [24, 28] [17, 23] [18, 26] [11, 22] [11, 23] [11, 21] [26, 34] [20, 28] [13, 24] [20, 31] [12, 22] [16, 30] [25, 30] [13, 20] [9, 17] [38, 40] [23, 31] [16, 26] [14, 23] [8, 25]
[10, 20] [19, 28] [28, 34] [24, 30] [20, 33] [26, 32] [25, 30] [11, 18] [25, 37] [5, 27] [11, 19] [25, 29] [23, 32] [17, 24] [11, 19] [25, 34] [13, 28] [25, 28] [17, 24] [18, 26] [6, 16] [8, 20] [11, 24] [24, 34] [16, 24] [11, 21] [17, 27] [12, 23] [9, 28] [24, 30] [9, 15] [11, 20] [29, 30] [20, 27] [12, 22] [11, 19] [5, 23]
[5, 15] [16, 23] [24, 31] [24, 32] [18, 31] [24, 32] [24, 29] [7, 12] [21, 34] [0, 22] [6, 13] [22, 27] [23, 31] [14, 21] [8, 14] [24, 32] [8, 20] [24, 29] [18, 25] [16, 25] [1, 8] [4, 13] [13, 24] [18, 34] [11, 19] [7, 16] [13, 21] [11, 22] [3, 24] [24, 30] [5, 9] [13, 22] [18, 20] [13, 21] [6, 14] [7, 13] [3, 22]
[1, 10] [11, 18] [20, 26] [23, 32] [14, 26] [18, 29] [23, 29] [3, 7] [17, 30] [−3, 14] [3, 8] [18, 23] [23, 31] [11, 17] [5, 10] [22, 30] [4, 14] [22, 28] [19, 27] [14, 25] [−5, 0] [0, 7] [13, 23] [11, 28] [5, 12] [4, 10] [9, 16] [8, 18] [7, 19] [24, 30] [1, 4] [16, 26] [9, 12] [8, 16] [−1, 17] [2, 7] [0, 19]
[−1, 4] [8, 14] [15, 21] [20, 30] [10, 20] [13, 26] [22, 30] [1, 4] [14, 26] [−8, 10] [−2, 6] [14, 19] [23, 31] [8, 14] [3, 7] [21, 29] [1, 9] [22, 27] [21, 28] [8, 23] [−11, −5] [−4, 2] [13, 23] [7, 23] [−2, 6] [1, 6] [5, 12] [6, 14] [1, 8] [23, 30] [−2, 2] [20, 30] [−5, 0] [2, 12] [−5, 1] [1, 3] [−11, 8]
D.S. Guru, B.B. Kiranagi / Pattern Recognition 38 (2005) 151 – 156
Pattern No.
D.S. Guru, B.B. Kiranagi / Pattern Recognition 38 (2005) 151 – 156
155
Table 5 Description of the clusters obtained by the Panel of Human experts and the proposed method on the data set shown in Table 4 Cluster No.
I II
III IV V VI VII VIII
Clusters obtained by the panel of Human experts
Cluster obtained by the proposed method
Type 1
Type 2
2, 3, 4, 5, 6, 8, 11, 12, 15, 17,19, 22, 23, 29,31 0, [1], 7, 9, 10, [13], 14, 16, 20, 21, 24, 25, 26, [27], [28], 30, [33], 34, 35, 36 18 32
2, 3, 4, 5, 6, 8, 12, 15, 17,19, 23, 29, 31 0, [1], 7, 9, [13], 14, 16, 21, 24, 25, 26, [27], [28], [33], 34, 35, 36
2, 3, 4, 5, 6, 8, 12, 15, 17,19, 22, 23, 29, 31 0, 1, 7, 9, 13, 14, 16, 21, 24, 25, 26, 27, 28, 33, 34, 35, 36
10 11 18 20 30 32
10 11 18 20 30 32
consistency can be expected on microcomputer data. Application of our method has resulted with 2 clusters which are same as that of the methods [4,5] encouraging their results. Nevertheless, the method [5] is parametric and as stated in their work itself, finding out suitable value for the parameter is indeed a tough job. In addition the method [5] is computationally expensive as it involves the computation of special operators (Cartesian join and Cartesian meet). Though the approaches [2–4], are non-parametric, they require a prior knowledge on the number of samples in each pattern, in addition to being computationally expensive due to the computation of components such as span, content and position. On contrary, our method does not require a prior knowledge on the number of samples and also does not involve any extra computation of the components or special operators in addition to being non-parametric. This comparative study based on qualitative factors is summarized in Table 3. An experiment on temperature data (Table 4) has also been conducted. The data were given to a panel of human experts for classification. Some experts classified the data into 4 clusters while others classified the data into eight clusters of cities (Table 5). When the experts who classified the data into 4 clusters were asked about the reasons for such a classification, it is understood that the cities which are classified under cluster 1 bear most likely the same variations in temperature when compared to the other cities. Moreover they lie in between 0◦ and 40◦ latitudes. The cities which are classified under cluster 2 also bear most likely the same temperature variation and all of them except which have been marked by ‘[ ]’ lie in between 40◦ and 60◦ latitudes. The cities which have been marked by ‘[ ]’, though lie in between 0◦ and 40◦ latitudes, are classified as members of cluster 2 because of the reason that they bear low temperature being closer to sea coast. It is also understood that the cities, Mauritius for being the only island (in the data set provided) and Tehran for possessing irregular temperature, are made singleton clusters. On the other hand, when other experts were asked about the reasons for eight clusters, it
is understood that they have also classified the cities, in a similar way except that they identified the cities, Geneva for possessing pleasant temperature due to its location in between the Alps and the Jura mountains, Hong Kong for possessing the moderate temperature all through the year, Moscow for the snowy and the coldest temperature (in the considered data set) and Stockholm for colder temperature but not being as cold as Moscow, as members of four more singleton clusters. It appears that these experts have looked into finer details to encapsulate the reality in classification than the other experts. It is quite interesting to note that our methodology has produced the same eight clusters (Table 5) there by revealing its high consistency with the experts who have considered finer details for classification. 4. Conclusion In this paper, a novel dissimilarity measure and a modified clustering algorithm by introducing the concept of mutual dissimilarity value is proposed for clustering symbolic patterns. The proposed method is experimentally validated for its efficacy and is shown to have high consistency with human perception. The beauty of our methodology lies in its notion of computing dissimilarity in multivalued form and in addition, its new concept, MDV which helps in agglomerative clustering of patterns whose degrees of dissimilarity are of type multivalued and non-symmetric. Thus, our method bears the following characteristics: (i) It is simple and computationally efficient. (ii) It can be employed on quantitative, qualitative and multivalued qualitative symbolic data types. (iii) It is non-parametric. (iv) Being based on MDV it works on multivalued type proximity matrix. References [1] H.H. Bock, E. Diday (Eds.), Analysis of Symbolic Data, Springer, Berlin, 2000.
156
D.S. Guru, B.B. Kiranagi / Pattern Recognition 38 (2005) 151 – 156
[2] K.C. Gowda, E. Diday, Symbolic clustering using a new dissimilarity measure, Pattern Recognition 24 (6) (1991) 567 –578. [3] K.C. Gowda, T.V. Ravi, Agglomerative clustering of symbolic objects using the concepts of both similarity and dissimilarity, Pattern Recognition Lett. 16 (1995) 647–652.
[4] K.C. Gowda, T.V. Ravi, Divisive clustering of symbolic objects using the concepts of both similarity and dissimilarity, Pattern Recognition 28 (8) (1995) 1277–1282. [5] M. Ichino, H. Yaguchi, Generalized Minkowski metrices for mixed feature type data analysis, IEEE Trans. Syst. Man Cybern. 24 (4) (1994).