P-FCM: a proximity—based fuzzy clustering

P-FCM: a proximity—based fuzzy clustering

Fuzzy Sets and Systems 148 (2004) 21 – 41 www.elsevier.com/locate/fss P-FCM: a proximity—based fuzzy clustering Witold Pedrycza; b;∗ , Vincenzo Loiac...

1MB Sizes 0 Downloads 65 Views

Fuzzy Sets and Systems 148 (2004) 21 – 41 www.elsevier.com/locate/fss

P-FCM: a proximity—based fuzzy clustering Witold Pedrycza; b;∗ , Vincenzo Loiac , Sabrina Senatorec a

Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada T6R 2G7 b Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland c Department of Mathematics and Informatics, University of Salerno, Via S. Allende, 84081 Baronissi, SA, Italy

Abstract In this study, we introduce and study a proximity-based fuzzy clustering. As the name stipulates, in this mode of clustering, a structure “discovery” in the data is realized in an unsupervised manner and becomes augmented by a certain auxiliary supervision mechanism. The supervision mechanism introduced in this algorithm is realized via a number of proximity “hints” (constraints) that specify an extent to which some pairs of patterns are regarded similar or di3erent. They are provided externally to the clustering algorithm and help in the navigation of the search through the set of patterns and this gives rise to a two-phase optimization process. Its 4rst phase is the standard FCM while the second step is concerned with the gradient-driven minimization of the di3erences between the provided proximity values and those computed on a basis of the partition matrix computed at the 4rst phase of the algorithm. The proximity type of auxiliary information is discussed in the context of Web mining where clusters of Web pages are built in presence of some proximity information provided by a user who assesses (assigns) these degrees on a basis of some personal preferences. Numeric studies involve experiments with several synthetic data and Web data (pages). c 2004 Elsevier B.V. All rights reserved.  Keywords: Fuzzy clustering; Proximity measure; Web mining; Fuzzy C-Means (FCM); Supervision hints; Preference modeling; Proximity hints (constraints)

1. Introduction It almost goes without saying that clustering occupies a predominant position in data analysis and comes with various techniques of identifying groups of elements (patterns) exhibiting some level of resemblance or closeness. The conceptual diversity of clustering is impressive and so is the variety of algorithmic means. Narrowing down our discussion to the objective function-based fuzzy ∗

Corresponding author. Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada T6G 2G7. Tel.: +1-204-474-8380; fax: +1-204-261-4639. E-mail address: [email protected] (W. Pedrycz). c 2004 Elsevier B.V. All rights reserved. 0165-0114/$ - see front matter  doi:10.1016/j.fss.2004.03.004

22

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

Table 1 Main tendencies in objective function based fuzzy clustering: an overview Structural enhancements Knowledge-based of clustering enhancements of clustering • Enhanced geometry of clusters (from spherical • Partial clustering (labels of some patterns available) to ellipsoidal; C-varieties, C-lines, C-shells) • Conditional clustering • Robust clustering • Logic-based clustering • Possibilistic clustering • Noise-absorbing clusters

clustering; we can identify main tendencies and generalizations encountered in the literature, refer to Table 1. A general taxonomy involves two main categories of generalizations, namely, (a) structural that attempt to capture more re4ned forms of geometry of clusters [3,10,12,13,22] either through more complex objective function or/and distance function or revise the format of data that encounter relational patterns [8,9,11,14,23] and (b) knowledge-based oriented. The approaches falling under this second rubric are concerned with an inclusion of a mechanism of partial supervision where some labeled patterns are involved in the process of 4nding or navigating a structure in the data. Very often these methods involve augmented-objective functions with an additive format of the proposed extension that captures the available hints navigating the search of the data structure. In some other cases, we encounter a collaborative mechanism of clustering where a detailed objective function comes with a number of additive components that reKect some 4ndings about the underlying structures in the collaborating portions of the patterns. The proposed generalized version of the clustering can be positioned in the realm of knowledgebased clustering cf. [2,15,20]. An important objective of this emerging paradigm is to seamlessly combine two sources of information: the one residing within the data set itself (and to be revealed through some optimization process) and the second one coming from the user/designer who instills some general observations (hints) as to the nature of the data (and their structure) while being placed in a general context of the problem at hand. These two sources of information need to be reconciled and put into some sort of collaborative environment. The features of such environment and the ensuing algorithmic details constitute the crux of the approach proposed in this study. Here we are more speci4c as we primarily look at the speci4c type of hints augmenting the data-driven clustering in the form of proximity between pairs of data (patterns). These suggestions are very intuitive. A user/designer can easily contrast two patterns and quantify their level of proximity. This could be done either in the qualitative (say, very similar patterns, di3erent patterns, etc.) or quantitative manner (e.g, patterns “a” and “b” are similar to degree 0.7). The study is arranged into six sections. In Section 2 we discuss a concept of proximity, formulate its underlying foundations and relate it to fuzzy partition matrices. Section 3 is concerned with the proximity-based clustering, called P-FCM; here we formulate a problem as a certain optimization task, outline a general Kow of computing and discuss possible ways of interaction between two sources of information: the data to be clustered and a collection of proximity hints provided by a data analyst. We also contrast the P-FCM algorithm with an interesting and quite visible idea of relational clustering (that is clustering using relational data). Experimental studies in which we use synthetic data are presented in Section 4. The P-FCM is especially suitable for Web exploration and data organization on the Web; this issue is covered in Section 5. In the same section, we present a

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

23

comprehensive case study discussing clustering of Web pages in the Open Directory Project (OPD). Conclusions are covered in Section 6. The notation being used throughout the study is standard: patterns are located in an n-dimensional space of real numbers (Rn ) and we are concerned with “N ” patterns to be clustered (grouped) into “c” clusters. While the approach proposed here could be used to augment a vast array of clustering techniques (which eventually may call for some technical modi4cations with the essence being fully retained), we con4ne ourselves to the FCM method. There are two basic reasons for that: 4rst, this algorithm is widely known so that any enhancements made to it could be of interest to the current users. Second, an introduction of the idea realized in this setting is transparent and convincing without too much unnecessary notational and algorithmic cluttering. Owing to the well-documented FCM technique, we can also take advantage of the familiarity of its notation and underlying concepts such as partition matrices and prototypes being used to capture the essence of the structure of the data.

2. The concept of proximity and its relationship to partition matrices The concept of proximity between two objects (patterns) is one of the fundamental notions of high practical relevance. Formally, given two patterns “a” and “b”, their proximity, p(a; b), is a mapping to the unit interval such that it satis4es the following two conditions. p(a; b) = p(b; a) symmetry; p(a; a) = 1 reKexivity: The notion of proximity is the most generic that constitutes a minimal set of requirements; what we impose is straightforward: “a” exhibits the highest proximity to itself and the proximity relation is symmetric. In this sense, we can envision that in any experimental setting, these two properties can be easily realized. Given a collection of patterns, the proximity results obtained for all possible pairs of patterns are usually arranged in a matrix form known as a proximity relation P. It is worth mentioning that the concept of similarity is more demanding as it comes with the request for transitivity (viewed in some well-de4ned sense, e.g., as max–min transitivity, etc.). In practice, experimental results (that come as a consequence of some comparison between pairs of objects) do not guarantee that the transitivity requirement is satis4ed. One can, however, achieve it by computing a transitive closure of the proximity relation, cl(P). Fuzzy partition (as produced by the FCM algorithm) is directly linked with the proximity relation. The well-known transformation of the partition matrix into its proximity counterpart is governed by the expression p[k ˆ 1 ; k2 ] =

c 

(uik1 ∧ uik2 );

(1)

i=1

where ∧ denotes a minimum operation. Owing to the well-known properties of the partition matrix, we k1 = k2 we end up with the value of p[k ˆ 1 ; k2 ] equal to 1. Evidently, p[k ˆ 1 ; k1 ] = c observe that for c (u ∧ u ) = u = 1. The symmetry of p[k ˆ ; k ] is obvious. ik ik ik 1 2 1 1 1 i=1 i=1

24

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

3. The P-FCM algorithm The FCM algorithm accepts a collection of data (patterns). This process is completely guided by some underlying objective function. The result depends exclusively upon the data to be analyzed. There is no further user/designer intervention (with an exception of setting up the parameters of the algorithm and deciding upon the number of clusters to be formed). There is another important, yet quite often overlooked source of information coming directly from the potential user of the results of the clustering process. The algorithm consists of two main phases that are realized in interleaved manner. The 4rst phase is data driven and is primarily the standard FCM applied to the patterns. The second concerns an accommodation of the proximity-based hints and involves some gradient-oriented learning. 3.1. Problem formulation and underlying notation The high-level-computing scheme comprises of two phases that form a nested optimization structure, see Table 2. The upper level of the scheme deals with the standard FCM computing (iterations) while the one nested there is aimed at the accommodation of the proximity requirements and optimizes the partition matrix on the basis of these hints. The upper part (phase) of the P-FCM is straightforward and follows the well-known scheme encountered in the literature. The inner part deserves detailed discussion. Given: specify number of clusters, fuzzi4cation coePcient, distance function and initiate a partition matrix (generally it is started from a collection of random entries), termination condition (small positive constant ). The accommodation of the proximity requirements (constraints or hints) is realized in the form of a certain performance index whose minimization leads us to the optimal partition matrix. As stated in the problem formulation, we are provided with pairs of patterns and their associated level of proximity. The partition matrix U (more speci4cally the induced values of the proximity) should adhere to the given levels of proximity. Bearing this in mind, the performance is formulated as the following sum: V =

N N  

(p[k ˆ 1 ; k2 ] − p[k1 ; k2 ])2 b[k1 ; k2 ]d[k1 ; k2 ]:

(2)

k1 =1 k2 =1

The notation p[k ˆ 1 ; k2 ] is used to describe the proximity level induced by the partition matrix. It becomes apparent that using directly the values of the membership (corresponding entries of the partition matrix) is not suitable. Simply, if two patterns k1 and k2 have the same distribution of membership grades across the clusters, these membership grades are usually not equal to 1 as the proximity value could be close or equal to 1. The value d[k1 ; k2 ] denotes the distance between the two corresponding patterns while p[k1 ; k2 ] is the proximity level provided by the user or data analyst. Subsequently, the entries of the binary matrix B are de4ned as follows: • b[k1 ; k2 ] assumes binary value (it returns 1 if there is a proximity hint for this speci4c pair of the patterns, that is k1 and k2 ); • otherwise the value of b[k1 ; k2 ] is set up to zero (meaning that there is no proximity hint for the speci4c pair of data).

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

25

Table 2 A general Kow of optimization of the P-FCM algorithm Repeat Main external loop Compute prototypes and partition matrix using standard expressions encountered in the FCM method Repeat Internal optimization loop Minimize some performance index V guided by the collection of the proximity constraints Until no signi4cant changes in its values over successive iterations have been reported (this is quanti4ed by another threshold ) Until a termination condition has been met (namely, a distance between two successive partition matrices does not exceed ).

With the partition-proximity de4ned in this way, (2) reads as follows:  c 2 N  N   (uik1 ∧ uik2 ) − p[k1 ; k2 ] b[k1 ; k2 ]d[k1 ; k2 ]: V = k1 =1 k2 =1

(3)

i=1

The optimization of V with respect to the partition matrix does not lend itself to a closed-form expression and requires some iterative optimization. The gradient-based scheme comes in a wellknown format   @V ; (4) ust (iter + 1) = ust (iter) −  @ust (iter) 0;1 s = 1; 2; : : : ; c; t = 1; 2; : : : ; N where [ ]0;1 indicates that the results are clipped to the unit interval,  stands for a positive learning rate. Successive iterations are denoted by “iter”. The detailed computations of the above derivative are straightforward. Taking the derivative with ust , s = 1; 2; : : : ; c; t = 1; 2; : : : ; N , one has  c 2 N N    @ @V (uik1 ∧ uik2 ) − p[k1 ; k2 ] = @ust (iter) @ust i=1 k1 =1 k2 =1

=2

N c N    k1 =1 k2 =1

i=1

(uik1

c @  ∧ uik2 ) − p[k1 ; k2 ]) (uik1 ∧ uik2 ): @ust i=1

The inner derivative assumes binary values depending on the satisfaction of the conditions  c  1 if t = k1 and usk1 6 usk1 6 usk2 ;  @ (uik ∧ uik ) = 1 if t = k2 and usk2 6 usk2 6 usk2 ;  @ust i=1 0 otherwise:

(5)

(6)

Making this notation more concise, we can regard the above derivative to be a binary (Boolean) predicate [s; t; k1 ; k2 ] and plug it into (5) that leads to the overall expression N  N  c  @V =2 (uik1 ∧ uik2 ) − p[k1 ; k2 ])f[s; t; k1 ; k2 ]: @ust (iter) i=1 k1 =1 k2 =1

(7)

26

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

FCM

Proximity hints

Min V (gradient-based optimization)

Fig. 1. P-FCM: a general Kow of optimization activities.

The organization of the computing process of the P-FCM scheme is visualized in Fig. 1. The two interleaving phases (FCM and the proximity-based navigation) are clearly identi4ed with an interface between them. The results of the FCM optimization (which is guided by a set of its own parameters) are passed on the gradient-based procedure of the proximity-based optimization. The results obtained there are normalized to one (to meet the requirements of the partition matrix) and then processed by the FCM at its next iteration. As to the intensity of computing at both ends, we note that for each iteration step of the FCM we have a series of iterations of the gradient-based optimization. In this sense, we allow the data structure to become more visible and the changes from the end of the proximity hints play a supportive role. There is an interesting optimization aspect of the P-FCM that is worth emphasizing: the proposed scheme dwells on two sources of knowledge that is (a) a certain performance index (objective function) that directs search in the data space and (b) a collection of proximity hints. Evidently, if we con4ne ourselves to a certain distance function (quite commonly the Euclidean one), then the search for the structure is quite directed in this manner and helps reveal the structure in the data that conforms to a collection of hyperspheres. The proximity hints are provided independently and therefore could lead the search for the structure in a di3erent direction. We do not know if these two sources of information are conKicting or competitive. If so, it is very likely that the interaction between them may result in some instability of the optimization process. In essence bearing in mind the nature of the problem, we cannot guarantee that the optimization process will always converge, especially if these two sources of guidance (objective function and proximity hints) are in a strong competition. If this is the case, the lack of stability can be used constructively and trigger some analysis of the existing sources of knowledge being used in the P-FCM clustering. The P-FCM in its formulation is an original development. This obviously does not stipulate that there were no other similar attempts in the past. For instance, in the seminal paper by Ruspini [24] we can note an interesting formulation of the optimization problem that reads as follows: Min

N

 k=1

X

X

|Tk (x; y) − Vk (g(x); g(y))| dP(x) dP(y);

(8)

where Tk (x; y) denotes a function in the Cartesian product of data X × X, g(·) is a so-called grouping function while Vk (g(x); g(y)) describes the resemblance of a pair of patterns. dP denotes the underlying probability density function. The minimization is aimed at bringing Tk (x; y) as close as possible to Vk (g(x); g(y)). One may note, however, that while the issue of similarity between

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

27

patterns has been raised, the formulation (and the ensuing solution in the form of the P-FCM algorithm) is quite distant from the formulation presented above. 3.2. Interaction aspects of sources of information in the P-FCM The P-FCM clustering environment brings an important and fundamental issue of collaboration and/or competition between di3erent sources of information and an algorithmic manifestation of such interaction. In the discussed framework, we are inherently faced with two diverse source of data/knowledge. FCM aims at “discovering” the structure in the data by minimizing a certain objective function. The gradient-based learning concentrates on the proximity hints and this is the point where the interaction with the data starts to unveil. The strength of this interaction is guided by the intensity of the gradient-based learning. More speci4cally, for higher values of the learning rate, we put more con4dence in the hints and allow them to a3ect the already developed structure (partition matrix) to a high extent. It may be that the collaboration may convert into competition when the two mechanisms start interacting more vigorously and these two sources of information are fully consistent (which is diPcult to quantify in a formal way). The existence of the competition starts manifesting through substantial oscillations during the minimization of V ; to avoid them we need to loosen the interaction and lower the value of the learning rate (). 3.3. P-FCM and relational clustering The P-FCM exhibits some resemblance with the well-known and interesting trend of fuzzy relational clustering, cf. [8,9]. Instead of individual patterns, in this mode of clustering we consider relational objects that are new objects describing properties of pairs of the original data. More speci4cally, these objects are organized in a single matrix R = [rkl ], k; l = 1; 2; : : : ; N where rkl denotes a degree of similarity, resemblance or more generally dependency (or association) between two patterns (k and l). We assume that this type of referential information about patterns is available to the clustering purposes and the algorithms are developed along this line. The importance of the relational character data is motivated by the lack of interpretation of single patterns while their pairwise comparison (leading to the relational patterns) makes perfect sense. Computationally, the number of relational patterns is substantially higher than the original data (N versus all possible di3erent pairs of data that is N (N − 2)=2). The computational advantage can arise with regard to the dimensionality of the space; the original patterns may be located in a highly dimensional space (n) whereas the relational data could often be one dimensional. In the P-FCM, we have a number of di3erences when comparing with the relational clustering: (a) as already underlined, the relational clustering operates in the homogeneous space of relational objects. The P-FCM augments FCM by adding an extra optimization scheme so it still operates on patterns (rather than their relational counterpart); (b) P-FCM attempts to reconcile two sources of information (structural and domain hints); this mechanism is not available to the relational clustering; (c) computationally, P-FCM does not a3ect the size of the original dataset; we can be provided with a di3erent number of hints (being a certain percentage of the dataset). Relational clustering increases the size of the dataset while operating in a low-dimensional space;

28

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

(d) P-FCM dwells on the core part of the FCM optimization scheme by augmenting it by an additional gradient-based optimization phase; in contrast, the relational clustering requires substantial revisions to the generic FCM method (which sometimes leads to optimization pitfalls and is still under further improvements, cf. [14,23]).

4. Experiments—synthetic data In this section, we discuss how the P-FCM performs on some illustrative synthetic data with proximity hints and then show its application to clustering Web data (pages). 4.1. Experiment 1 Here we are concerned with a small two-dimensional synthetic data shown in Fig. 2(a). Evidently, there is some structure with several visible but not necessarily clearly distinguishable clusters. The standard FCM with c = 3 clusters gives rise to the cluster boundaries, Fig. 2(b) that distinguish between the produced groups. The prototypes of the clusters shown there are v1 = [0:95 1:154], v2 = [1:70 1:88], and v3 = [0:15 0:25]. The results are quite expectable; the prototypes of the clusters represent the structure of the data quite well and the resulting boundaries of the clusters split the datasets as one could have expected. The proximity hints a3ects quite substantially the clusters and their boundaries. To illustrate this point, we added several proximity hints (constraints) to selected pairs of patterns. Two scenarios are presented in Figs. 3 and 4. They show the proximity constraints and the resulting boundaries of the clusters. It becomes apparent that these are quite substantially a3ected by the proximities being a result of the prototypes that moved around because of the imposed proximity constraints. Because of the low values of the proximities on the pairs of patterns, the region occupied by the second class has expanded over the previous case (in comparison to the case without any proximity constraints). The prototypes are now equal to v1 = [0:25 0:37], v2 = [1:65 1:78], and v3 = [1:06 1:36].

Fig. 2. Two-dimensional data (a) and boundaries of the clusters resulting from the FCM algorithm (b). The contour plots use the analytic expression of the membership functions and are built on a basis of the prototypes of the clusters.

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

29

Fig. 3. Proximity constraints (a) and resulting cluster boundaries (b) with four proximity constraints.

Fig. 4. Proximity constraints (a) and resulting cluster boundaries (b); with eight proximity constraints.

The situation visualized in Fig. 4 is completely di3erent from the previous cases. The proximity constraints (that are quite “decisive” by assuming mostly binary values) have radically changed the landscape of the clustering. The list of the constraints consists of the following triples (data-point, data point, proximity); see also Fig. 4(a): (1 7 0.9), (2 6 0), (2 9 1), (7 12 0.9), (9 13 0.0), (8 14 0), (13 14 0), (1 2 0). Noticeably, the proximity constraints throw some patterns with visible neighborhood in the feature space into di3erent groups (because of the low values of their proximity values). The prototypes are equal to v1 = [0:68 0:77], v2 = [0:58 0:80], and v3 = [1:43 1:68] so we note that the two of them are placed very closely to each other (as a result of the existing proximity constraints).In the sequel, the resulting boundaries shown in Fig. 4 become very distinct from those encountered in the previous cases. 4.2. Experiment 2 While in the previous case the proximity constraints were introduced quite freely with a primary intent to visualize how they impact the results of clustering by interplaying with the structure

30

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

FCM

P-FCM

Partition matrix

Proximity Hints

Fig. 5. Experimental setup—from highly dimensional feature space to its reduced version and proximity hints.

“discovered” within the original data, in this experiment the origin of the proximity values is very much di3erent. The general scheme we are exercising here is as follows: start with the original pattern positioned in an n -dimensional space, pick up the subset (n) of the features (where n ¡ n ) and cluster them in this reduced feature space with some additional proximity constraints. These constraints are constructed in a systematic way: we cluster patterns in the n -dimensional space and then build the proximities based on the resulting partition matrix. The overall setup of the experiment is portrayed in Fig. 5. The dataset consists of three clusters of four-dimensional synthetic data with Gaussian distribution,   1:0 0:0 0:0 0:0   0:0 1:0 0:0 0:0   m = [0:0 0:0 3:0 0:0]T =  0:0 0:0 1:0 0:0  ; 0:0 0:0 0:0 0:5   1:2 0:0 0:0 0:0   0:0 1:5 0:0 0:0   m = [0:5 0:8 1:0 2:0]T =  0:0 0:0 1:0 0:0  ; 0:0 0:0 0:0 0:5   2:0 0:0 0:0 0:0   0:0 1:0 0:0 0:0   m = [6:0 6:0 3:0 3:0]T =  0:0 0:0 2:0 0:0  : 0:0 0:0 0:0 0:5 Each cluster consists of 100 patterns. The use of the FCM with c = 4 clusters leads to the prototypes v1 = [0:04 0:13 3:00 0:33], v2 = [5:90 6:01 3:34 3:08]; v3 = [0:72 1:27 1:22 2:04] with the partition matrix illustrated in Fig. 6. Now we take only the two 4rst variables of the patterns. This leads to a substantial overlap between the clusters (as becomes apparent by looking at the statistical parameters of the generated

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

31

1 0.8 0.6 0.4 0.2 0 1

Fig. 6. Membership functions produced by the FCM for the four-dimensional set of patterns (three clusters).

1 0.8 0.6 0.4 0.2 0 1

Fig. 7. Membership grades of the patterns in three clusters (reduced two-dimensional data).

groups). This is also reKected in a more visible overlap between the clusters as portrayed in Fig. 7. The prototypes of the clusters are given as v1 = [0:16 0:03], v2 = [1:65 3:56], and v3 = [6:51 6:28]. Consider the P-FCM algorithm with randomly selected proximity hints; we consider 30 and 150 hints (that is pairs of patterns for which we calculate the proximity level on a basis of the partition matrix obtained for the four-dimensional patterns). The resulting partition matrices (membership grades) are shown in Fig. 8; the prototypes are given below No: of hints = 30: v1 = [0:13 No: of hints = 150: v1 = [0:21

− 0:29]; v2 = [0:76 − 0:13]; v2 = [0:96

2:07]; v3 = [6:21 2:25]; v3 = [5:97

6:25]: 6:09]:

Another way to quantify the impact of the proximity constraints is to calculate a Hamming distance between the two partition matrices, namely the one for the original four-dimensional patterns and the other corresponding to the reduced two-dimensional patterns without and with the proximity hints. The tendency is clear; if there were no hints, this distance is equal to 171.0 and gradually reduced to 151. 6 and 146.4 for 30 and 150 hints, respectively. It becomes evident that the P-FCM can compensate for the unseen features by guiding the clustering process.

32

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41 1 0.8 0.6 0.4 0.2 0 1

(a) 1 0.8 0.6 0.4 0.2 0 1

(b)

Fig. 8. Membership grades for 30 (a) and 150 (b) proximity hints.

5. Web exploration and P-FCM Considering the rapidly growing size of the Web, it becomes evident that any kind of manual classi4cation and categorization of Web sources (sites, pages, etc.) will be prohibitively time consuming. Hence arises a great role of rapid, automatic, and accurate hypertext clustering algorithms. The main problem with the development of automated tools is related to 4nding, extracting, parsing, and 4ltering the user requirements from web knowledge. There have been a number of approaches that help automatically retrieve, categorize and classify web documents. Clustering techniques have been proposed in [4] as generic information retrieval tools. Two cluster-based approaches exploit graph partitions to induce clusters; one based on a hypergraph to de4ne an association rule able to gather items that appear frequently together in many transactions. The other method produces, through recursive splitting, a binary tree of clusters in which the root is the starting document set, while each leaf node is a partition of whole set. A recent proposal of a fuzzy-based approach to Web mining is presented in [14]. The idea there is to use “medoids”, a kind of some relational fuzzy clustering. Some other clustering-oriented approaches are more related to user-interaction; LOGSOM system [25] is developed in order to mine web log data and provide a visual tool to guide the user during the navigation, based on self-organizing map (SOM), and organizes web documents into two-dimensional map, according to user navigation behaviors. In [21], instead, clustering is used to

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

33

USER CYBERSPACE

Proximity hints

P-FCM

Fig. 9. A schematic view of the Web mining in the environment of P-FCM.

discover semantic relationship among speci4ed concepts, and organize them into messages, created during electronic meetings. There is a broad range of existing approaches, see e.g., [5,7,16,19,26]. Owing to its concept, the P-FCM plays an interesting role in Web mining. A few observations can cast the discussion in a general setting. First, it is apparent that in assessing proximity between two pages of Web pages we are faced with highly heterogeneous information including text, images, video, and audio. There are a lot of factors that play a role in our judgment as to the proximity of Web pages such as layouts of individual pages, form of background, intensity of links to other sites as well as the origin in the cyberspace that navigates us to this speci4c page. Most importantly, a lot of these factors are diPcult to quantify and translate into computationally meaningful features. The textual information is the most evident and it is almost the exclusive contributor to the feature space when determining structures in a collection of Web pages. This tendency becomes very visible in the literature, see [1,6,17,18] and it goes back to the extensive studies on information retrieval. Here we can envision a role of the proximity hints whose utilization can compensate for the consideration of a subset of the feature space. For instance, we cluster Web pages in the subspace of textual information and the proximity values provided by the user augment (and implicitly expand) this subspace by the incorporating other features capturing the multimedia and layout portion of the description of the pages. The schematic view of the Web mining with the involvement of the P-FCM is portrayed in Fig. 9; here we emphasize the origin of two sources of information: the datasets themselves and hints provided by the user/designer. The usual approach to document categorization is based on the analysis of its content, since information for categorizing a document is extracted from the document itself. Current (semi)-automatic attempts are still in a premature age to be operative components of powerful Web search engines that prefer the work done by humans. ODP (Open Directory Project http:==dmoz.org, informally known as Dmoz, e.g., Mozilla Directory) is the most widely distributed database of Web content classi4ed by a volunteer force of more than 8000 editors. ODP organization provides the means to manage the Web growth thanks to editor/sub-editors chains, whose integration cover all the possible Web contents. This kind of such “collective” brain represents the human technology at the basis of

34

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

Fig. 10. Selection of keywords from a descriptive web page dealing with the category Top:Computers:Software: Graphics:Image Manipulation.

the most popular Web search engines and portals, including Netscape Search, AOL Search, Google, Lycos, HotBot, DirectHit, and hundreds of others. For our testbed we considered three ODP categories, i.e., 1. Top:Computers:Software:Graphics:Image Manipulation (www.dmoz.org/Computers/Software/Graphics/Image Manipulation) 2. Top:Shopping:Gifts: Personalized:Photo Transfers (www.dmoz.org/Shopping/Gifts/Personalized/Photo Transfers) 3. Top:News:Media:Journalism:Photojounalism:Music Photography (www.dmoz.org/News/Media/Journalism/Photojounalism/Music Photography) The open structure of ODP allowed us to analyze the structure of categories 1–3 in order to acquire the necessary information usable to categorize the related web pages; in fact, for each category, Dmoz provides a descriptive web page, such as the one given in Fig. 10, explaining the category in terms of related terms and keywords. Table 3 indicates the keywords used in the case study, associated with each category. Some keywords are stemmed to capture words with common pre4x denoted by an asterisk (for instance, photo or manipulat). The representation of each web page is treated as a 14-dimensional vector, where each component is a frequency of occurrence (probability) of the term existing in the speci4c page. Although in our

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

Range

Table 3 Selected keywords chosen as features of data set Category Top:Shopping:Gifts: Personalized:Photo Transfers Top:Computers:Software:Graphics:Image Manipulation Top:News:Media:Journalism:Photojounalism:Music Photography

35

Keywords Transfer, gift, photo*, logo Image, software, 4lter, digital, manipulat* Concert, music, journal, promot*, portrait

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1

4

7

10

13

16

19

22

25

28

31

34

37

40

43

46

49

52

55

58

49

52

55

58

Web Pages Cluster 1

Cluster 2 4

Cluster 3

1

Range

0.8 0.6 0.4 0.2 0 1

4

7

10

13

16

19

22

25

Cluster 1

28 31 34 Web Pages

Cluster 2

37

40

43

46

Cluster 3

Fig. 11. The results (membership grades) of the FCM clustering of web pages.

approach the features can be keywords, hyperlinks, images, animations, etc., in this experiment we consider just keyword-based features, in order to be compliant with the Dmoz statistical classi4cation (being based on the term-frequency analysis). Our test has been performed on 20 web pages per category, leading to 60 pages in total. For the reference, we applied a standard FCM algorithm and partitioned these 60 pages into three clusters (Fig. 11). Fig. 11 portrays the distribution of membership grades of the Web pages in each cluster: for the 4rst 20 pages, the membership grades are the highest in the cluster 3 (whitish bar), while they become quite irrelevant in the cluster 1 and 2 (those denoted by black and grey bars, respectively). Analogous observation can be made for pages belonging to two other categories: pages 21–40 are represented by cluster 1 (although some pages of this category assume higher membership values in cluster 2), while pages 41–60 form the second cluster.

36

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

Fig. 12. Typical web pages (prototypes) for each cluster constructed by the FCM algorithm. Table 4 Proximity values between selected pairs of web pages reKecting user’s evaluation Web pages 1 16 1 18 1 21 7 10 25 42 40 57 32 22 22 42 37 59 37 5 21 38 33 37

Proximity 0.8 0.9 0.1 0.9 0.1 0 0.8 0.1 0.1 0.1 0.9 0.8

The Web navigation is in part a cognitive process that may be inKuenced by several factors. If the user wants to express proximity values between pairs of Web pages, these values can be established upon a personal judgment. In this situation, the P-FCM approach can be useful to capture the user’s feedback and reKect its impact on reshaping the clusters. In our test, the user identi4es several pairs of Web pages and assigns to them some proximity values as summarized in Table 4. For instance, the proximity value for page 1 and 18 is equal to 0.9 and this value underlines that the pages are very analogous. On the other hand, page 1 is very di9erent from page 21 and this is reKected by a very low proximity value (e.g., 0.1) associated with this pair of pages, see Fig. 13.

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

37

Fig. 13. Web pages for which the user expressed proximity preferences (hints).

This user’s feedback conveyed in terms of proximity values has an impact on the previous clustering results (refer to Fig. 11). Fig. 14 illustrates the results of the P-FCM. It is evident that some pages improve their membership in the right cluster. For instance, pages 1 and 21 are now in the right cluster with higher values. For comparative reasons, Figs. 12 and 15 visualize the prototypes (highest membership values in corresponding clusters) produced by the FCM and P-FCM, respectively. In many real world cases, the user may disagree with a textual Web search classi4cation: in our test, this situation may occur since the categorization has been realized considering only textual information and no other media (components of the hypertext) have been taken into consideration. Fig. 16 shows this scenario; here the user de4nes the following proximity values between the pairs of the Web pages, namely (53 8 0.9) and (8 20 0.1). The user’s intention is to assert that pages 53 and 8 are very similar while page 20 is very dissimilar from page 8 (although both the pages are in the same cluster); in fact according to the user evaluation, pages 53 and 8 are similar because they show photo galleries, while the web page 20 is related to the existing plugin tools. After the realization of the P-FCM, refer to the results shown in Fig. 17, page 20 moves to cluster 1 (black bars), because it assumes the highest membership in that cluster 1 (circled in the 4gure), while page 8 remains in cluster 3 (whitish bar) as expected. Finally page 53 does not change the cluster, but its membership value gets lowered. This e3ect follows the low proximity values imposed by the user. 6. Conclusions The proximity-based FCM is an example of the general category of knowledge-based clustering in which we form a synergy between structural knowledge (that resides with patterns themselves)

38

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41 1 0.9 0.8 0.7

Range

0.6 0.5 0.4 0.3 0.2 0.1 0 1

4

7

10

13

16

19

22

25

Cluster 1

28 31 34 Web Pages

37

40

Cluster 2

43

46

49

52

55

58

Cluster 3

1

Range

0.8 0.6 0.4 0.2 0

1 3

5 7

9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 Web Pages Cluster 1

Cluster 2

Cluster 3

Fig. 14. The results of the P-FCM clustering.

and user-supplied knowledge (that is external to the data). In the derived optimization model, we assumed that the external source of knowledge (being a mechanism of partial supervision to the problem) supports the clustering activities not being a3ected by the results produced there. The impact of supervision is quanti4ed via the intensity of the gradient-based optimization that involves the series of the proximity-based guidance constraints (navigation hints). Depending upon the framework of the pattern recognition tasks, we can envision a more symmetrical process of interaction. More speci4cally, not only the hints can a3ect the process of searching for a structure in data but also we can view that through the reconciliation of the two views at the data, some clustering hints may require some adjustments (corrections) prior to their usage at the classi4cation end. In this sense, we end up with the highly collaborative process between its facet of unsupervised and supervised classi4cation. The role of the proximity-based clustering becomes crucial in cases where the original feature space may not fully capture the essence of the clustering problem. The case study of Web pages is an excellent example with this regard. While the proposed feature space addresses the facet of the textual content of the pages (in the form of a collection of keywords), the hypertext nature of

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

39

Fig. 15. Typical web pages (prototypes) for each cluster built by the P-FCM algorithm.

Fig. 16. Subjective evaluation of the proximity levels between selected web pages.

the pages (including relevant information about layout, graphical content, density of associated links, etc.) is not included directly but comes in the form of the user’s hints about degrees of proximity occurring between the pages.

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

Range

40

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 Web Pages Cluster 1

Cluster 2

Cluster 3

1

Range

0.8 0.6 0.4 0.2 0 1

3

5

7

9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 Web Pages Cluster 1

Cluster 2

Cluster 3

Fig. 17. Results produced by the P-FCM; note a shift in the values of the membership grades.

Acknowledgements Support from the Canada Research Chair Program (W. Pedrycz), Natural Sciences and Engineering Research Council of Canada (NSERC), Alberta Software Engineering Research Consortium (ASERC) is gratefully acknowledged. References [1] [2] [3] [4]

R. Baeza-Yates, R.B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, Reading, MA, 1999. A. Bargiela, W. Pedrycz, Granular Computing: An Introduction, Kluwer Academic Publishers, Dordrecht, 2002. J.C. Bezdek, Pattern Recognition and Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981. D. Boley, et al., Partitioning-based clustering for Web document categorization, Decision Support Systems 27 (1999) 329–341. [5] A.Z. Broder, S.C. Glassman, M.S. Manasse, G. Zweig, Syntactic clustering of the Web, Comput. Networks ISDN Systems 29 (1997) 1157–1166. [6] J. Furnkranz, Exploiting structural information for text classi4cation on the WWW, Proc. 3rd Internat. Symp. Advances in Intelligent Data Analysis, 1999, pp. 487–498. [7] D. Guillaume, F. Murtargh, Clustering of XML documents, Comput. Phys. Comm. 127 (2000) 215–227.

W. Pedrycz et al. / Fuzzy Sets and Systems 148 (2004) 21 – 41

41

[8] R.J. Hathaway, J.C. Bezdek, NERF-c means: non-Euclidean relational fuzzy clustering, Pattern Recognition 27 (1994) 429–437. [9] R.J. Hathaway, J.C. Bezdek, J.W. Davenport, On relational data versions of c-means algorithms, Pattern Recognition Lett. 17 (1996) 607–612. [10] R.J. Hathaway, J.C. Bezdek, Y. Hu, Generalized fuzzy C-means clustering strategies using Lp norm distances, IEEE Trans. Fuzzy Systems 8 (5) (2000) 576–582. [11] R.J. Hathaway, J.W. Davenport, J.C. Bezdek, Relational dual of the C-means clustering algorithms, Pattern Recognition 22 (2) (1989) 205–212. [12] F. Hoppner, Fuzzy shell clustering in image processing—fuzzy c-rectangular and two rectangular shells, IEEE Trans. Fuzzy Systems 5 (5) (1997) 599–613. [13] F. Hoppner, F. Klawonn, R. Kruse, T. Runkler, Fuzzy Cluster Analysis—Methods for Image Recognition, Wiley, New York, 1999. [14] R. Krishnapuram, A. Joshi, O. Nasroui, L. Yi, Low-complexity fuzzy relational clustering algorithms for web mining, IEEE Trans on Fuzzy Systems 9 (4) (2001) 595–607. [15] B. Lazzerini, F. Marcelloni, Classi4cation based on neural similarity, Electron. Lett. 38 (15) (2002) 810–812. [16] W.S. Li, D. Agrawal, Supporting web query expansion ePciently using multi-granularity indexing and query processing, Data Knowledge Eng. 35 (2000) 239–257. [17] S. Loh, L.K. Wives, J. Palazzo, Concept based knowledge discovery from texts extracted from the web, ACM SIGKDD Explorations 2 (1) (2000) 29–40. [18] S. Mitra, S.K. Pal, P. Mitra, Data mining in soft computing framework: a survey, IEEE Trans. Neural Networks 13 (1) (2002) 3–14. [19] S. Miyamoto, Information clustering based on fuzzy multisets, Inform. Process. and Management 39 (2) (2003) 195–213. [20] W. Pedrycz, G. Succi, M. Reformat, P. Musilek, X. Bai, Expressing similarity in software engineering: a neural model, Proc. 2nd Internat. Workshop on Soft Computing Applied to Software Engineering, Enschede, the Netherlands, February, 2001. [21] D. Rousinov, J.L. Zhao, Automatic discovery of similarity relationships through Web mining, Decision Support Systems 35 (1) (2003) 149–166. [22] T.A. Runkler, J.C. Bezdek, Alternating cluster estimation: a new tool for clustering and function approximation, IEEE Trans. Fuzzy Systems 7 (4) (1999) 377–393. [23] T.A. Runkler, J.C. Bezdek, Web mining with relational clustering, Internat. J. Approx. Reason. 32 (2003) 217–236. [24] E. Ruspini, A new approach to clustering, Inform. and Control 15 (1) (1969) 22–32. [25] K.A. Smith, A. Ng, Web clustering using a self-organizing map of user navigation patterns, Decision Support Systems 35 (2) (2003) 245–256. [26] R.L. Walker, Search engine case study: searching the web using genetic programming and MPI, Parallel Computing 27 (2001) 71–89.