Knowledge-Based Systems 32 (2012) 91–100
Contents lists available at SciVerse ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
An efficient incremental method for generating equivalence groups of search results in information retrieval and queries Jin Zhang, Qiang Wei ⇑, Guoqing Chen School of Economics and Management, Tsinghua University, Beijing 100084, China
a r t i c l e
i n f o
Article history: Available online 30 August 2011 Keywords: Intelligent search Decision support Incremental method Transitive closure Grouping
a b s t r a c t Today’s widespread web applications bring many challenges to decision support systems (DSS) research for effectively retrieving useful information from online data sources that are of huge volume. Importantly, in a web search and service environment, search results grouping becomes a crucial issue of DSS functionality and service, where the scale of data is dynamically expanding. This paper proposes an intelligent method that generates equivalence groups (classes) in an incremental manner, so as to deal with the evolving nature of the data in web search. Such equivalence groups are derived from k-cuts of transitive closure of a closeness matrix for the search elements. The proposed incremental method does not need to redo the whole procedure of grouping each time when the overall search outcome changes, which is common in real applications, rather, it only captures the changes and related elements so that the calculation is minimized in both time and space complexity. Theoretical analysis and data experiments show the advantage and effectiveness of the proposed incremental method. 2011 Elsevier B.V. All rights reserved.
1. Introduction Due to the rapid increase in the need for efficiently and effectively organizing and utilizing the data that are in huge volume, intelligent search/query techniques play a more and more important role nowadays for supporting customers and companies decisions [56,57,61]. For example, an online recommendation system based on such intelligent techniques may provide the customers with a service that links a purchase intention (e.g., via search keywords) with relevant products or advertisements. Apparently, a company providing this service will have a good chance to facilitate the decision making of the customers in light of a precisemarketing strategy, which is deemed desirable for business. In this context, we are in the realm of decision support systems (DSS), where online applications are getting more pervasive in our daily lives, and decision makers become more reliant upon data and web-based services. The exemplified online recommendation system is a case of DSS with a functionality of search as one of its key components for access to internal and external information/data [55,62]. In other words, the system is a service platform that helps retrieve useful information for effective decision-making in web-search environments. Traditionally, to evaluate the quality of search results for decision support, two well-known measures for information retrieval are widely used, namely, Precision and Recall [43]. Briefly, Preci⇑ Corresponding author. E-mail address:
[email protected] (Q. Wei). 0950-7051/$ - see front matter 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2011.08.013
sion is to measure the accuracy of search results, i.e., the fraction of the search results satisfying the criteria among all search results, while Recall is the fraction of search results satisfying the criteria among the results actually satisfying the criteria in data source. An intelligent search technique may be regarded more preferable if it has higher Precision and Recall degrees. Moreover, some other combined measures have also been proposed to evaluate the quality of search results [44,61]. In real-world search applications for decision support, usually a volume of search results can be quite huge. For instance, Google always shows that millions of web pages matching the keyword a user inputs are indexed with hundreds of pages available to display. Though all such search results satisfy the search criteria (i.e., with high Precision), users would only prefer a small set of search results in the overall search outcome, which often appear in the first couple of pages. Thus, the quality of the small set of search results will significantly impact the related decision-making, as well as the service quality, and therefore is of great interest to both academia and practitioners. Notably, targeting different decision-making processes, search techniques usually evaluate the search results with certain quality measures, e.g., the number of comments/visits/recommends, the freshness of time, PageRank measure, etc. [45–48], so as to rank the search results by their ‘‘importance’’ and ‘‘hotness’’, where the top-ranked results will be then displayed in the first several pages to support decision-making. While such top-ranked approaches are generally considered useful (e.g., generating a high degree of mutual similarity in
92
J. Zhang et al. / Knowledge-Based Systems 32 (2012) 91–100
results), they still have limitations as follows. First, users’ browsing the first several pages can hardly grasp the overall information of the whole search results. In other words, the first several pages may not sufficiently cover the information carried by the overall search outcome. This problem will become severe if the applications require a search/query that needs to capture a compact set of results in light of information totality and diversity in relation to the overall outcome, especially in enterprise decision-making environment [49–51]. Second, it is not seldom seen that the topranked results are sometimes redundant in content. This is because a data source may contain many similar and duplicated materials, and the redundant results are usually ranked together [49–51]. In this case, users may be presented with results containing redundant contents, which not only affects users’ search experiences, but significantly decreases the quality of search results for decision supporting. In addressing the above issue of the totality and diversity of a small set of search results, grouping the results in the overall outcome is deemed useful and essential, where clustering/grouping methods play an important role [18–26]. Clustering is the assignment of a set of data objects into subsets (called groups) so that data objects in the same group are similar in some sense, and the data objects from different groups are dissimilar in some sense. In this way, the overall outcome of search could be represented by the groups or group representatives. That is, given n search results, a clustering method could generate k groups, from each of which a search result could be extracted, e.g., with the largest average similarity to all the other search results in the same cluster, as a representative result. Therefore, the derived set of k representative search results, called a representative set, could cover the whole range of information in the overall search outcome and have less redundancy (e.g., lower similarity between groups) in information. Furthermore, it is worth noticing that the quality of a representative set relies upon how the search results are grouped/ clustered, effectively and efficiently. The effectiveness is referred to the property that any two search results in one group are sufficiently similar (e.g., with similarity no less than threshold k); and any two results from two distinct groups are sufficiently dissimilar (i.e., with similarity less than k). The efficiency is referred to the computational complexity in handling the new results in search updates so as to avoid re-generating the whole groups each time. This is particularly meaningful and important in a dynamic decision-support environment with massive data sources. The focal point of the work is grouping. The method proposed in this paper is aimed at (1) grouping the results into equivalence classes in terms of k-closeness, which will serve a basis to enable the representative set (composed of one from each equivalence class) to have high information coverage and low information redundancy (due to the notion of equivalence classes); and (2) generating the equivalence classes in an incremental manner, so that the time and space complexity can be improved. The paper is organized as follows. Section 2 will review the related work on clustering methods and discuss the groupings methods based on fuzzy logic derived equivalence classes. An iteratively updating strategy for calculating transitive closure will be illustrated in Section 3. Section 4 will discuss the incremental method for grouping search results in detail. The algorithmic details and the computational complexity analysis on time and space will be discussed in Section 5. Section 6 will show some scalability experiments to examine the efficiency of the proposed algorithm, along with a system implementation framework for the method.
2. Related work Clustering methods [18–26,58–60] are widely used to deal with large-scale data groupings for decision-making in marketing [16], business operations [11], healthcare applications [17], etc. Usually, the clustering methods could be identified into several categories, namely, partition, hierarchical, density-based, grid-based, modelbased, link-based, fuzzy relation-based [1–7,18], and so on. Generally speaking, the data objects within a same group are more similar to each other than objects in different groups [27,28], which is the common principle for clustering processes. A good thinking in this direction is to explore the way to generate groups, each containing data objects that are mutually ‘‘equivalent’’. This can be done by specifying a threshold degree k (0 6 k 6 1) such that any pair of data objects within the same groups are closed to each other with a degree no less than k, and the data objects in different groups are distinct with a degree less than k, giving rise to the so-called the similarity-driven equivalence relation method [8–13]. The basic idea of the method is to compute a fuzzy max–min transitive closure from a given closeness matrix. Here, both the max–min transitive closure and closeness matrix are fuzzy relations, where the former is a similarity relation (with reflexivity, symmetry and max–min transitivity) and the latter is a closeness relation (with reflexivity and symmetry), which will be further discussed in the following sections. Furthermore, an equivalence relation can be obtained by applying the k-cut operations onto the transitive closure. Thus, the equivalence relation can be used to group data into equivalent classes having three desirable features: (1) any pair of data objects within a same group is equivalent to each other on degree k; (2) the data objects from different groups are mutually distinct on degree k; and (3) the number of equivalence groups can be flexibly determined by setting different levels of k according to decision makers’ needs [10]. Concretely, given a set D containing n data objects, any two objects can be close to each other. For two data objects, di and dj, di, dj e D (i, j = 1, 2, . . . , n), the closeness degree between di and dj, denoted as eij, could be measured in various ways, such as cosine-similarity for web documents, distance-based closeness for imprecise databases [34–39], etc. Consequently, an n n closeness matrix can then be obtained as M = (eij)nn. The corresponding transitive closure M þ ¼ ðeþ ij Þnn could be computed with a series of max–min compositions onto M [15], with M+ = Mp = Mp+1, p P 1, and M+ converges within n 1 compositions, i.e., M+ = Mp, p 6 n 1 [40]. Furthermore, given threshold k (0 6 k 6 1), the k-cut matrix + þ k k Mþ k ¼ ðeij Þnn of M could be defined as well, i.e., if eij P k; eij ¼ 1 and ekij ¼ 0 otherwise. That is, closeness matrix M and transitive closure M+ can be regarded as fuzzy binary relations, and k-cut matrix M þ k is a crisp binary relation. Generally, a fuzzy binary relation R on domain X is defined as a mapping from X X to [0, 1]. Furthermore, R is reflexive, if R(x, x) = 1, "x e X. R is symmetric, if R(x, y) = R(y, x), "x, y e X. R is max–min transitive, if R(x, y) P maxyeX min{R(x, y), R(y, z)}, "x, y, z e X. Thus, R is a closeness relation if R is reflexive and symmetric. R is a similarity relation if R is reflexive, symmetric and max–min transitive [11]. It can be further deduced that closeness matrix M can be regarded as a closeness relation (i.e., reflexive and symmetric), and closure M+ is a similarity relation (i.e., reflexive, symmetric and max–min transitive) [9]. Notably, similarity relation is a special case of closeness relation. Moreover, M+ is the smallest max–min similarity relation that includes M. That is, for any similarity relation B = (bij)nn including M (eij 6 bij, for any i and j), we have eij 6 eþ ij 6 bij , for any i, j = 1, 2, . . . , n [15]. Finally, with the k k-cut matrix Mþ k , if eij ¼ 1, data objects di and dj will be clustered
J. Zhang et al. / Knowledge-Based Systems 32 (2012) 91–100
into the same group. In so doing, D is clustered into a family of k disjoint groups, say {C1, C2, . . . , Ck}. While generating equivalence groups is considered desirable, the computational complexity is of concern as far as massive data pertain in real applications. With various efforts [14,15,29–33,52–54], the complexity has been reduced from O(n5) to O(n2) on time and O(n2) on space [15]. It is worth noticing that these levels of complexity need to be further improved in the context of large data scales. First, for the space complexity O(n2), the computation may be restricted due to the storage of transitive closure matrix in memory [15]. For example, if there are 106 data objects, the memory that the method needs is almost 4T, which is often infeasible for nowadays commonly used computational platforms. Second, when a new data object is incorporated, re-calculation of the whole transitive closure matrix is inefficient, resulting in an additional O(n2) consumption on both time and space, which is even not applicable in the frequently evolutionary search environment with rapid information updating. In other words, an efficient incremental strategy has to be further investigated and taken into consideration. The purpose of this paper is to introduce an efficient method as well as an algorithm to incrementally group search results based on equivalence classes. 3. Generating M þ k via iteratively updating This section will present a strategy to generate matrix M þ k in an iteratively updating manner. It can also be proved that M þ k generated by this method is equivalent to that of classical methods. Given a set D of n data objects with its corresponding closeness matrix M and threshold k, let Wk be a set of pairs (i, j) representing data pairs di and dj, where their transitive similarities eþ ij P k. Initially, Wk is set to be £. Next, scan matrix M. If there exists some eij P k, then update Wk = Wk + {(i, j)}, otherwise keep Wk unchanged. All the pairs in Wk are called seeds, i.e., Wk = {(i, j)|eij P k, i, j = 1, 2, . . . , n}. Based on the seeds, Wk can be iteratively updated according to the following properties. Note that, due to the transitivity in transitive closure Mþ ¼ ðeþ ij Þnn [10,30,32,33], for any i, j, m = 1, 2, . . . , n, we have þ þ þ þ eþ ij P max16m6n minðeim ; emj Þ. If eim P k and emj P k, then it can be inferred that eþ P k. Moreover, due to the symmetry for M and ij þ þ M+, eij = eji and eþ ij ¼ eji , if eij P k, it can be further inferred that eþ ji P k. Therefore, the following inference properties hold: i. If (i, m) e Wk and (m, j) e Wk, then (i, j) e Wk. ii. If (i, j) e Wk, then (j, i) e Wk. With these two properties, the inference can be performed by iteratively updating Wk. Finally, if Wk cannot be further updated, denoted as Wþ k , then the process terminates. The following theorem þ (Theorem 1) guarantees that Wþ k is equivalent to M k . Theorem 1. Let W0k ¼ ði; jÞjekij ¼ 1 and ekij 2 M þ k ; i; j ¼ 1; 2; . . . ; n be þ the set of pairs (i, j) with ekij ¼ 1 in M þ k , and Wk be the set generated by the proposed iteratively updating method, then Wþ k is equivalent to 0 W0k , i.e., Wþ k ¼ Wk . Proof 1. þ 0 (1) Proof of Wþ k # Wk . For any ði; jÞ 2 Wk . There are two possible cases. For case 1, i.e., (i, j) is from the set of seeds. By definition of M+, we have eþ ij P eij . Moreover, since eij P k, so 0 eþ P k and ði; jÞ 2 W . k For case 2, i.e., (i, j) is iteratively ij
93
derived from seeds with the inference properties. Thus, þ þ þ $m = 1, . . . , n, eþ im P k; emj P k, or eji P k, then eij P k and ði; jÞ 2 W0k . Therefore, all the pairs (i, j)s inferred from seeds belong to W0k . 0 þ (2) Proof of W0k # Wþ k . First, define W ¼ Wk Wk to contain the 0 þ pairs ði; jÞ; ði; jÞ 2 Wk and ði; jÞ R Wk . If W = £, the proposition is proved. If W – £, we can construct a matrix B = (bij)n n such that, if ði; jÞ R W; bij ¼ eþ ij ; otherwise, bij = x, where x ¼ maxðeij ; eþpq Þ, (i, j) e W, ðp; qÞ 2 W0k . For any pair (i, j) e W, we know that eij < k, otherwise it can be inferred that þ ði; jÞ 2 Wþ k which is contradictory. Since epq < k; bij ¼ x < k for any pair (i, j) e W. Proof of M+ includes B: For any pair (i, j) e W, we have bij < k 6 eþ . For any pair (i, j) R W, we have bij ¼ eþ . Thus it could ij ij be inferred that M+ includes B. Proof of B includes M: By definition of M+, for any pair (i, j) R W, we have bij ¼ eþ P eij . According to matrix B’s definition, for any ij pair (i, j) e W, we have bij P eij. Thus B includes M. Proof of Reflexivity: Since eii = 1 P k, thus ði; iÞ 2 Wþ k . So bii ¼ eþ ¼ 1 according to the definition of matrix B. The reflexivity ii is satisfied. Proof of Symmetry: (i) For any pair ði; jÞ 2 W0k , ði; jÞ 2 W0k due to þ þ þ the symmetry of M+. Therefore, bij ¼ eþ ij and bji ¼ eji . Since eij ¼ eji , 0 0 so bij = bji. (ii) For any pair ði; jÞ 2 Wk ; ði; jÞ 2 Wk due to the symmetry of M+. There will be two possible cases, ði; jÞ 2 Wþ k or (i, j) e W. For case 1, i.e., ði; jÞ 2 Wþ , the corresponding pair ðj; iÞ 2 Wþ k k according to þ þ the inference properties. Thus, bij ¼ eij ¼ eji ¼ bji . For case 2, i.e., (i, j) e W, (j, i) must also belong to W, otherwise if ðj; iÞ 2 Wþ k , it will be inferred that ði; jÞ 2 Wþ k (according to the inference properties) which is contradictory. Thus we have bij = bji = x. The symmetry is satisfied. Proof of Transitivity: For any pair (i, j) R W, then bij ¼ eþ ij . Since eþ P max16m6n minðeþ ; eþ Þ for any m and M+ includes B, then bij ¼ ij im mj þ þ þ þ eþ ij P max16m6n minðeim ; emj Þ P max16m6n minðbim ; bmj Þ. If (i, j) e W, then bij = x. For any m, pairs (i, m) and (m, j) do not belong to Wþ k simultaneously, otherwise it can be inferred that ði; jÞ 2 Wþ k , which þ þ is contradictory. Therefore, for any m; minðbim ; bmj Þ 6 x and þ þ bij ¼ x P max16m6n minðbim ; bmj Þ. The transitivity is satisfied. Hence, matrix B (includes M and included by M+) is a similarity relation. Because M+ is the smallest max–min similarity relation that includes M which is introduced in Section 2, B = M+. However, for any pair ði; jÞ 2 W; bij < k 6 eþ ij , which is contradictory. There0 þ fore, W must be £, meaning that Wþ k ¼ Wk .As a conclusion, Wk is 0 equivalent to W0k , i.e., Wþ ¼ W . h k k Theorem 1 guarantees that obtaining Wþ k with the iteratively updating strategy is equivalent to obtaining M þ k through computing transitive closure matrix using classical methods. In other words, given a specified k, grouping based on Wþ k obtained by the iteratively updating strategy is equivalent to grouping through Mþ k . Moreover, the iteratively updating process can be performed step-by-step as shown in Fig. 1. The process shown in Fig. 1 can process new data objects and perform inferring step by step. For example, suppose k data objects have finished the processes of collecting seeds and inference. For a new (k + 1)th data object, the closeness degrees between (k + 1)th object and the other k objects will be scanned first. Then only the pairs (i, k + 1)s with ei(k+1) P k will be added as new seeds and updated to Wk. After collecting new seeds, the inference process will also only process and deduce new pairs with inferred closeness degrees no less than k. In so doing, computational time and memory consumption could be significantly saved. For illustrative purposes, let us consider Example 1 as follows.
94
J. Zhang et al. / Knowledge-Based Systems 32 (2012) 91–100
Fig. 1. The process of step-by-step iterative updating.
Fig. 2. Closeness matrix M, transitive closure M+ and k-cut matrix Mþ k of Example 1.
Example 1. Suppose that someone propose some search/query in Internet or a database. For simplicity, 7 data objects d1, d2, . . . , d7 (e.g., documents, files, transactions, etc.) are retrieved as follows. The closeness matrix, the transitive closure and the k-cut matrix (k = 0.8) have been obtained as shown in Fig. 2. Then, in the straight-forward way of classical methods [15], matrix 2 Mþ k will be saved in O(n ) (i.e., 7 7) memory space and the time complexity will be at O(n2) as well. With the iteratively updating strategy, however, there are 7 steps to obtain Wþ k. (1) Initialization. The initial data set D = {d1} and Wþ k ¼ fð1; 1Þg. (2) Add d2 in to D, i.e., D = {d1, d2}. New seed {(2, 2)} is added into Wþk and no new inferred pairs. (3) Add d3 in to D, i.e., D = {d1, d2, d3}. New seed {(3, 3)} is added into Wþ k and no new inferred pairs. (4) Add d4 in to D, i.e., D = {d1, d2, d3, d4}. New seeds {(1, 4), (3, 4), (4, 1), (4, 3), (4, 4)} are added into Wþ k and new inferred pairs {(1, 3), (3, 1)} are added into Wþ k. (5) Add d5 in to D, i.e., D = {d1, d2, d3, d4, d5}. New seeds {(3, 5), (5, 3), (5, 5)} are added into Wþ k and new inferred pairs {(1, 5), (4, 5), (5, 1), (5, 4)} are added into Wþ k. (6) Add d6 in to D, i.e., D = {d1, d2, d3, d4, d5, d6}. New seed {(6, 6)} is added into Wþ k and no new inferred pairs. (7) Add d7 in to D, i.e., D = {d1, d2, d3, d4, d5, d6, d7}. New seeds {(7, 6), (6, 7), (2, 7), (7, 2), (7, 7)} are added into Wþ k and new inferred pairs {(2, 6), (6, 2)} are added into Wþ k. (8) No new data object, then terminate. (9) Finally, Wþ is obtained, which is just the same as k W0k ¼ fð1; 1Þ; ð2; 2Þ; ð3; 3Þ; ð4; 4Þ; ð5; 5Þ; ð6; 6Þ; ð7; 7Þ; ð1; 3Þ; ð1; 4Þ; ð1; 5Þ; ð2; 6Þ; ð2; 7Þ; ð3; 1Þ; ð3; 4Þ; ð3; 5Þ; ð4; 1Þ; ð4; 3Þ; ð4; 5Þ; ð5; 1Þ; ð5; 3Þ; ð5; 4Þ; ð6; 2Þ; ð6; 7Þ; ð7; 2Þ; ð7; 6Þg correspon ding to M þ k. This example reveals that the memory allocation could be significantly saved, that there is no more wasted computation caused by the values which are smaller than k, and that obtaining Wþ k with
the iteratively updating strategy is equivalent to obtaining W0k through computing transitive closure matrix.Additionally, the example also shows that, in order to obtain the final equivalence groups, it is not necessary to retrieve the transitive closure in advance, but to generate groups step by step based on the pairs in Wþk obtained by the iteratively updating strategy. Driven by this idea, the next section will generally introduce an incremental grouping method for generating equivalence groups. 4. An incremental grouping method As discussed in the previous section, the transitive closure could be obtained step-by-step using the iteratively updating strategy, which can be further incorporated into the grouping process. Initially, assign each di, di e D, with a group label Ci = i. Let D0 be the set of data objects that have been assigned with group labels, the number of objects in D0 is u. Originally, set D0 = £ and u = 0. Without loss of generality, suppose D0 = {d1, d2, . . . , du} (where u < n) assigned with C1, C2, . . . , Cu, respectively. For di and dj (i, j = 1, 2, . . . , u), if Ci = Cj, then di and dj have been clustered into a þ same group, i.e., there exists a pair ði; jÞ 2 Wþ k and eij P k. Thus, for the (u + 1)th object du+1, clearly du+1 e D D0 . Two operations should be carried out: one is to update an appropriate group label for the (u + 1)th object; the other is to update the group labels for all of the u objects when necessary. Concretely, first, for du+1, if ei(u+1) < k, for any di e D0 , we know du+1 cannot bring any new seeds into Wþ k and no pair containing du+1 can be inferred in Wþ with the properties stated in Section 3. k As a result, du+1 cannot be assigned with C1, C2, . . . , or Cu. Then keep du+1 with group label Cu+1 unchanged. Otherwise, du+1 can bring at least one seed, e.g., di e D0 , such that ei(u+1) P k, then update du+1 with Cu+1 = Ci and add the corresponding new seed (i, u + 1) into Wþk . Second, some new pairs may be inferred due to new added seeds. That is to say, the group labels of the first u objects may need to be updated. Without loss of generality, suppose that du+1 bring v
J. Zhang et al. / Knowledge-Based Systems 32 (2012) 91–100
95
new seeds, e.g., (1, u + 1), (2, u + 1), . . . , (v, u+1), v 6 u, which means there exist v data objects in D0 (d1, d2, . . . , dv) with closeness with du+1 no less than k. Notably, these v data objects may not be assigned with the same group label, e.g., some pairs (i, j)s may not exist in Wþ k ; i; j ¼ 1; 2; . . . ; m. With the increment of du+1, it can be inferred that ði; jÞ 2 Wþ k (according to the properties in Section 3), which means that their group labels of di and dj should be updated to the same group label as that of du+1. In addition, for the other u v data objects in D0 with closeness with du+1 less than k, if there exist some of these u v data objects with the group labels which are the same with certain group labels of data objects d1, d2, . . . , dv, which means that Wþ contains pairs (i, j)s, i = 1, 2, . . . , v, k j = v + 1, . . . , u, their group labels should also be updated to the same group label as that of du+1 (according to the properties in Section 3). In this case, a trace operation should be performed to test all of the u data objects in D0 , and a minimal-indexed group label will be assigned to the objects whose group labels need to be updated. Next, set D0 = D0 + {du+1} and u = u + 1 to deal with the next object incrementally. When u = n, then terminate and the final groups are obtained. Example 2 illustrates the whole procedure with this incremental grouping.
e23 = 0.13 < k, then the group label of d3 does not need to be updated and update D0 = {d1, d2, d3} and u = 3. When u = 3, add d4. Since e14 = 0.92 > k, e24 = 0.15 < k and e34 = 0.86 > k, then the variable min_group is assigned with the minimal class label in d1, d3 (i.e., min_group = 1) and the group label of d4 is min_group. In addition, the group label of d3 needs to be updated by tracing. After tracing D0 , then update the group label of d3 with the variable min_group and update D0 = {d1, d2, d3, d4} and u = 4. When u = 4, add d5. Since e15 = 0.77 < k, e25 = 0.22 < k, e35 = 0.88 > k and e45 = 0.70 < k, then assign the variable min_group with the group label of d3 (i.e., min_group = 1). After that, update d5 with the variable min_group and update D0 = {d1, d2, d3, d4, d5} and u = 5. When u = 5, add d6. Since e16 = 0.31 < k, e26 = 0.78 < k, e36 = 0.25 < k, e46 = 0.11 < k and e56 = 0.21 < k, then the group label of d6 does not need to be updated and update D0 = {d1, d2, d3, d4, d5, d6} and u = 6. When u = 6, add d7. Since e17 = 0.35 < k, e27 = 0.90 > k, e37 = 0.25 < k, e47 = 0.12 < k, e57 = 0.19 < k and e67 = 0.95 > k, then min_group is assigned with the minimalindexed group label in d2 and d6 (i.e., min_group = 2) and the group label of d7 is min_group. By tracing D0 , update the group label of d6 with min_group and update D0 = {d1, d2, d3, d4, d5, d6, d7}. When u = 7 = n, terminate the procedure with two final groups, namely C1 = {d1, d3, d4, d5} and C2 = {d2, d6, d7}.
Example 2. Given the 7 query results in Example 1. With a threshold k = 0.8 and the proposed incremental grouping method, the procedure of computation is shown in Fig. 3.
5. Algorithmic details
In this example, initially, d1, d2, . . . , d7 are assigned with group labels 1, 2, 3, . . . , 7, respectively. A variable min_group is used to store the current minimal-indexed group label for incremental grouping. Symbol ‘⁄’ represents that the group label of the corresponding data object needs to be updated by a trace operation. When u = 0, add d1 and the class label of d1 does not need to be updated and update D0 = {d1} and u = 1. When u = 1, add d2. Since e12 = 0.11 < k, the group label of d2 does not need to be updated and update D0 = {d1, d2} and u = 2. When u = 2, add d3. Since e13 = 0.79 < k and
Fig. 3. The procedure of incremental grouping.
In this section, the algorithmic details for the incremental grouping method will be proposed along with some discussions on the space and time complexity. Fig. 4 provides the pseudo-codes of the algorithm. In this algorithm, Step 1 is to scan D0 and record the group labels of objects with closeness degrees no less than k into trace array. Step 2 is to retrieve the minimal-indexed group label in trace array by sorting the group labels in trace array using Heap Sort algorithm, which is an efficient sorting algorithm suitable in large-scale data volume [41]. Step 3 is to trace the objects and update their group labels when necessary, where a binary search strategy is adopted to improve search efficiency. In order to examine the contributions of the Proposed method, we compare it with the Straight method [15] (i.e., the straight way to generate the whole transitive closure matrix with k-cut used in classical methods) on space complexity and time complexity, respectively. On space complexity, the Straight method requires O(n2) memory allocation for the transitive closure matrix. In the Proposed method, only two O(n) space consumptions are needed for storing groups and trace, which is significantly more efficient than that of the Straight method [15]. On time complexity, the Straight method consists of three major steps: the closeness matrix calculation costs O(n2) time; the transitive closure computation costs at least O(n2) time [10,15]; and k-cut matrix calculation and grouping cost O(n2). Thus, overall time complexity of the Straight method is O(n2), or concretely O(n2) for closeness matrix computation plus O(n2) for other computations. For the Proposed method, there is an outer loop having an O(n) time complexity. Step 1 has an O(n) time complexity, which leads to an O(n2) time complexity corresponding to the closeness matrix computation in the Straight method. The Heap Sort algorithm in Step 2 will be averagely at O(n log n) (i.e., O(1) in the best case, O(n log n) in the worst case). Thus, Step 2 costs averagely O(n log n) time (i.e., O(1) in the best case, and O(n log n) in the worst case). Step 3 costs averagely O(n log n) (i.e., O(1) in the best case, and O(n log n) in the worst case) for the trace operation. Thus, the overall time complexity of the Proposed method is averagely O(n2 log n). However, in real applications, since k is usually required to be large enough, e.g., k P 0.8 or higher, thus the
96
J. Zhang et al. / Knowledge-Based Systems 32 (2012) 91–100
Fig. 4. Algorithmic details of incremental grouping.
Table 1 Time efficiencies (running times) of the 4 methods. Datasets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Size
Running times (s) Method 1
Method 2
Straight
Proposed
Cardiotocography Dexter Dorothea Image segmentation ISOLET Madelon Multiple features Musk Opt. recognition of handwritten digits Page blocks classification Semeion handwritten digit Spambase Steel plates faults Wine quality Yeast
2126 2600 1950 2310 7797 4400 2000 6598 5620 5473 1593 4601 1941 4898 1484
136.781 247.937 104.109 176.125 6815.41 1205.88 111.781 4028.09 2461.99 2263.2 56.282 1398.95 102.719 1659.94 45.704
41.625 75.109 32.016 53.422 1931.97 346.75 34.109 1156.73 715.172 659.109 17.25 401.75 31.078 477 13.875
2.578 3.891 2.141 3.047 37.625 11.125 2.172 25.891 18.453 17.437 1.359 12.609 2.047 13.875 1.141
0.594 0.967 0.538 0.683 8.418 2.765 0.546 6.318 4.473 4.286 0.381 3.106 0.533 3.364 0.356
Average
3693
1387.660
399.131
10.359
2.489
number of values in trace array to be sorted is always small, which is close to the best case at O(n) in Steps 2 and 3, leading to a better overall time complexity than that of the Straight method. It is important to note that the Straight method at O(n2) space is often infeasible when the volume of data is large, while the Proposed method at O(n) is significantly advantageous, and that the Proposed method in most cases (e.g., with large k) could perform more efficiently in time than the Straight method, which will also be illustrated by the real data experiments in Section 6. Moreover, the Proposed method focuses more on incremental updates, which is frequently seen in the evolutionary updating environment such
as Internet and enterprise databases. With a newly incremental data object, the Proposed method will cost extra O(n) space and averagely O(n log n) (i.e., O(n) in the best case, and O(n log n) in the worst case) time complexity, which is deemed efficient and desirable.
6. Experiments and implementation This section examines the space and time efficiency of the Proposed method with experimental data. The data experiment
97
J. Zhang et al. / Knowledge-Based Systems 32 (2012) 91–100 Table 2 Space efficiencies (memory allocations) of the 4 methods.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Datasets
Size
Memory allocation (MB) Method 1
Method 2
Straight
Proposed
Cardiotocography Dexter Dorothea Image segmentation ISOLET Madelon Multiple features Musk Opt. recognition of handwritten digits Page blocks classification Semeion handwritten digit Spambase Steel plates faults Wine quality Yeast
2126 2600 1950 2310 7797 4400 2000 6598 5620 5473 1593 4601 1941 4898 1484
54.200 80.536 45.768 63.804 715.632 228.640 48.096 512.768 372.324 353.156 30.908 249.908 45.360 283.064 26.960
45.336 67.284 38.312 53.336 596.504 190.692 40.248 427.460 310.420 294.448 25.924 208.408 37.968 236.044 22.640
45.360 67.312 38.336 53.364 596.576 190.732 40.272 427.520 310.472 294.500 25.944 208.452 37.992 236.092 22.660
1.092 1.096 1.088 1.092 1.156 1.116 1.092 1.144 1.132 1.132 1.084 1.120 1.088 1.120 1.084
Average
3693
207.408
173.002
173.039
1.109
Table 3 Statistical tests on the 4 methods. Hypothesis
Running times
Memory allocations
* ** ***
t Test
Friedman test
t Values
Significance
Chi-square values
Significance
Method 1 > Proposed Method 2 > Proposed Straight > Proposed
2.806 2.834 3.656
*
15 15 15
***
Method 1 > Proposed Method 2 > Proposed Straight > Proposed
3.829 3.829 3.830
**
15 15 15
***
* **
** **
*** ***
*** ***
p < 0.05. p < 0.01. p < 0.001.
Memory Allocation(MB)
50
40
30
20
10
0 0
2
4
6
8
Number of Data Objects
10
x 10
5
Fig. 5. Memory allocation for the Proposed method.
environment was a Windows XP system on a PC with Intel E8400 CPU and 1 GB RAM, and both the Proposed method and three other methods were implemented with the same data structures and basic routines with Visual C++ 6.0. In addition, this section will also introduce the framework of a search/query results grouping system, which is designed and implemented to support decision making based on incremental grouping as discussed in previous sections. In order to show the efficiency advantage of the Proposed method, comparisons have been conducted with Straight method [15] (with O(n2) time complexity and O(n2) space complexity) and 2 classical methods, i.e., Method 1 with O(n3 log n) time complexity and O(n2) space complexity [52,53], Method 2 with O(n3) time
complexity and O(n2) space complexity [54]. All the following experiments were tested based on the datasets from a commonly used benchmarking database in the field, namely the UCI Machine Learning Repository [42]. The UCI datasets used in the experiments were well-known and widely used benchmarking real data, with a rich variety of data features and application domains, and mainly provided for the purpose of comparative assessments of the data mining algorithms concerned. Concretely, 15 datasets were used totally. Tables 1 and 2 present the time efficiency and space efficiency of all the four methods on the 15 datasets, showing that the Proposed method outperforms all other methods significantly, which are further justified by statistical significance tests in Table 3. The experimental results on the UCI datasets reveal that the Proposed method can significantly improve the efficiencies both on time and space, which is consistent with the theoretical analysis in previous sections. To further illustrate the advantages of the Proposed method, the experimental comparisons on scalability are performed on some synthetic data (with k = 0.9 by default). For the space efficiency (memory allocation), since Method 1, Method 2 and Straight method were infeasible due to memory overflow when n > 14,000 and the differences were too obvious (i.e., O(n2) for Method 1, Method 2 and Straight method, O(n) for the Proposed method), Fig. 5 just depicts the memory consumption of the Proposed method (n = 105–106), which evidently shows a linear trend, i.e., at O(n). Since the running times of Method 1 and Method 2 are too worse both on time efficiency and space efficiency than the Straight method and the Proposed method, we just further clarify
98
J. Zhang et al. / Knowledge-Based Systems 32 (2012) 91–100
Data Center Internet
Search/Query
Groups/Classes Update New Data Group 1
Incremental Grouping Engine
New Group 2
Search/Query Results
...
New
... ...
New Added Results
Decision Makers
Group k New
Fig. 6. System implementation for incremental grouping search/query results.
26
Proposed Method Straight Method
50
25 Running Times(s)
Running Time(s)
60
40 30 20
24 23 22 21
10
20 0 0
2000
4000
6000
8000
19
10000
Number of Data Objects Fig. 7. Running times of the Proposed method and the Straight method (1000 < n < 10,000).
6
x 10
Running Times(s)
5
5
Proposed Method Straight Method(Interpolation)
4 3 2 1 0 0
2
4
6
8
Number of Data Objects
10 x 10
5
Fig. 8. Running time of the Proposed method and the Straight method (interpolation).
the difference between the Straight method and the Proposed method. Some experiments on synthetic data (1000 < n < 10,000) were conducted (Fig. 7), showing that the Proposed method is more efficient than the Straight method, which conforms to the theoretical analysis in Section 5. When n was getting larger, e.g., from 105 to 106, the running times of the Proposed method are shown in Fig. 8 along with the interpolated running times based on the trend of the Straight method in Fig. 7 (since memory allocation for Straight method is infeasible when n > 105), revealing a polynomial complexity consistent to the theoretical analysis and more applicable in large data volume environment than the Straight method.
0.65
0.7
0.75 0.8 0.85 Lambda values
0.9
0.95
Fig. 9. Running times of the Proposed method with different k.
Moreover, as indicated in Section 5, the trace operations would highly affect the overall time efficiency of the Proposed method, which is considered significantly dependent upon threshold k. Generally, when k is large enough, the trace operations will be reduced, otherwise it will be increased. Usually, it is suggestive that a large k be set for the Proposed method, which is also intuitive in most real applications. Fig. 9 illustrates the trend of running times along with the increase of k value (n = 104). The above experiments on scalabilities show that the Proposed method does outperform the classical methods. In addition, based on the incremental grouping method along with the algorithm proposed in this paper, a search/query results grouping system has been designed and implemented. The framework of this system is as shown in Fig. 6. In the system, when the data center gets the search/query request from the decision makers, the system incrementally generates the search/query results into several equivalence groups, based on which the decision-makers could be presented with grouped results or representative ones extracted accordingly. This system is considered useful for decision makers to get insights into the data. First, the derived equivalence groups reveal the underlying distribution of search results to help make decisions, which can further help extract representative results. Second, this system could nimbly respond to data updating from the evolutionary data environment of Internet or the enterprise databases through the incremental grouping process. 7. Conclusion The DSS research nowadays needs to deal with massive data processing in web applications. Importantly, to effectively and
J. Zhang et al. / Knowledge-Based Systems 32 (2012) 91–100
efficiently extract a small and representative data set from a huge amount of search results is considered desirable to the quality of decision support. Furthermore, intelligent search/query techniques are key DSS components. In order to extract representative information covering sufficient information of all search results, grouping search results from the overall search outcome is deemed meaningful and important. Based on the idea to group search results via the transitive closure of data objects, which can provide equivalence groups on given threshold k, this paper has proposed an efficient method to incrementally generate the elements in the transitive closure when they are necessary. This is enabled by investigating the properties and developing an iteratively updating strategy that is incorporated into the process of generating the equivalence groups step by step, giving rise to desirable features of the Proposed method that is both space and time efficient, as well as capable of efficiently dealing with the situation where newly incremental data objects often pertain, which is usually common in the frequently evolutionary search environment. The data experiments have also been conducted, revealing the advantages and effectiveness of the Proposed method. In addition, a system implementation framework of the incremental grouping method was also introduced. Future studies could be conducted in two respects: one is to explore specific ways to extract representative results from the groups; the other is to perform more experiments/tests with large real-world DSS search data. Acknowledgments The work was partly supported by the National Natural Science Foundation of China (70890083/71072015/71110107027), the MOE Project of Key Research Institute of Humanities and Social Sciences at Universities of China (07JJD630005), and Tsinghua University’s Research Center for Contemporary Management. References [1] O. Zamir, O. Etzioni, O. Madani, R.M. Karp, Fast and intuitive clustering of web documents, in: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (KDD 1997), California, USA, 1997, pp. 287–290. [2] S. Bandyopadhyay, U. Maulik, An evolutionary technique based on K-means algorithm for optimal clustering in RN, Information Sciences 146 (2002) 221– 237. [3] J.S. Deogun, D. Kratsch, G. Steiner, An approximation algorithm for clustering graphs with dominating diametral path, Information Processing Letters 61 (1997) 121–127. [4] S. Hirano, X. Sun, S. Tsumoto, Comparison of clustering methods for clinical databases, Information Sciences 159 (2004) 155–165. [5] S.S. Khan, A. Ahmad, Cluster center initialization algorithm for K-means clustering, Pattern Recognition Letters 25 (2004) 1293–1302. [6] R. Krishnapuram, J.M. Keller, A possibilistic approach to clustering, IEEE Transactions on Fuzzy Systems 1 (1993) 98–110. [7] R.J. Kuo, J.L. Liao, C. Tu, Integration of ART2 neural network and genetic K-means algorithm for analyzing web browsing paths in electronic commerce, Decision Support Systems 40 (2005) 355–374. [8] Y.J. Wang, A clustering method based on fuzzy equivalence relation for customer relationship management, Expert Systems with Applications 37 (9) (2010) 6421–6428. [9] Y.J. Wang, H.S. Lee, A clustering method to identify representative financial ratios, Information Sciences 178 (4) (2008) 1087–1097. [10] X.H. Tang, G.Q. Chen, Q. Wei, Introducing relation compactness for generating a flexible size of search results in fuzzy queries, in: Proceedings of International Fuzzy Systems Association World Congress and European Society of Fuzzy Logic and Technology Conference (IFSA2009/EUSFLAT 2009), Lisbon, Portugal, 2009, pp. 1462–1467. [11] H.S. Lee, Automatic clustering of business processes in business systems planning, European Journal of Operational Research 114 (1999) 354–362. [12] H.S. Lee, On automation of business processes clustering in business systems planning, in: Proceedings of the 27th Annual Meetings of Western Decision Sciences Institute, Reno, USA, 1998, pp. 633–635. [13] H. Wang, P.M. Bell, Fuzzy clustering analysis and multifactorial evaluation for students’ imaginative power in physics problem solving, Fuzzy Sets and Systems 78 (1996) 95–105. [14] L.A. Zadeh, Fuzzy sets, Information and Control 8 (1965) 338–353.
99
[15] H.S. Lee, An optimal algorithm for computing the max–min transitive closure of a fuzzy similarity matrix, Fuzzy Sets and Systems 123 (2001) 129–136. [16] M.J. Shaw, C. Subramaniam, G.W. Tan, M.E. Welge, Knowledge management and data mining for marketing, Decision Support Systems 31 (2001) 127–137. [17] M.K. Obenshain, Application of data mining techniques to healthcare data, Statistics for Hospital Epidemiology 25 (8) (2004) 690–695. [18] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, California, USA, 2001. [19] J. Hartigan, Clustering Algorithms, John Wiley and Sons, New York, NY, 1975. [20] H. Spath, Cluster Analysis Algorithms, Ellis Horwood, Chichester, England, 1980. [21] A. Jain, R. Dubes, Algorithms for Clustering Data, Prentice-Hall, Englewood Cliffs, New Jersey, 1988. [22] L. Kaufman, P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley and Sons, New York, NY, 1990. [23] B. Everitt, Cluster Analysis, third ed., Edward Arnold, London, UK, 1993. [24] B. Mirkin, Mathematic Classification and Clustering, Kluwer Academic Publishers, Dordrecht, Netherlands, 1996. [25] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Computing Surveys 31 (3) (1999) 264–323. [26] D. Fasulo, An analysis of recent work on clustering algorithms, Technical Report UW-CSE01-03-02, University of Washington, 1999. [27] M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques, Journal of Intelligence Information Systems 17 (2/3) (2001) 107–145. [28] S. Guha, R. Rastogi, K. Shim, CURE: an efficient clustering algorithm for large databases, in: Proceedings of the ACM SIGMOD Conference, Seattle, USA, 1998, pp. 73–84. [29] J.C. Dunn, A graph theoretic analysis of pattern classification via Tamura’s fuzzy relation, IEEE Transactions on Systems, Man and Cybernetics 4 (3) (1974) 1–15. [30] G.Y. Fu, An algorithm for computing the transitive closure of a fuzzy similarity matrix, Fuzzy Sets and Systems 51 (1992) 189–194. [31] A. Kandel, L. Yelowitz, Fuzzy chains, IEEE Transactions on Systems, Man and Cybernetics 4 (1974) 472–475. [32] H.K. Larsen, R. Yager, A fast maxmin similarity algorithm, in: J.C. Verdegay, M. Delgado, (Eds.), The Interface Between AI and OR in a Fuzzy Environment, IS 95 Verlag TUV Rheinland, Köln, Germany, 1989, pp. 147–155. [33] H.B. Potoczny, On similarity relations in fuzzy relational databases, Fuzzy Sets and Systems 12 (3) (1984) 231–235. [34] V. Owei, An intelligent approach to handling imperfect information in concept-based natural language queries, ACM Transactions on Information Systems 20 (3) (2002) 291–328. [35] G.Q. Chen, Fuzzy Logic in Data Modeling: Semantics, Constraints, and Database Design, Kluwer Academic Publishers, Boston, 1998. [36] H. Prade, C. Testemale, Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries, Information Sciences 34 (1984) 115–143. [37] G.Q. Chen, J. Vanderbulcke, E.E. Kerre, A general treatment of data redundancy in fuzzy relational data model, Journal of the American Society for Information Science 34 (4) (1992) 304–311. [38] K.V.S.V.N. Raju, A.K. Majumdar, Fuzzy functional dependencies and lossless join decomposition of fuzzy relational database systems, ACM Transactions on Database Systems 13 (2) (1988) 129–166. [39] G. Salton, The SMART Retrieval System: Experiments in Automatic Document Processing, Prentice-Hall, Englewood Cliffs, New Jersey, USA, 1971. [40] S. Tamura, S. Higuchi, K. Tanaka, Pattern classification based on fuzzy relations, IEEE Transactions on Systems, Man and Cybernetics 1 (1) (1971) 61–66. [41] T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, 1st ed., MIT Press, Cambridge, Massachusetts, USA, 1990. [42] C.J. Merz, P. Murphy, UCI repository of machine learning databases, 1996. Available from:
. [43] D.E. Kraft, A. Bookstein, Evaluation of Information Retrieval System: A Decision Theory Approach, Journal of the American Society for Information Science 29 (1) (1978) 31–40. [44] C.D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, Cambridge, England, 2009. [45] X. Lian, L. Chen, Top-k dominating queries in uncertain databases, in: Proceedings of the 12th International Conference on Extending Database Technology, New York, 2009, pp. 660–671. [46] D. Papadias, Y. Tao, G. Fu, B. Seeger, Progressive skyline computation in database systems, ACM Transactions on Database Systems 30 (1) (2005) 41– 82. [47] M.L. Yiu, N. Mamoulis, Multi-dimensional top-k dominating queries, The VLDB Journal 18 (3) (2009) 695–718. [48] L. Page, S. Brin, R. Motwani, T. Winograd, The pagerank citation ranking: brining order to the web, Technical report, Standford InfoLab, 1998. Available from: . [49] D. Hawking, Challenges in enterprise search, in: Proceedings of the 15th Australasian Database Conference, Dunedin, New Zealand, 2004, pp. 15–24. [50] K. Balog, People search in the enterprise, in: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, 2007, pp. 916–916. [51] A.Z. Broder, A.C. Ciccolo, Towards the next generation of enterprise search technology, IBM Systems Journal 43 (3) (2004) 451–454. [52] G.S. Liang, T.Y. Chou, T.C. Han, Cluster analysis based on fuzzy equivalence relation, European Journal of Operational Research 166 (2005) 160–171.
100
J. Zhang et al. / Knowledge-Based Systems 32 (2012) 91–100
[53] G.J. Klir, B. Yuan, Fuzzy Sets and Fuzzy Logic Theory and Application, Prentice Hall PTR, Upper Saddle River, NJ, 1995. [54] H. Naessens, H. De Meyer, B. De Baets, Algorithms for the computation of T-transitive closures, IEEE Transactions on Fuzzy Systems 10 (2002) 541–551. [55] J.P. Shim, M. Warkentin, J.F. Courtney, D.J. Power, R. Sharda, C. Carlsson, Past, present and future of decision support technology, Decision Support Systems 33 (2002) 111–126. [56] Ping-I Chen, Shi-Jen Lin, Word Adhoc network: using google core distance to extract the most relevant information, Knowledge-Based Systems 24 (3) (2011) 393–405. [57] Pierre-Antoine Champin, Peter Briggs, Maurice Coyle, Barry Smyth, Coping with noisy search experiences, Knowledge-Based Systems 23 (4) (2010) 287–294.
[58] Yum Zong, Guandong Xu, Yanchun Zhang, He Jiang, Mingchu Li, A robust iterative refinement clustering algorithm with smoothing search space, Knowledge-Based Systems 23 (5) (2010) 389–396. [59] Wen Zhang, Toketoshi Yoshida, Xijin Tang, Qing Wang, Text clustering using frequent itemsets, Knowledge-Based Systems 23 (5) (2010) 379–388. [60] Shunzhi Zhu, Dingding Wang, Tao Li, Data clustering with size constraints, Knowledge-Based Systems 23 (8) (2010) 883–889. [61] Yan Pan, Hai-Xia Luo, Yong Tang, Chang-Qin Huang, Learning to rank with document ranks scores, Knowledge-Based Systems 24m (4) (2011) 478– 483. [62] Jun Ma, Jie Lu, Guangquan Zhang, Decider: a fuzzy multi-criteria group decision support system, Knowledge-Based Systems 23 (1) (2010) 23–31.