Information Processing and Management 36 (2000) 571±583
www.elsevier.com/locate/infoproman
Digital library query clearing using clustering and fuzzy decision-making M.I. Heywood a,*, A.N. Zincir-Heywood a, C.R. Chatwin b a
Department of Computer Science and Engineering, Dokuz EyluÈl, Bornova, Izmir, 35100, Turkey b iims, University of Sussex, Falmer, Brighton BN1 9QT, UK Accepted 15 December 1999
Abstract A method is proposed and analysed for servicing keyword queries expressed in a digital library. Eciency in the service routine is introduced via the concept of customers and producer. From the customer perspective, queries are grouped into clusters of similar concepts. The action of the servicing digital library is then formulated in terms of the amount of overlap each cluster of queries has with respect to the information density of the library. Furthermore, the concept of query priorities is incorporated within the formulation of the initial query clusters. The ensuing combination of prioritised query clustering and fuzzy decision-making is shown to ensure that the prioritised query instances receive preferential service times without increasing the query queue length or delay times. 7 2000 Elsevier Science Ltd. All rights reserved. Keywords: Query clearing; Clustering; Digital libraries; Fuzzy decision-making
1. Introduction In order to initiate searches in digital library (DL) applications, the user is required to suitably specify parameters associated with the query using keywords. To do so, the user's perception of the query requires interpretation in terms of the semantics associated with each digital library. Furthermore, it is assumed that the DL data is distributed across heterogeneous information providers. Related work (Topaloglu, Zincir, Heywood & Chatwin, 1998; Zincir & * Corresponding author. 0306-4573/00/$ - see front matter 7 2000 Elsevier Science Ltd. All rights reserved. PII: S 0 3 0 6 - 4 5 7 3 ( 9 9 ) 0 0 0 7 4 - 6
572
M.I. Heywood et al. / Information Processing and Management 36 (2000) 571±583
Tunali, 1998), describes the application of a decentralised query processing platform in which query agents enact queries across a completely decentralised heterogeneous network of information providers. The objective of this work is to provide the basis for controlling the manner in which queries are serviced at the host DL. Speci®cally, we are interested in ensuring that the concept of keyword processing is maintained (i.e. the qualitative nature of the information), and both the query and digital library host preferences/ priorities are re¯ected. The former is achieved by using a fuzzy systems perspective of the initial keyword formulation (Chen & Wang, 1995) and will not be detailed further. The latter is achieved through a combination of fuzzy clustering and a formulation initially motivated by market based systems (Clearwater, 1995); that is, from the perspective of achieving a set of diering objectives derived from the point of view of customers (the query) and producers (the DL host). The resulting system, however, is strictly deterministic, thus a protracted period of negotiation is avoided. The main contribution of this work is therefore, in the clearing process itself, where the original queries are expressed in terms of fuzzy linguistic keywords. The paper is organised in the following manner; the concept of query clearing using clustering and fuzzy decision-making is introduced in Section 2, such that the users and producers perspectives are supported. Section 3 summarises the modelling constraints and simulation results of a study into the feasibility of the proposed system. The conclusion of Section 4 reiterates the approach and highlights future work.
2. Query clearing clustering and decision-making As indicated above, queries to a digital library host are assumed to be in the form of a set of keywords, where some keywords are more signi®cant to the user's request than others. Furthermore, the concept of service time is to be incorporated. That is to say, from a customers perspective, their particular query should become more important as the time spent queuing at the DL host increases. From the perspective of the producer (DL host), the most important attribute is the amount of overlap between the query keywords and the density of information contained in the library. Finally, it is also assumed that only the query keywords associated with the information local to the DL host are submitted to the query processing activity. The proceeding represents a customer perspective alone, on the other hand the producer or DL host wishes to maximise a dierent set of objectives. Speci®cally, the host DL should service multiple queries at a time, where such queries are selected on the basis of a suitable similarity metric. The objective of this constraint is to ensure that the host DL search process is ecient, or the maximum amount of information is gleaned from each search performed by the producer. Such a requirement also has a market economic basis in terms of the `cost' associated with servicing each query. In eect the DL is biased towards servicing clusters of queries, which overlap with DL (producer) content of highest density, but without retaining queries for signi®cant periods of time. As indicated in the complementary paper (Topaloglu et al., 1998; Zincir & Tunali, 1998) fuzzy membership functions are used to represent a global keyword schema. This implies that for each keyword associated with the query, membership is described in terms of fuzzy sets
M.I. Heywood et al. / Information Processing and Management 36 (2000) 571±583
573
(Chen & Wang, 1995). Furthermore, this is expressed in terms of an area in a K dimensional space, where K is the number of keywords associated with the host DL. Naturally, those queries with most similarity are clustered in similar regions of the keyword space. Moreover, as the time a query spends waiting to be serviced by the host DL (producer) increases, then the priority of the query from the customer perspective increases. The related concept of keyword signi®cance to the overall query is incorporated in the initial value given to the overall query priority (semantic check signi®cance), (Zincir & Tunali, 1998). Hence, the result of the customer clustering activity is to group similar queries in terms of their position (and priority) in the keyword space, and then place these grouped sets into a new queue for consideration by the host DL. The host, eectively representing a producer, then ranks the signi®cance of each query set from the perspective of the overlap in the information density at the host DL. This will be expressed in terms of a fuzzy decision problem (Ross, 1995). The general algorithm is therefore in the form of the following four steps, 1. cluster incoming queries on the host DL keyword space whilst incorporating time dependent priorities; 2. estimate overlap of the clusters from the perspective of DL information density; 3. service the cluster with greatest intersect; 4. loop on point (1).
2.1. Customer query clustering As indicated above, the ®rst step is to collect queries into groups of similar queries, that is to say, a clustering activity. Many robust fuzzy and statistical clustering algorithms exist (Rajesh & Krishnapuram, 1997), however, all suer from various performance trade-os. For the purposes of this study, the Potential Function method is employed (Chiu, 1994), principally due to the ease with which the concept of priorities may be introduced and the autonomy of the approach. In the latter case, it is implied that a technique to clustering is sought in which as few as possible a priori assumptions are made regarding the nature of clusters formed. In particular, it is very useful if the number of clusters required to describe the data, where the data in this case are incoming queries, is not speci®ed a priori, but identi®ed as part of the clustering process. The Potential Function clustering process used here satis®es these objectives and takes the form of the following four steps: 1. identify the potential of each point (query in the keyword space) with respect to all the other points; 2. select the point (query) with the highest potential; 3. subtract the potential of this point (query) from the others, where this eectively represents a cluster centre; 4. repeat from step two until the end criteria is satis®ed. The objective of step one is to characterise points (queries) in terms of how close they are to other points. This implies the evaluation of a suitable metric, denoted the potential function (Chiu, 1994; Rajesh & Krishnapuram, 1997),
574
M.I. Heywood et al. / Information Processing and Management 36 (2000) 571±583
Pt
x
j
K X
exp
ÿakx
i ÿ x
jk2
1
i1
where x
j is the `j'th query, Pt
x
j is the potential for such a query at iteration t and a is the cluster radii constant. This metric eectively allocates queries, x
i, located within the region of the candidate query, x
j, with a high weighting (unity) than those further away which only receive a low weighting (approaches zero). Hence the eect of the ®rst activity is to give the highest potentials to the query located in the centre of the most densely populated regions of query keyword space. The query with highest potential therefore identi®es the query with the most neighbours1 as the ®rst cluster centroid, step two. The purpose of step three is to remove the in¯uence of the current query from all keywords within the same region. This means that all queries have their potentials reduced by a factor equivalent to their distance from the candidate query cluster centroid, x
j, or Pt1
x
j Pt
x
j ÿ Pt
x exp
ÿbkx
j ÿ x k2
2
where the index t + 1 represents the updated potential at iteration t, x is the query maximising step two at the present iteration, and b is the second radii selected such that it is larger than a (Chen, 1994; Rajesh & Krishnapuram, 1997). The result of step three is the identi®cation of points representing queries from similar semantic criteria as being members of candidate centroid, x
j: The cycle then iterates until some suitable stop condition is identi®ed, for example Chiu (1994) details a criteria based on the ratio of the maximising potential at the present time step, and that of the ®rst. Such a stop condition results in three intervals. The ®rst gives rise to a new cluster (gupper), the second identi®es the lower bound in which all `signi®cant' points are said to be clustered (glower), and the third performs arbitration to assess the signi®cance of a point (query) mid way between previously allocated cluster centres. The bounds are identi®ed by always performing the comparison with respect to the ®rst maximising point (i.e. the ®rst cluster identi®ed). The speci®c test used within this context is, IF Pt
x > gupper P1
x 1 THEN create new cluster ELSE IF Pt
x < glower P1
x 1 THEN stop clustering ELSE ignore point where 0 < gupper, glower < 1, and glower < gupper. In order to introduce the concept of priorities, the following observation is made: the identi®cation of the next cluster centre is driven by the query, which maximises the remaining potential. Therefore, in order to incorporate the concept of dierent priorities, the selection of the maximising query at step two should be biased in accordance with the query priority. Speci®cally, if the initial query priorities are selected over the interval (0.5, 1.5) then the
1
In terms of an Euclidean distance norm.
M.I. Heywood et al. / Information Processing and Management 36 (2000) 571±583
575
modi®cation to step two detailed in Fig. 1 biases potentials of a low priority query towards zero, and potentials of a high priority away from zero. The ®nal activity before submitting the clusters to the host DL is to associate points with clusters. However, such a process has already been performed. That is to say, the Potential Function of (1) represents a distance norm. The process of subtracting a centre (2), allocates points to centres. The entire clustering process is summarised pictorially in Fig. 2, using a three-dimensional keyword space and three query centres located at co-ordinates (0.75, 0.75, 0.25), (0.5, 0.25, 0.75) and (0.75, 0.5, 0.75). That is to say keywords are recognised as linguistic properties and therefore able to possess the property of the keyword to a degree (membership) over the unit interval (Chen & Wang, 1995). The points highlighted with `' represent the location of the cluster centroids identi®ed by the standard algorithm. Points identi®ed following the incorporation of query priorities are indicated by the elements marked with a circle. Note however, that this represents an arti®cially simplistic scenario for illustrative purposes alone.
Fig. 1. Prioritised query cluster identi®cation.
576
M.I. Heywood et al. / Information Processing and Management 36 (2000) 571±583
Fig. 2. Example of prioritised clustering process.
2.2. Producer decision-making The input as seen by the host DL (producer) is a list of clustered queries, where the clusters represent the smallest `clumps' of information the DL may service. As indicated above, the next objective is to rank these sets in terms of the overlap with the density of information held at the DL; the concept of maximising payo on the basis of the amount of information that the host DL supplies. This is illustrated using a one-dimensional keyword space in Fig. 3. It is apparent that this problem can be addressed in terms of estimating the maximal overlap between query clusters (i.e. a membership function) identi®ed above, and the membership function used to describe the density of information held at the host digital library. The use of membership funtions is a re¯ection of the ambiguity inherent in the query and approximate manner in which we wish to describe the information density at the host DL. That is to say, a membership function describes a mapping between the quantity under measurement and the unit interval. Such a function therefore expresses the ambiguity, as opposed to chance as in the case of a probability distribution, between the concept represented by the case of complete
Fig. 3. Digital library density to query membership function relationships.
M.I. Heywood et al. / Information Processing and Management 36 (2000) 571±583
577
membership and the case of no membership at all. Here, we use the concept of membership functions to describe the contribution of queries with respect to the cluster centriod. In the case of the host DL, the membership function provides a mapping between queries and information density of the host. Moreover, the overall approach signi®cantly simpli®es complexity ranking customer and producer perspectives. The general process for solving this problem is therefore of the form, 1. estimate the membership function for each of the query sets; 2. fuzzify the query and DL density antecedents for each information set2: the process of fuzzi®cation relates measurements made on the target attribute and expresses it in terms of a fuzzy concept by way of a membership function, Fig. 3 (Ross, 1995). In this case, the fuzzy concepts denote the concept of information density (of the host digital library) and query clusters themselves; 3. identify the maximal overlap between query and host DL membership functions: this selects the region maximising both queries processed and information given; 4. defuzzify and service the maximising case: the result of the above process is a fuzzy set, hence requires interpretation in terms of a single scalar value if it is to de®ne the single most appropriate cluster for clustering, as opposed to the combination of sets denoting the region of maximal intercept (Ross, 1995). The ®rst activity is satis®ed by de®ning triangular membership functions in terms of the minimum, maximum and centroid associated with each of the cluster sets short-listed by the query clustering activity. Fuzzi®cation then takes place in order to describe the contribution of each of the points (queries) in the cluster sets with respect to the membership functions of the query sets and DL densities; in eect the antecedent is estimated over all queries within each set. The contribution of each information set is assessed using the following fuzzy decision function, which quanti®es the signi®cance of each query cluster, C, in terms of the maximising intercept within each cluster, DC max min
Ai
x
j, Bi
x
j j2C
i2K
where Ai
x
j is the membership function of keyword i cluster for query j; Bi
x
j is the membership function of keyword i density for query j; i indexes all keywords, and j indexes all queries which are members of cluster set C. This function returns the maximum intersect for each cluster; recalling that the clusters identi®ed by the query clustering activity denote the minimal divisible unit. All that remains is to interpret the vector DC in terms of the maximising instance. This is performed using the following mean defuzzi®cation operator, chosen in order to avoid dominance by overtly pessimistic or optimistic outcomes in any one keyword dimension,
2 The DL density antecedents are eectively static with respect to the query clusters. That is, the DL density only changes when the content of the DL is updated.
578
M.I. Heywood et al. / Information Processing and Management 36 (2000) 571±583 K 1X D max DC
i C K i
!
where K is the keyword space dimension local to the host digital library. This completes the selection process for the query set serviced by the DL; the process then repeating iteratively, each time incrementing the priority (Fig. 1) of queries carried over from the last iteration. In Section 3, a simulation study is performed to demonstrate the operation of the concept for dierent load factors. 3. Evaluation In order to evaluate the robustness of the above query processing engine, a simulator is constructed in which the initial queries are generated over a K-dimensional keyword space, and dierent numbers of new queries added after each iteration of the above query servicing routine. The various parameters associated with the clustering algorithm are summarised in Table 1; no attempt was made to tune this selection in any way, the values of the clustering algorithm conforming to those speci®ed in the initial reference (Chiu, 1994). The simulation programme is speci®cally designed to assess the signi®cance of: dierent values for the priority step size; the degree of overlap between the queries; and increases to the number of new queries entered at each iteration. Section 3.1 details modelling issues associated with the representation of the problem in the simulator, whereas Section 3.2 details the results of the simulation process. 3.1. Modelling In order to model the environment, two major issues require clari®cation: how are the initial queries created; and how are the membership functions representing the host DL densities derived. Naturally, the degree of overlap between the initial queries and the host DL densities will aect the throughput of the system. As indicated in the partnering study (Topaloglu et al., 1998; Zincir & Tunali, 1998), it is assumed that the platform ensures that queries forwarded to the host DL have an information overlap proportional to the signi®cance attributed to the query by the user and the appropriateness of the DL information. This implies that any query actually submitted to the host does have `sucient' overlap to warrant query servicing. The most optimistic bound represents the case in which all the query cluster centres correspond to information encountered on the host DL. The most pessimistic implies that a single keyword Table 1 Algorithm constants Priority increment Upper cluster cut (gupper) Lower cluster cut (glower) Minimum cluster radii
0.125, 0.5 0.5 0.15 0.25
M.I. Heywood et al. / Information Processing and Management 36 (2000) 571±583
579
for each query has some overlap with the host density estimates. The approach taken here is to ensure that the membership functions representing the query and DL perspective on the keywords diers by some pre-speci®ed amount. Table 2 summarises the mean dierence for each of the keywords. Simulations were initially performed over two scenarios. One uses three keyword dimensions and 27 query clusters; the second uses 10 keyword dimensions and 81 query clusters. In each case, points corresponding to the query clusters are generated about the cluster centroid using a Gaussian probability distribution function, variance of 0.1. The following results are speci®c to the three keyword case, but the observations made are generic to both keyword counts. Two secondary issues are also speci®cally highlighted: (1) what process is used to model the generation of the initial query priorities; and (2) by how much do the queries overlap. In the former case, the generation of the initial query priorities, it is assumed that the number of initially high priority instances will be low. Hence, a log normal distribution is used to generate the initial priority information with a mean of unity. In order to model the degree of homology between the queries received, the population of queries is varied by specifying the inter- and intra-cluster query counts; Table 3. That is to say, a reference case is chosen in which 100 queries are added at each iteration ((b), in Table 2). An intra-cluster example uses 200 queries where queries from the same class are favoured ((a) in Table 3); whereas the intercluster example enforces the selection of queries from double the number of query clusters ((c) in Table 3). In eect the inter-cluster case represents a scenario in which there is a high degree of similarity between queries, or the information space is maximally spanned. The case of the intra-class scenario the span of the information space is high, producing a more fractured distribution of queries. The signi®cance of the priority increment step size is assessed by repeating the above experiments for three scenarios: no priority; priority with a step size of 0.125; and a priority with a step size of 0.5. 3.2. Results The query processing activity is summarised in terms of ®ve characteristics: 1. query delay: the number of iterations a query spends before it is serviced by the host digital library, as measured at each iteration of the simulation; 2. query priority: the priority of a query as measured at each iteration of the simulation;, 3. queue length: the number of queries remaining once the digital library has serviced a cluster, as measured at each iteration of the simulation; 4. query prioritised delay: the delay experienced by dierent priorities at each iteration of the simulation (discussed in more detail below); 5. the number of query clusters formed: where this summarises the number of cluster centres Table 2 Mean dierence between keyword membership functions Keyword Mean dierence (%)
1 25
2 25
3 23
4 6.7
5 6.5
6 6.6
7 27
8 27
9 22
10 17
580
M.I. Heywood et al. / Information Processing and Management 36 (2000) 571±583
Table 3 Cluster homology scenarios Identi®er
Intra (a)
Std (b)
Inter (c)
Num. clusters Queries/cluster
20 10
10 10
10 20
formed by the dierent algorithms (prioritised/ unprioritised) at each iteration of the simulation. In each case, a simulation period of 1000 iterations is used, where the results from the ®rst 150 iterations are dropped from the evaluation of the above metrics in order to provide a sucient period for steady state conditions to exist. The ®rst three metrics are self explanatory, and summarised in Figs. 4±6. In the case of the delay metric, this is measured independently of the query priority. Ideally, it is desirable for the delay of the prioritised cases to be lower than the corresponding case without priority. In the case of the standard and intra-cluster scenarios, this is observed in the case of the maximum; whereas the mean is generally the same. However, in the case of the inter, or maximally spanning information space condition, the maximal delay of the prioritised cases is higher. This eect is repeated in the case of the mean priority of the queues at each iteration and queue length. To summarize, the above metrics indicate that the prioritised clustering perform equally as well as the unprioritised clustering instance in terms of net query delay, queue length and priority. The central motivation of the prioritised clustering concept, however, was to ensure that the cases with higher priority received preferential treatment without adversely aecting other performance metrics. In order to provide a measure for this, the queries at each iteration are collected into a bucketed list, where each bucket represents a priority interval of 0.5; there are 20 buckets in total (in eect a histogram representation is constructed). The product is then
Fig. 4. Query delay for each scenario and priority condition. Cluster homology: Intra Ð columns 1±3; Std Ð columns 4±6; Inter Ð columns 7±9.
M.I. Heywood et al. / Information Processing and Management 36 (2000) 571±583
581
Fig. 5. Query priority for each scenario and priority condition. Cluster homology: Intra Ð columns 1±4; Std Ð columns 5±8; Inter Ð columns 9±12.
taken between the mean delay and number of queries per bucket. This estimates the ability of the prioritised query clustering algorithm to ensure that high priority cases are not signi®cantly delayed. The results of this test are summarised in Table 4. This represents the result of a statistical t-test in which the hypothesis that there is no dierence between corresponding prioritised and non-prioritised instances is tested. T-tests with a magnitude greater than 2, imply statistical independence. It is apparent that the only time that a priority increment of 0.5 gives a mean response time statistically dierent from that of the unprioritised instance is when
Fig. 6. Query queue length for each scenario and priority condition. Cluster homology: Intra Ð columns 1±3; Std Ð columns 4±6; Inter Ð columns 7±9.
582
M.I. Heywood et al. / Information Processing and Management 36 (2000) 571±583
Table 4 Delay±instance product Simulation scenario
Priority increment
t-statistic
Mean No priority
Priority included
Intra-class
0.5 0.25
ÿ15.9 0.21
225 557
260 555
Standard
0.5 0.25
1.7 3.98
113 262
111 245
Inter-class
0.5 0.25
0.029 4.44
221 527
221 491
a longer response time is returned in the case of the intra-class scenario. That is to say, the 0.5 step size instance performs worse than the unprioritised clustering algorithm. In the case of the 0.125 step size increment, however, the prioritised instance is better than the unprioritised case in the standard and inter-class scenarios, and as-good-as the unprioritised case in the intraclass scenario. In summary then, it is recommended that priority step sizes of 0.125 are used in order to ensure performance comparable with the unprioritised clustering algorithm under worst case conditions. The prioritised instance is then able to reduce the delay time of priority instances when the information space is not spanned by the incoming set of candidate queries. Finally, the number of query clusters formed under each scheme/ scenario are observed to remain very similar; Table 5. Indeed the only perceivable dierence is a tendency of the 0.5 Table 5 Query clusters formed Scenario
No priority
Intra-class Max Mean Min
9 3.605 1
Standard Max Mean Min
9 3.55 1
Inter-class Max Mean Min
9 3.605 1
0.125 increment 9 3.535 1 11 3.36 1 9 3.534 1
0.5 increment 10 3.78 1 11 3.78 1 10 3.78 1
M.I. Heywood et al. / Information Processing and Management 36 (2000) 571±583
583
priority increment case to produce slightly larger means and maximal cluster counts, whereas the 0.125 case always returns a lower mean number of cluster centres. 4. Conclusion A customer±producer relationship is used to model the relationship between queries and servicing hosts in a digital library. Speci®cally, the customers or query preference is modelled in terms of the need to receive information in as short a time period as possible. The producer or library host's objective, however, is to maximise the number of query hits produced whilst upholding an ecient data access policy. In particular, the customers' perspective is incorporated by extending the method of Potential Function clustering to include the concept of prioritised data. Whereas the producer's perspective is incorporated by expressing the information density of the host's library. Fuzzy decision-making is then used to identify the queries enacted such that both objectives are satis®ed without recourse to negotiation. The overall approach is demonstrated to yield a balanced query processing system applicable to keyword driven queries and information density driven preferences in the host DL. Acknowledgements This work was conducted whilst A.N. Zincir-Heywood was a visiting scholar from Ege University under a British Council Scheme awarded between October 1996 and September 1997. References Chen, S.-M., & Wang, J.-Y. (1995). Document retrieval using knowledge-based fuzzy information retrieval techniques. IEEE Transactions on Systems, Man and Cybernetics, 25(5), 793±803. Chiu, S. L. (1994). Fuzzy model identi®cation based on cluster estimation. Journal of Intelligent and Fuzzy Systems, 2, 267±278. Clearwater, S. H. (1995). Market based control: A paradigm for distributed resource allocation. World Scienti®c, Singapore. Rajesh, N. D., & Krishnapuram, R. (1997). Robust clustering methods: a uni®ed review. IEEE Transactions on Fuzzy Systems, 5(2), 270±293. Ross, T. J. (1995). Fuzzy logic with engineering applications. McGraw Hill isbn 0-07-053917-0. Topaloglu, N. Y., Zincir, A. N., Heywood, M. I., & Chatwin, C. R. (1998). Characterisation of centralised and decentralised text information retrieval search platforms using OMT. In Advances in Computer and Information Sciences '98, Proceedings of the 13th International Symposium (pp. 519±526). IOS Press. Zincir, A. N., & Tunali, T. (1998). A simulation study on the optimization of distributed search of digital libraries. In Advances in Computer and Information Sciences '98, Proceedings of the 13th International Symposium (pp. 527± 534). IOS Press.