Expert Systems with Applications 24 (2003) 365–373 www.elsevier.com/locate/eswa
An interactive agent-based system for concept-based web search Wei-Po Lee*, Tsung-Che Tsai Department of Management Information Systems, National Pingtung University of Science and Technology, Nei-Pu Hsiang, Pingtung, Taiwan, ROC
Abstract Search engines are useful tools in looking for information from the Internet. However, due to the difficulties of specifying appropriate queries and the problems of keyword-based similarity ranking presently encountered by search engines, general users are still not satisfied with the results retrieved. To remedy the above difficulties and problems, in this paper we present a multi-agent framework in which an interactive approach is proposed to iteratively collect a user’s feedback from the pages he has identified. By analyzing the pages gathered, the system can then gradually formulate queries to efficiently describe the content a user is looking for. In our framework, the evolution strategies are employed to evolve critical feature words for concept modeling in query formulation. The experimental results show that the framework developed is efficient and useful to enhance the quality of web search, and the concept-based semantic search can thus be achieved. q 2003 Elsevier Science Ltd. All rights reserved. Keywords: Semantic search; Intelligent agent; Evolutionary computing; Concept modeling; Query formulation
1. Introduction Following the continuously developing of Internet and the World Wide Web, anyone can easily provide information to the Internet. Meanwhile, though it is now convenient to obtain information from the Internet, the rapidly expanded World Wide Web causes the problem of information overload. In order to help users efficiently find the information needed, different types of search engines, such as Yahoo, Google, and Infoseek, have been developed during the last few years. Search engines are useful tools to collect and index web pages. After receiving a userspecified query, a search engine can use its internal strategy to narrow down the vast Internet information to a certain range, and can retrieve web pages that match the query for the user. However, because most conventional search engines use the methods of keyword-based similarity ranking, people may spend a lot of time to evaluate results retrieved but eventually could not find the specific information they really needs. This is mainly because in the common keywordmatch design, the search engines rank the documents according to their relevancy to the given query in which the relevance is measured by the similarity between a document and a query. The methods used to determine the similarity are normally based on probability models, such as * Corresponding author. E-mail address:
[email protected] (W.-P Lee).
the well-known vector space model (Salton, 1989), in which the frequency of the query keywords appearing in the documents is the decisive factor. On the other hand, it is well known that the users of web search engines tend to use short queries which in fact consist of only one or two words (Jansen, Spink, Bateman, & Saracevic, 1998; Silverstein, Henzinger, Hannes, & Moriz, 1999). The average users simply speculate some terms that could be most useful, and submit them to the search engines. But short queries are often non-specific in nature, the results retrieved by the similarity-based method can thus be of very poor quality: search engines may retrieve many irrelevant documents that include the keywords used, while they are not able to retrieve the relevant documents not containing the keywords. The situation becomes worse when a user is looking for some specific concepts but he is not able to accurately specify useful keywords as a query for the search engines. In fact, it is difficult to specify appropriate keywords to a search engine for retrieving web pages that best match a user’s abstract concept because of its involvement of semantic mapping between the query and the concept. For example, if a user would like to find some web pages that provide downloadable research papers on intelligent agents, he might have to try different combinations of possible keywords, for example, ‘intelligent’, ‘agent’, ‘paper’, ‘academic’, ‘research’. Still, the user may spend a lot of time in evaluating the results but does not feel satisfied with them. To solve the above problem, in this paper we present an agent-based system in which an interactive approach is
0957-4174/03/$ - see front matter q 2003 Elsevier Science Ltd. All rights reserved. doi:10.1016/S0957-4174(02)00186-0
366
W.-P. Lee, T.-C. Tsai / Expert Systems with Applications 24 (2003) 365–373
proposed to gradually capture the concept a user has in mind. In our system, an interface agent is developed to receive a user’s query and redirect the query to existing search engines as the general metasearch engines do. An information agent then analyzes the web pages chosen by the user and derives a temporary profile for him. Most importantly, based on the profile a discovery agent can perform query expansion and modification by the evolutionary strategies (ES) for more accurate results. A filtering agent uses the profile to rank the retrieved web pages for a new query and recommends the most relevant ones to the user. After the user indicates the pages he really needs, those pages are further analyzed and then the profile is updated. This procedure continues iteratively until a user terminates the search. By this interactive way, a user’s feedback is taken into account and the user’s initial request can be expanded with good query terms, therefore, the task of web search no longer completely relies on a precise query. With the evolution-based query formulation, the concept-based semantic search can be achieved. The prototype system has been evaluated and the preliminary results show the promise and efficiency of our approach.
2. Web search As it is mentioned, the search engines generally employ a probability model to measure the similarity between a document and a query to represent their corresponding relevance. Often the most popular method used is the vector space model widely accepted in the research of information retrieval. In this model, both documents and queries are represented as word vectors, after the common terms (such as ‘the’, ‘to’, ‘for’, etc.) have been removed and the words have been stemmed. The similarity of any two vectors is then determined by the angle between them in which the angle can be obtained by their inner product (Salton, 1989). Though this method can measure the syntactic closeness of a document and a query, it does not mean they are relevant in context. Hence, the search results by this approach are not satisfactory. In order to improve the search quality, it becomes important for a searching system to automatically form effective queries to capture what the user really needs. A possible way to achieve the concept-based semantic search is to develop a profile acting as a filter to further classify the results retrieved. In fact, this method has been advocated in web page recommendations in which a profile contains some feature words extracted from the pages a user is interested in and each word has a weight indicating its relative importance. The similarity between a new page and the personal profile is then used to predict whether a user is interested in this page. This approach has been successfully used to develop systems for web page recommendations (Joachims, Freitag, & Mitchell, 1997; Moukas, 1997; Pazzani & Billsus, 1997),
and it is generally believed that better semantic relevance can thus be captured in this way. It should be noted that the above personalization technique used in web page recommendation cannot directly be applied to the query-driven web search. In the former case, the user is passive; he normally remains his interests for a period of time, so feature words in his personal profile can be accumulated over time. Yet in the latter case, a user may have his specific targets in each single search, therefore, previously collected feature words are not necessarily related to a new search. Based on the above method, Shu and Kak have used the retrieved pages with highest and lowest scores by the search engine as positive and negative examples to train a neural network as a filter to recognize other pages for a specific query in web search (Shu & Kak, 1999). However, the task of concept-based semantic search is subjective; whether a certain page is what the user needs for the concept depends on his own judgment. A web page relevant to a query or a user’s profile may still not be what a user means for that query, because even the user himself is not capable of specifying the most appropriate query for the contents he is looking for. Under such circumstances, how to generate efficient queries becomes the major problem to solve in this kind of search (Chen & Sycara, 1998; Goldman, Langer, & Rosenchein, 1997). As the above analysis shows, the concept-based search cannot be achieved immediately after a query is specified; the query must be iteratively modified to capture what a user really needs. Also a profile similar to that used in the web recommendation work is needed to characterize the concept in the user’s mind during the process of query formulation. Some methods have been proposed to create effective queries (Hsu & Chang, 1999; Nick & Themis, 2001), and the most related work is the GA approach used in (Nick & Themis, 2001). Yet, the nested design for evolving queries in their work is very time-consuming to be a practical application in the real world web search task. As is indicated, the concept-based semantic search is user-centric, therefore, in the iterative steps the profile should be developed from the user’s feedback for the web pages retrieved in an interactive manner. In this way, additional words extracted from the pages newly retrieved can be gradually integrated into the profile and used to generate new queries that can further explore other area of the web space. The more relevant web documents the profile contains, the higher probability of optimal queries can be reformulated. By the exploitation of the profile and exploration of web space, the search quality does not completely rely on the preciseness of the original query any more; the interactive approach can lead the user to what he is targeting and provide the user abundant results. In this work, we take such design principles to develop a system and show that it can greatly improve the quality of traditional web search.
W.-P. Lee, T.-C. Tsai / Expert Systems with Applications 24 (2003) 365–373
Fig. 1. The overall architecture of the proposed agent-based system.
3. The proposed system 3.1. System framework Developing intelligent agents for Internet-based applications has been advocated in recent years (Etzioni & Weld, 1995; Maes, 1994; Sycara, Pannu, Williamson, Zeng, & Decker, 1996). Hence in this work an agent-based methodology is adopted to construct a framework for interactive concept-based web search. Each agent here autonomously performs a specific sub-task, and different agents work simultaneously to achieve the overall task. To achieve concept-based search, the system we develop includes four agents: an interface agent for interacting with the user and collecting his feedback, an information agent for extracting feature words from the pages gathered, a discovery agent for query formulation, and a filtering agent for re-ranking and recommending pages to the user. The agents work continuously until the user is satisfied with the results and then terminates the search. During the search process, a profile is used to record critical feature words for
367
automatic query generation. Fig. 1 illustrates the system framework. In Fig. 1, the interface agent receives a user’s initial query and directly sends it to the discovery agent responsible for interacting with the search engines and for organizing the retrieved results. The interface agent also displays the search results from the filtering agent to the user and then collects his feedback that indicates which pages he needs. With the user’s feedback, the system can thus gradually know what he means by the query submitted. Fig. 2 shows the typical search results presented by the interface agent. In this figure, the user can follow the hyperlinks to evaluate the web pages as he usually does with the traditional search engines. He then marks the hyperlinks for identification. After gathering the hyperlinks with positive feedback, the interface agent delivers the corresponding web pages to the information agent for further analysis. The concept-based semantic search is a subjective task: the same query by different users could mean different contents to them, or different users may use their own queries to search the same concept. It completely depends on a user’s personal opinions. Therefore, the user – system interaction described is an important feature to enable the system to capture what a user means. The information agent takes the responsibility of analyzing the web pages selected, and maintaining a word profile that includes useful information to describe the concept a user is looking for. The discovery agent described below will use the profile for query formulation. For a selected web page, the information agent first removes the HTML tags and those common but not important terms such as pronouns and prepositions, and then performs a stemming procedure to strip off word endings for statistic purpose.
Fig. 2. The typical results presented by the interface agent; the user can mark the hyperlinks to express the results he is expecting.
368
W.-P. Lee, T.-C. Tsai / Expert Systems with Applications 24 (2003) 365–373
The words left are used to represent this page; and their appearing frequencies in the web page indicate the corresponding importance. By this way, a page is transferred into a word vector. Consequently the similarity between two pages can be derived from the angle between them. Then the vectors for the pages collected are combined into a profile in which the words from different pages are sorted according to their accumulated frequencies. As shown in Figs. 1 and 2, the system operates in an iterative way and a user can indicate the pages he needs for the search system at each iteration. Hence, the system has to continuously update the profile to take more user feedback into account, and the word frequencies are accumulated from iteration to iteration. As some pages may contain only a small number of words (the main page of a specific web site), the words within them thus have relatively small impacts. In order to prevent bias caused by the page length, for those pages with words less than a pre-defined threshold, the information agent explores the hyperlinks within the pages to extend the page content. The profile derived from the above procedure is then used to generate new queries and to re-rank the retrieved pages later. The discovery agent is the kernel of our system; it is expected to generate new queries to better capture the concept in the user’s mind, based on the relevance feedback provided by the user. Different from the recommendation work described in Section 2, the task of concept-based semantic search is more abstract and restrictive; it is difficult to achieve by the direct measurement of lexical similarity. A more efficient approach is needed to derive sets of words to model what a user means from the pages he has selected. The sets of words produced can then form sensitive queries to explore different regions of the web space for more accurate results. Also, the method used to construct the discovery agent for query formulation must be timeefficient, because a user tends to find some specific contents within limited time when he is operating the search engine. With the above considerations, a special kind of evolutionary approach is employed in our work to create new queries from the word profile to model what the user is seeking. The details are described in Section 3.2. To evaluate a certain query, the discovery agent sends it to a search mechanism (e.g. an ordinary search engine or a metasearch engine), collects the web pages retrieved, and then delivers them to the filtering agent for pages re-ranking. After receiving the web pages, the filtering agent activates the information agent to analyze and transfer them into vector forms, and re-ranks these pages according to their similarities with the accumulated word list in the profile. In other words, the newly retrieved web pages are re-ranked based on how they are similar to the ones previously indicated as user-needed. To save the processing time, the filtering agent only deals with the pages with top rankings (e.g. the top l pages given by the search engine) by conducting the inner product between each of the page vectors and the profile. According to the measuring results,
the pages most similar to the word profile (e.g. the top k pages after re-ranking, k , l) are presented to the user through the interface agent. The user can then select the web pages with the contents he needs as examples to show the system what he is looking for. As is mentioned, the interface agent will send the pages selected to the information agent, which will then analyze these pages and extract feature words to update the profile. By the above interactive and iterative way, our system can gradually exploit what the user provides and explore new web regions to retrieve what he needs. 3.2. Query formulation by evolution strategies Evolutionary algorithms represent the kind of algorithms that simulate the process of natural evolution to search for the fittest through selection and recreation over the generations. They are based on the collective learning process within a population of individuals each of which represents a search point in the space of potential solutions for a given problem. In general, an evolutionary algorithm operates as an iteratively cyclic mechanism, which includes a sequence of selecting parents and creating children, after the initialization phase. Selection involves probabilistically choosing individuals from the current population to form a new population in which the survival probability of an individual is normally based on how fit it is for the specific problem to be solved. The recreation is to apply genetic operators on the parent individuals to generate new ones. In this work, a special kind of evolutionary algorithm, namely ES (Ba¨ck, 1996; Schwefel, 1981), is used to construct the discovery agent for evolving queries. In the ES model, a member of the population is typically considered as an individual which is constituted by a set of parameters (components) to be optimized, and it is then a fixed-length real valued string. Unlike Genetic Algorithms, the ES model emphasizes the behavioral linkage between parents and offspring; each component of a trial solution is thus regarded as a behavioral trait rather than a gene. It is assumed that whatever genetic transformations happen, the resulting change in each behavioral trait will follow a Gaussian distribution with zero mean difference and some standard deviation. In ES, mutation is the primary operator for creating offspring. It is applied to all components simultaneously, and is generally implemented by adding normally distributed random numbers to all components of an individual. The key concept of the ES is that it maintains one global step size (i.e. the standard deviation s in the equations below) for each individual. In the ES model, the step size is self-adaptive—each offspring inherits its step size from its parent and the step size is modified by the logarithm of normal random numbers. With this characteristic, ES are allowed to self-adapt to different fitness landscapes. Therefore, except the population size, there are no system parameters to be tuned by the designer (Ba¨ck, 1996). It
W.-P. Lee, T.-C. Tsai / Expert Systems with Applications 24 (2003) 365–373
has been shown that with the way of recreation described earlier, ES are better alternatives to Genetic Algorithms for some problems in which the epistatis exists among the parameters to be optimized (Saloman, 1996). Epistatis describes the interaction (or dependency) of parameters with respect to the fitness of an individual. If it appears, all parameters involved have to be adapted simultaneously so that the overall fitness of an individual can be improved. Hence, with the genetic operators used, GAs are very time consuming (epistatis drastically slows down the convergence of GAs) for the above kind of problems. Another advantage of using a mutation-based ES is in that it can reduce the negative impact of the permutation problem, and the evolutionary process can thus become more efficient (Ba¨ck, 1996). In the web search task, the word frequencies are not mutually independent; the appearance of some words may increase/decrease the appearing frequencies of others. In addition, due to the time limitation (the time a user can wait for response), only small numbers of population size and generations are allowed. Therefore, the ES model is chosen to evolve queries in this work. The multi-member ES model (m þ l)-ES is the ES variant widely used; it incorporates the idea of population and self-adaptation of strategy parameters. In this model, m is the number of individuals in each population, l is the number of offspring created from the parents, and the best m individuals selected from the parents and offspring form the next population. In our implementation, the above (m þ l)strategy is employed to select the survivors. As in the traditional ES model, an individual in our system is represented as a vector I ¼ ðf1 ; f2 ; …; fn ; s1 ; s2 ; …; sn Þ
369
consisting of n components fi ð1 # i # nÞ and their corresponding n standard deviations si ð1 # i # nÞ for individual Gaussian mutation. In this representation, each component fi denotes the mutated result of the appearing frequency of feature word wi in the profile and n is a user-specified parameter meaning the number of words with top frequency ranks in the profile to be considered. At each iteration of user – system interaction, the initial population is derived from the updated word profile. Then the evolutionary process continues for a pre-defined number of generations and the best individual obtained from the last generation is used to form a final query of this iteration. During the evolution, in order to create an offspring each individual I is mutated to I 0 ¼ ðf 01 ; f 02 ; …; f 0n ; s01 ; s02 ; …; s0n Þ by the following operations
s0i ¼ si expðt0 Nð0; 1Þ þ tNi ð0; 1ÞÞ
and
f 0i ¼ fi þ s0i Ni ð0; 1Þ Here, N(0,1) is a normally distributed one-dimension random number with mean zero and variance one; Ni ð0; 1Þ is a random number forpcomponent fi ;ptffiffiffi and t0 are set to the ffiffiffiffiffi pffiffiffi 21 common used values ð 2 nÞ and ð 2nÞ21 ; respectively. To evaluate an individual I, the discovery agent first collects a set of m words (m , n) with largest fi (i.e. the mutated frequency values) from the corresponding chromosome. Then the discovery agent uses the conjunction form of these m words to form a query and sends it to the search engine. Though it may be more flexible to use the combination of conjunction and disjunction of different words, to determine the appropriate combination for a set of
Fig. 3. The flow of query formulation during the evolutionary process of each iteration.
370
W.-P. Lee, T.-C. Tsai / Expert Systems with Applications 24 (2003) 365–373
words nevertheless implies much more computational cost. It is thus not suitable for the interactive and iterative search task here. In order to avoid the over-constrained search situations that may be caused by the method used, the search condition can be relaxed from the conjunction form by iteratively removing one word from the above form in our current implementation. After a query is submitted, the top l pages retrieved by the search engine are delivered to the filtering agent that will re-rank these web pages by calculating their similarities (i.e. the cosine measure) to the profile. The average of the similarity coefficients of the top k pages (k , l) after the reranking is thus defined as the fitness of the individual I. Once the fitness values of all individuals have been determined, the best m individuals selected from the (m þ l) competitors are propagated to the next generation. After a pre-defined number of generations, the k pages derived from the final query of an iteration are presented to the user for his personal evaluations. With the user’s feedback, the word profile is updated, and then a new initial population is created and the evolutionary process starts again. This procedure is repeated until the user terminates his search. Fig. 3 illustrates the flow of query formulation in our framework. From Fig. 3, it should be noted that the user only evaluates the results at the end of each iteration, i.e. during the evolution process within each iteration, the fitness of an individual is estimated in an indirect way by the discovery agent rather than a direct evaluation by the user. This inevitably causes some deviations and can be observed from the experimental results.
4. Experiments and results In order to access the proposed framework and the learning approach, this section describes the experiments conducted. In the following experiments, we concentrate on evaluating the effect of the user – system interaction and the performance of the evolutionary mechanism. The former is to investigate whether the user’s feedback is useful in modeling what he means by the query; and the latter, how the learning approach presented can further improve the search quality. 4.1. Evaluating the effectiveness of a user’s feedback As is mentioned, this work aims to overcome the difficulty in specifying optimal queries to model the abstract concepts a user has in mind so that he can find the semanticrelevant web pages by the search engine. The optimal queries for the search task here cannot be obtained immediately; one possible way is to derive the queries from the user –system interactions. Therefore, the first set of experiment is to examine the usefulness of a user’s feedback. In this experiment, three volunteers submitted different queries independently to the system (the well-known search
engine Yahoo was used as the background search mechanism) to look for the same concept of “research work related to intelligent agent-based systems that involve machine learning techniques and with downloadable publications”. The three queries sent were ‘computer science’, ‘artificial intelligence’, and ‘intelligent agent’ representing different views from the individual experimenters for the concept to be searched. It should be noted that the semantic relevance of the results retrieved were determined subjectively by the experimenters rather than objectively by the domain experts. It completely depended on their personal understanding for the concept described. The three values 5, 5, and 7 corresponding to the iteration 0 (i.e. i0) in Table 1 are the results for the above three queries, respectively. They represent the numbers of web pages a user indicated as relevant in semantic from the top 20 pages retrieved by the search engine for each original query. As can be seen, the results are not satisfactory; only a small portion of the pages retrieved was chosen in each case. This shows the inefficiency of the traditional search engines. With the above results, the interactive search procedure was activated. For each query, the pages selected were analyzed and the feature words extracted were used to modify the profile. The words with top five appearing frequencies (determined by preliminary testing) in the profile were then combined by the logical operator ‘and’ to form a new query, and it was submitted to the search engine again. In order to save the processing time, only the top 50 pages retrieved by the search engine were collected and reranked by the cosine similarity measurement as described in Section 3.1. If the total number of pages retrieved by the search engine was less than 50, the search condition was relaxed by removing the keyword with the smallest frequency value in the query, and then the new combination was re-sent to the search engine. The above procedure continued until there were 50 pages collected. After that, the top 20 pages most similar to the newly updated word profile were presented to the user for further identification. The user’s feedback was then analyzed and used to update
Table 1 The searching results by an interactive approach Search keywords
Iteration
i0
i1
i2
i3
i4
i5
Computer science
s a
5 5
8 12
8 18
12 28
12 39
17 49
Artificial intelligence
s a
5 5
16 20
11 29
5 32
16 45
5 46
Intelligence agent
s a
7 7
4 10
4 14
10 19
9 28
13 41
s and a represent the number of pages selected by the user at each iteration, and the number of different pages accumulated so far, respectively.
W.-P. Lee, T.-C. Tsai / Expert Systems with Applications 24 (2003) 365–373
the word profile for the next iteration. In our experiments, five consecutive iterations were performed for each initial query, and the results are shown in Table 1. In this table, each value in the first row (i.e. s) for a query represents the number of pages identified as relevant by the experimenter from the 20 pages presented by the interface agent at each iteration; and a value in the second row (i.e. a), the number of different and relevant pages accumulated so far. As can be seen, the numbers of pages selected tend to gradually increase from iteration to iteration. It shows that with the exploitation of the user feedback collected from the interactive procedure, the search system can more and more accurately capture what a user is searching in semantic, and, therefore, it can efficiently and continuously explore different regions of the web space. It can also be observed that some exceptions happen in which only four or five pages were selected. After further examinations, we have found that this is because at certain iterations, the user has chosen some specific pages, which were newly retrieved by the search engine, and different from previously selected ones. This situation is similar to the effect called query drift (Crouch, Crouch, Chen, & Holtz, 2002; Mitra, Singhal, & Buckley, 1998) in which the focus of the query is altered by query expansion. In our cases, some new words appeared very often in the newly selected pages and thus dominated the profile. These words then directed the search engine to find pages with new topics while still related to the concept originally specified. For example, in one of the cases with only five pages selected, the experimenter has indicated some indexing pages of knowledge-based systems as relevant, but which were different from the previously selected pages with contents mainly focusing on intelligent computation. As is shown in Table 1, though during the topic-transition phase, the pages retrieved could not meet
371
the user’s need immediately, and the performance of the system thus declined temporarily, it would soon be recovered. 4.2. The performance of the evolution mechanism After evaluating the validity of the above interactive approach, in this section we investigate whether the evolution mechanism can be used to further expand the queries obtained from the user – system interaction for concept modeling. In the experiments, the evolutionary mechanism was developed by the evolution strategies described in Section 3.2 for query formulation. In each iteration, the initial population was derived from the word profile by employing the mutation operation. Here, a (5 þ 5)-ES model was used: each population contained five individuals that were selected from the five parents and their corresponding five offsprings. In the experiments, to evaluate an individual involved picking up the five words with highest frequency values from the chromosome and sending the conjunction form of these words to the search engine. The top 50 web pages retrieved were then re-ranked by calculating their cosine similarities to the profile, and the average of the similarity coefficients of the top 20 pages after re-ranking was used to determine the fitness of the individual. The above evolution procedure continued for five generations and the 20 pages collected from the best individual obtained in the last generation were presented to the user for his personal feedback. Once the user has identified the pages, the ones indicated as relevant were analyzed and the profile was updated. Then the next iteration of evolution started. As Section 4.1, there were totally five iterations performed for each initial query. Table 2 shows the results in which values in the rows s1 and
Table 2 The search results by the search engine with interactive evolution mechanism Search keywords
Iteration
i0
i1
i2
i3
i4
i5
Computer science
s1 c1 a s2 c2
5 – 5 – –
9 0.283 14 9 0.283
10 0.269 21 13 0.229
3 0.191 24 9 0.188
15 0.194 37 15 0.194
15 0.194 37 16 0.174
Artificial intelligence
s1 c1 a s2 c2
5 – 5 – –
13 0.326 17 14 0.303
16 0.200 31 16 0.200
18 0.196 38 19 0.158
16 0.174 43 18 0.173
17 0.179 50 18 0.171
Intelligence agent
s1 c1 a s2 c2
7 – 7 – –
12 0.252 16 15 0.233
16 0.270 29 16 0.270
13 0.251 32 17 0.231
14 0.229 34 17 0.212
20 0.223 43 20 0.223
The symbols s1, c1, a, c2, and s2 are defined in the text.
372
W.-P. Lee, T.-C. Tsai / Expert Systems with Applications 24 (2003) 365–373
a for each query are the numbers of pages selected by the experimenters at the end of each iteration (i.e. after the evolution), and the different pages accumulated so far, respectively. Comparing with Table 1, with the evolutionary mechanism, better search results can be obtained. For the case with initial query ‘intelligent agent’, the system can even achieve 100% accuracy of search. It shows the success of our evolution mechanism. As in Table 1, there is a small value 3 in the third iteration for the query ‘computer science’. After the examination, it is found that the reason for resulting in this value is the same as the specific cases in Section 4.1, and this situation can also be recovered soon. Due to the time limitation, the determination of the fitness of an individual is based on the measure of cosine similarity in the above evolutionary process. As is mentioned in Section 3.2, this indirect evaluation may cause some deviations in prediction because the fitness obtained cannot truly reflect a user’s opinion. Some results in Table 2 show such a situation. Each value in the row c1 is the average of the similarity coefficients of the 20 pages corresponding to the best individual at each iteration. The other two rows s2 and c2 in the table record the real best results obtained from the direct evaluation by the experimenters (they were asked to give feedback separately for the pages retrieved by each individual in the last generation): s2 means the number of pages selected by the experimenter for the best query individual, and c2, the average of the similarity coefficients of the 20 pages correspondingly. As can be seen, the larger similarity value used to determine the fitness in the experiments cannot guarantee a higher accuracy in the real situation. However, though the deviation slightly reduces the performance, the indirect method for fitness measurement can largely save the processing time. It is an important factor for the interactive and iterative search task here.
5. Conclusion and future work In this paper, we have emphasized the importance of user feedback and concept modeling in web search. To achieve the concept-based semantic search, we present an agentbased framework to overcome the difficulty of specifying appropriate queries to retrieve web contents semantically relevant to a user’s need. In our multi-agent framework, each agent is responsible for a specific subtask, including information presentation and gathering, web content analyzing, query formulation, and information filtering. Especially, a special kind of evolutionary algorithms, evolution strategies, is used to develop a discovery agent for concept modeling in query formulation. As is analyzed, due to the characteristics of simultaneous mutations and self-adaptation, ES are better alternatives than GAs for the search task, and consequently chosen here to evolve queries. In addition, because the search task here is user-centric, the process of query formulation must then be under the user’s
guidance. Therefore, in this work, an interactive approach is employed to iteratively collect a user’s feedback from the pages he has identified for the query submitted. By analyzing the pages gathered, the system can then gradually create queries that are able to efficiently describe the content a user is looking for. To access our framework, different sets of experiments have been conducted and the preliminary results have shown that the search quality can be improved by our approach. Our work presented here points to some prospects for future research. In the experiments of evolving queries, instead of asking a user to manually evaluate the results of each generation, we have used an indirect method to define the fitness of an individual in which the fitness was estimated by the average of the similarity coefficients of the pages with top ranks. Though this method greatly saves the effort of user– system interaction, it inevitably causes some deviations because the common method of cosine similarity measurement in information retrieval cannot truly reflect the similarity of documents in semantic. Therefore, it is worthwhile to investigate other ways to guide the evolution. The other important issue is to examine the possibility of applying the collaborative technique in recommendation work (Basu, Hirsh, & Cohen, 1998; Smyth & Cotter, 2000) to enhance the effectiveness of the word profile used.
References Ba¨ck, Th. (1996). Evolutionary algorithms in theory and practice. New York: Oxford University Press. Basu, C., Hirsh, H., & Cohen, W. (1998). Recommendation as classification: using social and content-based information in recommendation. Proceedings of National Conference on Artificial Intelligence, 714 –720. Chen, L., & Sycara, K. (1998). WebMate: a personal agent for browsing and searching. Proceedings of International Conference on Autonomous Agents, 132 –139. Crouch, C. J., Crouch, D. B., Chen, Q., & Holtz, S. J. (2002). Improving the retrieval effectiveness of very short queries. Information Processing and Management, 38(1), 1–36. Etzioni, O., & Weld, D. (1995). Intelligent agents on the internet: fact fiction, and forecast. IEEE Expert, 10(4), 44 –49. Goldman, C. V., Langer, A., & Rosenchein, J. S. (1997). Musag: an agent that learn what you mean. Applied Artificial Intelligence, 11(5), 413 –435. Hsu, C.-C., & Chang, C.-H. (1999). WebYacht: a concept-based search tools for WWW. International Journal of Artificial Intelligence Tools, 8(2), 137 –156. Jansen, B., Spink, A., Bateman, J., & Saracevic, J. (1998). Real life information retrieval: a study of user queries on the web. SIGIR Forum, 32(1), 5– 17. Joachims, T., Freitag, D., & Mitchell, T. (1997). WebWatcher: a tour guide for the World Wide Web. Proceedings of International Joint Conference on Artificial Intelligence, 770 –775. Maes, P. (1994). Agents that reduce work and information overload. Communications of the ACM, 37(7), 30–40. Mitra, M., Singhal, A., & Buckley, C. (1998). Improving automatic query expansion. Proceedings of the 21st Annual International Conference on Research and Development in Information Retrieval, 206–214.
W.-P. Lee, T.-C. Tsai / Expert Systems with Applications 24 (2003) 365–373 Moukas, A. (1997). Amaltha: information discovery and filtering using a multi-agent evolving ecosystem. Applied Artificial Intelligence, 11(5), 437–457. Nick, Z. Z., & Themis, P. (2001). Web search using a genetic algorithm. IEEE Internet Computing, 5(2), 18 –26. Pazzani, M., & Billsus, D. (1997). Learning and revising user profiles: the identification of interesting web pages. Machine Learning, 27, 313–331. Saloman, R. (1996). Reevaluating genetic algorithm performance under coordinate rotation of benchmark functions; a survey of some theoretical and practical aspects of genetic algorithms. BioSystems, 39(3), 263–278.
373
Salton, G. (1989). Automatic text processing. Reading, MA: AddisonWesley. Schwefel, H.-P. (1981). Numerical optimation of computer models. New York: Wiley. Shu, B., & Kak, S. (1999). A neural-network based intelligent metasearch engine. Information Sciences, 120, 1–11. Silverstein, C., Henzinger, M., Hannes, M., & Moricz, M. (1999). Analysis of a very large web search engine query log. SIGIR Forum, 33(3), 6– 22. Smyth, B., & Cotter, P. (2000). A personalized TV listings service for the digital TV ages. Knowledge-Based Systems, 13, 53–59. Sycara, K., Pannu, A., Williamson, M., Zeng, D., & Decker, K. (1996). Distributed intelligent agents. IEEE Expert, 11(6), 36 –46.