ARTICLE IN PRESS
Information Systems 30 (2005) 630–648 www.elsevier.com/locate/infosys
Incremental mining of information interest for personalized web scanning$ Rey-Long Liu, Wan-Jung Lin Department of Information Management, Chung Hua University, No. 707, Sec. 2, Wufu Road, HsinChu, Taiwan 300, Republic of China Received 7 February 2003; received in revised form 13 May 2004; accepted 5 July 2004
Abstract Businesses and people often organize their information of interest (IOI) into a hierarchy of folders (or categories). The personalized folder hierarchy provides a natural way for each of the users to manage and utilize his/her IOI (a folder corresponds to an interest type). Since the interest is relatively long-term, continuous web scanning is essential. It should be directed by precise and comprehensible specifications of the interest. A precise specification may direct the scanner to those spaces that deserve scanning, while a specification comprehensible to the user may facilitate manual refinement, and a specification comprehensible to information providers (e.g. Internet search engines) may facilitate the identification of proper seed sites to start scanning. However, expressing such specifications is quite difficult (and even implausible) for the user, since each interest type is often implicitly and collectively defined by the content (i.e. documents) of the corresponding folder, which may even evolve over time. In this paper, we present an incremental text mining technique to efficiently identify the user’s current interest by mining the user’s information folders. The specification mined for each interest type specifies the context of the interest type in conjunctive normal form, which is comprehensible to general users and information providers. The specification is also shown to be more precise in directing the scanner to those sites that are more likely to provide IOI. The user may thus maintain his/her folders and then constantly get IOI, without paying much attention to the difficult tasks of interest specification and seed identification. r 2004 Elsevier B.V. All rights reserved. Keywords: Context of information interest; Precise interest specifications; Comprehensible interest specifications; Incremental text mining; Web scanning
1. Introduction $
Recommended by Ricardo Baeza-Yates.
Corresponding author. Tel: +886-3-5186530; fax: +886-3-
5186546. E-mail address:
[email protected] (R.-L. Liu).
Web scanning aims to continuously scan the web for information of interest (IOI) with respect to a set of relatively long-term interests (when compared with one-shot queries). It is motivated
0306-4379/$ - see front matter r 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.is.2004.07.001
ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
by the common needs of those users (e.g. businesses and people) who constantly require timely information concerning their jobs and preferences. Given a specification of the user’s interest, the scanner needs to identify the seeds from which IOI gathering and monitoring are started and conducted as a routine job. An example application of web scanning is business environmental scanning, which has been identified as an essential continuous task for many business activities such as decision making [1] and product development [2]. Since the user’s job and preferences are relatively long-term, the user’s interest is relatively long-term as well [3]. The interest may evolve when internal conditions (e.g. business strategies and user preferences) and external conditions (e.g. the environments) change. The information scanned and the information manually collected are often stored into a hierarchy of folders (or categories). The personalized folder hierarchy facilitates the browsing and retrieval of information. Web scanning needs to be guided by precise and comprehensible specifications of the user’s interest. A precise specification may direct the scanner to those spaces that deserve scanning (imprecise specification may significantly deteriorate the performances of most information gatherers on the web [4]), while a specification comprehensible to the user may facilitate manual refinement, and a specification comprehensible to information providers (e.g. Internet search engines) may facilitate the identification of proper seed sites to start scanning. Moreover, due to the evolution of the user’s interest, the specification should evolve as well. Imprecise or obsolete specifications may consume lots of resources (e.g. efforts of the scanner’s, bandwidths of the network, and services provided by related information servers), and produce lots of garbage information to the user. Unfortunately, the user often has difficulties in properly specifying his/her relatively long-term but evolving interest. The interest is usually implicitly defined in the hierarchy of folders. A folder corresponds to an interest type, which is collectively defined by the documents under the folder (i.e. including the folder and descendants of the folder). The user thus requires a mechanism that
631
may identify and represent each interest type and its evolution. With the help provided by the mechanism, the user only needs to maintain his/ her own folders. Proper interest specification may be automatically extracted to guide the scanner to find IOI using a smaller amount of resources. Fig. 1 illustrates the idea. Each user may set up and maintain a hierarchy of folders. He/she may enrich the folders by adding new IOI that is either automatically scanned or manually collected. A folder thus corresponds to an interest type of the user. A set of folders may be designated (by the user) as an interest set for the scanner to work on. An interest miner is activated once the contents of the designated folders are updated. It identifies the newest specification of each interest type (folder). Based on the specification, the scanner identifies proper seeds to start continuous gathering and monitoring of IOI from the web. With the recent successful technology in information monitoring [5] and gathering [6–10], the major challenge to realize the scanning flow lies on the interest miner, which we believe is an essential component to achieve satisfactory and cost-effective web scanning. This is the target we pursue in this paper. In the next section, we discuss the weaknesses of previous studies in interest mining. An incremental mining technique is thus developed to derive precise and comprehensible interest specifications from the user’s personalized folder hierarchy (i.e. the interest miner in Fig. 1). The framework is named IMind (Interest Mining from personalized folders), and is presented in Section 3. To investigate the contributions of IMind, an experiment on a real-world environment was conducted (ref. Section 4). Theoretical analysis and empirical results show that IMind may efficiently keep the scanner precisely aware of the most recent interest of the user, and hence guide the scanner to find more targets of interest, while separating the user from the difficult (even implausible) tasks of interest specification and seed identification.
2. Related work Web scanning consists of two main tasks: information gathering (discovery through crawling)
ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
632
Personalized Folder
Interest Miner Spec. for user’s interest {
AND …}
Interest designation New Info
User
Info Scanned
Seed Finding
The Web
Gathering & Monitoring
Scanner New Info Info Scanned
Fig. 1. Mining personalized information folders for web scanning.
and information monitoring [3]. They work together to find complete and up-to-date IOI for the user. Previous information gathering studies mainly focused on the satisfaction of users’ ‘‘oneshot’’ needs of IOI (e.g. [6,8,10]). They also considered various factors, such as time/cost/ quality tradeoffs [7], alleviation of information overloading [9], and mining linkage structure [11] and users’ browsing behaviors [12] to improve searching. As to information monitoring, previous scanning systems often employed periodical monitoring to actively check for information updates in a periodical way [13,14]. To promote the costeffectiveness of monitoring, adaptive monitoring was developed to monitor different targets in different frequencies [5]. The integrative treatment of continuous information gathering and monitoring under limited resource consumption was explored as well [3]. A web scanner may provide IOI to the user only if the user’s interest is precisely specified. Imprecise specifications could significantly deteriorate the performances of state-of-the-art information gathering algorithms [4]. Interaction is a common way to disambiguate the user’s interest [15,16]. Human–system interactions often helped to refine the queries for the next round of information retrieval [17–19]. To conduct such interaction, users need to read and process the retrieved information and
then provide relevance feedback to the system. However, users are often unable (or unwilling) to read much information [20] and interact with the system many times [21]. This problem will become more serious in web scanning, which is a continuous task. The scanner never stops consuming lots of resources to find candidate IOI to be verified. The user thus may be notified at any time for such verification. Moreover, it will be quite difficult for the user to set up queries to start scanning, since the users’ interest is often implicitly and collectively defined by the contents of the information folders, which may evolve over time. One way to alleviate the problem is to automatically set up the interest specification, and update the interest specification once the contents of the folders evolve. Mining the folders for the interest specification is thus a promising approach. Instead of grouping information to automatically create folders for archiving users’ interests [22,23], the mining task aims to build a profile for each existing folder from the documents under the folder. The profile needs to be a good basis to derive a precise specification for the corresponding interest type. It should be composed of those features (terms or phrases) that are carefully selected. Such a feature selection task was routinely conducted for various purposes, including query refinement (e.g. [17–19,24]), document summarization
ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
(e.g. [25]), document classification (e.g. [25–27]), document filtering (e.g. [28]), and retrieval of similar documents (e.g. the well-known give-memore retrieval). It also helped to build a classifier to judge the relevance of a document (e.g. a document was said to be more relevant if it had higher degrees of match with those categories that were of the user’s interest [23,29,30]). However, these previous techniques did not extract the context of each interest type, making the interest specifications vague. As an example, consider a folder hierarchy containing three branches of folders: (1) root-system development-decision support systems, (2) root-manufacturing-decision support systems, and (3) root-financial Management-Decision Support Systems. Although there are three folders (interest types) named ‘‘decision support systems’’, they have different contexts of discussion (i.e. strategy of implementation, usage in manufacturing, and usage in financial services, respectively). Obviously, with proper attention to the contexts, web scanning may be more focused, cost-effective, and satisfactory. Perhaps the most straightforward way to consider interest contexts is to derive a query by simply concatenating the folder names of the same branch [4]. This approach deserves significant improvement, since the name of a folder cannot precisely and completely define the folder. As noted above, each folder should be implicitly and collectively defined by its content, which may evolve over time. Incremental mining of the context of each interest type is thus essential. Moreover, the language to represent the interest specifications is essential as well. It should be comprehensible so that the user may manually refine the interest specifications, and the system may comprehensively identify proper seeds to start scanning on the ever-changing web. Seed identification is the first key step for scanning. Previous related studies often retrieved seeds from a given set of web pages [10,29,30]. This method is not suitable for personalized web scanning, since (1) for each interest type of each user, it is unrealistic to predefine a static set of web pages that may comprehensively contain proper seeds from the ever-changing web, (2) the user often cannot comprehensively specify the seeds on the web,
633
and (3) the documents in the user’s folders cannot always contain proper seeds either (the documents may even contain no hyperlinks). To tackle the problem, Internet search engines may be invoked to provide a comprehensive and adaptive coverage of the web. Therefore, the interest specifications should be comprehensible for both the user and the search engines (e.g. in conjunctive normal form). However, deriving such interest specifications is challenging. Therefore, a straightforward way was often employed: selecting several good terms and integrate them in disjunction form [30]. Unfortunately, this way cannot precisely express the context of each interest type. For example, consider the above interest type: Root-System Development (SD)-Decision Support Systems (DSS). If the interest type is specified in disjunction form, the terms concerning SD and/or the terms concerning DSS will be mixed together, incurring much irrelevant information, such as the information about SD but not DSS, and the information about DSS but not SD. Therefore, to achieve cost-effective web scanning, mining a precise and comprehensible specification for each interest type (folder) is both essential and helpful. The mining process should be incremental in order to adapt the interest specifications to the evolving folders. It should also precisely recognize the context of each interest type, and represent the context in a form that may facilitate comprehension and seed identification. Such a mining process may promote user satisfaction and cost-effectiveness of continuous information gathering and monitoring on the web, while separate the user from the tedious (and even implausible) task of interest specification and seed identification.
3. The interest miner Table 1 presents the algorithm of the incremental interest miner IMind. IMind operates on a given hierarchy T of information folders (ref. Input I). A folder corresponds to an interest type of the user. Each folder has a profile, which serves as the basis to derive an interest specification for the corresponding interest type. The user may
ARTICLE IN PRESS 634
R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
Table 1 Incremental interest mining Input: (I) A hierarchy T of folders, (II) A set of folders G designated as the goals of web scanning, and (III) A set X of documents added to a folder f. Effect: (I) Update the profile of each related folder of f in T, and (II) For each folder g in G, if the interest specification of g has changed, send the new specification to the scanner. Begin (1) W’w|w is a word occurring in documents in X, and w is not a stop word; (2) While (f is not the root of T) do (2.1) For each word w in W, do (2.1.1) If w is not a profile term for f, add ow, rw,f, dw,f4 to the profile of f (rw,f and dw,f are initially unknown); (2.2) For each 3-tuple ow, rw,f, dw,f4 in the profile of f, do (2.2.1) Update rw,f and dw,f; (2.2.2) For each sibling b of f, update dw,b; (2.3) f’parent of f; (3) For each goal folder g in G, do (3.1) Ig’Disjunction of the profile terms having higher rw,g dw,g values (a number a of profile terms in g are selected); (3.2) a’parent of g; (3.3) While (a is not the root of T) do (3.3.1) Ig’Conjunction of Ig and disjunction of the terms having higher rw,a dw,a values (a number a of profile terms in both a and g are selected); (3.3.2) a’parent of a; (3.4) If Ig6¼specification of g, send Ig to the scanner to update the specification of g; End.
designate a set G of interest types as the goals of web scanning (ref. Input II). Once a set X of documents is added to a folder f (ref. Input III), IMind conducts two tasks: (1) updating the profiles of related folders (ref. Effect I, and Steps 1 and 2), and (2) notifying the scanner if the interest specification of any interest type in G has changed (ref. Effect II, and Step 3). 3.1. Updating the folder profiles The profile of a folder f is initially empty and may evolve over time. It is a set of 3-tuples /w, rw,f, dw,fS, where w is a word (or feature), rw,f is the degree to which w may represent the content of the documents under f, and dw,f is the capability of w in discriminating f from siblings of f. A good profile term should be the one that is both representative (having a higher r-value) and discriminative (having a higher d-value). Since dw,f is estimated with respect to siblings of f (rather than all folders as in most previous approaches), general (specific) terms tend to be good profile terms for higher- (lower-) level
folders. For example, suppose in the user’s hierarchy of folders, system development (SD) has two subfolders: decision support systems (DSS) and accounting information systems (AIS). The terms like ‘‘information’’ and ‘‘maintenance’’ may have a lower d-value in both DSS and AIS (since the documents in both subfolders are about the implementation and maintenance of information systems), but a higher d-value in SD, if sibling folders of SD are not about system development (recall that in the example raised in Section 2, SD has two siblings: manufacturing and financial management). Therefore, ‘‘information’’ and ‘‘maintenance’’ may be good in discriminating SD from its siblings, which have the same generality level. Similarly, the terms like ‘‘inventory’’ and ‘‘sales’’ may get a higher d-value in AIS, but not in SD or DSS. That is, based on the personalized hierarchy, generality of each term is implicitly defined. IMind makes the generality explicit by mining. Technically, the degree of representation of a word w under f (i.e. rw,f) is estimated by P(w|f), which is equal to TF(w,f)/Size(f), where TF(w,f) is
ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
the times of occurrences of w in documents under f, and Size(f) is the total number of terms in the documents under f (from the viewpoint of data mining, P(w|f) may be viewed as the support of w under f). On the other hand, the capability of w in discriminating f from its siblings (i.e. dw,f) is estimated by P(w|f)(Bf/SiP(w|fi)), where Bf is the number of siblings of f plus one (i.e. including f). The summation of P(w|fi) is conducted over f and its siblings. Thus, for example, w may get a higher d-value if it occurs frequently under f (i.e. P(w|f) is high), but infrequently (on average) under siblings of f (i.e. Bf/SiP(w|fi) is high). In that case, w may be good in discriminating f from siblings of f. Also note that, 0pdw,f pBf, for each profile term w in f. When w only occurs in f, dw,f=Bf, while dw,b=0 for each sibling b of f. If P(w|f) is higher than SiP(w|fi)/Bf (i.e. the average P(w|fi)), dw,f41; otherwise dw,f p1. Note that the set X of documents added to a folder f may only trigger profile updates for a limited number of related folders. Fig. 2 illustrates the idea. For r-values, since the addition of X only changes the sizes of f and ancestors of f, only the rvalues of the profile terms of f (ref. Steps 2.2.1) and its ancestors (ref. Step 2.3) are updated. As to dvalues, the updates are conducted only on f, siblings of f, ancestors of f, and siblings of the ancestors (ref. Steps 2.2.1, 2.2.2, and 2.3). This is because, when X is added to f, P(w|f) of each profile term w in f needs to be updated. This
update triggers the update of the second component of dw,f (i.e. Bf/SiP(w|fi)). Since each sibling b of f employs the same second component to compute dw,b, dw,b needs to be updated as well. Similar updates should be conducted when the process proceeds to the ancestors of f. Also note that, to update r-values and d-values, IMind only needs to keep track of the related data for computing P(w|f). No previous training documents need to be reprocessed. No feature set needs to be predefined either. The features (vocabulary) for expressing the profiles may evolve over time without conducting time-consuming trail-and-error feature set tuning employed in many previous related techniques. 3.2. Deriving the specification of each goal folder Profile mining provides a fundamental basis for deriving the specification of each folder (interest type). As noted above, it identifies good profile terms (i.e. both representative and discriminative) for each folder. General (specific) terms tend to have higher d-values in general (specific) folders. Therefore, good profile terms for ancestor folders of f may indicate the context of f. For example, consider the above-mentioned folder hierarchy in which system development (SD) has two subfolders: decision support systems (DSS) and accounting information systems (AIS). Through profile mining, ‘‘information’’ and ‘‘maintenance’’
Both r-values and
...
d-values of the profile terms are updated
f
... Only d-values of the
635
... ...
X: the set of documents added to f
terms are updated Fig. 2. Updating the profiles of related folders once a set of documents is added.
ARTICLE IN PRESS 636
R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
may be found to be a good profile term for SD, while ‘‘sales’’ and ‘‘inventory’’ may be good profile terms for AIS. Thus a good specification for the AIS folder should include not only ‘‘sales’’ and ‘‘inventory,’’ but also ‘‘information’’ and ‘‘maintenance,’’ which may indicate the context of the folder. Therefore, the specification for a folder f should be composed of good profile terms of f and each ancestor folder of f. The good profile terms from the folders of the same pedigree should be integrated in conjunction form so that the contextual requirement at each level of generality may be enforced. On the other hand, the good profile terms from each folder should be integrated in disjunction form so that the coverage of the terms may represent the main content of the folder. That is, the specification for a folder should be in conjunctive normal form. For the above example, the specification for AIS may be like {y /information OR maintenanceS AND /sales OR inventoryS}. Therefore, for each goal folder f, IMind derives a specification in conjunctive normal form, and then sends the specification to the scanner immediately (ref. Step 3). The specification is derived by checking the profiles of f and ancestors of f. Good profile terms are extracted by selecting those terms that have higher r-values and d-values (ref. Steps 3.1 and 3.3.1). The number of terms to be selected from a profile is governed by a system parameter a. The terms selected are then integrated in disjunction form. The final interest specification is simply the conjunction of the disjunctions for f and its ancestors (ref. Step 3.3.1). Note that, when selecting terms to construct the disjunction for an ancestor of f, only the terms that occur in the profile of f may be the candidates. This method guarantees that each term in the specification of f really occurs in documents of f.
3.3. Space and time complexities of interest mining To further illustrate the behavior of IMind, space and time complexities deserve analysis. More empirical results will be reported in the experiment (ref. Section 4.4).
3.3.1. Space complexity IMind needs to store the r- and d-values of each folder’s profile terms, and hence the maximum memory required is O(N t), where N is the total number of distinct profile terms currently accumulated in the hierarchy, and t is the number of folders in the hierarchy. Since it is often the case that N5105 and t5103 in a personalized hierarchy, the memory requirement may be supported by modern computers. Actually, IMind seldom requires so much memory, since the numbers of profile terms in most folders are much smaller than N. This is because each folder mainly contains those documents that talk about the folder’s topic using many topic-specific terms, which do not appear in all siblings of the folder, leading to the common case that children folders have much fewer profile terms than their parent (recall that a parent’s set of profile terms is actually the union of its children’s sets of profile terms). Therefore, the larger a folder’s level is, the fewer profile terms the folder would have. Even those folders whose levels are 1 (i.e. level-1 folders) could only have a subset of the N profile terms (i.e. only the union of the sets of profile terms of all level-1 folders may have N terms). For example, an empirical observation on a text hierarchy of Yahoo! (http://www.yahoo.com, ref Section 4) showed that there were 82 620 distinct terms in the hierarchy (i.e. N=82620), but there were only about 60000 profile terms in each level-1 folder. 3.3.2. Time complexity Time complexity of interest mining is determined by the two main steps: profile mining (i.e. Step 2 in Table 1) and specification derivation (i.e. Step 3 in Table 1). Time complexity of profile mining may be measured by estimating the maximum number of profile terms that are updated when a document set X is added into a folder f. As noted in Section 3.1 and Fig. 2, the addition of X only triggers the profile updates on f, siblings of f, ancestors of f, and siblings of the ancestors. Therefore, the maximum number of P updates is B N, where Bi is the number of i i siblings of the level-i ancestor of f (i.e. the ancestor whose level is i) plus one (i.e. including the level-i
ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
ancestor). Note that 1piph, where h is the level of f (i.e. Bh is simply the number of siblings of f plus one). On the other hand, time complexity of specification derivation should be determined by the maximum number of operations required to select profile terms to update interest specifications of goal folders. As noted above, when a document set X is added to a folder f, only a limited number of profiles are updated (i.e. the profiles of f, siblings of f, ancestors of f, and siblings of the ancestors). Only the specifications related to the updated profiles need to be updated. Therefore, the maximum number of operations P P required to update interest specifications is i jabi,jN, where a is the parameter governing the number of terms selected from each profile (ref. Steps 3.1 and 3.3.1 in Table 1), and bi,j is the number of descendant goal folders of the jth sibling of the level-i ancestor of f (1piph, 1pjpBi, and when j=Bi the jth sibling is simply the level-i ancestor of f). Actually, both profile mining and interest specification seldom require so much time, since the numbers of profile terms in most folders are much smaller than N (due to the same reason discussed above, ref. space complexity analysis in Section 3.3.1).
4. Experiment Experiments on real-world text hierarchies were conducted to evaluate IMind. The evaluation focused on the qualities of the interest specifications produced by IMind and several previous techniques. The qualities were measured by counting the seeds that could be retrieved by sending the specifications to Yahoo (http:// www.yahoo.com). An interest specification could be said to be more precise if it guides Yahoo to find more seeds that are really of the corresponding interest type. That is, we employed a popular search engine (i.e. Yahoo) to objectively and quantitatively measure the qualities of interest specifications. To further evaluate IMind from other perspectives, empirical results on its performances under different search engines and the time
637
spent for incremental interest mining are also reported.
4.1. The experimental data Experimental data was extracted from the text hierarchy of Yahoo. To make the data more comprehensive, a hierarchy H was constructed by considering the categories under three top-level categories: ‘‘Science’’, ‘‘Computers and Internet’’, and ‘‘Society and Culture’’. A category in H corresponds to a folder, which could contain several documents. A document corresponded to a web site. It consisted of web page(s) manually extracted to represent the main content of the corresponding web site (a web site often contained many pages not conveying its main content). To investigate the contributions of IMind under different hierarchies, two hierarchies were constructed from H: the larger hierarchy and the smaller hierarchy. The larger hierarchy inherited both the structure and the data from H, except that only leaf folders contained documents. On the other hand, the smaller hierarchy shared the same structure and data with the larger hierarchy, except that some leaf folders in the smaller hierarchy were non-leaf folders in the larger hierarchy. Note that, the total number of documents in the leaf folders was larger than the total number of documents in their corresponding descendants in the larger hierarchy, since the leaf folders could contain the documents from all their corresponding descendants in H, including both leaf and non-leaf descendants. Therefore, the smaller hierarchy had fewer leaf folders, but more training documents. More specifically, the larger hierarchy had 261 folders among which 174 were leaf folders, which totally contained 2844 documents. The smaller hierarchy had 169 folders among which 119 were leaf folders, which totally contained 3615 documents. In the structures of the two hierarchies, 97 leaf folders were in common, while 22 leaf folders in the smaller hierarchy were non-leaf folders in the larger hierarchy. The profile of each folder was built by mining the documents in the hierarchies. The two hierarchies were mined and tested independently.
ARTICLE IN PRESS 638
R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
To conduct thorough performance investigation, all leaf folders were designated as the user’s goals of scanning (i.e. the set G in Table 1), except for those folders that were redirected to other folders (i.e. those categories having ‘@’ at the end of their names in Yahoo). Thus each tested folder was unique (there were no duplicate folders in testing). In the larger hierarchy, there were 142 folders designated as goals, while in the smaller hierarchy, there were 109 folders designated as goals. 4.2. The evaluation method and criteria For each folder designated as a goal, a specification was constructed by IMind and several previous techniques (to be introduced in Section 4.3), and then sent as a query to Yahoo to retrieve web sites. If Yahoo retrieved web sites for a query, we controlled the number (o200) of topranking web sites retrieved. For each site retrieved, Yahoo showed its brief summary, followed by its category (folder) in the whole hierarchy of Yahoo. Based on such information, we could evaluate the qualities of the specifications produced by IMind and the previous techniques. Two evaluation criteria were employed to measure the qualities of the specifications: (1) the average number of web sites correctly retrieved per folder, and (2) the percentage of goal folders for which at least one web site was correctly retrieved. The first criterion measured the completeness of the retrieval of the information that was of the goal interest types, while the second criterion measured the reliability of the retrieval across different interest types. For example, if a set of interest specifications guides the search engine to retrieve a large number of target sites for only a small number of interest types, completeness is good only for the interest types (but poor for others), making the retrieval unreliable. Interest specifications could be said to be more precise, if they may guide the search engine to (1) retrieve more sites that are of the goal folders, and (2) find sites for a higher percentage of the goal folders. The objectivity of the evaluation method deserves justification. The quality of a specification was objectively evaluated by sending the specifica-
tion to a search engine (i.e. Yahoo), who claims that the sites she retrieved are sorted by relevance using her complicated and proprietary algorithm — no other factors may affect the search results. Also note that we were not training the systems using a document and then asking the search engine to retrieve the same document. There are two facts to justify the claim: (1) only the documents from a subset of the web sites in each category were included in the training data, and (2) even when a web site was included in the training data, the document for the web site was selected by manually visiting and examining the web site, while Yahoo employed totally different information (i.e. not the document manually selected) to judge relevance in retrieving the web sites. It is thus reasonable to expect that, a specification that is proved to be better (in terms of the two criteria) in the experiment may perform better in other search engines and scanners as well. 4.3. The systems evaluated Table 2 defines the systems evaluated in the experiment. IMind has only one parameter (i.e. a, the number of terms selected in each level of profile), which had two settings: 10 (IMind-10) and 20 (IMind-20). In addition to IMind, there were four baselines for performance comparison: norm of the folder (NOF), Rocchio (RO), Naive Bayes (NB), and hierarchical shrinkage (HS). For each leaf folder, the baselines created a profile by their individual weighting methods, although the profiles were originally used for other purposes (e.g. classification, filtering, and relevance feedback for query refinement). Employing NOF, RO, NB, and HS as baselines deserves justification. To our survey, no previous techniques were dedicated to context-based interest mining for personalized web scanning. The baselines thus aimed to represent modern techniques related to the challenge. NOF and RO aimed to represent those techniques that built profiles by analyzing the document vectors in corresponding folders, NB aimed to represent those techniques that built a probability-based profile for each folder, while HS aimed to represent those techniques that employed a text hierarchy to build a
ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
639
Table 2 Systems for performance investigation System
Type
Description
Variant
IMind
Experiment system to be evaluated
(1) IMind-10: a=10, and (2) IMind-20: a=20
Norm-of-thefolder (NOF)
Baseline system for those techniques that build a profile for a folder using the documents of the folder only (without considering other folders) Baseline system for those techniques that build a profile for a folder by considering both positive (in-folder) and negative (out-folder) documents Baseline system for those techniques that build a probability-based profile for a folder
The system was implemented by following the descriptions presented in Section 3. The profile of the folder was a vector constructed by averaging the document vectors in the folder.
Rocchio’s method (RO)
Naive Bayes (NB)
Hierarchical Baseline system for those shrinkage (HS) techniques that employ a text hierarchy to build a probability-based profile for a folder
The profile was a vector constructed by computing a weighted sum of the positive document vectors and the negative document vectors. The profile was constructed by estimating the conditional probabilities of the terms in the folder.
The profile was constructed by employing the hierarchical relationships (e.g. sibling) among folders to refine the estimates of the conditional probabilities (in NB).
probability-based profile for each folder. When compared with IMind, the profiles constructed by the baselines (and other previous methods as well) had a common property: no contextual information could be conveyed. This property helped to measure the contributions of IMind in mining and representing the context of each interest type. NOF and RO employed the well-known and popular tfidf technique to estimate the weight of each feature. The weight of a feature w in the vector for document p was equal to TF(w,p) log2 (total number of documents/number of documents containing w), where TF(w,p) was the times w occurs in p. The tfidf technique was routinely employed in many applications in which vectorbased representation served as the basis for similarity estimation. For each folder, NOF averaged the vectors of the documents in the folder.
(1) NOF-10: extracting the same maximum number of terms with IMind-10, and (2) NOF-20: extracting the same maximum number of terms with IMind-20
(1) RO-10: extracting the same maximum number of terms with IMind-10, and (2) RO-20: extracting the same maximum number of terms with IMind-20 (1) NB-10: extracting the same maximum number of terms with IMind-10, and (2) NB-20: extracting the same maximum number of terms with IMind-20 (1) HS-10: extracting the same maximum number of terms with IMind-10, and (2) HS-20: extracting the same maximum number of terms with IMind-20
RO is a state-of-the-art technique to create vector-based profiles (or queries) from a collection of training data. It has been commonly studied from the perspectives of document classification (e.g. [31]), filtering (e.g. [27]), and retrieval (e.g. [21]). As NOF, RO constructed a vector for each folder. The major difference was that, in addition to considering the documents in the folder, RO also considered the documents outside the folder. More P specifically, the vector P for folder f was equal to Z1 DocAPDoc/|P|Z2 DocANDoc/|N|, where P was the set of vectors for ‘‘positive’’ documents (i.e. the documents in f), while N was the set of vectors for ‘‘negative’’ documents (i.e. the document not in f). In the experiment, Z1 and Z2 were set to 16 and 4 respectively, since such setting was believed to be promising in previous studies (e.g. [31]). NB and HS employed conditional probabilities to estimate the weight of each feature. They were
ARTICLE IN PRESS 640
R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
often employed for document classification. NB estimated the conditional probability P(w|f) for every feature w and folder f (with standard Laplace smoothing to avoid probabilities of zero). In document classification, the ‘‘similarity’’ between a category and an input document p was based on the product of the conditional probabilities of the features (in the category) that occurred in p. NB is a popular technique for classification. For more details about NB, the reader is referred to [26,32]. HS was implemented to represent those approaches that employed a text hierarchy to improve non-hierarchical classification [26,33]. It was a recent approach shown to be more faulttolerant and promising [26]. The given text hierarchy was employed to refine the parameter estimation for an NB classifier. The basic idea was to, for each leaf folder f, derive a weight for f and each of its ancestors. The refined estimate of the probability of a term w under f (i.e. P(w|f)) was simply a weighted sum of the probability estimates from f to the root folder (a dummy folder beyond the root was also considered). More specifically, to derive the weight of each folder of the same pedigree (i.e. from f to the root and the dummy folder), an expectation maximization (EM) procedure was invoked. The weight of the parent of a folder c in the pedigree will be larger if the terms occurring under c often occur under siblings of c as well. In that case, c is ‘‘similar’’ to its siblings, and hence the parent of c should have more strength affecting the parameter estimation for c. This method was shown to be promising in smoothing parameter estimates of a data-sparse child using the data under its parent. It should be noted that, since the profiles constructed by the baselines could not be comprehensible for the search engine, the baselines needed to transform each profile into a query that was acceptable for the search engine. This was achieved by selecting and integrating those features having higher positive weights, since these features were more representative and discriminative than others. As in [30], the features were integrated in a disjunction manner (i.e. the selected featured were integrated using OR) in order to avoid generating too constrained queries (i.e.
integrated using AND). Moreover, since there is no standard way to determine how many features (terms) to select, to facilitate objective comparison with IMind, we controlled the number of terms in the queries from the baselines. That is, as IMind, each baseline had two versions as well. For example, RO-10 and RO-20 could construct those queries having the same maximum number of terms as those constructed by IMind-10 and IMind-20, respectively. Thus, for example, for a level-5 folder, the query constructed by RO-10 may have at most 50 (=5 10) terms. Also note that, instead of incrementally maintaining an evolving feature set to express each folder’s profile (as in IMind), all the baselines needed to preset a fixed feature set on which each folder’s profile was represented. The selection of the features was based on the strength of each feature, which was estimated by the w2 (chi-square) weighting technique. The technique has been shown to be more promising than several others [34]. As noted in previous studies (e.g. [26,34]), the size of the feature set was an experimental issue (no standard way to set a perfect size). Therefore, we tested several different sizes, and reported the results when the size was 10 000, 20 000, and 40 000, respectively. Such sizes of feature sets provided enough candidate features to be selected into the query for each folder (recall that there were 174 leaf folders in the larger hierarchy and 119 leaf folders in the smaller hierarchy). 4.4. Result and analysis To illustrate the contributions and behaviors of IMind, experimental results are analyzed from several perspectives. 4.4.1. Basic results and analysis Figs. 3 and 4 show the results in terms of the first criterion (i.e. average number of sites found per goal folder) on the larger and the smaller folder hierarchy, respectively. Since IMind is an incremental mining technique that maintains an evolving feature set, no feature set was predefined. Therefore, the performance of IMind did not vary when the baselines employed different feature sets. The results showed that IMind significantly
ARTICLE IN PRESS
Average sites found per folder
R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
6 5 4 3 2 1 0
IMind-10 NOF-10 RO-10 NB-10 HS-10 10000
Average sites found per folder
641
20000 Feature set size
40000
3 2.5
IMind-20 NOF-20 RO-20 NB-20 HS-20
2 1.5 1 0.5 0 10000
20000 Feature set size
40000
Average sites found per folder
Fig. 3. Average sites found per folder (the larger hierarchy).
12 10
IMind-10 NOF-10 RO-10 NB-10 HS-10
8 6 4 2 0
Average sites found per folder
10000
20000 Feature set size
40000
6 5
IMind-20 NOF-20 RO-20 NB-20 HS-20
4 3 2 1 0 10000
20000 Feature set size
40000
Fig. 4. Average sites found per folder (the smaller hierarchy).
outperformed all the baselines under all conditions (i.e. feature sets with different sizes, queries with different maximum numbers of terms, and
hierarchies with different sizes). When comparing the average performance of IMind with the average performance of the baselines, IMind
ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
642
contributed 200.78% improvement on the larger hierarchy, and 496.99% improvement on the smaller hierarchy. A detailed analysis showed that, IMind contributed the improvements by making interest specifications more context-indicative. To illustrate the contribution, Fig. 5 shows the interest specification derived by IMind for a folder in the smaller hierarchy: Root-Computer & Internet Hardware - Desktop Computers. The specification was context-indicative. Higher-level folders tended to get more general terms to serve as the contextual requirements of their descendant folders. Each folder’s context was thus expressed in conjunctive normal form. On the other hand, the specification constructed by RO-10 (the best baseline with a feature set of size 20 000) was: /motherboard OR chip OR modem OR MHz OR SDRAM OR linux OR desktop OR gigabyte OR EDO OR window OR hardware OR soundblaster OR laser OR AMD OR game OR hardware OR XP ORyS. These terms could only be representative and discriminative for the folder. They could not form a context-indicative specification for the folder, misleading Yahoo to return lots of sites that mentioned some of the terms from other perspectives (e.g. operating systems and B2B electronics). Actually, to derive a context-indicative specification, the system needs to (1) select candidate context-indicative terms, which are not necessarily representative and discriminative for the folder (e.g. information, server, web, CPU, and memory), and then (2) identify the generality of each candidate term. IMind achieved the two tasks
more successfully, making the specification more context-indicative, and hence more precise and comprehensible. The performance differences among the baselines deserve discussion. NOF and RO outperformed NB and HS, while NOF and RO had very similar performances. This indicates that (1) considering idf (i.e. inverse document frequency) in the profiles (as in NOF and RO) could be helpful, (2) using negative examples (as in RO) was not an important factor, and (3) employing a text hierarchy to refine parameter estimates (as in HS) could not be helpful either. By mining the context of each individual folder, IMind could significantly break the performance limit of all the baselines. Figs. 6 and 7 show the results in terms of the second criterion (i.e. percentage of goal folders with sites successfully found) on the larger and the smaller folder hierarchy, respectively. IMind outperformed all the baselines in most cases. When comparing the average performance of IMind and the average performance of the baselines, IMind provided 29% improvement on the larger hierarchy, and 99% improvement on the smaller hierarchy. Interest mining by IMind was thus shown to be more reliable across different interest types. As in the first criterion, NOF and RO generally outperformed NB and HS, reconfirming the contributions of idf weighting. The performances of the baselines under different feature set sizes deserve discussion as well. An experiment (not shown in the figures) indicated that, even when the size was smaller than 10 000 or larger than 40 000, IMind significantly
Root Computers & Internet file OR information OR window OR system OR site OR server OR web... … …
Hardware CPU OR bit OR instruction OR register OR processor OR chip OR memory...
… …
…
Desktop Computers Computer OR card OR machine OR PC OR sound OR printer...
… Fig. 5. An example specification mined by IMind.
ARTICLE IN PRESS
Folders with sites found (%)
Folders with sites found (%)
R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
60
643
IMind-10 NOF-10 RO-10 NB-10 HS-10
40 20 0 10000
20000 Feature set size
40000
60
IMind-20 NOF-20 RO-20 NB-20 HS-20
40 20 0 10000
20000 Feature set size
40000
Folders with sites found (%)
Folders with sites found (%)
Fig. 6. Percentage of folders with sites retrieved (the larger hierarchy).
60
IMind-10 NOF-10 RO-10 NB-10 HS-10
40 20 0 10000
20000 Feature set size
40000
60
IMind-20 NOF-20 RO-20 NB-20 HS-20
40 20 0 10000
20000 Feature set size
40000
Fig. 7. Percentage of folders with sites retrieved (the smaller hierarchy).
outperformed all the baselines as well. Moreover, there was no perfect way to determine a best feature set for the baselines. Under different situations (i.e. evaluation criteria and hierarchies),
the baselines would call for different feature set sizes in order to achieve better performances. This highlights another two weaknesses of the baselines: (1) the feature set needs to be tuned in a
ARTICLE IN PRESS 644
R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
trial-and-error manner, which is too costly and even implausible in practice, since the performance of a personalized web scanner on the web is much more difficult to measure quantitatively, and (2) even when a baseline is tuned best under a certain situation, it cannot necessarily perform best under another situation. IMind does not have such weaknesses. The incremental mining capability of IMind avoids presetting a feature set, relieving IMind from trial-and-error tuning of the feature set for the evolving folder hierarchy and vocabulary. Through context mining and specification, IMind may guide the search engine to find more sites for a higher percentage of folders in the hierarchy. 4.4.2. Effects of different settings It is interesting to further investigate the effects of different settings of (1) the parameter a (i.e. the number of profile terms selected for each folder), and (2) the hierarchy on which the systems operated. For each folder f, a governs how constrained the interest specification of f is. A smaller (larger) a may introduce fewer (more) terms into each disjunction, making the specification more constrained (relaxed). Experimental results showed that all systems’ performances significantly dropped when a switched from 10 to 20, indicating that the derivation of interest specifications was sensitive to noisy terms, which could mislead the relevance judgment of the search engine. Therefore, the setting of a heavily depends on the vocabulary and content diversity of each individual folder (rather than other factors such as the height of the hierarchy). Moreover, a folder’s vocabulary and content diversity may evolve. An intelligent and evolvable way to set a is thus an interesting and challenging extension for all systems. As to the effect of different hierarchies, we observed that the improvements provided by IMind were more significant when all systems operated on the smaller hierarchy. This was because, on the smaller hierarchy, IMind improved much more significantly than the baselines. For example, in the first criterion (i.e. average sites found per folder), IMind-10 improved 86% (from 5.23 to 9.71), while the best baseline RO-10 only
improved 36% (from 2.12 to 2.88). Moreover, in the second criterion (i.e. percentage of folders with sites found), IMind-10 improved 20% (from 56% to 67%), while the best baseline RO-10 did not improve (from 54% to 53%). We are thus interested in the reason of why IMind could perform much better on the smaller hierarchy. As noted in Section 4.1, both the smaller and the larger hierarchies shared the same structure and data, except that (1) 22 leaf folders in the smaller hierarchy were ancestor (non-leaf) folders in the larger hierarchy, and (2) due to the 22 folders, the smaller hierarchy contained more training documents. Actually, performance comparisons on the 22 folders (in the smaller hierarchy) and their descendants (in the larger hierarchy) could not be conclusive, since they were quite semantically different (the ancestor folders were more general than their descendants, and hence covered much more subcategories that could be included or not included in the hierarchies). On the other hand, the amount of training data could affect the convergence of profile mining, and hence significantly affect the performance of IMind. This could be justified by the result that, among the 97 leaf folders that were included in both hierarchies, IMind performed better for 35 (9) folders when it operated on the smaller hierarchy (the larger hierarchy). Since the baselines could not improve so much when switching from the larger hierarchy to the smaller hierarchy, IMind was more competent in improving itself as more training data was given. 4.4.3. Cross evaluation under different search engines It is also interesting to evaluate the possible contributions of IMind under different search engines. We thus tried to send the queries of IMind and the baselines to 3 search engines: Google (http://www.google.com), Lycos (http:// www.lycos.com), and Open Directory Project (ODP, http://www.dmoz.org). As Yahoo, they organize information into directories and may show the categories of the web sites found (other search engines such as AltaVista, http://www. altavista.com, and Netscape, http://www.netscape. com do not show the categories of the web sites found).
ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
Time Spent (Sec.)
Unfortunately, the search engines did not accept most of the queries generated by IMind and the baselines, since they limited the number of terms in a query (Google even only allowed at most 10 terms). One way to tackle the problem is to (1) cut each query into several ‘‘small’’ disjunctions (e.g. each disjunction contains at most 10 terms), (2) send the disjunctions to the search engines separately, and then (3) integrate and rank the results for the disjunctions. However, this way cannot objectively measure the contributions of IMind either, since the strategy to integrate and rank the results deserves proper design (designing an effective strategy is an interesting extension, ref. Section 5). Moreover, this way may be plausible for IMind but not necessarily plausible for the baselines, since the baselines’ queries may still be very vague, making the search space too large to process. This may be justified by the experiment on Yahoo, which did not impose strict limits on the number of terms in a query, and hence accepted all queries from IMind and the baselines. The results showed that Yahoo responded to all queries from IMind, but did not return web sites for some of the queries from the baselines, even when each of the queries was tried 10 times, and in each trial, the maximum waiting time was set to 10 min. When the feature set size was set to 40 000 in the smaller hierarchy, Yahoo did not respond to 2, 3, 19, and 78 queries generated by RO-20, NOF-20, NB-20, and HS-20, respectively. Similar results were obtained by repeating the experiments several times in different days. Therefore, in the lower figures of Figs. 4 and 7, only those queries to which Yahoo responded were counted and averaged. As the terms in the queries were more common (e.g. ‘‘list’’, ‘‘run’’, and ‘‘product’’), the
645
problem became more serious. As the queries were longer (i.e. RO-20, NOF-20, NB-20, and HS-20) and the feature set was larger (i.e. 40 000), more such words could be included in the queries, incurring lots of matches (and hence a heavier processing cost). Due to the absence of idf weighting, the problem was more serious for NB and HS. Therefore, without a suitable specification mechanism, a query could only be in disjunction form, which showed to be too vague for the search engines to find web sites using an acceptable amount of processing efforts. IMind employs context specifications to remove a huge amount of sites that did not deserve consideration. Moreover, as analyzed in Section 4.4.1, the context specifications contributed significant performance improvements on those queries that could be successfully processed by Yahoo, who promised to rank the web sites based on content relevance. Content relevance is the fundamental basis for the ranking strategies of many other baselines as well. 4.4.4. Efficiency of incremental mining As noted in Section 3.3.2, we were also concerned with the efficiency of IMind. Fig. 8 shows the time spent for incremental interest mining for individual documents, which were sequentially added into the larger hierarchy (recall that the larger hierarchy had 174 leaf folders containing 2844 documents). The system operated on a personal computer (PC) with a CPU running in 2.6 GHz and a RAM whose size was 2 GB. On average, incremental interest mining triggered by each document required 3.43 s. Another experiment showed that if the leaf folders were added one by one (recall that IMind may accept a set of
20 15 10 5 0 0
500
1000
1500
2000
2500
Document ID Fig. 8. Time spent for individual documents sequentially added into the larger hierarchy.
ARTICLE IN PRESS 646
R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
documents as input, ref. Input III in Table 1), the average time spent for each folder was 4.38 s (on average, a leaf folder had 16.3 documents). It is also interesting to note that, those documents (or folders) that were added later did not necessarily require more time, even though the total number of distinct profile terms in the hierarchy (i.e. N) increased monotonically (finally, N was 82 620). Observations also showed that the related folders’ numbers of profile terms tended to converge to certain numbers that were much fewer than N. The time complexity analysis in Section 3.3.2 was thus confirmed: time spent by IMind mainly depends on the number of profile terms in related folders, which did not necessarily have many profile terms. Together with the space complexity analysis (ref. Section 3.3.1), IMind is thus shown to be efficient enough to adapt to the evolving interest of the user, without requiring a memory space that is too large for modern computers.
but difficult for the user, since the web is both huge and ever changing. IMind is an efficient and adaptive text mining technique to tackle the challenges. It guides web scanning by incremental interest mining. The specification mined for each interest type is represented in conjunctive normal form, which is comprehensible for both the user and many information providers. Empirical analysis also shows that the specifications are more precise by expressing the context of each interest type. The delivery of such specifications to a scanner may efficiently keep the scanner precisely aware of the most recent interest of the user, directing each resource consumption and effort of the scanner to more focused and suitable targets, while separating the user from the difficult (even implausible) tasks of interest specification and seed identification. Aiming to serve as a personalized assistant for individual users, the framework is being extended in several ways:
5. Conclusion and extension Personalized web scanning is triggered by the interest of individual users (e.g. businesses or people). The interest is relatively long-term and evolving. To satisfy such interest, the web scanner needs to consume lots of resources to continuously gather and monitor information on the huge and ever-changing web. Precise specifications of the interest are thus essential for the scanner to achieve cost-effective scanning. However, the user often has difficulties in expressing such specifications. The interest may be implicitly and collectively defined by the evolving folders that contain those documents of the user’s interest (e.g. documents related to the user’s job descriptions, skills, and preferences). Moreover, the specifications should also be comprehensible for both the user and the related information providers (e.g. Internet search engines). A specification comprehensible to the user may facilitate manual refinement of the specification, while a specification comprehensible to the information providers may facilitate comprehensive identification of proper seeds to start scanning. Comprehensive identification of proper seeds is essential for web scanning
(1) Automatic determination of the number of profile terms in an interest specification: although IMind has removed other parameters (e.g. the size of the feature set required by many other interest miners), there is still one parameter: the number of profile terms selected for each folder (i.e. a in Table 1), which is required by other interest miners as well. As noted in Section 4.4.2, automatic setting of the parameter is both interesting and challenging. The possibility of guiding the user to manually refine the parameter deserves exploration as well. (2) Manual refinement of the interest specifications: as noted above, the specification produced by IMind is also comprehensible to the user. An interface may thus be developed for the user to comprehend and refine each specification, making the specification able to reflect the user’s intentions more precisely. It may also alleviate the problem incurred by unsuitable profile terms (i.e. a in the above discussion). The interface should provide a visualized environment to support online drill-down exploration of how the specification is produced.
ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
(3) Post-processing of the results returned by scanners and search engines: to produce a specification comprehensible for the search engine, IMind removes the weight of each term (i.e. rw,f dw,f value). Actually, the weight is helpful in post-processing the results returned by scanners and search engines. For example, each web page found by the scanners and search engines may be ranked according its weight, which may be measured by summing the weights of the profile terms occurring in the page. The ranking helps to assign priorities to the web pages for the next round of scanning, directing the limited resources to more promising spaces.
Acknowledgements This research is supported by the National Science Council of the Republic of China under the grant NSC 92-2213-E-216-018.
References [1] M.N. Frolick, M.J. Parzinger, R.K. Rainer, N.K. Ramarapu, Using EISs for environmental scanning, Inform. System Manage., 1997. [2] N. Ahituv, J. Zif, I. Machlin, Environmental scanning and information systems in relation to success in introducing new products, Inform. & Manage. 33 (1998) 201–211. [3] R.-L. Liu, Collaborative multiagent adaptation for business environmental scanning through the internet. Applied Intelligence 20 (2) (2004) 119–133. [4] P. Srinivasan, G. Pant, F. Menczer, Target-seeking crawlers and their topical performance, Proceedings of SIGIR’02, 2002. [5] R.-L. Liu, M.-J. Shih, Y.-F. Kao, Adaptive exception monitoring agents for management by exceptions, Applied Artificial Intelligence (AAI) 15 (4) (2001) 397–418. [6] H. Chen, Y.-M. Chung, R. Marshall, C.Y. Christopher, An intelligent personal spider (agent) for dynamic Internet/Intranet searching, Decision Support Systems 23 (1998) 41–58. [7] V. Lesser, B. Horling, F. Klassner, A. Raja, BIG: a resource-bounded information gathering agent,’’ UMass Computer Science Technical Report 1998-03, 1998. [8] F. Menczer, ARACHNID: Adaptive retrieval agents choosing heuristic neighborhoods for information discovery, Proceedings of the 14th International Machine Learning Conference (ML97), 1997.
647
[9] K. Sycara, In-Context information management through adaptive collaboration of intelligent agents, in: M. Klusch (Ed.), Intelligent Information Agents Cooperative Rational and Adaptive Information Gathering on the Internet, Springer, Berlin, 1999. [10] C.C. Yang, J. Yen, H. Chen, Intelligent internet searching agent based on hybrid simulated annealing, Decision Support Systems 28 (2000) 269–277. [11] C.C. Aggarwal, F. Al-garawi, P.S. Yu, On the design of a learning crawler for topical resource discovery, ACM Trans. Inform. Systems 19 (3) (2001) 286–309. [12] C.C. Aggarwal, Collaborative crawling: mining user experiences for topical resource discovery, Proceedings. of ACM SIGKDD’02, 2002. [13] K.S. Decker, K. Sycara, Intelligent adaptive information agents, J. of Intelli. Inform. Systems, 1997. [14] S. Liu, Business environment scanner for senior managers: towards active executive support with intelligent agents, Proceedings of 31st Annual Hawaii International Conference on System Sciences, 1998, pp. 18–27. [15] R. Nordlie, User Revealment — A comparison of initial queries and ensuing question development in online searching and in human reference interactions, Proceedings of ACM SIGIR’99, 1999. [16] T. Saracevic, A. Spink, M.-M. Wu, Users and intermediaries in information retrieval: what are they talking about?, Proceedings of the 6th International Conference on User Modeling (UM97), 1997. [17] P. Bruza, R. McArthur, S. Dennis, Interactive Internet search: keyword, directory and query reformulation mechanisms compared, Proceedings of ACM SIGIR 2000, 2000. [18] K. Eguchi, Adaptive cluster-based browsing using incrementally expanded queries and its effects, Proceedings of ACM SIGIR’99, 1999. [19] S. Jones, M.S. Staveley, Phrasier: a system for interactive document retrieval using keyphrases, Proceedings ACM SIGIR’99, 1999. [20] M.B.J. Jansen, A. Spink, J. Bateman, T. Saracevic, Real life information retrieval: a study of user queries on the web, SIGIR Forum 32 (1) (1998) 5–17. [21] M. Iwayama, Relevance feedback with a small number of relevance judgments: incremental relevance feedback vs. document clustering,’’ Proceedings of ACM SIGIR 2000, 2000. [22] S. Chakrabarti, S. Srivastava, M. Subramanyam, M. Tiwari, Using memex to achieve and mine community web browsing experience, Proceedings of the 9th International World Wide Web Conference, 2000. [23] D. Godoy, A. Amandi, Building subject hierarchies for user modeling, 2001, http://www.exa.unicen.edu.ar/~dgodoy/pubfiles/godccw2001.pdf. [24] H. Joho, C. Coverson, M. Sanderson, M. Beaulieu, hierarchical presentation of expansion terms, Proceedings of SAC 2002, Madrid, Spain, 2002. [25] D. Lawrie, W. B. Croft, A. Rosenberg, Finding topic words for hierarchical summarization, Proceedings of SIGIR’01, 2001.
ARTICLE IN PRESS 648
R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648
[26] A. McCallum, R. Rosenfeld, T. Mitchell, A. Y. Ng, improving text classification by shrinkage in a hierarchy of classes, Proceedings of ICML’98, 1998. [27] R.E. Schapire, Y. Singer, A. Singhal, Boosting and Rocchio applied to text filtering, Proceedings of ACM SIGIR’98, 1998. [28] A. Singhal, M. Mitra, C. Buckley, Learning Routing Queries in a Query Zone, Proceedings of ACM SIGIR’97, 1997. [29] S. Chakrabarti, M. van den Berg, B. Dom, Focused crawling: a new approach to topic-specific web resource discovery, Proceedings of the 8th World Wide Web Conference, Toronto, Canada, 1999.
[30] M. Pazzani, J. Muramatsu, D. Billsus, Syskill & Webert: Identifying Interesting Web Sites, Proceedings of AAAI’96, 54–61, 1996. [31] H. Wu, T.H. Phang, B. Liu, and X. Li, A refinement approach to handling model misfit in text categorization, Proceedings of ACM SIGKDD’02, 2002. [32] L.S. Larkey, W.B. Croft, Combining Classifiers in Text Categorization, Proceedings of ACM SIGIR’96, 1996. [33] D. Koller, M. Sahami, Hierarchically classifying documents using very few words, Proceedings of ICML’97, 1997. [34] Y. Yang, J. O. Pedersen, A Comparative Study on Feature Selection in Text Categorization, Proceedings of ICML’97, 1997.