Incremental mining of information interest for personalized web scanning

ARTICLE IN PRESS Information Systems 30 (2005) 630–648 www.elsevier.com/locate/infosys Incremental mining of information interest for personalized w...

Download PDF

345KB Sizes 1 Downloads 32 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

Information Systems 30 (2005) 630–648 www.elsevier.com/locate/infosys

Incremental mining of information interest for personalized web scanning$ Rey-Long Liu, Wan-Jung Lin Department of Information Management, Chung Hua University, No. 707, Sec. 2, Wufu Road, HsinChu, Taiwan 300, Republic of China Received 7 February 2003; received in revised form 13 May 2004; accepted 5 July 2004

Abstract Businesses and people often organize their information of interest (IOI) into a hierarchy of folders (or categories). The personalized folder hierarchy provides a natural way for each of the users to manage and utilize his/her IOI (a folder corresponds to an interest type). Since the interest is relatively long-term, continuous web scanning is essential. It should be directed by precise and comprehensible speciﬁcations of the interest. A precise speciﬁcation may direct the scanner to those spaces that deserve scanning, while a speciﬁcation comprehensible to the user may facilitate manual reﬁnement, and a speciﬁcation comprehensible to information providers (e.g. Internet search engines) may facilitate the identiﬁcation of proper seed sites to start scanning. However, expressing such speciﬁcations is quite difﬁcult (and even implausible) for the user, since each interest type is often implicitly and collectively deﬁned by the content (i.e. documents) of the corresponding folder, which may even evolve over time. In this paper, we present an incremental text mining technique to efﬁciently identify the user’s current interest by mining the user’s information folders. The speciﬁcation mined for each interest type speciﬁes the context of the interest type in conjunctive normal form, which is comprehensible to general users and information providers. The speciﬁcation is also shown to be more precise in directing the scanner to those sites that are more likely to provide IOI. The user may thus maintain his/her folders and then constantly get IOI, without paying much attention to the difﬁcult tasks of interest speciﬁcation and seed identiﬁcation. r 2004 Elsevier B.V. All rights reserved. Keywords: Context of information interest; Precise interest speciﬁcations; Comprehensible interest speciﬁcations; Incremental text mining; Web scanning

1. Introduction $

Recommended by Ricardo Baeza-Yates.

Corresponding author. Tel: +886-3-5186530; fax: +886-3-

5186546. E-mail address: [email protected] (R.-L. Liu).

Web scanning aims to continuously scan the web for information of interest (IOI) with respect to a set of relatively long-term interests (when compared with one-shot queries). It is motivated

0306-4379/$ - see front matter r 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.is.2004.07.001

ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

by the common needs of those users (e.g. businesses and people) who constantly require timely information concerning their jobs and preferences. Given a speciﬁcation of the user’s interest, the scanner needs to identify the seeds from which IOI gathering and monitoring are started and conducted as a routine job. An example application of web scanning is business environmental scanning, which has been identiﬁed as an essential continuous task for many business activities such as decision making [1] and product development [2]. Since the user’s job and preferences are relatively long-term, the user’s interest is relatively long-term as well [3]. The interest may evolve when internal conditions (e.g. business strategies and user preferences) and external conditions (e.g. the environments) change. The information scanned and the information manually collected are often stored into a hierarchy of folders (or categories). The personalized folder hierarchy facilitates the browsing and retrieval of information. Web scanning needs to be guided by precise and comprehensible speciﬁcations of the user’s interest. A precise speciﬁcation may direct the scanner to those spaces that deserve scanning (imprecise speciﬁcation may signiﬁcantly deteriorate the performances of most information gatherers on the web [4]), while a speciﬁcation comprehensible to the user may facilitate manual reﬁnement, and a speciﬁcation comprehensible to information providers (e.g. Internet search engines) may facilitate the identiﬁcation of proper seed sites to start scanning. Moreover, due to the evolution of the user’s interest, the speciﬁcation should evolve as well. Imprecise or obsolete speciﬁcations may consume lots of resources (e.g. efforts of the scanner’s, bandwidths of the network, and services provided by related information servers), and produce lots of garbage information to the user. Unfortunately, the user often has difﬁculties in properly specifying his/her relatively long-term but evolving interest. The interest is usually implicitly deﬁned in the hierarchy of folders. A folder corresponds to an interest type, which is collectively deﬁned by the documents under the folder (i.e. including the folder and descendants of the folder). The user thus requires a mechanism that

631

may identify and represent each interest type and its evolution. With the help provided by the mechanism, the user only needs to maintain his/ her own folders. Proper interest speciﬁcation may be automatically extracted to guide the scanner to ﬁnd IOI using a smaller amount of resources. Fig. 1 illustrates the idea. Each user may set up and maintain a hierarchy of folders. He/she may enrich the folders by adding new IOI that is either automatically scanned or manually collected. A folder thus corresponds to an interest type of the user. A set of folders may be designated (by the user) as an interest set for the scanner to work on. An interest miner is activated once the contents of the designated folders are updated. It identiﬁes the newest speciﬁcation of each interest type (folder). Based on the speciﬁcation, the scanner identiﬁes proper seeds to start continuous gathering and monitoring of IOI from the web. With the recent successful technology in information monitoring [5] and gathering [6–10], the major challenge to realize the scanning ﬂow lies on the interest miner, which we believe is an essential component to achieve satisfactory and cost-effective web scanning. This is the target we pursue in this paper. In the next section, we discuss the weaknesses of previous studies in interest mining. An incremental mining technique is thus developed to derive precise and comprehensible interest speciﬁcations from the user’s personalized folder hierarchy (i.e. the interest miner in Fig. 1). The framework is named IMind (Interest Mining from personalized folders), and is presented in Section 3. To investigate the contributions of IMind, an experiment on a real-world environment was conducted (ref. Section 4). Theoretical analysis and empirical results show that IMind may efﬁciently keep the scanner precisely aware of the most recent interest of the user, and hence guide the scanner to ﬁnd more targets of interest, while separating the user from the difﬁcult (even implausible) tasks of interest speciﬁcation and seed identiﬁcation.

2. Related work Web scanning consists of two main tasks: information gathering (discovery through crawling)

ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

632

Personalized Folder

Interest Miner Spec. for user’s interest { AND …}

Interest designation New Info

User

Info Scanned

Seed Finding

The Web

Gathering & Monitoring

Scanner New Info Info Scanned

Fig. 1. Mining personalized information folders for web scanning.

and information monitoring [3]. They work together to ﬁnd complete and up-to-date IOI for the user. Previous information gathering studies mainly focused on the satisfaction of users’ ‘‘oneshot’’ needs of IOI (e.g. [6,8,10]). They also considered various factors, such as time/cost/ quality tradeoffs [7], alleviation of information overloading [9], and mining linkage structure [11] and users’ browsing behaviors [12] to improve searching. As to information monitoring, previous scanning systems often employed periodical monitoring to actively check for information updates in a periodical way [13,14]. To promote the costeffectiveness of monitoring, adaptive monitoring was developed to monitor different targets in different frequencies [5]. The integrative treatment of continuous information gathering and monitoring under limited resource consumption was explored as well [3]. A web scanner may provide IOI to the user only if the user’s interest is precisely speciﬁed. Imprecise speciﬁcations could signiﬁcantly deteriorate the performances of state-of-the-art information gathering algorithms [4]. Interaction is a common way to disambiguate the user’s interest [15,16]. Human–system interactions often helped to reﬁne the queries for the next round of information retrieval [17–19]. To conduct such interaction, users need to read and process the retrieved information and

then provide relevance feedback to the system. However, users are often unable (or unwilling) to read much information [20] and interact with the system many times [21]. This problem will become more serious in web scanning, which is a continuous task. The scanner never stops consuming lots of resources to ﬁnd candidate IOI to be veriﬁed. The user thus may be notiﬁed at any time for such veriﬁcation. Moreover, it will be quite difﬁcult for the user to set up queries to start scanning, since the users’ interest is often implicitly and collectively deﬁned by the contents of the information folders, which may evolve over time. One way to alleviate the problem is to automatically set up the interest speciﬁcation, and update the interest speciﬁcation once the contents of the folders evolve. Mining the folders for the interest speciﬁcation is thus a promising approach. Instead of grouping information to automatically create folders for archiving users’ interests [22,23], the mining task aims to build a profile for each existing folder from the documents under the folder. The proﬁle needs to be a good basis to derive a precise speciﬁcation for the corresponding interest type. It should be composed of those features (terms or phrases) that are carefully selected. Such a feature selection task was routinely conducted for various purposes, including query reﬁnement (e.g. [17–19,24]), document summarization

ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

(e.g. [25]), document classiﬁcation (e.g. [25–27]), document ﬁltering (e.g. [28]), and retrieval of similar documents (e.g. the well-known give-memore retrieval). It also helped to build a classiﬁer to judge the relevance of a document (e.g. a document was said to be more relevant if it had higher degrees of match with those categories that were of the user’s interest [23,29,30]). However, these previous techniques did not extract the context of each interest type, making the interest speciﬁcations vague. As an example, consider a folder hierarchy containing three branches of folders: (1) root-system development-decision support systems, (2) root-manufacturing-decision support systems, and (3) root-ﬁnancial Management-Decision Support Systems. Although there are three folders (interest types) named ‘‘decision support systems’’, they have different contexts of discussion (i.e. strategy of implementation, usage in manufacturing, and usage in ﬁnancial services, respectively). Obviously, with proper attention to the contexts, web scanning may be more focused, cost-effective, and satisfactory. Perhaps the most straightforward way to consider interest contexts is to derive a query by simply concatenating the folder names of the same branch [4]. This approach deserves signiﬁcant improvement, since the name of a folder cannot precisely and completely deﬁne the folder. As noted above, each folder should be implicitly and collectively deﬁned by its content, which may evolve over time. Incremental mining of the context of each interest type is thus essential. Moreover, the language to represent the interest speciﬁcations is essential as well. It should be comprehensible so that the user may manually reﬁne the interest speciﬁcations, and the system may comprehensively identify proper seeds to start scanning on the ever-changing web. Seed identiﬁcation is the ﬁrst key step for scanning. Previous related studies often retrieved seeds from a given set of web pages [10,29,30]. This method is not suitable for personalized web scanning, since (1) for each interest type of each user, it is unrealistic to predeﬁne a static set of web pages that may comprehensively contain proper seeds from the ever-changing web, (2) the user often cannot comprehensively specify the seeds on the web,

633

and (3) the documents in the user’s folders cannot always contain proper seeds either (the documents may even contain no hyperlinks). To tackle the problem, Internet search engines may be invoked to provide a comprehensive and adaptive coverage of the web. Therefore, the interest speciﬁcations should be comprehensible for both the user and the search engines (e.g. in conjunctive normal form). However, deriving such interest speciﬁcations is challenging. Therefore, a straightforward way was often employed: selecting several good terms and integrate them in disjunction form [30]. Unfortunately, this way cannot precisely express the context of each interest type. For example, consider the above interest type: Root-System Development (SD)-Decision Support Systems (DSS). If the interest type is speciﬁed in disjunction form, the terms concerning SD and/or the terms concerning DSS will be mixed together, incurring much irrelevant information, such as the information about SD but not DSS, and the information about DSS but not SD. Therefore, to achieve cost-effective web scanning, mining a precise and comprehensible speciﬁcation for each interest type (folder) is both essential and helpful. The mining process should be incremental in order to adapt the interest speciﬁcations to the evolving folders. It should also precisely recognize the context of each interest type, and represent the context in a form that may facilitate comprehension and seed identiﬁcation. Such a mining process may promote user satisfaction and cost-effectiveness of continuous information gathering and monitoring on the web, while separate the user from the tedious (and even implausible) task of interest speciﬁcation and seed identiﬁcation.

3. The interest miner Table 1 presents the algorithm of the incremental interest miner IMind. IMind operates on a given hierarchy T of information folders (ref. Input I). A folder corresponds to an interest type of the user. Each folder has a proﬁle, which serves as the basis to derive an interest speciﬁcation for the corresponding interest type. The user may

ARTICLE IN PRESS 634

R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

Table 1 Incremental interest mining Input: (I) A hierarchy T of folders, (II) A set of folders G designated as the goals of web scanning, and (III) A set X of documents added to a folder f. Effect: (I) Update the proﬁle of each related folder of f in T, and (II) For each folder g in G, if the interest speciﬁcation of g has changed, send the new speciﬁcation to the scanner. Begin (1) W’w|w is a word occurring in documents in X, and w is not a stop word; (2) While (f is not the root of T) do (2.1) For each word w in W, do (2.1.1) If w is not a proﬁle term for f, add ow, rw,f, dw,f4 to the proﬁle of f (rw,f and dw,f are initially unknown); (2.2) For each 3-tuple ow, rw,f, dw,f4 in the proﬁle of f, do (2.2.1) Update rw,f and dw,f; (2.2.2) For each sibling b of f, update dw,b; (2.3) f’parent of f; (3) For each goal folder g in G, do (3.1) Ig’Disjunction of the proﬁle terms having higher rw,g dw,g values (a number a of proﬁle terms in g are selected); (3.2) a’parent of g; (3.3) While (a is not the root of T) do (3.3.1) Ig’Conjunction of Ig and disjunction of the terms having higher rw,a dw,a values (a number a of proﬁle terms in both a and g are selected); (3.3.2) a’parent of a; (3.4) If Ig6¼speciﬁcation of g, send Ig to the scanner to update the speciﬁcation of g; End.

designate a set G of interest types as the goals of web scanning (ref. Input II). Once a set X of documents is added to a folder f (ref. Input III), IMind conducts two tasks: (1) updating the proﬁles of related folders (ref. Effect I, and Steps 1 and 2), and (2) notifying the scanner if the interest speciﬁcation of any interest type in G has changed (ref. Effect II, and Step 3). 3.1. Updating the folder profiles The proﬁle of a folder f is initially empty and may evolve over time. It is a set of 3-tuples /w, rw,f, dw,fS, where w is a word (or feature), rw,f is the degree to which w may represent the content of the documents under f, and dw,f is the capability of w in discriminating f from siblings of f. A good proﬁle term should be the one that is both representative (having a higher r-value) and discriminative (having a higher d-value). Since dw,f is estimated with respect to siblings of f (rather than all folders as in most previous approaches), general (speciﬁc) terms tend to be good proﬁle terms for higher- (lower-) level

folders. For example, suppose in the user’s hierarchy of folders, system development (SD) has two subfolders: decision support systems (DSS) and accounting information systems (AIS). The terms like ‘‘information’’ and ‘‘maintenance’’ may have a lower d-value in both DSS and AIS (since the documents in both subfolders are about the implementation and maintenance of information systems), but a higher d-value in SD, if sibling folders of SD are not about system development (recall that in the example raised in Section 2, SD has two siblings: manufacturing and ﬁnancial management). Therefore, ‘‘information’’ and ‘‘maintenance’’ may be good in discriminating SD from its siblings, which have the same generality level. Similarly, the terms like ‘‘inventory’’ and ‘‘sales’’ may get a higher d-value in AIS, but not in SD or DSS. That is, based on the personalized hierarchy, generality of each term is implicitly deﬁned. IMind makes the generality explicit by mining. Technically, the degree of representation of a word w under f (i.e. rw,f) is estimated by P(w|f), which is equal to TF(w,f)/Size(f), where TF(w,f) is

ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

the times of occurrences of w in documents under f, and Size(f) is the total number of terms in the documents under f (from the viewpoint of data mining, P(w|f) may be viewed as the support of w under f). On the other hand, the capability of w in discriminating f from its siblings (i.e. dw,f) is estimated by P(w|f)(Bf/SiP(w|fi)), where Bf is the number of siblings of f plus one (i.e. including f). The summation of P(w|fi) is conducted over f and its siblings. Thus, for example, w may get a higher d-value if it occurs frequently under f (i.e. P(w|f) is high), but infrequently (on average) under siblings of f (i.e. Bf/SiP(w|fi) is high). In that case, w may be good in discriminating f from siblings of f. Also note that, 0pdw,f pBf, for each proﬁle term w in f. When w only occurs in f, dw,f=Bf, while dw,b=0 for each sibling b of f. If P(w|f) is higher than SiP(w|fi)/Bf (i.e. the average P(w|fi)), dw,f41; otherwise dw,f p1. Note that the set X of documents added to a folder f may only trigger proﬁle updates for a limited number of related folders. Fig. 2 illustrates the idea. For r-values, since the addition of X only changes the sizes of f and ancestors of f, only the rvalues of the proﬁle terms of f (ref. Steps 2.2.1) and its ancestors (ref. Step 2.3) are updated. As to dvalues, the updates are conducted only on f, siblings of f, ancestors of f, and siblings of the ancestors (ref. Steps 2.2.1, 2.2.2, and 2.3). This is because, when X is added to f, P(w|f) of each proﬁle term w in f needs to be updated. This

update triggers the update of the second component of dw,f (i.e. Bf/SiP(w|fi)). Since each sibling b of f employs the same second component to compute dw,b, dw,b needs to be updated as well. Similar updates should be conducted when the process proceeds to the ancestors of f. Also note that, to update r-values and d-values, IMind only needs to keep track of the related data for computing P(w|f). No previous training documents need to be reprocessed. No feature set needs to be predeﬁned either. The features (vocabulary) for expressing the proﬁles may evolve over time without conducting time-consuming trail-and-error feature set tuning employed in many previous related techniques. 3.2. Deriving the specification of each goal folder Proﬁle mining provides a fundamental basis for deriving the speciﬁcation of each folder (interest type). As noted above, it identiﬁes good proﬁle terms (i.e. both representative and discriminative) for each folder. General (speciﬁc) terms tend to have higher d-values in general (speciﬁc) folders. Therefore, good proﬁle terms for ancestor folders of f may indicate the context of f. For example, consider the above-mentioned folder hierarchy in which system development (SD) has two subfolders: decision support systems (DSS) and accounting information systems (AIS). Through proﬁle mining, ‘‘information’’ and ‘‘maintenance’’

Both r-values and

...

d-values of the profile terms are updated

f

... Only d-values of the

635

... ...

X: the set of documents added to f

terms are updated Fig. 2. Updating the proﬁles of related folders once a set of documents is added.

ARTICLE IN PRESS 636

R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

may be found to be a good proﬁle term for SD, while ‘‘sales’’ and ‘‘inventory’’ may be good proﬁle terms for AIS. Thus a good speciﬁcation for the AIS folder should include not only ‘‘sales’’ and ‘‘inventory,’’ but also ‘‘information’’ and ‘‘maintenance,’’ which may indicate the context of the folder. Therefore, the speciﬁcation for a folder f should be composed of good proﬁle terms of f and each ancestor folder of f. The good proﬁle terms from the folders of the same pedigree should be integrated in conjunction form so that the contextual requirement at each level of generality may be enforced. On the other hand, the good proﬁle terms from each folder should be integrated in disjunction form so that the coverage of the terms may represent the main content of the folder. That is, the speciﬁcation for a folder should be in conjunctive normal form. For the above example, the speciﬁcation for AIS may be like {y /information OR maintenanceS AND /sales OR inventoryS}. Therefore, for each goal folder f, IMind derives a speciﬁcation in conjunctive normal form, and then sends the speciﬁcation to the scanner immediately (ref. Step 3). The speciﬁcation is derived by checking the proﬁles of f and ancestors of f. Good proﬁle terms are extracted by selecting those terms that have higher r-values and d-values (ref. Steps 3.1 and 3.3.1). The number of terms to be selected from a proﬁle is governed by a system parameter a. The terms selected are then integrated in disjunction form. The ﬁnal interest speciﬁcation is simply the conjunction of the disjunctions for f and its ancestors (ref. Step 3.3.1). Note that, when selecting terms to construct the disjunction for an ancestor of f, only the terms that occur in the proﬁle of f may be the candidates. This method guarantees that each term in the speciﬁcation of f really occurs in documents of f.

3.3. Space and time complexities of interest mining To further illustrate the behavior of IMind, space and time complexities deserve analysis. More empirical results will be reported in the experiment (ref. Section 4.4).

3.3.1. Space complexity IMind needs to store the r- and d-values of each folder’s proﬁle terms, and hence the maximum memory required is O(N t), where N is the total number of distinct proﬁle terms currently accumulated in the hierarchy, and t is the number of folders in the hierarchy. Since it is often the case that N5105 and t5103 in a personalized hierarchy, the memory requirement may be supported by modern computers. Actually, IMind seldom requires so much memory, since the numbers of proﬁle terms in most folders are much smaller than N. This is because each folder mainly contains those documents that talk about the folder’s topic using many topic-speciﬁc terms, which do not appear in all siblings of the folder, leading to the common case that children folders have much fewer proﬁle terms than their parent (recall that a parent’s set of proﬁle terms is actually the union of its children’s sets of proﬁle terms). Therefore, the larger a folder’s level is, the fewer proﬁle terms the folder would have. Even those folders whose levels are 1 (i.e. level-1 folders) could only have a subset of the N proﬁle terms (i.e. only the union of the sets of proﬁle terms of all level-1 folders may have N terms). For example, an empirical observation on a text hierarchy of Yahoo! (http://www.yahoo.com, ref Section 4) showed that there were 82 620 distinct terms in the hierarchy (i.e. N=82620), but there were only about 60000 proﬁle terms in each level-1 folder. 3.3.2. Time complexity Time complexity of interest mining is determined by the two main steps: proﬁle mining (i.e. Step 2 in Table 1) and speciﬁcation derivation (i.e. Step 3 in Table 1). Time complexity of proﬁle mining may be measured by estimating the maximum number of proﬁle terms that are updated when a document set X is added into a folder f. As noted in Section 3.1 and Fig. 2, the addition of X only triggers the proﬁle updates on f, siblings of f, ancestors of f, and siblings of the ancestors. Therefore, the maximum number of P updates is B N, where Bi is the number of i i siblings of the level-i ancestor of f (i.e. the ancestor whose level is i) plus one (i.e. including the level-i

ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

ancestor). Note that 1piph, where h is the level of f (i.e. Bh is simply the number of siblings of f plus one). On the other hand, time complexity of speciﬁcation derivation should be determined by the maximum number of operations required to select proﬁle terms to update interest speciﬁcations of goal folders. As noted above, when a document set X is added to a folder f, only a limited number of proﬁles are updated (i.e. the proﬁles of f, siblings of f, ancestors of f, and siblings of the ancestors). Only the speciﬁcations related to the updated proﬁles need to be updated. Therefore, the maximum number of operations P P required to update interest speciﬁcations is i jabi,jN, where a is the parameter governing the number of terms selected from each proﬁle (ref. Steps 3.1 and 3.3.1 in Table 1), and bi,j is the number of descendant goal folders of the jth sibling of the level-i ancestor of f (1piph, 1pjpBi, and when j=Bi the jth sibling is simply the level-i ancestor of f). Actually, both proﬁle mining and interest speciﬁcation seldom require so much time, since the numbers of proﬁle terms in most folders are much smaller than N (due to the same reason discussed above, ref. space complexity analysis in Section 3.3.1).

4. Experiment Experiments on real-world text hierarchies were conducted to evaluate IMind. The evaluation focused on the qualities of the interest speciﬁcations produced by IMind and several previous techniques. The qualities were measured by counting the seeds that could be retrieved by sending the speciﬁcations to Yahoo (http:// www.yahoo.com). An interest speciﬁcation could be said to be more precise if it guides Yahoo to ﬁnd more seeds that are really of the corresponding interest type. That is, we employed a popular search engine (i.e. Yahoo) to objectively and quantitatively measure the qualities of interest speciﬁcations. To further evaluate IMind from other perspectives, empirical results on its performances under different search engines and the time

637

spent for incremental interest mining are also reported.

4.1. The experimental data Experimental data was extracted from the text hierarchy of Yahoo. To make the data more comprehensive, a hierarchy H was constructed by considering the categories under three top-level categories: ‘‘Science’’, ‘‘Computers and Internet’’, and ‘‘Society and Culture’’. A category in H corresponds to a folder, which could contain several documents. A document corresponded to a web site. It consisted of web page(s) manually extracted to represent the main content of the corresponding web site (a web site often contained many pages not conveying its main content). To investigate the contributions of IMind under different hierarchies, two hierarchies were constructed from H: the larger hierarchy and the smaller hierarchy. The larger hierarchy inherited both the structure and the data from H, except that only leaf folders contained documents. On the other hand, the smaller hierarchy shared the same structure and data with the larger hierarchy, except that some leaf folders in the smaller hierarchy were non-leaf folders in the larger hierarchy. Note that, the total number of documents in the leaf folders was larger than the total number of documents in their corresponding descendants in the larger hierarchy, since the leaf folders could contain the documents from all their corresponding descendants in H, including both leaf and non-leaf descendants. Therefore, the smaller hierarchy had fewer leaf folders, but more training documents. More speciﬁcally, the larger hierarchy had 261 folders among which 174 were leaf folders, which totally contained 2844 documents. The smaller hierarchy had 169 folders among which 119 were leaf folders, which totally contained 3615 documents. In the structures of the two hierarchies, 97 leaf folders were in common, while 22 leaf folders in the smaller hierarchy were non-leaf folders in the larger hierarchy. The proﬁle of each folder was built by mining the documents in the hierarchies. The two hierarchies were mined and tested independently.

ARTICLE IN PRESS 638

R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

To conduct thorough performance investigation, all leaf folders were designated as the user’s goals of scanning (i.e. the set G in Table 1), except for those folders that were redirected to other folders (i.e. those categories having ‘@’ at the end of their names in Yahoo). Thus each tested folder was unique (there were no duplicate folders in testing). In the larger hierarchy, there were 142 folders designated as goals, while in the smaller hierarchy, there were 109 folders designated as goals. 4.2. The evaluation method and criteria For each folder designated as a goal, a speciﬁcation was constructed by IMind and several previous techniques (to be introduced in Section 4.3), and then sent as a query to Yahoo to retrieve web sites. If Yahoo retrieved web sites for a query, we controlled the number (o200) of topranking web sites retrieved. For each site retrieved, Yahoo showed its brief summary, followed by its category (folder) in the whole hierarchy of Yahoo. Based on such information, we could evaluate the qualities of the speciﬁcations produced by IMind and the previous techniques. Two evaluation criteria were employed to measure the qualities of the speciﬁcations: (1) the average number of web sites correctly retrieved per folder, and (2) the percentage of goal folders for which at least one web site was correctly retrieved. The ﬁrst criterion measured the completeness of the retrieval of the information that was of the goal interest types, while the second criterion measured the reliability of the retrieval across different interest types. For example, if a set of interest speciﬁcations guides the search engine to retrieve a large number of target sites for only a small number of interest types, completeness is good only for the interest types (but poor for others), making the retrieval unreliable. Interest speciﬁcations could be said to be more precise, if they may guide the search engine to (1) retrieve more sites that are of the goal folders, and (2) ﬁnd sites for a higher percentage of the goal folders. The objectivity of the evaluation method deserves justiﬁcation. The quality of a speciﬁcation was objectively evaluated by sending the speciﬁca-

tion to a search engine (i.e. Yahoo), who claims that the sites she retrieved are sorted by relevance using her complicated and proprietary algorithm — no other factors may affect the search results. Also note that we were not training the systems using a document and then asking the search engine to retrieve the same document. There are two facts to justify the claim: (1) only the documents from a subset of the web sites in each category were included in the training data, and (2) even when a web site was included in the training data, the document for the web site was selected by manually visiting and examining the web site, while Yahoo employed totally different information (i.e. not the document manually selected) to judge relevance in retrieving the web sites. It is thus reasonable to expect that, a speciﬁcation that is proved to be better (in terms of the two criteria) in the experiment may perform better in other search engines and scanners as well. 4.3. The systems evaluated Table 2 deﬁnes the systems evaluated in the experiment. IMind has only one parameter (i.e. a, the number of terms selected in each level of proﬁle), which had two settings: 10 (IMind-10) and 20 (IMind-20). In addition to IMind, there were four baselines for performance comparison: norm of the folder (NOF), Rocchio (RO), Naive Bayes (NB), and hierarchical shrinkage (HS). For each leaf folder, the baselines created a proﬁle by their individual weighting methods, although the proﬁles were originally used for other purposes (e.g. classiﬁcation, ﬁltering, and relevance feedback for query reﬁnement). Employing NOF, RO, NB, and HS as baselines deserves justiﬁcation. To our survey, no previous techniques were dedicated to context-based interest mining for personalized web scanning. The baselines thus aimed to represent modern techniques related to the challenge. NOF and RO aimed to represent those techniques that built proﬁles by analyzing the document vectors in corresponding folders, NB aimed to represent those techniques that built a probability-based proﬁle for each folder, while HS aimed to represent those techniques that employed a text hierarchy to build a

ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

639

Table 2 Systems for performance investigation System

Type

Description

Variant

IMind

Experiment system to be evaluated

(1) IMind-10: a=10, and (2) IMind-20: a=20

Norm-of-thefolder (NOF)

Baseline system for those techniques that build a proﬁle for a folder using the documents of the folder only (without considering other folders) Baseline system for those techniques that build a proﬁle for a folder by considering both positive (in-folder) and negative (out-folder) documents Baseline system for those techniques that build a probability-based proﬁle for a folder

The system was implemented by following the descriptions presented in Section 3. The proﬁle of the folder was a vector constructed by averaging the document vectors in the folder.

Rocchio’s method (RO)

Naive Bayes (NB)

Hierarchical Baseline system for those shrinkage (HS) techniques that employ a text hierarchy to build a probability-based proﬁle for a folder

The proﬁle was a vector constructed by computing a weighted sum of the positive document vectors and the negative document vectors. The proﬁle was constructed by estimating the conditional probabilities of the terms in the folder.

The proﬁle was constructed by employing the hierarchical relationships (e.g. sibling) among folders to reﬁne the estimates of the conditional probabilities (in NB).

probability-based proﬁle for each folder. When compared with IMind, the proﬁles constructed by the baselines (and other previous methods as well) had a common property: no contextual information could be conveyed. This property helped to measure the contributions of IMind in mining and representing the context of each interest type. NOF and RO employed the well-known and popular tfidf technique to estimate the weight of each feature. The weight of a feature w in the vector for document p was equal to TF(w,p) log2 (total number of documents/number of documents containing w), where TF(w,p) was the times w occurs in p. The tfidf technique was routinely employed in many applications in which vectorbased representation served as the basis for similarity estimation. For each folder, NOF averaged the vectors of the documents in the folder.

(1) NOF-10: extracting the same maximum number of terms with IMind-10, and (2) NOF-20: extracting the same maximum number of terms with IMind-20

(1) RO-10: extracting the same maximum number of terms with IMind-10, and (2) RO-20: extracting the same maximum number of terms with IMind-20 (1) NB-10: extracting the same maximum number of terms with IMind-10, and (2) NB-20: extracting the same maximum number of terms with IMind-20 (1) HS-10: extracting the same maximum number of terms with IMind-10, and (2) HS-20: extracting the same maximum number of terms with IMind-20

RO is a state-of-the-art technique to create vector-based proﬁles (or queries) from a collection of training data. It has been commonly studied from the perspectives of document classiﬁcation (e.g. [31]), ﬁltering (e.g. [27]), and retrieval (e.g. [21]). As NOF, RO constructed a vector for each folder. The major difference was that, in addition to considering the documents in the folder, RO also considered the documents outside the folder. More P speciﬁcally, the vector P for folder f was equal to Z1 DocAPDoc/|P|Z2 DocANDoc/|N|, where P was the set of vectors for ‘‘positive’’ documents (i.e. the documents in f), while N was the set of vectors for ‘‘negative’’ documents (i.e. the document not in f). In the experiment, Z1 and Z2 were set to 16 and 4 respectively, since such setting was believed to be promising in previous studies (e.g. [31]). NB and HS employed conditional probabilities to estimate the weight of each feature. They were

ARTICLE IN PRESS 640

R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

often employed for document classiﬁcation. NB estimated the conditional probability P(w|f) for every feature w and folder f (with standard Laplace smoothing to avoid probabilities of zero). In document classiﬁcation, the ‘‘similarity’’ between a category and an input document p was based on the product of the conditional probabilities of the features (in the category) that occurred in p. NB is a popular technique for classiﬁcation. For more details about NB, the reader is referred to [26,32]. HS was implemented to represent those approaches that employed a text hierarchy to improve non-hierarchical classiﬁcation [26,33]. It was a recent approach shown to be more faulttolerant and promising [26]. The given text hierarchy was employed to reﬁne the parameter estimation for an NB classiﬁer. The basic idea was to, for each leaf folder f, derive a weight for f and each of its ancestors. The reﬁned estimate of the probability of a term w under f (i.e. P(w|f)) was simply a weighted sum of the probability estimates from f to the root folder (a dummy folder beyond the root was also considered). More speciﬁcally, to derive the weight of each folder of the same pedigree (i.e. from f to the root and the dummy folder), an expectation maximization (EM) procedure was invoked. The weight of the parent of a folder c in the pedigree will be larger if the terms occurring under c often occur under siblings of c as well. In that case, c is ‘‘similar’’ to its siblings, and hence the parent of c should have more strength affecting the parameter estimation for c. This method was shown to be promising in smoothing parameter estimates of a data-sparse child using the data under its parent. It should be noted that, since the proﬁles constructed by the baselines could not be comprehensible for the search engine, the baselines needed to transform each proﬁle into a query that was acceptable for the search engine. This was achieved by selecting and integrating those features having higher positive weights, since these features were more representative and discriminative than others. As in [30], the features were integrated in a disjunction manner (i.e. the selected featured were integrated using OR) in order to avoid generating too constrained queries (i.e.

integrated using AND). Moreover, since there is no standard way to determine how many features (terms) to select, to facilitate objective comparison with IMind, we controlled the number of terms in the queries from the baselines. That is, as IMind, each baseline had two versions as well. For example, RO-10 and RO-20 could construct those queries having the same maximum number of terms as those constructed by IMind-10 and IMind-20, respectively. Thus, for example, for a level-5 folder, the query constructed by RO-10 may have at most 50 (=5 10) terms. Also note that, instead of incrementally maintaining an evolving feature set to express each folder’s proﬁle (as in IMind), all the baselines needed to preset a ﬁxed feature set on which each folder’s proﬁle was represented. The selection of the features was based on the strength of each feature, which was estimated by the w2 (chi-square) weighting technique. The technique has been shown to be more promising than several others [34]. As noted in previous studies (e.g. [26,34]), the size of the feature set was an experimental issue (no standard way to set a perfect size). Therefore, we tested several different sizes, and reported the results when the size was 10 000, 20 000, and 40 000, respectively. Such sizes of feature sets provided enough candidate features to be selected into the query for each folder (recall that there were 174 leaf folders in the larger hierarchy and 119 leaf folders in the smaller hierarchy). 4.4. Result and analysis To illustrate the contributions and behaviors of IMind, experimental results are analyzed from several perspectives. 4.4.1. Basic results and analysis Figs. 3 and 4 show the results in terms of the ﬁrst criterion (i.e. average number of sites found per goal folder) on the larger and the smaller folder hierarchy, respectively. Since IMind is an incremental mining technique that maintains an evolving feature set, no feature set was predeﬁned. Therefore, the performance of IMind did not vary when the baselines employed different feature sets. The results showed that IMind signiﬁcantly

ARTICLE IN PRESS

Average sites found per folder

R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

6 5 4 3 2 1 0

IMind-10 NOF-10 RO-10 NB-10 HS-10 10000

Average sites found per folder

641

20000 Feature set size

40000

3 2.5

IMind-20 NOF-20 RO-20 NB-20 HS-20

2 1.5 1 0.5 0 10000

20000 Feature set size

40000

Average sites found per folder

Fig. 3. Average sites found per folder (the larger hierarchy).

12 10

IMind-10 NOF-10 RO-10 NB-10 HS-10

8 6 4 2 0

Average sites found per folder

10000

20000 Feature set size

40000

6 5

IMind-20 NOF-20 RO-20 NB-20 HS-20

4 3 2 1 0 10000

20000 Feature set size

40000

Fig. 4. Average sites found per folder (the smaller hierarchy).

outperformed all the baselines under all conditions (i.e. feature sets with different sizes, queries with different maximum numbers of terms, and

hierarchies with different sizes). When comparing the average performance of IMind with the average performance of the baselines, IMind

ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

642

contributed 200.78% improvement on the larger hierarchy, and 496.99% improvement on the smaller hierarchy. A detailed analysis showed that, IMind contributed the improvements by making interest speciﬁcations more context-indicative. To illustrate the contribution, Fig. 5 shows the interest speciﬁcation derived by IMind for a folder in the smaller hierarchy: Root-Computer & Internet Hardware - Desktop Computers. The speciﬁcation was context-indicative. Higher-level folders tended to get more general terms to serve as the contextual requirements of their descendant folders. Each folder’s context was thus expressed in conjunctive normal form. On the other hand, the speciﬁcation constructed by RO-10 (the best baseline with a feature set of size 20 000) was: /motherboard OR chip OR modem OR MHz OR SDRAM OR linux OR desktop OR gigabyte OR EDO OR window OR hardware OR soundblaster OR laser OR AMD OR game OR hardware OR XP ORyS. These terms could only be representative and discriminative for the folder. They could not form a context-indicative speciﬁcation for the folder, misleading Yahoo to return lots of sites that mentioned some of the terms from other perspectives (e.g. operating systems and B2B electronics). Actually, to derive a context-indicative speciﬁcation, the system needs to (1) select candidate context-indicative terms, which are not necessarily representative and discriminative for the folder (e.g. information, server, web, CPU, and memory), and then (2) identify the generality of each candidate term. IMind achieved the two tasks

more successfully, making the speciﬁcation more context-indicative, and hence more precise and comprehensible. The performance differences among the baselines deserve discussion. NOF and RO outperformed NB and HS, while NOF and RO had very similar performances. This indicates that (1) considering idf (i.e. inverse document frequency) in the proﬁles (as in NOF and RO) could be helpful, (2) using negative examples (as in RO) was not an important factor, and (3) employing a text hierarchy to reﬁne parameter estimates (as in HS) could not be helpful either. By mining the context of each individual folder, IMind could signiﬁcantly break the performance limit of all the baselines. Figs. 6 and 7 show the results in terms of the second criterion (i.e. percentage of goal folders with sites successfully found) on the larger and the smaller folder hierarchy, respectively. IMind outperformed all the baselines in most cases. When comparing the average performance of IMind and the average performance of the baselines, IMind provided 29% improvement on the larger hierarchy, and 99% improvement on the smaller hierarchy. Interest mining by IMind was thus shown to be more reliable across different interest types. As in the ﬁrst criterion, NOF and RO generally outperformed NB and HS, reconﬁrming the contributions of idf weighting. The performances of the baselines under different feature set sizes deserve discussion as well. An experiment (not shown in the ﬁgures) indicated that, even when the size was smaller than 10 000 or larger than 40 000, IMind signiﬁcantly

Root Computers & Internet file OR information OR window OR system OR site OR server OR web... … …

Hardware CPU OR bit OR instruction OR register OR processor OR chip OR memory...

… …

…

Desktop Computers Computer OR card OR machine OR PC OR sound OR printer...

… Fig. 5. An example speciﬁcation mined by IMind.

ARTICLE IN PRESS

Folders with sites found (%)

Folders with sites found (%)

R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

60

643

IMind-10 NOF-10 RO-10 NB-10 HS-10

40 20 0 10000

20000 Feature set size

40000

60

IMind-20 NOF-20 RO-20 NB-20 HS-20

40 20 0 10000

20000 Feature set size

40000

Folders with sites found (%)

Folders with sites found (%)

Fig. 6. Percentage of folders with sites retrieved (the larger hierarchy).

60

IMind-10 NOF-10 RO-10 NB-10 HS-10

40 20 0 10000

20000 Feature set size

40000

60

IMind-20 NOF-20 RO-20 NB-20 HS-20

40 20 0 10000

20000 Feature set size

40000

Fig. 7. Percentage of folders with sites retrieved (the smaller hierarchy).

outperformed all the baselines as well. Moreover, there was no perfect way to determine a best feature set for the baselines. Under different situations (i.e. evaluation criteria and hierarchies),

the baselines would call for different feature set sizes in order to achieve better performances. This highlights another two weaknesses of the baselines: (1) the feature set needs to be tuned in a

ARTICLE IN PRESS 644

R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

trial-and-error manner, which is too costly and even implausible in practice, since the performance of a personalized web scanner on the web is much more difﬁcult to measure quantitatively, and (2) even when a baseline is tuned best under a certain situation, it cannot necessarily perform best under another situation. IMind does not have such weaknesses. The incremental mining capability of IMind avoids presetting a feature set, relieving IMind from trial-and-error tuning of the feature set for the evolving folder hierarchy and vocabulary. Through context mining and speciﬁcation, IMind may guide the search engine to ﬁnd more sites for a higher percentage of folders in the hierarchy. 4.4.2. Effects of different settings It is interesting to further investigate the effects of different settings of (1) the parameter a (i.e. the number of proﬁle terms selected for each folder), and (2) the hierarchy on which the systems operated. For each folder f, a governs how constrained the interest speciﬁcation of f is. A smaller (larger) a may introduce fewer (more) terms into each disjunction, making the speciﬁcation more constrained (relaxed). Experimental results showed that all systems’ performances signiﬁcantly dropped when a switched from 10 to 20, indicating that the derivation of interest speciﬁcations was sensitive to noisy terms, which could mislead the relevance judgment of the search engine. Therefore, the setting of a heavily depends on the vocabulary and content diversity of each individual folder (rather than other factors such as the height of the hierarchy). Moreover, a folder’s vocabulary and content diversity may evolve. An intelligent and evolvable way to set a is thus an interesting and challenging extension for all systems. As to the effect of different hierarchies, we observed that the improvements provided by IMind were more signiﬁcant when all systems operated on the smaller hierarchy. This was because, on the smaller hierarchy, IMind improved much more signiﬁcantly than the baselines. For example, in the ﬁrst criterion (i.e. average sites found per folder), IMind-10 improved 86% (from 5.23 to 9.71), while the best baseline RO-10 only

improved 36% (from 2.12 to 2.88). Moreover, in the second criterion (i.e. percentage of folders with sites found), IMind-10 improved 20% (from 56% to 67%), while the best baseline RO-10 did not improve (from 54% to 53%). We are thus interested in the reason of why IMind could perform much better on the smaller hierarchy. As noted in Section 4.1, both the smaller and the larger hierarchies shared the same structure and data, except that (1) 22 leaf folders in the smaller hierarchy were ancestor (non-leaf) folders in the larger hierarchy, and (2) due to the 22 folders, the smaller hierarchy contained more training documents. Actually, performance comparisons on the 22 folders (in the smaller hierarchy) and their descendants (in the larger hierarchy) could not be conclusive, since they were quite semantically different (the ancestor folders were more general than their descendants, and hence covered much more subcategories that could be included or not included in the hierarchies). On the other hand, the amount of training data could affect the convergence of proﬁle mining, and hence signiﬁcantly affect the performance of IMind. This could be justiﬁed by the result that, among the 97 leaf folders that were included in both hierarchies, IMind performed better for 35 (9) folders when it operated on the smaller hierarchy (the larger hierarchy). Since the baselines could not improve so much when switching from the larger hierarchy to the smaller hierarchy, IMind was more competent in improving itself as more training data was given. 4.4.3. Cross evaluation under different search engines It is also interesting to evaluate the possible contributions of IMind under different search engines. We thus tried to send the queries of IMind and the baselines to 3 search engines: Google (http://www.google.com), Lycos (http:// www.lycos.com), and Open Directory Project (ODP, http://www.dmoz.org). As Yahoo, they organize information into directories and may show the categories of the web sites found (other search engines such as AltaVista, http://www. altavista.com, and Netscape, http://www.netscape. com do not show the categories of the web sites found).

ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

Time Spent (Sec.)

Unfortunately, the search engines did not accept most of the queries generated by IMind and the baselines, since they limited the number of terms in a query (Google even only allowed at most 10 terms). One way to tackle the problem is to (1) cut each query into several ‘‘small’’ disjunctions (e.g. each disjunction contains at most 10 terms), (2) send the disjunctions to the search engines separately, and then (3) integrate and rank the results for the disjunctions. However, this way cannot objectively measure the contributions of IMind either, since the strategy to integrate and rank the results deserves proper design (designing an effective strategy is an interesting extension, ref. Section 5). Moreover, this way may be plausible for IMind but not necessarily plausible for the baselines, since the baselines’ queries may still be very vague, making the search space too large to process. This may be justiﬁed by the experiment on Yahoo, which did not impose strict limits on the number of terms in a query, and hence accepted all queries from IMind and the baselines. The results showed that Yahoo responded to all queries from IMind, but did not return web sites for some of the queries from the baselines, even when each of the queries was tried 10 times, and in each trial, the maximum waiting time was set to 10 min. When the feature set size was set to 40 000 in the smaller hierarchy, Yahoo did not respond to 2, 3, 19, and 78 queries generated by RO-20, NOF-20, NB-20, and HS-20, respectively. Similar results were obtained by repeating the experiments several times in different days. Therefore, in the lower ﬁgures of Figs. 4 and 7, only those queries to which Yahoo responded were counted and averaged. As the terms in the queries were more common (e.g. ‘‘list’’, ‘‘run’’, and ‘‘product’’), the

645

problem became more serious. As the queries were longer (i.e. RO-20, NOF-20, NB-20, and HS-20) and the feature set was larger (i.e. 40 000), more such words could be included in the queries, incurring lots of matches (and hence a heavier processing cost). Due to the absence of idf weighting, the problem was more serious for NB and HS. Therefore, without a suitable speciﬁcation mechanism, a query could only be in disjunction form, which showed to be too vague for the search engines to ﬁnd web sites using an acceptable amount of processing efforts. IMind employs context speciﬁcations to remove a huge amount of sites that did not deserve consideration. Moreover, as analyzed in Section 4.4.1, the context speciﬁcations contributed signiﬁcant performance improvements on those queries that could be successfully processed by Yahoo, who promised to rank the web sites based on content relevance. Content relevance is the fundamental basis for the ranking strategies of many other baselines as well. 4.4.4. Efficiency of incremental mining As noted in Section 3.3.2, we were also concerned with the efﬁciency of IMind. Fig. 8 shows the time spent for incremental interest mining for individual documents, which were sequentially added into the larger hierarchy (recall that the larger hierarchy had 174 leaf folders containing 2844 documents). The system operated on a personal computer (PC) with a CPU running in 2.6 GHz and a RAM whose size was 2 GB. On average, incremental interest mining triggered by each document required 3.43 s. Another experiment showed that if the leaf folders were added one by one (recall that IMind may accept a set of

20 15 10 5 0 0

500

1000

1500

2000

2500

Document ID Fig. 8. Time spent for individual documents sequentially added into the larger hierarchy.

ARTICLE IN PRESS 646

R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

documents as input, ref. Input III in Table 1), the average time spent for each folder was 4.38 s (on average, a leaf folder had 16.3 documents). It is also interesting to note that, those documents (or folders) that were added later did not necessarily require more time, even though the total number of distinct proﬁle terms in the hierarchy (i.e. N) increased monotonically (ﬁnally, N was 82 620). Observations also showed that the related folders’ numbers of proﬁle terms tended to converge to certain numbers that were much fewer than N. The time complexity analysis in Section 3.3.2 was thus conﬁrmed: time spent by IMind mainly depends on the number of proﬁle terms in related folders, which did not necessarily have many proﬁle terms. Together with the space complexity analysis (ref. Section 3.3.1), IMind is thus shown to be efﬁcient enough to adapt to the evolving interest of the user, without requiring a memory space that is too large for modern computers.

but difﬁcult for the user, since the web is both huge and ever changing. IMind is an efﬁcient and adaptive text mining technique to tackle the challenges. It guides web scanning by incremental interest mining. The speciﬁcation mined for each interest type is represented in conjunctive normal form, which is comprehensible for both the user and many information providers. Empirical analysis also shows that the speciﬁcations are more precise by expressing the context of each interest type. The delivery of such speciﬁcations to a scanner may efﬁciently keep the scanner precisely aware of the most recent interest of the user, directing each resource consumption and effort of the scanner to more focused and suitable targets, while separating the user from the difﬁcult (even implausible) tasks of interest speciﬁcation and seed identiﬁcation. Aiming to serve as a personalized assistant for individual users, the framework is being extended in several ways:

5. Conclusion and extension Personalized web scanning is triggered by the interest of individual users (e.g. businesses or people). The interest is relatively long-term and evolving. To satisfy such interest, the web scanner needs to consume lots of resources to continuously gather and monitor information on the huge and ever-changing web. Precise speciﬁcations of the interest are thus essential for the scanner to achieve cost-effective scanning. However, the user often has difﬁculties in expressing such speciﬁcations. The interest may be implicitly and collectively deﬁned by the evolving folders that contain those documents of the user’s interest (e.g. documents related to the user’s job descriptions, skills, and preferences). Moreover, the speciﬁcations should also be comprehensible for both the user and the related information providers (e.g. Internet search engines). A speciﬁcation comprehensible to the user may facilitate manual reﬁnement of the speciﬁcation, while a speciﬁcation comprehensible to the information providers may facilitate comprehensive identiﬁcation of proper seeds to start scanning. Comprehensive identiﬁcation of proper seeds is essential for web scanning

(1) Automatic determination of the number of proﬁle terms in an interest speciﬁcation: although IMind has removed other parameters (e.g. the size of the feature set required by many other interest miners), there is still one parameter: the number of proﬁle terms selected for each folder (i.e. a in Table 1), which is required by other interest miners as well. As noted in Section 4.4.2, automatic setting of the parameter is both interesting and challenging. The possibility of guiding the user to manually reﬁne the parameter deserves exploration as well. (2) Manual reﬁnement of the interest speciﬁcations: as noted above, the speciﬁcation produced by IMind is also comprehensible to the user. An interface may thus be developed for the user to comprehend and reﬁne each speciﬁcation, making the speciﬁcation able to reﬂect the user’s intentions more precisely. It may also alleviate the problem incurred by unsuitable proﬁle terms (i.e. a in the above discussion). The interface should provide a visualized environment to support online drill-down exploration of how the speciﬁcation is produced.

ARTICLE IN PRESS R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

(3) Post-processing of the results returned by scanners and search engines: to produce a speciﬁcation comprehensible for the search engine, IMind removes the weight of each term (i.e. rw,f dw,f value). Actually, the weight is helpful in post-processing the results returned by scanners and search engines. For example, each web page found by the scanners and search engines may be ranked according its weight, which may be measured by summing the weights of the proﬁle terms occurring in the page. The ranking helps to assign priorities to the web pages for the next round of scanning, directing the limited resources to more promising spaces.

Acknowledgements This research is supported by the National Science Council of the Republic of China under the grant NSC 92-2213-E-216-018.

References [1] M.N. Frolick, M.J. Parzinger, R.K. Rainer, N.K. Ramarapu, Using EISs for environmental scanning, Inform. System Manage., 1997. [2] N. Ahituv, J. Zif, I. Machlin, Environmental scanning and information systems in relation to success in introducing new products, Inform. & Manage. 33 (1998) 201–211. [3] R.-L. Liu, Collaborative multiagent adaptation for business environmental scanning through the internet. Applied Intelligence 20 (2) (2004) 119–133. [4] P. Srinivasan, G. Pant, F. Menczer, Target-seeking crawlers and their topical performance, Proceedings of SIGIR’02, 2002. [5] R.-L. Liu, M.-J. Shih, Y.-F. Kao, Adaptive exception monitoring agents for management by exceptions, Applied Artiﬁcial Intelligence (AAI) 15 (4) (2001) 397–418. [6] H. Chen, Y.-M. Chung, R. Marshall, C.Y. Christopher, An intelligent personal spider (agent) for dynamic Internet/Intranet searching, Decision Support Systems 23 (1998) 41–58. [7] V. Lesser, B. Horling, F. Klassner, A. Raja, BIG: a resource-bounded information gathering agent,’’ UMass Computer Science Technical Report 1998-03, 1998. [8] F. Menczer, ARACHNID: Adaptive retrieval agents choosing heuristic neighborhoods for information discovery, Proceedings of the 14th International Machine Learning Conference (ML97), 1997.

647

[9] K. Sycara, In-Context information management through adaptive collaboration of intelligent agents, in: M. Klusch (Ed.), Intelligent Information Agents Cooperative Rational and Adaptive Information Gathering on the Internet, Springer, Berlin, 1999. [10] C.C. Yang, J. Yen, H. Chen, Intelligent internet searching agent based on hybrid simulated annealing, Decision Support Systems 28 (2000) 269–277. [11] C.C. Aggarwal, F. Al-garawi, P.S. Yu, On the design of a learning crawler for topical resource discovery, ACM Trans. Inform. Systems 19 (3) (2001) 286–309. [12] C.C. Aggarwal, Collaborative crawling: mining user experiences for topical resource discovery, Proceedings. of ACM SIGKDD’02, 2002. [13] K.S. Decker, K. Sycara, Intelligent adaptive information agents, J. of Intelli. Inform. Systems, 1997. [14] S. Liu, Business environment scanner for senior managers: towards active executive support with intelligent agents, Proceedings of 31st Annual Hawaii International Conference on System Sciences, 1998, pp. 18–27. [15] R. Nordlie, User Revealment — A comparison of initial queries and ensuing question development in online searching and in human reference interactions, Proceedings of ACM SIGIR’99, 1999. [16] T. Saracevic, A. Spink, M.-M. Wu, Users and intermediaries in information retrieval: what are they talking about?, Proceedings of the 6th International Conference on User Modeling (UM97), 1997. [17] P. Bruza, R. McArthur, S. Dennis, Interactive Internet search: keyword, directory and query reformulation mechanisms compared, Proceedings of ACM SIGIR 2000, 2000. [18] K. Eguchi, Adaptive cluster-based browsing using incrementally expanded queries and its effects, Proceedings of ACM SIGIR’99, 1999. [19] S. Jones, M.S. Staveley, Phrasier: a system for interactive document retrieval using keyphrases, Proceedings ACM SIGIR’99, 1999. [20] M.B.J. Jansen, A. Spink, J. Bateman, T. Saracevic, Real life information retrieval: a study of user queries on the web, SIGIR Forum 32 (1) (1998) 5–17. [21] M. Iwayama, Relevance feedback with a small number of relevance judgments: incremental relevance feedback vs. document clustering,’’ Proceedings of ACM SIGIR 2000, 2000. [22] S. Chakrabarti, S. Srivastava, M. Subramanyam, M. Tiwari, Using memex to achieve and mine community web browsing experience, Proceedings of the 9th International World Wide Web Conference, 2000. [23] D. Godoy, A. Amandi, Building subject hierarchies for user modeling, 2001, http://www.exa.unicen.edu.ar/~dgodoy/pubﬁles/godccw2001.pdf. [24] H. Joho, C. Coverson, M. Sanderson, M. Beaulieu, hierarchical presentation of expansion terms, Proceedings of SAC 2002, Madrid, Spain, 2002. [25] D. Lawrie, W. B. Croft, A. Rosenberg, Finding topic words for hierarchical summarization, Proceedings of SIGIR’01, 2001.

ARTICLE IN PRESS 648

R.-L. Liu, W.-J. Lin / Information Systems 30 (2005) 630–648

[26] A. McCallum, R. Rosenfeld, T. Mitchell, A. Y. Ng, improving text classiﬁcation by shrinkage in a hierarchy of classes, Proceedings of ICML’98, 1998. [27] R.E. Schapire, Y. Singer, A. Singhal, Boosting and Rocchio applied to text ﬁltering, Proceedings of ACM SIGIR’98, 1998. [28] A. Singhal, M. Mitra, C. Buckley, Learning Routing Queries in a Query Zone, Proceedings of ACM SIGIR’97, 1997. [29] S. Chakrabarti, M. van den Berg, B. Dom, Focused crawling: a new approach to topic-speciﬁc web resource discovery, Proceedings of the 8th World Wide Web Conference, Toronto, Canada, 1999.

[30] M. Pazzani, J. Muramatsu, D. Billsus, Syskill & Webert: Identifying Interesting Web Sites, Proceedings of AAAI’96, 54–61, 1996. [31] H. Wu, T.H. Phang, B. Liu, and X. Li, A reﬁnement approach to handling model misﬁt in text categorization, Proceedings of ACM SIGKDD’02, 2002. [32] L.S. Larkey, W.B. Croft, Combining Classiﬁers in Text Categorization, Proceedings of ACM SIGIR’96, 1996. [33] D. Koller, M. Sahami, Hierarchically classifying documents using very few words, Proceedings of ICML’97, 1997. [34] Y. Yang, J. O. Pedersen, A Comparative Study on Feature Selection in Text Categorization, Proceedings of ICML’97, 1997.

Incremental mining of information interest for personalized web scanning

Incremental mining of information interest for personalized web scanning

Recommend Documents