Computer Networks 50 (2006) 1547–1563 www.elsevier.com/locate/comnet
A dataflow approach to efficient change detection of HTML/XML documents in WebVigiL q Anoop Sanka, Shravan Chamakura, Sharma Chakravarthy
*
Department of Computer Science and Engineering, The University of Texas at Arlington, 329 Nedderman Hall, 416 Yates Street (or POB 19015), Arlington, TX 76019-0015, United States Available online 9 December 2005
Abstract The burgeoning data on the Web makes it difficult for one to keep track of the changes that constantly occur to specific information of interest. Currently, the most widespread way of detecting changes occurring to Web content is to periodically retrieve the pages of interest and check them for changes. This approach puts the burden on the user and wastes time and resources. Alternatively, systems that detect any change to a page is an overkill as it presents information that may not be relevant. Timeliness of change detection is also an issue in this approach. In this paper, we present a change-monitoring system—WebVigiL—which efficiently monitors user-specified Web pages for customized changes and notifies the user in a timely manner. The focus of this paper is on the dataflow approach used for detecting multiple types of changes to a page and to monitor changes to more then one page at a time. This approach has been optimized to group similar/same specifications to reduce the computation of changes. Multiple changes to the same page as well as to different pages are handled in our approach. As a special case, this includes the monitoring of Web pages containing frames. We also provide the overall architecture and functionality of the WebVigiL system to highlight the role of change detection graph (CDG) which forms the core of the system. 2005 Elsevier B.V. All rights reserved. Keywords: Change detection; Customized Web monitoring; Dataflow approach; User-defined profiles; Web pages and frames
1. Introduction The World Wide Web has outdone traditional media such as television, to become an indispensq
This work was supported, in part, by the Office of Naval Research and the SPAWAR System Center-San Diego and by the Rome Laboratory grant F30602-01-2-05430, and by NSF grants IIS-0123730, IIS-0326505, and EIA-0216500. * Corresponding author. E-mail addresses:
[email protected] (A. Sanka),
[email protected] (S. Chamakura),
[email protected] (S. Chakravarthy).
able source of information. There is data for everyone and everything on the Web, and this data is increasing at a rapid pace. This plethora of information often leads to situations wherein users looking for specific pieces of information are flooded with irrelevant data. For example, an information technology professional who is only interested in news related to his areas of interest, is flooded with news on many topics when he goes to any popular news Web site. There are other situations in which users are interested in knowing about the updates happening to a particular Web page of their interest.
1389-1286/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.comnet.2005.10.016
1548
A. Sanka et al. / Computer Networks 50 (2006) 1547–1563
For example, students might want to know about the updates that are made to their course Web sites regarding any new projects. As another example, in a large software development project, there are number of documents, such as requirements analysis, design specification, detailed design document and implementation documents. When a number of people are working on a project, they need to be aware of the changes to any one of the documents to make sure the changes are propagated properly to other relevant documents. In general, the ability to specify and monitor changes to arbitrary documents and get notified in different ways in a timely manner is needed. The proposed architecture of WebVigiL is geared towards disseminating information efficiently without sending unnecessary or irrelevant information. Today, information retrieval is mostly done using the pull paradigm [1] where the information of interest is retrieved from its source by giving a query (or queries). The user has the burden of analyzing the retrieved information for any changes of interest. Although there are some proposed solutions, such as mailing lists, to notify users of the changes to information of interest, there is little or no provision for the user to customize those notifications. The user has to be satisfied with whatever information the source sends rather than what specific information the user wants. A great deal of research has been done in the field of active technology (essentially push paradigm in contrast to the pull paradigm) to provide the capability of timely notification for many applications. Event–Condition–Action (or ECA) rules are used to incorporate active capability into a system. In the case of large-scale distributed environments such as the Web, users are interested in monitoring changes to a particular Web page. In most cases, change detection is required at a finer granularity, such as specifying changes to links, images, phrases or keywords in a page. Web pages that are monitored for detecting changes may be either HTML or XML pages. In our approach, changes to pages, images, links, or keywords, etc., correspond to primitive events when mapped to the ECA paradigm and their combinations form composite events. Thus, techniques developed for active databases has been used effectively where appropriate to provide solution in a new context. This paper focuses on developing a framework and to provide a selective propagation approach to detect changes that are of interest to the users in the context of
Web and other large-scale network-centric environments by adapting and extending active technology. The remainder of the paper is organized as follows. In Section 2 we present related work in change detection as applied to Web contents. In Section 3 we present the architecture of the WebVigiL system. In Section 4 we explain how the ECA paradigm plays a central role in detecting changes for a large number of sentinels (user specification of what to monitor). In Section 5 we present the core component of the WebVigiL system which is the use of Change Detection Graph (CDG) to support the monitoring of Web pages. Finally, we discuss conclusions and future work in Section 6. 2. Related work WebVigiL is a content-monitoring system, which supports primitive (or simple) and composite changes to Web pages, and notifies the user of changes as soon as it is detected. To achieve this capability, active technology has been effectively used. Using active rules incorporated to make databases active has been well established. Active database systems, based on rule definition, event detection, and action execution, add the capability to recognize specific situations and react to them in contrast to the traditional passive databases. The usage of this capability has been very limited in Web-based systems. WebVigiL uses active capability for the run time management of user-defined specifications to monitor and detect changes efficiently. Many other tools have been developed and are currently available for tracking changes to Web pages: AIDE (AT&T Internet Difference Engine) [2] views a HTML document as a sequence of sentences and sentence-breaking markups. A token is either a sentence-breaking markup or a sentence, which can be a sequence of words and non-sentence-breaking markups. AIDE uses HTMLdiff, which uses the weighted Longest Common Subsequence (LCS) algorithm [3] to compare two tokens and computes a non-negative weight reflecting the degree to which they match. All markups are represented and are compared. This approach may be expensive computationally as each sentence may need to be compared with all sentences in the document. Thus in situations where the user is interested in change to a particular phrase, HTMLdiff will end up computing change to the whole page, resulting in excessive computational cost.
A. Sanka et al. / Computer Networks 50 (2006) 1547–1563
NetMind [4] formerly known as URL-minder provides keyword or text-based change detection and notification service over Web pages. NetMind detects changes to links, images, keywords and phrases in an HTML page. The media of notification are e-mail or mobile phone. The system lacks the support to specify ignoring changes which the user is not interested in. Also, it lacks the support to specify composite changes on a page. Also there is no provision for the user to come back later and view the history of changes that have been detected. The frequency of when to poll the page is predefined. WebCQ [5] detects customized changes between two given HTML pages and provides an expressive language for the user to specify his/her interests. However, WebCQ only supports changes between the last two versions of a page of interest. Hence, compare options such as ‘moving’ and ‘every’ that are supported in WebVigiL are not available to the user. Changes using composition operators (AND, OR, NOT) are not supported in WebCQ. WebVigiL allows users to inherit properties from other (correlated) sentinels, i.e., sentinels that start/stop based on other sentinels and temporal events. The architecture is quite different in that WebVigiL uses adaptive fetch algorithms and buffer management that is tailored to WebVigiL needs [6]. WYSIGOT [7] is a commercial application that can be used to detect changes to HTML pages. This system has to be installed on the local machine (which may not always be possible). The system monitors an HTML page and all the pages that it points to. But the granularity of change detection is at the page level. RSS [8] is expanded in multiple ways as ‘Rich Site Summary’ or ‘Really Simple Syndication’, etc. RSS is an XML-based specification format that is mainly used to syndicate news sites, personal Weblogs, CVS checkins, etc. Web pages featuring RSS publish a file which contains the required information in the XML format, which is updated whenever there is a change. To keep track of these changes, a client-side reader application is required which periodically checks the published RSS file and notifies of changes. RSS only provides a basic framework for change presentation and the complexity and usability of the published changes depend on the RSS reader. There are some commercial RSS readers [9] which support some additional features like monitoring an RSS feed for specific keywords or phrases. The main disadvantage of RSS is that even though the changes are published by the Web page,
1549
the main computation is done on the client side. Also, RSS feeds are not available for every Web page which greatly limits its scope. In addition to the above systems there has been other work related to events and changes on the Web [10]. Most of the above systems detect changes to the entire page instead of tracking changes to user specified components. The changes can be tracked only on limited pages and the user will be burdened with all the changes, irrespective of whether he/she requires them or not. Hence, these tools solve the problem of when to refresh a page but not how and when a page is modified. In addition, these tools have limited capabilities in terms of types of changes detected. The scalability of these systems is also not very clear. Some of the distinguishing features of WebVigiL [11–15] when compared with the above systems are: 1. Properties of monitoring requests can be inherited: The user has the option of specifying the monitoring request to be dependent on the status of other monitoring requests. One can specify the start/end of a request to be the start/end of another request. 2. Flexible specification of versions: All the above systems compute changes between two successive versions of a page. In WebVigiL, the user can explicitly specify the pages that can participate in change detection. 3. Composite change detection: WebVigiL provides an elegant way to specify multiple change types, such as ‘‘changes occurring to either images or links’’, which none of the above systems provide.
3. WebVigiL architecture WebVigiL is a change-detection and notification system, which can monitor and detect changes to unstructured documents in general. Although WebVigiL has been implemented in the context of a Web monitoring system, the architecture is modular and flexible. For example, to monitor changes to a document system, only a few modules specific to the document fetch and change detection need to be replaced as the rest of the architecture remains the same. To detect changes to a word document instead of an HTML/XML page, a new change detection module needs to be added. However, the change detection graph, version manager, and the notification subsystems will remain the same.
1550
A. Sanka et al. / Computer Networks 50 (2006) 1547–1563
Fig. 1. WebVigiL architecture.
WebVigiL aims at investigating the specification, management, and propagation of changes as requested by the user in a timely manner while meeting the quality of service requirements. Fig. 1 summarizes the complete architecture of WebVigiL. 3.1. Sentinel Users specify their monitoring request in the form of a sentinel. A Web-based user interface enables the users to submit their profiles, termed sentinels. A user-profile is composed of number of components allowing the user to specify his/her requirements at a desired level of granularity. The profile comprises of the following: • Page to be monitored (URL). • Type of content to be monitored. • Frequency with which the system checks for changes (either user-specified or systemdetermined). • Start and End time of the monitoring request. • The medium and frequency for notification of changes. • The relative version of comparison for changes. WebVigiL provides an expressive language with well-defined semantics for specifying the monitoring requirements [16]. The specification language supports a number of features some of which are summarized below: As monitoring a page for any type of change as a whole may be too restrictive, WebVigiL supports monitoring a page at different levels of granularity. The set includes changes to images, changes to links,
changes to specified keywords or phrases. Users can place a request with the above mentioned change types by combining them using the operators ‘OR’, ‘AND’ and ‘NOT’ (for example, changes to links AND images either on the same page or on different pages). The system also allows the users to set sentinels based on previously defined sentinels which allows one to track correlated changes. Using this specification, unless the users explicitly override the properties, the new sentinel inherits all the properties of the previous sentinel. The system provides a way for users to exploit the system capability (adaptive determination of change frequency [17]) to check for changes. The fetch frequency can be either user defined (e.g., every 2 days) or system-determined where the system tunes the fetch frequency to the actual change frequency using a learning algorithm based on the history of observed change intervals. The media for notifications can be specified as email, fax, PDA, or dashboard. Currently notification by email is supported. The frequency of notification can be immediate or can be given as a user-specified frequency. The options for the relative versions of a page to be compared are: ‘moving’, ‘every’, and ‘pairwise’. A sentinel with option ‘moving(n)’ compares the fetched version of a page with the last nth version in a sliding window. In ‘every(n)’, every n versions of a page are compared, and in pairwise, every version is compared with its previous version. For example, consider the scenario: John wants to monitor changes to links and images on the page http://www.msn.com and wants to be notified daily by e-mail starting from April 8, 2005 to May 8, 2005. The sentinel specified for the above scenario is as follows:
A. Sanka et al. / Computer Networks 50 (2006) 1547–1563
Create Sentinel s1 Using http:// www.msn.com Monitor all links AND all images Fetch every 2 days From 04/08/05 To 05/08/05 Notify By email
[email protected] Every day Compare pairwise For the BNF of sentinel specification and its semantics, refer to [16]. 3.2. Verification module This module provides the required communication interface between the system and the user for specifying sentinels. User requests (sentinels) are processed for syntactic and semantic correctness. Valid sentinels are stored in a Knowledge Base (Oracle is used currently) and a notification is sent to the change detection module once the sentinel is validated. The verification module performs as much of validation as possible in the client, thereby reducing excessive communication between the client (JSP-based Web client) and the server. 3.3. Knowledge base After the user input (i.e., a sentinel) is validated, the details of the sentinel are stored and persist in the Knowledge Base. The Knowledge Base is maintained as a set of relational tables in a database. As there are several modules that require the information on a sentinel at run time, it is essential for the meta-data of the sentinel to be stored in a persistent recoverable manner. In case the system crashes, the Knowledge Base (in addition to other information logged while the system is running) is used for recovering to a stable state and start serving the sentinels that were active at the time of crash. The Knowledge Base is updated with important run time parameters by several modules of the system. Queries are posed to retrieve the required data from the tables at run time using a JDBC bridge. 3.4. Change detection module Currently, change detection [18] has been addressed for HTML and XML [12,19,20]—the two most popular forms for publishing data on the Web. Change detection algorithms [13,16] work according to the change type specified by the users.
1551
The corresponding objects are extracted from the versions of pages being compared. The versions to be compared for change detection are decided by the option which is chosen during the input. Appropriate objects are obtained from the version manager and are compared for changes and if there are any, they are reported to the notification module for notifying the corresponding user. In HTML, changes are detected using content-based tags such as links and images, presentation tags and changes to specific content such as keywords, phrases, etc. Although XML consists of tags that define both the structure and content, we are only interested in supporting changes to the content. We assume that the structure does not change very often and even if it does, the user is oblivious to the structural changes as they cannot be seen by the user. 3.5. Fetch module The options for specifying the frequency [21] are: ‘user-defined’ or can be ‘system-defined’. A periodic event is used for fetching the page by the fetch module. The user-defined frequency is used, if provided. Otherwise, the learning-based algorithm adapts itself to the change frequency of the underlying page and decides the next fetch frequency. This module informs the version controller of every version that it fetches, stores it in the page repository and notifies the CDG of a successful fetch. The properties of a page are checked for determining the freshness of a page. Page properties used are: Last Modified Time (LMT) or size of the page (Checksum). Depending upon the nature (static/dynamic) of the page being monitored, the complete set or subset of the meta-data is used to evaluate the change. Only if a change is detected using the properties, the wrapper fetches the actual page and sends the page to the version controller for storage. 3.5.1. Best effort (BE) rule BE Rule uses a best-effort Algorithm (BEA) [22] to achieve this tuning. In the best-effort algorithm, the next fetch interval (Pnext) is determined from the history of observed changes on that page. A periodic event is used whose frequency is changed dynamically to the Pnext computed by the algorithm. 3.5.2. Fixed-interval (FI) rule A periodic event [23,24] with periodicity (interval t) equal to the user-specified interval is created and
1552
A. Sanka et al. / Computer Networks 50 (2006) 1547–1563
an FI rule is associated with it to fetch the page. As a result, there may be more than one FI rule on a given page with different or same periodicity, where each rule is associated with a unique periodic event (i.e., with different start and end times). 3.6. Version management As WebVigiL performs the change detection by comparing two versions of a page of interest, there is a need for a repository service (Version Manager) where different versions of a page are archived and managed efficiently [25]. The version management module archives, manages, and forwards the requested version of a page to different modules at run time. The primary purpose of the repository service is to reduce the number of network connections to the remote Web server by avoiding redundant fetches, thereby reducing network traffic. The requirements for the repository service include managing multiple versions of pages without excessive storage overhead. So there is a need to store only those versions that are needed by the system and to purge all the versions which are not needed anymore. A deletion algorithm based on the requirements of the sentinels being monitored is used for purging the pages in the version controller [26]. 3.7. Presentation and notification module The primary functionality of this module is to notify the users of detected changes and present the changes in an elegant manner. It is only when the computed differences are presented or displayed in an intuitive and meaningful way that the user can interpret what has changed. Two schemes have been developed for change presentation, namely, ‘only change’ approach and the ‘dual frame’ approach. In the first scheme, only the changes that have been made to a page are displayed in a tabular manner. The dual frame approach has the old and new versions of the page shown side by side, with the changes highlighted. As the HTML tags are used for presentation and XML tags define content, the presentation of HTML and XML differ.
Java applications in a seamless manner. Primitive and composite event detection in various parameter contexts and coupling modes has been implemented [27–29]. During its lifespan, a sentinel is active and participates in change detection. A sentinel can be disabled (does not detect changes during that period) or enabled (detects changes). By default, a sentinel is enabled during its lifespan. The user can also explicitly change the state of the sentinel during its lifespan. The start/end of a sentinel can be time points or other events (for example, start of a sentinel, end of a sentinel along with temporal offset can be used as events). When a sentinel’s start time is now, it is enabled immediately. But in cases where the start is at a later time point or depends on another event that has not occurred, we need to enable the sentinel only when the start time is reached or the event of interest has occurred. In WebVigiL, the ECA rule generation module creates the appropriate events and rules to enable/disable sentinels. Briefly, we accomplish this as follows. Consider the scenario where s1 is defined in the interval [06/02/04, 01/02/05]. At time 06/02/04 sentinel s1 has to be enabled. Fig. 2 shows the events and rules that are generated to enable sentinel s1. Fetch_s1 is a periodic event created with Start_s1 as the start event, the frequency of page fetch, and End_s1 as the end event. The rule associated with it handles the fetching of page for s1. A rule associated with an event is fired when the event is raised. More than one rule can be associated with an event. When event Temp1 is triggered at the specified time point, rule T1 is executed, which in turn raises the event Start_s1. Triggering of the event Start_s1 activates the sentinel s1 and also initiates the periodic event used for fetching the pages of URL specified in s1. Now, if another sentinel s2 which is defined over the interval [start(s1), end(s1)] arrives, the events and rules are generated in order to enable s2 are shown in Fig. 2. Here, we are associating the rule R_start_s2 with the event Start_s1, which was created at the arrival of sentinel s1. This rule actually raises the Start_s2 event to activate the
4. Activation and deactivation of ECA rules WebVigiL uses the Java Local Event Detector (LED) to incorporate active capability in the system. LED is a library designed to provide support for primitive and composite events, and rules in
Fig. 2. Event generation.
A. Sanka et al. / Computer Networks 50 (2006) 1547–1563
periodic event associated with s2. In this manner, ECA rules are used to asynchronously activate and deactivate sentinels at run time. Once the appropriate events and rules are created, the local event detector handles the execution at run time. Enabling/disabling of sentinels sets a flag which is checked before fetching a page. Addition/deletion of sentinels actually change the detection graph by inserting or deleting event nodes and making sure other dependent sentinels on those events are not disturbed. 5. Change detection of Web contents In WebVigiL, change detection is performed between two different versions of a given page. When the comparison of two versions of the same page results in the detection of a change, all the sentinels interested in that page are notified of the changes. Change detection algorithms, CH-DIFF [17] and CX-DIFF [16,12] have been developed to detect changes occurring to HTML and XML documents, respectively. CH-DIFF detects changes to various components such as links, images, keywords, phrases and anychange. By anychange we mean change to any word in the page and/or changes to links/images. For detecting changes to any word we could use the LCS algorithm. Extant tools use the LCS algorithm (with several speed optimizations) to compare HTML pages at page level. But for scenarios where the user is only interested in a change to a particular object in a page, using the LCS algorithm will be computationally expensive. We describe our approach for change detection to HTML pages briefly here. Let t be the object type, which is of interest in a page, S(A) be the set of objects of type t extracted from version Vi and S(B) be the set of objects extracted from version Vi+1. Here S(A) S(B) gives the objects that are absent in S(B) indicating deletion of those objects between versions Vi and Vi+1. Similarly S(B) S(A) gives the new objects that have been inserted or added into version Vi+1. The algorithm improves upon this idea by introducing the concept of window-based change detection. We define a set s(o1, c1), (o2, c2), (o3, c3), . . . , (on, cn) where o1, o2, o3, . . . , on are objects of type t with c1, c2, . . . , cn being the corresponding number of occurrences of each object in a version Vi. For detecting changes to objects of type t in version Vi, we need to compare the set obtained from Vi, with the old set obtained from version Vi1.
1553
Increase or decrease in the number of instances of an object, is taken as an insert or delete. As the name of the approach indicates we form a window of objects which are ordered based on their hash codes. In Java 1.3, for every string object a hash code is generated (since every object is of type string in our case, we use this hash code). The hash code of the first and last object in the reordered old set defines the bounds of the window. 1. Phase I: Every object in the new set with a hash code greater than the upper bound or lower than the lowerbound is flagged as an insert. Objects that have their hash code within [lowerbound, upperbound] are searched for occurrence in the objects defining this range (in the reordered old set). If found, the occurrence count is compared and accordingly an insert or delete is flagged. If not found then the object in the new set is flagged as an insert. 2. Phase II: All objects in the reordered old set that have not participated in the previous phase are flagged as deletes. Since the objects are extracted from the page and change is deduced from the occurrence count, knowing exactly which instance of this object changed in the page requires additional computation. We do the additional computation based on the user preferences at the presentation phase. The signature of each instance of the keywords is used to detect the exact instance that was deleted. For phrases, in addition to insert and delete, an update to a phrase is also detected. Here, for phrases, the objects in the extracted set denote the signature of each instance (occurrence) of the phrase. The process involved in change detection of a phrase is as follows: Initially during the object extraction phase, the Knuth–Morris–Pratt (KMP) string-matching algorithm is used for matching the phrase with the words extracted from the page (words object). For every hit, the corresponding signature (bounding box) is extracted. Thus the set of objects extracted from the new and old version of a page for phrase detection is the signature of each instance of the phrase in the corresponding version. The windowbased approach results in indicating inserts and deletes to the objects considered. Delete to the object here has two possibilities: either the phrase was completely deleted from the new version or the phrase is partially updated but is not caught by KMP. Here we use a heuristic approach for
1554
A. Sanka et al. / Computer Networks 50 (2006) 1547–1563
determining an update to a phrase. The object that was flagged as deleted in the old set (i.e., signature of that instance of a phrase) is used against the new page to get the words wrapped by this signature. The LCS algorithm is now run on the words thus extracted and the phrase, and if the length of the resulting longest common subsequence is greater than a given percentage (the percentage can be adjusted on the fly based on past history) of the length of the phrase we take it as an update else a delete is flagged. Additional details can be found in [17]. CX-DIFF detects customized changes on XML documents like keywords and phrases. These changes are word-based changes on the content of the page. In an XML tree, the leaf nodes represent the content. Hence changes to the leaf nodes are of interest. The structure (actually path) is taken into consideration for identifying the content and for efficient change detection. A keyword can be part of the leaf node or be the node itself. For a phrase change, the given phrase could be part of the node, the node itself or can span more than one node. Detecting changes to contents constituting part of the node or spanning several nodes complicates the extraction of the content and change detection. For XML, the attributes of an element are considered unordered. But as the changes are detected considering the content to be ordered, the attributes are assumed to be ordered for the proposed change detection algorithm. Attributes defining ID and IDREFS are also considered as simple ordered attributes. In addition, we only consider well-formed XML documents and do not process DTD, CData, Entity and Processing Instructions nodes. The changes are detected by identifying the change operations, which transform a tree T1 to tree T2. For customized change detection based on user intent, extraction of the objects of interest like keywords and phrases is necessary to detect changes to a page. A signature is computed for each extracted leaf node. To detect change operations between given trees T1 and T2, the unique inserts/deletes are filtered and matching nodes and signatures are extracted. The common order subsequence is detected on the extracted matching nodes to detect move and insert/deletes to duplicate nodes. The algorithm consists of following steps: (i) Object extraction and signature computation: To access the content and extract the structure of the XML document, it is first transformed into a Document Object Model (DOM) [30]. The Xerces-J 1.4.4 Java
parser [31] for XML is used for this purpose. The tree is traversed top-down and the leaf node consisting of text and attribute nodes are extracted. The signature of each node is also computed from the extracted element information. (ii) Filtering of unique inserts/deletes: insertion/deletion of distinct nodes can result in unique insert/delete unless they are moved, and can be detected on an unordered tree. Similarly, leaf nodes containing a non-matching signature can also be considered as unique inserts/deletes as the signature defines the context. To reduce the computation cost of finding the common order subsequence between two ordered trees, by considering all the leaf nodes, the unique inserts/deletes are filtered out and matching nodes extracted by the defined functions totalMatch and signatureMatch. (iii) Finding the common order subsequence between the leaf nodes of the given trees: For change detection to multiple occurrences of a node with common signatures and for moved nodes, it is necessary to consider an ordered tree. Due to realignment of the node and inserts and deletes, the order of occurrence needs to be considered with respect to the sibling. Hence, the common order subsequence is computed by running the Longest Common Subsequence (LCS) algorithm [3] between the matched nodes of both the trees. All the matched nodes are aligned in the order of occurrence. For reducing the computational time for detecting changes, an optimization is also proposed. Additional details can be found in [12] and [16]. 5.1. Monitoring a Web page One of the important goals of WebVigiL is to support a large number of user requests. This gives rise to the scenario of having overlapping user requests—requests in which different users want to monitor common Web pages. To achieve system scalability, it becomes important to reduce redundant computation of changes. In [32], after examining various approaches, a graph-based approach has been identified as the best solution for performing change computation in WebVigiL. A graphbased structure allows to asynchronously feed fetched pages for change detection, allows parallelism where possible and supports optimization of computation by grouping sentinels over URLs and change types. WebVigiL uses a change detection graph (CDG) and a data flow paradigm to perform change detection. The change detection graph is very similar to
A. Sanka et al. / Computer Networks 50 (2006) 1547–1563
the event detection graph used for detecting primitive and composite events [33]. Various functions of WebVigiL are mapped onto different ECA components [32]. These include mapping the starting/ stopping of sentinels and fetching of Web pages to ECA events as was briefly described in Section 4. Primitive change detection involves detecting changes to links, images, keywords, etc., in a page. In order to perform primitive change detection, a graph is constructed which is called the change detection graph. The graph is constructed in a bottom-up manner as shown in Fig. 3. The different types of nodes in the graph are described below: • URL node (Un): A URL node is a leaf node at level 0 (L0) that denotes the page of interest (e.g., www.uta.edu). The number of URL nodes in the graph is equal to the number of unique URLs the system is monitoring at that particular instant of time. Whenever a change is detected at this level, the page is propagated to the corresponding nodes at level 1. • Change type node (Cn): All level 1 (L1) nodes in the graph are change type nodes. These nodes represent the type of change on a page (links, images, keywords, phrases, etc.). Change detection for the specific type of node is performed at this level. • Composite node (COn): The nodes at level 2 (L2) and above are composite nodes, ‘AND’, ‘OR’, ‘NOT’. The nodes ‘OR’ and ‘AND’ have two change type nodes (binary operator) as children, whereas ‘NOT’ has only one child (unary operator). All of these follow the semantics of the Boolean operators.
1555
In the graph, to perform the propagation of changes, the relationship between nodes at different levels is captured using a subscription/notification mechanism. The higher level nodes subscribe to the nodes at the lower level and this subscription information is maintained in a subscriber list at each node. The change is computed for all the sentinels present in the subscriber list at the change type node (Cn). There can be more than one sentinel associated with each change type node. The arrows in the graph represent the data flow. Whenever a Web page is fetched (actually the meta-data), it is given to the corresponding URL node. The URL node performs the task of determining whether the page has changed from the time it was last fetched (by comparing the meta-data with a previous version of the same page). This is done calling the version controller and by checking the ‘last modified time’ associated with the page. If the page is a dynamically generated page and does not provide a ‘last modified time’, the change detection is performed by computing the checksum of the page. If the page is detected to have changed, it is given to the corresponding change type nodes at level 1. The change type nodes at level 1 detect the occurrence of the specific change type they represent. For example, the ‘links’ node detects any changes happening to links on a page and the ‘images’ node detects any changes happening to the images on a page. If a change is detected, the group of sentinels interested on the page are notified of the changes. If there is a composite node involving the nodes at level 1, the level 1 nodes propagate the changes to the level 2 node which in turn will inform the corresponding sentinels. Sentinels are
Fig. 3. Change detection graph (CDG).
1556
A. Sanka et al. / Computer Networks 50 (2006) 1547–1563
grouped based on the URL they monitor and the change type. Grouping based on frequency is also done where appropriate. For example, a fixed interval sentinel with 1-h frequency and another with 2-h frequency are grouped so that the change detection is done only once every other computation. The algorithm for sentinel grouping is shown in Fig. 4 (please refer to [32] for more details). The above assumes that the user is monitoring a single page at a time. Many Web pages are composed of frames (pointers/links to other Web pages) and some frame-based pages may not have any content except for the links. Also, it would be useful for a user to monitor a rooted Web page (up to a particular depth) rather than a single page. In addition, when users want to monitor changes on different pages in one request, we need to accept multiple URL’s as part of a single sentinel. Also, the CDG needs to be organized differently to handle multiple pages and links from within a page efficiently. We extend the above approach [34] in the next section to support monitoring more than one Web page in a single request. 5.2. Monitoring multiple Web pages
itoring changes to links on www.cnn.com and changes to keywords ‘‘rangers’’ on www.espn. com. The user might want to be notified only when both the pages are updated with the corresponding changes. These requests map to monitoring multiple Web pages for different change types. • Linked monitoring: There are many instances in which information present on a Web page is categorized into different sections and each section is placed on a different page. In such cases, all the different pages are linked from one main page. These are the cases in which users might be interested in monitoring multiple Web pages by giving a single request on the page that links all the pages. For example, for an online forum Web site like www.javaforums.com, there will be many different sections on different Web pages but all the sections will be linked from the home page. A user might be interested in monitoring posts made on different sections of such a Web site. In this case, the base page can be assumed to be at depth 0 and all the pages linked from it to be at depth 1. For Web pages containing frames, embedded pages in the form of frames are automatically mapped to depth of 1.
Monitoring of multiple Web pages [34] for changes can be classified into the following two categories:
The two categories are elaborated in the following sections.
• Assorted monitoring: There might be cases in which users might want to monitor multiple Web pages for changes using a single request. For example, a user might be interested in mon-
5.2.1. Assorted monitoring The first task in monitoring multiple Web pages is to facilitate the input specification to include multiple Web pages and multiple change types associ-
Fig. 4. Algorithm for sentinel grouping.
A. Sanka et al. / Computer Networks 50 (2006) 1547–1563
ated with each Web page. Every pair of Web page and its associated change type should be joined with another pair using any of the operators OR, AND and NOT. For example, consider: (URL1 for ChangeType1 OR URL2 for ChangeType2) AND URL3 for ChangeType3. This means that the user wants to be notified whenever there is a change to either URL1 or URL2 along with a change to URL3. To provide this functionality, the part of the sentinel specification where the URL of the page to be monitored is designed as shown in Fig. 5. Detecting composite changes on the same page is computationally different from detecting composite changes on different pages. For the former, the page is fetched once and different computations for detecting different types of changes are done by propagating the fetch event to different subscriber nodes. On the other hand, for composite changes on different pages, different pages need to be fetched using the same fetch frequency. One possibility is to express them as different sentinels. However, this can only be done for the OR operation as AND cannot be achieved by independent sentinels. Furthermore, using independent sentinels will increase the number of notifications as well. An illustration of how OR can be expressed as two independent sentinels is shown in Fig. 6. Fig. 6 shows the sequence of handling the computation for two requests. It can be seen from the figure that there is redundancy involved in the crea-
Fig. 5. Sentinel specification.
1557
tion of ECA events and the construction of the CDG. This redundancy increases with the increase in the number of URLs to monitor. A mechanism for handling ‘‘assorted’’ natively and efficiently is needed. The requirements for an efficient mechanism for assorted monitoring can be summarized as follows: • Users should be able to specify any number of URLs and any change type for each URL. • Users should be able to combine the URL and change type pairs using the operators OR, AND and NOT. • For every request a minimum number of ECA rules should be created to start and stop the sentinel. • For every request, only one CDG should be constructed, with multiple leaf nodes. • Fetching of all the specified URLs should be done using only one fetch rule. • A request should have only one notification mechanism for every change. Based on these requirements, whenever a request for assorted monitoring is received, only one ECA rule (for fetching) is generated, by treating the first URL in the request as the base page. The way the CDG is constructed for these requests is illustrated in Fig. 7. In Fig. 7, the first request S1 should be notified whenever either links on www.cnn.com change or the keywords ‘‘rangers’’ on www.espn.com change. The second request S2 should be notified only when a change to keywords ‘‘rangers’’ is detected on www.espn.com and a change to links is detected on www.nba.com. When a change to links is detected for the Web page www.cnn.com, it is propagated to the OR node corresponding to S1 and
Fig. 6. Using two independent sentinels to compose ‘OR’.
1558
A. Sanka et al. / Computer Networks 50 (2006) 1547–1563
Fig. 7. CDG for assorted monitoring.
finally S1 is notified. But a change to the keywords ‘‘rangers’’ is propagated to both the OR node for S1 and the AND node for S2. The OR node for S1 notifies S1 of this change but the AND node for S2 only notifies when there is a change to links on the Web page www.nba.com. 5.2.2. Linked monitoring As mentioned before, linked monitoring is a request to monitor a Web page with a given depth. When a sentinel with a request for linked monitoring is received, all the pages till the given depth are crawled and the required URLs are extracted. One issue with crawling is to handle circular references wherein a link on a Web page points to a Web page that has already been crawled. This might result in non-termination of crawling if not handled properly. In the WebVigiL system, a breadth-first search is used to crawl the required pages. Starting from the base page, all the pages at level 1 are crawled, followed by all the pages at level 2 and so on making sure circular references are avoided. Web pages containing frames can be treated as a special case of linked monitoring where the user does not provide any depth but a default depth of 1 is assumed. Therefore, requests on Web pages with frames are converted to requests for linked monitoring with a depth of 1 so that the base page and all the pages pointed by the frames can be monitored. The construction of CDG for these requests can be done using the following two approaches: 1. Approach 1: Using binary OR nodes. 2. Approach 2: Using an N-ary operator: Extended OR (EOR).
Binary OR Approach: One method to construct the CDG is to use multiple binary OR nodes as shown is Fig. 8. The construction of CDG starts by joining the first two URL nodes using a binary OR node. A dummy sentinel is inserted at this OR node which is connected with the next URL node using another OR node. This process is repeated till all the URL nodes have been connected. As the binary OR node propagates changes to any of its child nodes, changes get propagated through all the binary OR nodes above it till the top level, at which point the sentinel will be notified. Extended-OR (EOR) Approach: The ‘extendedOR’ (EOR) [34] is an N-ary operator designed to efficiently handle multi-URL requests. The design details of the ‘EOR’ node are presented in Fig. 9. The binary operator ‘OR’ has two child nodes; however, an ‘EOR’ operator can have a variable number of children nodes. The semantics of the ‘EOR’ node are the same as that of the ‘OR’ node in terms of propagation of changes. Whenever a child node propagates a change, the ‘EOR’ node notifies the sentinel of the change. To easily add and delete sentinels to the EOR node, it maintains all the child sentinels as a linked list. The EOR node also consists of a hash table which has a mapping from the URL of every child to its corresponding relative level. For example, the entry in the hash table for the base page will be 0, while the entries for pages linked from the base page will be 1. To construct the CDG using the EOR node, when a linked monitoring request is given, the base page is parsed (till the required depth) and all the required URLs are extracted. The traversal is done in a breadth-first manner to maintain the information regarding the
A. Sanka et al. / Computer Networks 50 (2006) 1547–1563
1559
Fig. 8. CDG construction using binary OR nodes.
Fig. 9. Extended-OR operator.
relative depth of every URL. Each of the extracted URLs is checked to see whether or not a URL node corresponding to that URL already exists. For every URL that does not have a corresponding URL node, a new URL node is created and the requested change type node is created and attached to it. After creating all the change type nodes, the list of URL nodes and the corresponding level information is used to create the EOR node. Finally, the main sentinel is inserted into the EOR node, which in turn is cloned and inserted at every child node of the EOR node. Fig. 10 illustrates the graph construction process for different sections (Web pages) present on www.javaforums.com. For every Web page, the corresponding change type node is created and the cloned sentinel is inserted at every change type node. When a change is detected on any Web page, it propagates to the EOR node immediately.
The EOR node determines the relative level of the page where the change is detected and notifies the same to the sentinel. The advantages of the EOR node with respect to change detection and propagation are: • Using EOR nodes for graph construction results in a smaller CDG than the one resulting with binary OR operators. There is only one EOR node created for every request and there are no dummy sentinels with a request. • There is no delay involved in the propagation of detected changes with EOR nodes. Any change that is detected gets propagated through only one level and the sentinel is notified. • Since all the child nodes are stored in a vector at the EOR node, addition and deletion is done in a constant time with simple operations.
1560
A. Sanka et al. / Computer Networks 50 (2006) 1547–1563
Fig. 10. CDG construction using EOR node.
• The EOR node maintains the relative depth information for all the child sentinels which makes it easy to propagate this information to the sentinel.
5.3. Fetching multiple Web pages In WebVigiL fetching of Web pages is done by creating ECA rules on periodic events. ECA rules fetch the Web pages with the requested frequency. In the case of fixed-interval sentinels, every sentinel has a unique fetch rule. Therefore, there can be multiple fetch rules on the same URL. A fetch rule is created on a periodic event with the user-specified periodicity for fetching. The fetch rule then fetches the Web page with the required frequency during the sentinel’s lifetime. In the case of best-effort sentinels, there is only one fetch rule associated with any given URL. The time period of this fetch rule changes based on the prediction of periodicity made by the best-effort algorithm. To handle the fetching of multiple Web pages, one approach is to create an independent fetch-rule for every URL in the request. But this greatly increases the load on the system as the number of URLs increases. More importantly, this does not reflect the essence of monitoring multiple URLs with a single request. Hence, one fetch rule is created on the base URL and whenever the rule is fired, all the URLs are fetched at the frequency specified by the user and checked for changes. In the case of best-effort sentinels, different URLs have different fetch intervals based on their change behavior. The way fetching is done for best-effort sentinels is explained in Fig. 11.
For best-effort sentinels having multiple URLs, the fetching for all the pages starts with the same time interval. Fig. 11 shows the fetching process of three pages, sports, travel and world on www.cnn.com. The fetching for all three pages starts with the same initial time period of 4 min. When the pages are fetched at point T1, changes are detected in sports and world pages but no change is detected in travel page. When this information is given to the best-effort algorithm, it predicts a next fetch interval of 4 min for the pages which changed and an interval of 8 min for the page that did not change. Therefore, the periodicity of the fetch rule is changed to 4 min. The next fetch event occurs after 4 min at point T2. At this point only sports and world pages are fetched. There is no change detected in either of the pages and the best-effort algorithm predicts a value of 8 min for both the pages. Since the travel page has to be fetched after 8 min from point T1, and 4 min have elapsed from point T1 to point T2, the periodicity of the fetch rule is set to 4 min. When the event is raised after 4 min, at point T3, only the travel page is fetched. Generalizing, the next fetch interval for the besteffort rule is: T next ¼ MinimumðT next ; Rnext Þ. Tnext = Minimum next fetch interval of all the pages that are fetched in the current cycle; Rnext = Minimum remaining fetch interval of all the pages that are not fetched in the current cycle. The original best effort algorithm [22] has been modified [34] to work for multiple page fetches within a single fetch rule and minimize the number of fetches if there are no changes in one or more pages.
A. Sanka et al. / Computer Networks 50 (2006) 1547–1563
1561
Fig. 11. Multi-URL fetching for best-effort sentinels.
6. Conclusions and future work
References
WebVigiL is a change monitoring system for the Web that supports the monitoring of single or multiple HTML/XML pages for different change types. It provides mechanisms to efficiently fetch required pages with either a ‘user-specified’ time interval or with an interval predicted by a learning-based algorithm. Customized changes to single or multiple Web pages (including frames) are supported by WebVigiL. The first version of the WebVigiL system incorporating the features discussed in this paper has been implemented and can be accessed from http://berlin.uta.edu:8081/webvigil. The system is being extended in a number of ways, such as the distribution of sentinels over several WebVigiL servers, and so forth. WebVigiL is a system developed at UT Arlington for providing an alternative paradigm for monitoring changes to the Web (or any structured document).
[1] P. Deolasee, A. Katkar, A. Panchbudhe, K. Ramamritham, P. Shenoy, Adaptive push-pull: disseminating dynamic Web data, in: Proceedings of the Tenth International WWW Conference, Hong Kong, China, 2001, pp. 265–274. [2] F. Douglis, T. Ball, Y.F. Chen, E. Koutsofios, The AT&T internet difference engine: tracking and viewing changes on the Web, World Wide Web Journal 1 (1) (1998) 27–44. [3] D. Hirschberg, Algorithms for the longest common subsequence problem, Journal of the ACM 24 (4) (1997) 664–675. [4] Mind-it. Available from:
. [5] L. Liu, C. Pu, T. Wei, WebCQ: detecting and delivering information changes on the Web, in: Proceedings of the International Conference on Information and Knowledge Management (CIKM), Washington, DC, 2000, pp. 512–519. [6] A. Eppili, Approaches to improve the performance of storage and processing subsystems in WebVigiL, Master’s thesis, The University of Texas at Arilngton, 2004. [Online]. Available from: . [7] Wysigot. Available from: . [8] RSS. Available from: . [9] RSS-Reader. Available from: . [10] M. Levene, A. Poulovassilis (Eds.), Web Dynamics: Adapting to Change in Content, Size, Topology and Use, SpringerVerlag, 2004. [11] J. Jacob, A. Sanka, N. Pandrangi, S. Chakravarthy, WebVigiL: an approach to just-in-time information propagation in large network-centric environments, in: Web Dynamics, Springer-Verlag, 2004, pp. 301–318. [12] J. Jacob, A. Sachde, S. Chakravarthy, CX-DIFF: a change detection algorithm for XML content and change
Acknowledgements The authors would like to acknowledge the contributions of Mr. C. Hari Hara Subramaninan and Raman Adaikkalavan in revising and strengthening the manuscript. The authors would also like to gratefully acknowledge the comments from the reviewers for improving the content and presentation of the paper.
1562
[13]
[14]
[15]
[16]
[17]
[18] [19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
A. Sanka et al. / Computer Networks 50 (2006) 1547–1563
presentation issues for WebVigiL, Data and Knowledge Engineering 25 (52) (2005) 209–230. N. Pandrangi, J. Jacob, A. Sanka, S. Chakravarthy, WebVigiL: user-profile based change detection for HTML/XML documents, in: Proceedings of the 20th British National Conference on Data Bases, Coventry, UK, 2003, pp. 38–55. A. Eppili, J. Jacob, A. Sachde, S. Chakravarthy, Expressive profile specification and its semantics for a Web monitoring system, in: Proceedings of the 23rd International Conference on Conceptual Modeling (ER2004), Shanghai, China, 2004, pp. 420–433. S. Chamakura, A. Sachde, S. Chakravarthy, A. Arora, WebVigiL: monitoring multiple Webpages and presentation of XML pages, in: Proceedings of the International Workshop on XML Schema and Data Management, ICDE, Tokyo, Japan, 2005, pp. 41–50. J. Jacob, WebVigiL: sentinel specification and user-intent based change detection for XML, Master’s thesis, The University of Texas at Arilngton, 2003. [Online]. Available from: . N. Pandrangi, WebVigiL: adaptive fetching and user-profile based change detection of HTML pages, Master’s thesis, The University of Texas at Arilngton, (2003). [Online]. Available from: . Change detection. Available from: . G. Cobena, S. Abiteboul, A. Marian, Detecting changes in XML documents, in: Proceedings of the International Conference on Data Engineering, San Jose, California, USA, 2002, pp. 41–52. B. Nguyen, S. Abiteboul, G. Cobena, M. Preda, Monitoring XML data on the Web, in: SIGMOD ’01: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, ACM Press, New York, NY, USA, 2001, pp. 437–448. J. Cho, H. Garcia-Molina, Estimating frequency of change, ACM Transactions on Internet Technology 3 (3) (2003) 256– 290. S. Chakravarthy, A. Sanka, N. Pandrangi, J. Jacob, A learning-based approach for fetching pages in WebVigiL, in: Proceedings of the ACM Symposium on Applied Computing, March 2004, pp. 1725–1731. S. Yang, Formal semantics of composite events for distributed environments: algorithms and implementation, Master’s thesis, The University of Florida, Gainsville, December 1998. [Online]. Available from: . H. Lee, Support for temporal events in sentinel: design implementation and preprocessing, Master’s thesis, The University of Florida, May 1996. [Online]. Available from: . S. Chien, V. Tsotras, C. Zaniolo, Efficient management of multiversion documents by object referencing, in: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, Italy, 2001, pp. 291–300. A. Sachde, Persistence, notification, and presentation of changes in WebVigiL, Master’s thesis, The University of Texas at Arilngton, 2004. [Online]. Available from:
[27]
[28]
[29]
[30] [31] [32]
[33]
[34]
itlab.uta.edu/ITLABWEB/Students/sharma/theses/sachde. pdf>. S. Chakravarthy, V. Krishnaprasad, E. Anwar, S.-K. Kim, Composite events for active databases: semantics, contexts, and detection, in: Proceedings of the International Conference on Very Large Data Bases, 1994, pp. 606–617. S. Chakravarthy, D. Mishra, Snoop: an expressive event specification language for active databases, Data and Knowledge Engineering 14 (10) (1994) 1–26. S. Chakravarthy, E. Anwar, L. Maugis, D. Mishra, Design of sentinel: an object-oriented DBMS with event-based rules, Information and Software Technology 36 (9) (1994) 559– 568. Document-Object-Model. Available from: . Xerces-J. Available from: . A. Sanka, A dataflow approach to efficient change detection of HTML/XML documents in WebVigiL, Master’s thesis, The University of Texas at Arilngton, 2003. [Online]. Available from: . S. Chakravarthy, V. Krishnaprasad, Z. Tamizuddin, R. Badani, ECA rule integration into an OODBMS: architecture and implementation, in: Proceedings of the International Conference on Data Engineering, February 1995, pp. 341–348. S. Chamakura, Improvements to change detection and fetching to handle multiple URLs in WebVigiL, Master’s thesis, The University of Texas at Arilngton, 2004. [Online]. Available from: .
Anoop Sanka was born in Rajahmundry, India, in June 20, 1978. He received his B.Tech degree in Computer Science and Engineering from National Institute of Engineering, India. In the Fall of 2000, he started his graduate studies in Computer Science and Engineering and received his Master of Science in Computer Science and Engineering from The University of Texas at Arlington, in December 2003. His research interests include active databases and Web technologies.
Shravan Chamakura was born in Andhra Pradesh, India, in 1981. He received his B.S. degree in Computer Science and Engineering from V.N.R. Vignana Jyothi Inst of Engg and Tech, India, in 2002. In the Fall of 2002, he started his graduate studies in Computer Science and He received his Master of Science in Computer Science and Engineering from The University of Texas at Arlington, in Dec 2004. His research interests include active databases and Web technologies.
A. Sanka et al. / Computer Networks 50 (2006) 1547–1563 Sharma Chakravarthy is a Professor in the Computer Science and Engineering Department at the University of Texas at Arlington. He has established the Information Technology Laboratory (or ITLab) and a Distributed and Parallel Computing Cluster (DPCC) at UTA. His current research interests include: enabling technologies for advanced information systems, mining/knowledge discovery (conventional, graph, and text), active capability, Web change monitoring, stream data processing, document/email classification, and information
1563
security. Prior to joining UTA, he was with the CISE Department at the University of Florida, Gainesville. He also worked as a Computer Scientist at the Computer Corporation of America (CCA), Cambridge, MA and as a Member of Technical Staff at Xerox Advanced Information Technology (XAIT), Cambridge, MA. He has published over 115 papers in refereed International journals and conference proceedings. He has consulted for MCC, BBN, SRA Systems (India), Enigmatec, London, and RTI Inc. He has given tutorial on a number of database topics, such as active, real-time, distributed, object-oriented, and heterogeneous databases in North America, Europe, and Asia. He has given several invited and keynote addresses at universities and International conferences.