The Journal of Systems and Software 69 (2004) 29–42 www.elsevier.com/locate/jss
Parameter driven synthetic web database generation q Pallavi Priyadarshini, Fengqiong Qin, Ee-Peng Lim *, Wee-Keong Ng Centre for Advanced Information Systems, School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, Blk N4, 2a-32, Singapore 639798, Singapore Received 26 June 2002; received in revised form 24 December 2002; accepted 26 December 2002
Abstract To support intelligent data analysis on the web information, a data warehousing system called WHOWEDA (WareHouse Of WEb DAta) has been proposed. Unlike other relational data warehouses, WHOWEDA incorporates a Web Data Model that describes the web objects and their relationships as they are maintained within a data warehouse. A set of web operations has also been developed to manipulate the warehoused web information. In order to measure the performance of WHOWEDA and other similar systems that store and manipulate web information, a synthetic web database generator called WEDAGEN (WEb DAtabase GENerator) has been developed. It has the capability of generating collections of web objects of different sizes and complexities determined by a set of user-specified parameters. This paper presents the issues in the design and implementation of WEDAGEN. It also gives a detailed description of its system components and the strategy to generate synthetic web databases. A formal analysis of the generated web database and an empirical assessment of WEDAGEN has been reported. 2003 Elsevier Inc. All rights reserved. Keywords: Synthetic web data generation; Web warehousing; Web database performance
1. Introduction 1.1. Motivation The Wold Wide Web (WWW) is fast becoming an invaluable repository of information. It can be viewed as a huge collection of graphs with nodes representing web pages and links representing hyperlinks among these pages. As Web grows both in content and volume, many new interesting applications can now be developed based on the web information. For example, specialized digital libraries consisting of catalogues of web pages useful to different academic subjects can support students’ learning beyond classroom environment. Intelligent agents can also be built to collect and process web
q This work was supported in part by the Nanyang Technological University, Ministry of Education (Singapore) under Academic Research Fund 4-12034-5060, 4-12034-3012, 4-12034-6022. Any opinions, findings, and recommendations in this paper are those of the authors and do not reflect the views of the funding agencies. * Corresponding author. Tel.: +852-26098246; fax: +852-26035505. E-mail address:
[email protected] (E.-P. Lim).
0164-1212/$ - see front matter 2003 Elsevier Inc. All rights reserved. doi:10.1016/S0164-1212(03)00002-5
pages from multiple sources in order to provide companies with information about global competitors. Information available on the web can be visited using web browsers but browsing web information is a tedious process owing to the network overheads involved in fetching web pages from remote sites. Web browsing also does not scale up when users are interested in information from large number of web sites and pages. To allow web information to be located more easily, different search engines have been developed. Nevertheless, search engines only support simple search criteria on the global Web. The search results are often overwhelming and much unwanted information are returned. Hence, it has been concluded that the existing browsers and search engines do not always represent the best approaches for users to harness information from the Web (Manber and Bigot, 1998). To systematically store and manipulate web information, a data warehousing system specially designed for Web has been proposed (Ng et al., 1998; Cao et al., 2000a,b). The data warehousing system known as WHOWEDA (Warehouse of WEb DAta) stores selected web information as web tables and provides several web operators such as global coupling, web select,
30
P. Priyadarshini et al. / The Journal of Systems and Software 69 (2004) 29–42
web project, etc., to manipulate web tables. While WHOWEDA has query capabilities similar to that of other web database systems such as LORE (McHugh et al., 1997; Goldman et al., 1999) and Strudel (Fernandez et al., 1997), its ability to reuse query results for further query processing has been a unique feature. As part of the ongoing development efforts for WHOWEDA, we require the storage and query processing subsystems of WHOWEDA to be tested and their performance to be evaluated by a series of experiments. While it is possible to conduct these experiments using actual web pages downloaded from the Internet, the amount of time required would be so high that a comprehensive experimentation would not be possible. To overcome this constraint, a synthetic web database generator called WEDAGEN (WEb DAta GENerator) has been developed. With WEDAGEN, one can generate the appropriate web tables and web pages that correspond to the web objects in the web tables. WEDAGEN, as a synthetic web database generator, allows web tables of different web schemas and sizes 1 to be created very quickly to allow WHOWEDA system testing and performance evaluation experiments to be efficiently conducted on the created web tables. While these web tables may also be created by downloading web pages from the global World Wide Web or some web archive (e.g., the Internet Archive 2), such web table construction process will incur much more overheads due to two main reasons. • Firstly, it takes much communication overheads to download web pages from the World Wide Web or the site providing the web archive. Downloading a web page typically may take a few seconds to several minutes depending on the page size, the size of objects (e.g., image and video files) embedded in the page, and the network load. Such downloading time will become overwhelming when large number of web pages are required to construct the web tables. • Secondly, the web table construction will involve scanning and traversing through large number of web pages to determine the subset that meet the schema requirement of the web tables to be constructed. Such scanning and traversal operations are clearly less efficient than direct generation of web pages that meet the schema requirement and synthetizing the web pages into web tables. When the real web pages are used, it is also not possible to determine a priori the size of the web tables that will be constructed. This causes difficulties in con-
1 The size of a web table refers to the number of web tuples it contains. See Section 1.2 for the detailed definition of web table. 2 Internet Archive, URL ¼ http://www.archive.org.
ducting a thorough testing and evaluation experiments on the WHOWEDA system. Given the above practical limitations using the real web pages, we have chosen to adopt the synthetic web databases in testing and evaluating WHOWEDA. Note that when WHOWEDA is actually deployed to warehouse web information, the overheads of downloading web pages from the World Wide Web and constructing web tuples are usually of much smaller magnitude because the warehousing approach constructs web tables incrementally over long period of time. The web tables will only become large in size after a long period of warehousing. The concept of synthetic databases is not new in database system research. Unlike the real databases, one can construct synthetic databases of different sizes and schemas to test or evaluate a database system, be it a relational or web database system. To our best knowledge, WEDAGEN is the first synthetic web database generator developed. In some way, the collection of web pages generated by WEDAGEN may look similar to objects in an object-oriented database (OODB), and there are a few performance benchmarks designed for OODBs, e.g., 001 (Cattell and Skeen, 1992), 007 (Carey et al., 1993), Hypermodel (Anderson et al., 1990) and Sequoia 2000 (Stonebraker et al., 1993). Nevertheless, these benchmarks do not have a synthetic database generator that produces databases with different schema complexities and sizes. Their main purpose is also restricted to performance comparison between database systems of the same kind. Since our research is related to web data generation, we have also studied the work related to web measurement. The research in web measurement can be divided into three main categories. Cho and Garcia-Molina conducted experiments on more than half million web pages over 4 months to estimate how web pages evolve over time (Cho and Garcia-Molina, 2000). Their focus is on the life span and update frequency of web pages. Other efforts to measure web evolution include the OCLCs Web Characterization Project 3 and the work by Douglis et al. (Douglis et al., 1997). Another category of web measurement research focuses on measuring the network connectivity structure of Internet (Faloutsos et al., 1999). Such efforts study the complexity of the network connections of Internet so that efficient network protocols can developed. The last category of web measurement is related to analysing the link structures among web pages within the global web. Kleinberg identified two important classes of web pages known as the authoritative and hub web pages (Kleinberg, 1999). Authoritative web pages refer to pages with strong similarity in theme while hub web pages provides dense links 3 OCLCs Web Characterization Project, URL ¼ http://wcp.oclc.org.
P. Priyadarshini et al. / The Journal of Systems and Software 69 (2004) 29–42
Hubs Pages Authoritative Pages
Fig. 1. Authoritative and hub web pages.
to the former. Authoritative web pages are universally popular pages such as the home pages of various search engines. Example of hub web pages are pages of organizations and individuals that provide links to these search engines’ home pages. Their connectivity relationships can be shown in Fig. 1. Kumar et al. further developed the algorithms to identify interesting web pages using their link structures, and extract knowledge from them (Kumar et al., 1999). As the above three categories of web measurement research focused on the entire web, the characteristics of web pages derived is specific to the entire global web. While these characteristics holds for the entire web, they may not applicable to subsets of the web. In contrast, WEDAGEN aims to allow users to define their own sets of input parameters to generate web information. Hence, we consider our research unique compared to the existing web measurement work.
table. Each node instance represents a web page and each link instance represents a hyperlink from one web page to another. Node instances are described by a set of attributes namely URL, title, format, size, date and text. The format of a node (web page) refers to the file type of the node which can be HTML, postscript, plain text etc. The link instances have source_URL, target_URL, label and link_type as the main attributes. The link_type of a link can be interior, local or global. Fig. 2 depicts an example schema of a web table stored as a schema file. Fig. 3 shows the same schema graphically. In the schema shown, a, b, and c are node variables. Node a is a source node while c is a sink node. e and f are link variables. Inter-connectivities among the node variables have been defined. For example, node variable a is linked to node variable b via the link variable e. Predicates have been defined on the nodes and link variables. For example, Title CONTAINS ‘‘commerce’’ is a predicate defined on the node variable b. The predicate requires instances of b to have ‘‘commerce’’ included in their titles. A predicate on any textbased attribute of a node or link variable can involve either CONTAINS or EQUALS. The argument of CONTAINS predicate is a list of query terms while that of EQUALS is a string. To distinguish between individual node and link instances, unique IDs are assigned to them. A node ID uniquely identifies a node instance. Similarly, all link instances are assigned unique link IDs. The WHOWEDA system can maintain multiple web databases each consisting of a set of web tables (Lim
1.2. Overview of WHOWEDA To realize WHOWEDA as a web warehousing system, a data model called Web Data Model has been defined to represent and manipulate web information (Cao et al., 2000a). In this subsection, we give an overview of the data representation adopted by the web data model. The proposed web data model captures web information as a set of web tables, each consisting of a web schema and a set of web tuples. The web schema describes the meta-information that binds tuples in the web table. It defines a directed graph consisting of a set of node variables, link variables, connectivities and predicates defined on the attributes of node and link variables. A source node in a schema refers to a node variable with no incoming link i.e., it is a start node of the schema graph. The same schema may have more than one source node. A sink node is one that has no outgoing link. A web tuple is a directed graph consisting of node and link instances satisfying the web schema of the web
31
Fig. 2. An example web schema.
Fig. 3. Graphical representation of schema.
32
P. Priyadarshini et al. / The Journal of Systems and Software 69 (2004) 29–42
et al., 1998). While the web tables may be distinct in their schema and tuples, they may share common web pages associated with their node instances. A web table is stored as a set of files described below: • Web schema file: The web schema file contains the schema information about the web table. • Web tuple file: Each tuple in the web tuple file consist of node and link instances forming the tuple. The web tuple file maintains the node attributes like the node ID and URL, and link attributes like target node ID, label and link_type. • Node pool file: Since a node instance may be shared by multiple web tuples from possibly more than one web table, the complete information about node instances are stored in a single node pool file shared by all web tables. The node information stored includes the node ID, URL, title, format, size, text and the complete list of its outgoing links. • Web pages: This are files storing the actual web pages generated for the individual node instances. 1.3. Objectives and scope WEDAGEN is designed to be a parameter-driven synthetic web database generator that generates web data for system testing and performance evaluation. It allows maximum flexibility in web data generation by incorporating a set of parameters that specifies the complexity, size and content of the web data to be generated. In the following, we describe the main objectives to be accomplished by WEDAGEN. • Verification of web database system functionalities: WEDAGEN can support testing of web database system functionalities by generating web objects satisfying different properties specified by the system developers. As part of the WHOWEDA project, WEDAGEN generates web pages that are inter-connected in the topologies required and the web pages are further assembled into web tables. WHOWEDAs functionalities can therefore be thoroughly tested with the generated web information. Such a strategy significantly reduces the amount of time required for system testing since one does not need to spend much time finding the real web sites suitable for system testing and downloading their web pages. To many other web tools and web database systems, the web pages generated by WEDAGEN can also be useful for their system testing. For example, the functionality of a web site compression software can be tested by having WEDAGEN generated a web collection consisting of inter-connected web pages. WEDAGEN can only meet the system testing requirement of these tools and database systems when the web data generated can be configured by some user input parameters.
• Performance evaluation: Using WEDAGEN, one can create a testbed for conducting experiments that evaluate the performance of web database systems (or tools) like WHOWEDA. Performance evaluation sometimes involves the performance comparison between different approaches of implementing a subsystem of the same web database system (or tool), e.g. query processor. Performance evaluation may sometimes involve comparing some common subsystem of different web database systems. In both cases, WEDAGEN is able to speed up the evaluation process by synthesizing web data of required size and complexity. In this paper, we have identified a set of input parameters for web database generation. Using our web database generator WEDAGEN, users can specify different input parameter values to automatically derive web pages and tuples in a web table. WEDAGEN has been implemented. We have also conducted a formal analysis on the generated web data and determined the estimated number of generated web tuples from userspecified parameters. An empirical evaluation of time taken to generate web data was also performed. Although our research aims to generate web data with characteristics controlled by user-specified input parameters, it is not part of our objective to generate synthetic web data similar to the entire World Wide Web. Instead, we are interested in simulating only subsets of the entire web. For example, WEDAGEN can be used to generate web pages within a company’s web site. Such small proportion may demonstrate characteristics quite different from the entire web. Using the input parameters, such characteristics can be simulated by our synthetic web databases. 1.4. Paper outline The rest of this paper is structured as follows: Section 2 discusses the design and an overall web information generation strategy adopted by WEDAGEN. Section 3 deals mainly with the input parameters required by WEDAGEN. Sections 4 and 5 describe the creation of node and link instances and the generation of web tuples, respectively, as the two important steps in synthetic web table generation. In Section 6, we will present some theoretical results for the generated web databases. Section 7 gives a preliminary assessment of WEDAGEN. Finally, conclusion and the future work are presented in Section 8. 2. The design of WEDAGEN 2.1. Design objectives WEDAGEN is to be used for performance evaluation and system testing of a web database system. In the
P. Priyadarshini et al. / The Journal of Systems and Software 69 (2004) 29–42
2.2. System overview of WEDAGEN
design of WEDAGEN, several criteria have been considered. The implication of these design criteria and the decisions we made in the design of WEDAGEN are described below:
Fig. 4 depicts the overall architecture of WEDAGEN. It consists of three modules namely, specific parameter generation module, instance generation module and tuple extraction module. Every web table has a schema consisting of node variables, link variables, connectivities, and predicates. The web schema has to be given to WEDAGEN as an input parameter. In addition, WEDAGEN requires other input parameters to be given to control the size of web table to be generated and the connectivities between its nodes instances. Many of these input parameters are associated with the node and link variables in the schema. Instead of having a user specifying the same set of input parameters for each and every node and link variables, WEDAGEN allows default parameter values to be assigned to these variables. This set of default parameter values is stored in the default parameter file. Given the input schema and the default parameters, the specific parameter generation module generates a specific parameter file containing the default parameter values assigned to all the node and link variables in the schema. Within the specific parameter file, the user can modify the parameter values for individual node and link variables to control the actual composition of node and link instances finally generated. Once the input parameters for individual node and link variables are available, the instance generation module will generate the node and link instances for the web table according to the parameter values in specific parameter file. A set of selectivity parameters can be given to constrain the proportion of web pages satisfying some predicates not found in the web schema. These are specified in the selectivity parameter file. The instance generation module also generates the actual web pages for the node instances. In WEDAGEN, all the
• Scalability: To meet the objectives given in Section 1.3, the synthetic database generator should be designed to generate variable amount of data. We would like to be able to generate databases consisting of web documents in the order of tens of thousands. The database size should be large enough to be realistic for performance evaluation as query performance differences are magnified on large databases. In WEDAGEN, scalability is achieved by allowing users to specify input parameters that determine the size of the database to be generated and ensuring that the generator algorithm is able to produce as much as can be accommodated by the underlying operating system. • Flexibility: The flexibility criteria refers to control over the structure of web information to be generated. In the case of WHOWEDA, the structure of web information refers to schemas of the web tables. For testing and performance evaluation of web database systems, it is necessary to have tables with varying complexities. WEDAGEN is designed to accept a web schema as an input parameter, and web schemas of different topologies and predicates can be specified. • Simplicity: As web schema becomes an input parameter of WEDAGEN, users may have to specify a separate set of parameters for each node and link variables in the web schema. The parameter specification process could therefore be painstaking and error-prone. To ensure that WEDAGEN is easy to use, we have designed WEDAGEN to support default parameter values to be assigned to node and link variables if specialized parameter values are not required.
input schema file
dictionary
Specific parameter generation module
selectivity parameter file
web pages
Instance generation module
Tuple extraction module node & link instances
default parameter file
33
specific parameter file
user Fig. 4. System architecture of WEDAGEN.
web tuples
web tables
34
P. Priyadarshini et al. / The Journal of Systems and Software 69 (2004) 29–42
terms required to form the URLs, titles and text of web pages are obtained from a dictionary containing a huge collection of terms (see Section 4 for more details). For web database systems not using web table constructs, the web pages generated can be directly used for evaluation or testing purposes. The last module of WEDAGEN, i.e., the tuple extraction module, constructs tuples for the web table from the node and link instances generated by the instance generation module. Each tuple is a list of node IDs and link IDs which satisfy the input web schema. The tuples extracted are stored in the web table files. The web table files are now complete with the node, link and tuple information.
3. Input parameters for WEDAGEN To synthesize a web table consisting of a collection of interlinked web documents, it is necessary to exercise some control over the size, complexity and content of the web table generated through the following four categories of input parameters. • Web schema parameter: This refers to the schema of the web table to be generated. • Instance parameters: This refers to the parameters associated with the given web schema and are used for data generation. • Control parameters: This refers to parameters associated with the node and link variables of the web schema for verification purposes. • Selectivity parameters: This refers to parameters that impose constraints on the text content of the web pages to be generated in addition to those already specified in the web schema. Both the instance and control parameters are included by the default and specific parameter files. Most of the instance and control parameters are associated with node and link variables. Hence, they are the input parameters that have to be specified for each node and link variable in the specific parameter file. 3.1. Web schema parameter WEDAGEN aims to support the evaluation of web database system performance and the testing of its functionalities. It is therefore desired for WEDAGEN to generate web tables with different schemas. The schema parameter consists of four components, namely nodes, links, connectivities and the predicates specified in the format illustrated by Fig. 2. At present, WEDAGEN supports only schemas containing no cyclic connectivities among their node variables. The web schema parameter can be either specified using a text editor or automatically generated using a schema generator.
3.2. Instance parameters This set of parameters determines the properties of individual node and link instances to be generated by WEDAGEN. The parameters are directly used to generate the node and link instances, and to construct web tuples. • Approximate number of tuples (NumTuples): This parameter indicates the approximate number of tuples in the web table to be generated. Since NumTuples is merely used as a guideline to determine final number of tuples generated, the actual number of tuples created may differ. A formal estimation of the number of generated tuples is given in Section 6. • Number of instances for a node variable (NumNodeInstances): In the default parameter file, this parameter determines the number of node instances to be generated for a source node variable. In the specific parameter file, this parameter is associated with every node variable. In the default parameter file, users are not required to specify the NumNodeInstances parameter for each non-source node variable since the number of non-source node instances can be computed directly or indirectly from the number of source node instances by considering the fanout from each node instance (see Section 4). In cases where the pre-determined values are not appropriate, users can adjust the NumNodeInstances value for each node variable in the specific parameter file. Note that the node instances generated must satisfy the predicate(s) defined on the source node variable. For example, let a be a source node variable with the predicate a.title CONTAINS ‘‘java’’, and a NumNodeInstances of 5. WEDAGEN will generate 5 node instances for a and these instances have their titles containing ‘‘java’’. Note that if a predicate of the form a.URL EQUALS ‘‘http://xxx.yyy.com’’ is specified on a source node variable, the NumNodeInstances value will not be used since only one a node instance will be generated. • Fan-out per node instance for a link variable (FanOut): This positive integer parameter controls the number of outgoing links from a node instance to other node instances through a specific link variable. For example, given a connectivity aheib with predicate e.label CONTAINS ‘‘text’’, and a fanout of seven specified on the link variable e, WEDAGEN will generate seven outgoing links from each a instance to the b instances. Each link will be associated with a label containing the term ‘‘text’’. • Approximate number of keywords per node instance for a node variable (NumKeyWordsPerNodeInstance): This parameter determines the number of keywords to be included by the text attribute of each node instance of a node variable. Furthermore, these
P. Priyadarshini et al. / The Journal of Systems and Software 69 (2004) 29–42
keywords are indexed for efficient keyword searching. The keywords to be included are randomly chosen from a dictionary containing about 30,000 terms retrieved from the WWW. The keywords found in the text attribute of a node instance must also be included in the web page associated with the node instance. Note that if a predicate of the form b.text CONTAINS ‘‘java’’ is specified on a node variable b as part of the schema, the query term java will also be included in the text attribute of each b instance. Note that NumKeyWordsPerNodeInstance only determines an approximate number of keywords finally generated as it does not include the query terms specified on the node instances. • Approximate number of words per node instance for a node variable (NumWordsPerNodeInstance): This parameter controls the size of the web page generated for each node instance of a node variable. The words in the generated web pages are again chosen from the dictionary. The parameter value excludes the query words found in the predicates associated with the node variable, but includes the number of keywords associated with the text attribute of the node variable. Hence, NumWordsPerNodeInstance of a node variable is always larger than or equal to the corresponding NumKeyWordsPerNodeInstance. For example, let b.text CONTAINS ‘‘school’’ be a predicate on the node variable b in the web schema, and the values of NumKeyWordsPerNodeInstance and NumWordsPerNodeInstance be 3 and 10 respectively. The word ‘‘school’’, the three keywords that form the text attribute, and seven other words randomly chosen from the dictionary will form the text of the web page of each b node instance. • Approximate number of words for each link label (NumWordsPerLinkLabel): This parameter controls the length of the label associated with each link variable in the input schema. The words used in the link label are randomly chosen from the dictionary. The query word(s) in the predicate defined on the link label of a link variable will also be included by the link label of a corresponding link instance. For example, let NumWordsPerLinkLabel be 10 for link variable e that connects node variables a and b, i.e. aheib. Ten randomly chosen words and the query word(s) appearing in the predicate(s) on e will be included in the link label associated with each e instance. Note that when a predicate of the form e.label EQUALS ‘‘java’’ is given in the schema, each link instance corresponding to e will only contain the word java, i.e. NumWordsPerLinkLabel does not apply to e. • Approximate number of words for the hostname of the URL assigned to each node instance of a node variable (NumWordsPerHostname): This parameter controls the number of words used to construct the hostnames of the URLs generated for the node instances of a
35
node variable. The NumWordsPerHostname of a URL is defined by the number of words separated by dots and it excludes the default prefix word ‘‘http://www’’. For example, a NumWordsPerHostname of three will lead to URL of the form http:// www.word1.word2.word3/ where word1, word2, and word3 are randomly chosen from the dictionary. • Approximate number of words for the web page title of each node instance of a node variable (NumWordsPerTitle): This parameter controls the number of words to be chosen to form the title of a web page corresponding to each node instance of a node variable. Similar to NumWordsPerNodeInstance, NumWordsPerTitle does not include the query word(s) of a predicate specified on the title of the node variable. • Type of link for a link variable (LocalGlobalLink): This parameter determines if local or global links are to be generated for a link variable. When the link type of a link variable is global, we assign 0 to the flag LocalGlobalLink. The generated URLs of the source and destination node instances associated with the link variable will carry different site addresses. A local link type is indicated by having LocalGlobalLink ¼ 1. All instance parameters except NumTuples are associated with some node or link variable. Hence, these instance parameter values in the default parameter file are replicated for every node and link variable in the specific parameter file by the specific parameter generation module. 3.3. Control parameter At present, only one control parameter is used in WEDAGEN. Unlike the other parameters, a control parameter is not used for data generation. It indicates some criteria to be satisfied by the node and link instances generated. In order to satisfy the criteria, WEDAGEN may have to require the user to modify the instance parameters and re-generate the synthetic node and link instances. • Maximum Fan-in of each node instance of a node variable (FanIn): To avoid clustering of incoming links to a particular node instance, the maximum fan-in of the node instances can be specified. Like most instance parameters, FanIn parameters are associated with node variables. Hence, the parameter is found in the default parameter file, and is replicated for all node variables in the specific parameter file. A user can modify the FanIn parameter for each node variable to impose different fan-in criteria for different node variables. After all node and link instances have been generated by the instance generation module, they are checked
36
P. Priyadarshini et al. / The Journal of Systems and Software 69 (2004) 29–42
against the FanIn parameters associated with the node variables. If any of the node instances has a fan-in exceeding the FanIn parameter for its node variable, WEDAGEN will alert the user who may choose to modify the default parameter values and re-generate the node and link instances. 3.4. Selectivity parameters The selectivity of a predicate refers to the percentage of the node instances that satisfy the given predicate. While WEDAGEN always ensures that all predicates in the given web schema will be satisfied by the node and link instances generated, it is still useful to impose some additional constraints on the text content of the web pages corresponding to the generated node and link instances. In particular, the specially calibrated content of web pages can be used to evaluate the performance of web operations and web queries. In WEDAGEN, the constraints on the text content of the web pages to be generated are determined by a set of predicates together with their selectivities. Note that two kinds of predicates can be specified, namely predicates associated with some node variables, and predicates associated with the entire web table. • Selectivity of a predicate specified on some node variable (NodeSelectivity): This parameter determines the number of instances of a particular node variable satisfying a predicate not found in the input web schema. For example, given a predicate a.URL CONTAINS ‘‘language’’ and its selectivity of 0.6, WEDAGEN will generate 60% of the node instances of a having their URLs containing language. • Selectivity of a predicate specified on the entire web table (TableSelectivity): A predicate specified on the entire web table is one that does not require any node variable. Given such a predicate and its selectivity, WEDAGEN will generate the corresponding proportion of node instances in the entire web table (irrespective of their node variables) satisfying the given predicate. For example, the predicate title CONTAINS ‘‘hello’’ with selectivity 0.4 will lead to 40% of the node instances having their titles containing ‘‘hello’’. NodeSelectivity and TableSelectivity are specified in the selectivity parameter file. Each parameter entry consists of a predicate and its selectivity value. 4. Instance generation module This section describes in detail the instance generation module which creates the actual node and link instances and stores them in the node pool file and in the web table files. The major steps in instance generation are summarized as follows:
• Step 1: Determining the number of instances for each non-source node variable In this step, we calculate the number of instances for each non-source node variable since this piece of information is not directly provided by the user. The number of node instances of a non-source node variable is computed from the number of node instances of the other node variables having direct links to it and the fanouts associated with these links. For example, suppose c is a non-source node variable involved in two connectivities, aheic and bhf ic where a and b are source node variables. Let the NumNodeInstances values of a and b be 5 and 4 respectively, and the FanOut of e and f be 2 and 3 respectively. The number of c instances can be computed by maximum of ðNumNodeInstances of a FanOut of eÞ and ðNumNodeInstances of b FanOut of f Þ. That is, there are 12 c instances to be generated. Note that if a predicate of the form c.URL EQUALS ‘‘http://www.xxx.yyy/zzz’’ is associated with the nonsource node variable c, there will be only one c instance to be generated. Based on the above strategy that determines the number of node instances for a node variable, one can derive the following lemma: Lemma 1. Given a schema with no predicate of the form hnode vari.URL EQUALS hURL Stringi, the number of node instances for a node variable x is always no larger than that of any node variable to which x is linked. Proof. Let y be a node variable such that x is linked to y through a link variable e. Since y is a non-source node, the number of instances for y is determined by one of the node variables (including x) having direct links to y which gives the maximum NumNodeInstances Fanout value. In other words, NumNodeInstances of y P ðNumNodeInstances of x Fanoutof eÞ. Since Fanout of e > 1, NumNodeInstances of x 6 NumNodeInstances of y. Due to Lemma 1, one can conclude that the number of instances to be generated for a sink node variable is no fewer than that of its predecessor node variables. • Step 2: Generating the URLs of node instances: Having determined the number of instances for each node variable, the URLs of all node instances are generated. Recall that predicates on URLs may be specified within a web schema. The instance generation module would ensure that the URLs generated satisfy these predicates. Each URL consists of http followed by a hostname and a filename for the web page determined by the node variable name and node id of the corresponding node instance. The length of hostname in a URL is determined by the NumWordsPerHostname parameter. Moreover, the proportion of URLs
P. Priyadarshini et al. / The Journal of Systems and Software 69 (2004) 29–42
generated must also satisfy the selectivity parameters involving URLs. • Step 3: Creating node and link instances: This step determines the values of node attributes such as text, date etc., creates the outgoing link sets of node instances, and constructs the corresponding web pages. All generated node and link instances are stored in the node pool file and web table files. The various sub-steps have been elaborated below: Creating non-URL attributes of node instances: To create the text attribute of a node instance, NumKeyWordsPerNodeInstance words are randomly chosen from the dictionary. The text attribute of the node instance may also include some words suggested by predicates in the schema and the selectivity parameters. All keywords in the text would also be part of the textual content of the corresponding web page. Similar to the text attribute, the instance generation module generates the title attribute based on instance parameters, web schema, and selectivity parameters. At present, the dates of the node instances are simply the current date. Generating link sets for node instances: Having generated the attributes of node instances, the outgoing links of each node instance have to be created before the instance generation module proceeds to create the web pages. The number of outgoing links from a particular node instance is generated based on the FanOut parameter associated with the corresponding node variable. The target URL of each link instance is determined randomly from the URLs of node instances corresponding to the target node variable. To create the link label of a link instance, NumWordsPerLinkLabel words are randomly selected from the dictionary under normal circumstances. When the input web schema includes some predicate(s) on the link label of a link variable, the instance generation module must ensure that the the link labels of the corresponding link instances satisfy the predicate. Creating web pages: The node and link information generated so far are written into web pages. At the same time, the actual textual content of these web pages are randomly generated using the NumWordsPerNodeInstance parameters of the node variables. The instance generation module also includes in the generated web pages some images randomly chosen from a collection of GIF files. The size of the node instance is then determined by the size of its web page.
5. Tuple extraction module The instance generation module generates only node and link instances inter-connected as directed graphs but
37
Fig. 5. Node and link instances from which tuples are extracted.
e0
b3
e2
b5
f2
c9
e2
b5
f3
c10
e3
b6
f2
c9
e3
b6
f3
c10
a2
a1 f0
c7
e0
b3 a2
a1 f1
c8
e1
b4 a2
a1 f0
c7
e1
b4 a2
a1 f1
c8
Fig. 6. Extracted tuples.
not the individual tuples. It is therefore the responsibility of tuple extraction module to extract tuples from the node and link instances by navigating the directed graphs composed of these instances. The number of tuples to be generated will be no more than the NumTuples input parameter. Fig. 5 depicts an example directed graph of node and link instances for the input web schema involving node variables a, b, c and d, and link variables e and f, and connectivities aheib and ahf ic. The IDs of the node and link instance are given in Fig. 5. The tuples extracted by tuple extraction module are shown in Fig. 6. Each tuple is represented by the IDs of the node and link instances involved in the tuple. The tuples generated are stored in the web table files. Hence, the web table files are now complete with all node, link and tuple information.
6. Formal analysis of generated web databases In this section, we conduct a formal analysis of the web databases generated by WEDAGEN. The analytical results will help us to determine the relationships between the user-specified parameters and the size of the generated web database. Among the user-specified parameters, the number of node instances for a node variable (denoted by NumNodeInstances), and fan-out per node instance for a link
38
P. Priyadarshini et al. / The Journal of Systems and Software 69 (2004) 29–42
variable (denoted by FanOut) are closely related to the number of web tuples generated (denoted by NumTuples) by WEDAGEN. Let NumNodeInstances and Fan-Out of a node variable ni and link variable ek be abbreviated by #ðni Þ and fek respectively. Below is a lemma that determines the maximum number of web tuples that can be generated when the NumNodeInstances for each node variable is given. Lemma 2. Given a web schema G with node variables fn1 ; . . . ; nt g, and a #ðni Þ for each ni . The Qmaximum t number of tuples that can be generated for G is i¼1 #ðni Þ. Proof. Given a web table, all its tuples must be unique. The lemma can be easily shown by considering all possible combinations of node instances for the different node variables. Due to the above lemma, we derive the following theorem. Theorem 1. Given a web schema G fn1 ; . . . ; nt g, and a #ðni Þ for each number of tuples can be generated fek ¼ #ðnkj Þ for every link variable nki ; nkj 2 fn1 ; . . . ; nt g.
with node variables ni . The maximum for G by assigning ek from nki to nkj ,
Proof. We prove the above theorem by showing that by not assigning fek ¼ #ðnkj Þ, the maximum number of tuples cannot be generated. We first note that fek cannot be assigned a value larger than #ðnkj Þ. Suppose for some link variable ek from nki to nkj , we assign a fek value less than #ðnkj Þ. The node instances of nki and nkj , and link instances of ek do not form a complete bipartite graph. That is, there exists a pair of nki and nkj instances that are not connected. Hence, tuples that involve this pair of node instances will not be generated. We therefore conclude that the theorem holds. From the theorem, we know that if the fanout of a link variable is smaller than the number of instances for its target node variable, the unique combinations of node instances that can be derived for tuples will be fewer. In the following theorem, we will further estimate the number of tuples that can be generated from the numbers of instances for node variables and the fanouts for link variables. Theorem 2. Given a connected web schema G, the number of tuples that can be generated (denoted by #ðGÞ 4) is estimated as follows: 4 Note that #ðGÞ may not be the same as the user-specified parameter NumTuples as the latter is given by the user while the former is an estimate derived from #ðni Þ’s and fanouts of link variables.
• If G has only one node variable n but no link variable, #ðGÞ ¼ #ðnÞ • If G contains one or more link variable, we let G0 be a connected web schema obtained from G by removing a link variable ek from nki to nkj . 8 #ðG0 Þ fek > > > > #ðG0 Þ > < #ðnki Þ fek #ðGÞ ¼ #ðnkj Þ > > > > #ðG0 Þ > : fe #ðnkj Þ k
if only nki belongs to G0 if only nkj belongs to G0 if both nki and nkj belong to G0
Proof. The proof is trivial for G containing one node variable but no link variable. We now consider G containing one or more link variable. Let G be obtained from G0 by adding a link variable ek from nki to nkj . If only nki belongs to G0 , each tuple of G0 can be extended with different nkj instances according to the fanout fek . Hence, #ðGÞ ¼ #ðG0 Þ fek . If only nkj belongs to G0 , G tuples can be formed by each nki instance linking to an nkj instance in G0 . As each nkj instance can be involved in G0 =#ðnkj Þ tuples of G0 . Hence, #ðGÞ ¼ ð#ðG0 Þ=#ðnkj ÞÞ #ðnki Þ fek . If both nki and nkj belong to G0 , the probability of a pair of nki and nkj instances connected by ek is determined by fek =#ðnkj Þ. Hence, #ðGÞ ¼ ð#ðG0 Þ= #ðnkj ÞÞ fek . The above theorem provides a recursive way to compute the estimated number tuples generated for a web schema. From the above recursive formulas, we obtain the following closed form formula. Corollary 1. Given a connected web schema G with node variables fn1 ; . . . ; nt g and link variables fe1 ; . . . ; er g. t r Y Y fei #ðGÞ ¼ #ðni Þ #ðm iÞ i¼1 i¼1 where mi denotes the target node variable of ei . Discussions The above analysis provides us a means to estimate the number of web tuples generated by WEDAGEN. While we do not expect WEDAGEN to be able to generate web data that simulate the entire web, it is possible to generate web pages with connectivities similar to that of real web graphs with some careful specification of input parameters. For example, the authoritative and hub web pages identified by Kleinberg are important web pages for discovering high quality information out of the global web (Lleinberg, 1999). To synthetically construct web pages resembling authori-
P. Priyadarshini et al. / The Journal of Systems and Software 69 (2004) 29–42
39
tative and hub pages, we can define a node variable for the set of authoritative pages, and several other node variables for different types of hub pages. Link variables connecting the latter node variables to the authoritative node variable are also defined. To simulate the considerable overlap in the set of authoritative web pages, we can include several predicates that specify a larger common set of keywords to be shared by all such pages. To simulate a group of hub pages, one can define the corresponding node variable to have a high fanout to instances of the authoritative node variable. 7. Empirical evaluation of WEDAGEN WEDAGEN has been implemented in C++ using Standard Template Library (STL) (Ammeraal, 1997) on a SUN Ultra-30 Workstation with 250 MHz UltraSPARC II Processor and 128 MB RAM. In this section, we describe an empirical evaluation of the web database generation capabilities of WEDAGEN. Our evaluation focuses on the ability of WEDAGEN to generate web tables of different complexities and sizes and the overheads incurred. In our experiments, the overhead of web table generation is measured by the elapsed time required to generate a web table. The elapsed time includes the amount of time spent in generating the node instances, constructing web tuples, and inserting them into web tables. As shown in Fig. 7, four web schemas with different connectivity topologies have been used for the web tables to be generated. The topologies include a linear graph (Schema1), a tree (Schema2), a disconnected graph (Schema3) and a connected graph with multiple source nodes (Schema4). The four corresponding schema files are given in Figs. 8–11. The above web schemas have been selected because we believe that they represent the kinds of web schemas users will specify. We expect each user to be interested in some selected portions of a web site and such selected portions involve very few inter-connected web pages. For example, from an online news website, a user may be interested in the daily online news home page and its links to the world
Schema 1
Schema 2
Schema 3
Schema 4
Fig. 7. Topologies of experimental web schemas.
Fig. 8. The schema file for Schema1.
Fig. 9. The schema file for Schema2.
news pages about terrorism. In such an example, the numbers of node and link variables are small, as in the above four web schemas. To evaluate the overheads of web table generation, web tables of three sizes, i.e., small, medium and large, are generated for each of the above four schemas. In other words, there are altogether 12 web tables to be generated. The table sizes are determined by three sets of default parameters shown in Tables 1–3. Note that a different parameter value for NumNodeInstances for the source node variable is specified for each combination of
40
P. Priyadarshini et al. / The Journal of Systems and Software 69 (2004) 29–42 Table 1 Parameter values for a small-size web table Parameter
Value
NumTuples NumNodeInstances (for source node variables only) –Schema1 –Schema2 –Schema3 –Schema4 FanOut NumWordsPerNodeInstance NumWordsPerLinkLabel NumWordsPerHostname NumWordsPerTitle LocalGlobalLink
600 24 5 1 24 5 200 6 5 6 0
Table 2 Parameter values for a medium-size web table
Fig. 10. The schema file for Schema3.
Parameter
Value
NumTuples NumNodeInstances (for source node variables only) –Schema1 –Schema2 –Schema3 –Schema4 FanOut NumWordsPerNodeInstance NumWordsPerLinkLabel NumWordsPerHostname NumWordsPerTitle LocalGlobalLink
2000 20 2 1 20 10 200 10 10 10 0
Table 3 Parameter values for a large-size web table
Fig. 11. The schema file for Schema4.
web schema and web table size. This is necessary in order for WEDAGEN to generate sufficient node and
Parameter
Value
NumTuples NumNodeInstances (for source node variable only) –Schema1 –Schema2 –Schema3 –Schema4 FanOut NumWordsPerNodeInstance NumWordsPerLinkLabel NumWordsPerHostname NumWordsPerTitle LocalGlobalLink
5000 8 1 1 8 25 200 10 10 10 0
link instances for constructing the required number of web tuples. To avoid the amount of user efforts involved in the experiments, each of the above 12 web tables is generated directly from the default parameter values, i.e., the specific parameter generation, instance generation, and tuple extraction modules are invoked automatically without user intervention. Combining the four different web schemas and the three different web table sizes, we have conducted
P. Priyadarshini et al. / The Journal of Systems and Software 69 (2004) 29–42
41
Table 4 Number of node instances generated
Fig. 12. Elapsed times for the generation of 12 web tables.
experiments that require WEDAGEN to generate the 12 web tables and measured the elapsed time for each web table generation. The experiments have been conducted on a lightly loaded SUN workstation. The results of our experiments are shown in Fig. 12. Based on the results shown in Fig. 12, the following observations can be made: • As expected, the elapsed time of generating a web table increases with the size of the table. However, the rates of growth are different for web tables with different schemas. This indicates that the complexity of schema also affects the time taken to synthesize web tables. • The elapsed times of generating web tables with linear schemas (Schema1) and inverse-tree schemas (Schema4) are higher than compared to that with those for tree schemas (Schema2) and schemas with multiple linear components (Schema3). This result looks counter-intuitive since Schema1 is relatively simpler compared to the other schemas. Nevertheless, if we examine the number of node instances to be created for Schema1 and Schema4, we can quickly realise that the two schemas actually require the creation of many more node instances than the other two schemas. The numbers of node instances generated for different web schemas are shown in Table 4. Schema1 and Schema4 require 744 and 768 node instances for the small-sized web tables, 2220 and 2240 node instances for the medium-sized web tables, and 5208 and 5216 node instances for the large-sized web tables respectively. The numbers of nodes instances required by Schema1 and Schema2 are at least two times more than that of other schemas. It is noted that the amount of time required for web table generation correlates well with the number of generated node instances.
Web table
Number of node instances
Small-size, Schema1 Small-size, Schema2 Small-size, Schema3 Small-size, Schema4 Medium-size, Schema1 Medium-size, Schema2 Medium-size, Schema3 Medium-size, Schema4 Large-size, Schema1 Large-size, Schema2 Large-size, Schema3 Large-size, Schema4
744 280 62 768 2220 422 222 2240 5208 1276 1302 5216
• For large web tables, the elapsed time of Schema3 becomes larger than that of Schema2. This is caused by slightly larger number of node instances required to generate the web table with Schema3 schema. Specifically, the number of node instances for Schema2 and Schema3 are 1276 and 1302 respectively.
8. Conclusions and future work In this paper, we describe a synthetic web database generator known as WEDAGEN. WEDAGEN is capable of generating web tables and web pages for comprehensive performance evaluation of web database systems. From our research, we have determined a set of parameters that allows web data of different complexities and sizes to be synthesized. The design and implementation of WEDAGEN have been described. A simple empirical evaluation of WEDAGEN indicates that the synthetic web database generator is able to scale up well with increasing web schema complexity and web table size. With the availability of WEDAGEN, the time and efforts required to systematically evaluate the performance of web database systems can be reduced dramatically. One can quickly construct a specially calibrated web database for performance evaluation or testing purposes instead of having to download actual web pages from the WWW. To our best knowledge, the WEDAGEN approach to construct web databases is unique and it represents an important step towards testing applications and systems developed on web information. At present, WEDAGEN is being used in the WHOWEDA project to evaluate the performance of low level data access operations on the web tables. More work also remains to be done for WEDAGEN to make it more versatile. In particular, WEDAGEN can be extended to handle more varied web schema that may include unbound node and link variables, and more expressive predicates.
42
P. Priyadarshini et al. / The Journal of Systems and Software 69 (2004) 29–42
WEDAGEN can be developed into a full-fledged benchmark toolkit for web database systems (not necessary WHOWEDA) (Gray, 1993). To achieve this, a standard set of web queries and a benchmark web database have to be determined. In this aspect, WEDAGEN can be further explored to generate a benchmark web database. We plan to look into the design of both the benchmark web database and the benchmark web query set. With the complete benchmark toolkit, the performance of different web database systems can be compared systematically. Although the present WEDAGEN does not aim to generate web data of the magnitude similar to the entire web, it will be interesting and perhaps useful to develop some synthetic web generators that can generate web pages of individual web sites or even the entire web. To develop such generators, measurement techniques for web sites or entire web have to be developed. Such web measurement will provide the necessary input parameters for synthetic web generation. The generators should also consider incorporating more input parameters to simulate web evolution, i.e. changes to web pages over time. By deriving the performance metrics to measure the quality of generated web data against the real data, one can also conduct a comprehensive verification experiments on the entire web generation process. References Ammeraal, L., 1997. STL for C++ Programmers. Wiley. Anderson, T., Berre, A., Mallison, M., H.H. Porter, B.S., 1990. The hypermodel benchmark. In: Proceedings of the International Conference on Extending Database Technology (EDBT). Venice, Italy, pp. 317–331. Cao, Y., Lim, E.-P., Ng, W.-K., 2000a. On warehousing historical web information. In: Proceedings in the 19th International Conference on Conceptual Modeling (ER 2000). Salt Lake City. Cao, Y., Lim, E.-P., Ng, W.-K., 2000b. Storage management of a historical web warehousing system. In: Proceedings of the International Conference on Database and Expert Systems Applications (DEXA 2000). Greenwich. Carey, M., DeWitt, D., Naughton, J., 1993. The 007 benchmark. In: ACM SIGMOD Conference on Management of Data. Washington, DC, pp. 12–21.
Cattell, R., Skeen, J., 1992. Object operations benchmark. ACM Transactions on Database Systems 17 (1), 1–31. Cho, J., Garcia-Molina, H., 2000. The evolution of the web and implications for an incremental crawler. In: Proceedings of the 26th International Conference on Very Large Data Bases, pp. 200– 209. Douglis, F., Feldmann, A., Krishnamurthy, B., Mogul, J., 1997. Rate of change and other metrics: a live study of the world wide web. In: Proceedings of the USENIX Symposium on Internet Technologies and Systems, pp. 147–158. Faloutsos, M., Faloutsos, P., Faloutsos, C., 1999. On power-law relationships of the internet topology. In: Proceedings of ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pp. 251– 262. Fernandez, M., Florescu, D., Kang, J., Levy, A., Suciu, D., 1997. STRUDEL: a Web site management system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. Tucson, Arizona, pp. 549–552. Goldman, R., McHugh, J., Widom, J., 1999. From semistructured data to XML: Migrating the Lore data model and query language. In: Proceedings of the ACM International Workshop on the Web and Databases (WebDB’99). Philadelphia, Pennsylvania, pp. 25– 30. Gray, J., 1993. The Benchmark Handbook For Database And Transaction Processing Systems. Morgan Kaufmann Publishers. Kleinberg, J.M., 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM 46 (5), 604–632. Kumar, S., Raghavan, P., Rajagopalan, S., Tomkins, A., 1999. Extracting large-scale knowledge bases from the web. In: Proceedings of 25th International Conference on Very Large Data Bases, pp. 639–650. Lim, E.-P., Ng, W.-K., Bhowmick, S., Qin, F., Ye, X., 1998. A data warehousing system for web information. In: Proceedings of the 1st Asia Digital Library Workshop. Hong Kong. Manber, U., Bigot, P., 1998. Connecting diverse web search facilities. Data Engineering Bulletin 21 (2), 21–27. McHugh, J., Abiteboul, S., Goldman, R., Quass, D., Widom, J., 1997. Lore: A database management system for semistructured data. SIGMOD Record 26 (3), 54–66. Ng, W.-K., Lim, E.-P., Huang, C.-T., Bhowmick, S., Qin, F.-Q., 1998. Web warehousing: An algebra for web information. In: Proceedings of the 3rd IEEE International Conference on Advances in Digital Libraries (ADL’98). Santa Barbara, California, pp. 228– 237. Stonebraker, M., Frew, J., Gardels, K., Meredith, J., 1993. The SEQUOIA 2000 Benchmark. In: Proceedings of the ACM SIGMOD Conference on Management of Data. Washington, DC, pp. 2–11.