Available online at www.sciencedirect.com
Data & Knowledge Engineering 64 (2008) 491–509 www.elsevier.com/locate/datak
Extracting lists of data records from semi-structured web pages ´ lvarez *, Alberto Pan, Juan Raposo, Fernando Bellas, Fidel Cacheda Manuel A Department of Information and Communication Technologies, University of A Corun˜a, Campus de Elvin˜a s/n. 15071 A Corun˜a, Spain Received 23 May 2007; accepted 2 October 2007 Available online 11 October 2007
Abstract Many web sources provide access to an underlying database containing structured data. These data can be usually accessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Nevertheless, web sources usually encode data records using a consistent template or layout, and the implicit regularities in the template can be used to automatically infer the structure and extract the data. In this paper, we propose a set of novel techniques to address this problem. While several previous works have addressed the same problem, most of them require multiple input pages while our method requires only one. In addition, previous methods make some assumptions about how data records are encoded into web pages, which do not always hold in real websites. Finally, we have also tested our techniques with a high number of real web sources and we have found them to be very effective. 2007 Elsevier B.V. All rights reserved. Keywords: Data extraction; Data mining/web-based information; Web/web-based information systems
1. Introduction In today’s Web, there are many sites providing access to structured data contained in an underlying database. Typically, these sources provide some kind of HTML form that allows issuing queries against the database, and they return the query results embedded in HTML pages conforming to a certain fixed template. These data sources are usually called ‘‘semi-structured’’ web sources. For instance, Fig. 1 shows an example page containing a list of data records, each data record representing the information about a book in an Internet shop. Allowing software programs to access these structured data is useful for a variety of purposes. For instance, it allows data integration applications to access web information in a manner similar to a database. It also allows information gathering applications to store the retrieved information maintaining its structure and, therefore, allowing more sophisticated processing. Several approaches have been reported in the literature for building and maintaining ‘‘wrappers’’ for semistructured web sources (see for instance [5,25,27,33,29]; [21] provides a brief survey). These techniques allow an administrator to create an ad-hoc wrapper for each target data source, using some kind of tool which *
Corresponding author. ´ lvarez),
[email protected] (A. Pan),
[email protected] (J. Raposo),
[email protected] (F. Bellas), fi
[email protected] E-mail addresses:
[email protected] (M. A (F. Cacheda). 0169-023X/$ - see front matter 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2007.10.002
492
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
Fig. 1. Example HTML page containing a list of data records.
nature varies depending of the approach used. Once created, wrappers are able to accept a query against the data source and return a set of structured results to the calling application. Although wrappers have been successfully used for many web data extraction and automation tasks, this approach has the inherent limitation that the target data sources must be known in advance. This is not possible in all cases. Consider, for instance, the case of ‘‘focused crawling’’ applications [8], which automatically crawl the web looking for topic-specific information. Several automatic methods for web data extraction have been also proposed in the literature [3,11,22,34], but they present several limitations. First, [3,11,22] require multiple pages generated using the same template as input. This can be inconvenience because a sufficient number of pages need to be collected even if we do not need to extract data from them. Second, the proposed methods make some assumptions about the pages containing structured data which do not always hold. For instance, [34] assumes the visual space between two data records in a page is always greater than any gap inside a data record (we will provide more detail about these issues in the related work section). In this paper, we present a new method to automatically detect a list of structured records in a web page and extract the data values that constitute them. Our method requires only one page containing a list of data records as input. In addition, it can deal with pages that do not verify the assumptions required by other previous approaches. We have also validated our method in a high number of real websites, obtaining very good effectiveness. 1.1. Organization of the paper The rest of the paper is organized as follows. Section 2 describes some basic definitions and models our approach relies on. Sections 3–5 describe the proposed techniques and constitute the core of the paper.
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
493
Section 3 describes the method we use to detect the data region in the page containing the target list of records. Section 4 explains how we segment the data region into individual data records. Section 5 describes how we extract the values of each individual attribute from the data records. Section 6 describes our experiments using our method with real web pages. Section 7 discusses related work. Finally, Section 8 concludes the paper.
2. Definitions, models and problem formulation In this section, we introduce a set of definitions and models we will use through the paper. 2.1. Lists of structured data records We are interested in detecting and extracting lists of structured data records embedded in HTML pages. In this section, we formally define the concept of structured data and the concept of list of data records. A type or schema is defined recursively as follows [1,3]: 1. The Basic Type, denoted by b, represents a string of tokens. A token is some basic unit of text. For the rest of the paper, we define a token to be a text string or a HTML tag. 2. If T1, . . . , Tn are types, then their ordered list hT1, . . . , Tni is also a type. We say that the type hT1, . . . , Tni is constructed from the types T1, . . . , Tn using a tuple constructor of order n. 3. If T is a type, then {T} is also a type. We say that the type {T} is constructed from T using a set constructor. An instance of a schema is defined recursively as follows: 1. An instance of the basic type, b, is any string of tokens. 2. An instance of type hT1, T2, . . . , Tni is a tuple of the form hi1, i2, . . . , ini where i1, i2, . . . , in are instances of types T1, T2, . . . , Tn, respectively. Instances i1, i2, . . . , in are called attributes of the tuple. 3. An instance of type {T} is any set of elements {e1, . . . , em}, such that ei(1 6 i 6 m) is an instance of type T. According to this definition, we define a list of structured data records as an instance of a type of the form {hT1, T2, . . . , Tni}. T1, T2, . . . , Tn are the types of the attributes of the record in the list. For instance, the type of a list of books where the information about each book includes title, author, format and price could be represented as {hTITLE, AUTHOR, FORMAT, PRICEi}, where TITLE, AUTHOR, FORMAT and PRICE represent basic types. This definition can be easily extended to support also optional fields and disjunctions in the target data as shown in [3]. An optional type T is equivalent to a set constructor with the constraint that any instantiation of it has a cardinality of 0 or 1. Similarly, if T1 and T2 are types, a disjunction of T1 and T2 is equivalent to a type hT1, T2i where, in every instantiation of it, the instantiation of either T1 or T2 has exactly one occurrence and the other has zero occurrences. 2.2. Embedding lists of data records in HTML pages The answer to a query issued against a structured data repository will be a list of structured data records of the kind described in the previous section. This list needs to be embedded by a program into an HTML page for presentation to the user. The model we use for page creation is taken from [3]. A value x from a database is encoded into a page using a template T. We denote the page resulting from encoding of x using T by k(T, x). A template T for a schema S, is defined as a function that maps each type constructor s of S as follows: 1. If s is a tuple constructor of order n, T(s) is an ordered set of n + 1 strings hCs1, . . . , Cs(n+1)i. 2. If s is a set constructor, T(s) is a string Cs.
494
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
Given a template T for a schema S, the encoding k(T, x) of an instance x of S is defined recursively in terms of encoding of subvalues of x: 1. If x is of basic type, b, k(T, x) is defined to be x itself. 2. If x is a tuple of form hx1, . . . , xnist, k(T, x) is the string C1k(T, x1)C2k(T, x2) . . . k(T, xn)Cn+1. Here, x is an instance of the sub-schema that is rooted at type constructor st in S, and T(st) = hC1, . . . , Cn+1i. 3. If x is a set of the form {e1, . . . , em}ss, k(T, x) is given by the string k(T, e1)Ck(T, e2)C . . . Ck(T, em). Here x is an instance of the sub-schema that is rooted at type constructor ss in S, and T(ss) = C. This model of page creation defines a precise way for embedding lists of data records in HTML pages. For instance, Fig. 2 shows an excerpt of the HTML code of the page in Fig. 1, along with the used template. This model of page creation captures the basic intuition that the data records in a list are formatted in a consistent manner: the occurrences of each attribute in several records are formatted in the same way and they always occur in the same relative position with respect to the remaining attributes. In addition, this definition also establishes that the data records in a list are shown contiguously in the page. 2.3. Lists of records represented in the DOM tree of pages HTML pages can also be represented as DOM trees [31]. For instance, Fig. 3 shows an excerpt of the DOM tree of the example HTML code shown in Fig. 2. From the model defined in the previous section to embed lists of data records in HTML, we can derive the following properties of their representation as DOM trees:
Fig. 2. HTML source code and page template for page in Fig. 1.
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
495
Fig. 3. DOM tree for HTML page in Fig. 1.
Property 1. Each record in the DOM tree is disposed in a set of consecutive sibling subtrees. Additionally, although it cannot be derived strictly from the page creation model, it is heuristically found that a data record comprises a certain number of complete subtrees. For instance, in Fig. 3 the first two subtrees form the first record, and the following three subtrees form the second record. Property 2. The occurrences of each attribute in several records have the same path from the root in the DOM tree. For instance, in Fig. 3 it can be seen how all the instances of the attribute title have the same path in the DOM tree, and the same applies to the remaining attributes. 3. Finding the dominant list of records in a page In this section, we describe how we locate the data region of the page containing the main list of records in the page. From the Property 1 of the previous section, we know finding the data region is equivalent to finding the common parent node of the sibling subtrees forming the data records. The subtree having as root that node will be the target data region. For instance, in our example of Fig. 3 the parent node we should discover is n1. Our method for finding the region containing the dominant list of records in a page p consists of the following steps: 1. Let us consider N, the set composed by all the nodes in the DOM tree of p. To each node ni 2 N, we will assign a score called si. Initially "i=1..jNjsi = 0. 2. Compute T, the set of all the text nodes in N. 3. Divide T into subsets p1, p2, . . . , pm, in a way such that all the text nodes with the same path from the root in the DOM tree are contained in the same pi. To compute the paths from the root, we ignore tag attributes. 4. For each pair of text nodes belonging to the same group, compute nj as their deepest common ancestor in the DOM tree, and add 1 to sj (the score of nj). 5. Let nmax be the node having a higher score. Choose the DOM subtree having nmax as root of the desired data region.
496
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
Now, we provide the justification for this algorithm. First, by definition, the target data region contains a list of records and each data record is composed of a series of attributes. By Property 2 in Section 2.3, we know all the occurrences of the same attribute have the same path from the root. Therefore, the subtree containing the dominant list in the page will typically contain more texts with the same path from the root than other regions. In addition, given two text nodes with the same path in the DOM tree, the following situations may occur: 1. By Property 1, if the text nodes are occurrences of texts in different records (e.g. two data values of the same attribute in different records), then their deepest common ancestor in the DOM tree will be the root node of the data region containing all the records. Therefore, when considering that pair in step 4, the score of the correct node is increased. For instance, in Fig. 3 the deepest common ancestor of d1 and d3 is n1, the root of the subtree containing the whole data region. 2. If the text nodes are occurrences from different attributes in the same record, then in some cases, their deepest common ancestor could be a deeper node than the one we are searching for and the score of an incorrect node would be increased. For instance, in the Fig. 3 the deepest common ancestor of d1 and d2 is n2. By Property 2, we can infer that there will usually be more cases of the case 1 and, therefore, the algorithm will output the right node. Now, we explain the reason for this. Let us consider the pair of text nodes (t11, t12) corresponding with the occurrences of attribute1 and attribute2 in record1. (t11, t12) is a pair in the case 2. But, by Property 2, for each record ri in which attribute1 and attribute2 appear, we will have pairs (t11, ti1), (t11, ti2), (t12, ti1), (t12, ti2), which are in case 1. Therefore, in the absence of optional fields, it can be easily proved that there will be more pairs in the case 1. When optional fields exist, it is easy to see that it is still very probable. This method tends to find the list in the page with the largest number of records and the largest number of attributes in each record. When the pages we want to extract data from have been obtained by executing a query on a web form, we are typically interested in extracting the data records that constitute the answer to the query, even if it is not the larger list (this may happen if the query has few results). If the executed query is known, this information can be used to refine the above method. The idea is very simple: in the step 2 of the algorithm, instead of using all the text nodes in the DOM tree, we will use only those text nodes containing text values used in the query with operators whose semantic be equals or contains. For instance, let us assume the page in Fig. 3 was obtained by issuing a query we could write as (title contains ‘java’) AND (format equals ‘paperback’). Then, the only text nodes considered in step 2 would be the ones marked with an ‘*’ in Fig. 3. If the query used does not include any conditions with the aforementioned operators, then the algorithm would use all the texts. The reason for this refinement is clear: the values used in the query will typically appear with higher probability in the list of results than in other lists of the page. 4. Dividing the list into records Now we proceed to describe our techniques for segmenting the data region in fragments, each one containing at most one data record. Our method can be divided into the following steps: – Generate a set of candidate record lists. Each candidate record list will propose a particular division of the data region into records. – Choose the best candidate record list. The method we use is based on computing an auto-similarity measure between the records in the candidate record lists. We choose the record division lending to records with the higher similarity. Sections 4.2 and 4.3 describe in detail each one of the two steps. Both tasks need a way to estimate the similarity between two sequences of consecutive sibling subtrees in the DOM tree of a page. The method we use for this is described in Section 4.1.
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
497
4.1. Edit-distance similarity measure To compute ‘‘similarity’’ measures we use techniques based in string edit-distance algorithms. More precisely, to compute the edit-distance similarity between two sequences of consecutive sibling subtrees named ri and rj in the DOM tree of a page, we perform the following steps: 1. We represent ri and rj as strings (we will term them si and sj). This is done as follows: a. We substitute every text node by a special tag called text. b. We traverse each subtree in depth first order and, for each node, we generate a character in the string. A different character will be assigned to each tag having a different path from the root in the DOM tree. Fig. 4 shows the strings s0 and s1 obtained for the records r0 and r1 in Fig. 3, respectively. 2. We compute the edit-distance similarity between ri and rj, denoted as es(ri, rj), as the string edit distance between si and sj (ed(si, sj)). The edit distance between si and sj is defined as the minimum number of edition operations (each operation affecting one char at a time) needed to transform one string into the other. We calculate string distances using the Levenshtein algorithm [23]. Our implementation only allow insertion and deletion operations (in other implementations, substitutions are also permitted). To obtain a similarity score between 0 and 1, we normalize the result dividing by (len(si) + len(sj)) and then subtract the obtained result from 1 (1). In our example from Fig. 4, the similarity between r0 and r1 is 1 (2/(26 + 28)) = 0.96. esðri ; rj Þ ¼ 1
edðsi ; sj Þ lenðsi Þ þ lenðsj Þ
ð1Þ
4.2. Generating the candidate record lists In this section, we describe how we generate a set of candidate record lists inside the data region previously chosen. Each candidate record list will propose a particular division of the data region into records. By Property 1, we can assume every record is composed of one or several consecutive sibling sub-trees, which are direct descendants of the node chosen as root of the data region. We could leverage on this property to generate a candidate record list for each possible division of the subtrees verifying it. Nevertheless, the number of possible combinations would be too high: if the number of subtrees is n, the possible number of divisions verifying Property 1 is 2n1 (notice that different records
Fig. 4. Strings obtained for the records r0 and r1 in Fig. 3.
498
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
in the same list may be composed of a different number of subtrees, as for instance r0 and r1 in Fig. 3). In some sources, n can be low, but in others it may reach values in the hundreds (e.g. a source showing 25 data records, with each data record composed of an average of 4 subtrees). Therefore, this exhaustive approach is not feasible. The remaining of this section explains how we overcome these difficulties. Our method has two stages: 1. Clustering the subtrees according to their similarity. 2. Using the groups to generate the candidate record divisions. The next subsections respectively detail each one of these stages. Grouping the subtrees. For grouping the subtrees according to their similarity, we use a clustering-based process we describe in the following lines: 1. Let us consider the set {t1, . . . , tn} of all the subtrees which are direct children of the node chosen as root of the data region. Each ti can be represented as a string using the method described in Section 4.1. We will term these strings as s1, . . . , sn. 2. Compute the similarity matrix. This is a n · n matrix where the (i, j) position (denoted mij) is obtained as es(ti, tj), the edit-distance similarity between ti and tj. 3. We define the column similarity between ti and tj, denoted (ti, tj), as the inverse of the average absolute error between the columns corresponding to ti and tj in the similarity matrix (2). Therefore, to consider two subtrees as similar, the column similarity measure requires their columns in the similarity matrix to be very similar. This means two subtrees must have roughly the same edit-distance similarity with respect to the rest of subtrees in the set to be considered as similar. We have found column similarity to be more robust for estimating similarity between ti and tj in the clustering process than directly using es(ti, tj). X jmik mjk j ð2Þ csðti ; tj Þ ¼ 1 n k¼1::n 4. Now, we apply bottom-up clustering [7] to group the subtrees. The basic idea behind this kind of clustering is to start with one cluster for each element and successively combine them into groups within which interelement similarity is high, collapsing down to as many groups as desired. Fig. 5 shows the pseudo-code for the bottom-up clustering algorithm. Inter-element similarity of a set U is estimated using the auto-similarity measure. The auto-similarity of a set U is denoted s(U) and it is computed as the average of the similarities between each pair of elements in the set (3). X 2 sðUÞ ¼ csðti ; tj Þ ð3Þ jUjðjUj 1Þ ti ;tj 2U We use column similarity as the similarity measure between ti and tj. To allow a new group to be formed, it must verify two thresholds: – The global auto-similarity of the group must reach the auto-similarity threshold Xg. In our current implementation, we set this threshold to 0.9. – The column similarity between every pair of elements from the group must reach the pairwise-similarity threshold Xe. This threshold is used to avoid creating groups that, although showing high overall auto-similarity, contain some dissimilar elements. In our current implementation, we set this threshold to 0.8. Fig. 6 shows the result of applying the clustering algorithm to our example page from Fig. 3. Generating the candidate record divisions. Our method to generate the candidate record divisions is as follows. First, we assign an identifier to each of the generated clusters. Then, we build a sequence by listing in order the subtrees in the data region, representing each subtree with the identifier of the cluster it belongs to (see Fig. 6).
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
499
Fig. 5. Pseudo-code for bottom-up clustering.
Fig. 6. Result of applying the clustering algorithm for example page from Fig. 3.
The data region may contain, either at the beginning or at the end, some subtrees that are not really part of the data but auxiliary information. For instance, these sub-trees may contain information about the number of results or web forms to refine the query or to navigate to other result intervals. These subtrees will typically be alone in a cluster, since there are not other similar subtrees in the data region. Therefore, we pre-process the string from the beginning and from the end removing tokens until we find the first cluster identifier that appears more than once in the sequence. In some cases, this pre-processing is not enough and, therefore, these additional subtrees will be still included in the sequence. As we will see, they will typically be removed from the output in the stage of extracting the attributes from the data records, which is described in Section 5. Once the pre-processing step is finished, we proceed to generate the candidate record divisions. By Property 1 in Section 2.3, we know each record is formed by a list of consecutive subtrees (i.e. characters in the string). From our page model, we know records are encoded consistently. Therefore, the string will tend to be formed by a repetitive sequence of cluster identifiers, each sequence corresponding to a data record. Since there may be optional data fields in the extracted records, the sequence for one record may be slightly different from the sequence corresponding to another record. Nevertheless, we will assume they always either start or end with a subtree belonging to the same cluster (i.e. all the data records always either start or end in the same way). This is based on the following heuristic observations: – In many sources, records are visually delimited in an unambiguous manner to improve clarity for the human user. This delimiter is present before or after every record. – When there is not an explicit delimiter between data records, the first data fields appearing in a record are usually mandatory fields which appear in every record (e.g. in our example source the delimiter may be the fragment corresponding to the title field, which appears in every record).
500
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
Fig. 7. Candidate record divisions obtained for example page from Fig. 3.
Based on the former observations, we will generate the following candidate record lists: – For each cluster ci, i = 1..k, we will generate two candidate divisions: one assuming every record starts with cluster ci and another assuming every record ends with cluster ci. For instance, Fig. 7 shows the candidate record divisions obtained for our example of Fig. 3. – In addition, we will add a candidate record division considering each record is formed by exactly one subtree. This reduces the number of candidate divisions from 2n1, where n is the number of subtrees, to 1 + 2k, where k is the number of generated clusters, turning feasible to evaluate each candidate list to choose the best one. 4.3. Choosing the best candidate record list To choose the correct candidate record list, we rely on the observation that the records in a list tend to be similar to each other. This can be easily derived from the page model described in Section 2.2. Therefore, we will choose the candidate list showing the highest auto-similarity. As we have already stated in previous sections, each candidate list is composed of a list of records. Each record is a sequence of consecutive sibling subtrees. Then, given a candidate list composed of the records hr1, . . . , rni, we compute its auto-similarity as the weighted average of the edit-distance similarities between each pair of records of the list. The contribution of each pair to the average is weighted by the length of the compared registers. See Eq. (4). P i¼1::n;j¼1::n;i6¼j esðr i ; r j Þðlenðri Þ þ lenðrj ÞÞ P ð4Þ i¼1::n;j¼1::n;i6¼j lenðri Þ þ lenðr j Þ For instance, in our example from Fig. 7, the first candidate record division is the one showing higher autosimilarity. 5. Extracting the attributes of the data records In this section, we describe our techniques for extracting the values of the attributes of the data records identified in the previous section. The basic idea consists in transforming each record from the list into a string using the method described in Section 4.1, and then using string alignment techniques to identify the attributes in each record. An alignment between two strings matches the characters in one string with the characters in the other one, in such a way
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
501
that the edit-distance between the two strings is minimized. There may be more than one optimal alignment between two strings. In that case, we choose any of them. For instance, Fig. 8 shows an excerpt of the alignment between the strings representing the records in our example. As can be seen in the figure, each aligned text token roughly corresponds with an attribute of the record. Notice that to obtain the actual value for an attribute we may need to remove common prefixes/suffixes found in every occurrence of an attribute. For instance, in our example, to obtain the value of the price attribute we would detect and remove the common suffix ‘‘€’’. In addition, those aligned text nodes having the same value in all the records (e.g. ‘‘Buy new:’’, ‘‘Price used:’’) will be considered ‘‘labels’’ instead of attribute values and will not appear in the output. To achieve our goals, it is not enough to align two records: we need to align all of them. Nevertheless, optimal multiple string alignment algorithms have a complexity of O(nk). Therefore, we need to use an approximation algorithm. Several methods have been proposed in the literature for this task [26,13]. We use a variation of the center star approximation algorithm [13], which is also similar to a variation used in [34] (although they use tree alignment instead of string alignment). The algorithm works as follows: 1. The longest string is chosen as the ‘‘master string’’, m. 2. Initially, S, the set of ‘‘still not aligned strings’’ contains all the strings but m. 3. For every s 2 S: a. Align s with m. b. If there is only one optimal alignment between s and m: i. If the alignment matches any null position in m with a character from s, then the character is added to m replacing the null position. ii. Remove s from S. 4. Repeat step 3 until S is empty or the master string m does not change. The key step of the algorithm is 3.b.i where each string is aligned with the master string and eventually used to extend it. Fig. 9 shows an example of this step. The alignment of the master record and the new record produces one optimal alignment where one char of the new record (‘d’) is matched with a null position in
Fig. 8. Alignment between strings representing the records of Fig. 1.
Fig. 9. Example of string alignment with the master string.
502
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
the master record. Therefore, the new master is obtained by inserting that char in the position determined by the alignment. As it was mentioned in Section 4.2, the data region may contain, either at the beginning or at the end, some subtrees that are not really part of the data but auxiliary information (e.g. information about the number of results, web forms to navigate to other result intervals, etc.). As mentioned in Section 4.2, before generating the candidate record divisions, the system makes a first attempt to remove these subtrees. Nevertheless, some subtrees may still be present either at the beginning of the first record or at the end of the last. Therefore, after the multiple alignment process, if the first record (respectively the last) starts (respectively ends) with a sequence of characters that are not aligned with any other record, then those characters are removed. 6. Experiments This section describes the empirical evaluation of the proposed techniques with real web pages. During the development of the techniques, we used a set of 20 pages from 20 different web sources. The pilot tests performed with these pages were used to adjust the algorithm and to choose suitable values for the thresholds used by the clustering process (see Section 4.2). These pages were not used in the experimental tests. For the experimental tests, we chose 200 new websites in different application domains (bookshops, music shops, patent information, publicly financed R&D projects, movies information, etc.). We performed one query in each website and collected the first page containing the list of results. We selected queries to guarantee that the collection includes pages having a very variable number of results (some queries return only 2–3 results while others return hundreds of results). The collection of pages used in the experiments is available online.1 While collecting the pages for our experiments, we found three data sources where our page creation model is not correct. Our model assumes that all the attributes of a data record are shown contiguously in the page. In those three sources, the assumption does not hold and, therefore, our system would fail. We did not consider those sources in our experiments. In the related work section, we will further discuss this issue. We performed the following tests with the collection: – We measured the validity of the heuristic assumed by the Property 1 defined in Section 2.3. – We measured the effectiveness of the extraction process at each stage, including recall and precision measures. – We measured the execution times of the techniques. – We compared the effectiveness of our approach with respect to RoadRunner [11], one of the most significant previous proposals for automatic web data extraction. The following sub-sections detail the results of each group of tests. 6.1. Validity of Property 1 The first test we performed with the collection was checking the validity of the Property 1, defined in Section 2.3. It is important to notice that the Property 2 defined in the same section is not a heuristic (it is directly derived from our page creation model). Nevertheless, Property 1 has a heuristic component which needs to be verified. The result of this test corroborated the validity of the heuristic, since it was verified by the 100% of the examined pages. 6.2. Effectiveness of the data extraction techniques We tested the automatic data extraction techniques on the collection, and we measured the results at three stages of the process:
1
http://www.tic.udc.es/~mad/resources/projects/dataextraction/testcollection_0507.htm.
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
503
– After choosing the data region containing the dominant list of data records. – After choosing the best candidate record division. – After extracting the structured data contained in the page. The reason for performing the evaluation at these three stages is two fold: on one hand, it allows to evaluate separately the effectiveness of the techniques used at each stage. On the other hand, the results after the two first stages are useful on their own. For instance, some applications need to automatically generate a reduced representation of a page for showing it on a portlet inside a web portal or on a small-screen device. To build this reduced version for pages containing the answer to a query, the application may choose to show the entire data region or the first records of the list, discarding the rest of the page. Table 1 shows the results obtained in the empirical evaluation. In the first stage, we use the information about the executed query, as explained at the end of Section 3. As it can be seen, the data region is correctly detected in all pages but two. In those cases, the answer to the query returned few results and there was a larger list on a sidebar of the page containing items related to the query. In the second stage, we classify the results in two categories. – Correct. The chosen record division is correct. As it was already mentioned, in some cases, the data region may still contain, either at the beginning or at the end, some subtrees that are not really part of the data but auxiliary information (e.g. information about the number of results, web forms to refine the query or controls to navigate to other result intervals). This is not a problem because, as explained in Section 5, the process in stage 3 removes those irrelevant parts because they cannot be aligned with the remaining records. Therefore, we consider these cases as correct. – Incorrect. The chosen record division contains some incorrect records (not necessarily all). For instance, two different records may be concatenated as one or one record may appear segmented into two. As it can be seen, the chosen record division is correct in the 96% of the cases. It is important to notice that, even in incorrect divisions, there will usually be many correct records. Therefore, stage 3 may still work fine for them. The main reason for the failures at this stage is that, in a few sources, the auto-similarity measure described in Section 4.3 fails to detect the correct record division. Although the correct record division is between the candidates, the auto-similarity measure ranks another one higher. This happens because, in these sources, some data records are quite dissimilar to each other. For instance, in one case where we have two consecutive data records that are much shorter than the rest, the system chooses a candidate division that groups these two records into one. It is worth noticing that, in most cases, only one or two records are incorrectly identified. Regarding this stage, we also experimented with the pairwise similarity threshold (defined in Section 4.2), which was set to 0.8 in the experiments. We tested other three values for the threshold (0.6, 0.7, and 0.9). As it can be seen in Table 2, the achieved effectiveness is high in all the cases, but the best results are obtained with the previously chosen value of 0.8 In stage 3, we use the standard metrics recall and precision. Recall is computed as the ratio between the number of records correctly extracted by the system and the total number of records that should be extracted.
Table 1 Results obtained in the empirical evaluation Stage1 Stage 2
Stage 3
# Correct 198 # Correct 192
# Incorrect 2 # Incorrect 8
# Records to extract # Extracted records # Correct extracted records
3557 3570 3496
% Correct 99.00 % Correct 96.00 Precision 97.93 Recall 98.29
504
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
Table 2 Effect of pairwise threshold Threshold
0.6
0.7
0.8
0.9
Stage 2 – %Correct
94.50
94.50
96.00
93.50
Precision is computed as the ratio between the number of correct records extracted and the total number of records extracted by the system. These are the most important metrics in what refers to web data extraction applications because they measure the system performance at the end of the whole process. As it can be seen in Table 1, the obtained results are very high, reaching respectively to 98.29% and 97.93%. Most of the failures come from the errors propagated from the previous stage. 6.3. Execution times We have also measured the execution time of the proposed techniques. The two most costly stages of the process are: – Generating the similarity matrix described in the Section 4.2. This involves computing the similarity between each pair of first-level subtrees in the data region. – Computing the auto-similarities of the candidate record divisions to choose the best one (as described in Section 4.3). This involves computing the similarity between each pair of records in every candidate record division. Fortunately, both stages can be greatly optimized by caching similarity measures: • When generating the similarity matrix, our implementation of the techniques checks for equality the strings generated from the first-level subtrees. Obviously, only the comparisons between distinct subtrees need to be performed. Notice that since the subtrees represent a list of data records with similar structure, there will typically be many identical subtrees. • A similar step can be performed when computing the auto-similarities of the candidate record divisions. Due to the regular structure of the data, there will typically be many identical candidate records across the candidate divisions. As a consequence, the proposed techniques run very efficiently. In an average workstation PC (Intel Pentium Centrino Core Duo 2 GHz, 1 GB RAM), the average execution time for each page in the above experimental set was of 786 milliseconds (including HTML parsing and DOM tree generation). The process ran in sub-second time for 87% of the pages in the collection. 6.4. Comparison with RoadRunner We also compared the effectiveness of the proposed techniques with respect to RoadRunner [11]. As far as we know, RoadRunner is the only automatic web data extraction system available for download. 2 Compared to our system, RoadRunner performs neither the region location stage nor the record division stage. Its function is comparable to the stage in our approach which extracts the individual attributes from each data record. RoadRunner requires as input multiple pages. Typically, each one of these pages contains data from only one data record. For instance, a typical RoadRunner execution could receive as input a set of ‘book detail’ pages from an online bookshop, and would output the books data.
2
http://www.dia.uniroma3.it/db/roadRunner/software.html.
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
505
Table 3 Comparison with RoadRunner #Input records
#Extracted records
#Correct records
Precision
Recall
Proposed techniques
3557
3570
3496
97.93
98.29
RoadRunner
3557
2712
2360
87.02
66.35
Therefore, to generate the input for RoadRunner, we split each page of the collection, generating a new page for each record. This way, each page in the original collection is transformed in a set of pages following the same template which can be fed as input to RoadRunner. Table 3 shows the obtained results. As it can be seen, the obtained precision and recall values are inferior to the ones achieved by the techniques proposed in this paper over the same collection. 7. Related work Wrapper generation techniques for web data extraction have been an active research field for years. Many approaches have been proposed such as specialized languages for programming wrappers [30,14,19], inductive learning techniques able to generate the wrapper from a set of human-labeled examples [25,20,16,15,33,17], or supervised graphical tools which hide the complexity of wrapper programming languages [5,27]. [21] provides a brief survey of some of the main approaches. All the wrapper generation approaches require some kind of human intervention to create and configure the wrapper previously to the data extraction task. When the sources are not known in advance, such as in focused crawling applications, this approach is not feasible. Several works have addressed the problem of performing web data extraction tasks without requiring human input. In [6], a method is introduced to automatically find the region on a page containing the list of responses to a query. Nevertheless, this method does not address the extraction of the attributes of the structured records contained in the page. A first attempt which considers the whole problem is IEPAD [9] which uses the Patricia tree [13] and string alignment techniques to search for repetitive patterns in the HTML tag string of a page. The method used by IEPAD is probable to generate incorrect patterns along with the correct ones, so human post-processing of the output is required. This is an important inconvenience with respect to our approach. RoadRunner [11] receives as input multiple pages conforming to the same template and uses them to induce a union-free regular expression (UFRE) which can be used to extract the data from the pages conforming to the template. The basic idea consists in performing an iterative process where the system takes the first page as initial UFRE and then, for each subsequent page, tests if it can be generated using the current template. If not, the template is modified to represent also the new page. The proposed method cannot deal with disjunctions in the input schema. Another inconvenience with respect to our approach is that it requires receiving as input multiple pages conforming to the same template while our method only requires one. As it was previously mentioned, the page creation model we described in Section 2.3 was first introduced in ExAlg [3]. As well as RoadRunner, ExAlg receives as input multiple pages conforming to the same template and uses them to induce the template and derive a set of data extraction rules. ExAlg makes some assumptions about web pages which, according to the own experiments of the authors, do not hold in a significant number of cases: for instance, it is assumed that the template assign a relatively large number of tokens to each type constructor. It is also assumed that a substantial subset of the data fields to be extracted have a unique path from the root in the DOM tree of the pages. Ref. [22] leverages on the observation that the pages containing the list of results to a query in a semi-structured web source, usually include a link for each data record, allowing access to additional information about it. The method is based on locating the information redundancies between list pages and detail pages and using them to aid the extraction process. This method is not valid in sources where these ‘‘detail’’ pages do not exist and, in addition, it requires of multiple pages to work. Ref. [34] presents DEPTA, a method that uses the visual layout of information in the page and tree editdistance techniques to detect lists of records in a page and to extract the structured data records that form
506
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
it. As well as in our method, DEPTA requires as input one single page containing a list of structured data records. They also use the observation that, in the DOM tree of a page, each record in a list is composed of a set of consecutive sibling subtrees. Nevertheless, they make two additional assumptions: (1) that exactly the same number of sub-trees must form all records, and (2) that the visual space between two data records in a list is bigger than the visual space between any two data values from the same record. It is relatively easy to find counter-examples of both assumptions in real web sources. For instance, neither of the two assumptions holds in our example page of Fig. 3. In addition, the method used by DEPTA to detect data regions is more expensive than ours, since it involves a potentially high number of edit-distance computations. A limitation of our approach arises in the pages where the attributes constituting a data record are not contiguous in the page, but instead they are interleaved with the attributes from other data records. Those cases do not conform to our page creation model and, therefore, our current method is unable to deal with them. Although DEPTA implicitly assumes a page creation model similar to the one we use, after detecting a list of records, they propose some useful heuristics to identify these cases and transform them in ‘‘conventional’’ ones before continuing the process. These heuristics could be adapted to work with our approach. There are several research problems which are complementary to our work (and to the automatic web data extraction techniques in general). Several works [32,4] have addressed the problem of how to automatically obtain attribute names for the extracted data records. Our prototype implementation labels the attributes using techniques similar to the ones proposed in [32]. Another related problem is how to automatically obtain the pages which constitute the input to the automatic data extraction algorithm. Several works [24,28,2,18,10] have addressed the problem of how to automatically interpret and fill in web query forms to obtain the response pages. [12] has addressed the problem of examining a web site to automatically collect pages following the same template. All these works are complementary to our work. 8. Conclusions In this paper, we have presented a new method to automatically detecting a list of structured records in a web page and extracting the data fields that constitute them. We use a page creation model which captures the main characteristics of semi-structured web pages and allows us to derive the set of properties and heuristics our techniques rely on. Our method requires only one page containing a list of data records as input. The method begins by finding the data region containing the dominant list. Then, it performs a clustering process to limit the number of candidate record divisions in the data region and chooses the one having higher auto-similarity according to edit-distance-based similarity techniques. Finally, a multiple string alignment algorithm is used to extract the attribute values of each data record. With respect to previous works, our method can deal with pages that do not verify the assumptions required by other previous approaches. We have also validated our method in a high number of real websites, obtaining very good effectiveness. Acknowledgements This research was partially supported by the Spanish Ministry of Education and Science under Project TSI2005-07730 and the Spanish Ministry of Industry, Tourism and Commerce under Project FIT-3502002006-78. Alberto Pan’s work was partially supported by the ‘Ramo´n y Cajal’ programme of the Spanish Ministry of Education and Science. References [1] S. Abiteboul, R. Hull, V. Vianu, Foundations of Databases, Addison Wesley, Reading, Massachussetts, 1995. ´ lvarez, A. Pan, J. Raposo, F. Cacheda, F. Bellas, V. Carneiro, Crawling the content hidden behind web forms, in: Proceedings of [2] M. A the 2007 International Conference on Computational Science and Its Applications (ICCSA). Lecture Notes in Computer Science,
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
[3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33]
507
vol. 4706, Part 2, Springer, Berlin/Heidelberg, 2007, pp. 322–333, ISSN: 0302-9743, ISBN-10: 3-540-74475-4, ISBN-13: 978-3-54074475-7. A. Arasu, H. Garcia-Molina, Extracting structured data from web pages, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2003. L. Arlota, V. Crescenzi, G. Mecca, P. Merialdo, Automatic annotation of data extracted from large websites, in: Proceedings of the WebDB Workshop, 2003, pp. 7–12. R. Baumgartner, S. Flesca, G. Gottlob, Visual web information extraction with Lixto, in: Proceedings of the 21st International Conference on Very Large DataBases (VLDB), 2001. J. Caverlee, L. Liu, D. Buttler, Probe, cluster, and discover: focused extraction of QA-Pagelets from the Deep Web, in: Proceedings of the 20th International Conference ICDE, 2004, pp. 103–115. S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, 2003, ISBN 1-55860754-4. S. Chakrabarti, Van den M. Berg, B. Dom, Focused crawling: a new approach to topic-specific web resource discovery, in: Proceedings of the Eighth International World Wide Web Conference, 1999. C. Chang, S. Lui, IEPAD: information extraction based on pattern discovery, in: Proceedings of 2001 International World Wide Web Conference, 2001, pp. 681–688. K. Chang, B. He, Z. Zhang, MetaQuerier over the Deep Web: Shallow Integration Across Holistic Sources, in: Proceedings of the VLDB Workshop on Information Integration on the Web (VLDB-IIWeb), 2004. V. Crescenzi, G. Mecca, P. Merialdo, ROADRUNNER: towards automatic data extraction from large web sites, in: Proceedings of the 2001 International VLDB Conference, 2001, pp. 109–118. V. Crescenzi, P. Merialdo, P. Missier, Clustering web pages based on their structure, Data and Knowledge Engineering Journal 54 (3) (2005) 279–299. G.H. Gonnet, R.A. Baeza-Yates, T. Snider, New Indices for Text: Pat trees and Pat Arrays. Information Retrieval: Data Structures and Algorithms, Prentice Hall, 1992. J. Hammer, J. McHugh, H. Garcia-Molina, Semistructured data: the Tsimmis experience, in: Proceedings of the First East-European Symposium on Advances in Databases and Information Systems (ADBIS), 1997, pp. 1–8. A. Hogue, D. Karger, Thresher: automating the unwrapping of semantic content from the wold wide web, in: Proceedings of the 14th International World Wide Web Conference, 2005. C.N. Hsu, M.T. Dung, Generating finite-state transducers for semi-structured data extraction from the web, Information Systems 23 (8) (1998) 521–538. V. Kovalev, S. Bhowmick, S. Madria, HW-STALKER: a machine learning-based system for transforming QURE-Pagelets to XML, Data and Knowledge Engineering Journal 54 (2) (2005) 241–276. Y. Jung, J. Geller, Y. Wu, Ae S. Chun, Semantic deep web: automatic attribute extraction from the deep web data sources, in: Proceedings of the International SAC Conference, 2007, pp. 1667–1672. T. Kistlera, H. Marais, WebL: A programming language for the web, in: Proceedings of the Seventh International World Wide Web Conference (WWW7), 1998, pp. 259–270. N. Kushmerick, D.S. Weld, R.B. Doorenbos, Wrapper induction for information extraction, in: Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI), 1997, pp. 729–737. A.H.F. Laender, B.A. Ribeiro-Neto, A. Soares da Silva, J.S. Teixeira, A brief survey of web data extraction tools, ACM SIGMOD Record 31 (2) (2002) 84–93. K. Lerman, L. Getoor, S. Minton, C. Knoblock, Using the structure of web sites for automatic segmentation of tables, in: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, 2004, pp. 119–130. V.I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals, Soviet Physics Doklady 10 (1966) 707– 710. S. Liddle, S. Yau, D. Embley, On the Automatic Extraction of Data from the Hidden Web. ER (Workshops), 2001, pp. 212– 226. I. Muslea, S. Minton, C.A. Knoblock, Hierarchical wrapper induction for semistructured information sources, Autonomous Agents and Multi-Agent Systems (2001) 93–114. C. Notredame, Recent progresses in multiple sequence alignment: a survey, Technical report, Information Genetique et, 2002. A. Pan et al., Semi-automatic wrapper generation for commercial web sources, in: Proceedings of IFIP WG8.1 Conference on Engineering Inform, Systems in the Internet Context (EISIC), 2002. S. Raghavan, H. Garcı´a-Molina, Crawling the hidden web, in: Proceedings of the 27th International Conference on Very Large Databases (VLDB), 2001. ´ lvarez, J. Hidalgo, Automatically maintaining wrappers for web sources, Data and Knowledge Engineering J. Raposo, A. Pan, M. A Journal 61 (2) (2007) 331–358. A. Sahuguet, F. Azavant, Building intelligent web applications using lightweight wrappers, Data and Knowledge Engineering Journal 36 (3) (2001) 283–316. The W3 Consortium. The Document Object Model. http://www.w3.org/DOM/. J. Wang, F. Lochovsky, Data extraction and label assignment for web databases, in: Proceedings of the 12th International World Wide Web Conference (WWW12), 2003. Y. Zhai, B. Liu, Extracting web data using instance-based learning, in: Proceedings of Web Information Systems Engineering Conference (WISE), 2005, pp. 318–331.
508
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
[34] Y. Zhai, B. Liu, Structured data extraction from the web based on partial tree alignment, IEEE Transactions on Knowledge and Data Engineering 18 (12) (2006) 1614–1628.
´ lvarez is an Assistant Professor in the Department of Information and Communications Technologies, Manuel A at the University of A Corun˜a (Spain), and Product Manager at R&D Department of Denodo Technologies. He earned his Bachelor’s Degree in Computer Engineering from University of A Corun˜a and is working on his Ph.D. at the same university. His research interests are related to data extraction and integration, semantic and Hidden Web. Manuel has managed several projects at national and regional level in the field of data integration and Hidden Web accessing. He has also authored or co-authored numerous publications in regional and international conferences. He also teaches a Master’s degree at the University of A Corun˜a. In 2004, the HIWEB project was awarded with the GaliciaTIC Award for technology innovation by the Fundacio´n Universidad-Empresa (FEUGA) of the Xunta of Galicia (the regional autonomous government of Galicia, in Spain).
Alberto Pan is a senior research scientist at the University of A Corun˜a (Spain) and a consultant for Denodo Tehnologies. He received a Bachelor of Science Degree in Computer Science from the University of A Corun˜a in 1996 and a Ph.D. Degree in Computer Science from the same university in 2002. His research interests are related to data extraction and integration, semantic and Hidden Web. Alberto has led several projects at national and regional level in the field of data integration and Hidden Web accessing. He has also authored or co-authored numerous publications in scientific magazines and conference proceedings. Furthermore, he has held several research, teaching and professional positions in institutions such as CESAT (a telematic engineering company) the University Carlos III of Madrid and the University of Alfonso X el Sabio of Madrid. In 1999, Alberto was awarded the Isidro Parga Pondal Award for technology innovators by the Diputacio´n of A Corun˜a (the local representation of the Spanish government in A Corun˜a). In 2004, the HIWEB project was awarded with the GaliciaTIC Award for technology innovation by the Fundacio´n Universidad-Empresa (FEUGA) of the Xunta of Galicia (the regional autonomous government of Galicia, in Spain).
Juan Raposo is an Assistant Professor in the Department of Information and Communications Technologies at University of A Corun˜a and Product Manager at the R&D department at Denodo Technologies. He received his Bachelor’s Degree in Computer Engineering from the University of A Corun˜a in 1999 and a Ph.D. Degree in Computer Science from the same University in 2007. His research interests are related to data extraction and integration, semantic and Hidden Web. Juan has participated in several projects at national and regional level in the field of data integration and Hidden Web accessing. He has also authored or co-authored numerous publications in regional and international conference proceedings. He also teaches a Master’s degree at the University of A Corun˜a.
Fernando Bellas is an Associate Professor in the Department of Information and Communications Technologies at the University of A Corun˜a (Spain). His research interests include Web engineering, portal development, and data extraction and integration. He has a Ph.D. in computer science from University of A Corun˜a. In the past, he has held several professional positions in institutions such as CESAT (a telematic engineering company) and the University of Alfonso X el Sabio of Madrid.
M. A´lvarez et al. / Data & Knowledge Engineering 64 (2008) 491–509
509
Fidel Cacheda is an Associate Professor in the Department of Information and Communications Technologies at the University of A Corun˜a (Spain). He received his Ph.D. and B.S. degrees in Computer Science from the University of A Corun˜a, in 2002 and 1996, respectively. He has been an Assistant Professor in the Department of Information and Communication Technologies, the University of A Corun˜a, Spain, since 1998. From 1997 to 1998, he was an assistant in the Department of Computer Science of Alfonso X El Sabio University, Madrid, Spain. He has been involved in several research projects related to Web information retrieval and multimedia real time systems. His research interests include Web information retrieval and distributed architectures in information retrieval. He has published several papers in international journals and has participated in multiple international conferences.