Computer Standards & Interfaces 49 (2017) 34–43
Contents lists available at ScienceDirect
Computer Standards & Interfaces journal homepage: www.elsevier.com/locate/csi
Indexing techniques for processing generalized XML documents Ghassan Z. Qadah Computer Science & Engineering Department, American University of Sharjah, P.O. Box 26666, Sharjah, United Arab Emirates
a r t i c l e
i n f o
Article history: Received 26 November 2015 Received in revised form 19 June 2016 Accepted 1 July 2016 Available online 2 July 2016 Keywords: Algorithms Containment queries Database system Extensible Markup Language Query processing XLink
a b s t r a c t The Extensible Markup Language (XML) data model has recently gained huge popularity because of its ability to represent a wide variety of structured (relational) and semi-structured (document) data. Several query languages have been proposed for the XML model, the most-widely used one is the XQuery. An important component of an XQuery is its XPath expression which retrieves a set of XML documents to be manipulated by the associated XQuery. An XPath expression can be of several types, among which are the containment queries. Traditional research of processing containment queries has concentrated on data retrieval from independent XML documents; not much research has been directed towards interlinked XML documents. This paper reviews this area of research and shows the adequacy and correctness of one of the reviewed algorithms when applied to independent XML documents. However, the direct application of this algorithm to process queries against interlinked XML documents is shown to generate incorrect results. To remedy such a situation, two new algorithms and the associated indexing structures are developed and shown to perform correctly in processing both independent and/or inter-linked XML documents. In addition, one of the new algorithms is shown to minimize the storage requirement of the intermediate lists generated throughout its execution and therefore improving further the algorithm's space and time performance. © 2016 Published by Elsevier B.V.
1. Introduction A database is a huge repository of interrelated data elements [13]. Traditional databases are structured using the relational model [9]. In this model, data is organized into relations viewed as twodimensional tables. The column headers of a table represent the attributes of the relation, whereas, the table's rows are the relation's tuples. Relational databases are accessed using a very high-level language called Structured Query Language (SQL) [13]. A statement of this type, referred to as a query, specifies the data to be retrieved from the database. To process a query, a software system, the database management system (DBMS), generates a number of execution plans, selects the one with the least cost, and then executes it to retrieve the requested data. In a relational database, a given query plan consists of a sequence of operations called the relational algebra operations (select, project, join, etc.). The execution of these operations; especially projection, join and transitive closure; against very large collection of data is generally slow. Many serial and parallel algorithms have been proposed to speed-up the execution of these operations. Some of these algorithms and their performance evaluation can be found in [1,12,18–20,22–23]. It is worth to note here that the relational model has many advantages including its simplicity, rigorous mathematical foundation, and easiness to program
E-mail address:
[email protected].
http://dx.doi.org/10.1016/j.csi.2016.07.002 0920-5489/© 2016 Published by Elsevier B.V.
and use. However, it has also many limitations, including its poor modeling capability, especially for semi-structured and unstructured data. To overcome the relational model drawbacks, new and powerful data models, with semantics richer than those of the relational one, have been introduced. Among those models are deductive [14], object-oriented [3] and XML [5]. The XML (Extensible Markup Language) model, organizes data, into collections of nested elements and character sets (strings), referred to as documents. The XML data model has gained huge popularity because of its ability to model a wide variety of structured (relational) and semi-structured (documents) data as well as its use in integrating heterogeneous data sources (traditional relational databases, data files, email messages, web pages, etc.) for displaying data on a variety of devices, including personal computers, personal digital assistants (PDAs) and smart mobile phones [16, 17]. Several Query languages have been proposed for the XML data model [4,6]; the most-widely used one is XQuery [4]. An important component of the XQuery is the XPath expression which defines the set of XML documents to be manipulated by the associated XQuery. An XPath expression can be of several types, among which are the containment [24] ones. A containment query retrieves the XML elements based on the containment of these elements within some other ones. Two types of containment queries do exist, namely direct and indirect containments, both of which will be presented in details in Section 2.2.
G.Z. Qadah / Computer Standards & Interfaces 49 (2017) 34–43
Traditional containment query processing research [2,7,21,24] has concentrated on developing data retrievals algorithms from independent XML documents. In this paper, we review this area of research and show the adequacy and correctness of one of the reviewed algorithms when applied to independent XML documents. However, the direct application of this algorithm to process queries against interlinked XML documents is shown to generate incorrect results. In this paper, two new algorithms and the associated index structures are introduced and shown to perform correctly in processing both independent and/or interlinked XML documents. In addition, one of the new algorithms is shown to minimize the storage requirement of the intermediate lists generated throughout the execution of such an algorithm and therefore improving further the it's space and time performance. The rest of this paper is organized as follows. Section 2 presents some background material concerning the XML database model and its query types, whereas; Section 3 reviews the processing of the containment queries against independent XML documents. Section 4 presents two new algorithms that handle the processing of XML independent and inter-linked documents. Finally, Section 5 presents a summary, some concluding remarks and future work.
35
of Fig. 1.b, represents the outermost element of the XML document, namely, the car element of Fig. 1.a. The leaf nodes of the tree, on the other hand, represent the document's string data. The other elements of the XML document, as shown in Fig. 1.a, are represented by the internal (non-leaf) nodes of the tree. A node in the tree is labeled by the tag of the element or the string it represents. An edge from one node in the tree to one of its child nodes represents a direct element-subelement (parent-child or direct-containment) relationship between the two nodes. On the other hand, a path between two nodes in the tree, going from top to bottom, represents an ancestor-descendant (indirect containment) relationship between these two nodes. For example and as shown in Fig. 1.b, the node/element owners directly contains two owner elements, whereas, it indirectly contains two name and two ssn elements. It is important to note here that the fact that the XML model has a general tree structure explains its power in representing complex data structures as opposed to the limited representation power of the relational model. However, this improvement comes with a price, basically, an increased complexity when processing the XML data. 2.2. The XML database and queries
2. Background 2.1. The Extensible Markup Language (XML) data model The relational data model [9] is simple and easy to use; however, its ability to model data with complex structures is very limited. To overcome this limitation, the Extensible Markup Language (XML) model [5] has been introduced. This model, as shown in Fig. 1.a, organizes data into collections of nested elements and character sets (strings); these collections are referred to as documents. For example, Fig. 1.a presents a simple XML document that contains data about a specific car. “b carN, b/carN”, “b vinN, b/vinN” and “bownersN b/ownersN” of Fig. 1.a are examples of the XML elements, whereas, “John”, “Debra” and “999” are examples of the character sets/strings. An element within an XML document can be referred to by its tag; the element “bcar N, b/car N”, for example, is referred to by car. Moreover, an XML element may contain other elements and/or strings. Element car, for example, contains, as shown in Fig. 1.a, one owners element which in turns contains two owner elements. Elements in the XML model may also be nested to any depth. For example, the “car” element, as shown in Fig. 1.a, contains two owner elements, each owner element contains its name and ssn. An XML document, as shown in Fig. 1.b, can also be viewed as a rooted, ordered tree. The nodes in this tree represent the elements and strings of the XML document. The root node of the tree, node car
In general, an XML database is a huge collection (forest) of trees; each tree is similar to the one presented in Fig. 1.b. To query such a database, several languages have been proposed, namely, Quilt [6], XMLQL [11] and XQuery [4]. XQuery, being developed by a W3C working group, has emerged as the front-runner. An XQuery contains two types of components, namely, the XPath expressions [8], which locate elements in the XML database, and the XML functions that are used to manipulate the located elements. It is interesting to note here that the XQuery standard has over 800 built-in functions and provides a mechanism to build up new XQuery functions. A simple example of an XQuery, which output the number of owners elements in the car XML database, is presented next. xquery version “1.0” let $owners = doc(“carsdb.xml”)/car/owners return count($owners) where “carsdb.xml” is the XML database containing all of the car documents. On the other hand, doc(“carsdb.xml”), calls an XQuery built-in function named doc and passes to it the name of the XML database to connect to and open. Whereas, the expression “$owners = doc (“cars. xml”)/car/owners” is an XPath expression that locates all owners elements that are the children of car (see Fig. 1) and store these elements in the XQuery variable $owners. Finally, the expression “return count
Fig. 1. Different representations of an XML document.
36
G.Z. Qadah / Computer Standards & Interfaces 49 (2017) 34–43
Fig. 2. Dewey labeling scheme for the tree representation of the XML book fragment.
($owners)” calls the XQuery function count ( ) and passes to it $owners to print out the number of elements within $owners. Several types of XPath expressions do exist, the most basic ones are referred to as the containment queries [21,24]. Two types of containment queries can be identified, namely, the direct containment and the indirect containment. The direct containment refers to an expression of the form e1/e2, where e1 and e2 are elements of an XML document. e1/e2 retrieves all of the database elements e2 if these elements are directly contained within the element e1. That is, the e2 element is retrieved if e2 is a child of e1. As an example, the direct containment query “owners/owner” will retrieve
all owner elements (refer to Fig. 1.b) that are the children of (or directly nested within or directly contained within) the element owners. The indirect containment query, on the other hand, refers to an XPath expression of the form e1//e2, where e1 and e2 are XML elements of the same document. Such an expression will retrieves all elements e2 that are decedent from/nested within the element e1. As an example, the indirect containment expression “car//name”, for example, would retrieve the name element (refer to Fig. 1.b), whenever name is decedent/contained within a car element. That is, such a query would retrieve the name of all car owners.
Fig. 3. A Pre-Post labeling scheme for the tree representation of the XML book fragment.
G.Z. Qadah / Computer Standards & Interfaces 49 (2017) 34–43
37
Fig. 4. Zhang's [24] labeling scheme for the tree representation of the XML book fragment.
3. Processing containment queries on independent XML documents
3.2. Node labeling-based techniques
Several algorithms have been devised to process the containment queries. These algorithms will be reviewed next.
The tree navigation technique is very slow. The pre-computation of transitive-closures, on the other hand, is prohibitively expensive, in terms of its storage cost, and impracticality especially when the XML database is huge and update intensive. To remedy such a situation and speed up the processing of the containment queries, a number of algorithms that utilize schemes to label the nodes of the XML trees have been devised. Two of these schemes [25,26] are presented next.
3.1. Classical algorithms Several algorithms fit into this category, namely: 1. Tree navigation-based techniques: these techniques recognize the fact that an indirect containment query such as e1//e2 is basically a tree reachability problem, that is, e1 contains e2 within a single XML tree if there exist a path that links e1 to e2 inside that tree. Such a path can be determined by traversing the XML sub-tree whose root is e1 to check if any node e2 can be found (forward navigation). This navigation can be depth-first or breadth-first. For a given e1 and e2 values and a single XML tree, the time complexity of these algorithms is linear with respect to the number of nodes n within such a tree and it is O(n * k) for an XML database with k documents. It is important to note here that the tree navigation technique is advantageous when the XML database is update-intensive, whereas, it is inefficient when the database is retrieval-intensive, since it may involve the traversal of many nodes within the XML trees that are not labeled by e1 or e2. Such an action slows down the processing of the containment queries drastically. 2. Transitive-closure pre-computation technique: another technique to process e1//e2 is to pre-compute and store the transitive closure of the different XML trees. In this case, the time complexity for looking up the existence of a path between e1 and e2 within a given tree is constant (O (1)) and it is O(k) for an XML database with k documents. This algorithm is very fast for retrieving elements, but has a number of drawbacks, namely, the high storage cost and high cost of transitive closure re-computation when the nodes of the XML document get updated.
1. Dewey encoding scheme: The Dewey scheme was originally developed and widely used by Librarians. Tatarinov et al. [27] was the first to introduce the Dewey code into XML document processing. In this approach, each node, as shown in Fig. 2, is associated with a vector of integers that represent the path from the document root to the given node. The number of integers in a vector is equal to the number of links (plus 1) within that path. Using Dewey scheme, node e1 contains another node e2 if “e1.vector” is a prefix of “e2.vector”. For example, the “chapter” node of Fig. 2, with Dewey vector of “1.4” contains the node “Beyond” which has the Dewey vector of “1.4.3.1.1” because “1.4” is a prefix of “1.4.3.1.1”. It is interesting to note here that using the Dewey labeling scheme speeds up the execution of the containment queries because it reduces the navigation of the tree operation (with O (n) complexity, where n is the number of nodes within the tree) to a string matching operation (with a complexity of O (log n), where “log n” is the height of the tree). 2. Pre-Post/interval coding scheme: in this scheme, each node in the tree, as shown in Fig. 3, is associated with two numbers, pre and post (pre, post). Pre is the starting position of the element or string within the XML document. Post, on the other hand, is the end position of the element or string within the XML document. To determine the position of the elements and strings within an XML document, one can use the document's tree representation, presented in Fig. 3. By traversing this tree in a depth-first fashion and sequentially assigning integer numbers (starting from number 1) to the nodes at each visit; Pre for a node will then be the integer generated during the
38
G.Z. Qadah / Computer Standards & Interfaces 49 (2017) 34–43
Table 1 Part of an index representation of the book XML fragment. term
docId
sPos
ePos
level
book ISBN title chapter title title title RDBMS RDBMS RDBMS
1 1 1 1 1 1 1 1 1 1
1 2 5 11 15 22 28 6 19 30
36 4 7 35 20 26 32 6 19 30
0 1 1 1 2 3 4 2 3 3
first visit to that node, whereas, Post will be the integer generated during the subsequent visit. Since, a string is visited only once, then Pre and Post for that string will be the same. It is important to note here that Pre-Post labeling scheme speeds up the execution of the containment queries because it reduces the navigation of the tree operation (with O (log n) complexity, where n is the number of nodes within the tree) to two integer matching operation (with a complexity of O (1)). 3.3. Zhang's containment algorithm To speed up the execution of the containment queries, Zhang et al. [24] adopted a variant of the Pre-Post scheme to label the nodes within the XML database, as shown in Fig. 4. For a given element or string node, the label is a 4-tuple (docId, sPos, ePos, level), where docId is the identifier of the document in which the element or string is stored. sPos, on the other hand, is the starting position of the element or string within the XML document (same as Pre in the Pre-Post labeling scheme). ePos, on the other hand, is the end position of the element (same as Post in the Pre-Post labeling scheme). level for the element (or string) is its nesting depth within the document. Using the numbering scheme presented above, both the direct and indirect containment relationships, as shown by Zhang et al. [24], can
easily be evaluated. For two elements e1 and e2 with labels (D1, S1, E1, L1) and (D2, S2, E2, L2), e1 indirectly contains e2 if the three conditions c1 through c3, presented below, are satisfied. D1 = D2 (c1) S21 b S2 (c2) E1 N E2 (c3) L2 = L1 + 1 (c4) whereas, e1 directly contains e2 if c1 through c4 are satisfied. For example, the chapter element at position (1, 11, 35, 1) indirectly contains the title element at position (1, 22, 26, 3) of the document in Fig. 4 since the two nodes satisfy conditions c1 through c3. However, the title element at position (1, 5, 7, 1) is not directly or indirectly contained within the chapter element at position (1, 11, 35, 1) since c2 and c4 are violated. To facilitate the processing of the containment queries, an index, a sample of which is shown in Table 1, is created for all the nodes within the XML database. Each one of these nodes has an index entry. Such an entry is the 5-tuple (term, docId, sPos, ePos, level), where term is the element tag or string represented by the node and “docId, sPos, ePos, level” is the position of the node within the document XML database, as define above. For an element entry, such as book of Table 1, sPos is different from ePos, whereas, sPos and ePos are the same for a string entry, such as RDBMS of Table 1. It is interesting to note here that the index table generated for the XML documents can also be viewed as a database relation with term, docId, sPos, ePos and level as its attributes. Processing a containment query, e1//e2, one can use Zhang's algorithm [24], presented in Fig. 5 in terms of the relational algebra operations [9] and the containmentmerge operation. The algorithm starts by selecting all the rows of the index table of Table 1 in which “term = e1” and all the rows in which “term = e2” (steps 1 & 3 of Zhang Algorithm of Fig. 5). This selection results in two lists of index entries, namely, e1_list (all entries with term = e1), and e2_list (all entries with term = e2). e1_list and e2_list are then sorted according to docId and sPos fields (steps 2 & 4 of Fig. 5). Elements of e1_list and e2_list are then combined together using the containment-merge operator “*”, as shown in step 5 of Fig. 5, to yield the
Fig. 5. Zhang's algorithm for the indirect containment query evaluation.
G.Z. Qadah / Computer Standards & Interfaces 49 (2017) 34–43
39
Fig. 6. A collection of interlinked XML documents.
e1_e2_list. The operator “*” is defined as follows:
the indexing scheme presented in Section 3.3, the labels associated with the nodes of these documents are generated, as shown in Fig. 6, and stored in the index table Table 2.
That is, an element of e1_e2_list is formed by combining one element of e1_list and another element of e2_list whenever these two elements satisfy the containment conditions, c1 through c3, presented earlier. Processing the containment query, “chapter//title” for example, the index table of Table 1, is searched for all rows in which term = “chapter” and for all rows in which term = “title”. This search, followed by a sorting operation, will result in chapter_list = {(1, 11, 35, 1)}, and title_list = {(1, 5, 7, 1), (1, 15, 20, 2), (1, 22, 26, 3), (1, 28, 32, 4). The elements of chapter_list and title_list are then merged using the containment-merge operation, defined above. As a result, the list chapter_title_list {b (1, 11, 35, 1), (1, 15, 20, 2) N, b (1, 11, 35, 1), (1, 22, 26, 3) N, b (1, 11, 35, 1),(1, 28, 32, 4) N will be generated for the query “chapter//title.”
Table 2 An index-table representing the XML database documents of Fig. 6.
4. Processing interlinked XML documents 4.1. The limitation of Zhang Algorithm In Section 3.3, the algorithm, developed by Zhang et al. [24], to evaluate the direct and indirect containment queries has been presented. This algorithm is fast and efficient because it can test two XML nodes for containment by comparing their labels only, and thus avoiding the need to traverse the path between these two nodes. However, Zhang Algorithm has a serious limitation; it works for an XML database that contains independent documents only. It breaks down, however, when processing containment across XML links [10,15]. To illustrate this point further, consider Fig. 6 which presents a collection of five interlinked XML documents. doc1 and doc2 store car data, doc3 and doc4 store owner data and doc5 stores address data. Using XLink statements [10,15], doc1 and doc2 link to both doc3 and doc4 as the latter being the owners of the cars represented by doc1 and doc2. On the other hand, documents doc3 and doc4 link to doc5 as the latter stores the address of the two owners represented by doc3 and doc4 (the two owners of the cars live in the same address). Using
term
docId
sPos
ePos
level
car vin owners make year “999” owner owner “Toyota” “2000” car vin owners make year “888” owner owner “Nissan” “2001” owner name ssn addr “John” “333” owner name ssn addr “Debra” “444” addr str state “Hillside” “NJ”
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5
1 2 5 9 12 3 6 7 10 13 1 2 5 9 12 3 6 7 10 13 1 2 5 8 3 6 1 2 5 8 3 6 1 2 5 3 6
15 4 8 11 14 3 6 7 10 13 15 4 8 11 14 3 6 7 10 13 9 4 7 8 3 6 9 4 7 8 3 6 8 4 7 3 6
0 1 1 1 1 2 2 2 2 2 0 1 1 1 1 2 2 2 2 2 0 1 1 1 2 2 0 1 1 1 2 2 0 1 1 2 2
40
G.Z. Qadah / Computer Standards & Interfaces 49 (2017) 34–43
Table 3 The XLink map for the XML documents of Fig. 6. XLink source node
XLink destination node
s_docId
s_sPos
s_ePos
s_level
d_docId
d_sPos
d_ePos
d_level
1 1 2 2 3 4
6 7 3 7 8 8
6 7 3 7 8 8
2 2 2 2 1 1
3 4 3 4 5 5
1 1 1 1 1 1
9 9 9 9 8 8
0 0 0 0 0 0
namely, node-duplication and extended containment-merge (X). node-duplication is a unary operator that takes a list of XML nodes, e1_list for example, and generates another list, e1_e1_list, by duplicating each one of the nodes within e1_list. The new list is defined as follows:
And the extended containment-merge operator “X” is a binary one that takes two lists and generates a new list as follows: Now, the evaluation of a containment query, “car//state” (for each car in the database, retrieve the corresponding state elements), for example, using Zhang Algorithm of Fig. 5, will proceed by fetching from Table 2 the list of labels, car_list, with term = “car”, and the list of labels, state_list, with term = “state”. car_list and state_list will contain {(1,1,15,0), (2, 1,15,0)}, and {(5, 5, 7, 1)}, respectively. However, since these two lists contain index entries with different document IDs, their containment merge will generate an empty list instead of the list {b (1,1,15,0), (5, 5, 7, 1) N, b (2, 1,15,0), (5, 5, 7, 1) N}. That is, Zhang Algorithm cannot handle the containment evaluation across Xlinks. 4.2. A new algorithm for processing interlinked XML documents To process the containment queries against interlinked XML documents, a new algorithm has been devised. Similar to Zhang Algorithm, the new algorithm creates for the XML documents an index table similar to the one presented in Table 2. In addition, the new algorithm creates another table, the XLink-table, which stores information about the links that connect the nodes across the different XML documents. An entry in that table, as shown in Table 3, is the 8-tuple (s_docId, s_sPos, s_ePos, s_level, d_docId, d_sPos, d_ePos, d_level), where (s_docId, s_sPos, s_ePos, s_level) is the label of a link's source node, whereas (d_docId, d_sPos, d_ePos, d_level) is the label of the link's destination node. XLink table is sorted based on s_docId and s_sPos. The processing of a containment query “e1//e2”, as shown in Algorithm-1 of Fig. 7, proceeds in three phases, namely: • Phase 1: this phase generates the intra-document solutions to the “e1//e2” query using Zhang Algorithm presented in Fig. 5. The processing of this phase results in the list e1_e2_list_intra. • Phase 2: this phase generates the inter-document solutions to “e1// e2” using the new Algorithm-1-phase 2 presented in Fig. 8, and stores the result in e1_e2_list_inter. This phase utilizes two operators,
The details of this phase will be presented in Fig. 8. • Phase 3: this phase appends the inter-document solutions (e1_e2_list_inter), found in phase 1, to the intra-document solutions (e1_e2_list_intra), found in phase 2, to yield the complete set of solutions (e1_e2_list) to the query “e1//e2”. To illustrate Algorithm-1 further, consider the evaluation of the query “e1//e2”, where e1 is “car” and e2 is “state”, against the interlinked XML database of Fig. 6. Executing phase 1 of Algorithm-1, against the index table of Table 2, results into the two lists, car_list and state_list, as follows:
The two lists, car_list and state_list, are then combined using the containment merge operator “*” to yield the set of intra solutions: car_state_list_inra = car_list * state_list = Φ (the empty set) Executing the 2nd phase of Algorithm-1, on the other hand, proceeds, as shown in Fig. 8, as follows: a. Steps 1 & 2: transform the car_list and state_list, obtained in phase 1, into the new lists car_car_list and state_state_list, by applying the node-duplication operator to the car_list and state_list,
Fig. 7. The generalized containment algorithm.
G.Z. Qadah / Computer Standards & Interfaces 49 (2017) 34–43
41
Fig. 8. The generalized containment Algorithm-1: Phase 2.
respectively. The content of the new lists are:
b. Step 3: create the car_state_list_init and initialize it to empty. As the processing of phase 2 progresses, car_state_list_init accumulates all of the links (source, destination pairs) whose source nodes are those of car_list and destination nodes are those that may contain the nodes of those in state_list. c. Step 4: create δ_list and initialize it to the elements of the car_car_list. For this example, therefore, δ_list will be initialized, as shown in Table 4, to the following content:
d. Step 5: during this step, the algorithm will loop until δ_list becomes empty. During a typical iteration, the extended-containment operator is applied to δ_list and XLink to generate a new δ_list. The elements of the new δ_list that already exist in car_state_list_int will then be removed and the remaining elements of δ_list will be added to those already in car_state_list_int. Table 4 presents the initial, intermediate and final content of δ_list and car_state_list_init during the evaluation of the “/car//state” query. e. Step 6: during this step, the extended-containment operator is applied to car_state_list_init and state_state_list to generate a car_state_list_inter as follows:
During Phase 3 of Algorithm-1, the car_state_list_inter is appended to car_state_list_intra to yield all the solutions to the “/car//state” query. It is interesting to note here that the algorithm presented above not only process containment queries against interlinked XML documents but also against the independent ones. For the independent documents case, Algorithm-1-phase 2 of Fig. 8, generates the empty list e1_e2_list_inter, and therefore the final list e1_e2_list will have the elements of e1_e2_list_intra only. On the other hand, phases 2 & 3 of Algorithm-1 generate all the solutions to the inter-linked XML documents case. For both independent and inter-linked documents, phases 1, 2 and 3 generates the full solutions for the generalized containment XML queries.
4.3. Algorithm-2 for processing interlinked XML documents Examining the execution of Algorithm-1-phase 2, presented in Fig. 8, it is easy to see that the intermediate lists, δ_list and e1_e2_list_init, as shown in Table 4, contain many redundant source nodes. Removing such redundancy will save a considerable amount of storage which will certainly improve the time and storage performance of Algorithm-1.
42
G.Z. Qadah / Computer Standards & Interfaces 49 (2017) 34–43
Table 4 The content of δ_list and car_state_list_init during Algorithm-1-phase-2 different iterations. Iteration
δ_list
car_state_list_init
0
b (1, 1, 15, 0), (1, 1, 15, 0) N b (2, 1, 15, 0), (2, 1, 15, 0) N b (1, 1, 15, 0), (3, 1, 9, 0) N b (1, 1, 15, 0), (4, 1, 9, 0) N b (2, 1, 15, 0), (3, 1, 9, 0) N b (2, 1, 15, 0), (4, 1, 9, 0) N b (1, 1, 15, 0), (5, 1, 8, 0) N b (2, 1, 15, 0), (5, 1, 8, 0) N
Φ
1
2
3
Φ
b (1, 1, 15, 0), (3, 1, 9, 0) b (1, 1, 15, 0), (4, 1, 9, 0) b (2, 1, 15, 0), (3, 1, 9, 0) b (2, 1, 15, 0), (4, 1, 9, 0) b (1, 1, 15, 0), (3, 1, 9, 0) b (1, 1, 15, 0), (4, 1, 9, 0) b (2, 1, 15, 0), (3, 1, 9, 0) b (2, 1, 15, 0), (4, 1, 9, 0) b (1, 1, 15, 0), (5, 1, 8, 0) b (2, 1, 15, 0), (5, 1, 8, 0) b (1, 1, 15, 0), (3, 1, 9, 0) b (1, 1, 15, 0), (4, 1, 9, 0) b (2, 1, 15, 0), (3, 1, 9, 0) b (2, 1, 15, 0), (4, 1, 9, 0) b (1, 1, 15, 0), (5, 1, 8, 0) b (2, 1, 15, 0), (5, 1, 8, 0)
N N N N N N N N N N N N N N N N
The enhanced algorithm, Algorithm-2, follows the same steps as those of Algorithm-1 with the following modifications: • Algorithm-2 creates an additional table, source-table, of size proportional to the number of source nodes generated for the given containment query. An entry in that table is capable of storing the label of a source node (s_docId, s_sPos, s_ePos, s_Level). • Algorithm-2 utilizes a new operator, node_annotation, that takes a list of source nodes, e1_list for example, stores these nodes into the source-table and generates a new list a_e1_list (the annotated version of e1_list) by amending each one of the nodes in e1_list with its entry index into the source-table. a_e1_list is defined as follows:
For the “/car//state” example, the list car-list has the source nodes {(1, 1, 15, 0), (2, 1, 15, 0)} and the corresponding source-table will contain two entries; the 0th and 1st entries are to contain the nodes (1, 1, 15, 0) and (2, 1, 15, 0), respectively. The annotated version of car_list (a_car_list), on the other hand, will have the elements (0, 1, 1, 15, 0) and (1, 2, 1, 15, 0). • Algorithm-2 utilizes another operator annotation_expansion that takes an annotated list of nodes, a_e2_list for example and the corresponding source_table and expands these nodes, using the source_table, into full links (pairs of source and destination nodes). The annotation_expansion operator is defined as follows:
For the “/car//state” example, the a_e2_list_inter containing the annotated nodes (0, 3, 1, 9, 0), (0, 4, 1, 9, 0), (1, 3, 1, 9, 0), (1, 4, 1, 9, 0), (0, 5, 1, 8, 0) and (1, 5, 1, 8, 0) would be expanded by the annotation_expansion operator to the elements: b (1,1,15,0, 3, 1, 9, 0) N b (1,1,15,0, 4, 1, 9, 0) N b (2, 1,15,0,1,15,0, 3, 1, 9, 0) N b (2, 1,15,0, 4, 1, 9, 0) N, b (1,1,15,0, 5, 1, 8, 0) N b (2, 1,15,0, 5, 1, 8, 0) N replacing the Id on every node with the source-table entry with an index equal to that Id. • Algorithm-2 modifies the expanded containment-merge operator (X) such that it generates annotated versions of the intermediate lists (instead of the fully expanded ones) as follows:
Fig. 9. Phase 2 of the enhanced containment algorithm (Algorithm-2).
G.Z. Qadah / Computer Standards & Interfaces 49 (2017) 34–43 Table 5 The content of δ_list and car_car_list_init during the different iterations of phase 2 of Algorithm-2. Iteration
a_δ_list
a_state_list_init
0 1 2
b (0, 1, 1, 15, 0) N b (1, 2, 1, 15, 0) N b (0, 3, 1, 9, 0) N b (0, 4, 1, 9, 0) N b (1, 3, 1, 9, 0) N b (1, 4, 1, 9, 0) N b (0, 5, 1, 8, 0) N b (1, 5, 1, 8, 0) N
3
Φ
Φ b (0, 3, 1, 9, 0) b (1, 3, 1, 9, 0) b (0, 3, 1, 9, 0) b (1, 3, 1, 9, 0) b (0, 5, 1, 8, 0) b (0, 3, 1, 9, 0) b (1, 3, 1, 9, 0) b (0, 5, 1, 8, 0)
N b (0, 4, 1, 9, 0) N N b (1, 4, 1, 9, 0) N N b (0, 4, 1, 9, 0) N N b (1, 4, 1, 9, 0) N N b (1, 5, 1, 8, 0) N N b (0, 4, 1, 9, 0) N Nb (1, 4, 1, 9, 0) N N b (1, 5, 1, 8, 0) N
Using the modifications presented above, Algorithm-2-phase 2, would proceed as shown in Fig. 9. Table 5, on the other hand, presents the content of a_δ_list and a_state_list_init during the different iterations of algothim-2- phase 2. It is worth to note here that the amount of storage needed by the intermediate lists, as shown in Table 5, have been substantially reduced (about 50% of those ones generated by the 2nd phase of Algorithm-1), leading to a superior storage and time performance of Algorithm-2. 5. Conclusions and future work Traditional XML query processing research has concentrated on data retrieval from independent XML documents and the structures in their support. In this paper, we have reviewed one of these algorithms and its associated inverted index structure and showed its correctness and suitability when processing independent XML documents. However, the direct application of this algorithm to process queries against interlinked XML documents is shown to generate incorrect results. Two new algorithms and the associated index structures have been introduced and shown to perform correctly in processing both independent and/or interlinked XML documents. In addition, one of the new algorithms is shown to minimize the storage requirement of the intermediate lists generated throughout its execution and therefore improving further the space and time performance of the proposed algorithm. As a future research, we plan to develop space and time models for the new algorithms and use these models to compute and compare the performance of these algorithms. References [1] R. Agrawal, S. Dat, H. Jagadish, Direct transitive closure algorithms: design and performance evaluation, ACM Trans. Database Syst. 3 (1990) 427–458. [2] S. Al-Khalifa, et al., Structural joins. A primitive for efficient XML query pattern matchingProceedings of the International Conference on Data Engineering (ICDE) 2002, pp. 141–152. [3] F. Bancilhon, Object-oriented database systemsProceedings of the ACM SIGACTSIGMOD Symposium on Principles of Database Systems, Austin, Texas 1988, pp. 152–162.
43
[4] S. Bog, et al., XQuery 1.0: An XML Query Language (Second Edition). W3C Recommendation, 14 December 2010 On Line: http://www.w3.org/TR/xquery/ Last accessed: May 10, 2015. [5] T. Bray, et al., Extensible Markup Language (XML) 1.0 (Fifth Edition), W3C Recommendation, 26 November 2008 On-Line: http://www.w3.org/TR/REC-xml Last accessed: May 10, 2015. [6] D. Chamberlin, J. Robie, D. Florescu, Quilt: an XML query language for heterogeneous data sourcesProceedings of the World Wide Web and Databases (WebDB) 2000, pp. 53–62. [7] S. Chien, Z. Vagena, D. Zhang, V. Tsoras, C. Zaniolo, Efficient structural joins on indexed xml documentsProceedings of International Conference on Very Large Databases (VLDB), Hong Kong, China 2002, pp. 263–274. [8] J. Clark, S. Derose, XML Path Language (Version 1). W3C Recommendation, 16 November 1999 On Line: http://www.w3.org/TR/xpath Last accessed: May 1, 2015. [9] E. Cod, A relational model of data for large shared data-banks, Commun. ACM 1 (1970) 377–387. [10] S. Derose, et al., XML Linking Language (XLink)-Version 1.1, W3C Recommendation, 06 May 2010 On Line: http://www.w3.org/TR/xlink11. [11] A. Deutsch, et al., XML-QL: A Query Language for XMLOn Line: http://www.w3.org/ TR/NOTE-xml-ql Last accessed: May 15, 2015. [12] D. DeWitt, D. Schneide, A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environmentProceedings of the ACM SIGMOD International Conference on Management of Data 1989, pp. 110–121. [13] R. Elmasri, S. Nvathe, Fundamentals of Database Systems, sixth ed. Addison-Wesley, 2014. [14] H. Gallaire, J. Minker, J.M. Nivolas, Logic and databases: a deductive approach, ACM Comput. Surv. 2 (1984) 153–185. [15] M. Graves, Designing XML Databases, Prentice Hall PTR, 2002 (ISBN = 0-13088901-6). [16] G. Qadah, R. Taha, Electronic Voting Systems: Requirements, Design, and Implementation, International Journal on Computer Standards & Interfaces, Elsevier Publisher, 2007 376–386 (3). [17] G. Qadah, W. Al Zouabi, XML-based electronic service delivery systems: a case study, J. Comput. Inf. Technol. 14 (3) (2006) 1–12. [18] G. Qadah, K.B. Irani, The join algorithms on a shared–memory multiprocessor database machine, IEEE Trans. Softw. Eng. 11 (1988) 1668–1683. [19] G. Qadah, L. Henschen, J. Kim, Efficient algorithms for the instantiated transitive closure queries, IEEE Transaction on Software Engineering 3 (1991) 296–309. [20] G. Qadah, J. Kim, The processing of a class of transitive closure queries on uniprocessor and shared-nothing multiprocessor systems, Data Knowl. Eng. 8 (1992) 57–89 (North-Holland). [21] C. Seo, S.-W. Lee, H.-J. Kim, An efficient inverted index technique for XML documents using RDBMS, Inf. Softw. Technol. 45 (2003) 11–22. [22] D.I. Shapiro, Join processing in database systems with large main memories, ACM Trans. Database Syst. 11 (3) (1986) 239–264. [23] I. Toroslu, G. Qadah, The strong partial transitive-closure problem: algorithms and performance evaluation, IEEE Transaction on Knowledge and Data Engineering 4 (1996) 617–629. [24] C. Zhang, J. Naughton, D. DeWitt, Q. Luo, G. Lohman, On supporting containment queries in relational databases management systemsProceedings of ACM SIGMOD International Conference on Management of Data 2001, pp. 425–436. [25] H. Georgiadis, V. Vassalos, Improving the efficiency of XPath execution on relational databases, Advances in Database Technology EDBT 2006, pp. 570–587. [26] G. Gou, R. Chirkova, Efficiently querying large XML repositories: a survey, IEEE Transaction on Knowledge and Data Engineering 18 (10) (October 2007) 1381–1403. [27] I. Tatarinov, S. Viglas, K.S. Beyer, J. Shanmugasundaram, E.J. Shekita, C. Zhang, Storing and querying ordered XML using a relational database systemProceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin 2002, pp. 204–215. Ghassan Z. Qadah obtained his BSc from the Electrical Engineering Department of Ainshames University, Cairo, Egypt, and MSc. and PhD. from the Electrical and Computer Engineering Department of the University of Michigan-Ann Arbor. Currently, he is an Associate Professor at the Computer Engineering Department, American University of Sharjah, P.O. Box 26666, Sharjah, United Arab Emirates. His research interest includes serial and parallel query processing algorithms, traditional and non-traditional database systems, wireless networks and mobile computing.