Reformulating XPath queries and XSLT queries on XSLT views

Reformulating XPath queries and XSLT queries on XSLT views

Data & Knowledge Engineering 57 (2006) 64–110 www.elsevier.com/locate/datak Reformulating XPath queries and XSLT queries on XSLT views Sven Groppe *,...

2MB Sizes 0 Downloads 111 Views

Data & Knowledge Engineering 57 (2006) 64–110 www.elsevier.com/locate/datak

Reformulating XPath queries and XSLT queries on XSLT views Sven Groppe *, Stefan Bo¨ttcher, Georg Birkenheuer, Andre´ Ho¨ing University of Paderborn, Faculty 5, Fu¨rstenallee 11, D-33102 Paderborn, Germany Received 28 February 2005; received in revised form 28 February 2005; accepted 14 April 2005 Available online 16 May 2005

Abstract Applications using XML for data representation very often use different XML formats and thus require the transformation of XML data. The common approach transforms entire XML documents from one format into another, e.g. by using an XSLT stylesheet. Different from this approach, we use an XSLT stylesheet in order to transform a given XPath query or a given XSLT query so that we retrieve and transform only that part of the XML document, which is sufficient to answer the given query. Among other things, our approach avoids problems of replication, saves processing time, and in distributed scenarios, transportation costs.  2005 Elsevier B.V. All rights reserved. Keywords: XML; Semi-structured data; XSLT; XPath; Query transformation; Query reformulation; Query optimization

1. Introduction 1.1. Problem definition and motivation An application must often access data in a data format Ftransf that is different from the format Forig, which is used to store the data itself in the database. Applications deal with different data *

Corresponding author. E-mail addresses: [email protected] (S. Groppe), [email protected] (S. Bo¨ttcher), [email protected] (G. Birkenheuer), [email protected] (A. Ho¨ing). 0169-023X/$ - see front matter  2005 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2005.04.002

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

65

formats; for example, in the scenarios of data integration, schema evolution, or in bilateral situations, where two applications exchange data. In these scenarios, the user can define a view in one database so that an application can access the data in the data format Ftransf of another database. The view definition specifies how the database management system transforms stored data from the format Forig into the format Ftransf when the user queries the view. In order to avoid transforming the complete data of the view, database management systems use the method of query reformulation, whereby the query from the format Ftransf of the view is reformulated into a query, called reformulated query, in the format Forig of the stored data. By applying the reformulated query, the database management system transforms that section of data, which is necessary to answer the query on the view, from the format Forig into the format Ftransf. In this paper we introduce a slightly different approach (see Section 1.2) to query reformulation for XML, which we call query transformation, where the query can be formulated in XPath or in XSLT and the view definition in XSLT. This enables similar scenarios, where XML, XPath and XSLT are continuously used, as for query reformulation in traditional databases. In these scenarios, using query transformation has several advantages in comparison to the state-of-the-art method of XML and XSLT, which at first transforms the entire XML document and then works on the copy of the original document transformed into Ftransf: Using query transformation avoids replication problems, saves processing time for the transformation, and in distributed scenarios reduces transportation costs. In comparison to the classical definition of query reformulation [12], we perform the transformation of the data in a separate step (see Fig. 1). In the first step, our query transformation algorithm transforms a given query Q (which can be an XPath query QXPath or an XSLT query QXSLT) into an XPath query R according to a given XSLT stylesheet S. We then evaluate R on the original document D in order to retrieve a smaller resultant XML fragment R(D). The resultant XML fragment R(D) is defined to contain all nodes and all their ancestors up to the root of the original XML document D, which contribute to the successful evaluation of the query R given in XML format Forig. The resultant XML fragment R(D) is then transformed by an XSLT processor according to the given XSLT stylesheet S to S(R(D)) and at last queried according to the query Q in order to retrieve the final result Q(S(R(D))). In this paper, we investigate the algorithmic problem of query transformation, which we define as follows: Definition 1. The algorithmic problem of query transformation is to determine an XPath expression R according to a given query Q (which can be an XPath query QXPath or an XSLT query QXSLT)

Fig. 1. Complete system for supporting query transformation of XPath and XSLT queries on XSLT views.

66

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

and an XSLT stylesheet S so that it meets the following conditions: The resultant XML fragment of R(D) has to be as small as possible but has to guarantee the equivalence of Q(S(R(D))) and Q(S(D)), i.e. that Q(S(R(D))) returns the same result as Q(S(D)) for every XML document D. We take care that our approach uses standard XSLT processors for transforming XML documents according to the XSLT stylesheet S (and according to an XSLT query QXSLT respectively) and standard XPath evaluators for evaluating a given XPath query QXPath. Standard XSLT processors and standard XPath evaluators are already available in well-tested free software versions or as commercial products either as stand-alone programs or programming libraries for all established programming languages for several years. The rest of the paper is organized as follows: We describe the related work in Section 1.2. Section 2 deals with the essentials, basic definitions and lemmas. Section 3 presents the first phase of the query transformation approach for XPath queries on XSLT views: In the first phase, we search for paths in the XSLT stylesheet, which generate elements, attributes and attribute values in the correct order, i.e. as needed in order to answer a given XPath query. For each of these successfully searched paths we determine in the second phase the input path expression of the XSLT stylesheet (see Section 4), which summarizes the XPath expressions of input nodes along the stylesheet path. The transformed query R is the disjunction of the determined input path expressions of each successfully searched path. In Section 5, we prove the correctness of our query transformation algorithm. Section 6 extends our approach to XSLT queries. Section 7 presents approaches for retrieving the resultant XML fragment R(D), whereas Section 8 presents the experimental results. We end up with the summary and conclusions in Section 9. 1.2. Related work The query reformulation approach is known from relational databases. In comparison to our query transformation approach, applying the reformulated query transforms the retrieved data into the target format in one step [12]. Given two schemas Forig and Ftransf and a correspondence S between them, find a query R formulated in terms of schema Forig that is equivalent to a given query Q formulated in terms of schema Ftransf modulo the correspondence S. In our approach, we perform the transformation of XML data in a separate step in order to use standard XSL processors and standard XPath evaluators wherever possible. As in the approach of query reformulation in relational databases, our approach does not use a direct connection to a database for transforming the query. Unlike the XML data model, the classical relational model and the deductive data model have no hierarchy, treat element order as insignificant, and do not support identity. These properties of the XML data model have to be considered in query reformulation approaches for XML. Therefore, query reformulation approaches for other data models cannot be applied in pure XML environments, where only XML is used. In comparison to approaches, where XML data and XML queries are first mapped to relational databases and relational queries [15,23,25,33–36], we

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

67

support whole XPath [39] and whole XSLT [38]. Furthermore, we can avoid performance bottlenecks, which occur whenever complex mappings are processed, by using an XML query reformulation approach, where only XML data and XML languages are used. For the transformation of XML queries into queries based upon other data formats, at least two major research directions can be distinguished. First, the mapping of XML queries to other data storage formats of object-oriented or of relational databases (e.g. [7,12]), and second, the transformation of XML queries or XML documents into other XML queries or XML documents based on another XML data format (e.g. [1]). We follow the second approach; however, we focus on XSL [38] for the transformation of both data and XPath [39] queries. In related contributions to schema integration, two approaches to data and query translation can be distinguished. While the majority of contributions (e.g. [10,2]) map the data to a unique representation, we follow [8] and map the queries to those domains where the data resides. The contribution in [11] contains query reformulation according to path-to-path mappings. We go beyond this, as we use XSLT as a more powerful mapping language. Ref. [29] describes how XSL processing can be incorporated into database engines, but focuses on efficient XSL processing. The complexity of XPath query evaluation on XML documents is examined in [17]. In comparison, we use an evaluation based on output nodes of XSLT and consider query transformation. Refs. [13] and [14] present algorithms to filter an XML document according to a given query and analyze the performance, but the algorithms do not contain query transformation. The contributions [16,18,32] introduce an algebra for XQuery. Additionally, they list transformation and optimization rules based on the introduced algebra, but they do not contain an XQuery variant of the optimization approaches presented here. Ref. [9] describes how the language XQuery can be extended to support views. It describes the language extensions but does not describe how to optimize. Ref. [28] projects XML documents to a sufficient XML fragment before processing XQuery queries. Ref. [28] contains a static path analysis of XQuery queries, which computes a set of projection paths formulated in XPath from an arbitrary XQuery expression. In comparison to this approach we describe, among other things, a path analysis in XSLT stylesheets depending on an input XPath query. Furthermore, we analyze paths in recursive calls (of templates). In order to retrieve the resultant XML fragment, we extend the approach of a loader contributed by [28] to evaluating filter expressions on XML streams and call it the SAXFilter approach. We were influenced by an approach [30], which also evaluates filter expressions on XML streams, but returns a node set and not a resultant XML fragment. Ref. [26] describes how to transform XQuery expressions into XSLT stylesheets. The approach of [26] can be used in order to extend our proposed approach to support XQuery as query language and as view language. In contrast to all these approaches, we focus on the transformation of XPath queries and XSLT queries according to an XSLT stylesheet. In this paper, we extend our previous contributions of [19–22] among other things by a sketch of a query transformation algorithm, which supports whole XPath and whole XSLT (in comparison to restricted subsets of XPath and XSLT in our previous contributions), by a proof of its correctness, by extending our approach to XSLT queries, by introducing a new SAX based approach for retrieving the resultant XML fragment and further experiments.

68

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

2. Essentials Whereas expressions of the XPath language describe node sets of XML documents, XSLT stylesheets describe the transformation of XML documents. We give a short introduction to the XPath language in Section 2.1 and the XSLT language in Section 2.2 so that the reader can understand the examples in this paper. We refer to [39] for a full description of XPath, and we refer to [38] for a full description of XSLT. We present an example of a query transformation of an XPath query according to an XSLT stylesheet in Section 2.3. In Section 2.4, we outline how an XSLT processor executes an XSLT stylesheet, because the knowledge of the execution algorithm of an XSLT processor is necessary for understanding the query transformation algorithm and its proof of correctness. Section 2.5 presents basic definitions and lemmas. In Section 2.6, we define the state of an XSLT processor. 2.1. XPath essentials The W3C developed the XPath language to describe node sets of XML documents. XPath expressions consist of location steps separated by a slash (‘‘/’’). Each location step contains • an axis. Among the possible axes are the child axis, which contains all child XML nodes of a context XML node, the descendant-or-self axis containing all descendants and the context XML node itself, and the attribute axis containing all attributes of a context node. • a node test. Among the possible node tests are a name node test for a specific name A (declared by A itself) and for an arbitrary name (declared by the wildcard ‘‘H’’), and a node test node( ) for all node types. • an arbitrary number of predicates. A predicate is enclosed by the brackets [ and ]. A predicate contains a Boolean expression; for example, a comparison of an XPath expression with a constant string or number. For example, the XPath query QXPath ¼ =child :: product list=child :: product½attribute :: label ¼ 00cockpit00 =attribute ::

H

=parent :: nodeð Þ

consists of the location steps • /child :: product_list with a child axis and a product_list name node test, • /child :: product[attribute :: label="cockpit"] with a child axis, a product name node test and the [attribute :: label="cockpit"] predicate, which checks whether or not the label attribute exists and if its value is "cockpit", and • /attribute ::H with an attribute axis and a node test for an arbitrary name (‘‘H’’). • /parent :: node( ) with a parent axis and a node test for an arbitrary node. Starting with the document root of an XML document, each location step from left to right describes, which XML nodes must be considered for the next location step by following the axis

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

69

Fig. 2. Example of the transformation by an XSLT stylesheet S.

for the XML nodes of the previous location step, checking the node test and the predicates. The whole XPath expression describes the considered XML nodes of the last location step. The XPath query QXPath on the XML document S(D) of Fig. 2 describes a node set containing the XML node (R4), . Note that there exists an abbreviated syntax for XPath expressions (see [39]), which can be easily transformed into the long notation presented here. In the following sections, we use only the long notation for XPath expressions. 2.2. XSLT essentials The W3C developed the declarative language XSLT, which describes the transformation of XML documents by template rules. An XSLT stylesheet itself is an XML document with a root element . The xsl namespace is used to distinguish XSLT elements from other elements. Template rules are expressed by an element. Its match attribute contains a pattern in form of an XPath expression. Whenever a current input XML node fulfills the pattern of the match attribute, the template is executed. An XSLT processor starts the transformation of an input XML document with the current input XML node assigned to the document root. Using a short form, the output of the executed template is the XML nodes, which are not XSLT instructions, and the text inside the executed template. This output can also be described by a long form with the XSLT instructions for generating XML elements, for generating attributes of an XML element and for generating text. Note that the short form can be easily transformed into the long form, which we use in this paper. Output is also described by the XSLT instruction , which converts the result of an XPath expression to a string. The XSLT instruction recursively applies the templates to all XML nodes in the result node set of the XPath expression given by its select attribute. We refer to [38] for a complete list of XSLT instructions.

70

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

For an example of transforming an XML document by an XSLT stylesheet, see Fig. 2: The XSLT stylesheet S transforms the representation of nested objects (XML document D) into a flat model of a list of products, i.e. the transformed XML document S(D). 2.3. Example of an XPath query transformation according to an XSLT stylesheet Assume that we have to answer an XPath query QXPath ¼ =child :: product list=child :: product ½attribute :: label ¼ 00cockpit00 =attribute ::

H

=parent :: nodeð Þ

on the transformed XML document S(D) of Fig. 2. It is sufficient to transform only a resultant XML fragment R(D) (see the bold face part in the left box of Fig. 2) for answering QXPath, where R is a query in XML format Forig computed by our new query transformation algorithm. Notice that standard XPath evaluators only return a query result as a node set, not as a resultant XML fragment. In the example it is sufficient for answering QXPath to transform the resultant XML fragment (see the bold face part of the left box in Fig. 2) of the query R ¼ =child :: objectð=child :: contains=child :: objectÞn ½attribute :: name ¼ 00cockpit00  where An is a short notation for an arbitrary number of paths A. Note that the XPath language does not support An, but we can retrieve a fragment containing a superset of the XML nodes required by replacing An with the XPath expression /descendant-or-self::node( ) if A contains only self, child, descendant and descendant-or-self axes, and with the XPath expression /ancestor-or-self :: node( ) /descendant-or-self :: node( )(/self :: node( )|/attribute :: node( )|/namespace :: node( )) for an arbitrary A. 2.4. Outline of the execution of an XSLT stylesheet by an XSLT processor We outline the execution of an XSLT stylesheet by an XSLT processor for the following reasons: The first reason is to understand how an XSLT processor transforms an XML document and to introduce a formal XSLT processor model. The second reason is to develop a query transformation algorithm and the final reason is to prove the correctness of the query transformation algorithm based on the formal XSLT processor model. To simplify the XSLT processor model, we reduce many cases of XSLT instructions by first transforming XSLT instructions into their long form. Let us consider the case that an XSLT instruction I contains an attribute A, which is set to the value of one or more XPath expressions, i.e. in the case of an XPath expression A="{XP}", where XP is an XPath expression. In order to keep the XSLT processor model simple, we assume that the XSLT processor evaluates XP as it would be a child node of I. Furthermore, other attributes of I containing an XPath expression are also executed as they would be a child node of I. Fig. 3 presents the pseudocode of an XSLT processor. Whereas the algorithm processXSLTStylesheet initializes the XSLT processor, the algorithm processXSLTNode processes the given XSLT node N and its child XSLT nodes

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

71

Fig. 3. Algorithms to transform an input XML document iXML according to an XSLT stylesheet S into an output XML document oXML.

recursively. The XSLT processor starts with the root node of the input XML document iXML (see Fig. 3, line 5), then computes that XSLT node (or a built-in template respectively), which contains the template with the highest priority that matches the root node (see Fig. 3, line 6). The XSLT processor processes this template by calling the algorithm processXSLTNode (see Fig. 3, line 8). If a new input node set is selected in the current XSLT node N (see Fig. 3, line 15), the XSLT processor determines the input XML node set inputNodeSet of the current XSLT node N by computing a newly selected node set (see Fig. 3, line 16) and then assigns the input XML node K to the first unmarked node in inputNodeSet (see Fig. 3, line 17). The XSLT processor selects the new input node set by calling a method getNodeSet of an internal XPath evaluator XPathEvaluator with three parameters: the first parameter is the XPath expression in the select attribute of the current XSLT node N. The second parameter is the current context node K of the input XML document iXML, and the third parameter is the input XML document iXML itself. Otherwise, the input XML node set is assigned to a node set containing only the current input XML node K (see Fig. 3, line 18). After that, the XSLT processor iterates through the input XML node set inputNodeSet (see Fig. 3, line 19), and marks the current input XML node K of the input node set inputNodeSet to be processed (see Fig. 3, line 20). In each iteration step, the XSLT processor executes the current XSLT node N of a given XSLT stylesheet S. For this

72

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

purpose, line 21 contains a call of the method executeOneXSLTNode with three parameters, i.e. the current XSLT node N, the input XML node K, and the current XML node in the output XML document o. We do not describe the implementation of the method executeOneXSLTNode here, because [38] describes the execution semantics of the different types of XSLT nodes. Note that the execution of an XSLT node (, or ) can cause the execution of a whole template. If the current XSLT node generates output XML nodes oz, then these nodes oz are attached as child nodes of o. Furthermore, the XSLT processor returns each XML node oz in the output XML document (see Fig. 3, line 21), which serves as current XML node in the output XML document in line 23. There the XSLT processor processes all child nodes DN of the current XSLT node N recursively by calling the method processXSLTNode with the parameters child node DN, current input XML node K, input XML document iXML and output XML node oz. 2.5. Used definitions and lemmas We define the following terms for later use: Definition 2. An XPath expression I can be divided into a relative part rp(I) and an absolute part ap(I) (both of which may be empty) in such a way that rp(I) contains a relative path expression, ap(I) contains an absolute path expression, and the union of ap(I) and rp(I), i.e. ap(I)jrp(I), is equivalent to I, i.e. the evaluation of I returns the same node set as the evaluation of ap(I)jrp(I) for all XML documents and for all context nodes in the current XML document. Example 1. The relative part of I ¼ ð=child :: E1jchild :: E2=child :: E3jchild :: E4Þ=child :: E5 is rpðIÞ ¼ ðchild :: E2=child :: E3jchild :: E4Þ=child :: E5 the absolute part is apðIÞ ¼ =child :: E1=child :: E5 Lemma 1. Let N1 = be an XSLT node, let N2 = be an XSLT node in the same XSLT stylesheet and Mi with i in {1, . . . , k} the match attributes of templates with a higher priority than N2. There exists an incomplete tester T, which tests by an incomplete test, whether or not N2 can be called from N1. T is incomplete as it returns ‘‘maybe called’’, if there may exist an XML document for which N2 will be called from N1. If T returns ‘‘not called’’, then we are sure that N2 will not be called from N1 for any XML document. Proof of Lemma 1. We can prove the lemma by a trivial function, which returns ‘‘maybe called ’’ for every input. In the following, we prove the existence of an incomplete tester T, which returns better results, by using an incomplete intersection tester for XPath expressions.

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

73

N2 is called from N1, if the selected node set I in the context of the current input XML node K of N1, I(K), and M are not disjointed, i.e. I(K) \ M 5 { } and the templates with a higher priority than N2 do not consume all XML nodes of I(K) \ M, which can be tested by (I(K) \ M)  I(K) \ M \ (M1 [    [ Mk) 5 { }. We can choose an XML document for which N2 is called from N1 by adding at least one node of (I(K) \ M)  I(K) \ M \ (M1 [    [ Mk) to the XML document. As a logical tester is missing, which checks whether or not the difference of XPath expressions is empty, the tester T neglects templates with a higher priority. There exist fast (but incomplete) testers (e.g. the testers presented in [5,6]) for the logical intersection test of XPath expressions. The testers are incomplete but provide one-sided correctness so that if the tester returns ‘‘disjointed’’ then we are sure that the XPath expressions are disjointed, otherwise the tester returns ‘‘maybe not disjointed’’. We determine a superset of I(K) independent of the current XML node K by ap(I)j/ descendant-or-self :: node( ) (/self :: node()|/attribute :: node()|/namespace :: node())/rp(I) and we use ap(M)j/descendant-or-self :: node() (/self :: node()j/attribute :: node()j/namespace :: node())/rp(M) instead of M. The tester T can be implemented by returning ‘‘not called’’ if the result of the logical intersection tester with input ap(I)|/descendant-or-self :: node()(/self :: node()|/attribute :: node()|/namespace :: node())/rp(I) and ap(M)|/descendant-or-self :: node()(/ self :: node()|/attribute :: node()|/namespace :: node())/rp(M) is ‘‘disjointed’’, and otherwise ‘‘maybe called ’’ (which can also be returned in the case that the intersection tester does not support the used XPath expressions). h Definition 3. The successor XSLT nodes of a currently executed XSLT node N1 are those XSLT nodes N2, which will be the next executed XSLT node in the call for the XSLT node N1 of the algorithm processXSLTNode (see Fig. 3) in line (21) or in line (23). Proposition 1. An XSLT node N2 is the successor XSLT node of an XSLT node N1, if • N2 is a child node of N1 in the XSLT stylesheet, or • N1 is an XSLT node with an attribute xsl : use-attribute-sets=N and N2 is an XSLT node with the same N, or • N1 is an XSLT node and N2 is an XSLT node with the same N, or • N2 is and N2 is and the template of N2 is called for at least one node of the selected node set I. Proof of Proposition 1. We can conclude Proposition 1 from the semantics of the XSLT instructions described in [38]. h If we determine the successor XSLT nodes independently of the current input XML document, we can determine a superset of the successor XSLT nodes N2 = of an XSLT node N1 = by using a tester T (see Lemma 1), which checks whether an XML document for which N2 is called from N1 exists. Example 2. Fig. 4 contains all successor XSLT nodes of the XSLT stylesheet S of Fig. 2.

74

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

Fig. 4. Successor XSLT nodes of XSLT stylesheet S in Fig. 2.

Definition 4. If the XSLT node N2 is a successor XSLT node of N1, we call N1 the predecessor XSLT node of N2. Definition 5. Given an XPath expression XP and an input XML document D, we say an XML node K of an input XML document D is successfully visited by an XPath evaluator, if the following holds: there exists two substrings XP1 and XP2 of XP so that the concatenation of these substrings XP1 and XP2 is equal to XP, i.e. XP = XP1 + XP2. Furthermore, the XPath evaluator must retrieve a node set that contains K, if the XPath evaluator parses and executes XP1 on D. We set the current context node of the XPath evaluator to K, and then further parsing and processing of XP2 must return a non-empty result. In this situation, we also say K is relevant to answer the query XP. Example 3. Let XP1 = /child :: product_list/child :: product and XP2 = [attribute :: label="cockpit"]/attribute ::w. The XPath evaluator retrieves the XML nodes (R2), (R3), (R4) and (R5) of Fig. 2 for the evaluation of XP1 on the XML document S(D), but the XPath evaluator retrieves only a non-empty result for the evaluation of XP2 in the context of (R4). Therefore, (R4) is successfully visited by an XPath evaluator for XP = XP1 + XP2 = /child :: product_list/child :: product[attribute :: label="cockpit"]/attribute ::w. Definition 6. Let D be an input XML document. Furthermore, let N be an XSLT node of a given XSLT stylesheet S so that an execution of N generates output in the form of an XML node O while the XSLT processor transforms D to S(D). N is relevant to answer the query QXPath, if O is relevant to answer the query QXPath on S(D). Example 4. (R4) of Fig. 2 is successfully visited while evaluating QXPath = /child :: product_list/child :: product[attribute :: label="cockpit"]/attribute ::w/parent :: node( ) (see Example 3). As (R4) is generated by the XSLT node (S6) of Fig. 2, (S6) is relevant to answer the query QXPath. Definition 7. The resultant XML fragment R(D) of a given XML document D contains all XML nodes and all their ancestors up to the root of D, which are successfully visited by an XPath evaluator, when the XPath evaluator processes a given query R given in XML format Forig. Furthermore, the same parent–child and sibling relationships as in D are set between the XML nodes of the resultant XML fragment R(D).

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

75

2.6. State of an XSLT processor Considering the algorithm of an XSLT processor (see Fig. 3), we can define a state of an XSLT processor as follows: Definition 8. The state X of an XSLT processor transforming an input XML document D according to an XSLT stylesheet S is a tuple (N, XP, K, I), where • N is the currently executed XSLT node of S. • XP is the state of the execution of the predecessor XSLT node of N. We call XP the predecessor state of X and define a function predecessor(X), which returns the predecessor state XP (or null if X is the first state). • K is the current input XML node of the input XML document D. • I is the input XML node set of the current XSLT node N with marked XML nodes, which are already processed. Definition 9. Given an input XML document D and an XSLT stylesheet S. The execution sequence Z for D is the sequence of states of an XSLT processor while transforming the input XML document D according to an XSLT stylesheet S.

Fig. 5. The execution sequence (see Definition 9) of states of the XSLT processor and the relation of predecessor states (see Definition 8) while transforming D according to S of Fig. 2. We use the node identifiers of Fig. 2.

76

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

Example 5. Fig. 5 contains the execution sequence of the XSLT processor while the XSLT processor executes the XSLT stylesheet S on the XML document D in Fig. 2. In Fig. 5, we use the node identifiers of Fig. 2. 3. Searching for relevant output nodes In our approach of our new query transformation algorithm for determining R, we search at first for paths in the XSLT stylesheet, which generate elements, attributes and attribute values in the correct order, i.e. as needed in order to answer a given XPath query QXPath. Whereas we describe in this section the search by enumerating the modifications to a normal XPath evaluator, which supports whole XPath, we present in Section 3.1 an example modified XPath evaluator in pseudocode (see Section 3.1.2), which supports a subset of XPath (see Section 3.1.1). The query QXPath is evaluated on the transformed XML document, which is generated by the XSLT processor in output nodes of the XSLT stylesheet S. Therefore, we first look at the output nodes of the XSLT stylesheet S: Definition 10. Output nodes of an XSLT stylesheet S are those XSLT nodes of S, which • generate an element E by the XSLT node , • generate an attribute A by the XSLT node , • generate text nodes containing "content" by the XSLT node content, or • copy the content of XML nodes I of the input XML document by the XSLT node , or • copy whole element nodes of the input XML document by the XSLT nodes or , or • generate a number by the XSLT node . In the example of Fig. 2, all the product_list elements in S(D) in the right part of Fig. 2 are generated by the node (S3) of S (see the middle box of Fig. 2), and all the product elements (R2), (R3), (R4) and (R5) in S(D) are generated by node (S6). These output nodes (S3) and (S6) of the XSLT stylesheet S are reached, after a sequence of nodes of the XSLT stylesheet S are executed. In the example, <(S1), (S2), (S3), (S4), (S5), (S6)> is one sequence which reaches the nodes (S3) and (S6), i.e. which generates output that is relevant to answer an XPath query /child::product_list/child::product. We modify an XPath evaluator in order to search for all relevant output nodes, which can generate output that is relevant to the query QXPath in the correct order. Instead of evaluating an XPath query on an XSLT stylesheet S as required by our XSLT optimization for XPath queries approach, an XPath evaluator is typically constructed to evaluate an XPath query on an input XML document. Therefore, the search must be modified in the following way: • Similar to the search by a normal XPath evaluator, we start the search at the document root node of the XSLT stylesheet document. • The search can pass non-output nodes, as they do not generate any output, which is relevant to answer QXPath.

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

77

• The search continues from an XSLT node N1 to an XSLT node N2, whenever N2 is a successor XSLT node of N1. • Whenever the value of an attribute or text node is compared with a constant value V in a filter expression of QXPath and this value is generated by an XSLT node const, the modified XPath evaluator can currently decide, without access to the XML document that the filter will always be true, or always be false, respectively, by comparing both constant values V and const. Otherwise the value of the attribute or text node is copied from the content of the input XML document. In our approach, we then transform the filter of QXPath into a filter in XML format Forig in R (see Section 4.5), which restricts the node set of the input XML document more precisely when we apply R. As we transform the query QXPath independently of given XML documents, our modified XPath evaluator evaluates all other filter expressions to true, if there exists an input XML document so that the elements, attributes, and text nodes referenced in the filter expression are generated (or are not generated for a filter expression in the scope of an odd number of not operators respectively) in the XSLT stylesheet S. • In comparison to the evaluation on XML documents, it may occur during the search in XSLT stylesheets that the modified XPath evaluator revisits a node N of the XSLT stylesheet without any progress in the processing of QXPath. For example, the XPath evaluator can visit the XSLT nodes (S1), (S2), (S3), (S4), (S5), then the XSLT node (S9) and the XSLT node (S5) again in Fig. 2. We call this a loop. In the example of Fig. 2, the loop contains the nodes (S9) and (S5). In order to avoid an infinite search, we do not continue the search at the final node N when the loop is detected. Furthermore, our modified XPath evaluator must log different paths in the XSLT stylesheet, which we use for the further computation of the transformed query R in Section 4. In the following, we define the types of paths in the XSLT stylesheet and explain how the modified XPath evaluator uses these types of paths in order to log the search. Definition 11. A stylesheet path is a list of tuples (E, N, A, F, L), where E is an XPath expression, N an XSLT node, A represents attached paths, F filter stylesheet paths and L loop stylesheet paths. Whereas the attached paths A and the loop stylesheet paths L are lists of stylesheet paths, the filter stylesheet paths F is a list of tuples (s, o, c), where s is a stylesheet path, o is an operator representing ‘‘<’’, ‘‘<=’’, ‘‘=’’, ‘‘>’’, ‘‘>=’’ or ‘‘!=’’, and c is a constant string or a constant number. See Fig. 6 for the logged stylesheet paths of the example XSLT stylesheet S in Fig. 2 and the example query QXPath ¼ =child :: product list=child :: product½attribute :: label ¼ 00cockpit00 =attribute ::

H

=parent :: nodeð Þ.

• We call the stylesheet path, which contains all the visited XSLT nodes of the path from the start node to the current node of the search in the visited order, the current stylesheet path, where each entry (E, N, A, F, L) of the current stylesheet path contains that suffix E of QXPath, which still has to be searched for at the visited XSLT node N.

78

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

Fig. 6. Result of output path search. Due to simplicity of presentation, we only depict the XSLT node of the stylesheet paths and not a whole entry according to Fig. 6.

• We call the stylesheet paths, which begin with the node and may generate output that is relevant to QXPath, detected stylesheet paths. • The modified XPath evaluator can return from an XSLT node N1 to an XSLT node N2, which has already been visited in the current stylesheet path, because the XPath evaluator processes a reverse axis (parent, ancestor, ancestor-or-self, preceding or preceding-sibling axis), evaluates the XPath expressions of parameters of a built-in function, or processed a filter expression before. We then store the stylesheet paths starting at the XSLT node N2 to the XSLT node N1 separately in the attached paths, which are attached to the XSLT node N2 of the current stylesheet path. • We can transform the currently processed filter of QXPath into a filter in XML format Forig in R (see Section 4.5). Whenever the value of an attribute or text node in the filter expression is compared with a constant value and is copied from the input XML document by the XSLT node , the transformed filter in R (see Section 4.5) restricts the node

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

79

set of the input XML document more precisely when we apply R. In this case, we store the filter operator and the constant value of the comparison themselves and stylesheet paths resulting from a search for filters in the filter stylesheet paths attached to the current stylesheet path (e.g., for [attribute::label="cockpit"] see nodes (S7) and (S8) in Fig. 2). • For each loop detected for a current XSLT node N, we generate a stylesheet path loop, which is the current stylesheet path minus the stylesheet path of the first visit of N. We store loop as an entry in the list of loop stylesheet paths attached to the current stylesheet path, because we need to know the input, which is consumed when the XSLT processor executes the nodes of loop (see Section 4.6). 3.1. A simple modified XPath evaluator for searching for relevant output nodes In order to keep the presentation compact, we present the pseudocode of a simple modified XPath evaluator in the following subsections, which supports a smaller subset of XPath and XSLT than the general modified XPath evaluator outlined in Section 3. The considered subsets of the simple XPath evaluator is described in Section 3.1.1, the algorithm itself is described in Section 3.1.2. The pseudocode of the modified XPath evaluator can be extended so that the general modified XPath evaluator outlined in Section 3 is implemented. 3.1.1. Considered subsets of XPath and XSLT of the simple modified XPath evaluator We restrict XPath queries QXPath to the subset SXPath (see Definition 12) of XPath for our simple modified XPath evaluator. Definition 12. Subset SXPath of XPath is defined by the following rule for LocationPath given in the Extended Backus Naur Form (EBNF): LocationPath :: = Step :: = Axis :: = NodeTest :: = Predicate :: = Operator :: =

("/" Step)H. Axis " :: " NodeTest PredicateH. "child"j"descendant-or-self"j"attribute". QNamej"node( )"j"H". "[" Step ("/" Step)H Operator (StringjNumber) "]". "<"j"<="j"="j">="j">"j"!=".

where QName tests a qualified name, String represents a string constant and Number a number constant. This subset of XPath allows querying for an XML fragment, which can be described by succeeding elements (in an arbitrary depth). Filter expressions allow the restriction of values of paths to a constant value. Similarly, we restrict XSLT, i.e., we consider the following nodes of an XSLT stylesheet: • • • •

, , , ,

80

• • • • • • • • • • • • •

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

, , , , , , , , , , , and ,

where I and M contain an XPath expression without function calls, T is a Boolean expression and N is a string constant. Whenever attribute values are generated by the XSLT stylesheet, we assume that each attribute value is generated in one XSLT node (i.e. or ). 3.1.2. Pseudocode of the simple modified XPath evaluator The algorithm getNextOutputNodes of Fig. 7 has three input parameter: The input parameter e represents the part of the XPath expression, for which has still to be searched. The input parameter s is of type stylesheet path. The last input parameter is t, which represents the type of the output nodes to be searched for and can be ELEMENT, ATTRIBUTE or CONTENT. The algorithm getNextOutputNodes adds to s successor XSLT nodes until the next output node of type t is reached. Loops are recognized in line 43 and are stored in line 44 of the algorithm getNextOutputNodes. Line 45 to line 47 check, if an output node with the correct type is reached, and line 49 checks, if the current node is not an output node so that line 50 searches in the successor XSLT nodes of the current node by recursively calling getNextOutputNodes. The algorithm evaluateXPath of Fig. 7 has two input parameters: e represents the part of the XPath expression, for which still has to be searched, and s is an input stylesheet path. Algorithm evaluateXPath starts its search at the end of s, searches for e and returns all stylesheet paths that are relevant for e. Line 8 to line 18 of Fig. 7 contain the evaluation of the axis of the current location step. Support of more axes can be added there. Line 19 to line 33 handle filters of the current location step. Whereas line 25 and line 26 check whether a filter will be always true, line 29 checks whether a filter can be transformed to a filter of the input XML document. If a filter cannot be fulfilled, then the corresponding stylesheet path is deleted in line 33. Line 34 and line 35 evaluate the rest of the XPath expression after the current location step by recursively calling evaluateXPath. Let us assume that an XSLT stylesheet S and an XPath query QXPath are given. We retrieve the detected stylesheet paths and their attached paths for transforming QXPath according to S by calling the algorithms in Fig. 7 in the following way:

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

Fig. 7. Algorithm for output path search.

81

82

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

l evaluateXPath(QXPath, <(QXPath, root of S, null, null,null)>) for each pl2l do lc.addAll(getNextOutputNodes(null, pl, CONTENT)) The first call of evaluateXPath computes all the detected stylesheet paths of S according to QXPath. The additional getNextOutputNodes call is necessary, as QXPath queries for the content of an element or an attribute. Before calling getNextOutputNodes, the stylesheet paths of l contain only the generation of this element or attribute itself. Therefore, the stylesheet paths of l are extended to the XSLT nodes by calling the algorithm getNextOutputNodes, which generate the content of an element or attribute. The final result is the detected stylesheet paths stored in lc. 4. Determining the sufficient node set of the original document While executing the detected stylesheet paths (and attached stylesheet paths, filter stylesheet paths and loop stylesheet paths) computed in Section 3, the XSLT processor also processes input nodes (e.g. node (S4) in Fig. 2) each of which selects a certain node set described by a local input path expression I of the input XML document D. Definition 13. The input nodes of an XSLT stylesheet S with local input path expression I are • , • , where I=self :: node( ), • , where I=I0 /descendant-or-self :: node( )(/self :: node( )|/attribute :: node( )|/namespace :: node( )) as copies whole subtrees of I0 , • , where we deal with this XSLT node as it is on an attached stylesheet path, • , where we deal with this XSLT node as it is on an attached stylesheet path, • , where we deal with this XSLT node as it is on an attached stylesheet path, • and • . Furthermore, the following XSLT nodes, which we call test nodes, can restrict the current input XML nodes of the input XML document by a Boolean test expression T. Definition 14. The test nodes of an XSLT stylesheet S with a Boolean test expression T are • , • and • where T is implicitly 0not(T1) and . . . and not(Ti)0 , if there are i > 0 different XSLT nodes , . . . , in the same scope, or T is implicitly true( ) if there is no XSLT node. When considering all executed input nodes and test nodes of a detected stylesheet path (and its attached paths), the input nodes altogether select and the test nodes restrict a certain node set of

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

83

the input XML document D. In the following, we describe how to determine the whole node set (described using a query R), which is selected and restricted respectively on all stylesheet paths, which generate output relevant to the query QXPath and which we have already computed in Section 3. We can then select a smaller, but sufficient part R(D) of the input XML document D, where QXPath(S(R(D))) is equivalent to QXPath(S(D)). In order to achieve this goal, we have to combine all the local input path expressions of input nodes and test expressions of test nodes along a detected stylesheet path (and its attached paths). For this purpose, we use two different variables: The current input path expression (current ipe) contains the whole input path expression of the detected stylesheet path down to (and including) the current XSLT node. The following holds for the node set described by current ipe. The XSLT processor processes the current XSLT node with accessing the XML nodes of the input XML document, which are described by current ipe, while the XSLT processor executes the detected stylesheet path. The completed input path expression (completed ipe) contains all such input path expressions, which are selected in the stylesheet path before the current node, but which will not be used further in the computation of a current ipe. Fig. 8 shows the computation of the current input path expressions and the completed input path expressions of the example stylesheet of Fig. 2 and a given query QXPath=/child :: product_list/child :: product[attribute :: label="cockpit"]/attribute ::w/parent :: node( ). The node identifiers (S1)–(S8) in Fig. 8 refer to the node identifiers of the XSLT stylesheet in Fig. 2. The completed ipe is always initialized with the empty set. For the example in Fig. 8, the current ipe is initialized with /. In general, the XSLT processor starts executing the detected stylesheet path with the node set described by the match attribute M of the first template in the detected stylesheet path. The template can match nodes of the node set rp(M) occurring in arbitrary depth of the XML document because of built-in templates. Therefore, we initialize current ipe with ap(M)j/descendant-or-self :: node( )/rp(M) if rp(M) matches only element nodes, and otherwise with ap(M)|/descendant-or-self::node( )(/self::node( )| attribute::node( )|namespace::node( ))/ rp(M). Fig. 9 lists the different computing steps for current ipe and completed ipe (column 2). These steps depend on the type of the current node and on the type of paths attached to the current node (column 1). Furthermore, Fig. 9 contains the identifiers of example nodes (column 3) for each computing step applied to these example nodes in Fig. 8. In order to compute current ipe and completed ipe for each node along the detected stylesheet path and its attached paths (as e.g. for the nodes (S2)–(S8) in Fig. 8), we iterate through the detected stylesheet path. Then depending on the current node, we • compute new path expressions of the current ipe (i.e. current ipenew) and the completed ipe (i.e. completed ipenew). The result depends on the local input path expression (I) or test expression (T) respectively of the current XSLT node and the old input path expressions of the current ipe (i.e. current ipeold) and the completed ipe (i.e. completed ipeold).

84

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

Fig. 8. Computing input path expressions.

• recursively compute and combine current ipes and completed ipes of attached filter stylesheet paths, attached stylesheet paths, and loop stylesheet paths. For this purpose, at first we initialize current ipe (i.e. current ipeinit) and completed ipe (i.e. completed ipeinit), then recursively compute along the attached path as before and get the current ipe (i.e. current ipepath) and completed ipe (i.e. completed ipepath) after the last node of the attached path. At last we compute current ipenew and completed ipenew of the node with the attached path. The complete input path expression which is used as query R on the input XML document is the union of all the completed ipes and the current ipes of the last node of each of the n detected stylesheet paths (1ÆÆn),

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

85

Fig. 9. Computing steps of current ipe and completed ipe.

R ¼ completed ipe1 jcurrent ipe1 j    jcompleted ipen jcurrent ipen If there is no entry in the set of detected stylesheet paths, i.e. n = 0, R remains empty. In the example of Fig. 8, we get R = /child :: object(/child :: contains/child :: object)n [attribute :: name="cockpit"]j /child :: object(/child :: contains/child :: object)n [attribute :: name="cockpit"]/attribute :: name In the following subsections, we present each computation step of current ipe and completed ipe depending on the current XSLT node in more detail. 4.1. Nodes that are neither input nodes nor are test nodes nor have an attached path Whenever an XSLT node is neither an input node and nor a test node (for example, see XSLT nodes (S2), (S3) and (S7) in Fig. 8), then the XSLT processor will neither select a new input node

86

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

set nor restrict the currently selected input node set while executing the current XSLT node. Therefore, when additionally no path is attached to the current XSLT node, the current ipe and the completed ipe remain unchanged, i.e., they are identical to their previous values. 4.2. Basic combination step, input node In the running example there are several examples of the computation of a new current input path expression (current ipenew) and a new completed input path expression (completed ipenew) (for example, see XSLT nodes (S4), (S8) and (S9) in Fig. 8) from the local input path expression I, the old input path expression (current ipeold) and the old completed input path expression (completed ipeold). The general rule is as follows: If rp(I) is empty, then the XSLT processor will execute this XSLT node with the node set ap(I), which contains only absolute paths. Therefore, we can neglect the old selected node set current ipeold and assign ap(I) to current ipenew: current ipenew ¼ apðIÞ If rp(I) is not empty (e.g. (S4), (S8) and (S9) in the running example of Fig. 8), the XSLT processor will execute the XSLT node with all input XML nodes selected by I in the context of the old selected node set current ipeold. Therefore, the current ipe must be concatenated with rp(I) and the new current ipe must contain ap(I): current ipenew ¼ current ipeold =rpðIÞjapðIÞ Before we present the rules for the computation of completed ipenew, we give two examples of XSLT stylesheets in Figs. 11 and 13 that explain the motivation for the rule definition. The output of XSLT stylesheet A with input XML document 1 of Fig. 10 is presented in Fig. 12, the output of XSLT stylesheet B with input XML document 1 is presented in Fig. 14. As an absolute path is used in the bold face part of Fig. 11 instead of a relative path in the bold face part in Fig. 13, the output in Fig. 12 contains an additional element in comparison to the output in Fig. 14. If we want to retrieve that part of the output in Figs. 12 and 14 respectively, which is described by a query /child :: a/child :: b, then in the case of XSLT stylesheet A, the complete input XML document 1 must be transformed in order to generate output containing all relevant information to answer the query /child :: a/child :: b. In the case of XSLT stylesheet B, only the bold face part of Fig. 10 must be transformed. Considering these observations, we describe the rules for computing completed ipenew in the following. The XSLT processor executes the previous nodes of the current XSLT node down to the last XSLT input node with each node of current ipeold. After that, if the absolute part of I, i.e. ap(I), is not empty, the XSLT processor not only selects XML nodes of rp(I), which

Fig. 10. Input XML document 1.

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

87

Fig. 11. XSLT stylesheet A.

Fig. 12. Output of XSLT stylesheet A of Fig. 11 with input XML document 1 of Fig. 10.

Fig. 13. XSLT stylesheet B.

Fig. 14. Output of XSLT stylesheet B of Fig. 13 with input XML document 1 of Fig. 10.

are in the context of current ipeold, but also selects XML nodes in ap(I) to execute the current XSLT node and its successor XSLT nodes. Note that the node set ap(I) is independent of the context of current ipeold, but the number of XML nodes in current ipeold determines, how often the current XSLT node and its successor XSLT nodes are executed with the XML nodes in ap(I). On the other hand, current ipenew contains a more restricted node set than current ipeold. Therefore, as all input XML nodes of current ipeold have to be available in the resultant XML fragment, we add current ipeold to completed ipenew: completed ipenew ¼ completed ipeold jcurrent ipeold If ap(I) is empty, i.e. I = rp(I), then we can optimize the determined section of the input XML document in the resultant XML fragment. The XSLT processor generates output that is

88

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

relevant to answer the query QXPath, only if the XSLT processor executes a whole detected path (and its attached paths). Furthermore, the XSLT processor executes the current XSLT node and its successor XSLT nodes with input XML nodes rp(I), which restrict the XML nodes of current ipeold. Therefore, we must only retrieve current ipenew and not the whole current ipeold. completed ipenew ¼ completed ipeold 4.3. Test node combination step In test nodes with test expression T, the XSLT processor checks, whether or not evaluating the test expression T on the current input XML node K of the input XML document returns true. In other words, test nodes restrict the currently selected node set of the input XML document to the XML nodes that fulfil T. Therefore, we restrict current ipe to the XML nodes that fulfil T: current ipenew ¼ current ipeold ½T completed ipe contains the node set of the input XML document, which has been selected before the current XSLT node N is reached, but which do not belong to the currently selected node set of N any more. Therefore, the test node does not influence the node set of completed ipe: completed ipenew ¼ completed ipeold 4.4. Attached stylesheet path combination step If there is a stylesheet path attached to the current node (for an example, see XSLT node (S7) and XSLT node (S8) in the right branch of Fig. 8), we then recursively compute the current input path expression (current ipepath) and the completed input path expression (completed ipepath) of the attached stylesheet path. Before this recursive computation, we initialize current ipe with current ipeold and completed ipe with completed ipeold. After the last recursive evaluation step, we compute the input path expressions of the current node. From now on, relative parts of local input path expressions of following input nodes are in context of current ipeold: current ipenew ¼ current ipeold As all selected input path expressions in the attached stylesheet path have to be retrieved, we set completed ipenew ¼ completed ipepath jcurrent ipepath 4.5. Filter combination step Filters of the form [P op const] are optimized in our approach, where P is a path expression, op is a relational operator =, !=, >, >=, <= or <, and const is a constant value. Whenever the value of P is generated in the XSLT stylesheet S by copying content from the input XML document D by an XSLT node , our modified XPath evaluator attaches a filter path to the XSLT node (see Sections 3 and 3.1).

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

89

If there is a filter path (for an example, see XSLT node (S7) and (S8) in the left branch of Fig. 8) attached to the current XSLT node, we recursively compute the current ipefilter and completed ipefilter of the filter path. Before this recursive computation, we initialize the current ipe (current ipefilter) of the filter path with an empty path and the completed ipe (completed ipefilter) with completed ipeold. After the last recursive evaluation step, we compute the input path expressions of the node to which the filter path is attached as follows. current ipefilter contains the input path expression of the value which will be assigned to the XML node P by an XSLT node while processing the XSLT stylesheet S. Considering this, we transform the filter [P op const] to the filter [current ipefilter op const] in the correct context of current ipeold: current ipenew ¼ current ipeold ½current ipefilter op const We continue the computation of completed ipe with completed ipefilter: completed ipenew ¼ completed ipefilter 4.6. Loop combination step If there is a loop path attached to the current XSLT node (for example, see XSLT node (S9) in Fig. 8), we start an additional recursive computation of the input paths of this loop path. Before this recursive computation starts, we initialize both, the current input path expression (current ipeinit) and the completed input path expression (completed ipeinit) of the loop with an empty path. Then we compute recursively the input path expressions in the loop as before. After the last recursive evaluation step, we retrieve the input path expressions after the last node in the loop, current ipeloop and completed ipeloop. We compute current ipenew and completed ipenew of the node to which the loop is attached according to the following rules. In every iteration of the loop rp(current ipeloop) is selected in the context of the input path expression current ipeold, or after at least one iteration of the loop, ap(current ipeloop): current ipenew ¼ ðcurrent ipeold japðcurrent ipeloop ÞÞð=rpðcurrent ipeloop ÞÞn completed ipenew is computed from completed ipeold and completed ipeloop. In order to compute all input path expressions that are completed during the processing along the XSLT nodes of the loop, we have to set the relative part of completed ipeloop, i.e. rp(completed ipeloop), in each possible context, i.e. in each context after an arbitrary number of iterations of the loop. Therefore, we have to attach rp(completed ipeloop) to current ipenew. Furthermore, we have to consider the absolute paths completed during the loop, i.e. ap(completed ipeloop). Thus, we get completed ipenew ¼ completed ipeold jcurrent ipenew =rpðcompleted ipeloop Þj apðcompleted ipeloop Þ If rp(completed ipeloop) is empty, we can simplify the assignment to completed ipenew ¼ completed ipeold japðcompleted ipeloop Þ

90

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

The rule for attached loop paths in Fig. 9 handles the general case that one or more loop paths are attached to the current XSLT node. 4.7. Variables If there occurs a variable in an XPath expression, then we first compute current ipevar and completed ipevar of the stylesheet path that leads to the variable assignment. We replace the variable with current ipevar and assign the new completed ipe to completed ipenew ¼ completed ipeold jcompleted ipevar If there are XML nodes generated in the variable assignment, we have to evaluate location steps after the variable in the considered XPath expression on the generated XML nodes of the variable assignment. With this technique, we can determine the node set accessed by the XPath expression in the input XML document. We do not deal with this case in more detail here.

5. Proof of correctness and of minimum property of R(D) In this section, we prove the correctness of the query transformation algorithm. More precisely, we prove that using our approach for transforming a query Q into a query R according to an XSLT stylesheet S, we retrieve the same result for Q(S(R(D))) and for Q(S(D)) for all XML documents D. Furthermore, we show that the resultant XML fragment R(D) is minimal. For this purpose, we let P be the set of paths, which contains all detected stylesheet paths and their attached paths computed by our query transformation algorithm. Furthermore, we notate x # P, if a state x of an XSLT processor and its predecessor states contain executed XSLT nodes so that these executed XSLT nodes are the visited XSLT nodes of a detected stylesheet path or its attached paths in the same order. Definition 15. Given an execution sequence Z of states for a given XML document and given x1, x2 2 Z. Then x1 6Z x2 denotes an order relation so that x1 is before the state x2 or in the same place (i.e. x1 = x2) in the execution sequence Z. Note that the order 6 Z is different from the predecessor relation (see Definition 8) between states of Z. Definition 16. Given an execution sequence Z of states. Then Z 0 is a subsequence of Z, iff " x 2 Z 0 : x 2 Z and "x1, x2 2 Z 0 : x1 6Z 0 x2 () x1 6Z x2. Definition 17. Given an execution sequence Z of states and a state x1 2 Z. We define the distance of x1, which we denote by dist(x1), to the first state in Z by counting the number of predecessor states until the first state. Example 6. In Fig. 5, the first state is X1 and the distance of X20, i.e. dist(X20), is 7.

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

91

In the proof of correctness of the query transformation algorithm, we use following lemmas Lemmas 2–4. In these lemmas, we assume that an XSLT stylesheet S and an XPath query QXPath are given. Furthermore, the XPath query R is the result of the query transformation algorithm with input S and QXPath. Let ZD be the execution sequence of states of the XSLT processor while transforming the whole input XML document D according to the XSLT stylesheet S into S(D), and let ZR be the execution sequence of states of the XSLT processor while transforming the resultant XML fragment R(D) into S(R(D)). The basic idea of the lemmas Lemmas 2–4 is to prove following condition C for subsequences in ZD and for corresponding subsequences in ZR. Definition 18. Let ZD 0 be a subsequence of ZD and let ZR 0 be a subsequence of ZR. The condition C is true for ZD 0 and ZR 0 , iff " x1, x2 2 ZD 0 [ ZR 0 , where x1 # P and x2 # P, the following holds: x1, x2 2 ZD 0 and x1 6ZD 0 x2 () x1, x2 2 ZR 0 and x1 6ZR 0 x2. Finally, the condition is proved for the entire ZD and the entire ZR in Lemma 4, which is used in Proposition 2 for the proof of correctness of the query transformation algorithm. For Lemma 2, let xLast 2 ZD [ ZR be a state, where xLast # P. We consider the processing steps of the state xLast in the algorithm processXSLTNode in Fig. 3, which presents the algorithm of the XSLT processor after initialization. The XSLT processor gets into a state, where xLast is the predecessor state, in the function call of line (21) and in the function call of line (23). The function call of line (21) generates such states, if the execution of the current XSLT node leads to the execution of other XSLT nodes (e.g. in the case of ) and therefore to a call of processXSLTNode again. Lemma 2. Let xLast 2 ZD [ ZR be a state, where xLast # P. We define NDline 21 in the following way: Let NDline 21 be a subsequence of ZD, where NDline 21 = . We also analogously define subsequences of ZR and the states generated in the function call of line (23): Let NRline 21 be a subsequence sor(x)=xLast ^ x is generated by line Let NDline 23 be a subsequence sor(x)=xLast ^ x is generated by line Let NRline 23 be a subsequence sor(x)=xLast ^ x is generated by line

of (21) of (23) of (23)

ZR, where NRline 21 = . ZD, where NDline 23 = . ZR, where NRline 23 = .

Then the condition C holds for the pairs NDline 21 and NRline 21, NDline 23 and NRline 23. Proof of Lemma 2. We conclude the condition C for the pairs NDline 21 and NRline 21, NDline 23 and NRline 23 by considering the algorithm processXSLTNode (see Fig. 3). Note that after the state xLast the internal XPath evaluator retrieves the same XML node sets in line (16) of Fig. 3 for the input XML document D and for the input XML document R(D), as the combination steps of current ipe and completed ipe of a single XSLT node on a detected stylesheet path (or its attached paths) for the computation of R are designed in this way. h

92

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

Lemma 3. Let SZDold be a subsequence of ZD, let SZRold be a subsequence of ZR. Let xLast 2 SZDold \ SZRold be a state, where xLast # P. Furthermore, we require the following to hold: " x 2 ZDold [ ZRold: predecessor(x)5xLast. We define SZDnew to be a subsequence of ZD, where SZDnew = , and SZRnew to be a subsequence of ZR, where SZRnew = . Then condition C holds for SZDnew and SZRnew. Proof of Lemma 3. SZDnew can be divided into four subsequences SDuntilxLast, NDline 21, NDline 23 and SDafterxLast of SZDnew so that SZDnew = SDuntilxLast  NDline 21  NDline 23  SDafterxLast, where  denotes the concatenation of sequences: (1) SDuntilxLast consists of the states from the first state in SZDold until xLast including xLast itself. (2) After xLast, the states of NDline 21 occur in SZDnew. (3) After the states of NDline 21, the states of NDline 23 occur in SZDnew. (4) Let xLast+1 be the state after xLast in SZDold. The fourth subsequence SDafterxLast contains xLast+1 and all states after xLast+1 of SZDold. Considering the definitions of the execution sequences and the algorithm processXSLTNode in Fig. 3, first all states of SDuntilxLast occur in SZDnew, then all states of NDline 21, all states of NDline 23 and finally all states of SDafterxLast occur in SZDnew. There are no other states in SZDnew. We can define subsequences of SZRnew analogously so that first all states of SRuntilxLast occur in SZRnew, then all states of NRline 21, all states of NRline 23 and finally all states of SRafterxLast occur in SZRnew. There are no other states in SZRnew. Considering the presumption of Lemma 3 and considering Lemma 2, condition C holds for each single subsequence in SZDnew and its corresponding subsequence in SZRnew, i.e. for SDuntilxLast and SRuntilxLast, NDline 21 and NRline 21, NDline 23 and NRline 23, SDafterxLast and SRafterxLast. We now have to prove, that condition holds for the entire SZDnew and the entire SZRnew. For this purpose, we arbitrarily choose two states x1 and x2 in SZDnew [ SZRnew, where x1 # P and x2 # P. Therefore, x1, x2 2 SZDnew \ SZRnew because this is one requirement of the condition C and the condition C holds for all corresponding subsequences SDuntilxLast and SRuntilxLast, NDline 21 and NRline 21, NDline 23 and NRline 23, SDafterxLast and SRafterxLast. We consider two different cases for x1 and x2. (a) If x1, x2 are both elements of the same subsequence SDuntilxLast, NDline 21, NDline 23 and SDafterxLast of SZDnew and its corresponding subsequence SRuntilxLast, NRline 21, NRline 23 and SRafterxLast of SZRnew, then x1 6SZDnew x2 () x1 6SZRnew x2. (b) Let xH be the first state of NDline 23 \ NRline 23. If x1, x2 are elements of different subsequences SDuntilxLast, NDline 21, NDline 23 and SDafterxLast of SZDnew and x1, x2 are ele-

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

93

ments of their corresponding subsequences SRuntilxLast, NRline 21, NRline 23 and SRafterxLast of SZRnew, then we set x1 and x2 into 6SZDnew relation and into 6SZRnew relation to xLast, xw and xLast+1, use the transitive property of the order relations 6SZDnew and 6SZRnew, and use the relations xLast 6SZDnew xw 6SZDnew xLast+1 and xLast 6SZRnew xw 6SZRnew xLast+1 to conclude x1 6SZDnew x2 () x1 6SZRnew x2. Therefore, condition C holds for SZDnew and SZRnew. h Lemma 4. Condition C holds for ZR and ZD. Proof of Lemma 4. If there is no detected stylesheet path, then we can conclude Lemma 4. Otherwise there is at least one detected stylesheet path. Let ZD(n) and ZR(n) respectively be that subsequence, which contains all states with at most the distance n to xStart in ZD and in ZR respectively. In other words: Let ZD(n) be a subsequence of ZD, where ZD(n) = . Let ZR(n) be a subsequence of ZR, where ZR(n) = . Then we prove the condition C (see Definition 18) for ZD(n) and ZR(n) by induction on variable n. As ZD and ZR are limited in the number of states, we prove Lemma 4 by the following induction proof: Base clause (n = 0): The first state xStart in ZD and also in ZR is always (N = , XP = -, K = /, I = {/} with / marked). Therefore, ZD(0) = ZR(0) = , where xStart = (N = , XP = -, K = /, I = {/} with / marked). xStart # P, as detected stylesheet paths always start with the XSLT node . Furthermore, xStart 6ZD xStart and xStart 6ZR xStart so that condition C holds for ZD(0) and ZR(0). In order to prove the recursion clause, we assume that the condition C holds for ZD(n) and for ZR(n). At first, we set SZDold(n) = ZD(n) and SZRold(n) = ZR(n). Then, we iterate through the set ZLast of states of ZD [ ZR, where " xLast 2 ZLast: xLast # P ^ dist(xLast) = n. Let xLast be the currently considered state of ZLast. We prove condition C for a subsequence SZDnew(n) of ZD, where SZDnew ðnÞ ¼< x 2 ZDjx 2 SZDold ðnÞ _ predecessorðxÞ ¼ xLast Þ > and a subsequence SZRnew(n) of ZR, where SZRnew ðnÞ ¼< x 2 ZR j x 2 SZRold ðnÞ _ predecessorðxÞ ¼ xLast Þ > by using Lemma 3. After proving the condition C for SZDnew(n) and SZRnew(n), we set SZDold(n) to SZDnew(n), SZRold(n) to SZRnew(n) and consider the next state of ZLast until all states of ZLast are considered and the condition C is proved for the entire execution sequences ZD(n + 1) and ZR(n + 1). Thus, condition C holds for ZD and ZR. h

94

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

Proposition 2. Let S be an XSLT stylesheet and let QXPath be an XPath query. Then the query transformation algorithm transforms QXPath into an XPath query R so that for every input XML document D the following holds: QXPath(S(D)) returns the same result as QXPath (S(R(D))). Proof of Proposition 2. We compute the detected stylesheet paths of QXPath in S. Then we can conclude from Lemma 4: there is no other output generated in ZD, which is not generated in ZR, but is relevant to answer the query QXPath, i.e. QXPath(S(D)) returns the same result as QXPath(S(R(D))) for every XML document D. h Lemma 5. Let x be a state of an execution sequence Z, where P represents the detected stylesheet paths (and their attached paths) of a given query QXPath and x # P. The node sets described by the current ipenew and the completed ipenew which are computed by the single combination step (see Section 4) of current ipeold and completed ipeold for the XSLT node of x in P are minimal in the following sense: The internal XPath evaluator in line 16 of Fig. 3 successfully visits exactly the XML nodes described by current ipenew and completed ipenew except those XML nodes described by current ipeold and completed ipeold for the XSLT node of the state x in P and for those states that are on a path attached to the XSLT node of x in P. Proof of Lemma 5. We prove Lemma 5 by considering each combination step for current ipe and completed ipe in Section 4. h Proposition 3. Let S be an XSLT stylesheet, let D be an input XML document and let QXPath be an XPath query. Let further R be the transformed query of QXPath according to S. Furthermore, let O be those XML nodes of D, which will be successfully visited (see Definition 5) by the internal XPath evaluator (see Fig. 3, line 16) of the XSLT processor, when the XSLT processor executes that section of S, which generates output that is relevant to answer QXPath. Then the whole resultant XML fragment R(D) is minimal in the following sense: R(D) does not contain other XML nodes of the input XML document D except those in O. Proof of Proposition 3. The query transformation algorithm only combines the local input path expressions along those stylesheet paths, which generate output that is relevant to answer the query QXPath. In each combination step, current ipe and completed ipe are computed in such a way, that they do not describe other XML nodes of the input XML document D than the following, when the XSLT processor processes the XSLT stylesheet S on the detected stylesheet paths and their attached paths: When the XSLT processor processes the XSLT stylesheet S on the detected stylesheet paths and their attached paths, current ipe and completed ipe describe those XML nodes of D, which will be successfully visited (see Definition 5) by the internal XPath evaluator (see Fig. 3, line 16) of the XSLT processor. We can conclude from the minimum property (see Lemma 5) of each single combination step of current ipe and completed ipe to the minimum property of the resultant XML fragment R(D) described by the transformed query R. h

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

95

6. XSLT query transformation into XPath queries on XSLT views For an example of an XSLT query, see Fig. 15. The XSLT query QXSLT in Fig. 15 returns an XML document with root element and the value of all attributes of the elements /child :: product_list/child :: product[attribute :: label= 0cockpit0 ] of the input XML document enclosed by elements. Assume we have to answer the XSLT query QXSLT of Fig. 15 on the XSLT view S of Fig. 2. One possible approach is to transform the whole input XML document by S into S(D) and then transform the result S(D) by QXSLT into QXSLT(S(D)). Instead, we can apply query reformulation in order to use its advantages. In the following, we describe how XPath query reformulation can be used to enable XSLT query reformulation. For this purpose, we first have to determine that XML fragment of the input XML document S(D) of QXSLT, which is relevant for QXSLT (see Definition 19). Definition 19. An XML node K of an input XML document S(D) is relevant for an XSLT stylesheet QXSLT, if the internal XPath evaluator of the XSLT processor visits K successfully (see Definition 5) while transforming S(D) according to QXSLT. Proposition 4. The most general XPath query G := /descendant-or-self :: node( )j /descendant-or-self :: node( )/attribute :: node( )j /descendant-or-self :: node( )/namespace :: node( ) returns all nodes of an XML document. Proof of Proposition 4. /descendant-or-self :: node( ) describes the root of an XML document and all its descendant nodes except of all attribute nodes, which are described by /descendant-or-self :: node( )/attribute :: node( ), and all namespace nodes, which are described by /descendant-or-self :: node( )/namespace :: node( ). h We can use XPath query reformulation in order to determine that XML fragment, which is relevant for QXSLT in the following way: As we are interested in the whole result of QXSLT, we reformulate the most general XPath query G according to QXSLT and retrieve the reformulated XPath

Fig. 15. XSLT query QXSLT.

96

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

query QXPath. QXPath(S(D)) is that XML fragment of S(D), which is relevant for QXSLT. Furthermore, we reformulate QXPath according to the XSLT view S as before and get a query R, which describes the part of D that is relevant to QXSLT. We retrieve the answer of the XSLT query QXSLT by evaluating QXSLT(S(R(D))). For example, see the XSLT query QXSLT of Fig. 15 and the XSLT view S of Fig. 2. We retrieve QXPath = /child :: product_listj /child :: product_list/child :: product[attribute :: label= 0cockpit0 ] /attribute ::w by reformulating the most general XPath query G (see Proposition 4) according to QXSLT. We transform QXPath according to S into R = /j/child :: object(/child :: contains/child :: object)n [attribute :: name="cockpit"]j /child :: object(/child :: contains/child :: object)n [attribute :: name="cockpit"]/attribute :: name The result of the evaluation of R is the bold face part of the XML fragment R(D) of Fig. 2. Note that this is significantly smaller than D, i.e. a significant part of D does not need to be transformed. The answer of the XSLT query QXSLT is the result of QXSLT(S(R(D))): cockpit

7. Computing the resultant XML fragment R(D) Applications can use two different application programming interfaces (API) in order to access XML. Whereas the Document Object Model (DOM [37]) enables tree operations on XML data, the Simple Application Programming Interface for XML (SAX [27]) is an event-based parsing API. Current implementations of DOM first generate a DOM tree of the whole XML data in main memory so that large XML data cannot be accessed because of memory limitations. The generation of the DOM tree is time consuming. In comparison, current implementations of SAX do not generate a DOM tree and are therefore faster than DOM implementations. Instead, SAX implementations trigger SAX events while parsing XML data so that an application cannot navigate arbitrarily in the XML data. We present two different approaches using the two different APIs DOM (see Section 7.1) and SAX (see Section 7.2) to compute the resultant XML fragment R(D) of a given query R on a given XML document D. The approach using DOM clones the relevant XML nodes of the DOM tree of D that are relevant to a given query R. We call this approach the CloneNode approach.

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

97

In comparison, we call the approach using SAX the SAXFilter approach: The SAXFilter approach retrieves the SAX events of D, but transmits only the SAX events of R(D). The SAXFilter approach can be used in a pipe of SAX event handlers. 7.1. CloneNode: Cloning the relevant nodes of a DOM tree The CloneNode approach uses a modified XPath evaluator in order to mark all the relevant XML nodes for a given query R in a DOM tree. After that, CloneNode creates a new document, and then starts cloning at the root node of the original DOM tree. The approach clones all marked children and their attribute nodes and namespace nodes recursively, inserts the cloned nodes into the new document, and establishes the parent–child relationships of the cloned nodes. CloneNode achieves best results, if the XML data is already in main memory as DOM tree, and if CloneNode can benefit from the goal-oriented search of the XPath evaluator. 7.2. SAXFilter: Filtering SAX events A SAX parser triggers the SAX events of an XML document D in the order they occur in D. We call this order document order. The SAXFilter approach must transmit the SAX events of R(D) in the same order in order to preserve document order. As the SAX API does not support to navigate arbitrarily in the XML data, reverse axes of R are eliminated in a first step by the approach presented in [31]. Ref. [31] describes a rule set to transform an XPath expression containing reverse axes into an equivalent XPath expression containing only forward axes. For simplicity of presentation, we restrict the given query R to a disjunction of XPath expressions conforming to the subset SXPath (see Definition 12) of XPath. Although restricted to SXPath, the presented algorithm shows the main technical difficulties to solve and therefore, the algorithm can be easily extended to support a bigger subset of XPath. In principle, the SAXFilter approach uses a stack, which contains the current XML node and all its ancestors up to the root of the input XML document D. We call all these XML nodes stack nodes. When R is detected, i.e. the stack nodes fulfill R, then the SAXFilter approach starts at the top of the stack, visiting each XML node down to the current XML node in the stack and transmitting this XML node as SAX event if not already transmitted before. SXPath also supports filter expressions, which can contain other filter expressions, recursively. In this case, the algorithm becomes much more complicated. For example, R = /child :: p1[child :: p2/attribute :: a= 0const0 ]/child :: p3 presumes that after detecting /child :: p1, the two paths child :: p2/attribute :: a, the value of which must be 0const0 , and child :: p3 are both detected relative to the already detected p1 element. If only one of these paths is detected, then the transmission of the SAX events of the stack nodes must be delayed until also the other path is detected, or the transmission of the SAX events must be cancelled, if there is no chance to detect the other path. We call the status of the stack nodes unsafe, if their transmission depends on the detection of another path, which is not yet detected, but there is still a possibility to detect the other path. Thus, the case occurs also that stack nodes sn must be declared to be transmitted anyway, but the transmission of sn must be delayed until the transmission of other

98

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

stack nodes, which have to occur before sn in document order, can be executed or cancelled. We call these stack nodes sn safe. In the event handling of the startTag event, we administer these safe and unsafe stack nodes and the transmission of their SAX events in the right order. We must also take care that there are no double references inserted to the same stack nodes. If safe stack nodes are inserted and they (or some of them) occur already in the stored stack nodes with status unsafe, then these stack nodes change their status from unsafe to safe. Stack nodes never change their status from safe to unsafe. In order to check, whether all dependent paths of a given XPath query R are already detected, we first build a decision tree, which we define in the following. Definition 20. The decision tree is built from a given XPath query R as follows: First, we generate all nodes, which we call location step nodes, which contain a location step of R. If the location step of a locations step node does not contain a filter, then the child of a location step node is that location step node, which is generated from the next location step in R. We insert a newly created node as child, which contains an and operator, wherever a filter occurs in the location step of a location step node. We call the path from the root of the decision tree down to a filter the common start path of the filter and of the path after the filter. The children of the and operator are the root of the location step nodes for the path in the filter (and its comparison) and the root of the location step nodes for the path after the filter. We insert an empty node as child at each leaf node in order to mark the end of a path. Example 7. Fig. 16 contains the decision tree of the query R = /child :: object/descendant-or-self :: node( )[attribute :: name="cockpit"]. The common start path of the filter and of the empty path after the filter is /child :: object/descendant-or-self :: node( ). The event handling for the startTag event follows one edge, if the current location step fits to the current node of the startTag event. Furthermore, the event handling stores at the leaves of the decision tree, which single paths are already detected. The event handling methods updates these decision trees regularly. If all single paths of a decision tree are detected, i.e. the whole decision tree is detected and therefore R is detected, then the statuses of all corresponding unsafe stack nodes are set to safe.

Fig. 16. Decision tree of query R = /child :: object/descendant-or-self :: node( )[attribute :: name= "cockpit"].

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

99

Fig. 17. Algorithm startTag event handler.

The pseudocode of the startTag event handler is presented in Fig. 17. If the endTag event occurs, then we pull the last node from the stack. If a decision tree was deactivated in the corresponding startTag event, then we activate the decision tree so that the next events can consider this decision tree again. If the current stack nodes leave a start path of a decision tree, then no single paths with the old common start path can be detected any more. Therefore, if not all paths in the decision tree are detected, then all corresponding unsafe stack nodes are deleted, which are not also unsafe nodes of another decision tree. The pseudocode of the endTag event handler is presented in Fig. 18. Fig. 19 contains a dump of the SAX events when parsing the XML document D of Fig. 2, the stack nodes, the current location step nodes of the different decision trees and the transmitted SAX events. Whenever SAX is used, SAXFilter can be integrated in a pipe of SAX event handlers. If a DOM tree is needed as output, then a simple SAX to DOM adapter can be installed as last event

Fig. 18. Algorithm endTag event handler.

100

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

Fig. 19. Dump of SAXFilter when parsing Fig. 2.

handler in the pipe. In comparison to CloneNode, SAXFilter achieves better results, if the XML data D is not already loaded in main memory. In this case, only the XML nodes of the resultant XML fragment R(D) are generated instead of first generating all XML nodes of D and then cloning the XML nodes of R(D). If the XML data is already loaded in main memory, then the SAXFilter approach can be used in a pipe after a DOM to SAX adapter, which transmits the SAX events of a DOM tree in a leftorder traversal of the DOM tree. In comparison to CloneNode, the SAXFilter approach following a DOM to SAX adapter cannot then benefit from a goal-oriented search, as all XML nodes of the DOM tree are visited at least one time.

8. Performance analysis In this section, we show the results of the experiments with our prototype in comparison to the standard approach, which transforms the entire XML document in order to answer a query. 8.1. Experimental environment The test system for all runtime experiments is an Intel Pentium 4 processor 1.7 GHz with 1 GB RAM, Windows XP as operating system and Java VM build version 1.4.2. We use Xerces2 Java parser 2.5.0 release [4] as XML parser, the Xalan-Java version 2.5.1 [3] and Saxon version 8.0 [24] as XSLT processors. Fig. 20 contains the XSLT stylesheet, which we used for all experiments. Definition 21. The size of an XML document (or a query result respectively) is the number of bytes used for the textual representation of the XML document (or of the query result respectively). The textual representation does not contain more characters than necessary to represent the XML document (or query result respectively) except of a return code after each start and end sequence of an XML element.

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

101

Fig. 20. Used XSLT stylesheet for the experiments.

Fig. 21. Used DTD Forig for the experiments.

We have generated test XML documents of different size according to the DTD in Fig. 21 (from 0MB up to 25MB in 1MB steps). The b attribute of the a tag contains a number value, which is equally distributed in the interval [0; 99]. Definition 22. The selectivity of a query is the size of the query result divided by the size of the original document. For the experiments, we use the query QXPath = /child :: s/child :: c[attribute :: d < X]/ attribute :: d for a query with a selectivity of X percentage. For most experiments and if it is not stated in another way, the XML data is loaded from file. We run each experiment with both XSLT processors Xalan and Saxon. The average of 10 experiments is presented. 8.2. Analysis of experimental results 8.2.1. Varying the file size and evaluating queries with different selectivity In the following subsections, we present the results of the experiments of transforming and processing queries QXPath from 0% selectivity to 100% selectivity in 10% selectivity steps according to the XSLT stylesheet S of Fig. 20 to the transformed query R. The presented experiments include also the time for retrieving the resultant XML fragment R(D) from file using the CloneNode approach or SAXFilter approach, for transforming R(D) to S(R(D)) and finally for evaluating the query QXPath on S(R(D)). Furthermore, the time without using our approach is presented. This time includes the time for loading the entire XML document D, transforming D according to S and evaluating QXPath on S(D), i.e. QXPath(S(D)). In this case, we choose the 0% selectivity query as QXPath, i.e. we compare our approach to the evaluation of that query which takes the shortest time. Section 8.2.1.1 presents the results of the CloneNode approach, whereas Section 8.2.1.2 presents the results of the SAXFilter approach.

102

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

Fig. 22. Using the CloneNode approach.

Fig. 23. Transforming the entire XML document using the DOM API.

8.2.1.1. Evaluating the CloneNode approach. In the following experiments, we first load and parse the XML document from file and then generate the DOM tree in main memory by a DOM parser. When using the CloneNode approach (see Fig. 22), we consider also the time for transforming the given query QXPath to R and for then retrieving the resultant XML fragment R(D), which we transform by the XSLT processor according to the XSLT stylesheet in Fig. 20. In comparison, without using the CloneNode approach (see Fig. 23), we transform the entire DOM tree. In both cases, we retrieve a DOM tree as result of the XSLT processor, to which we apply QXPath. Finally, we iterate once through the result node set of the query QXPath. Fig. 24 shows the entire time when using the Xalan processor. Fig. 25 shows the entire time when using the Saxon processor. In both cases, the operating system starts to swap memory at

Fig. 24. Comparing the CloneNode approach in QXPath(S(R(D))) with QXPath(S(D)) using the Xalan XSLT processor.

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

103

Fig. 25. Comparing the CloneNode approach in QXPath(S(R(D))) with QXPath(S(D)) using the Saxon XSLT processor.

7MB, which limits the experiments. We reserve 768MB for the java virtual machine so that an out of memory error occurs at 9MB. When using the Xalan processor, the CloneNode approach is faster for a document size of more than 10KB for queries with selectivity less than or equal 30%. At a document size of 1MB, the CloneNode approach is even faster for queries with selectivity less than 50%. When using the Saxon processor, the CloneNode approach is faster for a document size of more than 10KB for queries with selectivity less than or equal 20%. At document size of 1MB, the CloneNode approach is even faster for queries with selectivity less than or equal 40%. 8.2.1.2. Evaluating the SAXFilter approach. When using the SAXFilter approach (see Fig. 26), we consider the time for transforming the given query QXPath to R and for then parsing the XML document by a SAX parser, filtering the SAX events for R(D) by the SAXFilter approach and generating the resultant XML fragment R(D) as DOM tree by a SAX to DOM adapter, which we transform by the XSLT processor according to the XSLT stylesheet in Fig. 20. In comparison, without using the SAXFilter approach (see Fig. 27), we take the input stream of the XML document as input for the XSLT processor. In both cases, we retrieve SAX events as result of the XSLT processor, which we query by the SAXFilter approach for QXPath. Finally, we transmit the resultant SAX events of the SAXFilter module to the SAX DefaultHandler.

Fig. 26. Using the SAXFilter approach.

104

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

Fig. 27. Transforming the entire XML document using the SAX API.

Fig. 28. Comparing the SAXFilter approach in QXPath(S(R(D))) with QXPath(S(D)) using the Xalan XSLT processor.

Fig. 28 shows the entire time when using the Xalan processor. Fig. 29 shows the entire time when using the Saxon processor. We reserve 768MB for the java virtual machine and no out of memory error occurs in the experiments. When using the Xalan processor, the SAXFilter approach is faster for a document size larger than 1MB for queries with selectivity less than or equal 50%, for a document size larger than 7MB the SAXFilter approach is faster for queries with selectivity less than or equal 70%, and for a document size larger than 15MB the SAXFilter approach is even faster for queries with a selectivity less than or equal 80%. When using the Saxon processor, the SAXFilter approach is faster for a document size larger than 1MB for queries with selectivity less than or equal 50%. 8.2.2. Varying the selectivity whilst maintaining constant file size In the following representations of the experimental results, we let the file size fixed at 7MB and vary the selectivity of the queries. Furthermore, we present not only the entire time, which is used by our approaches, but also the times, which are used for the query transformation, for retrieving the resultant XML fragment R(D) and for transforming the resultant XML fragment R(D). We can conclude from the experiments that the time used by the query transformation algorithm is negligible, and that the time used for the transformation of the resultant XML fragment R(D)

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

105

Fig. 29. Comparing the SAXFilter approach in QXPath(S(R(D))) with QXPath(S(D)) using the Saxon XSLT processor.

Fig. 30. Xalan processor with CloneNode approach at 7MB.

increases linear with the size of R(D). Furthermore, the Saxon XSLT processor transforms faster so that our approaches benefit more, if the Xalan XSLT processor is used. 8.2.2.1. The CloneNode approach. Fig. 30 presents the results of the experiments when using the Xalan processor. Fig. 31 presents the results of the experiments when using the Saxon processor. Figs. 30 and 31 contain a curve, which presents the time spent for query reformulation (1), a curve for the time spent for parsing the original XML document D and applying the CloneNode approach afterwards (2), and a curve for transforming the resultant XML fragment (3). In both experiments, we can neglect the time spent for query reformulation, most time is spent for parsing

106

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

Fig. 31. Saxon processor with CloneNode approach at 7MB.

Fig. 32. Xalan processor with SAXFilter approach at 7MB.

the XML document D, generating the DOM tree and using the CloneNode approach. Furthermore, the time spent for transforming the resultant XML fragment is linear in the size of the resultant XML fragment. 8.2.2.2. The SAXFilter approach. Fig. 32 presents the results of the experiments when using the Xalan processor. Fig. 33 the results of the experiments when using the Saxon processor. Figs. 32 and 33 contain a curve, which presents the time spent for query reformulation (1), a curve for the time spent for parsing the original XML document D, applying the SAXFilter approach and a SAX to DOM adapter afterwards (2), and a curve for transforming the resultant XML fragment (3). In both experiments, we can neglect the time spent for query reformulation. In comparison to the CloneNode approach, the time for parsing the XML document by a SAX parser, for using the SAXFilter approach and the SAX to DOM adapter afterwards, increases linear with the

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

107

Fig. 33. Saxon processor with SAXFilter approach at 7MB.

size of the resultant XML fragment R(D). Again, the time spent for transforming the resultant XML fragment increases linear with the size of the resultant XML fragment.

9. Summary and conclusions Whenever XML data D given in an XML format Forig can be transformed by an XSLT stylesheet S into an XML format Ftransf, and a query expressed in terms of format Ftransf has to be applied, our goals are as follows: to avoid replicas, to reduce the processing costs for document transformation by an XSLT processor, and to reduce data shipping costs in distributed scenarios. In our approach, we transform a given query Q (which can be an XPath query or XSLT query) by using a given XSLT stylesheet S into a query R. R can be applied to the input XML document D in order to retrieve a smaller XML fragment R(D), which contains all the relevant data. R(D) can be transformed by the XSLT stylesheet into S(R(D)), from which the query Q selects the relevant data. We have presented a proof that our query transformation algorithm is correct and that the XML fragment R(D) is minimal. Furthermore, we have proved by experimental results that our approach to queries on transformed XML data has considerable advantages over transforming the entire XML document. Particularly this is the case for queries on large XML documents. Furthermore, we have shown that our approach is scalable and becomes more efficient for larger XML documents. The selectivity of the queries, when the queries are faster, runs against a constant limit. This limit depends on the used approach CloneNode or SAXFilter, the XSLT processor and the used input and output APIs DOM or SAX. In a professional environment, the use of our approach can be switched on and off depending on the file size of the original XML document, and estimations of selectivity of the transformed query. Whenever XML data is transformed according to an XSLT stylesheet, our approach can be used to enable on-demand transformations, especially our approach can be incorporated into XML databases in an efficient and scalable manner. Furthermore, our approach can be included

108

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

in relational databases, which are enabled to generate XML data from the relational data and then transform the generated XML data by an XSLT processor. References [1] S. Abiteboul, On views and XML, in: PODS, 1999, pp. 1–9. [2] S. Abiteboul, S. Cluet, T. Milo, Correspondence and translation for heterogeneous data, in: Proc. of the 6th ICDT, 1997. [3] Apache Software Foundation, Xalan-Java, http://xml.apache.org/xalan-j/index.html, 2003. [4] Apache Software Foundation, Xerces2 Java Parser 2.5.0 Release, http://xml.apache.org/xerces2-j, 2003. [5] S. Bo¨ttcher, R. Steinmetz, Optimized internet search based on an intersection test for XPath expressions under a DTD, International Conference on Internet Computing IC04, Las Vegas, June 2004. [6] S. Bo¨ttcher, A. Tu¨rling, Checking XPath expressions for synchronization, access control and reuse of query results on mobile clients. Workshop: Database Mechanisms for Mobile Applications, Karlsruhe, Germany, 2003. [7] R. Bourret, C. Bornho¨vd, A.P. Buchmann, A.P., A generic load/extract utility for data transfer between XML documents and relational databases. 2nd International Workshop on Advanced Issues of EC and Web-based Information Systems (WECWIS), San Jose, California, 2000. [8] C.-C.K. Chang, H. Garcia-Molina, Approximate query translation across heterogeneous information sources. VLDB 2000, 2000. [9] Y.B. Chen, T.W. Ling, M.L. Lee, Designing valid XML views, ER 2002, LNCS 2503, 2002, pp. 463–477. [10] S. Cluet, C. Delobel, J. Simon, K. Smaga, Your mediators need data conversion! in: Proceedings of the 1998 ACM SIGMOD Conference, 1998. [11] S. Cluet, P. Veltri, D. Vodislav, Views in a large scale XML repository, in: Proceedings of the 27th VLDB Conference, Roma, Italy, 2001. [12] A. Deutsch, V. Tannen, Reformulation of XML queries and constraints, in: ICDT 2003, LNCS 2572, 2003, pp. 225–241. [13] Y. Diao, M. Altinel, M.J. Franklin, H. Zhang, P. Fischer, Path sharing and predicate evaluation for highperformance XML filtering, in: TODS, 2003. [14] Y. Diao, S. Rizvi, M.J. Franklin, Towards an internet-scale XML dissemination service, in: Proceedings of VLDB 2004, 2004. [15] M. Ferna´ndez, Y. Kadiyska, D. Suciu, A. Morishima, W.C. Tan, SilkRoute, A framework for publishing relational data in XML, ACM Transactions on Database Systems 27 (4) (2002) 438–493. [16] D. Fisher, F. Lam, R.K. Wong, Algebraic transformation and optimization for XQuery, APWeb 2004, LNCS 3007, 2004, pp. 201–210. [17] G. Gottlob, C. Koch, R. Pichler, The complexity of XPath query evaluation, in: Proceedings of the 22th ACM SIGMOD-SIGACT-SIGART Symposium of Principles of Database Systems (PODS 2003), San Diego, California, USA, 2003. [18] M. Grinev, S. Kuznetsov, Towards an exhaustive set of rewriting rules for XQuery optimisation: BizQuery experience, ADBIS 2002, LNCS 2435, 2002, pp. 340–345. [19] S. Groppe, S. Bo¨ttcher, Querying transformed XML documents, Determining a sufficient fragment of the original document. 3. International Workshop Web Databases (WebDB), Berlin, 2003. [20] S. Groppe, S. Bo¨ttcher, XPath query transformation based on XSLT stylesheets, Fifth International Workshop on Web Information and Data Management (WIDM03), New Orleans, LA, USA, 2003. [21] S. Groppe, S. Bo¨ttcher, G. Birkenheuer, Efficient querying of transformed XML documents, 6th International Conference on Enterprise Information Systems (ICEIS 2004), Porto, Portugal, 2004. [22] S. Groppe, S. Bo¨ttcher, R. Heckel, G. Birkenheuer, Using XSLT stylesheets to transform XPath queries. Eighth East-European Conference on Advances in Databases and Information Systems (ADBIS 2004), Budapest, Hungary, September 2004. [23] S. Jain, R. Mahajan, D. Suciu, Translating XSLT programs to efficient SQL queries, in: Proceedings of the Eleventh International World Wide Web Conference (WWW 2002), Honolulu, HI, USA, 2002.

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110

109

[24] M.H. Kay, Saxon—The XSLT and XQuery processor, http://saxon.sourceforge.net, April 2004. [25] R. Krishnamurthy, R. Kaushik, J.F. Naughton, Efficient XML-to-SQL query translation: where to add the intelligence? in: Proceedings of the Thirtieth International Conference on Very Large Data Bases (VLDB 2004), Toronto, Canada, 2004. [26] S. Lechner, G. Preuner, M. Schrefl, Translating XQuery into XSLT, ER 2001 Workshops, Yokohama, Japan, 2001. [27] D. Megginson, SAX, http://www.saxproject.org/, 2000. [28] A. Marian, J. Sime´on, Projecting XML documents, in: Proceedings of the 29th VLDB Conference, Berlin, Germany, 2003. [29] G. Moerkotte, Incorporating XSL processing into database engines, in: Proceedings of the 28th VLDB Conference, Hong Kong, China, 2002. [30] D. Olteanu, T. Furche, F. Bry, Evaluating complex queries against XML streams with polynomial combined complexity, 21st British National Conference on Databases, BNCOD 21, Edinburgh, UK, July 7–9, 2004. [31] D. Olteanu, H. Meuss, T. Furche, F. Bry, XPath: looking forward, XML-based data management (XMLDM), EDBT Workshops, Prague, Czech Republic, 2002. [32] S. Paparizos, Y. Wu, L.V.S. Lakshmanan, H.V. Jagadish, Tree logical classes for efficient evaluation of XQuery, SIGMOD 2004, Paris, France, 2004. [33] M. Rys, Bringing the internet to your database: using SQL server 2000 and XML to build loosely coupled systems, ICDE 2001, Heidelberg, Germany, 2001. [34] J. Shanmugasundaram, J. Kiernan, E. Shekita, C. Fan, J. Funderburk, Querying XML views of relational data, in: Proceedings of the 27th VLDB Conference, Roma, Italy, 2001. [35] J. Shanmugasundaram, E. Shekita, R. Barr, M. Carey, B. Lindsay, H. Pirahesh, B. Reinwald, Efficiently publishing relational data as XML documents, VLDB Journal 10 (2–3) (2001). [36] L. Wang, M. Mulchandani, E.A. Rundensteiner, Updating XQuery views published over relational data: a roundtrip case study, XSym 2003, LNCS 2824, 2003, pp. 223–237. [37] World Wide Web Consortium (W3C), Document Object Model (DOM) Level 3 Core Specification Version 1.0, W3C Recommendation, http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/, 2004. [38] W3C, Extensible Stylesheet Language (XSL), http://www.w3.org/Style/XSL/, 2001. [39] W3C, XML Path Language (XPath) Version 1.0, http://www.w3.org/TR/xpath/, 1999. Sven Groppe earned his diploma degree in Informatik (Computer Science) from the University of Paderborn in 2002. He is currently working on his Doctorate at C-Lab, which is a research institute in co-operation with Siemens and the University of Paderborn. In 2001/2002, he worked in the project B2B-ECOM, which dealt with distributed internet market places for the electrical industry. From 2002 to 2004, he worked in the project MEMPHIS in the area of premium services. Both projects were funded by the European Union. His research interests include XML and semistructured data, query reformulation, data integration in heterogenous environments, distributed systems, electronic market places, web services and mobile devices.

Stefan Bo¨ttcher is a Professor for Databases and E-Commerce at the University of Paderborn. After receiving his Ph.D. from Johann Wolfgang Goethe-University in Frankfurt, he was a research fellow at IBM Scientific Center in Stuttgart and a senior researcher at Daimler Benz Research Center Ulm. In 1993 he became a Professor at the University of Applied Science in Ulm, and 1997 he joined the University of Paderborn. His research interests cover query optimization, caching and privacy in XML databases, transaction processing in MANETs and middleware for distributed E-Commerce systems.

110

S. Groppe et al. / Data & Knowledge Engineering 57 (2006) 64–110 Georg Birkenheuer received a B.Sc. degree in Computer Science from the University of Paderborn, Germany in 2004. He is currently working on his Diplom Informatik (comparable to Masters degree in Computer Science) at the same institution. His areas of research include XML and semistructured data, distributed resource management and grid computing.

Andre´ Ho¨ing received a Bachelors degree in Computer Science from the University of Paderborn, Germany in 2004. He is currently working on his Diplom of Computer Science (comparable to Masters degree) at the same institution.