Advantages of complex SQL types in storing XML documents

Advantages of complex SQL types in storing XML documents

Future Generation Computer Systems ( ) – Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: www.elsevi...

615KB Sizes 0 Downloads 72 Views

Future Generation Computer Systems (

)



Contents lists available at ScienceDirect

Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs

Advantages of complex SQL types in storing XML documents Kai Schweinsberg ∗ , Lutz Wegner Universität Kassel, FB 16 Elektrotechnik/Informatik, D-34121 Kassel, Germany

highlights • • • • •

We look at new ways to map XML documents to relational database systems. We preserve the hierarchical structure of the documents. The mapping uses the advanced types available following the SQL:2003 standard. We design and implement a reversible conversion method in a prototype. The performance of the mapping process (time and space) is evaluated.

article

info

Article history: Received 22 April 2015 Received in revised form 15 December 2015 Accepted 24 February 2016 Available online xxxx Keywords: XML mapping SQL:2003 complex types Performance evaluation

abstract This contribution looks at and evaluates new ways to map XML documents to relational database systems preserving their hierarchical structure. In particular we take advantage of the advanced features available following the SQL:2003 standard, which defines complex structure and collection types. Starting out with a survey of existing methods, basic questions concerning appropriate and efficient forms of mapping XML documents to SQL-compliant data types are addressed. This leads to a reversible conversion method, which is subsequently implemented in a prototype application. The design of the mapping is geared towards existing, mature relational database management systems (DBMS). It turns out that among the products considered, IBM Informix seems to offer the best support for complex structure and collection types in the context of our application. To better assess the performance of the mapping process, the observed running time and the required in-memory and database storage of the implementation are measured and evaluated. © 2016 Elsevier B.V. All rights reserved.

1. Introduction 1.1. Motivation The markup language XML is used for annotating documents and has established itself as data exchange format. Consequently, there is a need to store and transfer these documents not only as pure text files, but also in a better structured way in persistent storage. This can be achieved by means of particular XML or relational database systems. Traditionally, relational databases either store the XML documents unchanged as binary or string objects, or they are broken up and stored normalized in conventional relational tables (so-called flattening or shredding of the hierarchical structure). Both approaches are dissatisfactory. This is motivation for taking a new approach which kind of lies in the middle of the



Corresponding author. E-mail addresses: [email protected] (K. Schweinsberg), [email protected] (L. Wegner). http://dx.doi.org/10.1016/j.future.2016.02.013 0167-739X/© 2016 Elsevier B.V. All rights reserved.

previous solutions. It takes advantage of the opportunities of the advanced features in the SQL standard. SQL:2003 defines complex structure and collection types (tuples, arrays, lists, sets, multisets). They allow a mapping of XML documents to a relational schema which preserves the hierarchical structure. This offers two advantages: on one side, proven technologies, which originate from the area of relational databases, are available without any restrictions. On the other side, using SQL collection types preserves the inherent tree structure of XML documents, so in rebuilding documents, expensive joins over the tuples from normalized and distributed tables can be avoided. 1.2. Structure of this paper The paper starts with basic questions concerning appropriate and efficient forms of mapping XML documents to SQL:2003compliant data types. Based on this, a suitable, reversible conversion method is developed which is subsequently implemented in a prototype application and then analyzed. The design

2

K. Schweinsberg, L. Wegner / Future Generation Computer Systems (

1.3. XML

2. Object-relational database systems and SQL:2003 2.1. The SQL standards Most readers will be familiar with SQL (Structured Query Language) and its standardization after 1986. Table 1 lists the evolution steps of the SQL standard. In the context of this contribution, we focus on the revisions and extensions introduced with SQL:2003 [5]. The most important innovations which the standard established in 2003 were XML-related functions, support for the programming language Java, and the introduction of sequences and columns with identities and automatically generated values. Furthermore there is now a multiset data type and there are tablevalued functions, both extending object-relational concepts already introduced in SQL:1999. Worth mentioning is also a new, comfortable variant of the CREATE TABLE command which uses an SQL query as declaration base. Michels et al. [6] give a good overview of the new features. Although over ten years old now, SQL:2003 is by no means outdated. In fact, many of the features outside of the minimal subset, called Core SQL, have not been (or differently) implemented by the big RDBMS vendors. This will be seen in our implementation considerations below (cf. Section 2.2). Furthermore, SQL:2006 three years later added only one small part, namely SQL/XML, which also defines how XQueries can be embedded into SQL queries. We consider SQL/XML as part of the review of mapping alternatives further below (Section 3.4). The subsequent standard SQL:2008 defined commands already present



Table 1 Evolution of the SQL standards.

of the mapping aims at a later use with an existing, mature relational database management system (DBMS). However, support of SQL:2003 in the commercial DBMS world is still incomplete. Thus we investigate to what extent the various systems are suited for the implementation of the mapping process. The paper closes with a performance evaluation of our Informix-implementation which includes a comparison with a rival method, here Oracle’s objectrelational storage, and lists future research issues.

We assume familiarity with the Extensible Markup Language, short XML, as defined by the W3C (World Wide Web Consortium) in [1] and for which numerous tutorials and textbooks exist [2]. XML itself is a meta language which defines other markup languages, e.g. MathML or SVG (Scalable Vector Graphics). Languages which are derived from XML serve two major purposes: firstly they define well-formed – and if combined with a schema document (Document Type Definition (DTD) or XML Schema) even valid – documents as containers for information. Stylesheets guide the visual representation for the most common mode of usage, namely viewing the documents in a browser. Secondly XML plays a big role in defining standards for data exchange, notably for SOAP and XBRL (Extensible Business Reporting Language). XML documents can be classified as either document-centered (also called text-centered) or data-centered. As an example, the ordering of elements is important for the first type, while it does not matter much for the second type, where pieces of content are often retrieved using XQuery. All in all, XML documents are ordered tree-like structures with seven types of nodes: a root, elements, texts, attributes, namespaces, processing instructions, and comments. Whatever mappings are devised, they must take care of the specific requirements of each node type, in particular, if the mapping is to be reversible (see Section 4.4). Properties of the nodes, how to address them and the data model as a whole are defined in the XML Path Language and XPath Data Model standards [3,4].

)

Year

Name

1986 1989 1992 1999 2003 2006 2008 2011

SQL-86 (SQL-87) SQL-89 SQL-92 (SQL2) SQL:1999 (SQL3) SQL:2003 SQL:2006 (only part 14: SQL/XML) SQL:2008 SQL:2011

Table 2 Support for object-relational data types. Data Type

SQL:1999

SQL:2003

ROW (unnamed) ROW (named) SET (unnamed) MULTISET (unnamed) TABLE (named) LIST (unnamed) ARRAY (unnamed) ARRAY (named) OBJECT (named) REF (unnamed)

X

X

Oracle

DB2

Informix X X X X

X X

X X

X

X X

X X

X X X

X X

in many implementations, like TRUNCATE TABLE, extended CASE and MERGE and introduced INSTEAD OF triggers, but did not touch the new data types from 2003. The currently last standard SQL:2011 added insert, update and delete operations inside queries and introduced temporary tables together with other minor changes. Thus, SQL:2003 (together with SQL/XML from 2006) still remains the standard which defines what an object-relational DBMS should offer for XML storage and retrieval. 2.2. The object-relational type system and its implementations It is impossible to describe the type system of the SQL standards from 1999 and 2003 together with the implementations of the dialects in Oracle, DB2, and Informix on less than ten pages. Not to mention the religious wars among database people surrounding the terms ‘‘object’’ and ‘‘object reference’’. We must therefore refer the reader to the literature, in particular to the excellent text book by Türker [7]. However, Table 2 summarizes the offered types sufficiently well and we comment on the deficiencies with respect to our mapping task. It is obvious from a short glance that the support of object-relational data types differs very much among the database systems. None of the SQL:2003 data types is supported by all three systems. Whatever XML mapping one may devise will be very implementation specific if one decides to use these advanced features. Informix offers good support in the area of tuple and collection types, but misses out on object and object reference types. Many database researchers therefore prefer to call Informix an extended relational database system. Conversely, DB2 knows object and reference types but lacks built-in types for collections and tuples. Finally, Oracle is a more balanced implementation and object and reference types may be declared, although syntactically different than defined in the SQL standard. There are array types, but other than in the SQL:2003-standard, it is a named constructor. Instead of multisets, Oracle offers a type TABLE but with much of the same functionality. The situation with SQL Server and with open source products is rather bad. MySQL has no collection types. PostgreSQL offers an array type and a composite type which together resemble LIST and ROW, but lacks sets and multisets. We return to this topic once the mapping method has been fixed.

K. Schweinsberg, L. Wegner / Future Generation Computer Systems (

3. Alternatives in storing XML documents Apart from the classification of XML documents into datacentric vs text-centric, XML documents may come in many variations. Thus it is difficult to devise a mapping which guarantees optimal storage and retrieval (called ‘‘publishing’’) regardless of the document structure. It is not surprising that the research community came up with many intricate solutions after 1999, few however with the claim to be an industry strength solution. In the Ph.D. thesis which forms the base for the article here, Schweinsberg [8] classifies the approaches and gives examples of the storage result when applied to e.g. the well-known mondial document [9]. We use this taxonomy and list the methods which go into each category, but for lack of space without examples.1 The main purpose is to set our method into perspective. Three fundamental storage methods are distinguished:

• text-based methods which store the complete XML document as text.

• structure-based methods which examine the XML documents and store them in database tables whose schemas reflect the document structure. • model-based methods which choose a generic model which reflects the general, tree-like structure of XML documents without regard for the particular document instance. Some authors use different terms. Lausen [10] calls structurebased content-based, while model-based becomes structurebased. 3.1. Text-based methods Storing the documents as long strings has several advantages. There is no need for schema information, in fact it is possible to store fragments only. If the documents come with a signature, the signature remains valid as the documents are retrieved with bytefor-byte fidelity. On the other hand, the granularity of access is large and fast retrieval of particular elements or attributes is not possible. In practice, full-text indexing can solve some of these problems. The simplest way to implement text-based storage is to rely on the file system of a server. However, no transaction processing is possible and it is a step backwards into pre-database times. Consequently, a better approach is to use LOB (Large Object) columns in the DBMS, possibly supplemented by additional attributes. In SQL:2003, the data type is called CLOB (Character Large Object) and all commercial databases provide something similar. Oracle knows CLOB, NCLOB for the Unicode character set, and BLOB (Binary Large Object) without string semantics. LOBtypes can take data up to at least 2 GB, often more. However, unlike the VARCHAR2-type in Oracle, which is limited to 4000 characters, neither direct string comparisons nor using the LIKE-operator is possible. To offset these shortcomings, Oracle provides additional functionality in its OracleText-package which understands XPathexpressions. 3.2. Structure-based methods When many documents obey the same schema definition, structure-based methods can generate a database schema which maps the XML elements, attributes and texts of the document into appropriate entities, relations and attributes. The mapping can be done manually or by a mapping algorithm. Storage and retrieval of data is provided by a RDBMS module or through middleware.

1 Examples may be found in chapter 4 of [8].

)



3

The schema can be optimized to support frequently accessed elements or attributes. However, this can turn into a disadvantage, if applications or document structures change. Overall, structurebased methods are geared towards data-centric documents with a stable structure over time. As for the mapping, one can either dissect the documents such that the nodes fit into ordinary ‘‘flat’’ database tables or map the structure into non-normalized tables which resemble the NF2 -data model [11]. The first method is generally known as ‘‘shredding’’ (see [12] for an introduction into this technique). Whether to pack elements into one table or to start new tables for elements with sub-elements differs from algorithm to algorithm. Use of ordinal numbers to maintain sequence order is common but creates problems if the documents are updated later on. General difficulties arise with the mapping of XML value types to SQL types, with the XML schema types choice and any, mixed content and recursion. Mapping algorithms and performance considerations are discussed e.g. in [13–18] and [19, p. 486ff.]. [20] considers mapping XPath-expressions into equivalent clauses of SQL queries. The second method makes use of the collection types of the database system as discussed for SQL:2003 above. In particular, if lists are available (as in case of the eNF2 -data model [21]), the ordering problem can be solved in an elegant way. Variant and optional parts of the XML document lead to NULL-values in the database. Mixed content and recursion, however, are still difficult to map. Oracle and other DBMS provide built-in mapping methods between XML documents and object-relational tables which resemble the NF2 -model above. They all require registering the XML schema with the DBMS. We will use the Oracle-mapping in Section 5 on performance evaluation as comparison. A disadvantage of this second method is the creation of large database schemas with corresponding sparsely filled tables. Algebraic operations for querying and transforming XML documents, which are stored in NF2 -relations, are given in [22]. 3.3. Model-based methods This method provides a generic mapping for XML documents without looking at schema definitions. The general task is to record an ordered tree structure in database tables. It is the most thoroughly researched concept and we can only hint at the available algorithms.

• Edge table. All nodes in the tree are assigned a number in some order of traversal. In addition, for each node, the ID of the predecessor node is recorded and some ordering to reconstruct the sequence of child-nodes. Finally, the type of node, its name and value must be stored. With a document-ID, one table can hold several documents. Edge tables can become very large and performance for XQuery is poor. Optimizations are possible, e.g. by splitting up the edge table as done in XRel [23] or storing parent–child relations in another table as in XParent [24]. • Labeling schemes. Nodes in the tree are given labels which are chosen in such a way that the node position can be reconstructed from the label [25,26]. The art is to find ordering schemes which allow insertions and deletions of nodes without extensive relabeling. Several such schemes exist. A fairly simple one relies on the rank of a node in pre- and post-order traversals. Another one is the Dewey-scheme, which records the path to a node in the label. If documents change frequently, the original Dewey-scheme requires expensive relabeling. The Ordpath-method, used in Microsoft’s SQL Server, is based on Dewey, but avoids relabeling using some tricks in naming the nodes. Similarly, the dynamic Dewey-labeling scheme (DDE) [27] claims to solve this problem. Next, labels can be

4

K. Schweinsberg, L. Wegner / Future Generation Computer Systems (

derived from intervals. Each node records start, end, and level. Predecessor and successor, as well as parent–child relations, are easily computed. Insertions, however, destroy the scheme and leaving gaps in the numbering system is of limited use. Finally there are labeling schemes based on mathematical principles (prime labeling scheme [28]) and schemes based on coding theory (QED, Dynamic Quaternary Encoding [29]), which use some type of lexicographic ordering. • IBM DB2 pureXML. Starting with Version 9, DB2 offers XML support. The new functionalities are called pureXML and provide a column type XML, which stores XML documents in a native format. This involves parsing the document and creating a DOM (Document Object Model) tree in which all names of elements and attributes are replaced by integers through a table SYSXMLSTRINGS. These integers are even valid across several documents with identical names. Another interesting point is how the documents are mapped to memory pages by means of a so called regional index. Details can be found in [30]. Like Oracle, DB2 offers native XQuery support. • eXist. We should mention eXist, which is one of the twenty native XML databases discussed in [31]. eXist uses a labeling scheme based on publications by Lee and Shin [32,33] and models the document as a k-tree. Labels quickly become large, but with a modification of the pure k-tree definition, 8-byte integers are sufficient in eXist for almost all documents. 3.4. SQL/XML In reviewing and classifying XML-to-RDBMS methods, SQL/XML stands out as a standardized approach [5] and detailed descriptions can be found in Melton et al. [34,35] and in [7]. SQL/XML provides functionalities for the following tasks:

• XML output from queries over relational data • storage and processing of XML data in relational databases • queries over XML data and output of results either in XML or relational format

• mapping of relational data and identifiers to and from XML. The standard makes no recommendations concerning the storage formats. In practice, textual storage is widespread for its easy implementation. SQL/XML defines a data type XML which also accepts XML fragments or forests. There are publishing functions to be used inside SQL queries for converting relational data to XML elements and attributes. There are mappings for values and identifiers including handling the different data type universes. How XML Schema should be incorporated into the mapping process is not prescribed by the standard, but Melton strongly recommends to register schemas with the DBMS to avoid validation conflicts of previously stored documents resulting from subsequent schema changes (which the DBMS would prohibit for registered schemas). 4. A new method and its prototype implementation Having reviewed storage and retrieval alternatives for XML documents in relational databases, we now propose a new bijective mapping from XML to SQL:2003-conformant data types. We start with properties and requirements, then identify which DBMS suits these goals best, and finally describe our Java application which maps into object-relational data structures and back. 4.1. Properties and requirements Essentially we look for a model-based mapping which mirrors the tree structure of the XML documents in a most ‘‘natural’’

)



way. To this end, SQL:2003 offers the collection types LIST, SET, MULTISET, and ROW. This will allow us to combine a set (or list) of child nodes directly with its parent node. Clearly, text-based methods, complete shredding or tricky labeling schemes do not reveal this parent–child relation in such an obvious form. Of course, any database designer will immediately recognize, that a hierarchical relation can be mapped into a table where each child carries the identifier (or primary key) of its parent. We argue that for XML documents this would correspond to a processing from the leaves to the root, while, in fact, most elements are accessed with path expressions which start at the root element or at least process the tree top-down. For attributes it is essential that an element can see directly the set of attributes which belongs to it, rather than having each attribute carry the ID of the element it belongs to. Mixed content is crucial as well. Intermittent text can be considered as a text-child and could go with sub-elements into a list of children. Long attribute and element names (tags) exceeding 30 characters are rare, but should be possible. Proprietary functionalities of any particular DBMS will not be considered, as they cause trouble with respect to portability. Stored procedures and easy access from a programming language, say in Java through the JDBC interface, would be helpful. 4.2. Selecting a suitable database As indicated in Table 2, support for the object-relational SQL:2003 data types differs among commercial vendors. In comparing the three big players Oracle, IBM DB2, and Informix, DB2 dropped out of the race early on as it does not offer collection types which could be used for table definitions. ARRAY-types are supported which could represent set-valued types, but they are restricted to processing in stored procedures written in SPL, they are not first-class citizens for column definitions. Oracle looked like a winner for a long time, in particular since we run it at our institute for teaching, content management and administrative tasks. Collection types are available, but they are not as orthogonal as in Informix. Furthermore, a first prototype in Oracle turned out to be rather cumbersome to manage. Fact is that Oracle is so powerful and has included so many new features, that even the so-called Enterprise Manager, which is an independent module with a graphical interface, often hides problems occurring deeper down instead of tracing them. It came as somewhat a surprise, that Informix offered the best choice. Although smaller in its range of functionalities, it offered all collection types in an easy to manage environment. The only missing feature were object methods, identities, and references. For the prototype we had in mind, this constituted no drawback and everything can be done with relational functionalities and collection types. Informix SPL for stored procedures has no encapsulation for data and code, but otherwise it is on par with Oracle’s PL/SQL. Experiences with early implementations both in Oracle and Informix supported our view, that for the given SQL:2003-compliant mapping, Informix Dynamic Server 11.7 was the DBMS of choice. For the downside of this comparison, it is clear that porting the implementation to another DBMS, despite trying to be SQL:2003-conformant, would require considerable rewriting. 4.3. Data structures The most direct way to map the hierarchical structure of elements and sub-elements is to define nested tables in the sense of the NF2 -data model. As XML documents may have an arbitrary depth of nesting, defined by a recursive DTD or XML Schema

K. Schweinsberg, L. Wegner / Future Generation Computer Systems (

)



5

Fig. 2. A city element in the mondial document.

4.4. XML2DB implementation

Fig. 1. Type definition instructions.

definition, we need recursive collection types.2 A closer look at user-defined data types in SQL:2003 [5, p. 632ff.], however, reveals, that an attribute of a user-defined type may not be based on the type to be declared. Accordingly, none of the three DBMS, which we considered for the implementation, allows direct or indirect recursion. As a way out, we can replace the sub-elements by references to sub-elements. The references could use object identifiers generated by the system or a user-generated identification system. In a structure-based model, each element definition in a DTD or XML Schema would generate a unique type for the database, which can handle text content, sub-elements, and attributes. The object references then go into a list-valued attribute. Alternatively, a generic structure with object references can be devised to reflect one parent–child level. Mixed content is again difficult but can be accommodated with some tricks. This approach belongs to the model-based methods and no DTD or XML Schema is needed. On the other hand, it is hard to validate insertions, deletions and updates of stored documents in the database without extracting the complete document. Balmin et al. [37], however, show that incremental checking is possible using an additional data structure whose size is in the order of the size of the document. The implementation which we present uses this generic approach. As Informix does not know object tables, nor object identifiers or references, we take typed tables, based on tuple types. Each row takes one instance of this type. Object-IDs will be modeled by foreign key relations (named NODEID). An additional document-ID (DOCID) is introduced to store multiple documents in one table. Together they form the composite primary key. Fig. 1 shows the necessary SQL commands for the type-definitions in Informix. The definitions should be self-explaining in connection with the following example (cf. Figs. 2 and 3) taken from the mondial document [9]. Other node types in XML documents require the insertion of additional tuples. For instance, consider text in XML documents, which can be very long. Texts longer than the 30,000 characters defined above as LVARCHAR are split up into separate sub-nodes of 30,000 characters each with node name !TEXT. Elements which contain these over-length texts receive a list of node-IDs in the SUBELEMENTREFS attribute for the fragments. This allows proper reconstruction regardless of tuple order. The limits we mention result from a length restriction of 32,767 bytes for the row type of tables in Informix.

2 [36] reports that, in practice, more than half of all DTDs are recursively built.

Mixed content is handled similarly. Details for all other node types and properties – document root, XML version, standalone attribute, encoding, comments, processing instructions, CDATAsections, document type declarations, entities, and namespaces – can be found in [8]. The thesis contains also a description of the command-line tool XML2DB which handles storage and extraction of XML documents into and from the database. The insertion method works recursively in post-order, i.e. all sub-nodes are inserted as tuples first and return their node-IDs. These are collected and then stored as a list in the parent tuple. Finally, the question of byte-fidelity arises. Because attributes are stored as sets of name-value tuples, the order may change in a reconstructed document. Similarly, single and double quotes around the values are converted into double quotes, entities are replaced by their texts, and ignorable whitespaces are condensed. Although a byte-by-byte identical reconstruction was not implemented, as it is prohibitively expensive both in time and added space, the mapping does produce logically identical documents as defined by the ‘‘Canonical XML’’ [38] standard. 5. Performance evaluation To evaluate the mapping, the three most relevant questions are:

• How fast is the mapping into and out of the database? • How much space is consumed by the database in storing a document?

• How well does the proposed method compare to other existing solutions? To answer them, a thorough performance evaluation of the Informix prototype was performed. It involved four different document styles and a comparison with the Oracle implementation. Readers are again referred to [8] for detailed results. 5.1. Document classes The documents chosen for the empirical experiments included – apart from the previously mentioned mondial document – three more test files from the literature: bible.xml, purchaseOrder.xml, and SigmodRecord.xml. As can be seen from Table 3, they represent different document styles. The document mondial.xml represents a typical data-centric XML document. There is a fairly well-balanced mixture of elements and attributes, elements are nested up to 5 levels deap and each element has at most 5 attributes. One single parent element has up to 1357 child elements. Text content of elements and attribute values are relatively short, similarly to what might be stored in columns of a relational table anyway. With a total size of 1.73 MB it belongs to the class of larger XML files. Due to its formatting, there is a high number of ignorable whitespaces. The document bible.xml contains the text of the Luther-bible of 1546 with mark-ups satisfying Zefania XML for which an XML

6

K. Schweinsberg, L. Wegner / Future Generation Computer Systems (

)



Fig. 3. Mapping of elements and attributes (names shortened). Table 4 Timing results for document insertion with DTD applied.

Table 3 Document structure in comparison. Property

mondial

bible

purOrder

SigRecord

Time required

mondial

bible

purOrder

SigRecord

File size (bytes) Total no. of elements Total no. of attributes Number of diff. elements Number of diff. attributes Max. no. of attr. per elem. Max. depth of elem. nesting Max. no. of children per elem. Total length attribute values Total length text node cont. Total length ignor. whitesp.

1 810 233 30 822 30 015 48 22 5 5 1 357

1 120 635 7 379 7 424 17 11 5 4 80

1039 25 5 15 3 1 4 5

478 336 11 526 3737 11 1 1 6 89

In total (s) Parsing Delete old document Other DB-operations Document insertion - of these DB-inserts - - o.t. again JDBC handling - percentage for DB-inserts

18.275 0.939 1.030 0.080 16.227 14.946 1.203 92.1

6.820 0.760 0.300 0.075 5.685 4.884 0.513 85.9

0.515 0.352 0.028 0.084 0.051 0.036 0.020 70.6

7.913 1.340 0.363 0.084 6.127 5.514 0.686 90.0

449 005 138 688 330 625

12 817 890 551 29 983

26 181 207

7 474 146 015 70 545

schema exists. Clearly it belongs to the class of document- or textcentric XML files with low nesting and long text contents. In fact, 80% of all document characters belong to a text node. purchaseOrder.xml, known as po.xml-example in the W3C recommendations, is in contrast a very short, data-centric document as they appear in record keeping and has e.g. at most one attribute per element. It is included to gain insight on setup costs when they are not amortized by thousands of node insertions and retrievals. Finally, SigmodRecord.xml contains a directory of articles from the ACM SIGMOD Records. With half a megabyte in size, it represents data-centric documents in the medium range, in this case also with a low number of different element types and only one attribute type. 5.2. Timing results Experiments were conducted on an isolated server to eliminate influence from traffic and repeated ten times with the identical document. With the exception of purchaseOrder.xml, which is a very small document, all figures were very close to each other when scaled against the number of DOM nodes. We consider first document insertion into the database. Apart from the total time taken, which of course depends mostly on the size of the document, it is important to understand which individual task contributes to which extend to the total effort (see Table 4 for the overall picture). As an example, inserting the mondial document into the database required a total of 18.275 s, of which 0.939 s involved parsing (instantiating the Xerces DOM parser, parsing the XML document and building the DOM tree in main memory), 1.030 s for deleting the old document (with same document-ID), 0.080 s other DB-operations, and finally 16.227 s for the actual insertion into the database. This last figure can again be broken down into 14.946 s for the DB-inserts and 1.203 s for the JDBC handling. What is not shown in the table is that of the 14.946 s for DB-inserts, 14.929 s concern element insertions, which includes attributes. Everything else, e.g. texts/CDATA, comments, processing instructions, DocType-declarations can be neglected.

Table 5 Timing results for document reconstruction with DTD applied. Time required

mondial

bible

purOrder

SigRecord

In total (s) Find root node-ID Other DB-operations Document retrieval - of these DB-selects - - o.t. again JDBC handling - percentage of DB-selects DOM serialization

26.686 0.218 0.263 25.891 24.130 6.513 93.2 0.313

9.104 0.075 0.245 8.540 7.568 2.214 88.6 0.244

0.422 0.029 0.263 0.106 0.062 0.043 58.5 0.024

9.054 0.097 0.271 8.440 7.793 2.577 92.3 0.246

One very important fact must be mentioned. Even though

XML2DB is a model-based method, a DTD existed for each document and could be used to recognize ignorable whitespaces by the parser, which are then deleted. Without a DTD, they must go into the database as text nodes. If that is the case, the number of database entries in the ELEMENTS-table doubles both in the mondial- and in the bible document. Accordingly, the performance suffers and the execution time for DB-inserts climbs from 14.946 s to 23.432 s, resp. 4.884 s to 7.415 s for the bible document. Consider next reconstructing a document which actually takes more time than insertion (see Table 5). The task starts with finding the root node-ID which implies fetching the node-ID of the document root node when given the document-ID. The term ‘‘other DB-operations’’ covers opening and closing the database connection and initializing SQL commands. The core of the task is document retrieval, which means initializing the DOM tree and subsequently reading the table rows to recursively build the tree. DOM serialization is the common conversion of the in-memory DOM tree into an XML document to be written into standard output. For the mondial document with DTD, we need 26.686 s in total, of which 25.891 s belong to the actual retrieval. Of these, 6.513 s are attributed to the JDBC handling. Only 0.313 s – or up to 3% of the total time – go into the serialization of the DOM tree into an output document, once the tree has been constructed from the database tuples. How does the document structure, e.g. depth of nesting and number of different attribute types, influence the performance? To answer this question, one needs to normalize the data, i.e. divide the measured times by the number of elements, resp. DOM

K. Schweinsberg, L. Wegner / Future Generation Computer Systems ( Table 6 Normalized timing results for document reconstruction with DTD applied. Time required (in ms)

mondial

bible

purOrder

SigRecord

Per XML element Per DB-select Per DOM node

0.856 0.856 0.513

1.201 1.200 0.611

15.305 14.738 8.651

0.764 0.764 0.442

)



7

Table 7 Database storage requirements with compression and with DTD applied.

Page size (kB) Pages consumed for data - unused space inside (kB) Pages consumed for index Size of compression dictionary (kB) Storage requirement in total (kB) Database to file size ratio

mondial

bible

SigRecord

4 872 1686 163 72 2526 1.43:1

4 338 644 41 82 954 0.87:1

4 234 459 63 63 793 1.70:1

Fig. 4. Time per DOM node for retrieval from the database.

nodes. This has been done in Table 6. As in the tables before, purchaseOrder.xml is out of scale by a factor of ten or more because of its extreme short length. The other data seem to indicate that long texts as in the bible.xml document play a larger role than variability for elements and attributes. Finally, Schweinsberg [8] conducts another experiment in which he doubles the size of the purchaseOrder document by copying the contents repeatedly under the root. This way he can precisely measure the amortization of setup costs with a growing number of nodes. For document reconstruction, the graph in Fig. 4 shows that past about 6000 nodes, the time to perform the selectoperations measured per DOM node becomes invariant at less than 0.6 ms. For insertions, the values are even lower with less than 0.5 ms past 2000 nodes. 5.3. Space requirements To answer the second question (storage space) we looked at the required heap and non-heap space in memory for document insertion. Across all documents, the non-heap space is largely invariant at about 9 MB, while the heap space grows with the document size. The growth actually happens during the parsing of the document in the initial phase of the document storage and is needed for the DOM tree. Afterwards, when the nodes are successively written into the database, little extra space is required on top. For the larger documents, between 200 and 400 bytes are initially allocated per DOM node. The situation is similar for the retrieval of documents, but there the heap space grows continually with each select operation which creates a DOM node. The more interesting part is the database storage. The mondial document had a file size of about 1.7 MB. When stored in the database, using a DTD for parsing, but without compression, it takes up 7.0 MB – a growth by a factor of 4! This might come as a surprise considering that XML documents are claimed to be verbose. In the database, additional overhead comes from type and length information, texts and names are just as long as in the XML documents. This changes when compression is applied for which Informix has an option. Table 7 summarizes the space requirements with compression, again under the assumption, that a DTD was available when the document was stored. Note that the purchaseOrder

Fig. 5. Database space requirements for growing document file sizes.

document could not be compressed, as the algorithm requires at least 2000 rows to work. Now mondial.xml takes only 2.5 MB database storage, in case of the bible document, the storage even drops below the original document size with a factor of 0.87. Insertion times also drop by about 3%–4%, extraction times remain unchanged. As with the time measurements in the section above, Schweinsberg [8] determines the asymptotic space requirements for growing file sizes. In the test runs, the size of the purchaseOrder document is again successively doubled by copying the contents repeatedly under the root. Fig. 5 shows the result. Clearly, for XML documents with at least a few hundred nodes, the needed database space is linearly correlated to the file size. Without compression, the growth factor is 5.41, with compression around 1.65 for files larger than 2 MB. 5.4. Comparison with Oracle’s storage method Finally, to get a feeling for the quality of our mapping method, we ran the same experiments, but restricted to the mondial document, using Oracle’s built-in structured (object-relational) storage method. Oracle’s mapping guarantees DOM fidelity. It requires registering the document schema, but does not accept DTDs. Thus we rewrote the DTD into an XML schema. Oracle then generates a suitable table of type XMLType and nested sub-tables to map the hierarchic structure of the document. In the case of mondial.xml, 34 sub-tables were required. Also, one attribute value had to be shortened, as it exceeded an unexpected Oraclelimit of presumably 2000 bytes. As Table 8 shows, insertion and retrieval in Oracle is by a factor of 3.3, resp. 4.5, faster than what XML2DB could achieve under Informix. Measurements for DB2 were not taken but are expected to be similar to Oracle’s. The poor performance of XML2DB is not, or at least not entirely, due to the mapping design. The cause lies in the database access: XML2DB runs outside the database using a JDBC connection to issue SQL statements. In contrast, with Oracle and DB2 insertions and retrievals occur inside the database in a built-in module, which may profit from kernelrelated optimizations.

8

K. Schweinsberg, L. Wegner / Future Generation Computer Systems (

Table 8 Timing comparisons (in s) for the mondial document. Operation

XML2DB

Oracle

Insertion (includes deletion of old document) Retrieval

18.275 26.686

5.510 5.973

6. Conclusion We reviewed the XML-to-RDBMS mapping problem and offered a taxonomy. A model-based method was devised which makes use of SQL:2003-conformant collection types. A prototype was then built and implemented for Informix, which came out as winner against DB2 and Oracle as far as type-requirements were concerned. A performance evaluation revealed that the prototype is considerable slower than a comparable built-in method in Oracle, in part due to the JDBC connection. Storage requirements are acceptable when Informix’s compression is used. Although the storage method is generic, having a DTD when parsing the documents during insertion time is essential for dropping ignorable whitespaces. This cuts execution times in half. Future research should go into update functionality under the given mapping. In particular, document changes should map into local, incremental database updates (cf. [37]) and their validity should also be checked locally inside the database. Index optimizations and XQuery translations are other natural themes to look into. Acknowledgments We like to thank the chairmen of the MEDI 2014 conference – Yamine AIT AMEUR, Ladjel BELLATRECHE, and George PAPADOPOULOS – for the opportunity to expand and revise our conference contribution for this FGCS journal. Likewise the valuable comments of the reviewers are gratefully acknowledged. References [1] J. Paoli, M. Sperberg-McQueen, F. Yergeau, J. Cowan, T. Bray, E. Maler, Extensible markup language (XML) 1.1 (second edition), W3C recommendation, W3C. http://www.w3.org/TR/2006/REC-xml11-20060816 (August 2006). [2] E.R. Harold, XML 1.1 Bible, Wiley & Sons, 2004. [3] J. Simeon, A. Berglund, S. Boag, J. Robie, D. Chamberlin, M. Fernandez, M. Kay, XML path language (XPath) 2.0 (second edition), W3C recommendation, W3C. http://www.w3.org/TR/2010/REC-xpath20-20101214/ (December 2010). [4] J. Marsh, M. Fernandez, N. Walsh, A. Malhotra, M. Nagy, A. Berglund, XQuery 1.0 and XPath 2.0 data model (XDM) (second edition), W3C recommendation, W3C. http://www.w3.org/TR/2010/REC-xpath-datamodel20101214/ (December 2010). [5] ISO, International standard: Information technology—Database languages— SQL, Tech. Rep., International Organization for Standardization, 2003, ISO/IEC 9075(1-4, 9-11, 13, 14):2003. [6] J.-E. Michels, K.G. Kulkarni, C.M. Farrar, A. Eisenberg, N.M. Mattos, H. Darwen, The SQL standard, IT—Inf. Technol. 45 (1) (2003) 30–38. [7] C. Türker, SQL:1999 & SQL:2003—Objektrelationales SQL, SQLJ & SQL/XML, Dpunkt-Verlag, 2003. [8] K. Schweinsberg, Abbildung von XML-Dokumenten auf SQL:2003-konforme Datentypen (Ph.D. Thesis), Universität Kassel, Germany, 2012, http://nbnresolving.de/urn:nbn:de:hebis:34-2012081041594. [9] W. May, Web-page to the mondial-database maintained at the institute for informatics of the Georg-August-Universität Göttingen, 2010. http://www.dbis.informatik.uni-goettingen.de/Mondial/. [10] G. Lausen, Datenbanken—Grundlagen und XML-Technologien, Spektrum Akademischer Verlag, 2005. [11] H.-J. Schek, M.H. Scholl, The relational model with relation-valued attributes, Inf. Syst. 11 (2) (1986) 137–147. [12] D.D. Chamberlin, D. Draper, M.F. Fernández, M. Kay, J. Robie, M. Rys, J. Siméon, J. Tivy, P. Wadler, XQuery from the Experts: A Guide to the W3C XML Query Language, Addison Wesley, 2003. [13] D. Barbosa, J. Freire, A.O. Mendelzon, Designing information-preserving mapping schemes for XML, in: Böhm et al. [39], pp. 109–120. [14] S. Chaudhuri, Z. Chen, K. Shim, Y. Wu, Storing XML (with XSD) in SQL databases: Interplay of logical and physical designs, IEEE Trans. Knowl. Data Eng. 17 (12) (2005) 1595–1609.

)



[15] P. Genevès, N. Layaïda, V. Quint, Impact of XML schema evolution, ACM Trans. Internet Technol. 11 (1) (2011) 4. [16] H. Lu, J.X. Yu, G. Wang, S. Zheng, H. Jiang, G. Yu, A. Zhou, What makes the differences: benchmarking XML database implementations, ACM Trans. Internet Technol. 5 (1) (2005) 154–194. [17] L. Kulic, Adaptability in XML-to-relational mapping strategies, in: S.Y. Shin, S. Ossowski, M. Schumacher, M.J. Palakal, C.-C. Hung (Eds.), SAC, ACM, 2010, pp. 1674–1679. [18] A. Algergawy, M. Mesiti, R. Nayak, G. Saake, XML data clustering: An overview, ACM Comput. Surv. 43 (4) (2011) 25. [19] C. Türker, G. Saake, Objektrelationale Datenbanken, Dpunkt-Verlag, 2006. [20] W. Fan, J.X. Yu, J. Li, B. Ding, L. Qin, Query translation from XPath to SQL in the presence of recursive DTDs, VLDB J. 18 (4) (2009) 857–883. http://dx.doi.org/10.1007/s00778-008-0131-0. [21] P. Dadam, K. Küspert, F. Andersen, H.M. Blanken, R. Erbe, J. Günauer, V.Y. Lum, P. Pistor, G. Walch, A DBMS prototype to support extended NF2 relations: An integrated view on flat tables and hierarchies, in: C. Zaniolo (Ed.), SIGMOD Conference, ACM Press, 1986, pp. 356–367. [22] H.L. Lau, W. Ng, Querying XML data by the nested relational sequence database system, in: IDEAS, IEEE Computer Society, 2003, pp. 236–241. [23] M. Yoshikawa, T. Amagasa, T. Shimura, S. Uemura, XRel: a path-based approach to storage and retrieval of XML documents using relational databases, ACM Trans. Internet Technol. 1 (1) (2001) 110–141. [24] H. Jiang, H. Lu, W. Wang, J.X. Yu, Path materialization revisited: An efficient storage model for XML data, in: X. Zhou (Ed.), Australasian Database Conference, in: CRPIT, vol. 5, Australian Computer Society, 2002, pp. 85–94. [25] I. Tatarinov, S. Viglas, K.S. Beyer, J. Shanmugasundaram, E.J. Shekita, C. Zhang, Storing and querying ordered XML using a relational database system, in: M.J. Franklin, B. Moon, A. Ailamaki (Eds.), SIGMOD Conference, ACM, 2002, pp. 204–215. [26] M.F. O’Connor, M. Roantree, Desirable properties for XML update mechanisms, in: F. Daniel, L.M.L. Delcambre, F. Fotouhi, I. Garrigós, G. Guerrini, J.-N. Mazón, M. Mesiti, S. Müller-Feuerstein, J. Trujillo, T.M. Truta, B. Volz, E. Waller, L. Xiong, E. Zimányi (Eds.), EDBT/ICDT Workshops, ACM International Conference Proceeding Series, ACM, 2010, pp. 23:1–23:9. [27] L. Xu, T.W. Ling, H. Wu, Z. Bao, DDE: from dewey to a fully dynamic XML labeling scheme, in: U. Çetintemel, S.B. Zdonik, D. Kossmann, N. Tatbul (Eds.), SIGMOD Conference, ACM, 2009, pp. 719–730. [28] X. Wu, M.-L. Lee, W. Hsu, A prime number labeling scheme for dynamic ordered XML trees, in: Z.M. Özsoyoglu, S.B. Zdonik (Eds.), ICDE, IEEE Computer Society, 2004, pp. 66–78. [29] C. Li, T.W. Ling, QED: a novel quaternary encoding to completely avoid relabeling in XML updates, in: O. Herzog, H.-J. Schek, N. Fuhr, A. Chowdhury, W. Teiken (Eds.), CIKM, ACM, 2005, pp. 501–508. [30] M. Nicola, B.V. der Linden, Native XML support in DB2 universal database, in: Böhm et al. [39], pp. 1164–1174. [31] R.P. Bourret, XML database products. http://www.rpbourret.com/xml/XMLDatabaseProds.htm (June 2010). [32] Y.K. Lee, S.-J. Yoo, K. Yoon, P.B. Berra, Index structures for structured documents, in: Digital Libraries, ACM, 1996, pp. 91–99. [33] D. Shin, H. Jang, H. Jin, BUS: An effective indexing and retrieval scheme in structured documents, in: ACM DL, ACM, 1998, pp. 235–243. [34] A. Eisenberg, J. Melton, Advancements in SQL/XML, SIGMOD Rec. 33 (3) (2004) 79–86. [35] J. Melton, S. Buxton, Querying XML—XQuery, XPath, and SQL/XML in Context, Morgan Kaufmann, 2006. [36] B. Choi, What are real DTDs like? in: WebDB, 2002, pp. 43–48. [37] A. Balmin, Y. Papakonstantinou, V. Vianu, Incremental validation of XML documents, ACM Trans. Database Syst. 29 (4) (2004) 710–751. [38] J. Boyer, Canonical XML version 1.0, W3C recommendation, W3C. http://www.w3.org/TR/2001/REC-xml-c14n-20010315 (March 2001). [39] K. Böhm, C.S. Jensen, L.M. Haas, M.L. Kersten, P.-Å. Larson, B.C. Ooi (Eds.), Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, August 30–September 2, 2005, ACM, 2005. Kai Schweinsberg studied Mathematics and Computer Science at the University of Kassel. He received his Ph.D. in computer science with a thesis on XML mappings for SQL:2003-compliant databases. He is currently a research assistant at the Department of Electrical Engineering and Computer Science of the University of Kassel. His interests are in database systems, web technologies, and software design.

Lutz M. Wegner is professor emeritus at the University of Kassel, Germany. He received his Ph.D. from the University of Karlsruhe, Germany in 1977. He was appointed as full professor and head of the database group in 1987. His interests are in database systems, interface design, and algorithms & data structures.