Mapping DTDs to relational schemas with semantic constraints

Mapping DTDs to relational schemas with semantic constraints

Information and Software Technology 48 (2006) 245–252 www.elsevier.com/locate/infsof Mapping DTDs to relational schemas with semantic constraints Ten...

143KB Sizes 0 Downloads 41 Views

Information and Software Technology 48 (2006) 245–252 www.elsevier.com/locate/infsof

Mapping DTDs to relational schemas with semantic constraints Teng Lva,b, Ping Yana,* b

a College of Mathematics and System Science, Xinjiang University, Urumqi 830046, People’s Republic of China Teaching and Research Section of Computer Science, Artillery Academy, Hefei 230031, People’s Republic of China

Received 5 July 2004; revised 2 May 2005; accepted 2 May 2005 Available online 14 June 2005

Abstract XML is becoming a prevalent format and standard for data exchange in many applications. With the increase of XML data, there is an urgent need to research some efficient methods to store and manage XML data. As relational databases are the primary choices for this purpose considering their data management power, it is necessary to research the problem of mapping XML schemas to relational schemas. The semantics of XML schemas are crucial to design, query, and store XML documents and functional dependencies are very important representations of semantic information of XML schemas. As DTDs are one of the most frequently used schemas for XML documents in these days, we will use DTDs as schemas of XML documents here. This paper proposes the concept and the formal definition of XML functional dependencies over DTDs. A method to map XML DTDs to relational schemas with constraints such as functional dependencies, domain constraints, choice constraints, reference constraints, and cardinality constraints over DTDs is given, which can preserve the structures of DTDs as well as the semantics implied by the above constraints over DTDs. The concepts and method of mapping DTDs to relational schemas presented in the paper can be extended to the field of XML Schema just with some modifications in related formal definitions. q 2005 Elsevier Ltd All rights reserved. Keywords: XML; Functional dependency; Schema mapping; Semantic constraints

1. Introduction XML (eXtensible Markup Language) [1] has become one of the primary standards for data exchange on the World Wide Web and is widely used in many fields. With the increase of XML data on the World Wide Web, it is urgent to develop some efficient methods to manage and store XML data. There are two main methods of storing and managing XML data today. The first method uses native XML databases to store and manage XML data directly, such as Refs. [15,16]. The other method proposed in Refs. [2,3,18] uses relational databases to store XML data which takes advantage of the mature technology and products of relational databases. For the latter case, there is an urgent need to research some efficient methods to map XML schemas to relational schemas. As DTDs (Document * Corresponding author. Tel.: C44 1604 779109; fax: C44 1604 716247. E-mail addresses: [email protected] (T. Lv), [email protected] (P. Yan).

0950-5849/$ - see front matter q 2005 Elsevier Ltd All rights reserved. doi:10.1016/j.infsof.2005.05.001

Type Definitions) [4] are one of the most frequently used schemas for XML documents in these days [5], we will use DTDs as schemas of XML documents. The semantics of XML DTDs is a very important aspect of mapping XML DTDs to relational schemas in most situations. Functional dependencies, domain constraints, choice constraints, reference constraints, and cardinality constraints [6] over DTDs are very important representations of semantic information of DTDs. So it is significant that the method to map XML DTDs to relational schemas can preserve the structures of DTDs as well as the semantics implied by these constraints (functional dependencies, domain constraints, choice constraints, reference constraints, and cardinality constraints) over DTDs. 1.1. Related work For the concept of XML functional dependencies, Ref. [11] proposes a definition of XML functional dependencies. Unfortunately, it does not differentiate between global functional dependencies and local functional dependencies for XML. Ref. [17] also gives a

246

T. Lv, P. Yan / Information and Software Technology 48 (2006) 245–252

definition of XML functional dependencies which only considers the string values of attributes and elements of XML documents. The definition of XML functional dependencies proposed in this paper overcomes the shortcomings of the above definitions in the following aspects: (1) It captures the characteristics of XML structure and differentiates between global functional dependencies and local functional dependencies for XML. (2) It considers not only the string values of attributes and elements but also the elements themselves of XML documents in XML functional dependencies. Ref. [19] gives a partially supervised method to extract relations from XML documents without consistent schemas such as DTDs. Our approach in the paper is to map DTDs to relational schemas with constraints which is orthogonal to the method given in Ref. [19]. Refs. [20–22] also propose an XML-to-relational mapping framework through annotations in an XML schema considering different mapping methods without functional dependencies. Our approach considers a specific mapping method which is also orthogonal to the work given in Refs. [20–22]. For the problem of mapping XML DTDs to relational schemas, several mapping methods [2,7,8,18] are proposed, but all these methods do not consider the semantics of DTDs in the mapping process from XML DTDs to relational schemas. As we stated in the former part of the paper, the semantics of XML DTDs is a very important aspect when mapping XML DTDs to relational schemas. So the resulted relational schemas will definitely lose some important information of the original DTDs. For the problem of mapping XML DTDs to relational schemas considering the semantics of DTDs, two methods which can map XML DTDs to relational schemas while preserve the semantics implied by XML keys and cardinality constraints, etc. are given in Refs. [9,6], respectively. These methods do not consider the semantics of DTDs implied by functional dependencies over DTDs in the process of mapping DTDs to relational schemas. Although Ref. [10] proposes another mapping method considering functional dependencies over DTDs, it requires XML functional dependencies are flat and well-structured which definitely restrict the flexibility and applicability of the method. The mapping method from XML DTDs to relational schemas presented here has the following significant characteristics: (1) It can preserve the structures of DTDs as well as the semantics implied by functional dependencies and other constraints such as domain constraints, choice constraints, reference constraints, and cardinality constraints over DTDs. (2) It does not restrict the forms of functional dependencies over DTDs. In this paper, we first propose the concept and the formal definition of functional dependencies over XML DTDs. Then based on the semantics of functional dependencies over DTDs, we propose a method to map XML DTDs to relational schemas in the presence of functional dependencies, domain constraints, choice constraints, reference constraints, and cardinality constraints over DTDs. The

method presented in the paper can preserve the structures as well as the semantic constraints as mentioned above. 1.2. Organization The rest of the paper is organized as follows. Some notations and the definition of functional dependencies over DTDs are given as a preliminary work in Section 2. Section 3 presents our method to map XML DTDs to relational schemas in the presence of functional dependencies and other constraints over DTDs. A case study is given in Section 4 as an application of the method presented in Section 3. And finally, Section 5 concludes the paper and points out the directions of future work.

2. Preliminaries We give the definitions of DTD, path, and XML tree [11] as a preliminary work. Definition 1. A DTD (Document Type Definition) is defined to be DZ(E, A, P, R, r), where (1) E is a finite set of element types; (2) A is a finite set of attributes; (3) P is a mapping from E to element type definitions. For each t2E, P(t) is a regular expression a defined as aTZSj3jt1jajaja,aja*, where S denotes string types, 3 is the empty sequence, t12E, ‘j’, ‘,’ and ‘*’ denote union (or choice), concatenation and Kleene closure, respectively; (4) R is a mapping from E to the power set P(A); (5) r2E is called the element type of the root. A path p in DZ(E, A, P, R, r) is defined to be pZ u1.$$$.un, where (1) u1Zr; (2) ui2P(uiK1), i2[2,nK1]; (3) un2P(unK1) if un2E and P(un)sF, or unZS if un2E and P(un)ZF, or un2R(unK1) if un2A. Let paths(D)Z{pjp is a path in D}. Example 1. Consider the following DTD D1, which describes the information of course, student, and teacher: !!ELEMENT courses (course*)O !!ELEMENT course (title,takenby)O !!ATTLIST course cno CDATA #REQUIREDO !!ELEMENT title (#PCDATA)O !!ELEMENT takenby (student*)O !!ELEMENT student (sname, teacher)O !!ATTLIST student sno CDATA #REQUIREDO !!ELEMENT sname (#PCDATA)O !!ELEMENT teacher (tname)O !!ATTLIST teacher tno CDATA #REQUIREDO !!ELEMENT tname (#PCDATA)O According to Definition 1, D1 is defined as D1Z(E1, A1, P1, R1, r1), where E1Z{courses, course, title, takenby, student, sname, teacher, tname}

T. Lv, P. Yan / Information and Software Technology 48 (2006) 245–252

A1Z{cno, sno, tno} P1(courses)Zcourse* P1(course)Ztitle, takenby P1(title)ZP1(sname)ZP1(tname)ZS P1(takenby)Zstudent* P1(student)Zsname, teacher P1(teacher)Ztname R1(course)Z{cno} R1(student)Z{sno} R1(teacher)Z{tno} R1(courses)ZR1(title)ZR1(takenby)ZR1(sname)ZR1 (tname)ZF r1Zcourses, Paths courses, courses.course, courses.course.@cno, courses.course.title, courses.course.title.S, and courses. course.takenby are some paths in paths(D1).

247

p2,.,and pn in an XML tree T, the maximal common prefix is denoted as p1hp2h.hpn, which is also a path in T. Example 2. Fig. 1 is an XML tree T1 conforming to D1 in Example 1 (Note: Each node is marked by its lab mapping value for clarity). According to Definition 2, T1 is defined as T1Z(V1, lab1, ele1, att1, val1, root1), where V1 is the finite set of nodes of Fig. 1, lab1(root)ZrZcourses, for the leftmost course node ele1(course)Z [title,takenby], att1(course)Z @cno, and for the leftmost title node, val1(title)Z‘db’. courses, courses.course, course.course.@cno, courses.course.title, and courses.course.title.S are some paths in T1, and courseshcourses.coursehcourse.course.@cnohcourses. course.titlehcourses.course.title.SZcourses. courses [[course.course]] is the three course nodes in Fig. 1. For the first course node, p(course)Zcourses.course, which is the leftmost path in T1. We give the definition of value equality of two nodes. Intuitively, two nodes are value equal iff the two sub-trees rooted on the two nodes are identical.

Definition 2. Let DZ(E, A, P, R, r). An XML tree T conforming to D (denoted by #jZD) is defined to be TZ(V, lab, ele, att, val, root), where (1) V is a finite set of vertexes; (2) lab is a mapping from V to EgA; (3) ele is a partial function from V to V* such that for any n2V, ele(v)Z [n1,.,nn] if lab(n1),., lab(nn) is defined in P(lab(v)); (4) att is a partial function from V to A such that for any n2V, att(v)ZR(lab(v)) if lab(n) 2E and R(lab(v)) is defined in D; (5) val is a partial function from V to S such that for any n2V, val(n) is defined if P(lab(n))ZS or lab(n)2A; (6) lab(root)Zr is called the root of T. Given a DTD D and an XML tree T/ZD, a path p in T is defined to be pZn1.$$$.nn, where (1) n1Zroot; (2) ni2ele(niK1), i2 [2,nK1]; (3) nn2ele(nnK1) if lab(nn)2E, or vn2att(nnK1) if lab(nn)2A, or vnZS if P(lab(nnK1))ZS. Let paths(T)Z{pjp is a path in T}. If n is a node in an XML tree TjZD and p is a path in D, then the last node set of path p passing node n is n [[p]]. Specifically, root [[p]] is just simplified as [[p]]. A path p is denoted as p(n) if its last node is node n. For paths p1,

Definition 3. Two nodes x and y are value equal denoted as xZny iff (1) lab(x)Zlab(y); (2) val(x)Zval(y) if x, y2A or xZyZS; (3) if x, y2E, then (a) for any attribute a2att(x), there exists b2att(y) and satisfies aZvb, and vice versa; (b) If ele(x)Zn1,.,nk, then ele(y)Zw1,.,wk. and for any i2 [1,k], there exists niZvwi, and vice versa. Example 3. In Fig. 1, the leftmost two teacher nodes are value equal according to Definition 3. A functional dependency over DTDs is defined as follows: Definition 4. Given a DTD D, a functional dependency (FD) f over D has the form (Sh, [Sx1,.,Sxn]/[Sy1,.,Sym]), where (1) Sh2paths(D) is called the header path of f which defines the scope of f over D. And the last symbol of

courses

@cno "c10"

title "db"

course

course

course

takenby

student

@cno "c20"

title "at"

takenby

student

student

@cno "c30"

title "os"

takenby

student

student

student

@sno sname teacher@sno sname teacher @sno sname teacher @sno sname teacher @sno sname teacher@sno sname teacher "s10" "Joe" "s20" "Smith" "s30" "Jane" "s30" "Jane" "s10" "Joe" "s20" "Smith"

@tno "t10"

tname "John"

@tno "t10"

tname "John"

@tno "t10"

tname "John"

@tno "t10"

tname "John"

Fig. 1. An XML tree T1 conforming to D1.

@tno "t20"

tname "Mary"

@tno tname "t20" "Mary"

248

T. Lv, P. Yan / Information and Software Technology 48 (2006) 245–252

path Sh is an element name, i.e. last(Sh)2E. If ShsF and Shsr, then f is called a local FD which means that the scope of f is the sub-tree rooted on last(Sh); otherwise, f is called a global FD which means the scope of f is the overall D. (2) [Sx1,.,Sxn] is called the left path of f. For iZ1,.,n, it is the case that Sxi2paths(D), Sxi JPathSh(Sh is a prefix of Sxi, but not necessarily a proper prefix), SxisF, and last(Sxi)2EgAgS. (3) [Sy1,.,Sym] is called the right path of f. For jZ1,.,m, it is the case that Syj2paths(D), SyjJPathSh, SyjsF, and last(Syj) 2EgAgS. For an XML tree TjZD, we call T satisfies FD f (denoted as TjZf) iff for any nodes H2 [[Sh]](let HZroot if ShZF) and X1, X22H [[Sx1h.hSxn]] in T, if there exist nodes X1 [[Sx1]]ZvX2 [[Sx1]], ., X1 [[Sxn]]ZnX2 [[Sxn]], and it is the case that for any nodes Y1, Y22H [[Sy1h.hSym]] and H(p(X1) hp(Y1)), H(p(X2)hp(Y2))2H [[Sx1h.hSxn hSy1h.hSym]] such that Y1 [[Sy1]]ZnY2 [[Sy1]],.,Y1 [[Sym]]Zn Y2 [[Sym]]. For each FD f: (Sh, [Sx1,.,Sxn]/[Sy1,.,Sym]) over D, the right path of f can be always divided into a set of single path, so f can be represented as the following m FDs fj: (Sh, [Sx1,.,Sxn]/[Syj]), where jZ1,.,m. Example 4. In Fig. 1, we have the following FD: (courses.course, [courses.course.takenby.student.@sno]/[courses.course.takenby.student]), which implies that within the sub-tree rooted on a course node, a student’s number (@sno) can uniquely determines a student node. courses

* course

, @cno

,

* student

@sno

Example 5. For DTD D1 given in Example 1, the DTD graph for D1 is shown in Fig. 2.

3. Mapping XML DTDs to relational schemas 3.1. Rules for simplifying DTDs We first give some rules to simplify a DTD. Rule 1. Remove the choice operator ‘j‘ between elements [6]: (e 1 j.je n)0(e 1?,.,e n ?) with a choice constraint CHECK((e1 is not null, e2 is null,.,and en is null) or.or (e1 is null, e2 is null,.,and en is not null)). This rule removes the choice operator in a group of elements by operator ‘?’ with a constraint such that there is one and only one element in the group of elements is not null. For example, (ajbjc) can be simplified as (a?,b?,c?) with a constraint such that there is one and only one non-empty element in the three elements a, b, and c.

(a) ðe1 ; .; en Þ 0 ðe1 ; .; en Þ with a cardinality constraint CHECK(CARDINALITY(e1)Z.ZCARDINALITY (en)); C (b) ðe1 ; .; en ÞC0 ðeC 1 ; .; en Þ with a cardinality constraint CHECK(CARDINALITY(e1)Z.ZCARDINALITY (en)R1); (c) (e1,.,en)?0(e1?, ., en?) with a cardinality constraint CHECK(CARDINALITY(e1)Z.ZCARDINALITY (en)Z0 or CARDINALITY(e1)Z.ZCARDINALITY (en)Z1).

takenby

,

Definition 5. Given a DTD DZ(E,A,P,R,r), a DTD graph is a diagraph G(V,RV), where VZ{njn2EgA}, and RVZ{hx,yij(P(x,y) or R(x,y)) and x,y2V}. Intuitively, V is the set of attributes and elements (or called vertexes together) in D, RV is the cardinality relationship between these vertexes, hx, yi denotes the arc from vertexes x to y which are called tail and head of the arc, respectively. The indegree of vertex n is the number of arcs whose heads are n and denoted by in-degree(n). Similarly, the outdegree of vertex n is the number of arcs whose tails are n and denoted by out-degree(v). Specifically, an arc marked with ‘,’ (i.e. a,_arc) or ‘?’ (i.e. a ?_arc) is regarded as just one arc, and an arc marked with ‘*’(i.e. a *_arc) or ‘C’ (i.e. aC_arc) is regarded as infinite arcs.

Rule 2. Ungroup operators ‘*‘, ‘C‘ and ‘?‘:

,

title

A DTD graph is defined as follows:

,

,

sname

teacher

, @tno Fig. 2. The DTD graph for D1.

, tname

This rule converts a definition of a group of elements to a definition that operator ‘,’does not exist in operators ‘*’,‘C’, and ‘?’, i.e. a nested definition is converted to a flat definition. This rule extends the rules given in Refs. [7,8] by adding a cardinality constraint which

T. Lv, P. Yan / Information and Software Technology 48 (2006) 245–252

can preserve the semantic of original definition. For example, (a,b,c)* is converted to (a*,b*,c*) with a cardinality constraint CHECK(CARDINALITY(a)ZCARDINALITY(b)Z CARDINALITY(c)). Rule 3. Remove two redundant operators: (a) e**Ze*CZe*?ZeC*ZeC?Ze?*Ze?C0e*; (b) eCC0eC; (c) e??0e?.

3.3. Mapping DTDs to relational schemas In this section, we give a method to map XML DTDs to relational schemas: Simplify the DTD by Rules 1–4. 1. 2.

This rule converts two consecutive redundant operators to a single operator. This rule improve the rule given in Ref. [7] in the sense that it can deal with not only operators ‘*’ and ‘?’ but also operator ‘C’. Rule 4. Remove redundant elements: Suppose e i2 {e,e?,eC,e*} where i2 [1,n]. (a) (b) (c) (d)

e1,., en0e* if deiZe*; e1,., en0eC if deiZeC; e1,., en0e? if deiZe?; Otherwise, e1,.,en0e*.

This rule removes redundant elements in some specific conditions. For example, the definition a,a,a* can be just simplified as a* without loss of the original semantic. Rules 1–4 have some significant improvements over the rules proposed in Refs. [7,8] in the following aspects: (1) Operator ‘j’ is reasonably treated with an additional choice constraint. (2) Some semantic constraints are added in the resulted DTD. (3) Not only operators ‘,’ and ‘*’ but also ‘C’ and ‘?’ are preserved in the resulted DTD, which can maintain the structural and semantic information of the original DTD. After apply Rules 1–4 to a DTD orderly and recursively, there are only distinct element names with operators ‘,’, ‘?’, ‘C’ and ‘*’ in the DTD.

3.

4.

5. 3.2. Constructing DTD graphs with constraints For a DTD simplified by the previous sub-section rules, a DTD graph can be constructed as Definition 5 with reference constraints and domain constraints as specified by the following treatments: 1.

2.

If an attribute rid is IDREF or IDREFS, a dashed arc is created to point the referenced vertex e according to the semantic connection. But the dashed arc has no contributions to the indegree and outdegree of any vertex. Attribute rid is a foreign key referencing e. If a vertex e has a domain, the values of the vertex are listed below the vertex. A domain constraint is added as CHECK(e VALUE IN (domain_value1,.,domain_valuen)).

249

6.

7.

Construct the DTD graph with constraints as Section 3.2. For each vertex e such that in-degree(e)s1 or outdegree(e)s0, a relation e is generated as: (1) If out-degree(e)Z0 then relation e(ID,e), where ID is the primary key for relation e, and attribute e is used to store the corresponding value for vertex e. And the corresponding constraint is (ID is not null). (2) Otherwise, the attributes of e consist of: (a) An attribute ID as the primary key for e and the corresponding constraint is (ID is not null); (b) Each leaf vertex e1 of vertex e such that indegree(e 1)Z1 in the DTD graph. And the corresponding constraint is (e1 is not null) if the arc he,e1i is a,_arc; (c) An attribute e2.ID such that the iarc he,e2i is a,-arc or ?_arc in the DTD. e2.ID is a foreign key of relation e referencing to relation e2. Specifically, the constraint is (e2.ID is not null) for the former case. For each *_arche3,e4i in the DTD graph, there is a relation e3_e4(e3.ID, e4.ID) which stores the parent– child relationship between vertexes e3 and e4, and the primary key of relation e3_e4 is (e3.ID, e4.ID). Moreover, the constraint is: e3.ID is not null, e3.ID is a foreign key referencing to e3(ID), and e4.ID is a foreign key referencing to e4(ID) or e4.ID is null. For eachC_arche5,e6i in the DTD graph, there is a relation e5_e6(e5.ID, e6.ID) which stores the parent– child relationship between vertexes e5 and e6, and the primary key of relation e5_e6 is (e5.ID, e6.ID). Moreover, the constraint is: e5.ID is not null, e6.ID is not null, e5.ID is a foreign key referencing to e5(ID), and e6.ID is a foreign key referencing to e6(ID). For each FD f (Sh, [Sx1,.,Sxn]/[Sy]) over the DTD, create a relation f(Sh.ID,{Sx1},.,{Sxn},{Sy}), where {Sxi }ZSxi .ID if last(Sxi)2E, {Sy}ZS y.ID if last(Sy)2E, {Sxi}ZSxi if last(Sxi)2A, {Sy}ZSy if last(Sy)2A, {Sxi}ZSxi/last(Sxi) if last(Sxi)ZS, and {Sy}ZSy/last(Sy) if last(Sy)ZS. And the primary key for relation f is (Sh.ID,{Sx1},.,{Sxn}). Specifically, the relation f is f ({Sx1},.,{Sxn},{Sy}) if ShZF or ShZr. Remove some redundant relations introduced in Step 6. If a relation introduced by an FD has already been expressed in the resulted relation set, then it is not necessary to create such a relation, because such data dependencies have already been represented in other relations. It is just necessary to include such FDs over the DTD into the set of FDs of the resulted relations. Add constraints introduced in Step 1 and 2 to corresponding relations in the resulted relation set.

250

T. Lv, P. Yan / Information and Software Technology 48 (2006) 245–252

Some explanations of the above method: (1) Step 1 simplifies the input DTD, and Step 2 constructs the diagraph for the DTD. Each arc is marked with operator ‘,’, ‘?’, ‘C’ or ‘*’ in the DTD graph with some additional constraints. (2) Step 3 creates a separate relation for each non-leaf vertex and leaf vertex e such that in-degree(e)O1. (3) Step 4 creates a separate relation for each parent–child relation between two vertexes connected by a *-arc. (4) Step 5 creates a separate relation for each parent–child relation between two vertexes connected by aC-arc. (5) Step 6 creates a separate relation for each FD over the DTD. (6) Step 7 removes the redundant relations from the resulted relation set. (7) Step 8 preserves the constraints in the resulted DTD. (8) From (2)–(4) and (7), we can see that the structure as well as the semantics implied by FDs, domain constraints, choice constraints, reference constraints, and cardinality constraints are preserved in the resulted relation set. (9) From (5), we can see that the semantic implied by FDs over the input DTD is preserved in the resulted relation set, too. (10) From (6), we see that there are no redundant relations in the resulted relation set. As there are no other combination of the resulted relations in the above method, the resulted relations are more clear and intuitionistic to represent the structure and the semantics implied by FDs and other constraints over the input DTD than the methods proposed in Refs. [6,8]. From the above discussions, it is easy to get the following proposition: Proposition 1. Given a DTD and a set of FDs over it, it can be mapped to a set of relations according to the mapping method which can preserve the structure and the semantics implied by FDs and other constraints (such as domain constraints, choice constraints, reference constraints, and cardinality constraints) over the DTD, and there are no redundant relations in the resulted relation set.

4. A case study In this section, we give an example to illustrate the application of the mapping method presented in Section 3. Consider a DTD D2 describing the information of courses: !!ELEMENT courses (course,room**)*O !!ELEMENT course (name,type,takenby)O !!ATTLIST course year CDATA#REQUIREDO !!ELEMENT name (#PCDATA)O

!!ELEMENT type (#PCDATA)O !!ELEMENT takenby (student,studentC)O !!ELEMENT student (name,major,teacher,emailjphone)O !!ELEMENT major (#PCDATA)O !!ELEMENT teacher (name)O !!ATTLIST teacher sex (malejfemale)#REQUIREDO !!ELEMENT email (#PCDATA)O !!ELEMENT phone (#PCDATA)O !!ELEMENT room (#PCDATA)O Suppose the FDs over D2 include: F1: [courses.course.name.S]/[courses.course.type.S], F2: [courses.course.takenby.student.name.S]/[courses.course.takenby.student.major.S], F3: [courses.course.takenby.student.teacher.name.S]/ [courses.course.teakenby.student.teacher.sex.S], F4: (courses.course, [courses.course.takenby.student.name.S]/ [courses.course.takenby.student.teacher.name.S]). and F5: [courses.room]/[courses]. Step 1. Simplify D2 by Rules 1–4. By Rule 1,!!ELEMENT student (name,major,teacher,emailjphone)Ois simplified as!!ELEMENT student (name,major,teacher,email?,phone?)Owith a choice constraint CHECK((email is not null and phone is null) or (email is null and phone is not null)); By Rule 2(a),!!ELEMENT courses (course,room**) *Ois simplified as !!ELEMENT courses (course*,room***)Owith a cardinality constraint CHECK(CARDINALITY(course)ZCARDINALITY(room)); By Rule 3(a), the above definition!!ELEMENT courses (course*,room***)Ois simplified as!!ELEMENT courses (course*,room*)Oand also with the same cardinality constraint CHECK(CARDINALITY(course)ZCARDINALITY(room)); By Rule 4(b),!!ELEMENT takenby (student,studentC)Ois simplified as !!ELEMENT takenby (studentC)O. The result simplified DTD is D3: !!ELEMENT courses (course*,room*)O !!ELEMENT course (name,type,takenby)O !!ATTLIST course year CDATA #REQUIREDO !!ELEMENT name (#PCDATA)O !!ELEMENT type (#PCDATA)O !!ELEMENT takenby (studentC)O !!ELEMENT student (name,major,teacher,emailjphone)O !!ELEMENT major (#PCDATA)O !!ELEMENT teacher (name)O !!ATTLIST teacher sex (malejfemale) #REQUIREDO !!ELEMENT email (#PCDATA)O !!ELEMENT phone (#PCDATA)O !!ELEMENT room (#PCDATA)O

T. Lv, P. Yan / Information and Software Technology 48 (2006) 245–252

a foreign key referencing to takenby(ID), amd student.ID is a foreign key referencing to student (ID).

courses

* , @year

course , ,

name

* room

,

type

Step 6. For FDs F1, F2, F3, F4 and F5, the following relations are created respectively:

takenby + ,

student , ,

major

?

teacher ? , , ,,

251

email phone

@sex ,, ,, ,, male or female

11. F1(courses:course:name, courses.course.type), 12. F2(coursescourse:takenby:student:name, courses.course.takenby.student.major), 13. F3(coursescourse:takenby:student:teacher:name, courses.course.takenby.student.teacher.sex), 14. F4(coursescourse:ID; courses:course:takenby:student: name, courses.course.takenby.student.teacher.name), 15. F5(room:ID, courses.ID).

Fig. 3. The DTD graph for DTD D3.

Step 7. Relation 15 can be expressed by relation 9 just with some modification as the following relation:

Step 2. The DTD graph for D3 is shown in Fig. 3. We use a rectangle to represent that vertexes email and phone are constrained by a choice constraint.

9 0 . courses_room(courses.ID,room:ID). courses.ID is not null, courses.ID is a foreign key referencing to room(ID), and room:ID is a foreign key referencing to room (ID) or room.ID is null.

Step 3. For vertexes courses, course, room, name, takenby, student, and teacher, the following relations are created respectively (We use double underlining convention to show it as the primary key specifically): 1. 2.

3. 4. 5. 6.

7.

courses(ID) ID is not null. course(ID, year, name.ID, type, takenby.ID). All attributes are not null, attributes name.ID and takenby.ID is foreign keys referencing name(ID) and takenby(ID), respectively. room(ID,room) All attributes are not null. name(ID, name). All attributes are not null. takenby(ID,) ID is not null. student(ID, name.ID, major, teacher.ID, email, phone). Attributes ID, name.ID, major, and teacher.ID are not null, attributes name.ID and teacher.ID is foreign keys referencing name(ID) and teacher(ID), respectively. teacher(ID, sex, name.ID). All Attributes are not null, name.ID is a foreign key referencing name(ID).

Step 4. For *-arcs!courses,courseOand!courses,roomO , the following relations are created: 8.

9.

courses_course(courses ID; course:ID). courses.ID is not null, courses.ID is a foreign key referencing to courses(ID), and course.ID is a foreign key referencing to course (ID) or course.ID is null. courses_room(courses ID; room:ID). courses.ID is not null, courses.ID is a foreign key referencing to room(ID), room.ID is a foreign key referencing to room (ID) or room.ID is null.

Step 5. For theC-arc !takenby,student O, the following relation is created: 10. takenby_studentðtakenbyID; student:IDÞ. takenby.ID is not null, and student.ID is not null, takenby.ID is

Step 8. For relation 6, a choice constraint is introduced in Step 1, so relation 6 is changed to: 6 0 . student(ID, name.ID, major, teacher.ID, email, phone). Attributes ID, name.ID, major, and teacher.ID are not null, attributes name.ID and teacher.ID is foreign keys referencing name(ID) and teacher(ID), respectively. CHECK((email is not null and phone is null) or(email is null and phone is not null)). For relation 7, a domain constraint is introduced in Step 2, so relation 7 is changed to: 7 0 . teacher(ID, sex, name.ID). All Attributes are not null, name.ID is a foreign key referencing name(ID). The domain constraint is: CHECK VALUE(sex IN (‘male’, ‘female’)). For relations 2 and 3, a cardinality constraint is introduced in Step 1, so there is a cardinality constraint such as CHECK(CARDINALITY(course)ZCARDINALITY(room)). The resulted relation set generated by the method are relations 1–5, 6 0 , 7 0 , 8, 9 0 , and 10–14.

5. Conclusions and future work We give the definition of functional dependencies for XML DTDs in the paper. Then a method for mapping XML DTDs to relational schemas in the presence of functional dependencies and other constraints (such as domain constraints, choice constraints, reference constraints, and cardinality constraints) over DTDs is given. Our method can preserve the structures of DTDs as well as the semantics

252

T. Lv, P. Yan / Information and Software Technology 48 (2006) 245–252

implied by functional dependencies and the above constraints over DTDs, and there are no extra restriction requirements of the forms of functional dependencies over DTDs in our method. We choose DTD rather than other XML schemas such as XML Schema [23] as a start point to research mapping method from XML to relation considering the following facts: (1) there is much similarity between XML Schema and DTD in structure. Both the structure of XML Schema and DTD can be represented as a tree model as described in the paper. (2) FDs over XML Schema can also be represented as relationship between paths as those over DTDs defined in the paper. So the concepts and method of mapping DTDs to relational schemas used in the paper can be generalized to the field of XML Schema just with some trivial modifications in related formal definitions. Although the method presented here can preserve the semantics implied by functional dependencies and some other constraints of DTDs in the result relational schemas, there are more semantics such as multi-valued dependencies [12–14] in XML documents that should also be preserved. We plan to do this work in our future work. Another interesting future work is to investigate the mapping method from relational schemas and their functional dependencies to XML DTDs, which is another aspect of the mapping problem between XML DTDs and relational schemas.

[7]

[8]

[9]

[10]

[11]

[12] [13]

[14]

[15] [16]

Acknowledgements This work is supported by Science Research Foundation for Young Teachers of Xinjiang University under Grant No. QN040101.

References [1] T. Bray, J. Paoli, C.M. Sperberg-McQueen, E. Maler (Eds). Extensible Markup Language (XML) 1.0, Third Ed., W3C Recommendation, 4 February, 2000, http://www.w3.org/TR/REC-xml. [2] D. Florescu, D. Kossmann, Storing and querying XML data using an RDBMS, Bulletin of the Technical Committee on Data Engineering (1999) 431–442. [3] A. Schmidt, M.L. Keysten, M. Windhouwer, F. Wass, Efficient relational storage and retrieval of XML documents, Proceedings of the Third Workshop on the Web and Databases (WebDB’00), Dallas, TX, 2000 pp. 47–52. [4] W3C XML Specification DTD, ArborText Inc., 10 September, 1998, http://www.w3.org/XML/1998/06/xmlspec-report-19980910.htm. [5] B. Choi, What are real DTDs like, Proceedings of the Fifth Workshop on the Web and Databases (WebDB’02), ACM Press, Madison, WI, 2002. pp. 43–48. [6] D. Lee, M. Mani, W.W. Chu, Schema conversion methods between XML and relational models in: B. Omelayenko, M. Klein (Eds.),

[17]

[18]

[19]

[20]

[21]

[22]

[23]

Knowledge Transformation for the Semantic Web, IOS Press, Amsterdam, 2003, pp. 1–17. J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, J. Naughton, Relational databases for querying XML documents: limitations and opportunities, Proceedings of the 25th VLDB Conference, Morgan Kaufmann Publisher, Edinburgh, Scotland, 1999. pp. 302–314. S. Lu, Y. Sun, M. Atay, F. Fotouhi, A new inlining algorithm for mapping XML DTDs to relational schemas, Proceedings of the First International Workshop on XML Schema and Data Management (XSDM2003) LNCS 2814, Spinger, New York, 2003. pp. 366–377. Y. Chen, S.B. Davidson, Y. Zheng, Constraints preserving XML storage in relations, Technical Report, MS-CIS-02-04, University of Pennsylvania, 2002. Q. Wang, J. Zhou, H. Wu, J. Xiao, A. Zhou, Mapping XML documents to relations in the presence of functional dependencies, Journal of Software 14 (7) (2003) 1275–1281. M. Arenas, L. Libkin, A normal form for XML documents, Proceedings of Symposium on Principles of Database Systems (PODS’02), ACM press, Madison, WI, 2002. pp. 85–96. T. Lv, N. Gu, B. Shi, A normal form for XML DTD, Journal of Computer Research and Development 41 (4) (2004) 615–620. L.V. Saxton, X. Tang, Tree multivalued dependencies for XML datasets, Proceedings of 5th International Conference on Advances in Web-Age Information Management(WAIM2004) LNCS 3129, Springer, New York, 2004. pp. 357–367. M.W. Vincent, J. Liu, Multivalued dependencies in XML, 20th British National Conference on Databases (BNCOD20), LNCS 2712, Springer, New York, 2003. pp. 4–18. Tamino XML Server, Software AG, http://www1.softwareag.com/ corporate/products/tamino/default.asp Sonic XML Server, Sonic Software Corporation, http://www. sonicsoftware.com/products/sonic_xml_server/index.ssp M.L. Lee, T.W. Ling, W.L. Low, Designing functional dependencies for XML, Proceedings of VIII Conference on Extending Database Technology (EDBT’02), LNCS 2287, Springer, New York, 2002. pp124–141. A. Deutsch, M. Fernandez, D. Suciu, Storing semistructured data with STORED, Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD 1999), ACM Press, Madison, WI, 1999. p. 431–442. E. Agichtein, C.T. Howard, V. Josifovski, J. Gerhardt, Extracting relational from XML documents, Proceedings of the First International Workshop on XML Schema and Data Management (XSDM2003) LNCS 2814, Springer, New York, 2003. pp. 390–401. P. Bohannon, J. Freire, P. Roy, J. Simeon, Form XML-Schemas to relations: a cost-based approach to XML Storage, Proceedings of the 18th International Conference on Data Engineering (ICDE2002) IEEE Computer Society, 2002 pp. 564–580. S. Amer-Yahia, F. Du, J. Freire, A generic and flexible framework for mapping XML documents into relations, Technical report, OGI/OHSU, 2004. P. Bohannon, J. Freire, J.R. Haritsa, M. Ramanath, P. Roy, J. Simeon, Bridging the XML-relational divide with LegoDB: a demonstration, Proceedings of the 19th International Conference on Data Engineering (ICDE2003) IEEE Computer Society, 2003 pp. 759–760. D.C. Fallside and P. Walmsley (Editors), XML Schema Part 0: Primer Second Edition, W3C Recommendation, 28 October, 2004, http:// www.w3.org/TR/2004/REC-xmlschema-0-20041028.