On the feasibility of using conceptual modeling constructs for the design and analysis of XML data

On the feasibility of using conceptual modeling constructs for the design and analysis of XML data

Data & Knowledge Engineering 72 (2012) 219–238 Contents lists available at SciVerse ScienceDirect Data & Knowledge Engineering journal homepage: www...

1MB Sizes 0 Downloads 8 Views

Data & Knowledge Engineering 72 (2012) 219–238

Contents lists available at SciVerse ScienceDirect

Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak

On the feasibility of using conceptual modeling constructs for the design and analysis of XML data Arijit Sengupta ⁎ Wright State University, Raj Soin College of Business, Dayton, OH 45435, United States

a r t i c l e

i n f o

Article history: Received 15 May 2010 Received in revised form 4 November 2011 Accepted 4 November 2011 Available online 20 November 2011 Keywords: Data models (7.3) Dynamic modeling (7.4) Digital libraries (14.3) Data quality (15.2) Visualization (16.1)

a b s t r a c t XML is one of the most widely accepted data representation languages in today's Internetdominated Computing. While most XML data on the net today use commonly known structures, the power of XML lies in the ability to develop application-specific structures and models. XER (Extensible Entity Relationship) is a conceptual modeling methodology that uses visual constructs reminiscent of Entity Relationship (ER) in the logical design of XML instead of relying on the text-based DTD (Document Type Definition) and XML Schema notations. In this paper, we demonstrate how XER can be used to effectively design and analyze applications that use XML data. We also compare XER against other design constructs to demonstrate that a conceptual modeling artifact can potentially be superior to other artifacts for modeling XML structures, by significantly improving accuracy, efficiency and user satisfaction. © 2011 Elsevier B.V. All rights reserved.

1. Introduction and motivation Design is one of the most important steps in the software development process. In most systems analysis and design literature, the process of design always comes before implementation, and carries a major weight in determining the success of a project. The process of conceptual design is the phase of design that is independent of the final platform and the medium of implementation, and is usually in a form that is understandable and usable by managers and other personnel who may not be familiar with low level implementation details, yet play an influential role in the development process. Most of the development methods utilized in today's software development include such a conceptual design phase. For instance, in relational database development, the conceptual design is presented using the well-accepted Entity-Relationship (ER) diagrams [1], in software model development, the conceptual model is presented using data flow diagrams (DFD's) [2], in Object-Oriented design, the conceptual model is presented using the Unified Modeling Language (UML) [3], and so forth. The Entity-Relationship approach is a well-accepted technique for data modeling, and is one of the techniques most used in the design process of applications involving relational databases. Unfortunately, the design process of XML-based applications does not involve an accepted conceptual modeling methodology, although several tree-based methods and ER-based methods have been proposed (see e.g., [4,5]). The goal of this research is to propose a conceptual design method building on top of the ER model, staying as true to the original model as possible, yet creating a semantic equivalence between the conceptual model and the final schema. Just as in ER, the model should lead to an equivalent schema representing the modeled data structure. As an added motivation, the well-established domain of model-driven architecture [6] recommends the use of conceptual models for software development because of the benefits it provides to the software development process, both short term and long term. We derive further motivation from the principles of metamodeling which show that the need for concise models can be addressed by providing a visual modeling language [7]. This paper intends to empirically demonstrate this principle with a visual

⁎ Tel.: + 1 937 775 2115; fax: + 1 937 775 3533. E-mail address: [email protected]. 0169-023X/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2011.11.001

220

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

language based on familiar notation, an important principle for universal design. In fact, the Entity Relationship approach has been successfully applied to the design of the Eclipse Modeling Framework (EMF) [8], a metamodel framework used extensively in software development in the open software foundation domain. EMF also uses an XML-based model specification, and a guideline has been proposed for using the Entity Relationship approach for EMF metamodel definition [8]. The concept of web services, SOAP (Simple Object Access Protocol), and a host of other XML-based standards mandates the use of XML at the level of design, representation, as well as retrieval of data; yet there is no standard mechanism for modeling and designing XML structures without a low-level hierarchy-based design. Most XML design processes start by directly marking up data in XML, and the metadata is typically designed at the time of encoding the documents. Users who prefer a more graphical method typically employ an XML editor to design the schema, typically via a process of creating the hierarchical structure using graphical constructs. We feel that there is a strong need for a methodology (not simply a tool) that allows users to develop conceptual models and have a process to translate the model into the schema, as is typical in relational database design. The methodology described in this paper utilizes the fact that conceptual models are network-based structures rather than hierarchy-based, which one needs to achieve when building XML applications. XER relieves the designer by not enforcing the hierarchy constraint but allowing a flexible method for extracting the best possible hierarchy for the data from an unconstrained conceptual model. In this respect we present XER (pronounced “Cher”, or “Sure”), a primarily semantic extension of the wellknown ER model [1], which is capable of handling all the nuances of XML in a highly presentable graphical form. While the presentation of the modeling technique is one contribution of this article, the most important contribution of this paper lies in the process of analyzing user behavior while using a modeling technique for designing a new application or analyzing an existing application. In this paper, we compare the process of analysis and design in the context of XML applications, using comparable conceptual techniques as well as direct application of schema. We develop a theory based on existing literature on systems analysis and design, and demonstrate that the use of a domain-independent modeling methodology may lead to better user performance in accurately and efficiently designing and analyzing XML applications. The contribution of this paper, thus, lies in the conservative extension of the ER modeling method for conceptual modeling of XML, and empirically establishing its effectiveness in improving user performance in the design and analysis of XML data oriented applications. The biggest claim we intend to make in this paper is that any visually inclined metamodeling architecture should provide a better modeling experience for XML-based applications. The metamodel architecture provides a strong theory for this claim, and our empirical study shows results supporting this claim as well. We hope that in the recent future, W3C and the research community will work together to develop a recommendation for a commonly known and understood conceptual model for XML. 1.1. Modeling nuances in XML The biggest question in the reader's mind will likely be: “Why do we need a different modeling method for XML? Why will current modeling strategies not work?” The proper answer for this question comes from the fact that XML has a complex ordered and multi-level structure. This structure is more complex than the flat relational model, yet less complex than the semantically rich object-oriented model. At first glance, the Entity-Relationship (ER) model seems to be an appropriate modeling approach for XML. ER is very successful in the relational domain, with a set of simple graphical modeling constructs like rectangles and diamonds to describe the data objects and the relationships between them. However, XML structures are typically not designed with a data relationship view, but they are designed more from a document perspective, with primarily textual content, that has embedded links, meta-data information, and a structural hierarchy. The main complicating factors that make the standard ER model hard to adapt to XML are as follows: 1 Order: XML objects are inherently ordered i.e., there is a specific ordering between different elements as well as different instances of the same element. Ordering is not a constraint but an inherent design aspect of XML, and several structural constructs are included in both DTDs and Schemas to handle document order. 2 Hierarchy: XML does not have a direct way to support many-many relationships, since the structure is essentially hierarchical. Every document must represent a strict hierarchy to be well-formed. 3 Heterogeneity: XML structures often involve heterogeneous types, a concept in which different instances of an element may have different structures. 4 Structural complexity: Individual element structures can be complex, involving structured groups of elements, as well as optional, required and multi-valued elements, unlike the essentially flat nature of structures in ER models. 5 Empty Elements: On the flip side, elements may be empty, containing no sub-structure, yet representing a significant positional characteristic in the document, perhaps augmented with attributes. 6 Mixed Elements: An element in XML may have mixed content with atomic (e.g., text) as well as non­atomic (e.g., other structure) items at the same time. 7 Namespaces: Last but not the least, XML supports many related concepts such as namespaces that makes the task of creating a conceptual model nontrivial. Given these departures from the norm (primarily compared with relational systems), we contend that an extension to the ER model can appropriately result into a method for conceptual development of XML Schema, retaining the ease of use and success of the ER approach and still allow the possibility of modeling XML structures.

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

221

This paper is organized as follows. Section 2 provides a background in XML design, from both research as well as commercial perspectives. Section 3 introduces the XER methodology, and Section 4 shows how XER assists in design and analysis of XML applications. Section 5 presents the empirical study comparing XER against two other artifacts from the perspectives of both design and analysis. Section 6 details the results from the study; and we finally conclude in Section 7. We add detailed information on the models used in the empirical study in Appendix A and statistical analysis results in Appendix B. 2. Background in XML design and modeling Conceptual modeling is a fairly standard aspect of software engineering and all software engineering models involve conceptual design as the first phase of an application development life cycle. Although XML has matured as a well-developed language for the internal representation of most web-based application-to-application data interchange, XML for core data representation still hasn't caught up to where it was hyped to be. W3C has released different models for XML for the purpose of programmatic traversal (DOM), as well as query language understanding (XPath/XQuery data model). DOM (Document Object Model) [9] is a low level programmatic model that allows traversal of the XML parse trees and provides APIs for the purpose of developing XML based applications. The XPath/XQuery data model [10] is a formal model for describing the query language features of XML. Neither of these models can serve as a satisfactory substitute for a user friendly and design oriented conceptual model. As a result, design of XML applications mostly depends on direct editing of XML Schema or DTD's, sometimes using graphical interfaces that support interactive manipulation. While conceptual modeling has been a topic of research for some time, a standard conceptual modeling method has not yet been accepted as the method of choice for XML. In this section, we review some of the conceptual modeling methods in literature, as well as some of the research in comparing conceptual models with schema-based models. Most of the methodologies explored in literature are based on currently existing modeling techniques. Two of the most popular modeling approaches in current use are the Entity Relationship (ER) model and the UML (Unified Modeling Language), both of which have been used for modeling XML. Literature shows several surveys (see e.g., [4,5,11]) reviewing in detail work over the last ten years that show various methods for conceptual modeling. In this section, we summarize a selection of such techniques to introduce the basis for our work. Some of the early research on XML modeling involved capturing the development of Document Type Definitions (DTDs). Dos Santos Mello and Heuser [12] describe a semiautomated process for converting a DTD to a conceptual schema and for providing a canonical representation of the same. The process involves a set of conversion rules that consider DTD elements, attributes, syntactical constructs and heuristics related to default semantic interpretations of these constructs to generate the corresponding conceptual concept. It entails human intervention to generate the final graphical conceptual model using a three-phase conversion process: (i) rules application, (ii) instance analysis, and (iii) user validation. The conversion rule application phase follows the DTD definition, applying appropriate rules according to the syntax of the element and its attribute definitions to generate a preliminary conceptual schema. The user further validates this preliminary schema in the validation phase to generate the definitive conceptual schema. Rule-based translation of XML specification to XML Schema using XSLT stylesheets [13]has also been developed [14,15]. In these techniques, logical models can be used to define XML Schema, and stylesheets can be used to dynamically determine the output schemas of the stylesheet specifications. Extending the Entity-Relationship approach [1] to model XML has also been fairly common. Such methods involve adapting the ER modeling approach with extensions to incorporate XML capabilities. Common methods among these approaches include EER (Extended Entity Relationship) and ERX (Entity Relationship Extended). EER [16] uses the semantic data modeling capabilities of XML Schema's to capture the mapping between features of XML Schema and existing models. This work formalizes a core set of features found in various XML Schema languages into XGrammar, a commonly used notation in formal language theory. XGrammar is an extension of the regular tree grammar definition, and use a six tuple notation to describe the model of XML Schema languages. XGrammar has three pertinent features: (i) Ordered binary relationships; (ii) Union Boolean operations; and (iii) Recursive relationships. Mani et al. compare these features to the standard ER model; and based on this comparison, extend the ER model to better support XML. Mani et al. term this extended model as the Extended Entity Relationship model (EER). The main extensions from the ER are (i) modification of the ER constructs to better support order and (ii) introduction of a dummy “has” relationship to describe the element–sub-element relationship that is prevalent in XML. Psaila [17] introduces ERX (Entity Relationship for XML) as a conceptual model based on the Entity Relationship model [1]. ERX is designed primarily to provide support for the development of complex XML structures. It provides a conceptual model to help better represent the various document classes that are used and their interrelationships. ERX is not nested but rather exploits a flat representation to explain XML concepts and their relationships. The basic building blocks of an ER model such as Entities, Relationships and Attributes have been modified or extended to support XML specific features such as order, complex structures etc. ERX is not constrained by the syntactic structure of XML and is specifically focused on data management issues. ERX however does not support some XML specific features such as mixed content, and does not describe how complex types with their various nuances can be modeled into the system. ERX does not provide support for ordered and unordered attributes. The qualified attributed “Order” supported in ERX serves to establish the order between the various instances of a complex concept in XML, but however there is no mechanism to determine the order of attributes within an XML concept. ERX however establishes the reasoning that a conventional ER model can be used to describe a conceptual model for XML structures and serves as an effective support for the development of complex XML structures for advanced applications. In another related article, Psaila [18] also describes algorithms to translate XML DTD's into a corresponding ERX model. Combi and Oliboni [19] also present a similar Entity-

222

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

Relationship-based technique that is demonstrated using a prototype and a translation mechanism into XML Schema. Other techniques following the E-R approach include X-Entity [20], which also uses an extension of the ER model to incorporate XML features. The modeling technique described in this paper, XER (Extensible Entity Relationship) builds on an earlier conference presentation [21] and falls in the category of extending the ER model. The biggest difference between this approach and many of the other approaches described earlier includes the use of semantic extension — whereby the natural look and feel of the EntityRelationship model is kept intact with very minor structural changes. Instead of changing the ER model significantly, we use semantics that allow transformation of the ER constructs to XML. UML, the well-accepted object-oriented modeling technique has also been used for modeling XML. Conrad et al. [22] describe various UML constructs such as classes, aggregation, composition, generalization and packages and explain their transformation into appropriate DTD fragments. Their work also extends UML to take advantage of all facets that DTD's offer. UML classes are used to represent XML element type notations. The element name is represented using the class name and the element content is described by using the attributes of the class. Since UML classes do not directly support order, the authors introduce an implicit top-bottom order. DTD constructs for element types, which express an element–sub-element relationship, are modeled using aggregations. The authors argue that the multiplicity specification for UML aggregations is semantically as rich as cardinality constraints and use the same to express relationship cardinalities. One advantage of using UML is the easy incorporation of ObjectOriented design in implementing generalization. The conceptual model as proposed by the authors can handle most of the constructs that are commonly used in a DTD. Further, some of the UML constructs such as UML attributes for classes, which do not have an equivalent XML representation, are suitably modified to represent XML specific features. This method is fairly successful in the conversion of DTD fragments into corresponding conceptual models. But since the author's work restricts the model to DTD's, the expressive power of the model is limited. The UML based method also entails the user designing an XML conceptual model to learn the concepts of UML. The UML models can then be used to build XML Schema using a mapping process. One such mapping is presented by Routledge et al. [23]. In this technique, once the conceptual model has been validated (requires a domain expert), the process involves automatically converting the model into a logical level diagram that describes the XML Schema in a graphical and abstract manner. The logical level model, in most cases, serves as a direct representation of the XML Schema data structures. The logical level model uses standard XML stereotypes such as “Simple Element” and “Complex Element” and “Sequence” that are defined in the UML profile for the XML Schema. The above definitions enable a direct representation of the XML Schema components in UML. More details of the UML profile can be found in [23]. The third and final stage is the physical level and involves the representation of the logical level diagram in the implementation language namely XML Schema. The authors have not included algorithms to directly convert the logical model to a Schema and vice-versa. Other advanced concepts have also been used for modeling XML. One such approach uses the concept of semantic data networks [24]. This methodology uses a semantic network in the top level that provides a semantic model of the XML document through (i) atomic and complex nodes, (ii) direct edges representing semantic relationship between nodes, (iii) labels denoting the type of relationship and (iv) constraints defined over nodes and edges. This method also provides for a lower level design that includes element/attribute declarations and simple/complex type definitions. The main idea is that the mapping between these two levels can be used to transform the XML semantic model into a XML schematic model, which can then be used to create, modify, manage and validate documents. Research has also been performed on identifying dependencies in Entity-relationship models via constraints for the purpose of automated model interpretation [25]. This paper is influenced by many of the above articles. Many commercial tools have also been developed for the purpose of building XML structures. A list of these tools is available from W3C [26], that includes XMLSpy from Altova, a full featured XML editor with capabilities for visually editing DTDs and XML Schema. A few additional tools are described in the rest of this section. XML Authority [27] provides a visual representation of a DTD or a XML Schema. It supports two different views, a tree representation and a tabular representation listing the various elements and attributes of the Schema or a DTD. XML Authority also provides the ability to validate DTD's and Schema. Near and Far Designer [28] is a DTD modeling tool to create a DTD without prior knowledge of the DTD syntax. It provides the user with a simple, easy to use tree representation to create a DTD. It provides features to validate DTD's and also provides conversion tools to convert between XML and SGML. Microsoft Visual Studio .NET Suite [29] includes XML Designer, a graphical XML Schema editor. XML Designer provides a set of visual tools for working with XML Schemas. The designer provides the following 3 views or modes to work on XML files and datasets namely Schema View, XML View and Data View. The schema view provides a visual representation of the elements, attributes, types, etc., components that make up the XML Schema and ADO.NET datasets. The XML view provides an editor for editing raw XML and provides IntelliSense and color-coding. In summary, a fair amount of work Table 1 provides a comparison of the existing methods cited above, with respect to six functional aspects. Literature does not show any existing work in empirically evaluating conceptual design methods for XML. We select six functional criteria from the systems covered, primarily based on the main four design complexities of XML structures and adding in the forward and reverse engineering methodologies capabilities of conceptual designs. The first criterion order represents representation of document and structure order. The next two (heterogeneity and complex content) are also structural aspects of XML, and all the models faithfully represent them. The fourth criterion, mixed content is an important data representation characteristic of XML. The next two criteria refer to the processes of translating an existing application up to the model as “up-translation” or reverse-translation, and of generating a logical schema from the conceptual model as the forward or “down-translation”. Not all the models can re-generate (or down-translate to) the original applications. The final criterion refers to the extent these methods support the XML Schema standard.

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

223

Table 1 Comparison of methods and tools regarding XML support (✓ = full support, – = partial support, X = no support). Method

ERX UML DTD UML Schema Semantic XGrammar VS. NET XML Auth Near and Far XER

Property Order

Hetero

Complex

Mixed

Up-tran

Down-tran

XML Schema support

– ✓ ✓ ✓ X ✓ ✓ ✓ ✓

✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

– ✓ X – X ✓ ✓ ✓ ✓

✓ ✓ ✓ – – ✓ ✓ ✓ ✓

✓ ✓ ✓ ✓ ✓ – – – ✓

✓ – ✓ ✓ ✓ ✓ ✓ ✓ ✓

Although the UML method by Conrad et al. [22] does support almost all the criteria, its limitation to XML DTDs limits it as a conceptual representation of XML Schema. In the methodology we present in this paper, XER addresses all of the functional aspects of XML conceptual modeling. In addition, we demonstrate empirically, that conceptual modeling artifacts such as XER do result in improved user performance, both from the perspectives of efficiency and accuracy, and provide greater satisfaction to users for development of new applications, as well as for analyzing existing applications. 3. The XER modeling methodology We present XER, an extension of the popular ER model [1] as a conceptual model for XML. To avoid the minute details on all the various features of XML, we assume a canonical view of XML called ENF (Element Normal Form) [30], which is a representation of XML documents without using any attributes. Any XML document with attributes can be transformed into ENF by converting the attributes to elements with a special naming convention (e.g., prefixed with a symbol like ‘@’). This transformation can be performed using XSLT [13], and the original document can be obtained back from an ENF representation using XSLT as well. Note that given the ENF specification, we assume that the ENF translation process places the “@” symbol in front of newly created simple elements to identify that they were originally XML attributes. In the rest of the paper, we assume that all the documents in our data set are in ENF. Readers should note that the ENF assumption is not a requirement in order to use XER. It is merely used for the purpose of simplification of describing a conceptual model for XML documents, without the potential conflict in terminology, since the term “attribute” in XML has an entirely different meaning from the same term in Entity Relationship models. Given the metamodel architecture motivation introduced earlier, XER fits well with the layered metamodel-based design principles [6]. In this architecture, at the top level is a domain model, captured by XER in the next level as the metamodel layer. XML Schema forms the model layer, and at the bottom is the data represented in XML. The design goal in this method aims at developing a metamodel-based design paradigm for XML, whereby a domain model is captured via XER, and the schema may be generated as needed. Note that XER is not necessarily a complete structural equivalent of XML Schema, and in fact requires some simplification of XML Schema before an up-translation algorithm can be applied. However, designs in XER can always be translated into an equivalent XML Schema. The metamodel architecture is shown in Fig. 1. 3.1. XER constructs

conceptualize

As noted earlier, XER extends the ER model [21] by introducing some semantic modifications to the ER constructs. Before discussing the constructs, we would like to emphasize the design principle for XER — to be as close to the original ER design as possible, allowing only semantic variations to enable XML functionality. For the following discussion, we are going to assume some background

XER

Validate

Schema discovery

Up-translate

Domain model

Down-translate

XML Schema

Fig. 1. Metamodel layer with XER.

XML Application

224

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

in XML constructs to avoid defining all the related XML terms. However, because of the overlap between the terms and concepts in XML and ER, some of the terms in the rest of this paper will refer to XER concepts. For example, an “entity” will refer to the concept of entities in the XER model and not the concept of general or parameter entities in XML. Also, XER attributes will refer to attributes of entities and not XML attributes of elements (we are assuming ENF, thereby eliminating the need for attributes). The XER model includes all the basic constructs of the ER model, and some new constructs of its own. We present a brief overview of the primary constructs in XER below. In this presentation, along with each XER construct, we also show its equivalent XML Schema, highlighting the specific schema construct equivalent to the XER concept being discussed. The figures in this section show the XER constructs and the XML schema translation of each construct, highlighting the primary aspects of the translation process in bold in the XML Schema. 3.1.1. XER entity The XER entity is the basic conceptual object in XER. An XER entity is represented using a rectangle with a title area showing the name of the entity and the body showing the attributes. XER attributes are properties of entities and can be atomic or complex, potentially optional or multi-valued. Attributes are shown in the model by placing the names of the attributes in the body of the entity. Attributes are ordered by default, and the ordering in the diagram is top to bottom. The XER entity can be of three following types as shown in Table 2: (i) Ordered Entity: XER entities are ordered by default. An ordered entity indicates that the XER attributes in the entity must appear in the same order they are presented in the diagram. Most of the Entity Relationship features, such as keys, arity, and multiplicity of the attributes are supported. An ordered entity is translated into a standard XML element with a complexType containing a sequence of all its attributes as simpleTypes. (ii) Unordered Entity: XER also has support for unordered entities, in which all attributes are required but may appear in a document in any order. The unordered entities (as in the “all” tag in XML Schema) have the same representation as ordered entities, but the name of the entity is preceded with a question mark (?). An unordered entity is translated into an XML element containing a complexType with an “all” content. (iii) Mixed Entity: XER supports mixed entities, in which text content as well as element content is allowed. The mixed entity (as in the “mixed” attribute in the XML Schema) is represented in XER using a solid rounded outer rectangle. Mixed entities can be either ordered or unordered. A mixed entity is translated into a complexType with mixed attribute set to true. 3.1.2. XER relationships Relationships denote a connection between two or more XER entities. Relationships can be “one to one” (1–1), “one to many” (1–M) or “many to many” (M–M). In the XER diagram, a relationship is represented as a diamond like in the ER model. The reader Table 2 Different types of entities in XER. (iv) Entity type Ordered

Unordered

Mixed

XER

XML Schema bxs:element name=“Person”> b xs:complexType > bxs:sequence > b xs:element name=“FirstName”/> b xs:element name=“LastName”/> b xs:element name=“PhoneNumber” maxOccurs=“unbounded”/> b/xs:sequence > bxs:attribute name=“SSN” type=“xs:ID”/> bxs:attribute name=“status”/> b/xs:complexType > b/xs:element > bxs:element name=“Evaluations”> b xs:complexType > bxs:all > b xs:element name=“Technical”/> b xs:element name=“Structure”/> b xs:element name=“Presentation”/> b xs:element name=“Applicability”/> b xs:element name=“Methodology”/> b/xs:all > b/xs:complexType > b/xs:element > bxs:element name=“Para”> bxs:complexType mixed=“true”> b xs:sequence > b xs:element name=“Bold”/> b xs:element name=“Italics”/> b xs:element name=“Footnote”/> b/xs:sequence > b/xs:complexType > b/xs:element >

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

225

should note that while relationships with different cardinalities are not new in ER, many-many relationships are not directly captured by XML because of its hierarchical nature, and hence must be properly translated when building the schema. However, using XER, designers do not need to worry about this internal translation mechanism and may develop models as simply as they would develop relational models. When translated into a schema, a relationship becomes part of the entity's complexType if it is a 1–1 or 1–M relationship. However for M–M relationships as well as for ternary relationships, separate elements are created and they are referenced using the KEY/KEYREF feature in XML Schema. Relationships may or may not be named, and labels along the connectors indicate participation constraints for a relationship and the connecting entity. An example of a 1–M XER relationship is shown in Table 3, and a more complex example involving a ternary relationship is shown in Table 5. Note that the diagram specifies the position of the relationship in the entity by repeating the name of either the connected entity (for unnamed or 1–M relationships) or the relationship (for ternary or M–M relationships). Participation and cardinality in relationships: Participation and cardinality constraints in XER are exactly the same as those in standard ER models. Unfortunately there are a large number of “standards” in the ER modeling literature as to how participation and cardinality are represented. Hence we use a “minmax” approach (see e.g., Ch. 11 in [31] and also [32]) where the participation/cardinality constraints indicate the minimum and maximum number of times the entity participates in the relationship. This was chosen since it comes closest to the minOccurs and maxOccurs attributes of elements. Table 3, for example, shows that a book should have a minimum of one chapter but could have many chapters, while a chapter is associated with only one book. This is closest to the XML equivalent where within the complex type book, the minOccurs and maxOccurs of Chapter are defined as 1 and M respectively. 3.1.3. XER generalizations The term “generalization” refers to the concept of having an entity that can have different sub-entities (similar to an ISA hierarchy). Conceptually, a generalization allows one entity to have multiple sub-entities that may have properties in common, but also distinguishing properties. The closest concept in XML Schema is a “choice”, and it fits nicely in XER as well. When there is no overlap between the sub-entities, a choice is an exact fit. However, when there are overlaps between the sub-entities, a complexType involving appropriate values for minOccurs and maxOccurs can be used. In XER, we present generalizations using a covering rectangle containing the specialized XER entities as shown in Table 4. This is equivalent to using the “choice” tag in XML Schema. The choice for including the sub-entities within the body of the parent entity follows the trend of incorporating element structure within the element, although the readers should note that a choice does not indicate any form of order. While heterogeneity in XML seems different from ER generalization conceptually, the heterogeneity typically used in documents to provide alternative structures of a parent element, a semantic that is close to generalization; and hence supports our use of generalization to model heterogeneous structures. 3.1.4. Other XER concepts Like the ER model, XER can also have weak entities, ternary relationships, and aggregations with similar semantics. A weak entity in XER indicates a strong relationship between the “strong” entity and a weak entity. This is essentially a hint to the underlying translation engine to ensure that the weak entity is always modeled as a sub-element of the parent entity's element (instead of creating an IDREF relationship). For example, consider the XER diagram in Table 5. The weak entity “answer” is dependent on the strong entity “question” since answers cannot exist without a corresponding entity in the Question entity set. Question has other relationships, for example the relationship between question and category, or the ternary relationship between question,

Table 3 XER relationships and the corresponding XML Schema. (v) XER

XML Schema bxs:element name=“BOOK”> b xs:complexType > bxs:sequence > b xs:element name=“title” /> b xs:element name=“author” minOccurs=“0” maxOccurs=“unbounded”/> bxs:element name=“Chapter” minOccurs=“1” maxOccurs=“unbounded”> b xs:complexType > bxs:sequence > b xs:element name=“title”/> b xs:element name=“abstract”/> b xs:element name=“section” minOccurs=“0” maxOccurs=“unbounded”/> b/xs:sequence > bxs:attribute name=“chapno” type=“xs:ID”/> b/xs:complexType > b/xs:element > b/xs:sequence > b xs:attribute name=“isbn” type=“xs:ID”/> b/xs:complexType > b/xs:element >

226

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

Table 4 XER generalizations and the corresponding XML Schema. XER

XML Schema b xs:element name=“ITEM”> bxs:complexType > b xs:choice > bxs:element name=“BOOK”> b xs:complexType > bxs:sequence > b xs:element name=“pages” /> ………………… b/xs:complexType > b/xs:element > bxs:element name=“VIDEO”> b xs:complexType > bxs:sequence > b xs:element name=“title” /> ………………….. b/xs:complexType > b/xs:element > b/xs:choice> b xs:attribute name=“itemno” type=“xs:ID”/> b/xs:complexType > b/xs:element >

Table 5 Translation of complex XER structures. XER (partial)

XML Schema (partial) b element name=‘root’> bcomplexType > b choice minOccurs=‘0’ maxOccurs=‘unbounded’> bsequence minOccurs=‘0’ maxOccurs=‘unbounded’> b element name=‘student’> bcomplexType > b attribute name=‘sid’ type=‘ID’ use=‘required’/> b/complexType > b/element > b/sequence > b sequence minOccurs=‘0’ maxOccurs=‘unbounded’> belement name=‘test’> b complexType mixed=‘true’> b/complexType > b/element > b/sequence > b sequence minOccurs=‘0’ maxOccurs=‘unbounded’> belement name=‘question’> b complexType > bsequence > b element ref=‘answer’ minOccurs=‘0’ maxOccurs=‘unbounded’/> b/sequence > b/complexType > b/element > b/sequence > b sequence > belement ref=‘category’ minOccurs=‘0’ maxOccurs=‘unbounded’/> b/sequence > b sequence > belement ref=‘attempt’ minOccurs=‘0’ maxOccurs=‘unbounded’/> b/sequence > b/choice> b/complexType > bkey name=“studentKey”> b selector xpath=“./student”/>bfield xpath=“@sid”/> b/key > bkeyref name=“attemptRef” refer=“studentKey”> b selector xpath=“./attempt”/>bfield xpath=“@attsid”/> b/keyref > b/element >

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

227

test and student. Relationships with arity greater than two (i.e., ternary relationships) mandate the use of the KEY/KEYREF features in XML. Table 5 demonstrates the use of weak entities and ternary relationships, showing the pertinent translations in bold. Finally, aggregations, in the same manner as ER aggregations, provide a method for turning a relationship into an entity, thus restructuring a hierarchy into a graph. The readers should note that weak entities are highly conceptual, and cannot be easily detected from a schema, although a weak entity in a model is translated in the same manner as a 1–M relationship. A full XER model for a real application, including all of the major constructs of XER described here is shown in Fig. 3 as well as in Appendix A (Fig. 5). In summary, we designed XER as more of a semantic extension to ER rather than syntactic or structural. While some structural differences exist, most extensions involve semantic re-positioning of ER constructs for applicability in XML. Table 6 summarizes these structural and semantic differences. 4. Using XER to design/analyze XML applications Designing XML applications involves a systematic construction of the XML Schema. Using XER, users do not need to work through the textual verbosity of the XML Schema, and instead, construct a model conceptually using graphical tools. This process is very similar to the creation of conceptual models for relational database systems using the ER model. However, with XER, once a model is constructed, it can be translated into a fully conforming XML Schema (a process that we describe as down-translation). Analysis of existing XML applications can also be easily performed using XER. Any conformant XML Schema can be “reverse-engineered” (or up-translated) into an XER model. The ensuing graphical model can be easily used for the purpose of reasoning and analyzing the structure of the existing XML data. In this section, we will algorithmically describe the above two translation. For the sake of explanation, we will term the translation from a schema to XER as “reverse translation” or “up translation”, and from XER to a schema as “forward translation” or “down translation”. 4.1. Down-translation from XER to schema The down-translation process follows the basic schema construction methods discussed in the previous section. The biggest issue in down translation is the creation of an XML hierarchical structure from the network structure of the model. The designer may designate a specific entity to be the “root” element, and the translation mechanism incorporates all other elements under its hierarchy. However, the more typical method is to let the algorithm create a root element and incorporate all primary entities as multiply occurring children of the root element. The algorithm is presented below as pseudocode in Listing 1. 4.1.1. Notes on the algorithm In the above algorithm, there are two important items that deserve additional explanation. The first is the priority order. Since there is no explicit hierarchy in an ER model, we prioritize the entities based on how many other entities it may be a “parent” of, based on 1–M relationships in which the entity is the single/master. Thus, we give the highest priority to entities that have either a weak entity, sub-entities in a generalization, or have the most number of relationships associated with them. The second item that deserves special attention is termination. To demonstrate that this algorithm always terminates, note that there are only a finite number of elements in xermodel, each of which is unmarked to start with, and as each node gets incorporated, the node is marked and never considered again for re-modeling. The selection of the “top-level” nodes ensures that even if the XER model is simply a disconnected set of entities, the translation still produces a valid XML Schema with a generated root and all entities as its children. 4.2. Up-translation from schema to XER Before we discuss the up-translation process, please note that an XML Schema needs to be pre-processed to the Element Normal Form (ENF) before up-translation, which can be easily achieved using an XSLT transformation. Also, we ignore the data types Table 6 Summary of structural and semantic extensions in XER over the ER model. Concept

Structural extensions

Semantic extensions

Entity

Rounded rectangle for mixed, “?” before name to represent unordered. Prefixed with “@” to represent XML attribute. Multivalued attributes represented by appending “(*)”. May be unnamed for implicit complex types.

Content of entity inherently ordered, position of attribute and relationship markers denote order. Inherent order maintained, underlined to denote key. An attribute may represent a marker for a relationship. Keys may be composite but must be underlined in diagram. Position of relationships matters. The participation and cardinality constraints represent min and max occurrence. Same semantics as ER. Cannot be up-translated. Represents a choice between structures semantically. More structural choice rather than object-oriented.

Attribute

Relationship

Weak entity Represented with double borders Generalization Represented using entities drawn inside parent entities

228

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

Table 7 Translation of a recursive structure into XER. XML Schema (a)

XER (b)

b xsd:element name=“section”> bxsd:complexType > b xsd:sequence > bxsd:element name=“secno”/> bxsd:element name=“title”/> bxsd:element name=“sectext”/> bxsd:element ref = “section” minOccurs = “0” maxOccurs = “unbounded”/> b/xsd:sequence > b/xsd:complexType> b/xsd:element >

and external namespaces for the presentation of the algorithm, as well as expand implicit complexTypes to complexTypes with automatically generated names. Since XER is primarily meant to be a conceptual modeling methodology and not an exact representation of an XML Schema in a graphical means, some of the specialized schema elements are not represented in a visual form, but saved as properties of the diagram for future down translation purposes. For example, the “schema” element includes information about the namespace from which the elements and the data types used in the schema are derived from. This information is kept in the property sheet (a collection of specialized properties of any object) of the diagram, and can be shown as a footnote in the diagram. Listing 2 presents the up-translation algorithm.

4.2.1. Notes on the algorithm This algorithm involves traversing the XML tree structure of the schema. The macro root() finds the root of the XML Schema tree; and the macro content() finds the content type of a node, following a reference if necessary. The intuition behind the proper termination of this algorithm is based on the fact that in XML Schema as well as in XER, every element (in schema) and entity (in XER) has a unique name. However, because of the xs:ref feature in XML, the same element may be encountered multiple times. For this reason, the up-translation algorithm treats the xs:ref differently, and when a ref node is obtained, it finds the entity representing the definition of that node to create the relationship (see line 23 in Listing 2). Thus, even in the case of a recursive structure where an element contains a complexType including a reference to that element itself, the algorithm will create a relationship between the entity with itself and will stop (combination of lines 07 and 23 in Listing 2). Thus, the schema in Table 7a is translated into the XER in Table 7b. The strangeness of a section attribute inside the section entity is explained by the necessity of providing the location for that relationship within the section entity. The reader should note that the above algorithm provides higher weights to entities, and thus may result in fewer relationships than what a human designer may create. However, the resulting models behave similarly, and literature suggests entities to be better in representing composites than relationships [33]. As far as the quality of the model generated from the schema, the only claim we make is that it is a true representation of the schema, and will likely require human manipulation for a presentable and understandable model. Models created from very complex XML Schema may not be very presentable in the beginning, but should be easier to comprehend than the textual form. Metrics to determine quality of models have been studied in literature (see e.g., [34]); we consider such quality analysis of the generated model as a future extension of this article.

Table 8 Task sets and corresponding domain, type of task and complexity. Task set

Domain

Task type

Task complexity

1a 1b 1c 1d 1e 1f 1g 1h 1i 1j 1k 1l 2x 2a 2b 2c

Analysis

Basic sequence Long sequence Basic relationship Complex sequence Mixed type Simple/complex combination Multiple complex types Unordered type Generalization Simplicity Ease of use Preference Complete design Simplicity Ease of use Preference

L L L M M M H H H L L L All L L L

Design

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

229

Table 9 Means and standard deviations of dependent variables. Model artifact XER

Accuracy Mean Std. Dev Time Mean Std. Dev EU Mean Std. Dev Pref Freqa a

ERX

.NET

Analysis

Design

Analysis

Design

Analysis

Design

Task

Task

Task

Task

Task

Task

12.93 1.63

11.47 1.55

8.91 2.1

10.4 1.18

5.87 2.27

9.07 1.91

25.53 11.51

16.27 4.98

33.33 11.53

18.27 7.25

28.2 10.84

27.27 8.39

0.53 0.64

0.53 0.64

1.6 0.91

1.4 0.74

1.13 0.73

0.74 0.7

14

15

5

7

9

10

Count of number of preferences for the modeling artifact.

In summary, the translation between XML Schema and ER creates a form of semantic mapping that is implemented by these forward and reverse translation algorithms. Fig. 2 demonstrates the translation and mapping process by showing the same XER model as in Table 5 and demonstrates how the XML Schema structure is formed for each element in the XER diagram. 4.3. Prototype implementation of XER We developed two prototype implementations of XER: the first using VBA (Visual Basic for Applications) based on top of Microsoft Visio®; and the second using XML and XSLT (XML Stylesheet Language Transformations) on top of Dia, an open source drawing utility. We designed both implementations to create new XER diagrams and performing up and down translations. However, we abandoned the development of the Visio-based system because of some limitations on the size of diagrams that we ran into. The Dia implementation is also more open and capable of incorporating more current techniques and standards. Also, because of the functional nature of XSLT, and the fact that XML traversal is built into XSLT, the XSLT implementation does not need to perform any specific tree traversal – the traversal of the XML trees is taken care of by the XSLT engine – the translation works by implementing the down and up translation algorithms functionally by attaching semantics to nodes and paths in the parse/translation step in XSLT transformation. In the Dia-based prototype, users create XER diagrams in Dia by using an XER stencil that we created. The users create entities, attributes, relationships and participation constraints using the stencil objects. The users may specify additional properties of the objects (such as the occurrence indicators of attributes) using context menus associated with the objects. A screenshot from the implementation showing a partial XER model and the XER stencil is shown in Fig. 3. 5. Empirical evaluation of XER We conducted an empirical study to determine user reactions with respect to comprehension and design of XML structures using XER. The study had a two-pronged objective: on one hand we intended to determine whether the use of a conceptual

Table 10 Results of statistical analyses showing measures of significance on the dependent variables from the independent variables task, model or interaction of task and model (task ∗ model). MS Accuracy Model Task Task ∗ model Efficiency Model Task Task ∗ model Ease of use Model Task Task ∗ model

F

p

168.351 25.942 41.805

40.197 10.939 17.628

.000 .002 .000

372.211 1596.011 378.544

3.431 23.345 5.537

.042 .000 .007

7.078 .900 .300

10.876 2.100 .700

.000 .155 .502

230

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

Listing 1 Algorithm down-translate. Translate an XER diagram into XML Schema. 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18

Algorithm down-translate (XML xermodel, String root):XMLschema { Outputmodel = empty tree; Unmark all entities in xermodel; if (root b> nil) add root to outputmodel; else { Create a new node “root” in outputmodel; targetlist = FindTopLevelUnmarkedEntities(xermodel); Add all entities in targetlist as multiply-occurring children of the root entity; Mark all entities above as modeled; } while (unmarked nodes exist in xermodel) do { sort unmarked nodes in order of priority (highest priority first) node n = select next unmarked node based on node priority convert node to schema element based on conversion rules mark node as converted } return outputmodel; }

model was helpful for the purpose of designing solutions to new problems or understanding existing solutions; and on the other hand whether any of the individual tools (XER, ERX, .NET) had a significant advantage over the others in the process of design as well as comprehension. For both purposes, standard ISO 92411-part 11 standard criteria of effectiveness, efficiency and satisfaction of users were used. Since the process of conceptual design is independent of the final platform and the medium of implementation and is usually in a form that is understandable and usable even to people who are not familiar with low level implementation details, we investigated the use of the above models as a means of design as well as for comprehension. Studies determining the effectiveness of conceptual models are not uncommon. Batra [35] provides a framework that demonstrates that data retrieval using conceptual methods provide superior user performance than other querying approaches, and that users tend to make fewer errors in retrieval when using concept-based methods. Using this framework, Batra et al. [36] found that conceptual modeling approaches make a major impact especially for end-user development of information systems, when comparing the low-level relational model with the conceptual EER model. Chan et al. [37] observe a pattern of improved user performance in accuracy, performance and time with a higher level model, and generalize this into a theory that creates a user cognitive process for querying. The general notion here once again is that the users perform better using a conceptual model. We extend

Listing 2 Algorithm up-translate. Translate an XML Schema into an XER representation. 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Algorithm up-translate (XML inputschema) : XERModel { Outputmodel = empty; Queue = root(inputschema); Perform breadth-first queue-based traversal on input schema tree: While (not empty(Queue)) { node = dequeue(Queue); if content(node) = complexType and node not in Outputmodel { if attval(mixed)=‘true’ addMixedEntity(Outputmodel, node); else addEntity(Outputmodel, node); case content(node) { sequence: for each element e in sequence { if content(e) is complexType addRelationship(Outputmodel, node, e) enqueue(Queue, e); } all: changeEntityType(Outputmodel, node, UNORDERD); choice: create a generalization hierarchy between elements in choice; any: unsupported; } } else if node is simpleType addAttribute(Outputmodel, parent(node), node); else if node is ref { tnode = findDef(inputschema, node); enqueue(Queue, tnode); } } // while return Outputmodel; }

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

231

Fig. 2. Translation and conversion between ER and schema trees.

this notion twofold: we first place this theory in the context of XML, and next, place the effect of the conceptual model at a preimplementation stage, as well as a post-implementation analysis of existing systems. Thus, our primary research question is: “What will be the performance improvements of XER over other existing methods, either conceptual (e.g., ERX), or graphical schema-based (e.g., .NET)?” To test this research question, we conducted a lab experiment with the following hypotheses. H1. The use of XER will generate more accurate models for analysis (H1a) and design (H1b) tasks, when compared to ERX and .NET. H2. The use of XER will generate more efficient models for analysis (H2a) and design (H2b) tasks, when compared to ERX and .NET. H3. The use of XER will lead to more satisfaction in analysis (H3a) and design (H3b) tasks, when compared to ERX and .NET. Although technically the comparison should be between XER (or ERX) against XML Schema, practically speaking, the syntaxheavy nature of XML Schema makes it difficult for any feasible comparison with more visual conceptual models. Hence, we choose

Fig. 3. Screenshot from the Dia implementation of XER showing the XER stencil.

232

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

Artifact XER ERX .NET

Task Analysis Design

Performance Accuracy Efficiency Satisfaction

Fig. 4. Empirical study model.

the Visual Studio .NET Schema designer, which is a graphical tool comparable to XER and ERX. More details on the rationale for the choice of the artifacts are presented later in this section. 5.1. Research methodology The primary goal of this research is to establish a model of user performance in design and analysis tasks for XML using one of three artifacts — two conceptual and one schema-based. Fig. 4 shows a pictorial representation of this model. Fig. 4 shows the two factors that impact user performance in the design and analysis of XML structures. Two types of task (design and analysis), and three modeling artifacts (XER, ERX, .NET) impact user performance. 5.1.1. Subjects 45 students (33 males and 12 females) from a graduate level Information Systems course on Web based programming with XML participated in this study. A majority of the students came directly from their bachelor's degree with little more than summer internship experience and/or part-time work experience in information systems. As a result, the students were in the 23–25-age range. A university Human Subjects Committee approved the study, and the students participating in the experiment received extra course credit and were ensured complete confidentiality. The students had prior exposure to conceptual modeling, XML and writing XML Schemas. 5.1.2. Dependent variables The primary dependent variable, accuracy was measured by the following criteria. For the analysis task, the subjects' answers were evaluated based on correctness and validity of their reasoning (asked as part of the question). For the design task, the accuracy of subjects was determined by coding each answer based on the following metrics (based on Batini et al. [38]). Completeness: A model is complete when it represents all relevant features of the application domain. Completeness is checked in principal by ensuring that all mentioned requirements are represented in the model and all concepts mentioned in the model are present in the requirements. Correctness: A model is said to be correct when it properly uses the concepts of the conceptual model being tested. Minimality: The model is said to be minimal if every aspect of the requirements occurs only once in the schema. Expressiveness: The model is expressive when it represents requirements in a natural way and can be easily understood. Readability: The model has good readability when it respects certain aesthetic criterion that makes the diagram graceful such as symmetry. Extensibility: A model is easily adaptable to changing requirements when it can be decomposed into pieces so that changes are applied within each piece. Self-Explanatory: The model is self-explanatory when a large number of properties can be expressed in the conceptual model itself without other formalisms (ex: annotations in natural language). The second dependent variable, efficiency, was measured by the amount of time the subjects used in each of the tasks. The third dependent variable, user satisfaction, was measured using ease of use of the modeling artifact and preference for the modeling artifact compared to XML Schema. 5.1.3. Independent variables This study used a repeated measures design in which all participants performed two tasks, namely analysis and design. The independent variable was the modeling artifact XER, ERX or .NET. The choice of ERX was straightforward, due to its prominence in academic and practical settings [17,18]. Visual Studio.NET was chosen primarily because it is a popular tool for designing XML Schema, and the diagrams look similar to a conceptual model, rather than a visual representation of a tree structure. Also, the subjects were enrolled in an XML development course utilizing .NET tools, and had prior exposure of developing XML Schema using the .NET tool.

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

233

5.2. Tasks To simulate the two types of tasks, two separate question sets were created. The first question set (design) described a typical data representation scenario that had to be modeled as a XML database. The scenario was chosen to include most of the nuances of XML modeling mentioned earlier. The subjects of the study had to model the scenario using the randomly assigned artifact (XER, ERX or .NET). The second question set was designed to test the ability of users to analyze a given conceptual model. A conceptual model describing a typical business situation was chosen as part of the analysis task. Each subject received a conceptual model that was created using the same modeling treatment that they were assigned to, namely XER, ERX or .NET. (See Appendix A for the actual models used). For each task, sub-tasks were carefully chosen to cover the types of errors that users normally perform. For the analysis task, a total of 9 subtasks with differing degrees of complexity were created, to determine if the users were able to comprehend the underlying XML structure with only the diagrammatic model shown to them. Both the design and analysis tasks also included questions related to the subjects' judgment regarding the simplicity and ease of use of the model they used, and their preferences of the model over others that they were exposed to during training. Table 8 shows the different types of tasks used for both the analysis and design domains. For the analysis domains, the tasks essentially included questions based on the diagrams/schema distributed to the subjects where the subjects had to answer questions covering specific task type involving the identification of different types of structure in the model. The design task involved the subjects creating a design for a specific problem domain. The same problem was used for all three artifacts for the design task. No special software was used for either design or analysis tasks. The students had to create their design on pencil and paper, and had access to the instruction material and the Internet while they were performing the tasks in all cases. In fact, the multiple-choice questions (task sets 1a–1l and 2a–2c) were all filled out on the web. The next section will describe more details on the experimental procedure. 5.3. Experimental procedure A pilot study was first conducted to determine the feasibility of the final experiment. The pilot study involved three subjects, one each for each modeling treatment. The final study was conducted over two days (two class periods). On the first day, the subjects were trained on all the three modeling artifacts (XER, ERX and .NET). Subjects received instruction (through a 90 minute lecture) on how to use the above conceptual models to design XML Schemas and also on how to comprehend the constructs from the above models. An example scenario different from those used in the experimental materials was used for training purposes. The example incorporated most of the nuances of XML and all the constructs supported by these models. The students were also given documentation and research papers describing the above models, which they were allowed to consult during the actual experiment. On the day of the tests the subjects were randomly assigned to one of the modeling treatment conditions. Each subject answered the two questions sets described in the previous section as per the modeling treatment condition they were assigned to. The order of sub-tasks (within each task) was kept consistent for all treatment conditions, progressing from simple to difficult. All the questions were answered directly using a web based interface. Students were instructed to choose the correct answer and explain the rationale behind their choices. The subjects indicated their ease of use and preference perception on the same questionnaire. The time taken by the students to answer the questions was recorded by the web interface. For measuring the accuracy of the analysis task, a 5-point Likert scale was used to measure the subject's responses: 1) incorrect with no reasoning, 2) incorrect with partial reasoning, 3) incorrect with good reasoning, 4) correct, and 5) correct with proper reasoning. For the design task, the accuracy was coded based on weights assigned to each of the above categories. Two independent evaluators were used to code the answers. One evaluator evaluated all the responses. The other evaluated a random subset of the responses, yielding a high inter-rater reliability score of 0.93. For the final statistical computations, the first evaluator's scores were used. For repeated measures comparison, the scores from both the tasks were normalized for a total of 15 points. 6. Results We designed a 3 × 2 mixed randomized-repeated measures MANOVA (Multivariate Analysis of Variance) [39] analysis for evaluating the results. MANOVA is a generalized form of univariate Analysis of Variance (ANOVA). We applied this measure in this research because it helps answering the question regarding whether the changes in independent variables have significant impact on the dependent variables. MANOVA measures are also used to determine interactions between the dependent and independent variables. In the statistical model (see Fig. 4), we used with “task” as the repeated factor, “model” as the randomized factor and accuracy, efficiency and ease of use as dependent variables [40]. (Preference was not considered for this test as MANOVA requires continuous variables). Pillai's Trace (Model: F (6, 82) = 13.725, p b 0.001) indicates that the model type has a significant effect on the three continuous dependent variables (see Table 11 in Appendix B). Normality of the sampling distribution and homogeneity of variance were assessed for each dependent variable. No deviations from normality or heterogeneity of variance were found. Hence, the alpha level for comparison was set at 0.05. Table 9 provides the means and standard deviations for the 6 cells for the dependent variables. To further test the effect of the model artifact on each of the dependent variables, repeated-measures ANOVA was performed [40]. As shown in Table 10, the main effect of model, the main effect task and the interaction effects of task ∗ model are significant for accuracy and efficiency. Only the main effect of model is significant for ease of use. The model type has a significant effect on accuracy (F (2, 42) = 40.197, p b 0.001), efficiency (F (2, 42) = 3.431, p b 0.05) and ease of use (F (2, 42) = 10.876, p b 0.001). Pair-wise comparisons (see Table 12 in Appendix B) further indicate that when

234

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

compared to ERX, XER generated more accurate results (p b 0.001), and was easier to use (p b 0.001). When compared to .NET, XER generated more accurate results (p b 0.001) and was more efficient (p b 0.05). To compare the effects of the models across the two types of tasks, One-way ANOVA was performed (see Table 13 in Appendix B). Results indicate that for the analysis task, XER generated more accurate results when compared to ERX (p b 0.001) and .NET (p b 0.001) and was easier to use when compared to ERX (p b 0.01). There were no significant differences for efficiency across the models. For the design task, XER generated more accurate results, was more efficient when compared to .NET (p b 0.001), and was significantly easier to use when compared to ERX (p b 0.01). Within model comparisons were performed to further elicit the interaction effects of task and model (see Table 14 in Appendix B). Results demonstrate that for XER, the accuracy score is significantly lower for the design task when compared to the analysis task (F (1, 14) = 15.44, p = 0.0015), whereas for .NET it is significantly higher (F (1, 14) = 24.08, p = 0.0002). Efficiency results for XER and ERX are significantly better for the design task, when compared to the analysis task, whereas for .NET there is no substantial difference. Ease of Use comparisons were not performed as the interaction effect for this dependent variable is not significant. One additional comparison of preference was performed using repeated-measures logistic regression (using PROC GENMOD in SAS). Pair-wise comparisons indicate that XER is preferred more when compared to ERX (Chi-sq: 13.78, p = 0.0002) and .NET (Chi-sq: 6.58, p = 0.0103) (see Table 15Appendix B). 6.1. Discussion The study compared the accuracy, efficiency and satisfaction of XER when compared to the other modeling artifacts: ERX and .NET. Results demonstrate the superiority of XER over the other modeling artifacts, in both the analysis and design tasks for XML applications. Specifically, XER was found to be more accurate for both the analysis and design tasks. For the analysis task, the average accuracy scores were 45% higher compared to ERX and 120% higher when compared to .NET. For the design task, the accuracy scores were comparable, even though the differences were significant between XER and .NET. Similar trends are seen for the efficiency and ease of use scores. Thus, XER was found to be better suited for both analysis and design tasks, when compared to ERX and .NET. Additional statistical analyses yielded some interesting results. The interaction effect of the task and the model encouraged comparison of the differences between the tasks within each model, for each dependent variable. XER achieved higher accuracy for the analysis task when compared to the design task, whereas for the other artifacts the reverser was observed. With respect to efficiency, the design task was performed in 45% less time when compared to the analysis task using ERX, followed by 36% for XER. .NET did not gain significant efficiency for design tasks. These two interactions lead to the conclusion that the other modeling artifacts are probably better suited for design tasks more so than analysis tasks. More than 95% of the subjects in the XER condition preferred it over the XML Schema for both the tasks, whereas in the other modeling artifacts the preference was closer to 60% and less. This subjective evaluation demonstrates the usability of XER over the other modeling artifacts. Thus, results indicate that XER performs better on both objective and subjective aspects. XER generated more accurate, more efficient and more satisfying results in both analysis and design tasks when compared to ERX and .NET. However, there are a few limitations in the study that we discuss below to ensure a complete and unbiased explanation of the results and analyses. 6.2. Limitations This section discusses potential limitations of the study concerning student subjects, ordering of tasks and training procedure. First, student subjects were used for the lab study, which might question the generalizability of the research to practical settings. These students were part of a graduate course that the primary researcher was handling, and hence represent a convenience sample. However, use of student subjects, by itself, is not totally appropriate [41], but given their background and knowledge in databases, conceptual modeling and XML, these students represent an appropriate sample for the target population, i.e. those people who are more likely to develop XML databases and schemas and hence appropriate. Second, the ordering of the tasks was kept constant for each of the modeling treatments. To make effective comparison between the tasks, it is suggested that the ordering of the tasks be blocked (or randomized) across treatments especially when one employs repeated-measures design [42]. However, since the slope of the curve differs across treatments, based on post-hoc comparisons of both the accuracy and efficiency results, consistent ordering of the tasks may not have had a confounding effect on the interpretations. Third, the experimental procedure was performed over two days, which poses the question the differential effectiveness of the training, with respect to the treatments received by each subject. Subjects might fail to remember some key issues while solving the tasks, or may have received additional information on any one of the treatments between the days. To counter this, subjects were randomly assigned to the treatment conditions on the second day and were allowed to refer to materials while solving the tasks. However, no posthoc measures were collected for the effectiveness of the training, and hence could not be compared. Thus, even though limitations exist, they did not constitute any serious threat to the findings. 7. Contributions and conclusion XER provides a novel method for providing a conceptual presentation for XML. Typically XML document structures are represented using DTDs or XML Schema, both heavily textual formats and often fairly complex to understand in their written or printed

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

235

forms. XER is an extension to the ER model to represent all the nuances of XML. It provides mechanisms for forward and reverse translations for going back and forth between the visual and textual views and also provides a means for quality assessment of XML data designs. The prototype implementation is capable of performing all the transformations presented in this paper, and has shown its robustness by preserving the schema structure through multiple up and down translations. Given that the up and down translations are implemented as XSLT transformations, an XML Schema, reverse translated into XER, and then forward translated to a new schema maintains a mapping between the original XML Schema and the generated schema, and hence no structural information is lost in the translation process. Typically the generated XER model is not always a perfect representation of the underlying domain model, but provides users with a conceptual view of the model, and allows the user to then improve on the design based on long-understood modeling principles. Commercial and industry applications of XER constitute an important effect of the methodology presented in this article. Development of XML applications falls within the overall bounds of software development, and hence a strong design cycle is definitely required for the purpose of successful implementation of software projects involving XML data. Although a new level of design is introduced by including a conceptual modeling stage in XML data design, such a stage is commonplace in other datadriven application development domains such as the relational and object-oriented domains. Studies on relational and objectoriented designs show that a conceptual modeling approach not only improves the designs, but also improves user performance in analysis and querying of the data using such conceptual models [35,36]. In this study we re-established the same theory in the field of XML application development, a result that should assist XML application developers in improving designs as well as enhancing design and analysis performance of developers and users. While XER provides a method that designers of small-to-medium scale application can use to simplify and expedite the design process, it suffers from some of the same weaknesses that other visual design concepts (ER, UML, Flowcharts) suffer from — visual designs do not scale very well. For very large schema, XER models can also be fairly large, and hence a multi-level model (e.g., [43]) is a possible future extension. Some of the other future directions of this research involve modeling more intricate XML concepts such as namespaces, Xlink and Xpointer. Other limitations of XER that were not addressed in the current implementation include the “ANY” content type, which allows an element to have any other element in the schema in its content. However, such a design, when literally mapped into XER, would result in a fully connected and unusable model. A project-based study where user performance in team projects involving XML application development including frequent requirement modifications, should determine whether or not XER can be used appropriately in a commercial project development settings. However, given the theoretical basis and empirical findings, XER can be a highly useful tool in the presentation of XML data models to developers and users alike, both for the purposes of design and analysis of XML applications.

Appendix A. Empirical study models and tasks Models used for XML analysis task

Fig. 5. Library scenario in XER.

236

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

Fig. 6. ERX model for library schema.

Fig. 7. Library scenario in VS.NET.

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

237

Appendix B. Details of analysis results

Table 11 Pillai's Trace.

Model Task Task ∗ model

Value

F

p

Partial eta-squared

Observed power

1.002 .485 .649

13.725 12.580 6.565

.000 .000 .000

.501 .485 .324

1.000 1.000 .999

Table 12 Model pairwise comparisons. Measure

(I) Model

(J) Model

Mean difference (I–J)*

p**

Accuracy

XER

Efficiency

XER

Ease of use

XER

ERX .NET ERX .NET ERX .NET

2.545 4.733 − 4.900 − 6.833 −.967 −.400

.000 .000 .227 .045 .000 .185

*Based on estimated marginal means. **p-value adjustment for multiple comparisons: Bonferroni.

Table 13 Task pairwise comparisons. Dependent variable

(I) Model

Accuracy

XER

Efficiency

XER

Ease of use

XER

(J) Model ERX .NET ERX .NET ERX .NET

Analysis task

Design task

Mean difference (I–J)*

p

Mean difference (I–J)*

p

4.023 7.0667 − 7.800 − 2.667 − 1.07 −.60

.000 .000 .154 .795 .001 .097

1.0667 2.4000 − 2.000 − 11.000 −.87 −.20

.165 .000 .717 .000 .004 .712

*Tukey comparison based on observed means.

Table 14 Task comparison within model.

Accuracy

Efficiency

Model

MS

F

p

XER ERX .NET XER ERX .NET

32.296 33.27 153.33 1288 3405.067 3.066

15.44 5.78 24.08 8.78 24.21 0.11

0.0015 0.0306 0.0002 0.0103 0.0002 0.7492

Table 15 Chi-square analysis with PROC GENMOD. Variable Model XER

Comparison

Chi-sq

Pr > Chisq

ERX .NET

15.3 13.78 6.58

0.0005 0.0002 0.0103

References [1] P. Chen, The entity-relationship model — toward a unified view of data, ACM Transactions on Database Systems (TODS) 1 (1) (1976) 9–36. [2] P.D. Bruza, T.P. van der Weide, The semantics of data flow diagrams, International Conference on Management of Data, McGraw-Hill Publishing Company, 1993, pp. 66–78.

238

A. Sengupta / Data & Knowledge Engineering 72 (2012) 219–238

[3] OMG, OMG Unified Modeling Language (OMG UML) Infrastructure Version 2.4, Object Management Group, 2010. [4] M. Necasky, Conceptual modeling for XML: a survey, in: V. Snasel, K. Richta, J. Pokorny (Eds.), Dateso 2006 Annual International Workshop on Databases, Texts, Specifications and Objects, Desna, Czech Republic, 2006, pp. 40–53. [5] E. Wilde, Towards conceptual modeling for XML, in: R. Eckstein, R. Tolksdorf (Eds.), Berliner XML Tage, Berlin, Germany, 2005, pp. 213–224. [6] D.S. Frankel, Model Driven Architecture: Applying MDA to Enterprise Computing, Wiley, 2003. [7] C. Atkinson, T. Kuhne, Model-driven development: a metamodeling foundation, IEEE Software 20 (5) (2003) 36–41. [8] I. García-Magariñ, R. Fuentes-Fernández, J.J. Gómez-Sanz, Guideline for the definition of EMF metamodels using an entity-relationship approach, Information and Software Technology 51 (8) (2009) 1217–1230. [9] A. Le Hors, P. Le Hegaret, J. Robie, L. Wood, G. Nicol, M. Champion, S. Bryne, Document Object Model (DOM) Level 3 Core Specification — Version 1.0, W3C, Available at, http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/, 2004. [10] M. Fernandez, A. Malhotra, J. Marsh, M. Nagy, N. Walsh, XQuery 1.0 and XPath Data Model, W3C Working Draft, Available at, http://www.w3.org/TR/querydatamodel/, 2003. [11] H. Chen, H. Liao, A survey to conceptual modeling for XML, 2010 3rd IEEE International Conference on Computer Science and Information Technology (ICCSIT), IEEE, Chengdu, China, 2010, p. 473. [12] R. Dos Santos Mello, C. Heuser, A rule-based conversion of a DTD to a conceptual schema, Proceedings, ER 2001, Also as Lecture Notes in Computer Science 2224, 2001, pp. 133–148. [13] M. Kay, XSL Transformations (XSLT) Version 2.0, World Wide Consortium, Available at, http://www.w3.org/TR/2007/REC-xslt20-20070123/, 2007. [14] J. Groppe, S. Groppe, Filtering unsatisfiable XPath queries, Data and Knowledge Engineering 64 (1) (2008) 134–169. [15] S. Groppe, J. Groppe, Output schemas of XSLT stylesheets and their applications, Information Systems Journal 178 (21) (2008) 3989–4018. [16] M. Mani, D. Lee, R.R. Muntz, Semantic data modeling using XML schemas, Proceedings of the 20th International Conference on Conceptual Modeling: ER 2001, Springer-Verlag, 2001, pp. 149–163. [17] G. Psaila, ERX: a conceptual model for XML documents, ACM Symposium on Applied Computing (SAC 2000), Como, Italy, 2000, pp. 898–903. [18] G. Psaila, From XML DTDs to entity-relationship schemas, Conceptual Modeling for Novel Application Domains, Lecture Notes in Computer Science, vol. 2814, 2003, pp. 378–389. [19] C. Combi, B. Oliboni, Conceptual modeling of XML data, Proceedings of the 2006 ACM Symposium on Applied Computing, ACM, Dijon, France, 2006, pp. 467–473. [20] B.F. Lóscio, A.C. Salgado, L.d.R. Galvão, Conceptual Modeling of XML Schemas, Fifth International Workshop on Web Information and Data Management, ACM, New Orleans, Louisiana, USA, 2003. [21] A. Sengupta, S. Mohan, R. Doshi, XER — Extensible Entity Relationship Modeling, in: Proceedings of the XML 2003 Conference, IDEAlliance, Philadelphia, PA, 2003. [22] R. Conrad, D. Scheffner, J. Freytag, XML conceptual modeling using UML, 19th International conference on Conceptual Modeling — ER 2000, Also as Lecture Notes in Computer Science 1920, Salt Lake City, Utah, 2000, pp. 558–571. [23] N. Routledge, L. Bird, A. Goodchild, UML and XML Schema, in: X. Zhou, L. Goodchild (Eds.), Thirteenth Australasian Database Conference (ADC2002), Melbourne, Australia, 2002. [24] L. Feng, E. Chang, T. Dillon, A semantic network-based design methodology for XML documents, ACM Transactions on Information Systems (TOIS) 20 (4) (2002) 390–421. [25] S. Hartmann, S. Link, T. Trinh, Constraint acquisition for entity-relationship models, Data and Knowledge Engineering 68 (10) (2009) 1128–1155. [26] W3C, XML Schema Tools, Available at, http://www.w3.org/XML/Schema-Tools, 2010. [27] Tibco, XML Authority Version 2.0, XML.com, Available at, http://www.xml.com/pub/p/258, 2003. [28] MicroStar, Near and Far Designer, Available at, http://www.xml.com/pub/p61, 2003. [29] Microsoft, Microsoft Visual Studio .NET, Available at, http://www.microsoft.com/visualstudio/en-us, 2010. [30] H.S. Thompson, Normal Form Conventions for XML Representations of Structured Data, in: XML 2001, IDEAlliance, Orlando, FL, 2001. [31] T. Connolly, C. Begg, Database Systems, 4th ed. Addison Wesley/Pearson Education, 2005. [32] A. Silberschatz, H. Korth, S. Sudarshan, Database System Concepts, 5th edition McGraw-Hill Science/Engineering/Math, 2005. [33] G. Shanks, E. Tansley, R. Weber, Representing composites in conceptual modeling, Communications of the ACM 47 (7) (2004) 77–80. [34] M. Genero, G. Poels, M. Piattini, Defining and validating metrics for assessing the understandability of entity-relationship diagrams, Data and Knowledge Engineering 64 (3) (2008) 534–557. [35] D. Batra, A framework for studying human error behavior in conceptual database modeling, Information Management 25 (3) (1993) 121–131. [36] D. Batra, J.A. Hoffler, R.P. Bostrom, Comparing representations with relational and EER models, Communications of the ACM 33 (2) (1990) 126–139. [37] H. Chan, K. Siau, K.-K. Wei, The effect of data model, system and task characteristics on user query performance: an empirical study, SIGMIS Database 29 (1) (1997) 31–49. [38] C. Batini, S. Ceri, S. Navathe, An Entity Relationship Approach, Benjamin Cummins Menlo Park, California, USA, 1992. [39] J.P. Stevens, Applied Multivariate Statistics for the Social Sciences, Lawrence Erlblaum, Mahwah, NJ, 2002. [40] B.G. Tabachnick, L.S. Fidell, Computer Assisted Research Design and Analysis, Allyn and Bacon, MA, 2001. [41] A. Dennis, J. Valacich, Conducting research in information systems, Communications of the AIS 7 (1991) 5. [42] J. Cohen, P. Cohen, Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, Lawrence Eribaum Associates, NJ, 1983. [43] M. Gandhi, E. Robertson, D. Van Gucht, Levelled entity relationship model, in: P. Loucopoulos (Ed.), Entity-Relationship Approach — ER '94, Springer, 1994, pp. 420–436. Dr. Arijit Sengupta is Associate Professor of Information Systems and Operations Management at the Raj Soin College of Business at Wright State University, Dayton, Ohio. He received his Ph.D. in Computer Science from Indiana University. Prior to joining Wright State, Dr. Sengupta served as faculty at Kelley School of Business at Indiana University and the Robinson College of Business at Georgia State University. Dr. Sengupta's current primary research interest is in the efficient use and deployment of RFID (Radio Frequency Identification) for business applications, and founded SmartRF Solutions, a startup company commercializing systems using RFID Technology. His other research interests are in databases and XML, specifically in modeling, query languages, data mining, and human-computer interaction. He has published over 30 scholarly articles in leading journals and conferences, as well as authored several books and book chapters.