Data modeling: Description or design?

Information & Management 49 (2012) 151–163 Contents lists available at SciVerse ScienceDirect Information & Management journal homepage: www.elsevie...

Download PDF

945KB Sizes 3 Downloads 54 Views

Report

PDF Reader
Full Text

Information & Management 49 (2012) 151–163

Contents lists available at SciVerse ScienceDirect

Information & Management journal homepage: www.elsevier.com/locate/im

Data modeling: Description or design? Graeme Simsion, Simon K. Milton *, Graeme Shanks Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia

A R T I C L E I N F O

A B S T R A C T

Article history: Received 13 November 2010 Received in revised form 15 November 2011 Accepted 25 January 2012 Available online 14 February 2012

Data modeling for database creation has generally been considered to be a descriptive process: the realworld is observed and represented in a conceptual model that is then transformed into a logical structure for a database. This is reﬂected in prescriptive methods and is the dominant assumption in most studies. However, data modeling can also be considered a type of design with negotiable requirements, a creative process, and many workable solutions. Our paper discusses empirical results from almost 500 practitioners on three continents comparing data modeling to design. We found that data modeling, as practiced, was better characterized as design. ß 2012 Elsevier B.V. All rights reserved.

Keywords: Conceptual data modeling Practitioner study Design Analysis

1. Introduction 1.1. Alternative views of data modeling Data modeling is one of the most critical activities in the implementation of an IS: it has been characterized as a process of reality mapping. This characterization has been occasionally challenged from a philosophical perspective, from observations of practice, and from empirical evidence. This descriptive characterization also dominates the practitioner literature. In a descriptive activity, a set of artifacts may be created, and this might well be called design, but not be of sufﬁcient importance to the overall result as to characterize the entire activity as design. In data modeling, there is choice in the selection of components (typically entities, relationships and attributes) used to represent some part of reality. The difference between description and design is in whether this selection is a trivial part of the process compared to understanding the Universe of Discourse (UoD) {descriptive type}, or whether it is the essence of the process {design type}. 1.2. Previous empirical research We know little about how experienced data modelers approach their work or about the models that are produced for real business applications. This is because most studies have assumed the process to be descriptive and have not involved practitioners.

The descriptive characterization is embodied in most empirical studies through the use of a gold standard – a single correct solution devised by the researcher, who often embedded entity and relationship names in descriptions, thus constraining the modeling abstractions. The ‘‘business requirements’’ amounted to a plain language description and the participant’s task was to translate the description to the original diagram. For example, ‘‘An employee can report to only one department. Each department has a phone number.’’ Tasks showing these two traits mainly tested facility with modeling formalisms. Yet it is common to see conclusions that indicated that novice designers did not run into much trouble in modeling entities and attributes. In the context of the research task, modeling entities may have meant little more than identifying nouns in the description. The use of simple models and prescriptive instructions limited the scope of the design. Most empirical studies have used students as participants; of course, this limited the difﬁculty of the problems posed. Of the total of 3210 participants across 59 studies that we surveyed, only 147 in nine studies had more than one year’s industry experience of data modeling. Thus most studies used unrealistically simple data models. Nevertheless, some studies have used experienced data modelers and have uncovered design behavior. Comparisons of novice and expert data modelers have revealed behaviors characteristic of designers (attempting to gain a holistic understanding, categorization of problems, pattern re-use) in the experts but not in the novices. 1.3. The research question

* Corresponding author. Fax: +61 3 9349 4596. E-mail address: [email protected] (S.K. Milton). 0378-7206/$ – see front matter ß 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.im.2012.01.003

Our research question was: Is data modeling better characterized as description or design? Here, data model refers to a model of a speciﬁc UoD (e.g. the data model of ABC corporation’s human

G. Simsion et al. / Information & Management 49 (2012) 151–163

152

resources operations), while data modeling refers to the set of activities required to specify a conceptual schema that will transform into a database schema but prior to its transformation into a speciﬁc DBMS data deﬁnition. Speciﬁcally, we examined the process of data modeling that resulted in a database design that could be implemented in a relational DBMS. We did not include other purposes of conceptual data modeling (e.g., its use in IS planning). This is an important research question for at least three reasons: Most data modeling research assumes the descriptive characterization, notably in the design of experiments and in the application of ontology [2,3]. If data modeling is, in fact, design, research results need to be reinterpreted in that light. Data modeling education should include expert level practice. Creative thinking and evaluation of alternative designs are intrinsic to design processes. If data modeling is seen as a design activity rather than description, then data modeling methods should be updated to reﬂect a design process and explicitly include creative thinking and comparative evaluation. 2. The ideal type of ‘design’ Design is the ideal type against which we measure data modeling. Its essence has been synthesized in the form of a list of some of the important characteristics of design problems and solutions, and the design process itself. These characteristics are intended to typify design and thus to differentiate it from description. The list is not exhaustive, and the characteristics are interrelated. Collectively, they provide an overall picture of design. The characteristics (shown in Table 1) were grouped into fourteen properties within three dimensions – Problems, Solutions (i.e., Products), and Process, following Lawson’s [1] properties of design 3. Research design The framework provides a basis for expanding the research question to 11 research sub-questions (RSQs) (see Table 2). Scope seeks to clarify what practitioners mean by data modeling. The remaining three dimensions determine whether data modeling practice has the properties of the design type: Problem deals with the negotiability of data modeling requirements, Process examines whether data modeling is creative, and Product deals with the diversity in data models produced by experienced practitioners in response to a task; this suggests that data modeling is a design process.

Table 1 Lawson’s properties of design. Design problems 1. Design problems cannot be comprehensively stated 2. Design problems require subjective interpretation 3. Design problems tend to be organized hierarchically The design process 1. The process is endless 2. There is no infallibly correct process 3. The process involves ﬁnding as well as solving problems (including creativity) 4. Design inevitably involves subjective value judgments 5. Design is a prescriptive activity 6. Designers work in the context of a need for action Design solutions 1. There are [sic] an inexhaustible number of different solutions 2. There are no optimal solutions to design problems 3. Design solutions are often holistic responses 4. Design solutions are a contribution to knowledge 5. Design solutions are parts of other design problems

The RSQs were addressed with a combination of surveys, laboratory studies, and interviews. The selection of the mode for each is shown in Table 3. Semi-structured interviews (with inﬂuencers of practitioners or thought leaders), surveys (to collect the perceptions of experienced data modelers about data modeling products, processes, and problems), and laboratory studies (designed to explore diversity and style in data models by asking participants to complete modeling tasks which were examined for evidence of diversity, style, and patterns in data modeling) were chosen to provide multiple sources of data to assess the practice of data model against the ideal type of design. The surveys and laboratory studies were incorporated into 12 data modeling seminars and workshops for experienced practitioners delivered by the ﬁrst author in the US, UK, Scandinavia and Australia between May 2002 and November 2004 (see Appendix A for summary of participants). Figs. 1–4 summarize the responses to demographic questions. There was a strong correlation between the two experience measures (g = 0.65, p < 0.0005). Our study was the largest currently published; it included 381 participants with at least one year of data modeling experience. The minimum number to complete any task was 55. Three other groups participated in the research. The Practitioner thought-leaders and expert model evaluators were purposive samples. Architects and accountants (who provided a benchmark for the Characteristics of Data Modeling component) were recruited from personal and professional contact lists.

Table 2 Research sub-questions (RSQs). Type dimension

Name

Research sub-question

General – applying to all three dimensions

Scope

(RSQ1) What do data modeling practitioners believe is the scope and role of data modeling within the database design process? (RSQ2) Is the description/design question considered important by data modeling practitioners? (RSQ3) What are the (espoused) beliefs of data modeling practitioners on the description/design question? (RSQ4) Are data modeling problems perceived as design problems by data modeling practitioners? (RSQ5) Do database design methods used in practice support a descriptive or design characterization of data modeling? (RSQ6) Are data modeling processes perceived as design processes by data modeling practitioners? (RSQ7) Are data modeling products perceived as design products by data modeling practitioners? (RSQ8) Will different data modeling practitioners produce different conceptual data models for the same scenario? (RSQ9) Will different data modeling practitioners produce different logical data models from the same conceptual model? (RSQ10) Do data modeling practitioners use patterns when developing models? (RSQ11) Do data modeling practitioners exhibit personal styles that can be identiﬁed in the data models that they create?

Importance Espoused Beliefs Problem Process

Perception of Problems Methods

Product

Perception of Processes Perception of Products Diversity in Conceptual Modeling Diversity in Logical Modeling Patterns Style

G. Simsion et al. / Information & Management 49 (2012) 151–163

153

Table 3 Research methods used. Research component

Method

Sub-questions addressed (listed by name)

Interviews with thought-leaders

Interviews

Scope and stages Espoused positions on data modeling Characteristics of data modeling Diversity in conceptual modeling Diversity in logical modeling Style in data modeling

Survey Survey Survey Laboratory study Laboratory study Laboratory study

Importance (RSQ2), Espoused Beliefs (RSQ3), Perception of Problems (RSQ4), Perception of Processes (RSQ6), Perception of Products (RSQ7) Scope (RSQ1), Methods (RSQ5) Importance (RSQ2), Espoused Beliefs (RSQ3) Perception of Problems (RSQ4), Perception of Processes (RSQ6), Perception of Products (RSQ7) Diversity in Conceptual Modeling (RSQ8), Style (RSQ11) Diversity in Logical Modeling (RSQ9), Style (RSQ11) Style (RSQ11), Diversity in Conceptual Modeling (RSQ8), Diversity in Logical Modeling (RSQ9), Patterns (RSQ10)

4. Results 4.1. Interviews with thought leaders Interviews were held with seventeen ‘‘thought leaders’’ (see Appendix B); they were conducted by the ﬁrst author. These were used to conﬁrm the currency of the research question and to clarify which aspects of the research would shed most light on the questions. All interviewees chose acknowledgment over anonymity. Interviewees were asked for their views on the research question and asked to comment on three aspects of the ideal type: 1. Whether the data modeler should challenge business requirements (Problem). 2. Whether data modeling is a creative activity (Process). 3. Whether data modeling problems have a single right answer (Product). Analysis was based on videotape transcription and conﬁrmation; organization of statements by common meanings; organization of meanings into themes, and then into the research framework; synthesis of views and positions; and participant review of the ﬁndings. Opinions on the research question, expressed directly and in discussion, varied widely and were summarized in Table 4. Positions were starkly articulated:

Occupation of Participants 11%

4.1.1. Problem: are business requirements negotiable? Proponents of the design characterization answered that business requirements were negotiable, and that modelers should be active in exposing new ways of doing business. One view was that the business does not know what is best for it. Modelers were seen as being able to make suggestions, and to bring in their own business knowledge to provide new perspectives. Interviewees favoring the description characterization supported the primacy of the business in determining its data model and the danger of the modeler taking that role: ‘‘What we’re modeling is what the domain expert says is right’’ 4.1.2. Process: is data modeling a creative activity? Proponents of design believed that they were ‘‘creating’’ the objects in the model: ‘‘10–15% of entities are obvious and everyone agrees with them, but (beyond that) the actual choice of entities requires a lot of imagination and creativity.’’ Some supporters of the description characterization recognized a role for creativity in peripheral areas like the layout and presentation of the model, or in the way of understanding the business.

Data Administrator

d

N

ot

R

ep

or

10

te

0+

00

0

-1

0

-5

51

21

8%

4%

-2

Not Reported

10

Other 1%

11

Manager

6-

Enterprise Modeler 38% 11%

5

Data Warehouse

Experience: Number of Models

120 100 80 60 40 20 0

2-

Database

1

Data Modeler

0

25%

Number of Participants

2%

Data modeling is not a process of creation, it is a process of discovery. Data modeling is certainly a descriptive activity, it’s not a design activity. I believe rabidly and intensely that it’s a design process. We’re designing (but) some of the people that we work with see us as scribes.

Number of Models

Fig. 1. Occupation of participants.

Fig. 2. How participants learned.

or te d

+

0

Years Experience

Fig. 4. Experience: years.

N

ot

R

ep

20

-2 16

10

-1 5 11

6-

3-

2 1-

1

5

Experience: Years

140 120 100 80 60 40 20 0 0

d or te ep R

No t

In d

us

try

Ed uc

.

Ed uc .

r

ar y Te rti

M

en

to

e

tio ns lic a

ok s Bo

Ex pe rie nc

Method

Number of Participants

How Participants Learnt 250 200 150 100 50 0

Pu b

Number of Participants

Fig. 3. Experience: number of models.

154

G. Simsion et al. / Information & Management 49 (2012) 151–163

Table 4 Interviewees’ overall positions. Position

Number of interviewees

Strongly supports description Somewhat supports description Supports neither position more strongly than the other Position depends on modeling formalism/language Somewhat supports design Strongly supports design

5 1 3 1 3 4

4.1.3. Product: one right answer? This generated strongly conﬂicting responses. Some who believed that requirements were negotiable were less sure that the models would vary once requirements were settled. Some held that there is a right model, allowing for variation only in notation and the naming of objects. Those who saw differences because modeling was design spoke in terms of utility vs truth. A few raised the theoretical position of choice in classiﬁcation, but most who argued against the one right answer drew on personal experience. Instructors noted that students produced different workable models in response to case study scenarios. The difﬁculty of integrating different databases within and across organizations was evidence that different workable models could be implemented for the same data. The trade-off between level of generalization and enforcement of business rules was a central theme for those who believed in design. Three groups emerged: the literalists (concepts should be modeled as used in the business); the moderate abstractors (some generalizations) and the rule removers (deliberately removing business rules for representation elsewhere). Some saw these as stylistic preferences of modelers. 4.1.4. Summary Thought leaders considered the question was important. Further, they were divided about whether they believed that data modeling was design or description. Their perceptions of whether the problems, processes, and products were best characterized as design or description also varied. 4.2. Survey: scope and stages There were two reasons for this stage. We sought to determine what practitioners meant by data modeling before designing and interpreting survey questions, and sought responses about parts of the database design process to see whether they could individually be characterized as design or description. Our questionnaire therefore asked participants to nominate the stages in database design, and map 26 elementary activities (such as normalization and deﬁnition of indexes) against the stages. The elementary activities served as common reference points for comparing the higher-level stages nominated by different participants. Participants were also asked to deﬁne data modeling in terms of the stages that it covered. Respondents were attendees at advanced data modeling classes in London (25) and Los Angeles (30). We found broad agreement on the scope, stages, and activities considered to be data modeling activities. After consolidation of names, ﬁve stages accounted for 201 (83%) of the total of 243 activities cited: (1) Business Requirements Analysis; (2) Conceptual Data Modeling; (3) Logical Data Modeling; (4) Physical Data Modeling and/or Physical Database Design; and (5) Post-database-design Activities (optional). The two tasks in (4) were found to contain essentially the same activities. A stage simply named ‘‘Data Modeling’’ was listed on nine occasions and was in the same place in the sequence as ‘‘Logical Data

Modeling’’. Forty-one (75%) responses matched this overall pattern, and a further nine (16%) matched the pattern except for the omission of ‘‘Business Requirements Analysis’’. Eighty percent of responses put the activities of data modeling and responsibilities of the data modeler as (at least) the speciﬁcation of an initial conceptual schema, to meet agreed business requirements, prior to any modiﬁcations to improve performance. Thus there was a broad consensus on the overall composition and sequence of the database design process. The description/design debate was thus unlikely to be a consequence of different deﬁnitions of data modeling. 4.2.1. Questions measuring stages against the ideal type ‘Design’ The data modeling scope and stages survey provided answers to the following questions. Question 1 – Is a business requirements stage (not including entity, relationship, attribute identiﬁcation) a necessary part? Including this stage prior to identifying key model components runs counter to the descriptive characterization when the UoD is mapped directly onto the data model. Requirements statements can be seen as problem statements to which the model provides a solution. A business requirements stage was nominated by 84% of our respondents of whom 46% saw it as the responsibility of the data modeler (solely or jointly); 65% the analyst and 39% the user. Question 2 – Is entity/relationship/attribute identiﬁcation part of the business requirements stage? If these are established before data modeling starts, then the data modeler cannot ‘‘create’’ them, and the description characterization is supported. Only 5% of respondents included identiﬁcation of entities, relationships, and attributes in a business requirements stage. Question 3 – Are there separate stages for DBMS-independent (conceptual) modeling and DBMS-speciﬁc (logical) modeling? Data modeling can be seen as an implementationindependent descriptive stage followed by transformation to a logical data model, supporting the descriptive characterization. In contrast, designers are constantly conscious of the implementation environment or ‘‘medium’’. If data modeling is design, we would expect the two stages to blur and often combine. Respondents grouped most conceptual schema speciﬁcation tasks into one stage, generally called Logical Data Modeling, with only entity and relationship identiﬁcation being part of a Conceptual Data Modeling stage. Respondents effectively used the term conceptual modeling to describe a preliminary ‘‘sketch plan’’ and not a rigorous and complete product for mechanical translation into a conceptual schema. This is in line with practitioner terminology. Question 4 – Where does view integration happen? In the descriptive characterization, where there is one right answer, view integration is relatively simple. In the design characterization, models may differ in complex ways and their integration becomes a process of negotiation. View development and integration was a median task in the sequence of the tasks classed as data modeling and it was considered ongoing rather than a discrete terminal task. No respondent nominated it as a discrete stage. Question 5 – Where is external schema speciﬁcation located and who is responsible? One approach to database design uses external schemas (views) to replicate user views. In the descriptive characterization, these are mappings from the conceptual schema that originally integrated them and the data modeler will be responsible for their deﬁnition. If, instead, external schemas are a tool for managing data independence, programming needs, and security rather than reproducing user views, we would expect them to be deﬁned later in the process. This is indeed what we found and this activity was seen to be outside the primary responsibility of the data modeler.

G. Simsion et al. / Information & Management 49 (2012) 151–163

We surveyed attendees at a one-day advanced data modeling seminar at an international practitioner convention in an attempt to determine their position on the description/design dichotomy by asking them two questions. We asked ﬁrstly, an open question: What is data modeling? and secondly, a closed question: Which better describes data modeling (a) Describing the data requirements of an organization or part of an organization? or (b) Designing data structures to meet the requirements of an organization or part of an organization? 93 respondents answered both questions. The questions were given after participants had completed a data modeling task developing a model from a business scenario. Participants were told: ‘‘We are referring to data modeling to support the development of a relational database; not enterprise data modeling or reverse-engineering’’. They were not shown the closed question until after they answered the open question. Two researchers coded the responses neutral, somewhat, or strongly for the question depending on the level of support for either the design characterization (coded as 4 & 5) or the description characterization (coded as 1 & 2). The distribution of coded responses to the open question is shown in Fig. 5. Neutral was subdivided into both or neither (3 and 0 respectively). Inter-coder reliability (a) was 0.82 (0.7 was considered acceptable). Responses to the closed question are shown in Fig. 6. The word design (as a verb) was used in only six responses to the open question. Only 17% of responses to the open question did not embody a position. There was no signiﬁcant correlation between responses and experience, method of learning, or job position. Fig. 7 compared responses to the open and closed questions: the vertical axis shows the break-up of responses to the Closed Question for the participants who gave each of the possible (coded) responses to the Open Question. Thus, participants favoring the design characterization in the open question and mostly maintained that view in the closed question, but a signiﬁcant number of participants whose open question answers supported description reversed it in the closed question: providing only a moderate correlation (K = 0.34, p = 0.007) between the open and closed questions when both and neither were excluded. This difference suggested that some responses to the open question may have been inﬂuenced by taught deﬁnitions of data modeling (which favor a description characterization) whilst the closed question demanded some reﬂection. A facilitated discussion followed response collection. A show of hands reporting closed question answers caused surprise. The discussion which followed established that many participants had expected their own response to dominate. A second show of hands

Number of Participants

Coded Responses to Open Question 60 50 40 30 20 10 0 Neither

Strong Somewhat Description Description

Both

Somewhat Design

Response

Fig. 5. Coded responses to open question.

Strong Design

Number of Participants

4.3. Survey: espoused positions on data modeling

Responses to Closed Question 60 50 40 30 20 10 0 Description

Design

Both

Response

Fig. 6. Responses to closed question.

Source of Responses to Closed Question Frequency of Each Response to Closed Question

4.2.2. Summary In questions asked speciﬁcally about important aspects of the process that could illuminate whether data modeling is design or description, practitioners offered opinions that were consistent with the design type.

155

60 50 40 30 20 10 0

Both Design Description

0

1 2 3 4 5 Response to Closed Question

NA

Fig. 7. Source of responses to closed question.

showed a close-to-unanimous view that the design/description distinction was real and important. 4.3.1. Summary Data modeling practitioners espouse beliefs that data modeling was description in response to the open question and were evenly split between description and design in response to the closed question. The practitioners conﬁrmed that researching the design/ description dichotomy was of importance. 4.4. Survey: characteristics of data modeling This part of our survey sought a deeper understanding of theory-in-use by addressing practitioners’ perceptions of characteristics of data modeling problems, products and processes to address RSQs 4, 7. The 25 questions shown in Appendix C used a ﬁve-point Likert scale. The survey was benchmarked using architects and accountants. They were chosen because architecture is a design discipline whereas accounting is a process of recording, classifying, reporting and communicating, and is thus a descriptive characterization. Data modelers have been compared with both architects and accountants. Responses were received from a snowball sample of 38 accountants and 21 architects, all based in Australia. The results for these two professions were then used to benchmark the results for data modeling against them. The survey was administered to 266 attendees at seven seminars targeting data modeling practitioners in the USA, Australia, UK, and Scandinavia (the smallest 20 and the largest 90). Participants were told, before completing the survey: ‘‘We are referring to data modeling to support the development of a relational database; not enterprise data modeling or reverse-engineering’’. No signiﬁcant differences were found in the results across the seminars. Scale reliability (Cronbach’s a) was 0.73. The Corrected Item-Total Correlation (CITC) was positive (showing that the questions were measuring the same underlying construct in the same direction) for all but one item. The exception was: ‘‘Data modeling is prescriptive rather than descriptive’’ – this had a CITC value of 0.15. Subsequent discussion has suggested that some respondents had interpreted ‘‘prescriptive’’ as applying to the modeling process rather than product (paradoxically supporting a descriptive characterization).

G. Simsion et al. / Information & Management 49 (2012) 151–163

156

Table 5 Summary of responses to the data modeling questionnaire. Design mean

Dimension

Property

Property mean

3.75 t(267) = 31, p < 0.0005

A. Design problems mean = 4.11 t(317) = 37, p < 0.0005

1. 2. 3. 1. 2. 3. 4. 1. 2. 3. 4. 5. 6.

4.09 4.08 4.18 4.04 3.90 2.55 3.94 4.00 3.34 4.00 3.65 2.66 3.35

B. Design products mean = 3.60 t(302) = 23, p<0.0005

C. The design process mean = 3.49 t(299) = 16, p<0.0005

Design problems cannot be comprehensively stated Design problems require subjective interpretation Design problems tend to be organized hierarchically There are an inexhaustible number of different solutions There are no optimal solutions to design problems Design solutions are often holistic responses Design solutions are a contribution to knowledge The process is endless There is no infallibly correct process The process involves ﬁnding as well as solving problems Design inevitably involves subjective value judgments Design is a prescriptive activity Designers work in the context of a need for action

Table 5 shows the mean scores (maximum score of 5) for the survey, at the Property, Dimension, and Overall levels. The onesample t-test results indicated the signiﬁcance of the difference between the mean and the neutral score of 3. Fig. 8 shows the frequency distribution of the Overall score, showing that most values were above the neutral value. These results show that modeler-espoused characteristics ﬁt a design characterization. Data modelers also scored signiﬁcantly higher than accountants in all dimensions (p < 0.01), and signiﬁcantly higher than architects in the problem dimension (p < 0.01) and overall (p = 0.02). 4.4.1. Summary The results clearly showed that participants did perceive that the problems, products, and processes of data modeling ﬁt the design ideal type. 4.5. Laboratory study: diversity in conceptual models Our laboratory study examined product diversity in the conceptual data models developed by experienced data modelers for a real-world problem. The task involved an effort to develop a conceptual model for a medical research database from a description of a real business requirement. Participants were attendees at an international (practitioner-oriented) data management conference in North America; they viewed a video of the project sponsor and data administrator describing the requirements, and were given a transcript of it (see Appendix D). Participants were then given 25 min to complete the task and a further 5 min to complete a questionnaire about the process. Ninety-three models were received. Forty-nine responded that they understood the problem fairly well or very well, did not ﬁnd it very difﬁcult, made no guesses or only trivial guesses and did not think their models would be much different if more time was allowed. The ﬁrst author judged 66 of the models to be workable; they were submitted by the more experienced (8.5 years vs 3.5 years;

two-tailed t(87) = 4.5, p < 0.001) who found the problem less difﬁcult (difﬁculty 5-point Likert scale rating 2.6 vs 3.2; two tailed t(89) = 3.4, p = 0.001). 88% used some variant of the ‘‘crow’s foot’’ notation. A reference set of standard entity names and deﬁnitions was synthesized to facilitate comparison of models. 4.5.1. Assessment of diversity Seven measures of diversity were used. These are neither orthogonal nor exhaustive but are indicative of diversity. Diversity measure no. 1 – Participants’ perceptions of difference: On completing their model, participants paired off and compared their models. One percent of participants perceived the models as identical, six percent as identical except for naming or agreed errors, 53% as structurally different in minor ways, and 40% as structurally different in important ways. Diversity measure no. 2 – Number of entities: Fig. 9 shows the frequency distribution of entity counts from the models. Subtypes were excluded from the count to improve comparison with the 77% of models which did not use subtypes. Diversity measure no. 3 – Variety of entity names: The 93 models had 291 different entity names after removing synonyms. In addition to unrecognized synonyms, it includes some homonyms. The different uses were evident through the context of relationships with other entities. Diversity measure no. 4 – Use of nouns from the description: The frequency distribution of entity names matching nouns from the interview transcripts is shown in Fig. 10. Of the 291 different names given to entities, comparatively few came directly from the problem description but had been invented by the modeler. Diversity measure no. 5 – Variability in construct use: One concept was represented in some models as an entity (52 times) and as a relationship (5 times) in others. Three concepts were shown in some models as entities and in others, explicitly or implicitly, as attributes. Correlation between the three decisions was negligible in two cases and weakly positive but not signiﬁcant in the third (K = 0.14, p = 0.2).

Frequency

Frequency

Frequency Distribution of Overall Score 45 40 35 30 25 20 15 10 5 0 2.9

3.1

3.3

3.5

3.7

3.9

4.1

4.3

4.5

Overall Score

Fig. 8. Frequency distribution of overall score – 3 is neutral.

t(330) = 31, p < 0.0005 t(335) = 31, p < 0.0005 t(342) = 24 p < 0.0005 t(339) = 23, p < 0.0005 t(340) = 21, p < 0.0005 t(334) = 6.4, p < 0.0005 t(320) = 19, p < 0.0005 t(342) = 23, p < 0.0005 t(337) = 4.9, p < 0.0005 t(341) = 18, p < 0.0005 t(331) = 10, p < 0.0005 t(323) = 6.7, p < 0.0005 t(337) = 5.1, p < 0.0005

Total Number of Entities

18 16 14 12 10 8 6 4 2 0 3

4

5

6

7

8

9

10 11 12

13 14

15 16 18

Number of Entities

Fig. 9. Total number of entities in each model.

G. Simsion et al. / Information & Management 49 (2012) 151–163

Frequency of Nouns 35

Frequency

30 25 20 15 10 5 0 1

0

2

3

4

5

7

Entity Names Matching Nouns

Fig. 10. Number of entities corresponding to nouns in the problem description.

Diversity measure no. 6 – Level of entity generalization: Three concepts were represented at different levels of generalization, though t here were signiﬁcant correlations between the three generalization decisions suggesting that modelers bring personal styles to the generalization decision (0.42 g 0.77, p < 0.02). Diversity measure no. 7 – Holistic difference (expert assessed): 19 experts assessed ten selected standardized models to assess the viability of their implementation. Standardization involved providing a common name for the same entities, removing entities outside the deﬁned scope, and presenting the models in a common format. Diversity was supported if more than one model was judged as being practically viable. The expert modelers (with a minimum of 15 years experience), gave scores for the measures overall quality, understandability and ﬂexibility on a 5point Likert scale. The inter-rater reliability, measured by Cronbach’s a, was 0.92 for Overall Quality, 0.84 for understandability, and 0.73 for ﬂexibility (0.7 would be considered acceptable) The level of understandability across all evaluators and models was 3.53 (s = 0.47), placing it between neither easy nor difﬁcult to understand and reasonably easy to understand. Fig. 11 shows, for each model, the number of experts who assessed overall quality as 3 (mid-point of the scale: application would work with no serious problems) or more, and the benchmark (experts who scored it equal to or higher than an average model that they would expect to encounter in their work, developed in the last ten years by someone other than themselves.) Thus, between three and ﬁve models were acceptable to the majority of these experts. Evaluators were also asked to nominate the best model overall; four different models were selected. 4.5.2. Summary The results demonstrated a diversity of objectively assessed workable solutions to the same problem and answer RSQ8. The diversity observed was consistent with the design characterization. 4.6. Laboratory study: diversity in logical models Our laboratory study also examined product diversity in the logical data models developed by experienced data modelers in a real-world problem.

157

The task involved developing a logical data model to serve as the speciﬁcation for a relational database from a list of 22 attributes that the user wished to record for a real business application. The attributes were presented as a single table/ relation (see Appendix E). Participants were attendees at advanced data modeling seminars: for measures 1–4 these were attendees in London, U.K., and Pittsburg, U.S.A. (40 participants in total); for measure 5 the participants included a further 58 at two other seminars. All but two of the models produced supported the data speciﬁed by the original table and were thus workable. All but three models were fully normalized. Straightforward assumptions were made about the columns in each table for the few solutions that did not provide a full list. Consistent application of these assumptions may have led to less diversity than if this task had been completed by the modelers themselves. 4.6.1. Diversity measures Diversity measure no. 1 – Participants’ perceptions of difference: Participants paired off and compared models. No participant reported the two models as identical, nine percent reported the models as identical except for naming or agreed errors, 39% as structurally different in minor ways, and 52% as structurally different in important ways. Diversity measure no. 2 – Number of tables: Fig. 12 shows the frequency distribution of table counts in the models. Diversity measure no. 3 – Variety of table names: After consolidation of obvious synonyms, the 39 models contained 66 different table names. Diversity measure no. 4 – Construct variability: The 39 models resulted in seven concepts being represented in more than one way: as tables in some models and in others as columns (see Appendix E). Diversity measure no. 5 – Generalization: Examination of the models revealed ﬁve different generalizations: four produced a single column from two different attributes and one changed a column’s name to increase consistency with other columns (see Appendix E). A score was then calculated for each participant by totaling the number of decisions taken by the participant. Fig. 13 shows the frequency distribution of the scores. With one exception, generalization scores were not signiﬁcantly correlated with the standard demographic groupings or with responses to the process questions. The sole signiﬁcant correlation was that participants who had developed more than one model in practice had signiﬁcantly higher generalization scores. 4.6.2. Summary The diversity in logical models is evident from the results (RSQ9.)

Total number of tables

8 7 6

>=Benchmark >=3

Frequency

Number of Evaluations

Acceptability of Models 16 14 12 10 8 6 4 2 0

5 4 3 2 1 0

1

2

3

4

5 6 Model

7

8

9

Fig. 11. Model acceptability.

10

1

2

3

4 5 6 Number of tables

7

8

Fig. 12. Total number of tables in each model.

9

G. Simsion et al. / Information & Management 49 (2012) 151–163

158

Frequency of Generalization Scores

Frequency of Score

40

Table 7 Family Tree generalization decisions.

35 30 25

Person Relationship

20 15

Relationship

Parenthood

0.39 (p < 0.0005)

0.37 (p < 0.0005) 0.88 (p < 0.0005)

10 5 0 1

0

2 3 Generalization Score

4

5

Fig. 13. Frequency distribution of generalization scores.

4.7. Laboratory study: style in data modeling Our laboratory study also examined whether by consistently favored higher or lower levels of generalization within and between models. The task involved an effort to develop two models. As in the prior section generalization scores were calculated for each model. These were then used to determine the consistency of decisions within each model, and the correlation of the scores between the two models. Participants were attendees at advanced data modeling seminars in the USA (two seminars) and in Stockholm, Sweden: a total of 91 participants. Three modeling problems were used, with each participant being assigned two. Two of the problems required the development of a conceptual data model from a text description and one required the development of a logical data model (see Appendix F). Models that omitted any of the constructs were excluded from analysis. Each identiﬁed construct in each model was coded according to the level of generalization using ‘0’ for lowest, ‘1’ for next-lowest, to the lowest level of generalization. A total generalization score was determined by adding the individual levels coded. Four concepts in the Bank Loans and three concepts in the Family Tree solutions were identiﬁed as being subject to different levels of generalization. In each case, only two levels were found. With one exception, the decisions were logically independent. Appendix F contains the frequencies with which each decision was used. Recall from the previous section that there were ﬁve generalization decisions in the logical data-modeling task (see ‘‘diversity measure no. 5’’.). Combinations of the decisions resulted in ten versions of the Bank Loans model and ﬁve versions of the Family Tree model, the most popular in each case accounted for 50% of the models. Tables 6 and 7 show the correlation between each pair of generalization decisions. Covariance using the Kuder-Richardson 20 (KR20) coefﬁcient was 0.80 for the Family Tree decisions, 0.75 for the Bank Loans decisions, and 0.68 for the logical data modeling task generalization decisions Generalization scores for the Family Tree and Bank Loans models were moderately positively correlated. The gamma statistic (g = 0.69, p < 0.0005), showed a strong correlation. Thus, between model generalization correlation was supported for the two conceptual models. There was some correlation with demographic categories, so the analysis was repeated for each of them. Table 8

Table 6 Bank loans generalization decisions.

Customer Party Party relationship

Party

Party relationship

Transaction

0.77 (p < 0.0005)

0.58 (p < 0.0005) 0.75 (p < 0.0005)

0.27 (p = 0.03) 0.28 (p = 0.03) 0.36 (p < 0.0005)

Table 8 Correlation between generalization scores across the conceptual models. Demographic group

g

p

>10 models produced 10 models produced Occupation = data modelers Occupation = non data modelers >6 years experience <6 years experience Total sample

0.54 0.27 0.52 0.62 0.56 0.49 0.51

<0.0005 0.32 <0.0005 0.001 0.001 0.003 <0.0005

shows that the correlation remained moderate and signiﬁcant (p < 0.005) within all but one group. No signiﬁcant correlation was found between the generalization scores for the Annual Budget (logical) model and either the Family Tree or Bank Loans conceptual models (Family Tree: g = 0.26, p = 0.46; Bank Loans: g = 0.53, p = 0.86). 4.7.1. Summary Some modelers consistently choose higher (or lower) levels of generalization than others, within both conceptual and logical models and across conceptual models. This bias (or style) is not due to their level or experience (and, by implication, expertise). Consequently, RSQ9, was answered in the afﬁrmative and thus supported the conclusion that the products of data modeling were inﬂuenced by differences in style of the practitioners that produce them. 5. Summary Our ﬁndings were based on the four key dimensions of our design framework: general, problem, process and product. The general dimension includes scope, importance and beliefs. Data modeling was found to consist of the speciﬁcation of the initial conceptual schema to meet the business requirements prior to any performance tuning (RSQ1). The research question was considered to be important (RSQ2). Data modeling practitioners were evenly divided between their belief that data modeling was design and description (RSQ3). Data modeling problems were seen as having the characteristics of design problems by data modeling practitioners, signiﬁcantly more so than architects and accountants (RSQ4). Data modelers generally worked from a problem statement rather than directly from observations of the UoD (RSQ4). The data modeling process was perceived as having the characteristics of design processes, similar to the perceptions of architects and signiﬁcantly more than the perceptions of accountants (RSQ6). Consistent with the design characterization, identiﬁcation of entities, relationships, and attributes was not considered to be part of the business requirements analysis. Furthermore, there was no evidence of the widely advocated ‘‘view-deﬁnition, view-integration, view-reconstruction’’ sequence, which required that model differences can be reduced to reconcilable views (RSQ5). Data modeling products were perceived to be design products by data modeling practitioners, signiﬁcantly more than perceptions of accountants, and similar to perceptions of architects (RSQ7). Conceptual data models and logical data models developed in response to a common problem were found to have substantial

G. Simsion et al. / Information & Management 49 (2012) 151–163

159

focused entirely on the design/description question and produced consistent evidence in favor of the design characterization. For researchers, there are two implications. First, careful design of data modeling experiments that take into account the likelihood of alternative solutions is required. Second, generalization of the results of empirical studies that use students as participants is problematic, because design skills take time to develop. Data modeling teachers should consider it as a design activity. Designing data modeling tasks by articulating a domain based on nouns and verbs that relate to the entity types and relationship types is a way to teach data modeling notation but it does not teach the practice of data modeling. Practitioners should be aware that their method is more consistent with data modelling as a design activity. They should be aware that alternative data modeling solutions may be useful and need to be evaluated for quality as part of the process.

diversity (RSQ8, RSQ9). Data modelers frequently re-used their own or other’s patterns, signiﬁcantly more than architects. Experienced conceptual data modelers re-used patterns much more than less-experienced data modelers (RSQ10). A signiﬁcant correlation was found between the levels of generalization of entities within and between conceptual data models developed by the same modeler. This suggested that personal style, evidenced by generalization decisions, affected the data models that modelers produce (RSQ11).

6. Discussion and conclusions Answers to our research sub-questions suggested that: data modeling, while traditionally characterized as description, was better characterized as design based as it was practiced. We

Appendix A. Summary of participants Research samples and research components. Location

Research component

Number

DAMA/Metadata Conference, San Antonio, TX, USA

Diversity in conceptual modeling Espoused positions on data modeling Diversity in logical modeling Diversity in logical modeling Data modeling style Characteristics of data modeling Characteristics of data modeling Characteristics of data modeling Data modeling style and Diversity in logical modelinga Characteristics of data modeling Characteristics of data modeling Characteristics of data modeling Scope and stages Characteristics of data modeling; Scope and stages Data modeling style and Diversity in Logical Modelinga

112

66%

17 23 41 54 28 39 28

85% 80% 77% 90%b 90%b 90%b 70%b

70c 20 25

90%b 90%b 83%

30 459

86% 75%

DAMA Conference, London, UK Enterprise Data Forum, Pittsburgh, PA, USA DAMA /Metadata Conference, Orlando, FL, USA DAMA Chapter Presentation, Portland, OR, USA DAMA Chapter Presentation, Phoenix, AZ, USA DAMA Chapter Presentation, Des Moines, IA, USA IRM Data Modeling Workshop Stockholm, Sweden IRM /DAMA Conference, Stockholm, Sweden DAMA Chapter Presentation, Sydney, Australia DAMA/Data Quality Conference, London, UK Wilshire Conferences Data Modeling Masterclass, Los Angeles, CA, USA

a b c

Response rate (%)

The Diversity in Logical Modeling task was incorporated in the Data Modeling Style task in these two locations. Estimate – exact attendee numbers not available. This group included 28 who attended the previous item.

Appendix B. List of participants in the thought leaders interviews The participants, and their positions or roles at the time of interview were:

Peter Aiken, data management consultant, Associate Professor at Virginia Commonwealth University. Richard Barker, company director, architect of the Oracle CASE tool. Michael Brackett, President of the International Data Management Association. Harry Ellis, data modeling consultant to the British Department of Defence. Larry English, leading proponent of data quality techniques. Terry Halpin, Professor at Northface University Utah. David Hay, independent data modeling consultant and educator. Steve Hoberman, global reference data manager with Mars, Inc. Karen Lopez, data modeling consultant and commentator. Dawn Michels, data modeling specialist, Vice President of Chapter Services for DAMA International. Terry Moriarty, president of Inastrol data modeling consultancy. Ronald Ross, editor of the Database Newsletter for 22 years. Robert Seiner, data management consultant. Alec Sharp, independent data and process modeling consultant. Len Silverston, data modeling consultant, industry educator. Eskil Swende, Chief Executive of the IRM group. President of the Scandinavian chapter of the Data Management Association. John Zachman, industry consultant and educator.

G. Simsion et al. / Information & Management 49 (2012) 151–163

160

Appendix C. Survey questions – perceptions of characteristics of data modeling Properties of design organized as a set of questions. Design

Dimension

Property

Additional survey question

Overall

A. Design problems

1. Problems cannot be comprehensively stated 2. Problems require subjective interpretation

2. 3. 4. 5.

B. Design products

C. The design process

3. Problems tend to be organized hierarchically 1. There are an inexhaustible number of different solutions 2. There are no optimal solutions to design problems 3. Design solutions are often holistic responses 4. Design solutions are a contribution to knowledge 1. The process is endless 2. There is no infallibly correct process 3. The process involves ﬁnding as well as solving problems 4. Design inevitably involves subjective value judgments 5. Design is a prescriptive activity 6. Designers work in the context of a need for action

Data modeling problems are often full of uncertainties about objectives and relative priorities Many requirements do not emerge until some attempt has been made at developing a model Objectives and priorities are likely to change during the modeling process In establishing requirements for a data model, something that seems important to one data modeler may not seem important to another data modeler 6. In establishing requirements for a data model, something that seems important to one business stakeholder may not seem important to another business stakeholder 7. Modeling problems are often symptoms of higher level problems 9. Most data modeling problems do not have a single correct solution 10. In most practical business situations, there is a wide range of possible (and workable) data models 11. Data modeling almost invariably involves compromise 12. Data modelers will almost invariably appear wrong in some ways to some people

13. It is not usually possible to dissect a data model and identify which piece of the model supports each piece of the business requirements 14. I frequently re-use patterns (structures) from other data models that I have developed myself 15. I frequently re-use patterns (structures) that I have seen in models developed by others 17. Identifying the end of the data modeling process (i.e. when to stop modeling) requires experience and judgment 16. There is no infallible correct process that (if properly followed) will always produce a sound data model 21. Data modeling requires a high level of creative thinking

23. I ﬁnd it difﬁcult to remain dispassionate and detached in my data modeling work 24. Data modeling is prescriptive rather than descriptive 25. The ﬁnal data model is often a result of compromise decisions made on the basis of inadequate information

The 19 Characteristics at the lowest level were derived from concepts in the descriptions of the Properties, and were operationalized as questions that could be scored on a Likert scale (the numbers, complete with gaps in the sequence, are the numbers of the corresponding questions in the resulting questionnaire). Scores were computed by taking the mean to provide a score for the higher level Property. Then the Property scores were computed to provide Problem, Product, and Process scores, and ultimately an overall Design score. There was some subjectivity in the identiﬁcation of these concepts and the framing of the questions. There was no question addressing the Property (of design products) Design solutions are parts of other design problems. This Property proved difﬁcult to communicate in a simple question or questions and after pilot testing it was excluded. In all, six questions addressed problem, seven addressed product, and six addressed process. Five further questions were added based on other differences between description and design. These were classiﬁed under their relevant dimensions. Questions added to characteristics of data modeling survey.

Additional survey question

Dimension

8. Business requirements are often negotiable 18. When I am developing a data model, I sometimes produce more than one workable solution, and then choose the best one 19. I often start modeling before I have a thorough understanding of business requirements 20. Sometimes, even when I understand the business requirements, I ﬁnd it difﬁcult to produce a data model 22. I have experienced ‘‘eureka’’ moments (sudden and dramatic insights or solutions to problems) in my data modeling work

Problem Process

Process Process Process

A further question was added as Question 1 in the survey to determine whether the most difﬁcult part of data modeling was in understanding the business requirements. It served three purposes: (1) To answer the question: are requirements ﬁxed or negotiable? If requirements are negotiable, but perceived as ﬁxed by some modelers (or vice versa), we would expect those modelers to ﬁnd the task difﬁcult. (2) To determine whether perceived difﬁculty in understanding requirements correlated with other indicators of design. Incompleteness, subjectivity, and negotiability of requirements are cited as properties of design; if the task is essentially descriptive, then gaining an understanding of it is the central (and most difﬁcult) task. (3) In eliciting a deeper understanding of either description or design positions, to reduce the possibility that respondents would recognize the dichotomy behind their questions and answer. The question did not signal the dichotomy. It was placed ﬁrst on the questionnaire. The model was developed speciﬁcally for our research, in the absence of established measures for differentiating description and design activities. We were obliged to rely solely on the soundness of the underlying theory (and on its operationalization) when drawing conclusions from the results. Questions were adapted, through minor re-wording, to enable them to be used with two other professional groups, viz. architects and accountants. The two groups were chosen because: (1) Architecture is generally recognized as a design discipline and is frequently employed as a metaphor for IS tasks and deliverables.

G. Simsion et al. / Information & Management 49 (2012) 151–163

(2) Accounting is a process of recording, classifying, reporting and communicating, a deﬁnition consistent with the descriptive paradigm. Data modelers have in fact been compared with accountants: ‘‘Just as an accountant might use a ﬁnancial model, the analyst can develop an entity model’’. To encourage a focus on common tasks, the accountants’ questions were framed in the context of preparing a set of accounts for a business and the architects’ questions in the context of designing a building. Appendix DLaboratory materials – diversity in conceptual modeling The problem to be analyzed was presented to the participant in three parts: (1) A videotaped description of the business requirements as recorded by the project director and also by the manager responsible for managing the production system. The two stakeholders were responding independently to our request to tell us about a project and the data that was needed to run it. (2) A verbatim transcript of the videotape (see below), with a short glossary of terms added by the author in consultation with the project director. (3) A list of questionnaires to be used for data collection, with excerpts from two questionnaires. Postnatal depression interview transcript and glossary. Case study interview transcripts Key Terms (as used in the transcripts): Post-natal or post-partum – after the birth of a baby Ante-natal – before the birth of a baby (i.e. during pregnancy) Intervention – action taken by a health professional e.g. counseling, prescription of drugs. Also used by Prof. Buist (ﬁnal sentence) to mean ‘‘actions taken to educate health professionals and the public about Post-Natal Depression.’’ Screening – administering a questionnaire (to a woman participating in the study) Professor Anne Buist, Director, National Post-natal Depression Initiative This project is looking at ante-natal and post-natal depression, and it’s going to run over four years, and cover ﬁve states of Australia. It’s being funded by Beyond Blue, which is the Australian national depression institute, and it’s going to cover somewhere between 50,000 and 100,000 women over this time period The data collection is in three kind-of-separate bits: Firstly across all states we’re going to be screening women at a minimum of two time points – once through the pregnancy and once post-natally. And the data we’re collecting there will be the same in each state. However there’s also going to be state-speciﬁc interventions for these women, and that will be evaluated both pre and post intervention with another set of questionnaires that women or/and the research assistants will be completing. And these may be at up to six different time points in covering through pregnancy and post-partum. The other sort of aspect of the data collection is before we even start the study and at the end of the study we’re going to be sending questionnaires to both women who have had babies and health professionals (general practitioners, midwives and maternal child health nurses), and evaluating their understanding of post-natal depression with respect to what it is, with respect to stigma and with respect to treatment. And we’ll be evaluating that again after our four-year time period where we’re going to be doing some interventions and in particular increasing awareness of post-natal depression. Dr Justin Biltza, Project Ofﬁcer, National Post-natal Depression Initiative So really there are two types of data that we’re collecting: the ﬁrst lot being patient demographic data (name, address, date of birth, contact details), and the other set of data is based on a series of questionnaires which are either ‘‘short answer’’ or the selection of a score based on (say) a range from (say) ‘‘good’’ to ‘‘bad’’. There’s approximately forty questionnaires that we’re using, ﬁve which form a core key component that everyone in the survey is doing, but then each of the states has a number of individual surveys that they’re using, none of which cross over, so one of the problems that we have is ensuring that we collect all the data on all the patients.

161

A couple of the other problems we have are the need for a central identiﬁcation number that we need to generate: we can’t use (for instance) a Medicare number or social security number because of privacy issues. Another one of the problems that we have is that a lot of these surveys are used multiple times – two, sometimes three times. So it’s the ability to be able to collect data on the third survey, linking it up with the same patient that we used for the ﬁrst survey. So if you were to participate in the study, you would come into your ante-natal visit and with the help of staff ﬁll out (say) four or ﬁve questionnaires asking you about your mood and how you’re feeling. You would then answer the same questionnaires again at your post-natal visit, and the reason we have the same questionnaires again is just to see how the mood and the response has changed over a period of time.

Instructions and a set of ‘‘process’’ questions addressing assumptions, level of difﬁculty and use of patterns (common to all modeling exercises used in our research) were added to the standard demographics questionnaire. Appendix ELaboratory materials – diversity in logical modeling The task was to produce a logical model based on the conceptual model (see Fig. E1). The logical model needed to be a workable speciﬁcation for a database: a single table/relation is needed so that data can be stored: it is already normalized. The quarterly items are not repeating groups; they are different items with different names and meanings. The task and associated questionnaire were administered to 96 attendees at four advanced data modeling seminars. For some of the measures of diversity, only the 39 responses from London (a substantial European conference) and Pittsburgh (a substantial North American conference) were included. The last three concepts in Table E1 did not directly reﬂect columns in the original model but were added by modelers to capture semantics lost when generalizing some of the original columns. Ignoring differing levels of generalization and considering only the choices of representing each concept as either a column or table resulted in 19 distinct models. There were ﬁve situations in which some participants had: (a) explicitly generalized two or more of the attributes in the original model to produce a single column, e.g. Generalizing Budget First Quarter Material, Budget Second Quarter Material, Budget Third Quarter Material and Budget Last Quarter Material into a single column Quarterly Material Budget plus a Quarter Number column to identify which quarter the amount applied to (Decision 1 in Table E2).or (b) Altered columns to increase consistency: e.g. Replacing Actual Total Material with Actual Fourth Quarter Material (Decision 5 in Table E2) to make it consistent with the representation of budgeted amounts and comparable with the other (quarterly) actual material amounts. Although these decisions are not manifested as generalizations, they are based on the recognition of commonality, and thus have been treated together with the explicit generalizations. Table E2 shows the ﬁve situations and the different decisions made by participants. Fig. E2 shows the frequency distribution of the decisions. Nineteen of the 22 modelers who generalized Budget and Actual amounts (Gen BA) also made the other four generalizations and this covariance amongst the decisions was supported by a Kuder-Richardson 20 (KR20) statistic of 0.68. Apparently this modeler had an underlying concept of propensity to generalize on the part of the modeler. Correlations between individual decisions (f) ranged from negligible to strong, and were positive in all cases.

G. Simsion et al. / Information & Management 49 (2012) 151–163

162

Department Number (Primary key item)

Budget-First-Quarter-Labor

Year (Primary key item)

Budget-Second-Quarter-Labor

Approved-By

Budget-Third-Quarter-Labor

Budget-First-Quarter-Material

Budget-Last-Quarter-Labor

Budget-Second-Quarter-Material

Actual-First-Quarter-Labor

Budget-Third-Quarter-Material

Actual-Second-Quarter-Labor

Budget-Last-Quarter-Material

Actual-Third-Quarter-Labor

Actual-First-Quarter-Material

Actual-Total-Labor

Actual-Second-Quarter-Material

Budget-Other

Actual-Third-Quarter-Material

Actual-Other

Actual-Total-Material

Discretionary-Spending-Limit Fig. E1. Annual Budget conceptual model.

Table E1 Alternative representations of concepts. Concept

Not present

As column

Literal table

Generalized table (in scope)

Generalized table (beyond scope)

Total

Approved by Department Disc spending limit Year LMO-type Quarter BA-type

0 0 0 0 14 10 28

13 4 27 26 7 13 7

7 34 12 6 18 8 4

1 1 0 7 0 7 0

18 0 0 0 0 1 0

39 39 39 39 39 39 39

Table E2 Generalization choices in the logical data models. Decision number

Decision name

Yes

No

Both

1

Gen QTR (generalization decision)

Columns for individual quarters

N/A

2

Gen LMO (generalization decision) Gen BA (generalization decision)

4

Other QTR (consistency decision)

5

Fourth QTR LM (consistency decision)

Speciﬁc columns for Labor, Material, Other amounts Speciﬁc Columns for Budget and Actual amounts Support only for annual values for Other amounts Annual totals held for Labor and Material Actual amounts

N/A

3

Quarterly amount columns generalized – no columns speciﬁc to a particular quarter. Labor, material, other generalized – no columns speciﬁc to a particular type Budget and actual columns generalized – no columns speciﬁc to a particular type. Support for quarterly values for ‘‘Other’’ amounts – no column for annual amount Direct representation of fourth quarter labor and material actual amounts

N/A Both options supported N/A

Frequency

Frequency of Design Options 90 80 70 60 50 40 30 20 10 0

Yes No Unclear Both

Gen Qtr

Gen LMO

Gen BA

Other Qtr Fourth Qtr LM

Design Option

Fig. E2. Frequency of design options.

A Bank Loans problem: a simpliﬁed version of a real example, presented as a short plain-language description written by the author. A Family Tree problem: it included the concept of marriage, presented as short plain-language description written by the author. Bank loans data modeling problem. Bank loans To support the business of a bank, we need to record details of personal loans, housing loans and motor vehicle ﬁnance loans. Against each loan, we need to record the details of the borrower(s), the Loan Ofﬁcer who approved the loan, and (in some cases) a guarantor. We also need to record payments, drawings (initial and further borrowings) and interest transactions against each loan.

Appendix F Laboratory materials – style in data modeling Family tree data modeling problem. Three data modeling problems were used in this research component. The Annual Budget problem: participants were presented with a conceptual model and some supporting information, and asked to produce a logical data model (the model shown in Appendix E).

Family tree We are developing a database to record details of a family tree. For each person of interest to us, we need to be able to record details (where known) of their mother, father, children, and marriages, and their date of birth, death and marriages

G. Simsion et al. / Information & Management 49 (2012) 151–163 Not Generalised

Frequency of Design Option

Generalised

60

Frequency

50 40 30 20 10 0 Customer

Party

Party Relationship

Transaction

Design Option Fig. F1. Bank Loans generalization decisions.

163

[3] Y. Wand, R. Weber, Research commentary: information systems and conceptual modeling: a research agenda, Information Systems Research 13 (4), 2002, pp. 363–376. Graeme Simsion is an Information Systems Consultant, Educator, and Researcher. For 20 years he was CEO of a business and information systems consultancy with ofﬁces in three Australian cities. His PhD from The University of Melbourne examined attitudes and practices of data modeling practitioners. He is the author of Data Modeling Essentials, one of the most widely used practitioner texts on the subject, Data Modeling Theory and Practice, and numerous academic and practitioner articles, and is a regular speaker at industry and academic forums. His current focus is on improving the consulting skills of business and information systems professionals.

Frequency

Frequency of Design Options Not Generalised

70 60 50 40 30 20 10 0

Generalised

Person

Relationship

Simon Milton is a Senior Lecturer in the Department of Computing and Information Systems at The University of Melbourne, and received his PhD from The University of Tasmania in which he reported the ﬁrst comprehensive analysis of data modeling languages using a common-sense realistic ontology. Dr Milton continues his interest in the ontological foundations and practice of data modeling. He is also interested in the value and use of ontologies for business and biomedicine.

Parenthood

Design Option

Fig. F2. Family Tree generalization decisions.

Figs. F1 and F2 show the frequency with which each generalization option was used in the two models. References [1] B. Lawson, How Designers Think: The Design Process Demystiﬁed, 4th ed., Architectural Press, Oxford, 2005. [2] S.K. Milton, E. Kazmierczak, An ontology of data modelling languages: a study using a common-sense realistic ontology, Journal of Database Management 15 (2), 2004, pp. 19–38.

Graeme Shanks is an Australian Professorial Fellow in the Department of Computing and Information Systems at The University of Melbourne. He received his PhD from Monash University. His research interests focus on the management and impact of information systems, business analytics, data quality and conceptual modeling. Graeme has published in journals including MIS Quarterly, Journal of Information Technology, Information Systems Journal, Information & Management, Journal of the AIS, Electronic Commerce Research, Journal of Strategic Information Systems, Information Systems, Behaviour and Information Technology, Communications of the AIS, Communications of the ACM, and Requirements Engineering.

Data modeling: Description or design?

Data modeling: Description or design?

Recommend Documents