Information & Management 49 (2012) 151–163
Contents lists available at SciVerse ScienceDirect
Information & Management journal homepage: www.elsevier.com/locate/im
Data modeling: Description or design? Graeme Simsion, Simon K. Milton *, Graeme Shanks Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia
A R T I C L E I N F O
A B S T R A C T
Article history: Received 13 November 2010 Received in revised form 15 November 2011 Accepted 25 January 2012 Available online 14 February 2012
Data modeling for database creation has generally been considered to be a descriptive process: the realworld is observed and represented in a conceptual model that is then transformed into a logical structure for a database. This is reflected in prescriptive methods and is the dominant assumption in most studies. However, data modeling can also be considered a type of design with negotiable requirements, a creative process, and many workable solutions. Our paper discusses empirical results from almost 500 practitioners on three continents comparing data modeling to design. We found that data modeling, as practiced, was better characterized as design. ß 2012 Elsevier B.V. All rights reserved.
Keywords: Conceptual data modeling Practitioner study Design Analysis
1. Introduction 1.1. Alternative views of data modeling Data modeling is one of the most critical activities in the implementation of an IS: it has been characterized as a process of reality mapping. This characterization has been occasionally challenged from a philosophical perspective, from observations of practice, and from empirical evidence. This descriptive characterization also dominates the practitioner literature. In a descriptive activity, a set of artifacts may be created, and this might well be called design, but not be of sufficient importance to the overall result as to characterize the entire activity as design. In data modeling, there is choice in the selection of components (typically entities, relationships and attributes) used to represent some part of reality. The difference between description and design is in whether this selection is a trivial part of the process compared to understanding the Universe of Discourse (UoD) {descriptive type}, or whether it is the essence of the process {design type}. 1.2. Previous empirical research We know little about how experienced data modelers approach their work or about the models that are produced for real business applications. This is because most studies have assumed the process to be descriptive and have not involved practitioners.
The descriptive characterization is embodied in most empirical studies through the use of a gold standard – a single correct solution devised by the researcher, who often embedded entity and relationship names in descriptions, thus constraining the modeling abstractions. The ‘‘business requirements’’ amounted to a plain language description and the participant’s task was to translate the description to the original diagram. For example, ‘‘An employee can report to only one department. Each department has a phone number.’’ Tasks showing these two traits mainly tested facility with modeling formalisms. Yet it is common to see conclusions that indicated that novice designers did not run into much trouble in modeling entities and attributes. In the context of the research task, modeling entities may have meant little more than identifying nouns in the description. The use of simple models and prescriptive instructions limited the scope of the design. Most empirical studies have used students as participants; of course, this limited the difficulty of the problems posed. Of the total of 3210 participants across 59 studies that we surveyed, only 147 in nine studies had more than one year’s industry experience of data modeling. Thus most studies used unrealistically simple data models. Nevertheless, some studies have used experienced data modelers and have uncovered design behavior. Comparisons of novice and expert data modelers have revealed behaviors characteristic of designers (attempting to gain a holistic understanding, categorization of problems, pattern re-use) in the experts but not in the novices. 1.3. The research question
* Corresponding author. Fax: +61 3 9349 4596. E-mail address:
[email protected] (S.K. Milton). 0378-7206/$ – see front matter ß 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.im.2012.01.003
Our research question was: Is data modeling better characterized as description or design? Here, data model refers to a model of a specific UoD (e.g. the data model of ABC corporation’s human
G. Simsion et al. / Information & Management 49 (2012) 151–163
152
resources operations), while data modeling refers to the set of activities required to specify a conceptual schema that will transform into a database schema but prior to its transformation into a specific DBMS data definition. Specifically, we examined the process of data modeling that resulted in a database design that could be implemented in a relational DBMS. We did not include other purposes of conceptual data modeling (e.g., its use in IS planning). This is an important research question for at least three reasons: Most data modeling research assumes the descriptive characterization, notably in the design of experiments and in the application of ontology [2,3]. If data modeling is, in fact, design, research results need to be reinterpreted in that light. Data modeling education should include expert level practice. Creative thinking and evaluation of alternative designs are intrinsic to design processes. If data modeling is seen as a design activity rather than description, then data modeling methods should be updated to reflect a design process and explicitly include creative thinking and comparative evaluation. 2. The ideal type of ‘design’ Design is the ideal type against which we measure data modeling. Its essence has been synthesized in the form of a list of some of the important characteristics of design problems and solutions, and the design process itself. These characteristics are intended to typify design and thus to differentiate it from description. The list is not exhaustive, and the characteristics are interrelated. Collectively, they provide an overall picture of design. The characteristics (shown in Table 1) were grouped into fourteen properties within three dimensions – Problems, Solutions (i.e., Products), and Process, following Lawson’s [1] properties of design 3. Research design The framework provides a basis for expanding the research question to 11 research sub-questions (RSQs) (see Table 2). Scope seeks to clarify what practitioners mean by data modeling. The remaining three dimensions determine whether data modeling practice has the properties of the design type: Problem deals with the negotiability of data modeling requirements, Process examines whether data modeling is creative, and Product deals with the diversity in data models produced by experienced practitioners in response to a task; this suggests that data modeling is a design process.
Table 1 Lawson’s properties of design. Design problems 1. Design problems cannot be comprehensively stated 2. Design problems require subjective interpretation 3. Design problems tend to be organized hierarchically The design process 1. The process is endless 2. There is no infallibly correct process 3. The process involves finding as well as solving problems (including creativity) 4. Design inevitably involves subjective value judgments 5. Design is a prescriptive activity 6. Designers work in the context of a need for action Design solutions 1. There are [sic] an inexhaustible number of different solutions 2. There are no optimal solutions to design problems 3. Design solutions are often holistic responses 4. Design solutions are a contribution to knowledge 5. Design solutions are parts of other design problems
The RSQs were addressed with a combination of surveys, laboratory studies, and interviews. The selection of the mode for each is shown in Table 3. Semi-structured interviews (with influencers of practitioners or thought leaders), surveys (to collect the perceptions of experienced data modelers about data modeling products, processes, and problems), and laboratory studies (designed to explore diversity and style in data models by asking participants to complete modeling tasks which were examined for evidence of diversity, style, and patterns in data modeling) were chosen to provide multiple sources of data to assess the practice of data model against the ideal type of design. The surveys and laboratory studies were incorporated into 12 data modeling seminars and workshops for experienced practitioners delivered by the first author in the US, UK, Scandinavia and Australia between May 2002 and November 2004 (see Appendix A for summary of participants). Figs. 1–4 summarize the responses to demographic questions. There was a strong correlation between the two experience measures (g = 0.65, p < 0.0005). Our study was the largest currently published; it included 381 participants with at least one year of data modeling experience. The minimum number to complete any task was 55. Three other groups participated in the research. The Practitioner thought-leaders and expert model evaluators were purposive samples. Architects and accountants (who provided a benchmark for the Characteristics of Data Modeling component) were recruited from personal and professional contact lists.
Table 2 Research sub-questions (RSQs). Type dimension
Name
Research sub-question
General – applying to all three dimensions
Scope
(RSQ1) What do data modeling practitioners believe is the scope and role of data modeling within the database design process? (RSQ2) Is the description/design question considered important by data modeling practitioners? (RSQ3) What are the (espoused) beliefs of data modeling practitioners on the description/design question? (RSQ4) Are data modeling problems perceived as design problems by data modeling practitioners? (RSQ5) Do database design methods used in practice support a descriptive or design characterization of data modeling? (RSQ6) Are data modeling processes perceived as design processes by data modeling practitioners? (RSQ7) Are data modeling products perceived as design products by data modeling practitioners? (RSQ8) Will different data modeling practitioners produce different conceptual data models for the same scenario? (RSQ9) Will different data modeling practitioners produce different logical data models from the same conceptual model? (RSQ10) Do data modeling practitioners use patterns when developing models? (RSQ11) Do data modeling practitioners exhibit personal styles that can be identified in the data models that they create?
Importance Espoused Beliefs Problem Process
Perception of Problems Methods
Product
Perception of Processes Perception of Products Diversity in Conceptual Modeling Diversity in Logical Modeling Patterns Style
G. Simsion et al. / Information & Management 49 (2012) 151–163
153
Table 3 Research methods used. Research component
Method
Sub-questions addressed (listed by name)
Interviews with thought-leaders
Interviews
Scope and stages Espoused positions on data modeling Characteristics of data modeling Diversity in conceptual modeling Diversity in logical modeling Style in data modeling
Survey Survey Survey Laboratory study Laboratory study Laboratory study
Importance (RSQ2), Espoused Beliefs (RSQ3), Perception of Problems (RSQ4), Perception of Processes (RSQ6), Perception of Products (RSQ7) Scope (RSQ1), Methods (RSQ5) Importance (RSQ2), Espoused Beliefs (RSQ3) Perception of Problems (RSQ4), Perception of Processes (RSQ6), Perception of Products (RSQ7) Diversity in Conceptual Modeling (RSQ8), Style (RSQ11) Diversity in Logical Modeling (RSQ9), Style (RSQ11) Style (RSQ11), Diversity in Conceptual Modeling (RSQ8), Diversity in Logical Modeling (RSQ9), Patterns (RSQ10)
4. Results 4.1. Interviews with thought leaders Interviews were held with seventeen ‘‘thought leaders’’ (see Appendix B); they were conducted by the first author. These were used to confirm the currency of the research question and to clarify which aspects of the research would shed most light on the questions. All interviewees chose acknowledgment over anonymity. Interviewees were asked for their views on the research question and asked to comment on three aspects of the ideal type: 1. Whether the data modeler should challenge business requirements (Problem). 2. Whether data modeling is a creative activity (Process). 3. Whether data modeling problems have a single right answer (Product). Analysis was based on videotape transcription and confirmation; organization of statements by common meanings; organization of meanings into themes, and then into the research framework; synthesis of views and positions; and participant review of the findings. Opinions on the research question, expressed directly and in discussion, varied widely and were summarized in Table 4. Positions were starkly articulated:
Occupation of Participants 11%
4.1.1. Problem: are business requirements negotiable? Proponents of the design characterization answered that business requirements were negotiable, and that modelers should be active in exposing new ways of doing business. One view was that the business does not know what is best for it. Modelers were seen as being able to make suggestions, and to bring in their own business knowledge to provide new perspectives. Interviewees favoring the description characterization supported the primacy of the business in determining its data model and the danger of the modeler taking that role: ‘‘What we’re modeling is what the domain expert says is right’’ 4.1.2. Process: is data modeling a creative activity? Proponents of design believed that they were ‘‘creating’’ the objects in the model: ‘‘10–15% of entities are obvious and everyone agrees with them, but (beyond that) the actual choice of entities requires a lot of imagination and creativity.’’ Some supporters of the description characterization recognized a role for creativity in peripheral areas like the layout and presentation of the model, or in the way of understanding the business.
Data Administrator
d
N
ot
R
ep
or
10
te
0+
00
0
-1
0
-5
51
21
8%
4%
-2
Not Reported
10
Other 1%
11
Manager
6-
Enterprise Modeler 38% 11%
5
Data Warehouse
Experience: Number of Models
120 100 80 60 40 20 0
2-
Database
1
Data Modeler
0
25%
Number of Participants
2%
Data modeling is not a process of creation, it is a process of discovery. Data modeling is certainly a descriptive activity, it’s not a design activity. I believe rabidly and intensely that it’s a design process. We’re designing (but) some of the people that we work with see us as scribes.
Number of Models
Fig. 1. Occupation of participants.
Fig. 2. How participants learned.
or te d
+
0
Years Experience
Fig. 4. Experience: years.
N
ot
R
ep
20
-2 16
10
-1 5 11
6-
3-
2 1-
1
5
Experience: Years
140 120 100 80 60 40 20 0 0
d or te ep R
No t
In d
us
try
Ed uc
.
Ed uc .
r
ar y Te rti
M
en
to
e
tio ns lic a
ok s Bo
Ex pe rie nc
Method
Number of Participants
How Participants Learnt 250 200 150 100 50 0
Pu b
Number of Participants
Fig. 3. Experience: number of models.
154
G. Simsion et al. / Information & Management 49 (2012) 151–163
Table 4 Interviewees’ overall positions. Position
Number of interviewees
Strongly supports description Somewhat supports description Supports neither position more strongly than the other Position depends on modeling formalism/language Somewhat supports design Strongly supports design
5 1 3 1 3 4
4.1.3. Product: one right answer? This generated strongly conflicting responses. Some who believed that requirements were negotiable were less sure that the models would vary once requirements were settled. Some held that there is a right model, allowing for variation only in notation and the naming of objects. Those who saw differences because modeling was design spoke in terms of utility vs truth. A few raised the theoretical position of choice in classification, but most who argued against the one right answer drew on personal experience. Instructors noted that students produced different workable models in response to case study scenarios. The difficulty of integrating different databases within and across organizations was evidence that different workable models could be implemented for the same data. The trade-off between level of generalization and enforcement of business rules was a central theme for those who believed in design. Three groups emerged: the literalists (concepts should be modeled as used in the business); the moderate abstractors (some generalizations) and the rule removers (deliberately removing business rules for representation elsewhere). Some saw these as stylistic preferences of modelers. 4.1.4. Summary Thought leaders considered the question was important. Further, they were divided about whether they believed that data modeling was design or description. Their perceptions of whether the problems, processes, and products were best characterized as design or description also varied. 4.2. Survey: scope and stages There were two reasons for this stage. We sought to determine what practitioners meant by data modeling before designing and interpreting survey questions, and sought responses about parts of the database design process to see whether they could individually be characterized as design or description. Our questionnaire therefore asked participants to nominate the stages in database design, and map 26 elementary activities (such as normalization and definition of indexes) against the stages. The elementary activities served as common reference points for comparing the higher-level stages nominated by different participants. Participants were also asked to define data modeling in terms of the stages that it covered. Respondents were attendees at advanced data modeling classes in London (25) and Los Angeles (30). We found broad agreement on the scope, stages, and activities considered to be data modeling activities. After consolidation of names, five stages accounted for 201 (83%) of the total of 243 activities cited: (1) Business Requirements Analysis; (2) Conceptual Data Modeling; (3) Logical Data Modeling; (4) Physical Data Modeling and/or Physical Database Design; and (5) Post-database-design Activities (optional). The two tasks in (4) were found to contain essentially the same activities. A stage simply named ‘‘Data Modeling’’ was listed on nine occasions and was in the same place in the sequence as ‘‘Logical Data
Modeling’’. Forty-one (75%) responses matched this overall pattern, and a further nine (16%) matched the pattern except for the omission of ‘‘Business Requirements Analysis’’. Eighty percent of responses put the activities of data modeling and responsibilities of the data modeler as (at least) the specification of an initial conceptual schema, to meet agreed business requirements, prior to any modifications to improve performance. Thus there was a broad consensus on the overall composition and sequence of the database design process. The description/design debate was thus unlikely to be a consequence of different definitions of data modeling. 4.2.1. Questions measuring stages against the ideal type ‘Design’ The data modeling scope and stages survey provided answers to the following questions. Question 1 – Is a business requirements stage (not including entity, relationship, attribute identification) a necessary part? Including this stage prior to identifying key model components runs counter to the descriptive characterization when the UoD is mapped directly onto the data model. Requirements statements can be seen as problem statements to which the model provides a solution. A business requirements stage was nominated by 84% of our respondents of whom 46% saw it as the responsibility of the data modeler (solely or jointly); 65% the analyst and 39% the user. Question 2 – Is entity/relationship/attribute identification part of the business requirements stage? If these are established before data modeling starts, then the data modeler cannot ‘‘create’’ them, and the description characterization is supported. Only 5% of respondents included identification of entities, relationships, and attributes in a business requirements stage. Question 3 – Are there separate stages for DBMS-independent (conceptual) modeling and DBMS-specific (logical) modeling? Data modeling can be seen as an implementationindependent descriptive stage followed by transformation to a logical data model, supporting the descriptive characterization. In contrast, designers are constantly conscious of the implementation environment or ‘‘medium’’. If data modeling is design, we would expect the two stages to blur and often combine. Respondents grouped most conceptual schema specification tasks into one stage, generally called Logical Data Modeling, with only entity and relationship identification being part of a Conceptual Data Modeling stage. Respondents effectively used the term conceptual modeling to describe a preliminary ‘‘sketch plan’’ and not a rigorous and complete product for mechanical translation into a conceptual schema. This is in line with practitioner terminology. Question 4 – Where does view integration happen? In the descriptive characterization, where there is one right answer, view integration is relatively simple. In the design characterization, models may differ in complex ways and their integration becomes a process of negotiation. View development and integration was a median task in the sequence of the tasks classed as data modeling and it was considered ongoing rather than a discrete terminal task. No respondent nominated it as a discrete stage. Question 5 – Where is external schema specification located and who is responsible? One approach to database design uses external schemas (views) to replicate user views. In the descriptive characterization, these are mappings from the conceptual schema that originally integrated them and the data modeler will be responsible for their definition. If, instead, external schemas are a tool for managing data independence, programming needs, and security rather than reproducing user views, we would expect them to be defined later in the process. This is indeed what we found and this activity was seen to be outside the primary responsibility of the data modeler.
G. Simsion et al. / Information & Management 49 (2012) 151–163
We surveyed attendees at a one-day advanced data modeling seminar at an international practitioner convention in an attempt to determine their position on the description/design dichotomy by asking them two questions. We asked firstly, an open question: What is data modeling? and secondly, a closed question: Which better describes data modeling (a) Describing the data requirements of an organization or part of an organization? or (b) Designing data structures to meet the requirements of an organization or part of an organization? 93 respondents answered both questions. The questions were given after participants had completed a data modeling task developing a model from a business scenario. Participants were told: ‘‘We are referring to data modeling to support the development of a relational database; not enterprise data modeling or reverse-engineering’’. They were not shown the closed question until after they answered the open question. Two researchers coded the responses neutral, somewhat, or strongly for the question depending on the level of support for either the design characterization (coded as 4 & 5) or the description characterization (coded as 1 & 2). The distribution of coded responses to the open question is shown in Fig. 5. Neutral was subdivided into both or neither (3 and 0 respectively). Inter-coder reliability (a) was 0.82 (0.7 was considered acceptable). Responses to the closed question are shown in Fig. 6. The word design (as a verb) was used in only six responses to the open question. Only 17% of responses to the open question did not embody a position. There was no significant correlation between responses and experience, method of learning, or job position. Fig. 7 compared responses to the open and closed questions: the vertical axis shows the break-up of responses to the Closed Question for the participants who gave each of the possible (coded) responses to the Open Question. Thus, participants favoring the design characterization in the open question and mostly maintained that view in the closed question, but a significant number of participants whose open question answers supported description reversed it in the closed question: providing only a moderate correlation (K = 0.34, p = 0.007) between the open and closed questions when both and neither were excluded. This difference suggested that some responses to the open question may have been influenced by taught definitions of data modeling (which favor a description characterization) whilst the closed question demanded some reflection. A facilitated discussion followed response collection. A show of hands reporting closed question answers caused surprise. The discussion which followed established that many participants had expected their own response to dominate. A second show of hands
Number of Participants
Coded Responses to Open Question 60 50 40 30 20 10 0 Neither
Strong Somewhat Description Description
Both
Somewhat Design
Response
Fig. 5. Coded responses to open question.
Strong Design
Number of Participants
4.3. Survey: espoused positions on data modeling
Responses to Closed Question 60 50 40 30 20 10 0 Description
Design
Both
Response
Fig. 6. Responses to closed question.
Source of Responses to Closed Question Frequency of Each Response to Closed Question
4.2.2. Summary In questions asked specifically about important aspects of the process that could illuminate whether data modeling is design or description, practitioners offered opinions that were consistent with the design type.
155
60 50 40 30 20 10 0
Both Design Description
0
1 2 3 4 5 Response to Closed Question
NA
Fig. 7. Source of responses to closed question.
showed a close-to-unanimous view that the design/description distinction was real and important. 4.3.1. Summary Data modeling practitioners espouse beliefs that data modeling was description in response to the open question and were evenly split between description and design in response to the closed question. The practitioners confirmed that researching the design/ description dichotomy was of importance. 4.4. Survey: characteristics of data modeling This part of our survey sought a deeper understanding of theory-in-use by addressing practitioners’ perceptions of characteristics of data modeling problems, products and processes to address RSQs 4, 7. The 25 questions shown in Appendix C used a five-point Likert scale. The survey was benchmarked using architects and accountants. They were chosen because architecture is a design discipline whereas accounting is a process of recording, classifying, reporting and communicating, and is thus a descriptive characterization. Data modelers have been compared with both architects and accountants. Responses were received from a snowball sample of 38 accountants and 21 architects, all based in Australia. The results for these two professions were then used to benchmark the results for data modeling against them. The survey was administered to 266 attendees at seven seminars targeting data modeling practitioners in the USA, Australia, UK, and Scandinavia (the smallest 20 and the largest 90). Participants were told, before completing the survey: ‘‘We are referring to data modeling to support the development of a relational database; not enterprise data modeling or reverse-engineering’’. No significant differences were found in the results across the seminars. Scale reliability (Cronbach’s a) was 0.73. The Corrected Item-Total Correlation (CITC) was positive (showing that the questions were measuring the same underlying construct in the same direction) for all but one item. The exception was: ‘‘Data modeling is prescriptive rather than descriptive’’ – this had a CITC value of 0.15. Subsequent discussion has suggested that some respondents had interpreted ‘‘prescriptive’’ as applying to the modeling process rather than product (paradoxically supporting a descriptive characterization).
G. Simsion et al. / Information & Management 49 (2012) 151–163
156
Table 5 Summary of responses to the data modeling questionnaire. Design mean
Dimension
Property
Property mean
3.75 t(267) = 31, p < 0.0005
A. Design problems mean = 4.11 t(317) = 37, p < 0.0005
1. 2. 3. 1. 2. 3. 4. 1. 2. 3. 4. 5. 6.
4.09 4.08 4.18 4.04 3.90 2.55 3.94 4.00 3.34 4.00 3.65 2.66 3.35
B. Design products mean = 3.60 t(302) = 23, p<0.0005
C. The design process mean = 3.49 t(299) = 16, p<0.0005
Design problems cannot be comprehensively stated Design problems require subjective interpretation Design problems tend to be organized hierarchically There are an inexhaustible number of different solutions There are no optimal solutions to design problems Design solutions are often holistic responses Design solutions are a contribution to knowledge The process is endless There is no infallibly correct process The process involves finding as well as solving problems Design inevitably involves subjective value judgments Design is a prescriptive activity Designers work in the context of a need for action
Table 5 shows the mean scores (maximum score of 5) for the survey, at the Property, Dimension, and Overall levels. The onesample t-test results indicated the significance of the difference between the mean and the neutral score of 3. Fig. 8 shows the frequency distribution of the Overall score, showing that most values were above the neutral value. These results show that modeler-espoused characteristics fit a design characterization. Data modelers also scored significantly higher than accountants in all dimensions (p < 0.01), and significantly higher than architects in the problem dimension (p < 0.01) and overall (p = 0.02). 4.4.1. Summary The results clearly showed that participants did perceive that the problems, products, and processes of data modeling fit the design ideal type. 4.5. Laboratory study: diversity in conceptual models Our laboratory study examined product diversity in the conceptual data models developed by experienced data modelers for a real-world problem. The task involved an effort to develop a conceptual model for a medical research database from a description of a real business requirement. Participants were attendees at an international (practitioner-oriented) data management conference in North America; they viewed a video of the project sponsor and data administrator describing the requirements, and were given a transcript of it (see Appendix D). Participants were then given 25 min to complete the task and a further 5 min to complete a questionnaire about the process. Ninety-three models were received. Forty-nine responded that they understood the problem fairly well or very well, did not find it very difficult, made no guesses or only trivial guesses and did not think their models would be much different if more time was allowed. The first author judged 66 of the models to be workable; they were submitted by the more experienced (8.5 years vs 3.5 years;
two-tailed t(87) = 4.5, p < 0.001) who found the problem less difficult (difficulty 5-point Likert scale rating 2.6 vs 3.2; two tailed t(89) = 3.4, p = 0.001). 88% used some variant of the ‘‘crow’s foot’’ notation. A reference set of standard entity names and definitions was synthesized to facilitate comparison of models. 4.5.1. Assessment of diversity Seven measures of diversity were used. These are neither orthogonal nor exhaustive but are indicative of diversity. Diversity measure no. 1 – Participants’ perceptions of difference: On completing their model, participants paired off and compared their models. One percent of participants perceived the models as identical, six percent as identical except for naming or agreed errors, 53% as structurally different in minor ways, and 40% as structurally different in important ways. Diversity measure no. 2 – Number of entities: Fig. 9 shows the frequency distribution of entity counts from the models. Subtypes were excluded from the count to improve comparison with the 77% of models which did not use subtypes. Diversity measure no. 3 – Variety of entity names: The 93 models had 291 different entity names after removing synonyms. In addition to unrecognized synonyms, it includes some homonyms. The different uses were evident through the context of relationships with other entities. Diversity measure no. 4 – Use of nouns from the description: The frequency distribution of entity names matching nouns from the interview transcripts is shown in Fig. 10. Of the 291 different names given to entities, comparatively few came directly from the problem description but had been invented by the modeler. Diversity measure no. 5 – Variability in construct use: One concept was represented in some models as an entity (52 times) and as a relationship (5 times) in others. Three concepts were shown in some models as entities and in others, explicitly or implicitly, as attributes. Correlation between the three decisions was negligible in two cases and weakly positive but not significant in the third (K = 0.14, p = 0.2).
Frequency
Frequency
Frequency Distribution of Overall Score 45 40 35 30 25 20 15 10 5 0 2.9
3.1
3.3
3.5
3.7
3.9
4.1
4.3
4.5
Overall Score
Fig. 8. Frequency distribution of overall score – 3 is neutral.
t(330) = 31, p < 0.0005 t(335) = 31, p < 0.0005 t(342) = 24 p < 0.0005 t(339) = 23, p < 0.0005 t(340) = 21, p < 0.0005 t(334) = 6.4, p < 0.0005 t(320) = 19, p < 0.0005 t(342) = 23, p < 0.0005 t(337) = 4.9, p < 0.0005 t(341) = 18, p < 0.0005 t(331) = 10, p < 0.0005 t(323) = 6.7, p < 0.0005 t(337) = 5.1, p < 0.0005
Total Number of Entities
18 16 14 12 10 8 6 4 2 0 3
4
5
6
7
8
9
10 11 12
13 14
15 16 18
Number of Entities
Fig. 9. Total number of entities in each model.
G. Simsion et al. / Information & Management 49 (2012) 151–163
Frequency of Nouns 35
Frequency
30 25 20 15 10 5 0 1
0
2
3
4
5
7
Entity Names Matching Nouns
Fig. 10. Number of entities corresponding to nouns in the problem description.
Diversity measure no. 6 – Level of entity generalization: Three concepts were represented at different levels of generalization, though t here were significant correlations between the three generalization decisions suggesting that modelers bring personal styles to the generalization decision (0.42 g 0.77, p < 0.02). Diversity measure no. 7 – Holistic difference (expert assessed): 19 experts assessed ten selected standardized models to assess the viability of their implementation. Standardization involved providing a common name for the same entities, removing entities outside the defined scope, and presenting the models in a common format. Diversity was supported if more than one model was judged as being practically viable. The expert modelers (with a minimum of 15 years experience), gave scores for the measures overall quality, understandability and flexibility on a 5point Likert scale. The inter-rater reliability, measured by Cronbach’s a, was 0.92 for Overall Quality, 0.84 for understandability, and 0.73 for flexibility (0.7 would be considered acceptable) The level of understandability across all evaluators and models was 3.53 (s = 0.47), placing it between neither easy nor difficult to understand and reasonably easy to understand. Fig. 11 shows, for each model, the number of experts who assessed overall quality as 3 (mid-point of the scale: application would work with no serious problems) or more, and the benchmark (experts who scored it equal to or higher than an average model that they would expect to encounter in their work, developed in the last ten years by someone other than themselves.) Thus, between three and five models were acceptable to the majority of these experts. Evaluators were also asked to nominate the best model overall; four different models were selected. 4.5.2. Summary The results demonstrated a diversity of objectively assessed workable solutions to the same problem and answer RSQ8. The diversity observed was consistent with the design characterization. 4.6. Laboratory study: diversity in logical models Our laboratory study also examined product diversity in the logical data models developed by experienced data modelers in a real-world problem.
157
The task involved developing a logical data model to serve as the specification for a relational database from a list of 22 attributes that the user wished to record for a real business application. The attributes were presented as a single table/ relation (see Appendix E). Participants were attendees at advanced data modeling seminars: for measures 1–4 these were attendees in London, U.K., and Pittsburg, U.S.A. (40 participants in total); for measure 5 the participants included a further 58 at two other seminars. All but two of the models produced supported the data specified by the original table and were thus workable. All but three models were fully normalized. Straightforward assumptions were made about the columns in each table for the few solutions that did not provide a full list. Consistent application of these assumptions may have led to less diversity than if this task had been completed by the modelers themselves. 4.6.1. Diversity measures Diversity measure no. 1 – Participants’ perceptions of difference: Participants paired off and compared models. No participant reported the two models as identical, nine percent reported the models as identical except for naming or agreed errors, 39% as structurally different in minor ways, and 52% as structurally different in important ways. Diversity measure no. 2 – Number of tables: Fig. 12 shows the frequency distribution of table counts in the models. Diversity measure no. 3 – Variety of table names: After consolidation of obvious synonyms, the 39 models contained 66 different table names. Diversity measure no. 4 – Construct variability: The 39 models resulted in seven concepts being represented in more than one way: as tables in some models and in others as columns (see Appendix E). Diversity measure no. 5 – Generalization: Examination of the models revealed five different generalizations: four produced a single column from two different attributes and one changed a column’s name to increase consistency with other columns (see Appendix E). A score was then calculated for each participant by totaling the number of decisions taken by the participant. Fig. 13 shows the frequency distribution of the scores. With one exception, generalization scores were not significantly correlated with the standard demographic groupings or with responses to the process questions. The sole significant correlation was that participants who had developed more than one model in practice had significantly higher generalization scores. 4.6.2. Summary The diversity in logical models is evident from the results (RSQ9.)
Total number of tables
8 7 6
>=Benchmark >=3
Frequency
Number of Evaluations
Acceptability of Models 16 14 12 10 8 6 4 2 0
5 4 3 2 1 0
1
2
3
4
5 6 Model
7
8
9
Fig. 11. Model acceptability.
10
1
2
3
4 5 6 Number of tables
7
8
Fig. 12. Total number of tables in each model.
9
G. Simsion et al. / Information & Management 49 (2012) 151–163
158
Frequency of Generalization Scores
Frequency of Score
40
Table 7 Family Tree generalization decisions.
35 30 25
Person Relationship
20 15
Relationship
Parenthood
0.39 (p < 0.0005)
0.37 (p < 0.0005) 0.88 (p < 0.0005)
10 5 0 1
0
2 3 Generalization Score
4
5
Fig. 13. Frequency distribution of generalization scores.
4.7. Laboratory study: style in data modeling Our laboratory study also examined whether by consistently favored higher or lower levels of generalization within and between models. The task involved an effort to develop two models. As in the prior section generalization scores were calculated for each model. These were then used to determine the consistency of decisions within each model, and the correlation of the scores between the two models. Participants were attendees at advanced data modeling seminars in the USA (two seminars) and in Stockholm, Sweden: a total of 91 participants. Three modeling problems were used, with each participant being assigned two. Two of the problems required the development of a conceptual data model from a text description and one required the development of a logical data model (see Appendix F). Models that omitted any of the constructs were excluded from analysis. Each identified construct in each model was coded according to the level of generalization using ‘0’ for lowest, ‘1’ for next-lowest, to the lowest level of generalization. A total generalization score was determined by adding the individual levels coded. Four concepts in the Bank Loans and three concepts in the Family Tree solutions were identified as being subject to different levels of generalization. In each case, only two levels were found. With one exception, the decisions were logically independent. Appendix F contains the frequencies with which each decision was used. Recall from the previous section that there were five generalization decisions in the logical data-modeling task (see ‘‘diversity measure no. 5’’.). Combinations of the decisions resulted in ten versions of the Bank Loans model and five versions of the Family Tree model, the most popular in each case accounted for 50% of the models. Tables 6 and 7 show the correlation between each pair of generalization decisions. Covariance using the Kuder-Richardson 20 (KR20) coefficient was 0.80 for the Family Tree decisions, 0.75 for the Bank Loans decisions, and 0.68 for the logical data modeling task generalization decisions Generalization scores for the Family Tree and Bank Loans models were moderately positively correlated. The gamma statistic (g = 0.69, p < 0.0005), showed a strong correlation. Thus, between model generalization correlation was supported for the two conceptual models. There was some correlation with demographic categories, so the analysis was repeated for each of them. Table 8
Table 6 Bank loans generalization decisions.
Customer Party Party relationship
Party
Party relationship
Transaction
0.77 (p < 0.0005)
0.58 (p < 0.0005) 0.75 (p < 0.0005)
0.27 (p = 0.03) 0.28 (p = 0.03) 0.36 (p < 0.0005)
Table 8 Correlation between generalization scores across the conceptual models. Demographic group
g
p
>10 models produced 10 models produced Occupation = data modelers Occupation = non data modelers >6 years experience <6 years experience Total sample
0.54 0.27 0.52 0.62 0.56 0.49 0.51
<0.0005 0.32 <0.0005 0.001 0.001 0.003 <0.0005
shows that the correlation remained moderate and significant (p < 0.005) within all but one group. No significant correlation was found between the generalization scores for the Annual Budget (logical) model and either the Family Tree or Bank Loans conceptual models (Family Tree: g = 0.26, p = 0.46; Bank Loans: g = 0.53, p = 0.86). 4.7.1. Summary Some modelers consistently choose higher (or lower) levels of generalization than others, within both conceptual and logical models and across conceptual models. This bias (or style) is not due to their level or experience (and, by implication, expertise). Consequently, RSQ9, was answered in the affirmative and thus supported the conclusion that the products of data modeling were influenced by differences in style of the practitioners that produce them. 5. Summary Our findings were based on the four key dimensions of our design framework: general, problem, process and product. The general dimension includes scope, importance and beliefs. Data modeling was found to consist of the specification of the initial conceptual schema to meet the business requirements prior to any performance tuning (RSQ1). The research question was considered to be important (RSQ2). Data modeling practitioners were evenly divided between their belief that data modeling was design and description (RSQ3). Data modeling problems were seen as having the characteristics of design problems by data modeling practitioners, significantly more so than architects and accountants (RSQ4). Data modelers generally worked from a problem statement rather than directly from observations of the UoD (RSQ4). The data modeling process was perceived as having the characteristics of design processes, similar to the perceptions of architects and significantly more than the perceptions of accountants (RSQ6). Consistent with the design characterization, identification of entities, relationships, and attributes was not considered to be part of the business requirements analysis. Furthermore, there was no evidence of the widely advocated ‘‘view-definition, view-integration, view-reconstruction’’ sequence, which required that model differences can be reduced to reconcilable views (RSQ5). Data modeling products were perceived to be design products by data modeling practitioners, significantly more than perceptions of accountants, and similar to perceptions of architects (RSQ7). Conceptual data models and logical data models developed in response to a common problem were found to have substantial
G. Simsion et al. / Information & Management 49 (2012) 151–163
159
focused entirely on the design/description question and produced consistent evidence in favor of the design characterization. For researchers, there are two implications. First, careful design of data modeling experiments that take into account the likelihood of alternative solutions is required. Second, generalization of the results of empirical studies that use students as participants is problematic, because design skills take time to develop. Data modeling teachers should consider it as a design activity. Designing data modeling tasks by articulating a domain based on nouns and verbs that relate to the entity types and relationship types is a way to teach data modeling notation but it does not teach the practice of data modeling. Practitioners should be aware that their method is more consistent with data modelling as a design activity. They should be aware that alternative data modeling solutions may be useful and need to be evaluated for quality as part of the process.
diversity (RSQ8, RSQ9). Data modelers frequently re-used their own or other’s patterns, significantly more than architects. Experienced conceptual data modelers re-used patterns much more than less-experienced data modelers (RSQ10). A significant correlation was found between the levels of generalization of entities within and between conceptual data models developed by the same modeler. This suggested that personal style, evidenced by generalization decisions, affected the data models that modelers produce (RSQ11).
6. Discussion and conclusions Answers to our research sub-questions suggested that: data modeling, while traditionally characterized as description, was better characterized as design based as it was practiced. We
Appendix A. Summary of participants Research samples and research components. Location
Research component
Number
DAMA/Metadata Conference, San Antonio, TX, USA
Diversity in conceptual modeling Espoused positions on data modeling Diversity in logical modeling Diversity in logical modeling Data modeling style Characteristics of data modeling Characteristics of data modeling Characteristics of data modeling Data modeling style and Diversity in logical modelinga Characteristics of data modeling Characteristics of data modeling Characteristics of data modeling Scope and stages Characteristics of data modeling; Scope and stages Data modeling style and Diversity in Logical Modelinga
112
66%
17 23 41 54 28 39 28
85% 80% 77% 90%b 90%b 90%b 70%b
70c 20 25
90%b 90%b 83%
30 459
86% 75%
DAMA Conference, London, UK Enterprise Data Forum, Pittsburgh, PA, USA DAMA /Metadata Conference, Orlando, FL, USA DAMA Chapter Presentation, Portland, OR, USA DAMA Chapter Presentation, Phoenix, AZ, USA DAMA Chapter Presentation, Des Moines, IA, USA IRM Data Modeling Workshop Stockholm, Sweden IRM /DAMA Conference, Stockholm, Sweden DAMA Chapter Presentation, Sydney, Australia DAMA/Data Quality Conference, London, UK Wilshire Conferences Data Modeling Masterclass, Los Angeles, CA, USA
a b c
Response rate (%)
The Diversity in Logical Modeling task was incorporated in the Data Modeling Style task in these two locations. Estimate – exact attendee numbers not available. This group included 28 who attended the previous item.
Appendix B. List of participants in the thought leaders interviews The participants, and their positions or roles at the time of interview were:
Peter Aiken, data management consultant, Associate Professor at Virginia Commonwealth University. Richard Barker, company director, architect of the Oracle CASE tool. Michael Brackett, President of the International Data Management Association. Harry Ellis, data modeling consultant to the British Department of Defence. Larry English, leading proponent of data quality techniques. Terry Halpin, Professor at Northface University Utah. David Hay, independent data modeling consultant and educator. Steve Hoberman, global reference data manager with Mars, Inc. Karen Lopez, data modeling consultant and commentator. Dawn Michels, data modeling specialist, Vice President of Chapter Services for DAMA International. Terry Moriarty, president of Inastrol data modeling consultancy. Ronald Ross, editor of the Database Newsletter for 22 years. Robert Seiner, data management consultant. Alec Sharp, independent data and process modeling consultant. Len Silverston, data modeling consultant, industry educator. Eskil Swende, Chief Executive of the IRM group. President of the Scandinavian chapter of the Data Management Association. John Zachman, industry consultant and educator.
G. Simsion et al. / Information & Management 49 (2012) 151–163
160
Appendix C. Survey questions – perceptions of characteristics of data modeling Properties of design organized as a set of questions. Design
Dimension
Property
Additional survey question
Overall
A. Design problems
1. Problems cannot be comprehensively stated 2. Problems require subjective interpretation
2. 3. 4. 5.
B. Design products
C. The design process
3. Problems tend to be organized hierarchically 1. There are an inexhaustible number of different solutions 2. There are no optimal solutions to design problems 3. Design solutions are often holistic responses 4. Design solutions are a contribution to knowledge 1. The process is endless 2. There is no infallibly correct process 3. The process involves finding as well as solving problems 4. Design inevitably involves subjective value judgments 5. Design is a prescriptive activity 6. Designers work in the context of a need for action
Data modeling problems are often full of uncertainties about objectives and relative priorities Many requirements do not emerge until some attempt has been made at developing a model Objectives and priorities are likely to change during the modeling process In establishing requirements for a data model, something that seems important to one data modeler may not seem important to another data modeler 6. In establishing requirements for a data model, something that seems important to one business stakeholder may not seem important to another business stakeholder 7. Modeling problems are often symptoms of higher level problems 9. Most data modeling problems do not have a single correct solution 10. In most practical business situations, there is a wide range of possible (and workable) data models 11. Data modeling almost invariably involves compromise 12. Data modelers will almost invariably appear wrong in some ways to some people
13. It is not usually possible to dissect a data model and identify which piece of the model supports each piece of the business requirements 14. I frequently re-use patterns (structures) from other data models that I have developed myself 15. I frequently re-use patterns (structures) that I have seen in models developed by others 17. Identifying the end of the data modeling process (i.e. when to stop modeling) requires experience and judgment 16. There is no infallible correct process that (if properly followed) will always produce a sound data model 21. Data modeling requires a high level of creative thinking
23. I find it difficult to remain dispassionate and detached in my data modeling work 24. Data modeling is prescriptive rather than descriptive 25. The final data model is often a result of compromise decisions made on the basis of inadequate information
The 19 Characteristics at the lowest level were derived from concepts in the descriptions of the Properties, and were operationalized as questions that could be scored on a Likert scale (the numbers, complete with gaps in the sequence, are the numbers of the corresponding questions in the resulting questionnaire). Scores were computed by taking the mean to provide a score for the higher level Property. Then the Property scores were computed to provide Problem, Product, and Process scores, and ultimately an overall Design score. There was some subjectivity in the identification of these concepts and the framing of the questions. There was no question addressing the Property (of design products) Design solutions are parts of other design problems. This Property proved difficult to communicate in a simple question or questions and after pilot testing it was excluded. In all, six questions addressed problem, seven addressed product, and six addressed process. Five further questions were added based on other differences between description and design. These were classified under their relevant dimensions. Questions added to characteristics of data modeling survey.
Additional survey question
Dimension
8. Business requirements are often negotiable 18. When I am developing a data model, I sometimes produce more than one workable solution, and then choose the best one 19. I often start modeling before I have a thorough understanding of business requirements 20. Sometimes, even when I understand the business requirements, I find it difficult to produce a data model 22. I have experienced ‘‘eureka’’ moments (sudden and dramatic insights or solutions to problems) in my data modeling work
Problem Process
Process Process Process
A further question was added as Question 1 in the survey to determine whether the most difficult part of data modeling was in understanding the business requirements. It served three purposes: (1) To answer the question: are requirements fixed or negotiable? If requirements are negotiable, but perceived as fixed by some modelers (or vice versa), we would expect those modelers to find the task difficult. (2) To determine whether perceived difficulty in understanding requirements correlated with other indicators of design. Incompleteness, subjectivity, and negotiability of requirements are cited as properties of design; if the task is essentially descriptive, then gaining an understanding of it is the central (and most difficult) task. (3) In eliciting a deeper understanding of either description or design positions, to reduce the possibility that respondents would recognize the dichotomy behind their questions and answer. The question did not signal the dichotomy. It was placed first on the questionnaire. The model was developed specifically for our research, in the absence of established measures for differentiating description and design activities. We were obliged to rely solely on the soundness of the underlying theory (and on its operationalization) when drawing conclusions from the results. Questions were adapted, through minor re-wording, to enable them to be used with two other professional groups, viz. architects and accountants. The two groups were chosen because: (1) Architecture is generally recognized as a design discipline and is frequently employed as a metaphor for IS tasks and deliverables.
G. Simsion et al. / Information & Management 49 (2012) 151–163
(2) Accounting is a process of recording, classifying, reporting and communicating, a definition consistent with the descriptive paradigm. Data modelers have in fact been compared with accountants: ‘‘Just as an accountant might use a financial model, the analyst can develop an entity model’’. To encourage a focus on common tasks, the accountants’ questions were framed in the context of preparing a set of accounts for a business and the architects’ questions in the context of designing a building. Appendix DLaboratory materials – diversity in conceptual modeling The problem to be analyzed was presented to the participant in three parts: (1) A videotaped description of the business requirements as recorded by the project director and also by the manager responsible for managing the production system. The two stakeholders were responding independently to our request to tell us about a project and the data that was needed to run it. (2) A verbatim transcript of the videotape (see below), with a short glossary of terms added by the author in consultation with the project director. (3) A list of questionnaires to be used for data collection, with excerpts from two questionnaires. Postnatal depression interview transcript and glossary. Case study interview transcripts Key Terms (as used in the transcripts): Post-natal or post-partum – after the birth of a baby Ante-natal – before the birth of a baby (i.e. during pregnancy) Intervention – action taken by a health professional e.g. counseling, prescription of drugs. Also used by Prof. Buist (final sentence) to mean ‘‘actions taken to educate health professionals and the public about Post-Natal Depression.’’ Screening – administering a questionnaire (to a woman participating in the study) Professor Anne Buist, Director, National Post-natal Depression Initiative This project is looking at ante-natal and post-natal depression, and it’s going to run over four years, and cover five states of Australia. It’s being funded by Beyond Blue, which is the Australian national depression institute, and it’s going to cover somewhere between 50,000 and 100,000 women over this time period The data collection is in three kind-of-separate bits: Firstly across all states we’re going to be screening women at a minimum of two time points – once through the pregnancy and once post-natally. And the data we’re collecting there will be the same in each state. However there’s also going to be state-specific interventions for these women, and that will be evaluated both pre and post intervention with another set of questionnaires that women or/and the research assistants will be completing. And these may be at up to six different time points in covering through pregnancy and post-partum. The other sort of aspect of the data collection is before we even start the study and at the end of the study we’re going to be sending questionnaires to both women who have had babies and health professionals (general practitioners, midwives and maternal child health nurses), and evaluating their understanding of post-natal depression with respect to what it is, with respect to stigma and with respect to treatment. And we’ll be evaluating that again after our four-year time period where we’re going to be doing some interventions and in particular increasing awareness of post-natal depression. Dr Justin Biltza, Project Officer, National Post-natal Depression Initiative So really there are two types of data that we’re collecting: the first lot being patient demographic data (name, address, date of birth, contact details), and the other set of data is based on a series of questionnaires which are either ‘‘short answer’’ or the selection of a score based on (say) a range from (say) ‘‘good’’ to ‘‘bad’’. There’s approximately forty questionnaires that we’re using, five which form a core key component that everyone in the survey is doing, but then each of the states has a number of individual surveys that they’re using, none of which cross over, so one of the problems that we have is ensuring that we collect all the data on all the patients.
161
A couple of the other problems we have are the need for a central identification number that we need to generate: we can’t use (for instance) a Medicare number or social security number because of privacy issues. Another one of the problems that we have is that a lot of these surveys are used multiple times – two, sometimes three times. So it’s the ability to be able to collect data on the third survey, linking it up with the same patient that we used for the first survey. So if you were to participate in the study, you would come into your ante-natal visit and with the help of staff fill out (say) four or five questionnaires asking you about your mood and how you’re feeling. You would then answer the same questionnaires again at your post-natal visit, and the reason we have the same questionnaires again is just to see how the mood and the response has changed over a period of time.
Instructions and a set of ‘‘process’’ questions addressing assumptions, level of difficulty and use of patterns (common to all modeling exercises used in our research) were added to the standard demographics questionnaire. Appendix ELaboratory materials – diversity in logical modeling The task was to produce a logical model based on the conceptual model (see Fig. E1). The logical model needed to be a workable specification for a database: a single table/relation is needed so that data can be stored: it is already normalized. The quarterly items are not repeating groups; they are different items with different names and meanings. The task and associated questionnaire were administered to 96 attendees at four advanced data modeling seminars. For some of the measures of diversity, only the 39 responses from London (a substantial European conference) and Pittsburgh (a substantial North American conference) were included. The last three concepts in Table E1 did not directly reflect columns in the original model but were added by modelers to capture semantics lost when generalizing some of the original columns. Ignoring differing levels of generalization and considering only the choices of representing each concept as either a column or table resulted in 19 distinct models. There were five situations in which some participants had: (a) explicitly generalized two or more of the attributes in the original model to produce a single column, e.g. Generalizing Budget First Quarter Material, Budget Second Quarter Material, Budget Third Quarter Material and Budget Last Quarter Material into a single column Quarterly Material Budget plus a Quarter Number column to identify which quarter the amount applied to (Decision 1 in Table E2).or (b) Altered columns to increase consistency: e.g. Replacing Actual Total Material with Actual Fourth Quarter Material (Decision 5 in Table E2) to make it consistent with the representation of budgeted amounts and comparable with the other (quarterly) actual material amounts. Although these decisions are not manifested as generalizations, they are based on the recognition of commonality, and thus have been treated together with the explicit generalizations. Table E2 shows the five situations and the different decisions made by participants. Fig. E2 shows the frequency distribution of the decisions. Nineteen of the 22 modelers who generalized Budget and Actual amounts (Gen BA) also made the other four generalizations and this covariance amongst the decisions was supported by a Kuder-Richardson 20 (KR20) statistic of 0.68. Apparently this modeler had an underlying concept of propensity to generalize on the part of the modeler. Correlations between individual decisions (f) ranged from negligible to strong, and were positive in all cases.
G. Simsion et al. / Information & Management 49 (2012) 151–163
162
Department Number (Primary key item)
Budget-First-Quarter-Labor
Year (Primary key item)
Budget-Second-Quarter-Labor
Approved-By
Budget-Third-Quarter-Labor
Budget-First-Quarter-Material
Budget-Last-Quarter-Labor
Budget-Second-Quarter-Material
Actual-First-Quarter-Labor
Budget-Third-Quarter-Material
Actual-Second-Quarter-Labor
Budget-Last-Quarter-Material
Actual-Third-Quarter-Labor
Actual-First-Quarter-Material
Actual-Total-Labor
Actual-Second-Quarter-Material
Budget-Other
Actual-Third-Quarter-Material
Actual-Other
Actual-Total-Material
Discretionary-Spending-Limit Fig. E1. Annual Budget conceptual model.
Table E1 Alternative representations of concepts. Concept
Not present
As column
Literal table
Generalized table (in scope)
Generalized table (beyond scope)
Total
Approved by Department Disc spending limit Year LMO-type Quarter BA-type
0 0 0 0 14 10 28
13 4 27 26 7 13 7
7 34 12 6 18 8 4
1 1 0 7 0 7 0
18 0 0 0 0 1 0
39 39 39 39 39 39 39
Table E2 Generalization choices in the logical data models. Decision number
Decision name
Yes
No
Both
1
Gen QTR (generalization decision)
Columns for individual quarters
N/A
2
Gen LMO (generalization decision) Gen BA (generalization decision)
4
Other QTR (consistency decision)
5
Fourth QTR LM (consistency decision)
Specific columns for Labor, Material, Other amounts Specific Columns for Budget and Actual amounts Support only for annual values for Other amounts Annual totals held for Labor and Material Actual amounts
N/A
3
Quarterly amount columns generalized – no columns specific to a particular quarter. Labor, material, other generalized – no columns specific to a particular type Budget and actual columns generalized – no columns specific to a particular type. Support for quarterly values for ‘‘Other’’ amounts – no column for annual amount Direct representation of fourth quarter labor and material actual amounts
N/A Both options supported N/A
Frequency
Frequency of Design Options 90 80 70 60 50 40 30 20 10 0
Yes No Unclear Both
Gen Qtr
Gen LMO
Gen BA
Other Qtr Fourth Qtr LM
Design Option
Fig. E2. Frequency of design options.
A Bank Loans problem: a simplified version of a real example, presented as a short plain-language description written by the author. A Family Tree problem: it included the concept of marriage, presented as short plain-language description written by the author. Bank loans data modeling problem. Bank loans To support the business of a bank, we need to record details of personal loans, housing loans and motor vehicle finance loans. Against each loan, we need to record the details of the borrower(s), the Loan Officer who approved the loan, and (in some cases) a guarantor. We also need to record payments, drawings (initial and further borrowings) and interest transactions against each loan.
Appendix F Laboratory materials – style in data modeling Family tree data modeling problem. Three data modeling problems were used in this research component. The Annual Budget problem: participants were presented with a conceptual model and some supporting information, and asked to produce a logical data model (the model shown in Appendix E).
Family tree We are developing a database to record details of a family tree. For each person of interest to us, we need to be able to record details (where known) of their mother, father, children, and marriages, and their date of birth, death and marriages
G. Simsion et al. / Information & Management 49 (2012) 151–163 Not Generalised
Frequency of Design Option
Generalised
60
Frequency
50 40 30 20 10 0 Customer
Party
Party Relationship
Transaction
Design Option Fig. F1. Bank Loans generalization decisions.
163
[3] Y. Wand, R. Weber, Research commentary: information systems and conceptual modeling: a research agenda, Information Systems Research 13 (4), 2002, pp. 363–376. Graeme Simsion is an Information Systems Consultant, Educator, and Researcher. For 20 years he was CEO of a business and information systems consultancy with offices in three Australian cities. His PhD from The University of Melbourne examined attitudes and practices of data modeling practitioners. He is the author of Data Modeling Essentials, one of the most widely used practitioner texts on the subject, Data Modeling Theory and Practice, and numerous academic and practitioner articles, and is a regular speaker at industry and academic forums. His current focus is on improving the consulting skills of business and information systems professionals.
Frequency
Frequency of Design Options Not Generalised
70 60 50 40 30 20 10 0
Generalised
Person
Relationship
Simon Milton is a Senior Lecturer in the Department of Computing and Information Systems at The University of Melbourne, and received his PhD from The University of Tasmania in which he reported the first comprehensive analysis of data modeling languages using a common-sense realistic ontology. Dr Milton continues his interest in the ontological foundations and practice of data modeling. He is also interested in the value and use of ontologies for business and biomedicine.
Parenthood
Design Option
Fig. F2. Family Tree generalization decisions.
Figs. F1 and F2 show the frequency with which each generalization option was used in the two models. References [1] B. Lawson, How Designers Think: The Design Process Demystified, 4th ed., Architectural Press, Oxford, 2005. [2] S.K. Milton, E. Kazmierczak, An ontology of data modelling languages: a study using a common-sense realistic ontology, Journal of Database Management 15 (2), 2004, pp. 19–38.
Graeme Shanks is an Australian Professorial Fellow in the Department of Computing and Information Systems at The University of Melbourne. He received his PhD from Monash University. His research interests focus on the management and impact of information systems, business analytics, data quality and conceptual modeling. Graeme has published in journals including MIS Quarterly, Journal of Information Technology, Information Systems Journal, Information & Management, Journal of the AIS, Electronic Commerce Research, Journal of Strategic Information Systems, Information Systems, Behaviour and Information Technology, Communications of the AIS, Communications of the ACM, and Requirements Engineering.