Int. J. Man-Machine Studies (1992) 37, 83-101
Conceptual data modetling in database design: simlaritii and differences between expert and novice designers DINESH BATRA
Decision Sciences & Information Systems, College of Business Administration, Florida International University, Miami, FL 33199, USA JOSEPH G.
DAVIS
Management Science and Information Systems, University of Auckland, Auckland, New Zealand
Private Bag,
(Received 3 May 1990 and accepted in revised form 3 August 1991) This paper explores the similarities and differences between experts and novices
engaged in a conceptual data modelling task, a critical part of overall database design, using data gathered in the form of think-aloud protocols. It develops a three-level process model of the subjects’ behavior and the differentiated application of this model by experts and novices. The study found that the experts focused on generating a holistic understanding of the problem before developing the conceptual model. They were able to categorize problem descriptions into standard abstractions. The novices tended to have more errors in their solutions largely due to their inability to integrate the various parts of the problem description and map them into appropriate knowledge structures. The study also found that the expert and novice behavior was similar in terms of modelling facets like entity, identifier, descriptor and binary relationship, somewhat different in modelling ternary relationship, but quite different in the modelling of unary relationship and category. These findings are discussed in relation to the results of previous expert-novice studies in other domains.
1. Introduction Developing a conceptual data model based on the detailed information requirements provided by users is a critical and demanding task in the overall database design. It is in this phase that the structure of the database to be implemented is captured along with the semantic constraints. The conceptual model is usually easy to understand and can form the basis for communication with users. It does not include implementation details. This enables the users and designers to focus on specifying the properties of data without being concerned with file structures and storage details (CODASYL, 1971; Elmasri & Navathe, 1988). A good proportion of research in the area of conceptual modelling has been devoted to introducing additional formalisms for capturing greater meaning in the representations (Chen, 1976; Smith & Smith, 1977; Hammer & McLeod, 1981; Hull & Ring, 1987; see Peckham & Maryanski, 1988 for a detailed review), and the comparison of different data models (Lochovsky & Tsichritzis, 1977; Brosey & Shneiderman, 1978; Hoffer, 1982; Jahn & Naumann, 1985; Ridjanovic, 1986; Shoval & Even-Chaime, 1987; 83 0020-7373/92/070083 + 19$03.00/O
@ 1992 Academic Press Limited
84
D. BATRA AND J. G. DAVIS
Jarvenpaa & Machesky, 1989; Batra, Hoffer & Bostrom, 1990) for ease of representation from a human factors perspective. The processes and expertise employed by designers in eliciting user requirements and representing them in a conceptual model have received relatively little research attention. A deeper understanding of these processes can provide useful insights for aiding pedagogy and building knowledge-based systems to support and partially automate database design. With the advent of end-user computing (traditional end users designing and developing applications), database design is not restricted to well-trained, experienced designers (Davis, 1985). This necessitates greater diffusion of conceptual modelling and design skills across a wider cross-section of end users. Such diffusion can be greatly facilitated if the skills and expertise involved in conceptual modelling can be better understood through careful scrutiny and analysis. This paper attempts to enhance our understanding of the cognitive processes underlying conceptual data modelling through an exploratory investigation into the similarities and differences between experts and novices engaged in such a task. It employs protocol analysis, a process tracing methodology which has been successfully used in a number of other domains. We proceed to synthesize a process model of problem solving in this domain, and use the model to illustrate expert-novice differences and strategies used by experts. This paper is divided into six sections. The next section presents a survey of previous studies that have motivated this research. In the third section, the characteristics and distinctive features of conceptual data modelling as a problemsolving task are outlined. The fourth section presents the methodology, profiles of the experts and novices who participated in the study and a description of the task presented to the subjects. This is followed by the results section which includes the analysis of data in the form of verbal protocols. The process model developed and strategies followed by experts are also included in this section. The sixth section discusses the results obtained and synthesizes them with the findings concerning expert-novice differences in other domains. The final section outlines the implications and conclusions of this study. 2. Previous research A growing body of research has examined the differences in the problem-solving processes employed by experts and novices in a wide variety of domains. Studying such differences can contribute to a deeper understanding of what the expert does differently from the novice to account for the generally observed superiority of the expert (Larkin et al., 1980; Chi, Glaser & Farr, 1988). While some of these differences may be applicable only to the specific domain in question, it is also possible to account for some generalizable differences across domains. We proceed to review the prior research in a domain-specific manner and then to synthesize these results in a more general fashion to the extent it is possible. The superiority of experts based on efficient processing of large amounts of information was one of the first expert-novice differences documented. In the game of chess, Chase and Simon (1973) found that chess masters (comparable to experts) worked with familiar configurations of several chess pieces recognized as distinct “chunks” that could be evoked from memory with little effort. Performance
CONCEI’RJAL
DATA
MODELLING
85
differences between experts and novices in terms of their ability to recall the position of chess pieces on the board were not found when the pieces were randomly placed. Similar information-processing behavior by experts has been reported in the game of Go (Reitman, 1976), bridge (Engle & Bukstel, 1978) and in solving physics problems (Larkin et al., 1980). Larkin et al. (1980) also speculated that the greater speed with which the experts solved physics problems is related to their ability to execute the problem-solving steps in “compiled” form as opposed to “interpreted” form. Chi, Feltovich and Glaser (1981) and Larkin (1983) also found that experts tended to categorize physics problems into standard types based on fundamental principles of the domain. Empirical studies of skill differences between experts and novices in the context of information systems have tended to focus on programming tasks. Gugerty and Olson (1986) investigated expert-novice differences in debugging computer programs written in LOGO and Pascal. They found that experts were faster and more successful at finding bugs and suggested that the experts’ superiority could be traced to a more comprehensive understanding of the program and their consequent ability to generate and test high quality hypotheses concerning the bugs. It should be noted that the novices in this study did not do qualitatively different things as compared to the experts; it is just that the experts were able to tap into their knowledge structures to work faster and more correctly. These were consistent with the results obtained by Jeffries (1981, 1982). In her study, the experts spent a larger proportion of their time in the comprehension of the program to develop a more complete representation of the program in addition to better remembering the details of the program. These findings were also in accordance with the results obtained by McKeithen et al. (1981) who found that expert computer programmers were able to process large amounts of relevant information compared to novices because of their ability to organize information meaningfully into useful chunks. In studying expert-novice differences in the area of mathematical programming, Orlikowski and Dhar (1986) confirmed several results obtained in physics problems solving. They also proposed a differentiated model of knowledge organization between experts and novices. Experts’ concepts were found to be more finely differentiated than those of the novices. This differentiation both at structural and semantic levels enabled them to categorize problems into standard formulations, and facilitated the association of problem features with appropriate meaning. The use of accounting information in financial decision making has been studied using protocol analysis by Bouwman among others (Bouwman 1982, 1984). Dillard (1984) provides a review of this research. Bouwman (1982) compared the decision making processes of experts and novices engaged in a financial analysis task. The verbal protocols obtained were split up into distinct decision-making “processes”, each consisting of a goal and one or more activities. Such processes constituted the basic units of the Problem Behavior Graph that provided a trace of the subjects’ decision-making behavior. Bouwman reported that at the global level, experts and novices used similar decision-making processes. However, the relative frequencies of specific processes were significantly different. Experts tended to periodically summarize the results and formulate useful hypotheses. It was also reported that the financial analysis task can be decomposed into three non-contiguous phases: (1) examination of given information, (2) integration of observation and finding and (3)
86
D. BATRA
AND
.I. G. DAVIS
reasoning. The most significant differences between experts and novices occurred during the reasoning phase (Bouwman, 1982, 1984). The review of the results of the expert-novice problem-solving comparisons from a cross section of domains presented above cannot fully endorse the claim of Larkin et al. (1980), except at an abstract level, that “. . . expertness probably has much the same foundations wherever encountered” (p. 1336). The aspects that have been found to be common across most of such studies include (i) ability of experts to process large amounts of information into meaningful chunks; (ii) the consequent facility to trigger such structures with little effort; (iii) categorization of problems into standard types based on underlying domain principles. However, the previous research is inconclusive as to whether the experts and novice use the same, or similar models, differently (Bouwman, 1982), or if experts possess a differentiated model as compared to novices (Chi, Feltovich & Glaser, 1981; McKeithen, ef al., 1981; Orlikowski & Dhar, 1986). 3. Conceptual
modelling
as a problem-solving
task
Conceptual modelling involves the representation of the entire information content of the database being designed in somewhat abstract terms relative to the way the data is physically stored (Date, 1990). Essentially, it is the process of identifying entities, relationships between the entities, attributes and categories. As suggested previously, a number of methodologies and data models have been proposed in the database literature for developing such a model (Chen, 1976; Smith & Smith, 1977; Hammer & McLeod, 1981; Teorey, Yang & Fry, 1986; Blaha, Premerlani & Rumbaugh, 1988). While they differ in the specific notations proposed, there are a few basic “facets” common across these representation schemes. Batra et al. (1990) have identified the following “facets” as comprising the typical set of constructs that the modeller will need to work with in building the conceptual model against a set of given information requirements: entities, relationships qualified by degree and connectivity, attributes (identifiers and descriptors) and categories. These facets are by no means independent of each other; entities, relationships and attributes may be viewed as the primitive constructs of the conceptual model while the category facet supports generalization abstraction. In this view, conceptual modeling is the process of identifying these facets as they apply to the given situation and representing them in an integrated fashion. The notion of a facet permits a more micro-level analysis of a conceptual data model than would be possible otherwise. This becomes useful in carrying out a finely grained assessment of the performance of experts and novices engaged in a conceptual modelling task in which the subjects need to identify instances of such facets in the given situation and represent them in accordance with the convention they are used to. These conventions could range from a relational representation to more semantic representations such as the extended entity relationship (EER) and object-oriented models. It should be noted that the modelling task in this study is somewhat different from the ones in which expert-novice problem-solving process differences have been studied previously. The idea of unique correct solutions is generally central to domains such as physics, linear programming, puzzle solving and program debugging where the task usually involves focused thinking in rather
CONCEPUAL
DATA MODELLING
87
narrowly defined problem spaces. The conceptual data modelling problem, in most realistic cases, involves a more open-ended, semantically rich problem space in which the modeller is forced to engage in both broader conceptual thinking, as well as focused problem-solving activities. It is more realistic to think of the solutions (conceptual models developed) in terms of degree of correctness than a unique correct solution. In this respect, the domain resembles the accounting problem studied by Bouwman (1982). We view the differences in domains as an opportunity to add to the available body of knowledge on expert-novice differences across multiple domains. 4. Research methodology RESEARCH STRATEGY AND PROCEDURE
As mentioned previously, the study employed the protocol analysis methodology for data gathering and analysis. Ericsson and Simon (1980) have convincingly argued that the verbal reports obtained provide the means to understanding the cognitive processes of subjects involved in performing tasks or solving problems in different domains. This approach becomes particularly useful in domains in which the task is relatively complex and open-ended since the verbal data enable us to study in greater detail the intermediate stages of such processes. In this study, subjects were provided with a case and asked to prepare a conceptual representation of the problem using the relational or any other data model and to concurrently verbalize their thoughts as they proceeded with the task. They were expected to show all the rough work as they developed the solution. One of the authors acted as the user, so as to provide clarifications if needed. The subjects were given a sample demonstration of “thinking aloud” as the conceptual model was being developed. It was observed that the subjects quickly adapted to this mode of working. The verbal protocols provided by the subjects were audiotaped. These protocols were transcribed from the tapes into a text document by a trained secretary. SUBJECT PROFILE
Nine subjects-four novices and five experts-participated in the study. The novices in our study were undergraduate and graduate students who had completed an introductory database course, but did not have extensive design experience. The novices had learned conceptual, logical and physical database design, and had used standard relational DBMSs to define and manipulate data. The experts chosen showed more diversity in their background. Since expertise can be acquired through practical experience, as well as through extensive study, it was decided to have experts who could be placed at various points in this spectrum. Six experts were considered for the study. Of these, one subject who ran a small consulting firm, had no formal training in database design and had learned database concepts solely through experience. His erratic performance did not seem representative of an expert, and hence was excluded from the study. All the remaining five experts had formal training, as well as practical experience in database design. One of the experts was an MIS doctoral student who was
88
D. BATRA
AND J. G. DAVIS
completing his dissertation on logical database design. The second expert had a Master’s degree in Computer Science and was employed in a small software company which specialized in the design of object-oriented databases. Two experts had been working in database administration of fairly large-sized companies for about 10 years each and were part-time faculty in local universities. The fifth expert was full-time faculty in a university teaching introductory MIS courses as well as actively consulting for microcomputer-based database systems. He had previously held a senior position in the MIS department of an airline company. TASK The case that provided
the information requirements for the conceptual modelling task was prepared by the authors (see Appendix A). The case description was adapted from a real application for which a prototype had been developed by senior level computer science students. One of the authors had reviewed the conceptual design of the system. The case was adapted based on interviews with a systems professionals in the organization. As with any case material, the requirements mentioned in the case were somewhat incomplete and ambiguous. This was, however, a desirable feature since it forced the subject to make assumptions or seek clarifications from the user. The case was semantically rich and included the following facets: entity, category, identifier, descriptor and the following types of relationships: unary many-many, binary one-many, binary many-many, and ternary many-many-many.
5. Results The verbal data in the form of transcripts and the written trace on worksheets were initially analysed by the authors to identify and document the stages the subjects went through from the initial reading of the case to the completed conceptual data model. This was intended to provide a working model of the process which would permit a more detailed analysis of the data. Such a model becomes useful in analysing lengthy protocols dealing with relatively open-ended tasks (Bouwman, 1982). In identifying such a working (process) model, we were guided primarily by the ANSI/SPARC (American National Standards Institute/Systems Planning and Requirements Committee) architecture which explicitly recognizes distinct levels of abstraction in conceptual database design and implementation. This study focused on the translation of user requirements (external view level) into a conceptual data model. The subjects seemed to operate at three distinct levels of abstraction and they iterated among these levels over the time they worked on the task. These levels were applicable to both experts and novices through the frequency and pattern of iterations and the total time spent at each level differed between experts and novices. Figure 1 presents the process model. The specific activities at each level are described below: 1.ENTERPRISE LEVEL
During the enterprise phase, the subject would read, contemplate the requirements, comment, elicit user requirements (from the simulated user), seek clarifications or
89
CONCEPTUAL DATA MODELLING
Recognition level
FIGURE1.
Process model of conceptual database design.
establish connections. The focus at this level is on developing understanding of the problem domain.
a reasonable
2. RECOGNITION LEVEL
At this level, the subject would focus on some specific aspect of the user requirements and try to understand the sub-problem at hand, and this would trigger the appropriate knowledge structures in his or her repertoire. 3. REPRESENTATlON LEVEL
The representation phase constituted the operationalization of the subject’s understanding of the structure into a conceptual data representation using the relational (or any other data) model. This phase also included the verification of the representation to ensure it satisfied the user requirements. The following examples illustrate the differences between the three levels. Suppose that one is modelling the fact that an equipment is composed of parts which in turn may be composed of other parts. At the enterprise level, one may expect a protocol segment of the form: “Now, I am looking
at parts.
This example
deals with centrifugal
pumps.
. . it is
composed of body, impeller and so on. Then, a part is made of smaller parts . . I guess I have a question. Can a sub-part go into many parts?” During the recognition phase, the subject would attempt to grasp the underlying structure. He may articulate some key phrases which suggested his understanding of the database structure of the description under focus. “So it seems that there is a recursive relationship. But it has to be captured in tabular form. How will I do this? I guess if I have a table for parts, one for the relationship between parts and sub-parts. . . and I guess, this would be different from the one for the relationship between parts and equipment . . .”
90
D. BATRA
AND
J. G. DAVIS
The beginning of the recognition level may be marked by an expression of uncertainty about an aspect of user requirements. The recognition phase can be long and, at time, arduous if the subject has difficulty understanding the situation. “I am having difficulty understanding this . . . it does not say how many levels of parts. But do we have to know that? I guess not. I must find a way to capture this irrespective of the number of the levels. Maybe, I should create another relation. This would connect equipment, part and sub-part. . No, actually, there are two relations. . .”
The representation level is usually a mechanical process where the designer gathers the complete information for the situation under focus, and actually develops
the representation.
“I am now creating a relation for the relationship between parts and sub-parts _. . Since a sub-part can go into many parts, the relationship is many to many and so this relation will have part-code and sub-part-code concatenated as the identifier;” The mode1 in Figure 1 shows the three 1eve1s-enterprise, recognition and representation-identified in the protocols collected from both novices and experts. It was found that a subject would typically stay at one level for some time before moving on to another, and could move from a given level to any other level. The following specific shifts were observed:
1. Enterprise to recognition: Once a subject understood the requirements, he would attempt to understand the underlying database structures. This is typical in a top down approach. 2. Recognition to representation: Once the structure was recognized, the solution was represented. 3. Enterprise to representation: For the relatively simpler aspects, e.g. attributes and binary relationships, recognition level may be missing since the translation occurs in a “compiled” fashion. 4. Representation to enterprise: To seek more information, or to validate the representation. 5. Representation to recognition : For verification, or if the subject suspected a mistake in the representation. Also, in a bottom-up approach where the subject is trying to understand the underlying structures in a user view. 6. Recognition to enterprise: For verification. ANALYSIS OF DATA USING THE PROCESS MODEL
Once the mode1 was identified, each protocol was carefully studied, partitioned and classified into segments such that each segment corresponded to one of the three levels. This was achieved by identifying transition points between the levels. Each audiotape was rerun so that the transition points could be time-stamped. A subject typically had 30-60 such points. The duration of each segment was calculated as the difference between adjacent time stamps. By aggregating the phase durations of all instances of a level in a given protocol, the total time spent in each phase was found. This analysis revealed some interesting qualitative and quantitative differences between the novices and experts. While the process mode1 introduced earlier was generally applicable to experts and novices alike, there were substantive differences in the emphasis placed by them
91
CONCEI’llJAL DATA MODELLING
on each of the levels and patterns of iterations between them. For the most part, experts tended to work on one part of the case at a time, seek clarifications, integrate all the relevant information, recognize the similarities between the situations at hand and appropriate knowledge structures, presumably from the long-term memory and then proceeded to represent the information. At the enterprise level, the experts’ focus was on developing a holistic understanding of the problem by asking questions of the simulated user, if necessary. Frequently, they relied on referent experiences to assist them organize and understand the problem task. At the recognition level, the experts seemed to categorize the information requirements into known abstractions. They then proceeded to represent the information based on the conventions they were most familiar with (e.g. relational, aggregation-generalization, entity relationship etc). They seemed to cycle through the levels, typically starting at the enterprise level, and returning to the same level for a different part of the case, without much iterative backtracking. The novices’ behavior was considerably different. Their protocols did not reveal a focused effort at integrating the information and filling the gaps by seeking clarifications at the enterprise level. Their experiences were very general and quite inferior to those of the experts. In attempting to recognize certain requirements, the novices lacked the categorizing terminology and knowledge of expert. Much of the novices’ efforts were directed toward understanding the underlying structure of the requirements. When they encountered an unfamiliar situation, they would skip over to another part of the case. This resulted in considerable amount of cycling between levels and backtracking. We now turn to a qualitative analysis of the data by each facet. QUALITATIVE
ANALYSIS
AT THE FACET LEVEL
The
qualitative analysis revealed both similarities and differences between the novices and experts in modelling various facets in the case. The similarities pertained to the modelling of the entities, and the identifiers and descriptors of the entities. None of the subjects had any problem classifying an object as an entity or an attribute. The general rule that the subjects seemed to use to distinguish an entity from an attribute
was as follows:
If an object has descriptive information, then treat it as an entity; else if an identifying name and number is the only information to be stored, treat it as an attribute or descriptor. This rule, however, was seldom articulated is an excerpt from the protocol of a novice:
explicitly.
For example,
“Says there are several mechanics. . . we might need something mechanics. I’ll create an entity called mechanics . . .” Another similarity relationships. Again,
binary relationships
the following
to store data for the
was the relative ease with which all subjects modelled binary the recognition phase seemed to be generally missing when the
were captured,
and the rules were not explicitly verbalized:
“Since we need to link equipments to the customer number or the customer name . . . Okay, 1’11just call it a customer name. . . customer name as key. . . it should be there also in the equipment.”
92
D. BATRA
AND J. G. DAVIS
In this case, the subject was capturing the relationship between EQUIPMENT and CUSTOMER. The subject correctly included the identifier of CUSTOMER in the relation for EQUIPMENT, and not vice versa. However, the subject did not go through the formal process of stating that the relationship is one to many, or consciously ensuring that the relations obeyed the normalization rules. There were similarities, too, between the way the novices and experts modelled the ternary relationships. The three user views (or reports) after normalization decomposed into two ternary relationships-one between date, equipment and mechanics, which captured the labor involved in a repair, and the other between date, equipment and parts, which captured the parts replaced in a repair. (The same solution could also be simplified by introducing a new attribute REPAIR#, although the subjects rarely did so). The main task for representing these relationships was to determine the concatenated key capturing the information pertaining to labor and parts replaced. There did not seem to be any major differences between the processes of novices and experts as they modelled the ternary (or four-way) relationship, though the experts showed better understanding and came up with marginally better solutions. The general procedure employed by both experts and novices to model ternary or higher degree relationships seemed to be: If a user view involves identifiers or descriptors of more than two entities, then include the identifiers of the entities in a relation and exclude the descriptors of entities. Next, find information about the connectivity of the involved entities from the user reports (e.g. many mechanics can be involved in a repair), from real-world knowledge (any equipment can be repaired many times), and from querying the user (are the repair descriptions standard?). The connectivity information is used to determine the primary key of the relation. Finally, ensure that the descriptors are dependent on the primary key of the relation for the ternary relationship.
Note that a relation will be considered as denoting a ternary relationship only if at least two of the three foreign keys (i.e. identifiers of other relations) form the primary key of the relation. The subjects, however, seldom verbalized that they were modelling a ternary relationship-they were mainly interested in determining the primary key such that the normalization rules be satisfied. The novices had considerable difficulty in modelling unary relationships and categories. For example, the following excerpt reveals one novice’s struggle with the unary relationship that captured the description that parts are composed of sub-parts. “The equipment is comprised of parts which may consist of sub-parts. I am not sure how to put this in yet. . .” Later: “Part bearing block . . . I don’t understand what this part means yet. I’ll skip this part and go to the next sentence . . .” Later: “Instead of calling it sub-parts, we may be able to call it . . . just parts, and there should be descriptions of parts, price and weight. Types and parts would be the key OK. . OK. . ahight. . . wrong. . . part. . . part code. . . exclude these parts. . . OK. . I think we should have another entity called sub-parts. . . That identifies the parts that make up the sub-parts.”
CONCEPTUAL DATA
93
MODELLING
This contrasted with the confident approach recognized the relationship was recursive.
of one of the experts,
who quickly
(Enterprise Level): “Any equipment is comprised of parts. Okay, now they are talking about “parts” (Recognition Level): “So this is a recursive relationship, but still we have to capture it in tables . . .” etc (Representation Level): “I’ll create a new parts relation . . . I want to create one more relation which is sub-part . . .” etc. It seemed that the expert had worked previously with recursive situations and the representation of such problems was stored in the expert’s memory in “compiled form”. The expert was able to recognize and represent the problem quickly. The novices generally did not appear to be familiar with recursive relationships and consequently had to work their way to a reasonable representation after some backtracking. Differences were also found in the way the novices and experts modelled the categories. The case involved an entity EQUIPMENT which could be one of three types-centrifugal pumps, reciprocating pumps and diesel engines. Each category had its own set of attributes. Three of the five experts had knowledge of the category concept and were able to recognize and map the application description into a representation. A typical excerpt from the protocols is as follows:
“There are three different types of equipment: centrifugal pumps which. . . looks like three sub-types: centrifugal and reciprocating pumps, diesel engines.” This was followed by the representation of the categories. procedures to model categories can be expressed as:
The
rules and
“If an entity can be of different types such that each type has its own descriptive information, then create subtypes (e.g. relations) for each, use the identifier of the parent entity for each subtype entity and specify the respective descriptors.” The novices had little success in modelling categories. They were not able to distinguish categories of an entity from the entity itself and hence used simplifying approaches to model the concept. QUANTITATIVE ANALYSIS
The quantitative results are presented to support the qualitative findings of the study. Since the sample size is not a significant issue given the methodology, the quantitative results do not have the statistical rigor of the conventional empirical studies. We, therefore, did not conduct any statistical tests. The strength of the protocol methodology is primarily derived from the quality of its verbal data. We recorded the time taken for each subject to complete the task. Further, by determining the time interval of each segment, the total task completion time was divided into the times spent at the enterprise, recognition, and representation levels. The results of this analysis (Table 1) revealed some interesting patterns. The task completion times were similar for the experts and the novices. Thus, overall, the experts did not work faster than the novices. There were notable differences, however, in the proportions of the time spent by the novices and experts at the three levels. The novices spent most of the time at the representation level (about 50%) a fair proportion of the time at the recognition level (about 30%) and
D. BATRA AND J. G. DAVIS
94 TABLE 1
Time spent at different levels (min) Subject Novice Nl Novice N2 Novice N3 Novice N4 Expert El Expert E2 Expert E3 Expert E4 Expert E5
Enterprise
Recognition
Representation
9:54 14.6% 9:46 21.1% 16:34 26.5% 17: 13 19.8% 22127 55.3% 18:47 34.0% 8.38 17.2% 29:51 40.1% 29:19 47.0%
22124 52.3% 13:48 29.8% 14:35 23.3% 26:28 30.4% 2:50 7.0% 11:28 20.7% 6:47 13.5% 9~24 12.6% 6: 11 9.9%
35~22 33.1% 22:42 49.1% 31: 19 50.2% 43: 15 49.8% 15:15 37.7% 25:00 45.3% 34:41 69.3% 35:07 47.3% 26153 43.1%
Total 67:40 46:16 62:28 86:56 40:32 55:15 50:06 74122 62~23
the least time at the enterprise level (about 20%). The experts spent minimal time at the recognition level (about 15%). All but one expert spent about equal time at the other two levels. The above data suggest that although there was were no notable differences in the total time taken by novices and experts, there were critical similarities and differences in the way the completion times were apportioned into the enterprise, recognition and representation levels. Both experts and novices spent a good proportion of their time at the representation level. This was expected since this level involved the actual conceptual representation of the database and it required pulling together all details of the data included in the case and writing them down. Differences were found, however, in the time spent by experts and novices at the enterprise and the recognition levels. Experts were able quickly to relate the underlying structure of parts of the case to knowledge they already possessed. Consequently, they spent little time at the recognition level. However, they were more concerned about getting the requirements right. As a result, they spent a considerable amount of time asking questions, seeking clarifications and trying to put the various pieces of the domain together. Their enquiry process was aided by similar experiences from the past. The novices were not as concerned about getting the requirements right, but had to try hard to recognize and represent the database structures in the case since such processes were not automatically triggered. They lacked the experts’ finer differentiation of conceptual modelling concepts and had to struggle to come up with the correct representation of some of the facets. The novices, therefore spent a lot of time at the recognition level and much less time at the enterprise level.
CONCEPTUAL
STRATEGIES
DATA
MODELLING
95
USED BY EXPERTS
The quality of the solutions developed by the experts was generally superior to that of the novices. While richer knowledge of the “facets” (e.g. category) was one of the reasons, the other reason was a better understanding of the application itself. The experts, especially those who had a considerable real-world experience in database design were able to draw on their referent experiences. They may not have worked on an identical system, but were able to draw parallels between the application at hand and similar applications on which they had worked in the past. Consequently, they were able to ask better questions. None of the novices showed a great deal of familiarity with similar applications. In particular, two experts explicitly drew similarities between the case problem and systems they had worked on in the past. One of the experts had worked in an airline company, which had a database containing information about equipment. The other expert had participated in the design of a database which included such information. These subjects seemed to be able to develop a holistic comprehension of the problem and ask searching questions. For example, the reports showed a field EQUIPMENT#, but it was not evident if the field pertained to the serial number or the model of the equipment. Both experts asked the user to clarify the meaning of EQUIPMENT#. They felt that both serial number and model number should be stored and have separate names. It seemed that they had a “frame” about the object equipment, and were using the information provided to fill the “slots” in this frame. The novices did not have referent experiences to draw from, and consequently, they made their own assumptions, at times erroneously, as to what EQUIPMENT# meant. A similar behavior was also observed in modelling the PARTS-REPLACED for a repair. It was expected that this relation would capture the part code in a specific repair. However, one of the novices assumed that, given a repair description, there would be a standard set of parts replaced. After the experiment, the subject was asked about the reasons for making the assumption. She replied that this was what the reports seemed to imply, and, further, this was consistent with some day-to-day examples, e.g. the oil-and-filter change of a car. Her inference was, therefore, based on the sketchy data in the reports and naive referent experiences. In contrast, the experts appeared to have relatively sophisticated models, One of them even considered the billing problems that could result if there were changes in the selling prices of parts. Both experts and novices asked probing questions on those aspects of the problem which could be related to more common situations of everyday life. For example, the application dealt with repair of equipment. The bill for the repair did not have a unique code (e.g. a repair number, or a bill number). Some of experts and novices were concerned about this point. They were relating to the everyday experience of seeing order numbers or account numbers on the bills. But the experts went further and tried to find a way to generate the repair numbers automatically. Some experts paid more attention to the process of the application From the information provided in the case, they would create a process “script” which would try to simulate the sequence of activities involved. For example, in the case of equipment repair, they would start from the arrival of the equipment in the shop to
96
D. BATRA AND J. G. DAVIS
the data entry of items involved in the repair. Sometimes, this would lead to probing questions. Novices restricted themselves to data provided in the case and proceeded in a text book fashion. In this study we were also interested in finding out if the subjects used intermediate representations to help them in developing the relational solutions. Two novices and two experts used the entity relationship (ER) or a similar graphical model, but the rest approached the relational solution directly. While a different study with a larger sample size may be required to validate our observation, the results in this study suggest that both expert and novice designers do not show a real preference for the ER model.
6. Discussion In this section, we attempt to place the study in perspective by comparing our results with those obtained in previous studies on expert-novice differences. The critical issues identified by prior research are listed, and the corresponding findings, based on our study, are discussed. 1. Experts and novices tend to apply qualitatively different (process) models to the task. This differentiation is reflected in the structure of the model as well as the meaning they assign to concepts (Chi, Feltovich & Glaser, 1981; Larkin, 1983; Orlikowski & Dhar, 1986). Alternatively, Bouwman (1982) has shown that in the accounting domain, problem-solving processes of experts and novices can be captured by the same model, but the model is applied differently. Our results are more in accordance with that of Bouwman which may be attributed to the relatively open ended nature of the tasks involved in the two studies. 2. It has been found in most studies in this area that experts exhibit richer vocabulary and relative ability to categorize problem descriptions into standard abstractions. This was corroborated in our study. At the recognition level, the mapping of the case to knowledge structures was triggered often for the experts whereas the novices were unable to achieve this for some of the facets. 3. Hinsley, Hayes and Simon (1977) have reported that experts have demonstrated the ability to automate some aspects of problem-solving process while the novices often have to work from “first principles”. This result was partially confirmed with respect to some aspects of the overall conceptual model. 4. Several studies have argued that experts are able to process and meaningfully organize large amounts of information (Chase & Simon, 1973; Jeffries, 1981, 1982; McKeithen et al., 1981). This result was not directly supported; however, the experts in our study demonstrated a more detailed and systematic approach to information gathering before addressing the representation aspects. They are aided by past experiences, which act as templates for filling information at hand. 5. Most previous studies have underscored the greater preponderance of misconceptions and errors by novices. The novices in this study tended to have more errors in their conceptual models than experts largely due to their inability to map parts of the problem description to appropriate knowledge structures as well as the failure to integrate pieces of information.
CONCEF’IIJAL DATA MODELLING
7. lmpkations
97
and conclusions
The findings
reported in this paper have important implications for training, research and development in database design. The study has explicated a general process model of conceptual data modelling and explored the differentiated application of this model by experts and novices. The experts’ modelling approach is characterized by decomposing the overall problem description into meaningful parts, gathering and organizing all relevant information concerning each part and mapping this to appropriate knowledge structures prior to actual representation. In gathering and organizing relevant information, the experts use domain-related experiences to understand the application and ask questions if necessary. They also have a richer and more differentiated vocabulary of facets, and are better able to map specific aspects of the application to database structures. Thus, their expertise is both at the enterprise and recognition level of the data-modelling process. The findings provide some guidelines for the development of knowledge-based support tools for the conceptual data-modelling task. There have been few attempts at developing knowledge-based database design systems (e.g. Storey & Goldstein, 1988; Choobineh et al. 1988; Dogac, Yuruten & Spaccapietra, 1989). However, these systems are typically “facet’‘-based, that is, they try to identify specific facets, e.g. a binary relationship based on forms, user views and dependency information inferred from interviewing the user. While one can design such systems to model facets, the behavior of these systems is unlikely to match the typical expert. This study points to the complexities involved in designing such a system. While the experts use a number of general design rules, as well as frames from referent experiences, they also spend a considerable portion of their time at the “enterprise” level, that it, in accurately understanding the situation and getting the requirements right. This study has addressed an aspect of data modelling that has not been researched previously-the process of conceptual design as performed by experts and novices. The similarities and differences between experts and novices provide some understanding of the nature of expert knowledge and experience in this domain. Further work involving a real-world data modelling task would contribute to developing domain-specific expert and novice models, as well as discovering the precise heuristics used by experts.
An earlier version of this paper was presented at the Tenth International Conference on Information Systems, Boston, 4-6 December, 1989. We are grateful to Professor Ananth Srinivasan of Case Western Reserve University and the reviewers for their helpful comments.
References BATRA, D., HOFFER,J. A. & BOSTROM,R. P. (1990). Comparing representatives with relational and EER models. Communications of the ACM, 33, 126-139. BLAHA, M. R., PREMERLANI, W. J. & RUMBAUGH,J. E. (1988). Relational database design using an object-oriented methodology. Communications of the ACM, 31,414-427. BOUWMAN,M. (1982). The use of accounting information: expert versus novice behavior. In G. UNGSON& D. BRAUNSTEIN, Eds. De&ion Making: An Interdisciplinary Inquiry, pp. 134-167. Boston, MA: Kent Publishing Company.
98
D. BATRA AND J. G. DAVIS
BOU~MAN, M. (1984). Expert versus novice decision making in accounting:
a summary.
Accounting, Organizations and Society, 9, 325-327. BROSEY, M. & SHNEIDERMAN,B. (1978). Two experimental comparisons of relational and hierarchical database models. International Journal of Man-Machine Studies, 10, 625-637. CHASE, W. G. & SIMON, H. A. (1973). Perception of chess. Cognitive Psychology, 4, 55-81. CHEN, P. P. (1976). The entity-relationship model-toward a unified view of data. ACM Transactions of Database Systems, 1, 9-36. CHI, M. T. H., FELTOVICH,P. J. & GLASER, R. (1981). Categorization and representation of physics problems by experts and novices. Cognitive Science, 5, 121-152. CHI, M. T. H., GLASER, R. & FARR, M. J. (1988). The Nature of Expertise. Hillsdale, NJ: Lawrence Erlbaum. CHOOBINEH,J., KONSYNSKI,B. R., MANNINO, M. V. & NUNAMAKER, J. F. (1988). An expert system based on forms. IEEE Software Engineering, February 1988, 242-253. CODASYL DATA BASE TASK GROUP (1971). April 1971 Report, New York: ACM. DATE, C. J. (1990). An Introduction to Database Systems, Volume 1, Fifth Edn, Reading:
Addison-Wesley. DAVIS, J. G. (1985). A typology of management information system users and its implications for user information satisfaction research. Proceedings of the ACM SIGBDP Conference
on End User Computing, Minneapolis, MN, USA. DILLARD, J. E. (1984).
Cognitive
science and decision
making research in accounting.
Accounting, Organizations and Society, 9, 343-354. DOGAC, A., YURUTEN, B. & SPACCAPIETRA,S. A generalized expert system for database
design. IEEE Transactions on Software Engineering, 15, 479-491. ELMASRI, R. & NAVATHE, S. B. (1988). Fundamentals of Database Systems. Menlo Park,
CA: Benjamin Cummings Publishing Co. ENGLE, R. W. & BUKSTEL, L. (1978). Memory processes among bridge players of differing
expertise. American Journal of Psychology, 91, 673-689. ERICSSON,K. A. & SIMON, H. A. (1986). Verbal reports as data. Psychological Review, 87,
215-251. GUGERTY, L. & OLSON, G. M. (1986). Debugging by skilled and novice programmers. CHI 1986 Proceedings, pp. 171-175 Boston, MA, USA. HAMMER, M. & MCLEOD, D. (1981). Database descriptions with SDM: a semantic database model. ACM Transactions of Database Systems, 6, 351-386. HINSLEY, D. A., HAYES, J. R. & SIMON, H. A. (1977). From words to equations: meaning and representations in algebra word problems. In P. A. CARPENTER& M. A. JUST Eds.
Cognitive Processes in Comprehension,
Hillsdale, NJ: Lawrence Erlbaum.
HOFFER, J. A. (1982). An empirical investigation into individual differences in database models. Proceedings of the Third International Conference on Information Systems, Ann
Arbor, 153-168. HULL, R. & KING, R. (1987). Semantic database modeling. Computing Surveys, 19, 201-260. JARVENPAA, S. L. & MACHESKY, J. J. (1989). Data analysis and learning: an experimental study of data modelling tools. International Journal of Man-Machine Studies, 31,
367-391. JEFFRIES, R. (1981). Computer program debugging by experts. Paper presented at the Meeting of the Psychonomic Society, November 1981. JEFFRIES R. (1982). A comparison of the debugging behavior of expert and novice programmers. Paper presented at the Meetings of the American Educational Research Association, March 1982. JUHN, S. & NAUMANN, J. D. (1985). The effectiveness of data representation characteristics on user validation. Proceedings of the Sixth International Conference on Information
Systems. pp. 212-226, Indianapolis, USA, December 16-18. LARKIN, J. (1983). Problem representation in physics in mental models. In D. R. GENTNER, & A. L. STEVENS,Eds. Mental Models, pp. 75-100, Hillsdale, NJ: Lawrence Erlbaum. LARKIN, J. H., MCDERMOI-~, J., SIMON, D. P. & SIMON, H. A. (1980). Expert and novice
performance in solving physics problems. Science, #)%, 1335-1342.
CONCEPTUAL DATA MODELLING
99
IAXHOVSKY,
F. H. & TSICHRITWS,D. C. (1977). User performance considerations in DDMS selection. Proceedings ACM SIGMOD, pp. 128-134. MCKJZITHEN, K. B., RE~TMAN, J. S., REUTER, H. H. & HIRTLE, S.C. (1981). “Knowledge organization and skill differences in computer programmers. Cognitive Psychology, 23, 307-325. ORLIKOWSKI, W. & DHAR, V. (1986). Imposing structure on linear programming problems: an empirical analysis of expert and novice models. Proceedings of the Fifth National Conference on Artificial Intelligence (AAAZ-86) pp. 308-311, Philadelphia, PA, USA, August 11-15. PECKHAM,J. & MARYANSKI,F. (1988). Semantic data models. ACM Computing Surveys, 24, 153-189. REITMAN, J. S. (1976). Skilled perception in Go: deducing memory structures from interresponse times. Cognitive Psychology, 8, 336-356. RIDJANOYIC,D. (1986). Comparing quality of data representations produced by nonexperts using logical data structure and relational data models. Ph.D. thesis, University of Minnesota, USA. SHOVAL, P. & EVEN-CHAINE, M. (1987). Database schema design: an experimental comparison between normalization and information analysis. Database, 18, 30-39. SMITH, J. M. & SMITH, D. C. P. (1977). Database Abstractions: aggregations and generalization. ACM Transactions on Database Systems, 2, 105-133. STOREY, V. C. & GOLDSTEIN,R. C. (1988). A methodology for creating user views in database design. ACM Transactions in Database Systems, W, 305-338. TEOREY, T., YANG, D. & FRY, J. P. (1986). A logical design methodology for relational databases using the extended entity relationship model. ACM Computing Surveys, 18, 197-222.
Appendix
1: Case
FAR-EASTERN REPAIR CENTER
Midwestern Manufacturing, Inc. has a large repair facility in Southeast Asia named Far-Eastern Repair Facility. This repair facility carries out repairs of equipment manufactured by Midwestern, as well as other manufacturers. The company would like to computerize the Facility, so that it can automate billing and inventory, and keep a track of the repair history of each engine. Currently, however, the company would like to build a small prototype. After interviewing the managers and engineers, the following requirements were obtained. Equipment comes in different types. Far-eastern can handle centrifugal pumps, reciprocating pumps and diesel engines. For centrifugal, pumps and diesel engines, the important data are the piston diameter and cylinder volume. Each equipment serviced by the Center is identified by its manufacturer’s code and is characterized by its horsepower and speed (RPM). Each piece of equipment is owned by a customer. Equipment needing repair or maintenance is brought to Far-Eastern Repair. The facility maintains the current address (street, city, state) and phone number of the customers for billing purposes. The manufacturers of the equipment are mostly overseas, and therefore, the facility has to maintain adequate level of inventory of spare parts. The facility maintains the address (street, city, state, country) and telex and phone number of each manufacturer, primarily for ordering spare parts to replenish stocks. Any equipment is comprised of parts which, in turn, may consist of sub-parts. The lowest (level of) sub-part considered is that which can be ordered as a spare part.
100
D. BATRA
AND
J. G. DAVIS
(As an example, a centrifugal pump can be considered as consisting of body, impeller, shaft, mechanical seal and bearing block.) The part shaft is at the lowest level because it does not have any component part. The part mechanical seal is composed of other parts, but is considered as the lowest level since its sub-parts cannot be ordered as spare parts. The part bearing block can be ordered as a spare part and has two components-bearing and housing-both of which can also be ordered as spare parts. The part bearing block, therefore, is not at the lowest level (its components are). The inventory of parts is warehoused in three different buildings. A, B and C. Parts are stored in numbered bins within each building. When a mechanic has to fetch a part, he first looks up the location of the part using a terminal. After finding the building and the bin number, the mechanic can easily fetch the required part. Each part is identified by a code. The essential data to be stored about a part are its description, list price and weight. There are several mechanics at the facility. Since mechanics differ in skill level and experience, the labor rate for each mechanic is different. When equipment arrives for repair, the history of the equipment is called up (refer to Report 1). One or more mechanics are then assigned to the repair task. The number of mechanics assigned to a repair job varies depending on the magnitude of the job. After the repair is over, each mechanic enters the number of hours spent on repair, parts replaced and a description of the repair. These data are then used to compute two reports, a bill for the customer (Report 2) and a second report for internal accounting (Report 3). Report
1.
HISTORY OF THE EQUIPMENT CP12758
Type: Centrifugal pump HP: 50 RPM: 1500 Date received 7115186
Jay Chu Ram
12109187
Description
Mechanic
Jay
Pump leaking: mechanical seal replaced Pump leaking: mechanical seal replaced Bearin replaced Shaft 8 round
Report 2. BILL FOR EQUIPMENT CP12758
Type: Customer: Date Received: Amount:
of repairs
Centrifugal pump Lee Associates 719188 $18140
101
CONCEPTUAL DATA MODELLING
Description of repairs Pump leaking: mechanical seal replaced Coupling worn off: replaced Charges Total labor: Total labor charge:
2h $5540
Parts replaced Part name
Quantity
Mechanical seal Coupling Sub-Total Taxes Total
:
Unit price ($) Total price ($) 8040 4040
E:E 20-00 12~:~
Report 3. BILL FOR EQUIPMENT CP12758
Type: Customer: Date Received: Amount:
Centrifugal pump Lee Associates 719188 $18140
Labor Mechanic
Jay
3540
Bud
2040
Number of hours
Description of repairs
1
Pump leaking: mechanical seal replaced Coupling worn off: replaced
1 Total
Labor charge ($) 3540 2040 55.08
Parts Part name Mechanical seal Coupling Sub-total
Quantity 1 1
Unit price ($) . E.E
Total/Part g:E
12040
($)