Data & Knowledge Engineering 56 (2006) 122–138 www.elsevier.com/locate/datak
Evolution of entity–relationship modelling Susanne Patig Otto-von-Guericke-University of Magdeburg, P.O. Box 4120, D-39016 Magdeburg, Germany Received 27 September 2004; received in revised form 3 February 2005; accepted 24 March 2005 Available online 25 April 2005
Abstract New modelling languages are continually being developed. Some of them are variants of existing ones, others are completely different, and a few become widely accepted. The overall variety of modelling languages can be explained by an evolutionary process that exhibits regularities. Evolution has been studied extensively in biology and linguistics. From the results of these studies, hypotheses on the causes, mechanisms and directions of the evolution of entity–relationship (ER) modelling are derived and tested, using a sample of 100 ER models. The empirical results lead to theory of the evolution of modelling languages. 2005 Elsevier B.V. All rights reserved. Keywords: Modelling languages; Entity–relationship modelling; Theory of evolution
1. Motivation Languages are the predominant tool in any field of computer science. For example: programming languages are used to implement systems, specification languages describe systems and modelling languages [53] formally describe some aspects of the real world. The descriptions that arise from applying a modelling language are intended to be used mainly by humans for the purpose of understanding, but also occasionally by some system to perform a task. Therefore, conceptual models, semantic data models, process modelling languages and knowledge representation
E-mail address:
[email protected] 0169-023X/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2005.03.010
S. Patig / Data & Knowledge Engineering 56 (2006) 122–138
123
techniques are modelling languages. The term language implies a set of constructs, syntactic rules to connect them and a description of the meaning of each construct (semantics). A method additionally comprises guidelines on how to apply the language in building descriptions [29]. This paper deals with modelling languages only. Besides the overall variety of languages in computer science (inter-language variety), languages themselves change over time. Descending from a common origin, new language variants are basically similar to their parent, but some characteristics differ. Variety within a language (intralanguage variety) is the result, which can be observed, for instance, on Petri nets [17], conceptual graphs [78] and entity–relationship (ER) models. Considering the variety of and within modelling languages, judging the importance of a particular proposal is difficult. The situation becomes increasingly difficult, as new proposals continuously emerge. They are usually not constructed systematically, but rather based on intuition [73]. If creating (variants of) modelling languages retained certain patterns, these regularities would help to classify new language proposals. One day some ideal modelling languages could be designed based on the knowledge of how and why humans create artificial languages, which is contained in these patterns. Investigating the changes ER modelling has undergone over time reveals regularities. ER modelling is considered because it has been used for nearly 30 years, which, first indicates importance, and, secondly, provides enough empirical data for statistical analysis (see Section 3). Thirdly, there are other modelling languages, e.g., NIAM [29], UML [67] or semantic nets [63], whose relationship to ER modelling can be examined. Therefore, ER modelling is also a good starting point to analyse inter-language variety, but this is a topic for future research. Evolution is defined as changes over time that exhibit regularities and lead into certain directions [50,7]. The characteristics of evolution have been studied extensively in biology (see Section 2.1) and in linguistics (see Section 2.2). This work is prompted by the assumption that the historical development of ER modelling is an evolutionary process. From the common characteristics of biological and linguistic evolution, some hypotheses on the evolution of ER modelling are derived (Section 2.3). A sample of 100 ER models, presented from 1975 to 2003, is used to test these hypotheses (Section 3). The empirical results lead to a theory of the evolution of modelling languages (Section 4).
2. Exploring evolution 2.1. Evolution in biology By searching for ‘‘The Origin of (the) Species’’, Charles Darwin [14] grounded the field of evolutionary biology in 1859. The Synthetic Theory that emerged during the 1930s and 1940s forms the foundation of contemporary evolutionary biology. Following this theory, biological evolution is defined as a change in the properties of populations of organisms over the course of generations [24]. A population refers to a group of individuals of a species that occupy a particular locality and are reproductively isolated from other populations of the same species [50]. Biological evolution claims that all organisms descended from a single common ancestor (monogenesis) [24].
124
S. Patig / Data & Knowledge Engineering 56 (2006) 122–138
With the exception of twins, all individuals of a population are genetically different, which leads to a great variety in their phenotypic (observable physical) properties. Genetic differences originate from the random processes of mutation, recombination or hybridization [24]. A mutation is an unpredictable alteration of an individuals genetic material. Recombination occurs during meiosis: New gametes are formed (after the union of genetically different gametes), which possess chromosomes or DNA sequences of both individuals of the same population that united. Finally, hybridization introduces genes from one species to another. Genetic differences have consequences for evolution only if they are transmitted to succeeding generations. Only some of the offspring produced by hybridization are fully fertile; most genetic variability that arises through recombination is destroyed immediately by the same process. Genetic differences owing to mutation that constitute non-random advantages in survival or reproduction are inherited; this is called natural selection [14]. The properties that are retained by natural selection are better adapted to the natural environment. Evolutionary adaptation to environment is limited by the genetic material. An analysis of the changes within phylogenetic lineages reveals several trends as to how organisms better adapt [50]. For every trend, however, exceptions exist [24]. First, adaptation by increasing differentiation (i.e., different cells acquire different functions) produces more complex organisms. For instance, because of the special requirements of feeding, mammals have more highly differentiated teeth than reptiles. Secondly, better adaptation can be achieved by simplification; features disappear or become rudiments as soon as they are no longer needed, e.g., the posterior extremities of whales. Finally, specialization leads to a better adaptation to different ways of life. On one hand, homologous organs are inherited from the common ancestor of the species yet differ in form and function, depending on the requirements of environment. For example, the extremities of vertebrates are adapted to running, flying, swimming or grasping. So, the trend of specialization results in the divergent evolution of related lineages (evolutionary radiation) [24]. On the other hand, convergent evolution of unrelated lineages is observable as well: Owing to similar environmental factors, analogous organs of different origin are similar in form and function (e.g., wings of birds and bats). Biological evolution is a pluralistic process that finds alternative solutions for each challenge posed by nature. 2.2. Evolution in linguistics Language change is described and explained by historical linguistics. Investigating the origin of languages was prohibited by the Linguistic Society in Paris in 1866 and again in 1911, but investigations have been resumed since 1965 [7]. Language evolution refers to similarities and differences among languages that are due to a process of change from a common ancestral language [23,66]. It is highly probable that all languages are traceable to a few common ancestors (polygenesis) [7]. In contrast to most of its causes, language change is systematic [57]. On the one hand, language changes by chance, e.g., by errors of the speakers (spelling pronunciation, articulatory simplification etc.) or the hearers (e.g., fusion of words) [7]. On the other hand, non-random cultural developments require words for new items, e.g., television [57,7]. Results of errors or cultural changes are regular variations within a language, which are classified according to the objects that vary (see Table 1). In general, a word is an arbitrary combination of sound and meaning [66,57] that can be thought of as an entry in our mental lexicon
S. Patig / Data & Knowledge Engineering 56 (2006) 122–138
125
Table 1 Linguistic operators to create variety Branch of linguistics
Phonology
Morphology
Syntax
Semantics
Lexicon
Objects of variation
Phonemes, allophones
Morphemes
Word order
Word meaning
Word
Operator of variation Add/delete Merge/split Assimilate/dissimilate Weaken/strengthen Broadening/narrowing Shift Borrow Invent
+/+ +/+ +/+ +/+ + +
+/+ /+ +
+
+/+ +/+ +/+ + + +
+/+ + +
+: Operator applies; : operator does not apply.
[23]. Morphology deals with the internal structure of words, semantics with their meanings and syntax with the order of words in sentences [57]. The minimal unions of sound and meaning that cannot be further decomposed are called morphemes [23]. For example, words like boy, prefixes (e.g., dis-) and suffixes (e.g., -ment) are morphemes [57]. Phonology is concerned with the sound patterns of a language. The distinctive sounds are known as phonemes (e.g., vowels); their predictable phonetic variants are categorized as allophones (e.g., oral and nasalized allophones of vowels) [23]. Linguistic variation can be ascribed to several operators (see Table 1). Some of the operators are dedicated to certain objects, e.g., making a sound more like (assimilate) or less like (dissimilate) to another sound in its environment, or making a meaning more inclusive (broadening) or less inclusive (narrowing) than its earlier form [57,23]. Both sounds and meanings can get weaker or stronger [57]. Moreover, new sounds can be the result of merging or splitting existing objects; splitting also creates new morphemes. Shifts refer either to a change in the sequence of words or to substituting an old sound or meaning by a new, closely related one [57,23]. As a consequence of speaker interaction, allophones, phonemes, morphemes, words or word meanings can be borrowed from other languages [57]. The fundamental vocabulary of a language is rarely borrowed [66]. Alternatively, words or word meanings can be invented from scratch when they are required to describe new, unknown items [7]. Finally, a language varies because of the addition or the deletion of nearly all kinds of objects. In contrast to inventing or borrowing, the added objects already occur in the language, but in other contexts. Adding is realized specifically, namely by epenthesis in phonology, by fusion or analogy in morphology, by metaphor in semantics, by word formation or by derivation in the lexicon [57,7]. Most of the variants produced by the linguistic operators are disregarded or not even noticed. For a language change to take place, the linguistic variant must be appreciated and accepted by the linguistic community [7]. Acceptance is probable if the variant is better adapted to the physiological (e.g., ease of articulation) or cognitive (e.g., preference for regular patterns, spelling pronunciation) capabilities of speakers or if it fits the latest cultural development [57]. Consequently, the adaptation of languages is directed to articulatory and grammatical simplification. However, the trend to simplification is countered by the need to limit ambiguity. For that reason,
126
S. Patig / Data & Knowledge Engineering 56 (2006) 122–138
complication of sounds and grammar also occurs [23]. Finally, specialized languages for social or professional speech communities arise if a linguistic variant is only partly accepted [7]. 2.3. Derivation of hypotheses In each of the two scientific disciplines, the term evolution describes changes over time that show regularities. Despite the differences in detail, there are characteristics on an abstract level that are common to both biological and linguistic evolution. Both are based on variation that is created by certain mechanisms and guided into various directions by a rule; the rule converts singular variants into persistent changes. Adding, borrowing and inventing are the common types of mechanisms of variation; the linguistic operators have analogues in biology. Variation always reveals itself in the new elements within a system (a population or a language). The first mechanism of variation adds new elements by rearranging elements already contained in the system; this is similar to recombination in biology. Creating variation by borrowing means that the system varies due to elements introduced from the outside (as in biological hybridization). Finally, variation within a system can be the result of new elements that are invented from scratch, elements that neither exist within nor outside the system. Mutation is the name for inventions in biology. Both in biological and in linguistic evolution, variation is directed to better adaptation, either to the natural environment or to the speakers physiological and cognitive abilities. The new elements that are selected because of their more suitable adaptation can be alternatively more complex, simpler or more specialized. These evolutionary trends are observable in biology as well as in linguistics. The main differences between biological and linguistic evolution refer to their origins (mono-/ polygenesis), to the causes of variation and to the rules that establish persistent changes. Biological variation occurs randomly; persistent changes are achieved by the non-random rule of natural selection. In linguistics, variation is not only caused by chance but also by cultural changes. For a linguistic change to remain in existence, it must be accepted by the linguistic community. The general assumption of this paper is that the changes of ER modelling over time form an evolution. From the definition of the term evolution, the observable variation has three general characteristics: It is (1) caused by something [H1], (2) systematically created [H2] and (3) directed [H3] (see Table 2). These general hypotheses [H1]–[H3] are partially refined by special hypotheses, Table 2 Hypotheses for investigation on ER modelling General hypotheses
Special hypotheses
[H1] There are causes of variation. [H2] There are mechanisms of variation.
* The mechanisms of variation include adding, borrowing and inventing.
[H3] Starting from an origin, variation is directed, i.e.: [H3-a] There is a general direction of variation. [H3-b] There are trends of variation. [H3-c] There is a rule to make changes persistent. *: Distinctive features of the evolution of ER modelling.
Variation is directed to better adaptation to *. Best adaptation is achieved by increasing complexity, simplification or specialization. *
S. Patig / Data & Knowledge Engineering 56 (2006) 122–138
127
which can be easier tested. The special hypotheses correspond to common characteristics of evolution in biology and linguistics, which are assumed to also be valid for the evolution of ER modelling. However, if biological and linguistic evolution differ, distinctive features (indicated by * in Table 2) are expected for ER modelling; in this case, special hypotheses cannot be stated. The hypotheses summarized in Table 2 are tested in the empirical study in Section 3.
3. Evolution of entity–relationship modelling 3.1. Sample The sample comprises 100 published ER models. The ER models must be either scientifically or commercially accepted. Scientific acceptance is assumed for reviewed conference or journal publications. In detail, three series of conferences (VLDB, ER, ACM SIGMOD) and ten journals (ACM Computing Surveys, ACM Transactions on Database Systems, Data and Knowledge Engineering, IEEE Data Engineering, IEEE Transactions on Data and Knowledge Engineering, IEEE Transactions on Systems, Man and Cybernetics, Information and Software Technology, Information Systems, Journal of Database Management, Journal of Systems and Software) have been investigated from 1975 to 2003. The year 1975 was chosen as a starting point because the ER model had been introduced at the Conference on Very Large Databases in 1975 (although its first publication dates to 1976 [11]). Additionally, publications on ER models were searched in various databases using the keywords ERM, entity–relationship modeling/modelling, entity–relationship and entity. Also included are references to books, manuals or other papers, if the ER model is part of some commercially available tool (e.g., [4]); the existence of such a tool indicates commercial acceptance. Because of space limitations, the complete sample is available on the following website: http://www-wi.cs.uni-magdeburg.de/forschung/er_evo/. Papers dedicated to theoretical aspects of ER modelling, ER-based data definition, data manipulation or query languages as well as practical guidelines for creating ER diagrams were not considered, because the investigation is more specifically focused on ER models as special types of modelling languages. Moreover, only the first reference to an ER model was included. The authors themselves sometimes (e.g., [80]) supplied the chronological classification. To be included in the sample, the proposed ER models had to provide constructs possessing the same semantics as the original ER model constructs. Constructs are the (sets of) signs on which the syntactic rules are defined. In addition to graphical notations of constructs (e.g., rectangles, diamonds [11,20]), textual notations (e.g., first-order predicate logic [54], and pseudo-code [62]) also occurred in the sample. The original ER model [11] used the constructs entity type, relationship type and attribute. Entity types stand for things that can be uniquely identified and characterized by their attributes. Relationship types represent associations among entity types. Attributes express information on entity and relationship types by mappings into value sets. The syntactic rules require that relationship types connect entity types and that attributes are associated with entity types and relationship types. For the most part, the meanings of the above constructs (semantics) are provided by a mathematical substratum, such as set theory or logics. The semantics of the original ER model [11] was
128
S. Patig / Data & Knowledge Engineering 56 (2006) 122–138
introduced by referring to the mathematical substratum without developing a complete framework (formalized semantics). For a complete framework, the semantics is called formal (e.g., [80,27]). If the mathematical meaning of the constructs is not specified, the term informal semantics is applied (e.g., [1,70]). 3.2. Causes of variation In contrast to variety itself, the causes of variation are not directly observable. Instead, conclusions must be drawn from statements of the authors concerning their motivation to propose a new ER model. The explicit statements are cited in the complete sample. Since it is uncertain whether the statements reveal the causes correctly and completely, the overall content of each ER model was additionally considered. Altogether, variation of ER models is caused by the following (see Fig. 1): • • • •
Expressiveness [8], i.e., the ability to represent any intended meaning. Practical applicability, i.e., making large ER diagrams easy to read and modify (e.g., [77,58]). Corrections to remedy deficiencies of the original constructs (e.g., [76]). Integration, i.e., to provide a framework that contains other approaches (e.g., logical [11] or semantic [27] data models, alternative ER models [42]) as special cases. • Implementation support (e.g., [62]). If the stated motive did not comply with the causes above, it was classified as other. Sometimes a new ER model is motivated by more than one cause (e.g., expressiveness and integration [11]). Thus, the total number of counted causes in Fig. 1 exceeds the number of proposed ER models. The causes clearly reflect the problem to be solved by ER modelling, namely data modelling. A data model provides concepts to describe the structure of and the operations on a database [55]. Typically, it also includes primitives to define and manipulate databases [8]. Hence, the general
Fig. 1. Causes to propose new ER models.
S. Patig / Data & Knowledge Engineering 56 (2006) 122–138
129
purpose of data modelling can be divided into the sub-purposes of description and implementation support. Implementation support directly appears as a cause to modify ER models. Expressiveness, applicability, correction and integration emphasize particular qualities of description; the sub-purpose expressive description is further analysed below. Increasing expressiveness always means adding new semantics, which can be classified by its intension. For ER models, expressiveness is enhanced concerning the following: 1. Structure, referring to the abstraction capabilities in describing database schemas (e.g., generalization, aggregation [70], or category [21]). 2. Integrity, making it possible to state constraints that restrict the valid states of the database and that are not already included in the constructs (e.g., [2]). 3. Behavior, to describe operations on the database by the modeling language (e.g., [18]). 4. Time, revealing differences in the instances [41] or the structure [43] of the database at distinct points of time. 5. Uncertainty, if the structure or the instances of the database must be characterized by probability distributions (e.g., [48]). 6. Knowledge, to represent inference rules [36]. 7. Multidimensionality, to show the facts and the dimensions of data analysis [68]. 8. Domain-specifics, using dedicated vocabulary, e.g., in the fields of process industry [1], manufacturing [22], security [59], electronic commerce [35], multimedia [82], hypermedia [26], psychology [28] and geographic information systems [83]. To summarize, emphasizing particular sub-purposes of data modelling causes the variety of ER models. This is additionally supported by the significant (a = 0.05) association between the subpurposes of expressive description and implementation support as well as the kind of notation (graphical/textual) chosen (see Table 3, v2 test for r · c tables [72], v2 = 6.36, v20.05;1 ¼ 3.84). Graphical notations are preferred to improve expressiveness and textual notations to support implementation. In Table 3, ER models that are motivated by both sub-purposes or use both kinds of notations are randomly classified. 3.3. Mechanisms of variation The investigation of the sample suggests that there are certain mechanisms to realize the variation of ER models. These mechanisms are summarized in Fig. 2 and will be explained in the following. Table 3 Association between sub-purpose and notation Sub-purpose
Notation
Sum
Graphical
Textual
Expressive description Implementation support
48 6
17 9
65 15
Sum
54
26
80
130
S. Patig / Data & Knowledge Engineering 56 (2006) 122–138
Fig. 2. Mechanisms of variation.
From Section 2.2, it is known that language variation is observable in phonology, morphology, syntax, semantics and the lexicon. Since ER models are a written language, phonology is irrelevant. Instead, the function of the sounds in (natural) language is fulfilled by the constructs of the ER models. Each construct is assigned a meaning (semantics). The branches of linguistics distinguish between simple meaningful constructs, which cannot be further decomposed (morphemes), and composed ones (words). The formation of complex meaningful constructs would be controlled by morphological rules, whereas syntactic rules would apply to the connection of constructs to form a description. Constructs differ from descriptions in that the meaning of constructs is independent of a particular part of reality. Thus, the meaning of a complex construct should be implied by the meanings of its constituents. Although complex constructs in the above sense are occasionally to be found in ER models (e.g., weak relationships that result from combining the usual construct of relationship and a construct representing weakness [69]), the combination of constructs is always fixed, i.e., semantically unsplittable (like linguistic mergers). Consequently, it seems to be appropriate to simply treat complex constructs as constructs whose meaning cannot be further decomposed and whose connection to describe reality must adhere to the syntactic rules. Morphology is not part of this investigation. Any modelling language that claims to be an ER model must preserve the original semantics (see Section 3.1). Therefore, deleting inherent semantics or semantic change, consisting in the replacement of the original meaning by an alternative one (linguistic operator semantic shift), are prohibited. Instead, semantics is always added (or deleted) when additional constructs are introduced (or removed), which results in an update of the lexicon. Similar to linguistics, the constructs for carrying the given meaning are selected randomly. There are, for instance, many alternative notations (e.g., numbers [11], single lines and crows feet [4], filled or unfilled diamonds [77]) to describe in how many relationships an entity may be involved. Moreover, instead of ER diagrams, occurrences structure diagrams are sometimes depicted [76]. Variation of ER models stems from language-inherent constructs or from new ones. Inherent constructs indeed cannot be avoided in the modelling language. Variety results from adding or
S. Patig / Data & Knowledge Engineering 56 (2006) 122–138
131
deleting specializations of inherent constructs. Here, the original syntactic rules still apply and are not affected. Moreover, the semantics of the inherent construct is also valid for the specialized one and modified by additional semantics, which can be broader (e.g., types instead of entity types [16]) or narrower (e.g., total relationships [69]) than the original one. Semantic strengthening, e.g., making a meaning better (amelioration), or weakening are irrelevant to ER modelling. Depending on the intended application, a construct specialization is called general if the specialized construct is not restricted to certain application domains and otherwise named domain-specific. Typically, domain-specific specializations of inherent constructs depend heavily on the vocabulary of the domains. Inherent constructs of the ER model are entity type, relationship type and attribute. Although attributes are sometimes not explicitly represented, specifying entity types is impossible without defining their attributes. General specializations of the inherent constructs entity type (weak entity type), relationship type (1:N, M:N, 1:1, existence-dependency) and attribute (key, role) already formed a part of the original ER model [11]. Some of these specializations have been deleted later on, e.g., weak entity types [30] or keys [21]. Widely accepted general specializations of inherent constructs are contained in extended ER models (e.g., [77,20,27]). Besides, a plethora of other general specializations of inherent constructs exist, e.g., time-period classes (entity types) [46], facts and classifications (relationship types) as well as dimensions (entity types) [46] and fuzzy attribute types [48]. Domain-specific specializations have been proposed, for example, for geographic information systems (spatio-temporal entity types) [22], security applications (security object entity or relationship types) [59], psychology (conceptual entity types, perceptual attribute types) [28] and electronic commerce (specialized entity types action, agent, claim/commitment and specialized relationship types sends, perceives, does) [84]. Alternatively, variation is induced by new constructs whose meaning does not refer to the semantics of inherent constructs. If the meaning is taken from another modelling language or method, the new construct is borrowed. If not, it is invented from scratch. A new construct is called adapted if syntactic rules for connecting it to inherent constructs are defined; otherwise it is just included. Invented constructs that are adapted to ER models are, for example, constraint and transformation sets that connect attributes [40] as well as type constructors [27] (which transform several entity types into distinct output entity types if certain requirements are satisfied). Moreover, constructs were borrowed from Petri nets (e.g., [18]), object-oriented modeling (e.g., [56]), knowledgebased systems (e.g., [36]) or abstract data types (e.g., [3]) and adapted to the ER model. Missing adaptation of borrowed (e.g., [25], distribute transaction schema diagrams) or invented (e.g., [61], sequence diagrams) constructs leads to separate diagrams, whose connection to the ER diagrams must be specified. In addition to being a consequence of adapted constructs, syntactic rules can also change in isolation. It is not allowed to connect constructs if a corresponding syntactic rule is missing. Consequently, mostly, syntactic rules are added. Examples of syntactic rules that are added in isolation are relationship types that connect relationship types [65] or attributes [1]. Syntactic rules would only be deleted if adapted constructs were removed. This requires an ER model with adapted constructs to be an ancestor of another ER model. In the sample, only three ER models with new constructs [34,75,35] were modified by adding constructs. Furthermore, since the modifications
132
S. Patig / Data & Knowledge Engineering 56 (2006) 122–138
were either carried out [34,35] or initiated [75] by the author of the ancestor himself, introducing constructs with a meaning not referring to the semantics of the inherent constructs is probably not an optimal starting point for further variation. The complete sample (available on the website) contains details on the variation of ER models. The names of the constructs are given as defined in the respective proposal of the ER model; renaming of inherent constructs is also indicated. To avoid any subjective bias, I adhered strictly to what the authors said in describing their modifications. Consequently, generalization, for example, appears in the classification as a general specialization of both entity [6] and relationship types [70]. Furthermore, if the origin of a new construct was not stated, it was classified as invented. 3.4. Directions of variation Evolution means directed variation starting from some origin. For the evolution of ER modelling, its origin is obvious [11]. However, answering the question of why the original ER model [11] became the starting point of variation is not that easy. Help comes from theory of science, which coined the term paradigm. A paradigm [38] is an idea guiding scientific research that is characterized by its outstanding abilities to (1) solve problems and (2) to integrate both (2a) existing problem solutions and (2b) a community, as well as by (3) its precision. The paradigmatic qualities (1)–(3) play an important role in directing the variation of ER modelling. The ER model has been empirically demonstrating its ability to integrate a community (2a), i.e., to reach agreement among researchers, for nearly 30 years. Besides, the original ER model [11] provided a framework to integrate the data models (2b) (relational, network, entity set) that existed at that time. The integrating framework distinguishes the original ER model from another semantic data model [71] published in 1975, whose constructs are semantically equal to those of the ER model, but which only refers to the relational data model. To summarize, the original ER model already had integrative power. Improving this integrative power has been the impetus to propose new ER models from 1981 to 1993 (see cause integration in Fig. 1). Especially from 1981 to 1994, new ER models have been motivated by the cause of correction (see Fig. 1); this hints that the original ER model was lacking precision (3) to some extent. Nowadays, most deficiencies are cured, and ER modelling is sufficiently precise. As mentioned in Section 3.2, ER modelling is used to (1) solve the problem of data modelling. Since data models are mainly descriptive, most paradigmatic problem-solving ability is related to description. Fig. 1 shows that the major cause of developing a new ER model with respect to description is expressiveness. Of the intensions that are added to raise expressiveness (see Section 3.2), structure, integrity and behaviour are directly relevant to the purpose of a data model. In Fig. 3, all occurrences of these intensions are summarized accordingly. The necessity of describing time, uncertainty, knowledge and multidimensionality arises from the intended application of the modelled database system. Again, aggregated data appears in Fig. 3. Application-dependent intensions are found in any domain, in contrast to domain-specifics, which are depicted separately in Fig. 3. The number of ER models created to enhance data model expressiveness has been increasing from 1975 to 1988 and decreasing since that time; so, this number is best fitted by a quadratic function (see Table 4). There seems to be an agreement within the community of computer science
S. Patig / Data & Knowledge Engineering 56 (2006) 122–138
133
Fig. 3. Directions of variation.
Table 4 Statistical details of the fitted functionsa Fitted function
Data model
Linear f(t) = a + b Æ t Quadratic f(t) = a + b Æ t + b Æ t2
Application 2
Domain-specifics
Femp
Sig F
Sig C
R
Femp
Sig F
Sig C
R2
Femp
Sig F
Sig C
0.194
1.051
0.31
0.31
0.033
0.930
0.34
0.34
0.209
7.145
0.01
0.01
0.275
4.916
0.01
<0.02
0.172
2.692
0.09
<0.05
0.209
3.455
0.05
<0.88
R
2
R2: Coefficient of determination, Femp: empirical F for R2, Sig F: significance of Femp, Sig C: significance of function coefficients. a The statistics have been computed by the software package SPSS.
on the basic descriptive abilities that a data model should provide. All of these consensual problem-solving abilities, which are common to most semantic data models [31], have been incorporated in extended ER models, making further improvements in this direction superfluous. Nearly the same holds true for application-dependent expressiveness (see Table 4). In contrast, a linearly growing number of ER models have been presented since 1983 to increase domain-specific expressiveness (see Table 4). The imperfect fit measured by the coefficient of determination R2 is due to the variability of the empirical data, which covers a time span of nearly 30 years (n = 28). However, the R2 values of the functions whose lines are depicted in Fig. 3 are statistically significant [72] at the a = 0.05 and, for application, at the a = 0.10 levels. Table 4 contains the empirical F values of R2 (Femp) and their significance (Sig F), which must be smaller than the theoretical levels of significance (a = 0.05,
134
S. Patig / Data & Knowledge Engineering 56 (2006) 122–138
0.10) to reject the null hypothesis (R2 = 0). Moreover, the empirical significance of the function coefficients is given (Sig C). The coefficients of the functions in Fig. 3 are significant (a = 0.05). The observations imply that the variation of ER modelling over time is, on the one hand, directed to an improvement of paradigmatic qualities (general perfection). As a result of general perfection, variation converges to the paradigm. ER models that are in line with the paradigm are mainly characterized by their adaptation to the general purpose of data modelling and their acceptance in the community. On the other hand, a trend of specialized perfection exists; ER models are changed to yield a better adaptation to the data modelling requirements of certain domains. Consequently, variation also leads to divergence into domains. Mostly, specialized ER models are accepted only in the domain. General perfection and, to some extent, specialized perfection, tend to complicate ER models. But variation does not necessarily lead to increasing complexity. Here, the complexity of an ER model is measured by the number of its constructs and by the number of restrictions imposed by syntactic rules. In ER models, specializations of inherent constructs have been deleted because they were no longer necessary (e.g., weak entity types [30]). Moreover, according to the argumentation in Section 3.3, adding a syntactic rule removes a restriction, e.g., by allowing relationships among relationships [65].
4. Conclusions and future research The results of the empirical study presented in Section 3 do not indicate that the hypotheses formulated in Section 2.3 are false. It shows instead that the change of ER models over time corresponds to an evolution. This means: [H1] The variation of ER models is caused by the purpose of data modelling (including the subpurposes description and implementation support) and by chance, as far as the selection of constructs for meanings is concerned. [H2] The variation of ER models is created systematically. ER models are mostly extended by constructs. Specializations of inherent constructs refer to semantics and syntax already contained in the ER model, corresponding to the type add of the operators of variation. Additionally, constructs are borrowed or invented. Inherent semantics is never deleted or replaced but only modified as a consequence of other variations. New syntactic rules arise in isolation or when constructs are adapted. Extended constructs or syntactic rules are rarely deleted. ER models containing new constructs do not vary further. [H3] The variation of ER models started from a paradigmatic proposal and is directed to better adaptation to the general and domain-specific purpose of data modelling. Improving adaptation results in ER models that are more complex, more specialized or simpler. Only ER models that basically comply with the paradigm survive, that is, are chosen by other authors for further variation. From these characteristics of the evolution of ER modelling, the following conclusions about its history and future are inferred:
S. Patig / Data & Knowledge Engineering 56 (2006) 122–138
135
[C1] There is no sense in trying to justify the origin of variation, as all paradigms are the result of non-cumulative developments. A paradigm emerges as an unanticipated novelty, from scratch, based on the intuition of individual researchers [38]. [C2] The variation of a modelling language will reach a state of paradigmatic stability, which is characterized by stable semantics (for ER models, the consensual general data modelling capabilities) and pluralism of notations. [C3] The number of domain-specific variants of a modelling language is as infinite as the number of domains. Since acceptance is lacking, domain-specific variants will not affect the paradigm. The above conclusions are not restricted exclusively to ER modelling. To obtain conclusions independent of any modelling language, the hypotheses must also be generalized. If a modelling language has been subject to evolution, then the following characteristics [A1]–[A6] will apply: [A1] Within the modelling language, variation that starts from a paradigmatic origin is observable. [A2] Variation is caused by the purpose of the modelling language, which (according to the definition of modelling languages) always contains the purpose of description and occasionally the purpose of implementation support, and by chance. [A3] Variation is realized through adding, borrowing and inventing. [A4] Variation is directed to better adaptation to the purpose. [A5] Better adaptation results in general or more specialized perfection or in simplification of the modelling language. [A6] Only variants that comply with the paradigm survive. Claiming any evolution of a modelling language to possess the characteristics [A1]–[A6] constitutes a law. A law is a hypothesis that is confirmed and depicts a general pattern [10]. The hypothetical characteristics [A1]–[A6] are consistent, independent (i.e., none of the hypotheses is derivable from the others) and can be tested empirically. From the hypotheses, the conclusions [C1]–[C3] can be deduced. Altogether, [A1]–[A6] and [C1]–[C3] encompass what is known or what can be predicted about the evolution of modelling languages. Such a hypothetical-deductive system is called a theory [10]. Future research will test this theory of evolution on other modeling languages. At present, the evolution of Petri nets is investigated. Petri nets have been chosen because they are, from the viewpoint of semantics, orthogonal to ER modelling; that is, it cannot be ascribed to semantic equivalence if the theory holds for Petri nets as well. Moreover, I would like to encourage any reader of this paper to test the theory on any modelling language which has undergone change over time, to report the results and to improve the theory. References [1] S.S. Al-Fedaghi, An entity–relationship approach to modelling petroleum engineering database, in: [15], pp. 761– 779. [2] A. Badia, Extending entity–relationship models with higher-order operators, in: [64], pp. 321–330.
136
S. Patig / Data & Knowledge Engineering 56 (2006) 122–138
[3] M. Balaban, P. Shoval, Enhancing the ER model with integrity methods, Journal of Database Management 10 (4) (1999) 14–23. [4] R. Barker, CASE*Method: Entity–Relationship Modelling, Addison-Wesley, Wokingham, 1990. [5] C. Batini (Ed.), Entity–Relationship Approach: A Bridge to the User, North-Holland, Amsterdam, 1989. [6] C. Batini, S. Ceri, S.B. Navathe, Conceptual Database Design: An Entity–Relationship Approach, Benjamin/ Cummings, Redwood City, 1992. [7] D. Bolinger, Aspects of Language, second ed., Harcourt Brace Jovanovich, New York, 1975. [8] M.L. Brodie, On the development of data models, in: [9], pp. 19–47. [9] M.L. Brodie, J. Mylopoulos, J.W. Schmidt (Eds.), On Conceptual Modelling: Perspectives from Artificial Intelligence, Databases, and Programming Languages, Springer, Berlin, 1984. [10] M. Bunge, Philosophy of Science: From Problem to Theory, rev. ed., Transaction Publishers, New Brunswick, 1998. [11] P. Chen, The entity–relationship model—toward a unified view of data, ACM Transactions on Database Systems 1 (1) (1976) 9–36. [12] P. Chen (Ed.), Entity–Relationship Approach to Systems Analysis and Design, North-Holland, Amsterdam, 1980. [13] P. Chen (Ed.), Entity–Relationship Approach: The Use of ER Concept in Knowledge Representation, NorthHolland, Amsterdam, 1985. [14] C. Darwin, The Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life, Modern Library, New York, 1859. [15] C.G. Davis et al. (Eds.), Entity–Relationship Approach to Software Engineering, North-Holland, Amsterdam, 1983. [16] C.S. dos Santos, E.J. Neuhold, A.L. Furtado, A data type approach to the entity–relationship model, in: [12], pp. 103–119. [17] S. Dress et al., Bibliography of Petri Nets, Working Paper No. 315, GMD, Sankt Augustin, 1988. [18] J. Eder, et al., BIER—the behaviour integrated entity–relationship approach, in: [74], pp. 147–166. [19] R. Elmasri, V. Kouramajian, B. Thalheim (Eds.), Entity–Relationship-Approach—ER 93, Lecture Notes in Computer Science (LNCS), vol. 881, Springer, Berlin, 1994. [20] R. Elmasri, S.B. Navathe, Fundamentals of Database Systems, third ed., Addison-Wesley, Reading, 2000. [21] R. Elmasri, J. Weeldreyer, A. Hevner, The category concept: An extension to the entity–relationship model, Data and Knowledge Engineering 1 (1) (1985) 75–116. [22] A. Flory, V. Giard, Modelling requirements of a manufacturing design application using an E/R schema, in: S.T. March (Ed.), Entity–Relationship-Approach—ER 87, North-Holland, Amsterdam, 1988, pp. 249–267. [23] V. Fromkin, R. Rodman, An Introduction to Language, sixth ed., Harcourt Brace College Publishers, Fort Worth, 1998. [24] D.J. Futuyma, Evolutionary Biology, third ed., Sinauer Associates, Sunderland, 1998. [25] H.-M. Garcia, O. Sheng, An entity–relationship-based methodology for distributed database design: An integrated approach towards combined logical and distribution designs, in: [60], pp. 178–193. [26] F. Garzotto, L. Mainetti, P. Paolini, HDM2: Extending the E–R approach to hypermedia application design, in: [19], pp. 178–189. [27] M. Gogolla, U. Hohenstein, Towards a semantic view of an extended entity–relationship model, ACM Transactions on Database Systems 16 (3) (1991) 369–416. [28] T.R.G. Green, D.R. Benyon, The skull beneath the skin: entity–relationship models of information artefacts, International Journal of Human–Computer Studies 44 (6) (1996) 801–828. [29] T. Halpin, Information Modeling and Relational Databases: From Conceptual Analysis to Logical Design, Morgan Kaufmann, San Francisco, 2001. [30] C. Hsu, Structured database system analysis and design through entity relationship approach, in: [13], pp. 56–63. [31] R. Hull, R. King, Semantic database modeling: Survey, Applications, and Research Issues, ACM Computing Surveys 19 (3) (1987) 201–260. [32] Y. Kambayashi et al. (Eds.), Advances in Database Technologies, LNCS, vol. 1552, Springer, Berlin, 1999. [33] H. Kangassalo (Ed.), Entity–Relationship-Approach: The Core of Conceptual Modelling, North-Holland, Amsterdam, 1991.
S. Patig / Data & Knowledge Engineering 56 (2006) 122–138
137
[34] G. Kappel, M. Schrefl, A behavior integrated entity–relationship approach for the design of object-oriented databases, in: [5], pp. 311–328. [35] K. Karlapalem, A.R. Dani, P.R. Krishna, A framework for modeling electronic contracts, in: [39], pp. 193–207. [36] L. Kerschberg, R. Baum, J. Hung, KORTEX: An expert database system shell for a knowledge-based entity relationship model, in: [44], pp. 255–268. [37] W.F. King (Ed.), Proceedings of the 1975 ACM SIGMOD International Conference on Management of IBM Research Laboratory, IBM Research Laboratory, San Jose´, 1975. [38] T. Kuhn, The Structure of Scientific Revolutions, third ed., University of Chicago Press, Chicago, 1996. [39] H.S. Kunii, S. Jajodia, A. Solverg (Eds.), Conceptual Modeling—ER 2001, LNCS, vol. 2224, Springer, Berlin, 2001. [40] R. Lazimy, Knowledge representation and modeling support in knowledge-based systems, in: [49], pp. 133–161. [41] J.Y. Lee, R. Elmasri, J. Won, Specification of calendars and time series for temporal databases, in: [79], pp. 341– 356. [42] M. Lenzerini, SERM: Semantic Entity Relationship Model, in: [13], pp. 270–278. [43] C.-T. Liu, P.K. Chrysanthis, S.-K. Chang, Database schema evolution through the specification and maintenance of changes on entities and relationships, in: [45], pp. 132–151. [44] F.H. Lochovsky (Ed.), Entity–Relationship-Approach to Database Design and Querying, North-Holland, Amsterdam, 1990. [45] P. Loucopoulos (Ed.), Entity–Relationship-Approach—ER 94: Business Modelling and Re-Engineering, LNCS, vol. 881, Springer, Berlin, 1994. [46] P. Loucopoulos, B. Theodoulidis, D. Pantazis, Business rules modelling: conceptual modelling and object-oriented specifications, in: [81], pp. 323–342. [47] P. Loucopoulos, R. Zicari (Eds.), Conceptual Modeling, Databases, and CASE: An Integrated View of Information Systems Development, Wiley, New York, 1992. [48] Z. Ma et al., Conceptual design of fuzzy object-oriented databases using an extended entity–relationship model, International Journal of Intelligent Systems 16 (6) (2001) 697–711. [49] S.T. March, Entity–Relationship-Approach—ER 87, North-Holland, Amsterdam, 1988. [50] E. Mayr, Systematics and the Origin of Species, Columbia University Press, New York, 1982. [51] M. Minsky (Ed.), Semantic Information Processing, MIT Press, Cambridge/MA, 1968. [52] M.-L. Mugnier, M. Chein (Eds.), Conceptual Structures: Theory, Tools and Applications—ICCS 98, LNCS, vol. 1453, Springer, Berlin, 1998. [53] J. Mylopoulos, Conceptual modeling and telos, in: [47], pp. 49–68. [54] R. Nakano, Integrity checking in a logic-oriented ER model, in: [15], pp. 551–564. [55] S.B. Navathe, Evolution of data modeling for databases, Communications of the ACM 35 (9) (1992) 112–123. [56] S.B. Navathe, M.K. Pillalamarri, OOER: Toward making the E–R approach object-oriented, in: [5], pp. 185–206. [57] W. OGrady, M. Dobrovolsky, M. Aronoff, Contemporary Linguistics—An Introduction, second ed., St. Martins, New York, 1993. [58] C. Parent, S. Spaccapietra, ERC+: An object-based entity relationship approach, in: [47], pp. 69–86. [59] G. Pernul, W. Winiwarter, A.M. Tjoa, The Entity–Relationship Model for Multilevel Security, in: [19], pp. 166– 177. [60] G. Pernul, A.M. Tjoa (Eds.), Entity–Relationship-Approach—ER 92, LNCS, vol. 645, Springer, Berlin, 1992. [61] F. Put, The ER approach extended with the action concept as a conceptual modelling tool, in: [5], pp. 423–440. [62] X. Qian, G. Wiederhold, Data definition facility of CRITIAS, in: [13], pp. 46–55. [63] M.R. Quilian, Semantic memory, in: [51], pp. 227–270. [64] Z.W. Ras, S. Ohsuga (Eds.), Foundations of Intelligent Systems—ISMIS 2000, Lecture Notes in Artificial Intelligence, vol. 1932, Springer, Berlin, 2000. [65] A. Rochfeld, J. Morejon, P. Negros, Inter-relationship links in E–R model, in: [33], pp. 149–163. [66] M. Ruhlen, On the Origin of Languages: Studies in Linguistic Taxonomy, Stanford University Press, Stanford, 1994. [67] J. Rumbaugh, I. Jacobson, G. Booch, The Unified Modeling Language Reference Manual, Addison-Wesley, Reading, 1998.
138
S. Patig / Data & Knowledge Engineering 56 (2006) 122–138
[68] C. Sapia, Extending the E/R model for the multidimensional paradigm, in: [32], pp. 105–116. [69] P. Scheuermann, G. Schiffner, H. Weber, Abstraction capabilities and invariant properties: Modelling within the entity–relationship approach, in: [12], pp. 121–140. [70] G. Schiffner, P. Scheuermann, Multiple views and abstractions with an extended-entity–relationship model, Journal of Computer Languages 4 (4) (1979) 139–154. [71] H.A. Schmid, J.R. Swenson, On the semantics of the relational data model, in: [37], pp. 211–223. [72] D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, second ed., Chapman & Hall, London, 2000. [73] K. Siau, Information modeling and method engineering: A psychological perspective, Journal of Database Management 10 (4) (1999) 44–50. [74] S. Spaccapietra (Ed.), Entity–Relationship-Approach: Ten Years of Experience in Information, North-Holland, Amsterdam, 1987. [75] M. Steeg, The conceptual database design optimizer CoDO-Concepts, Implementation, Application, in: [79], pp. 105–120. [76] Y. Tabourier, Further development of the occurrences structure concept: The EROS approach, in: [15], pp. 565– 583. [77] T.J. Teorey et al., ER model clustering as an aid for user communication and documentation in database design, Communications of the ACM 32 (2) (1989) 975–987. [78] W.M. Tepfenhart, Ontologies and conceptual structures, in: [52], pp. 334–348. [79] B. Thalheim (Ed.), Conceptual Modeling—ER 96, LNCS, vol. 1157, Springer, Berlin, 1996. [80] B. Thalheim, Entity–Relationship Modeling: Foundations of Database Technology, Springer, Berlin, 2000. [81] F. van Assche, B. Moulin, C. Rolland (Eds.), Object Oriented Approach in Information Systems, North-Holland, Amsterdam, 1991. [82] F. Velez, LAMBDA: An entity–relationship based query language for the retrieval of structured documents, in: [13], pp. 82–89. [83] G. Vert, M. Stock, A. Morris, Extending ERD modeling notation to fuzzy management of GIS data files, Data and Knowledge Engineering 40 (2) (2002) 163–179. [84] G. Wagner, The Agent-Object-Relationship metamodel: towards a unified view of state and behavior, Informations Systems 28 (5) (2003) 475–504. Susanne Patig works in the Business Information Systems group at the Otto-von-GuerickeUniversity in Magdeburg, Germany. She received her diploma in Business Administration from the University of Leipzig, Germany (1997) and her Ph.D. in Computer Science from the Otto-vonGuericke-University in Magdeburg (2001). Her current research interests include conceptual modelling, knowledge representation and software specification.