Database sampling with functional dependencies

Database sampling with functional dependencies

Information and Software Technology 43 (2001) 607±615 www.elsevier.com/locate/infsof Database sampling with functional dependencies JesuÂs Bisbal*, ...

177KB Sizes 0 Downloads 62 Views

Information and Software Technology 43 (2001) 607±615

www.elsevier.com/locate/infsof

Database sampling with functional dependencies JesuÂs Bisbal*, Jane Grimson Department of Computer Science, Trinity College Dublin, Dublin 2, Ireland Received 29 June 2000; revised 5 March 2001; accepted 27 April 2001

Abstract During the development of information systems, there is a need to prototype the database that the applications will use when in operation. A prototype database can be built by sampling data from an existing database. Including relevant semantic information when extracting a sample from a database is considered invaluable to support the development of data-intensive applications. Functional dependencies are an example of semantic information that could be considered when sampling a database. This paper investigates how a database relation can be sampled so that the resulting sample satis®es precisely a given set of functional dependencies (and its logical consequences), i.e. is an Armstrong relation. q 2001 Elsevier Science B.V. All rights reserved. Keywords: Information system development; Legacy migration; Database sampling; Integrity constraints; Database prototyping; Armstrong relation; Functional dependencies

1. Introduction During the development of an information system there is a need to construct a prototype of the operational database. Such a database can support several stages of the development process. Validation of database and/or application requirements. Different designs can be checked for completeness and correctness [17]. User training. Applications will need a database on which to operate with during training. Testing. As is needed in, for example, system's functional test[4], performance evaluation [11], and back-toback testing [19]. Most of the existing approaches to building prototype databases use synthetic data values to populate the resulting database. Although using synthetic data may be the only alternative, the lack of data values relevant to the application domain limits the usefulness of the prototype database [5,6]. Increasingly, however, real (existing) data are actually available. It has been reported that from 60 to 85% of the total software development costs in industry can be attributed to maintenance [21]. Thus, it is reasonable to expect that in over 60% of projects there will be a database from * Corresponding author. E-mail address: [email protected] (J. Bisbal).

which real data can be extracted. This paper proposes that these data should be used to populate a prototype database. The resulting database is called a Sample Database. A database can be sampled according to many different criteria, such as, for example, random selection, or selecting data items in such a way that the resulting sample satis®es a set of integrity constraints. Functional dependencies are an example of such integrity constraints, that have been widely investigated in the context of relational databases [1,3,10]. They capture intuitive relationships between data items, and play a central role in the design of databases [16]. Sampling a database in such a way that the resulting sample satis®es a set of functional dependencies will produce a sample with appropriate semantic information to support the database design process, as outlined above. Since it will also contain domain relevant data values, this sample can be expected to support user training and testing better than when using synthetic values. This research was initially motivated in the context of Legacy Information Systems Migration [8], where the need to identify a representative sample (in terms of semantic information and data values) of an existing operational database is essential to the success of any migration project [22]. A sample of the migrating database can be used to thoroughly test the mapping between legacy and target schemes; additionally, a migrated sample can support the development of the target information system, in the stages identi®ed above. Using the entire legacy database for development and testing purposes only may not be cost-effective.

0950-5849/01/$ - see front matter q 2001 Elsevier Science B.V. All rights reserved. PII: S 0950-584 9(01)00165-3

608

J. Bisbal, J. Grimson / Information and Software Technology 43 (2001) 607±615

This paper investigates the use of functional dependencies in the context of database sampling. This same problem has also been investigated considering other types of integrity constraints, as will be brie¯y outlined in Section 2. The remainder of this paper is organised as follows. The next section reviews existing research in the area of database prototyping and database sampling. Required notation and terminology are introduced in Section 3. Section 4 de®nes the objectives when sampling with functional dependencies, and describes a concrete database sampling example that will be used in later sections. Two alternative sampling algorithms are analysed in Section 5. The goal of sampling with functional dependencies, as de®ned in Section 4, is generalised in Section 6. The ®nal section summarises the paper and gives a number of future directions for this research. 2. Related work Existing research into database prototyping methods has focused mainly on populating the database using synthetic values, i.e. not necessarily related to the database application domain. The objective is to produce a database satisfying a prede®ned set of integrity constraints, that represent domain-relevant semantics for the data. Noble [18] describes one such method, that considers referential integrity constraints and functional dependencies, but populates the database mainly with synthetic values, although the user can also enter a list of values to be used as the domain for an attribute. A different database prototyping approach is presented in [24]. This method ®rstly checks for the consistency of an Extended Entity-Relationship (EER) schema de®ned for the database being prototyped, based on cardinality constraints only. Once the design has been proved consistent, a database according to this schema is generated. To guide the generation process, a so-called general Dependency Graph is created from the EER diagram. This graph represents the set of referential integrity constraints that must hold in the database and is used to de®ne a partial order between the entities of the EER diagram. The synthetic data generation process populates the database entities following this partial order. LoÈhr-Richter [13] further develops the data generation step used by this method, and recognises the need for additional information in order to generate relevant data for the application domain, without addressing how this is to be achieved. Tucherman [20] presents a database design tool that prototypes a database also based on an Entity-Relationship Schema. The tool automatically maps this design into a normalised relational schema. Special attention is paid in this contribution to what are called restrict/propagate rules, that is, to enforcing referential integrity constraints. When an operation is going to violate one such constraint, the system can be instructed to either block (restrict) the operation so that such violation does not occur or to propagate it to the asso-

ciated tables by deleting or inserting tuples as required. No explicit reference is made as to how the resulting database should be populated for prototyping, although it identi®es the possibility of interacting with the user when new insertions are required, as the user may be the only source for appropriate domain-relevant data values. An approach more closely related to the work presented here was reported by Mannila [14]. This paper described a mechanism for ef®ciently populating a database relation so that it satis®es exactly a prede®ned set of functional dependencies (and, thus, being an Armstrong relation, as will be de®ned in Section 3). Such a relation can be used to assist the database designer in identifying the appropriate set of functional dependencies that the database must satisfy. Since this relation satis®es all the required dependencies and only those, it can be seen as an alternative representation for the dependencies themselves. A relation generated using this method is expected to expose missing or undesirable functional dependencies, and the designer can use it to iteratively re®ne the database design until it contains only the required dependencies. Special attention is paid in [14] to the size of the relation being generated. A designer might not be able to identify missing or undesirable dependencies if (s)he was presented with a relation containing a large number of tuples. For this reason, the goal is to produce a relation that, satisfying exactly a given set of functional dependencies, contains the smallest possible number of tuples. See [16] for a more extended treatment of this approach. All the database prototyping approaches described above populate the database using data values not related to the application domain. This limits their applicability, especially in cases where data values from such prototype databases must be presented to an end user, as it common in software prototyping for requirements elicitation. A user who is unfamiliar with the data values presented may not fully understand their semantics and, therefore, may not be able to comment on required modi®cations, enhancements or missing functionality. This paper investigates how a prototype database can be built by sampling from an operational database and thus using domain-relevant data. Limited work has been reported to date in the area of Database Sampling resulting in a database that satis®es a prede®ned set of integrity constraints. Bisbal et al. [9] described a mechanism to represent a wide range of integrity constraints (including cardinality constraints, inclusion dependencies, hierarchical relationships and a generalisation of referential integrity constraints) specially well suited for sampling purposes. In a related paper [7] the database sampling process was analysed from a more abstract point of view, without focusing on any particular type of integrity constraint, and thus identifying the issues that must always be addressed when consistently sampling from a database. The authors [5,6] have extensively discussed the motivations and application areas of database prototyping and database sampling, and

J. Bisbal, J. Grimson / Information and Software Technology 43 (2001) 607±615

reviewed existing research on this area. The design of a database sampling tool that implements the work presented here has been reported in [5,7]. The work presented here has similar objectives to those described in [14]. The aim is also to produce a small database relation that satis®es a prede®ned set of functional dependencies. The difference, however, is that the resulting relation must be a subset of a relation in an existing operational database. Using such a relation as the source of data in software prototyping, an end user would be familiar with the data values presented. Also, a database designer, when maintaining an existing database, could identify dependencies that no longer should be satis®ed or dependencies that should be added as a consequence of a change in the environment in which the database operates. Using the entire operational database relation may not be cost-effective in some applications of database prototyping, as for example in legacy migration. Also, as outlined above, the size of the entire relation is likely to be too large to be manageable by the designer. Note that the need for a sample of a database arises in a context signi®cantly different from that that motivated the database prototyping approaches described earlier. Methods that produce synthetically generated databases are needed when developing completely new applications. Currently, there is an increasing need to develop or extend applications based on existing software and data (enhancements, migration, etc.) that motivates the need to address the traditional problem, but with an additional constraint: the availability of existing applications and related data. See [8] for a more extended discussion on the challenges posed by existing (legacy) applications. 3. De®nitions and notation This section de®nes the terminology and notation based on standard de®nitions [1,23] that will be used throughout this paper. Examples to illustrate these de®nitions will be given in later sections, where they will be used. Assume there is a set of attributes U, each with an associated domain. A relation schema R over U is a subset of U. A database schema D ˆ {R1 ; ¼; Rn } is a set of relation schemas. A relation r over the relation schema R is a set of R-tuples, where an R-tuple is a mapping from the attributes of R to their domains. Attributes in U are denoted with uppercase letters A, B, C, ¼. The notation will be simpli®ed when referring to sets of attributes, e.g. {A, B}, by removing the curly brackets, e.g. AB, which should not cause confusion as it will be clear from the context. R-tuples are denoted by lowercase letters, possibly with subscripts, t1, t2, ¼. The relation r to which an R-tuple belongs will always be clear from the context. The value of an attribute, A [ U, for a particular R-tuple, t1, is denoted t1(A). This notation is extended to sets of attributes so that

609

for Y # U, t1(Y) denotes a tuple, u1, over Y with u1(A) ˆ t1(A) for all attributes A [ Y. A functional dependency can be de®ned as follows. De®nition 3.1. If U is a set of attributes, then a Functional Dependency over U is an expression of the form Y ! Z, where Y, Z # U. A relation r over U satis®es Y ! Z, denoted r o Y ! Z, if for each pair of tuples in r, t1 and t2, t1(Y) ˆ t2(Y) implies t1(Z) ˆ t2(Z). If S is a set of functional dependencies, then r o S means that all dependencies in S are satis®ed in r. Individual functional dependencies are denoted with lowercase Greek letters, possibly with subscripts, s , g 1, g 2, ¼. Sets of functional dependencies are denoted using uppercase Greek letters, possibly with subscripts, S , G 1, G 2, ¼. De®nition 3.2. Let S and G be sets of functional dependencies over U. Then S implies G , denoted S o G, iff for all relations r over U, r o S implies r o G. De®nition 3.3. Let S be a set of functional dependencies over U. The closure of S , denoted S p, is de®ned as S p ˆ {Y ! ZuYZ # U and S o Y ! Z}.

S p represents the set of all functional dependencies that are logical consequences of S . This concept allows for the de®nition of what are called Armstrong relations, a concept initially introduced in Ref. [2] although the term itself was coined by Fagin [10]. De®nition 3.4. A database relation r is an Armstrong relation for a set of functional dependencies S iff r satis®es all functional dependencies in S p and no other functional dependency. It is minimal if every Armstrong relation for S has at least as many tuples as r. It should be noted that a set of functional dependencies and an Armstrong relation for it are dual representations of the same information [15]. During the design and maintenance of a database, incorrect constraints are more easily identi®ed from a list of dependencies, whereas missing dependencies are better revealed from an example of the relation. This justi®es the use of Armstrong relations for design purposes. The construction of an Armstrong relation presented here relies on the closure of a set of attributes under a set of functional dependencies, called fd-closure, which is de®ned next [1,3]. De®nition 3.5. Given a set S of functional dependencies over U and an attribute pset X # U, the fd-closure of X under S , denoted …X; S† ;U or simply X p if S and U are understood from the context, is the set X p ˆ {A [ XuS o X ! A}.

610

J. Bisbal, J. Grimson / Information and Software Technology 43 (2001) 607±615

Table 1 Running exampleÐArmstrong relation for S ˆ {B ! C, AC ! B}

t1 t2 t3 t4 t5 t6

A

B

C

a1 a1 a2 a2 a4 a3

b1 b2 b3 b2 b3 b1

c1 c2 c1 c2 c1 c1

Table 2 Agreements table for the running example

t1 t2 t3 t4 t5 t6

Ap

Cp

{BC} p

Bp

No. X p

{t2} {t1} {t4} {t3}

{t3, t5}

{t6} {t4} {t5} {t2} {t3} {t1}

{t4} {t3, t6, t5} {t2} {t1, t6, t5} {t2, t4} {t2, t4}

4 3 4 3 3 3

{t1, t6} {t1, t6} {t3, t5}

Intuitively, X p represents the set of all attributes that are determined by X according to S . Some sets of attributes, referred to as saturated [2] (or closed sets [3]), do not determine any other attributes but themselves.

The exposition given here will ®rstly assume that r is an Armstrong relation for S . Then, the same problem will be addressed in Section 6 considering a subset G 1 and a superset G 2 of S , G 1 C S C G 2 .

De®nition 3.6. A set of attributes X over U is saturated with respect to a set of functional dependencies S over U iff X p ˆ X.

4.2. Running example

Following the terminology of [3], we de®ne the concept of two tuples agreeing exactly on a set of attributes as follows. De®nition 3.7. Let X be a set of attributes, X , U. A pair of tuples t1 and t2 over U agree exactly on X iff ;A [ X; t 1 …A† ˆ t2 …A† and ;B Ó X; t1 …B† ± t2 …B†. This concept is next extended to entire relations. De®nition 3.8. A database relation r over U satis®es X, with X , U, iff there exist two different tuples in r that agree exactly on X. This last de®nition is central to the purposes of this paper. As will be shown by Theorem 5.1, if relation r satis®es precisely all saturated sets, then it is an Armstrong relation for the speci®ed set of functional dependencies. Therefore, the goal is to develop an algorithm that selects a set of tuples from a relation in such a way that the resulting relation satis®es all saturated sets. 4. Sampling with functional dependencies This section de®nes the objective of the process described in this paper, that is, what is involved when sampling a relation according to a given set of functional dependencies, and also introduces an example. 4.1. De®ning the problem Given a database relation r over U, and a set of functional dependencies S satis®ed by r, the goal is to extract a sample of this relation that is a small 1 Armstrong relation for S . 1

The sense in that it is small will be addressed in Section 5.1.2.

The following is an example that will be used in this paper to illustrate the concepts being developed. Consider the set of functional dependencies S ˆ {B ! C; AC ! B}, and the database relation r over U ˆ {A, B, C} shown in Table 1. It can be seen how tuples t1, t2, t3 and t4 in this relation form a minimal Armstrong relation for S . The entire relation r is also an Armstrong relation for S (adding tuples t5 and t6 does not invalidate any functional dependency in S ). It follows that this is an example of a relation from which a sample can be extracted so that the set of functional dependencies it satis®es is the same. 5. Sampling algorithms This section describes the two algorithms that are used in this paper to select a set of tuples from a relation so that the resulting sample is an Armstrong relation for a prede®ned set of functional dependencies. The ®rst algorithm extracts a minimal size sample. The second algorithm selects tuples randomly. Although the size of the sample resulting when using the second algorithm can be considerably larger than using the ®rst method, the ef®ciency improvement can be very signi®cant. 5.1. Sampling method (I)Ðminimal sample size In order to guarantee the minimal possible size for the resulting sample, the algorithm presented here relies on constructing what is called in this paper an Agreements Table. This table represents the interactions between tuples in the input relation r regarding the set of functional dependencies each pair of tuples, if included in the sample, invalidates (see below for an example). This information is used to guide the sampling process. 2 The construction of the 2 All database sampling strategies require some information structure to guide the sampling process [7].

J. Bisbal, J. Grimson / Information and Software Technology 43 (2001) 607±615

611

Fig. 1. Algorithm for minimal size sampling.

agreements table is based on the concept of a set of attributes being saturated (De®nition 3.6), as described below. 5.1.1. Agreements table The fd-closure is used in an agreements table to compute all sets of attributes X , U such that X p ˆ X, that is, all saturated sets. An agreements table gathers, for each tuple in the input database relation, which other tuples can be used with that one to agree exactly (De®nition 3.7) on each saturated set. As an example, using S ˆ {B ! C; AC ! B} as in Section 4.2, the saturated sets of attributes are {A; C; BC; B}. 3 An agreements table has a column for 3 To simplify the exposition, this paper is considering all saturated sets. It has been shown [2,3] that it suf®ces to consider only the set of so-called intersection generators of this set. The algorithm presented below is valid in either case. By convention, U is the intersection of the empty collection of sets, thus it is not a generator (although it is saturated). For this reason it has not been included here. In the relation example used in this paper, the set of generators and the set of all saturated sets are the same except for U.

each such set. Table 2 shows the agreements table for the running example of Section 4.2. For example, tuple t2 does not agree exactly on C p with any other tuple, for this reason column C p of row t2 is empty. Even though t2 and t4 do have the same value for attribute C, they also have the same value for B, and a different value for A, for this reason they agree exactly on {B, C} p, not on B p. As another example, it can be seen how tuple t1 can only use tuple t2 to agree exactly on A p. This follows from the fact that in relation r of Section 4.2 we have t1(A) ˆ t2(A), but t1(B) ± t2(B) and t1(C) ± t2(C), as required by De®nition 3.7. The meaning of this relationship between t1 and t2 is that if both tuples are included in the sample, no functional dependency with only attribute `A' in the left-hand side will be satis®ed. The algorithm described below exploits this fact in order to build a sample of the input relation that violates (i.e. does not satisfy) all functional dependencies not in S p. The rationale behind an agreements table is that if a sample is selected in such a way that it satis®es all saturated

612

J. Bisbal, J. Grimson / Information and Software Technology 43 (2001) 607±615

attributes sets then all functional dependences in S p, and only those, hold in the resulting sample. This is formalised next. Theorem 5.1. Let r be a database relation and S a set of functional dependencies, both over U. If relation r satis®es precisely all sets of attributes X with X p ˆ X, then r is an Armstrong relation for S . Proof. 4Firstly, to prove that all functional dependences in S p hold in r, let s be a functional dependency s [ S p. Without loss of generality, assume that s is of the form X ! A, with X , U and A [ U. Note that Z ˆ X p is saturated. By hypothesis, any two distinct tuples must agree exactly on a saturated set. If t1 and t2 agree exactly on X, they must agree exactly on X p. Since A [ X p, t1 and t2 agree also on A. It follows that r o s, as required. To prove that only the dependencies in S p hold in r, take a functional dependency X ! A Ó S p. This implies A Ó X p. By hypothesis, there is a pair of tuples, t1 and t2, that agree exactly on X p, therefore, t1(X) ˆ t2(X) and t1(A) ± t2(A). So, by de®nition r o± X ! A. This proves that r is an Armstrong relation for S .A In addition to one column for each saturated set, another column, termed `No. X p' (number X p), is shown in the agreements table of Table 2. It represents the number of different sets X that could be satis®ed by selecting this tuple (i.e. number of non-empty columns in each row). This column is used to select between several possible tuples that could be inserted into the sample, in order for it to be of minimal size. Selecting tuples with the highest value for `No. X p' leads to the minimum possible additional tuples required to satisfy all saturated sets. 5.1.2. Minimal size sampling algorithm The above discussion has shown how, using an agreements table, the resulting sample (1) is an Armstrong relation, and (2) is expected to be small. This solves the problem as stated in Section 4.1. The algorithm presented in this section extracts a sample of minimal size. It is only minimal according to the information available in the agreements table. Tuples are selected according to their values for column `No. X p' in this table. These values represent the number of saturated sets to which each particular tuple is directly related (in the sense of being able to satisfy them), as opposed to considering all saturated sets to which each tuple is transitively related. These two alternatives can be identi®ed with selecting tuples using either local information or global information, respectively. The algorithm that follows would not change if the second alternative was followed; however, the complexity (in terms of execution 4 This proof has been included here for completeness only. See p. 37 of [3] or p. 592 of [2] for alternative proofs.

time) of building an agreements table would be signi®cantly higher. Fig. 1 shows an algorithm that implements the sampling process described above. The initial call for this recursive algorithm should be MinimalSizeSample…r; S; Sample† where r is the relation being sampled, S is the set of functional dependencies being considered, and Sample contains the resulting sample relation. In addition to a sample of relation r, this algorithm also returns whether sampling was successfully performed. (The situation in which it is not possible to extract an Armstrong relation for S from the input relation r is analysed in Section 6.) The algorithm initially selects a tuple, tj, with the highest value for column `No. X p according to the agreements table (if several tuples have the same value, it can select randomly between them). This tuple is then included in the sample. After that, a recursive procedure, MinialSizeSamplerec, is used to select tuples that, taken together with the tuple inserted last, agree exactly on as many saturated sets as possible. Using this last inserted tuple, LastTuple, function SelectTuple_local returns another tuple, ti, that agrees exactly with LastTuple on one (or more) saturated set, X. The information stored in the agreements table is used to select such a tuple. This information is updated after each insertion in order to record that tuple ti, and the newly satis®ed set (or sets) X are not to be considered in future selections. Then, MinialSizeSamplerec is called again using the newly inserted tuple. The chain of recursive calls terminates when either all saturated sets have been satis®ed, or no more sets can be satis®ed using tuple tj. In the former case, the algorithm will terminate and the current sample is an Armstrong relation. In the latter, a new tuple tj will be selected and the process described above will be repeated. If no such tuple can be found, it means that there are some saturated sets that cannot be satis®ed, and, therefore, an Armstrong cannot be extracted (see Section 6). 5.1.3. Sampling example Applying the algorithm of Fig. 1 to the running example of Section 4.2 and the agreements table shown in Table 2 leads to the following scenario. 1. According to the global selection criteria (function SelectTuple), two tuples can initially be selected since they have the highest value for `No. X p': t1 and t2. Assume that tuple t1 is (randomly) chosen. Now, the recursive procedure is called using MinialSizeSamplerec(AgrT, Sample, t1). 1.1. Using the agreements table it can be seen how t2 must be selected in order to satisfy A p (function SelectTuple_local). Now, MinialSizeSamplerec(AgrT, Sample, t2) is called. 1.1.1. Set C p cannot be satis®ed using t2. However,

J. Bisbal, J. Grimson / Information and Software Technology 43 (2001) 607±615

according to the agreements table (row t2), t4 must be selected so that {BC} p is satis®ed. Additionally, due to the interactions between t1 (already in the sample) and t4 (just inserted), set B p is also satis®ed. 1.1.2. MinialSizeSamplerec(AgrT, Sample, t4) is called. 1.1.2.1. The only saturated set that remains to be satis®ed is C p. However, t4 does not agree exactly with any tuple on C p. So, this recursive call terminates. 1.1.3. For the same reason, t2 cannot be used. This recursive call also terminates. 1.2. In contrast, there are two tuples that agree exactly on C p with t1, they are t3 and t5. Function SelectTuple_local, using the values of column `No. X p in Table 2, would select t3 to be inserted into the sample. MinialSizeSamplerec(AgrT, Sample, t3) is called. 1.3. All X p (saturated sets) are satis®ed, so there is nothing for this recursive call to do. 1.4. All X p are satis®ed, so this recursive call terminates. 2. All X p are satis®ed, function MinialSizeSample terminates. The resulting sample (tuples t1, t2, t4, and t3) is exactly the minimal Armstrong relation identi®ed in Section 4.2. 5.2. Sampling method (II)Ðrandom sampling An alternative method for sampling a database is to select the set of tuples randomly. If such an approach is followed, the sample size cannot be assured to be minimal, but the ef®ciency improvement (both in performance and memory usage) will be signi®cant, as discussed below. Following this approach, a list of saturated sets of attributes still to be satis®ed by the current sample is maintained. Whenever a new tuple is randomly selected, the algorithm checks whether any new saturated set is satis®ed. If so, this tuple becomes part of the sample and the saturated set now satis®ed is deleted from the list. If not, a new tuple is sampled. This process continues until the list of saturated sets of attributes to be satis®ed is empty. The two approaches, Minimal Size Sample and Random Size Sample Selection, must be compared. It is expected that in those cases where the minimal size sample is much smaller than the initial table, random sampling will, on average, not perform better than the minimal size sampling algorithm as the resulting sample will be signi®cantly larger than necessary. 5.3. Complexity of sampling algorithms This section analyses the worst-case complexity 5 of the 5

Set operations are ignored for simplicity.

613

algorithms given in Sections 5.1.2 and 5.2. The standard ONotation [12] will be used. The minimal size sampling algorithm consists of three phases: (1) computing all saturated 6 sets of attributes; (2) constructing the agreements table; and (3) extracting a minimal size sample using the algorithm shown in Fig. 1. Let n be the number of attributes in U, p the number of tuples in the databases, and q the number of functional dependencies in S ; that is, n ˆ uUu; p ˆ uru, and q ˆ uSu. 1. Computing X p can be done in linear time with the size of S and X [1], and thus checking whether X is saturated has complexity O(q 1 n). Berri [3] showed that there exist some functional dependency sets for which any Armstrong relation will have exponential size with S . Therefore, any algorithm must face this worst-case complexity. However, this is only the case for highly unnormalised relations, which should be rare. In case of relations in Boyce±Codd Normal Form [16], the number of saturated sets is bounded by a polynomial in n and q [14]. The degree of this polynomial depends on the number of keys in the relation. Refer to this polynomial as pol(n,q). Therefore, this ®rst phase has complexity O…pol…n; q†…p 1 n††. 2. For each tuple t1 and each saturated set X, compute which tuples t2 agree with t1 exactly on X. Since uXu # n this phase's complexity is O…pol…n; q†p 2 n†. It can be improved, however, if the relation is sorted on X before each test [15], which results in complexity O…pol…n; q†nplogp†. 3. The algorithm of Fig. 1 has complexity O(pol(n,q)p), if one ignores the construction of the agreements table, that was analysed in (1). Therefore, the entire algorithm to extract a minimal size sample, if the relation is normalised, has complexity O…pol…n; q†nplogp†: This result is comparable to the complexity of known algorithms to build Armstrong relations using synthetic data [14]. Following a similar analysis, it can be seen how the algorithm for random sampling with functional dependencies has complexity O…pol…n; q†pn†: 6. Sampling with subsets or supersets of S The previous section assumed that the input relation r was an Armstrong relation for S , and the goal was to extract a minimal Armstrong relation. This section relaxes this assumption in both directions. First considering a subset 6 Recall from Section 5.1 that it suf®ces to consider only the intersection generators.

614

J. Bisbal, J. Grimson / Information and Software Technology 43 (2001) 607±615

of the whole set of functional dependencies satis®ed by r, S , and then considering a superset. If a subset G 1 of S , G 1p C S p , is considered, then sampling is not possible. That is, no sample of a relation, that satis®es exactly S p, will satisfy exactly G 1p. This is proved next. Theorem 6.1. Let r be an Armstrong relation for a set of functional dependencies S . There cannot be a sample s of relation r that is an Armstrong relation for a proper subset G 1 of S , G 1p C S p . Proof. Let s be a functional dependency such that s [ S p, but s Ó G 1p. Since r is an Armstrong relation for S , there cannot be any pair of tuples in r that do not satisfy s . Therefore, a sample s of r cannot violate s either. Since s o s for some s Ó G 1p, s cannot be an Armstrong relation for G 1, as required.A Consider now, sampling with supersets of S . If the aim is also to extract an Armstrong relation for a given superset of S , then this may not always be possible. To see this, assume that a sample satisfying precisely G 2 ˆ S < {A ! C} is to be extracted from a relation containing only tuples {t 1 ; t2 ; t3 ; t4 } of Table 1. It can be seen how one tuple from {t1 ; t2 } and another one from {t3 ; t4 } must be left out of the sample, in order to satisfy the new functional dependency {A ! C}. However, the resulting sample (e.g. {t1 ; t4 }; {t2 ; t3 }) satis®es more functional dependencies than only those in G 2p (e.g. B ! A, A ! C}). Thus, it cannot be an Armstrong relation for G 2. Therefore, in order to sample with a superset of S , it is necessary to weaken the objective of sampling. Instead of requiring the resulting sample to be an Armstrong relation, the goal is now only to ensure it satis®es all speci®ed functional dependencies (even if it may also satisfy others). Theorem 6.2. Let r be an Armstrong relation for a set of functional dependencies S over U. There exists a sample s of r that satis®es a superset G 2 of S , S p , G 2p . Proof. Let g 1 be a functional dependency such that g1 [ G 2 and g1 Ó S p . Without loss of generality, assume g 1 is of the form Y ! A, with Y , U and A [ U. Initially, let sample s be the entire relation r. For any pair of tuples t1 and t2 in s with t1(Y) ˆ t2(Y) and t1(A) ± t2(A), remove t2 from s. No functional dependency in S can become unsatis®ed by removing a tuple from s (Theorem 6.1), therefore, now s o S p < {y1 }. (Note that by removing a tuple, however, a new functional dependency not in G 2p may become satis®ed.) This process can be repeated for any additional g 2 [ G 2 \{y1 } and g2 Ó S p < {y1 } until s o G 2 , as required.A The algorithm given in Fig. 1 can easily be modi®ed to sample with supersets of S , or to detect when this is not possible as in the previous example. All that is required is to

ensure that functions SelectTuple() and SelectTuple_local() do not select tuples that satisfy non-saturated sets of attributes. As an example, consider G 2 ˆ S < {A ! C} ˆ {B ! C; AC ! B; A ! C} and the (entire) relation shown in Table 1 of Section 4.2. The corresponding set of saturated sets is {B, C, BC}. Therefore, the set that must not be satis®ed is {A p} (see Table 2). Applying the algorithm of Fig. 1 with the suggested modi®cation, the following sample would be extracted: {t 1 ; t5 ; t4 ; t6 }. This sample is, indeed, an Armstrong relation for G 2. The same procedure will demonstrate why, in the previous example, no sample of {t 1 ; t2 ; t3 ; t4 } can be an Armstrong relation of G 2. The two theorems presented in this section identify what can be expected when sampling with functional dependencies, and the minimum knowledge required about the database relation being sampled in order to do so (i.e. a set of functional dependencies for which the input relation is an Armstrong relation). If the set of functional dependencies is not the minimum required, the algorithm given in Section 5.1.2 can detect this situation, notifying the database designer and increasing the understanding of this relation. 7. Summary and future work This paper has investigated how a database relation can be sampled so that the resulting set of tuples satis®es precisely those functional dependencies that are logical consequences of a given set of functional dependencies, that is, it is an Armstrong relation. This has been achieved by using the concept of closure of a set of attributes under a set of functional dependencies, fd-closure, that has then been used to compute the so-called saturated sets of attributes. The saturated sets guide the sampling process to select tuples in such a way that all functional dependencies not in the initial set are violated (i.e. not satis®ed). Two algorithms have been proposed, one that extracts a sample with the minimum number of tuples, and the other that selects tuples randomly. Initially, it has been assumed that sampling would be performed according to the whole set of functional dependencies satis®ed by the relation, denoted by S . Then, this assumption has been relaxed, considering subsets and supersets of the initial set of functional dependencies. As outlined in Section 5.1.2, the algorithm presented in this paper to extract a minimal size example is in fact suboptimal, in the sense that a sample of smaller size than the one generated by the algorithm could exist in some circumstances. This is due to the fact that the algorithm considers only local information to decide which tuples must be inserted into the sample. How global information could be used to ensure minimal sample size, without excessively increasing its complexity, is still to be investigated. Theorem 6.2 proved that sampling with a superset of S is not always possible. An interesting line of work could

J. Bisbal, J. Grimson / Information and Software Technology 43 (2001) 607±615

investigate in which cases it is actually possible to extract an Armstrong relation for a superset of S . Finally, it should be noted that the concept of Armstrong relation is not exclusively related to functional dependencies, but it has been extended to a more general type of dependencies [10]. Database sampling with this type of dependencies, in order to extract Armstrong relations for them, would be an interesting area for future work. References [1] S. Abiteboul, R. Hull, V. Vianu, Foundations of Databases, AddisonWesley, Reading, 1995. [2] W.W. Armstrong, Dependency structures of data base relationships, in: J.L. Rosenfeld (Ed.), Information Processing 74 (Proceedings of IFIP Congress 74), IFIP, North Holland, Amsterdam, 1974, pp. 580± 583. [3] C. Beeri, M. Dowd, R. Fagin, R. Statman, On the structure of Armstrong relations for functional dependencies, Journal of the ACM 31 (1) (1984) 30±46. [4] B. Beizer, Black-Box Testing: Techniques for Functional Testing of Software and Systems, Wiley, New York, 1995. [5] J. Bisbal, Database Sampling to Support the Development of DataIntensive Applications. PhD thesis, University of Dublin, Trinity College, 2000. [6] J. Bisbal, J. Grimson, Database prototyping through consistent sampling, In: The International Conference on Advances in Infrastructure for Electronic Business, Science, and Education on the Internet (SSGRR'2000), Scuola Superiore Guglielmo Reiss Romoli (SSGRR), 2000. [7] J. Bisbal, J. Grmisin, Generalising the constent database sampling process, in: B. Sanchez, N. Nada, A. Rashid, T. Arndt, M. Sanchez (Eds.), Proceedings of the Joint Meeting of the Fourth World Multiconference on Systemics, Cybernetics and Informatics (SCI'2000) and the Sixth International Conference on Information Systems Analysis and Synthesis (ISAS'2000), vol. IIÐInformation Systems Development, International Institute of Informatics and Systemics, IIIS, Orlando, 2000, pp. 11±16. [8] J. Bisbal, D. Lawless, B. Wu, J. Grimson, Legacy information systems: issues and directions, IEEE Software 16 (5) (1999) 103±111. [9] J. Bisbal, B. Wu, D. Lawless, J. Grimon, Building consistent sample databases to support information system evolution and migration, in: G. Quirchmayr, E. Schweighofer, T.J.M. Bench-Capon (Eds.), Proceedings of the Ninth International Conference on Database and Expert Systems Applications (DEXA'98), vol. 1460 of Lecture Notes in Computer Science Springer-Verlag, Heidelberg, 1998, pp. 196± 205.

615

[10] R. Fagin, Horn clauses and database dependencies, Journal of the ACM 29 (4) (1982) 952±985. [11] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, P.J. Weinberger, Quickly generating billion-record synthetic databases. In, Snodgrass, R.T., Winslett, M. (Eds.), Proceedings of the International Conference on Management of Data (SIGMOD 1994), ACM Press, New York, 1994, pp. 243±252. [12] D.E. Knuth, Fundamental Algorithms, vol. 1 of The Art of Computer Programming, 3rd ed., Addison-Wesley, 1997. [13] P. Lohr-Richter, A. Zamperoni, Validating database components of software systems. Technical Report 94-24, Leiden University, Department of Computer Science, 1994. [14] H. Mannila, K.-J. RaÈihaÈ, Small Armstrong relations for database design, in: Proceedings of the Fourth ACM SIGACT-SIGMOD Symposium on Principles of Database Systems (PODS'85), 1985, pp. 245±250. [15] H. Mannila, K.-J. RaÈihaÈ, Dependency inference, in: P.M. Stocker, W. Kent, P. Hammersley (Eds.), Proceedings of 13th International Conference on Very Large Data Bases (VLDB'87), Morgan Kaufmann, 1987, pp. 155±158. [16] H. Mannila, K.J. RaÈihaÈ, The Design of Relational Databases, Addison-Wesley, Reading, 1992. [17] A. Neufeld, G. Moerkotte, P.C. Lockemann, Generating consistent test data: restricting the search space by a generator formula, VLDB Journal 2 (2) (1993) 173±213. [18] H. Noble, The automatic generation of test data for a relational database, Information Systems 8 (2) (1983) 79±86. [19] I. Sommerville, Software Engineering, 5th ed., Addison-Wesley, Reading, 1995. [20] L. Tucherman, M.A. Casanova, A.L. Furtado, The chris consultantÐ a tool for database design and rapid prototyping, Information Systems 15 (2) (1990) 187±195. [21] G. Wiederhold, Modeling and system maintenance, in: M.P. Papazoglou (Ed.), Proceedings of the 14th International Conference on Object-Oriented and Entity-Relationship Modeling (OOER'95), Springer-Verlag, Heidelberg, 1995, pp. 1±20. [22] B. Wu, D. Lawless, J. Bisbal, R. Richardson, J. Grimson, V. Wade, D. O'Sullivan, The butter¯y methodology: a gateway-free approach for migrating legacy information system, in: B. Werner (Ed.), Proceedings of the Third IEEE Conference on Engineering of Complex Computer systems (ICECCS'97), IEEE Computer Society Press, 1997, pp. 200±205. [23] M. Yannakakis, Perspectives on database theory, Proceedings of the 36th Annual Symposium on Foundations of Computer Science, IEEE Computer Society Press, Los Alamitos, 1995, pp. 224±246. [24] A. Zamperoni, P. Lohr-Richter, Enhancing the quality of conceptual database speci®cations through validation, Proceedings of the 12th International Conference on Entity-Relationship Approach (ER'93), vol. 823 of Lecture Notes in Computer Science Springer-Verlag, Heidelberg, 1993, p. 85.