On approximation measures for functional dependencies

On approximation measures for functional dependencies

ARTICLE IN PRESS Information Systems 29 (2004) 483–507 On approximation measures for functional dependencies Chris Giannella*,1, Edward Robertson1 D...

399KB Sizes 2 Downloads 77 Views

ARTICLE IN PRESS

Information Systems 29 (2004) 483–507

On approximation measures for functional dependencies Chris Giannella*,1, Edward Robertson1 Department of Computer Science, Indiana University, Bloomington, IN 47405, USA

Abstract We examine the issue of how to measure the degree to which a functional dependency (FD) is approximate. The primary motivation lies in the fact that approximate FDs represent potentially interesting patterns existent in a table. Their discovery is a valuable data mining problem. However, before algorithms can be developed, a measure must be defined quantifying their approximation degree. First we develop an approximation measure by axiomatizing the following intuition: the degree to which X -Y is approximate in a table T is the degree to which T determines a function from PX ðTÞ to PY ðTÞ: We prove that a unique unnormalized measure satisfies these axioms up to a multiplicative constant. Next we compare the measure developed with two other measures from the literature. In all but one case, we show that the measures can be made to differ as much as possible within normalization. We examine these measure on several real datasets and observe that many of the theoretically possible extreme differences do not bear themselves out. We offer some conclusions as to particular situations where certain measures are more appropriate than others. r 2003 Elsevier Ltd. All rights reserved. Keywords: Functional dependency; Approximation measures; Entropy; Dependency discovery

1. Introduction The field of data mining has enjoyed a tremendous growth in the last 15 years. Broadly put, it is [1, p. 1]: ‘‘... the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.’’ A great deal of effort has focused on the development of efficient algorithms for discovering certain (often well-defined) patterns. One such pattern is functional dependencies (FDs) (other examples include association rules and clusters). Let S be a non-empty set of attributes (a schema) and T a finite table over S (possibly containing repeated tuples). Given non-empty X ; Y DS; X -Y is an FD in T if for any tuples t1 ; t2 ; t1 ½X  ¼ t2 ½X  implies t1 ½Y  ¼ t2 ½Y  (written TFX -Y ). An equivalent definition of FDs will be useful later. Let PX ðTÞ denote the standard projection operation.2 For xAPX ðTÞ; let TX ¼x denote the collection of tuples t in T where t½X  ¼ x (TX ¼x may contain repeats). TFX -Y if for all xAPX ðTÞ; jPY ðTX ¼x Þj ¼ 1: *Corresponding author. Tel.: +1-812-855-6486; fax: +1-812-855-4829. E-mail addresses: [email protected] (C. Giannella), [email protected] (E. Robertson). 1 Supported by NSF Grant IIS-0082407. 2 PX ðTÞ is a set and does not contain repeats, unlike T: We assume that the readers are familiar with standard notation from relational database theory. See [2] for a review. 0306-4379/$ - see front matter r 2003 Elsevier Ltd. All rights reserved. doi:10.1016/j.is.2003.10.006

ARTICLE IN PRESS 484

C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

In the last 11 years there has been growing interest in the problem of discovering FDs that hold in a given relational table, T [3–9]. FDs represent potentially novel and interesting patterns existent in T: Their discovery provides valuable knowledge of the ‘‘structure’’ of T: Unlike FD researchers in the 1970s, we are interested in FDs that hold in a given instance of the schema rather than FDs that are pre-defined to hold in any instance of the schema. The FDs of our interest are instance based as they represent structural properties that a given instance of the schema satisfies rather than properties that any instance of the schema must satisfy to be considered valid. In some cases an FD may ‘‘almost’’ hold. For example, we may well imagine a corporation database where years of experience may determine salary level in all but a few exceptional cases. These are approximate functional dependencies (AFDs). AFDs also represent interesting patterns contained in T: 1.1. Motivations One motivation for the study of AFDs lies in the fact that AFDs represent knowledge of dependencies between columns in a table. Their discovery is potentially valuable to domain experts. For example, paraphrasing from [3, p. 100]: an AFD in a table of chemical compounds relating various structural attributes to carcinogenicity could provide valuable hints to biochemists for potential causes of cancer (but cannot be taken as a fact without further analysis by domain specialists). Along these lines, Boulicaut [10] describes how FDs, AFDs and inclusion dependencies can be used to aid in the understanding of data semantics in relational databases (Boulicaut describes this process as following a ‘‘database audit’’ perspective). However, they do not investigate the question of how an AFD measure could be defined (they use the g3 measure introduced in [5]). Moreover Lopes [11] and Lopes et al. [12] develop a framework over which algorithms can be developed for the discovery of FDs, AFDs, minimal keys, and normal form tests. They argue that such algorithms are useful for understanding databases. They do not investigate the question of how an AFD measure could be defined (they use g3 ). Another motivation lies in the fact that AFDs can be used to partially normalize tables separating exceptions. Such a process can produce a database whose design is more intuitively appealing and more compact [13]. Moreover, rewriting queries to take advantage of a partial normalization can result in improved evaluation time [14]. Finally we feel that the study of AFD discovery is interesting in its own right. Similar to the discovery of association rules, AFD discovery in a large table presents difficult and interesting algorithmic challenges. 1.2. Purpose and layout Before algorithms for discovering AFDs can be developed, an approximation measure must be defined. There are many ways to do so and there is no universally superior measure. The problem of defining an approximation measure can be a difficult task. Hence, we believe this problem is interesting and deserving of further study. Choosing the best measure is a difficult task, because the decision is partly subjective. Intuition developed from background knowledge must be taken into account. As such, one approach to defining a measure is to isolate properties that the measure must satisfy. Assumptions from intuition are taken into account in the definition of these properties. Based on these properties, a measure is derived. In Section 2 we describe works in the database literature related to AFD measures and their discovery. In Sections 3 and 4 we develop an approximation measure following the methodology mentioned in the previous paragraph. The intuition from which properties are developed is the following. The degree to which X -Y is approximate in T is the degree to which T determines a function from PX ðTÞ to PY ðTÞ: By ‘‘determines’’ we mean that each tuple in T is to be regarded as a data point that either supports or denies a

ARTICLE IN PRESS C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

485

mapping choice xAPX ðTÞ/yAPY ðTÞ: We prove that a unique unnormalized measure satisfies these properties up to a multiplicative constant. In Sections 5–7 we compare the measure developed with two other measures from the literature. The goal is to determine how much these measures differ. In all but one case, we prove tight bounds on the differences between measures. These bounds show that the measures can be made to differ to the maximum possible extent (i.e. one measure goes to zero while the other goes to one). However, on several real datasets, many of these theoretically possible extreme differences do not bear themselves out. Nonetheless considerable differences are observed on the real datasets. Finally we conclude the paper in Section 8 where we describe some directions for future work. Stylistic comment. Lengthy proofs are deferred to appendices to avoid detraction from the flow of the discussion.

2. Related work The first subsection describes previous measures developed based on information theoretic principles. The second subsection describes previous measures developed based on other principles. The third subsection describes works concerning FD and AFD discovery. Finally, the fourth subsection describes other works related to AFDs. 2.1. Information theoretic approaches to defining measures Lee [15], Malvestuto [16], and Nambiar [17] introduce the idea of applying the Shannon entropy function to measure the ‘‘information content’’ of the data in the columns of an attribute set. They extend the idea to develop a measure that, given T; quantifies the amount of information the columns of X contain about Y : This measure is the conditional entropy between the probability distributions associated with X and Y through frequencies (i.e. xAPX ðTÞ is assigned probability equal to the number of tuples containing x divided by the total number of tuples). All three authors show that this measure is non-negative and is zero exactly when an X -Y is an FD. As such, an approximation measure is developed. However, the main thrust of [15–17] was to introduce the idea of characterizing the information content using entropy and not to explore the problem of defining an approximation measure for FDs. Cavallo and Pittarelli [18] also develop an approximation measure. Their measure is the conditional entropy normalized to lie between zero and one. However, the main thrust of [18] was to generalize the relational model to allow for probabilistic instances. They do not explore the problem of defining an approximation measure for FDs. Finally, Dalkilic and Robertson [19] (also [20]) independently discover the idea of applying the Shannon entropy function to measure ‘‘information content’’. They use the conditional entropy to quantify the amount of information the columns of X contain about the columns of Y in T: They call this the information dependency measure. While they make explicit mention of the idea of using the information dependency as an FD approximation measure, they do not explore the idea further. Interestingly, the authors of [15,17,18] make little or no mention of the potential applicability to Data Mining of entropy as a measure of information content. This is probably due to the fact that, at the time of their writing (all before 1988), Data Mining had not yet received the attention that it does today. Dalkilic [20], however, does make explicit mention of the potential applicability to Data Mining. Pfahringer and Kramer [21,22] propose a measure for AFDs (they call partial dependencies) based on the minimum description length principal (MDL). Given X -Y ; they define the encoding length of T when X -Y is used to compress T: Essentially T can be represented as T with the Y columns removed, a table mapping each xAPX ðTÞ to a yAPY ðTÞ; and a table of exceptions to the mapping rule. Based on entropy,

ARTICLE IN PRESS 486

C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

the encoding length of each of these parts is quantified and summed to produce the encoding length of T based on X -Y : 2.2. Other approaches to defining measures Piatetsky-Shapiro [23] introduces the concept of a probabilistic data dependency develop an approximation measure. The measure developed is the same as the t measure of Goodman and Kruskal [24]. Piatetsky-Shapiro develops a method of examining the significance of probabilistic data dependencies and, as such, touches on the issue of how the measure should be defined (see his Section 3.2). However, he does not examine the fundamental assumptions upon which the measure is based. Goodman and Kruskal [24] give a survey of dependency measures that had appeared in statistics and social sciences (one of which is t). Kivinen and Mannila [5] take a non-probabilistic approach to developing an approximation measure. They propose three measures, all of which are based on counting the number of tuples or pairs of tuples that cause the dependency to break. For example, one of the measures proposed is denoted g3 and is defined as the minimum number of tuples that need be removed from T so that X -Y is an FD divided by the number of tuples in T: The main thrust of [5], however, is to develop an efficient algorithm that finds, with high probability, all FDs that hold in a given instance. The problem of how to best define an approximation measure is not considered. Huhtala et al. [3] develop an algorithm, TANE, for finding all AFDs whose g3 value is no greater than some user specified value. Again, though, they do not consider the problem of how to best define an approximation measure. Simovici et al. [25] develop the notion of purity dependencies. A purity dependency X -Y is an AFD whose measure is defined by quantifying the degree to which the partition on T induced by PX ðTÞ is ‘‘pure’’ with respect to that induced by PY ðTÞ: They develop a generalized framework for quantifying that degree, for which one specialization results in a measure similar to the information theoretic measures described earlier. 2.3. FD and AFD discovery Numerous algorithms have been proposed for discovering all FDs that hold in a given table: [3–9]. All of these algorithms essentially perform a search through the lattice of subsets of attributes of T: Huhtala et al. [3] develop an algorithm for finding all AFDs whose approximation degree does not lie above some fixed threshold (setting this threshold to zero results in only FDs). They use the measure g3 introduced in [5]. Their algorithm performs a level-wise search of the attribute subset lattice of T: Lopes et al. [26] develop connections between formal concept analysis and FD and AFD discovery. Based on the concept of ‘‘agree sets’’, they develop an algorithm for computing a minimum cover of the set of FDs that hold in a table (called the Duquenne-Guigues basis). The authors apply the so-called ‘‘Luxenburger basis’’ from formal concept analysis to develop a basis of the set of AFDs in a table, T: Unlike Huhtala et al. [3], they define the set of AFDs to be discovered as the set of all X -A (X a subset of attributes and A a single attribute) such that TjX -A and for all X CY ; TFY -A (maximal left-hand side excluded dependencies). Associated with each X -A is an approximation measure (Lopes et al. use g3 ). Pfahringer and Kramer develop an algorithm for finding AFDs with a high degree of encoding compression (described earlier). Their algorithm does not find all highly compressive AFDs. First they find all FDs (using any algorithm), then, for each FD, they search for highly compressive AFDs whose left-hand side is a subset of the left-hand side of the FD. They claim to find AFDs resulting from the distortion of FDs due to noise. Simovici et al. [25] briefly describe an algorithm for finding purity dependencies. For a fixed Y ; their algorithm performs a level-wise search for left-hand sides X such that X -Y has purity

ARTICLE IN PRESS C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

487

measure below a user specified threshold. Finally, Calders et al. [27] develop an algorithm for the discovery of AFDs in the presence of roll-up lattices. They generalize FDs and define the concept of a roll-up dependency which describes functional dependence between attributes at various levels in the roll-up lattice. Their algorithm finds roll-up dependencies whose support and confidence exceeds fixed thresholds. They define support as the fraction of tuple pairs that are equivalent with respect to the particular levels in the lattice and confidence in the standard way. Confidence can be thought of as an AFD measure. 2.4. Other related works De Bra and Paredaens [28] describe a method for horizontally decomposing a relation instance with respect to an AFD. The result is two instances; on one instance the FD holds perfectly and the other represents exceptions. The authors go on to develop a normal form with respect to their horizontal decomposition method. They do not consider the problem of defining an AFD measure. Demetrovics et al. [29] study a weaker form of functional dependencies that they call partial dependencies. They go on to examine a number of combinatorial properties of partial dependencies. They also investigate some questions about how certain related combinatorial structures can be realized in a database with minimal numbers of tuples or columns. They do not consider the problem of defining an AFD measure. Demetrovics et al. [30] study the average case properties of keys and functional dependencies in random databases. They show that the worst-case exponential properties of keys and functional dependencies (e.g. the number of minimal keys can be exponential in the number of attributes) is unlikely. They do not consider the problem of defining an AFD measure. Sections 3 and 4 develop a measure for AFDs in an axiomatic fashion. First a general definition for AFDs is given, laying the framework for the subsequent axiomatic discussion. Next, based on a set of intuitions, axioms are developed. Finally we prove that the axioms uniquely determine an unnormalized measure, up to a multiplicative constant.

3. FD approximation measures: general definition Let D be some fixed, countably infinite set that serves as a universal domain. Let IðS; DÞ be the set of all finite relation tables, T; over schema S such that for all attributes AAS; PA ðTÞDD: An approximation measure for X -Y over IðS; DÞ is a function from IðS; DÞ to RX0 (the non-negative reals). Intuitively, the number to which a T is mapped determines the degree to which X -Y holds in T: The concept of genericity in relational database theory asserts that the behavior of queries should be invariant on the values in the database up to equality [31]. In our setting, genericity implies that the actual values from D used in T are irrelevant for approximation measures provided that equality is respected. More formally put: given any permutation r : D-D; any approximation measure should map T and rðTÞ to the same value.3 Therefore, the only information of relevance needed from T is the attribute value counts. 3.1. Intuition The degree to which X -Y is approximate in T is the degree to which T determines a function from PX ðTÞ to PY ðTÞ: Consider the two tables as seen in Fig. 1. The one on the left has n tuples (nX2) while the one on the right has two tuples. Our intuition states that the degree to which A-B is approximate in each of these tables is the degree to which each determines a function from f1g to f1; 2g: We have a choice 3

rðTÞ denotes the instance obtained by replacing each value a in T by rðaÞ:

ARTICLE IN PRESS 488

C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

Fig. 1. Two tables over schema A; B; C:

between 1/1 and 1/2: If we were to randomly draw a tuple from the table on the right, there would be an equal likelihood of drawing ð1; 1; :Þ or ð1; 2; :Þ: Hence the table does not provide any information to decrease our uncertainty in choosing between 1/1 and 1/2: On the other hand, if we were to randomly draw a tuple from the table on the left, the likelihood of drawing ð1; 1; :Þ would be ðn 1Þ=n and the likelihood of ð1; 2; :Þ would be 1=n: Hence, if n is large, then the table on the left decreases our uncertainty substantially. The tuples of the two tables could be thought of as data points supplying the choice between 1/1 or 1/2 (e.g. tuple ð1; 1; 1Þ supplies choice 1/1). In the next section we unpack our intuition further by developing a set of axioms.

4. Axioms This section is divided into three subsections. In the first, the boundary case where jPX ðTÞj ¼ 1 is considered and axioms are described that formalize our intuitions. In the second, the general case is considered and one additional axiom is introduced. We close the section with a theorem stating that any approximation measure that satisfies the axioms must be equivalent to the dependency measure using Shannon entropy described in [19] up to a multiplicative constant. Some preliminary definitions are needed. For xAPX ðTÞ; let cX ðxÞ denote the count of x in X : the number of tuples in TX ¼x : For yAPY ðTÞ; let cXY ðx; yÞ denote the number of tuples t in T where t½X ,Y  ¼ ðx; yÞ: 4.1. The jPX ðTÞj ¼ 1 case The only information of relevance needed from T is the vector ½cY ðyÞ: yAPY ðTÞ;4 or equivalently the frequency vector: ½ fY ðyÞ: yAPY ðTÞ; where fY ðyÞ ¼ cY ðyÞ=jTj: We may think of an approximation measure as a function mapping finite, non-zero, rational probability distributions into RX0 :P Let Qð0; 1 denote the set of rational numbers in ð0; 1: Given integer qX1; let Fq ¼ f½ f1 ; y; fqSAQð0; 1q : qi¼1 fi ¼ 1g: Formally, we may think of an approximation measure as a function from N q¼1 Fq to RX0 : Equivalently, an approximation measure may be thought of as a family, G ¼ fGq j q ¼ 1; 2; yg; of functions Gq : Fq -RX0 : G represents an approximation measure. Example 1. Recall the approximation measure g3 described in Section 2.2. It can be seen that: P xAPX ðTÞ ðcX ðxÞ maxfcXY ðx; yÞ : yAPY ðTÞgÞ : g3 ðTÞ ¼ jTj 4 The counts for the values of PX ðTÞ are not needed, because, we are assuming for the moment that jPX ðTÞj ¼ 1: In the next subsection, we drop this assumption and the counts for the values of PX ðTÞ become important.

ARTICLE IN PRESS C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

489

In the case of jPX ðTÞj ¼ 1; we have g3 ðTÞ ¼ 1 maxffY ðyÞ: yAPY ðTÞg: So, g3 can be represented as the family functions Gg3 ; where Ggq3 ð½ f1 ; y; fq Þ ¼ 1 maxffj g: Consider the instance as seen on the left in Fig. 1; call this table T 0 : We have g3 ðT 0 Þ ¼ Gg23 ð½ðn 1Þ=n; 1=nÞ ¼ 1 maxfðn 1Þ=n; 1=ng ¼ 1 ðn 1Þ= n ¼ 1=n: We now develop our axioms. The first axiom, called the Zero Axiom, is based on the observation that when there is only one Y value, X -Y holds. In this case, we require that the measure returns zero; formally put: G1 ð½1Þ ¼ 0: The second axiom, called the Symmetry Axiom, is based on the observation that the order in which the frequencies appear should not affect the measure. Formally stated: for all qX1; and all 1pipjpq; we have Gq ð½y; fi ; y; fj ; yÞ ¼ Gq ð½y; fj ; y; fi ; yÞ: The third axiom concerns the behavior of G on uniform frequency distributions. Consider the two tables as seen in Fig. 2. The B column frequency distributions are both uniform. The table on the left has frequencies 12; 12 while the table on the right has frequencies 13; 13; 13: The degree to which A-B is approximate in the table on the left (right) is the degree to which the table determines a function from f1g to f1; 2g (f1g to f1; 2; 3g). In the table on the left, we have a choice between 1/1 and 1/2: In the table on the right, we have a choice between 1/1; 1/2; and 1/3: If we were to randomly draw a tuple from each table, in either case, each B value would have an equal likelihood of being drawn. However, the table on the left has fewer mapping choices than the table on the right (1/1; 2 vs. 1/1; 2; 3). Therefore, A-B is closer to an FD in the table on the left than in the table on the right. Since we assumed that an approximation measure maps a table to zero when the FD holds, then a measure should map the table on the left to a number no larger than the table on the right. Formalizing this intuition we have: for all q0 XqX2; Gq0 ð½1=q0 ; y; 1=q0 ÞXGq ð½1=q; y; 1=qÞ: This is called the Monotonicity Axiom. For the fourth axiom, denote the single X value of T as x and denote the Y values as y1 ; y; yq ðqX3Þ: The degree to which X -Y is approximate in T matches the degree of uncertainty we have in making the mapping choice x/y1 ; y; yq : Let us group together the last two choices as G ¼ fyq 1 ; yq g: The mapping choice can be broken into two steps: (i) choose between y1 ; y; yq 2 ; G and (ii) choose between elements of G if G was chosen first. The uncertainty in making the final mapping choice is then the sum of the uncertainties of the choice in each of these steps. Consider step (i). The uncertainty of making this choice is Gq 1 ð ½ f ðy1 Þ; y; f ðyq 2 Þ; f ðyq 1 Þ þ f ðyq ÞÞ: Consider step (ii). If y1 ; y; yq 2 were chosen in step (i), then step (ii) is not necessary (equivalently, step (ii) has zero uncertainty). If G was chosen in step (i), then an element must be chosen from G in step (ii). The

Fig. 2. Two tables over schema A; B; C:

ARTICLE IN PRESS 490

C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

uncertainty of making this choice is G2 ð½f ðyq 1 Þ=ðf ðyq 1 Þ þ f ðyq ÞÞ; f ðyq 1 Þ=ðf ðyq 1 Þ þ f ðyq ÞÞÞ: However, this choice is made with probability ðf ðyq 1 Þ þ f ðyq ÞÞ: Hence the uncertainty of making the choice in step (ii)) is ðf ðyq 1 Þ þ f ðyq ÞÞG2 ð½f ðyq 1 Þ=ðf ðyq 1 Þ þ f ðyq ÞÞ; f ðyq 1 Þ=ðf ðyq 1 Þf ðyq ÞÞÞ: Our fourth axiom, called the Grouping Axiom, is: for qX3; Gq ðf1 ; y; fq Þ ¼ Gq 1 ðf1 ; y; fq 2 ; fq 1 þ fq Þ þ ðfq 1 þ fq ÞG2 ðfq 1 =ðfq 1 þ fq Þ; fq 1 =ðfq 1 þ fq ÞÞ: 4.2. The general case We now drop the assumption that jPX ðTÞj ¼ 1: Consider the instance, T 0 ; as seen in Fig. 3. The degree to which A-B is approximate in T 0 is determined by the uncertainty in making the mapping choice for each A value, namely, 1/1; 2 and 2/1; 3; 4: The choice made for the A value, 1, should not influence the choice for 2 and vice versa. Hence, the approximation measure on T 0 should be determined from the measure on 0 0 TA¼1 and TA¼2 : Each of these falls into the jPj ¼ 1 case. However, there are five tuples with A value 1 and only three with A value 2. So, intuitively, the measure on s1 should contribute more to the total measure on T 0 than the measure on s2 : Indeed, five-eighths of the tuples in T 0 contribute to making the choice 1/1; 2 while only three-eight’s contribute to making the choice 2/1; 3; 4: Hence, we assume that the measure 0 0 0 on T 0 is the weighted sum of the measures on TA¼1 and TA¼2 ; namely, 58 (measure on TA¼1 Þ þ 38 (measure 0 on TA¼2 ). Put in more general terms, the approximation measure for X -Y in T should be the weighted sum of the measures for each TX ¼x ; xAPX ðTÞ: Before we can state our final axiom, we need to generalize the notation from the jPj ¼ 1 case. In the jPj ¼ 1 case, Gq was defined on frequency vectors, ½ f1 ; y; fq : However, with the jPj ¼ 1 assumption dropped, we need a relative frequency vector for each xA PX ðTÞ: Given yAPY ðTÞ; let fY jX ðyjxÞ denote the relative frequency of y with respect to x: fY jX ðyjxÞ ¼ cXY ðx; yÞ=cX ðxÞ: The relative frequency vector associated with x is ½ fY jX ðyjxÞÞ: yAPY ðTX ¼x Þ: Notice that Y values that do not appear in any tuple with x are omitted from the relative frequency vector. Moreover, we also need the frequency vector for the X values, ½ fX ðxÞ: xAPX ðTÞ: Let PX ðTÞ ¼ fx1 ; y; xp g and jPY ðTX ¼xi Þj ¼ qi : Gq must be generalized to operate on the X frequency vector, ½ fX ðx1 Þ; y; fX ðxp Þ; and the relative frequency vectors for Y associated with each X value, ½ fY jX ðyjxi Þ: yAPY ðTX ¼xi Þ: The next set of definitions makes precise the declaration of G: qi Given integers p; q1 ; y; qp X1; let Qð0; 1p;q1 ;y;qp denote P Qð0; 1p ð pi¼1 Qð0; Pp 1 Þ: Let Fp;q1 ;y;qp ¼ qi p;q1 ;y;qp fð½ f1 ; y; fp ; ½ f1j1 ; y; fq1 j1 ; y; ½ f1jp ; y; fqp jp ÞAQð0; 1 : j¼1 fjji ¼ fi and i¼1 fi ¼ 1g: An approximation measure is a family, G ¼ fGp;q1 ;y;qp : p; q1 ; y; qp ¼ 1; 2; yg; of functions Gp;q1 ;y;qp : Fp;q1 ;y;qp -RX0 : Our final axiom, called the Weighted Sum PAxiom, is: for all pX2 and q1 ; y; qp X1; Gp;q1 ;y;qp ð½ f1 ; y; fp ; ½ f1j1 ; y; fq1 j1 ; y; ½ f1jp ; y; fqp jp Þ ¼ pi¼1 fi Gqi ð½ f1ji ; y; fqi ji Þ:

Fig. 3. Table, T 0 ; over schema A; B; C:

ARTICLE IN PRESS C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

491

FD Approximation Axioms Zero. G1 ð½1Þ ¼ 0: Symmetry. For all 1pq; 1pipjpq; Gq ð½y; fi ; y; fj ; yÞ ¼ Gq ð½y; fj ; y; fi ; yÞ: Monotonicity. For all q0 XqX1; Gq0 ð½1=q0 ; y; 1=q0 ÞXGq ð½1=q; y; 1=qÞ: Grouping. For all qX3; Gq ð½ f1 ; y; fq Þ ¼ Gq 1 ð½ f1 ; y; fq 2 ; fq 1 þ fq Þ þ ðfq 1 þ fq ÞG2 ð½fq 1 =ðfq 1 þ fq Þ; fq =ðfq 1 þ fq ÞÞ: (5) Weighted Sum. For all pX2 and q1 ; y; qp X1; Gp;q1 ;y;qp ð½ f1 ; y; fp ; ½ f1j1 ; y; fq1 j1 ; y; ½ f1jp ; y; fqp jp Þ ¼ Pp f G i¼1 i 1;qi ð½ f1ji ; y; fqi ji Þ:

(1) (2) (3) (4)

We arrive at our desired theorem. Theorem 1. Assume G satisfies the FD Approximation Axioms, then Gp;qP ð½ fX ðx1 Þ; y; fX ðxp Þ; 1 ;y;qp P ½ fY jX ðyjx1 Þ: yAPY ðTX ¼x1Þ; y; ½ f ðyjx Þ: yAP ðT ÞÞ ¼ c f ðxÞ p Y X ¼xp xAPX ðTÞ X yAPY ðTX ¼x Þ fY jX ðyjxÞ log2   Y jX ðfY jX ðyjxÞÞ; where c ¼ G2 12; 12 (c is a non-negative constant). is defined as The information dependency measure of [19] (written HX -Y ) P P xAPX ðTÞ fX ðxÞ yAPY ðTÞ fY jX ðyjxÞ logðfY jX ðyjxÞÞ (assuming that 0 log2 ð0Þ ¼ 0). Theorem 1 shows that if G satisfies the FD Approximation Axioms, then G is equivalent to the cHX -Y ðTÞ: Making the additional    assumption that G2 12; 12 ¼ 1; the theorem states that the only measure satisfying the axioms is the information dependency measure (InD measure). A proof of Theorem 1 can be found in Appendix A.

5. Comparing measures We developed an AFD measure in the previous section based on a set of axioms (up to a constant multiple). The measure was denoted HX -Y (setting the constant equal to one) and called the InD measure. Our result should not be interpreted to mean that the InD measure is the only reasonable measure. Instead, our result implies that other measures must violate one of the axioms. Hence, if a measure is needed for some application and the designers decide to use another measure, then they must accept that the measure they use violates one of the axioms. Several other measures have been proposed in the database literature and statistics literature (see [24] for an impressive survey in the statistics and social sciences literature). We compare two of these measures with the InD measure. We first develop analytical bounds on their differences. Then we compare the measures on several real datasets. In order to accurately compare measures, each is normalized to lie between zero and one (inclusive) and defined to equal zero exactly when an FD holds. Our purpose is not to provide a comprehensive comparison of all existing measures (there are too many to do so), rather, to compare three measures that have appeared in the database literature to understand how they differ. Moreover, our purpose is not to argue that any one measure is universally more appropriate than another. We feel that the appropriateness of a measure depends on the context in which it is used. In Section 6.1 we describe briefly some particular contexts in which certain measures are more appropriate than others. 5.1. The normalized InD measure The first measure in our comparison is the InD P P measure HX -Y : This measure can be written as xAPX ðTÞ cXjTjðxÞHX ¼x-Y ðTÞ where HX ¼x-Y ðTÞ is yAPY ðTÞ ðcXY ðx; yÞ=cX ðxÞÞ log2 ðcXY ðx; yÞ=cX ðxÞÞ (recall we assume that 0 log2 ð0Þ ¼ 0). For each xAPX ðTÞ; HX ¼x-Y ðTÞ can be thought of as the uncertainty in Y

ARTICLE IN PRESS 492

C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

conditioned on x: If this uncertainty is zero, then x has only one associated Y value, so x perfectly determines Y : If this uncertainty is maximum (log2 ðPY ðTX ¼x ÞÞ) [32], then the Y values associated with x have a uniform frequency distribution, so x yields no information about Y : Intuitively, the InD measure is the amortized uncertainty of Y conditioned on X : Note that the InD measure is exactly the same as the conditional entropy described in [32] (denoted HðY jX Þ). It is proven in [19] that HX -Y ðTÞ ¼ 0 if and only if TFX -Y : Some useful properties of the InD measure can be proven using P the concept of the entropy of a set of attributes X of T: The entropy of X in T (written HX ðTÞ) is xAPX ðTÞ ðcX ðxÞ=jTjÞlog2 ðcX ðxÞ=jTjÞ: Likewise define HY ðTÞ and HXY ðTÞ: HX ðTÞ; intuitively, quantifies the ‘‘uncertainty’’ inherent in the X part of T: If PX ðTÞ consists of one value (which is repeated jTj times), then the uncertainty is zero. If each value of PX ðTÞ appears with the same frequency, then the uncertainty is maximized and equals log2 ðjPX ðTÞjÞ [32]. It is proven in [32,19] that HX -Y ðTÞ ¼ HXY ðTÞ HX ðTÞ (chain rule) and HX ðTÞ þ HY ðTÞXHXY ðTÞX maxfHX ðTÞ; HY ðTÞg: Since the entropy is clearly non-negative, then HY ðTÞXHX -Y ðTÞX0: We define a normalized measure 8 if jPY ðTÞj ¼ 1; <0 IFDðTÞ ¼ HX -Y ðTÞ : otherwise: HY ðTÞ By definition, it follows that HY ðTÞX0 with equality exactly when jPY ðTÞj ¼ 1: Hence the measure is well defined. Moreover, IFDðTÞ lies between zero and one and equals zero exactly when the FD holds. The next lemma characterizes when IFDðTÞ reaches one (implying that the measure can obtain one). We say that X and Y are independent in T if for any xAPX ðTÞ; yAPY ðTÞ; it is the case that cXY ðx; yÞ ¼ cX ðxÞcY ðyÞ=jTj: Lemma 1. IFDðTÞ ¼ 1 if and only if jPY ðTÞj > 1 and X and Y are independent in T. Proof. Assume that jPY ðTÞj > 1 and X ; Y are independent. Plugging into the definition we get HXY ðTÞ ¼ HX ðTÞ þ HY ðTÞ: We have HX -Y ðTÞ ¼ HXY ðTÞ HX ðTÞ ¼ HY ðTÞ: Hence IFDðTÞ ¼ 1: Assume that IFDðTÞ ¼ 1: Then HXY ðTÞ ¼ HX ðTÞ þ HY ðTÞ which implies that X and Y are independent [32, Theorem 2.6.6]. & The following fact will be useful later. Lemma 2. If TFY -X and jPY ðTÞj > 1; then IFDðTÞ ¼ 1 HX ðTÞ=HY ðTÞ: X -Y ðTÞ Proof. Since jPY ðTÞj > 1; then IFDðTÞ ¼ HH : Since TFY -X ; then for all xAPX ðTÞ and Y ðTÞ yAPY ðTX ¼x Þ; cXY ðx; yÞ ¼ cY ðyÞ: Hence, it follows from definition that HXY ðTÞ ¼ HY ðTÞ: The result follows from the fact that HX -Y ðTÞ ¼ HXY ðTÞ HX ðTÞ: &

5.2. The t measure Another measure was originally proposed by Goodman and Kruskal [24] and later by Piatetsky-Shapiro [23]. This measure is motivated by the following situation. A tuple t is uniformly drawn from T: The drawer then makes a guess as to the value t½Y  in one of two situations: (1) no further information is given about t and (2) the drawer is told the value t½X  before making a guess. Let GY denote the guess. The drawer has complete knowledge of the counts fcX ðxÞg; fcXY ðyÞg; and fcY ðyÞg and is assumed to guess according to likelihood. In situation (1), GY ¼ y with probability cY ðyÞ=jTj: In situation (2), GY ¼ y having seen t½X  ¼ x with probability cXY ðx; yÞ=cX ðxÞ: Let P1 ðTÞ denote the probability that a correct guess was made in

ARTICLE IN PRESS C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

493

situation (1) and P2 ðTÞ the probability of a correct guess in situation (2). We have ! _ P1 ðTÞ ¼ P ½t½Y  ¼ y and GY ¼ y ¼

X

yAPY ðTÞ

Pðt½Y  ¼ yÞPðGY ¼ yÞ

y

¼

X cY ðyÞ2 : jTj y

The second equality is due to the fact that t was drawn uniformly and not seen before guessing. Further, we have ! _ _ P2 ðTÞ ¼ P t½X  ¼ x; t½Y  ¼ y and GY ¼ y having seen t½X  x

¼

x

¼

y

X X

Pðt½X  ¼ x; t½Y  ¼ yÞ PðGY ¼ y having seen t½X Þ

y

X X cXY ðx; yÞ2 : jTjcX ðxÞ x y

P 2 The inner sum of P2 ðTÞ is maximized to c ðx; yÞ =jTjcX ðxÞ ¼ cX ðxÞ=jTj exactly when jPY ðTX ¼x Þj ¼ y XY 1: Thus, P2 ðTÞ reaches its maximum of one exactly when for each xAPX ðTÞ; jPY ðTX ¼x Þj ¼ 1 (i.e. TFX -Y ). Thus, P2 ðTÞ P1 ðTÞp1 P1 ðTÞ with equality exactly when TFX -Y : Normalizing the difference we have 8 if jPY ðTÞj ¼ 1; <1 ð1Þ tðTÞ ¼ P2 ðTÞ P1 ðTÞ : otherwise: 1 P1 ðTÞ Note that 1 P1 ðTÞX0 with equality exactly when jPY ðTÞj ¼ 1: Thus the measure is well-defined. This measure was originally proposed by Goodman and Kruskal [24] and later by Piatetsky-Shapiro [23] in a Data Mining setting. It follows from the paragraph immediately before Eq. (1), that tðTÞ equals one exactly when TFX -Y : To be comparable with IFD; we instead consider measure t0 ðTÞ ¼ 1 tðTÞ: The next Lemma shows that t0 satisfies the desirable properties to be comparable to IFD as well as characterizing when the measure obtains one. A proof can be found in Appendix B. Lemma 3. The following statements hold. 1. 0pt0 ðTÞp1; 2. t0 ðTÞ ¼ 1 if and only if jPY ðTÞj > 1 and X ; Y are independent in T.

5.3. The g3 measure Recall the approximation measure g3 described in Section 2.2 and in Example 1 (proposed in [5]): P xAPX ðTÞ ðcX ðxÞ maxfcXY ðx; yÞ: yAPY ðTÞgÞ : g3 ðTÞ ¼ jTj

ARTICLE IN PRESS 494

C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

This measure is based on the intuitive idea that the degree to which X -Y is approximate is determined by the minimum number of tuples that need be removed for X -Y to be an FD. Note that g3 ranges from 0 to 1 and equals zero exactly when TF X -Y : However, it can easily be seen that the numerator in g3 ðTÞ is bounded above by jTj jPX ðTÞj: Hence g3 ðTÞ cannot obtain one. Since we wish to compare measures g3 with IFD and t0 ; g3 must be renormalized to include one: P xAPX ðTÞ ðcX ðxÞ maxfcXY ðx; yÞ: yAPY ðTÞgÞ 0 g3 ðTÞ ¼ : jTj jPX ðTÞj The next Lemma characterizes when g03 reaches one. Lemma 4. g03 ðTÞ ¼ 1 if and only if for all xAPX ðTÞ; jPY ðTX ¼x Þj ¼ cX ðxÞ: Proof. It can be seen from definition that g03 ðTÞ ¼ 1 if and only if for all xAPX ðTÞ; maxfcXY ðx; yÞ: yAPY ðTÞg ¼ 1: This holds if and only if for all xAPX ðTÞ and yAPY ðTX ¼x Þ; cXY ðyÞ ¼ 1: Which, in turn, holds if and only if for all xAPX ðTÞ; jPY ðTX ¼x Þj ¼ cX ðxÞ: &

6. Analytical comparisons In this section, we develop bounds on g03 IFD; g03 t0 ; and IFD t0 : Clearly each of these differences is trivially bounded above and below by one and negative one, respectively. Our goal is to develop tight bounds, thereby completely characterizing the theoretical differences between the measures. Consider the five tables T1 ; y; T5 in Figs. 4 and 5. All tables have two columns A and B and possibly repeated rows. Tables T1 ; T3 ; and T5 have n rows; T2 has nk2 rows; T4 has nk rows. Tables T1 and T3 are further parameterized by aX2 and bA½0; 1: Tables T2 ; T3 ; and T4 have extra labels indicating counts. For

Fig. 4. Tables T1 (n rows) , T2 (k2 n rows), and T3 (n rows) left to right.

ARTICLE IN PRESS C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

495

Fig. 5. Tables T4 (nk rows) and T5 (n rows) left to right.

Fig. 6. Limiting measure results.

example, in T3 the first n=2 rows are ð1; 1Þ; the next bn=2 rows are ð2; 2Þ: Strictly speaking, tables T1 ; T3 ; T5 are not well-defined as some of their row counts may not be whole numbers (e.g. n1=a and n1 1=a in T1 ). By rounding this problem can be solved. The analysis to follow is not affected significantly so we leave off rounding for simplicity. Let Li g03 denote limn-N g03 ðTi Þ: Li IFD and Li t0 are defined analogously. The results are summarized in Fig. 6. Some of the entries in Fig. 6 are left empty as they as not needed for comparison. The computation details can be found in Appendix C. From T1 with a-N and T2 with k-N we see that the trivial bounds 1XIFD g03 X 1 are tight. In other words, IFD can be made larger or smaller than g03 by the maximum amount. Moreover, from T1 with a-N and T3 with b-1; we see that the trivial bounds on IFD t0 are also tight. From T4 we see that the bound 1Xt0 g03 is tight (t0 can be made larger than g03 as much as possible). T5 shows that g03 t0 is bounded below by at least one half. We conjecture that g03 t0 is bounded above by one half but proving this seems difficult. To gain some insight into the bound, consider g3 t0 : From

ARTICLE IN PRESS 496

C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

definition we have g03 ðTÞ ¼ g3 ðTÞjTj=ðjTj jPX ðY ÞjÞ: For jTj=ðjTj jPX ðTÞjÞ close to one, g3 will behave much like g03 : The next theorem shows a tight upper-bound on g3 t0 : Then, Corollary 1 shows an upperjTj bound on g03 t0 in terms of jTj jP : X ðTÞj Theorem 2. For any T; g3 ðTÞ t0 ðTÞp0: Moreover, this bound is tight. A proof of Theorem 2 can be found in Appendix D. Corollary 1. For any T; g03 ðTÞ t0 ðTÞpminfjTj=ðjTj jPX ðTÞjÞ 1; 1g: jTj Proof. By Theorem 2, g03 ðTÞ t0 ðTÞ þ ðg3 ðTÞ g03 ðTÞÞp0: From definition g3 ðTÞ jTj jP ¼ g03 ðTÞ: Thus X ðTÞj jTj jTj g03 ðTÞ t0 ðTÞ þ g3 ðTÞ1 jTj jP p0: Since g3 ðTÞp1 we have g03 ðTÞ t0 ðTÞpjTj jP 1: X ðTÞj X ðTÞj

&

We believe that in real-world data the ratio jTj=ðjTj jPX ðTÞjÞ is close to one (for X not a key). In this setting we understand the upper-bound on g03 t0 well. Only in cases where the ratio is large do we have difficulty characterizing a tight upper-bound. In Section 7 we will show on a collection of five real-world data sets the ratio is very close to one. 6.1. Discussion Our goal in this section was to completely characterize bounds on g03 IFD; g03 t0 ; and IFD t0 : We showed that the trivial bounds on 1XIFD t0 X0 and 1Xg03 IFDX0 are tight. Thus, in theory, g03 can be made larger or smaller than IFD as much as possible; the same is true for IFD and t0 : We showed that the trivial upper bound 1Xt0 g03 is tight; t0 can be made larger than g03 as much as possible. Finally, we showed that g03 ðTÞ t0 ðTÞ is bounded above by minfjTj=ðjTj jPX ðTÞjÞ; 1g and we gave an example (T5 ) where, in the limit, g03 t0 is one half. g03 can be made significantly larger than t0 : Moreover, for small jPX ðTÞj (dominated by T), we can completely characterize the limiting upper-bound. Some qualitative observations are in order. All measures equal zero exactly when the FD holds, but their maximizing behavior is different. IFD and t0 are maximized exactly when A and B are independent and g03 is maximized exactly when each x value is associated with jTX ¼x j y values. Radically different behavior results from this difference (as seen from table T4 ). Whether the behavior of g03 or IFD and t0 is regarded as anomalous depends on desired properties of the measures. In situations where measures are intended to quantify the degree to which knowledge of X determines Y (prediction or classification), we believe that IFD and t0 are more appropriate. In T4 knowledge of an x value in a tuple tells nothing of the associated y value. Hence measures should be maximum (as IFD and t0 are and g03 is not). Moreover, in T5 knowledge of an x value tells quite a bit about the associated y value. But g03 is maximum and IFD and t0 are not. An interesting additional observation can be made: L5 IFD ¼ 0 while L5 t0 ¼ 1=2: Intuitively one would expect measures to be very small as knowledge of x reduces the number of choices for as associated y value from n to 2. It seems that IFD is more appropriate and t0 is anomalous. The large number of x values seems to be the cause of the anomalous behavior. Moreover, g03 can be thought as more coarse than IFD and t0 in the sense that for each x value, the frequency distribution of associated y values does not affect g03 except for the maximum value (any two distributions with the same maximum will behave the same under g03 ). However, IFD and t0 take into account the entire y value distribution. Thus these measures separate tables more finely. In situations where measures are intended to quantify the number of ‘‘error’’ tuples with respect to an FD, we believe g03 is more appropriate than IFD and t0 : For example, in a situation where we are interested in normalizing a table by separating off errors into another table, g03 is more appropriate. This situation is

ARTICLE IN PRESS C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

497

explored further in [13,14]. Viewing T4 in this light, we expect measures to be very small as there are only k error tuples. g03 ðT4 Þ is indeed small but t0 ðT4 Þ and IFDðT4 Þ are large.

7. Empirical comparisons In this section, we compare the measures empirically on five real-world datasets. The goal is to develop an understanding as to how these measures differ in practise. In particular to determine whether the extreme differences that are obtainable in theory occur in practise. The datasets were stored as tables in Oracle (Version 8.05) and accessed by SQL queries executed from a Java 1.3 application using the JDBC thin client interface provided by Oracle. The Employee dataset consisted of 63142 rows and 5 columns. The column domain sizes ranged from 2 (SEX) to 545 (YEARSOFSERVICE). The dataset came from personnel data of a company [33]. Another dataset we looked at was IPUMS Census data [34]. In all experiments we only used the first 59999 rows.5 We then split the dataset into two datasets, Census Sm. and Census Lg. by projecting out columns. Census Sm. consisted of eight columns: AHISPAN, ALANG, ASCHOOL, CLASS, ENGLISH, LOOKING, MOBILITY, YEARSCHOOL. These columns all have small domain sizes (2–8). Census Lg. consisted of eight columns: AGE, ANCSTRY1, HOURS, INDUSTRY, MEANS, PERSCARE, RACE, WORKLWK. These columns have large domain sizes (91–179). The COIL dataset [35] consisted of 4000 rows and contains information taken from a European insurance company. We projected out all columns with domain size one or 4000 (e.g. keys), resulting in 84 columns with domain sizes ranging from 2 to 40. The INTERNET dataset [36] consisted of 10103 rows and 70 columns with domain sizes ranging from 2 to 120. This dataset contains information gathered from an Internet usage survey conducted in 1997. For each dataset we compute the differences g03 IFD; g03 t0 ; and IFD t0 for each pair of attributes. The average, maximum, minimum, and standard deviation are computed for each difference. These results are reported in Figs. 7–9. g03 IFD: The average in all but the EMPLOYEE and Census Lg. dataset was less than 0:59 with small relative standard deviations (31, 54, and 29 percent of the average for Census Sm., COIL, and INTERNET, respectively).6 We conclude that IFD tends to be smaller than g03 in several datasets tested. Looking at maximum and minimum, the difference never exceeds 0.35, but nearly reaches 1: We conclude that the theoretically possible positive extreme (one) was not observed in our experiments. A large difference (0.65) exists between the observed maximum and the theoretical upper limit. However, the theoretically possible lower limit ð 1Þ was very close to being observed. g03 t0 : Similar to g03 IFD; the average in all but the EMPLOYEE and Census Lg. was less 0:61 with small relative standard deviations (31, 54, 34 and percent of the average for Census Sm., COIL, and INTERNET respectively). We conclude that t0 tends to be smaller than g03 in several datasets tested. Looking at maximum and minimum, the difference never exceeds 0.009 (INTERNET), but nearly reaches 1: Take note that the ratio jTj=ðjTj jPX ðTÞjÞ is very close to one (maximum of 1.01). From Corollary 1 we can conclude that, on the datasets we used, g03 t0 will never exceed 0.01. We see that the difference comes close to achieving 0.01 and 1: The theoretical extremes were very close to being observed. IFD t0 : The averages tend to be close to zero. On all datasets except EMPLOYEE, the averages lie less 60 percent of one standard deviation from zero. For EMPLOYEE the average lies less than 106 percent of Gathered by select columns from dataset where personID o60000: We are not taking a random sample from a population so statistical significance tests are not applicable. We report standard deviation merely as a descriptive statistic. 5 6

ARTICLE IN PRESS C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

498

Ave Max Min Stan Dev

0.4

0.2

g3"− IFD

0

−0.2

−0.4

−0.6

−0.8

−1

Employee

Census Sm.

Census Lg.

COIL

INTERNET

Fig. 7.

Ave Max Min Stan Dev

0.4

0.2

g3" − Tau"

0

−0.2

−0.4

−0.6

−0.8

−1

Employee

Census Sm.

Census Lg.

COIL

INTERNET

Fig. 8.

one standard deviation from zero. We conclude that IFD and t0 tend to be very similar on the datasets tested. Looking at maximum and minimum, the difference never exceeds 0.132 (INTERNET) or falls below 0:39 (COIL). Hence the theoretical upper and lower bounds are far from observed.

ARTICLE IN PRESS C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

499

0.3 Ave Max Min Stan Dev

0.2

IFD − Tau"

0.1

0

−0.1

−0.2

−0.3

−0.4

Employee

Census Sm.

Census Lg.

COIL

INTERNET

Fig. 9.

7.1. Discussion Our goal in this section was to develop an understanding as to how the three measures differ in practise. In particular to determine whether the extreme differences that are obtainable in theory occur in practise. Only IFD and t0 were observed to be quite similar (although considerably large differences were observed in extreme cases). IFD and t0 were observed to be considerably smaller than g03 ; in most cases, on several datasets. This makes sense since t0 and IFD were defined based on similar notions of the amount of information X contains about Y while g03 was defined on the notion of the number of tuples which violate the FD. It is interesting to note that, despite similarities in definition, IFD and t0 still demonstrated considerable differences in some cases. The theoretical bounds on g03 IFD and IFD t0 were not observed. The difference between the theoretical bounds and observed results was particularly acute for IFD t0 : However, the bounds on g03 t0 were observed. We conclude that the much of the extreme theoretical differences between g03 IFD and IFD t0 is due to unusual situations which are not likely to bear themselves out in practise. Finally, the differences observed between Census Sm. and Census Lg. are interesting, because they highlight the effect that the attribute domain sizes have. For g03 IFD and g03 t0 we see that small domains tend to produce larger differences. For IFD t0 the opposite effect is observed but very small (the results on small and large domains are very close).

8. Conclusions and future work 8.1. Conclusions The primary purpose of this paper was to develop a deeper understanding of the concept of FD approximation degree. We first developed a set of axioms based on the following intuition. The degree to which X -Y holds in T is the degree to which T determines a function from PX ðTÞ to PY ðTÞ: We proved

ARTICLE IN PRESS 500

C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

that a unique unnormalized measure (up to a constant multiple) satisfies the axioms, namely, the InD measure of [19]. Care must be taken in how the result is interpreted. We do not think it should be interpreted to imply that the information dependency measure is the only reasonable approximation measure. Other approximation measures may be reasonable as well. In fact, the determination of the ‘‘reasonability’’ of a measure is subjective (like the determination of the ‘‘interestingness’’ of rules in KDD). The way to interpret the result is as follows. It implies that measures other than information dependencies must violate one of the FD Approximation Axioms. Hence, if a measure is needed for some application and the designers decide to use another measure, then they must accept that the measure they use violates one of the axioms. Other measures have been proposed in the literature. We compare the (normalized) InD measure with two other (normalized) measures. We first develop analytical bounds on the measure differences, then compare the measures empirically on five real datasets. In theory, the measures can all be made radically different. However, looking at the maximum and minimum differences between measures, with the exception of g03 and t0 ; the extreme possible theoretical differences did not bear themselves out in practise. Looking at the average differences, IFD and t0 in practise seem to be rather similar (which makes sense as they are defined on similar intuitions). IFD and t0 seem to be considerably smaller than g03 (but not nearly as much as allowed in theory). We posit that in situations where measures are intended to quantify the degree to which knowledge of X determines Y (prediction or classification), IFD and t0 are more appropriate than g03 : For example, in T4 knowledge of an x value in a tuple tells nothing of the associated y value. Hence measures should be maximum (IFD and t0 are and g03 is not). In situations where measures are intended to quantify the number of ‘‘error’’ tuples with respect to an FD, we posit that g03 is more appropriate than IFD and t0 : For example, in a situation where we are interested in normalizing a table by separating off errors into another table, g03 is more appropriate. Viewing T4 in this light, we expect measures to be very small as there are only k error tuples. g03 ðT4 Þ is indeed small but t0 ðT4 Þ and IFDðT4 Þ are large. 8.2. Future work Three directions for future work come to mind. The first is to examine other situations where an axiomatic approach could be enlightening: for example, the axiomatic development of approximation measures for multivalued dependencies and inclusion dependencies. Another example is the axiomatic development of measures for quantifying the ‘‘interestingness’’ of generalizations (summaries) of columns in relation instances (see [37] and the citations contained therein). The basic idea of this work is that it is often desirable to generalize a column along pre-specified taxonomic hierarchies; each generalization forms a different data set. There are often a large number of ways that a column can be generalized along a hierarchy (e.g. levels of granularity). Moreover, if there are many available hierarchies, then the number of possible generalizations increases yet further; the number can become quite large. Finding the right generalization can significantly improve the gleaning of useful information out of column. A common approach to addressing this problem is to develop a measure of interestingness of generalizations and use it to rank them. Moreover, this approach bases the interestingness of a generalization in terms of the diversity of its frequency distribution. No work has been done in taking an axiomatic approach to defining a diversity measure. A second direction lies in the discovery of AFDs in large datasets. Some approaches have been developed for AFD discovery (see Section 2), but these approaches are not likely scale well (particularly with the number of attributes). In particular, the discovery of AFDs in market basket databases presents a whole new set of challenges because these datasets have huge numbers of attributes and rows. Because AFDs can

ARTICLE IN PRESS C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

501

describe patterns not expressible by association rules, their discovery is interesting. However, AFD discovery seems more complicated than association rule discovery. A third direction is to develop finer analytical differences between the measures compared in this paper. The measures theoretically are different to a maximum degree (except g03 and t0 ), but many of such radical differences do not seem to bear themselves out in practice. This is likely due to the fact that most of the bounding analysis depended on the construction of unusual tables that are unlikely to occur in practice. For example, consider tables T1 and T5 from Figs. 4 and 5. The domain size of A is n1 1=a and n=2; respectively and of B is n (for both). Such enormous domains are unlikely for many types of columns. An interesting direction for future work is to limit the number of domain elements of A or B (or both) and rigorously prove bounds. For example, if we limit the number of A elements to two (but do not limit the number of B elements or the number of rows), do the bounds for g3 IFD change? Such analysis may shed light on the experimental results we observed. In conclusion, we believe that the problem of defining an FD approximation measure is interesting and difficult. Moreover, we feel that the study of AFDs more generally is worthy of greater consideration in the KDD community.

Acknowledgements The authors thank the following people (in no particular order): Dirk Van Gucht, Jan Paredaens, Marc Gyssens, Memo Dalkilic, and Dennis Groth. The authors also thank a reviewer who pointed out several related works to consider and another reviewer whose comments greatly improved the part of the paper comparing the three measures.

Appendix A. Proof of Theorem 1 First we prove the result for the jPX ðTÞj ¼ 1 case. Namely, we prove the following proposition (the general result follows by the weighted sum axiom). Proposition Assume G satisfies the FD Approximation Axioms. For all qX1; Gq ð½ f1 ; y; fq Þ is of the  A.1. P q form G2 12; 12 j¼1 fj log2 ðfj Þ: The case of q ¼ 1 follows directly from the Zero Axiom, so, we now prove the proposition for qX2: The proof is very similar to that of Theorem 1.2.1 in [38], however, for the sake of being self-contained, we include our proof here. We show four lemmas, the forth of which serves as the base case of a straightforward induction proof of the proposition on qX2: h i P   Lemma A.1. For all qX2; Gq 1q; y; 1q ¼ qi¼2 qi G2 1i ; i 1 : i Proof. Apply the Grouping axiom q 2 times. & h i h i Lemma A.2. For all qX2; kX1; Gqk q1k ; y; q1k ¼ kGq 1q; y; 1q : Proof. Let qX2: We prove the desired result by induction on kX1: In the base case (k ¼ 1), the result follows trivially. Consider now the induction case (kX2). By q 1 applications of Grouping followed by an

ARTICLE IN PRESS 502

C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

application of Symmetry we have     1 1 1 1 1 Gqk ; y; ; ; y; ¼ G k q ðq 1Þ qk qk qk 1 qk qk     q X i 1 i 1 ; þ G : k 2 q i i i¼2

ðA:1Þ

Repeating the reasoning that arrived at Eq. (A.1) qk 1 1 more times, we have       q X 1 1 1 1 i 1 i 1 k 1 ; ; y; k ; y; k 1 G2 Gqk ¼ Gqk ðqk 1 Þðq 1Þ þ ðq Þ qk q qk 1 q qk i i i¼2     q X 1 1 i 1 i 1 G2 ; ¼ Gqk 1 ; y; k 1 : þ qk 1 q q i i i¼2 By Lemma A.1, we have       1 1 1 1 1 1 ; y; Gqk ; y; ; y; : ¼ G þ G k 1 q q qk qk qk 1 qk 1 q q So, by induction, we have       1 1 1 1 1 1 ; y; ; y; Gqk ; y; þ G ¼ ðk 1ÞG q q qk qk q q q q   1 1 ; y; ¼ kGq : & q q

Lemma A.3. For all qX2; Gq ð½1=q; y; 1=qÞ ¼ G2

1 1 2; 2 log2 ðqÞ:

Pq Proof. Let qX2: Assume Gq ð½1=q; y; 1=qÞ 1¼10: Then by Lemma 1, 0 ¼ i¼2 ði=qÞG2 ð½1=i; ði 1Þ=iÞ: Since G2 is non-negative by definition, then G2 2; 2 ¼ 0; so, the desired result holds. Assume henceforth that Gq ð½1=q; y; 1=qÞ > 0: For any integer rX1; there exists integer kX1 such that qk p2r pqkþ1 : Therefore, k=rp1=log2 ðqÞpðk þ 1Þ=r: Moreover, by the Monotonicity axiom, we have Gqk ðyÞpG2r ðyÞpGqkþ1 ðyÞ: So, by Lemma 6, k r pG2 ðyÞ=Gq ðyÞpðk þ 1Þ=r: Therefore, jG2 ðyÞ=Gq ðyÞ 1=log2 ðqÞjp1=r: Letting r-N; we have G2 ðyÞ=Gq ðyÞ ¼ 1=log2 ðqÞ: So, Gq ðyÞ ¼ G2 ðyÞlog2 ðqÞ; as desired. & Lemma A.4. For any pAQð0; 1Þ; G2 ð½p; 1 pÞ ¼ G2

1 1 2; 2 ½p log2 ðpÞ þ ð1 pÞ log2 ð1 pÞ:

     Proof. We shall show for all integers s > rX1; G2 rs; 1 rs ¼ G2 12; 12 ½ðr=sÞ log2 ðr=sÞ þ ð1 r=sÞ log2 ð1 r=sÞ: Let s > rX1: If s ¼ 2; then the result holds trivially, so, assume sX3: By r 1 applications of Grouping, followed by single application of Symmetry, followed by another s r 1 applications of Grouping we have       s r r hr s ri X X 1 1 i 1 i 1 i 1 i 1 ; y; ; G2 ; G2 ; Gs þ ¼ G2 þ s s s s s i i s i i i¼2 i¼2     s r r hr s ri s r X i 1 i 1 rX i 1 i 1 ; G2 ; G2 ; ¼ G2 þ þ : s s s i¼2 s r i i s i¼2 r i i

ARTICLE IN PRESS C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

By Lemmas A.1 and A.3, we have   hr s ri s r 1 1 1 1 ; y; ; G2 ; Gs ¼ G2 þ log2 ðs rÞ s s s s s 2 2    r 1 1 þ G2 ; log2 ðrÞ: s 2 2

503

ðA:2Þ

By Lemma A.3, again, we have     1 1 1 1 Gs ; y; ¼ G2 ; log2 ðsÞ: s s 2 2 From Eq. (A.2), it follows that  h i hr s ri 1 1 s r r ; log2 ðs rÞ þ log2 ðrÞ log2 ðsÞ G2 ; ¼ G2 s s 2 2 s s hr  s r i r s r ¼ G2 ðyÞ log2 ðrÞ log2 ðsÞ þ log2 ðs rÞ log2 ðsÞ s hr s r s  s r i r ¼ G2 ðyÞ log2 þ 1 log2 1 : & s s s s Now we prove the proposition by induction on qX2: The base case of q ¼ 2 follows directly from Lemma A.4. Consider now the induction case of qX3: By Grouping we have Gq ð½ f1 ; y; fq Þ ¼ Gq 1 ð½ f1 ; y; fq 2 ; fq 1 þ fq Þ   fq 1 fq ; þ ðfq 1 þ fq ÞG2 : fq 1 þ fq fq 1 þ fq Now we apply the induction assumption to both terms in the right-hand side and get   X  1 1 q 2 ; Gq ð½ f1 ; y; fq Þ ¼ G2 f log ðf Þ þ ðf þ f Þ log ðf þ f Þ i i q 1 q q 1 q 2 2 i¼1 2 2        fq 1 fq 1 fq 1 fq 1 1 1 ðfq 1 þ fq ÞG2 ; log2 log2 þ 2 2 fq 1 þ fq fq 1 þ fq fq 1 þ fq fq 1 þ fq " #    X q 2 1 1 ; ¼ G2 fi log2 ðfi Þ þ fq 1 log2 ðfq 1 Þ þ fq log2 ðfq Þ 2 2 i¼1   X q 1 1 ¼ G2 ; fi log2 ðfi Þ: & 2 2 i¼1

Appendix B. Proof of Lemma 3 t0 ðTÞX0 follows from the paragraph immediately above Eq. (1). To show t0 ðTÞp1 and statement 2. we reorder sums. Let the elements of PX ðTÞ be x1 ; y; xn and the elements of PY ðTÞ be y1 ; y; ym : We have 2 !2 3 n m m n X X X cXY ðxi ; yj Þ2 jTj 1 4X ðB:1Þ P2 ðTÞ P1 ðTÞ ¼ 2 cXY ðxi ; yj Þ 5: cX ðxi Þ jTj i¼1 j¼1 j¼1 i¼1

ARTICLE IN PRESS C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

504

The expression inside ½y equals n m n m X X X X cXY ðxi ; yj Þ2 jTj cX ðxi ÞcXY ðxi ; yj Þ2 cX ðxi Þ cX ðxi Þ i¼1 j¼1 i¼1 j¼1 2

n 1 X n m X X cXY ðxi ; yj ÞcXY ðxk ; yj ÞcX ðxi ÞcX ðxk Þ cX ðxi ÞcX ðxk Þ i¼1 k¼iþ1 j¼1

¼

n m n 1 X n m X X X X cXY ðxi ; yj Þ2 ðjTj cX ðxi ÞÞ y 2 y cX ðxi Þ i¼1 j¼1 i¼1 k¼iþ1 j¼1

¼

n i 1 X m n 1 X n m X X X X cXY ðxi ; yj Þ2 cX ðxk Þ2 cXY ðxi ; yj Þ2 cX ðxk Þ2 þ cX ðxi ÞcX ðxk Þ cX ðxi ÞcX ðxk Þ i¼2 k¼1 j¼1 i¼1 k¼iþ1 j¼1

2

n 1 X n m X X y : y i¼1 k¼iþ1 j¼1

The first triple sum can be rewritten as n 1 X n m X X cXY ðxk ; yj Þ2 cX ðxi Þ2 : cX ðxi ÞcX ðxk Þ i¼1 k¼iþ1 j¼1

The expression inside ½y in Eq. (B.1) equals n 1 X n m n 1 X n m X X X X cXY ðxk ; yj Þ2 cX ðxi Þ2 cXY ðxi ; yj Þ2 cX ðxk Þ2 þ cX ðxi ÞcX ðxk Þ cX ðxi ÞcX ðxk Þ i¼1 k¼iþ1 j¼1 i¼1 k¼iþ1 j¼1 2

¼

n 1 X n m X X cXY ðxi ; yj ÞcXY ðxk ; yj ÞcX ðxi ÞcX ðxk Þ cX ðxi ÞcX ðxk Þ i¼1 k¼iþ1 j¼1

n 1 X n m X X ðcXY ðxk ; yj ÞcX ðxi Þ cXY ðxi ; yj ÞcX ðxk ÞÞ2 : cX ðxi ÞcX ðxk Þ i¼1 k¼iþ1 j¼1

Thus

" # n 1 X n m X ðcXY ðxk ; yj ÞcX ðxi Þ cXY ðxi ; yj ÞcX ðxk ÞÞ2 1 X P2 ðTÞ P1 ðTÞ ¼ : cX ðxi ÞcX ðxk Þ jTj2 i¼1 k¼iþ1 j¼1

ðB:2Þ

Clearly P2 ðTÞ P1 ðTÞ is non-negative. Moreover, from definition, we know 1 P1 ðTÞX0: Thus, tðTÞX0 so t0 ðTÞp1: Now we prove statement 2. Assume jPY ðTÞj > 1 and X ;P Y are independent. Then by plugging the independence equality into the P 2 3 0 definition we get P2 ðTÞ equals ni¼1 cX ðxi Þ m c ðy j¼i1 Y j Þ =jTj : This is clearly equal to P1 ðTÞ; hence t ðTÞ ¼ 1: 0 Assume t ðTÞ ¼ 1: Then jPY ðTÞj > 1 and P2 ðTÞ P1 ðTÞ ¼ 0: It follows from Eq. (B.2) that for each 1piokpn; 1pjpm we have cXY ðxk ; yj ÞcX ðxi Þ ¼ cXY ðxi ; yj ÞcX ðxk Þ: Fix 1pipn; 1pjpm: We have Pn cX ðxi ÞcY ðyj Þ cX ðxi ÞcXY ðxk ; yj Þ ¼ k¼1 jTj jTj Pn cX ðxk ÞcXY ðxi ; yj Þ ¼ k¼1 jTj ¼ cXY ðxi ; yj Þ: Hence X and Y are independent.

ARTICLE IN PRESS C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

505

Appendix C. Measure computation details Tables T1 ; yT5 are depicted in Figs. 4 and 5. Consider T1 : By Lemma 4, it follows that g03 ðT1 Þ ¼ 1: Also IFDðT1 Þ ¼ ðHAB ðT1 Þ HA ðT1 ÞÞ=HB ðT1 Þ ¼ ðlog2 ðnÞ log2 ðn1 1=a ÞÞ=log2 ðnÞ ¼ 1 ð1 1=aÞ ¼ 1=a: Finally, we have  1=a 1 1=a 1=a  n n =n n n=n2 0 t ðT1 Þ ¼ 1 1 n=n2  1 1=a  n 1 ¼1 n 1 which goes to 1 as n-N: Consider T2 : We have g3 ðT2 Þ ¼ kðn 1Þ=ðnk2 k 1Þ: So, limn-N g3 ðT2 Þ ¼ 1=k: To compute IFDðT2 Þ; first observe that T2 FB-A and jPB ðT2 Þj > 1: So by Lemma 2, IFDðT2 Þ ¼ 1 HA ðT2 Þ=HB ðT2 Þ: It can be seen that HA ðT2 Þ ¼ ðnkðk 1Þ=nk2 Þlog2 ðnkðk 1Þ=nk2 Þ ðnk=nk2 Þlog2 ðn=nk2 Þ ¼ ððk 1Þ=kÞlog2 ððk 1Þ=kÞ ð1=kÞlog2 ð1=k2 Þ: Also, HB ðT2 Þ ¼ ððk 1Þ=kÞlog2 ððk 1Þ=kÞ ð1=kÞlog2 ð1=nk2 Þ: It follows that limn-N IFDðT2 Þ ¼ 1: Consider T3 : To compute IFDðT3 Þ first observe that T3 FB-A and jPBðT3 Þj > 1 so by Lemma 2, IFDðT3 Þ ¼ seen that HA ðT3 Þ ¼ 12 log2 12 ðb=2Þ log2 ðb=2Þ ð1  1  HA ðT3 Þ=HB ðT3 Þ: It can be  bÞ=2 log2 1 b : Also, HB ðT3 Þ ¼ 12 log2 12 ðb=2Þ log2 ðb=2Þ ðð1 bÞ=2Þ log2 ð1=nÞ: It follows that 2 limn-N IFDðT2 Þ ¼ 1: To compute t0 ðT3 Þ observe that P2 ðT3 Þ ¼ 12 þ b=2 þ 1=n and P1 ðT3 Þ ¼ 14 þ b2 =4 þ h i 2 ð1 bÞ=2n: Hence limn-N t0 ðT3 Þ ¼ 1 1þ2b b : 2 3 b Consider T4 : To compute g03 ðT4 Þ; observe that g03 ðT4 Þ ¼ k=ðkn kÞ ¼ 1=ðn 1Þ: Thus limn-N g03 ðT4 Þ ¼ 0: To compute IFDðT4 Þ and t0 ðT4 Þ observe that A and B are independent in T4 : Hence by Lemmas 1 and 3, IFDðT4 Þ ¼ 1 and t0 ðT4 Þ ¼ 1: Consider T5 : It follows from Lemma 4 that g03 ðT5 Þ ¼ 1: To compute IFDðT5 Þ observe that T5 FB-A and jPB ðT5 Þj > 1; so IFDðT5 Þ ¼ 1 HA ðT5 Þ=HB ðT5 Þ ¼ 1 log2 ðn=2Þ=log2 ðnÞ-0: To compute t0 ðT5 Þ observe that P2 ðT5 Þ ¼ 12 and P1 ðT5 Þ ¼ 1=n: Hence limn-N ðT5 Þ ¼ 12: Appendix D. Proof of Theorem 2 Once the bound has been proven, tightness follows immediately since any T for which TFX -Y satisfies g3 ðTÞ t0 ðTÞ ¼ 0: Now we show the bound. If jPY ðTÞj ¼ 1; then TFX -Y ; so t0 ðTÞ ¼ g3 ðTÞ ¼ 0: Now assume jPY ðTÞj > 1: We will show that t0 ðTÞ g3 ðTÞX0: For each xAPX ðTÞ denote maxfcXY ðx; yÞ: yAPY ðTÞg by mX ðxÞ: We have "P # P P 2 2 2 c ðx; yÞ =jTjc ðxÞ c ðyÞ =jTj XY X Y xAP ðTÞ yAP ðTÞ yAP ðTÞ X Y Y t0 ðTÞ g3 ðTÞ ¼ 1 P 1 yAPY ðTÞ cY ðyÞ2 =jTj2 P xAPX ðTÞ ðcX ðxÞ mX ðxÞÞ jTj P P cX ðxÞ mX ðxÞ þ x ¼ 1 ½? x jTj jTj P mX ðxÞ ¼ x ½? jTj

ARTICLE IN PRESS C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

506

P ¼

x mX ðxÞ



"P P

P

jTj

P

x mX ðxÞ=jTj y cY ðyÞ P 2 y cY ðyÞ =jTj

2

=jTj

# P =cX ðxÞ y cY ðyÞ2 =jTj P jTj y cY ðyÞ2 =jTj P P P 2 x mX ðxÞ x y cXY ðx; yÞ =cX ðxÞ ¼ P jTj y cY ðyÞ2 =jTj  P P P 2 2 c ðyÞ =jTj m ðxÞ=jTj c ðyÞ =jTj Y X Y y x y : þ P jTj y cY ðyÞ2 =jTj x

y cXY ðx; yÞ

2

P Since jPY ðTÞj > 1; then y cY ðyÞ2 ojTj2 ; so the denominators in the last equality is positive. Moreover, since P P m ðxÞpjTj; then the second numerator is non-negative. The first is also non-negative: m ðxÞ x X P P P P P Px PX 2 c ðx; yÞ =c ðxÞX m ðxÞ ðm ðxÞ=c ðxÞÞ c ðx; yÞ ¼ m ðxÞ m ðxÞ: X X X x y XY x X x y XY x X x X

References [1] D. Hand, H. Mannila, P. Smyth, Principals of Data Mining, MIT Press, Cambridge, MA, 2001. [2] R. Ramakrishnan, J. Gehrke, Database Management Systems, 2nd Edition, McGraw Hill, New York, 2000. [3] Y. Huhtala, J. K.arkk.ainen, P. Porkka, H. Toivonen, TANE: an efficient algorithm for discovering functional and approximate dependencies, Comput. J. 42 (2) (1999) 100–111. [4] M. Kantola, H. Mannila, K. R.aih.a, H. Siirtola, Discovering functional and inclusion dependencies in relational databases, Int. J. Intelligent Systems 7 (1992) 591–607. [5] J. Kivinen, H. Mannila, Approximate inference of functional dependencies from relations, Theoret. Comput. Sci. 149 (1995) 129–149. [6] S. Lopes, J. Petit, L. Lakhal, Efficient discovery of functional dependencies and Armstrong relations, in: Proceedings of the Seventh International Conference on Extending Database Technology, Lecture Notes in Computer Science, Vol. 1777, Springer, Berlin, 2000, pp. 350–364. [7] H. Mannila, K. R.aih.a, Dependency inference, in: Proceedings of the 13th International Conference on Very Large Databases, 1987, pp. 155–158. [8] N. Novelli, R. Cicchetti, Functional and embedded dependency inference: a data mining point of view, Inform. Systems 26 (2001) 477–506. [9] C. Wyss, C. Giannella, E. Robertson, FastFDs: A Heuristic-Driven, Depth-first algorithm for mining functional dependencies from relation instances, in: Proceedings of the Third International Conference on Data Warehousing and Knowledge Discovery, Lecture Notes in Computer Science, Vol. 2114, Springer, Berlin, 2001, pp. 101–110. [10] J.-F. Boulicaut, A KDD framework for database audit, in: Proceedings of the Eighth Workshop on Information Technologies and Systems, Helsinki Finland, 1998. [11] S. Lopes, DBA Companion: a tool for database analysis, in: Proceedings of the Seventh International Conference on Reverse Engineering Technologies for Information Systems, 2001. [12] S. Lopes, J. Petit, L. Lakhal, A framework for understanding existing databases, in: Proceedings of the International Database Engineering and Applications Symposium (IDEAS), IEEE Computer Society, Silver Spring, MD, 2001, pp. 330–338. [13] F. Berzal, J.C. Cubero, F. Cuenca, J.M. Medina, Relational decomposition through partial functional dependencies, Data Knowledge Engineering 43 (2002) 207–234. [14] C. Giannella, M. Dalkilic, D. Groth, E. Robertson, Improving query evaluation with approximate functional dependency based decompositions, in: Proceedings of the 19th British National Conference on Databases, Lecture Notes in Computer Science, Vol. 2405, Springer, Berlin, 2002, pp. 26–41. [15] T. Lee, An information-theoretic analysis of relational databases—Part I: data dependencies and information metric, IEEE Trans. Software Eng. SE-13 (10) (1987) 1049–1061. [16] F. Malvestuto, Statistical treatment of the information content of a database, Inform. Systems 11 (3) (1986) 211–223. [17] K.K. Nambiar, Some analytic tools for the design of relational database systems, in: Proceedings of the Sixth International Conference on Very Large Databases, 1980, pp. 417–428.

ARTICLE IN PRESS C. Giannella, E. Robertson / Information Systems 29 (2004) 483–507

507

[18] R. Cavallo, M. Pittarelli, The theory of probabilistic databases, in: Proceedings of the 13th International Conference on Very Large Databases, 1987, pp. 71–81. [19] M. Dalkilic, E. Robertson, Information dependencies, in: Proceedings of the 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principals of Database Systems, 2000, pp. 245–253. [20] M. Dalkilic, Establishing the foundations of data mining, Ph.D. Thesis, Department of Computer Science, Indiana University, 2000. [21] S. Kramer, B. Pfahringer, Efficient search for strong partial determinations, in: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996, pp. 371–374. [22] B. Pfahringer, S. Kramer, Compression based evaluation of partial determinations, in: Proceedings of the First International Conference on Knowledge Discovery and Data Mining, 1995, pp. 234–239. [23] G. Piatetsky-Shapiro, Probabilistic data dependencies, in: Proceedings of the ML-92 Workshop on Machine Discovery, Aberdeen, UK, 1992, pp. 11–17. [24] L. Goodman, W. Kruskal, Measures of associations for cross classifications, J. Amer. Statist. Assoc. 49 (1954) 732–764. [25] D. Simovici, D. Cristofor, L. Cristofor, Impurity measures in databases, Acta Inform. 38 (5) (2001) 307–324. [26] S. Lopes, J. Petit, L. Lakhal, Functional and approximate dependency mining: database and FCA points of view, J. Exp. Theoret. Artificial Intelligence 14 (2–3) (2002) 93–114. [27] T. Calders, R. Ng, J. Wijsen, Searching for dependencies at multiple abstraction levels, ACM Trans. Database Systems 27 (3) (2002) 229–260. [28] P. De Bra, J. Paredaens, An algorithm for horizontal decompositions, Inform. Process. Lett. 17 (1983) 91–95. [29] J. Demetrovics, G.O.H. Katona, D. Miklos, Partial dependencies in relational databases and their realization, Discrete Appl. Math. 40 (1992) 127–138. [30] J. Demetrovics, G.O.H. Katona, D. Niklosb, O. Seleznjevc, B. Thalheim, Asymptotic properties of keys and functional dependencies in random databases, Theoret. Comput. Sci. 40 (2) (1998) 151–166. [31] S. Abiteboul, R. Hull, V. Vianu, Foundations of Databases, Addison-Wesley, Reading, MA, 1995. [32] T. Cover, J. Thomas, Elements of Information Theory, Wiley, New York, 1991. [33] D. Groth, Personal communication, School of Informatics, Indiana University, Bloomington, Indiana, 2002. [34] S. Ruggles, M. Sobek, et al., Integrated public use microdata series: Version 2.0, Minneapolis: Historical Census Projects, University of Minnesota, 1997. Data download: UC Irvine KDD archive (kdd.ics.uci.edu/summary.data.date.html) ‘‘IPUMS Census Data’’ heading. [35] P. van der Putten, M. van Someren (Eds.), CoIL Challenge 2000: The Insurance Company Case, Sentient Machine Research, Amsterdam. Also a Leiden Institute of Advanced Computer Science Technical Report 2000-09. June 22, 2000. Data download: UC Irvine KDD archive (kdd.ics.uci.edu/summary.data.date.html) ‘‘The Insurance Company Benchmark (COIL 2000)’’ heading. [36] 8th WWW User Survey (Oct. 10, 1997–Nov. 16, 1997). Graphics, Visualization, and Usability Center (GVU), College of Computing, Georgia Institute of Technology. Copyright 1994, 1995, 1996, 1997, 1998 Georgia Tech Research Corporation. Data download: UC Irvine KDD archive (kdd.ics.uci.edu/summary.data.date.html) ‘‘Internet Usage Data’’ heading. [37] R. Hilderman, H. Hamilton, Evaluation of interestingness measures for ranking discovered knowledge, in: Proceedings of the Fifth Pacific-Asian Conference on Knowledge Discovery and Data Mining, Lecture Notes in Computer Science, Vol. 2035, 2001, pp. 247–259. [38] R. Ash, Information Theory, Interscience Publishers, Wiley, New York, 1965.