TOD: Temporal outlier detection by using quasi-functional temporal dependencies

TOD: Temporal outlier detection by using quasi-functional temporal dependencies

Data & Knowledge Engineering 69 (2010) 619–639 Contents lists available at ScienceDirect Data & Knowledge Engineering journal homepage: www.elsevier...

818KB Sizes 0 Downloads 20 Views

Data & Knowledge Engineering 69 (2010) 619–639

Contents lists available at ScienceDirect

Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak

TOD: Temporal outlier detection by using quasi-functional temporal dependencies Giulia Bruno, Paolo Garza * Dipartimento di Automatica e Informatica, Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129 Torino, Italy

a r t i c l e

i n f o

Article history: Received 8 May 2009 Received in revised form 9 February 2010 Accepted 11 February 2010 Available online 1 March 2010 Keywords: Knowledge discovery Temporal outlier detection Temporal databases Temporal association rules Temporal functional dependencies

a b s t r a c t The problem of detecting outliers has been investigated in several research areas such as database, machine learning, knowledge discovery, and logic programming, with the aim of identifying objects of a given population whose behavior is different from that of the other data objects of the dataset. Outliers represent semantically correct situations, albeit infrequent with respect to the majority of cases. Detecting them allows extracting useful and actionable knowledge of interest to domain experts. In this paper, we focus our attention on the analysis of outlier detection in temporal databases. We propose a method, based on association rules, to infer the normal behavior of objects by extracting frequent rules from a given dataset. To include the time information, we define the concept of temporal association rules. Then, temporal association rules are combined to generate temporal quasi-functional dependencies, which define relationships among attributes over time which hold frequently. Once such dependencies have been inferred from data, outliers are retrieved with respect to them. Given a temporal quasi-functional dependency, it is possible to discover the outliers by querying the temporal association rules stored previously. Our method is independent of the considered database and infers rules, used to highlight outliers, directly from data. The applicability of the proposed approach is validated through a set of experiments which show its effectiveness and efficiency.  2010 Elsevier B.V. All rights reserved.

1. Introduction The analysis of collected data with the aim of detecting implicit information is clearly a fascinating task, which can be complex due to the size of datasets. There are two kinds of interesting knowledge to discover from data sources: frequent trends and outliers with respect to such frequent trends. Both of them can augment the knowledge about a data source. In the database research field the problem of outlier detection has recently drawn increased attention due to its wide applicability in areas such as fraud detection [1], data cleaning [2], and clinical diagnosis [3]. The problem of analyzing outliers is interesting, since outliers represent errors or semantically correct, albeit infrequent, situations. In both cases detecting outliers is a challenging task either to clean errors or discover the meaning of exceptions through a further investigation. A functional dependency [4] is a relationship between database attributes. It states that in each record the value of an attribute is uniquely determined by the values of some other attributes. Functional dependencies are properties of the database schema and describe a priori knowledge that is known at design time. Nevertheless, collected data can hide interesting and previously unknown information on unstated constraints. For example, it happens when data are the result of an integration process of several sources or when they represent dynamic aspects. When functional dependencies are unknown,

* Corresponding author. Tel.: +39 011 564 7178; fax: +39 011 564 7099. E-mail addresses: [email protected] (G. Bruno), [email protected] (P. Garza). 0169-023X/$ - see front matter  2010 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2010.02.003

620

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639

inferential algorithms can be used to discover potential functional dependencies from the current instance of the database (e.g., [5,6]). Potential functional dependencies are relations that are satisfied by each tuple of the database. However, since they are inferred from one instance of the database, they could be not actually valid for all possible database instances. Differently from functional dependencies, approximate and quasi-functional dependencies [5,7] are dependencies which hold on almost all the tuples of a database and only few tuples do not satisfy them. Such (violating) tuples represent objects which deviate from the normal behavior. Hence, they can be labeled as outliers. The construction of temporal databases has enjoyed substantial interest from several years [8]. Temporal data mining has the capability of mining activities rather than just states and, thus, infer relations that may indicate a cause-effect association. Temporal databases cover a wide spectrum of applications (e.g., [9,10]). For example, in the area of medical health, they can store data related to drug administration to patients [10]. An interesting rule can be that the use of a certain quantity of a drug influences the patient disease status after a time interval. Also email data and stock price data have been used to extract temporal patterns [11,12]. For example, in the financial context, some rules can be identified, e.g., if the price of a stock has a certain value, then the price of another stock has another value after a certain period of time. Also trend relationships can be discovered, for example if the price of a stock shows a steep increase, the price of another stock shows a similar trend after 10 min. Detecting outliers with respect to such rules can be a suggestion for investigation, to understand if they are errors or interesting exceptions. In this paper, we address the outlier detection problem as being a part of the data mining process and we introduce a proposal for discovering outliers from temporal databases. Our technique allows mining temporal association rules (TARs) and merging them to derive temporal quasi-functional dependencies. A temporal quasi-functional dependency (TQFD) is an approximate functional dependency derived from data, and represents an implication among attributes over time, which holds frequently in the analyzed dataset. We argue that those data that do not satisfy the TQFD implication represent outliers. Hence, as it is described in detail in the following sections, we propose an outlier detection approach based on quasifunctional dependencies analysis to mine interesting temporal outliers from temporal datasets. The proposed approach (i) infers temporal quasi-functional dependencies from the given dataset, and (ii) detects outliers by selecting the data violating such inferred dependencies. We make the following contributions: (1) we give a rigorous definition of temporal association rules (TARs) and temporal quasi-functional dependencies (TQFDs); (2) we propose an algorithm to mine temporal association rules from data and an approach to infer temporal quasi-functional dependencies from temporal association rules; (3) we propose a temporal outlier detection algorithm (called TOD) based on analysis of temporal quasi-functional dependencies and temporal association rules; (4) we perform an extensive set of experiments to show the effectiveness and efficiency of our approach. The rest of the paper is organized as follows: In Section 2 we survey related works and place our contribution in context. In Section 3 we provide background knowledge and give formal definitions of temporal association rules and temporal quasifunctional dependencies. In Section 4 we describe how temporal association rules and temporal quasi-functional dependencies can be extracted from data, while in Section 5 we explain in details the TOD algorithm. Experimental results are reported in Section 6. Finally, Section 7 draws conclusions and presents directions for future works.

2. Related works Statistical approaches were the first methods exploited to extract outliers. The basic component is a probabilistic model that can be either a priori established or automatically derived by analyzed data. Objects that do not suit the probabilistic model are considered as outliers [13–16]. Once a probabilistic model is given or constructed, statistical methods are very efficient. However, they have several disadvantages which make their use in data mining systems inconvenient. Firstly, they strictly require the usage of a data model. If the model is parametrized, complex procedures for finding values they assume are necessary. Furthermore, it is not guaranteed that the data of the observed population match the assumed distribution law if there is no estimate of the distribution density based on the empirical data. Nowadays, the most popular approaches for outlier detection are distance-based methods [1,17]. Outliers are quantitatively characterized by means of a distance function between objects of the database. Although the distance-based methods have the main advantage of not requiring a probabilistic model, the majority of them have a quadratic complexity. Furthermore, when they are applied to real information systems, which contain heterogeneous data with a complex structure, the definition of a distance function between objects is a non-trivial problem. In literature, many examples of distance-based methods to detect outlier exist. In [18] a survey of classical outlier detection techniques is presented. However, the problem of temporal outlier is not considered. A more recent and complete overview of outlier detecting techniques is proposed in [19]. However, the authors do not consider the issue of detecting outliers among relations of time series. In the following, we focus our analysis on temporal data and present the motivation of introducing our algorithm in the context of temporal outlier detection.

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639

621

In [20] a method which defines outliers as drastic changes in historical trends is applied to vehicle traffic data. The outliers are detected by analyzing the differences between each trend and the average of its neighbor trends. The authors present two examples of detected outliers in road load of vehicles. The problem of outlier detection in moving object datasets was also analyzed in [21]. However, differently from [20], the authors of [21] seek outlier objects instead of outlier road segments. In [22] the problem of finding time sequences discords (i.e., subsequences of longer time series that are maximally different to the rest of all the subsequences) is discussed. A similar issue is discussed in [23], where the authors propose a framework to detect outliers in trajectory data. However, these kind of analysis are limited to numerical data and they need numerical attribute values to define temporal trends. Instead, our method can be exploited for both categorical and numerical data. Furthermore, the presented methods are not applicable to multidimensional database, i.e., with different attributes which vary over time. In fact, correlations may exists among different attributes, and outlier can be detected with respect to them. Our aim is to focus on finding correlation among values of different attributes which vary over time, and then extract outliers with respect to them. The problem of analyzing co-evolving time sequences is addressed in [24], where the authors highlight the issue of detecting correlations among sequences which evolve over time (e.g., network management data or sales data). The authors concentrate on discovering correlation patterns among sequences and predicting missing values, but spend less attention to the problem of outliers (i.e., they consider as outliers only values that are outside the range mean + standard deviation). Instead, we concentrate on the problem of detecting outliers among correlations between attribute values. In some domains online outlier detection algorithms are needed to identify outliers immediately (i.e., real time outlier identification). In [25] an ad-hoc real time approach for sensor data, based on non-parametric models, is proposed. The proposed approach can be used, for example, to identify faulty sensors. In this paper, we do not address the problem of online or real time outlier detection. In literature, different types of data relations similar to functional dependencies have been proposed. Even if the formal names of such relations are different, they have almost the same meaning. In [6] approximate functional dependencies are defined as functional dependencies that almost hold, i.e., with an error lower than a threshold. Hence, approximate dependencies are similar to quasi-functional dependencies [7]. Authors define various measures for the error of a dependency in a relation. In [5] an efficient algorithm to discover approximate functional dependencies is described. However, such works do not consider temporal dependencies. In [26] the authors introduce the notion of pseudo-constraints, which are predicates having significantly few violations. The authors use this pattern to identify rare events in databases. The aim of the work is similar to our main purpose: they define this data mining pattern to detect interesting anomalies. However, our approach differs for two reasons. Firstly, their approach does not detect outliers from temporal databases. Secondly, they define the notion of pseudo-constraint on the Entity-Relationship model, whereas we use association rules to define a quasi-functional constraint. In [27] authors define the conditional functional dependencies. They are a sort of functional dependencies which are conditioned to the specific value of other attributes, i.e., which are valid only for certain values of other attributes. For example, in a citizen database, if two tuples have the same Address value, they must have also the same Zip code, but only for a particular City value. It can happen if all the tuples which have such City value respect the constraints, but some tuples which have other City values do not respect it. Thus, there is not a functional dependency between Address and Zip code, but it is conditioned by the value of City. Authors define a method to detect violations of conditional functional dependencies to perform data cleaning. They do not consider the problem of outliers, but they are mainly interested in the detection of inconsistencies in data. The use of quasi-functional dependency in order to detect anomalies was first introduced in [28]. The authors present results on protein structure databases, without formalizing the concepts of quasi-functional dependency and without defining a general method to retrieve the anomalies. A further step is presented in [7] where the notion of quasi-functional dependency and the method to retrieve anomalies and to discover their possible nature are presented. As explained, functional dependency (or attribute correlation [29]) discovery from non-temporal databases are wellknown problems. However, the previous described methods succeeded in discovering correlations between attributes and relations close to functional dependencies, but they are limited to ‘‘static” databases, i.e., without time information. It is a significant limitation, since temporal database allow describing a temporal evolution of data and discover temporal relations can suggest a causality among conditions. In [30] authors define a uniform notation for temporal functional dependencies through an algebraic framework, and review some previous representations. In our paper, we focus on the definition of a method to directly infer temporal quasi-functional dependencies from data and use them to extract outliers. On the other hand, a vast literature is present on detecting temporal association rules from data. The problem of discovering associations from data was introduced in [31], followed by successive refinements, generalizations and improvements, among which there is the extension to temporal associations. A review of some methods to detect temporal association rules can be found in [32]. A definition of temporal association rules is provided in [33]. The authors consider objects in a database, and each object has a unique identifier and a set of numerical attributes. The database is viewed as a set of sequences of snapshots and each snapshot has objects consisting of attributes with their numerical values. They provide an algorithm to detect temporal association rules which achieves specific support, density, and strength thresholds. They proposed to use a density-based clustering algorithm to find interval clusters first, then generate temporal association rules with Apriori algorithm. We propose a similar method to extract temporal association rules and then we use them to detect temporal quasi-functional dependencies and outliers.

622

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639

The previously discussed temporal association rule mining algorithms focus on temporal datasets containing point-based events (i.e., events existing at a point in time). The problem of association and pattern mining patterns from sequences with interval-based events has also been addressed [12,34]. However, we are not interest in this type of rules since we propose an approach for temporal outlier identification from temporal databases composed of point-based events. There is another kind of temporal relation, that do not suggest a causality, but show the validity of a rule. For example, in a market basket analysis, it can be found that a product is bought with another product only in the month of January. This extension has been introduced to overcome the problem that rules can have a low support on the entire database, but can be relevant if considered in a specific time interval. For example, in [35] authors reduce the number of rules to present to the user by eliminating outdated rules and obsolete itemsets as a function of their lifetime, thus reducing the amount of work to be done in the determination of the frequent itemsets. With the concept of time, they consider the rules that have enough support in their lifespan, as long as they also have temporal support. We do not consider this type of rule in our work. Analogously, we do not consider asynchronous period patterns [36] since they are not useful to identify temporal (quasi)functional dependencies. In order to allow interoperability between different data mining systems, ad-hoc standard models and languages have been proposed for storing data mining patterns and models. The Predictive Model Markup Language (PMML) [37] is an XML-based markup language and it is usually considered the standard language for storing data mining models and association rules. Even if we do not consider the problem of model representation, many approaches can be used to store, query, and share data mining models and patterns (e.g., [38]). 3. Temporal association rules and temporal quasi-functional dependencies In this section, some background knowledge is introduced, including the definitions of relational database, association rules, and functional dependencies. Furthermore, temporal association rules and temporal quasi-functional dependencies are defined. 3.1. Background In a relational database, a relation corresponds to a table. A table is characterized by a fixed number of attributes (columns) and it is composed by a set of tuples (rows). In each tuple it is assigned at most one value to each attribute. Definition 1. Let D = {d1, d2, . . . , dn} be a set of attributes and Dom(di) the domain of attribute di. To allow null value, we can assume that the empty set is included in the attribute domain. Definition 2. Let R = {(d1, v1), (d2, v2), . . . , (dn, vn)} be a set of items, where an item is an attribute-value pair (di, vi) where di 2 D and vi 2 Dom(di). A relational table R is a table containing tuples (or records) t where t # R such that $(di, vi), i = 1, 2, . . . , n. Association rule discovery [31] is an important data mining technique, which is commonly used for local pattern detection in unsupervised learning systems. It shows items that occur together in a given dataset and these combinations of items are useful for finding correlations among sets of data. Association rules are defined as follows. Definition 3. An association rule is a rule in the form X ? Y[s, c], where (1) X # R, Y # R, (2) X \ Y = ;, N N (3) s ¼ Nxyr ; c ¼ Nxyx , where Nr is the number of tuples in the relational database, Nxy is the number of tuples that contain the set of items X [ Y, and Nx is the number of tuples that contain X. s and c are respectively called support and confidence of the rule. A functional dependency states that if in a relation two rows agree on the value of a set of attributes A # D, then they must agree on the value of a set of attributes B # D. The functional dependency is written as A ) B. Functional dependencies apply to combinations of attributes and must be valid in all the tuples of the relation. For example, we can consider the relation Buyers (IdBuyer, Name, Address, City, Nation, Age, Product), where IdBuyer is the key attribute. IdBuyer ) Name is a functional dependency, because for each row the value of the attribute IdBuyer implies the value of the attribute Name. Key constraints are among the most important of these kinds of dependencies. Definition 4. Let R be a relational table, A # D and B # D two sets of attributes, an instance r of R satisfies the functional dependency A ) B if the following holds for every pair of tuples t1 and t2 in r (1) If t1.A = t2.A, then t1.B = t2.B. In [39], the dependency degree index (p) between two sets of attributes is defined. In particular, given two sets of attributes A and B, and a directed dependency A ) B, the dependency degree value represents the strength of the dependency

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639

623

between A and B (i.e., how B if functional dependent with respect to A). The dependency degree value represents how frequently the dependency holds. In [39], p is computed combining support and confidence of association rules involving A and B. Definition 5. Let A # D and B # D be two sets of attributes. The dependency degree p of a dependency A ) B is defined as follows. (1) p ¼

P

i2AR si

 ci ,

where AR is the set of all association rules which involve the set of attributes A and B, si and ci are respectively the support and confidence values of each rule. The dependency degree index p assumes values in the range (0, 1]. As demonstrated in [39], the dependency degree index can be exploited to infer the presence of potential functional dependencies between sets of attributes. If p is equal to one, a potential functional dependency has been mined [39]. If p is lower than one, but higher than a specific threshold, the dependency is said to be a quasi-functional dependency (i.e., it is a dependency that is not always satisfied). It can be noticed that p is equal to one if and only if all the association rules that involve the considered attributes have confidence equal to 1 and the sum of all association rule supports is 1 [39]. Definition 6. A quasi-functional dependency is an implication in the form A [ B[p], where (1) A # D, B # D, (2) A \ B = ;, (3) p < 1, where p is the dependency degree. 3.2. Temporal association rules Since transaction data are often temporal, the problem of discovering temporal relations becomes relevant. For example, when gathering data about products purchased in a supermarket, the time of the purchase is registered in the transaction. Temporal association rules are extension of previously defined association rules and consider also the time delay between the antecedent and the consequent. Also the concepts of database and itemset are extended as follows. Definition 7. Let T be a temporal attribute and Dom(T) be the domain of T. A temporal database K is a database containing transactions or tuples in the form of (t, E), where t 2 T and E # R. The temporal attribute domain is ordinal and can be divided into equal length intervals, which can be represented by integers 0, 1, 2,. . . , without loss of generality. While a temporal association rule can theoretically span across many intervals, discovering all such rules will take up too much resources and users may not be interested in the rules which span more than a certain number of intervals. To avoid spending unnecessary resources a mining parameter called sliding window size, denoted by w, is introduced. Only the rules which span less than or equal to w intervals will be mined. A sliding window W in the transaction database is defined as follows. Definition 8. A sliding window W in a temporal database K is a block of w contiguous intervals along domain T, starting from interval t0 such that K contains a transaction or a tuple at interval t0. Every sliding window in the temporal database forms a megatuple, i.e., the union of all the tuples (t, E), where t 2 W. To distinguish the items in a megatuple from the items in a tuple, they are called extended-items. Definition 9. Given R and w, the set of extended-items is Rext = {(d1 = v1)(0), (d1 = v2)(1) . . . , (dn = vz)(w  1)}}. Definition 10. A temporal association rule is an implication in the form Xtime ? Ytime[s, c], where (1) (2) (3) (4) (5)

Xtime # Rext, Ytime # Rext, Xtime \ Ytime = ; $(di = vi)(0) such that (di = vi)(0) 2 (Xtime [ Ytime), 1 6 i 6 n, T T s ¼ Nxyr ; c ¼ Txyx ,

where Txy is the number of megatuples that contain the set of extended-items X [ Y and Tx the number of megatuples that contain X. The fourth condition in Definition 10 is necessary to exclude redundant rules which are temporal shifts of other rules. For example the rule (d1 = a)(1) ? (d2 = b)(2) is a temporal shift of the rule (d1 = a)(0) ? (d2 = b)(1), and thus is redundant and useless.

624

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639

3.3. Temporal quasi-functional dependencies Also the temporal quasi-functional dependencies can be defined by extending the previously described quasi-functional dependencies. Definition 11. Given D and w, the set of extended attributes is

Dext ¼ fd1 ð0Þ; d1 ð1Þ; . . . ; d1 ðw  1Þ; . . . ; dn ð0Þ; . . . ; dn ðw  1Þg; where di(t) represents the di attribute at time t. Definition 12. A temporal quasi-functional dependency is an implication in the form Atime [ Btime[p], where (1) (2) (3) (4) (5)

Atime # Dext, Btime # Dext, Atime \ Btime = ;, $di(0) such that di(0) 2 (Atime [ Btime), 1 6 i 6 n, p < 1,

where p is the dependency degree. Analogously to Definition 10, the fourth condition in Definition 12 is necessary to exclude redundant and useless dependencies, which are temporal shifts of other dependencies. 4. Mining temporal association rules and quasi-functional dependencies Our algorithm extracts the temporal association rules from the database. Then, they are combined to detect temporal quasi-functional dependencies. To extract temporal association rules, firstly a time-delay matrix is built. If the original database contains k tuples, the time-delay matrix is a matrix with (k  w + 1) rows and (n  w) columns. Each row is a sliding window and each column contains the w values for each of the n attributes. For example, if we consider the original relation in Table 1 and regulations among two time instances (i.e., w = 2), the time-delay relation reported in Table 2 is obtained. A similar structure is used in [40]. After the time-delay matrix has been built, temporal association rules are extracted by applying a traditional rule mining approach (e.g., Apriori [31], FP-growth [41]) to the time-delay relation. Once temporal association rules have been mined, they are combined to extract temporal quasi-functional dependencies. In particular, for each possible quasi-functional dependency Atime [ Btime[p] its dependency degree p is computed by applying the formula reported in Definition 12. If p is lower than 1 a temporal quasi-functional dependency has been found. 4.1. Temporal association rule mining: managing numerical values Algorithms for discovering association rules work well for datasets drawn from a domain of values with no relative meaning (categorical values). However, when applied to continuous data, they yield results that are not very intuitive. Thus, a discretization process is needed. Discretization is the process of mapping the range of possible values associated with a continuous attribute into a number of intervals each denoted by a unique integer label and converting all the values associated with this attribute to the corresponding integer labels [42]. Table 1 Original relation. Time

d1

d2

...

dn

t0 t1 t2 ... tj tk

v11 v21 v31

v12 v22 v32

... ... ... ... ... ...

v1n v2n v3n

...

...

vj1 vk1

vj2 vk2

...

vjn vkn

Table 2 Time-delay relation. Window

d1(0)

d1(1)

...

dn(0)

dn(1)

W1 W2 ... Wkw+1

v11 v21

v21 v31

... ... ... ...

v1n v2n

v2n v3n

...

...

vj1

vk1

...

...

vjn

vkn

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639

625

The first proposed algorithm [43] is based on the notion of quantitative association rules. Each quantitative attribute domain is partitioned into small intervals and adjacent intervals are combined into larger ones such that the combined intervals will have enough support. Finally, the obtained intervals are mapped into consecutive integers. By replacing the original attribute value by its corresponding integer, the quantitative problem can be transformed into a categorical one leading to a uniform representation. There exist several ways to partition continuous data into discrete values, and the best way strongly depends on the domain and there is not a general method to do it. Also the problem of deciding the optimal number of bins is crucial: if intervals are too large, they may hide rules inside portion of the interval, and if intervals are too small they may not get enough support to be considered. To investigate how the discretization affect the results, we perform experiments by varying the discretization technique and the number of bins, thus showing how the results change by varying these parameters. Among the most popular discretization methods there are (i) the equal interval width methods, in which the range of values is divided into sub-ranges of equal extent, (ii) the equal frequency width methods, in which the range is divided into sub-ranges containing equal number of values, and (iii) more sophisticated clustering techniques, which identify partitions that maximize inter cluster distance and minimize intra cluster distance [44]. For our purpose, we considered the equi-width binning and two clustering algorithms (i.e., a partition one and a hierarchical one). The equi-frequency binning is not appropriate for our aim because it may not work very well on highly skewed data. It tends to split adjacent values into separate intervals or merge distant values in the same bin. Thus, we do not consider it as a good discretization method for our purpose. The experimental results are reported in Section 6.1.3. Recently, some works consider a fuzzy discretization [42,45–48]. However, the quality of the produced results relies quite crucially on the appropriateness of the fuzzy sets to the given data. So, fuzzy sets must be consistent with the values of the corresponding attribute. Sometimes the fuzzy sets and the membership functions are a priori defined [45,48]. In [47] a method is proposed for autonomous mining fuzzy sets by means of a clustering algorithm, namely CURE. Then, for each fuzzy set a corresponding membership function is generated. A similar approach has been presented in [42], where author exploit the fuzzy C-means for determining the cluster centroids, and then define the membership functions according to the centroids. In both approaches, the shape of membership functions is a priori given. In [47] they are triangular-shaped, while in [42] they are trapezoidal. In literature several other membership function shapes exist: Gaussian, sigmoidal, polynomial, J-Shaped, etc. Combinations of them are also possible. For example, in [49] two types of membership functions are used: a Gaussian membership function which achieve smoothness for the degree of membership of normal level, and two sigmoidal membership functions which are able to specify asymmetric membership functions for the low and high levels. Since the definition of the appropriate membership functions is a non-trivial problem, we do not exploit it in the current paper. However, we believe that our method can be adapted to handle fuzziness by following the same process presented in [47]. Then, a fuzzy rule extractor (e.g., AprioriT [50]) can be exploited to extract fuzzy rules, that can be combined to detect quasi-functional (fuzzy) dependencies.

5. The TOD algorithm The proposed temporal outlier detection algorithm, called TOD, is based on temporal association rule and temporal quasifunctional dependency discovery. The idea is that a temporal quasi-functional dependency t with an high dependency degree value (e.g., 0.99) represents a dependency between two sets of attributes that almost holds on any data. Thus, t is not a potential functional dependency because there is a small set of violating data. If this set of data is removed, the quasi-functional dependency t will be a potential functional dependency. We argue that this set of data is composed by outliers. Hence, we propose a temporal outlier detection algorithm based on the following steps: (1) Extraction of temporal quasi-functional dependencies with a dependency degree value higher than or equal to a userspecified minimum dependency degree threshold. (2) For each temporal quasi-functional dependency, selection of the minimal set of data that should be removed in order to transform the temporal quasi-functional dependency into a potential temporal functional dependency. The selected data are considered outliers. The main structure of the TOD algorithm is outlined in Algorithm 1. Firstly, the set of temporal association rules are mined from the temporal database K with a sliding window size equal to w and a maximum length of l (Algorithm 1, line 2). Then, the mined association rules are used to extract temporal quasi-functional dependencies (Algorithm 1, line 3). In particular, for each possible set of attributes A and B, the temporal association rules are used to compute the dependency degree p of Atime [ Btime. Only the temporal quasi-functional dependencies with a p greater than or equal to the minimum dependency degree threshold (i.e., Pthreshold) are extracted. Each of the extracted temporal quasi-functional dependencies is considered to highlight a set of outliers (Algorithm 1, lines 5–8). In particular, for each temporal quasi-functional dependency, the temporal association rules which represent outliers are extracted (Algorithm 1, line 6). Then, a scan of the dataset is performed and the data matching one of the selected rules are included in the outlier set (Algorithm 1, line 7). Algorithm 2 shows in more details how, for each temporal quasi-functional dependency, the set of temporal association rules associated to outlier data are selected (LowConfAssRules). The procedure first selects, for each temporal quasi-functional

626

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639

dependency Atime [ Btime, the set of rules related to it, i.e., which contain Atime in the body and Btime in the head . Then, it groups the rules by body value (Algorithm 2, line 3) and for each group it computes the maximum confidence value (Algorithm 2, lines 4–5). Finally, it finds the rules with a confidence value lower than the maximum confidence of their group and add them to the output set of low confidence rules (Algorithm 2, lines 6–8). Algorihtm 1. TOD: temporal outlier detection algorithm Input: temporal dataset K, sliding window size w, length of extracted quasi-functional dependency l, minimum dependency degree value Pthreshold Output: set of outliers O 1: /* Extract temporal association rules of length l and discover l-length temporal quasi-functional dependencies analyzing extracted association rules */ 2: TARs = mine_temporal_association_rules(K, w, l) 3: TQFDs = mine_temporal_quasi-functional_dependencies(TARs, Pthreshold, l) 4: /* Highlight outliers by considering mined temporal quasi-functional dependencies */ 5: for all t in TQFDs do 6: LowConfAssRules = select_low_confidence_rules(t) S 7: O = O matching_data(K, LowConfAssRules) 8: end for 9: return O

Algorithm 2. select_low_confidence_rules Input: temporal quasi-functional dependency t Output: low confidence association rules ARlow_conf 1: ARlow_conf = ; 2: /* Create one group of rules for each antecedent */ 3: rule_groups=group by antecedent the rules in t.AssociationRules. Create one group for each possible antecedent. 4: for all group g in rule_groups do 5: max_confidence = Max"r2g(r.confidence) 6: for all rule r 2 g do 7: if r.confidence < max_confidence then S 8: ARlow_conf = ARlow_conf r 9: end if 10: end for 11: end for 12: return ARlow

5.1. Example As an illustrative example, we can consider the data in Table 3, which represents the amount of packets sent (PkS), packets lost (PkL), and packets repeated (PkR) in a network. For convenience, data values have already been discretized. In Table 4 the corresponding time-delay matrix (with a sliding window of length 2) is shown. By analyzing this matrix, the TOD algorithm detects the temporal association rules. Examples of temporal association rules with PkL(0) in the body and PkR(1) in the head are reported in Table 5. The TOD algorithm analyzes the temporal association rules and extracts the temporal quasi-functional dependencies. For example, we can consider the PkL(0) [ PkR(1) temporal quasi-functional dependency, which has a dependency degree higher than the Pthreshold. It represents the fact that if the amount of lost packets is known, also the amount of packets repeated after a time unit is known. Once the temporal quasi-functional dependencies have been detected, the low confidence association rules are found. To this aim, first the rules are grouped by body values and the maximum confidence value for each group is computed, as shown in Table 6. Then, for each group, the rules with confidence less than the maximum confidence are added to the low confidence rule set, as shown in Table 7. Data that match at least one rule of the low confidence rule set are labeled as outliers. 5.2. Complexity of the TOD algorithm The most computationally intensive steps of TOD are the temporal rule mining step (Algorithm 1, line 2) and the temporal quasi-functional dependency mining step (Algorithm 1, line 3). Also the outlier detection step (Algorithm 2, lines 5–8) is potentially an intensive step if the number of TQFDs with a dependency degree value higher than Pthreshold is high. How-

627

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639 Table 3 Example of network data. Time

Packet sent (PkS)

Packet lost (PkL)

Packet repeated (PkR)

0 1 2 3 4 5 ...

6 6 5 7 8 6 ...

4 8 7 7 9 9 ...

0 4 8 7 8 9 ...

Table 4 Time-delay matrix with a time window of length 2. Window

PkS(0)

PkS(1)

PkL(0)

PkL(1)

PkR(0)

PkR(1)

W1 W2 W3 W4 W5 ...

6 6 5 7 8 ...

6 5 7 8 6 ...

4 8 7 7 9 ...

8 7 7 9 9 ...

0 4 8 7 8 ...

4 8 7 8 9 ...

Table 5 Temporal association rules between PkL(0) and PkR(1). Rules

Support (%)

Confidence (%)

PkL(0) = 4 ? PkR(1) = 4 PkL(0) = 8 ? PkR(1) = 8 PkL(0) = 7 ? PkR(1) = 7 PkL(0) = 7 ? PkR(1) = 8 PkL(0) = 7 ? PkR(1) = 6 PkL(0) = 9 ? PkR(1) = 9 PkL(0) = 9 ? PkR(1) = 10 ...

12 5 6 8 0.2 20 1.5 ...

100 100 90 9 1 98 2 ...

Table 6 Maximum accuracy for each group of temporal association rules of Table 5. Rule body

Max confidence (%)

PkL(0) = 4 PkL(0) = 8 PkL(0) = 7 PkL(0) = 9 ...

100 100 90 98 ...

Table 7 Low confidence temporal association rules between PkL(0) and PkR(1). Low confidence rules

Support (%)

Confidence (%)

PkL(0) = 7 ? PkR(1) = 8 PkL(0) = 7 ? PkR(1) = 6 PkL(0) = 9 ? PkR(1) = 10 ...

8 0.2 1.5 ...

9 1 2 ...

ever, since we are interested in selecting few temporal quasi-functional dependencies, and in particular the few associated to the set of outliers, usually the set of TQFDs with a p value higher than the applied Pthreshold is limited to few tens. Thus, we concentrate our analysis on the first two steps, and particularly on the first one, because the second one is quite similar.

628

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639

In the worst case the complexity of the association rule mining process is linear with respect to the number of records (N) of the dataset and linear with respect to the number of mined rules (R) [51]. Hence, the complexity of line 2 of Algorithm 1 is N  R. As demonstrated in [51], the total number of rules (R) in the worst case is:



"  NI X NI k

k¼1



 N I k X i¼1

NI  k i

#

¼ 3NI  2NI þ 1;

where NI is the number of items contained in the dataset. In our algorithm, since the items are (extended attribute, value) pairs, the number of items NI is:

NI ¼

NA X

jDomðdi Þj  w;

i¼1

where NA is the number of attributes of the original dataset. If the domains (Dom(di)) of the attributes have almost all the same size (i.e., D), then NI can be approximated as:

NI ¼ NA  D  w: Hence, NI is linear with respect to the window size and linear with respect to the number of attributes. It follows that, in the worst case, the rule mining phase is exponential with respect to the number of attributes (NA) and the sliding window size (w). However, it has been empirically showed on real datasets [31,41] that on average the number of generated rules is significantly lower than the upper bound on the number of rules obtained by applying the formula reported above. Furthermore, a threshold can be enforced to limit the maximum size of the mined rules. In fact, rules too long are often not useful because not interpretable or redundant. By mining rules of length lower than or equal to L (RL), the maximum number of mined rules is:

RL ¼

"  L X NI k¼1

k

# Lk  X NI  k :  i i¼1

By applying this threshold, the maximum number of mined rules significantly decreases. For example, for L = 2 the number of mined rules is R2  N 2I , and for L = 3 R3  N 3I . Since NI = NA  D  w, we can conclude that for our purpose the number of mined rules is polynomial with respect to the number of attributes and the sliding window size and it is exponential with respect to the rule length. A similar argument hold for the temporal quasi-functional dependency mining step (Algorithm 1, line 3). In the worst case, the maximum number of mined temporal quasi-functional dependencies is the total number of possible combinations of the extended attributes of the time-delay matrix (i.e., NA  w, where NA is the number of attributes of the original dataset and w the sliding window size). Thus, also for this step, if a threshold is enforced on the maximum length, the complexity is polynomial with respect to the number of attributes and the sliding window size and it is exponential with respect to the rule length. We can also notice that the proposed approach supposes that no a priori knowledge is available. Thus, it analyzes all the possible pairs of extended attributes and computes the dependency degree value for each pair. However, if some a priori knowledge is available, it can be exploited to reduce the execution time. For example, primary keys generate exclusively exact functional dependencies. Since we are interested in quasi-functional dependencies, primary keys can be removed from the initial dataset without affecting the set of detected outliers, thus reducing complexity and execution time of TOD. An empirical measurement which confirm the theoretical evaluation of the complexity of TOD is reported in Section 6.2. 6. Experiments We performed a wide set of experiments to determine the effectiveness and the efficiency of the TOD method. The experiments were performed by using two different data generators. The financial time series benchmark FinTime data generator [52] is used to generate the first set of synthetic datasets. We used the financial datasets to analyze (i) the temporal quasi-functional dependencies extracted by the TOD algorithm by setting the hierarchical clustering as discretization technique and a Pthreshold of 0.99, (ii) the effect of varying the Pthreshold in the range [0–1] on the extracted temporal quasi-functional dependencies and on the precision and recall of TOD, and (iii) the effect of different discretization methods and number of bins on the extracted quasi-functional dependency. The results on the financial datasets are reported in Section 6.1. A second data generator, implemented by ourselves, was used to perform a set of scalability experiments to measure the execution time of TOD when varying (i) the number of records, (ii) the number of attributes, (iii) the size of the sliding window, and (iv) the maximum length of the extracted temporal quasi-functional dependencies. These results are reported in Section 6.2. The data generator implemented by ourselves is publicly available at http://dbdmg.polito.it/~garza/TemporalDataGenerator/.

629

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639

All the experiments were performed on a 3.0 GHz Intel Xeon system with 16GB RAM, running ubuntu 6.10. For the itemset and association rule extraction we used a publicly available version of Apriori downloaded from the FIMI repository [53].

6.1. FinTime datasets FinTime is a synthetic dataset, composed of four tables and it is usually used to analyze performance of query engines and financial tools on stock market data. The FinTime datasets were generated by using the publicly available FinTime data generator [52]. In our experiments, we considered the Market Data table which stores daily historical market data about stock prices. Table 8 describes in details its schema. In particular, each tuple contains the stock identifier (Id), the temporal attribute, i.e., the date (TradeData), the four prices measured on the day (HighPrice, LowPrice, OpenPrice and ClosePrice), and the volume of shares traded (Volume). Different datasets can be generated by setting the number of days and of stocks required. We performed the experiments on a set of datasets generated by varying the number of days and by setting the number of stocks to one (hence, the id attribute of the Market Data table always assumes the same value in our datasets). Before extracting association rules, all the continuous attributes of the FinTime dataset have been discretized. We repeated the experiments by applying different discretization algorithms to show how this choice affects the results (see Section 6.1.3). In this Section we report experiments obtained by mining 2-length quasi-functional dependencies (i.e., one extended attribute in the antecedent and one in the consequent) and by setting the sliding window parameter w to 2 (i.e., only rules which span across two consecutive time instances are mined). The effect of varying these values are discussed in Sections 6.2.3 and 6.2.4.

6.1.1. Extracted temporal quasi-functional dependencies and outliers We performed an initial set of experiments by generating a FinTime dataset with 1000 tuples (i.e., 1000 consecutive daily information) about one stock. We discretized continuous attributes in 10 bins by using hierarchical clustering. Effects of the discretization technique and the number of bins on the extracted outliers will be discussed in Section 6.1.3. As described in Section 5, the TOD algorithm exclusively analyzes the temporal quasi-functional dependencies with a dependency degree p higher than or equal to a user given threshold Pthreshold (see Algorithm 1, lines 1–3). We set Pthreshold to 0.99, since we are interested only in ‘‘strong” dependencies which describe the normal behavior of the analyzed data. By setting a lower threshold value, a large number of quasi-functional dependencies is detected, and hence also a larger number of outliers. However, if low threshold values are used, many of the selected objects are false outliers, as discussed later (Section 6.1.2). Only two of the mined temporal quasi-functional dependencies have a dependency degree value higher than 0.99. In particular, the following temporal quasi-functional dependencies with a p value higher than 0.99 have been extracted: (1) ClosePrice(0) [ OpenPrice(1), p = 0.998; (2) OpenPrice(1) [ ClosePrice(0), p = 0.998. The first one states that for each stock the open price of one day is almost always functionally dependent on the close price of the day before. Hence, the value of the open price attribute is implied by the value of the close price attribute of the previous day. This temporal quasi-functional dependency is given by the fact that the open price of stocks is usually equal to the close price of the day before, except when particular events happen (e.g., stock splits). Hence, the quasi-functional dependency automatically extracted by our approach represents an actual behavior of the analyzed data. The second temporal quasi-functional dependency with a p value higher than 0.99 states that for each stock the close price value is quasi-functional dependent on the value of the open price of the day after. Similarly to the previous one, also this dependency is related to stock splits. After retrieving the temporal quasi-functional dependencies with a dependency degree value higher than 0.99, the TOD algorithm selects, for each temporal quasi-functional dependency, the set of rules related to the extracted temporal quasifunctional dependency (Algorithm 1, lines 5–8 and Algorithm 2). In particular, to extract the outliers associated to the first temporal quasi-functional dependency, TOD examines the temporal association rules which involve the attribute ClosePrice

Table 8 FinTime dataset: main characteristics of the Market Data table. Attribute

Type

Description

Id Tradedate HighPrice LowPrice ClosePrice OpenPrice Volume

Char(30) Date Double Double Double Double Long

Stock identifier Trade date High price for the day Low price for the day Closing price for the day Open price for the day Volume of shares traded

630

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639

at time 0 (in the body of the rule) and the attribute OpenPrice at time 1 (in the head of the rule). Such rules are reported in Table 9. The attributes assume values in the range [1, 10], because values are discretized in 10 bins. The first nine rules in Table 9 have a confidence of 100%. It means that if the closing price has a label in the range [1, 9] we know which is the value of the open price the following day. On the contrary, if the closing price has a label equal to 10, in the majority of cases (i.e., 86.36%) the opening price of the following day is 10, but in few cases it is another value. When we apply the TOD algorithm to detect outliers, as described in Section 5, it first finds the rules with a confidence value lower than the maximum confidence of the rules with the same body (Algorithm 2, lines 6–8). Then, outlier are extracted applying the selected rules on the source data (Algorithm 1, line 7). For the analyzed data, the rules which represent outliers are the last two of Table 9. The two selected rules correspond to stock splits, i.e., particular situations in which the price of an stock decreases instantaneously. For financial reasons at a certain time the price of the stock is decreased by a scale factor of 2 (i.e., the price becomes the half of the previous value) or 3 (i.e., the price becomes a third of the previous value). This behavior explains why the rule (ClosePrice = 10)(0) ? (OpenPrice = 10)(1) has not a confidence of 100%. Thus, our approach labeled as outliers the situations in which the open price is different from the close price of the previous day, because they are rare events which happen only in particular situations (i.e., they are ‘‘different” from the majority of the other data). Hence, our approach allows identifying a set of interesting outliers which are present in the analyzed dataset. In particular, TOD finds three outliers on the used FinTime dataset. The temporal quasi-functional dependency OpenPrice(1) [ ClosePrice(0) is given by the same outliers of the first temporal quasi-functional dependency discussed previously and we do not analyze it in details. However, we point out that two quasifunctional dependencies X(ti) [ Y(tj) and Y(tj) [ X(ti) can potentially be related to different outliers. 6.1.2. Effect of minimum dependency degree value on the extracted temporal quasi-functional dependencies We performed a set of experiments by varying the minimum dependency degree value (Pthreshold) passed as parameter to TOD. Fig. 1(a) reports the cumulative number of extracted quasi-functional dependencies at increasing dependency degree value. If any minimum dependency degree threshold is enforced (i.e., Pthreshold = 0), 70 temporal quasi-functional dependencies are extracted from the FinTime dataset. If Pthreshold = 0.99, the only extracted temporal quasi-functional

Table 9 Temporal association rules. Rules

Support (%)

Confidence (%)

(ClosePrice = 1)(0) ? (OpenPrice = 1))(1) (ClosePrice = 2)(0) ? (OpenPrice = 2)(1) (ClosePrice = 3)(0) ? (OpenPrice = 3)(1) (ClosePrice = 4)(0) ? (OpenPrice = 4)(1) (ClosePrice = 5)(0) ? (OpenPrice = 5)(1) (ClosePrice = 6)(0) ? (OpenPrice = 6)(1) (ClosePrice = 7)(0) ? (OpenPrice = 7)(1) (ClosePrice = 8)(0) ? (OpenPrice = 8)(1) (ClosePrice = 9)(0) ? (OpenPrice = 9)(1) (ClosePrice = 10)(0) ? (OpenPrice = 10)(1) (ClosePrice = 10)(0) ? (OpenPrice = 1)(1) (ClosePrice = 10)(0) ? (OpenPrice = 3)(1)

19.4 5.3 15.3 8.0 17.5 5.6 12.3 11.5 2.8 1.9 0.2 0.1

100 100 100 100 100 100 100 100 100 86.36 9.09 4.55

100

70 80

60 50

Precision

Number of dependencies

80

40 30 20

60 40 20

10 0

0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Pthreshold

1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Pthreshold

Fig. 1. Effect of minimum dependency degree value.

1

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639

631

dependencies are the two discussed in the previous subsection. Fig. 1(a) shows that few mined temporal quasi-functional dependencies are characterized by a high dependency degree value. We also analyzed the effects of the Pthreshold parameter on the extracted outliers. In particular we measured the precision and the recall of our approach. Precision measures the fraction of the outliers extracted by TOD which are actually outliers, while recall measures the fraction of the complete set of actual outliers in the dataset which are identified by TOD. The precision and the recall formulas are reported in the following

jOtod \ Odataset j ; jOtod j jOtod \ Odataset j Recall ¼ ; jOdataset j

Precision ¼

where Odataset is the complete set of actual outliers which are present in the dataset, while Otod is the set of outliers returned by TOD. Both measures take values in the range [0%, 100%]. In order to be able to compute precision and recall, we manually identified the actual outliers from the FinTime dataset. In particular, we identified as outliers the set of three outliers discussed in Section 6.1.1 (i.e., the cases when the open price is different from the close price of the previous day). Fig. 1(b) reports the precision of TOD for increasing dependency degree value, while the recall value is always 100% for each dependency degree value. As expected the precision of TOD increases when the dependency degree threshold increases and it is equal to 100% when the threshold is greater or equal to 0.93. 6.1.3. Effect of discretization on extracted temporal quasi-functional dependencies We performed experiments by varying both the discretization method and the bin cardinality, to show how discretization influences the extracted temporal quasi-functional dependencies. We varied the number of bins (NB) from 2 to 100 and we exploited the following discretization methods. Hierarchical clustering. For each attribute, a hierarchical clustering algorithm is used to group the attribute values in NB bins. NB is the number of bins specified by the user. K-means clustering. For each attribute, the K-means clustering algorithm computes NB cluster centers and then each value of the attribute is assigned to the nearest one of them. Equi-width binning. For each attribute, the interval range is divided into NB bins of equal extent. Figs. 2(a)–(h), 3(a)–(h), and 4(a)–(h) report results obtained by applying respectively the hierarchical clustering algorithm, the K-means clustering algorithm, and the equi-width binning to discretize data. As expected, on average the dependency degree value of the mined dependencies is higher when few bins are used (for example, see Fig. 2(a)) while decreases when the number of bins increases (Fig. 2(h)). This behavior is given by the fact that when few bins are created, on average few rules, all characterized by a high support and confidence, are extracted. Hence, on average, higher p values are achieved. Differently, when the number of bins increases (and hence also the number of items) lower support and confidence rules are mined, impacting negatively on the p value of the extracted dependencies. This behavior is similar for all the three considered methods. Since we are interested in TQFDs with p higher than 0.99, we analyze in more details these results. To compare the three considered discretization approaches, we report in Fig. 5(a)–(c), for each technique, the number of extracted TQFDs with a dependency degree greater or equal to 0.99 by varying the number of bins from 2 to 100. The obtained results show that hierarchical clustering is less sensible to the number of bins with respect to the other techniques. In fact, by using the hierarchical clustering the number of extracted TQFDs with a dependency degree higher than 0.99 is equal to 2 for a significant range of values (see Fig. 5(a)). Only when few bins are used (2 bins) the number of TQFDs is significantly higher than 2 (i.e., 12). Differently, the equi-width binning method allows mining TQFDs with a p value higher or equal to 0.99 only if the number of bin is set to 2 (see Fig. 5(c)) and the K-means clustering only if the number of bins is in the range 2–10 (see Fig. 5(b)). The list of TQFDs with p > 0.99 for 2, 3, and 4 bins obtained with the three different discretization techniques is reported in Table 10. With 2 bins, the equi-width method detect only two TQFDs, which are not those associated to the real outliers. Instead, the two clustering techniques detect both the TQFDs which correspond to outliers1 and some other TQFDs. This is caused by the rough discretization in only two bins. When three or more bins are exploited, the equi-width method does not detect any TQFD with p > 0.99, while both the clustering techniques detect the TQFDs corresponding to the real outliers. This behavior continue until 10 bins. After 10 bins, only the hierarchical clustering detects TQFD with p > 0.99. From these results we can conclude that the two clustering based approaches allow obtaining better results than the equi-width technique and, in particular, the hierarchical clustering is less affected by the bin cardinality. 6.2. Scalability of TOD In order to analyze in details the scalability of TOD we performed experiments by varying both the dataset characteristics (i.e., number of records and number of attributes) and the values of the TOD parameters (i.e., sliding window size and rule 1

TQFDs: ClosePrice(0) [ OpenPrice(1) and OpenPrice(1) [ ClosePrice(0).

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639 20

20

18

18 Number of dependencies

Number of dependencies

632

16 14 12 10 8 6 4 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

10 8 6 4

1

20

20

18

18 Number of dependencies

Number of dependencies

12

0 0

16 14 12 10 8 6 4 2

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

16 14 12 10 8 6 4 2

0

0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

20

20

18

18 Number of dependencies

Number of dependencies

14

2

0

16 14 12 10 8 6 4 2

16 14 12 10 8 6 4 2

0

0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

20

20

18

18 Number of dependencies

Number of dependencies

16

16 14 12 10 8 6 4 2

16 14 12 10 8 6 4 2

0

0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

Fig. 2. Discretization based on hierarchical clustering. Effects of the number of bins on the dependency degree value of the mined temporal quasi-functional dependencies.

length). The experiments were performed by exploiting a set of synthetic datasets from a data generator implemented by ourselves, which allows specifying (i) the number of attributes of the generated dataset, (ii) the number of records, and

633

20

20

18

18 Number of dependencies

Number of dependencies

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639

16 14 12 10 8 6 4 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

10 8 6 4

1

20

20

18

18 Number of dependencies

Number of dependencies

12

0 0

16 14 12 10 8 6 4 2

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

16 14 12 10 8 6 4 2

0

0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

20

20

18

18 Number of dependencies

Number of dependencies

14

2

0

16 14 12 10 8 6 4 2

16 14 12 10 8 6 4 2

0

0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

20

20

18

18 Number of dependencies

Number of dependencies

16

16 14 12 10 8 6 4 2

16 14 12 10 8 6 4 2

0

0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

Dependency degree value

Fig. 3. Discretization based on K-means clustering. Effects of the number of bins on the dependency degree value of the mined temporal quasi-functional dependencies.

(iii) the set of temporal quasi-functional dependencies to be injected in the dataset. All the attributes of the generated datasets are categorical attributes. Hence, no discretization process in needed. This allows us to analyze the effects of the parameters independently of the discretization technique.

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639 20

20

18

18 Number of dependencies

Number of dependencies

634

16 14 12 10 8 6 4 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

10 8 6 4

1

20

20

18

18 Number of dependencies

Number of dependencies

12

0 0

16 14 12 10 8 6 4 2

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

16 14 12 10 8 6 4 2

0

0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

20

20

18

18 Number of dependencies

Number of dependencies

14

2

0

16 14 12 10 8 6 4 2

16 14 12 10 8 6 4 2

0

0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

20

20

18

18 Number of dependencies

Number of dependencies

16

16 14 12 10 8 6 4 2

16 14 12 10 8 6 4 2

0

0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Dependency degree value

1

Dependency degree value

Fig. 4. Discretization based on equi-width binning. Effects of the number of bins on the dependency degree value of the mined temporal quasi-functional dependencies.

6.2.1. Execution time when varying the number of records Fig. 6(a) and (b) reports the execution time of TOD when varying the cardinality of the dataset (i.e., the number of records) in the range from 10,000 to 100,000. The number of attributes is set to 15.

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639

635

Number of dependencies

14 12 10 8 6 4 2 0 0

20

40

60

80

100

80

100

80

100

Number of bins

Number of dependencies

14 12 10 8 6 4 2 0 0

20

40

60

Number of bins

Number of dependencies

14 12 10 8 6 4 2 0 0

20

40

60

Number of bins

Fig. 5. Effects of the number of bins and the discretization technique on the number of extracted temporal quasi-functional dependency with a dependency degree value greater or equal to 0.99.

Fig. 6(a) shows the execution time to extract 2-length temporal quasi-functional dependencies. Results of different sliding window size (from 2 to 5) are reported. The execution time of TOD increases almost linearly with respect to the number of records, and the slope increases at increasing sliding window size. Since the number of extracted rules increases almost linearly with respect to the number of records (as explained in Section 5.2), also TOD is characterized by the same trend. By increasing the sliding window size, also the number of mined rules and thus the execution time increases, because there are more extended attributes in the time-delay matrix (see Section 4). A similar trend was obtained by extracting 3-length temporal quasi-functional dependencies (see Fig. 6(b)). However, since the maximum length of the mined TQFDs influences the maximum length of the mined association rules this parameter has a significant impact on the execution time (the effect of this parameter is discussed in more details in Section 6.2.4).

636

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639

Table 10 Temporal quasi-functional dependencies (TQFD) with p > 0.99 for 2, 3, and 4 bins obtained with different discretization techniques, among attributes Volume (V), ClosePrice (CP), OpenPrice (OP), HighPrice (HP), and LowPrice(LP). Hierarchical

K-means

Equi-width

TQFD

p

TQFD

p

TQFD

p

2 bins V(0) [ V(1) V(1) [ V(0) CP(0) [ OP(1) OP(1) [ CP(0) CP(0) [ OP(O) OP(0) [ CP(O) CP(0) [ CP(1) CP(1) [ CP(0) OP(0) [ OP(1) OP(1) [ OP(0) CP(0) [ HP(1) HP(1) [ CP(0)

0.999 0.999 0.997 0.997 0.995 0.995 0.992 0.992 0.992 0.992 0.991 0.991

CP(0) [ OP(1) OP(1) [ CP(0) V(0) [ V(1) V(1) [ V(0) CP(0) [ OP(O) OP(0) [ CP(O)

0.996 0.996 0.995 0.995 0.992 0.992

V(0) [ V(1) V(1) [ V(0)

0.999 0.999

3 bins V(0) [ V(1) V(1) [ V(0) CP(0) [ OP(1) OP(1) [ CP(0)

0.998 0.998 0.997 0.997

CP(0) [ OP(1) OP(1) [ CP(0)

0.997 0.997

4 bins CP(0) [ OP(1) OP(1) [ CP(0)

0.997 0.997

CP(0) [ OP(1) OP(1) [ CP(0)

0.991 0.991

80

10000

70

9000

Execution time (s)

Execution time (s)

8000 60 50 40 30

7000 6000 5000 4000 3000

20 2000 10

1000

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 Number of records window size=2 window size=4 window size=3 window size=5

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 Number of records window size=2 window size=4 window size=3 window size=5

Fig. 6. Execution time of TOD when varying the number of records.

6.2.2. Execution time when varying the number of attributes Another parameter of interest is the number of attributes of the dataset. By increasing the number of attributes both the number of temporal association rules and the potential number of temporal quasi-functional dependencies increase. We performed experiments by generating a set of datasets with a number of attributes varying from 5 to 15 attributes. The number of records of the generated datasets was set to 100,000. The execution time is reported in Fig. 7(a) and (b), by extracting 2length rules and 3-length rules respectively. Results of different sliding window size (from 2 to 5) are reported. Since the number of generated association rules and the number of potential dependencies are almost polynomial with respect to the number of attributes (as explained in Section 5.2), also the execution time of TOD is almost polynomial with respect to the number of attributes. 6.2.3. Execution time when varying the sliding window size Fig. 8(a) and (b) reports the execution time of TOD when varying the sliding window size from 2 to 5, by extracting 2length rules and 3-length rules respectively. The dataset cardinality was fixed to 100,000. The execution time for different number of attributes (i.e., 5, 10, 12, 15) is reported in each figure. The sliding window size affect the execution time because

637

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639 80

10000

70

9000 8000 Execution time (s)

Execution time (s)

60 50 40 30

7000 6000 5000 4000 3000

20 2000 10

1000

0

0 5

6

7

8

9 10 11 12 13 Number of attibutes window size=2 window size=4 window size=3 window size=5

14

15

5

6

7

8

9 10 11 12 13 Number of attibutes window size=2 window size=4 window size=3 window size=5

14

15

Fig. 7. Execution time of TOD when varying the number of attributes.

80

10000

70

9000 8000 Execution time (s)

Execution time (s)

60 50 40 30

7000 6000 5000 4000 3000

20 2000 10

1000

0

0 2

3 attributes=5 attributes=10

4

5

2

Window size attributes=12 attributes=15

3 attributes=5 attributes=10

4

5

Window size attributes=12 attributes=15

Fig. 8. Execution time of TOD when varying the sliding window size.

2500

16000 14000

2000 Execution time (s)

Execution time (s)

12000 1500

1000

10000 8000 6000 4000

500 2000 0

0 2

attributes=5

3 Length of dependency attributes=6

4 attributes=7

2 attributes=5

3 Length of dependency attributes=6

Fig. 9. Execution time of TOD when varying the extracted dependency length.

4 attributes=7

638

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639

TOD mines the set of association rules by exploiting a time-delay matrix. The number of extended attributes of the time-delay matrix is equal to the number of attributes of the original dataset multiplied by the sliding window size. Since the execution time of TOD is almost polynomial with respect to the number of extended attributes of the time-delay matrix (as explained in Section 5.2), and the number of extended attributes depends on the sliding window size, the execution time of TOD is almost polynomial with respect to the sliding window size. 6.2.4. Execution time when varying the dependency length Finally, we performed a set of experiments to analyze the effect of varying the maximum length of the mined temporal quasi-functional dependencies. Fig. 9(a) and (b) reports the execution time of TOD when varying the maximum length from 2 to 4, for a sliding window size of 2 and 3, respectively. Results for different number of attributes (i.e., 5, 6, 7) are reported in each figure. The length of the extracted temporal quasi-functional dependencies impacts significantly on the execution time of TOD. The number of potential dependencies to be mined, and also the number of rules, increases exponential with respect to the length of the mined dependencies (as explained in Section 5.2). In fact, the results show that the execution time of TOD increases significantly when dependencies with a length equal to 4 are mined. However, we can notice that long dependencies are usually useless. In fact, long dependencies in many real cases highlight outliers already discovered by means of shorter dependencies. 7. Conclusion In this paper, we have (i) formally defined the temporal quasi-functional dependencies and (ii) presented the TOD algorithm to show how they can be used to detect temporal outliers. We showed the effectiveness and the efficiency of the TOD method by means of a wide set of experiments. In particular, the performance of TOD when varying the dependency degree threshold and the discretization technique is discussed. Furthermore, a set of scalability experiments when varying both the dataset and the algorithm parameters is performed. As an ongoing work, we are planning to develop a schema matching tool in order to be able to apply temporal quasi-functional dependencies extracted from one database to another one containing semantically similar data. Hence, we are exploiting a schema matching algorithm that maps the attributes of two databases depending on their semantic. References [1] E.M. Knorr, R.T. Ng, V. Tucakov, Distance-based outlier: algorithms and applications, VLDB Journal 8 (3–4) (2000) 237–253. [2] M.C. Limas, J.B.O. Meré, F.J.M.D.P. Ascacibar, E.P.V. González, Outlier detection and data cleaning in multivariate non-normal samples: the paella algorithm, Data Mining and Knowledge Discovery 9 (2) (2004) 171–187. [3] K.I. Penny, I.T. Jolliffe, A comparison of multivariate outlier detection methods for clinical laboratory safety data, the Statistician, Journal of the Royal Statistical Society 50 (2001) 295–308. [4] R. Ramakrishnan, J. Gehrke, Database Management Systems, McGraw-Hill Science/Engineering/Math, 2002. [5] Y. Huhtala, J. Kärkkäinen, P. Porkka, H. Toivonen, TANE: an efficient algorithm for discovering functional and approximate dependencies, The Computer Journal 42 (2) (1999) 100–111. [6] J. Kivinen, H. Mannila, Approximate inference of functional dependencies from relations, Theoretical Computer Science 149 (1) (1992) 129–149. [7] G. Bruno, P. Garza, E. Quintarelli, R. Rossato, Anomaly detection through quasi-functional dependency analysis, Journal of Digital Information Management 5 (4) (2007) 191–200. [8] A.U. Tansel, J. Clifford, S.K. Gadia, S. Jajodia, A. Segev, R.T. Snodgrass, Temporal Databases: Theory, Design and Implementation, Benjamin-Cummings Pub. Co., 1993. [9] N. Papadakis, G. Antoniou, D. Plexousakis, The ramification problem in temporal databases: changing beliefs about the past, Data and Knowledge Engineering 59 (2) (2006) 379–434. [10] C.D. Weekes, J.M. Vose, J.C. Lynch, D.D. Weisenburger, P.J. Bierman, T. Greiner, G. Bociek, C. Enke, M. Bast, W.C. Chan, J.O. Armitage, Hodgkinn disease in the elderly: improved treatment outcome with a doxorubicin-containing regimen, Journal of Clinical Oncology 20 (4) (2002) 1087–1093. [11] P. Chundi, M. Subramaniam, D.K. Vasireddy, An approach for temporal analysis of email data based on segmentation, Data and Knowledge Engineering 68 (11) (2009) 1253–1270. [12] S.-Y. Wua, Y.-L. Chen, Discovering hybrid temporal patterns from sequences consisting of point – and interval – based events, Data and Knowledge Engineering 68 (11) (2009) 1309–1330. [13] C.C. Aggarwal, P.S. Yu, Outlier detection for high dimensional data, in: Proceedings of SIGMOD Conference, 2001, pp. 37–46. [14] M. Breunig, H. Kriegel, R. Hg, J. Sander, LOF: identifying density-based local outliers, in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 2000, pp. 93–104. [15] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2006. [16] K. Yamanishi, J. Takeichi, G. Williams, On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms, in: Proceedings of 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000, pp. 320–324. [17] S. Papadimitriou, H. Kitagawa, P. Gibbons, C. Faloutsos, LOCI: Fast outlier detection using the local correlation integral, in: ICDE ’03: Proceedings of 19th International Conference on Data Engineering, 2003, pp. 315–326. [18] I.E. Ben-Gal, Outlier detection, The Data Mining and Knowledge Discovery Handbook (2005) 131–146. [19] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: a survey, ACM Computing Surveys 41 (3) (2009) 1–58. [20] X. Li, Z. Li, J. Han, J.-G. Lee, Temporal outlier detection in vehicle traffic data, in: ICDE 2009, 2009, pp. 1319–1322. [21] X. Li, J. Han, S. Kim, H. Gonzalez, Roam: Rule- and motif-based anomaly detection in massive moving object data sets, in: SDM’07: Proceedings of the Seventh SIAM International Conference on Data Mining, 2007, pp. 273–284. [22] E. Keogh, J. Lin, A. Fu, Hot sax: efficiently finding the most unusual time series subsequence, in: ICDM’05, 2005, pp. 226–233. [23] J.-G. Lee, J. Han, X. Li, Trajectory outlier detection: a partition-and-detect framework, in: ICDE’08, 2008, pp. 140–149. [24] B. Yi, N. Sidiropoulos, T. Johnson, H. Jagadish, C. Falout, A. Biliris, Online data mining for co-evolving time sequences, in: ICDE 2000, 2000, pp. 13–22. [25] S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, D. Gunopulos, Online outlier detection in sensor data using nonparametric models, in: VLDB’06: Proceedings of the 32nd International Conference on Very Large Data Bases, 2006, pp. 187–198. [26] S. Ceri, F.D. Giunta, P.L. Lanzi, Mining constraint violations, ACM Transactions on Database Systems 32 (1) (2007) 1–32.

G. Bruno, P. Garza / Data & Knowledge Engineering 69 (2010) 619–639

639

[27] P. Bohannon, W.F. W, F. Geerts, X. Jia, A. Kementsietsidis, Conditional functional dependencies for data cleaning, in: ICDE ’07: IEEE 23rd International Conference on Data Engineering, 2007, pp. 746–755. [28] D. Apiletti, E. Baralis, G. Bruno, E. Ficarra, Data cleaning and semantic improvement in biological databases, Journal of Integrative Bioinformatics 3 (2) (2006) 1–11. [29] R.H.L. Chiang, C.E.H. Cecil, E.-P. Lim, Linear correlation discovery in databases: a data mining approach, Data and Knowledge Engineering 53 (3) (2005) 311–337. [30] C. Combi, A. Montanari, R. Rossato, A uniform algebraic characterization of temporal functional dependencies, in: TIME ’05: 12th International Symposium on Temporal Representation and Reasoning, 2005, pp. 91–99. [31] R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large databases, in: International Conference on Very Large Data Bases, 1994, pp. 478–499. [32] J.F. Roddick, M. Spiliopoulou, A survey of temporal knowledge discovery paradigms and methods, IEEE Transactions on Knowledge and Data Engineering 14 (1) (2002) 750–767. [33] W. Wang, J. Yang, R. Muntz, TAR: temporal association rules on evolving numerical attributes, in: ICDE ’01: Seventeenth International Conference on Data Engineering, 2001, pp. 283–292. [34] E. Winarko, J.F. Roddick, ARMADA – an algorithm for discovering richer relative temporal association rules from interval-based data, Data and Knowledge Engineering 63 (1) (2007) 76–90. [35] J.M. Ale, G.H. Rossi, An approach to discovering temporal association rules, in: Proceedings of the 2000 ACM Symposium on Applied Computing, 2000, pp. 294–300. [36] K.-Y. Huang, C.-H. Chang, SMCA: a general model for mining asynchronous periodic patterns in temporal databases, IEEE Transactions on Data and Knowledge Engineering 17 (6) (2005) 774–785. [37] D.M. Group, PMML 4.0 specification, 2009. URL: http://www.dmg.org/v4-0/GeneralStructure.html. [38] A. Romei, S. Ruggieri, F. Turini, KDDML: a middleware language and system for knowledge discovery in databases, Data and Knowledge Engineering 57 (2) (2006) 179–220. [39] E. Baralis, P. Garza, E. Quintarelli, L. Tanca, Summarizing XML data by means of association rules, in: Current Trends in Database Technology, vol. 3268, 2004, pp. 260–269. [40] E. Baralis, G. Bruno, E. Ficarra, Temporal association rules for gene regulatory networks, in: Proceedings of the IEEE International Conference on Intelligent Systems, 2008, pp. 2–7. [41] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, in: SIGMOD’00, 2000, pp. 1–12. [42] K. Kianmehr, M. Alshalalfa, R. Alhajj, Fuzzy clustering-based discretization for gene expression classification, Knowledge and Information Systems, 2009. [43] R. Srikant, R. Agrawal, Mining quantitative association rules in large relational tables, in: SIGMOD ’96: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, 1996, pp. 1–12. [44] K.P. Soman, S. Diwakar, V. Ajay, Insight into Data Mining: Theory and Practice, Prentice-Hall of India Pvt. Ltd., 2006. [45] A. Gyenesei, J. Teuhola, Multidimensional fuzzy partitioning of attribute ranges for mining quantitative data, International Journal of Intelligent Systems 19 (2004) 1111–1126. [46] J. Li, X. Gao, L. Jiao, A new feature weighted fuzzy clustering algorithm, Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing (2005) 412–420. [47] M. Kaya, R. Alhajj, F. Polat, A. Arslan, Efficient automated mining of fuzzy association rules, Database and Expert Systems Applications 2453 (2002) 133–142. [48] T. Ozyer, R. Alhajj, K. Barker, Intrusion detection by integrating boosting genetic fuzzy classifier and data mining criteria for rule pre-screening, Journal of Network and Computer Applications 30 (1) (2007) 99–113. [49] D. Glez-Pena, R. Alvarez, F. Diaz, F. Fdez-Riverola, DFP: a bioconductor package for fuzzy profile identification and gene reduction of microarray data, BMC Bioinformatics 10 (37) (2009) 1–8. [50] F. Coenen, The lucs-kdd fuzzy apriori-t software, 2008. URL: http://www.csc.liv.ac.uk/frans/KDD/Software/FuzzyAprioriT. [51] P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Addison Wesley, 2005. [52] J. Jacob, D. Shasha, FinTime – a financial time series benchmark, 2000. URL: http://cs.nyu.edu/shasha/fintime.html. [53] B. Goethals, Frequent itemset mining implementations repository, 2004. URL: http://fimi.cs.helsinki.fi.

Giulia Bruno is a postdoctoral researcher at the Database and Data Mining group of Politecnico di Torino. She is currently working in the field of data mining and bioinformatics. Her activity is focused on the analysis of microarray gene expression data to propose algorithms for selecting genes relevant for tumor classification and to detect gene regulatory networks. Furthermore, she is investigating issues related to anomaly detection and semantic information discovery in temporal and biological databases. Her research activities are also devoted to data mining techniques for monitoring patient conditions and detect unsafe events.

Paolo Garza received the master’s and PhD degrees in computer engineering from the Politecnico di Torino. He has been a postdoctoral fellow in the Dipartimento di Automatica e Informatica, Politecnico di Torino, since January 2005. His current research interests include data mining and database systems. In particular, he has worked on the classification of structured and unstructured data, outlier detection, and itemset mining algorithms.