Knowledge-Based Systems 21 (2008) 934–940
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
Conceptual modeling rules extracting for data streams Xiao-Dong Zhu *, Zhi-Qiu Huang College of Information Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
a r t i c l e
i n f o
Article history: Received 18 April 2007 Received in revised form 30 March 2008 Accepted 13 April 2008 Available online 20 April 2008 Keywords: Granular computing Data streams mining Conceptual modeling Rules extraction Knowledge discovery
a b s t r a c t Data take the form of continuous data streams rather than traditional stored databases in a growing number of applications, including network traffic monitoring, network intrusion detection, sensor networks, fraudulent transaction detection, financial monitoring, etc. People are interested in the potential rules in data streams such as association rules and decision rules. Compared with much work on developing algorithms of data streams mining, there is little attention paid on the modeling data mining and data streams mining. Considering the problem of conceptual modeling data streams mining, we put forward a data streams oriented decision logic language as a granular computing formal approach and a rules extracting model based on granular computing. In this model, we propose the notion of granular drifting, which accurately interpret the concept drifting problem in data streams. This model is helpful to understand the nature of data streams mining. Based on this model, new algorithms and techniques of data streams mining could be developed. Ó 2008 Elsevier B.V. All rights reserved.
1. Introduction The research upon data streams mining has recently gained much attraction due to its important applications and the increasing generation of streaming data in various areas such as sensor networks, web clicks and computer network traffics. One of the tasks of knowledge discovery and data mining is to search for knowledge, patterns, and regularities derivable from a large datasets. So in data streams mining, people are also interested in the rules and knowledge hidden in data streams. While the data streams mining is different from the traditional data mining in terms of their characteristics. High speed, great volume, constrained memory and limited power are obvious characteristics of data streams, which lead to challenges on algorithms, tools, models and data structures addressed for traditional data mining. Association rules mining and decision rules mining are always important problems in data mining, so in data stream mining. However, compared with that a lot of techniques and algorithms were proposed, little work was done on the modeling data mining. A lack of conceptual modeling may jeopardize further development of data mining. Granular computing is a label of theories, methodologies, techniques, and tools that make use of granules in the process of problem solving [1,2]. Yao modeled data mining with granular computing, but his model is paid on traditional data instead of data streams [3]. In this paper, we put forward a data streams mining model based on granular computing. A notion of granular drifting is proposed to interpret the concept drifting in * Corresponding author. Tel.: +86 25 84896263. E-mail address:
[email protected] (X.-D. Zhu). 0950-7051/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2008.04.003
data streams. In this model, we also illustrate some experiments to validate the model. The rest sections are organized as follows. We introduce related work in Section 2. In Section 3, we present a data stream oriented decision logic language DS–DL, which is described by model and satisfiability in the Tarski’s style. Utilizing DS–DL as the formal method of granular computing, we model data streams mining. The notion of granular drifting is also proposed in this section. In Section 4, we interpret rules extraction in data streams. Then, we illustrate how to acquire our interesting rules in data streams using granular computing. Section 5 illustrates our experiments with granular computing and gives the experiment evaluation. Section 6 is our conclusion and future work. 2. Related work Granular computing is a label of theories, methodologies, techniques, and tools that make use of granules in the process of problem solving. Zadeh first proposed the notion of information granulation [4], however, this notion did not receive much attention in more than 10 years. In 1982, Pawlak advanced the theory of rough sets [5], which in fact provides a concrete example of granular computing. To some extent, rough set theory makes more people realize the importance of the notion of granulation. In 1997, Zadeh revisited information granulation [6], Lin suggested the term granular computing to label this new and growing research field [7]. Yao, Zhong proposed some granular computing methods and modeled data mining with them [3,8]. Association rules mining and decision rules mining are important problems in static data mining and data streams mining [9].
935
X.-D. Zhu, Z.-Q. Huang / Knowledge-Based Systems 21 (2008) 934–940
Vast algorithms were presented upon this research focus. The first algorithm about association rule mining was proposed by Agrawal [10]. After that, many people are interested in the association rules hidden in the large database, and significant algorithms were developed to discovery association rules [11–14]. Zhu designed and realized an intrusion detection system to detect computer network intrusion packages [15]. The association rules and decision rules mining techniques were adopted in this system. In recent years, some frequent sets and association rules mining algorithms in data streams were also developed [16–18]. Yao and Lin advocated more attention paid on conceptual modeling of data mining. Yao modeled data mining with granular computing. In fact, he utilized traditional decision logic language as granular computing approach, and accurately describe data sets and interpret concept in terms of its intension and extension [3,8]. Lin introduced the relation of data mining and granular computing, presented modeling method with neighborhood systems [19]. Yao and Lin contributed much work on granular computing and data mining modeling, however, their work and proposed models are based on traditional data set. In this paper, we present conceptual models of data streams mining, and we precisely interpret rules extracting process in data streams.
3. Modeling data streams mining with granular computing 3.1. Introduction on data streams The intelligent data analysis has passed through some stages. Statistical exploratory data analysis represents the first stage. Its goal was to explore the available data in order to test a specific hypothesis [20]. With the advance of artificial intelligence and database techniques, many new data analysis problems have been addressed, one of those is the knowledge extracting from very large databases. Data mining is the interdisciplinary field of research that can extract models and patterns from large amounts of information stored in data repositories. This statistical data mining could be regarded as the second stage of the intelligent data analysis. Advances in networking and parallel computation have lead to the introduction of distributed and parallel data mining. The goal was how to extract knowledge from different subsets of a dataset and integrate these generated knowledge structures to gain a global model of the whole dataset. In recent years, data take the form of continuous data streams rather than traditional stored databases in a growing number of applications which include network traffic monitoring, network intrusion detection, sensor networks, fraudulent transaction detection, financial monitoring, etc. Data in these applications share several distinguishing features like huge volumes of data and unpredictable data arrivals. The data generation rates become faster than ever before. These continuous stream data require online or real-time proceeding data instead of storing data in a static repository first. A data stream is an ordered sequence of items that arrive in timely order. Differing from data in traditional static databases, data streams are continuous, unbounded, usually come with high speed and have a data distribution that often changes with time [21]. Data in data streams are called streams data. As the number of applications on mining data streams grows rapidly, there is a increasing need to perform association rules mining upon data streams [14]. A typical data streams is illustrated in Fig. 1. Fig. 1 is a sliding window model of data process, which was adopted in many algorithms of data streams mining. In [16,22], the authors use this method in their algorithms to get the frequent itemsets of data streams within the current sliding window. In this model, only part of the data streams within the sliding window are stored and processed at the time when the data flows in. The size
History time window
Current time window Data streams Current batch of transactions Fig. 1. Data streams.
of the sliding window may be decided according to applications and system resources. The mining result of the sliding window method totally depends on recently generated transactions in the range of the window; all the transactions in the window need to be maintained in order to remove their effects on the current mining results when they are out of range of the sliding window. Association rules gotten in static database are generally invariable, while in data streams, association rules vary as the time goes on. Finding frequent itemsets are the first step of association rules mining, differently, in data streams mining some frequent itemsets may become infrequent, and some potential infrequent itemsets may become frequent later. As the time goes by, streams data vary continuously, frequent sets and association rules are potentially changed. This problem is regarded as a concept drifting problem in data streams.We will describe and interpret the concept drifting using the notion of granular drifting. 3.2. Conceptual modeling upon data streams The data mining was modeled with granular computing by Yao [3]. As well known, a concept contains two aspects, the intension and extension of the concept. Yao interpret the intension of a concept as the name of formula in an information table and the extension of a concept as the sets of objects that satisfying this formula. He utilized traditional decision logic to construct granules and interpret granules. However, his work focused on static data. We focus on the data streams and propose the notion granular drifting upon dynamic streams data. An information table is a convenient formal method to describe the dataset with hidden knowledge and patterns. We assumed that data streams are continuous information tables whose objects are unbounded. However, we give a time period to view the streams data. Definition 1. The data streams is a DS ¼ fTGr; T; At; DS DL; fV a ja 2 Atg; fIa ja 2 Atgg.
six-tuples.
Here TGr is a time period, which is a time section and could be set one second, one minute or one hour, etc. The syntax TGr implies a time granule defined in the define 7. A TGr can decide the length of the information table DS of concrete data streams. An information table DS is identified by its TGr. T is a set of transactions in a concrete TGr. We identify transactions in T with the time stamp. The length of the DS is equal to the cardinality of the set T and we denote it with DS:length. At is a finite nonempty set of attributes. DS–DL is a language defined using attributes in At, which would be defined in the next subsection. V a is a nonempty set of values for a 2 At. Ia is an information mapping function from T to V a. 3.3. Data streams oriented decision logic language We define a language DS–DL, which is similar to the language studied by Yao [3,8] and the decision logic language studies by
936
X.-D. Zhu, Z.-Q. Huang / Knowledge-Based Systems 21 (2008) 934–940
Pawlak [23] for describing the data streams language DS–DL, an atomic formula is given by /. It describes that binary tuple < a; v >, where a 2 At and v 2 V a . The where a 2 At and v 2 V a . The semantics of the language DS–DL can be defined in the Tarski’s style through the notions of a model and satisfiability. The model is an information table DS, which provides interpretation for symbols and formulas of DS–DL. We have described it in the above subsection. The satisfiability of a formula / by a transaction x is written as xDS / or in short x /.
time period is regarded as a time granule. So we say that one hour granularity is rougher than one second granularity, a month is rougher than a day, and so on. A time granularity can decide the length of the DS i.e. DS:length for concrete data streams due to a rougher time granularity contains more data. In some areas such as sensor networks and stock data streams, the streams data could generate periodically. So a granularity can be equally partitioned into some scales, a transaction generates every a time scale.
Definition 2. Formulas. Formulas in DS–DL were gotten through following three rules:
3.4. Granular drifting
(1) < a; v > is an atomic formula, where a 2 At and v 2 V a ; (2) IF / and u are formulas, THEN :/; / ^ u; / _ u; / $ u; / u are formulas; (3) Only the compound generated upon the first and second step within limited times is a formula.
Definition 3. Satisfiability. The satisfiability has the following conditions: (1) (2) (3) (4) (5) (6)
x < a; v > iff Ia ðxÞ ¼ v; x :/ iff :x / x / ^ u iff x / and x u x / _ u iff x / or x u x / ! u iff x :/ _ u x / $ u iff x / ! u and x u ! /
In this paper, iff denotes if and only if. The formula like / ^ u and ð/ ^ uÞ ! x are called combined formulas. Atomic formulas and combined formulas are jointly called formulas. Definition 4. T is satisfiable on a formula / iff 9x 2 T, xDS /. Definition 5. The formula is true, iff for any transaction x in T, it satisfies the formula /. That is, The formula / is true upon the current DS iff 8x 2 T, xDS /. Definition 6. For a formula /, the set mð/Þ defined by: mð/Þ ¼ fx 2 T j xDS /g is called the meaning set of the formula / in DS. Theorem 1. We hold following properties. (1) (2) (3) (4) (5) (6)
mð< a; v >Þ ¼ fx 2 TjIa ðxÞ ¼ vg; mð/Þ ¼ T mð/Þ; mð/ ^ uÞ ¼ mð/Þ \ mðuÞ; mð/ _ uÞ ¼ mð/Þ [ mðuÞ; mð/ ! uÞ ¼ mð:/ _ uÞ; mð/ uÞ ¼ mð/ ! uÞ \ mðu ! /Þ.
Theorem 2. If T in a DS is satisfiable on a formula /, then mð/Þ is not empty. If a formula / is true for the current DS, then mð/Þ is the T itself. Definition 7. A time period TGr is defined by a time section, written with < sb ; se >.where sb is the begin time of the time period, and se is the end time of the time period. Usually, we use one second, one minute or one hour as a time period according to applications or system resources. The time period is a sliding window in terms of data processing model. From the perspective of granular computing, this
In this subsection, the notion of granular drifting is presented and we model concept drifting problems with granular drifting. Concepts have formal and precise description with this language DS–DL. We know that a concept contains two aspects, the intension and extension of the concept. A concept in an information tables is defined as a binary pair < /; mð/Þ >, where / is a formula in the information table DS. The former element / is a description of mð/Þ, that is the intension of the concept. The later element mð/Þ is the sets of satisfiable transactions, that is the extension of the concept. In one time period, transactions are invariable, that is, it looks like a temporal static data set for one time period. / is a formula, that is the intension of a concept. mð/Þ is the meaning of the formula /, that is the extension of the concept. In a time period TGr, transaction data sets are regarded as static. But as the time period is changed, the transactions data will be updated. From the standing point of time periods, the updating could only include adding and deleting, because modifying can be replaced with adding and deleting. If the transaction data has been added, the extension of a concept may be expanded or hold. On the contrary, if the transaction data have been deleted, the extension of a concept may be reduced or hold. If the adding and deleting happened in the same time period, the extension of a concept may be expanded, reduced or hold. In a word, the extension of a concept has been changed. Now we interpret the process with granular computing. We assume the current time period is < sb ; se >. In this time period, a concept < /; mð/Þ > is invariable. As the time goes on, the current time period is < s0b ; s0e >, the time period < sb ; se > is the history time period, and < s0b ; s0e > is called the direct subsequence of < sb ; se >. In an information table, if a transaction x and a transaction y satisfy a formula / simultaneously, they may be put into the same period. In fact, the formula / decides an equivalence class that satisfies the formula. In a time period, this equivalence class is invariable. But in the subsequent time period, the equivalence class would change due to the updating of transaction data. We use mð/Þ to represent the set of transactions that satisfy the formula /. They have following two properties. (1) The mð/Þ is stronger than before iff the mð/Þ is expanded; (2) The mð/Þ is weaker than before iff the mð/Þ is reduced. We say if mð/Þ is stronger, then the concept < /; mð/Þ > is stronger. On the contrary, if mð/Þ is weaker, then the concept < /; mð/Þ > is weaker. Generally, we say that the concept < /; mð/Þ > is drifting. Therefore, the concept is drifting in data streams as the time periods goes on. The concept drifting in data streams is actually decided by the changes upon the extension of the concept, that is, the meaning set of the concept is drifting. We call this process granular drifting.
937
X.-D. Zhu, Z.-Q. Huang / Knowledge-Based Systems 21 (2008) 934–940
4. Interpretation of rules extraction
4.2. Interpretation of association rules
4.1. Quantity measures of rules
Rules like association rules and decision rules could be accurately interpreted with granular computing. From the perspective of granular computing, we revisit the association rules. The traditional association rules was defined on the transaction data set [10]. Agrawal first interpreted the data with basket market dataset. In fact, the transaction data set could be transported to a binary information table, see Table 2.
Association rules and decision rules represent expected knowledge. Depending on the meanings and forms of rules, one may classify rules in many ways. A clear classification of rules is useful for the understanding on the basic tasks of machine learning and data mining. Rules can be classified into two types, type 1 and type 2. The type 1 rule states that ‘‘if the value of an object is va on attribute a, then the value of the object is vb on attribute b”, and the type 2 rule states that ‘‘if two objects have the same value on attribute a, then they have the same value on attribute b”. Association rule, which states that the presence of one set of items implying the presence of another set of items, is a typical type 1 rule [10]. Decision rules from decision tree learning algorithms are also examples of type 1 [24]. The high order decision rules represent the type 2 [25,26], which focus on a pair of objects. The difference between traditional association rules and data streams association rules is the concept drifting problem in data streams [27]. In data mining, rules are typically interpreted in terms of conditional probability. Using the cardinalities of sets, we obtain the contingency Table 1, representing the quantitative information about the rule / ! u. The quantitative measures including generality; absolutesupport; changeofsupport and mutualsupport could be defined on this table. The 2 2 contingency table has been used by many authors for representing information of rules [1,28,29]. Here j j denotes the cardinality of a set .
jmð/Þj jTj
It indicates the relative size of the concept < /; mð/Þ >. A concept is more general if it covers more instances of the universe T. Obviously, we have 0 6 Gð/Þ 6 1. (2) The absolutesupport of u provided by / is defined by: ASð/ ! uÞ ¼ ASðuj/Þ ¼
jmð/Þ \ mðuÞj jmð/Þj
The support count is an important property of itemset. Theorem 3. In a binary information table, Let X be a itemset in I with k items ðix1 ; ix2 ; . . . ; ixk Þ; X maps a formula x, that is, ix1 ¼ 1 ^ ix2 ¼ 1 ^ . . . ^ ixk ¼ 1, then the support count of itemsets X is the cardinality of the meaning set of formula x, i.e. mðxÞ.
X and Y are called the antecedent and the consequent, respectively. The antecedent X is also called the LHS (left hand side), and the consequent Y is also called the RHS (right hand side). The support and the confidence are two measures used for association rules mining. support indicates the frequencies of the occurring patterns, and confidence denotes the strength of implication in the rule. rðX [ YÞ T rðX [ YÞ confidenceðX ) YÞ ¼ rðXÞ
supportðX ) YÞ ¼
Here mð/Þ–0. The quantity ASðuj/Þ states the degree to which / support u. It may be viewed as the conditional probability of a randomly selected element satisfying u given that the element satisfies /. In set theoretic terms, it is the degree to which mð/Þ included in mðuÞ. Clearly, ASðuj/Þ ¼ 1, iff mð/Þ–0 and mð/Þ # mðuÞ. (3) The changeofsupport of u provided by / is defined by: CSð/ ! uÞ ¼ CSðuj/Þ ¼ ASðuj/Þ GðuÞ The change of support varies from 1 to 1. For a positive value, one may say that / is positively related to u, for a negative one may say that / is negatively related to u. (4) The mutualsupport of / and u is defined by: MSð/ $ uÞ ¼ MSð/; uÞ ¼
Definition 9. Support count. the support count of a itemset X is defined by rðXÞ ¼ jfT i jX # T i ; T i 2 Tgj, here j j is the cardinality of a set.
Definition 10. Association rules. Association rule is defined as an implication of the form X ) Y, where X; Y 2 I, and X \ Y ¼ ;.
(1) The generality of concept < /; mð/Þ > is defined by: Gð/Þ ¼
Definition 8. Items. Let I ¼ fi1 ; i2 ; . . . ; im g be a set of binary attributes, called items. Let T be a database of transactions, T ¼ fT 1 ; T 2 ; . . . ; T N g. Each transaction t (t2 T) is represented as a binary vector, with t½k ¼ 1 if t possess the item ik and t½k ¼ 0 otherwise. There is one tuple in the database for each transaction. Let X be a set of some items in I. We say that transaction t satisfies X if for all items ik in X, t½k ¼ 1. A k-itemsets is defined by the itemsets including k items.
mð/Þ \ mðuÞ mð/Þ [ mðuÞ
Obviously, 0 6 MSðuj/Þ 6 1, one may interpret it as a measure of the strength of the double implication u $ /. It measures the degree to which / causes, and only causes u.
They are indeed the generality and absolute support: supportðX ) YÞ ¼ Gð/ ^ uÞ confidenceðX ) YÞ ¼ ASð/ ! uÞ Where / and u is the mapping formula of X and Y respectively according to Theorem 3. By specifying threshold values of support and confidence, one can obtain all association rules whose support and confidence are above thresholds. Finding frequent itemsets are the key step of association rules mining. In data streams, the association rules vary with the frequent itemsets. Some frequent itemsets may become infrequent, and some potential infrequent itemsets may become frequent later. We measure association rules in data streams based on granular drifting.
Table 2 Binary information table Table 1 Contingency table for rule
/ :/ P
col
u
:u
P
jm(/)\m(u)j jm(:/)\m(u)j jm(u)j
jm(/)\m(:u)j jm(:/)\m(:u)j jm(:u)j
jm(/)j jm(:/)j jTj
row
TID
i1
i2
i3
i4
i5
i6
1 2 3 4 5
1 1 0 1 1
1 0 1 1 1
0 1 1 1 1
0 1 1 1 0
0 1 0 0 0
0 0 1 0 1
938
X.-D. Zhu, Z.-Q. Huang / Knowledge-Based Systems 21 (2008) 934–940
Given a support threshold ksupport , in an information table DS, the frequent itemsets are defined by the itemsets satisfying the supportð/ ) uÞ is greater than ksupport . With the granular drifting, the meaning of the formulas / ! u and / ^ u could be expanded and reduced. If reducing, the support would descend, and after the support is lower than the ksupport , the frequent itemsets may become infrequent. If expanding, the confidence would ascend, which denotes that the association rules become stronger. In mining association rules, concepts with low support are not considered in the search for association. On the other hand, two concepts with low supports may have either large confidence or a large change of support. For example, suppose we have confidenceðX ) YÞ ¼ 0:90. If we also have GðYÞ ¼ 0:95. That is, CSðX ) YÞ < 0. We can conclude that X is in fact negatively associated with Y. This suggests that an association rule may not reflect the true association. Exception rules and peculiarity rules have been studied as extension of association rules to resolve above problems [30,31]. Both exception rule mining and peculiarity rules mining aim at finding rules missed by association rule mining. 5. Experiments and model evaluation In this section, we illustrate how to extract rules with granular computing. Our illustration includes two parts: one is extract decision rules from an intrusion detection data stream, the other is association rules mining from transaction data stream. 5.1. Decision rules extraction illustration We used the KDD cup data downloaded from http://kdd.ics.uci.edu. This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 the Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between ‘‘bad” connections, called intrusions or attacks, and ‘‘good” normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment. The data set has 494,021 transactions, shown in Table 3. We wrote some storage procedures to realize the data streams form of the data sets. The length of time period was changed from 10 transactions to 100 transactions. This data set has 41 attributes, which include 34 numerical variables and 7 flag variables. The last attribute is the decision attribute, which denotes corresponding intrusion type, for instance, ‘Normal.’ shows that this datum is a normal network action, ‘Back.’ shows that it is a back door datum or bad datum. ‘Spy.’ shows that it is a spy datum. Our goal is to detect the decision rules in the data sets. There existing so many attributes. However, we pay high attention to the relation between the Intrusion type and other attributes. So here we defined a term of axis attribute. The axis attributes are those in which people would be most interested. For example, the attributes displayed in the Table 2 are the axis attributes in
Table 3 KDD cup 99 data Protocol_type
Service
Flag
...
Intrusion_type
Tcp Tcp Tcp Tcp Tcp Tcp ...
http http http Private Private Private ...
SF RSTR SF REJ REJ S0 ...
... ... ... ... ... ... ...
Normal Back Back Neptune Satan Neptune ...
our experiments. Of course, people could choose their axis attributes according to their interest and concrete problems. The lower generality leads more transactions and the higher absolutesupport leads stronger decision rules. When we set the threshold of generality 0.1, and absolutesupport 0.90, in the first time period, we get the strong decision rule {service=‘http’^protocol_type=‘tcp’^flag=‘SF’)Intrusion_type=‘Normal.’}, whose absolute support is 1. While in the last time period, the decision rule’s absolute support is 0.91. When we set generality 0.2, {protocol_type=‘tcp’, Intrusion_type=‘Neptune.’} varied from NOT frequent itemsets in the first time period to frequent itemsets in the last time period. Its generality is 0.217. Subsequently we get a valuable decision rules {Intrusion_type=‘Neptune.’)protocol_type=‘tcp’}, whose confidence is 1. We interpret some above results as follows. protocol_type= ‘tcp’^Intrusion_type=‘Neptune.’ is an interesting formula. The meaning set that satisfies this formula is a granule. That is, the transactions satisfying this formula belong to the same granule. As the time goes forward, the meaning set of the formula become stronger. Which indicated that the {protocol_type=‘tcp’, Intrusion_type=‘Neptune.’} varies from NOT frequent sets to frequent sets. The experiments prove that the model can work well and accurately interpret the decision rules mining in data streams. 5.2. Association rules extraction illustration The first association rules mining algorithm proposed by Agrawal is Apriori [10]. In fact, many algorithms were developed based on the ideas of Apriori [11,12]. Utilizing the bitwise operation, we add the notion of computing with granules into the association rules mining. The ideas of bitwise operation were ever investigated by Lin [19]. We illustrate our model on a transaction data streams. However, when developing a concrete algorithm for data stream, many features of streams data need consideration. Table 4 shows a transaction data set in a time period of data streams. Issues of granular computing can be divided into two related aspects, the construction of granules and computing with granules. The former deals with the formation, representation, and interpretation of granules. The latter deals with the utilization of granule in problem solving [3].
Table 4 Transaction data set #1 TID
i1
i2
i3
i4
i5
001 002 003 004 005 006 007 008 009
0 1 0 0 0 0 0 0 0
1 0 0 1 1 0 1 1 1
0 0 1 0 1 1 1 1 1
1 1 1 1 0 1 0 1 1
1 0 0 0 0 0 1 1 0
Table 5 Basic information granules (candidate 1-compound itemsets) Itemsets
Transactions of granules
Binary expression of granules
Size of granules
[i1] [i2] [i3] [i4] [i5]
{002} {001,004,005,007,008,009} {003,005,006,007,008,009} {001,002,003,004,006,008,009} {001,007,008}
000100000 100110111 001011111 111101011 100000110
1 6 6 7 3
939
X.-D. Zhu, Z.-Q. Huang / Knowledge-Based Systems 21 (2008) 934–940
Suppose support threshold ksupport is 2/9. Similar to the Apriori algorithm, in this illustration, we first generate candidate itemsets then generate frequent itemsets. By the first scanning on this set, we get candidate 1-compound granules. See Table 5. Because all sizes of granules satisfy the ksupport , itemsets in Table 4 are all frequent 1-itemsets. Computing with granules here is defined by the AND bitwise operation upon the binary expression of corresponding granules. Similar to the apriori algorithm, we need connecting and pruning in the mining process. Table 6 shows the 2-item compound granules, i.e., candidate 2-itemsets. By computing the support of the corresponding compound granules, we get frequent 2-item compound granules, that is, frequent 2-itemsets. See Table 7. Based on frequent 2-item compound granules, we generate candidate 3-item compound granules by AND bitwise operation. We can get four new compound granules, ½i2 ; i3 ; i4 ; ½i2 ; i3 ; i5 ; ½i2 ; i4 ; i5 and ½i3 ; i4 ; i5 . However, the itemset ½i3 ; i4 ; i5 is pruned in the connecting process. Because itemsets in Table 8 satisfy the ksupport , these candidate itemsets are all frequent sets. Because these two itemsets could not be connected according to condition of connecting, the mining procedure ends. If we set the threshold of confidence kconfidence ¼ 0:9, then we can get associations fi3 ; i5 g ) i2 ;
However, streams data change continuously as time goes on. In fact, when the time period turns to the subsequent time period, many data enter into the new time period and some old data go out of the range of the new time period. Table 9 shows an informaTable 6 Candidate 2-item compound granules (candidate 2-itemsets) Itemsets
Transactions of granules
Binary expression of granules
Size of granules
[i1, [i1, [i1, [i1, [i2, [i2, [i2, [i3, [i3, [i4,
U U {002} U {005,007,008,009} {001,004,008,009} {001,007,008} {003,006,008,009} {007,008} {001,008}
000000000 000000000 010000000 000000000 000010111 100100011 100000110 001001011 000000110 100000010
0 0 1 0 4 4 3 4 2 2
Table 7 Frequent 2-item compound granules (frequent 2-itemsets)
i1
i2
i3
i4
i5
001 002 003 004 005 006 007 008 009
0 1 0 1 0 0 0 0 0
1 0 0 1 1 0 1 1 1
0 0 1 0 1 1 1 1 1
1 1 1 1 0 1 0 1 1
1 0 0 0 0 0 0 1 0
tion table in the subsequent time period (marked with #2) after the information table in above time period (marked with #1, see Table 4). In this period, some frequent itemsets will become not frequent. On the contrary, some potential non frequent will become frequent in the future. The updated association rules are as follows. i1 ) i4 ; i5 ) i2 ; i5 ) i4 ; fi2 ; i5 g ) i4 ;
The nature of association rules variation in above transaction data streams is the drifting of granules as the time goes by. This variation can be called granular drifting driven concept drifting in data streams. In our experiments, we illustrate and interpret rules extraction with granular computing in detail. We combine the traditional apriori algorithm with granular computing to realize the frequent set mining in dynamic data set. Therefore, granular computing is not only a formal method for conceptual modeling data streams mining, but a technique instrument for rules extracting in data streams. Above two illustrations reflects decision rule mining and association rule mining respectively. A number of experiments indicate that granular computing is not only a formal method and tool for conceptual modeling data stream mining, but also a technique and tool for problem solving in data mining as well as data streams mining. In fact, many characteristics of data streams such as high speed, huge number, continuous and unbounded feature, should be considered when we design a concrete algorithm. In addition, resource awareness is a significant problem when performing the algorithms of data streams mining. 6. Conclusion
Itemsets
Transactions of granules
Binary expression of granules
Size of granules
[i1, [i2, [i2, [i2, [i3, [i4,
{005,007,008,009} {001,004,008,009} {001,007,008} {003,006,008,009} {007,008} {001,008}
000010111 100100011 100000110 001001011 000000110 100000010
4 4 3 4 2 2
i4] i3] i4] i5] i4] i5]
TID
fi4 ; i5 g ) i2 :
fi4 ; i5 g ) i2 :
i2] i3] i4] i5] i3] i4] i5] i4] i5] i5]
Table 9 Transaction data set #2
Table 8 Candidate 3-item compound granules (candidate 3-itemsets) Itemsets
Transactions of granules
Binary expression of granules
Size of granules
[i2,i3,i4] [i2,i3,i5] [i2,i4,i5]
{008,009} {007,008} {001,008}
000000011 000000110 100000010
2 2 2
It is important to study formal and mathematical modeling of data mining as well as data streams mining. Lin and Yao advocated much attention on modeling of data mining. The main contribution of this paper is that we present a data stream oriented decision logic language DS–DL, which is described by model and satisfiability in the Tarski’s style. Then we propose a model for data streams mining with DS–DL. Construction and computing of granules are analyzed in detail. Considering the problem of concept drifting in data streams mining, we propose the notion of granular drifting based on granular computing. Concept drifting could be precisely interpreted in this model. Research on the conceptual modeling on data streams mining benefits understanding and insight into the nature of data streams mining. Within this model, some new techniques and algorithms could also be developed based on granular computing for data streams. In the future, we will further investigate and develop
940
X.-D. Zhu, Z.-Q. Huang / Knowledge-Based Systems 21 (2008) 934–940
methods of granular computing for data mining as well as data streams mining.
References [1] Y.Y. Yao, Potential applications of granular computing in knowledge discovery and data mining, in: Proceedings of World Multi Conference on Systemics, Cybernetics and Informations, 1999, pp. 573–580. [2] Y.Y. Yao, Granular computing: basic issues and possible solutions, in: Proceedings of the 5th Joint Conference on Information Sciences, 2000, pp. 186–189. [3] Y.Y. Yao, On modeling data mining with granular computing, in: Proceedings of the 25th Annual International Computer Software and Applications Conference, 2001, pp. 638–643. [4] L.A. Zadeh, Fuzzy sets and information granulation, in: M. Gupta, R.K. Ragade, P.R. Yager (Eds.), Advances in Fuzzy Set Theory and Application, North-Holland Publishing Company, 1979, pp. 3–18. [5] Z. Pawlak, Rough sets, International Journal of Computer and Information Science 11 (1982) 341–356. [6] L.A. Zadeh, Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic, Fuzzy Sets and Systems 19 (1) (1997) 111–127. [7] T.Y. Lin, From rough sets and neighborhood systems to information granulation and computing in words, in: Proceedings of European Congress On Intelligent Techniques and Soft Computing, 1997, pp. 1602–1607. [8] Y.Y. Yao, N. Zhong, Granular computing using information tables, in: T.Y. Lin, Y.Y. Yao, L.A. Zadeh (Eds.), Data Mining, Rough Sets and Granular Computing, Physica-Verlag, Heidelberg, 2002, pp. 102–124. [9] M.H. Dunham, Data Mining: Introductory and Advanced topics, Tsinghua University Press, Beijing, 2005. [10] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, 1993, pp. 207–216. [11] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proceeding of the 20th International Conference of Very Large Data Bases, 1994, pp. 487–499. [12] J.W. Han, J. Pei, Y.W. Yin, Mining frequent patterns without candidate generation, in: Proceeding of International Conference on Management of Data, 2000, pp. 1–12. [13] X.D. Zhu, C. Zheng, Research on hash pruning algorithm of association rules, Journal of Anhui University (Natural Sciences) 29 (4) (2005) 20–23. [14] N. Jiang, L. Gruenwald, Research issues in data stream association rule mining, SIGMOD Record 35 (1) (2006) 14–19.
[15] X.D. Zhu, Z.Q. Huang, H. Zhou, Design of a multi-agent based intelligent intrusion detection system, in: Proceeding of 2006 1st International Symposium On Pervasive Computing And Applications (Urumchi, August, 2006), pp. 290–295. [16] J.H. Chang, W.S. Lee, A sliding window method for finding recently frequent itemsets over online data streams, Journal of Information Science and Engineering 20 (4) (2004) 753–762. [17] J.H. Chang, W.S. Lee, A. Zhou, Finding recent frequent itemsets adaptively over online data streams, in: Proceeding of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2003, pp. 487–492. [18] G.S. Manku, R. Motwani, Approximate frequency counts over data streams, in: Proceedings of the 28th VLDB Conference (Hongkong, China, 2002), pp. 346– 357. [19] T.Y. Lin, E. Louie, Modeling the real world for data mining: granular computing approach, in: Proceedings of the 9th ISFA World Congress and 20th NAFIPS International Conference 2001, pp. 3044–3049. [20] M.M. Gaber, A. Zaslavsky, S. Krishnaswamy, Mining data streams: a review, SIGMOD Record 34 (2) (2005) 18–26. [21] S. Guha, N. Koudas, K. Shim, Data streams and histograms, in: Proceeding of ACM Symposium on Theory of Computing, 2001, pp. 471–475. [22] C.-H. Lin, D.-Y. Chiu, Y.-H. Wu and A.L.P. Chen, Mining frequent itemsets from data streams with a time-sensitive sliding window, in: Proceeding of SIAM Int’l Conf. on Data Mining, 2005, pp. 68–79. [23] Z. Pawlak, Rough Sets, Theoretical Aspects of reasoning about Data, Kluwer Academic, Dordrecht, 1991. [24] J.R. Quinlan, C 4.5: Programs for Machine Learning, San Mateo, 1993. [25] Y.Y. Yao, Mining high order decision rules, in: M. Inuiguchi, S. Hirano and S. Tsumoto (Ed.), Rough Set Theory and Granular Computing, Berlin, 2003, pp. 125–135. [26] Y.Y. Yao, Y. Sai, Mining ordering rules using rough set theory, in: T. Terano, T. Nishida, A. Namatame, S. Tsumoto, Y. Ohsawa, T. Washio (Ed.), Lecture Notes in Computer Science 2253, Berlin, 2001, pp. 316–321. [27] H. Wang, W. Fan, P.S. Yu, J.W. Han, Mining concept drifting data streams using ensemble classifiers, in: Proceeding of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, pp. 226–235. [28] B.R. Gaines, The Trade-off between Knowledge and Data in Knowledge Acquisition, in: G. Piatetsky-Shapiro, W.J. Frawley (Eds.), Knowledge Discovery in Databases, AAAI/MIT Press, 1991, pp. 491–506. [29] C. Silverstein, S. Brin, R. Motwani, Beyond market baskets: generalizing association rules to dependence rules, Data Mining and Knowledge Discovery 2 (1) (1998) 39–68. [30] E. Suzuki, Autonomous discovery of reliable exception rules, in: Proceeding of KDD’97, 1997, pp. 259–262. [31] N. Zhong, Y.Y. Yao, S. Ohsuga, Peculiarity oriented multi-database mining, in: Proceeding of PKDD’99, 1999, pp. 136–146.