Conceptual modeling rules extracting for data streams

Knowledge-Based Systems 21 (2008) 934–940 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locat...

Download PDF

194KB Sizes 0 Downloads 46 Views

Report

PDF Reader
Full Text

Knowledge-Based Systems 21 (2008) 934–940

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Conceptual modeling rules extracting for data streams Xiao-Dong Zhu *, Zhi-Qiu Huang College of Information Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

a r t i c l e

i n f o

Article history: Received 18 April 2007 Received in revised form 30 March 2008 Accepted 13 April 2008 Available online 20 April 2008 Keywords: Granular computing Data streams mining Conceptual modeling Rules extraction Knowledge discovery

a b s t r a c t Data take the form of continuous data streams rather than traditional stored databases in a growing number of applications, including network trafﬁc monitoring, network intrusion detection, sensor networks, fraudulent transaction detection, ﬁnancial monitoring, etc. People are interested in the potential rules in data streams such as association rules and decision rules. Compared with much work on developing algorithms of data streams mining, there is little attention paid on the modeling data mining and data streams mining. Considering the problem of conceptual modeling data streams mining, we put forward a data streams oriented decision logic language as a granular computing formal approach and a rules extracting model based on granular computing. In this model, we propose the notion of granular drifting, which accurately interpret the concept drifting problem in data streams. This model is helpful to understand the nature of data streams mining. Based on this model, new algorithms and techniques of data streams mining could be developed. Ó 2008 Elsevier B.V. All rights reserved.

1. Introduction The research upon data streams mining has recently gained much attraction due to its important applications and the increasing generation of streaming data in various areas such as sensor networks, web clicks and computer network trafﬁcs. One of the tasks of knowledge discovery and data mining is to search for knowledge, patterns, and regularities derivable from a large datasets. So in data streams mining, people are also interested in the rules and knowledge hidden in data streams. While the data streams mining is different from the traditional data mining in terms of their characteristics. High speed, great volume, constrained memory and limited power are obvious characteristics of data streams, which lead to challenges on algorithms, tools, models and data structures addressed for traditional data mining. Association rules mining and decision rules mining are always important problems in data mining, so in data stream mining. However, compared with that a lot of techniques and algorithms were proposed, little work was done on the modeling data mining. A lack of conceptual modeling may jeopardize further development of data mining. Granular computing is a label of theories, methodologies, techniques, and tools that make use of granules in the process of problem solving [1,2]. Yao modeled data mining with granular computing, but his model is paid on traditional data instead of data streams [3]. In this paper, we put forward a data streams mining model based on granular computing. A notion of granular drifting is proposed to interpret the concept drifting in * Corresponding author. Tel.: +86 25 84896263. E-mail address: [email protected] (X.-D. Zhu). 0950-7051/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2008.04.003

data streams. In this model, we also illustrate some experiments to validate the model. The rest sections are organized as follows. We introduce related work in Section 2. In Section 3, we present a data stream oriented decision logic language DS–DL, which is described by model and satisﬁability in the Tarski’s style. Utilizing DS–DL as the formal method of granular computing, we model data streams mining. The notion of granular drifting is also proposed in this section. In Section 4, we interpret rules extraction in data streams. Then, we illustrate how to acquire our interesting rules in data streams using granular computing. Section 5 illustrates our experiments with granular computing and gives the experiment evaluation. Section 6 is our conclusion and future work. 2. Related work Granular computing is a label of theories, methodologies, techniques, and tools that make use of granules in the process of problem solving. Zadeh ﬁrst proposed the notion of information granulation [4], however, this notion did not receive much attention in more than 10 years. In 1982, Pawlak advanced the theory of rough sets [5], which in fact provides a concrete example of granular computing. To some extent, rough set theory makes more people realize the importance of the notion of granulation. In 1997, Zadeh revisited information granulation [6], Lin suggested the term granular computing to label this new and growing research ﬁeld [7]. Yao, Zhong proposed some granular computing methods and modeled data mining with them [3,8]. Association rules mining and decision rules mining are important problems in static data mining and data streams mining [9].

935

X.-D. Zhu, Z.-Q. Huang / Knowledge-Based Systems 21 (2008) 934–940

Vast algorithms were presented upon this research focus. The ﬁrst algorithm about association rule mining was proposed by Agrawal [10]. After that, many people are interested in the association rules hidden in the large database, and signiﬁcant algorithms were developed to discovery association rules [11–14]. Zhu designed and realized an intrusion detection system to detect computer network intrusion packages [15]. The association rules and decision rules mining techniques were adopted in this system. In recent years, some frequent sets and association rules mining algorithms in data streams were also developed [16–18]. Yao and Lin advocated more attention paid on conceptual modeling of data mining. Yao modeled data mining with granular computing. In fact, he utilized traditional decision logic language as granular computing approach, and accurately describe data sets and interpret concept in terms of its intension and extension [3,8]. Lin introduced the relation of data mining and granular computing, presented modeling method with neighborhood systems [19]. Yao and Lin contributed much work on granular computing and data mining modeling, however, their work and proposed models are based on traditional data set. In this paper, we present conceptual models of data streams mining, and we precisely interpret rules extracting process in data streams.

3. Modeling data streams mining with granular computing 3.1. Introduction on data streams The intelligent data analysis has passed through some stages. Statistical exploratory data analysis represents the ﬁrst stage. Its goal was to explore the available data in order to test a speciﬁc hypothesis [20]. With the advance of artiﬁcial intelligence and database techniques, many new data analysis problems have been addressed, one of those is the knowledge extracting from very large databases. Data mining is the interdisciplinary ﬁeld of research that can extract models and patterns from large amounts of information stored in data repositories. This statistical data mining could be regarded as the second stage of the intelligent data analysis. Advances in networking and parallel computation have lead to the introduction of distributed and parallel data mining. The goal was how to extract knowledge from different subsets of a dataset and integrate these generated knowledge structures to gain a global model of the whole dataset. In recent years, data take the form of continuous data streams rather than traditional stored databases in a growing number of applications which include network trafﬁc monitoring, network intrusion detection, sensor networks, fraudulent transaction detection, ﬁnancial monitoring, etc. Data in these applications share several distinguishing features like huge volumes of data and unpredictable data arrivals. The data generation rates become faster than ever before. These continuous stream data require online or real-time proceeding data instead of storing data in a static repository ﬁrst. A data stream is an ordered sequence of items that arrive in timely order. Differing from data in traditional static databases, data streams are continuous, unbounded, usually come with high speed and have a data distribution that often changes with time [21]. Data in data streams are called streams data. As the number of applications on mining data streams grows rapidly, there is a increasing need to perform association rules mining upon data streams [14]. A typical data streams is illustrated in Fig. 1. Fig. 1 is a sliding window model of data process, which was adopted in many algorithms of data streams mining. In [16,22], the authors use this method in their algorithms to get the frequent itemsets of data streams within the current sliding window. In this model, only part of the data streams within the sliding window are stored and processed at the time when the data ﬂows in. The size

History time window

Current time window Data streams Current batch of transactions Fig. 1. Data streams.

of the sliding window may be decided according to applications and system resources. The mining result of the sliding window method totally depends on recently generated transactions in the range of the window; all the transactions in the window need to be maintained in order to remove their effects on the current mining results when they are out of range of the sliding window. Association rules gotten in static database are generally invariable, while in data streams, association rules vary as the time goes on. Finding frequent itemsets are the ﬁrst step of association rules mining, differently, in data streams mining some frequent itemsets may become infrequent, and some potential infrequent itemsets may become frequent later. As the time goes by, streams data vary continuously, frequent sets and association rules are potentially changed. This problem is regarded as a concept drifting problem in data streams.We will describe and interpret the concept drifting using the notion of granular drifting. 3.2. Conceptual modeling upon data streams The data mining was modeled with granular computing by Yao [3]. As well known, a concept contains two aspects, the intension and extension of the concept. Yao interpret the intension of a concept as the name of formula in an information table and the extension of a concept as the sets of objects that satisfying this formula. He utilized traditional decision logic to construct granules and interpret granules. However, his work focused on static data. We focus on the data streams and propose the notion granular drifting upon dynamic streams data. An information table is a convenient formal method to describe the dataset with hidden knowledge and patterns. We assumed that data streams are continuous information tables whose objects are unbounded. However, we give a time period to view the streams data. Deﬁnition 1. The data streams is a DS ¼ fTGr; T; At; DS DL; fV a ja 2 Atg; fIa ja 2 Atgg.

six-tuples.

Here TGr is a time period, which is a time section and could be set one second, one minute or one hour, etc. The syntax TGr implies a time granule deﬁned in the deﬁne 7. A TGr can decide the length of the information table DS of concrete data streams. An information table DS is identiﬁed by its TGr. T is a set of transactions in a concrete TGr. We identify transactions in T with the time stamp. The length of the DS is equal to the cardinality of the set T and we denote it with DS:length. At is a ﬁnite nonempty set of attributes. DS–DL is a language deﬁned using attributes in At, which would be deﬁned in the next subsection. V a is a nonempty set of values for a 2 At. Ia is an information mapping function from T to V a. 3.3. Data streams oriented decision logic language We deﬁne a language DS–DL, which is similar to the language studied by Yao [3,8] and the decision logic language studies by

936

X.-D. Zhu, Z.-Q. Huang / Knowledge-Based Systems 21 (2008) 934–940

Pawlak [23] for describing the data streams language DS–DL, an atomic formula is given by /. It describes that binary tuple < a; v >, where a 2 At and v 2 V a . The where a 2 At and v 2 V a . The semantics of the language DS–DL can be deﬁned in the Tarski’s style through the notions of a model and satisﬁability. The model is an information table DS, which provides interpretation for symbols and formulas of DS–DL. We have described it in the above subsection. The satisﬁability of a formula / by a transaction x is written as xDS / or in short x /.

time period is regarded as a time granule. So we say that one hour granularity is rougher than one second granularity, a month is rougher than a day, and so on. A time granularity can decide the length of the DS i.e. DS:length for concrete data streams due to a rougher time granularity contains more data. In some areas such as sensor networks and stock data streams, the streams data could generate periodically. So a granularity can be equally partitioned into some scales, a transaction generates every a time scale.

Deﬁnition 2. Formulas. Formulas in DS–DL were gotten through following three rules:

3.4. Granular drifting

(1) < a; v > is an atomic formula, where a 2 At and v 2 V a ; (2) IF / and u are formulas, THEN :/; / ^ u; / _ u; / $ u; / u are formulas; (3) Only the compound generated upon the ﬁrst and second step within limited times is a formula.

Deﬁnition 3. Satisﬁability. The satisﬁability has the following conditions: (1) (2) (3) (4) (5) (6)

x < a; v > iff Ia ðxÞ ¼ v; x :/ iff :x / x / ^ u iff x / and x u x / _ u iff x / or x u x / ! u iff x :/ _ u x / $ u iff x / ! u and x u ! /

In this paper, iff denotes if and only if. The formula like / ^ u and ð/ ^ uÞ ! x are called combined formulas. Atomic formulas and combined formulas are jointly called formulas. Deﬁnition 4. T is satisﬁable on a formula / iff 9x 2 T, xDS /. Deﬁnition 5. The formula is true, iff for any transaction x in T, it satisﬁes the formula /. That is, The formula / is true upon the current DS iff 8x 2 T, xDS /. Deﬁnition 6. For a formula /, the set mð/Þ deﬁned by: mð/Þ ¼ fx 2 T j xDS /g is called the meaning set of the formula / in DS. Theorem 1. We hold following properties. (1) (2) (3) (4) (5) (6)

mð< a; v >Þ ¼ fx 2 TjIa ðxÞ ¼ vg; mð/Þ ¼ T mð/Þ; mð/ ^ uÞ ¼ mð/Þ \ mðuÞ; mð/ _ uÞ ¼ mð/Þ [ mðuÞ; mð/ ! uÞ ¼ mð:/ _ uÞ; mð/ uÞ ¼ mð/ ! uÞ \ mðu ! /Þ.

Theorem 2. If T in a DS is satisﬁable on a formula /, then mð/Þ is not empty. If a formula / is true for the current DS, then mð/Þ is the T itself. Deﬁnition 7. A time period TGr is deﬁned by a time section, written with < sb ; se >.where sb is the begin time of the time period, and se is the end time of the time period. Usually, we use one second, one minute or one hour as a time period according to applications or system resources. The time period is a sliding window in terms of data processing model. From the perspective of granular computing, this

In this subsection, the notion of granular drifting is presented and we model concept drifting problems with granular drifting. Concepts have formal and precise description with this language DS–DL. We know that a concept contains two aspects, the intension and extension of the concept. A concept in an information tables is deﬁned as a binary pair < /; mð/Þ >, where / is a formula in the information table DS. The former element / is a description of mð/Þ, that is the intension of the concept. The later element mð/Þ is the sets of satisﬁable transactions, that is the extension of the concept. In one time period, transactions are invariable, that is, it looks like a temporal static data set for one time period. / is a formula, that is the intension of a concept. mð/Þ is the meaning of the formula /, that is the extension of the concept. In a time period TGr, transaction data sets are regarded as static. But as the time period is changed, the transactions data will be updated. From the standing point of time periods, the updating could only include adding and deleting, because modifying can be replaced with adding and deleting. If the transaction data has been added, the extension of a concept may be expanded or hold. On the contrary, if the transaction data have been deleted, the extension of a concept may be reduced or hold. If the adding and deleting happened in the same time period, the extension of a concept may be expanded, reduced or hold. In a word, the extension of a concept has been changed. Now we interpret the process with granular computing. We assume the current time period is < sb ; se >. In this time period, a concept < /; mð/Þ > is invariable. As the time goes on, the current time period is < s0b ; s0e >, the time period < sb ; se > is the history time period, and < s0b ; s0e > is called the direct subsequence of < sb ; se >. In an information table, if a transaction x and a transaction y satisfy a formula / simultaneously, they may be put into the same period. In fact, the formula / decides an equivalence class that satisﬁes the formula. In a time period, this equivalence class is invariable. But in the subsequent time period, the equivalence class would change due to the updating of transaction data. We use mð/Þ to represent the set of transactions that satisfy the formula /. They have following two properties. (1) The mð/Þ is stronger than before iff the mð/Þ is expanded; (2) The mð/Þ is weaker than before iff the mð/Þ is reduced. We say if mð/Þ is stronger, then the concept < /; mð/Þ > is stronger. On the contrary, if mð/Þ is weaker, then the concept < /; mð/Þ > is weaker. Generally, we say that the concept < /; mð/Þ > is drifting. Therefore, the concept is drifting in data streams as the time periods goes on. The concept drifting in data streams is actually decided by the changes upon the extension of the concept, that is, the meaning set of the concept is drifting. We call this process granular drifting.

937

X.-D. Zhu, Z.-Q. Huang / Knowledge-Based Systems 21 (2008) 934–940

4. Interpretation of rules extraction

4.2. Interpretation of association rules

4.1. Quantity measures of rules

Rules like association rules and decision rules could be accurately interpreted with granular computing. From the perspective of granular computing, we revisit the association rules. The traditional association rules was deﬁned on the transaction data set [10]. Agrawal ﬁrst interpreted the data with basket market dataset. In fact, the transaction data set could be transported to a binary information table, see Table 2.

Association rules and decision rules represent expected knowledge. Depending on the meanings and forms of rules, one may classify rules in many ways. A clear classiﬁcation of rules is useful for the understanding on the basic tasks of machine learning and data mining. Rules can be classiﬁed into two types, type 1 and type 2. The type 1 rule states that ‘‘if the value of an object is va on attribute a, then the value of the object is vb on attribute b”, and the type 2 rule states that ‘‘if two objects have the same value on attribute a, then they have the same value on attribute b”. Association rule, which states that the presence of one set of items implying the presence of another set of items, is a typical type 1 rule [10]. Decision rules from decision tree learning algorithms are also examples of type 1 [24]. The high order decision rules represent the type 2 [25,26], which focus on a pair of objects. The difference between traditional association rules and data streams association rules is the concept drifting problem in data streams [27]. In data mining, rules are typically interpreted in terms of conditional probability. Using the cardinalities of sets, we obtain the contingency Table 1, representing the quantitative information about the rule / ! u. The quantitative measures including generality; absolutesupport; changeofsupport and mutualsupport could be deﬁned on this table. The 2 2 contingency table has been used by many authors for representing information of rules [1,28,29]. Here j j denotes the cardinality of a set .

jmð/Þj jTj

It indicates the relative size of the concept < /; mð/Þ >. A concept is more general if it covers more instances of the universe T. Obviously, we have 0 6 Gð/Þ 6 1. (2) The absolutesupport of u provided by / is deﬁned by: ASð/ ! uÞ ¼ ASðuj/Þ ¼

jmð/Þ \ mðuÞj jmð/Þj

The support count is an important property of itemset. Theorem 3. In a binary information table, Let X be a itemset in I with k items ðix1 ; ix2 ; . . . ; ixk Þ; X maps a formula x, that is, ix1 ¼ 1 ^ ix2 ¼ 1 ^ . . . ^ ixk ¼ 1, then the support count of itemsets X is the cardinality of the meaning set of formula x, i.e. mðxÞ.

X and Y are called the antecedent and the consequent, respectively. The antecedent X is also called the LHS (left hand side), and the consequent Y is also called the RHS (right hand side). The support and the conﬁdence are two measures used for association rules mining. support indicates the frequencies of the occurring patterns, and conﬁdence denotes the strength of implication in the rule. rðX [ YÞ T rðX [ YÞ confidenceðX ) YÞ ¼ rðXÞ

supportðX ) YÞ ¼

Here mð/Þ–0. The quantity ASðuj/Þ states the degree to which / support u. It may be viewed as the conditional probability of a randomly selected element satisfying u given that the element satisﬁes /. In set theoretic terms, it is the degree to which mð/Þ included in mðuÞ. Clearly, ASðuj/Þ ¼ 1, iff mð/Þ–0 and mð/Þ # mðuÞ. (3) The changeofsupport of u provided by / is deﬁned by: CSð/ ! uÞ ¼ CSðuj/Þ ¼ ASðuj/Þ GðuÞ The change of support varies from 1 to 1. For a positive value, one may say that / is positively related to u, for a negative one may say that / is negatively related to u. (4) The mutualsupport of / and u is deﬁned by: MSð/ $ uÞ ¼ MSð/; uÞ ¼

Deﬁnition 9. Support count. the support count of a itemset X is deﬁned by rðXÞ ¼ jfT i jX # T i ; T i 2 Tgj, here j j is the cardinality of a set.

Deﬁnition 10. Association rules. Association rule is deﬁned as an implication of the form X ) Y, where X; Y 2 I, and X \ Y ¼ ;.

(1) The generality of concept < /; mð/Þ > is deﬁned by: Gð/Þ ¼

Deﬁnition 8. Items. Let I ¼ fi1 ; i2 ; . . . ; im g be a set of binary attributes, called items. Let T be a database of transactions, T ¼ fT 1 ; T 2 ; . . . ; T N g. Each transaction t (t2 T) is represented as a binary vector, with t½k ¼ 1 if t possess the item ik and t½k ¼ 0 otherwise. There is one tuple in the database for each transaction. Let X be a set of some items in I. We say that transaction t satisﬁes X if for all items ik in X, t½k ¼ 1. A k-itemsets is deﬁned by the itemsets including k items.

mð/Þ \ mðuÞ mð/Þ [ mðuÞ

Obviously, 0 6 MSðuj/Þ 6 1, one may interpret it as a measure of the strength of the double implication u $ /. It measures the degree to which / causes, and only causes u.

They are indeed the generality and absolute support: supportðX ) YÞ ¼ Gð/ ^ uÞ confidenceðX ) YÞ ¼ ASð/ ! uÞ Where / and u is the mapping formula of X and Y respectively according to Theorem 3. By specifying threshold values of support and conﬁdence, one can obtain all association rules whose support and conﬁdence are above thresholds. Finding frequent itemsets are the key step of association rules mining. In data streams, the association rules vary with the frequent itemsets. Some frequent itemsets may become infrequent, and some potential infrequent itemsets may become frequent later. We measure association rules in data streams based on granular drifting.

Table 2 Binary information table Table 1 Contingency table for rule

/ :/ P

col

u

:u

P

jm(/)\m(u)j jm(:/)\m(u)j jm(u)j

jm(/)\m(:u)j jm(:/)\m(:u)j jm(:u)j

jm(/)j jm(:/)j jTj

row

TID

i1

i2

i3

i4

i5

i6

1 2 3 4 5

1 1 0 1 1

1 0 1 1 1

0 1 1 1 1

0 1 1 1 0

0 1 0 0 0

0 0 1 0 1

938

X.-D. Zhu, Z.-Q. Huang / Knowledge-Based Systems 21 (2008) 934–940

Given a support threshold ksupport , in an information table DS, the frequent itemsets are deﬁned by the itemsets satisfying the supportð/ ) uÞ is greater than ksupport . With the granular drifting, the meaning of the formulas / ! u and / ^ u could be expanded and reduced. If reducing, the support would descend, and after the support is lower than the ksupport , the frequent itemsets may become infrequent. If expanding, the conﬁdence would ascend, which denotes that the association rules become stronger. In mining association rules, concepts with low support are not considered in the search for association. On the other hand, two concepts with low supports may have either large conﬁdence or a large change of support. For example, suppose we have confidenceðX ) YÞ ¼ 0:90. If we also have GðYÞ ¼ 0:95. That is, CSðX ) YÞ < 0. We can conclude that X is in fact negatively associated with Y. This suggests that an association rule may not reﬂect the true association. Exception rules and peculiarity rules have been studied as extension of association rules to resolve above problems [30,31]. Both exception rule mining and peculiarity rules mining aim at ﬁnding rules missed by association rule mining. 5. Experiments and model evaluation In this section, we illustrate how to extract rules with granular computing. Our illustration includes two parts: one is extract decision rules from an intrusion detection data stream, the other is association rules mining from transaction data stream. 5.1. Decision rules extraction illustration We used the KDD cup data downloaded from http://kdd.ics.uci.edu. This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 the Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between ‘‘bad” connections, called intrusions or attacks, and ‘‘good” normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment. The data set has 494,021 transactions, shown in Table 3. We wrote some storage procedures to realize the data streams form of the data sets. The length of time period was changed from 10 transactions to 100 transactions. This data set has 41 attributes, which include 34 numerical variables and 7 ﬂag variables. The last attribute is the decision attribute, which denotes corresponding intrusion type, for instance, ‘Normal.’ shows that this datum is a normal network action, ‘Back.’ shows that it is a back door datum or bad datum. ‘Spy.’ shows that it is a spy datum. Our goal is to detect the decision rules in the data sets. There existing so many attributes. However, we pay high attention to the relation between the Intrusion type and other attributes. So here we deﬁned a term of axis attribute. The axis attributes are those in which people would be most interested. For example, the attributes displayed in the Table 2 are the axis attributes in

Table 3 KDD cup 99 data Protocol_type

Service

Flag

...

Intrusion_type

Tcp Tcp Tcp Tcp Tcp Tcp ...

http http http Private Private Private ...

SF RSTR SF REJ REJ S0 ...

... ... ... ... ... ... ...

Normal Back Back Neptune Satan Neptune ...

our experiments. Of course, people could choose their axis attributes according to their interest and concrete problems. The lower generality leads more transactions and the higher absolutesupport leads stronger decision rules. When we set the threshold of generality 0.1, and absolutesupport 0.90, in the ﬁrst time period, we get the strong decision rule {service=‘http’^protocol_type=‘tcp’^ﬂag=‘SF’)Intrusion_type=‘Normal.’}, whose absolute support is 1. While in the last time period, the decision rule’s absolute support is 0.91. When we set generality 0.2, {protocol_type=‘tcp’, Intrusion_type=‘Neptune.’} varied from NOT frequent itemsets in the ﬁrst time period to frequent itemsets in the last time period. Its generality is 0.217. Subsequently we get a valuable decision rules {Intrusion_type=‘Neptune.’)protocol_type=‘tcp’}, whose conﬁdence is 1. We interpret some above results as follows. protocol_type= ‘tcp’^Intrusion_type=‘Neptune.’ is an interesting formula. The meaning set that satisﬁes this formula is a granule. That is, the transactions satisfying this formula belong to the same granule. As the time goes forward, the meaning set of the formula become stronger. Which indicated that the {protocol_type=‘tcp’, Intrusion_type=‘Neptune.’} varies from NOT frequent sets to frequent sets. The experiments prove that the model can work well and accurately interpret the decision rules mining in data streams. 5.2. Association rules extraction illustration The ﬁrst association rules mining algorithm proposed by Agrawal is Apriori [10]. In fact, many algorithms were developed based on the ideas of Apriori [11,12]. Utilizing the bitwise operation, we add the notion of computing with granules into the association rules mining. The ideas of bitwise operation were ever investigated by Lin [19]. We illustrate our model on a transaction data streams. However, when developing a concrete algorithm for data stream, many features of streams data need consideration. Table 4 shows a transaction data set in a time period of data streams. Issues of granular computing can be divided into two related aspects, the construction of granules and computing with granules. The former deals with the formation, representation, and interpretation of granules. The latter deals with the utilization of granule in problem solving [3].

Table 4 Transaction data set #1 TID

i1

i2

i3

i4

i5

001 002 003 004 005 006 007 008 009

0 1 0 0 0 0 0 0 0

1 0 0 1 1 0 1 1 1

0 0 1 0 1 1 1 1 1

1 1 1 1 0 1 0 1 1

1 0 0 0 0 0 1 1 0

Table 5 Basic information granules (candidate 1-compound itemsets) Itemsets

Transactions of granules

Binary expression of granules

Size of granules

[i1] [i2] [i3] [i4] [i5]

{002} {001,004,005,007,008,009} {003,005,006,007,008,009} {001,002,003,004,006,008,009} {001,007,008}

000100000 100110111 001011111 111101011 100000110

1 6 6 7 3

939

X.-D. Zhu, Z.-Q. Huang / Knowledge-Based Systems 21 (2008) 934–940

Suppose support threshold ksupport is 2/9. Similar to the Apriori algorithm, in this illustration, we ﬁrst generate candidate itemsets then generate frequent itemsets. By the ﬁrst scanning on this set, we get candidate 1-compound granules. See Table 5. Because all sizes of granules satisfy the ksupport , itemsets in Table 4 are all frequent 1-itemsets. Computing with granules here is deﬁned by the AND bitwise operation upon the binary expression of corresponding granules. Similar to the apriori algorithm, we need connecting and pruning in the mining process. Table 6 shows the 2-item compound granules, i.e., candidate 2-itemsets. By computing the support of the corresponding compound granules, we get frequent 2-item compound granules, that is, frequent 2-itemsets. See Table 7. Based on frequent 2-item compound granules, we generate candidate 3-item compound granules by AND bitwise operation. We can get four new compound granules, ½i2 ; i3 ; i4 ; ½i2 ; i3 ; i5 ; ½i2 ; i4 ; i5 and ½i3 ; i4 ; i5 . However, the itemset ½i3 ; i4 ; i5 is pruned in the connecting process. Because itemsets in Table 8 satisfy the ksupport , these candidate itemsets are all frequent sets. Because these two itemsets could not be connected according to condition of connecting, the mining procedure ends. If we set the threshold of conﬁdence kconfidence ¼ 0:9, then we can get associations fi3 ; i5 g ) i2 ;

However, streams data change continuously as time goes on. In fact, when the time period turns to the subsequent time period, many data enter into the new time period and some old data go out of the range of the new time period. Table 9 shows an informaTable 6 Candidate 2-item compound granules (candidate 2-itemsets) Itemsets

Transactions of granules

Binary expression of granules

Size of granules

[i1, [i1, [i1, [i1, [i2, [i2, [i2, [i3, [i3, [i4,

U U {002} U {005,007,008,009} {001,004,008,009} {001,007,008} {003,006,008,009} {007,008} {001,008}

000000000 000000000 010000000 000000000 000010111 100100011 100000110 001001011 000000110 100000010

0 0 1 0 4 4 3 4 2 2

Table 7 Frequent 2-item compound granules (frequent 2-itemsets)

i1

i2

i3

i4

i5

001 002 003 004 005 006 007 008 009

0 1 0 1 0 0 0 0 0

1 0 0 1 1 0 1 1 1

0 0 1 0 1 1 1 1 1

1 1 1 1 0 1 0 1 1

1 0 0 0 0 0 0 1 0

tion table in the subsequent time period (marked with #2) after the information table in above time period (marked with #1, see Table 4). In this period, some frequent itemsets will become not frequent. On the contrary, some potential non frequent will become frequent in the future. The updated association rules are as follows. i1 ) i4 ; i5 ) i2 ; i5 ) i4 ; fi2 ; i5 g ) i4 ;

The nature of association rules variation in above transaction data streams is the drifting of granules as the time goes by. This variation can be called granular drifting driven concept drifting in data streams. In our experiments, we illustrate and interpret rules extraction with granular computing in detail. We combine the traditional apriori algorithm with granular computing to realize the frequent set mining in dynamic data set. Therefore, granular computing is not only a formal method for conceptual modeling data streams mining, but a technique instrument for rules extracting in data streams. Above two illustrations reﬂects decision rule mining and association rule mining respectively. A number of experiments indicate that granular computing is not only a formal method and tool for conceptual modeling data stream mining, but also a technique and tool for problem solving in data mining as well as data streams mining. In fact, many characteristics of data streams such as high speed, huge number, continuous and unbounded feature, should be considered when we design a concrete algorithm. In addition, resource awareness is a signiﬁcant problem when performing the algorithms of data streams mining. 6. Conclusion

Itemsets

Transactions of granules

Binary expression of granules

Size of granules

[i1, [i2, [i2, [i2, [i3, [i4,

{005,007,008,009} {001,004,008,009} {001,007,008} {003,006,008,009} {007,008} {001,008}

000010111 100100011 100000110 001001011 000000110 100000010

4 4 3 4 2 2

i4] i3] i4] i5] i4] i5]

TID

fi4 ; i5 g ) i2 :

fi4 ; i5 g ) i2 :

i2] i3] i4] i5] i3] i4] i5] i4] i5] i5]

Table 9 Transaction data set #2

Table 8 Candidate 3-item compound granules (candidate 3-itemsets) Itemsets

Transactions of granules

Binary expression of granules

Size of granules

[i2,i3,i4] [i2,i3,i5] [i2,i4,i5]

{008,009} {007,008} {001,008}

000000011 000000110 100000010

2 2 2

It is important to study formal and mathematical modeling of data mining as well as data streams mining. Lin and Yao advocated much attention on modeling of data mining. The main contribution of this paper is that we present a data stream oriented decision logic language DS–DL, which is described by model and satisﬁability in the Tarski’s style. Then we propose a model for data streams mining with DS–DL. Construction and computing of granules are analyzed in detail. Considering the problem of concept drifting in data streams mining, we propose the notion of granular drifting based on granular computing. Concept drifting could be precisely interpreted in this model. Research on the conceptual modeling on data streams mining beneﬁts understanding and insight into the nature of data streams mining. Within this model, some new techniques and algorithms could also be developed based on granular computing for data streams. In the future, we will further investigate and develop

940

X.-D. Zhu, Z.-Q. Huang / Knowledge-Based Systems 21 (2008) 934–940

methods of granular computing for data mining as well as data streams mining.

References [1] Y.Y. Yao, Potential applications of granular computing in knowledge discovery and data mining, in: Proceedings of World Multi Conference on Systemics, Cybernetics and Informations, 1999, pp. 573–580. [2] Y.Y. Yao, Granular computing: basic issues and possible solutions, in: Proceedings of the 5th Joint Conference on Information Sciences, 2000, pp. 186–189. [3] Y.Y. Yao, On modeling data mining with granular computing, in: Proceedings of the 25th Annual International Computer Software and Applications Conference, 2001, pp. 638–643. [4] L.A. Zadeh, Fuzzy sets and information granulation, in: M. Gupta, R.K. Ragade, P.R. Yager (Eds.), Advances in Fuzzy Set Theory and Application, North-Holland Publishing Company, 1979, pp. 3–18. [5] Z. Pawlak, Rough sets, International Journal of Computer and Information Science 11 (1982) 341–356. [6] L.A. Zadeh, Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic, Fuzzy Sets and Systems 19 (1) (1997) 111–127. [7] T.Y. Lin, From rough sets and neighborhood systems to information granulation and computing in words, in: Proceedings of European Congress On Intelligent Techniques and Soft Computing, 1997, pp. 1602–1607. [8] Y.Y. Yao, N. Zhong, Granular computing using information tables, in: T.Y. Lin, Y.Y. Yao, L.A. Zadeh (Eds.), Data Mining, Rough Sets and Granular Computing, Physica-Verlag, Heidelberg, 2002, pp. 102–124. [9] M.H. Dunham, Data Mining: Introductory and Advanced topics, Tsinghua University Press, Beijing, 2005. [10] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, 1993, pp. 207–216. [11] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proceeding of the 20th International Conference of Very Large Data Bases, 1994, pp. 487–499. [12] J.W. Han, J. Pei, Y.W. Yin, Mining frequent patterns without candidate generation, in: Proceeding of International Conference on Management of Data, 2000, pp. 1–12. [13] X.D. Zhu, C. Zheng, Research on hash pruning algorithm of association rules, Journal of Anhui University (Natural Sciences) 29 (4) (2005) 20–23. [14] N. Jiang, L. Gruenwald, Research issues in data stream association rule mining, SIGMOD Record 35 (1) (2006) 14–19.

[15] X.D. Zhu, Z.Q. Huang, H. Zhou, Design of a multi-agent based intelligent intrusion detection system, in: Proceeding of 2006 1st International Symposium On Pervasive Computing And Applications (Urumchi, August, 2006), pp. 290–295. [16] J.H. Chang, W.S. Lee, A sliding window method for ﬁnding recently frequent itemsets over online data streams, Journal of Information Science and Engineering 20 (4) (2004) 753–762. [17] J.H. Chang, W.S. Lee, A. Zhou, Finding recent frequent itemsets adaptively over online data streams, in: Proceeding of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2003, pp. 487–492. [18] G.S. Manku, R. Motwani, Approximate frequency counts over data streams, in: Proceedings of the 28th VLDB Conference (Hongkong, China, 2002), pp. 346– 357. [19] T.Y. Lin, E. Louie, Modeling the real world for data mining: granular computing approach, in: Proceedings of the 9th ISFA World Congress and 20th NAFIPS International Conference 2001, pp. 3044–3049. [20] M.M. Gaber, A. Zaslavsky, S. Krishnaswamy, Mining data streams: a review, SIGMOD Record 34 (2) (2005) 18–26. [21] S. Guha, N. Koudas, K. Shim, Data streams and histograms, in: Proceeding of ACM Symposium on Theory of Computing, 2001, pp. 471–475. [22] C.-H. Lin, D.-Y. Chiu, Y.-H. Wu and A.L.P. Chen, Mining frequent itemsets from data streams with a time-sensitive sliding window, in: Proceeding of SIAM Int’l Conf. on Data Mining, 2005, pp. 68–79. [23] Z. Pawlak, Rough Sets, Theoretical Aspects of reasoning about Data, Kluwer Academic, Dordrecht, 1991. [24] J.R. Quinlan, C 4.5: Programs for Machine Learning, San Mateo, 1993. [25] Y.Y. Yao, Mining high order decision rules, in: M. Inuiguchi, S. Hirano and S. Tsumoto (Ed.), Rough Set Theory and Granular Computing, Berlin, 2003, pp. 125–135. [26] Y.Y. Yao, Y. Sai, Mining ordering rules using rough set theory, in: T. Terano, T. Nishida, A. Namatame, S. Tsumoto, Y. Ohsawa, T. Washio (Ed.), Lecture Notes in Computer Science 2253, Berlin, 2001, pp. 316–321. [27] H. Wang, W. Fan, P.S. Yu, J.W. Han, Mining concept drifting data streams using ensemble classiﬁers, in: Proceeding of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, pp. 226–235. [28] B.R. Gaines, The Trade-off between Knowledge and Data in Knowledge Acquisition, in: G. Piatetsky-Shapiro, W.J. Frawley (Eds.), Knowledge Discovery in Databases, AAAI/MIT Press, 1991, pp. 491–506. [29] C. Silverstein, S. Brin, R. Motwani, Beyond market baskets: generalizing association rules to dependence rules, Data Mining and Knowledge Discovery 2 (1) (1998) 39–68. [30] E. Suzuki, Autonomous discovery of reliable exception rules, in: Proceeding of KDD’97, 1997, pp. 259–262. [31] N. Zhong, Y.Y. Yao, S. Ohsuga, Peculiarity oriented multi-database mining, in: Proceeding of PKDD’99, 1999, pp. 136–146.

Conceptual modeling rules extracting for data streams

Conceptual modeling rules extracting for data streams

Recommend Documents