Mining knowledge from object-oriented instances

Mining knowledge from object-oriented instances

Expert Systems with Applications Expert Systems with Applications 33 (2007) 441–450 www.elsevier.com/locate/eswa Mining knowledge from object-oriente...

205KB Sizes 0 Downloads 57 Views

Expert Systems with Applications Expert Systems with Applications 33 (2007) 441–450 www.elsevier.com/locate/eswa

Mining knowledge from object-oriented instances Cheng-Ming Huang a, Tzung-Pei Hong

b,*

, Shi-Jinn Horng

c,d

a

Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei 106, Taiwan, ROC b Department of Electrical Engineering, National University of Kaohsiung, Kaohsiung 811, Taiwan, ROC c Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei 106, Taiwan, ROC d Department of Electronic Engineering, National United University, Miao-li 360, Taiwan, ROC

Abstract Data mining is the process of extracting desirable knowledge or interesting patterns from existing databases for specific purposes. Recently, the object concept has been very popular and used in a variety of applications, especially for complex data description. This paper thus proposes a new data-mining algorithm for extracting interesting knowledge from transactions stored as object data. Each item itself is thought of as a class, and each item purchased in a transaction is thought of as an instance. Instances with the same class (item name) may have different attribute values since they may appear in different transactions. The proposed algorithm is divided into two main phases, one for intra-object association rules, and the other for inter-object association rules. Two apriori-like procedures are adopted to find the two kinds of rules. The first phase finds out the association relation within the same kind of objects. Each large itemset found in this phase can be thought of as a composite item used in phase 2. The second phase then finds the relationship among different kinds of objects. Both the intra-object and inter-object association rules can thus be easily derived by the proposed algorithm at the same time. Experiments are also made to show the effect of the proposed algorithm. Ó 2006 Elsevier Ltd. All rights reserved. Keywords: Association rule; Data mining; Object transaction; Object-oriented mining

1. Introduction Knowledge discovery in databases (KDD) has become a process of considerable interest in recent years as the amounts of data in many databases have grown tremendously large. KDD means the application of nontrivial procedures for identifying effective, coherent, potentially useful, and previously unknown patterns in large databases (Frawley, Piatetsky-Shapiro, & Matheus, 1991). The KDD process generally consists of the following three phases (Famili, Shen, Weber, & Simoudis, 1997; Mannila, 1997). (1) Pre-processing: This consists of all the actions taken before the actual data analysis process starts (Famili *

Corresponding author. E-mail addresses: [email protected] (C.-M. Huang), [email protected] (T.-P. Hong), [email protected] (S.-J. Horng). 0957-4174/$ - see front matter Ó 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2006.05.029

et al., 1997). Famili et al. think that it may be performed on the data for the following reasons: solving data problems that may prevent us from performing any type of analysis on the data, understanding the nature of the data, performing a more meaningful data analysis, and extracting more meaningful knowledge from a given set of data (Famili et al., 1997). (2) Data mining: This involves applying specific algorithms for extracting patterns or rules from data sets in a particular representation. (3) Post-processing: This translates discovered patterns into forms acceptable for human beings. It may also make possible visualization of extracted patterns. Due to the importance of data mining to KDD, many researchers in database and machine learning fields are primarily interested in this new research topic because it offers opportunities to discover useful information and important

442

C.-M. Huang et al. / Expert Systems with Applications 33 (2007) 441–450

relevant patterns in large databases, thus helping decisionmakers easily analyze the data and make good decisions regarding the domains concerned. Recently, the object concept has been very popular and used in different applications such as databases, software engineering, knowledge representation (Clair, Liu, & Pissinou, 1998; Clark & Niblett, 1989), geographic information systems and even computer architecture (Kim, 1990; Kimura, 1995). An object represents an instance with several related attribute values and methods integrated together. In the past, data mining is most commonly used in attempts to induce association rules from transaction data. In this paper, we will try to generalize it and propose a mining algorithm to derive association rules from object data. The proposed algorithm is divided into two main phases, one for intra-object association rules, and the other for inter-object association rules. Two apriori-like (Agrawal & Srikant, 1994) procedures are adopted to find the two kinds of rules. The remaining parts of this paper are organized as follows. Related mining algorithms are reviewed in Section 2. The object-oriented concept is introduced in Section 3. The proposed data-mining algorithm for object-oriented transaction data is described in Section 4. An example to illustrate the proposed algorithm is given in Section 5. Experimental results are described in Section 6. Conclusion and future work are given in Section 7. 2. Review of related mining approaches The goal of data mining is to discover important associations among items such that the presence of some items in a transaction will imply the presence of some other items. To achieve this purpose, Agrawal and his co-workers proposed several mining algorithms based on the concept of large itemsets to find association rules in transaction data (Agrawal & Srikant, 1994; Agrawal & Srikant, 1995; Agrawal, Imielinksi, & Swami, 1993a; Agrawal, Imielinksi, & Swami, 1993b; Srikant, Vu, & Agrawal, 1997). They divided the mining process into two phases. In the first phase, candidate itemsets were generated and counted by scanning the transaction data. If the number of an itemset appearing in the transactions was larger than a predefined threshold value (called minimum support), the itemset was considered a large itemset. Itemsets containing only one item were first processed. Large itemsets containing only single items were then combined to form candidate itemsets containing two items. This process was repeated until all large itemsets had been found. In the second phase, association rules were induced from the large itemsets found in the first phase. All possible association combinations for each large itemset were formed, and those with calculated confidence values larger than a predefined threshold (called minimum confidence) were output as association rules. In addition to proposing methods for mining association rules from transactions of binary values, Agrawal et al. also

proposed a method (Srikant & Agrawal, 1996) for mining association rules from those with quantitative and categorical attributes. Their proposed method first determines the number of partitions for each quantitative attribute, and then maps all possible values of each attribute into a set of consecutive integers. It then finds large itemsets whose support values are greater than the user-specified minimum support levels. These large itemsets are then processed to generate association rules, and rules of interest to users are output. Agrawal and Srikant also proposed the AprioriAll mining approach to mine sequential patterns from a set of transactions (Agrawal & Srikant, 1995). Five phases are included in this approach. In the first phase, the transactions are sorted first by customer ID as the major key and then by transaction time as the minor key. This phase thus converts the original transactions into customer sequences. In the second phase, the set of all large itemsets are found from the customer sequences by comparing their counts with a predefined support parameter. This phase is similar to the process of mining association rules. Note that when an itemset occurs more than one time in a customer sequence, it is counted once for this customer sequence. In the third phase, each large itemset is mapped to a contiguous integer and the original customer sequences are transformed into the mapped integer sequences. In the fourth phase, the set of transformed integer sequences are used to find large sequences among them. In the fifth phase, the maximally large sequences are then derived and output to users. In this paper, a mining algorithm is proposed for finding inter- and intra-association rules from object data. It includes two apriori-like procedures and is a little like one between the above two approaches. 3. Object-oriented transactions An object-oriented transaction includes one or more purchased items, each of which is represented as an object or an instance. Each instance inherits its characteristics from a superior object, called class, which defines the basic structure of objects with common properties, including attributes, default values, and methods. The roles of classes and instances in an object-oriented transaction data are like those that schema and tuples play in a relational database (Kim, 1990). A simple structure of a class is shown in Fig. 1, which includes at least three major components: the class name, the attributes and the methods. The class name is an identifier used to identify a class, the attributes are used to represent the characteristics of a class, and the methods are used to implement the operations and functions of a class. An example for a class ‘‘wine’’ is shown in Fig. 2 to illustrate the above concept. The class name is specified as ‘‘wine’’. The class includes four attributes, on_sale, discount, take_out service and free trial. It also has two methods, confirmation and acknowledgement.

C.-M. Huang et al. / Expert Systems with Applications 33 (2007) 441–450

the kth attribute of the jth item; Ij.Ak countjk the count of Ij.Ak; supportjk the support of Ij.Ak; a the predefined minimum support value; k the predefined minimum confidence value; Cr the set of candidate itemsets with r intra-object items; Lr the set of large itemsets with r intra-object items; C 0z the set of candidate itemsets with z inter-object items; L0z the set of large itemsets with z inter-object items.

class name attribute 1 attribute 2

. . . attribute n

method 1 message

method 2

. . .

response

method m Fig. 1. Structure of a typical class.

wine 1. on_sale 2. discount 3. take_out service 4. free trial

1. confirmation 2. acknowledgement Fig. 2. An example of the class ‘‘wine’’.

In this paper, each item itself (or item name) is thought of as a class, and each item purchased in a transaction is thought of as an instance. Instances with the same class (item name) may have different attribute values since they may appear in different transactions. 4. The data-mining algorithm for object-oriented transaction data In this section, an algorithm is proposed for discovering useful association rules from objected-oriented transaction data. The notation used in the algorithm is first listed below. n w m Ij Ak

443

the total amount of object-oriented transaction data; the total amount of items (classes); the number of attributes for each item; the jth item (class), 1 6 j 6 w; the kth attribute, 1 6 k 6 m;

In this paper, the attributes in each item (class) are assumed to be binary, with 1 representing that an instance of the item has the attribute property. The proposed algorithm can be divided into two main phases. The first phase is called the intra-object mining phase, in which the large itemsets associated with the same classes (items) but with different attributes are divided. The phase can find out the association relation within the same kind of objects. Each large itemset found in this phase can thus be thought of as a composite item used in phase 2. The second phase is called the inter-object mining phase, in which the large itemsets from the composite items are obtained to get relationship among different kinds of objects. Both the intraobject and inter-object association rules can thus be easily derived by the proposed algorithm at the same time. Two apriori-like procedures are adopted to find the two kinds of rules. The details of the proposed algorithm are described below. The object-oriented data-mining algorithm for association rules: INPUT: A set of w items (classes) with m attributes, a body of n transaction data, each with some items (objects) and their attribute values, a predefined minimum support value a, and a predefined confidence value k. OUTPUT: A set of intra- and inter-object association rules. Step 1: Calculate the number (countjk) of each class attribute Ij.Ak appearing in the n transaction data, where Ij is the jth class (item), Ak is the kth attribute, 1 6 k 6 m, 1 6 j 6 w; set the support (supportjk) of Ij.Ak as countjk/n. Step 2: Check whether the support of each class attribute Ij.Ak is larger than or equal to the predefined minimum support value a. If Ij.Ak satisfies the condition, put it in the set of large 1-itemsets (L1). That is, L1 ¼ fI j :Ak jcountjk =n P a; 1 6 k 6 m; 1 6 j 6 wg: Step 3: If L1 is null, then exit the algorithm; otherwise, do the next step. Step 4: Set r = 1, where r is the number of items in the itemsets currently being processed. Step 5: Generate the candidate set Cr+1 by joining Lr in a way similar to that in the apriori algorithm (Agrawal &

444

C.-M. Huang et al. / Expert Systems with Applications 33 (2007) 441–450

Srikant, 1994) except that the two (r  1)-itemsets to be joined must have the same classes (items). Step 6: Calculate the number (counts) of each candidate (r + 1)-itemset s (with all their attribute values in the itemset equal to 1) in Cr+1; set its support (supports) as counts/n. Step 7: Check whether the support of each candidate (r + 1)-itemset s is larger than or equal to the predefined minimum support valuea. If s satisfies the condition, put it in the set of large (r + 1)-itemsets (Lr + 1). That is, Lrþ1 ¼ fsjcounts =n P a; s 2 C rþ1 g: Step 8: IF Lr+1 is null, do the next step; otherwise, set r = r + 1 and repeat Steps 5–7. Step 9: Each large itemset found so far is then thought of as an intra-object composite item and is put in the inter-object large 1-itemset (L01 ). Step 10: Set z = 1, where z is used to represent the number of composite items in the intra-object itemsets currently being processed. Step 11: Generate the candidate set C 0zþ1 from L0z in a way similar to that in the apriori algorithm under the condition that each (z + 1)-itemset must not include composite items from the same classes. Step 12: Calculate the number (counts) of each candidate (z + 1)-itemset s in C 0zþ1 appearing in the transaction data; set its support (supports) as counts/n. Step 13: Check whether the support of each candidate (z + 1)-itemset s is larger than or equal to the predefined minimum support value a. If s satisfies the condition, put it in the set of large (z + 1)-itemsets (L0zþ1 ). That is, L0zþ1 ¼ fsjcounts =n P a; s 2 C 0zþ1 g: Step 14: IF L0zþ1 is null, do the next step; otherwise, set z = z + 1 and repeat Steps 11–13. Step 15: Derive intra-object association rules with confidence values larger than or equal to k from the large itemsets L2 to Lr. Step 16: Derive inter-object association rules with confidence values larger than or equal to k from the large itemsets L02 to L0z . After Step 16, the two kinds of intra- and inter-object association rules are found from the given set of object-oriented transactions.

5. An example In this section, an example is given to illustrate the proposed data-mining algorithm. This is a simple example to show how the proposed algorithm can be used to generate inter-object and intra-object sale strategy of commodities in a store. Assume there are four items, I1 to I4, to be sold in this example and each item has the same four attributes related to the sale behavior. The attributes are on_sale, discount, take_out service and free trial, represented as A1 to A4. Their attribute values are either 0 or 1. Also assume the data set includes 10 transactions, as shown in Table 1. In Table 1, I1.A1 represents the value of the attribute A1 in item I1 is 1, meaning I1. with the characteristic A1. For the transaction data in Table 1, the proposed mining algorithm proceeds as follows: Step 1: The count value of each item attribute appearing in the ten transaction data is first calculated. Take the class attribute I1.A1 as an example. Its count value = (1 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 1 + 1) = 8. This step is repeated for the other item attributes, with the results shown in Table 2. Step 2: The support of each item attribute can be derived by the count value over the number of transactions. The support of each item attribute is checked to determine whether it is larger than or equal to the predefined minimum support value a. Assume in this example, the minimum support a is set at 0.4. Since the support values of I1.A1, I1.A2, I1.A3, I2.A2, I3.A3, I4.A1 and I4.A3 are larger than or equal to 0.4, these item attributes are put in the set of large L1 (Table 3). Step 3: If L1 is null, then the algorithm is exited; otherwise, the next step is done. In this example, since L1 is not null, step 4 is then executed. Step 4: r is set at 1, where r is the number of item attributes in the itemsets currently being processed. Step 5: The candidate set Cr+1 is formed by joining Lr such that the two (r  1)-itemsets to be joined must have the same items (classes). C2 is first generated from L2 as follows: (I1.A1, I1.A2), (I1.A1, I1.A3), (I1.A2, I1.A3), and (I4.A1, I4.A3).

Table 1 The set of 10 transactions in the example Transaction ID

Purchased items

Attribute values of purchased items

1 2 3 4 5 6 7 8 9 10

I1, I3, I4 I2, I4 I1, I4 I2, I3 I1, I2 I1, I4 I1, I2, I3, I4 I1, I2, I3, I4 I1, I2 I1, I3, I4

(I1.A1, I1.A2, I1.A3, I1.A4), (I3.A1, I3.A2, I3.A3), (I4.A1, I4.A3) (I2.A1, I2.A2, I2.A4), (I4.A1, I4.A3, I4.A4) (I1.A1, I1.A3), (I4.A2) (I2.A2), (I3.A2, I3.A3, I3.A4) (I1.A1, I1.A2, I1.A3, I1.A4), (I2.A1, I2.A2, I2.A3) (I1.A1, I1.A2, I1.A3), (I4.A1, I4.A2, I4.A3) (I1.A1, I1.A2, I1.A4), (I2.A1, I2.A2, I2.A4), (I3.A1, I3.A2, I3.A3), (I4.A2) (I1.A1, I1.A2, I1.A3), (I2.A3, I2.A4), (I3.A1, I3.A3), (I4.A1, I4.A3) (I1.A1, I1.A3), (I2.A3) (I1.A1, I1.A2, I1.A3), (I3.A4), (I4.A3, I4.A4)

C.-M. Huang et al. / Expert Systems with Applications 33 (2007) 441–450

445

Table 2 The counts of the item attributes in Table 1 Trans ID.

I2

I1

I3

I4

A1

A2

A3

A4

A1

A2

A3

A4

A1

A2

A3

A4

A1

A2

A3

A4

1 2 3 4 5 6 7 8 9 10

1 0 1 0 1 1 1 1 1 1

1 0 0 0 1 1 1 1 0 1

1 0 1 0 1 1 0 1 1 1

1 0 0 0 1 0 1 0 0 0

0 1 0 0 1 0 1 0 0 0

0 1 0 1 1 0 1 0 0 0

0 0 0 0 1 0 0 1 1 0

0 1 0 0 0 0 1 1 0 0

1 0 0 0 0 0 1 1 0 0

1 0 0 1 0 0 1 0 0 0

1 0 0 1 0 0 1 1 0 0

0 0 0 1 0 0 0 0 0 1

1 1 0 0 0 1 0 1 0 0

0 0 1 0 0 1 1 0 0 0

1 1 0 0 0 1 0 1 0 1

0 1 0 0 0 0 0 0 0 1

Count

8

6

7

3

3

4

3

3

3

3

4

2

4

3

5

2

Table 3 The set of large 1-itemsets L1 in this example Itemset

Count

I1.A1 I1.A2 I1.A3 I2.A2 I3.A3 I4.A1 I4.A3

8 6 7 4 4 4 5

Table 4 The counts of the itemsets in C2 Itemset

Count

(I1.A1, I1.A2) (I1.A1, I1.A3) (I1.A2, I1.A3) (I4.A1, I4.A3)

6 7 5 4

Step 6: The count of each candidate in C2 is calculated, with the results shown in Table 4. Step 7: The support of each candidate is then calculated as its count divided by 10. The support of each 2-candidate is then compared with the predefined minimum support value 0.4. In this example, all the four 2-candidates, (I1.A1, I1.A2), (I1.A1, I1.A3), (I1.A2, I1.A3), and (I4.A1, I4.A3), are large and thus kept in L2 (Table 5). Step 8: IF Lr+1 is null, the next step is done; otherwise, r = r + 1 and Steps 5–7 are repeated. Since L2 is not null in the example, r = r + 1 = 2. Steps 5–7 are then repeated to find L3. C3 is first generated from L2, and only the 3-itemset (I1.A1, I1.A2, I1.A3) is formed. Its support is calculated as Table 5 The itemsets kept in L2 Itemset

Count

(I1.A1, I1.A2) (I1.A1, I1.A3) (I1.A2, I1.A3) (I4.A1, I4.A3)

6 7 5 4

0.5, larger than 0.4. It is thus put in L3. Since L3 contains only one item set, no 4-itemsets are formed and L4 is null. Step 9 then begins. Step 9: Each large itemset found so far is thought of as an intra-object composite item and is put in the inter-object large 1-itemset (L01 ). In this example, the large itemsets I1.A1, I1.A2,I1.A3, I2.A2, I3.A3, I4.A1, I4.A3,(I1.A1, I1.A2), (I1.A1, I1.A3), (I1.A2, I1.A3), (I4.A1, I4.A3), (I1.A1, I1.A2, I1.A3) are put in the set of large L01 . Table 6 shows the results. Step 10: z is set at 1, where z is used to represent the number of composite items in the intra-object itemsets currently being processed. Step 11: The candidate set C 0zþ1 is generated by joining L0z under the condition that each (z + 1)-itemset must not include composite items from the same classes. C 02 is first generated from L01 as follows: [I1.A1, I2.A2], [I1.A1, I3.A3], [I1.A1, I4.A1], [I1.A1, I4.A3], [I1.A1, (I4.A1, I4.A3)], [I1.A2, I2.A2], [I1.A2, I3.A3], [I1.A2, I4.A1], [I1.A2, I4.A3], [I1.A2, (I4.A1, I4.A3)], [I1.A3, I2.A2], [I1.A3, I3.A3], [I1.A3, I4.A1], [I1.A3, I4.A3], [I1.A3, (I4.A1, I4.A3)], [I2.A2, I3.A3], [I2.A2, I4.A1], [I2.A2, I4.A3], [I2.A2, (I1.A1, I1.A2)], [I2.A2, (I1.A1, I1.A3)], [I2.A2, (I1.A2, I1.A3)], [I2.A2, (I4.A1, I4.A3)], [I2.A2, (I1.A1, I1.A2, I1.A3)], [I3.A3, I4.A1], [I3.A3, I4.A3], [I3.A3, (I1.A1, I1.A2)], [I3.A3, (I1.A1, I1.A3)], [I3.A3, (I1.A2, I1.A3)], [I3.A3, (I4.A1, I4.A3)], [I3.A3, (I1.A1, I1.A2, I1.A3)], [I4.A1, (I1.A1, I1.A2)], [I4.A1, (I1.A1, I1.A3)], [I4.A1, (I1.A2, I1.A3)], Table 6 The set of large inter-object 1-itemsets L01 in this example Itemset

Count

I1.A1 I1.A2 I1.A3 I2.A2 I3.A3 I4.A1 I4.A3 (I1.A1, I1.A2) (I1.A1, I1.A3) (I1.A2, I1.A3) (I4.A1, I4.A3) (I1.A1, I1.A2, I1.A3)

8 6 7 4 4 4 5 6 7 5 4 5

446

C.-M. Huang et al. / Expert Systems with Applications 33 (2007) 441–450

[I4.A1, (I1.A1, I1.A2, I1.A3)], [I4.A3, (I1.A1, I1.A2)], [I4.A3, (I1.A1, I1.A3)], [I4.A3, (I1.A2, I1.A3)], [I4.A3, (I1.A1, I1.A2, I1.A3)], [(I1.A1, I1.A2), (I4.A1, I4.A3)], [(I1.A1, I1.A3), (I4.A1, I4.A3)], [(I1.A2, I1.A3), (I4.A1, I4.A3)] and [(I4.A1, I4.A3), (I1.A1, I1.A2, I1.A3)]. There are totally 42 candidates in C 02 . Step 12: The count of each candidate in C 02 is calculated, with the results shown in Table 7. Step 13: The support of each candidate is calculated and compared with the predefined minimum support value 0.4. In this example, the 11 itemsets, [I1.A1, I4.A3], [I1.A1, (I4.A1, I4.A3)], [I1.A2, I4.A3] [I1.A2, (I4.A1, I4.A3)], [I1.A3, I4.A3], [I1.A3, (I4.A1, I4.A3)], [I2.A2, I4.A3], [I4.A3, (I1.A1, I1.A2)], [I4.A3, (I1.A1, I1.A3)], [I4.A3, (I1.A2, I1.A3)] and [I4.A3, (I1.A1, I1.A2, I1.A3)] satisfy the condition and are kept in L02 . Table 8 shows the results. Step 14: If L0zþ1 is null, the next step is done; otherwise, z = z + 1 and Steps 11–13 are repeated. Since L02 is not null in the example, r = r + 1 = 2. Steps 11–13 are then repeated to find L03 . C 03 is first generated from L02 , and four inter-object 3-itemsets [I2.A2, I4.A3, (I1.A1, I1.A2)], [I2.A2, I4.A3, (I1.A1, I1.A3)], [I2.A2, I4.A3, (I1.A2, I1.A3)] and [I2.A2, I4.A3, (I1.A1, I1.A2, I1.A3)], are formed. The count of each 3-itemset is then calculated. Since all the count values are 0, smaller than 4, the above 3-itemsets are not large. L03 is thus an empty set. Step 15 then begins. Step 15: Intra-object association rules with confidence values larger than or equal to k are derived from the large itemsets L2 to Lr. In this example, r = 3. It includes the following sub-steps: (a) All intra-object possible association rules are formed. The following 14 association rules are thus generated: 1. If I1 = A1, then I1 = A2; 2. If I1 = A2, then I1 = A1; 3. If I1 = A1, then I1 = A3; 4. If I1 = A3, then I1 = A1; 5. If I1 = A2, then I1 = A3; 6. If I1 = A3, then I1 = A2;

Table 8 The counts of the itemsets in L02 Itemset

Count

[I1.A1, I4.A3] [I1.A1, (I4.A1, I4.A3)] [I1.A2, I4.A3] [I1.A2, (I4.A1, I4.A3)] [I1.A3, I4.A3] [I1.A3, (I4.A1, I4.A3)] [I2.A2, I4.A3] [I4.A3, (I1.A1, I1.A2)] [I4.A3, (I1.A1, I1.A3)] [I4.A3, (I1.A2, I1.A3)] [I4.A3, (I1.A1, I1.A2, I1.A3)]

4 4 4 4 4 4 4 4 4 4 4

7. If I4 = A1, then I4 = A3; 8. If I4 = A3, then I4 = A2; 9. If I1 = A1, then (I1 = A2 and I1 = A3); 10. If (I1 = A2 and I1 = A3), then I1 = A1; 11. If I1 = A2, then (I1 = A1 and I1 = A3); 12. If (I1 = A1 and I1 = A3), then I1 = A2; 13. If I1 = A3, then (I1 = A1 and I1 = A2); 14. If (I1 = A1 and I1 = A2), then I1 = A3. (b) The confidence factors for the above association rules are calculated. Take the first association rule as an example. The intra-object count of I1.A1 \ I1.A2 is calculated as shown in Table 9. The confidence factor for the association rule ‘‘If I1 = A1, then I1 = A2’’ is then calculated as P10 6 i¼1 ðI 1 :A1 \ I 1 :A2 Þ ¼ ¼ 0:75: P10 8 i¼1 ðI 1 :A1 Þ With the same calculation process, the results for all the 14 rules are shown below. 1. ‘‘If I1 = A1, then I1 = A2’’ with a confidence factor of 0.75; 2. ‘‘If I1 = A2, then I1 = A1’’ with a confidence factor of 1;

Table 7 The counts of the itemsets in C 02 Itemset

Count

Itemset

Count

Itemset

Count

[I1.A1, I2.A2] [I1.A1, I3.A3] [I1.A1, I4.A1] [I1.A1, I4.A3] [I1.A1, (I4.A1, I4.A3)] [I1.A2, I2.A2] [I1.A2, I3.A3] [I1.A2, I4.A1] [I1.A2, I4.A3] [I1.A2, (I4.A1, I4.A3)] [I1.A3, I2.A2] [I1.A3, I3.A3] [I1.A3, I4.A1] [I1.A3, I4.A3] [I1.A3, (I4.A1, I4.A3)] [I2.A2, I3.A3]

2 3 3 4 4 2 3 3 4 4 1 2 3 4 4 2

[I2.A2, I4.A1] [I2.A2, I4.A3] [I2.A2, (I1.A1, I1.A2)] [I2.A2, (I1.A1, I1.A3)] [I2.A2, (I1.A2, I1.A3)] [I2.A2, (I4.A1, I4.A3)] [I2.A2, (I1.A1, I1.A2, I1.A3)] [I3.A3, I4.A1] [I3.A3, I4.A3] [I3.A3, (I1.A1, I1.A2)] [I3.A3, (I1.A1, I1.A3)] [I3.A3, (I1.A2, I1.A3)] [I3.A3, (I4.A1, I4.A3)] [I3.A3, (I1.A1, I1.A2, I1.A3)] [I4.A1, (I1.A1, I1.A2)] [I4.A1, (I1.A1, I1.A3)]

1 4 2 1 1 1 1 2 2 3 2 2 2 2 3 3

[I4.A1, (I1.A2, I1.A3)] [I4.A1, (I1.A1, I1.A2, I1.A3)] [I4.A3, (I1.A1, I1.A2)] [I4.A3, (I1.A1, I1.A3)] [I4.A3, (I1.A2, I1.A3)] [I4.A3, (I1.A1, I1.A2, I1.A3)] [(I1.A1, I1.A2), (I4.A1, I4.A3)] [(I1.A1, I1.A3), (I4.A1, I4.A3)] [(I1.A2, I1.A3), (I4.A1, I4.A3)] [(I4.A1, I4.A3), (I1.A1, I1.A2, I1.A3)]

3 3 4 4 4 4 3 3 3 3

C.-M. Huang et al. / Expert Systems with Applications 33 (2007) 441–450 Table 9 The calculation for the intra-object count of I1.A1 \ I1.A2 Trans ID

I1.A1

I1.A2

I1.A1 \ I1.A2

1 2 3 4 5 6 7 8 9 10 Count

1 0 1 0 1 1 1 1 1 1 8

1 0 0 0 1 1 1 1 0 1 6

1 0 0 0 1 1 1 1 0 1 6

3. ‘‘If I1 = A1, then I1 = A3’’ with a confidence factor of 0.88; 4. ‘‘If I1 = A3, then I1 = A1’’ with a confidence factor of 1; 5. ‘‘If I1 = A2, then I1 = A3’’ with a confidence factor of 0.83; 6. ‘‘If I1 = A3, then I1 = A2’’ with a confidence factor of 0.71; 7. ‘‘If I4 = A1, then I4 = A3’’ with a confidence factor of 1; 8. ‘‘If I4 = A3, then I4 = A1’’ with a confidence factor of 0.8; 9. ‘‘If I1 = A1, then (I1 = A2 and I1 = A3)’’ with a confidence factor of 0.63; 10. ‘‘If (I1 = A2 and I1 = A3), then I1 = A1’’ with a confidence factor of 1; 11. ‘‘If I1 = A2, then (I1 = A1 and I1 = A3)’’ with a confidence factor of 0.83; 12. ‘‘If (I1 = A1 and I1 = A3), then I1 = A2’’ with a confidence factor of 0.71; 13. ‘‘If I1 = A3, then (I1 = A1 and I1 = A2)’’ with a confidence factor of 0.71; 14. ‘‘If (I1 = A1 and I1 = A2), then I1 = A3’’ with a confidence factor of 0.63. (c) The confidence factors of the above association rules are then compared with the predefined confidence threshold k. Assume the confidence k was set at 0.8 in this example. The following eight rules are thus output to users: 1. ‘‘If I1 = A2, then I1 = A1’’ with a confidence factor of 1; 2. ‘‘If I1 = A1, then I1 = A3’’ with a confidence factor of 0.88; 3. ‘‘If I1 = A3, then I1 = A1’’ with a confidence factor of 1; 4. ‘‘If I1 = A2, then I1 = A3’’ with a confidence factor of 0.83; 5. ‘‘If I4 = A1, then I4 = A3’’ with a confidence factor of 1; 6. ‘‘If I4 = A3, then I4 = A1’’ with a confidence factor of 0.8; 7. ‘‘If (I1 = A2 and I1 = A3), then I1 = A1’’ with a confidence factor of 1;

447

8. ‘‘If I1 = A2, then (I1 = A1 and I1 = A3)’’ with a confidence factor of 0.83. The above rules can then be explained in a comprehensible way. For example, the association rule ‘‘If I1 = A2, then I1 = A1’’ with a confidence factor of 1 can be explained as ‘‘If item I1 has the characteristic discount, then I1 is on_sale, with a confidence factor of 1. Step 16: Inter-object association rules with confidence values larger than or equal to k are derived from the large itemsets L02 to L0z . In this example, z = 2. The following inter-object association rules can then be derived: 1. ‘‘If I4 = A3, then I1 = A1’’ with a confidence factor of 0.8; 2. ‘‘If (I4 = A1 and I4 = A3), then I1 = A1’’ with a confidence factor of 1; 3. ‘‘If I1 = A2, then I4 = A3’’ with a confidence factor of 1; 4. ‘‘If I4 = A3, then I1 = A2’’ with a confidence factor of 0.8; 5. ‘‘If (I4 = A1 and I4 = A3), then I1 = A2’’ with a confidence factor of 1; 6. ‘‘If I4 = A3, then I1 = A3’’ with a confidence factor of 0.8; 7. ‘‘If (I4 = A1 and I4 = A3), then I1 = A3’’ with a confidence factor of 1; 8. ‘‘If I2 = A2, then I4 = A3’’ with a confidence factor of 1; 9. ‘‘If I4 = A3, then I2 = A2’’ with a confidence factor of 0.8; 10. ‘‘If I4 = A3, then (I1 = A1 and I1 = A2)’’ with a confidence factor of 0.8; 11. ‘‘If I4 = A3, then (I1 = A1 and I1 = A3)’’ with a confidence factor of 0.8; 12. ‘‘If I4 = A3, then (I1 = A2 and I1 = A3)’’ with a confidence factor of 1; 13. ‘‘If (I1 = A2 and I1 = A3), then I4 = A3’’ with a confidence factor of 1; 14. ‘‘If I4 = A3, then (I1 = A1 and I1 = A2 and I1 = A3)’’ with a confidence factor of 0.8; 15. ‘‘If (I1 = A1 and I1 = A2 and I1 = A3), then I4 = A3’’ with a confidence factor of 0.8. The association rule ‘‘If (I1 = A1 and I1 = A2 and I1 = A3), then I4 = A3’’ with a confidence factor of 0.8 can be explained as ‘‘If item I1 is on_sale, with discount and take_out service, then I4 is also with take_out service, with a confidence factor of 0.8’’. After Step 16, the two kinds of intra- and inter-object association rules are found from the given set of object-oriented transactions. 6. Experimental results The section reports on experiments made to show the effects of the parameters on the proposed algorithm for

C.-M. Huang et al. / Expert Systems with Applications 33 (2007) 441–450

inter- and intra-object association rules. They were implemented in JAVA on a Pentium-IV 2.8 GHz personal computer with 1 GB memory. There were 100 object-oriented items. Each item had four attributes, and each attribute was 0 or 1. Data sets with different numbers of transactions were run by the proposed algorithm. In each data set, numbers of purchased items in transactions were first randomly generated. The purchased items and their attribute values in each transaction were then generated. An item could not be generated twice in a transaction. Experiments were first performed to find the relationships between numbers of rules and minimum supports when the minimum transaction number was set at 1000, the minimum confidence was 0.2 and the average number of purchased items in transactions was 10. The results for both kinds of the intra and inter-object association rules are shown in Fig. 3. It can be observed from Fig. 3 that the number of rules decreased along with the increase of the minimum support value. It was consistent with the property of data mining. Besides, the numbers of intra-object association rules were smaller than those of inter-object association rules because the attribute number is less than the item number in the experiments. This situation usually occurs in real applications. Intra-object association rules are just internal relations within objects and inter-object association rules are external relations among objects. We also find the execution time for intra-object association rules was smaller than that for inter-object association rules (which will be shown later). Experiments were then made to find the relationships between numbers of rules and minimum confidences when the minimum transaction number was 1000, the minimum support was 0.2 and the average number of purchased items in transactions was 10. The results for both kinds of the intra and inter-object association rules are shown in Fig. 4. It can be observed from Fig. 4 that the number of rules decreased along with the increase of the minimum confidence value. It was also consistent with the property of

intra oo

1400 1200

800 600 400

0 0.2

0.4

0.5 0.6 Mini-confidence

0.7

0.8

0.9

data mining. Besides, the numbers of intra-object association rules were smaller than those of inter-object association rules when the minimum confidence was small, and larger when the minimum confidence was large. This was because the inter-object association rules were derived from the given set of items, which was more dispersed than the set of attributes. Thus, when the minimum confidence value was high, only a few inter-object association rules could be derived. Experiments were then performed to compare the results of different numbers of transactions. The relationship between numbers of intra-object association rules and minimum support values along with different numbers of transactions for an average of 10 purchased items in transactions and a minimum confidence value set at 0.5 is shown in Fig. 5. The relationship between numbers of inter-object association rules and minimum support values along with different numbers of transactions is shown is Fig. 6. From Figs. 5 and 6, it is easily seen that the numbers of rules are nearly the same for different transactions since the minimum support and the minimum confidence were set at ratios and independent of transaction numbers. The rule numbers along with different transactions for the minimum

inter oo Intra-1000

Number of rules

800 600 400 200 0.22

0.3

Fig. 4. The relationship between numbers of rules and minimum confidence values.

1000 Number of rules

1000

200

1200

0 0.2

inter oo

intra oo

Number of rules

448

0.24 0.26 Mini-support

0.28

0.3

Fig. 3. The relationship between numbers of rules and minimum support values.

160 140 120 100 80 60 40 20 0 0

Intra-2000

0.1

Intra-3000

0.2

Intra-4000

0.3 Mini-support

0.4

Intra-5000

0.5

Intra-6000

0.6

Fig. 5. The relationship between numbers of intra-object rules and minimum support values for different numbers of transactions.

Inter-1000

Inter-2000

Inter-5000

Inter-6000

Inter-3000

Inter-1000

Inter-4000

16000

40000

14000

35000 30000 25000 20000

Inter-4000

Inter-5000

Inter-6000

12000 10000 8000 6000

2000

10000

0

5000 0 0

0.1

0.2

0.3 Mini-support

0.4

0.5

0.6

Fig. 6. The relationship between numbers of inter-object rules and minimum support values for different numbers of transactions.

support set at 0.2 and the minimum confidence set at 0.5 are shown in Fig. 7. The lines are nearly constant. intra oo

inter oo

1600 1400

Number of rules

Inter-3000

4000

15000

1200

0

0.1

0.2

0.3

0.4 0.5 Mini-support

0.6

0.7

0.8

Fig. 9. The execution times for inter-object association rules.

Finally, the execution time for intra-object rules with different minimum support values along with different numbers of transactions for an average of 10 purchased items in transactions and a minimum confidence value set at 0.5 is shown in Fig. 8. The execution time for inter-object rules is shown in Fig. 9. It is obvious from Figs. 8 and 9 that the execution time increased along with the increase of transaction numbers. Besides, finding inter-object association rules spent more time than finding intra-object association rules. The second phase of the proposed algorithm is thus the bottleneck in finding the rules.

1000

7. Conclusion and future work

800 600 400 200 0

1000 2000 3000 4000 5000 6000 7000 8000 9000 Transactions

Fig. 7. The relationship between numbers of rules and numbers of transactions.

Intra-1000

Intra-2000

Intra-5000

Intra-6000

Intra-3000

Intra-4000

140 120 Times (Second)

Inter-2000

449

18000

45000 Times (Second)

Number of rules

C.-M. Huang et al. / Expert Systems with Applications 33 (2007) 441–450

100 80 60 40 20 0

0

0.1

0.2

0.3

0.4 0.5 Mini-support

0.6

0.7

Fig. 8. The execution times for intra-object association rules.

0.8

The object concept has been very popular and used in a variety of applications, especially for complex data description. An object represents an instance with several related attribute values and methods integrated together. In this paper, we study how to mine out intra- and inter-object association rules from object transactions. Each item itself is thought of as a class, and each item purchased in a transaction is thought of as an instance. Instances with the same class (item name) may have different attribute values since they may appear in different transactions. The proposed algorithm can be divided into two main phases. The first phase is called the intra-object mining phase, in which the large itemsets associated with the same classes (items) but with different attributes are divided. The phase can find out the association relation within the same kind of objects. Each large itemset found in this phase can thus be thought of as a composite item used in phase 2. The second phase is called the inter-object mining phase, in which the large itemsets from the composite items are obtained to get relationship among different kinds of objects. Both the intraobject and inter-object association rules can thus be easily derived by the proposed algorithm at the same time. An example has also been given to illustrate the algorithm in detail. In the future, we will further generalize our approach to manage different types of attribute values in

450

C.-M. Huang et al. / Expert Systems with Applications 33 (2007) 441–450

addition to binary ones. Experimental results have also shown the effects of the parameters on the proposed algorithm. Finding inter-object association rules usually spend more time than finding intra-object association rules. References Agrawal, R., & Srikant, R. (1994). Fast algorithm for mining association rules. In The international conference on very large databases (pp. 487– 499). Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In The eleventh international conference on data engineering (pp. 3–14). Agrawal, R., Imielinksi, T., & Swami, A. (1993a). Mining association rules between sets of items in large database. The 1993 ACM SIGMOD conference, Washington, DC, USA. Agrawal, R., Imielinksi, T., & Swami, A. (1993b). Database mining: a performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6), 914–925. Clair, C., Liu, C., & Pissinou, N. (1998). Attribute weighting: a method of applying domain knowledge in the decision tree process. In The seventh international conference on information and knowledge management (pp. 259–266).

Clark, P., & Niblett, T. (1989). The CN2 induction algorithm. Machine Learning, 3, 261–283. Famili, A., Shen, W. M., Weber, R., & Simoudis, E. (1997). Data preprocessing and intelligent data analysis. Intelligent Data Analysis, 1(1). Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1991). Knowledge discovery in databases: an overview. In The AAAI workshop on knowledge discovery in databases (pp. 1–27). Kim, W. (1990). Object-oriented databases: definition and research directions. IEEE Transactions on Knowledge and Data Engineering, 2(3), 327–341. Kimura, T. D. (1995). Object-oriented dataflow. In The 11th IEEE international symposium on visual languages (pp. 180–186). Mannila, H. (1997). Methods and problems in data mining. In The international conference on database theory. Srikant, R., & Agrawal, R. (1996). Mining quantitative association rules in large relational tables. In The 1996 ACM SIGMOD international conference on management of data (pp. 1–12), Montreal, Canada, June 1996. Srikant, R., Vu, Q., & Agrawal, R. (1997). Mining association rules with item constraints. In The third international conference on knowledge discovery in databases and data mining (pp. 67–73), Newport Beach, California, August 1997.