A new relational learning system using novel rule selection strategies

A new relational learning system using novel rule selection strategies

Knowledge-Based Systems 19 (2006) 765–771 www.elsevier.com/locate/knosys Short communication A new relational learning system using novel rule selec...

152KB Sizes 0 Downloads 43 Views

Knowledge-Based Systems 19 (2006) 765–771 www.elsevier.com/locate/knosys

Short communication

A new relational learning system using novel rule selection strategies Mahmut Uludag a, Mehmet R. Tolun b

b,*

a European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK Cankaya University, Department of Computer Engineering, Balgat, 06530 Ankara, Turkey

Received 20 March 2006; accepted 26 May 2006 Available online 7 July 2006

Abstract This paper describes a new rule induction system, rila, which can extract frequent patterns from multiple connected relations. The system supports two different rule selection strategies, namely the select early and select late strategies. Pruning heuristics are used to control the number of hypotheses generated during the learning process. Experimental results are provided on the mutagenesis and the segmentation data sets. The present rule induction algorithm is also compared to the similar relational learning algorithms. Results show that the algorithm is comparable to similar algorithms.  2006 Elsevier B.V. All rights reserved. Keywords: Relational rule induction; Rule selection strategies; Pruning

1. Introduction Modern relational database systems utilize many advanced technologies for efficient management of relational data such as indexing and query technologies, transaction and security support. Compared to a single table of data, a relational database can represent more complex and structured data. As a result of their advantages over other ways of storing and managing data, a significant amount of today’s scientific and commercial data is stored in relational databases. For this reason, it is important to have data mining algorithms running on relational data, especially in its natural form without requiring the data to be transformed into one single table. Integration of data mining systems with relational databases has been receiving attention for almost a decade [1,16,14]. One of the reasons for establishing collaboration with relational database systems is to get the benefit of the advanced search mechanisms they make available. This collaboration also saves data mining systems from

loading the complete training data to its run-time memory. Theoretically, any relational database can be transformed into a single universal relation to make it usable by traditional data mining systems. However, in practice this can lead to relations of unmanageable sizes, especially when there are recursive relations in the schema. Flattening relational data can result in a combinatorial explosion in either the number of instances or the number of attributes, depending upon whether one decides to duplicate or aggregate [15]. Traditional relational learning algorithms were generally designed for relational data stored in Datalog/Prolog servers. These algorithms are usually called ILP-based algorithms [13]. There have been efforts to couple ILPbased algorithms with relational database systems [3]. ILP-based algorithms are well suited to discovering recursive relations in a generic way. A good example of a working, modern ILP-based system is the relational data mining system of PharmaDM1.

*

Corresponding author. Tel.: +905334754399; fax: +903122848043. E-mail addresses: [email protected] (M. Uludag), tolun@cankaya. edu.tr (M.R. Tolun). 0950-7051/$ - see front matter  2006 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2006.05.004

1 Examples of relational rules generated by the PharmaDM data mining system can be found at the PharmaDM web site www.pharmadm.com.

766

M. Uludag, M.R. Tolun / Knowledge-Based Systems 19 (2006) 765–771

-

Hypotheses construction Rule selection Pruning Conversions to/from SQL

SQL, Meta data queries

Result Sets, Meta data

JDBC driver

Induction system

DBMS

Fig. 1. The basic architecture of the rila induction system.

In the next section, the architecture of the rila system is described. This is followed by query generation, pruning and rule selection sections. After presenting the experimental results on two different data sets, we report a brief comparison of rila and two other relational learning systems.

learning process so the data remains available for other processes. The temporary table stores only the identifiers of the examples in the current class for which classification rules are being searched and it is cleared after each class is being processed.

2. Architecture of the system

3. Query generation

The architecture of the rila system and its basic components are shown in Fig. 1. The algorithm has hypotheses construction and rule selection components as in other rule induction algorithms. However, being a system for relational data, it also has components to traverse a relational schema, to translate hypothesis and rule objects into SQL, and to interpret results returned from a database system. The system is written in Java and uses Java JDBC API to communicate with the database management systems. It has a simple graphical user interface through which users can set the learning and pruning parameters and start/stop a learning task. The database connection and table selection wizard allows users to connect to a database and select a set of tables that stores the training data for the learning task. The wizard also allows the user to specify the target table2, and the class attribute for the learning task. After a learning process is started meta-data queries are sent to the connected DBMS, to retrieve column and foreign/primary key information for the selected tables. After this initialization phase, SQL queries sent to the database management system are generally for building valid hypotheses about the data. Basically, the system sends SQL queries to the database system then by analyzing the results of these queries it generates new hypotheses. These two steps are repeated many times to further analyze the training data. Each time a new rule is selected, the examples covered by the rule selected are separated from the active search space. Instead of deleting the associated rows from the input relational tables, a temporary relational table is used to store identifiers of the examples covered by the selected rules (using the primary key in the target table as an identifier for each example). The examples are excluded from the search space by a join to the temporary table, in the SQL queries generated. This strategy is important in keeping the input relational data in its original form during a

Relational queries are the basic blocks of the hypothesis generation and rule selection steps. During the hypothesis generation step relational queries are required to gather the frequency information about the training data. However, during the rule selection step, relational queries are required for evaluating the candidate rules. Initial hypotheses are composed of only one condition. During the initial hypotheses construction, the following template is used to generate SQL queries for finding hypotheses, together with their frequency values.

2

A target table contains unique rows for objects to be analyzed during a learning task [12]. This table is also called ‘primary table’ in [7], and ‘master table’ in [18].

Select attr, count (distinct targetTable.pk) from covered, path.getTableList() where path.getJoins() and targetTable.classAttr = currentClass and covered.id = targetTable.pk and covered.mark = 0 group by attr In the above template, • attr is the name of the current attribute column, • targetTable is the target table, • pk is the name of the primary key column in the target table, • covered is the name of the temporary table where identifiers of the objects covered by the selected rules are stored, • path refers to the path object that links the current table to the target table. Path objects have two methods; the getTableList method returns the list of tables in the path in comma separated form, while the getJoins method returns the list of conditions that represent the joins between the target table and the last table in the path. If a table is visited more than once in a given path then table aliases are used to distinguish different instances of the same table. • classAttr is the column representing the class attribute for the learning task,

M. Uludag, M.R. Tolun / Knowledge-Based Systems 19 (2006) 765–771

• currentClass is the current class for which the hypotheses are being searched. The template is applied to each column in the training data except the foreign and primary key columns and the class attribute column. The column representing the class attribute of the learning task is assumed to be in the target table. In order to remove this limitation, a table should be identified as the class table, and the getTableList and the getJoins methods should be revised so that they would also include the path between the class table and the target table. In a levelwise search, the set of hypotheses generated in a previous level are refined by adding further conditions in the next level. The refinement process usually continues until a predetermined maximum size for the rules is reached. Refinement of a relational hypothesis results in a new selection of objects that is a subset of the selection associated with the original hypothesis3. The following template is used to generate SQL queries for refining the most promising hypotheses in the set of current hypotheses. Select attr, count (distinct targetTable.pk) from covered, hypothesis.getTableList(path) where classAttr = currentClass and path.getJoins() and hypothesis.toSQL() and covered.id = targetTable.pk and covered.mark = 0 group by attr; In the above template, attr, targetTable, pk, covered, path, classAttr, and currentClass have the same meanings as described above in the building initial hypotheses section. Here, hypothesis is the hypothesis object being refined. The hypothesis objects have the following two methods to help the SQL construction processes. • The getTableList method of the hypothesis class returns the list of the tables in the conditions of a hypothesis, plus the tables in the paths that connect each test feature to the target table and the list of tables in the current path. • The toSQL method of the hypothesis class returns the list of joins for the features in the hypothesis plus the list of joins that connects each feature to the target attribute.

3

One important point in refining existing hypotheses is to avoid redundant hypotheses generation. An ideal refinement mechanism should not produce two hypotheses with the same conditions but in different orders. For this purpose, rila uses a hash table to remember the attribute combinations previously used.

767

4. Pruning In order to find satisfactory solutions to large data mining problems within a reasonable time, one needs to use some kind of heuristics that should reduce the number of hypotheses to be processed. The heuristics for this purpose are called pruning heuristics. The best known pruning heuristic is the minimum support pruning heuristic [17] which is used by most data mining systems. Although the minimum support pruning is an effective technique, the pruning performed is alone not always good enough to avoid most of the weak hypotheses which are unlikely to produce strong hypotheses when refined. For large problems generally more complex pruning techniques are needed to further reduce the search spaces. The optimistic estimate pruning technique is one of the pruning techniques that most of the previous machinelearning algorithms have used [18]. It is also known widely as beam search [6]. This technique exploits the fact that we are interested only in the n best solutions; if a hypothesis and its descendants could not make it into the top n list then it is pruned. If the parameter n is not selected large enough it may result in the myopia problem [4]. In rila, both the minimum support pruning heuristic and the optimistic estimate pruning heuristic are used. 5. Rule selection Hypothesis construction and rule selection steps of an inductive algorithm can be ordered in several different ways. For example, while one strategy may activate rule selection after each time a group of hypotheses is constructed, another strategy may activate rule selection only after all hypotheses have been enumerated. Rila relational learning system supports two rule selection strategies, namely the select early and select late strategies. The select early strategy activates the rule selection process more frequently than the select late strategy. In other words, the select early strategy [19] is similar to the rule selection strategies of type covering learning algorithms where new rules are selected several times during the intermediate stages of a learning process. On the other hand, the select late strategy postpones rule selection until all possible hypotheses are enumerated. The select late strategy, is similar to the rule selection strategy used by the famous apriori association rule induction algorithm where rules are selected after the hypotheses search process is completed. The main difference between these two strategies, therefore, is in the rate of recurrence of rule selection. The select early strategy activates rule selection more frequently than the select late strategy. After each time hypotheses construction process is completed for a level, the select early strategy selects rules among the hypotheses built so far. However, the select late strategy postpones rule selection until all possible hypotheses are enumerated in all levels for the active class (rila works for each class in turn, i.e., if

768

M. Uludag, M.R. Tolun / Knowledge-Based Systems 19 (2006) 765–771

the class attribute has three different values the learning loop is repeated three times). Main steps of the algorithm using the select late strategy are shown in Fig. 2. First, hypotheses with one condition are generated for the active class. Then the algorithm moves to level 2 and builds new hypotheses by refining the best n hypotheses generated in the previous level. The steps are repeated until the level m. The parameter m is defined by the user at the beginning of any learning process. It limits the maximum number of attributes in the rules generated. Finally, the rule selection starts. Details of the rule selection process are shown in Fig. 3. In the rule selection step, first the hypothesis with the maximum score is selected as a candidate rule. If its score is positive then the candidate rules are asserted as a new rule and the hypothesis is removed from the active hypothesis set, and examples covered by the new rule are marked in the temporary table as ‘covered’. In the next step, the hypothesis with the maximum score is considered as the candidate rule. The score for the candidate rule is recalculated using the effective cover which is determined by finding the number of examples covered by the candidate rule that was not covered by the previously selected rule(s). Then this score is compared to the score of next hypothesis in the sorted tree of hypotheses (sorted based on the score). If the score of a candidate rule is higher than the score of the next hypothesis then the candidate rule is asserted as a new rule, the hypothesis is removed from the active hypothesis set, and examples covered by the new rule are marked in the temporary table as ‘covered’. However, if the score of the candidate rule is greater than the score of the next hypothesis then it is asserted as a new rule. Otherwise the score of the hypothesis is set to score of the candidate rule and then the hypothesis is inserted back to the sorted tree of hypotheses. The rule selection process continues using the next hypothesis as it now becomes the hypothesis with the highest score. It is repeated until all examples in the active class are covered by the generated rules or until there are no more hypotheses with a positive score.

level=1

build initial hypotheses

refine current hypotheses

level++

6. Experiments Experiments have been conducted on one relational data set and also on one propositional data set. The relational data set is the mutagenesis data set4 on the mutagenic activity of molecules [11]. The segmentation data set5 used in the experiments is a propositional data set which is distributed as an example data set together with the Weka machine learning software suite [17]. The cross-validation technique was used to assess the results on the mutagenesis data set. On the other hand, the results on the segmentation data set were assessed using the separate test set provided together with the training set. The experiments were run on an Intel Pentium 4 1.2 GHz PC. 6.1. Experiments on the mutagenesis data set For the mutagenesis data set tests, the maximum size for the hypotheses (parameter m) was selected as 2, and the penalty factor [19], was selected as 20. The class with the maximum number of examples (the majority class) was selected as the default class (for this reason the coverage of the rule sets on the test set were always 100%). The rules were ordered by their score before they were applied to the test set for prediction. The set of experiments presented in Table 1 was performed using the select late strategy. The minimum f-score [17] for the rules was set to 0.001; hypotheses having a score less than this minimum were not refined during the hypothesis search process. As a result of the optimistic estimate pruning, the second experiment was completed in about 50% of the time the first experiment required, with more compact rules (118 vs. 80). The accuracy of the rule set in the second experiment was also better than the accuracy in the second experiment. The best accuracy record of rila was observed as 95.75%. This result is better than the results of 87.5% accuracy recently reported in [2]. It is also better than the results of 89.4% accuracy reported by [11]. 6.2. Experiments on the segmentation data set The instances in the segmentation data set were drawn randomly from a database of seven outdoor images by the Vision Group of University of Massachusetts. Each instance is a 3 · 3 region represented by 19 continuous attributes. The training data includes 210 instances while the test data includes 2100 entries. Numeric attributes were discretized using the default options of the Weka discretization filter. The maximum size for the hypotheses, param-

level < m ?

select rules no

yes

end

Fig. 2. The rila algorithm using the select late strategy.

4

The mutagenesis data set used in the experiments and the rila learning system can be downloaded from the web site of the project at url http:// cmpe.emu.edu.tr/rila. 5 The segmentation data set can be downloaded from the Weka web site at url http://www.cs.waikato.ac.nz/ml/weka.

M. Uludag, M.R. Tolun / Knowledge-Based Systems 19 (2006) 765–771

769

start

no

select hypothesis with the highest score

is the score positive? yes

-

Consider the selected hypothesis as the new candidate rule Find the number of new examples covered by the candidate rule Calculate the score of the candidate rule If the candidate rule’s score is better than the next hypothesis ‘s score then assert it as a new rule otherwise push the hypothesis back into the sorted tree after changing its score to the score of the candidate rule.

-

end

yes

all examples covered? no

Fig. 3. Rule selection algorithm when using the select late strategy.

Table 1 Cross-validation test results using the select late strategy on the mutagenesis data set

Time (seconds) Number of rules Number of conditions Accuracy (%)

All hypotheses extended in each level

Maximum five hypotheses extended in each level

125 69 118

61 50 80

95.21

95.75

eter m, was set to 5, and the penalty factor was selected as 15. The data set was discretized using the default discretize filter in Weka. Before the discretization process, the training and the test sets were merged into one single data set. The data set discretized was split back into the test and training sets after the discretization process. Table 2 shows the test results using the select early strategy. The time required by the learning process and the accuracy on the test set increases steadily as the number of hypotheses extended in each refinement process (beam Table 2 Test results on the segmentation data set using the select early strategy Max. number of hypotheses to extend (n)

20

30

40

50

Time (seconds) Number of rules Number of conditions Training set coverage (%) Training set accuracy (%) Test set accuracy (%)

284 39 69 89.4 95 85.55

401 37 65 92.73 96 89.01

509 38 67 97.2 96 94.69

589 36 63 98.73 96 97.41

width) increases. However, the size of the final rule set does not steadily increase as the beam width increases. For example, the size of the final rule sets (total number of conditions) decreases from 69 to 65 as the beam width increases from 20 to 30; similarly, the rule set size decreases from 67 to 63 as the beam width increases from 40 to 50. This is due to the fact that, as the number of evaluated hypotheses increases, it becomes possible to find better rules which are supported by more examples. The Weka machine learning suite was used to learn the accuracy of the classifiers generated by other well-known algorithms on the discretized segmentation data set. Table 3 shows accuracy of the classifiers produced by seven different algorithms available in Weka and the accuracy of the best classifier produced by rila. All algorithms, except rila, were executed using their default options. The best accuracy on the test set was observed using the select early strategy of rila, which was 97.41%. This was better than the accuracy of the classifiers generated by all the other 7 algorithms. Table 3 Test results on the discretized segmentation data set using different learning algorithms Algorithm

Accuracy (%)

ID3 Prism JRip Part Naı¨ve Bayes IB1 KStar Rila

93.56 91.37 92.78 91.05 93.56 92.78 92.78 97.41

770

M. Uludag, M.R. Tolun / Knowledge-Based Systems 19 (2006) 765–771

Table 4 A comparison of rila with two other relational learning systems

Type of the models generated Location of data Recursive relations Temporary tables on the database management system Discretization

rila

MRDTL

WARMR

Classification rules Relational database management system Yes Yes

Decision trees Relational database management system No No

Association rules Prolog/Datalog server

Weka

Can handle continuous attributes



7. Comparison with other systems The rila system was compared with two other relational learning systems. One of the systems compared, WARMR, use relational association rule mining algorithms based on the Apriori algorithm. Unlike rila, WARMR is an ILPbased system which has been developed in the context of a logic programming environment [8]. The second algorithm compared, MRDTL, has client– server architecture, similar to rila, where the data mining system is a client to the database management system. Unlike rila and WARMR, MRDTL produces classifiers in the form of relational decision trees and can handle continuous attributes without discretizing them first. Table 4 summarizes the comparison of rila with the MRDTL and the WARMR algorithms. Unlike the two other systems, rila use temporary tables on the database management system to store temporary information about the data mining jobs while running a job. One of the temporary tables is for storing discretization data and the other temporary table is for storing identifiers of the objects covered by the rules selected. The temporary tables are used as a means to pass more jobs to the database management systems (which are known to be efficient in joining data) that would normally be done on the client side. The jobs include the joins with these two temporary tables either to exclude marked items from further search or to find associated discretization intervals for the numeric attribute values. The temporary tables on the database are also used to minimize the data transfer between the data mining system and the database management system. 8. Discussion and conclusion In this paper, a new relational rule learning algorithm that has been developed is presented. Based on the new relational algorithm, the relational rule induction system, rila, has been developed. Pruning heuristics have been incorporated into the system to speed up the learning process. Experimental results were provided on two widely used example data sets6, namely the mutagenesis and segmentation data sets. Results of the experiments on the 6 An early version of the rila system was also applied to the KDD Cup 2001 genes data set.

Yes No

widely used mutagenesis data set indicate that rila can generate rule sets better than the well-known relational learning algorithms. The accuracy result was approximately 8% better than the best result reported by [2]. It has also achieved an accuracy value better than the accuracy reported by the originators of the mutagenesis data set, [11], by 6.4%. An early version of the rila algorithm was also tested on the KDD Cup 2001 genes data set [5]. Executable of the rila system and the example data sets can be downloaded from the rila web site at http://cmpe.emu.edu.tr/rila/. One of the factors responsible for the slow growth of the multi-relational data mining research area is the limited scalability of the multi-relational algorithms [10]. The rila system has a better position than the ILP-based systems from the scalability point of view because, unlike ILPbased relational learning systems, rila directly uses data in relational database systems and collaborates with the database management systems by using SQL queries. Therefore, it profits from the query optimizers and efficient query execution procedures that characterize the modern relational database systems. Most of the existing relational rule induction systems described in the current literature are either for association rule learning, such as [9], or have been designed for relational data stored in some kind of Prolog/Datalog server, such as WARMR, or require a copy of the relational data in the learning process’s internal memory, such as most of the graph-mining algorithms. Because of this, it is important to note that rila is one of the earliest supervised relational rule induction systems described in the open literature which is able to mine relational data stored in relational databases without requiring a copy of the data in its internal memory. Our experience in adapting a propositional learning algorithm to the relational domain can be useful for similar projects in the future. This experience also contributes to the general knowledge of relational learning as this study describes its own approach to relational learning which is different from other approaches in the ways described above. As has indicated in a recent review paper [10], one of the bottlenecks preventing the wider use of multi-relational data mining (MRDM) algorithms is the fact that relational algorithms tend to be more difficult to understand than the propositional ones and has invited MRDM researchers to present their algorithms in the most accessible form possible.

M. Uludag, M.R. Tolun / Knowledge-Based Systems 19 (2006) 765–771

References [1] R. Agrawal, K. Shim, Developing tightly coupled data mining applications on a relational database system, KDD-96 (1996). [2] A. Atramentov, H. Leiva, V. Honavar, A multi-relational decision tree learning algorithm: implementation and experiments, in: Proceedings of the 13th International Conference on Inductive Logic Programming, Springer-Verlag, Berlin, 2003. [3] H. Blockeel, M. Sebag, Scalability and efficiency in multi-relational data mining, ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) Explorations Special Issue on MultiRelational Data Mining, 5 (1) (2003). [4] L.P. Castillo, S. Wrobel, A comparative study on methods for reducing myopia of hill-climbing search in multi-relational learning, Proceedings of the Twenty-First International Conference on Machine Learning (2004). [5] J. Cheng, M. Krogel, J. Sese, C. Hatsiz, S. Morishita, H. Hayashi, D. Page, KDD Cup 2001 Report, ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) Explorations, 3 (2) (2002). [6] P. Clark, T. Niblett, The CN2 induction algorithm, Machine Learning 3 (1989) 261–283. [7] V. Crestana-Jensen, N. Soparkar, Frequent item-set counting across multiple tables, Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining (2000) 49–61. [8] L. Dehaspe, L. De Readt, Mining association rules with multiple relations, in: Proceedings of the 7th International Workshop on Inductive Logic ProgrammingLecture Notes in Artificial Intelligence, vol. 1297, Springer-Verlag, 1997, pp. 125–132. [9] L. Dehaspe, H. Toivonen, Discovery of relational association rules, in: Saso Dzeroski, Nada Lavrac (Eds.), Relational Data Mining, Springer-Verlag, 2001, pp. 189–212.

771

[10] P. Domingos, Prospects and challenges for multi-relational data mining, SIGKDD Explorations, Special Issue on Multi-Relational Data Mining, 5 (1) (2003). [11] R.D. King, S.H. Muggleton, A. Srinivasan, M.J.E. Sternberg, Structure–activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming, Proceedings of National Academy of Science, United States of America, 9 93 (1) (1996) 438–442. [12] A.J. Knobbe, A. Siebes, D.M.G. Van der Wallen, Multi-relational decision tree induction, Proceedings of the 3rd European Conference on Principles of Data Mining and Knowledge Discovery (1999) 378– 383. [13] S. Muggleton, Inductive logic programming, Academic Press, 1992. [14] A. Netz, S. Chaudhuri, U. Fayyad, Integration of data mining and relational databases, Proceedings of the 26th International Conference on Very Large Databases, Cairo, Egypt (2000). [15] J. Neville, D. Jensen, Supporting relational knowledge discovery: lessons in architecture and algorithm design, Proceedings of the Data Mining Lessons Learned Workshop, 19th International Conference on Machine Learning (2002). [16] F. Provost, V. Kolluri, Scaling up inductive algorithms: an overview, KDD-97 (1997). [17] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufman Publishers, San Francisco, California, 2000. [18] S. Wrobel, An algorithm for multi-relational discovery of subgroups, Proceedings of Principles of Data Mining and Knowledge Discovery97 (1997). [19] M. Uludag, M.R. Tolun, T. Etzold, A multi-relational rule discovery system, in: Proceedings of 18th International Symposium on Computer and Information SciencesLecture Notes in Computer Science, vol. 2869, Springer-Verlag, 2003, pp. 252–259.