Knowledge-Based Systems 23 (2010) 743–756
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
Concept discovery on relational databases: New techniques for search space pruning and rule quality improvement Y. Kavurucu, P. Senkul *, I.H. Toroslu Middle East Technical University, Department of Computer Engineering, 06531 Ankara, Turkey
a r t i c l e
i n f o
Article history: Received 4 February 2010 Received in revised form 20 April 2010 Accepted 21 April 2010 Available online 28 April 2010 Keywords: ILP Data mining MRDM Concept discovery Transitive rules Support Confidence
a b s t r a c t Multi-relational data mining has become popular due to the limitations of propositional problem definition in structured domains and the tendency of storing data in relational databases. Several relational knowledge discovery systems have been developed employing various search strategies, heuristics, language pattern limitations and hypothesis evaluation criteria, in order to cope with intractably large search space and to be able to generate high-quality patterns. In this work, we introduce an ILP-based concept discovery framework named Concept Rule Induction System (CRIS) which includes new approaches for search space pruning and new features, such as defining aggregate predicates and handling numeric attributes, for rule quality improvement. In CRIS, all target instances are considered together, which leads to construction of more descriptive rules for the concept. This property also makes it possible to use aggregate predicates more accurately in concept rule construction. Moreover, it facilitates construction of transitive rules. A set of experiments is conducted in order to evaluate the performance of proposed method in terms of accuracy and coverage. 2010 Elsevier B.V. All rights reserved.
1. Introduction The amount of data collected on relational databases has been increasing due to increase in the use of complex data for real life applications. This motivated the development of multi-relational learning algorithms that can be applied to directly multi-relational data on the databases [19,17]. For such learning systems, generally the first-order predicate logic is employed as the representation language. The learning systems, which induce logical patterns valid for given background knowledge, have been investigated under a research area, called Inductive Logic Programming (ILP) [46]. In general, using logic in data mining is a common technique in the literature [52,47,57,12,63,67,39,23,49,50,5,40,35,10,36,64,16,55, 18,51,65,66]. Concept is a set of patterns to be discovered by using the hidden relationships in the database. Concept discovery in relational databases is a predictive learning task. In predictive learning, there is a specific target concept to be learned in the light of the past experiences [45]. The problem setting of the predictive learning task introduced by Muggleton in [45] can be stated as follows: given target class/concept C (target relation), a set E of positive and negative examples of the class/concept C, a finite set of background facts/clauses B (background relations), concept description * Corresponding author. Tel.: +90 312 2105518. E-mail addresses:
[email protected] (Y. Kavurucu), senkul@ ceng.metu.edu.tr (P. Senkul),
[email protected] (I.H. Toroslu). URL: http://www.ceng.metu.edu.tr/karagoz/ (P. Senkul). 0950-7051/$ - see front matter 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2010.04.011
language L (language bias); find a finite set of clauses H, expressed in concept description language L, such that H together with the background knowledge B entail all positive instances E(+) and none of the negative instances E(). In other words, H is complete and consistent with respect to B and E, respectively. Association rule mining in relational databases is a descriptive learning task. In descriptive learning, the task is to identify frequent patterns, associations or correlations among sets of items or objects in databases [45]. Relational association rules are expressed as query extensions in the first-order logic [11,13]. In the proposed work, there is a specific target concept and association rule mining techniques are employed to induce association rules which have only the target concept as the only head relation. In this paper, we present Concept Rule Induction System (CRIS), which is a concept learning ILP system that employs relational association rule mining concepts and techniques to find frequent and strong concept definitions according to given target relation and background knowledge [31]. CRIS utilizes absorption operator of inverse resolution for generalization of concept instances in the presence of background knowledge and refines these general patterns into frequent and strong concept definitions with an APRIORI-based specialization operator based on confidence.
1.1. Contributions Major contributions and the main features of this work can be listed as follows:
744
Y. Kavurucu et al. / Knowledge-Based Systems 23 (2010) 743–756
1. The selection order of the target instance (the order in the target relation) may change the resulting hypothesis set. In each coverage set, the induced rules depend on the selected target instance and the covered target instances in each step do not have any effect on the induced rules in the following coverage steps. To overcome this problem, first, all possible values for each argument of a relation are determined by executing simple SQL statements in the database. Instead of selecting a target instance, those values for each argument are used in the generalization step of CRIS. By this way, the generated rules do not depend on the instance selection order and induced rule quality is improved. 2. This technique facilitates the generation of transitive rules, as well. When the target concept has common attribute types with only some of the background predicates, the rest of the predicates (which are called unrelated relations) can never take part in hypothesis. This prevents the generation of transitive rules through such predicates. In CRIS, since all target instances are considered together, there is no distinction for related and unrelated relations and hence transitive rules can be induced. 3. Better rules (higher accuracy and coverage) can be discovered by using aggregate predicates in the background knowledge. To do this, aggregate predicates are defined in the first-order logic and used in CRIS. In addition, numerical attributes are handled in a more accurate way. The rules having comparison operators on numerical attributes are defined and used in the main algorithm. 4. CRIS utilizes primary key-foreign key relationship (if exists) between the head and body relations in the search space as a pruning strategy. If a primary-foreign key relationship exists between the head and the body predicates, the foreign key argument of the body relation can only have the same variable as the primary key argument of the head predicate in the generalization step. 5. The main difficulty in relational ILP systems is searching in intractably large hypothesis spaces. In order to reduce the search space, a confidence-based pruning mechanism is used. In addition to this, many multi-relational rule induction systems require the user to determine the input–output modes of predicate arguments. Instead of this, we use the information about relationships between entities in the database if given. 6. Muggleton shows that [48], the expected error of an hypothesis according to positive versus all (positive and negative) examples do not have much difference if the number of examples is large enough. Most ILP-based concept learning systems input background facts in Prolog language; this restricts the usage of ILP engines in real-world applications due to the time-consuming transformation phase of problem specification from tabular to logical format. The proposed system directly works on relational databases, which contain only positive information, without any requirement of negative instances. Moreover, the definition of confidence is modified to apply Closed World Assumption (CWA) [53] in relational databases. We introduce type relations to the body of the rules in order to express CWA. In [31], the contribution presented in the first item of the above list was introduced without performance evaluation. In [33], the basics of aggregate predicate usage are presented. In this work, the features of CRIS are elaborated in more detail with performance evaluation results on several data sets. In [29,28,30,32], features of another concept discovery system developed by our research group, namely C2D, are presented. Although, CRIS and C2D have common properties such as use of only positive instances, the concept discovery algorithm of CRIS has different properties and advantages which are presented and discussed and evaluated in this work.
This paper is organized as follows: Section 2 gives preliminary information about concept discovery in general and the concepts employed in CRIS. Section 3 presents the related work. Section 4 describes the proposed method. Section 5 presents the experiments to discuss the performance of CRIS. Finally, Section 6 includes concluding remarks. 2. Preliminaries In this section, basic terminology in concept discovery and basics for concept representation and discovery are introduced. 2.1. Basics A concept is a set of patterns which are embedded in the features of the instances of a given target relation and in the relationships of this relation with other relations. In this work, a concept is defined though concept rules. Definition 1. [Concept rule] A concept rule (or shortly rule) is an association rule (range-restricted query extension). It is represented as ‘‘h b”, where h is the head of the rule and b denotes the body of the rule. Definition 2. [Target relation] A target relation is a predicate that corresponds to the concept to be discovered. The instances of the target relations have to be correctly covered by the discovered pattern. If the discovered pattern is in the form of rules (as in this work), target relation appears in the head of the rule. In recursive rules, it may take part in the body part, as well. Definition 3. [Background relation] A background relation is a predicate that is different than the target relation and involves in the concept discovery. When discovered pattern is in the form of rules, a background relation may appear in the body part of the rule. In Table 1, the relation given in the first column, ancestor, is the target relation. The content of the first column constitutes the target instances. For this example, one of the concepts rules defining the concept is ‘‘ancestor(A, B) parent(A, B)”. We use the first-order logic as the language to represent data and patterns. The concept rule structure is based on query extension. However, to emphasize the difference from classical clause and query, we firstly present definitions for these terms. Definition 4. [Clause] A clause is a universally quantified disjunction "(l1 _ l2 _ . . . _ ln). When it is clear from the context that clauses are meant, the quantifier " is dropped. A clause h1 _ h2 _ . . . _ hp _ b1 _ b2 _ . . . _ br, where the hi are positive literals and the bj are negative literals, can also be written as h1 _ h2 _ . . . _ hp b1 ^ b2 ^ . . . ^ br, where h1 _ h2 _ . . . _ hp
Table 1 The database of the ancestor example with type declarations. Concept instances
Background facts
a(kubra, ali). a(ali, yusuf). a(yusuf, esra). a(yusuf, aysegul). a(kubra, yusuf). a(kubra, esra). a(kubra, aysegul). a(ali, esra). a(ali, aysegul).
p(kubra, ali). p(ali, yusuf). p(yusuf, esra). p(yusuf, aysegul).
Y. Kavurucu et al. / Knowledge-Based Systems 23 (2010) 743–756
(p P 0) is called the head of the clause, and b1 ^ b2 ^ . . . ^ br (r P 0) is called the body of the clause. This representation can be read as ‘‘h1 or . . . or hp if b1 and . . . and br”. Definition 5. [Definite clause] A definite clause is a clause which only has one head literal. A definite clause with an empty body is called a fact. A denial is a clause with an empty head.
head relation that satisfy the rule, divided by the number of different bindings for the variables in the head relation. In other words, it is the ratio of number of positive target instances captured by the rule over number of target instances. Let C be h
supportðh
Definition 7. [Query extension] A query extension is an existentially quantified implication $(l1 ^ l2 ^ . . . ^ lm) ? $(l1 ^ l2 ^ . . . ^ lm ^ lm+1 ^ . . . ^ ln), with 1 6 m < n. To avoid confusion with clauses (which are also implications) we can write as l1 ^ l2 ^ . . . ^ lm ,lmþ1 ^ . . . ^ ln . We call query l1 ^ l2 ^ . . . ^ lm the body and query lm+1 ^ . . . ^ ln the head of the query extension [11]. In [13], relational association rules are called as query extensions. Definition 8. [Range-restricted query] A range-restricted query is a query in which all variables that occur in negative literals also occur in at least one positive literal. A range-restricted query extension is a query extension such that both head and body are range-restricted queries. Definition 9. [h-subsumption] A definite clause C h-subsumes1 a definite clause C0 , i.e. at least as general as C0 , if and only if $h such that:
headðCÞ ¼ headðC 0 Þ and bodyðCÞh # bodyðC 0 Þ: In this work, we adapt the h-subsumption definition for query extensions. Two basic steps in the search for a correct theory are specialization and generalization [38]. If a theory covers negative examples, it means that it is too strong, it needs to be weakened. In other words, a more specific theory should be generated. This process is called specialization. On the other hand, if a theory does not imply all positive examples, it means that it is too weak, it needs to be strengthened. In other words, a more general theory should be generated. This process is called generalization. Specialization and generalization steps are repeated to adjust the induced theory in the overall learning process. 2.2. Support and confidence Two criteria are important in the evaluation of a candidate concept rule: how many of the concept instances are captured by the rule (coverage) and the proportion of the objects which truly belong to the target concept among all those that show the pattern of the rule (accuracy); support and confidence, respectively. Therefore, the system should assign a score to each candidate concept rule according to its support and confidence value. Definition 10. [Support] The support value of a concept rule C is defined as the number of different bindings for the variables in the
1 A substitution h is a set {X1/t1, . . ., Xm/tm}, where each Xi is a variable such that X i ¼ X j () i ¼ j; ti is a term different from Xi, and each element Xi/ti is called a binding for variable Xi.
b,
bÞ ¼
Definition 6. [Query] A query is an existentially quantified conjunction $(l1 ^ l2 ^ . . . ^ ln). When it is clear from the context that queries are meant, the quantifier $ is dropped. A query $(l1 ^ . . . ^ lm) corresponds to the negation of a denial "( l1 ^ . . . ^ lm).
745
jbindings of variables for h that satisfy h bj : jbindings of variables for h that satisfy hj
Definition 11. [Confidence] The confidence of a concept rule C is defined as the number of different bindings for the variables in the head relation that satisfy the rule, divided by the number of different bindings for the variables in the head relation that satisfy the body literals. In other words, it is the ratio of number of target instances captured by the rule over number of instances that are deducible by the body literals in the rule. Let C be h b,
confidenceðh
bÞ ¼
jbindings of variables for h that satisfy h bj : jbindings of variables for h that satisfy bj
In the literature, the support and confidence values are obtained with the SQL queries given in [13]. The database given in Table 1, is used as a running example to illustrate the algorithm. In this example, ancestor (a) is the concept to be learned, and nine concept instances are given. Also a background relation, namely parent(p) is provided. For the rule, a(A, B) p(A, yusuf), support and confidence values are 3/9(0.33) and 1/1(1.0), that can be obtained by the SQL queries shown in Tables 2 and 3. The confidence query definition produces superficially high values for the cases where head of the concept rule includes variables not appearing in the body, as shown above, and hence it is not applicable as is. Therefore, we proposed a modification on the use of the confidence query definition [29,32]. Confidence is the ratio of number of positive instances deducible from the rule to the number of examples deducible from the
Table 2 The SQL queries for support calculation. Support = COUNT1/COUNT2 COUNT1: SELECT COUNT(*) FROM SELECT DISTINCT (a.arg1, a.arg2) FROM ancestor a, parent p WHERE a.arg1 = p.arg1 AND p.arg2 = ’yusuf’ COUNT2: SELECT COUNT(*) FROM SELECT DISTINCT (a.arg1, a.arg2) FROM ancestor a
Table 3 The SQL queries for confidence calculation. Confidence = COUNT3/COUNT4 COUNT3: SELECT COUNT(*) FROM SELECT DISTINCT (p.arg1) FROM ancestor a, parent p WHERE a.arg1 = p.arg1 AND p.arg2 = ’yusuf’ COUNT4: SELECT COUNT(*) FROM SELECT DISTINCT (p.arg1) FROM parent p
746
Y. Kavurucu et al. / Knowledge-Based Systems 23 (2010) 743–756
rule. In other words, it shows how strong the concept rule is. For the example rule, the high confidence value tells that it is very strong. However, out of the following five deducible facts a(ali, ali), a(ali, kubra), a(ali, yusuf), a(ali, esra) and a(ali, aysegul), only three of them (a(ali, yusuf), a(ali, esra) and a(ali, aysegul) are positive) exist in the database. As a result, the example rule covers some negative instances. In order to adapt the confidence query to concept discovery, we add type relations to the body of the concept rule corresponding to the arguments of the head predicate whose variable does not appear in the body predicates. The type tables (relations) for the arguments of the target relation are created in the database (if they do not exist). For the ancestor example, person table is the type table, which contains all values in the domain of the corresponding argument of the target relation in the database. For the ancestor example, person table contains 5 records which arekubra, ali, yusuf, esra and aysegul. A type relation is named as the corresponding head predicate argument type’s name and it contains a single argument whose domain is the same as the domain of the corresponding head predicate argument. The rules obtained by adding type relations are used only for computing the confidence values, and for the rest of the computation, original rules without type relations are used. Definition 12. [CWA] The Closed World Assumption (CWA) is the presumption that what is not currently known to be true, is false [53,42,44].
Definition 13. [Type-extended concept rule] A type-extended concept rule is a concept rule in which type relations corresponding to the variable arguments of the head predicate that do not appear in the body, are added to the body of the concept rule. In this work, we extend the concept rules to type-extended concept rules. By adding the type relations, negative instances can be deduced as in CWA. Besides this, since the type relation is always true for the instance, this modification does not affect the semantics of the concept rule. In addition, the definition of the confidence query remains intact. As a result of this modification, the support and confidence values for the example rule are as follows:
aðA; BÞ
pðA; yusuf Þ; personðBÞ:ðs ¼ 0:33; c ¼ 0:6Þ:
The confidence of a concept rule is calculated with respect to all concept instances. This is due to the fact that the covered examples are true for the candidate rules in the following coverage steps. On the other hand, the support is calculated over uncovered concept instances. Definition 14. [f-metric]f-metric is an hypothesis evaluation criterion that is calculated as follows: f-metric = ((B2 + 1) conf sup)/((B conf) + supp)) The user can emphasize the effect of support or confidence by changing the value of B. If the user defines B to be greater than 1, then confidence has a higher effect. On the other hand, if B has a value less than 1, then support has a higher effect. Otherwise, both support and confidence have equivalent weight in the evaluation. In this work, f-metric is used as the concept rule evaluation metric in order to select the best concept rule. 2.3. Association rule mining and APRIORI property Association rule mining is conventionally used for discovering frequent associations among item sets. Association rule mining techniques are adapted for multi-relational domains, as well. Although it is a descriptive method, by restricting the rule head
to concept, it is also useful for predictive approaches. In this work, there is a specific target concept to be learned, and, association rule mining is employed for inducing association rules with the target concept as the only head relation. The most popular and well-known association rule mining algorithm, as introduced in [4], is APRIORI. APRIORI utilizes an important property of frequent item sets in order to prune candidate item set space: Property 1. All subsets of a frequent item set must be frequent. The contra-positive of this property says that if an item set is not frequent than any superset of this set is also not frequent. It can be concluded that the item set space should be traversed from small size item sets to large ones in order to discard any superset of infrequent item sets from scratch. In order to apply this reasoning, APRIORI reorganizes the item set space as a lattice based on the subset relation. The search space is searched with an APRIORI-based specialization operator in the proposed method. 3. Related work In this section, we present an overview of relational learning systems related to our work. FOIL [52] is one of the earliest concept discovery systems. It is a top–down relational ILP system, which uses refinement graph in the search process. In FOIL, negative examples are not explicitly provided, they are generated on the basis of CWA. PROGOL [47] is a top–down relational ILP system, which is based on inverse entailment. PROGOL extends clauses by traversing the refinement lattice and reduces the hypothesis space by using a set of mode declarations given by the user, and a most specific clause (also called bottom clause) as the greatest lower bound of the refinement graph. A bottom clause is a maximally specific clause, which covers a positive example and is derived using inverse entailment. PROGOL applies the covering approach and supports learning from positive data as in FOIL. ALEPH [57] is similar to PROGOL, whereas it is possible to apply different search strategies, evaluation functions and refinement operators. It is also possible to define more settings in ALEPH such as minimum confidence and support. Design of algorithms for frequent pattern discovery has become a popular topic in data mining. Almost all algorithms have the same level-wise search technique known as APRIORI algorithm. The level-wise algorithm is based on a breadth-first search in the lattice spanned by a specialization relation between patterns. WARMR [12] is a descriptive ILP system that employs APRIORI rule to find frequent queries having the target relation by using support criteria. One major difficulty in ILP is to manage the search space. The most common approach is to perform a search of hypotheses that are local optima for the quality measure. To overcome this problem, simulated annealing algorithms [2] can be used. SAHILP [56] uses simulated annealing methods instead of covering approach in ILP for inducing hypothesis. It uses the neighborhood notion in search space, a refinement operator similar to FOIL and weighted relative accuracy [37] as the quality measure. PosILP [55] extends the propositional logic to the first-order case to deal with exceptions in a multi-class problem. It reformulates the ILP problem in the first-order possibilistic logic and redefines the ILP problem as an optimization problem. At the end, it learns a set of prioritized rules. The proposed work is similar to ALEPH as both systems produce concept definition from given target. WARMR is another similar work in a sense that, both systems employ APRIORI-based search-
747
Y. Kavurucu et al. / Knowledge-Based Systems 23 (2010) 743–756
ing methods. Unlike ALEPH and WARMR, CRIS does not need input/ output mode declarations. It only requires type specifications of the arguments, which already exist together with relational tables corresponding to predicates. Most of the ILP-based systems require negative information, whereas CRIS directly works on databases which have only positive data. Similar to FOIL, negative information is implicitly described according to CWA. Finally, it uses a novel confidence-based hypothesis evaluation criterion and search space pruning method. ALEPH and WARMR can use indirectly related relations and generate transitive rules only with using strict mode declarations. In CRIS, transitive rules are generated without the guidance of mode declarations. There are some other studies that use aggregation in multi-relational learning. Crossmine [67] is such an ILP based multi-relational classifier that uses TupleID propagation. Mr. G-Tree [39] is proposed to extend the concepts of propagation described in Crossmine by introducing the g-mean TupleID propagation algorithm, also known as GTIP algorithm. CLAMF [23] extends TupleID propagation in the order to efficiently perform single and multi-feature aggregation over related tables. Decision trees were extended to the MR domain while incorporating single-feature aggregation and probability estimates for the classification labels [49]. A hierarchy of relational concept classes in order of increasing complexity is presented in [50], where the complexity depends on that of any aggregate functions used. Aggregation and selection are combined efficiently in [5]. MRDTL [40] constructs selection graphs for rule discovery. Selection Graph is a graphical language that is developed to express multi-relational patterns. These graphs can be translated into SQL or the first-order logic expressions. Generalised Selection Graph is an extended version of SG that uses aggregate functions [35]. It inspired this work for defining and using aggregation, however, we followed a logic-based approach and included aggregate predicates in an ILP-based context for concept discovery. C2D [29,28,32,30] is another concept discovery system proposed by the same research group prior to CRIS. In CRIS, some of the features such as confidence-based pruning are borrowed from C2D and are further improved. The major difference between two systems appears in the generalization technique which improves rule quality and effective use of aggregate predicates and facilitates transitive rule generation. 4. Concept discovery in CRIS CRIS [31] is a concept discovery system that uses the first-order logic as the concept definition language and generates a set of concept rules having the target relation in the head. In this section, the basic techniques used in CRIS are described in detail. In the first subsection, novel pruning techniques in CRIS are introduced. In the next subsection, concept discovery algorithm of CRIS is described in detail. Finally, in the third subsection, inclusion of aggregation into the concept discovery process is explained. 4.1. Pruning strategies in CRIS
C1: a(A, B) C2: a(A, B)
p(A, C). p(A, C), a(C, A).
As the head of C1 and C2 (a(A, B)) are the same and body of C1 is a subset of C2, C1 is more general than C2 and it h-subsumes C2. The second pruning strategy is about the use of confidence. For this strategy, firstly, we define a ‘‘non-promising rule”. Definition 15. [Non-promising rule] Let C1 and C2 be the two parent rules of the concept rule C in the APRIORI search lattice ([3]). If the confidence value of C is not higher than the confidence values of C1 and C2, then it is called a non-promising rule. Strategy 2. In CRIS, non-promising rules are pruned in the search space. By using this strategy, in the solution path, each specialized rule has higher confidence value than its parents. A similar approach is used in the Dense–Miner system [8] for traditional association rule mining. For the illustration of this technique on the ancestor example, consider the following two rules in the first level of the APRIORI lattice: C1: a(A, B) C2: a(A, B)
p(A, C). (c = 0.6) p(C, B). (c = 0.45)
These rules are suitable for union since their head literals are the same and they have exactly one different literal from each other. Possible union rules are as follows: C3: a(A, B) C4: a(A, B)
p(A, C), p(C, B). (c = 1.0) p(A, C), p(D, B). (c = 0.75)
C3 and C4 have higher confidence values than C1 and C2. Therefore, they are not pruned in the search space. The last pruning strategy employed in CRIS, which is also a novel approach, utilizes primary key-foreign key relationship between the head and body relations: Strategy 3. If a primary key-foreign key relationship exists between the head and the body predicates, the foreign key argument of the body relation can only have the same variable as the primary key argument of the head predicate in the generalization step. For example, in the Mutagenesis database [61], the target relation is molecule(drug, boolean) and a background relation is atm(drug, atom, element, integer, charge). As there is a primary key-foreign key relationship between molecule and atm relations through the ‘‘drug” argument, some of the rules obtained at the end of the generalization step are as follows: molecule(A, true) molecule(A, true) molecule(A, true) ...
atm(A, B, c, 22, C). atm(A, B, h, C, D). atm(A, B, C, 22, D).
In CRIS, three mechanisms are utilized for pruning the search space. The first one is a generality ordering on the concept rules based on h-subsumption:
On the basis of this idea, concept rules that have different attribute variables for primary key-foreign key attributes are not allowed in the generalization step. For example, the rule ‘‘molecule(A, true) atm(B, C, c, 22, D)”. is not generated in the generalization step.
Strategy 1. In CRIS, candidate concept rules are generated according to h-subsumption definition given in Definition 9 in Section 2.
4.2. The algorithm
For instance, consider the following two concept rules from the ancestor example (Table 1):
As shown in the flowchart given in Fig. 1, concept rule induction algorithm of CRIS takes target relation and background facts from
748
Y. Kavurucu et al. / Knowledge-Based Systems 23 (2010) 743–756 Table 4 The SQL query for finding feasible constants.
DATABASE (Target Relation & Background Facts)
INPUT PARAMS Min_sup Min_conf Max_depth
Print hypothesis Y
N
Calculate feasible values for head and body relations
Are all target instances covered?
GENERALIZATION Find General Rules (one head & one body literal) using absorption. Depth = 1.
FILTER Discard infreq and un-strong rules
Is Candidate Rule Set empty?
Y
COVERAGE Find Solution Rules Cover Target Instances
N SPECIALIZATION Refine general rules using APRIORI. Depth = Depth + 1
Y
Is Depth smaller than Max_depth?
N
Fig. 1. CRIS algorithm.
the database. It works under minimum support, minimum confidence and maximum rule depth parameters. Rule construction starts with the calculation of feasible values for the head and body relations in order to generate most general rules with a head and a single body predicates. In generalization step, primary-foreign key relationship (Strategy 3) is also used in most general rule construction. After the generalization step, the concept rule space is searched with an APRIORI-based specialization operator. In this step, h-subsumption (Strategy 1) is employed for candidate rule generation. In the refinement graph, infrequent rules are pruned. In addition to this, on the basis of Strategy 2, rules whose confidence values are not higher than that of their parents are also eliminated. When the maximum rule depth is reached or no more candidate rules can be found, the rules that are below the confidence threshold are eliminated for the solution set. Among the produced strong and frequent rules, the best rule (with the highest f-metric value) is selected. The rule search is repeated for the remaining concept instances that are not in the coverage of the generated hypothesis rules. At the end, some uncovered positive concept instances may exist due to the user settings for the thresholds. In the rest of this section, the main steps of the algorithm are described. Generalization: generalization step of the algorithm constructs the most general two-literal rules by considering all target instances together. By this way, the quality of the rule induction does not depend on the order of target instances. This novel technique proceeds as follows. For a given target relation such as t(A, B), the induced rule has a head including either a constant or a variable for each argument of
SELECT a FROM t GROUP BY a HAVING COUNT(*) P (min_sup*num_of_uncov_inst)
t. Each argument can be handled independently in order to find the feasible head relations for the hypothesis set. As an example, for the first argument A, a constant must appear at least min_sup*number_of_uncovered_instances times in the target relation so that it can be used as a constant in an induced rule. In order to find the feasible constants for the attribute A, the SQL statement given in Table 4 is executed. For example, in the PTE-1 data set [60] the target relation pte_active has only one argument (drug). Initially, there are 298 uncovered instances in pte_active. When the min_sup parameter is set as 0.05, the SQL statement given in Table 5 returns an empty set which means there are not any feasible constants for the argument drug of pte_active. Therefore, the argument drug of pte_active can only be a variable for the head of the candidate concept rules. In the same manner, for a background relation such as r(A, B, C), if a constant appears at least min_sup*number_of_instances times for the same argument in r, then it is a frequent value for that argument of r and may take part in the solution rule for the hypothesis set. As an example, in the PTE-1 database, pte_atm(drug, atom, element, integer,charge) is a background relation and the feasible constants which can take part in the hypothesis set can be found for each argument of pte_atm by using the above SQL statement template. For numeric attributes, due to support threshold, it is not feasible to seek for acceptable constants. For this reason, feasible ranges
Table 5 The SQL query example for support calculation. SELECT drug FROM pte_active GROUP BY drug HAVING COUNT(*) P 298*0.05
Table 6 The feasible constants for pte_atm. Arg.
Constants
SQL query
Drug
Empty set (only variable)
Atom
Empty set (only variable)
Element
c, h, o (also variable)
Integer
3, 10, 12 (also variable)
Charge
19 range constants (also variable)
SELECT drug FROM pte_atm GROUP BY drug HAVING COUNT(*) P 9189*0.05 SELECT atom FROM pte_atm GROUP BY atom HAVING COUNT(*) P 9189*0.05 SELECT element FROM pte_atm GROUP BY element HAVING COUNT(*) P 9189*0.05 SELECT integer FROM pte_atm GROUP BY integer HAVING COUNT(*) P 9189*0.05 SELECT charge FROM pte_atm ORDER BY charge
Y. Kavurucu et al. / Knowledge-Based Systems 23 (2010) 743–756
are given through less-than/greater-than operators on constants. As an example, for the charge argument of pte_atm predicate, all values in the database are sorted in ascending order. For min_sup 0.05, there should be 19 (which is (1/0.05)–1) border values for each less-than/greater-than operator. If pte_atm relation has 1000 records, after ordering from smallest to largest, less-than/greaterthan operator is applied for the 51st constant, 101st constant and so on. In addition to these constants denoting feasible ranges, this argument can be a variable, as well. The feasible constants and the SQL statements used for each argument of pte_atm are as shown in Table 6. As a result, pte_atm relation has 320 (which is, 1*1*4*4*20) body relations for each possible head relation in the generalization step of CRIS. Example generalized rules are listed in Table 7. In the ancestor example, the target and background relations can only have variables for the arguments in the hypothesis set. The constructed rules are listed in Table 8. Specialization: CRIS refines the two-literal concept descriptions with an APRIORI-based specialization operator that searches the concept rule space in a top–down manner, from general to specific. As in APRIORI, the search proceeds level-wise in the hypothesis space and it is mainly composed of two steps: frequent rule set selection from candidate rules and candidate rule set generation as refinements of the frequent rules in the previous level. The standard APRIORI search lattice is extended in order to capture concept rules and the candidate generation and frequent pattern selection tasks are customized for the first-order concept rules. The candidate rules for the next level of the search space are generated in three important steps: 1. Frequent rules of the previous level are joined to generate the candidate rules via union operator. In order to apply the union operator to two frequent concept rules, these rules must have the same head literal, and bodies must have all but one literal in common. Therefore, union-ed rule is h-subsumed by its parents. Since only the rules that have the same head literal are combined, the search space is partitioned into disjoint APRIORI sub-lattices according to the head literal. In addition to this, the
Table 7 Example generalized rules for PTE-1 data set. pte_active(A) pte_active(A) pte_active(A) pte_active(A) pte_active(A) pte_active(A) pte_active(A) pte_active(A) pte_active(A) pte_active(A) pte_active(A) pte_active(A) pte_active(A) pte_active(A) pte_active(A)
pte_atm(A, pte_atm(A, pte_atm(A, pte_atm(A, pte_atm(A, pte_atm(A, pte_atm(A, pte_atm(A, pte_atm(A, pte_atm(A, pte_atm(A, pte_atm(A, pte_atm(A, pte_atm(A, pte_atm(A,
B, B, B, B, B, B, B, B, B, B, B, B, B, B, B,
c, 3, X), X 6 0.133. c, 3, X), X P 0.133. c, 3, C) c, 10, X), X 6 0.133. c, 10, X), X P 0.133. c, 10, C) c, 22, X), X 6 0.133. c, 22, X), X P 0.133. c, 22, C) c, C, X), X 6 0.133. c, C, X), X P 0.133. c, C, D) h, 3, X), X 6 0.133. h, 3, X), X P 0.133. h, 3, C)
B) B) B) B) B) B) B)
p(A, B). p(A, B). p(C, A). p(C, D). a(A, C). a(B, C). a(C, B).
system does not combine rules that are specializations of the same candidate rule produced in the second step of the candidate rule generation task in order to prevent logical redundancy in the search space. 2. For each frequent union rule, a further specialization step is employed that unifies the existential variables of the same type in the body of the rule. By this way, rules with relations indirectly bound to the head predicate can be captured. 3. Except for the first level, the candidate rules that have confidence value not higher than parent’s confidence values are eliminated. If the concept rule has confidence value as 1, it is not further specialized in the following steps (Strategy 3). Evaluation: once the system constructs the search tree consisting of the frequent and confident candidate rules for that round, it eliminates the rules having confidence values below the confidence threshold. Among the remaining strong rules, the system decides on which rule in the search tree represents a better concept description than the other candidates according to f-metric definition given in Section 2. The user can emphasize the effect of support or confidence by changing the value of B. Coverage: after the best rule is selected, target instances covered by this rule are determined and removed from the concept instances set. The main loop continues until all concept instances are covered or no more candidate rules can be generated for the uncovered concept instances. In the ancestor example, at the end of the first coverage step, the following rule, which covers 7 of the target instances, is induced:
aðA; BÞ
pðA; CÞ; aðC; BÞ:
In the second coverage step, the following rule, which covers all of the uncovered target instances, is induced:
aðA; BÞ
pðA; BÞ:
4.3. Aggregation in CRIS An important feature for a concept discovery method is the ability of incorporating aggregated information into the concept discovery. Such information becomes descriptive as in the example ‘‘the total charge on a compound is descriptive for the usefulness or harmfulness of the compound”. Therefore, concept discovery system needs aggregation capability in order to construct highquality (with high accuracy and coverage) for such domains. In relational database queries, aggregate functions characterize groups of records gathered around a common property. In concept discovery, aggregate functions are utilized in order to construct aggregate predicates that capture some aggregate information over one-to-many relationships. Conditions on the aggregation such as count < 10 or sum > 100 may define the basic characteristic of a given concept better. For this reason, in CRIS, we extend the background knowledge with aggregate predicates in order to characterize the structural information that is stored in tables and associations between them [33]. Definition 16. An Aggregate Predicate (P) is a predicate that defines aggregation over an attribute of a given Predicate (a). We use a notation similar to given in [24] to represent the general form for aggregate predicates as follows:
Table 8 Generalized rules for ancestor data set. a(A, a(A, a(A, a(A, a(A, a(A, a(A,
749
a(A, a(A, a(A, a(A, a(A, a(A, a(A,
B) B) B) B) B) B) B)
p(A, C). p(B, C). p(C, B). a(A, B). a(B, A). a(C, A). a(C, D).
Pac;;bx ðc; rÞ where a is the predicate over which the Aggregate Function (x) (COUNT, MIN, MAX, SUM and AVG are the frequently used functions) is computed, Key (c) is a set of arguments that will form
750
Y. Kavurucu et al. / Knowledge-Based Systems 23 (2010) 743–756
the key for P, Aggregate Value (r) is the value of x applied to the set of values defined by Aggregate Variable List (b). Mutagenesis data set [61] is used for illustrating the usage of aggregate functions. In the data set, the target relation is molecule and atom is a background relation. In addition to this, there is a primary-foreign key relationship between these two relations through the drug argument.
atom count atom;atomid drug;COUNT ðdrug; cntÞ: The above notation represents the aggregate predicate atom_ count(drug, cnt) that keeps the total number of atoms for each drug. Definition 17. An aggregate query is an SQL statement including aggregate functions. The instances of aggregate predicates are created by using an aggregate query template. Given Pac;;bx ðc; rÞ, the corresponding aggregate query is given in Table 9.
and feasible to define concepts on specific numeric values, in this work, numeric attributes are considered only together with comparison operators. For example, the pte_atm relation in the above example has the argument charge which has floating-point values. It is infeasible to search for a rule such as: A drug is active if it has an atom with charge equals to 0.117. As there are many possible numeric values in the relation, such a rule would probably be eliminated according to minimum support criteria. Instead, to search for drugs which have charge larger/smaller than some threshold value will be more feasible. For this purpose, numeric attributes are handled as described below. As the first step, domains of the numeric attributes are explicitly defined as infinite in the generalization step. For the infinite attributes, concept rules are generated on the basis of the following strategy.
For example, the instances of atom_count aggregate predicate on atom relation are constructed by the query given in Table 10.
Strategy 4. For a given target concept t(a, x) and a related fact such as p(a, b, num), where a and b are nominal values and num is a numeric value; instead of a single rule, the following two rules are generated:
Definition 18. An aggregate rule is a concept rule which has at least one aggregate predicate in the body of the rule.
tða; xÞ
pða; b; AÞ; A P num:
tða; xÞ
pða; b; AÞ; A 6 num:
An example aggregate rule is:
moleculeðd1; trueÞ
atom countðd1; AÞ; A P 28:
As a more comprehensive example, PTE-1 data set [60] is used for explaining how aggregation is used in concept discovery in the proposed system. There is one-to-many relationship between pte_active and pte_atm relations over the drug argument. A similar relation exists between the pte_active and pte_bond tables. Also there is a one-to-many relationship between pte_atm and pte_bond relations over the atm-id argument. pte atm countatom;atmid ðdrug; cntÞ is an example aggregate preddrug;COUNT icate that can be defined in the PTE-1 data set. For simplicity, we abbreviate it as pte_atm_count(drug, cnt) which represents number of atoms for each drug. The instances of pte_atm_count(drug, cnt) aggregate predicate on pte_atm relation are constructed by the query given in Table 11. All aggregate predicates defined on PTE-1 data set, their descriptions and the corresponding SQL query definitions are listed in Table 12. Aggregate predicates have numeric attributes by their nature. Therefore, in order to add aggregate predicates into the system, numeric attribute types should also be handled. Since it is not useful
Table 9 The SQL template for aggregate predicates. SELECT c, x(b) as r FROM a GROUP BY c
Table 10 SQL query for the predicate atom_count(drug, cnt). SELECT drug, COUNT (atom-id) as cnt FROM atom GROUP BY drug
In order to find the most descriptive num value, the basic method is ordering the domain values for attribute A and defining the intervals with respect to the given support threshold. Therefore, a set of rules describing the interval borders are generated. This method is described in Section 4.2. It is also applicable for numeric attributes of the aggregate predicates, as well. However, the number of generalized rules highly increases under low support threshold. For this reason, in order to improve the time efficiency, a simplification is employed and only the median element of the domain is selected as the num value. The integration of aggregate predicates into the concept rule generation process can be summarized as follows. One-to-many relationships between target concept and background relations are defined on the basis of the schema information. Under these relationships, aggregate predicates are generated by using the SQL template as described earlier in this section. In the generalization step, the instances of these predicates are considered for rule generation. As an example, for the pte_atm_count predicate defined in PTE-1 data set, the following example rules are created in the generalization step.
pte activ eðA; trueÞ
pte atm countðA; XÞ; X P 22:
pte activ eðA; trueÞ
pte atm countðA; XÞ; X 6 22:
Including aggregate predicates into the concept discovery process increases the size of background relations. In addition to this, handling numeric attributes for the aggregate predicates increases the number of aggregate predicate instances further. Therefore, this inclusion incrases the concept discovery duration. However, it is a necessary feature that incrases the rule quality in certain domains. The most effective way to use this feature is to include it into the rule generation mechanism optionally. For instance, for the domains that do not include numeric attributes, it is not useful and this property should be switched off. 5. Experimental results
Table 11 SQL statement for aggregate predicate pte_atm_count(drug, cnt). SELECT drug, COUNT (atm-id) as cnt FROM pte_atm GROUP BY drug
A set of experiments were performed to test the performance of CRIS on well-known problems in terms of coverage and predictive accuracy. Coverage denotes the number of target instances of test data set covered by the induced hypothesis set. Predictive accuracy denotes the sum of correctly covered true positive and true nega-
751
Y. Kavurucu et al. / Knowledge-Based Systems 23 (2010) 743–756 Table 12 The aggregate predicates in PTE-1 data set. Predicate
Desc.
SQL query definition
pte_atm_count(drug, cnt) pte_bond_count(drug, cnt) pte_atm_b_cnt(atm-id, cnt) pte_charge_max(drug, mx) pte_charge_min(drug, mn)
Number of atoms for each drug Number of bonds for each drug Number of bonds for each atom Max charge of the atoms in a drug Min charge of the atoms in a drug
SELECT SELECT SELECT SELECT SELECT
drug,COUNT(atm-id) FROM pte_atm GROUP BY drug drug,COUNT(atm-id) FROM pte_bond GROUP BY drug atm-id, COUNT(atm-id) FROM pte_bond GROUP BY atm-id drug, MAX(charge) FROM pte_atm GROUP BY drug drug, MIN(charge) FROM pte_atm GROUP BY drug
Table 13 Summary of the benchmark data sets used in the experiments. Data set
No. of Pred.
No. of Inst.
No. of AggPred
Min. sup.
Min. conf.
Same-gen Mesh PTE-1 Mutagenesis Diterpene Alzheimer Satellite Eastbound Elti
2 26 32 26 22 35 31 12 9
408 1749 29267 15003 46593 2505 18732 196 224
0 0 5 0 0 0 0 0 0
0.3 0.1 0.1 0.1 0.05 0.25 0.1 0.3 0.2
0.6 0.1 0.7 0.7 0.8 0.8 0.7 0.6 0.6
Table 14 Descriptions of the benchmark data sets used in the experiments. Data set
Description
Same-gen Mesh PTE-1
An actual family data set containing recursive relationships A sparse data set about learning rules to determine the elements on each edge of the edge The first data set of the Predictive Toxicology Evaluation (PTE) project to determine carcinogenic effects of chemicals in terms of numeric arguments and aggregate predicates A data set about chemicals in which the aim is to determine the mutagenicity of each drug in it A data set about diterpenes containing numeric attributes and the aim is to identify the skeleton of them A data set concerning the design of analogues to the Alzheimer’s disease drug tacrine without numeric attributes and aggregate predicates A data set including information about trains and their indirectly related facts A data set about diagnosis of power-supply failures in a communications satellite An actual family data set containing transitive relationships
Mutagenesis Diterpene Alzheimer Eastbound Satellite Elti
tive instances over the sum of true positive, true negative, false positive and false negative instances.2 The experiments were run on a computer with Intel Core Duo 1.6 GHz processor and 1 GB memory. The benchmark data sets used in the experiments are summarized in Tables 13 and 14. In this work, the experimental results on these benchmark data sets are sometimes given in different evaluation metrics. This is due to the fact that we refer to the results of the related work given in the literature and they are not available under the same metrics. For mesh data set, coverage results are given, whereas for PTE, accuracy is the measure used in the comparisons. This variation stems from the different characteristics of the data sets. For instance, since mesh is a sparse data set, increase in coverage is a more valuable quality than the increase in accuracy. In a similar way, in the literature, the generated rules and the parametric values such as support and coverage under which they are generated are not available. We presented the rules generated by our algorithm for some of the data sets. However, we could not do it for all data sets since comparable results by other systems are not available. The rest of this section is organized as follows. In Section 5.1, performance for recursive rule discovery is tested and evaluated. Section 5.2 presents the performance results for rule discovery
2 In order to find the number of false positive and false negative instances, test data set is extended with the dual of data set under CWA.
on sparse data sets. Section 5.3 includes the experiments on the accuracy of the generated rules. This subsection includes also the performance evaluation of the use of aggregate predicates. Lastly, in Section 5.4, performance of the proposed method for discovery of transitive rules is presented. 5.1. Performance evaluation for linear recursive rule discovery One of the interesting test cases that we have used is a complex family relation, same-generation learning problem. In this experiment, only linear recursion is allowed and B value is set to be 1. We set the confidence threshold as 0.6, support threshold as 0.3 and maximum depth as 3. In the data set, 344 pairs of actual family members are given as positive examples of same-generation (sg) relation. Additionally, 64 background facts are provided to describe the parental(p) relationships in the family. The tables sg and p have two arguments having person type. As there are 47 persons in the examples, the person table (type table) has 47 records. The solutions under different evaluation criteria are given in Table 15 (the parameters in lower-case letters are constants that exist in the data set). In this experiment, the effect of using confidence and using f-metric for selecting the best rule is evaluated. The row titles Conventional (as described in Definition 13) and Improved (as described in Definition 15) denote the use of conventional and proposed confidence query definition, as defined in
752
Y. Kavurucu et al. / Knowledge-Based Systems 23 (2010) 743–756
Section 2. The sub-row titles Confidence and f-metric denote using the rule evaluation metrics as confidence and f-metric, respectively. As seen from the results in Table 15, improved confidence evaluation can find better rules than the conventional confidence evaluation according to support and confidence values of the induced hypothesis. With respect to the improved confidence, f-metric produces better rules than using only confidence for hypothesis evaluation. The first two rules of the solution obtained using f-metric evaluation with improved confidence show that same-generation relation is a symmetric relation and the third rule forms the base rule for the recursive solution. The induced rules with this setting covers all the target instances and induced rules are very strong with 100% accuracy. The discovered rules are the same as the standard solution for same generation problem in the literature. On the other hand, when conventional confidence is used in evaluation, recursive rules are not discovered. Therefore the accuracy is very low. When improved confidence is used in recursive rules can be discovered. However, although they have high accuracy, the rules generated under confidence evaluation are dataspecific. Therefore f-metric evaluation with improved confidence clearly produces the best set of rules when compared to the others. For this data set, ALEPH, PROGOL and GOLEM cannot find a solution under default settings. Under strong mode declarations and constraints, ALEPH finds the following hypothesis:
sgðA; BÞ
pðC; AÞ; pðC; BÞ:
sgðA; BÞ
sgðA; CÞ; sgðC; BÞ:
sgðA; BÞ
pðC; AÞ; sgðC; DÞ; pðD; BÞ:
On the other hand, PROGOL can only find the following rule:
sgðA; BÞ
sgðB; CÞ; sgðC; AÞ:
5.2. Performance evaluation on sparse data In mechanical engineering, physical structures are represented by finite number (mesh) of elements to sufficiently minimize the errors in the calculated deformation values. Mesh is a grid that is composed of points called nodes. It is programmed to contain the material and structural properties which define how the structure will react to certain loading conditions. Nodes are assigned at a certain density throughout the material depending on the anticipated stress levels of a particular area. The problem is to determine an appropriate mesh resolution for a given structure that results in accurate deformation values.
Table 15 Rules discovered for same-generation data set. Conf. usage
Evaluation
Rules
Conventional confidence
Conf.- based
sg(A, sg(A, sg(A, sg(A,
B) B) B) B)
p(C, p(C, p(C, p(C,
sg(A, sg(A, B). sg(A, sg(A, sg(A, B) sg(A, sg(A, sg(A, sg(A,
B) B)
sg(C, D), p(C, A), p(D, B). sg(A, neriman), p(yusuf,
B) B) B)
sg(B, ali), p(mediha, A). p(yusuf, A), p(yusuf, B). p(mediha, A), p(mediha,
B) B) B) B)
p(C, A), p(C, B). sg(C, D), p(C, A), p(D, B). sg(C, D), p(C, B), p(D, A). p(C, A), p(C, B).
f-metricbased Improved confidence
Conf.-based
f-metricbased
A). B). A). B).
Table 16 Test results for the mesh-design data set. System
Coverage (over 55 records)
CRIS ALEPH PosILP SAHILP MFOIL PROGOL GOLEM FOIL
29 26 23 21 19 17 17 17
Mesh-design is, in fact, determination of the number of elements on each edge of the mesh. The task is to learn the rules to determine the number of elements for a given edge in the presence of the background knowledge such as the type of edges, boundary conditions, loadings and geometric position. Four different structures called (b–e) in [15] are used for learning in this experiment. Then, the structure a is used for testing the accuracy and coverage of the induced rules. The number of elements on each edge in these structures are given as positive concept instances, in the form of mesh (Edge, NumberOfElements). An instance of the examples such as (c15, 8) means that edge 15 of the structure c should be divided in 8 sub-edges. There are 223 positive training examples and 1474 background facts in the data set. The target relation mesh_train has two arguments having element and integer type. The type tables element and integer are created having 278 and 13 records. The test relation mesh_test has 55 examples. For this experiment, recursion is disallowed, support and confidence threshold are set as 0.1, B is set as 1 and maximum depth is set as 3. The details of the results and coverage of previous systems are shown in Table 16. Since mesh is a sparse data set, finding rules with high coverage is a hard task. For this reason, increase in coverage is a more valuable quality than the increase in accuracy. For this special and hard case, CRIS can find concept rules with higher coverage than the previous systems. 5.3. Accuracy evaluation of concept discovery 5.3.1. Experiments on PTE-1 data set A large percentage of cancer incidents stems from the environmental factors, such as cancerogenic compounds. The carcinogenicity tests of compounds are necessary to prevent cancers; however, the standard bioassays of chemicals on rodents are really time-consuming and expensive. Therefore, the National Toxicity Program (NTP) of the US National Institute for Environmental Health Sciences (NIEHS) started the Predictive Toxicology Evaluation (PTE) project in order to relate the carcinogenic effects of
Table 17 Predictive accuracies for PTE-1. Method
Type
Pred. acc.
CRIS (with aggr.) CRIS Ashby PROGOL RASH C2D (with aggr.) TIPT Bakale Benigni DEREK TOPCAT COMPACT
ILP + DM ILP + DM Chemist ILP Biol. potency an. ILP + DM Propositional ML Chem. reactivity an. Expert-guided regr. Expert system Statistical disc. Molecular modeling
0.88 0.86 0.77 0.72 0.72 0.70 0.67 0.63 0.62 0.57 0.54 0.54
753
Y. Kavurucu et al. / Knowledge-Based Systems 23 (2010) 743–756
chemicals on humans to their substructures and properties using machine learning methods [14]. In the NTP program, the tests conducted on the rodents result in a database of more than 300 compounds classified as carcinogenic or non-carcinogenic. Among these compounds, 298 of them are separated as training set, 39 of them formed the test set of first PTE challenge (PTE-1) and the other 30 chemicals constitute the test set of the second PTE challenge (PTE-2) for the data mining programs [60]. The background knowledge has roughly 25,500 facts [59]. The target relation pte_active has two arguments having drug and bool type. The primary key for the target relation is drug and it exists in all background relations as a foreign key. The type tables drug and bool are created having 340 and 2 (true/false) records, respectively. For this experiment, recursion is disallowed, maximum rule length is set to be 3 predicates and support and confidence thresholds are set as 0.1 and 0.7, respectively. The predictive accuracy of the hypothesis set is computed by the proportion of the sum of the carcinogenic concept instances classified as positive and non-carcinogenic instances classified as negative to the total number of concept instances that the hypothesis set classifies. For PTE-1 data set, the aggregate predicates given in Table 12 are defined and their instances are added to the background information. An example induced rule including an aggregate predicate is as follows:
pte activ eðA; falseÞ pte atmðA; B; c; 22; XÞ; X P 0:020; pte has propertyðA; salmonella; nÞ; pte has propertyðA; mouse lymph; pÞ: The predictive accuracies of the state-of-art methods and CRIS for PTE-1 data set are listed in Table 17. As seen from the table, CRIS has a better predictive accuracy than the other systems. In addition, it finds the best results (having highest accuracy) with respect to other systems. The reader may refer to [59,62,27,6,9, 54,7,21,41] for more information on the compared systems in Table 17. Within this experiment, the effect of including aggregate predicates in execution time is analyzed, as well. For the experiments, proposed method is applied on PTE-1 data set is with none to five aggregate predicates included in the background knowledge. The result is presented in Fig. 2. As seen in the figure, a linear increase
is observed with the linear increase in the number of included aggregate predicates. The load is basically from numeric attribute handling. For the domains, where the aggregate predicates are descriptive for the concept, experimentally observed increase rate in execution time can be tolerated. 5.3.2. Experiments on mutagenesis data set In this experiment, we have studied the mutagenicity of 230 compounds listed in [61]. We use the regression-friendly data set which has 188 compounds. The target relation molecule has two arguments having drug and bool type. The primary key for the target relation is drug and it exists in all background relations as a foreign key. The type tables drug and bool are created having 230 and 2 (true/false) records, respectively. In the literature [58], five levels of background knowledge for Mutagenesis are defined. Five sets of background knowledge are defined in the data set where Bi Bi+1 for i = 0.3. In this experiment, B2 is used. In this experiment, recursion is disallowed, support threshold is set as 0.1, confidence threshold as 0.7, B is set as 1 and maximum depth is set as 3. The predictive accuracies of the state-of-art methods and the proposed method on Mutagenesis data are listed in Table 18 [40]. As seen from the results, CRIS has the highest accuracy in this experiment. 5.3.3. Experiments on diterpene data set In another experiment on the accuracy performance, we used the diterpenes data set [20]. The data contains information on 1503 diterpenes with known structure. The predicatered(Mol, Mult, Freq) keeps the measured NMR-Spectra information. For each of the 20 carbon atoms in the diterpene skeleton, the multiplicity and frequency values are recorded. The predicate prop(Mol, Satoms, Datoms, Tatoms, Qatoms) counts the atoms that have multiplicity s, d, t, or q, respectively. The data set contains additional unary predicates in order to describe to which of the 23 classes a compound belongs. In this experiment, support threshold is 0.05, confidence threshold is 0.8, and maximum depth for the rules is 2. The predictive accuracies of the previous systems given in [5,55] and CRIS are shown in Table 19.
Table 18 Predictive accuracies for the Mutagenesis data set. Method
Predictive accuracy
CRIS PosILP SAHILP MRDTL C2D TILDE PROGOL FOIL
0.95 0.90 0.89 0.88 0.85 0.85 0.83 0.83
Table 19 Predictive accuracies for the Diterpene data set.
Fig. 2. Execution time graph for concept discovery with aggregation.
Method
Predictive accuracy
CRIS RIBL PosILP TILDE ICl SAHILP FOIL
0.98 0.91 0.91 0.9 0.86 0.84 0.78
754
Y. Kavurucu et al. / Knowledge-Based Systems 23 (2010) 743–756
5.3.4. Experiments on alzheimer data set The Alzheimer data set is about the drug design problem for Alzheimer’s disease. In the data set, four biological/chemical properties are considered and the target is to discover rules for them. These properties are maximization of acetyl cholinesterase inhibition, maximization of inhibition of amine re-uptake, maximization of the reversal of scopolamine-induced memory impairment and minimization of toxicity. In the data set, for each of the biological/chemical properties, instances are given as comparison of the drugs for that property. For example, less_toxic(d1, d2) indicates that the toxicity of d1 is less than that of d2. For this data set, the following rule is discovered by CRIS:
better acinhðA; BÞ : alk groupsðA; CÞ; alk groupsðB; DÞ; gtðD; CÞ; ring substitutionðB; 1Þ: The same rule is discovered by GOLEM as reported in [34]. There is not any other work that reports performance results on this data set. For this data set, CRIS can discover the concept with 0.893% accuracy. 5.3.5. Experiments on satellite data set The satellite data set is about the temporal fault diagnosis of power-supply failures in a communication satellite [1]. In the data set, battery faults are simulated and the time of failures are recorded in the form fault (n), where n denotes the time of failure. This predicate constitutes the target concept. The background knowledge includes the history of the components (obtained from 29 sensors in the power subsystem) and the basic operation phases’ start times. For this data set, one of the rules discovered by CRIS is as follows:
faultðAÞ : faultðBÞ; succðA; BÞ: The same rule discovered by GOLEM is reported in [22]. There is not any other work that reports performance results on this data set. With the data set given in [1], CRIS discovers the concept with 98% accuracy. For battery fault diagnosis experiments, [22] uses a slightly different data set (the reported number of facts is different than given in [1] and test data set differs) In [22], accuracy of the concept discovery by GOLEM is given as 98%. 5.4. Performance evaluation on transitive rule discovery The approaches that consider only related facts in generalization step, as in C2D, fall short for the cases where the domain includes many unrelated facts. Michalski’s trains problem [43], is a typical case for this situation. In this data set, the target relation eastbound(train) is only related with has_ car(train, car) relation. The other background relations have an argument of type car and are only related with has_car relation. For this data set, the generated rules by C2D are very general and cannot include any information about the properties of the cars of the train. C2D fixes this problem by adding the background facts that are indirectly related with the selected target concept instance into APRIORI lattice in the generalization step. As a result of this extension, C2D finds the following rule:
eastboundðAÞ
has carðA; BÞ; closedðBÞ:ðs ¼ 1:0; c ¼ 0:72Þ
CRIS can find the same rule without any further extension, since its generalization step takes all target instances into account. Therefore, there is not a directly related fact/indirectly related fact distinction in CRIS. Furthermore, it finds the same rule generated by C2D in shorter time. PROGOL finds only the following rule (with lower support and confidence) for this experiment:
eastboundðAÞ
has carðA; BÞ; doubleðBÞ:ðs ¼ 0:4; c ¼ 0:67Þ:
ALEPH cannot find a rule without negative instances. When negative instances are provided, it finds the following rule (best rule) for this experiment:
eastboundðAÞ
has carðA; BÞ; shortðBÞ; closedðBÞ:ðs ¼ 1:0; c ¼ 1:0Þ:
Another example for transitive rule construction is the kinship data set, which is adapted from [26]. The name and arguments of the relations in the data set are given in Table 20. There are 217 records in the data set. As there are 24 different people in the relations, a person table (type table) is created including the names of 24 people. In this experiment, a new relation called elti(A, B) was defined, which represents the family relation between the wives of two brothers (The term elti is the Turkish word for this family relationship). In the data set, the people in elti relation have no brothers. Therefore, brother instances are unrelated facts of elti. The minimum support is set as 0.2 and minimum confidence is set as 0.6. When indirectly related facts are added to the lattice, C2D finds the rules given in Table 21 that can capture the description of elti concept. For the same data set GOLEM cannot find any rule under several mode declarations. PROGOL cannot find a successful rule for this experiment, as well. However, if only husband, wife and brother relations are given as background knowledge, then it finds only one transitive rule (given below) under strict mode declarations:
eltiðA; BÞ
husbandðC; AÞ; husbandðD; BÞ; brotherðC; DÞ:
Similarly, ALEPH can only find one transitive rule for this experiment:
eltiðA; BÞ
husbandðD; AÞ; wifeðB; CÞ; brotherðC; DÞ:
CRIS finds the correct hypothesis set for this experiment. The time efficiency and rule quality performance comparison of CRIS and C2D are given in Table 22. In this table, the first column corresponds to C2D with extension in generalization step for handling transitive rules, the second column corresponds to the original C2D algorithm and the last column corresponds to CRIS. As seen
Table 20 The relations in the kinship data set. Relation name
Argument types
Aunt Brother Daughter Father Husband Nother Nephew Niece Sister Son Uncle Wife
Person, Person, Person, Person, Person, Person, Person, Person, Person, Person, Person, Person,
Table 21 Rules induced by C2D for elti data set. elti(A, B)
husband(C, A), husband (D, B), brother(C, D)
elti(A, elti(A, elti(A, elti(A, elti(A, elti(A, elti(A,
husband(C, A), husband(D, B), brother(D, C) husband(C, A), wife(B, D), brother(C, D) husband(C, A), wife(B, D), brother(D, C) husband(C, B), wife(A, D), brother(C, D) husband(C, B), wife(A, D), brother(D, C) wife(A, C), wife(B, D), brother(C, D) wife(A, C), wife(B, D), brother(D, C)
B) B) B) B) B) B) B)
person person person person person person person person person person person person
Y. Kavurucu et al. / Knowledge-Based Systems 23 (2010) 743–756 Table 22 The experimental results for train and elti data sets. Experiment Eastbound train Accuracy Coverage Time (second) Elti Accuracy Coverage Time (minute)
C2D with Unrel.F. 0.7 1.0 8 1.0 1.0 110
C2D w/o Unrel.F.
CRIS
0 0 1
0.7 1.0 5
0.5 0.5 25
1.0 1.0 2.5
in the table, original C2D algorithm cannot find good concept rules. The rule quality performance is much better in the extended version. On the other hand, CRIS can find high-quality rules in shorter time due to its generalization technique. In order to test the scalability of CRIS for this experiment, a syntactic data set on elti experiment was prepared which has 2170 records (10 fictitious record for each record of each table in the original elti data set). CRIS can still find the same hypothesis with linear increase in time. 6. Conclusion This work presents a concept discovery system, named CRIS, which combines rule extraction methods in ILP and APRIORI-based specialization operator. By this way, strong declarative biases are relaxed, instead, support and confidence values are used for pruning the search space. In addition, CRIS does not require user specification of input/output modes of arguments of predicates and negative concept instances. Thus, it provides a suitable data mining framework for non-expert users who are not expected to know much about the semantic details of large relations they would like to mine, which are stored in classical data base management systems. CRIS has a confidence-based hypothesis evaluation criterion and confidence-based search space pruning mechanism. Conventional definition of the confidence is slightly modified in order to calculate confidence correctly when there are unmatched head predicate arguments. Confidence-based pruning is used in the candidate filtering phase. If the confidence value of the generated rule is not higher than confidence values of its parents, it means that, the specifications through it will not improve the hypothesis to be more confident. By this way, such rules are directly eliminated at early steps. In order to generate successful rules for the domains where aggregated values such as sum, max, min are descriptive in the semantics of the target concept, it is essential for a concept discovery system to support definition of aggregation and inclusion in the concept discovery mechanism. In CRIS, aggregation information is defined in the form of aggregate predicates and they are included in the background knowledge of the concept. The aggregate value that takes part in the aggregate predicate is generated by considering all values of the attribute. This leads to increase in execution time, however, the concept discovery accuracy increases considerably. Due to the satisfactory results in rule quality, the decrease in time efficiency may be considered tolerable. The proposed system is tested on several benchmark problems including the same-generation, mesh-design, predictive toxicology evaluation and mutagenicity test. The experiments show that CRIS has better accuracy performance than most of the state-of-the-art knowledge discovery systems. It can handle sparse data with coverage values compatible to the state-of-the-art systems. It can discover transitive rules with high accuracy without mode declarations.
755
As the future work, there are several directions, on which CRIS can be further improved. One direction is using more efficient query processing in order to handle the repeating queries. Another issue to be studied is analyzing and improving the numeric attribute handling. As another improvement, it is possible to investigate the use of association rule mining techniques other than APRIORI. For this purpose, FP-growth [25] is a good candidate since it provides more efficiency with its ability to remove candidate generation step. Since APRIORI is more straightforward to apply in relational domain, in CRIS we chose to use APRIORI for specialization. Adapting FP-growth to relational domain appears as an interesting question.
References [1] Learning rules for temporal fault diagnosis in satellites. Available from:
. [2] E. Aarts, J. Korst, Simulated Annealing and Boltzmann Machines: A Stochastic Approach to Combinatorial Optimization and Neural Computing, John Wiley & Sons, Inc., New York, NY, USA, 1989. [3] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A.I. Verkamo, Fast discovery of association rules, in: Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996, pp. 307–328. [4] R. Agrawal, R. Srikant, Fast algorithms for mining association rules. in: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, Morgan-Kaufmann, 12–15 1994, pp. 487–499. [5] A. Assche, C. Vens, H. Blockeel, S. Dzˇeroski, First order random forests: learning relational classifiers with complex aggregates, Mach. Learn. 64 (1–3) (2006) 149–182. [6] Dennis Bahler, Douglas W. Bristol, The induction of rules for predicting chemical carcinogenesis in rodents, in: Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology, AAAI Press, 1993, pp. 29–37. [7] George Bakale, Richard D. McCreary, Prospective ke screening of potential carcinogens being tested in rodent bioassays by the US National Toxicology Program, Mutagenesis 7 (2) (1992) 91–94. [8] R.J. Bayardo, R. Agrawal, D. Gunopulos, Constraint-based rule mining in large, dense databases, Data Min. Knowl. Discovery 4 (2–3) (2000) 217–240. [9] Romualdo Benigni, Predicting chemical carcinogenesis in rodents: the state of the art in light of a comparative exercise, Mutat. Res. Environ. Mutagen. Relat. Subj. 334 (1) (1995) 103–113. [10] Y.C. Chien, Y. Chen, A phenotypic genetic algorithm for inductive logic programming, Expert Syst. Appl. 36 (3) (2009) 6935–6944. [11] L. Dehaspe, Frequent Pattern Discovery in First-Order Logic. PhD thesis, Katholike University, Leuven, Belgium, 1998. [12] L. Dehaspe, L. De Raedt, Mining association rules in multiple relations, in: ILP’97: Proceedings of the 7th International Workshop on Inductive Logic Programming, Springer-Verlag, London, UK, 1997, pp. 125–132. [13] L. Dehaspe, H. Toivonen, Discovery of relational association rules, in: S. Dzˇeroski, N. Lavracˇ (Eds.), Relational Data Mining, Springer-Verlag, 2001, pp. 189–212. [14] L. Dehaspe, H. Toivonen, R.D. King, Finding frequent substructures in chemical compounds, in: 4th International Conference on Knowledge Discovery and Data Mining, AAAI Press, 1998, pp. 30–36. [15] B. Dolsak, S. Muggleton, The application of inductive logic programming to finite element mesh design, in: S. Muggleton (Ed.), Inductive Logic Programming, Academic Press, London, 1992. [16] Bojan Dolsak, Finite element mesh design expert system, Knowl. Based Syst. 15 (8) (2002) 315–322. [17] P. Domingos, Prospects and challenges for multi-relational data mining, SIGKDD Explor. 5 (2003) 80–81. [18] Andrei Doncescu, Julio Waissman, Gilles Richard, Gilles Roux, Characterization of bio-chemical signals by inductive logic programming, Knowl. Based Syst. 15 (1-2) (2002) 129–137. [19] S. Dzˇeroski, Multi-relational data mining: an introduction, SIGKDD Explor. 5 (1) (2003) 1–16. [20] S. Dzeroski, S. Schulze-Kremer, K. Heidtke, K. Siems, D. Wettschereck, H. Blockeel, Diterpene structure elucidation from C NMR spectra with inductive logic programming, Appl. Artif. Intell. 12 (1998) 363–383. [21] K. Enslein, B.W. Blake, H.h. Borgstedt, Prediction of probability of carcinogenicity for a set of ongoing NTP bioassays, Mutagenesis 5 (4) (1990) 305–306. [22] C. Feng, Inducing temporal fault diagnostic rules from a qualitative model, in: Inductive Logic Programming, Academic Press, 1992, pp. 473–488. [23] Richard Frank, Flavia Moser, Martin Ester, A method for multi-relational classification using single and multi-feature aggregation functions, in: PKDD, 2007, pp. 430–437. [24] Lise Getoor, John Grant, Prl: a probabilistic relational language, Mach. Learn. 62 (1–2) (2006) 7–31. [25] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, SIGMOD Rec. 29 (2) (2000) 1–12.
756
Y. Kavurucu et al. / Knowledge-Based Systems 23 (2010) 743–756
[26] G. Hinton, UCI machine learning repository kinship data set, 1990. Available from: . [27] Troyce D. Jones, Clay E. Easterly, On the rodent bioassays currently being conducted on 44 chemicals: a RASH analysis to predict test results from the National Toxicology Program, Mutagenesis 6 (6) (1991) 507–514. [28] Y. Kavurucu, P. Senkul, I.H. Toroslu, Aggregation in confidence-based concept discovery for multi-relational data mining, in: Proceedings of IADIS European Conference on Data Mining (ECDM), Amsterdam, Netherland, 2008, pp. 43–50. [29] Y. Kavurucu, P. Senkul, I.H. Toroslu, Confidence-based concept discovery in multi-relational data mining, in: Proceedings of International Conference on Data Mining and Applications (ICDMA), Hong Kong, 2008, pp. 446–451. [30] Y. Kavurucu, P. Senkul, I.H. Toroslu, Analyzing transitive rules on a hybrid concept discovery system, in: LNCS, Hybrid Artificial Intelligent Systems, vol. 5572/2009, Springer, Berlin/Heidelberg, 2009, pp. 227–234. [31] Y. Kavurucu, P. Senkul, I.H. Toroslu, Confidence-based concept discovery in relational databases, in: Proceedings of 2009 World Congress on Computer Science and Information Engineering (CSIE 2009), Los Angeles, USA, 2009, pp. 43–50. [32] Y. Kavurucu, P. Senkul, I.H. Toroslu, ILP-based concept discovery in multirelational data mining, Expert Syst. Appl. 36 (2009). [33] Y. Kavurucu, P. Senkul, I.H. Toroslu, Multi-relational concept discovery with aggregation, in: Proceedings of 24th International Symposium on Computer and Information Sciences (ISCIS 2009), Northern Cyprus, 2009, pp. 43–50. [34] R.D. King, A. Srinivasan, M.J.E. Sternberg, Relating chemical activity to structure: an examination of ILP successes, New Gener. Comput. 13 (1995) 411–433. [35] A.J. Knobbe, A. Siebes, B. Marseille, Involving aggregate functions in multirelational search, in: PKDD ’02: Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery, Springer-Verlag, London, UK, 2002, pp. 287–298. [36] E. Lamma, P. Mello, M. Milano, F. Riguzzi, Integrating induction and abduction in logic programming, Inf. Sci. Inf. Comput. Sci. 116 (1) (1999) 25–54. [37] N. Lavrac, P.A. Flach, B. Zupan. Rule evaluation measures: a unifying view, in: ILP ’99: Proceedings of the 9th International Workshop on Inductive Logic Programming, Springer-Verlag, London, UK, 1999, pp. 174–185. [38] N. Lavracˇ, S. Dzˇeroski, Inductive Logic Programming: Techniques and Applications, Ellis Horwood, New York, 1994. [39] C. Lee, C. Tsai, T. Wu, W. Yang, An approach to mining the multi-relational imbalanced database, Expert Syst. Appl. 34 (4) (2008) 3021–3032. [40] H.A. Leiva, Mrdtl: a multi-relational decision tree learning algorithm, Master’s thesis, Iowa State University, Iowa, USA, 2002. [41] David F.V. Lewis, Costas Ioannides, Dennis V. Parke, A prospective toxicity evaluation (COMPACT) on 40 chemicals currently being tested by the National Toxicology Program, Mutagenesis 5 (5) (1990) 433–435. [42] V. Liftschitz, Closed-world databases and circumscription, Artif. Intell. 27 (1985) 229–235. [43] R. Michalski, J. Larson, Inductive inference of VL decision rules, in: Workshop on Pattern-Directed Inference Systems, vol. 63, SIGART Newsletter, ACM, Hawaii, 1997, pp. 33–44. [44] J. Minker, On indefinite databases and the closed world assumption, in: Proceedings of the Sixth International Conference on Automated Deduction (CADE’82), 1982, pp. 292–308. [45] S. Muggleton, Inductive logic programming, New Gener. Comput. 8 (4) (1991) 295–318.
[46] S. Muggleton (Ed.), Inductive Logic Programming, Academic Press, London, 1992. [47] S. Muggleton, Inverse entailment and PROGOL, New Gener. Comput. Special issue Inductive Logic Program. 13 (3–4) (1995) 245–286. [48] S. Muggleton, Learning from positive data, in: Proceedings of the 6th International Workshop on Inductive Logic Programming, Lecture Notes in Artificial Intelligence, vol. 1314, Springer-Verlag, 1996, pp. 358–376. [49] J. Neville, D. Jensen, L. Friedland, M. Hay, Learning relational probability trees, in: KDD ’03: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 2003, pp. 625–630. [50] Claudia Perlich, Foster Provost, Aggregation-based feature invention and relational concept classes, in: KDD ’03: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 2003, pp. 167–176. [51] H. Prendinger, M. Ishizuka, A creative abduction approach to scientific and knowledge discovery, Knowl. Based Syst. 18 (7) (2005) 321–326. [52] J.R. Quinlan, Learning logical definitions from relations, Mach. Learn. 5 (3) (1990) 239–266. [53] R. Reiter, On Closed World Databases, LNAI, vol. 4160, Plenum Press. [54] D.M. Sanderson, C.G. Earnshaw, Computer prediction of possible toxic action from chemical structure; the DEREK system, Hum. Exp. Toxicol. 10 (4) (1991) 261–273. [55] M. Serrurier, H. Prade, Introducing possibilistic logic in ILP for dealing with exceptions, Artif. Intell. 171 (16–17) (2007) 939–950. [56] M. Serrurier, H. Prade, Improving inductive logic programming by using simulated annealing, Inf. Sci. 178 (6) (2008) 1423–1441. [57] A. Srinivasan, The ALEPH manual, 1999. [58] A. Srinivasan, R. King, S. Muggleton, The role of background knowledge: using a problem from chemistry to examine the performance of an ILP program, Under Review for Intelligent Data Analysis in Medicine and Pharmacology, Kluwer Academic Press, 1996. [59] A. Srinivasan, R.D. King, S. Muggleton, M.J.E. Sternberg, Carcinogenesis predictions using ILP, in: Proceedings of the 7th International Workshop on Inductive Logic Programming, vol. 1297, Springer-Verlag, 1997, pp. 273–287. [60] A. Srinivasan, R.D. King, S.H. Muggleton, M. Sternberg, The predictive toxicology evaluation challenge, in: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), MorganKaufmann, 1997, pp. 1–6. [61] A. Srinivasan, S. Muggleton, R.D. King, M.J.E. Sternberg, Theories for mutagenecity: a study of first-order and feature based induction, Technical Report, PRG-TR-8-95, Oxford University Computing Laboratory, 1995. [62] R.W. Tennant, J. Spalding, S. Stasiewicz, J. Ashby, Prediction of the outcome of rodent carcinogenicity bioassays currently being conducted on 44 chemicals by the national toxicology program, Mutagenesis 5 (1990) 3–14. [63] I.H. Toroslu, M. Yetisgen-Yildiz, Data mining in deductive databases using query flocks, Expert Syst. Appl. 28 (3) (2005) 395–407. [64] M. Uludag, M.R. Tolun, A new relational learning system using novel rule selection strategies, Knowl. Based Syst. 19 (8) (2006) 765–771. [65] L. Wang, X. Liu, A new model of evaluating concept similarity, Know. Based Syst. 21 (8) (2008) 842–846. [66] Q. Wu, Z. Liu, Real formal concept analysis based on grey-rough set theory, Know. Based Syst. 22 (1) (2009) 38–45. [67] X. Yin, J. Han, J. Yang, P.S. Yu, Crossmine: efficient classification across multiple database relations, in: ICDE, 2004, pp. 399–411.