Dynamic programming approach to optimization of approximate decision rules

Dynamic programming approach to optimization of approximate decision rules

Information Sciences 221 (2013) 403–418 Contents lists available at SciVerse ScienceDirect Information Sciences journal homepage: www.elsevier.com/l...

891KB Sizes 1 Downloads 186 Views

Information Sciences 221 (2013) 403–418

Contents lists available at SciVerse ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Dynamic programming approach to optimization of approximate decision rules Talha Amin a, Igor Chikalov a, Mikhail Moshkov a,⇑, Beata Zielosko a,b,⇑ a b

Mathematical and Computer Sciences & Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia ´ ska St., 41-200 Sosnowiec, Poland Institute of Computer Science, University of Silesia, 39, Beßdzin

a r t i c l e

i n f o

Article history: Received 29 August 2011 Received in revised form 8 August 2012 Accepted 18 September 2012 Available online 28 September 2012 Keywords: Approximate decision rules Dynamic programming Length Coverage

a b s t r a c t This paper is devoted to the study of an extension of dynamic programming approach which allows sequential optimization of approximate decision rules relative to the length and coverage. We introduce an uncertainty measure R(T) which is the number of unordered pairs of rows with different decisions in the decision table T. For a nonnegative real number b, we consider b-decision rules that localize rows in subtables of T with uncertainty at most b. Our algorithm constructs a directed acyclic graph Db(T) which nodes are subtables of the decision table T given by systems of equations of the kind ‘‘attribute = value’’. This algorithm finishes the partitioning of a subtable when its uncertainty is at most b. The graph Db(T) allows us to describe the whole set of so-called irredundant b-decision rules. We can describe all irredundant b-decision rules with minimum length, and after that among these rules describe all rules with maximum coverage. We can also change the order of optimization. The consideration of irredundant rules only does not change the results of optimization. This paper contains also results of experiments with decision tables from UCI Machine Learning Repository. Ó 2012 Elsevier Inc. All rights reserved.

1. Introduction Decision rules are widely used as parts of algorithms (parallel or nondeterministic), as a way for knowledge representation, and also as parts of classifiers (predictors). Exact decision rules can be overfitted, i.e., depend essentially on the noise or adjusted too much to the existing examples. Approximate rules are less dependent on the noise. They have usually smaller number of attributes and are better from the point of view of understanding. Classifiers based on approximate decision rules have often better accuracy than the classifiers based on exact decision rules. Therefore approximate decision rules and closely connected with them approximate reducts are studied intensively last years by Nguyen et al. [5,9,10,20–23,25,26,28,31,32,34,35,37]. There are different approaches to the construction of decision rules and reducts: brute-force approach which is applicable to tables with relatively small number of attributes, genetic algorithms [3,35,36], Apriori algorithm [1], simulated annealing [15], Boolean reasoning [22,27,33], separate-and-conquer approach (algorithms based on a sequential covering procedure) [6,11,13,14,16], ant colony optimization [7,17,24], algorithms based on decision tree construction [18,21,29], different kinds of greedy algorithms [20,22]. Each method can have different modifications. For example, as in the case of decision trees, we

⇑ Corresponding authors. Address: Institute of Computer Science, University of Silesia, 39, Beß dzin´ska St., 41-200 Sosnowiec, Poland (B. Zielosko). E-mail addresses: [email protected] (T. Amin), [email protected] (I. Chikalov), [email protected] (M. Moshkov), [email protected] (B. Zielosko). 0020-0255/$ - see front matter Ó 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ins.2012.09.018

404

T. Amin et al. / Information Sciences 221 (2013) 403–418

can use greedy algorithms based on different uncertainty measures (Gini index, entropy, etc.) for construction of decision rules. We are interested in the construction of short rules which cover many objects. In particular, the choice of short rules is connected with the Minimum Description Length principle [30]. The rule coverage is important to discover major patterns in the data. Unfortunately, the problems of minimization of length and maximization of coverage of decision rules are NP-hard. Using results of Feige [12] it is possible to prove that under reasonable assumptions on the class NP there are no approximate algorithms with high accuracy and polynomial complexity for minimization of decision rule length. The most part of approaches mentioned above (with the exception of brute-force, Apriori, and Boolean reasoning) cannot guarantee the construction of shortest rules or rules with maximum coverage. To work with approximate decision rules, we introduce an uncertainty measure R(T) that is the number of unordered pairs of rows with different decisions in the decision table T. Then we consider b-decision rules that localize rows in subtables of T with uncertainty at most b. To construct optimal b-decision rules we use approach based on an extension of dynamic programming. We construct a directed acyclic graph Db(T) which nodes are subtables of the decision table T given by systems of equations of the kind ‘‘attribute = value’’. We finish the partitioning of a subtable when its uncertainty is at most b. The parameter b helps us to control computational complexity and makes the algorithm applicable to solving more complex problems. The constructed graph allows us to describe the whole set of so-called irredundant b-decision rules. Then we can find (in fact, describe) all irredundant b-decision rules with minimum length, and after that among these rules find all rules with maximum coverage. We can change the order of optimization: find all irredundant b-decision rules with maximum coverage, and after that find among such rules all irredundant b-decision rules with minimum length. We prove that by removal of some conditions from the left-hand side of each b-decision rule that is not irredundant and by changing the decision on the right-hand side of this rule we can obtain an irredundant b-decision rule which length is at most the length of initial rule and the coverage is at least the coverage of initial rule. It means that we work not only with optimal rules among irredundant b-decision rules but also with optimal ones among all b-decision rules. Similar approach to decision tree optimization was considered in [2,8,19]. First results for decision rules based on dynamic programming methods were obtained in [38]. The aim of this study was to find one decision rule with minimum length for each row. In this paper, we consider algorithms for optimization of b-decision rules relative to the length and coverage, and results of experiments with some decision tables from UCI Machine Learning Repository [4] based on Dagger software system created in KAUST. This paper consists of nine sections. Section 2 contains main notions. Section 3 is devoted to the consideration of irredundant b-decision rules. In Section 4, we study directed acyclic graph Db(T) which allows to describe the whole set of irredundant b-decision rules. In Section 5, we consider a procedure of optimization of this graph (in fact, corresponding b-decision rules) relative to the length, and in Section 6 – relative to the coverage. In Section 7, we discuss possibilities of sequential optimization of rules relative to the length and coverage. Section 8 contains results of experiments with decision tables from UCI Machine Learning Repository, and Section 9 – conclusions. 2. Main notions In this section, we consider definitions of notions corresponding to decision tables and decision rules. A decision table T is a rectangular table with n columns labeled with conditional attributes f1, . . . , fn. Rows of this table are filled by nonnegative integers which are interpreted as values of conditional attributes. Rows of T are pairwise different and each row is labeled with a nonnegative integer (decision) which is interpreted as a value of the decision attribute. We denote by N(T) the number of rows in the table T. By R(T) we denote the number of unordered pairs of rows with different decisions. We will interpret this value as uncertainty of the table T. The table T is called degenerate if T is empty or all rows of T are labeled with the same decision. It is clear that in this case R(T) = 0. A minimum decision value which is attached to the maximum number of rows in T will be called the most common decision for T. Let fi1 ; . . . ; f im 2 ff1 ; . . . ; fn g and a1, . . . , am be nonnegative integers. We denote by Tðfi1 ; a1 Þ . . . ðfim ; am Þ the subtable of the table T which contains only rows of T that have numbers a1, . . . , am at the intersection with columns fi1 ; . . . ; f im . Such subtables (including the table T) are called separable subtables of T. We denote by E(T) the set of attributes from {f1, . . . , fn} which are not constant on T. For any fi 2 E(T), we denote by E(T, fi) the set of values of the attribute fi in T. The expression

fi1 ¼ a1 ^ . . . ^ fim ¼ am ! d

ð1Þ

is called a decision rule over T if fi1 ; . . . ; f im 2 ff1 ; . . . ; fn g, and a1, . . . , am, d are nonnegative integers. It is possible that m = 0. In this case (1) is equal to

! d:

T. Amin et al. / Information Sciences 221 (2013) 403–418

405

Let r = (b1, . . . , bn) be a row of T. We will say that the rule (1) is realizable for r, if a1 ¼ bi1 ; . . . ; am ¼ bim . If m = 0 then (1) is realizable for any row from T. Let b be a nonnegative real number. We will say that the rule (1) is b-true for T if d is the most common decision for T 0 ¼ Tðfi1 ; a1 Þ . . . ðfim ; am Þ and R(T 0 ) 6 b. If m = 0 then the rule (1) is b-true for T if d is the most common decision for T and R(T) 6 b. If the rule (1) is b-true for T and realizable for r, we will say that (1) is a b-decision rule for T and r. Note that if b = 0 then we have an exact decision rule for T and r (100% true rule for T and r). 3. Irredundant b-decision rules Let (1) be a decision rule over T. We will say that the rule (1) is an irredundant b-decision rule for T and r if (1) is a b-decision rule for T and r and the following conditions hold if m > 0: (i) fi1 2 EðTÞ, and if m > 1 then fij 2 EðTðfi1 ; a1 Þ . . . ðfij1 ; aj1 ÞÞ for j = 2, . . . , m; (ii) if m = 1 then R(T) > b, and if m > 1 then RðTðfi1 ; a1 Þ . . . ðfim1 ; am1 ÞÞ > b. Lemma 1. Let T be a decision table with RðTÞ > b; f i1 2 EðTÞ; a1 2 EðT; fi1 Þ, and r be a row of the table T 0 ¼ Tðfi1 ; a1 Þ. Then the rule (1) with m P 1 is an irredundant b-decision rule for T and r if and only if the rule

fi2 ¼ a2 ^ . . . ^ fim ¼ am ! d

ð2Þ

is an irredundant b-decision rule for T 0 and r (if m = 1 then (2) is equal to ?d). Proof. It is clear that (1) is a b-decision rule for T and r if and only if (2) is a b-decision rule for T 0 and r. It is easy to show that the statement of lemma holds if m = 1. Let now m > 1. Let (1) be an irredundant b-decision rule for T and r. Then from (i) it follows that f2 2 E(T 0 ) and if m > 2 then, for j = 3, . . . , m,

fij 2 EðT 0 ðfi2 ; a2 Þ . . . ðfij1 ; aj1 ÞÞ: From (ii) it follows that R(T 0 ) > b if m = 2 and RðT 0 ðfi2 ; a2 Þ . . . ðfim1 ; am1 ÞÞ > b if m > 2. Therefore (2) is an irredundant b-decision rule for T 0 and r. Let (2) be an irredundant b-decision rule for T 0 and r. Then, for j = 2, . . . , m,

fij 2 EðTðfi1 ; a1 Þ . . . ðfij1 ; aj1 ÞÞ: Also we know that fi1 2 EðTÞ. Therefore the condition (i) holds. Since (2) is an irredundant b-decision rule for T 0 and r, we have that if m = 2 then R(T 0 ) > b, and if m > 2 then RðT 0 ðfi2 ; a2 Þ . . . ðfim1 ; am1 ÞÞ > b. Hence the condition (ii) holds. Therefore (1) is an irredundant b-decision rule for T and r. h Let s be a decision rule over T and s be equal to (1). The number m of conditions on the left-hand side of s is called the length of this rule and is denoted by l(s). If m = 0 then the length of decision rule s is equal to 0. The coverage of s is the number of rows in T for which s is realizable and which are labeled with the decision d. We denote it by c(s). If m = 0 then the coverage of decision rule s is equal to the number of rows in T which are labeled with the decision d. Proposition 1. Let T be a nonempty decision table, r be a row of T and s be a b-decision rule for T and r which is not irredundant. Then by removal of some conditions from the left-hand side of s and by changing the decision on the right-hand side of s we can obtain an irredundant b-decision rule irr(s) for T and r such that l(irr(s)) 6 l(s) and c(irr(s)) P c(s). Proof. Let s be equal to (1). Let T be a table for which R(T) 6 b and s be the most common decision for T. One can show that the rule ?s is an irredundant b-decision rule for T and r. We denote this rule by irr(s). It is clear that l(irr(s)) 6 l(s) and c(irr(s)) P c(s). Let T be a table for which R(T) > b. Let k 2 {1, . . . , m} be the minimum number such that T 0 ¼ Tðfi1 ; a1 Þ . . . ðfik ; ak Þ, and R(T 0 ) 6 b. If k < m then we denote by s0 the decision rule

fi1 ¼ a1 ^ . . . ^ fik ¼ ak ! q; where q is the most common decision for T 0 . If k = m then s0 = s. It is clear that s0 is a b-decision rule for T and r. If fi1 R EðTÞ then we remove the condition fi1 ¼ a1 from s0 . For any j 2 {2, . . . , k}, if fij R EðTðfi1 ; a1 Þ . . . ðfij1 ; aj1 ÞÞ then we remove the condition fij ¼ aj from the left-hand side of the rule s0 . One can show that the obtained rule is an irredundant b-decision rule for T and r. We denote this rule by irr(s). It is clear that l(s) P l(irr(s)). One can show that c(s) 6 c(s0 ) = c(irr(s)). h Let 0 6 b1 6 b2 and s be an irredundant b1-decision rule for T and r. It is clear that s is a b2-decision rule for T and r. Let s be not an irredundant b2-decision rule for T and r. Then from Proposition 1 it follows that there exists an irredundant b2-decision rule irr(s) for T and r such that l(irr(s)) 6 l(s) and c(irr(s)) P c(s). From here the next two corollaries of Proposition 1 follow:

406

T. Amin et al. / Information Sciences 221 (2013) 403–418

 If 0 6 b1 6 b2 then the minimum length of an irredundant b2-decision rule for T and r is less than or equal to the minimum length of an irredundant b1-decision rule for T and r.  If 0 6 b1 6 b2 then the maximum coverage of an irredundant b2-decision rule for T and r is greater than or equal to the maximum coverage of an irredundant b1-decision rule for T and r. 4. Directed acyclic graph Db(T) Now, we consider an algorithm that constructs a directed acyclic graph Db(T) which will be used to describe the set of irredundant b-decision rules for T and for each row r of T. Nodes of the graph are some separable subtables of the table T. During each step, the algorithm processes one node and marks it with the symbol ⁄. At the first step, the algorithm constructs a graph containing a single node T which is not marked with ⁄. Let the algorithm have already performed p steps. Let us describe the step (p + 1). If all nodes are marked with the symbol  as processed, the algorithm finishes its work and presents the resulting graph as Db(T). Otherwise, choose a node (table) H, which has not been processed yet. If R(H) 6 b mark the considered node with symbol ⁄ and proceed to the step (p + 2). If R(H) > b, for each fi 2 E(H), draw a bundle of edges from the node H. Let E(H, fi) = {b1, . . . , bt}. Then draw t edges from H and label these edges with pairs (fi, b1), . . . , (fi, bt) respectively. These edges enter to nodes H(fi, b1), . . . , H(fi, bt). If some of nodes H(fi, b1), . . . , H(fi, bt) are absent in the graph then add these nodes to the graph. We label each row r of H with the set of attributes EDb ðTÞ ðH; rÞ ¼ EðHÞ. Mark the node H with the symbol ⁄ and proceed to the step (p + 2). The graph Db(T) is a directed acyclic graph. A node of this graph will be called terminal if there are no edges leaving this node. Note that a node H of Db(T) is terminal if and only if R(H) 6 b. Later, we will describe the procedures of optimization of the graph Db(T). As a result of procedure work we will obtain a graph C with the same sets of nodes and edges as in Db(T). The only difference is that any row r of each nonterminal node H of C is labeled with a nonempty set of attributes EC(H, r) # E(H) possibly different from E(H). Now, for each node H of C and for each row r of H we describe a set of rules RulC(H, r). We will move from terminal nodes of C to the node T. Let H be a terminal node of C and d be the most common decision for H. Then

RulC ðH; rÞ ¼ f! dg: Let now H be a nonterminal node of C such that for each child H0 of H and for each row r0 of H0 , the set of rules RulC(H0 , r0 ) is already defined. Let r = (b1, . . . , bn) be a row of H. For any fi 2 EC(H, r), we define the set of rules RulC(H, r, fi) as follows:

RulC ðH; r; fi Þ ¼ ffi ¼ bi ^ c ! s : c ! s 2 RulC ðHðfi ; bi Þ; rÞg: Then

RulC ðH; rÞ ¼

[

RulC ðH; r; fi Þ:

fi 2EC ðH;rÞ

Theorem 1. For each node H of the graph Db(T) and for each row r of H, the set RulDb ðTÞ ðH; rÞ is equal to the set of all irredundant b-decision rules for H and r. Proof. We will prove this statement by induction on nodes in Db(T). Let H be a terminal node of Db(T) and d be the most common decision for H. One can show that the rule ?d is the only irredundant b-decision rule for H and r. Therefore the set RulDb ðTÞ ðH; rÞ is equal to the set of all irredundant b-decision rules for H and r. Let H be a nonterminal node of Db(T) and for each child of H, the statement of theorem hold. Let r = (b1, . . . , bn) be a row of H. It is clear that R(H) > b. Using Lemma 1 we obtain that the set RulDb ðTÞ ðH; rÞ contains only irredundant b-decision rules for H and r. Let s be an irredundant b-decision rule for H and r. Since R(H) > b, the left-hand side of s is nonempty. Therefore s can be represented in the form fi = bi ^ c ? s, where fi 2 E(H). By Lemma 1, c ? s is an irredundant b-decision rule for H(fi, bi) and r. From the inductive hypothesis we get that the rule c ? s belongs to the set RulDb ðTÞ ðHðfi ; bi Þ; rÞ. Therefore s 2 RulDb ðTÞ ðH; rÞ. h To illustrate algorithms studied in this paper, we consider simple decision table T0 (see Table 1). We set b = 2, so during the construction of the graph D2(T0) (see Fig. 1) we will stop the partitioning of a subtable H of T0 if R(H) 6 2. We denote G = D2(T0). For each node H of the graph G and for each row r of H we describe the set RulG(H, r). We will move from terminal nodes of G to the node T0. Terminal nodes of the graph G are H1, H2, H3, H4, H5, H7, H8. For these nodes: RulG(H1, r1) = RulG(H1, r4) = RulG(H1, r5) = {?3}; RulG(H2, r2) = RulG(H2, r3) = {?1}; RulG(H3, r1) = RulG(H3, r2) = RulG(H3, r5) = {?1}; RulG(H4, r3) = RulG(H4, r4) = {?2};

407

T. Amin et al. / Information Sciences 221 (2013) 403–418 Table 1 Decision table T0. T0

f1

f2

f3

r1 r2 r3 r4 r5

0 1 1 0 0

0 0 1 1 0

1 1 1 1 0

1 1 2 3 3

RulG(H5, r5) = {?3}; RulG(H7, r1) = RulG(H7, r4) = {?1}; RulG(H8, r1) = RulG(H8, r2) = {?1}. Now we can describe the sets of rules attached to rows of H6. This is a nonterminal node of G for which all children H2,

H4, H7, and H8 are already treated. We have: RulG(H6, r1) = {f1 = 0 ? 1, f2 = 0 ? 1}; RulG(H6, r2) = {f1 = 1 ? 1, f2 = 0 ? 1}; RulG(H6, r3) = {f1 = 1 ? 1, f2 = 1 ? 2}; RulG(H6, r4) = {f1 = 0 ? 1, f2 = 1 ? 2}. Finally, we can describe the sets of rules attached to rows of T0: RulG(T0, r1) = {f1 = 0 ? 3, f2 = 0 ? 1, f3 = 1 ^ f1 = 0 ? 1, RulG(T0, r2) = {f1 = 1 ? 1, f2 = 0 ? 1, f3 = 1 ^ f1 = 1 ? 1, RulG(T0,r3) = {f1 = 1 ? 1,f2 = 1 ? 2,f3 = 1 ^ f1 = 1 ? 1, RulG(T0, r4) = {f1 = 0 ? 3, f2 = 1 ? 2, f3 = 1 ^ f1 = 0 ? 1, RulG(T0, r5) = {f1 = 0 ? 3, f2 = 0 ? 1, f3 = 0 ? 3}.

f3 = 1 ^ f2 = 0 ? 1}; f3 = 1 ^ f2 = 0 ? 1}; f3 = 1 ^ f2 = 1 ? 2}; f3 = 1 ^ f2 = 1 ? 2};

In Sections 5 and 6, we present optimization of irredundant b-decision rules relative to length and coverage respectively, in Section 7 we present sequential optimization of such rules. We can consider the problem of optimization of

Fig. 1. Directed acyclic graph G = D2(T0).

408

T. Amin et al. / Information Sciences 221 (2013) 403–418

irredundant b-decision rules relative to length and coverage as multi-criteria optimization problem with hierarchically dependent criteria.

5. Procedure of optimization relative to length We consider the procedure of optimization of the graph C relative to the length l. For each node H in the graph C, this l procedure assigns to each row r of H the set RulC ðH; rÞ of b-decision rules with minimum length from RulC(H, r) and the l number OptC ðH; rÞ – the minimum length of a b-decision rule from RulC(H, r). The idea of the procedure is simple. It is clear that for each terminal node H of C and for each row r of H the following equalities hold: l

RulC ðH; rÞ ¼ RulC ðH; rÞ ¼ f! dg; where d is the most common decision decision for H, and

OptlC ðH; rÞ ¼ 0: Let H be a nonterminal node of C, and r = (b1, . . . ,bn) be a row of H. We know that

[

RulC ðH; rÞ ¼

RulC ðH; r; fi Þ

fi 2EC ðH;rÞ

and, for fi 2 EC(H, r),

RulC ðH; r; fi Þ ¼ ffi ¼ bi ^ c ! s : c ! s 2 RulC ðHðfi ; bi Þ; rÞg: l

For fi 2 EC(H, r), we denote by RulC ðH; r; fi Þ the set of all b-decision rules with the minimum length from RulC(H, r, fi) and by OptlC ðH; r; fi Þ – the minimum length of a b-decision rule from RulC(H, r, fi). One can show that l

l

RulC ðH; r; fi Þ ¼ ffi ¼ bi ^ c ! s : c ! s 2 RulC ðHðfi ; bi Þ; rÞg; OptlC ðH; r; fi Þ ¼ Opt lC ðHðfi ; bi Þ; rÞ þ 1; and

n o n o OptlC ðH; rÞ ¼ min Opt lC ðH; r; fi Þ : fi 2 EC ðH; rÞ ¼ min OptlC ðHðfi ; bi Þ; rÞ þ 1 : fi 2 EC ðH; rÞ : It is easy to see also that

[

l

RulC ðH; rÞ ¼ l

l

RulC ðH; r; fi Þ: l

fi 2EC ðH;rÞ;Opt C ðHðfi ;bi Þ;rÞþ1¼OptC ðH;rÞ

We now describe the procedure of optimization of the graph C relative to the length l. We will move from the terminal nodes of the graph C to the node T. We will assign to each row r of each table H the number Opt lC ðH; rÞ which is the minimum length of a b-decision rule from RulC(H,r) and we will change the set EC(H,r) attached to the row r in H if H is a nonterminal node of C. We denote the obtained graph by Cl. Let H be a terminal node of C. Then we assign the number

OptlC ðH; rÞ ¼ 0 to each row r of H. Let H be a nonterminal node of C and all children of H have already been treated. Let r = (b1, . . . , bn) be a row of H. We assign the number

n o OptlC ðH; rÞ ¼ min Opt lC ðHðfi ; bi Þ; rÞ þ 1 : fi 2 EC ðH; rÞ to the row r in the table H and set

n o ECl ðH; rÞ ¼ fi : fi 2 EC ðH; rÞ; Opt lC ðHðfi ; bi Þ; rÞ þ 1 ¼ Opt lC ðH; rÞ : From the reasoning before the description of the procedure of optimization relative to the length the next statement follows. l

Theorem 2. For each node H of the graph Cl and for each row r of H, the set RulCl ðH; rÞ is equal to the set RulC ðH; rÞ of all bdecision rules with the minimum length from the set RulC(H,r).

T. Amin et al. / Information Sciences 221 (2013) 403–418

409

Fig. 2. Graph Gl.

Fig. 2 presents the directed acyclic graph Gl obtained from the graph G (see Fig. 1) by the procedure of optimization relative to the length. l Using the graph Gl we can describe for each row ri, i = 1, . . . , 5, of the table T0 the set RulG ðT 0 ; r i Þ of all irredundant 2-decision rules for T0 and ri with minimum length: l

RulG ðT 0 ; r 1 Þ ¼ ff1 l RulG ðT 0 ; r 2 Þ ¼ ff1 l RulG ðT 0 ; r 3 Þ ¼ ff1 l RulG ðT 0 ; r 4 Þ ¼ ff1 l RulG ðT 0 ; r 5 Þ ¼ ff1

¼ 0 ! 3; f2 ¼ 1 ! 1; f2 ¼ 1 ! 1; f2 ¼ 0 ! 3; f2 ¼ 0 ! 3; f2

¼ 0 ! 1g; ¼ 0 ! 1g; ¼ 1 ! 2g; ¼ 1 ! 2g; ¼ 0 ! 1; f3 ¼ 0 ! 3g.

6. Procedure of optimization relative to coverage We consider the procedure of optimization of the graph C relative to the coverage c. For each node H in the graph C, this c procedure assigns to each row r of H the set RulC ðH; rÞ of b-decision rules with maximum coverage from RulC(H, r) and the c number Opt C ðH; rÞ – the maximum coverage of a b-decision rule from RulC(H, r). The idea of the procedure is simple. It is clear that for each terminal node H of C and for each row r of H the following equalities hold: c

RulC ðH; rÞ ¼ RulC ðH; rÞ ¼ f! dg; where d is the most common decision for H, and Opt cC ðH; rÞ is equal to the number of rows in H which are labeled with the decision d. Let H be a nonterminal node of C and r = (b1, . . . , bn) be a row of H. We know that

[

RulC ðH; rÞ ¼

RulC ðH; r; fi Þ

fi 2EC ðH;rÞ

and for fi 2 EC(H, r),

RulC ðH; r; fi Þ ¼ ffi ¼ bi ^ c ! s : c ! s 2 RulC ðHðfi ; bi Þ; rÞg: c

For fi 2 EC(H, r), we denote by RulC ðH; r; fi Þ the set of all b-decision rules with the maximum coverage from RulC(H, r, fi) and by Opt cC ðH; r; fi Þ– the maximum coverage of a b-decision rule from RulC(H, r, fi).

410

T. Amin et al. / Information Sciences 221 (2013) 403–418

One can show that c

c

RulC ðH; r; fi Þ ¼ ffi ¼ bi ^ c ! s : c ! s 2 RulC ðHðfi ; bi Þ; rÞg; c

c

OptC ðH; r; fi Þ ¼ Opt C ðHðfi ; bi Þ; rÞ; and

    OptcC ðH; rÞ ¼ max OptcC ðH; r; fi Þ : fi 2 EC ðH; rÞ ¼ max OptcC ðHðfi ; bi Þ; rÞ : fi 2 EC ðH; rÞ : It is easy to see also that

[

c

RulC ðH; rÞ ¼ c

c

RulC ðH; r; fi Þ: c

fi 2EC ðH;rÞ;Opt C ðHðfi ;bi Þ;rÞ¼OptC ðH;rÞ

We now describe the procedure of optimization of the graph C relative to the coverage c. We will move from the terminal nodes of the graph C to the node T. We will assign to each row r of each table H the number OptcC ðH; rÞ which is the maximum coverage of a b-decision rule from RulC(H, r) and we will change the set EC(H,r) attached to the row r in H if H is a nonterminal node of C. We denote the obtained graph by Cc. Let H be a terminal node of C and d be the most common decision for H. Then we assign to each row r of H the number OptcC ðH; rÞ that is equal to the number of rows in H which are labeled with the decision d. Let H be a nonterminal node of C and all children of H have already been treated. Let r = (b1, . . . , bn) be a row of H. We assign the number

  OptcC ðH; rÞ ¼ max OptcC ðHðfi ; bi Þ; rÞ : fi 2 EC ðH; rÞ to the row r in the table H and set

  ECc ðH; rÞ ¼ fi : fi 2 EC ðH; rÞ; OptcC ðHðfi ; bi Þ; rÞ ¼ OptcC ðH; rÞ : From the reasoning before the description of the procedure of optimization relative to the coverage the next statement follows. c

Theorem 3. For each node H of the graph Cc and for each row r of H, the set RulCc ðH; rÞ is equal to the set RulC ðH; rÞ of all bdecision rules with the maximum coverage from the set RulC(H,r). Fig. 3 presents the directed acyclic graph Gc obtained from the graph G (see Fig. 1) by the procedure of optimization relative to the coverage.

Fig. 3. Graph Gc.

411

T. Amin et al. / Information Sciences 221 (2013) 403–418 c

Using the graph Gc we can describe for each row ri, i = 1, . . . , 5, of the table T0 the set RulG ðT 0 ; r i Þ of all irredundant 2-decision rules for T0 and ri with maximum coverage. We will give also the value Opt cG ðT 0 ; ri Þ which is equal to the maximum coverage of an irredundant 2-decision rule for T0 and ri. This value was obtained during the procedure of optimization of the graph G relative to the coverage. We have c

RulG ðT 0 ; r 1 Þ ¼ ff1 c RulG ðT 0 ; r 2 Þ ¼ ff2 c RulG ðT 0 ; r 3 Þ ¼ ff1 c RulG ðT 0 ; r 4 Þ ¼ ff1 c RulG ðT 0 ; r 5 Þ ¼ ff1

¼ 0 ! 3; f2 ¼ 0 ! 1; f3 ¼ 1 ^ f2 ¼ 0 ! 1g; OptcG ðT 0 ; r 1 Þ ¼ 2; ¼ 0 ! 1; f3 ¼ 1 ^ f2 ¼ 0 ! 1g; OptcG ðT 0 ; r2 Þ ¼ 2; ¼ 1 ! 1; f2 ¼ 1 ! 2; f3 ¼ 1 ^ f1 ¼ 1 ! 1; f3 ¼ 1 ^ f2 ¼ 1 ! 2g; OptcG ðT 0 ; r 3 Þ ¼ 1; ¼ 0 ! 3g; OptcG ðT 0 ; r4 Þ ¼ 2; ¼ 0 ! 3; f2 ¼ 0 ! 1g; OptcG ðT 0 ; r 5 Þ ¼ 2.

7. Sequential optimization Theorems 2 and 3 show that we can make sequential optimization relative to the length and coverage. We can find all irredundant b-decision rules with minimum length and after that among these rules find all rules with maximum coverage. We can also change the order of optimization: find all irredundant b-decision rules with maximum coverage, and after that find among such rules all rules with minimum length. It is possible to show (see analysis of similar algorithms in [21], page 64) that the time complexities of algorithms which construct the graph Db(T) and make sequential optimization of b-decision rules relative to length and coverage are bounded from above by polynomials on the number of separable subtables of T, and the number of attributes in T. In [19] it was shown that the number of separable subtables for decision tables with attributes from a restricted infinite information systems is bounded from above by a polynomial on the number of attributes in the table. Examples of restricted infinite information system were considered, in particular, in [21]. We will say that an irredundant b-decision rule for T and r is totally optimal if it has minimum length and maximum coverage among all irredundant b-decision rules for T and r. We can describe all totally optimal rules using the procedures of optimization relative to the length and coverage. Set C = Db(T). We apply the procedure of optimization relative to the coverage to the graph C. As a result we obtain the graph Cc and, for each row r of T, – the value Opt cC ðT; rÞ which is equal to the maximum coverage of an irredundant b-decision rule for T and r. Now we apply the procedure of optimization relative to the length to the graph C. As a result we obtain the graph Cl. After that, we apply the procedure of optimization relative to the coverage to the graph Cl. As a result we obtain the graph Clc and, for each row r of T, the value Opt cCl ðT; rÞ which is equal to the maximum coverage of an irredundant b-decision rule for T and r among all irredundant b-decision rules for T and r with minimum length.

Fig. 4. Graph Glc.

412

T. Amin et al. / Information Sciences 221 (2013) 403–418

One can show that a totally optimal irredundant b-decision rule for T and r exists if and only if OptcC ðT; rÞ ¼ OptcCl ðT; rÞ. If the last equality holds then the set RulClc ðT; rÞ is equal to the set of all totally optimal irredundant b-decision rules for T and r. Fig. 4 presents the directed acyclic graph Glc obtained from the graph Gl (see Fig. 2) by the procedure of optimization relative to the coverage. Using the graph Glc we can describe for each row ri, i = 1, . . . , 5, of the table T0 the set RulGlc ðT 0 ; ri Þ of all irredundant 2decision rules for T0 and ri which have maximum coverage among all irredundant 2-decision rules for T0 and ri with minimum length. We will give also the value OptcGl ðT 0 ; ri Þ which is equal to the maximum coverage of a 2-decision rule for T0 and ri among all irredundant 2-decision rules for T0 and ri with minimum length. This value was obtained during the procedure of optimization of the graph Gl relative to the coverage. We have RulGlc ðT 0 ; r 1 Þ ¼ ff1 RulGlc ðT 0 ; r 2 Þ ¼ ff2 RulGlc ðT 0 ; r 3 Þ ¼ ff1 RulGlc ðT 0 ; r 4 Þ ¼ ff1 RulGlc ðT 0 ; r 5 Þ ¼ ff1

¼ 0 ! 3; f2 ¼ 0 ! 1g; OptcGl ðT 0 ; r 1 Þ ¼ 2; ¼ 0 ! 1g; Opt cGl ðT 0 ; r 2 Þ ¼ 2; ¼ 1 ! 1; f2 ¼ 1 ! 2g; OptcGl ðT 0 ; r 3 Þ ¼ 1; ¼ 0 ! 3g; Opt cGl ðT 0 ; r 4 Þ ¼ 2; ¼ 0 ! 3; f2 ¼ 0 ! 1g; OptcGl ðT 0 ; r 5 Þ ¼ 2.

It is easy to see that Opt cG ðT 0 ; ri Þ ¼ Opt cGl ðT 0 ; ri Þ for i = 1, . . . , 5. Therefore, for i ¼ 1; . . . ; 5; RulGlc ðT 0 ; ri Þ is the set of all totally optimal irredundant 2-decision rules for T0 and ri.

8. Experimental results We studied a number of decision tables from UCI Machine Learning Repository [4]. Some decision tables contain conditional attributes that take unique value for each row. Such attributes were removed. In some tables there were equal rows with, possibly, different decisions. In this case each group of identical rows was replaced with a single row from the group with the most common decision for this group. In some tables there were missing values. Each such value was replaced with the most common value of the corresponding attribute. Let T be one of these decision tables. We consider for this table values of b from the set B(T) = {R(T)  0.01, R(T)  0.05, R(T)  0.1, R(T)  0.2, R(T)  0.3} (these values can be found in Tables 2 and 3, where column ‘‘Rows’’ contains the number of rows in T). Let b 2 B(T). Tables 4 and 5 present structure of the graph Db(T). Column ‘‘n_nodes’’ contains the number of nonterminal nodes, column ‘‘t_nodes’’ contains the number of terminal nodes, and column ‘‘edges’’ contains the number of edges in the graph Db(T). The obtained results show that the number of nodes and the number of edges for Db(T) decrease with the growth of b. It means that the parameter b can be used for managing algorithm complexity. They show also that the structure of graph Db(T) is usually far from a tree: the number of edges is essentially larger than the number of nodes. We studied the minimum length of irredundant b-decision rules. Results can be found in Tables 6 and 7. Column ‘‘Rows’’ contains the number of rows in T and column ‘‘Attr’’ contains the number of conditional attributes in T. For each row r of T, we find the minimum length of an irredundant b-decision rule for T and r. After that, we find for rows of T the minimum length of a decision rule with minimum length (column ‘‘min’’), the maximum length of such rule (column ‘‘max’’), and the average length of rules with minimum length – one for each row (column ‘‘avg’’). We studied the maximum coverage of irredundant b-decision rules. Results can be found in Tables 8 and 9. Column ‘‘Rows’’ contains the number of rows in T. For each row r of T, we find the maximum coverage of an irredundant b-decision rule for T and r. After that, we find for rows of T the minimum coverage of a decision rule with maximum coverage (column

Table 2 Values R(T)  0.01, R(T)  0.05, and R(T)  0.1. Name of decision table

Rows

R(T)

R(T)  0.01

R(T)  0.05

R(T)  0.1

Adult-stretch Balance-scale Breast-cancer Cars Flags Hayes-roth-data Lenses Lymphography Mushroom Nursery Shuttle-landing Soybean-small Teeth Tic-tac-toe Zoo

16 625 266 1728 193 69 24 148 8124 12960 15 47 32 958 59

48 111,168 14,440 682,721 15,105 1548 155 5801 16,478,528 57,319,460 54 810 253 207,832 1405

0.480 1111.680 144.400 6827.210 151.050 15.480 0.240 58.010 164785.280 573194.600 0.540 8.100 2.530 2078.320 14.050

2.400 5558.400 722.000 34136.050 755.250 77.400 1.250 290.050 823926.400 2865973.000 2.700 40.500 12.650 10391.600 70.250

4.800 11116.800 1444.000 68272.100 1510.500 154.800 2.400 580.100 1647852.800 5731946.000 5.400 81.000 25.300 20783.200 140.500

413

T. Amin et al. / Information Sciences 221 (2013) 403–418 Table 3 Values R(T)  0.2 and R(T)  0.3. Name of decision table

Rows

R(T)

R(T)  0.2

R(T)  0.3

Adult-stretch Balance-scale Breast-cancer Cars Flags Hayes-roth-data Lenses Lymphography Mushroom Nursery Shuttle-landing Soybean-small Teeth Tic-tac-toe Zoo

16 625 266 1728 193 69 24 148 8124 12,960 15 47 32 958 59

48 111,168 14,440 682,721 15,105 1548 155 5801 16,478,528 57,319,460 54 810 253 207,832 1405

9.600 22233.600 2888.000 136544.200 3021.000 309.600 4.800 1160.200 3295705.600 11463892.000 10.800 162.000 50.600 41566.400 281.000

14.400 33350.400 4332.000 204816.300 4531.500 464.400 7.200 1740.300 4943558.400 17195838.000 16.200 243.000 75.900 62349.600 421.500

Table 4 Structure of graph Db(T) for b 2 {R(T)  0.01, R(T)  0.05}. Name of decision table

Adult-stretch Balance-scale Breast-cancer Cars Flags Hayes-roth-data House-votes-84 Lenses Lymphography Mushroom Nursery Shuttle-landing Soybean-small Teeth Tic-tac-toe Zoo

b = R(T)  0.01

b = R(T)  0.05

n_Nodes

t_Nodes

Edges

n_Nodes

t_Nodes

Edges

72 171 3272 473 558,293 170 56,459 90 26,861 24,648 1338 85 3571 134 2106 4518

60 165 2977 445 382,468 159 49,325 71 22,636 22,074 1260 37 2191 66 1933 1863

108 320 8116 688 11,709,941 249 164,202 168 163,465 161,983 2043 513 53,821 935 4035 60,525

60 21 1223 199 278,270 88 9514 65 9196 8011 339 82 2626 122 342 3765

56 35 1172 196 246,181 93 8897 65 8396 7549 332 44 2321 105 334 2778

92 20 2055 317 2,551,370 117 16,394 92 36,255 35,355 525 480 14,170 512 555 26,036

Table 5 Structure of graph Db(T) for b 2 {R(T)  0.1, R(T)  0.2, R(T)  0.3}. Name of decision table

Adult-stretch Balance-scale Breast-cancer Cars Flags Hayes-roth-data House-votes-84 Lenses Lymphography Mushroom Nursery Shuttle-landing Soybean-small Teeth Tic-tac-toe Zoo

b = R(T)  0.1

b = R(T)  0.2

b = R(T)  0.3

n_Nodes

t_Nodes

Edges

n_Nodes

t_Nodes

Edges

n_Nodes

t_Nodes

Edges

32 21 654 118 141,090 73 3267 39 4673 4148 249 81 1708 99 316 2719

40 35 642 126 130,877 81 3111 46 4379 3962 252 50 1591 99 312 2277

44 20 912 129 881,758 84 4680 56 14657 16178 293 442 6165 288 459 12717

32 21 333 22 50,082 16 881 37 2106 1692 78 74 861 67 28 1421

40 35 336 36 47,737 30 857 46 2009 1645 90 60 830 75 42 1299

44 20 422 21 219,737 15 1126 44 5587 5302 77 320 2294 154 27 4107

20 21 175 22 21,175 16 390 17 1207 874 28 66 482 58 28 789

32 35 184 36 20,432 30 390 30 1165 860 42 61 477 67 42 750

20 20 215 21 74,056 15 446 16 2914 2614 27 234 1041 138 27 1694

‘‘min’’), the maximum coverage of such rule (column ‘‘max’’), and the average coverage of rules with maximum coverage – one for each row (column ‘‘avg’’).

414

T. Amin et al. / Information Sciences 221 (2013) 403–418

Table 6 Minimum length of b-decision rules for b 2 {R(T)  0.01, R(T)  0.05}. Name of decision table

Adult-stretch Balance-scale Breast-cancer Cars Flags Hayes-roth-data House-votes-84 Lenses Lymphography Mushroom Nursery Shuttle-landing Soybean-small Teeth Tic-tac-toe Zoo

Rows

Attr.

16 625 266 1728 193 69 279 24 148 8124 12,960 15 47 23 958 59

4 4 9 6 26 4 16 4 18 22 8 6 35 8 9 16

b = R(T)  0.01

b = R(T)  0.05

Min

Avg.

Max

Min

Avg.

Max

1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 1

1.250 2.000 1.398 1.444 1.098 1.565 2.007 2.000 1.480 1.047 1.667 1.400 1.000 1.739 2.001 1.305

2 2 2 2 2 2 3 3 2 2 2 4 1 3 3 2

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1.250 1.000 1.023 1.250 1.000 1.261 1.262 1.500 1.027 1.000 1.000 1.200 1.000 1.174 1.292 1.085

2 1 2 2 1 2 2 2 2 1 1 4 1 2 2 2

Table 7 Minimum length of b-decision rules for b 2 {R(T)  0.1, R(T)  0.2, R(T)  0.3}. Name of decision table

Adult-stretch Balance-scale Breast-cancer Cars Flags Hayes-roth-data House-votes-84 Lenses Lymphography Mushroom Nursery Shuttle-landing Soybean-small Teeth Tic-tac-toe Zoo

Rows

16 625 266 1728 193 69 279 24 148 8124 12,960 15 47 23 958 59

Attr.

4 4 9 6 26 4 16 4 18 22 8 6 35 8 9 16

b = R(T)  0.1

b = R(T)  0.2

b = R(T)  0.3

Min

Avg.

Max

Min

Avg.

Max

Min

Avg.

Max

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1.250 1.000 1.000 1.000 1.000 1.000 1.000 1.333 1.000 1.000 1.000 1.133 1.000 1.000 1.081 1.000

2 1 1 1 1 1 1 2 1 1 1 3 1 1 2 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1.250 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.067 1.000 1.000 1.000 1.000

2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Table 8 Maximum coverage of b-decision rules for b 2 {R(T)  0.01, R(T)  0.05}. Name of decision table

Adult-stretch Balance-scale Breast-cancer Cars Flags Hayes-roth-data House-votes-84 Lenses Lymphography Mushroom Nursery Shuttle-landing Soybean-small Teeth Tic-tac-toe Zoo

Rows

16 625 266 1728 193 69 279 24 148 8124 12,960 15 47 23 958 59

b = R(T)  0.01

b = R(T)  0.05

Min

Avg.

Max

Min

Avg.

Max

4 14 15 92 12 3 54 2 17 1216 822 1 10 1 54 5

7.000 22.160 28.064 372.402 20.135 7.435 101.563 7.250 27.723 2778.117 2082.133 2.133 12.532 1.000 81.340 12.458

8 24 36 576 26 12 132 12 32 3152 4320 3 17 1 94 19

4 63 28 92 19 5 72 3 23 2336 1158 1 10 1 113 8

7.000 92.312 56.974 419.409 28.679 8.043 141.029 8.250 42.392 3284.965 2334.237 2.333 13.085 1.000 140.781 14.508

8 98 65 576 33 12 174 12 53 3408 4320 3 17 1 144 19

415

T. Amin et al. / Information Sciences 221 (2013) 403–418 Table 9 Maximum coverage of b-decision rules for b 2 {R(T)  0.1,R(T)  0.2,R(T)  0.3}. Name of decision table

Adult-stretch Balance-scale Breast-cancer Cars Flags Hayes-roth-data House-votes-84 Lenses Lymphography Mushroom Nursery Shuttle-landing Soybean-small Teeth Tic-tac-toe Zoo

Rows

16 625 266 1728 193 69 279 24 148 8124 12,960 15 47 23 958 59

b = R(T)  0.1

b = R(T)  0.2

b = R(T)  0.3

Min

Avg.

Max

Min

Avg.

Max

Min

Avg.

Max

4 63 46 292 23 5 88 4 32 2752 2412 2 10 1 125 8

7.000 92.312 73.906 485.509 33.974 10.913 147.799 8.583 47.270 3305.968 3066.104 3.133 14.170 1.000 163.470 16.000

8 98 79 576 39 12 174 12 53 3408 4320 4 17 1 172 19

4 63 46 368 27 10 88 4 43 3152 2412 3 11 1 225 12

7.000 92.312 88.793 499.259 39.073 11.348 150.771 8.625 57.750 3328.167 3066.104 4.333 15.894 1.000 327.791 17.695

8 98 97 576 43 12 174 12 63 3408 4320 5 17 1 366 19

6 63 93 368 32 10 88 7 46 3152 2412 3 17 1 225 15

7.500 92.312 120.880 499.259 40.181 11.348 159.814 9.875 58.473 3348.292 3066.104 5.067 17.000 1.000 327.791 18.373

8 98 136 576 43 12 174 12 63 3408 4320 6 17 1 366 19

Table 10 Totally optimal b-decision rules for b 2 {R(T)  0.01}. Name of decision table

Rows

Adult-stretch Balance-scale Breast-cancer Cars Flags Hayes-roth-data House-votes-84 Lenses Lymphography Mushroom Nursery Shuttle-landing Soybean-small Teeth Tic-tac-toe Zoo

16 625 266 1728 193 69 279 24 148 8124 12,960 15 47 23 958 59

b = R(T)  0.01 t-Opt.

Min

Avg.

Max

16 625 25 1728 37 69 23 24 48 600 12,960 13 37 23 639 38

1 2 0 1 0 1 0 1 0 0 1 0 0 1 0 0

1.500 2.739 0.154 1.705 0.238 2.232 0.179 3.500 0.581 0.121 1.667 2.600 1.851 11.391 2.217 1.000

2 12 2 8 2 6 4 6 2 2 2 24 5 60 12 18

Table 11 Totally optimal b-decision rules for b 2 {R(T)  0.05, R(T)  0.1}. Name of decision table

Adult-stretch Balance-scale Breast-cancer Cars Flags Hayes-roth-data House-votes-84 Lenses Lymphography Mushroom Nursery Shuttle-landing Soybean-small Teeth Tic-tac-toe Zoo

Rows

16 625 266 1728 193 69 279 24 148 8124 12,960 15 47 23 958 59

b = R(T)  0.05

b = R(T)  0.1

t-Opt.

Min

Avg.

Max

t-Opt.

Min

Avg.

Max

16 625 27 1728 12 69 238 23 6 3528 10,800 11 32 23 662 39

1 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0

1.500 1.440 0.117 1.429 0.078 1.609 1.061 1.500 0.088 0.434 0.833 2.400 1.979 6.304 1.691 0.678

2 4 2 6 2 4 2 4 6 1 1 24 5 40 12 2

16 625 14 1728 29 69 226 24 26 3528 12960 5 24 23 894 31

1 1 0 1 0 1 0 1 0 0 1 0 0 1 0 0

1.500 1.440 0.053 1.130 0.150 1.304 0.810 1.333 0.176 0.434 1.000 0.467 1.255 3.348 1.591 0.610

2 4 1 2 1 2 1 2 1 1 1 2 5 6 8 2

416

T. Amin et al. / Information Sciences 221 (2013) 403–418

We studied also totally optimal irredundant b-decision rules. Results can be found in Tables 10–12. Column ‘‘Rows’’ contains the number of rows in T. For each row r of T, we find the number of totally optimal irredundant b-decision rule for T and r. After that, we find for rows of T the minimum number of totally optimal rules (column ‘‘min’’), the maximum number of such rules (column ‘‘max’’), and the average number of totally optimal rules (column ‘‘avg’’). Column ‘‘t-opt’’ contains the number of rows r for each of which there exists a totally optimal irredundant b-decision rule for T and r. It is interesting to see that the number of rows that have totally optimal irredundant b-decision rules can both decrease (see row ‘‘Breast-cancer’’ in Table 12) and increase (see row ‘‘Lymphography’’ in Table 12) with the increase of b. Experiments were done on a workstation with 2 Intel Xeon X5550 processors and 16 GB of RAM. Software being used pthreads and MPI libraries for parallelism. We also present results of experiments connected with accuracy of classifiers based on approximate decision rules optimized relative to the length or coverage, and optimized sequentially relative to the length and coverage. Table 13 presents average test error for two-fold cross validation method (we repeat experiments for each decision Table 50 times). Each data set was divided randomly into three parts: train-30%, validation-20%, and test-50%. Classifier is constructed on the train part, then pruned by minimum error on validation set. On the train part of data set we constructed exact decision rules (0-rules). Then we prune these rules and obtain b-rules for increasing b. We choose the value of b which gives the minimum error on validation set. We use this model on test part of decision table as classifier. Test error is a results of classification. It is the number of objects (rows) from the test part of decision table which are improperly classified divided by the number of all rows in the test part of decision table. The last row of Table 13 presents the average test error for all decision tables. Based on performed experiments we can see that the smallest average test error is for sequential optimization of decision rules relative to the coverage and length – 0.191, and for sequential optimization of decision rules relative to the length and coverage – 0.193.

Table 12 Totally optimal b-decision rules for b 2 {R(T)  0.2,R(T)  0.3}. Name of decision table

Adult-stretch Balance-scale Breast-cancer Cars Flags Hayes-roth-data House-votes-84 Lenses Lymphography Mushroom Nursery Shuttle-landing Soybean-small Teeth Tic-tac-toe Zoo

Rows

b = R(T)  0.2

16 625 266 1728 193 69 279 24 148 8124 12,960 15 47 23 958 59

b = R(T)  0.3

t-Opt.

Min

Avg.

Max

t-Opt.

Min

Avg.

Max

16 625 130 1728 155 69 240 21 100 3952 12,960 2 28 23 958 24

1 1 0 1 0 1 0 0 0 0 1 0 0 3 1 0

1.500 1.440 0.489 1.111 0.803 1.391 0.867 0.875 0.676 0.486 1.000 0.200 1.340 4.913 1.551 0.407

2 4 1 2 1 3 2 1 1 1 1 2 3 6 4 1

16 625 102 1728 150 69 269 24 123 6852 12,960 5 27 23 958 19

1 1 0 1 0 1 0 1 0 0 1 0 0 4 1 0

1.500 1.440 0.624 1.111 0.777 1.391 0.971 1.250 0.831 0.843 1.000 0.400 1.851 5.391 1.551 0.322

2 4 3 2 1 3 2 2 1 1 1 2 4 6 4 1

Table 13 Average test error of classifiers based on decision rules. Name of decision table

Optimization Coverage

Coverage and length

Length

Length and coverage

Balance-scale Breast-cancer Cars Flags Hayes-roth-data House-votes-84 Lymphography Mushroom Nursery Soybean-small Spect-test Zoo

0.289 0.296 0.207 0.441 0.345 0.048 0.277 0.001 0.053 0.171 0.081 0.180

0.286 0.299 0.196 0.449 0.338 0.049 0.254 0.001 0.052 0.097 0.080 0.195

0.294 0.307 0.217 0.440 0.368 0.082 0.347 0.002 0.050 0.170 0.080 0.244

0.286 0.305 0.196 0.434 0.332 0.064 0.253 0.001 0.052 0.079 0.080 0.233

Average

0.199

0.191

0.217

0.193

T. Amin et al. / Information Sciences 221 (2013) 403–418

417

9. Conclusions We studied an extension of dynamic programming approach to the optimization of b-decision rules relative to the length and coverage. This extension allows us to describe the whole set of irredundant b-decision rules and optimize these rules sequentially relative to the length and coverage or relative to the coverage and length. We discussed the notion of totally optimal b-decision rule. We considered also results of experiments with decision tables from UCI Machine Learning Repository [4]. We can optimize rules also relative to the number of misclassifications. In this case, is possible to make sequential optimization relative to arbitrary subset and order of cost functions length l, coverage c, number of misclassifications m: (l, c), (c, l), (l, m), (m, l), (m, c), (c, m), (l, c, m), (l, m, c), (c, l, m), (c, m, l), (m, l, c), (m, c, l). Optimizations of irredundant b-decision rules relative to the length and coverage and construction of totally optimal rules can be considered as tools which support design of classifiers. To predict the value of decision attribute for a new object we can use in a classifier only totally optimal rules or rules with maximum coverage, etc. Moreover, it is known that, very often, classifiers based on approximate decision rules have better accuracy than classifiers based on exact decision rules. Short rules which cover many objects can be useful also in knowledge discovery to represent knowledge extracted from decision tables. In this case, rules with smaller number of descriptors are more understandable. Future investigations will be connected with the study of other cost functions, uncertainty measures and construction of classifiers. Acknowledgments This research was supported by King Abdullah University of Science and Technology in the frameworks of joint project with Nizhni Novgorod State University ‘‘Novel Algorithms in Machine Learning and Computer Vision, and their High Performance Implementations’’, Russian Federal Program ‘‘Research and Development in Prioritized Directions of Scientific-Technological Complex of Russia in 2007–2013’’. References [1] R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large databases, in: J.B. Bocca, M. Jarke, C. Zaniolo (Eds.), VLDB’94, Morgan Kaufmann, 1994, pp. 487–499. [2] A. Alkhalid, I. Chikalov, M. Moshkov, On algorithm for building of optimal a-decision trees, in: M.S. Szczuka, M. Kryszkiewicz, S. Ramanna, R. Jensen, Q. Hu (Eds.), RSCTC 2010, LNCS, vol. 6086, Springer, Heidelberg, 2010, pp. 38–445. [3] J. Ang, K. Tan, A. Mamun, An evolutionary memetic algorithm for rule extraction, Expert Systems with Applications 37 (2010) 1302–1315. [4] A. Asuncion, D.J. Newman, UCI Machine Learning Repository, 2007. . [5] J.G. Bazan, H.S. Nguyen, T.T. Nguyen, A. Skowron, J. Stepaniuk, Synthesis of decision rules for object classification, in: E. Orłowska (Ed.), Incomplete Information: Rough Set Analysis, Physica-Verlag, Heidelberg, 1998, pp. 23–57. [6] J. Błaszczyn´ski, R. Słowin´ski, M. Szelaßg, Sequential covering rule induction algorithm for variable consistency rough set approaches, Information Science 181 (2011) 987–1002. [7] U. Boryczka, J. Kozak, New algorithms for generation decision trees – ant-miner and its modifications, in: A. Abraham, A.E. Hassanien, A.C.P. de Leon Ferreira de Carvalho, V. Snásel (Eds.), Foundations of Computational Intelligence, Studies in Computational Intelligence, vol. 206(6), Springer, 2009, pp. 229–262. [8] I. Chikalov, On algorithm for constructing of decision trees with minimal number of nodes, in: W. Ziarko, Y.Y. Yao (Eds.), RSCTC 2000, LNCS, vol. 2005, Springer, Heidelberg, 2001, pp. 139–143. [9] I. Chikalov, M. Moshkov, B. Zielosko, Online learning algorithm for ensemble of decision rules, in: S.O. Kuznetsov, D. S´le ß zak, D.H. Hepting, B. Mirkin (Eds.), RSFDGrC 2011, LNCS, vol. 6743, Springer, Berlin, 2011, pp. 310–313. ´ [10] C. Cornelis, R. Jensen, G. Hurtado, D. Sleßzak, Attribute selection with fuzzy decision reducts, Information Science 180 (2010) 209–224. [11] K. Dembczyn´ski, W. Kotłowski, R. Słowin´ski, Ender: a statistical framework for boosting decision rules, Data Mining and Knowledge Discovery 21 (2010) 52–90. [12] U. Feige, A threshold of ln n for approximating set cover, in: F.T. Leighton (Ed.), Journal of the ACM (JACM), vol. 45, ACM, New York, 1998, pp. 634–652. [13] Q. Feng, D. Miao, Y. Cheng, Hierarchical decision rules mining, Expert Systems with Applications 37 (2010) 2081–2091. [14] J. Fürnkranz, P.A. Flach, Roc ‘n’ rule learning-towards a better understanding of covering algorithms, Machine Learning 58 (2005) 39–77. [15] R. Jensen, Q. Shen, Semantics-preserving dimensionality reduction: rough and fuzzy-rough-based approaches, IEEE Transactions on Knowledge and Data Engineering 16 (2004) 1457–1471. [16] N. Lavrac, J. Fürnkranz, D. Gamberger, Explicit feature construction and manipulation for covering rule learning algorithms, in: J. Koronacki, Z.W. Ras´, S.T. Wierzchon´, J. Kacprzyk (Eds.), Advances in Machine Learning I, vol. 262, Springer, 2010, pp. 121–146. [17] B. Liu, H.A. Abbass, B. McKay, Classification rule discovery with ant colony optimization, in: IAT 20-03, IEEE Computer Society, 2003, pp. 83–88. [18] S. Michalski, J. Pietrzykowski, iAQ: a program that discovers rules, AAAI-07 AI Video Competition, 2007. . [19] M. Moshkov, I. Chikalov, On algorithm for constructing of decision trees with minimal depth, Fundamental Information 41 (2000) 295–299. [20] M. Moshkov, M. Piliszczuk, B. Zielosko, Partial Covers, Reducts and decision rules in rough sets – theory and applications, Studies in Computational Intelligence, vol. 145, Springer, Heidelberg, 2008. [21] M. Moshkov, B. Zielosko, Combinatorial machine learning – a rough set approach, Studies in Computational Intelligence, vol. 360, Springer, Berlin, 2011. [22] H.S. Nguyen, Approximate boolean reasoning: foundations and applications in data mining, in: J.F. Peters, A. Skowron (Eds.), T. Rough Sets, LNCS, vol. 4100, Springer, 2006, pp. 334–506. [23] H.S. Nguyen, D. S´leßzak, Approximate reducts and association rules – correspondence and complexity results, in: SDasdsa, LNCS, vol. 1711, Springer, 1999, pp. 137–145. [24] G.L. Pappa, A.A. Freitas, Creating rule ensembles from automatically-evolved rule induction algorithms, in: J. Koronacki, Z.W. Ras´, S.T. Wierzchon´, J. Kacprzyk (Eds.), Advances in Machine Learning I, vol. 262, Springer, 2010, pp. 257–273. [25] Z. Pawlak, Rough Sets – Theoretical Aspects of Reasoning About Data, Kluwer Academic Publishers, Dordrecht, 1991. [26] Z. Pawlak, Rough set elements, in: L. Polkowski, A. Skowron (Eds.), Rough Sets in Knowledge Discovery, Physica-Verlag, Heidelberg, 1998, pp. 10–30.

418 [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38]

T. Amin et al. / Information Sciences 221 (2013) 403–418 Z. Pawlak, A. Skowron, Rough sets and boolean reasoning, Information Science 177 (2007) 41–73. Z. Pawlak, A. Skowron, Rudiments of rough sets, Information Science 177 (2007) 3–27. J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufman, 1993. J. Rissanen, Modeling by shortest data description, Automatica 14 (1978) 465–471. M. Sikora, Decision rule-based data models using TRS and NetTRS – methods and algorithms, T. Rough Sets 11 (2010) 130–160. A. Skowron, Rough sets in KDD, in: Z. Shi, B. Faltings, M. Musem (Eds.), 16th World Computer Congress, IFIP2000, Proc. Conf. Intelligent Information Processing, IIP2000, House of Electronic Industry, Beijing, 2000, pp. 1–17. A. Skowron, C. Rauszer, The discernibility matrices and functions in information systems, in: R. Słowinski (Ed.), Intelligent Decision Support, Handbook of Applications and Advances of the Rough Set Theory, Kluwer Academic Publishers, Dordrecht, 1992, pp. 331–362. D. S´le ß zak, Normalized decision functions and measures for inconsistent decision tables analysis, Fundamental Information 44 (2000) 291–319. D. S´leßzak, J. Wróblewski, Order based genetic algorithms for the search of approximate entropy reducts, in: G. Wang, Q. Liu, Y. Yao, A. Skowron (Eds.), RSFDGrC 2003, LNCS, vol. 2639, Springer, 2003, pp. 308–311. J. Wróblewski, Finding minimal reducts using genetic algorithm, in: Proc. of the Second Annual Joint Conference on Information Sciences, Wrightsville Beach, NC, 1995, pp. 186–189. J. Wróblewski, Ensembles of classifiers based on approximate reducts, Fundamental Information 47 (2001) 351–360. B. Zielosko, M. Moshkov, I. Chikalov, Optimization of decision rules based on methods of dynamic programming, Vestnik of lobachevsky state university of nizhny novgorod 6 (2010) 195–200 (in Russian).