Discrete Applied Mathematics 215 (2016) 1–13
Contents lists available at ScienceDirect
Discrete Applied Mathematics journal homepage: www.elsevier.com/locate/dam
Totally optimal decision trees for Boolean functions Igor Chikalov, Shahid Hussain ∗ , Mikhail Moshkov Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia
article
info
Article history: Received 3 April 2015 Received in revised form 27 June 2016 Accepted 4 July 2016 Available online 26 July 2016 Keywords: Boolean functions Monotone Boolean functions Totally optimal decision trees Time complexity Space complexity
abstract We study decision trees which are totally optimal relative to different sets of complexity parameters for Boolean functions. A totally optimal tree is an optimal tree relative to each parameter from the set simultaneously. We consider the parameters characterizing both time (in the worst- and average-case) and space complexity of decision trees, i.e., depth, total path length (average depth), and number of nodes. We have created tools based on extensions of dynamic programming to study totally optimal trees. These tools are applicable to both exact and approximate decision trees, and allow us to make multi-stage optimization of decision trees relative to different parameters and to count the number of optimal trees. Based on the experimental results we have formulated the following hypotheses (and subsequently proved): for almost all Boolean functions there exist totally optimal decision trees (i) relative to the depth and number of nodes, and (ii) relative to the depth and average depth. © 2016 Elsevier B.V. All rights reserved.
1. Introduction Time and space complexity relationships for algorithms play important role in computational complexity theory. These relationships are often considered for non-universal computational models, such as branching programs or decision trees [3,20], where time and space complexity is characterized by a number and not by a function depending on the length of input. The considered relationships become trivial if there exist totally optimal algorithms—optimal with respect to both time and space complexity. Total optimality with respect to two different criteria (e.g., time and space) gives us optimal solution which can be used in many practical applications. We study totally optimal decision trees for computing Boolean functions. We consider depth and total path length (average depth) of decision trees as time complexity in the worst- and in the average-case, respectively, and the number of nodes in decision trees as space complexity. The research presented in this paper has four goals. First, we present all the tools (algorithms and mathematical background) to construct and optimize decision trees for Boolean functions. We perform experiments on table representation of Boolean functions, and lastly we formulate hypotheses based on the experimental results and prove them. We also present hardness results regarding optimization of decision trees corresponding to some cost functions such as depth, number of nodes, and average-depth of decision trees (see Section 4 for details.) We have created a number of tools based on extensions of dynamic programming to study decision trees, i.e., to construct and optimize decision trees with respect to various criteria (see for example [1,12]). These tools allow us to work with both exact and approximate decision trees. We can describe the set of decision trees under consideration by a directed acyclic
∗
Corresponding author. Fax: +966 12 8021291. E-mail address:
[email protected] (S. Hussain).
http://dx.doi.org/10.1016/j.dam.2016.07.009 0166-218X/© 2016 Elsevier B.V. All rights reserved.
2
I. Chikalov et al. / Discrete Applied Mathematics 215 (2016) 1–13
graph, make multi-stage optimization of decision trees relative to different cost functions, and count the number of optimal trees (see for example Section 5). Dynamic programming for optimization of decision trees has been studied before by Garey [9] and others [18,22], however not in the context of multi-stage optimization. Previously, we have considered multi-stage optimization such as [2] but we did not distinguish between optimal and strictly optimal decision trees (see Section 3.3). However, the considered procedures of optimization allow us to describe the whole set of optimal trees only for number of nodes and average depth. For depth we can only obtain the set of strictly optimal decision trees for which each subtree is optimal for the corresponding subtable. Chikalov in [5] discusses in detail the average time complexity of decision trees. Moshkov in [19] presented approximate algorithms for minimization of depth of decision trees and proved bound using an approximate algorithm for set cover problem. There is another direction of research involving evaluating stochastic Boolean functions in the context of decision trees (see for example Deshpande et al. [8]). Construction of decision trees for Boolean functions in general and for special purposes is well studied. Kundakcioglu and Ünlüyurt also discuss minimum-cost decision trees (specialized and/or tree) for sequential fault diagnosis (see [17]). There are many research papers which discuss the complexity of Boolean functions in various contexts and for different parameters (see for example the surveys by Korshunov [15] and Buhrman and de Wolf [4]). We discuss the cost functions for optimization of decision trees in detail in Section 2. It is important to note that all cost functions are increasing cost functions however, some of these cost functions are strictly increasing functions. That leads to the consideration of the two kinds of optimal trees, optimal and strictly optimal, which allow us to understand the difference between optimization (i) relative to depth, and (ii) relative to average depth and number of nodes. We also discuss the hardness results for decision tree optimization. Particularly, we show that total Boolean functions in a decision table form have polynomial-time algorithms for construction and optimization of decision trees in terms of the size of the table (the existence of such algorithm for minimization of size of decision trees was proved earlier by David Guijarro, Víctor Lavín, and Vijay Raghavan [10]). However, for general cases of partial Boolean functions the problem of optimization remains hard. We provide the proofs of NP-hardness for the well-understood cases when optimizing decision trees for depth and number of nodes of decision trees and show the result from Hyafil and Rivest [13] about NP-hardness of minimization of average depth of decision trees for pseudo Boolean functions. An essential part of this paper is devoted to the experimental study of total and partial Boolean functions. We begin with the study of three well-known functions: conjunction, linear function, and majority function each for n variables for n from 2 to 14 (and in some cases up to n = 16). We study minimum depth, average depth and number of nodes for these functions as well as the existence of totally optimal decision trees for each pair of the considered cost functions. We show that such totally optimal trees exist in all cases we have studied, and how the complexity of decision trees depends on their accuracy. The second direction of experimental research is to study the existence of totally optimal decision trees for monotone and arbitrary total Boolean functions. In [6], we proved that for any monotone Boolean function with at most five variables, there exists a totally optimal decision tree relative to the depth and the number of nodes. In this paper, we extend this result to six variables and find a counterexample for seven variables. We obtained similar results for each possible pair of these three cost functions and all three cost functions together. We also studied randomly generated total Boolean functions for n = 4, . . . , 10. The obtained experimental results allowed us to formulate the following hypotheses (which we subsequently proved in Theorems 16 and 17): for almost all total Boolean functions there exist (i) totally optimal decision trees relative to the depth and number of nodes, and (ii) totally optimal decision trees relative to the depth and average depth, respectively. The third direction of study is devoted to the experiments with partial Boolean functions. The obtained results show that the percentage of partial Boolean function that have totally optimal trees, is decreasing quickly with the growth of the number of variables. This situation is in some sense similar to the situation with patterns in the partial Boolean functions: for each total Boolean function, for each point in Boolean cube, there is a pattern which covers this point and only the points with the same value of the function, has the minimum length of description and maximum coverage. For partial Boolean functions, such patterns do not exist in each case, and that is why it is necessary to consider Pareto optimal patterns [11]. The experimental results presented in this paper lead us to postulate two hypotheses (mentioned above) regarding the total optimality of decision trees relative to (i) depth and number of nodes, and (ii) depth and average depth. We formally treat these hypotheses and prove the consequent theorems. This paper consists of seven sections. Section 2 explains partial Boolean functions and decision trees for these functions. Section 3 presents definitions and tools to study the decision trees. We prove some hardness results for decision tree optimization in Section 4. Section 5 is devoted to the consideration of experimental results. Proofs of hypotheses postulated with the help of experimental results go to Section 6, and Section 7 concludes the paper. 2. Partial Boolean functions and decision trees In this section, we consider the notions connected with table representation of partial Boolean functions, and with approximate decision trees (α -decision trees). A partial Boolean function f (x1 , . . . , xn ) is a partial function of the kind f : {0, 1}n → {0, 1}. We work with a table representation of the function f (table for short) which is a rectangular table T with n columns filled with numbers from the set {0, 1}. Columns of the table are labeled with variables x1 , . . . , xn . Rows of the table are pairwise different, and the set of
I. Chikalov et al. / Discrete Applied Mathematics 215 (2016) 1–13
3
rows coincides with the set of n-tuples from {0, 1}n on which the value of f is defined. Each row is labeled with the value of function f on this row. A table is called empty if it has no rows. We denote by F2 the set of table representations for all partial Boolean functions. Let T ∈ F2 . The table T is called degenerate if it is empty or all rows of T are labeled with the same value. We denote by N (T ) the number of rows in the table T and, for any t ∈ {0, 1}, we denote by Nt (T ) the number of rows of T labeled with the number t. By mcv(T ) we denote the most common value for T defined as mcv(T ) = arg max{Nt (T )}. t ∈{0,1}
For an empty table T we have mcv(T ) = 0. For any variable xi ∈ {x1 , . . . , xn }, we denote by E (T , xi ) the set of values of the variable xi in the table T . We denote by E (T ) the set of variables for which |E (T , xi )| = 2. A subtable of T is a table obtained from T by removal of some rows. Let T be a nonempty table, xi1 , . . . , xim ∈ {x1 , . . . , xn } and a1 , . . . , am ∈ {0, 1}. We denote by T (xi1 , a1 ) . . . (xim , am ) the subtable of the table T containing the rows from T which at the intersection with the columns xi1 , . . . , xim have numbers a1 , . . . , am , respectively. Such nonempty subtables, including the table T , are called separable subtables of T . The set of these tables is denoted by SEP (T ). The notion of separable subtable is different from the notion of subfunction. For example, for the function x1 ∧ x2 , where two different pairs of substitutions (x1 = 0, x2 = 1) and (x1 = 1, x2 = 0) describe the same subfunction 0, but different separable subtables of the function x1 ∧ x2 table representation: in the first case, the subtable contains the unique row (0, 1), and in the second case, it contains the unique row (1, 0). As an uncertainty measure for tables from F2 we consider relative misclassification error rme(T ) = (N (T ) − Nmcv(T ) (T ))/ N (T ). We assume that rme(T ) = 0 if T is empty. A decision tree over T is a finite directed tree with root in which nonterminal nodes are labeled with variables from the set {x1 , . . . , xn }, terminal nodes are labeled with numbers from {0, 1}, and, for each nonterminal node, edges starting in this node are labeled with pairwise different numbers from {0, 1}. Let Γ be a decision tree over T and v be a node of Γ . We denote by Γ (v) the subtree of Γ for which v is the root. We define now a subtable T (v) = TΓ (v) of the table T . If v is the root of Γ then T (v) = T . Let v be a node in Γ other than the root of Γ and v1 , e1 , . . . , vm , em , vm+1 = v be the directed path from the root of Γ to v in which nodes v1 , . . . , vm are labeled with variables xi1 , . . . , xim and edges e1 , . . . , em are labeled with numbers a1 , . . . , am , respectively. Then T (v) = T (xi1 , a1 ) . . . (xim , am ). Let α ∈ R+ , and α < 1, where R+ is the set of nonnegative real numbers. A decision tree Γ over T is called an α -decision tree for T if, for any node v of Γ ,
• If rme(T (v)) ≤ α then v is a terminal node which is labeled with mcv(T (v)). • If rme(T (v)) > α then v is a nonterminal node labeled with a variable xi ∈ E (T (v)). Two edges start from the node v which are labeled with 0 and 1, respectively. Note that a 0-decision tree for T is a decision tree for exact computation of the partial Boolean function represented by T and for α > 0, each row of the subtable is localized with uncertainty at most α . For b ∈ {0, 1}, we denote by tree(b) the decision tree that contains only one (terminal) node labeled with b. Let xi ∈ {x1 , . . . , xn }, and Γ0 , Γ1 be decision trees over T . We denote by tree(xi , Γ0 , Γ1 ) the following decision tree over T : the root of the tree is labeled with xi , and two edges start from the root which are labeled with 0,1 and enter the roots of the decision trees Γ0 , Γ1 , respectively. We denote by DT α (T ) the set of α -decision trees for T . For xi ∈ E (T ), we denote DT α (T , xi ) = {tree(xi , Γ0 , Γ1 ) : Γt ∈ DT α (T (xi , t )), t = 0, 1}, that is the set DT α (T , xi ) is a set of α -decision trees for T such that xi is the root. From here, following proposition follows naturally. Proposition 1. Let T ∈ F2 , α ∈ R+ , and α < 1. Then
(mcv(T ))}, {tree DT α (T ) = DT α (T , xi ),
if rme(T ) ≤ α, if rme(T ) > α.
xi ∈E (T )
We consider a partial order ≤ on the set R2+ : (x1 , x2 ) ≤ (y1 , y2 ) if x1 ≤ y1 and x2 ≤ y2 . A function F : R2+ → R+ is called increasing if F (x) ≤ F (y) for any x, y ∈ R2+ such that x ≤ y. A function F : R2+ → R+ is called strictly increasing if F (x) < F (y) for any x, y ∈ R2+ such that x ≤ y and x ̸= y. If F is strictly increasing then, evidently, F is increasing. For example max(x1 , x2 ) is increasing and x1 + x2 is strictly increasing. A cost function for decision trees is a function ψ(T , Γ ) which is defined on pairs (table T ∈ F2 and a decision tree Γ for T 1 ) and has values from R+ . The function ψ is given by the three operators ψ 0 : F2 → R+ , F : R2+ → R+ , and w : T2 → R+ . The value of ψ(T , Γ ) is defined by induction: 1 A decision tree Γ over a decision table T is called a decision tree for T if it is an α -decision tree for T for some α .
4
I. Chikalov et al. / Discrete Applied Mathematics 215 (2016) 1–13
• If Γ = tree(mcv(T )) then ψ(T , Γ ) = ψ 0 (T ). • If Γ = tree(xi , Γ0 , Γ1 ) then ψ(T , Γ ) = F (ψ(T (xi , 0), Γ0 ), ψ(T (xi , 1), Γ1 )) + w(T ). The cost function ψ is called (strictly) increasing function if F is (strictly) increasing function. We now consider some examples of cost functions for decision trees:
• Depth: h(T , Γ ) = h(Γ ) of a decision tree Γ for a table T is the maximum length of a path in Γ from the root to a terminal node. For this cost function, ψ 0 (T ) = 0, F (x, y) = max(x, y), and w(T ) = 1. This is an increasing cost function. • Total path length: tpl(T , Γ ) of a decision tree Γ for a table T is equal to r ∈Row(T ) lΓ (r ) where Row(T ) is the set of rows of T , and lΓ (r ) is the length of a path in Γ from the root to a terminal node v such that the row r belongs to TΓ (v). For this cost function, ψ 0 (T ) = 0, F (x, y) = x + y, and w(T ) = N (T ). This is a strictly increasing cost function. For a nonempty table T , the value tpl(T , Γ )/N (T ) is called the average depth of a decision tree Γ for a decision table T and is denoted by havg (T , Γ ). • Number of nodes: L(T , Γ ) = L(Γ ) of a decision tree Γ for a table T . For this cost function, ψ 0 (T ) = 1, F (x, y) = x + y, and w(T ) = 1. This is a strictly increasing cost function. • Number of nonterminal nodes: Ln (T , Γ ) = Ln (Γ ) of a decision tree Γ for a table T . For this cost function, ψ 0 (T ) = 0, F (x, y) = x + y, and w(T ) = 1. This is a strictly increasing cost function. • Number of terminal nodes: Lt (T , Γ ) = Lt (Γ ) of a decision tree Γ for a table T . For this cost function, ψ 0 (T ) = 1, F (x, y) = x + y, and w(T ) = 0. This is a strictly increasing cost function. In this paper, we do not consider the parameters Ln (Γ ) and Lt (Γ ) since they can be easily derived from the parameter L(Γ ). Let T ∈ F2 and Γ be a decision tree for T . It is easy to see that Γ is a full binary tree. Therefore Lt (Γ ) = Ln (Γ ) + 1, Ln (Γ ) = (L(Γ ) − 1)/2, and Lt (Γ ) = (L(Γ ) + 1)/2. 3. Tools for studying decision trees In this section, we consider main definitions and present some algorithms. Specifically, the algorithms for the construction of a directed acyclic graph (DAG), for counting the number of decision trees represented by the DAG (or its proper subgraph), and for multi-stage optimization of decision trees. 3.1. Directed acyclic graph ∆α (T ) Here, we discuss the possibility to represent the set of α -decision trees for a decision table in the form of a directed acyclic graph. Let α ∈ R+ , α < 1, and T ∈ F2 . We now consider Algorithm A1 for the construction of a directed acyclic graph ∆α (T ). This graph is used for the description and optimization of α -decision trees for T , and also for counting the number of such trees. Nodes of this graph are some separable subtables of the table T . We consider Algorithm A1 as definition of the graph ∆α (T ). Algorithm A1 Input: Output:
Table T ∈ F2 and number α ∈ R+ , α < 1. Directed acyclic graph ∆α (T ).
1. Construct the graph that consists of one node T which is not labeled as processed. 2. If all nodes are processed then the algorithm finishes and the resulting graph is ∆α (T ). Otherwise, choose a node (table) Θ that has not been processed yet. 3. If rme(Θ ) ≤ α mark Θ as processed and proceed to step 2. 4. If rme(Θ ) > α then, for each xi ∈ E (Θ ), draw a pair of edges from the node Θ (this pair of edges is called an xi -pair) and label these edges with pairs (xi , 0) and (xi , 1). These edges enter nodes Θ (xi , 0) and Θ (xi , 1), respectively. If some of the nodes Θ (xi , 0), Θ (xi , 1) are not present in the graph then add these nodes to the graph. Mark the node Θ as processed and proceed to step 2. It is easy to see that the time complexity of Algorithm A1 is bounded from above by a polynomial on the size of the input table T and the number of nodes in the graph ∆α (T ). The last number is bounded from above by the number |SEP (T )| of different separable subtables of T . For a total Boolean function with n variables, N (T ) = 2n and |SEP (T )| ≤ 3n . From here, following proposition follows. Proposition 2. The time complexity of Algorithm A1 for total Boolean functions is bounded from above by a polynomial on the size of the input table.
I. Chikalov et al. / Discrete Applied Mathematics 215 (2016) 1–13
5
For tables which are representations of partial Boolean functions, the number of nodes in the graph ∆α (T ) can grow exponentially with the number of rows in table T . Let us consider a table T with n columns labeled with variables and n + 1 rows (0, . . . , 0), (1, 0, . . . , 0), . . . , (0, . . . , 0, 1) labeled with values 0, 1, . . . , 1, respectively. It is not difficult to show that, for any subset {i1 , . . . , im } of the set {1, . . . , n} which is different from {1, . . . , n}, the separable subtable T (xi1 , 0) . . . (xim , 0) of the table T is nondegenerate. The considered subtables are pairwise different. One can show that any nondegenerate separable subtable of the table T is a node of the graph ∆0 (T ). Therefore the number of nodes in ∆0 (T ) is at least 2n − 1. A node of a directed graph is called terminal if there are no edges starting in this node. A node Θ of the graph ∆α (T ) is terminal if and only if rme(Θ ) ≤ α . We now describe the notion of a proper subgraph of the graph ∆α (T ). Such subgraphs arise as results of optimization of α -decision trees relative to different cost functions. A proper subgraph of the graph ∆α (T ) is a graph G obtained from ∆α (T ) by removal of some xi -pairs of edges such that each nonterminal node of ∆α (T ) keeps at least one pair of edges starting from this node. By definition, ∆α (T ) is a proper subgraph of ∆α (T ). A node Θ of the graph G is terminal if and only if rme(Θ ) ≤ α . We denote by L(G) the number of nodes in the graph G. Let G be a proper subgraph of the graph ∆α (T ). For each nonterminal node Θ of the graph G, we denote by EG (Θ ) the set of variables xi from E (Θ ) such that xi -pair of edges starts from Θ in G. For each node Θ of the graph G, we define the set Tree(G, Θ ) of decision trees in the following way. If Θ is a terminal node of G, then Tree(G, Θ ) = {tree(mcv(Θ ))}. Let Θ be a nonterminal node of G and xi ∈ EG (Θ ). We denote Tree(G, Θ , xi ) = {tree(xi , Γ0 , Γ1 ) : Γt ∈ Tree(G, Θ (xi , t )), t = 0, 1}. Then Tree(G, Θ ) =
Tree(G, Θ , xi ).
(1)
xi ∈EG (Θ )
Proposition 3. Let T ∈ F2 , α ∈ R+ , and α < 1. Then, for any node Θ of the graph ∆α (T ), the following equality holds: Tree(∆α (T ), Θ ) = DT α (Θ ). Proof. We prove this statement by induction on nodes of ∆α (T ). Let Θ be a terminal node of ∆α (T ). Then Tree(∆α (T ), Θ ) = {tree(mcv(Θ ))} = DT α (Θ ). Let now Θ be a nonterminal node of ∆α (T ), and let us assume that Tree(∆α (T ), Θ (xi , t )) = DT α (Θ (xi , t )) for any xi ∈ E (Θ ) and t ∈ {0, 1}. Then, for any xi ∈ E (Θ ), we have Tree(∆α (T ), Θ , xi ) = DT α (Θ , xi ). Using the equality in (1) and Proposition 1, we obtain Tree(∆α (T ), Θ ) = DTα (Θ ). 3.2. Cardinality of the set Tree(G, T ) We describe how to count the number of α -decision trees represented by a proper subgraph of the graph ∆α (T ). Let T be a table from F2 , α ∈ R+ , α < 1, and G be a proper subgraph of the graph ∆α (T ). We now describe an algorithm which counts, for each node Θ of the graph G, the cardinality C (Θ ) of the set Tree(G, Θ ), and returns the number C (T ) = |Tree(G, T )|. Algorithm A2 Input: Output:
A proper subgraph G of the graph ∆α (T ) for some table T ∈ F2 and number α ∈ R+ , α < 1. The number |Tree(G, T )|.
1. If all nodes of the graph G are processed then return the number C (T ) and terminate the algorithm. Otherwise, choose a node Θ of the graph G which is not processed yet and which is either a terminal node of G or a nonterminal node of G such that, for each xi ∈ EG (T ), the nodes Θ (xi , 0) and Θ (xi , 1) are processed. 2. If Θ is a terminal node then set C (Θ ) = 1, mark the node Θ as processed, and proceed to step 1. 3. If Θ is a nonterminal node then set C (Θ ) =
C (Θ (xi , 0)) × C (Θ (xi , 1)),
xi ∈EG (Θ )
mark the node Θ as processed, and proceed to step 1. Proposition 4. Let T be a table from F2 with n columns, α ∈ R+ , α < 1, and G be a proper subgraph of the graph ∆α (T ). Then the Algorithm A2 returns the number |Tree(G, T )| and makes at most 2nL(G) operations of addition and multiplication where L(G) is the number of nodes in the graph G. Proof. We prove by induction on the nodes of G that C (Θ ) = |Tree(G, Θ )| for each node Θ of G. Let Θ be a terminal node of G. Then Tree(G, Θ ) = {tree(mcv(Θ ))} and |Tree(G, Θ )| = 1. Therefore the considered statement holds for Θ . Let now Θ be a nonterminal node of G such that the considered statement holds for its children. By definition, Tree(G, Θ ) =
xi ∈EG (Θ )
Tree(G, Θ , xi ),
6
I. Chikalov et al. / Discrete Applied Mathematics 215 (2016) 1–13
where, for xi ∈ EG (Θ ), Tree(G, Θ , xi ) = {tree(xi , Γ0 , Γ1 ) : Γt ∈ Tree(G, Θ (xi , t )), t = 0, 1}. One can show that, for any xi ∈ EG (Θ ), |Tree(G, Θ , xi )| = |Tree(G, Θ (xi , 0))| × |Tree(G, Θ (xi , 1))|, and |Tree(G, Θ )| = xi ∈EG (Θ ) |Tree(G, Θ , xi )|. By the inductive hypothesis, C (Θ (xi , t )) = |Tree(G, Θ (xi , t ))| for any xi ∈ EG (T ) and t ∈ {0, 1}. Therefore C (Θ ) = |Tree(G, Θ )|. Hence, the considered statement holds. From here it follows that C (T ) = |Tree(G, T )|, and Algorithm A2 returns the cardinality of the set Tree(G, T ). We now evaluate the number of arithmetic operations made by the Algorithm A2 . At each nonterminal node of G, Algorithm A2 makes at most n operations of multiplication and at most n − 1 operations of addition. Therefore, to compute the value |Tree(G, T )| the algorithm makes at most 2nL(G) operations of additions and multiplications. We provide some bounds on the number of operations and time complexity of the A2 algorithm. Proposition 5. For Algorithm A2 , the number of arithmetic operations is bounded from above by a polynomial depending on the size of input table T and on the number of separable subtables of T . Proposition 6. For total Boolean functions the time complexity of Algorithm A2 is bounded from above by a polynomial on the size of the input table T . 3.3. Multi-stage optimization of decision trees We, now discuss how to optimize α -decision trees represented by a proper subgraph of the graph ∆α (T ) relative to a cost function for decision trees. We also explain possibilities of multi-stage optimization of decision trees for different cost functions and consider the notion of a totally optimal decision tree relative to a number of cost functions. A totally optimal decision tree is a decision tree which is optimal simultaneously for each of the considered cost functions. Let ψ be an increasing cost function for decision trees given by the triple of operators ψ 0 , F and w , α ∈ R+ , α < 1, T be a table from F2 with n columns labeled with attributes x1 , . . . , xn , and G be a proper subgraph of the graph ∆α (T ). Let Θ be a node of G and Γ ∈ Tree(G, Θ ). One can show that, for any node v of Γ , the decision tree Γ (v) belongs to the set Tree(G, ΘΓ (v)). A decision tree Γ from Tree(G, Θ ) is called an optimal decision tree for Θ relative to ψ and G if ψ(Θ , Γ ) = min{ψ(Θ , Γ ′ ) : ′ Γ ∈ Tree(G, Θ )}. A decision tree Γ from Tree(G, Θ ) is called a strictly optimal decision tree for Θ relative to ψ and G if, for any node v of Γ , the decision tree Γ (v) is an optimal decision tree for ΘΓ (v) relative to ψ and G. opt s-opt We denote by Treeψ (G, Θ ) the set of optimal decision trees for Θ relative to ψ and G. We denote by Treeψ (G, Θ ) the set of strictly optimal decision trees for Θ relative to ψ and G. opt s-opt s-opt Let Γ ∈ Treeψ (G, Θ ) and Γ = tree(xi , Γ0 , Γ1 ). Then Γ ∈ Treeψ (G, Θ ) if and only if Γt ∈ Treeψ (G, Θ (xi , t )) for t = 0, 1. Proposition 7. Let ψ be a strictly increasing cost function for decision trees, α ∈ R+ , α < 1, T ∈ F2 , and G be a proper subgraph opt s-opt of the graph ∆α (T ). Then, for any node Θ of the graph G, Treeψ (G, Θ ) = Treeψ (G, Θ ). s-opt
opt s-opt (G, Θ ) ⊆ Treeopt ψ (G, Θ ). Let Γ ∈ Treeψ (G, Θ ) and let us assume that Γ ̸∈ Treeψ (G, Θ ). opt opt ′ Then there is a node v of Γ such that Γ (v) ̸∈ Treeψ (G, ΘΓ (v)). Let Γ0 ∈ Treeψ (G, ΘΓ (v)) and Γ be the decision tree obtained from Γ by replacing Γ (v) with Γ0 . One can show that Γ ′ ∈ Tree(G, Θ ). Since ψ is strictly increasing and ψ(ΘΓ (v), Γ0 ) < ψ(ΘΓ (v), Γ (v)), we have ψ(Θ , Γ ′ ) < ψ(Θ , Γ ). Therefore Γ ̸∈ Treeopt ψ (G, Θ ) which is impossible. Thus opt s-opt Treeψ (G, Θ ) ⊆ Treeψ (G, Θ ).
Proof. It is clear that Treeψ
We describe now Algorithm A3 (a procedure of optimization relative to the cost function ψ ). The Algorithm A3 attaches to each node Θ of G the number c (Θ ) = min{ψ(Θ , Γ ) : Γ ∈ Tree(G, Θ )} and, probably, remove some xi -pairs of edges starting from nonterminal nodes of G. As a result, we obtain a proper subgraph Gψ of the graph G. It is clear that Gψ is also a proper subgraph of the graph ∆α (T ). Algorithm A3 Input: Output:
A proper subgraph G of the graph ∆α (T ) for some table T ∈ F2 and number α ∈ R+ , α < 1, and an increasing cost function ψ for decision trees given by the triple of operators ψ 0 , F and w . The proper subgraph Gψ of the graph G.
1. If all nodes of the graph G are processed then return the obtained graph as Gψ and terminate the algorithm. Otherwise, choose a node Θ of the graph G which is not processed yet and which is either a terminal node of G or a nonterminal node of G for which all children are processed.
I. Chikalov et al. / Discrete Applied Mathematics 215 (2016) 1–13
7
2. If Θ is a terminal node then set c (Θ ) = ψ 0 (Θ ), mark Θ as processed and proceed to step 1. 3. If Θ is a nonterminal node then, for each xi ∈ EG (Θ ), compute the value c (Θ , xi ) = F (c (Θ (xi , 0)), c (Θ (xi , 1))) + w(Θ ) and set c (Θ ) = min{c (Θ , xi ) : xi ∈ EG (Θ )}. Remove all xi -pairs of edges starting from Θ for which c (Θ ) < c (Θ , xi ). Mark Θ as processed and proceed to step 1. Proposition 8. Let G be a proper subgraph of the graph ∆α (T ) for some table T ∈ F2 with n columns and number α ∈ R+ , α < 1, and ψ be an increasing cost function for decision trees given by the triple of operators ψ 0 , F and w . Then, to construct the graph Gψ , Algorithm A3 makes O(nL(G)) elementary operations (computations of F , w , ψ 0 , comparisons, and additions). Proof. In each terminal node of the graph G, Algorithm A3 computes the value of ψ 0 . In each nonterminal node of G, the Algorithm A3 computes the value of F at most n times, and the value of w at most n times, makes at most n additions and at most 2n comparisons. Therefore, Algorithm A3 makes O(nL(G)) elementary operations. Using this proposition, we obtain the following upper bound for the algorithm. This statement is not true for partial Boolean functions, in general. Proposition 9. For any cost function ψ ∈ {h, tpl, L}, for tables which represent total Boolean functions, the time complexity of Algorithm A3 is bounded from above by a polynomial on the size of the input table. For any node Θ of the graph G and for any xi ∈ EG (Θ ), we denote ψG (Θ ) = min{ψ(Θ , Γ ) : Γ ∈ Tree(G, Θ )} and
ψG (Θ , xi ) = min{ψ(Θ , Γ ) : Γ ∈ Tree(G, Θ , xi )}. Lemma 10. Let G be a proper subgraph of the graph ∆α (T ) for some table T ∈ F2 with n columns and number α ∈ R+ , α < 1, and ψ be an increasing cost function for decision trees given by the triple of operators ψ 0 , F and w . Then, for any node Θ of the graph G and for any variable xi ∈ EG (Θ ), Algorithm A3 computes values c (Θ ) = ψG (Θ ) and c (Θ , xi ) = ψG (Θ , xi ). Proof. We prove the considered statement by induction on the nodes of the graph G. Let Θ be a terminal node of G. Then Tree(G, Θ ) = {tree(mc v(Θ ))} and ψG (Θ ) = ψ 0 (Θ ). Therefore c (Θ ) = ψG (Θ ) and the considered statement holds for Θ . Let now Θ be a nonterminal node of G such thatthe considered statement holds for each node Θ (xi , t ) with xi ∈ EG (Θ ) and t ∈ {0, 1}. By definition, Tree(G, Θ ) = xi ∈EG (Θ ) Tree(G, Θ , xi ) and, for each xi ∈ EG (Θ ), Tree(G, Θ , xi ) = {tree(xi , Γ0 , Γ1 ) : Γ1 ∈ Tree(G, Θ (xi , t )), t = 0, 1}. Since ψ is an increasing cost function,
ψG (Θ , xi ) = F (ψG (Θ (xi , 0)), ψG (Θ (xi , 1))) + w(Θ ). It is clear that ψG (Θ ) = min{ψG (Θ , xi ) : xi ∈ EG (Θ )}. By the inductive hypothesis, ψG (Θ (xi , t )) = c (Θ (xi , t )) for each xi ∈ EG (Θ ) and t ∈ {0, 1}. Therefore c (Θ , xi ) = ψG (Θ , xi ) for each xi ∈ EG (Θ ), and c (Θ ) = ψG (Θ ). Theorem 11. Let ψ be an increasing cost function for decision trees, α ∈ R+ , α < 1, T ∈ F2 , and G be a proper subgraph of the s-opt graph ∆α (T ). Then, for any node Θ of the graph Gψ , the following equality holds: Tree(Gψ , Θ ) = Treeψ (G, Θ ). Proof. We prove the considered statement by induction on nodes of Gψ . We use Lemma 10 which shows that, for any node Θ of the graph G and for any xi ∈ EG (Θ ), c (Θ ) = ψG (Θ ) and c (Θ , xi ) = ψG (Θ , xi ). s-opt Let Θ be a terminal node of Gψ . Then Tree(Gψ , Θ ) = {tree(mc v(Θ ))}. It is clear that Tree(Gψ , Θ ) = Treeψ (G, Θ ). Therefore the considered statement holds for Θ . Let Θ be a nonterminal node of Gψ such that the considered statement holds for each node Θ (xi , t ) with xi ∈ EG (Θ ) and t ∈ {0, 1}. By definition, Tree(Gψ , Θ ) =
Tree(Gψ , Θ , xi )
xi ∈E ψ (Θ ) G
and, for each xi ∈ EGψ (Θ ), Tree(Gψ , Θ , xi ) = {tree(xi , Γ0 , Γ1 ) : Γt ∈ Tree(Gψ , Θ (xi , t )), t = 0, 1}. We know that EGψ (Θ ) = {xi : xi ∈ EG (Θ ), ψG (Θ , xi ) = ψG (Θ )}. Let xi ∈ EGψ (Θ ) and Γ ∈ Tree(Gψ , Θ , xi ). Then Γ = tree(xi , Γ0 , Γ1 ), where Γt ∈ Tree(Gψ , Θ (xi , t )) for t = 0, 1. According to the inductive hypothesis, Tree(Gψ , Θ (xi , t )) = s-opt s-opt Treeψ (G, Θ (xi , t )) and Γt ∈ Treeψ (Gψ , Θ (xi , t )) for t = 0, 1. In particular, ψ(Θ (xi , t ), Γt ) = ψG (Θ (xi , t )) for t = 0, 1. Since ψG (Θ , xi ) = ψG (Θ ) we have F (ψG (Θ (xi , 0)), ψG (Θ (xi , 1))) + w(Θ ) = ψG (Θ ) and ψ(Θ , Γ ) = ψG (Θ ). Therefore opt s-opt Γ ∈ Treeψ (G, Θ ), Γ ∈ Treeψ (G, Θ ) and Tree(Gψ , Θ ) ⊆ Treeψ
s-opt
s-opt
Let Γ ∈ Treeψ xi ∈ EG (Θ ), and s-opt
Γt ∈ Treeψ
(G, Θ ).
(G, Θ ). Since Θ is a nonterminal node, Γ can be represented in the form tree(xi , Γ0 , Γ1 ) where
(G, Θ (xi , t ))
8
I. Chikalov et al. / Discrete Applied Mathematics 215 (2016) 1–13 s-opt
for t = 0, 1. Since Γ ∈ Treeψ
(G, Θ ), ψG (Θ , xi ) = ψG (Θ ) and xi ∈ EGψ (T ). According to the inductive hypothesis,
Tree(Gψ , Θ (xi , t )) = Treeψ
s-opt
(G, Θ (xi , t ))
for t = 0, 1. Therefore Γ ∈ Tree(Gψ , Θ , xi ) ⊆ Tree(Gψ , Θ ). As a result, we have Treeψ
s-opt
(G, Θ ) ⊆ Tree(Gψ , Θ ).
Corollary 12. Let ψ be a strictly increasing cost function, α ∈ R+ , α < 1, T ∈ F2 , and G be a proper subgraph of the graph opt ∆α (T ). Then, for any node Θ of the graph Gψ , Tree(Gψ , Θ ) = Treeψ (G, Θ ). This corollary follows immediately from Proposition 7 and Theorem 11. Let T ∈ F2 , α ∈ R+ , and α < 1. We can make multi-stage optimization of α -decision trees for T relative to a sequence of strictly increasing cost functions ψ1 , ψ2 , . . . . We begin from the graph G = ∆α (T ) and apply to it the procedure of optimization relative to the cost function ψ1 (Algorithm A3 ). As a result, we obtain a proper subgraph Gψ1 of the graph G. By Proposition 3, the set Tree(G, T ) is equal to the set DTα (T ) of all α -decision trees for T . Using Corollary 12, we obtain that the opt set Tree(Gψ1 , T ) coincides with the set Treeψ1 (G, T ) of all decision trees from Tree(G, T ) which have minimum cost relative to ψ1 among all trees from Tree(G, T ). Next we apply to Gψ1 the procedure of optimization relative to the cost function ψ2 . As a result, we obtain a proper subgraph Gψ1 ,ψ2 of the graph Gψ1 (and of the graph G = ∆α (T )). By Corollary 12, the opt set Tree(Gψ1 ,ψ2 , T ) coincides with the set Treeψ2 (Gψ1 , T ) of all decision trees from Tree(Gψ1 , T ) which have minimum cost relative to ψ2 among all trees from Tree(Gψ1 , T ), etc. If one of the cost functions ψi is increasing and not strictly increasing then the set Tree(Gψ1 ,...,ψi , T ) coincides with the s-opt set Treeψi (Gψ1 ,...,ψi−1 , T ) which is a subset of the set of all decision trees from Tree(Gψ1 ,...,ψi−1 , T ) that have minimum cost
relative to ψi among all trees from Tree(Gψ1, ...,ψi−1 , T ). For a cost function ψ , we denote ψ α (T ) = min{ψ(T , Γ ) : Γ ∈ DT α (T )}, i.e., ψ α (T ) is the minimum cost of an α -decision tree for T relative to the cost function ψ . Let ψ1 , . . . , ψm be cost functions and m ≥ 2. An α -decision tree Γ for T is called a α totally optimal α -decision tree for T relative to the cost functions ψ1 , . . . , ψm if ψ1 (T , Γ ) = ψ1α (T ), . . . , ψm (T , Γ ) = ψm (T ), i.e., Γ is optimal relative to ψ1 , . . . , ψm simultaneously. Let us assume that ψ1 , . . . , ψm−1 are strictly increasing cost functions and ψm is increasing or strictly increasing. We now describe how to recognize the existence of an α -decision tree for T , which is a totally optimal α -decision tree for T relative to the cost functions ψ1 , . . . , ψm . First, we construct the graph G = ∆α (T ) using Algorithm A1 . For i = 1, . . . , m, we apply to G the procedure of optimization relative to ψi (Algorithm A3 ). As a result, we obtain for i = 1, . . . , m, the graph Gψi and the number ψiα (T ) attached to the node T of Gψi . Next, we apply to G sequentially the procedures of optimization relative to the cost functions ψ1 , . . . , ψm . As a result, we obtain graphs Gψ1 , Gψ1 ,ψ2 , . . . , Gψ1 ,...,ψm and numbers ϕ1 , ϕ2 , . . . , ϕm attached to the node T of these graphs. It is clear that ϕ1 = ψ1α (T ). For i = 2, . . . , m, ϕi = min{ψi (T , Γ ) : Γ ∈ Tree(Gψ1 ,...,ψi−1 , T )}. One can show that a totally optimal α -decision tree for T relative to the cost functions ψ1 , . . . , ψm exists if and only if ϕi = ψiα (T ) for i = 1, . . . , m. 4. Hardness of decision tree optimization
In the following we provide some computational complexity results related to minimization of depth, number of nodes, and average depth of decision trees computing partial Boolean or pseudo-Boolean2 functions. We reduce the well known set cover problem (Richard Karp in [14] shown that set cover problem is NP-hard) to minimization of one of the cost functions (depth and number of nodes) for decision trees computing partial Boolean functions. To the best of our knowledge, results for depth (h) and number of nodes (L) are folklore results. p Let A = {a1 , . . . , aN } and F = {S1 , . . . , Sp } be a family of subsets of A such that A = i=1 Si . A subfamily {Si1 , . . . , Sit } t of the family F will be called a cover if j=1 Sij = A. The problem of searching for cover with minimum cardinality t is set cover problem. We correspond to the considered set cover problem a partial Boolean function fA,F with p variables x1 , . . . , xp represented by a table T (A, F ). This table contains p columns labeled with variables x1 , . . . , xp corresponding to the sets S1 , . . . , Sp respectively and N + 1 rows. The first N rows correspond to elements a1 , . . . , aN , respectively. That is, the table contains the value 1 at the intersection of jth row and i-column, for j = 1, . . . , N and i = 1, . . . , p, if and only if aj ∈ Si . The last (N + 1)-st row is filled with 0’s. The value of fA,F corresponding to the last row is equal to 0. All other rows are labeled with the value 1 of the function fA,F . One can show that each decision tree corresponding to fA,F is of the following kind (see Fig. 1): where {Si1 , . . . , Sim } is a cover for the considered set cover problem. From here it follows that there is a polynomial time reduction of the set cover
2 A function f : {0, 1}n → R is called pseudo-Boolean function and all pseudo-Boolean functions can be uniquely represented as multi-linear polynomials as: f (x) = a + i ai xi + i
I. Chikalov et al. / Discrete Applied Mathematics 215 (2016) 1–13
9
Fig. 1. Decision tree computing the function fA,F .
problem to the problem of minimization of decision tree depth for partial Boolean functions, and there exists a polynomial reduction of the set cover problem to the problem of minimization of number of nodes in decision trees corresponding to partial Boolean functions. So, we have the following statement. Proposition 13. The problem of minimization of depth for decision trees computing partial Boolean function given by table representation is NP-hard. We can construct a similar polynomial reduction from set cover problem to minimization of number of nodes for decision trees. Proposition 14. The problem of minimization of number of nodes for decision trees computing partial Boolean function given by table representation is NP-hard. A similar result can also be obtained for minimization of average depth of decision trees (this was proven to be NP-hard by Hyafil and Rivest in [13]) for computing partial pseudo-Boolean functions. Proposition 15 ([13]). The problem of minimization of average depth of decision trees computing partial pseudo Boolean functions given by table representation is NP-hard. It is important to note that the optimization of decision trees for depth, average depth, and number of nodes for total Boolean functions given by decision tables can be done in polynomial time (see Propositions 2 and 9). Note also that this result for the number of nodes was obtained earlier in [10]. 5. Experimental results In this section, we consider some experimental results for (i) three well known Boolean functions, (ii) total Boolean functions (with relatively small number of variables), and (iii) partial Boolean functions (also with relatively small number of variables). All experimental results presented in this section were performed on a fairly powerful machine with four Intel Xeon E7-4870 (2.3 GHz) processors and 1.5 TB of onboard system memory. For the case of three known Boolean functions, we present empirical time measurements for construction of decision trees. 5.1. Three known functions We did some experiments with three well known Boolean functions, conjunction, linear function, and majority function with slight modification: conn (x1 , . . . , xn ) =
n
xi ,
i=1
linn (x1 , . . . , xn ) =
n
xi mod 2,
i=1
majn (x1 , . . . , xn ) =
0,
if
n
xi < n/2,
i =1
1,
otherwise.
10
I. Chikalov et al. / Discrete Applied Mathematics 215 (2016) 1–13
(a) con3 .
(b) lin3 .
(c) maj3 .
Fig. 2. Exact decision trees for conjunction, majority, and linear functions for n = 3. Table 1 Minimum depth, average depth, and number of nodes for 0-decision trees computing the functions conn , linn , and majn , n = 2, . . . , 16. n
Conjunction
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Linear function
Majority function
h
havg
L
h
havg
L
h
havg
L
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1.5000 1.7500 1.8750 1.9375 1.9687 1.9844 1.9922 1.9961 1.9980 1.9990 1.9995 1.9998 1.9999 1.9999 1.9999
5 7 9 11 13 15 17 19 21 23 25 27 29 31 33
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
7 15 31 63 127 255 511 1,023 2,047 4,095 8,191 16,383 32,767 65,535 131,071
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1.5000 2.5000 3.1250 4.1250 4.8125 5.8125 6.5391 7.5391 8.2930 9.2930 10.0674 11.0674 11.8579 12.8579 13.6615
5 11 19 39 69 139 251 503 923 1,847 3,431 6,863 12,869 25,739 48,619
Table 2 Minimum values of h, havg , and L for α -decision trees (α ∈ {0, 0.1, . . . , 0.9}) computing functions con10 , lin10 , and maj10 .
α 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Conjunction
Linear function
Majority function
h
havg
L
h
havg
L
h
havg
L
10 0 0 0 0 0 0 0 0 0
1.9980 0 0 0 0 0 0 0 0 0
21 1 1 1 1 1 1 1 1 1
10 10 10 10 10 0 0 0 0 0
10 10 10 10 10 0 0 0 0 0
2,047 2,047 2,047 2,047 2,047 1 1 1 1 1
10 10 10 8 0 0 0 0 0 0
8.2930 7.1602 4.9766 2.8125 0 0 0 0 0 0
923 717 277 61 1 1 1 1 1 1
We found that for each n = 2, . . . , 16 there exist, totally optimal decision trees for the functions conn , linn , and majn relative to h, havg and L. Examples of totally optimal 0-decision trees, relative to h, havg , and L, for the functions con3 , lin3 and maj3 can be found in Fig. 2. Table 1 contains minimum values of depth, average depth, and number of nodes for exact decision trees (0-decision trees) computing Boolean functions conn , linn , and majn , n = 2, . . . , 16. We now consider some results for n = 10. The number of different 0-decision trees for the function con10 is equal to 3,628,800. Each such tree is a totally optimal 0-decision tree for con10 relative to h, havg and L. The same situation is with the functions lin10 and maj10 . The number of different 0-decision trees for the functions lin10 and maj10 is 5.84 × 10353 and 2.90 × 10251 , respectively. Table 2 contains minimum values of depth, average depth, and number of nodes for α -decision trees (α ∈ {0, 0.1, . . . , 0.9}) computing functions con10 , lin10 , and maj10 . In particular, there is only one 0.3-decision tree for the function con10 and this tree is totally optimal relative to h, havg , and L. The number of totally optimal 0.3-decision trees for the function lin10 remains the same as 5.84 × 10353 . However, the number of totally optimal 0.3-decision trees for the function maj10 reduces to 9.83 × 1020 .
I. Chikalov et al. / Discrete Applied Mathematics 215 (2016) 1–13
11
Table 3 Time (in seconds) to construct decision trees for three known Boolean functions. n
Time in seconds Conjunction
Linear function
Majority function
1.03 1.07 1.20 1.29 1.41 1.50 1.65 1.71 1.84 2.12 2.90 4.83 10.01 27.64 69.85
0.96 1.06 1.18 1.28 1.42 1.53 1.87 3.29 9.08 29.35 98.44 360.84 1435.52 7455.53 74 135.17
0.95 1.07 1.21 1.31 1.39 1.53 1.80 3.01 7.81 25.53 87.31 330.81 1304.44 8101.94 34 356.56
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Table 4 The existence (example fi ) or nonexistence (–) of a monotone Boolean function (mon) or a Boolean function (all) with n variables which does not have totally optimal decision trees relative to different combinations of cost functions h, havg , and L. n
h, L
h, havg
havg , L
h, havg , L
mon
all
mon
all
mon
all
mon
all
0 1 2 3 4 5 6 7
– – – – – – – f5
– – – – – f1 f1 f1
– – – – – – f2 f2
– – – – f3 f3 f3 f3
– – – – – – f2 f2
– – – – – f4 f4 f4
– – – – – – f2 f2
– – – – f3 f3 f3 f3
>7
f5
f1
f2
f3
f2
f4
f2
f3
5.1.1. Empirical time measurements Table 3 shows the time required to compute values for cost functions h, havg , and L in Table 1 for n = 2, . . . , 16. We can see that for the function conn is the easiest of the three functions to construct decision trees while the function lnrn takes huge amount of time. The total time for all these experiment was almost 26 h. 5.2. Total Boolean functions We studied the existence of totally optimal decision trees relative to different combinations of the depth, average depth, and number of nodes for monotone and arbitrary Boolean functions with n = 0, 1, 2, . . . variables. The obtained results can be found in Table 4. In particular (see the second column of Table 4), for each monotone Boolean function with at most 6 variables, there exists a totally optimal decision tree relative to the depth and number of nodes (relative to h and L) computing this function. However, for each n, n ≥ 7, there exists a monotone Boolean function with n variables which does not have a totally optimal decision tree relative to h and L. We also give an example of such function, f5 in this case. List of functions, f1 , f2 , f3 , f4 , and f5 mentioned in Table 4. f1 = x1 x¯ 2 x¯ 3 x¯ 4 ∨ x¯ 1 x¯ 2 x3 ∨ x¯ 1 x3 x5 ∨ x¯ 1 x4 ∨ x2 x4 ∨ x3 x4 x5 f2 = x1 x2 x4 ∨ x1 x4 x5 ∨ x5 x6 ∨ x3 x4 ∨ x3 x6 f3 = x¯ 1 x2 x¯ 4 ∨ x¯ 1 x3 x4 ∨ x¯ 2 x¯ 3 f4 = x1 x¯ 2 x¯ 3 x5 ∨ x1 x3 x¯ 4 x¯ 5 ∨ x1 x4 x5 ∨ x¯ 1 x2 x3 x5 ∨ x¯ 1 x¯ 2 x3 x¯ 5 ∨ x¯ 1 x¯ 2 x¯ 3 x4 ∨ x¯ 1 x4 x¯ 5 ∨ x2 x¯ 3 x4 x¯ 5 f5 = x1 x2 x5 x7 ∨ x1 x2 x6 x7 ∨ x1 x3 x6 x7 ∨ x1 x4 x6 x7 ∨ x2 x3 x6 x7 ∨ x2 x5 x6 x7 ∨ x1 x4 x5 ∨ x2 x4 x5 ∨ x3 x4 x5 . Table 5 shows results for experiments on total Boolean functions. For n = 4, . . . , 10 we randomly generated 1000 total Boolean functions. For different combinations of cost functions, we counted the number of Boolean functions for which there exist totally optimal decision trees. We can see that the number of total Boolean functions with totally optimal trees decreases whenever all the cost functions are strictly increasing functions in the optimization sequence. Based on these results, we can formulate the following two hypotheses. That is, for almost all total Boolean functions there exist (i) totally
12
I. Chikalov et al. / Discrete Applied Mathematics 215 (2016) 1–13 Table 5 Number of total Boolean functions with totally optimal trees. n
h, havg
h, L
havg , L
L, havg , h
4 5 6 7 8 9 10
992 985 997 1,000 1,000 1,000 1,000
1,000 997 999 1,000 1,000 1,000 1,000
1,000 994 927 753 426 100 14
992 982 926 753 426 100 14
Table 6 Number of totally optimal trees for different sets of randomly generated partial Boolean functions and different combinations of cost functions. Sets S0n.25
n
4 5 6 7 8 9 10
h, havg
h, L
havg , L
839 893 718 491 292 157 49
839 983 898 715 510 279 176
839 994 971 890 638 367 95
Sets S0n.50 h, havg
h, L
havg , L
839 893 718 491 292 157 49
839 983 898 715 510 279 176
839 994 971 890 638 367 95
Sets S0n.75 h, havg
h, L
havg , L
778 510 295 146 91 66 49
920 722 539 301 143 67 49
999 976 886 693 329 90 8
optimal decision trees relative to depth and number of nodes, and (ii) totally optimal decision trees relative to the depth and average depth (see Section 6 for proofs for these hypotheses). 5.3. Partial Boolean functions For each n = 4, . . . , 10, we randomly generated three sets S0n.25 , S0n.50 , S0n.75 of partial Boolean functions. Each set Sjn , j ∈ {0.25, 0.50, 0.75} and for each n = 4, . . . , 10, contains 1000 table representations of partial Boolean functions with j × 2n rows. The obtained results (see Table 6) show that the number of partial Boolean functions with totally optimal trees is decreasing quickly, with the growth of the number of variables. 6. Proofs of hypotheses In our experiments, we noted that the depth of decision trees obtained was almost always equal to the number of variables of the corresponding Boolean functions. This helped us to prove the proposed two hypotheses regarding total optimality of decision trees relative to (i) depth and number of nodes, and (ii) depth and average depth. Theorem 16. As n → ∞, almost all Boolean functions with n variables have totally optimal decision trees relative to depth and number of nodes. Proof. A Boolean function with n variables is called exhaustive if the minimum depth of the decision tree computing this function is equal to n. Rivest and Vuillemin (see Corollary 3.4 in [21]) proved that as n → ∞, almost all Boolean functions with n variables are exhaustive. Let f be an exhaustive Boolean function with n variables and Γf be a decision tree which computes f and has the minimum number of nodes. It is clear that the depth of Γf is at most n (from the definition of decision trees presented in this paper). Since f is an exhaustive Boolean function, the depth of Γf is exactly n. Therefore, Γf is a totally optimal tree for f relative to depth and number of nodes. Taking into account that f is an arbitrary exhaustive function we have as n → ∞, almost all Boolean functions with n variables have totally optimal trees relative to depth and number of nodes. We can prove the following proposition in a similar way. Theorem 17. As n → ∞, almost all Boolean functions with n variables have totally optimal decision trees relative to depth and average depth. Following is some clarification about ‘‘almost all Boolean functions with n variables’’ as mentioned in the previous two theorems (Theorems 16 and 17). Rivest and Vuillemin proved (see proof of Corollary 2k3.4 in [21]) the following: the probability P (n) that a Boolean function with n variables is not exhaustive is at most 212k · k where k = 2n−1 . It is known (see [16]) that
2k k
<
2k
2 √
2k
for k ≥ 2. Therefore P (n) < √1 n for n ≥ 2. 2
Based on this bound and the proof of Theorem 16, we obtain the following statement: the probability that a Boolean function with n ≥ 2 variables has no totally optimal decision trees relative to depth and number of nodes is less than √1 n . 2
I. Chikalov et al. / Discrete Applied Mathematics 215 (2016) 1–13
13
We can prove the following statement in similar way: the probability that a Boolean function with n ≥ 2 variables has no totally optimal decision trees relative to depth and average depth is less than √1 n . 2
We show that there exist Boolean functions which have no totally optimal decision trees relative to different combinations of cost functions. From results mentioned in Table 4 it follows, in particular, that
• For n ≤ 4, each Boolean function with n variables has a totally optimal decision tree relative to depth and number of nodes. For example, Boolean function f1 with five variables has no totally optimal decision trees relative to depth and number of nodes. • For n ≤ 3, each Boolean function with n variables has a totally optimal decision tree relative to depth and average depth. For example, Boolean function f3 with four variables has no totally optimal decision trees relative to depth and average depth. 7. Conclusion In this paper, we have created a set of tools for the study of exact and approximate decision trees for total and partial Boolean functions. These tools allow us to study the existence of totally optimal decision trees relative to different combinations of cost functions. The minimization of decision trees for different cost functions (e.g., depth, number of nodes) is, in general, a hard problem while we have polynomial-time results when optimizing decision trees for total Boolean functions (when represented as a decision table). We performed experiments on table representation of Boolean functions and formed (and proved) two hypotheses about the total optimality of decision trees for two different pairs of cost functions.
Acknowledgments Research reported in this publication was supported by the King Abdullah University of Science and Technology (KAUST). Authors acknowledge valuable comments and suggestions by anonymous reviewers that clearly improved the readability of this work and helped prove the two hypotheses. References [1] Abdulaziz Alkhalid, Talha Amin, Igor Chikalov, Shahid Hussain, Mikhail Moshkov, Beata Zielosko, Dagger: a tool for analysis and optimization of decision trees and rules, in: Francisco V.C. Ficarra, Andreas Kratky, Kim H. Veltman, Miguel C. Ficarra, Emma Nicol, Mary Brie (Eds.), Computational Informatics, Social Factors and New Information Technologies: Hypermedia Perspectives and Avant-Garde Experiencies in the Era of Communicability Expansion, Blue Herons, 2011, pp. 29–39. [2] Abdulaziz Alkhalid, Talha Amin, Igor Chikalov, Shahid Hussain, Mikhail Moshkov, Beata Zielosko, Optimization and analysis of decision trees and rules: dynamic programming approach, Int. J. Gen. Syst. 42 (6) (2013) 614–634. [3] Paul Beame, Michael E. Saks, Jayram S. Thathachar, Time-space tradeoffs for branching programs, in: 39th Annual Symposium on Foundations of Computer Science, FOCS ’98, November 8-11, 1998, Palo Alto, California, USA, 1998, pp. 254–263. [4] Harry Buhrman, Ronald de Wolf, Complexity measures and decision tree complexity: a survey, Theoret. Comput. Sci. 288 (1) (2002) 21–43. [5] Igor Chikalov, Average Time Complexity of Decision Trees, in: Intelligent System Reference Library, vol. 21, Springer-Verlag, Berlin, 2011. [6] Igor Chikalov, Shahid Hussain, Mikhail Moshkov, Totally optimal decision trees for monotone boolean functions with at most five variables, in: Junzo Watada, Lakhmi C. Jain, Robert J. Howlett, Naoto Mukai, Koichi Asakura (Eds.), 17th International Conference in Knowledge Based and Intelligent Information and Engineering Systems, (KES 2013), Kitakyushu, Japan, in: Procedia Computer Science, vol. 22, Elsevier, 2013, pp. 359–365. [7] Yves Crama, Peter L. Hammer, Boolean Functions: Theory, Algorithms, and Applications, in: Encyclopedia of Mathematics and its Applications, Cambridge University Press, Cambridge, 2011. [8] Amol Deshpande, Lisa Hellerstein, Devorah Kletenik, Approximation algorithms for stochastic Boolean function evaluation and stochastic submodular set cover, in: Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’14, SIAM, 2014, pp. 1453–1467. [9] Michael R. Garey, Optimal binary identification procedures, SIAM J. Appl. Math. 23 (1972) 173–186. [10] David Guijarro, Víctor Lavín, Vijay Raghavan, Exact learning when irrelevant variables abound, Inform. Process. Lett. 70 (1999) 233–239. [11] Peter L. Hammer, Alexander Kogan, Bruno Simeone, Sándor Szedmák, Pareto-optimal patterns in logical analysis of data, Discrete Appl. Math. 144 (1–2) (2004) 79–102. [12] Shahid Hussain, Relationships among various parameters for decision tree optimization, in: Colette Faucher, Lakhmi C. Jain (Eds.), Innovations in Intelligent Machines-4 - Recent Advances in Knowledge Engineering, in: Studies in Computational Intelligence, vol. 514, Springer, 2014, pp. 393–410. [13] Laurent Hyafil, Ronald L. Rivest, Constructing optimal binary decision trees is NP-complete, Inform. Process. Lett. 5 (1) (1976) 15–17. [14] Richard M. Karp, Reducibility among combinatorial problems, in: Raymond E. Miller, James W. Thatcher (Eds.), Proceedings of a Symposium on the Complexity of Computer Computations, in: The IBM Research Symposia Series, Plenum Press, New York, 1972, pp. 85–103. [15] Aleksei D. Korshunov, Computational complexity of boolean functions, Russian Math. Surveys 67 (1) (2012) 93–165. [16] Thomas Koshy, Catalan Numbers with Applications, Oxford University Press, 2008. [17] O.Erhun Kundakcioglu, Tonguç Ünlüyurt, Bottom-up construction of minimum-cost and/or trees for sequential fault diagnosis, IEEE Trans. Syst. Man. Cybern. Part A 37 (5) (2007) 621–629. [18] Alberto Martelli, Ugo Montanari, Optimizing decision trees through heuristically guided search, Commun. ACM 21 (12) (1978) 1025–1039. [19] Mikhail Moshkov, Approximate algorithms for minimization of decision tree depth, in: Guoyin Wang, Qing Yao, Andrzej Skowron (Eds.), Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing, 9th International Conference, RSFDGrC 2003, Chongqing, China, Proceedings, in: Lecture Notes in Computer Science, vol. 2639, Springer, 2003, pp. 611–614. [20] Mikhail Moshkov, Time complexity of decision trees, in: James F. Peters, Andrzej Skowron (Eds.), T. Rough Sets III, in: Lecture Notes in Computer Science, vol. 3400, Springer, Heidelberg, 2005, pp. 244–459. [21] Ronald L. Rivest, Jean Vuillemin, On recognizing graph properties from adjacency matrices, Theoret. Comput. Sci. 3 (3) (1976) 371–384. [22] Helmut Schumacher, Kenneth C. Sevcik, The synthetic approach to decision table conversion, Commun. ACM 19 (6) (1976) 343–351.