VolumeS. number 1
INFORMA’i’IONPROCESSING LETTERS
May i 976
CONsTRUCTJNG O~JMALJRNARY I;EcJgJON TREES IS NJ’-COMPLETE* Laurent JJYAFJL IRIA - tobode, 78150 Rocqucnczwt, France
and Ronald L. RJVEST &Pt. of Ekct*l Ensrnreritrsand Gmputer Science.MI. T., Cbmiuidge, Mamchusetts
02139, LE,S
Rscdved 7 Novrmbct 1975, revirod venion received 26 January 1976
b baV
dt#Sbn We& computationa! complexity. NPcomplete
1. lnttoduction We demonstrate that cons&g optimal binary de&ion trees ia an NP=compkt.e probtem, where an op timal tree is one which minin&s the expected number c: teats required to identuy the unknown object. Precise defu\ltons of NP-compkte problems are given in refs. f 1.2.41. while the proof to be given is relatively simple, the importance of this result can be measured in terms of the Jarge amount of effort that has been put into fmding efftient aJgorJthms for constructing optimal binary decision trees (see (3,5,6) and their references). Thus at present we may conjecture that no such efficient dgodhm exists (cm the supposition tit P# NP), thereby suppIyingmotivation for finding efficient hetitics for constructing nearsptimal decision trees. c 0 2. JkfJuJtJons J.&X={+ .*.,xn) be a finite set of ob@Vr and let Tt} be a ftite set of teszr For each test 7= IT,, .... TI, 1 Cci 6 f and object x ,I Gf S; R, we either have T-&x$= true or T&) = tlaJse. Without fear of confu+ Research for this pa
was supported by lRlA - Mxwia and the National S&nce Fwadatiop~ w3er NSF contract no. RCR74-12997.
sion, we shall also let Ti denote the set (xEX(‘P@) = true). The problem is to construct an identification procedure for the objects in X such that the expected number of tests required to completely identify an element of X is minimaL An identification procedure is essentilllly a binary decision tree; at the root and all other internal nodes a test is specified, and the terminal nodes specify ob;$Msin X. To apply the identificaprocedur;: one first applies the test specified at the root to tJle unknown object; ifit is fake one takes the left branch, otherwise the right. Ibis procedure is repeated at the root of each succea;ive subtree until one reaches a terminal node which names the unknown object. Jet dx& be thy length of the path from the root of th tree to the terminal node naming Xi, that is, the runber of tests required to identify Xi-Then the cost of this tree is merely the extema! path length, that is, &,,=xp(x$ This mode! is identical to that studied by Carey [3] . The de&m tree problem DT( 7, X, w) is to determine whether there exists a decision tree with cost less than or eqwl to w, given 9and X. y applications, most of theta Tlus probk. has straightforward identification problems. The problem of compiling decision tables is on@such ap~~~a~~~. This proldem can be genereiized by ad each test T&X, but Qj%vTistithese are
May 19’76
INFORMATION PROCESSING LETTERS Table 1
n
1
2
3
4
5
6
7
8
f(n)
0
2
5
8
12
16
20
25
a corresponding problem TIT@‘, X’, w) such that the optimal solution to DT embodies the solution to EC3 if Y exists, thus showing that EC3 Q DT. We define x’ def =
x u f& h cl,
where a, b, c are three elements not in X, and y' def I TU ffxllxfx'l. is NP-complete. Tkorem. DT(‘?, X, Rot j: Clearly DTE 2 I sime a non-deterministx Turing machine can guess the decision tree and &en WCif its weight is iess than 0.requal to w. . To show that DT is NY-complete, we show that IK3 QUT, where EC3 is the problem of finding an exact tiver for a set X, and where each of the s&sets available for use contains exactly c3elements. More precisely, we are given a set X = {xl, .-. , x,) and a family 9= :q;, ..*, Tt) of subsets of X, such that ~TiJ~~3for16i~t,~dwewishtofindasubsetcS of 9’such that
That is, 7’ contains all the triples in 7 plus all the elements in X’ as singleton sets. We wish to shq,o that an optimal decision tree has the form of the OT.Q given in fq. 1 where the sets T* ... . Tik form an exact cover of X by triples from $ kd the singleton sets (tests) {areused to distiquish elements within each triple. While this may seem intuitively true to the &ader, we give F rigorous proof. Let f(n) denote the minimal cost (path length) of a decision tree on an n=element set X, where all one, two, and three element subsets of X are available a~ tests. Table 1 gives the first few values off(n). We wish to show that the root of the optimal decision tree always selects a 3 element subset of X as a test, for n 3 7. Py definition we have
f(n)=1s3 Theexact cover problem EC (where ther,e is no reestrictiomron the size of each Tt) is known to be NP= complete (see ref. [4]). To shoQ that EC3 is complete, ws show that 3DM Q EC3, where 3DM is the problem of fmding a “thr+e~nsional n&hing”. That is, we :jpcgiven a ~etUEBXBXB~dwew~tof'indasui~~t i/such that I Wj = iB1 and no two
Wof
elements .>f W agree in any coordinate. Karp [4] shows that 3DM is NPcomplete. But 3DM becomes EC3 if we let U consist of a set of triples chosen from B, U B2 U B3 where B, , B2, B, are distinct sets of size ISrland each element of U aMains one element from each BP Thus EC3 is NP
[f(n-i)+f@+h].forn>4,
where i represents the size of the subset chosen as the test (atthe root. But if f(m) -f(m - 1) > 3 for n - 2 ~;m9n-I,theni~3m~tk~Rarentominimize fln),andf(n)-fQ - lj=f(n -3)-fin -4)+ 1 > 3 as well. Thus a 3 element subset must be chosen at the root for all n > 7. Since the decision tree of fig. 1 a&eves a cost of f(IX’l) exactly, it is optimal. FurthermoLe, any opti= mal decision tree must embody the solution to the corresponding EC3 problem, because only in this way the root of each sufficiently large subtree be a teat on a three element subset of X. Thus EC3 Q DT so that D ‘DTis NP-complete.
Volume S, number 1
INFORMATION
PROCESSING LETTERS
4. Corwlusions The construction of optimal binary decision trees is shown to be NP-cornplete,and thus very likely of non-polynomial time complexity (assuming that P f NP is probable).The results ejvenhere also imply that various relatedcost measuresfor binary decision trees yield NP-completeconstruction problems, for example if probabilitiesare assignedto the objects, or if the worst-case(instead of average)numberof tests is used (the same construction works). Accordingly, it is to be expected that goud heuristics for constructing newoptimal binary decision trees will be the best solution to this problem in the near ft:eure.
May 1976
References A.V. Aho, 1.L Hopwir aiad J.D.Ullman, I%.