0031 3203/93 $6.00+.00 Pergamon Press Ltd © 1993 Pattern Recognition Society
Pattern Recognition, Vol. 26, No. 6, pp. 883 889, 1993 Printed in Great Britain
A MORE EFFICIENT BRANCH A N D B O U N D ALGORITHM FOR FEATURE SELECTION BIN Yu and BAOZONGYUAN Institute of Information Science, Northern Jiaotong University, Beijing 100044, People's Republic of China (Received 6 August 1991; in revised form 22 September 1992; receivedfor publication 16 December 1992)
Abstract--An algorithm is presented to be able to select the globally optimal subset of d features from a larger D-feature set. This is a fundamental problem in statistical pattern recognition and combinatorial optimization. Exhaustive enumeration is computationally unfeasible in most applications. This algorithm dynamically searches for the globally optimal solution on a minimum solution tree which is a subtree of the solution tree used in the traditional branch and bound algorithm. The new algorithm is compared theoretically with the branch and bound algorithm. The analysis and the experimental results show that it is more efficient than the traditional algorithm. Feature selection
Combinatorial optimization
Solution tree
criterion that also satisfies the monotonicity property. This paper presents a more efficient BAB algorithm called BAB ÷ with a higher performance by reducing searching time than BAB's. Experimental results show that about a third of BAB's computation can usually be eliminated by this new algorithm in feature selection problems.
1. I N T R O D U C T I O N
In statistical pattern recognition, designing a classifier with fewer selected features not only could improve its usefulness for lack of large sample set, but also could reduce the cost of the system in most applications. Let S(D, d) denote the problem of selecting the optimal d-feature subset from the D-feature set, X o = {xl . . . . . xo}. There are C~ = D!/(D - d)! d! candidate d-feature subsets X d c X D for this problem. One of them, X*, which satisfies that J(X*)=
max J(Xd)
(1)
Xa~ XD
is its globally optimal solution, where J is a criterion function which should satisfy a monotonicity property here, i.e. J(X~) >_J(Xt),
if Xs-~ Xt.
(2)
The monotonicity is not particularly restrictive, as it merely means that a subset of features should be not better than any larger set that contains the subset. Indeed, a large variety of feature selection criteria does satisfy the monotonicity property. Discriminant functions and distance measures such as the Bhattacharyya distance and divergence are examples. Exhaustive evaluation of all the subsets is computationally costly since the number of subsets to be considered grows rapidly with the total number of features. Stepwise techniques, (1~ dynamic programming (2) and other solutions t3'4) are efficient since they avoid the cost associated with exhaustive enumeration. However, they cannot guarantee that the selected feature subset yields the globally best value of the criterion among all candidate subsets, t5'6) The branch and bound (BAB) algorithm ~7)is a traditional feature selection method which avoids exhaustive enumeration and also guarantees that the selected feature subset yields the globally best value of any
Pattern recognition
2. T H E B R A N C H A N D B O U N D M E T H O D
The BAB algorithm is based on a solution tree. This is such a tree that all terminal nodes which hold the required number of features enumerate all candidate solutions of selecting d features from D ones, and the subset held by a node contains the subset held by one of its successors and the relationship of their criterion values satisfies (2). The next section will intensively analyse this tree. The BAB algorithm which takes the search strategy of the-simplest-subtree-first successively generates portions of the solution tree and computes the criterion. Let B be the best criterion value (bound) of the terminal node found so far in the search. Whenever the criterion evaluated for any node is not larger than the bound B, by (2) all nodes that are successors of that node also have criterion values not larger than B, and therefore cannot be the optimum solution. So the subtree under that node is implicitly rejected. This rejection will not omit the optimum solution. The bound is replaced with the criterion value which is larger than it and is held by the terminal node. The node holding final bound is the solution one of this algorithm.
883
3. M I N I M U M S O L U T I O N T R E E
The following analysis is helpful to quantitatively analyse the inherent structure of the solution tree. Define a fourtuple for the solution tree according to
884
B. Yo and B. YUAN
the problem S(D, d) as
The procedure is given as:
T(D, d) = (N, E, r, L) where N is a node set, E is an edge set in which an edge
e(n, n~) denotes that node n~ holds all features held by node n except one discarded (former is one of letter's children), r is a root node which holds the feature set Xo, and L is a terminal node set which contains C~ nodes each of which holds d features. Figure 1 is an example of solution tree T(6, 2) according to problem S(6, 2). The numeral k nearby a node means that from its parent feature Xk is discarded at this node, and s indicates depth of nodes in the solution tree, s = 0, 1,..., d, here d = D - d. The depth of every terminal node (in set L) is d and any node at depth s contains D - s features. Let N s = {ninon and n's depth is not larger than s}, E ' = {e(n,n¢)]e~E and n,n¢eN ~} and L~eN ~ be a terminal node set in which each node is with depth s. We can define a subtree of T(D, d) = (N, E, r, L) as
T~(D, d) = (N ~, E*, r, L ~) which has the same structure as T(D, d) has, excluding the nodes with depth larger than s and the corresponding edges. When s = d we have
N~= N,E~= E, La= L, and T?~(D,d)= T(D,d). Obviously, the subtree T'(D, d) of T(D, d) has the same structure as tree T(d + s,d) has, which is associated with the problem S(d + s,d), i.e.
T~(D, d) = T(d + s, d). In other words, the generation of a solution tree
T(D, d) can be regarded as a recursive procedure from s = 0 to tt as follows:
Proof. Assume node n R is node np'S rightmost child
T~(D,d)= T(D,d)=(N,E,r,L). XD
$=0
/
$=2 I I 5
3
I 4
5
I I 5~, ,
I 65
6 6 5
6
i
I
6
i
i
I
I
6
6
6
I 5
6
s=O
Theorem 2. The degree of the rightmost child node n R of a node np is one if na 6L, and it is called a one-degree node, The degree o f a n y successor of a one-degree node is not larger than one and there are C~+_1 one-degree nodes in T(D,d) (see those nodes linked by dashed lines in Fig. 1).
T'(D, d) = T(d + s, d) = (N', E ~, r, L ~)
4
Proof. F o r p r o b l e m S(n,m), T(n,m) has C ~ = n !/(n - m)!m! terminal nodes. Let us consider a subtree of T(D,d), TS(D,d)=(NS, E~,r,U). Because T~(D,d) = T(d+s,d), it has C~÷~ terminal nodes, i.e. IUI d = Ca+~. And 7~ o-a .~=0
T 1(D, d) = T(d + 1, d) = (N t, E 1, r, L l)
s~
Theorem 1. The solution tree T(D, d) = (N, E, r, L) has C~+~ nodes with depth s, and it totally has INI -- - ,-,a+ ~JD+I nodes.
Ial-- E IL'I-- ~ C~+,-rd+l-,~o+,. []
T°(D, d) = T(d, d) = (N °, E °, r, L °)
s=l
(1) T°(D, d) = (N °, E °, r, L°), where N O= {r}, E ° = ~ , node r holds d features {x 1. . . . . Xd}, and L ° = {r}. (2) Let TS(D,d)=(NS, E~,r,L s) then T~+l(O,d) =(NS+I,ES+I,r,L~+I), where N s+l = N ~ + L*+t,E s+l = E ~ + E', and L ~÷ ~ and E' are produced in terms of the following rules. The nodes in L s always contain d features and represent the combination of selecting d features from d + s ones. Add the kth feature xk, k = d + s + 1, into each node in N ~ of T~(D,d). N o w each node in L s of TS(D, d) holds d + 1 features. It is true that from the L ~ we can produce the L s+ 1 for T s+ t(D,d), which contains the nodes representing the combination of selecting d features from d + s + 1 ones. In order of from-left-toright referring to Fig. 1, generate all of the n's (neL ~) possible child nodes {he}, each by discarding a feature x~, i = 1,..., d + s + 1, from the feature set held by node n. Some of them are collected into L s ÷ ~ under a rule that if n~ has not been in L s~l (it is initialized as L *+~ =~25) yet then add n~ into L s+l and add edge e(n, no) into E' (it is also initialized as E ' = ~ ) . The number of nodes in L s+l is IL~+ 11 = C~+~+ 1. N o w two theorems can be deduced.
Fig. I. A solution tree T(6, 2) for problem S(6, 2) with some string-structure subtrees.
and they are with depths s + 1 and s, respectively. Consider the recursive procedure, discarding the added feature Xd+,+~ at node n R can generate at least a child node which has not been generated by its left nodes. This means that the node nR has a successor. If nR has another child node, it must be generated by discarding a feature x~ and i < d + s + 1 from the node hR. Then according to step (2) in the recursive procedure, when generating np'S children, discarding the feature x~ can generate an np's child node right to nR because it has not been in L ~+ ~ as yet, i.e. nR is not np'S rightmost node. It is contradictory. The child node of a one-degree node must be its rightmost one, therefore, its degree is one when it does not belong to set L or is zero otherwise. In T(D, d), each node whose depth is not larger than
A more efficientbranch and bound algorithm for feature selection
Xo
[InitializationI IInitializeLIST,s)
s=O 1 $-1 3
885
Take First Node
3456
s=3
$=4
Leo |'or,,ard] 456
,5
6
5
6
5
I
6
Fig. 2. A minimum solution tree T~(6,2) for problem S(6,2) without any string-structure subtrees.
d - 2 has and only has one rightmost child, one-degree node. Let N 1 be the set of all these one-degree nodes, then ~-2 D-d 2 Igll = E IZSl= E Cg+s=fdo+x~- • [] s=O
s-O
In fact, any tree satisfying that the feature subset held by one node contains the subset held by one of this node's children, and every candidate solution subset is held by one terminal node can be used as the above solution tree and a corresponding feature selection algorithm can be designed. The efficiency of such an algorithm depends on the solution tree used. It has been seen that there are some one-degree nodes which, in fact, consist of string-structure subtrees in solution tree T. Because the solution node is included in L, all of these subtrees can be pruned and a minimum solution tree TM can be obtained from T by removing these one-degree nodes. Figure 2 is a minimum solution tree TM(6, 2) for S(6, 2).
~
T
Fig. 3. Flowchart for the BAB÷ algorithm. 4. T H E BAB ÷ A L G O R I T H M
Let f ( X ~ ) = J ( X o - X~). By (2) the solution of problem S ( D , d ) can be interpreted as finding the subset, X*, such that J(X*) = max J(Xa) Xa-c Xo
and X a* = X o - X a*. For the criterion function J satisfying monotonicity as (2), this condition can be rewritten as J(Xs) > J(X,),
if x~___x,.
(3)
Algorithm
Similar to the BAB algorithm the BAB ÷ algorithm finds the optimal subset of features by traversing the minimum solution tree from root node to terminal nodes. Let B be a bound denoting the best value J ( X a ) searched so far, i.e. B = J(Xa). If J(X~)(s < d) were less than B, then by (3), J(X,) < B for all possible X , ( t =
s + 1..... d) which contain X s. This means that whenever the criterion evaluated for any node (holding X o - X~) is not larger than the bound B, all nodes that are successors (holding X o - X,, t = s + 1,...,d) of that node, if any, also have criterion values not larger than B, and therefore cannot be the optimal solution. BAB ÷ does not generate these nodes and replaces the current bound with the criterion value which is larger than it and is held by the terminal node in the search procedure. The bound reflecting the criterion value held by the globally optimal solution node will never be replaced. Based on TM the algorithm implicitly skips over those one-degree nodes on T. We say it short-traverses, which promotes the algorithm's efficiency. Figure 3 is the flowchart of the BAB ÷ algorithm. The following notation will be used in the BAB ÷ algorithm.
886
B. Yu and B. YUAN
LIST(s): A stack storing the features enumerated at lever s. POINTER(s): The pointer to the element of LIST(s) being currently considered. SUCCESSOR(s, k): The number of successors that the kth element in LIST(s) can have. A VAIL: A list of available features that LIST(s) can assume. Step 0. /* Initialization */ Set B = Bo(B o is an arbitrary value less than J(X*)); A VAIL = Xo; s = 1, Xo = ~ ; LIS T(O) = S,~; SUCCESSOR(O, 1) = d + 1; POINTER(O) = 1; Step 1. /* Initialize LIST(s) */ Set NODE = POINTER(s - I); Rank the features Xk e A VAIL in increasing order of J ( X , _ 1, Xk); Remove the smallest p features of these from A VAIL to LIST(s) in increasing order (with the bottom element in LIST(s) being the feature yielding the smallest J), where p = S U C C E S S O R ( s - 1, NODE). Without loss of generality, these p features are assigned as x,, xs + 1,..., x~ + p_ 1 in increasing order; Set SUCCESSOR(s, i) = p - i + 1 for i = 1, 2 . . . . . p; Step 2. /* Handle the first node, i.e. the rightmost child */ Pop up the top element in LIST(s), x; Step 3. /* Short traverse */ If [s = d] go to Step 4; Else {Let all elements in A VAIL be set X~_,; Compute J~ = J ( X s_ ~ + {x} + Xa-s)} Step 4. /* Check bound */ Return x to AVAIL; If [ J¢ _< B] go to Step 6; Step 5. /* Update bound */ Set B = J¢; Record X~' = X~_ 1 + {x} + X a_ ~; Step 6./* Handle the next node */ If [LIST(s) = ~ ] go to Step 8; Else {Set POINTER(s) being the current number of elements in LIST(s); Pop up the top element in LIST(s), x} Step 7. /* Check bound */ If [Jc = J(X~- 1 + {x}) < B] {Return x to AVAIL; G o to Step 6} Else if [s = d] (Return x to AVAIL; G o to Step 5} Else {x~ = x ~ _ , +
{~};
S e t s = s + 1; G o to Step 1} Step 8. /* Backtrack */ Set s = s - 1; If Is = 0] terminate the algorithm; Else (Return x to AVAIL; G o to Step 2}.
A more efficient branch and bound algorithm for feature selection
887
Adoptability It is clear that the solution node n* holding the globally optimal feature subset of size d, X*, is certainly on the minimum solution tree T w If only the initial bound Bo is adequate, less than the value of the criterion function of the optimal subset, the BAB + can find this subset at last. The algorithm rejects those nodes and their successors with the criterion values not larger than the current bound value. This is to say that the BAB + guarantees to find the globally optimal feature subset without exhaustive enumeration.
5. COMPARISON AND EXAMPLES
Comparison
Fig. 5. The approximate ratio ofthenumbers of nodes on TM and T with respect to width coefficient k.
As we know that the branch and bound (7) is the most efficient algorithm so far. Its solution is global optimum. It is reasonable to compare our new algorithm with it. Figure 4 is the flowchart for this algorithm, where an original mistake in reference (7) symbolized with a sign " x " i s corrected with dashed lines here. This algorithm is implemented based on solution tree T so the comparison is primarily between T M and T. Theorems 1 and 2 tell us that T(D, d) has INI -- - ~rDa ++ I nodes, while TM(D,d) has I N l l = C~+-1~ fewer than
g8
///I
/ %
,"I\ /~ i /! ~ II II //\ \
,,~, I / 660 6, \b66 3 5 3 0 3 7 40 5 5 4 1 4 0
IInitializati°n I '
i
6 b 6! 40 5 8 3 7 32 35 B' ,m (o)
b
~, 6
43
40 Bz
B1
'
gg
" / O / ~
) B1
/ !-, ag/'.^~ ~ ',
/ I \\
a~
/ / \ ',
az
-
Irake Next NodeI ~y
lt~pN~ ,~Nle~(Terminate) Y Y lunateBound[ I
66b
/\ o
6
g5 g037 40 56
6
6
g7
-~
(b) Fig. 6. T and TMaccording to Figs 1 and 2, each node with an artificial criterion value for the first example: (a) solution tree for BAB algorithm where 19 nodes are computed; (b) minimum solution tree for BAB ÷ algorithm where 11 nodes are computed.
T(D, d). Define R(D,d) = 1 -
INll INI
Fig. 4. Flowchart for the BAB algorithm (an original mistake in reference (7) symbolized with a sign " x " is corrected with dashed lines here).
i',
b
40 43
= 1
(D-d)(D-d-
1)
(D + 1)D
then R < 1, i.e. the number of nodes on TM is less than those on T. Let d = kD, 0 < k < 1. Usually, D -- d >> 1,
888
B. Yu and B. YUAN Table 1
Problems
Algorithms
Nodes
Expanded Time (min)
BAB ÷
121
17
21.1
BAB
286
41
32.0
64
15
14.4
120
29
20.9
S(12,2) BAB + S(9, 2) BAB
thus R ~ 1 - (1 - k) 2, here k is regarded as a width coefficient of the solution tree T. This approximate relation which is shown in Fig. 5 expresses that the nodes on TM are always fewer than those on T and R squarely decreases when k decreases. The more narrow the solution tree T, the less the ratio, R, of the numbers of nodes on T M to T. The BAB algorithm utilizes the simplest-subtree-first search approach. One-degree nodes on T have higher probability to be traversed than others, therefore, the computational complexity of the BAB + algorithm is less than of the BAB algorithm. Examples
The first example is a problem S(6, 2) mentioned in Sections 2 and 3. Figures 6(a) and (b) are its solution tree and minimum solution tree, respectively, where each node is assigned an artificial criterion value. In these two trees the subtrees indicated with dashed lines are not generated by algorithms since their root nodes' criterion values are not larger than previous bounds, and B1,B 2 and B 3 are bounds found in order. The nodes with asterisks are solution nodes which indicate that the globally optimal criterion value is 58 for this problem. This example shows that with the solution tree BAB algorithm computes 19 nodes and with the minimum solution tree the BAB + algorithm computes 11 nodes only. The second example is an application of the BAB* algorithm in the real-world3 a) There are two groups of samples from two types of symbols in the circuit diagram. One is 256 and another 192 both in 21-dimensional feature space in which one subspace, 12D is originally used in the classification of one type of symbols and 9D in another. To design a classifier classifying between these two groups of symbols, we need to select two
globally optimal features from each subspace. Two criterion functions are given by J I1)-tr~(l'l) tr~(1,2)
and
J (21-tr~(2'2) try(2,1)
where tr ~ (i, j) is the trace of the covariance matrix of the ith group of samples in the jth subspace. The experiment is implemented on an SGI4D25 workstation. Table 1 reveals the experimental results. The experiment shows that BAB + expands fewer nodes than BAB, so that it is more efficient. On the other hand, it does guarantee that the selected feature subset is globally optimum as BAB does. The cost for opening each node is different and one-degree nodes are higher than the average. So the time ratio and the expanded node ratio are different. The former is a little higher. The experimental results show that this new algorithm is more efficiently applied on the problem S(12,2) than on the problem S(9,2), since T(12,2) is narrower than T(9, 2).
Acknowledgement This research was supported in part by China Postdoctoral Science Foundation and in part by the China National Science Foundation under grant 9690015.
REFERENCES
1. A.N. Mucciardi and E. E. Gose, A comparison of seven techniques for choosing subsets of pattern recognition properties, IEEE Trans. Computers 20, 1023-1031 (1971). 2. C. Y. Chang, Dynamic programming as applied to feature subset selection in a pattern recognition system, IEEE Trans. SMC 3, 166-171 (1973). 3. S. D. Stearns, On selecting features for pattern classifiers, Proc. 3rd 1JCPR, Coronado, pp. 71-75 (1976). 4. J. Kittle, Feature set search algorithm, Pattern Recognition and Signal Processing, C. H. Chert, ed., pp. 41-60. Sythoff and Noordhoff, Alphen ann den Rijn (1978). 5. T. M. Cover, Best two independent measurements are not the two best, IEEE Trans. SMC 4, 116-117 (1974). 6. T. M. Cover and J. M. Van Campenhout, On the possible orderings in the measurement selection problem, IEEE Trans. SMC 7, 657-661 (1977). 7. P. M. Narendra and K. Fukunaga, A branch and bound algorithm for feature subset selection, IEEE Trans. Computers 26, 917 922 (1977). 8. B. Yu and B. Yuan, A feature selection method for multiclass-set classification, IJCNN'92, Baltimore (1992).
BIN Yu was born in Shanghai, China, on 29 August 1962. Since May 1990, he has been with the Information Science Institute, Northern Jiaotong University, China as a Postdoctoral Fellow, and is currently an associate professor in the institute. He obtained his Ph.D. in electronic engineering and computer science from Tsinghua University, China, in 1990, M.S. in biomedical engineering from Tianjin University, China, in 1986 and B.S. in mechanical engineering from Hefei Polytechnic University, China, in 1983. His research interests are in pattern recognition, applications of AI, document image analysis, and machine reading of engineering drawings. Dr Yu has been a referee of IEEE journal, Computer, since 1990. He has published over 20 journal and conference articles in the areas of pattern recognition, artificial intelligence, and image analysis.
About the Author
A more efficient branch and bound algorithm for feature selection About the Author--BAOZONG YUANreceived his Ph.D. in electrical engineering from the Leningrad Institute of Railway Engineering, U.S.S.R., in 1960. He has been with the Northern Jiaotong University since 1953. Now he is a professor of electrical engineering, and serves as the director of Information Science Institute as well as head of Image and Information Processing Laboratory in the university. His research interests include digital signal processing, speech signal processing and data communication. He has authored numerous journal publications, and edited and authored three books on signal processing. Dr Yuan is a member of IEEE (Log Number 2217784), Chairman of Computer Chapter of IEEE Beijing Section. He is a Fellow of the China Institute of Electronics and a member of the Board of Directors of this organization.
889