Computational Statistics & Data Analysis 47 (2004) 311 – 322 www.elsevier.com/locate/csda
Discrete support vector decision trees via tabu search Carlotta Orsenigo, Carlo Vercellis∗ Dipartimento di Ingegneria Gestionale, Politecnico di Milano, p.za Leonardo da Vinci 32, Milano 20133, Italy Received 9 November 2003; received in revised form 9 November 2003
Abstract An algorithm is proposed for generating decision trees in which multivariate splitting rules are based on the new concept of discrete support vector machines. By this term a discrete version of SVMs is denoted in which the error is properly expressed as the count of misclassi0ed instances, in place of a proxy of the misclassi0cation distance considered by traditional SVMs. The resulting mixed integer programming problem formulated at each node of the decision tree is then e2ciently solved by a tabu search heuristic. Computational tests performed on both well-known benchmark and large marketing datasets indicate that the proposed algorithm consistently outperforms other classi0cation approaches in terms of accuracy, and is therefore capable of good generalization on validation sets. c 2003 Published by Elsevier B.V. Keywords: Classi0cation; Decision trees; Support vector machines; Tabu search
1. Introduction Statistical classi0cation problems have a wide variety of applications in such diversi0ed 0elds as marketing, 0nance, fraud detection and medical diagnosis. Moreover, the advent of new themes of high impact on the business community like data mining and analytics for customer relationship management has further increased the popularity of classi0cation well beyond the scienti0c niche. In a classi0cation problem we are provided with a set of instances whose associated class is already known. For example, instances may represent the customers of a bank, ∗
Corresponding author. Tel.: +39-022-399-2784; fax +39-022-399-2720. E-mail address:
[email protected] (C. Vercellis).
c 2003 Published by Elsevier B.V. 0167-9473/$ - see front matter doi:10.1016/j.csda.2003.11.005
312
C. Orsenigo, C. Vercellis / Computational Statistics & Data Analysis 47 (2004) 311 – 322
whose known class indicates whether an individual has subscribed in the past a given investment fund. Each instance is represented by the values of a number of attributes, such as age, income and number of children of each customer in our example. We are then required to build a classi0er algorithm, that is able to learn from the available data and to classify with the maximum accuracy new instances whose class is still unknown. In this paper, we will restrict our attention to binary classi0cation problems, in which the class may assume only two diDerent values. Although many diDerent approaches to classi0cation have been developed in the past, such as neural networks or statistical discriminant analysis, it appears that decision trees have attracted a great deal of interest, particularly in the data mining context. A possible explanation for their popularity may derive from the intuitive appeal and the understandability featured by discrimination rules generated through decision trees. The reader is referred to some comprehensive surveys on classi0cation trees (Murthy, 1998; Safavin and Landgrebe, 1991). Although myopic single variable rules, based on information theoretic concepts, have largely prevailed in the literature for deriving the rami0cation at each node of a decision tree, such as in CART (Breiman et al., 1984) or C4.5 (Quinlan, 1993), more sophisticated approaches based on a multivariate splitting rule at each node have been proposed. In general, most authors have focused on the construction of linear (or oblique) splitting rules, which in many cases are based upon mathematical programming models (Mangasarian et al., 1990; Mangasarian, 1993; Bennett and Mangasarian, 1992, 1994). In particular, the technique proposed in Bennett et al. (2000) relied on the theory of support vector machines (SVM), originally developed by Vapnik (1995, 1998). The objective function of the optimization problem considered by SVM is based on a principle indicated as structural risk minimization (SRM), and is aimed at minimizing the sum of the empirical classi0cation error and the generalization error, in order to achieve a higher prediction accuracy on unseen data, such as validation or future instances. Although in a classi0cation problem the error should be properly evaluated by counting the number of misclassi0ed instances, the SVM approach takes a proxy of the total misclassi0cation distance from the canonical supporting hyperplane as a continuous approximation of the discrete error. This has the computational advantage of avoiding the overwhelming complexity of mixed integer programming (MIP) models, permitting to apply e2cient techniques of linear programming (LP). However, it could be reasonably asked whether a radically diDerent approach based on an accurate discrete modeling of the error, combined with an approximate solution of the resulting MIP problem via e2cient heuristics, might lead to an overall increase in classi0cation accuracy. In this paper we propose a new classi0er algorithm for generating decision trees in which linear splitting rules are based on discrete support vector machines (DSVM). By this term we denote SVMs in which the empirical classi0cation error is represented by the discrete function counting the number of misclassi0ed instances, subsequently termed misclassi0cation rate, in place of a proxy of the misclassi0cation distance considered by traditional SVM approaches. The resulting MIP model, formulated at each node of the tree, is then e2ciently solved by a tabu search heuristic. Thus, the discrete optimization problem considered at each node selects a linear combination of
C. Orsenigo, C. Vercellis / Computational Statistics & Data Analysis 47 (2004) 311 – 322
313
the attributes to determine the best hyperplane which separates the instances belonging to the node. The objective function to be minimized is composed by the weighted sum of two terms, expressing a trade-oD between the accuracy of the separating hyperplane on the training instances and its potential of generalization against future unseen data. The minimization of the misclassi0cation rate has been considered previously in the literature, with signi0cant diDerences from the approach presented in this work. From one side, some papers (Mangasarian, 1994, 1996; Chunhui Chen and Mangasarian, 1996; La Torre and Vercellis, 2003) focused on the misclassi0cation rate alone, not in conjunction with the margin minimization, and transformed the discrete problem into a nonlinear optimization model by means of appropriate smoothing techniques. On the other hand, MIP models were formulated by other authors (Koehler and Erenguc, 1990; Lam et al., 1996), but once again without including the margin into the objective function. Actually, our computational evidence indicates that the margin plays a crucial role in improving the generalization capability of the classi0er, signi0cantly increasing the classi0cation accuracy. In order to validate the proposed algorithm, we have tested it on several well-known datasets used in the literature for benchmarking purposes, adopting a ten-fold crossvalidation. The comparison made against the best alternative classi0ers proposed in the literature shows that our algorithm generally outperforms other approaches, since the trees it generates are more accurate and capable of good generalization on validation sets. Furthermore, the empirical evidence indicates also that the tabu search heuristic applied at each node is worthwhile, since it leads to improvements over the accuracy achieved by applying a simple heuristic based on truncated branch and bound.
2. Modeling the classication problem In a classi0cation problem we are required to distinguish between distinct pattern sets. The problem can be mathematically formulated as follows. Given m points (xi ; yi ); i ∈ M = {1; 2; : : : ; m}, in the (n + 1)-dimensional real space Rn+1 , where xi is an n-dimensional vector and yi a scalar, determine a discriminant function f, from Rn into the real line R, such that f(xi ) = yi ; i ∈ M . With respect to the applications of classi0cation problems mentioned in the introduction, we can interpret each point as an instance, the coordinates of the vector xi as the values of the attributes, and the target yi as the class to which the instance belongs. In this paper we con0ne our attention to the two-class classi0cation problem, in which the target yi takes only two diDerent values. Without loss of generality, we will assume the two classes to be labeled by the values {−1; +1}, that is yi ∈ {−1; +1}. Let also A and B denote the two sets of points represented by the vectors xi in the space Rn and corresponding respectively to the two classes yi = −1 and yi = +1. If the two point sets A and B are linearly separable, that is when their convex hulls do not intersect, a family of separating hyperplanes f(x) = wx − b exists which discriminates the points in A from those in B, i.e. wxi − b ¿ 0;
xi ∈ A;
wxi − b ¡ 0;
xi ∈ B:
(1)
314
C. Orsenigo, C. Vercellis / Computational Statistics & Data Analysis 47 (2004) 311 – 322
The coe2cients w ∈ Rn and b ∈ R can be determined by solving a linear programming problem, as in (Mangasarian, 1965). In the most likely case in which the point sets A and B are not linearly separable, one should resort to more complex schemes of classi0cation, such as decision trees, neural networks and support vector machines. In general terms, all these approaches to classi0cation are aimed at determining a discriminant function which minimizes some reasonable measure of misclassi0cation. For the sake of assessing the accuracy of a classi0cation method, and for identifying the best classi0er among alternative competing techniques, it is customary to subdivide the point set A ∪ B into two disjoint subsets, termed respectively training and validation set. For a given classi0er, the discriminant function is then computed using only instances from the training set, and then applied to predict the class of each validation instance, in order to estimate the accuracy of the classi0er with respect to unseen data. In this section, we propose a mathematical programming model for constructing a linear discriminant function. This technique will be applied to obtain a linear multivariate splitting rule at each node of a decision tree, as described in Section 4, using the tabu search heuristic developed in Section 3. Our model provides a discrete variant of a SVM based on a more strict adherence to the SRM principle formulated by Vapnik (1995, 1998) within the context of statistical learning theory. Essentially, SRM formally establishes the intuitive idea that a good classi0er trained on a given dataset must both reduce the empirical classi0cation error as well as the generalization error in order to achieve a higher prediction accuracy against future unseen data. This concept plays a central role in the development of SVMs. In particular, the reduction of the generalization error is related by Vapnik to the maximization of the so-called margin of separation, de0ned as the distance between the pair of parallel supporting canonical hyperplanes wx−b−1=0 and wx − b + 1 = 0. The margin of separation, whose geometric interpretation for two linearly inseparable sets is provided in Fig. 1, can be shown equal to 2 ; where w2 = wj2 denotes the 2-norm (N = {1; 2; : : : ; n}): (2) w2 j∈N The problem of determining the best separating hyperplane is formulated as follows in the SVM framework. For each instance of the dataset de0ne a nonnegative slack variable di ; i ∈ M , as indicated in Fig. 1. One can therefore determine the optimal wT x = b + 1
x ,2
wT x = b wT x = b 1
di dk
2 w2
x ,1
Fig. 1. Margin maximization for linearly nonseparable sets.
C. Orsenigo, C. Vercellis / Computational Statistics & Data Analysis 47 (2004) 311 – 322
315
hyperplane by solving the following quadratic programming model, where ∈ [0; 1] is a parameter available to control the trade-oD between the misclassi0cation error and the generalization capability of the classi0er: w22 + (1 − ) di ; (QSVM) min w;b;d 2 i∈M s:t:
yi (wxi − b) ¿ 1 − di ; di ¿ 0;
i ∈ M;
i ∈ M:
(3)
In order to reformulate problem QSVM as a linear programming model onehas to replace the 2-norm in the previous objective function with the 1-norm w1 = j∈N |wj |, introducing at the same time the upper bounding variables uj ; j ∈ N , to obtain uj + (1 − ) di ; (LSVM) min w;b;d;u 2 j∈N i∈M s:t:
yi (wxi − b) ¿ 1 − di ; −uj 6 wj 6 uj ; di ¿ 0;
i ∈ M;
j ∈ N;
(4)
i ∈ M; uj ¿ 0; j ∈ N:
(5)
Model LSVM bene0ts from the advantages of a linear programming formulation: very e2cient computation of the optimal solution by means of state-of-the-art computer codes, and consequently high degree of scalability towards large scale classi0cation problems. However, problem LSVM evaluates a proxy of the misclassi0cation distance using the slack variables di ; i ∈ M , instead of the misclassi0cation rate. This latter should be de0ned by counting the number of misclassi0ed points, and actually appears to be the most appropriate measure of inaccuracy. We therefore propose a discrete modi0cation of model LSVM, in which the misclassi0cation rate is used in the objective function in place of the second term. To the end of counting the number of misclassi0ed points, de0ne the binary variables 0 if xi is correctly classi0ed; i = 1 if xi is misclassi0ed and let ci ; i ∈ M , denote the misclassi0cation cost associated to instance i, available to the analyst to express the relative importance of diDerent points. Let also Q be a su2ciently large constant value. We can now formulate the following optimization problem, aimed at minimizing a weighted sum of margin and misclassi0cation rate, and termed linear discrete support vector machine (LDVM): min uj + (1 − ) ci i ; (LDVM) w;b;;u 2 j∈N i∈M s:t:
yi (wxi − b) ¿ 1 − Qi ; −uj 6 wj 6 uj ; i ∈ {0; 1};
i ∈ M;
j ∈ N;
i ∈ M; uj ¿ 0; j ∈ N:
(6) (7)
316
C. Orsenigo, C. Vercellis / Computational Statistics & Data Analysis 47 (2004) 311 – 322
Model LDVM is a linear mixed integer programming problem, which is notoriously much more di2cult to solve to optimality than the continuous linear programming model LSVM. Hence, this increase in computational complexity is the price to pay in order to achieve a more accurate representation of the empirical misclassi0cation error. In the next section, we will propose an e2cient tabu search heuristic for obtaining a suboptimal solution to problem LDVM. 3. A tabu search heuristic for discrete support vector machines In order to generate a feasible suboptimal solution to model LDVM we propose a tabu search (TS) algorithm, described in the following. TS is a broad class of heuristics that has achieved signi0cant success in solving a wide range of optimization problems. Basically, TS is an iterative search algorithm for 0nding a suboptimal solution to an optimization problem P, which can be cast in the form {min c(v) : v ∈ V }. TS methods operate under the assumption that an appropriate problem-dependent neighborhood S(v) can be constructed to identify solutions adjacent to the current one. Thus, at each iteration the neighborhood of the current solution is exhaustively searched to 0nd the local solution that is best with respect to an assigned evaluation function h(v), not necessarily coinciding with the original objective function c(v) of problem P. Hereafter, without loss of generality, we will assume that the evaluation function h(v) has to be minimized. Hence, this latter local optimum becomes the new current solution, provided the move required to reach it is not included in a list of forbidden moves, called tabu list. The tabu list is dynamically updated, and the algorithm stops when either the neighborhood of the current solution is empty or a prescribed iteration limit is reached. The main advantage of TS over naive local search approaches is that the tabu list, together with the evaluation function, helps the algorithm to move out from local regions of attraction to hopefully reach better solutions. General references to tabu search methods can be found in (Glover, 1989, 1990). Below we provide a basic framework of a generic TS algorithm. Generic TS algorithm 1. Find an initial feasible solution v ∈ V , and let the best known solution vTS = v. Set the iteration counter t = 0 and empty the list of tabu moves T = ∅. 2. Set t = t + 1. If the neighborhood S(v) is empty then stop. If S(v) − T is empty, go to step 4. Otherwise select st = arg min{h(s) : s ∈ S(v) − T }. 3. If c(st ) ¡ c(vTS ) then update the best solution found vTS = st . 4. If a chosen number of iterations has elapsed either in total or since vTS was last updated, then stop. Otherwise update T and return to step 2. We now provide the details of how the generic TS algorithm can be adapted to solve problem LDVM. First, a starting feasible solution to problem LDVM can be generated by means of a truncated branch and bound algorithm for MIP. Notice that model LDVM is in fact easily feasible, and therefore a standard branch and bound code, such as the Cplex library, with its parameters properly tuned, is able to generate good feasible solutions quickly.
C. Orsenigo, C. Vercellis / Computational Statistics & Data Analysis 47 (2004) 311 – 322
317
The evaluation function h(v) coincides in our case with the original objective function of LDVM. The neighborhood S(v) of a solution (w; b; ; u) is de0ned by the set of feasible solutions that can be reached performing one of the two following types of moves: Move type I: One of the components of the vector taking the value one is changed to zero, whereas all the remaining binary variables are held 0xed to their current values. This is tantamount to converting the state of the corresponding point from misclassi0ed to correctly labeled. Move type II: Two components of the vector taking respectively values one and zero are mutually exchanged, provided the sum of their distances from the corresponding supporting hyperplane falls below a given threshold , still preserving all other binary variables to their current values. Formally, given two variables p = 1; q = 0, the move swaps the assignment into p = 0; q = 1, provided dp + dq 6 , recalling the de0nition of the distances dp ; dq in (3). The threshold can be used to limit the size of the search space and to prevent unreasonable pairwise exchanges between points too far away from the separating hyperplane. Notice that geometrically these moves correspond to marginal rotations of the separating hyperplane. Since both types of moves 0x in advance the value of the binary vector , the feasibility of the corresponding (w; b; ; u) solution can be veri0ed by simply solving the LP problem obtained from LDVM with the remaining variables (w; b; u). Since the neighbors determined by type I moves are intrinsically limited in number by the variables assuming the value one, while the cardinality of the set of adjacent solutions resulting from type II moves is controlled by the threshold , in our implementation the candidate list is given by the full neighborhood S(v) of the current solution. Any possible move involving a variable p is included into the tabu list at the iteration in which the variable is actually used to reach an adjacent solution. These moves are kept in the tabu list for a predetermined number of iterations, called the tabu tenure. Computational experiences have shown that in our case the tabu tenure should be taken proportional to the squared root of the number m of training instances. 4. Building a discrete support vector decision tree Among other approaches to classi0cation, decision trees are probably the most popular technique for constructing discriminant functions. Top-down induction of decision trees (TDIDT) (Quinlan, 1993) represents a framework for the generation of classi0cation trees from a training dataset, by applying a recursive partitioning of the instances. More speci0cally, the points of the training set are repeatedly subdivided at each node of the tree starting from the root, which at the beginning contains all the instances, by making use of some splitting rule. The algorithm terminates when no admissible split can be derived at the tree leaves. At this point, the discriminant function is obtained from a simple majority voting scheme: if a leaf contains more points belonging to A than points belonging to B then the leaf is labeled as A, and classi0ed as B when
318
C. Orsenigo, C. Vercellis / Computational Statistics & Data Analysis 47 (2004) 311 – 322
the opposite is true. When the class of new instances has to be predicted, as for the validation set, the tree is traversed from the root to the appropriate leaf, by applying the rule at each node along the path, and the new instance is classi0ed according to the label of the corresponding leaf. The most signi0cant diDerence among speci0c algorithms 0tting into the outlined framework is due to the way of deriving the splitting rule to be applied at each node. Early and still most popular approaches to tree induction have con0ned themselves to simple single attribute splits, in which the attribute and its threshold value are selected for the split in a way to minimize some information theoretic measure of “confusion” among the partitions determined by the split itself. For instance, Quinlan’s (1993) well-known C4.5 algorithm picks up the attribute k, and its threshold value b, maximizing the information gain determined by the subsequent split. This means that the rule for splitting instances at the given node becomes xik ¿ b or xik ¡ b, with the tie xik =b arbitrarily broken. However, the accuracy achieved by these simple univariate classi0ers, also termed axis parallel due to their geometric interpretation in the point space Rn , is not always satisfactory. This led to consider more general multivariate splitting rules, in the form of linear combinations of the attributes. These approaches have been called oblique trees (Murthy et al., 1994), or perceptron trees (Bennett et al., 2000), according to diDerent authors. Below is the scheme of a generic TDIDT algorithm: Generic TDIDT algorithm 1. Include all the instances of the training set into the root node, and put it into the list L of pending nodes. 2. If the list L is empty, then stop. Otherwise, select any node from L, remove it from L, and set it as the current node J . 3. Select the best splitting rule for the instances belonging to J , according to an assigned evaluation criterion. Apply the selected split deriving from J a set of nonempty child nodes. If there is one child only, then J is a leaf of the tree, and it is labeled according to majority voting. Otherwise, the child nodes are appended to the list L. Repeat from step 2. We propose two diDerent classi0ers derived within the TDIDT frame. Both algorithms select the best multivariate splitting rule by solving model LDVM at each node, as formulated in Section 2, on the instances included into the node. The 0rst method for generating a discrete support vector decision tree, denoted as LDSDTTS , is based on the approximate solution of problem LDVM at each node of the tree by means of the tabu search heuristic procedure described in Section 3. The second method, denoted as LDSDTBB and used here mainly as a benchmark, is based on approximating the solution of problem LDVM at each node by means of a truncated branch and bound algorithm. For both algorithms a node is considered a leaf, and therefore its splitting disallowed, whenever at least one of the following conditions is met: • the percentage of instances of one class in the node falls above a speci0ed threshold. This value is problem speci0c, and ranges in our tests between 60% and 95%.
C. Orsenigo, C. Vercellis / Computational Statistics & Data Analysis 47 (2004) 311 – 322
319
• the number of instances belonging to the node falls below a given threshold, again problem dependent. It generally ranges between 5 and 10. 5. Computational experiences In this section the performance of the proposed classi0ers LDSDTTS and LDSDTBB is evaluated in terms of accuracy and is compared with four alternative classi0cation approaches: the univariate algorithm C4.5 (Quinlan, 1993), the multivariate algorithm OC1 (Murthy et al., 1994), the Gaussian kernel SVM (Lee and Mangasarian, 2001) and the version of Quest (Loh and Shih, 1997) employing linear splits. Although the choice of existing techniques against which to compare our classi0ers is broad, these methods were selected because they represent a wide range of eDective approaches; furthermore, from the results of a recent benchmark of thirty-three classi0cation methods (Lim et al., 2000) it appears that Quest emerged as the leading classi0er on most datasets. The classi0ers were tested on six publicly available benchmark datasets, all from the UCI Machine Learning Repository of the University of California at Irvine (http://www.ics.uci.edu/∼mlearn/). The datasets used were: Cleveland Heart Disease (Heart), Wisconsin Breast Cancer (Cancer), Johns Hopkins University Ionosphere (Ionosphere), Pima Indians Diabetes (Diabetes), Bupa Liver Disorders (Liver), and 1984 United States Congressional Voting Records (House). Notice that the original Pima Indians Diabetes dataset was 0ltered, to remove noisy attribute “serum insulin”, together with some records containing several missing values. The actual size and the number of attributes of each dataset is then given in Table 1. To measure the learning ability of the alternative classi0ers, for each dataset we applied ten-fold cross-validation (Kohavi, 1995); the average testing set accuracy across the ten partitions of each dataset is reported in Table 1, together with the average computational times and the average number of leaves of the generated trees. In particular, for the datasets marked with (◦ ) we used the same ten-fold partition as in (Lim et al., 2000). Whereas the accuracy of methods LDSDTTS and LDSDTBB has been directly computed using our implementation, the remaining results from Table 1 are derived from the literature. In applying both methods LDSDTTS and LDSDTBB , we have performed a scaling of the numeric values in the datasets, so that the resulting coe2cients varied in the range [ − 1; +1]. This was done to avoid numeric ill-conditioning and singularities in the formulation of problem LDVM. Furthermore, we have noticed that a great bene0t in accuracy is achieved when the cost of misclassi0cation ci ; i ∈ M , of a point belonging to a given class, appearing in the formulation of model LDVM, is taken equal to the percentage of instances of the opposite class. That is, if the class of instance i is yi = −1, take its cost equal to the percentage of instances of class +1, and vice versa. Algorithm LDSDTBB was implemented using the branch and bound functions of the Cplex callable library, truncating the search after a given amount of time. In our tests, we provided a time limit of 60 s for each node in the tree. Also for obtaining the starting solution of the TS heuristic at each node in algorithm LDSDTTS we used truncated branch and bound, but with a lower time limit of 15 s.
320
C. Orsenigo, C. Vercellis / Computational Statistics & Data Analysis 47 (2004) 311 – 322
Table 1 Accuracy results with ten-fold cross-validation
Dataset
Points × attributes
Accuracy results (%) –Computational times (sec, min) –Number of leaves Method LDSDTTS
LDSDTBB
C4.5
OC1
SVM
Quest
Heart ◦
270 × 13
85.2 35 s 2
82.6 3:8 m 5
80.4 4s 23
77.8 4:2 m 3
85.9 3:4 s 2
84.8 1:2 m 3
Cancer ◦
699 × 9
97.8 32 s 2
97.0 54 s 2
95.7 4s 11
95.9 13:3 m 5
—
96.9 1:5 m 2
Ionosphere 351 × 34
94.6 31 s 3
91.1 4:4 m 6
93.7 3s 12
89.5 — 6
94.4 59 s 2
—
Diabetes◦
532 × 7
80.2 1:3 m 5
78.5 5:4 m 6
75.8 8s 18
75.3 17:2 m 5
76.6 5:4 m 2
77.7 2:3 m 5
Liver ◦
345 × 6
75.3 4:2 m 6
73.0 6m 7
70.8 6s 26
72.1 8.3 5
73.6 32 s 2
69.4 1:4 m 6
House◦
435 × 16
96.5 37 s 2
92.6 3:8 m 5
95.2 2s 6
94.2 4:2 m 2
—
96.4 1:5 m 2
From the results presented in Table 1, we can draw the empirical conclusion that algorithm LDSDTTS generally outperforms the competing classi0cation techniques considered for these tests in terms of accuracy. Actually, even the performance of algorithm LDSDTBB appears quite remarkable. These facts seem to encourage the conclusion that our overall approach to building decision trees based on DSVM is rather robust, with a relatively mild dependence from the speci0c method used for approximately solving model LDVM at each node of the tree. Furthermore, it appears that the tabu search heuristic is preferable to truncated branch and bound, since it achieves a greater accuracy, appearing dominant in all datasets. We also notice that the average number of leaves for the trees generated by LDSDTTS is rather low, both in absolute terms and compared to other techniques, leading to fewer discrimination rules. However, the observed improvement in accuracy achieved by method LDSDTTS over competing classi0ers involves an increase in computational times. Therefore, to investigate how method LDSDTTS can scale up to large datasets, we applied it to a pair of real world problems, arising in marketing applications. The 0rst deals with a conquer targeting task in the automotive industry, whereas the second refers
C. Orsenigo, C. Vercellis / Computational Statistics & Data Analysis 47 (2004) 311 – 322
321
Table 2 Accuracy results and computational times for the marketing datasets
(Training, validation)
Accuracy results (%) –Computational times (sec, min) Method LDSDTTS
C5.0
CART
Automotive (2250,9000)
95.8 5:8 m
93.1 54 s
88.9 1:9 m
Telecom (4750,19 000)
94.1 6:4 m
92.3 1:3 m
88.6 2:5 m
to a retention analysis in the context of mobile telecommunications. For each problem, we compared algorithm LDSDTTS with commercial implementations of methods C5.0 and CART, in terms of accuracy and computational times, as described in Table 2. The validation set contained in each test 80% of the instances, whereas the remaining 20% points were available to the classi0ers for training. Experiments were conducted by feeding each classi0er with a subset of these training instances of various sizes, to achieve the best accuracy on the validation sets. They showed that algorithms C5.0 and CART performed their best when the whole training set was used for learning, while for algorithm LDSDTTS the best accuracy was achieved by applying it to random subsets of the training set of size 1000. A similar behavior was noticed by some authors (Lee and Mangasarian, 2001) in relation to other classi0ers, where reduced training sets were employed to improve accuracy and learning speed. From the inspection of Table 2 we draw the conclusion that for the two marketing problems classi0er LDSDTTS is still more accurate, yet keeping the computational times close to competing methods. 6. Conclusions We have proposed an algorithm for building decision trees with multivariate linear splitting rules. The optimal separating hyperplane at each node is obtained by solving a mixed integer programming problem, whose formulation is derived from a discrete variant of support vector machines. The diDerence between the two formulations stems from the representation of the empirical misclassi0cation error: in our approach, it is based on the discrete count of misclassi0ed instances, in accordance with the structural risk minimization principle, whereas in traditional SVMs a proxy of the misclassi0cation distance is considered. We have shown that the complexity of the mixed integer programming problem formulated at each node of the tree can be tackled by an e2cient tabu search heuristic. Indeed, computational tests performed on well-known benchmark datasets indicate that our algorithm signi0cantly outperforms other classi0cation approaches in terms of accuracy. At a general level, we have shown a case in which an approach to data analysis based on an accurate complex model combined with an e2cient approximate solution
322
C. Orsenigo, C. Vercellis / Computational Statistics & Data Analysis 47 (2004) 311 – 322
procedure empirically dominates an alternative scheme based on the exact solution of an approximate model. Future developments of this research will be concerned with other approximate algorithms for solving model LDVM, alternative to tabu search. References Bennett, K., Cristianini, N., Shawe-Taylor, J., Wu, D., 2000. Enlarging the margins in perceptron decision trees. Mach. Learning 41, 295–313. Bennett, K., Mangasarian, O.L., 1992. Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods Software 1, 23–34. Bennett, K., Mangasarian, O.L., 1994. Multicategory discrimination via linear programming. Optimization Methods Software 3, 29–39. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classi0cation and Regression Trees. Wadsworth, Belmont, CA. Chunhui Chen, Mangasarian, O.L., 1996. Hybrid misclassi0cation minimization. Adv. Comput. Math. 5, 127–136. Glover, F., 1989. Tabu search. Part I. ORSA J. Comput. 1, 190–206. Glover, F., 1990. Tabu search. Part II. ORSA J. Comput. 2, 4–32. Koehler, G.J., Erenguc, S., 1990. Minimizing misclassi0cations in linear discriminant analysis. Decision Sci. 21, 63–85. Kohavi, R., 1995. A study of cross-validation and bootstrapping for accuracy estimation and model selection. International Joint Conference on Arti0cial Intelligence, Montreal. Lam, K.F., Choo, E.U., Moy, J.W., 1996. Minimizing deviations from the group mean: a new linear programming approach for the two-group classi0cation problem. European J. Oper. Res. 88, 358–367. La Torre, D., Vercellis, C., 2003. C 1; 1 approximations of generalized support vector machines. J. Concrete Appl. Math. 1, 125–134. Lee, Y.J., Mangasarian, O.L., 2001. RSVM: reduced support vector machines. CD Proceedings of the SIAM International Conference on Data Mining, Chicago. Lim, T.S., Loh, W.Y., Shih, Y.S., 2000. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classi0cation algorithms. Mach. Learning 40, 203–229. Loh, W.Y., Shih, Y.S., 1997. Split selection methods for classi0cation trees. Statist. Sinica 7, 815–840. Mangasarian, O.L., 1965. Linear and nonlinear separation of patterns by linear programming. Oper. Res. 13, 444–452. Mangasarian, O.L., 1993. Mathematical programming in neural networks. ORSA J. Comput. 5, 349–360. Mangasarian, O.L., 1994. Misclassi0cation minimization. J. Global Optimization 5, 309–323. Mangasarian, O.L., 1996. Machine learning via polyhedral concave minimization. In: Fischer, H, et al. (Ed.), Applied Mathematics and Parallel Computing. Physica-Verlag, Wurzburg, pp. 175–188. Mangasarian, O.L., Setiono, R., Wolberg, W., 1990. Pattern recognition via linear programming: theory and application to medical diagnosis. In: Coleman, T.F., Li, Y. (Eds.), Large-Scale Numerical Optimization. SIAM, Philadelphia, PA. Murthy, S.K., 1998. Automatic construction of decision trees from data: a multi-disciplinary survey. Data Mining Knowledge Discovery 2, 345–389. Murthy, S.K., Kasif, S., Salzberg, S., 1994. A system for induction of oblique decision trees. J. Arti0cial Intelligence Res. 2, 1–32. Quinlan, J.R., 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, Los Altos, CA. Safavin, S.R., Landgrebe, D., 1991. A survey of decision tree classi0er methodology. IEEE Trans. Systems Man Cybernet. 21, 660–674. Vapnik, V., 1995. The Nature of Statistical Learning Theory. Springer, Berlin. Vapnik, V., 1998. Statistical Learning Theory. Wiley, New York.