European Journal of Operational Research 230 (2013) 581–595
Contents lists available at SciVerse ScienceDirect
European Journal of Operational Research journal homepage: www.elsevier.com/locate/ejor
Stochastics and Statistics
A memetic approach to construct transductive discrete support vector machines Hubertus Brandner, Stefan Lessmann ⇑, Stefan Voß Institute of Information Systems, Department for Business and Economics, University of Hamburg, Von-Melle-Park 5, D-20146 Hamburg, Germany
a r t i c l e
i n f o
Article history: Received 2 May 2012 Accepted 6 May 2013 Available online 23 May 2013 Keywords: Data mining Transductive learning Support vector machines Memetic algorithms Combinatorial optimization
a b s t r a c t Transductive learning involves the construction and application of prediction models to classify a fixed set of decision objects into discrete groups. It is a special case of classification analysis with important applications in web-mining, corporate planning and other areas. This paper proposes a novel transductive classifier that is based on the philosophy of discrete support vector machines. We formalize the task to estimate the class labels of decision objects as a mixed integer program. A memetic algorithm is developed to solve the mathematical program and to construct a transductive support vector machine classifier, respectively. Empirical experiments on synthetic and real-world data evidence the effectiveness of the new approach and demonstrate that it identifies high quality solutions in short time. Furthermore, the results suggest that the class predictions following from the memetic algorithm are significantly more accurate than the predictions of a CPLEX-based reference classifier. Comparisons to other transductive and inductive classifiers provide further support for our approach and suggest that it performs competitive with respect to several benchmarks. 2013 Elsevier B.V. All rights reserved.
1. Introduction Classification analysis is an important approach to support decision making in various disciplines including medical diagnosis, information retrieval, risk management and marketing. A classification model categorizes objects into disjoint groups. The group assignment is based on a set of attributes that characterize the objects. Depending on the application, the objects can, e.g., represent patients who are to be categorized into medical risk groups on the basis of symptoms, clinical tests, or their health behavior (e.g. [1–3]). Similarly, financial institutions discriminate between high and low risk loan applicants to support money lending decisions (e.g. [4,5]), and service companies divide customers into loyal clients and likely churners to target retention programs to the right customers (e.g., [6–8]). Independent of the application, classification analysis always aims at constructing a model that predicts group memberships with high accuracy. The prevailing approach toward classification is to employ a sample of objects with known group memberships. The relationship between the attribute values of these objects and the corresponding class labels is then inferred in an inductive manner (e.g. [9]). Several techniques pursuing this principle have been proposed in Statistics, Machine Learning, and Operations Research.
⇑ Corresponding author. Tel.: +49 (0)40 42838 4706; fax: +49 (0)40 42838 5535. E-mail addresses:
[email protected] (H. Brandner),
[email protected] (S. Lessmann),
[email protected] (S. Voß). 0377-2217/$ - see front matter 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.ejor.2013.05.010
Statistical classifiers often rely on probability theory and estimate the conditional probability (i.e., the a posteriori probability) of an object belonging to a class given the object’s attribute values (e.g. [10]). Many machine learning methods adopt a data-driven paradigm. For example, tree-based classifiers recursively partition a data set through a sequence of tests on attribute values (e.g. [11]). Eventually, this produces a clear separation of objects of disjoint classes. Operations Research methods typically ground on linear and mixed integer programming (e.g. [12–16]). In this work, we consider the transductive learning (TL) setting [17]. Standard (inductive) classification aims at creating a global prediction model that facilitates classifying arbitrary decision objects. TL differs from this approach in that it advocates a direct estimation of group memberships for a fixed set of objects called the working set. The fundamental assumption of TL is thus that the decision maker knows all objects that are to be classified in advance. These a priori known decision objects are called the working set. A transductive model can be characterized as a local model that is applicable to working set objects only. The main advantage of TL compared to the more general classification setting is that the additional constraint of a fixed working set simplifies the learning task [17]. This, in turn, will often facilitate more accurate class predictions for working set objects (e.g. [18,19]). With respect to the applicability of TL, it has been shown that several important corporate planning tasks do not require a global model and could potentially benefit from TL [20]. Consequently, developing and testing transductive classification models is an important task to support decision making in organizations.
582
H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595
From the perspective of inferential statistics, TL is simpler than inductive classification because it explicitly considers the working set objects when building the local classification model (e.g., [21]). In other words, the objects for which classification accuracy matters are taken into account during classifier construction. Though, this approach brings about new algorithmic challenges. First, it is not obvious how to best exploit the predictive information contained in the working set. Second, creating a transductive classifier involves working with both labeled and unlabeled data. This is because the class labels of working set objects are (by definition) unknown. Accommodating labeled and unlabeled data in a learning algorithm is a nontrivial task in its own right. In this work, we propose solutions to these challenges and develop a novel transductive classifier. Our approach is based on two foundations. First, it relies on the principle of maximal margin separation, which has been put forward in the context of support vector machine (SVM) classifiers (e.g., [22]). The maximal margin principle is also a common approach toward TL (e.g. [18,23–25]). According to the overall risk minimization (ORM) theory, maximizing the margin of separation of a linear classifier helps to minimize a bound of the classifier’s error on working set objects (e.g. [21,26]). Second, our algorithm builds upon the Discrete Support Vector Machine (DSVM) of Orsenigo and Vercellis [27]. DSVMs improve upon standard SVMs in the sense that they implement the principles of statistical learning more accurately [28]. To achieve this, Orsenigo and Vercellis propose to capture classification errors through a discrete step function, which is exactly the notion of errors used in the risk bounds of statistical learning theory. In a large number of simulations, Orsenigo and Vercellis as well as others show that the discrete error measurement produces highly accurate classifiers that outperform standard SVMs and other challenging benchmarks under several experimental conditions [27–33]. We hypothesize that the appropriateness of a discrete error measurement extends to the TL setting. A first contribution of this paper is thus the development of a transductive DSVM (tDSVM) classifier. Building a transductive classification model is challenging from an algorithmic point of view. In general, classifier construction involves optimizing some measure of model fit over the objects of the training set. Inductive classification algorithms often rely on continuous optimization methods (e.g. [34]). Contrary, the mathematical program underlying classifier construction is typically a mixed integer program (MIP) within TL (e.g. [18,23]). This is also true for Orsenigo and Vercellis’s DSVM classifier and our tDSVM in particular. A second contribution of this paper is associated with the development of a memetic algorithm to solve the MIP underlying tDSVMs. Our approach, which we call tDSVMmem, incorporates population-based and local search operators. We design these operators so as to account for characteristics of the MIP underlying our tDSVM classifier. Additional characteristics of tDSVMmem include a self-adaptive tuning of endogenous strategy parameters and an inheritance of solution characteristics. We test the effectiveness of tDSVMmem through several empirical experiments on synthetic data and real-world data from the UCI Machine Learning Repository [35]. The results show that tDSVMmem performs significantly better than CPLEX. More specifically, whenever the two solvers find the same solution, this solution is also the optimal solution of the corresponding problem instance. Whenever finding an optimal solution is computationally infeasible, tDSVMmem gives significantly better objective values than a truncated CPLEX benchmark (i.e., better than the best objective value obtained with CPLEX for a given time limit of reasonable length). We also find that tDSVMmem produces classification models that predict significantly more accurately than the CPLEX-based reference classifier. These results confirm the appropriateness of our approach and suggest that tDSVMmem is well suited to
construct tDSVM classifiers. Regarding the tDSVM classifier itself, we conduct several experiments to assess its predictive performance in comparison to other inductive and transductive methods. The results confirm the effectiveness of a discrete error measurement in TL settings. Furthermore, we find that tDSVM performs often but not always better than other inductive classifiers. This suggests that TL and tDSVM in particular are not necessarily preferable to inductive classifiers, even if class predictions are sought for a known group of working set objects only. Through a set of follow-up experiments, we gain some insight what factors influence the suitability of TL. For example, we observe that the ratio between labeled and unlabeled examples in a data set is an important determinant of TL success. Overall, the analysis allows us to provide some practical recommendations under which circumstances TL is preferable to an inductive approach. A general implication of our study is that it emphasizes the efficacy of relatively simple heuristic procedures for combinatorial optimization under the condition that the focal problem is well understood, appropriately formalized in a mathematical model, and that the search operators within the heuristic are well adapted to this formulation. Our tDSVM formulation is well grounded in theory and thus captures the learning task in a suitable way. On this basis, a carefully selected set of standard search mechanisms suffices to devise an effective solver and obtain promising results. On the one hand, this evidences the power and generality of the heuristic search framework. On the other hand, it puts the popular approach to extend this framework and invent novel metaheuristics somewhat into perspective. Efforts related with the development of novel metaheuristics are best geared toward novel problems, whereas the techniques known today are well suitable to approach a wide range of standard combinatorial problems. We provide empirical evidence in favor of this view for the problem of building tDSVM classifiers, which can be considered a further contribution of our study. The paper is organized as follows: Section 2 introduces the original DSVM classifier and explains our modifications to extend it to the TL setting. We then develop our memetic algorithm in Section 3. Section 4 introduces the design of our empirical study. The corresponding results are presented and discussed from an optimization and predictive modeling point of view in Section 5. We conclude the paper with a summary of the main findings and an outlook to future research in Section 6. 2. Classification with transductive discrete support vector machines The objective of a classification model is to group objects n H xH j 2 R into fixed, disjoint classes yj . In other words, a classifier defines a mapping from objects to classes f: x ´ y. An object is characterized by a set of n attributes. The fundamental assumption of classification analysis is that attribute values determine class memberships. However, the specific (functional) relationship between attribute values and class memberships is unknown. A classification method strives to reconstruct this relationship from a training sample L that consists of objects with known class labels L ¼ fxi ; yi gli¼1 . The model resulting from this step facilitates predictn ou . Without ing the class memberships of novel objects U ¼ xH j j¼1
loss of generality [36], we concentrate on binary classification problems in this paper and assume that yH j 2 f1g 8 j. 2.1. Discrete support vector machines SVMs are a popular approach toward classification. They are inspired by statistical learning theory and the principle of structural-risk-minimization (SRM) in particular [17]. Roughly
583
H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595
speaking, it can be shown that a SVM classifier is optimal in the sense that it explicitly minimizes a bound of the classification error on novel objects (e.g. [37]). The error on novel objects - not contained in the training set – is called the generalization error; as opposed to the empirical error, which is measured (in-sample) on training set objects. The concept of SVMs is to separate examples fxi gli¼1 into the two groups y = ± 1 by means of a hyperplane H = w x + b. The hyperplane facilitates classifying novel objects xH j according to their position relative to the plane (i.e., below or above H). Such a classification model is characterized by two parameters, the normal w and intercept b of the hyperplane. To build a SVM classifier, these parameters are estimated from L by maximizing the margin between the closest objects of adjacent classes while avoiding misclassifications (e.g. [22]). The idea of a maximal margin separation provides the link to the SRM principle. It ensures that the final classifier generalizes well to novel objects not contained in L [17]. Orsenigo and Vercellis note that the particular way in which SVMs deal with classification errors only approximates the risk bound derived within the SRM framework [27,28]. They recommend a stricter compliance with the SRM philosophy to obtain more accurate class predictions. To achieve this, they identify classification errors through binary indicator variables hi in their DSVM framework. This idea leads to the following combinatorial program to construct DSVM classifiers [27,28]:
min akwk1 =2 þ ð1 aÞC w;b;h
l X hi
ð1Þ
i¼1
s:t: : yi ðw xi þ bÞ þ Qhi P 1 8 i 2 f1; . . . ; lg hi 2 f0; 1g 8i 2 f1; . . . ; lg The objective balances two conflicting goals: achieving a large margin and low classification error. The margin of separation is maximized by minimizing the norm of the hyperplane’s normal w (e.g., [22]). Using the L 1 norm ensures linearity of the objective (e.g. [38]). The second term in the objective minimizes the number of classification errors. More specifically, the first constraint ensures that training objects fxi gli¼1 are considered as classification errors if they are either on the wrong side of the hyperplane or fall into the margin of separation (gray tube in Fig. 1). The approach to consider all objects within the margin as classification errors, including those that are actually on the correct side of the hyperplane, is standard in SVM learning (e.g. [22]). The characteristic feature of DSVMs is to count these errors through hi, whereas standard SVMs approximate classification errors through continuous slack variables. The meta-parameter a allows decision makers to control the trade-off
between error minimization and margin maximization, and the constant Q is a sufficiently large number [31]. Orsenigo and Vercellis extend their DSVM formulation in several ways to enable, e.g., fuzzy classification [30], time-series classification [33], or multi-category classification [32]. These and other studies (e.g. [29,30]) have shown that a closer correspondence with SRM through a discrete measurement of classification errors increases the prediction performance of the resulting classifier. 2.2. Transductive discrete support vector machines A transductive classifier is by definition aware of the location (in the attribute space) of the objects in the working set U. Transductive SVMs strive to capitalize on this additional information by means of extending the maximal margin principle to the objects of U. The class predictions of working set objects follow directly from their location relative to the separating hyperplane. Therefore, an optimal hyperplane is one that (i) achieves the largest possible margin on the labeled objects in L, (ii) produces the largest possible margin on the unlabeled objects in U given their grouping through the hyperplane, and (iii) achieves minimal classification error (e.g., [17,18]). Geometrically, this is equivalent to positioning the separating hyperplane into a low density region of the attribute space [39]. We illustrate this concept with an example in Fig. 1. Fig. 1 depicts objects of a possible training and working set, respectively. More specifically, the symbols +, represent the objects and class labels of L. The objects of the working set U are represented by , . The class label of working set objects (i.e., the class predictions n o yH ) follows from their position relative to the transductive clasj sifier (solid line). The dashed line represents the inductive classifier that results from solving (1) for the training set L. Comparing the solution of the inductive classifier and the transductive classifier, it is clear that the latter achieves a larger margin on working set objects. Recall that the location of working set objects remains unknown in an inductive approach. Consequently, the resulting classifier is biased toward the data in L. Although the actual class labels of working set objects are unknown, a visual inspection suggests that the transductive solution captures the true relationship between an object’s location and its class membership more closely. This suggests that the class predictions n ou yH resulting from the transductive approach are likely to be j j¼1
more accurate than the predictions of the inductive classifier. Our visual argument is also supported by learning theory. In particular, the ORM theory shows that considering the margin on both training and working set objects leads to a tighter bound of the generalization error, compared to the risk bound developed in the inductive SRM framework [21]. In order to implement the maximal margin principle for TL within Orsenigo and Vercellis’s DSVM framework [27,28], we combine their idea of a discrete measurement of classification errors with Joachim’s formulation of a transductive SVM [18] and propose the following MIP to construct a tDSVM classifier:
min
w;b;h;hH ;yH
akwk1 =2 þ ð1 aÞ C
l u X X hi þ C H hH j i¼1
! ð2Þ
j¼1
s:t: : yi ðw xi þ bÞ þ Qhi P 1 8i 2 f1; . . . ; lg H w xH 8j 2 f1; . . . ; ug yH j j þ b þ Qhj P 1 hi 2 f0; 1g 8i 2 f1; . . . ; lg hH 8j 2 f1; . . . ; ug j 2 f0; 1g Fig. 1. Transductive SVM (solid line) and inductive SVM (dashed line) in a linearly non-separable case in R2 .
8j 2 f1; . . . ; ug yH j 2 f1g
ð3Þ
584
H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595
H H We define hH j such that hj > 0 if and only if xj is located in the margin of separation (gray tube in Fig. 1), "j 2 {1, . . . , u}:
( hH j :¼
1; if xH j is within the margin 0; otherwise
ð4Þ
The main difference between tDSVM (2) and Joachim’s transductive SVM [18] pertains the use of the binary indicator variable hH j to capture misclassifications among working set examples. This approach extends Orsenigo and Vercellis’s idea to count classification errors by means of a discrete step function [27,28] to the TL setting. Subsequently we refer to working set objects within the margin as transductive errors. Recall that misclassified objects of the training set are called empirical errors. The second constraint in (2) ensures that working set examples n ou H xj are either outside the margin or counted as transductive j¼1
Fig. 2. Simplified generation cycle of memetic algorithms.
errors. The objective incorporates these transductive errors (last term) and ensures that the final classifier is ‘far away’ from working set objects. The meta-parameters C and Cw allow controlling the relative influence of transductive versus empirical errors on the objective. The consideration of transductive errors in the objective provides an incentive to push the separating hyperplane into a region with a low density of unlabeled objects. Assuming that objects of the same class are ‘close’ to each other and reside in the same cluster of the space, it is likely that this approach leads eventually to a large margin on working set examples. In view of the ORM principle, a large margin should result in better predictions [24]. Put differently, given that TL aims only at generating class n ou predictions for the fixed working set xH and given that these j j¼1 n ou predictions yH are explicitly considered in the optimization, it j j¼1
is reasonable to assume that tDSVM achieves higher accuracy on working set examples compared to an inductive classifier.
3. A memetic algorithm for tDSVM learning In order to solve our tDSVM program (2), we employ the framework of memetic search [40]. A memetic algorithm (MA) is a population-based heuristic for global optimization inspired by cultural evolution (e.g. [41]). The notion of memes goes back to [42]. In contrast to genes in genetic algorithms [43], memes are refined separately and their individual improvements are passed onto the next generation. Fig. 2 illustrates the overall process of MA-based search. MAs operate on a pool of candidate solutions and employ local learning. They can thus be considered hybridizations between genetic algorithms and local search. Due to using population-based and local search operators, MAs explicitly pursue the two main objectives of heuristic search, exploration and exploitation. This feature makes them well suited for constructing tDSVM classifiers. Recall that the tDSVM objective (2) embodies two conflicting goals. First, the classifier should achieve a large margin of separation. Second, it should avoid both empirical and transductive errors. As we show in the following, search mechanisms that emphasize exploration are useful to minimize classification errors, whereas exploitation-centric search mechanisms facilitate increasing a classifier’s margin. Therefore, a heuristic for solving (2) should incorporate mechanisms of both types. In our approach, individuals of a population represent candidate solutions to (2). More specifically, each individual is characterized by a set of memes ðw1 ; . . . ; wn ; bÞ ¼: ðw; bÞ 2 Rnþ1 and represents a separating hyperplane. We obtain the initial population by sampling (w, b) at random from a normal distribution. To determine the normal’s parameters l and r, we create a set of standard
Fig. 3. Computation of offspring solutions.
SVM classifiers on randomly selected sub-samples of the training data and estimate the mean and standard deviation of (w, b) over the resulting solutions. In order to measure the fitness of an individual with memes (w, b), we use the corresponding hyperplane to classify the objects of the training and working set and count the resulting number of n ou empirical fhi gli¼1 and transductive errors hH . We then use the j j¼1
tDSVM objective (2) to compute a fitness value. Individuals enter an iterative process of selection and stochastic variation (in form of three search operators: reproduction, mutation and local refinement) to increase their fitness. In accordance with [44], our algorithm, which we call tDSVMmem, terminates after a maximal number of generations or when the fitness of candidate solutions stops to improve. Fig. 3 illustrates the overall structure of the evolutionary process within tDSVMmem. We detail important components of the algorithm in the following sections. 3.1. Reproduction Reproduction represents the main population-based search operator in tDSVMmem. The operator randomly samples pairs of candidate solutions from the current population with replacement. In particular, we employ a roulette wheel selection approach [45], in which an individual’s selection probability, and thus its probability to pass on memes to offspring solutions, is proportional to
585
H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595
(a)
(b)
(c)
Fig. 4. Geometrical illustration of a candidate solution before (ocher decision boundary) and after local refinement (gray decision boundary) in R2 .
its fitness. Afterwards, the operator generates new individuals from two parents I and II in two steps. Let X denote a new individual and (wX) the direction of the hyperplane corresponding to X. We first compute (wX) as:
tion operator’s evolvability. In formal terms, the mutation is given by the following log-normal rule [44]:
ðwI Þ normðwI Þ
with
ðwII Þ þ normðw II Þ
2 2 kwI k1
þ 2
2 kwII k1
ð5Þ
:
We then set the margin
r0d :¼ rd expðsI N I ð0; 1Þ þ sN d ð0; 1ÞÞ
2 kwX k1
2ðn þ 1Þ
ð6Þ
Finally, we obtain the intercept of the new individual by averaging over the intercepts of its parents. As a result of these steps, the hyperplane of the new individual bisects the hyperplanes of its parents and has a margin equal to the mean margin of the parental planes. Our motivation for this recombination approach is twofold. First, given that decision variables are continuous, our recombination avoids potential problems of binary encoding schemes, which are required by alternative recombination procedures such as, e.g., uniform cross-over (e.g. [43]). Second, our modification is appropriate from a geometric point of view. That is, the offspring of two hyperplanes represents a decision surface between the parental ones. Averaging over potentially distant (parental) hyperplanes in the beginning of the search achieves a broad exploration of the search space. On the other hand, given that candidate solutions will become more similar in later generations, the reproduction mechanism will intensify the search in promising regions of the solution space in the later course of tDSVMmem. Finally, an advantage of this recombination approach is that it sustains the modifications of the local refinement operator.
2 nþ1
The N 1 ; . . . ; N nþ1 are i.i.d. normal variables, and N I is hold constant during the mutation process. 3.3. Local refinement We devise a local refinement operator that further improves the fitness of candidate solutions by enlarging the margin of separation. In particular, we employ the hyperplanes resulting from recombination and mutation to classify training and working set objects and, thereby, determine empirical and transductive errors. Given a hyperplane H, we define its neighborhood as the set of all planes which produce correct classifications for all objects that H classifies correctly. Out of this neighborhood, we choose the hyperplane with maximal margin to replace the current (w, b). To that end, we solve the following convex linear program for the objects that (w, b) classifies correctly. Ignoring misclassified objects at this point guarantees that program (9) is feasible.
min kwk1 =2 w;b
ð9Þ
s:t: : yi ðw xi þ bÞ P 1 jðw xH j þ bÞj P 1
8 i 2 fijyi ðw xi þ bÞ P 1 8i 2 f1; . . . ; lgg 8j 2 fjjjðw xHj þ bÞj P 1 8j 2 f1; . . . ; ugg
3.2. Mutation In tDSVMmem, we mutate an individual’s memes by adding n + 1 normally distributed variables:
w0d :¼ wd þ N 0; r2d 8 d 2 f1; . . . ; ng 0 and b :¼ b þ N 0; r2nþ1
1
sI / pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi s / pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffi
equal to
:
1
ð8Þ
ð7Þ
The mutation operator shifts and rotates the separating hyperplane. Consequently, it affects the class assignments of working set objects, the margin of separation, and eventually classification errors. This suggests that the fitness of an individual depends heavily on the choices of the r2m , which, in turn, determine the intensity of mutation. In order to identify appropriate settings, we adopt Beyer and Schwefel’s self-adaptation philosophy [44]. In particular, we assign a Rnþ1 -vector (r), representing r1, . . . , rn+1, to each individual and incorporate it into the heuristic search. As a consequence, the endogenous strategy parameters rd are subject to the same variation mechanisms as the decision variables (w, b). They are tuned during the evolutionary process to automatically adjust the muta-
Fig. 4 illustrates the local refinement operator. Blue and red symbols represent elements of the training sample, and crosses depict unlabeled object of the working set. The ocher hyperplane in (a) shows a solution H prior to refinement, whereas the gray classifier in (b) represents the refined classifier; that is the optimal solution of (9). Clearly, the refined classifier achieves larger margin than H. Moreover, panel (c) shows that local refinement can, in addition to enlarging the margin, also reduce the number of classification errors. In particular, a circle around an object in panel (c) indicates that the classification of this object changes from misclassified to correctly classified. This possibility exists because we solve program (9) only for correctly classified objects and enforce (through the second constraint) that these objects keep their correct classification.1. Consequently, the refined classifier predicts at least as accurately as the original classifier. On the other hand, objects that were misclas1 Note that this approach is similar to the hard-margin formulation of SVMs (e.g. [22])
586
H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595
sified by the original classifier can change their classifications after local refinement. It may be that the refined classifier gives correct predictions for some of these objects. Therefore, local refinement can improve but never harm classification accuracy.
from different angles (e.g. [9]). In this study, we consider two common performance measures, classification accuracy (ACC) and the area under a receiver characteristics curve (AUC). The ACC of a classifier is simply the fraction of working set objects that it assigns to the correct class:
Pu
3.4. Fitness evaluation The fitness evaluation in tDSVMmem is mainly driven by the tDSVM objective. However, a potential problem with (2) and other mathematical programs for classification is that they are susceptible to trivial solutions with w = 0 and b 2 { ± 1} (e.g. [46]). Several techniques to circumvent this problem have been proposed in the literature (e.g. [47,48]). Within TL, a common approach is to enforce that the fraction of positively (negatively) classified working set objects matches the prior probability of positive (negative) objects in the training sample [18,49]. However, previous research has shown that this approach can decrease the classifiers performance whenever the training sample is not well representative for the working set [50]. Therefore, tDSVMmem incorporates a more flexible approach. In particular, we introduce a soft constraint to require that 50% of U are classified in a 95%-confidence interval deduced from L: H jfxH j jðw xj þ bÞ P 1 8j 2 f1; . . . ; uggj
jUj
ð10Þ
qffiffiffiffiffiffiffiffiffiffiffi h 2 0:5 p 1:96 pð1pÞ ;... jLj qffiffiffiffiffiffiffiffiffiffiffii . . . 1 0:5 ð1 pÞ 1:96 ð1pÞp jLj 8i2 1;...;lggj with p ¼ jfyi jyi ¼1 jLj We then augment the fitness evaluation within tDSVMmem by penalizing solutions that violate the constraint with a cost proportional to the violation.
4. Experimental setup The empirical evaluation pursues two objectives, verifying the effectiveness of tDSVMmem to solve (2) and assessing the predictive performance of the resulting tDSVM classifier. To that end, we require appropriate performance indicators and benchmark methods. The main benchmark used in this study is CPLEX 12. The unique advantage of CPLEX is that it solves the same mathematical program as tDSVMmem. Therefore, comparing the objective values of CPLEX and tDSVMmem clarifies the effectiveness of tDSVMmem as a solver for (2). CPLEX is also a suitable benchmark to assess predictive performance. The ORM principle suggests that a better solution to (2) will, in general, give a more accurate classifier (e.g. [21]). However, given that the labels of the working set objects are unknown and given that the relationship between an object’s location (i.e., attribute values) and its class membership is not perfect and often corrupted by noise, it is well possible that this equation does not hold. A better solution to (2) may give a classifier that is less accurate than those corresponding to some other solution [29]. Comparing the predictive accuracy of tDSVMmem and CPLEX classifiers provides some insight how strong the link between objective values and classifier accuracy is in TL settings. In addition, to demonstrate the merit of the tDSVM classifier in itself, it is important to compare it to other inductive and transductive classification methods. Given that such benchmarks are based on different mathematical programs, comparisons of objective values are misleading. Accordingly, we concentrate on predictive accuracy in such comparisons. A number of accuracy and error measures have been proposed to assess the predictive performance of a classification method
ACC ¼
H j¼1 hj
jUj
:
ð11Þ
This notion of classifier performance is well aligned with the discrete error measurement of DSVMs [27,28] and our tDSVMmem in particular. A characteristic of ACC is that the performance of a classifier depends only on the number of working set objects on the correct side of the separating hyperplane. The AUC, on the other hand, is based on the distance of an object to the separating hyperplane. It measures the ability of a classifier to rank objects of different classes in the correct order [51]. Assume, for example, a classifier that assigns all objects to the negative class. This is equivalent to:
sign w xH 8j 2 f1; . . . ; ug; j þ b ¼ 1
ð12Þ
meaning that all working set objects are located below the separating hyperplane. The classifier could still achieve a maximal AUC of 1 if all positive objects are closer to the hyperplane than the negative objects. In this sense, the AUC emphasizes a different aspect of classifier performance and complements an ACC-based assessment. An important design choice in empirical studies associated with TL concerns the ratio between labeled and unlabeled examples l/u. It is clear from (2) that the number of binary decision variables is closely related to the amount of unlabeled data. Consequently, the CPLEX benchmark will experience more and more difficulties in finding an optimal solution as l/u decreases. To set tDSVMmem a challenging benchmark, we perform most experiments with a constant value of l/u = 1, meaning that the numbers of unlabeled and labeled objects are the same. We will also consider ratios of l/u < 1 where appropriate (see Section 5.4). Another design choice in the experimental study is associated with the settings of tDSVM parameters. For example, the predictive performance of any SVM-type classifier depends heavily on the parameter C (e.g., [52]). Larger values result in classifiers with lower empirical error, whereas smaller values yield classifiers with larger margin. Given their influence on the optimal solution of (2) and the resulting classifier, we explicitly account for different parameter settings in our experiments. In particular, we consider three alternative settings of Ierror (low, medium, high), where we set a = 0.9, a = 0.5, and a = 0.1. Finally, we acknowledge that the CPLEX benchmark (tDSVMCPLEX in the following) might fail to identify an optimal solution, especially for larger problem instances. We therefore impose a time limit and terminate CPLEX after 30 min. After termination, we extract the currently best solution for comparison with tDSVMmem. Note that the running times of tDSVMmem vary between a few seconds (synthetic data) and four minutes (see Tables 3–5 for details). In this sense, tDSVMmem is consistently using less CPU time than the CPLEX benchmark. Therefore, in the following we concentrate on the effectiveness of both methods rather than focusing on running times. 5. Empirical results We first present empirical results obtained from synthetic data. We then discuss results from real-world data. There, we begin with a comparison of tDSVMmem versus tDSVMCPLEX in terms of objective values and predictive performance. Subsequently, we benchmark tDSVM classification models – produced with tDSVMmem – against other classifiers.
587
H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595
5.1. Synthetic data experiment We employ Musicant’s NDC generator [53] to generate a set of artificial (binary) classification problems. This package has been widely used in the literature (e.g. [25,54,55]) and offers some more generality compared to alternative data generators [56]. In the experiments with synthetic data, we consider two experimental factors in addition to the severity of classification errors (Ierror). First, we consider the number of attributes (two settings with n = 4 and 32, respectively) because it is an important characteristic of any classification problem and may thus have an influence on the performance of the two solvers. Second, the difficulty of the classification problem will directly affect the trade-off between a large margin and a low error solution and thus the two solvers under study. Musicant’s data generator [53] can accommodate problem difficulty as the degree of overlap between the two classes, which is equivalent to the Bayes error. We reuse settings from previous research (s = 0.01 and 15; [56,57]) to stretch the covariance matrix of the data generator. Together with the three settings for Ierror (low, medium, high), we obtain a 3 2 2 factorial setting for the synthetic data experiment. Although it represents an important characteristic of a classification task, we refrain from considering the number of examples (i.e., problem size) as a further experimental factor. This is because the main objective of the synthetic data experiment is to scrutinize the ability of tDSVMmem to identify near-optimal solutions. Therefore, we must ensure that the benchmark, tDSVMCPLEX, actually finds the optimal solution of a problem instance. A series of pretests revealed that u = l = 25 fulfills this requirement for all twelve experimental settings, whereas larger data sets lead to CPLEX not converging within the 30 min time limit. We therefore consider randomly generated datasets of 50 examples for the synthetic data experiment. Table 1 reports the empirical results of the comparison of tDSVMmem vs. tDSVMCPLEX across the twelve experimental settings on synthetic data. Columns one to three depict problem characteristics in terms of Ierror, n and s, whereas the remaining columns report the objective values of tDSVMCPLEX and tDSVMmem, and their percentage difference (i.e., the gap between the optimal solution and the tDSVMmem solution), respectively. For tDSVMmem, we also report the standard deviation of objective values, which we compute over ten random initializations per setting. The last two columns of Table 1 give the predictive performance in terms of AUC and ACC, which is the same for the two solvers in all experiments. Here and in the following, a indicates that CPLEX finds the optimal solution of a problem instance.
Table 1 Comparative results of tDSVMmem and tDSVMCPLEX on synthetic data. Ierror
Low
Medium
High
n
s
tDSVMCPLEX tDSVMmem
Gap (%)
AUC
ACC
0 0 0 0
0 0 0 0
1 0.8701 1 0.6508
1 0.7600 1 0.8000
0.2584 1.1546 0.2533 1.8170
0 0 0 0
0 0 0 0
1 0.8691 1 0.9683
1 0.8000 1 0.8400
0.0517 0.5214 0.0507 0.3634
0 0 0 0
0 0 0 0
1 1 1 0.9603
1 1 1 0.8400
Obj. value ()
Obj. value
Std. dev.
4 0.01 15 32 0.01 15
0.4652 0.8744 0.4560 2.0237
0.4652 0.8744 0.4560 2.0237
4 0.01 15 32 0.01 15
0.2584 1.1546 0.2533 1.8170
4 0.01 15 32 0.01 15
0.0517 0.5214 0.0507 0.3634
Table 1 provides strong support for the efficacy of tDSVMmem. It finds the optimal solution under all experimental conditions. Moreover, the low standard deviations evidence its robustness (i.e., that it finds optimal solutions independent of its random initialization). In a similar fashion, none of the three experimental factors severity of classification error, dimensionality, and difficulty of the classification problem seem to affect tDSVMmem. We observe optimal solutions for all factor combinations. In summary, the results on synthetic data clearly show the strong performance of tDSVMmem. However, the number of examples was only 50 in these experiments. It is important to also examine the performance of tDSVMmem in larger settings that are more representative for practical applications of classification analysis and TL in particular. This is the subject of the next section. 5.2. Analysis of solver efficacy To complement the previous analysis and to examine the competitive performance of tDSVMmem and tDSVMCPLEX in realistic scenarios, we select seven binary classification problems from the UCI Machine Learning Repository [35]. The data sets represent realworld decision tasks in different domains (e.g., business, medical diagnosis, physics and text-classification) to facilitate an industry-independent assessment. In particular, Australian Credit (AC) and German Credit (GC) represent loan approval problems. The classification task is to discriminate between good and bad credit risks on the basis of demographic and transactional attributes that characterize loan applications. The attributes in Wisconsin Breast Cancer (BC) refer to characteristics of cell nuclei, and the classification task is to distinguish tumor cells into benign or malignant. The Heart data set (HE) data set is also from the field of medical decision making. It includes the records of 267 patients who are classified into two groups (normal vs. abnormal) on the basis of single proton emission computed tomography images. The Ionosphere dataset requires a categorization of radar returns into two classes (good and bad), whereas the Spambase (SP) task is to discriminate between ordinary and spam emails on the basis of word and/or character counts extracted from the mail body. Finally, the Adult dataset (AD), contains Census data concerning demographic and socio-demographic characteristics of US households. The task is to predict whether household income exceeds $50,000. These data sets have been employed in several previous studies to explore the performance of competing learning algorithms (e.g. [25,27,28,58– 61]). In this sense, they represent an established set of classification problems well suited for evaluating novel classifiers and tDSVMmem in particular. Table 2 summarizes the characteristics of the data sets. It is known that many classifiers and SVM-type methods in particular benefit from data preprocessing (e.g. [62]). In this study, we preprocess all data sets as follows: we first convert categorical attributes into numerical attributes using the weight of evidence approach [63]. We then employ the z-transformation to avoid problems with continuous attributes on different measurement scales.
Table 2 Characteristics of the seven real-world UCI datasets. D
jDj
n
P(y = + 1)
AC GC BC HE IO SP AD
690 1000 569 267 351 4601 48,842
14 20 30 44 34 57 14
0.4449 0.3000 0.3726 0.2060 0.6410 0.3940 0.2393
588
H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595
Table 3 Empirical results on UCI data at Ierror = low.
Table 4 Empirical results on UCI data at Ierror = medium.
D
#
tDSVMCPLEX
tDSVMmem
Gap (%)
D
#
tDSVMCPLEX
Obj. val.
Sec.
Obj. val.
Std. dev.
AC
1 2 3 4 5 £
0.5697 0.5555 0.5514 0.5514 0.5372 0.5531
>1800 >1800 >1800 >1800 >1800 >1800
0.5352 0.5372 0.5393 0.5413 0.5413 0.5389
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Obj. val.
Sec.
Obj. val.
20.2 20.6 21.0 22.8 23.0 21.5
6.06 3.30 2.20 1.84 0.75 2.53
AC
1 2 3 4 5 £
0.8486 0.7775 0.7572 0.7572 0.6862 0.7654
28.6 14.6 23.2 13.8 11.4 18.3
0.8486 0.7775 0.7572 0.7572 0.6862 0.7654
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
14.0 19.5 22.0 22.1 22.3 20.0
0.00 0.00 0.00 0.00 0.00 0.00
BC
1 2 3 4 5 £
2.4662 2.4925 2.4921 2.4016 2.6745 2.5054
>1800 >1800 >1800 >1800 >1800 >1800
2.3533 2.3668 2.4131 2.4273 2.4307 2.3982
0.0028 0.0030 0.0065 0.0796 0.0041 0.0192
103.0 103.3 103.4 104.4 110.1 104.8
4.57 5.04 3.17 1.07 9.11 4.17
BC
1 2 3 4 5 £
4.0338 3.7492 3.7737 4.0748 3.8456 3.8954
>1800 >1800 >1800 >1800 >1800 >1800
3.6990 3.7300 3.7364 3.7727 3.8067 3.7490
0.0250 0.0284 0.0172 0.0273 0.0310 0.0258
58.3 59.6 61.5 61.8 63.6 61.0
8.30 0.51 0.99 7.41 1.01 3.65
GC
1 2 3 4 5 £
0.8820 0.8772 0.8820 0.8892 0.9012 0.8863
>1800 >1800 >1800 >1800 >1800 >1800
0.8714 0.8714 0.8726 0.8741 0.8815 0.8742
0.0243 0.0095 0.0342 0.0297 0.0591 0.0314
71.6 73.1 85.9 85.9 97.2 82.8
1.20 0.66 1.06 1.70 2.18 1.36
GC
1 2 3 4 5 £
2.3380 2.4100 2.4820 2.3500 2.3260 2.3812
>1800 >1800 >1800 >1800 >1800 >1800
2.2912 2.3104 2.3284 2.3306 2.3380 2.3197
0.1100 0.1404 0.1292 0.0481 0.0000 0.0856
77.7 82.9 83.0 84.6 88.0 83.2
2.00 4.13 6.19 0.83 0.52 2.53
HE
1 2 3 4 5 £
2.0469 1.8806 1.9880 1.7110 1.9431 1.9139
>1800 >1800 >1800 >1800 >1800 >1800
1.5441 1.6943 1.7007 1.7030 1.7037 1.6691
0.0064 0.0205 0.0397 0.0502 0.0009 0.0236
42.6 44.7 48.6 50.6 51.5 47.6
24.56 9.91 14.45 0.46 12.32 12.34
HE
1 2 3 4 5 £
3.8507 3.2087 3.2234 2.7729 3.1157 3.2343
>1800 >1800 >1800 >1800 >1800 >1800
3.1539 3.1784 3.0206 3.0532 3.1082 3.1029
0.0567 0.0622 0.1224 0.1333 0.1520 0.1053
20.6 21.7 19.9 19.9 20.6 20.5
18.10 0.95 6.29 10.11 0.24 3.09
IO
1 2 3 4 5 £
2.3307 2.1903 2.1901 2.4343 2.2386 2.2768
>1800 >1800 >1800 >1800 >1800 >1800
2.1079 2.1217 2.1235 2.1360 2.1445 2.1267
0.0717 0.0692 0.0379 0.1085 0.0148 0.0604
79.9 82.2 88.3 90.6 91.4 86.5
9.56 3.13 3.04 12.25 4.20 6.44
IO
1 2 3 4 5 £
4.0505 3.1552 3.6279 3.5531 2.7699 3.4313
>1800 >1800 >1800 >1800 1410.6 >1800
2.9877 3.0558 3.0641 3.1513 3.1573 3.0832
0.0109 0.0808 0.0063 0.2648 0.1322 0.0990
39.1 41.0 41.8 44.3 47.1 42.7
26.24 3.15 15.54 11.31 13.99 8.45
SP
1 2 3 4 5 £
6.4979 / / / / 6.4979
>1800 >1800 >1800 >1800 >1800 >1800
6.4390 7.1841 7.3723 7.4562 7.5422 7.1988
0.9653 1.0335 2.0785 1.6170 2.1602 1.5709
157.3 171.2 195.0 201.0 210.8 187.1
0.91 0.91
SP
1 2 3 4 5 £
17.0316 15.7249 15.6210 / / 16.1258
>1800 >1800 >1800 >1800 >1800 >1800
11.0977 10.8216 10.8737 10.9386 11.0377 10.9538
0.6896 0.6719 0.5021 0.7731 0.8186 0.6911
226.9 209.1 226.3 227.1 229.4 223.7
34.84 31.18 30.39 32.14
AD
1 2 3 4 5 £
/ / / / / /
>1800 >1800 >1800 >1800 >1800 >1800
0.7934 0.7999 0.8004 0.8035 0.8088 0.8012
0.0447 0.0000 0.0246 0.0163 0.0218 0.0215
163.3 178.8 189.0 189.6 189.9 182.1
AD
1 2 3 4 5 £
/ / / / / /
>1800 >1800 >1800 >1800 >1800 >1800
1.3820 1.4203 1.4465 1.4578 1.4979 1.4409
0.0162 0.1267 0.1935 0.1945 0.2600 0.1582
162.3 177.2 177.2 181.5 184.1 176.5
Sec.
To simulate a TL scenario, we partition all data sets randomly into a training and a working set using a split ratio of 1:1 (see above). We repeat this partitioning five times to account for the fact that the performance of tDSVMmem and tDSVMCPLEX might depend upon the random assignment of decision objects to the training and working set, respectively. Tables 3–5 report the empirical results in terms of objective values for the three experimental settings Ierror = low, medium, high. In each table, the five panels per dataset provide the individual results of the random assignments _ of objects to L[U. We use a to highlight settings where tDSVMCPLEX finds the optimal solution within the time limit; otherwise we report the best CPLEX result found within 30 min. The last row in each panel gives the average performance of the two solvers, computed over the five individual data partitionings. A / indicates that tDSVMCPLEX failed to find a feasible solution within the time limit. The running time for tDSVMmem varies between one to four minutes (depending on data set size) and never exceeds the time limit. In the last column of Tables 3–5, we use bold face to highlight cases in which tDSVMmem outperforms tDSVMCPLEX (i.e., negative gap). Tables 3–5 give a similar picture as the synthetic data experiment and suggest that tDSVMmem performs typically better than
tDSVMmem
Gap (%) Std. dev.
Sec.
tDSVMCPLEX. More specifically, the average objective values of tDSVMmem are consistently better than those of the CPLEX benchmark across all experiments. Considering the individual results across different random partitionings of the data sets, the overall win/tie/loss counts of tDSVMmem vs. tDSVMCPLEX are 69/10/5. On the basis of this results, a Wilcoxon signed rank test for matched samples (e.g. [64]) enables us to reject the null-hypothesis that the median objective values of the two solvers are the same (p-value <0.000). It is also noteworthy that all ties are due to both approaches solving a problem instance to optimality. It is encouraging to observe that tDSVMmem finds the optimal solution of a problem instance whenever it is revealed (i.e., through tDSVMCPLEX). Moreover, the standard deviation of tDSVMmem performance (computed over ten different random initializations) is consistently low. This suggests that the quality of tDSVMmem does not depend on the random initialization of the algorithm. In view of these results, we may conclude that tDSVMmem performs significantly better than tDSVMCPLEX. This also confirms that the heuristic operators and their combination within tDSVMmem are effective and well geared to the problem at hand. Such confirmation is crucial for any new metaheuristic. Finally, it is important to note that
H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595 Table 5 Empirical results on UCI data at Ierror = high. D
#
tDSVMCPLEX Obj. val.
Sec.
AC
1 2 3 4 5 £
1.1274 0.9996 0.9630 0.9630 0.8352 0.9777
482.7 143.1 85.3 129.6 12.2 170.6
1.1274 0.9996 0.9630 0.9630 0.8352 0.9777
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
12.5 13.0 13.1 13.4 13.5 13.1
0.00 0.00 0.00 0.00 0.00 0.00
BC
1 2 3 4 5 £
2.6192 2.3681 2.4459 2.4141 2.6541 2.5003
>1800 >1800 >1800 >1800 >1800 >1800
1.7320 1.8146 1.8630 1.8853 1.9056 1.8401
0.6203 0.6258 0.6595 0.6890 0.6382 0.6466
61.0 62.1 63.3 83.4 85.1 71.0
33.87 23.37 23.83 21.90 28.20 26.24
GC
1 2 3 4 5 £
3.1796 3.5560 3.3604 3.7992 3.3684 3.4527
>1800 >1800 >1800 >1800 >1800 >1800
2.3842 2.4475 2.4856 2.4907 2.5087 2.4633
0.3022 0.3619 0.3484 0.3163 0.1537 0.2965
33.2 35.1 48.7 49.4 49.6 43.2
25.02 31.17 26.03 34.44 25.52 28.44
HE
1 2 3 4 5 £
5.3507 5.3763 4.8141 4.3857 5.3164 5.0486
>1800 >1800 >1800 >1800 >1800 >1800
3.6760 3.6795 3.8482 4.0385 4.0796 3.8644
0.0568 0.0760 0.0754 0.0857 0.0694 0.0727
15.8 15.9 16.1 16.2 16.3 16.0
31.30 31.56 20.06 7.92 23.26 22.82
IO
1 2 3 4 5 £
4.1134 3.4063 4.0042 3.8410 3.4778 3.7685
>1800 >1800 >1800 >1800 >1800 >1800
2.7874 2.8186 2.8338 2.8622 2.8767 2.8357
0.2014 0.0256 0.0009 0.1832 0.09822 0.1019
41.4 44.1 45.4 46.3 47.2 44.9
32.24 17.25 29.23 25.48 17.28 24.30
SP
1 2 3 4 5 £
21.9349 18.3260 22.8557 19.5278 24.8513 21.4991
>1800 >1800 >1800 >1800 >1800 >1800
11.8460 11.9520 11.9706 11.9775 12.0379 11.9568
0.30136 0.63059 0.33681 0.12441 0.30096 0.33881
111.6 116.5 118.1 124.9 127.1 119.7
45.99 34.78 47.63 38.66 51.56 43.73
1 2 3 4 5 £
/ / / / / /
>1800 >1800 >1800 >1800 >1800 >1800
1.8699 1.8781 1.8789 1.8830 1.8836 1.8787
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
137.8 138.7 138.8 141.2 141.9 139.7
AD
tDSVMmem Obj. val.
Gap (%) Std. dev.
Sec.
tDSVMCPLEX experiences severe difficulties to find any solution within the time limit when working with larger data sets. This is true for SP and especially AD. On the other hand, tDSVMmem delivers solutions for these problems within a few minutes. This emphasizes the need for heuristic search in transductive SVM learning. Lacking an objective benchmark, we are unable to assess many SP and all AD solutions. Considering, the SP experiments where tDSVMCPLEX has found a solution (i.e., with Ierror = medium or high), the large magnitude with which tDSVMmem outperforms the CPLEX benchmark gives rise to the suspicion that the advantage of tDSVMmem is particularly prominent for larger data sets. However, future research is needed to examine the relationship between problem size and tDSVMmem performance in more detail. With respect to the general appropriateness of tDSVMmem, the above results provide strong evidence for the algorithm being an effective and efficient approach to solve the tDSVM learning problem. 5.3. Evaluation of prediction performance Having established the appropriateness of tDSVMmem from an optimization perspective, this section examines the ability of tDSVMmem to predict the class membership of working set objects
589
with high accuracy. We begin once again with a comparison to tDSVMCPLEX because it solves the same mathematical program as tDSVMmem. Subsequently, we consider other transductive and inductive classifiers to extend the scope of the comparison in Section 5.4. In general, the predictive performance of a classifier depends on the choice of meta-parameter settings (e.g. [9]). In the case of tDSVM, these parameters are a, C, and Cw (see (2)). It is common practice to determine suitable parameter values empirically by testing a set of candidate values (e.g., [8,52,65]). In our case, the three experimental settings Ierror = low, medium, high can be seen as a simple means of parameter tuning – a three value grid-search – which helps to identify suitable parameter values. Fig. 5 illustrates the dependency of predictive performance on parameter values by means of a box plot. For each of the three settings a = {0.1, 0.5, 0.9}, the corresponding box includes the results of tDSVMCPLEX (from all data sets and all randomized runs per data set). Fig. 5 reveals that the highest median AUC and ACC values follow from the setting with Ierror = high. Moreover, the performance of the classifiers corresponding to this setting varies less than in alternative settings. This suggests that tDSVMCPLEX performs more robust when classification errors receive a larger weight in the objective. Finally, Fig. 5 indicates that the worst case performance is highest in the Ierror = high setting. Together, these results suggest that the setting Ierror = high produces the strongest tDSVMCPLEX classifiers and thus the most challenging benchmark models. Consequently, we select this setting for subsequent comparisons of tDSVMmem and tDSVMCPLEX and set a = 0.1. We refrain from tuning C and Cw in this paper, but use the following default rule:
C ¼ n=l;
C H ¼ n=u:
ð13Þ
The intuition is that higher (lower) dimensionality will, in tendency, increase (decrease) the influence of w on the objective, relative to the influence of empirical and transductive errors. Multiplying the weights of classification errors with n compensates this effect. In a similar way, dividing the weights by the number of examples (l or u) ensures that the relative influence of classification errors does not increase (decrease), simply because of working with larger (smaller) data sets. Moreover, dividing C and Cw by l and u implies that the relative importance of empirical errors compared to transductive errors depends on the prevalence of labeled and unlabeled objects in the data. This is a plausible choice for benchmark experiments that use data sets from different applications. We acknowledge that a more elaborate tuning of all three meta-parameters, a, C, and Cw, is likely to improve predictive performance. However, this is true for both tDSVMmem and tDSVMCPLEX, as well as for any other classification method. Considering that tDSVMmem creates a classification model much faster than tDSVMCPLEX, allotting the same amount of tuning resources to both methods would give the former an advantage. In this sense, our approach to avoid extensive parameter tuning leads to a tougher benchmark for tDSVMmem. The prediction performance of tDSVMmem and tDSVMCPLEX is compared in Table 6. In particular, Table 6 reports average performance estimates (over the five randomized runs per dataset; see Section 4), which we compute from working set objects. The results evidence that tDSVMmem predicts more accurately than tDSVMCPLEX. Clearly, the two solvers perform identically on the AC data set, where they both found the optimal solution (again marked by a in Table 6). On all other data sets, tDSVMmem outperforms the CPLEX benchmark in terms of ACC and AUC. Fig. 6 provides a more detailed view on the comparative performance of the two solvers across the five randomized runs per UCI data set. In particular, Fig. 6 depicts the distribution of the percentage differences in AUC and ACC between tDSVMmem and tDSVMCPLEX. To
590
H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595
Fig. 5. Predictive performance of tDSVMCPLEX in terms of AUC and ACC across all data sets and randomized runs per data set for different settings of the meta-parameter a (high = 0.1, medium = 0.5, low = 0.9).
Table 6 Average prediction performance of tDSVMmem and tDSVMCPLEX on UCI data. D
AC BC GC HE IO SP AD
AUC
ACC
tDSVMCPLEX
tDSVMmem
Gap (%)
tDSVMCPLEX
tDSVMmem
Gap (%)
0.872 0.984 0.601 0.666 0.791 0.865 /
0.872 0.996 0.792 0.820 0.901 0.918 0.898
0.00 1.20 31.73 23.10 13.93 6.04
0.857 0.930 0.677 0.741 0.807 0.807 /
0.857 0.968 0.933 0.778 0.875 0.842 0.835
0.00 4.09 37.75 4.93 8.50 4.35
ensure consistency with previous results, we calculate the percentage differences such that negative values indicate superior performance of tDSVMmem. Note that we exclude AC and AD from the analysis. Results on AC do not differ between the two solvers
because they both find the optimal solution. We also exclude AD because the CPLEX benchmark does not find a feasible solution within the time limit. Fig. 6 reveals that all performance differences are below zero. This shows that tDSVMmem not only provides more accurate class predictions on average, but consistently outperforms tDSVMCPLEX in all randomized runs. A formal comparison of the two solvers by means of a Wilcoxon signed rank test (including the AC results) confirms that tDSVMmem predicts significantly more accurately than tDSVMCPLEX (p-value <0.0000 for both performance indicators). In addition, we can estimate the expected performance difference between the two classifiers (i.e., on new data not used in the study) by calculating the median of their observed performance differences [66]. The median performance difference between tDSVMCPLEX and tDSVMmem is 0.064 points for AUC and 0.043 points for ACC. Compared to the median prediction performance of tDSVMCPLEX, this equates to a relative improvement of 7.61% for AUC and about 5.31% for ACC. This is a sizeable improve-
Fig. 6. Distribution of the percentage differences in ACC and AUC between tDSVMCPLEX and tDSVMmem across the randomized runs per data set for selected UCI data sets.
591
H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595 Table 7 Predictive performance of tDSVM compared to other transductive and inductive classifiers in terms of AUC.
Table 8 Predictive performance of tDSVM compared to other transductive and inductive classifiers in terms of ACC.
D
tDSVM
TSVM
L1-SVM
DSVM
D
tDSVM
TSVM
L1-SVM
DSVM
AC BC GC HE IO SP AD £ AUC £ rank
0.872 0.996 0.792 0.820 0.901 0.918 0.898 0.885 1.927
0.923 0.992 0.723 0.750 0.871 0.902 0.890 0.864 2.736
0.926 0.986 0.801 0.729 0.852 0.965 0.892 0.879 2.482
0.894 0.976 0.771 0.774 0.846 0.951 0.900 0.873 2.855
AC BC GC HE IO SP AD £ ACC £ rank
0.857 0.968 0.933 0.778 0.875 0.842 0.835 0.870 2.191
0.865 0.965 0.744 0.776 0.853 0.831 0.718 0.822 2.736
0.861 0.956 0.764 0.744 0.865 0.915 0.839 0.849 2.346
0.826 0.952 0.738 0.743 0.855 0.890 0.848 0.836 2.727
ment that will be meaningful in many applications. For example, increasing the accuracy of a targeting model in marketing by five percent will typically improve the profits resulting from the corresponding marketing campaign to a large degree (e.g. [67]). In this sense, our results indicate that the predictive advantage of tDSVMmem is not only statistically significant but also meaningful from a managerial perspective. Finally, it is worth commenting on previous results that a better objective value may translate into an inferior classifier [29]. We do not observe such a case in our comparison of tDSVMmem versus tDSVMCPLEX. The better solver consistently produces the more accurate classifier. In this sense, our results provide some evidence that the link between the objective of (2) and the accuracy of the resulting classifier is relatively stable. This might result from a strict compliance with ORM theory and thus be a feature of the discrete measurement of empirical and transductive errors. It might also result from the direct estimation of class labels for working set objects, and thus be a general feature of TL. Clarifying the specific origin of the strong connection between objective and accuracy might be a promising avenue for future research. 5.4. Comparison to other inductive and transductive classifiers Having confirmed the suitability of tDSVMmem as a solver for program (2), we now turn our attention to the tDSVM classifier itself and examine its predictive performance in comparison to other inductive and transductive classifiers. In these comparisons, the tDSVM classifier is always constructed by means of tDSVMmem. First, we compare tDSVM to the original TSVM [18]. The two classifiers differ mainly in how they account for classification errors during training. TSVM uses continuous slack variables, whereas tDSVM adopts Orsenigo and Vercellis’s approach to count classification errors with a discrete step-function [27,28]. Comparing the two classifiers allows us to examine the effectiveness of a discrete error measurement in a TL setting. In addition, we consider two inductive classifiers, DSVM and a linear support vector machine with an L1-penalty (L1-SVM; e.g. [68,69]). As the TSVM benchmark, DSVM shares several similarities with tDSVM. The main difference between the two concerns the underlying learning paradigm, induction versus transduction. Therefore, contrasting tDSVM to DSVM clarifies upon the value of using unlabeled data during classifier construction.2 The L1-SVM benchmark completes the set of benchmarks in the sense that it represents an inductive classifier without a discrete measurement of classification errors. Tables 7 and 8 depict the results of the comparison in terms of AUC and ACC, respectively. All performance figures represent aver-
2 Note that we use a simplified version of DSVM to increase the similarity with tDSVM. In particular, we solve problem (1) to devise a DSVM classifier, using a linear programming heuristic [28]. Orsenigo and Vercellis [27] employed an additional regularizer in their DSVM, which penalizes the number of non-zero coefficients in the normal vector.
ages, computed over five randomized training/test set splits per data set. The last two rows report the average performance and the average rank per methods, which we compute across the individual results of the five randomized runs. For each run, the best performing method receives a rank of one, the runner-up a rank of two, etc. (e.g. [64]). A first finding is that tDSVM provides, on average, the most accurate predictions. It achieves the highest average AUC and ACC, and also the lowest average rank across all methods. A closer look at the two transductive approaches reveals that tDSVM performs better than TSVM on all data sets but AC. Furthermore, a Wilcoxon signed rank test allows concluding that tDSVM classifies significantly more accurately than TSVM (p-value: 0.0074 for AUC; 0.0001 for ACC).3 Given the similarity between the mathematical programs underlying these two classifiers, the main factor that can explain a significant difference in classification performance is the discrete measurement of classification errors. Orsenigo and Vercellis found this approach to be highly effective in inductive settings. Our results extend this finding to transductive classification. Measuring empirical and transductive errors by means of a discrete step function helps to increase the accuracy of the resulting classifier. This finding is consistent with ORM theory [21]. In this sense, our results provide further evidence in favor of the suitability to design novel prediction methods that are not only data-driven but also well anchored in learning theory. In comparison to other inductive classifiers, tDSVM gives better performance on average. However, its predictive advantage is smaller than in the TSVM case. Especially L1-SVM performs competitive. For example, L1-SVM beats tDSVM on three data sets when using AUC as performance measure. In the case of ACC, tDSVM ’wins’ four data sets, whereas no competitor secures more than one win. A statistical analysis confirms the competitive performance of L1-SVM and DSVM. In particular, our results do not provide sufficient evidence to conclude whether AUC differences between tDSVM and L1-SVM are systematic or random (p-value Wilcoxon signed rank test: 0.466). The same holds true for the comparison of tDSVM and DSVM (p-value: 0.301). Using ACC as indicator of predictive performance, we find marginal evidence for the superiority of tDSVM over L1-SVM (p-value: 0.103), whereas the results allow concluding that tDSVM delivers significantly higher accuracy than DSVM (p-value: 0.038). Higher average accuracy, especially when using ACC as performance measure, is an appealing result. In this sense, the analysis confirms the efficacy of TL and tDSVM in particular. However, the inductive approaches perform surprisingly competitive. Given that tDSVM training includes the working set objects, one would expect that the resulting model classifies these objects much more accurately than an
3 Note that we use the individual results of the randomized runs per data set for statistical testing.
592
H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595
Fig. 7. Development of predictive performance of tDSVM and L1-SVM across different ratios l/u.
inductive model. Arguably, our results do not show a substantial difference between tDSVM and the two inductive approaches. A possible explanation lies in our experimental setup. In all experiments, we use the same amount of labeled and unlabeled data (see Section 4). TL is typically employed when the amount of labeled data is small and unlabeled data is available in abundance (e.g. [18,70,71]). The marginal utility of using additional examples during model building decreases, once a classifier has seen ’enough’ labeled examples [72]. Using even amounts of labeled and unlabeled data in our experiments, the inductive classifiers may have received sufficient information (i.e., labeled examples) to reconstruct the relationship between class membership and attribute values with high accuracy. To test this, we run an additional set of experiments with varying amounts of labeled and unlabeled data using nine ratios l/u = [0.01, 0.02, 0.03, 0.04, 0.05, 0.10, 0.15,
0.20, 0.25]. For each ratio, we construct five samples per data set with random training/test set assignments. We use these samples to construct tDSVM and L1-SVM classifiers, which we then compare in terms of AUC and ACC. We select L1-SVM for this analysis because our previous results suggest that it performs slightly better than the other inductive classifier on our data. The results are shown in Fig. 7. Fig. 7 reveals that tDSVM performs typically better than L1-SVM. For most data sets, tDSVM achieves higher predictive accuracy across all ratios l to u. In the most extreme setting where only one percent of the original labeled training set is available for model building, tDSVM outperforms L1-SVM on all but one data set. This evidences the importance of the relative amounts of labeled and unlabeled data when comparing inductive and transductive classifiers. More specifically, the results of Fig. 7 confirm that
H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595
the competitive performance of the inductive classifiers in the previous analysis (Tables 7 and 8) is largely explained by the ratio of labeled to unlabeled data. Another finding from Fig. 7 is that the magnitude with which predictive performance increases due to applying TL can be small, even if labeled data is scarce. Consider, for example, the ACC results on the IO data set. L1-SVM and tDSVM perform virtually the same across all ratios l/u, including the most extreme case l/u = 0.01. Some other data sets show a similar tendency. So one may ask whether the additional computational effort of TL compared to L1-SVM is justified. It is difficult to give a general answer to this question. The value of improvements in predictive performance depends on the application context. A common view is that marginal improvements in accuracy can be very meaningful in business applications. This is typically explained with scaling effects. When using a decision support model with high frequency, increasing its effectiveness a little bit can improve business performance substantially. This has, for example, been observed in churn management (e.g. [67,73,74]). Targeting banner-ads so as to increase click-through rates in online marketing is another example (e.g., [75]). Predictive performance is even more important when a model is used to aid medical decision making. Consider for instance the case of cancer diagnosis (e.g. [76]). Clearly, the consequences of misclassification errors are dramatic. In this sense, there is always room for better – more accurate – prediction methods, even if the degree to which they improve upon current standards is moderate. It is however important to choose the right applications for such methods. This is also apparent from the results of the SP data set where L1-SVM outperforms tDSVM on all but the most extreme l/u ratio (see Fig. 7). Out of all data sets used in this study, SP is the one with the largest number of attributes (see Table 2). One may speculate that the poor performance of tDSVM is partly due to high dimensionality. In particular, tDSVM implements the cluster assumption and strives to construct the decision surface in a low density area of the feature space. The distance between decision objects in the feature space increases with the number of attributes. As a consequence, low density areas are more likely to occur not only between objects of different classes but also between objects of the same class. It might then be misleading to rely on density-based information when creating the classification model. This could explain why tDSVM classifies SP objects less accurately. However, future research is needed to better understand the relationship between dimensionality and TL efficacy and, more generally, clarify what other factors besides the ratio of labeled to unlabeled data determines the appropriateness of TL. For now, we conclude that tDSVM is a viable alternative to inductive models that should be considered whenever class decisions are needed for a priori known working set objects. There is a good chance that tDSVM improves predictive performance, especially when the amount of unlabeled data is small.
6. Summary and conclusions We set out to develop a novel transductive classifier which extends Orsenigo and Vercellis’s DSVM methodology [27,28] to the TL setting. Drawing inspiration from Joachims’s [18] formulation of a TSVM, we developed a combinatorial program that incorporates a discrete step function to count empirical and transductive errors on training and working set objects, respectively. We then devised a memetic algorithm to solve the MIP formulation and construct a tDSVM classification model. Our algorithm performs a direct optimization of the model parameters (w, b) and an indirect optimization of the decision variables yH i . It employs population-based and local search operators to pursue the conflicting goals of low classification error and large margin of separation,
593
respectively. In several empirical experiments on synthetic and real-world data, we examined the appropriates of our tDSVM formulation, the effectiveness of tDSVMmem, and, more generally, the relative advantages of TL compared to inductive classification. Overall, our results warrant three main conclusions. First, we find further support in favor of Orsenigo and Vercellis’s view that a discrete measurement of classification errors is preferable to a continuous error approximation during SVM training. In particular, we observed that tDSVM classifies working set objects significantly more accurately than the original TSVM. We attribute this to the discrete measurement of transductive errors, which represents the main difference between the two classifiers. In this sense, we show that the effectiveness of the DSVM framework [27,28] generalizes to TL. Second, the comparisons of tDSVM to two inductive classifiers revealed that L1-SVM and DSVM predict slightly less accurately when working with even amounts of labeled and unlabeled data. If labeled data is scarce, tDSVM will typically give better prediction performance. This suggests that TL and tDSVM in particularly are not guaranteed to be a better choice whenever the particular application requirements of TL are met. The transductive approach entails a higher computational cost (e.g., compared to L1-SVM), but will not always improve upon an inductive classifier. An important determinant whether TL is preferable is the similarity/dissimilarity of the distributions of working and training set objects. Given that this factor may be difficult to quantify in real-world applications, a practical recommendation is to consider TL alongside – but not instead of – inductive methods whenever class predictions are sought for a set of known working set instances. In cases where the modeler faces a severe imbalance between labeled and unlabeled objects, TL is particularly appropriate because an inductive approach will have difficulties to learn the relationship between attribute values and class memberships. Third, we found strong evidence that tDSVMmem is a suitable approach to construct tDSVM classifiers. Despite its heuristic nature, it successfully identified optimal solutions in several cases. In all other cases, where optimality was not assured due to the difficulty of the problem instance, we were able to show that tDSVMmem performs at least as good and typically better than CPLEX. This confirmed the effectiveness of tDSVMmem to solve the mathematical program underlying the tDSVM classifier. We also found that the classification models resulting from tDSVMmem predict significantly more accurate than the tDSVM models produced by CPLEX. This suggest that better solutions (lower objective values) translate into better classifiers (higher accuracy). Although somewhat intuitive, previous results have shown that there is not always a one-to-one relationship between objective values and predictive accuracy in inductive learning [29]. In this sense, our results suggest that the link between the two is more stable in a transductive setting where decision objects are considered when constructing the classifier. One could argue that this makes the TL setting particularly suitable for developing novel (better) classification methods. The stronger the link between the objective values of a classifier learning program and the classifier’s predictive performance, the more rewarding it is to develop better solvers. Our study suggests several directions for further research, related to either the formulation of the tDSVM learning problem or the novel solver tDSVMmem. With respect to the mathematical program, future research could test alternative formulation of the tDSVM program and compare these to the one proposed here. For example, instead of using a discrete measurement for both empirical and transductive errors, it would be possible to use the approximate error measurement of standard SVMs for empirical errors and reserve the discrete approach for working set objects. This would reduce the number of binary variables and thus accelerate the classifier construction process, especially in applications
594
H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595
where a reasonable amount of labeled data is available. Such analysis would also deepen our understanding how and under which conditions a stricter compliance with the ORM principle leads to better classifiers. Eventually, initiatives along this line could lead to a family of alternative programming formulation and decision rules how to chose among these formulations on the basis of data set characteristics. In a similar fashion, tDSVMmem is a first approach to construct tDSVM classifiers. Our results evidence the collective efficacy of the search operators incorporated in tDSVMmem. Future research could elicit their effectiveness in isolation. In particular, running tDSVMmem with just one of the three operators proposed in this work (or different pairs of operators, respectively) facilitates appraising the extent to which an operator contributes individually to the performance of tDSVMmem. This insight could lead to a revision of our operators and the development of novel operators, respectively, to further improve tDSVMmem. On a wider scope, future research could examine the potential of alternative approaches to construct tDSVM classifiers. We have argued the appropriateness of the memetic framework for the focal programming problem. However, experiments with other metaheuristics could provide valuable insight to confirm this view, or, possibly, identify an even more suitable search philosophy. Finally, the TL setting is rarely considered outside Machine Learning. As we show in this work, building a transductive classification model often involves solving a MIP. With its excellency in mixed integer programming, it is beyond doubt that the OR community could contribute greatly toward the further advancement of TL. The tasks of analyzing TL programming problems and the corresponding solution spaces from a theoretic angle, and designing tailor-made exact or heuristic solvers exemplify the potential for interdisciplinary work in the field. Our work makes a first step into this direction, which we believe is a promising avenue for future research at the interface of OR and data mining. References [1] X. Bai, R. Padman, J. Ramsey, P. Spirtes, Tabu search-enhanced graphical models for classification in high dimensions, INFORMS Journal on Computing 20 (3) (2008) 423–437. [2] O.L. Mangasarian, W.N. Street, W.H. Wolberg, Breast cancer diagnosis and prognosis via linear programming, Operations Research 43 (4) (1995) 570– 577. [3] D. West, P. Mangiameli, R. Rampal, V. West, Ensemble strategies for a medical diagnostic decision support system: a breast cancer diagnosis application, European Journal of Operational Research 162 (2) (2005) 532–551. [4] W. Gehrlein, B. Wagner, A two-stage least cost credit scoring model, Annals of Operations Research 74 (0) (1997) 159–171. [5] D. West, S. Dellana, J. Qian, Neural network ensemble strategies for financial decision applications, Computers & Operations Research 32 (10) (2005) 2543– 2559. [6] J. Hadden, A. Tiwari, R. Roy, D. Ruta, Computer assisted customer churn management: state-of-the-art and future trends, Computers & Operations Research 34 (10) (2007) 2902–2917. [7] J. Qi, L. Zhang, Y. Liu, L. Li, Y. Zhou, Y. Shen, L. Liang, H. Li, ADTreesLogit model for customer churn prediction, Annals of Operations Research 168 (1) (2009) 247–265. [8] W. Verbeke, K. Dejaeger, D. Martens, J. Hur, B. Baesens, New insights into churn prediction in the telecommunication sector: a profit driven data mining approach, European Journal of Operational Research 218 (1) (2012) 211–229. [9] T. Hastie, R. Tibshirani, J.H. Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer, New York, 2009. [10] D.J. Hand, W.E. Henley, Statistical classification models in consumer credit scoring: a review, Journal of the Royal Statistical Society: Series A (General) 160 (3) (1997) 523–541. [11] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, 1993. [12] N. Freed, F. Glover, Linear programming and statistical discrimination the LP side, Decision Sciences 13 (1) (1982) 172–175. [13] O.L. Mangasarian, Linear and nonlinear separation of patterns by linear programming, Operations Research 13 (3) (1965) 444–452. [14] P.A. Rubin, Solving mixed integer classification problems by decomposition, Annals of Operations Research 74 (0) (1997) 51–64.
[15] F. Talla Nobibon, R. Leus, F.C.R. Spieksma, Optimization models for targeted offers in direct marketing: exact and heuristic algorithms, European Journal of Operational Research 210 (3) (2011) 670–683. [16] J. Zhang, Y. Shi, P. Zhang, Several multi-criteria programming methods for classification, Computers & Operations Research 36 (3) (2009) 823–836. [17] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995. [18] T. Joachims, Transductive inference for text classification using support vector machines, in: I. Bratko, S. Dzeroski (Eds.), Proceedings of the 16th International Conference on Machine Learning, Morgan Kaufmann, Bled, Slovenia, 1999, pp. 200–209. [19] M.M. Silva, T.T. Maia, A.P. Braga, An evolutionary approach to transduction in support vector machines, in: Proceedings of the 5th International Conference on Hybrid Intelligent Systems, 2005, pp. 329–334. [20] H. Brandner, S. Lessmann, S. Voß, Support of managerial decision making by transductive learning, in: A. Bernstein, G. Schwabe (Eds.), Proceedings of the 10th International Conference on Wirtschaftsinformatik, Zurich, Switzerland, 2011, pp. 973–982. [21] V. Cherkassky, F.M. Mulier, Learning from Data: Concepts, Theory, and Methods, second ed., Wiley & Sons, New Jersey, 2007. [22] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, Cambridge, 2000. [23] K.P. Bennett, A. Demiriz, Semi-supervised support vector machines, in: M.J. Kearns, S.A. Solla, D.A. Cohn (Eds.), Advances in Neural Information Processing Systems, vol. 11, MIT Press, Cambridge, 1999, pp. 368–374. [24] O. Chapelle, B. Schölkopf, A. Zien, Semi-Supervised Learning, MIT Press, Cambridge, 2006. [25] G. Fung, O.L. Mangasarian, Semi-supervised support vector machines for unlabeled data classification, Optimization Methods and Software 15 (2001) 29–44. [26] V. Vapnik, S. Kotz, Estimation of Dependences Based on Empirical Data, second ed., Springer, New York, 2006. [27] C. Orsenigo, C. Vercellis, Multivariate classification trees based on minimum features discrete support vector machines, IMA Journal of Management Mathematics 14 (3) (2003) 221–234. [28] C. Orsenigo, C. Vercellis, Discrete support vector decision trees via tabu-search, Computational Statistics & Data Analysis 47 (2) (2004) 311–322. [29] M. Caserta, S. Lessmann, S. Voß, A novel approach to construct discrete support vector machine classifiers, in: A. Fink, B. Lausen, W. Seidel, A. Ultsch (Eds.), Advances in Data Analysis, Data Handling and Business Intelligence, Springer, Berlin, 2010, pp. 115–125. [30] C. Orsenigo, C. Vercellis, Evaluating membership functions for fuzzy discrete SVM, in: Proceedings of the 7th International Workshop on Fuzzy Logic and Applications, Springer, Berlin, 2007, pp. 187–194. [31] C. Orsenigo, C. Vercellis, Softening the margin in discrete SVM, in: P. Perner (Ed.), Proceedings of the 7th Industrial Conference on Data Mining, Springer, Berlin, 2007, pp. 49–62. [32] C. Orsenigo, C. Vercellis, Multicategory classification via discrete support vector machines, Computational Management Science 6 (1) (2009) 101–114. [33] C. Orsenigo, C. Vercellis, Combining discrete SVM and fixed cardinality warping distances for multivariate time series classification, Pattern Recognition 43 (11) (2010) 3787–3794. [34] R.I. Bot, N. Lorenz, Optimization problems in statistical learning: duality and optimality conditions, European Journal of Operational Research 213 (2) (2011) 395–404. [35] A. Frank, A. Asuncion, UCI Machine Learning Repository, Tech. Rep., School of Information and Computer Science, University of California, Irvine, CA, USA, 2010. [36] E.L. Allwein, R.E. Schapire, Y. Singer, Reducing multi-class to binary: a unifying approach for margin classifiers, Journal of Machine Learning Research 1 (2000) 113–141. [37] V. Cherkassky, Y. Ma, Another look at statistical learning theory and regularization, Neural Networks 22 (7) (2009) 958–969. [38] P.S. Bradley, U.M. Fayyad, O.L. Mangasarian, Mathematical programming for data mining: formulations and challenges, INFORMS Journal on Computing 11 (3) (1999) 217–238. [39] O. Chapelle, V. Vapnik, J. Weston, Transductive inference for estimating values of functions, in: S.A. Solla, T.K. Leen, K.-R. Mnller (Eds.), Advances in Neural Information Processing Systems, vol. 12, MIP Press, Cambridge, 2000, pp. 421– 427. [40] P. Moscato, On Evolution, Search, Optimization, Genetic Algorithms and Martial Arts – Towards Memetic Algorithms, Tech. Rep., Caltech Concurrent Computation Program Report 826, CalTech, Pasadena, CA, USA, 1989. [41] W.E. Hart, N. Krasnogor, J.E. Smith, Recent Advances in Memetic Algorithms, 1st ed., Springer, Berlin, 2005. [42] R. Dawkins, The Selfish Gene, Oxford University Press, 1976. [43] J. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, 1975. [44] H. Beyer, H. Schwefel, Evolution strategies – a comprehensive introduction, Natural Computing 1 (1) (2002) 3–52. [45] D.E. Goldberg, Genetic Algorithms in Search, first ed., Optimization and Machine Learning, Addison-Wesley, Boston, 1989. [46] G.J. Köhler, Improper linear discriminant classifiers, European Journal of Operational Research 50 (2) (1991) 188–198.
H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595 [47] G.J. Köhler, Characterization of unacceptable solutions in LP discriminant analysis, Decision Sciences 20 (2) (1989) 239–257. [48] G.J. Köhler, Considerations for mathematical programming models in discriminant analysis, Managerial and Decision Economics 11 (4) (1990) 227–234. [49] V. Sindhwani, S.S. Keerthi, Large scale semi-supervised linear SVMs, in: E.N. Efthimiadis, S.T. Dumais, D. Hawking, K. Järvelin (Eds.), Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, 2006, pp. 477–484. [50] Y. Chen, G. Wang, S. Dong, Learning with progressive transductive support vector machine, Pattern Recognition Letters 24 (12) (2003) 1845–1855. [51] T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters 27 (8) (2006) 861–874. [52] S. Lessmann, S. Voß, A reference model for customer-centric data mining with support vector machines, European Journal of Operational Research 199 (2) (2009) 520–530. [53] D.R. Musicant, NDC: Normally Distributed Clustered Datasets, 1998.
. [54] C. Audet, P. Hansen, A. Karam, C. Ng, S. Perron, Exact l2-norm plane separation, Optimization Letters 2 (4) (2008) 483–495. [55] F. Plastria, S. De Bruyne, E. Carrizosa, Alternating local search based VNS for linear classification, Annals of Operations Research 174 (1) (2010) 121–134. [56] A. Karam, G. Caporossi, P. Hansem, Arbitrary-norm hyperplane separation by variable neighbourhood search, IMA Journal of Management Mathematics 18 (2) (2007) 173–189. [57] O.L. Mangasarian, D.R. Musicant, Active support vector machine classification, in: T.K. Lee, T.G. Dietterich, V. Tresp (Eds.), Advances in Neural Information Processing Systems, vol. 13, MIT Press, Cambridge, 2000, pp. 577–583. [58] P.S. Bradley, O.L. Mangasarian, J.B. Rosen, Parsimonious least norm approximation, Computational Optimization and Applications 11 (1) (1998) 5–21. [59] R.S. Sexton, S. McMurtrey, D. Cleavenger, Knowledge discovery using a neural network simultaneous optimization algorithm on a real world classification problem, European Journal of Operational Research 168 (3) (2006) 1009–1018. [60] D. Bertsimas, R. Shioda, Classification and regression via integer optimization, Operations Research 55 (2) (2007) 252–271. [61] Y. Marinakis, M. Marinaki, M. Doumpos, N. Matsatsinis, C. Zopounidis, A hybrid ACO-GRASP algorithm for clustering analysis, Annals of Operations Research 188 (1) (2009) 343–358. [62] S.F. Crone, S. Lessmann, R. Stahlbock, The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing, European Journal of Operational Research 173 (3) (2006) 781–800.
595
[63] D. Martens, B. Baesens, T. Van Gestel, Decompositional rule extraction from support vector machines by active learning, IEEE Transactions on Knowledge and Data Engineering 21 (2) (2009) 178–191. [64] J. Demsar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006) 1–30. [65] T. Van Gestel, B. Baesens, J.A.K. Suykens, D. Van den Poel, D.-E. Baestaens, M. Willekens, Bayesian kernel based classification for financial distress detection, European Journal of Operational Research 172 (3) (2006) 979–1003. [66] S. Garca, A. Fernndez, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power, Information Sciences 180 (10) (2010) 2044–2064. [67] S.A. Neslin, S. Gupta, W. Kamakura, J. Lu, C.H. Mason, Defection detection: measuring and understanding the predictive accuracy of customer churn models, Journal of Marketing Research 43 (2) (2006) 204–211. [68] P. Bradley, O. Mangasarian, W. Street, Feature selection via mathematical programming, INFORMS Journal on Computing 10 (2) (1998) 209–217. [69] K.P. Bennett, Combining support vector and mathematical programming methods for classification, in: B. Schölkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, 1999, pp. 307–326. [70] R. Collobert, F. Sinz, J. Weston, L. Bottou, Large scale transductive svms, Journal of Machine Learning Research 7 (2006) 1687–1712. [71] Y. Wang, S.-T. Huang, Training TSVM with the proper number of positive samples, Pattern Recognition Letters 26 (14) (2005) 2187–2194. [72] C. Perlich, F. Provost, J.S. Simonoff, W.W. Cohen, Tree induction vs. logistic regression: a learning-curve analysis, Journal of Machine Learning Research 4 (2) (2003) 211–255. [73] B. Baesens, S. Viaene, D. Van den Poel, J. Vanthienen, G. Dedene, Bayesian neural network learning for repeat purchase modelling in direct marketing, European Journal of Operational Research 138 (1) (2002) 191–211. [74] W. Buckinx, D. Van den Poel, Customer base analysis: partial defection of behaviourally loyal clients in a non-contractual FMCG retail setting, European Journal of Operational Research 164 (1) (2005) 252–268. [75] C. Perlich, B. Dalessandro, R. Hook, O. Stitelman, T. Raeder, F.J. Provost, Bid optimizing and inventory scoring in targeted online advertising, in: Q. Yang, D. Agarwal, J. Pei (Eds.), Proc. of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, 2012, pp. 804–812. [76] D. West, P. Mangiameli, R. Rampal, V. West, Ensemble strategies for a medical diagnostic decision support system: a breast cancer diagnosis application, European Journal of Operational Research 162 (2) (2005) 532–551.