A memetic approach to construct transductive discrete support vector machines

European Journal of Operational Research 230 (2013) 581–595 Contents lists available at SciVerse ScienceDirect European Journal of Operational Resea...

Download PDF

1MB Sizes 1 Downloads 130 Views

Report

PDF Reader
Full Text

European Journal of Operational Research 230 (2013) 581–595

Contents lists available at SciVerse ScienceDirect

European Journal of Operational Research journal homepage: www.elsevier.com/locate/ejor

Stochastics and Statistics

A memetic approach to construct transductive discrete support vector machines Hubertus Brandner, Stefan Lessmann ⇑, Stefan Voß Institute of Information Systems, Department for Business and Economics, University of Hamburg, Von-Melle-Park 5, D-20146 Hamburg, Germany

a r t i c l e

i n f o

Article history: Received 2 May 2012 Accepted 6 May 2013 Available online 23 May 2013 Keywords: Data mining Transductive learning Support vector machines Memetic algorithms Combinatorial optimization

a b s t r a c t Transductive learning involves the construction and application of prediction models to classify a ﬁxed set of decision objects into discrete groups. It is a special case of classiﬁcation analysis with important applications in web-mining, corporate planning and other areas. This paper proposes a novel transductive classiﬁer that is based on the philosophy of discrete support vector machines. We formalize the task to estimate the class labels of decision objects as a mixed integer program. A memetic algorithm is developed to solve the mathematical program and to construct a transductive support vector machine classiﬁer, respectively. Empirical experiments on synthetic and real-world data evidence the effectiveness of the new approach and demonstrate that it identiﬁes high quality solutions in short time. Furthermore, the results suggest that the class predictions following from the memetic algorithm are signiﬁcantly more accurate than the predictions of a CPLEX-based reference classiﬁer. Comparisons to other transductive and inductive classiﬁers provide further support for our approach and suggest that it performs competitive with respect to several benchmarks. 2013 Elsevier B.V. All rights reserved.

1. Introduction Classiﬁcation analysis is an important approach to support decision making in various disciplines including medical diagnosis, information retrieval, risk management and marketing. A classiﬁcation model categorizes objects into disjoint groups. The group assignment is based on a set of attributes that characterize the objects. Depending on the application, the objects can, e.g., represent patients who are to be categorized into medical risk groups on the basis of symptoms, clinical tests, or their health behavior (e.g. [1–3]). Similarly, ﬁnancial institutions discriminate between high and low risk loan applicants to support money lending decisions (e.g. [4,5]), and service companies divide customers into loyal clients and likely churners to target retention programs to the right customers (e.g., [6–8]). Independent of the application, classiﬁcation analysis always aims at constructing a model that predicts group memberships with high accuracy. The prevailing approach toward classiﬁcation is to employ a sample of objects with known group memberships. The relationship between the attribute values of these objects and the corresponding class labels is then inferred in an inductive manner (e.g. [9]). Several techniques pursuing this principle have been proposed in Statistics, Machine Learning, and Operations Research.

⇑ Corresponding author. Tel.: +49 (0)40 42838 4706; fax: +49 (0)40 42838 5535. E-mail addresses: [email protected] (H. Brandner), [email protected] (S. Lessmann), [email protected] (S. Voß). 0377-2217/$ - see front matter 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.ejor.2013.05.010

Statistical classiﬁers often rely on probability theory and estimate the conditional probability (i.e., the a posteriori probability) of an object belonging to a class given the object’s attribute values (e.g. [10]). Many machine learning methods adopt a data-driven paradigm. For example, tree-based classiﬁers recursively partition a data set through a sequence of tests on attribute values (e.g. [11]). Eventually, this produces a clear separation of objects of disjoint classes. Operations Research methods typically ground on linear and mixed integer programming (e.g. [12–16]). In this work, we consider the transductive learning (TL) setting [17]. Standard (inductive) classiﬁcation aims at creating a global prediction model that facilitates classifying arbitrary decision objects. TL differs from this approach in that it advocates a direct estimation of group memberships for a ﬁxed set of objects called the working set. The fundamental assumption of TL is thus that the decision maker knows all objects that are to be classiﬁed in advance. These a priori known decision objects are called the working set. A transductive model can be characterized as a local model that is applicable to working set objects only. The main advantage of TL compared to the more general classiﬁcation setting is that the additional constraint of a ﬁxed working set simpliﬁes the learning task [17]. This, in turn, will often facilitate more accurate class predictions for working set objects (e.g. [18,19]). With respect to the applicability of TL, it has been shown that several important corporate planning tasks do not require a global model and could potentially beneﬁt from TL [20]. Consequently, developing and testing transductive classiﬁcation models is an important task to support decision making in organizations.

582

H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595

From the perspective of inferential statistics, TL is simpler than inductive classiﬁcation because it explicitly considers the working set objects when building the local classiﬁcation model (e.g., [21]). In other words, the objects for which classiﬁcation accuracy matters are taken into account during classiﬁer construction. Though, this approach brings about new algorithmic challenges. First, it is not obvious how to best exploit the predictive information contained in the working set. Second, creating a transductive classiﬁer involves working with both labeled and unlabeled data. This is because the class labels of working set objects are (by deﬁnition) unknown. Accommodating labeled and unlabeled data in a learning algorithm is a nontrivial task in its own right. In this work, we propose solutions to these challenges and develop a novel transductive classiﬁer. Our approach is based on two foundations. First, it relies on the principle of maximal margin separation, which has been put forward in the context of support vector machine (SVM) classiﬁers (e.g., [22]). The maximal margin principle is also a common approach toward TL (e.g. [18,23–25]). According to the overall risk minimization (ORM) theory, maximizing the margin of separation of a linear classiﬁer helps to minimize a bound of the classiﬁer’s error on working set objects (e.g. [21,26]). Second, our algorithm builds upon the Discrete Support Vector Machine (DSVM) of Orsenigo and Vercellis [27]. DSVMs improve upon standard SVMs in the sense that they implement the principles of statistical learning more accurately [28]. To achieve this, Orsenigo and Vercellis propose to capture classiﬁcation errors through a discrete step function, which is exactly the notion of errors used in the risk bounds of statistical learning theory. In a large number of simulations, Orsenigo and Vercellis as well as others show that the discrete error measurement produces highly accurate classiﬁers that outperform standard SVMs and other challenging benchmarks under several experimental conditions [27–33]. We hypothesize that the appropriateness of a discrete error measurement extends to the TL setting. A ﬁrst contribution of this paper is thus the development of a transductive DSVM (tDSVM) classiﬁer. Building a transductive classiﬁcation model is challenging from an algorithmic point of view. In general, classiﬁer construction involves optimizing some measure of model ﬁt over the objects of the training set. Inductive classiﬁcation algorithms often rely on continuous optimization methods (e.g. [34]). Contrary, the mathematical program underlying classiﬁer construction is typically a mixed integer program (MIP) within TL (e.g. [18,23]). This is also true for Orsenigo and Vercellis’s DSVM classiﬁer and our tDSVM in particular. A second contribution of this paper is associated with the development of a memetic algorithm to solve the MIP underlying tDSVMs. Our approach, which we call tDSVMmem, incorporates population-based and local search operators. We design these operators so as to account for characteristics of the MIP underlying our tDSVM classiﬁer. Additional characteristics of tDSVMmem include a self-adaptive tuning of endogenous strategy parameters and an inheritance of solution characteristics. We test the effectiveness of tDSVMmem through several empirical experiments on synthetic data and real-world data from the UCI Machine Learning Repository [35]. The results show that tDSVMmem performs signiﬁcantly better than CPLEX. More speciﬁcally, whenever the two solvers ﬁnd the same solution, this solution is also the optimal solution of the corresponding problem instance. Whenever ﬁnding an optimal solution is computationally infeasible, tDSVMmem gives signiﬁcantly better objective values than a truncated CPLEX benchmark (i.e., better than the best objective value obtained with CPLEX for a given time limit of reasonable length). We also ﬁnd that tDSVMmem produces classiﬁcation models that predict signiﬁcantly more accurately than the CPLEX-based reference classiﬁer. These results conﬁrm the appropriateness of our approach and suggest that tDSVMmem is well suited to

construct tDSVM classiﬁers. Regarding the tDSVM classiﬁer itself, we conduct several experiments to assess its predictive performance in comparison to other inductive and transductive methods. The results conﬁrm the effectiveness of a discrete error measurement in TL settings. Furthermore, we ﬁnd that tDSVM performs often but not always better than other inductive classiﬁers. This suggests that TL and tDSVM in particular are not necessarily preferable to inductive classiﬁers, even if class predictions are sought for a known group of working set objects only. Through a set of follow-up experiments, we gain some insight what factors inﬂuence the suitability of TL. For example, we observe that the ratio between labeled and unlabeled examples in a data set is an important determinant of TL success. Overall, the analysis allows us to provide some practical recommendations under which circumstances TL is preferable to an inductive approach. A general implication of our study is that it emphasizes the efﬁcacy of relatively simple heuristic procedures for combinatorial optimization under the condition that the focal problem is well understood, appropriately formalized in a mathematical model, and that the search operators within the heuristic are well adapted to this formulation. Our tDSVM formulation is well grounded in theory and thus captures the learning task in a suitable way. On this basis, a carefully selected set of standard search mechanisms sufﬁces to devise an effective solver and obtain promising results. On the one hand, this evidences the power and generality of the heuristic search framework. On the other hand, it puts the popular approach to extend this framework and invent novel metaheuristics somewhat into perspective. Efforts related with the development of novel metaheuristics are best geared toward novel problems, whereas the techniques known today are well suitable to approach a wide range of standard combinatorial problems. We provide empirical evidence in favor of this view for the problem of building tDSVM classiﬁers, which can be considered a further contribution of our study. The paper is organized as follows: Section 2 introduces the original DSVM classiﬁer and explains our modiﬁcations to extend it to the TL setting. We then develop our memetic algorithm in Section 3. Section 4 introduces the design of our empirical study. The corresponding results are presented and discussed from an optimization and predictive modeling point of view in Section 5. We conclude the paper with a summary of the main ﬁndings and an outlook to future research in Section 6. 2. Classiﬁcation with transductive discrete support vector machines The objective of a classiﬁcation model is to group objects n H xH j 2 R into ﬁxed, disjoint classes yj . In other words, a classiﬁer deﬁnes a mapping from objects to classes f: x ´ y. An object is characterized by a set of n attributes. The fundamental assumption of classiﬁcation analysis is that attribute values determine class memberships. However, the speciﬁc (functional) relationship between attribute values and class memberships is unknown. A classiﬁcation method strives to reconstruct this relationship from a training sample L that consists of objects with known class labels L ¼ fxi ; yi gli¼1 . The model resulting from this step facilitates predictn ou . Without ing the class memberships of novel objects U ¼ xH j j¼1

loss of generality [36], we concentrate on binary classiﬁcation problems in this paper and assume that yH j 2 f1g 8 j. 2.1. Discrete support vector machines SVMs are a popular approach toward classiﬁcation. They are inspired by statistical learning theory and the principle of structural-risk-minimization (SRM) in particular [17]. Roughly

583

H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595

speaking, it can be shown that a SVM classiﬁer is optimal in the sense that it explicitly minimizes a bound of the classiﬁcation error on novel objects (e.g. [37]). The error on novel objects - not contained in the training set – is called the generalization error; as opposed to the empirical error, which is measured (in-sample) on training set objects. The concept of SVMs is to separate examples fxi gli¼1 into the two groups y = ± 1 by means of a hyperplane H = w x + b. The hyperplane facilitates classifying novel objects xH j according to their position relative to the plane (i.e., below or above H). Such a classiﬁcation model is characterized by two parameters, the normal w and intercept b of the hyperplane. To build a SVM classiﬁer, these parameters are estimated from L by maximizing the margin between the closest objects of adjacent classes while avoiding misclassiﬁcations (e.g. [22]). The idea of a maximal margin separation provides the link to the SRM principle. It ensures that the ﬁnal classiﬁer generalizes well to novel objects not contained in L [17]. Orsenigo and Vercellis note that the particular way in which SVMs deal with classiﬁcation errors only approximates the risk bound derived within the SRM framework [27,28]. They recommend a stricter compliance with the SRM philosophy to obtain more accurate class predictions. To achieve this, they identify classiﬁcation errors through binary indicator variables hi in their DSVM framework. This idea leads to the following combinatorial program to construct DSVM classiﬁers [27,28]:

min akwk1 =2 þ ð1 aÞC w;b;h

l X hi

ð1Þ

i¼1

s:t: : yi ðw xi þ bÞ þ Qhi P 1 8 i 2 f1; . . . ; lg hi 2 f0; 1g 8i 2 f1; . . . ; lg The objective balances two conﬂicting goals: achieving a large margin and low classiﬁcation error. The margin of separation is maximized by minimizing the norm of the hyperplane’s normal w (e.g., [22]). Using the L 1 norm ensures linearity of the objective (e.g. [38]). The second term in the objective minimizes the number of classiﬁcation errors. More speciﬁcally, the ﬁrst constraint ensures that training objects fxi gli¼1 are considered as classiﬁcation errors if they are either on the wrong side of the hyperplane or fall into the margin of separation (gray tube in Fig. 1). The approach to consider all objects within the margin as classiﬁcation errors, including those that are actually on the correct side of the hyperplane, is standard in SVM learning (e.g. [22]). The characteristic feature of DSVMs is to count these errors through hi, whereas standard SVMs approximate classiﬁcation errors through continuous slack variables. The meta-parameter a allows decision makers to control the trade-off

between error minimization and margin maximization, and the constant Q is a sufﬁciently large number [31]. Orsenigo and Vercellis extend their DSVM formulation in several ways to enable, e.g., fuzzy classiﬁcation [30], time-series classiﬁcation [33], or multi-category classiﬁcation [32]. These and other studies (e.g. [29,30]) have shown that a closer correspondence with SRM through a discrete measurement of classiﬁcation errors increases the prediction performance of the resulting classiﬁer. 2.2. Transductive discrete support vector machines A transductive classiﬁer is by deﬁnition aware of the location (in the attribute space) of the objects in the working set U. Transductive SVMs strive to capitalize on this additional information by means of extending the maximal margin principle to the objects of U. The class predictions of working set objects follow directly from their location relative to the separating hyperplane. Therefore, an optimal hyperplane is one that (i) achieves the largest possible margin on the labeled objects in L, (ii) produces the largest possible margin on the unlabeled objects in U given their grouping through the hyperplane, and (iii) achieves minimal classiﬁcation error (e.g., [17,18]). Geometrically, this is equivalent to positioning the separating hyperplane into a low density region of the attribute space [39]. We illustrate this concept with an example in Fig. 1. Fig. 1 depicts objects of a possible training and working set, respectively. More speciﬁcally, the symbols +, represent the objects and class labels of L. The objects of the working set U are represented by , . The class label of working set objects (i.e., the class predictions n o yH ) follows from their position relative to the transductive clasj siﬁer (solid line). The dashed line represents the inductive classiﬁer that results from solving (1) for the training set L. Comparing the solution of the inductive classiﬁer and the transductive classiﬁer, it is clear that the latter achieves a larger margin on working set objects. Recall that the location of working set objects remains unknown in an inductive approach. Consequently, the resulting classiﬁer is biased toward the data in L. Although the actual class labels of working set objects are unknown, a visual inspection suggests that the transductive solution captures the true relationship between an object’s location and its class membership more closely. This suggests that the class predictions n ou yH resulting from the transductive approach are likely to be j j¼1

more accurate than the predictions of the inductive classiﬁer. Our visual argument is also supported by learning theory. In particular, the ORM theory shows that considering the margin on both training and working set objects leads to a tighter bound of the generalization error, compared to the risk bound developed in the inductive SRM framework [21]. In order to implement the maximal margin principle for TL within Orsenigo and Vercellis’s DSVM framework [27,28], we combine their idea of a discrete measurement of classiﬁcation errors with Joachim’s formulation of a transductive SVM [18] and propose the following MIP to construct a tDSVM classiﬁer:

min

w;b;h;hH ;yH

akwk1 =2 þ ð1 aÞ C

l u X X hi þ C H hH j i¼1

! ð2Þ

j¼1

s:t: : yi ðw xi þ bÞ þ Qhi P 1 8i 2 f1; . . . ; lg H w xH 8j 2 f1; . . . ; ug yH j j þ b þ Qhj P 1 hi 2 f0; 1g 8i 2 f1; . . . ; lg hH 8j 2 f1; . . . ; ug j 2 f0; 1g Fig. 1. Transductive SVM (solid line) and inductive SVM (dashed line) in a linearly non-separable case in R2 .

8j 2 f1; . . . ; ug yH j 2 f1g

ð3Þ

584

H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595

H H We deﬁne hH j such that hj > 0 if and only if xj is located in the margin of separation (gray tube in Fig. 1), "j 2 {1, . . . , u}:

( hH j :¼

1; if xH j is within the margin 0; otherwise

ð4Þ

The main difference between tDSVM (2) and Joachim’s transductive SVM [18] pertains the use of the binary indicator variable hH j to capture misclassiﬁcations among working set examples. This approach extends Orsenigo and Vercellis’s idea to count classiﬁcation errors by means of a discrete step function [27,28] to the TL setting. Subsequently we refer to working set objects within the margin as transductive errors. Recall that misclassiﬁed objects of the training set are called empirical errors. The second constraint in (2) ensures that working set examples n ou H xj are either outside the margin or counted as transductive j¼1

Fig. 2. Simpliﬁed generation cycle of memetic algorithms.

errors. The objective incorporates these transductive errors (last term) and ensures that the ﬁnal classiﬁer is ‘far away’ from working set objects. The meta-parameters C and Cw allow controlling the relative inﬂuence of transductive versus empirical errors on the objective. The consideration of transductive errors in the objective provides an incentive to push the separating hyperplane into a region with a low density of unlabeled objects. Assuming that objects of the same class are ‘close’ to each other and reside in the same cluster of the space, it is likely that this approach leads eventually to a large margin on working set examples. In view of the ORM principle, a large margin should result in better predictions [24]. Put differently, given that TL aims only at generating class n ou predictions for the ﬁxed working set xH and given that these j j¼1 n ou predictions yH are explicitly considered in the optimization, it j j¼1

is reasonable to assume that tDSVM achieves higher accuracy on working set examples compared to an inductive classiﬁer.

3. A memetic algorithm for tDSVM learning In order to solve our tDSVM program (2), we employ the framework of memetic search [40]. A memetic algorithm (MA) is a population-based heuristic for global optimization inspired by cultural evolution (e.g. [41]). The notion of memes goes back to [42]. In contrast to genes in genetic algorithms [43], memes are reﬁned separately and their individual improvements are passed onto the next generation. Fig. 2 illustrates the overall process of MA-based search. MAs operate on a pool of candidate solutions and employ local learning. They can thus be considered hybridizations between genetic algorithms and local search. Due to using population-based and local search operators, MAs explicitly pursue the two main objectives of heuristic search, exploration and exploitation. This feature makes them well suited for constructing tDSVM classiﬁers. Recall that the tDSVM objective (2) embodies two conﬂicting goals. First, the classiﬁer should achieve a large margin of separation. Second, it should avoid both empirical and transductive errors. As we show in the following, search mechanisms that emphasize exploration are useful to minimize classiﬁcation errors, whereas exploitation-centric search mechanisms facilitate increasing a classiﬁer’s margin. Therefore, a heuristic for solving (2) should incorporate mechanisms of both types. In our approach, individuals of a population represent candidate solutions to (2). More speciﬁcally, each individual is characterized by a set of memes ðw1 ; . . . ; wn ; bÞ ¼: ðw; bÞ 2 Rnþ1 and represents a separating hyperplane. We obtain the initial population by sampling (w, b) at random from a normal distribution. To determine the normal’s parameters l and r, we create a set of standard

Fig. 3. Computation of offspring solutions.

SVM classiﬁers on randomly selected sub-samples of the training data and estimate the mean and standard deviation of (w, b) over the resulting solutions. In order to measure the ﬁtness of an individual with memes (w, b), we use the corresponding hyperplane to classify the objects of the training and working set and count the resulting number of n ou empirical fhi gli¼1 and transductive errors hH . We then use the j j¼1

tDSVM objective (2) to compute a ﬁtness value. Individuals enter an iterative process of selection and stochastic variation (in form of three search operators: reproduction, mutation and local reﬁnement) to increase their ﬁtness. In accordance with [44], our algorithm, which we call tDSVMmem, terminates after a maximal number of generations or when the ﬁtness of candidate solutions stops to improve. Fig. 3 illustrates the overall structure of the evolutionary process within tDSVMmem. We detail important components of the algorithm in the following sections. 3.1. Reproduction Reproduction represents the main population-based search operator in tDSVMmem. The operator randomly samples pairs of candidate solutions from the current population with replacement. In particular, we employ a roulette wheel selection approach [45], in which an individual’s selection probability, and thus its probability to pass on memes to offspring solutions, is proportional to

585

H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595

(a)

(b)

(c)

Fig. 4. Geometrical illustration of a candidate solution before (ocher decision boundary) and after local reﬁnement (gray decision boundary) in R2 .

its ﬁtness. Afterwards, the operator generates new individuals from two parents I and II in two steps. Let X denote a new individual and (wX) the direction of the hyperplane corresponding to X. We ﬁrst compute (wX) as:

tion operator’s evolvability. In formal terms, the mutation is given by the following log-normal rule [44]:

ðwI Þ normðwI Þ

with

ðwII Þ þ normðw II Þ

2 2 kwI k1

þ 2

2 kwII k1

ð5Þ

:

We then set the margin

r0d :¼ rd expðsI N I ð0; 1Þ þ sN d ð0; 1ÞÞ

2 kwX k1

2ðn þ 1Þ

ð6Þ

Finally, we obtain the intercept of the new individual by averaging over the intercepts of its parents. As a result of these steps, the hyperplane of the new individual bisects the hyperplanes of its parents and has a margin equal to the mean margin of the parental planes. Our motivation for this recombination approach is twofold. First, given that decision variables are continuous, our recombination avoids potential problems of binary encoding schemes, which are required by alternative recombination procedures such as, e.g., uniform cross-over (e.g. [43]). Second, our modiﬁcation is appropriate from a geometric point of view. That is, the offspring of two hyperplanes represents a decision surface between the parental ones. Averaging over potentially distant (parental) hyperplanes in the beginning of the search achieves a broad exploration of the search space. On the other hand, given that candidate solutions will become more similar in later generations, the reproduction mechanism will intensify the search in promising regions of the solution space in the later course of tDSVMmem. Finally, an advantage of this recombination approach is that it sustains the modiﬁcations of the local reﬁnement operator.

2 nþ1

The N 1 ; . . . ; N nþ1 are i.i.d. normal variables, and N I is hold constant during the mutation process. 3.3. Local reﬁnement We devise a local reﬁnement operator that further improves the ﬁtness of candidate solutions by enlarging the margin of separation. In particular, we employ the hyperplanes resulting from recombination and mutation to classify training and working set objects and, thereby, determine empirical and transductive errors. Given a hyperplane H, we deﬁne its neighborhood as the set of all planes which produce correct classiﬁcations for all objects that H classiﬁes correctly. Out of this neighborhood, we choose the hyperplane with maximal margin to replace the current (w, b). To that end, we solve the following convex linear program for the objects that (w, b) classiﬁes correctly. Ignoring misclassiﬁed objects at this point guarantees that program (9) is feasible.

min kwk1 =2 w;b

ð9Þ

s:t: : yi ðw xi þ bÞ P 1 jðw xH j þ bÞj P 1

8 i 2 fijyi ðw xi þ bÞ P 1 8i 2 f1; . . . ; lgg 8j 2 fjjjðw xHj þ bÞj P 1 8j 2 f1; . . . ; ugg

3.2. Mutation In tDSVMmem, we mutate an individual’s memes by adding n + 1 normally distributed variables:

w0d :¼ wd þ N 0; r2d 8 d 2 f1; . . . ; ng 0 and b :¼ b þ N 0; r2nþ1

1

sI / pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ s / pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

equal to

:

1

ð8Þ

ð7Þ

The mutation operator shifts and rotates the separating hyperplane. Consequently, it affects the class assignments of working set objects, the margin of separation, and eventually classiﬁcation errors. This suggests that the ﬁtness of an individual depends heavily on the choices of the r2m , which, in turn, determine the intensity of mutation. In order to identify appropriate settings, we adopt Beyer and Schwefel’s self-adaptation philosophy [44]. In particular, we assign a Rnþ1 -vector (r), representing r1, . . . , rn+1, to each individual and incorporate it into the heuristic search. As a consequence, the endogenous strategy parameters rd are subject to the same variation mechanisms as the decision variables (w, b). They are tuned during the evolutionary process to automatically adjust the muta-

Fig. 4 illustrates the local reﬁnement operator. Blue and red symbols represent elements of the training sample, and crosses depict unlabeled object of the working set. The ocher hyperplane in (a) shows a solution H prior to reﬁnement, whereas the gray classiﬁer in (b) represents the reﬁned classiﬁer; that is the optimal solution of (9). Clearly, the reﬁned classiﬁer achieves larger margin than H. Moreover, panel (c) shows that local reﬁnement can, in addition to enlarging the margin, also reduce the number of classiﬁcation errors. In particular, a circle around an object in panel (c) indicates that the classiﬁcation of this object changes from misclassiﬁed to correctly classiﬁed. This possibility exists because we solve program (9) only for correctly classiﬁed objects and enforce (through the second constraint) that these objects keep their correct classiﬁcation.1. Consequently, the reﬁned classiﬁer predicts at least as accurately as the original classiﬁer. On the other hand, objects that were misclas1 Note that this approach is similar to the hard-margin formulation of SVMs (e.g. [22])

586

H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595

siﬁed by the original classiﬁer can change their classiﬁcations after local reﬁnement. It may be that the reﬁned classiﬁer gives correct predictions for some of these objects. Therefore, local reﬁnement can improve but never harm classiﬁcation accuracy.

from different angles (e.g. [9]). In this study, we consider two common performance measures, classiﬁcation accuracy (ACC) and the area under a receiver characteristics curve (AUC). The ACC of a classiﬁer is simply the fraction of working set objects that it assigns to the correct class:

Pu

3.4. Fitness evaluation The ﬁtness evaluation in tDSVMmem is mainly driven by the tDSVM objective. However, a potential problem with (2) and other mathematical programs for classiﬁcation is that they are susceptible to trivial solutions with w = 0 and b 2 { ± 1} (e.g. [46]). Several techniques to circumvent this problem have been proposed in the literature (e.g. [47,48]). Within TL, a common approach is to enforce that the fraction of positively (negatively) classiﬁed working set objects matches the prior probability of positive (negative) objects in the training sample [18,49]. However, previous research has shown that this approach can decrease the classiﬁers performance whenever the training sample is not well representative for the working set [50]. Therefore, tDSVMmem incorporates a more ﬂexible approach. In particular, we introduce a soft constraint to require that 50% of U are classiﬁed in a 95%-conﬁdence interval deduced from L: H jfxH j jðw xj þ bÞ P 1 8j 2 f1; . . . ; uggj

jUj

ð10Þ

qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ h 2 0:5 p 1:96 pð1pÞ ;... jLj qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃi . . . 1 0:5 ð1 pÞ 1:96 ð1pÞp jLj 8i2 1;...;lggj with p ¼ jfyi jyi ¼1 jLj We then augment the ﬁtness evaluation within tDSVMmem by penalizing solutions that violate the constraint with a cost proportional to the violation.

4. Experimental setup The empirical evaluation pursues two objectives, verifying the effectiveness of tDSVMmem to solve (2) and assessing the predictive performance of the resulting tDSVM classiﬁer. To that end, we require appropriate performance indicators and benchmark methods. The main benchmark used in this study is CPLEX 12. The unique advantage of CPLEX is that it solves the same mathematical program as tDSVMmem. Therefore, comparing the objective values of CPLEX and tDSVMmem clariﬁes the effectiveness of tDSVMmem as a solver for (2). CPLEX is also a suitable benchmark to assess predictive performance. The ORM principle suggests that a better solution to (2) will, in general, give a more accurate classiﬁer (e.g. [21]). However, given that the labels of the working set objects are unknown and given that the relationship between an object’s location (i.e., attribute values) and its class membership is not perfect and often corrupted by noise, it is well possible that this equation does not hold. A better solution to (2) may give a classiﬁer that is less accurate than those corresponding to some other solution [29]. Comparing the predictive accuracy of tDSVMmem and CPLEX classiﬁers provides some insight how strong the link between objective values and classiﬁer accuracy is in TL settings. In addition, to demonstrate the merit of the tDSVM classiﬁer in itself, it is important to compare it to other inductive and transductive classiﬁcation methods. Given that such benchmarks are based on different mathematical programs, comparisons of objective values are misleading. Accordingly, we concentrate on predictive accuracy in such comparisons. A number of accuracy and error measures have been proposed to assess the predictive performance of a classiﬁcation method

ACC ¼

H j¼1 hj

jUj

:

ð11Þ

This notion of classiﬁer performance is well aligned with the discrete error measurement of DSVMs [27,28] and our tDSVMmem in particular. A characteristic of ACC is that the performance of a classiﬁer depends only on the number of working set objects on the correct side of the separating hyperplane. The AUC, on the other hand, is based on the distance of an object to the separating hyperplane. It measures the ability of a classiﬁer to rank objects of different classes in the correct order [51]. Assume, for example, a classiﬁer that assigns all objects to the negative class. This is equivalent to:

sign w xH 8j 2 f1; . . . ; ug; j þ b ¼ 1

ð12Þ

meaning that all working set objects are located below the separating hyperplane. The classiﬁer could still achieve a maximal AUC of 1 if all positive objects are closer to the hyperplane than the negative objects. In this sense, the AUC emphasizes a different aspect of classiﬁer performance and complements an ACC-based assessment. An important design choice in empirical studies associated with TL concerns the ratio between labeled and unlabeled examples l/u. It is clear from (2) that the number of binary decision variables is closely related to the amount of unlabeled data. Consequently, the CPLEX benchmark will experience more and more difﬁculties in ﬁnding an optimal solution as l/u decreases. To set tDSVMmem a challenging benchmark, we perform most experiments with a constant value of l/u = 1, meaning that the numbers of unlabeled and labeled objects are the same. We will also consider ratios of l/u < 1 where appropriate (see Section 5.4). Another design choice in the experimental study is associated with the settings of tDSVM parameters. For example, the predictive performance of any SVM-type classiﬁer depends heavily on the parameter C (e.g., [52]). Larger values result in classiﬁers with lower empirical error, whereas smaller values yield classiﬁers with larger margin. Given their inﬂuence on the optimal solution of (2) and the resulting classiﬁer, we explicitly account for different parameter settings in our experiments. In particular, we consider three alternative settings of Ierror (low, medium, high), where we set a = 0.9, a = 0.5, and a = 0.1. Finally, we acknowledge that the CPLEX benchmark (tDSVMCPLEX in the following) might fail to identify an optimal solution, especially for larger problem instances. We therefore impose a time limit and terminate CPLEX after 30 min. After termination, we extract the currently best solution for comparison with tDSVMmem. Note that the running times of tDSVMmem vary between a few seconds (synthetic data) and four minutes (see Tables 3–5 for details). In this sense, tDSVMmem is consistently using less CPU time than the CPLEX benchmark. Therefore, in the following we concentrate on the effectiveness of both methods rather than focusing on running times. 5. Empirical results We ﬁrst present empirical results obtained from synthetic data. We then discuss results from real-world data. There, we begin with a comparison of tDSVMmem versus tDSVMCPLEX in terms of objective values and predictive performance. Subsequently, we benchmark tDSVM classiﬁcation models – produced with tDSVMmem – against other classiﬁers.

587

H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595

5.1. Synthetic data experiment We employ Musicant’s NDC generator [53] to generate a set of artiﬁcial (binary) classiﬁcation problems. This package has been widely used in the literature (e.g. [25,54,55]) and offers some more generality compared to alternative data generators [56]. In the experiments with synthetic data, we consider two experimental factors in addition to the severity of classiﬁcation errors (Ierror). First, we consider the number of attributes (two settings with n = 4 and 32, respectively) because it is an important characteristic of any classiﬁcation problem and may thus have an inﬂuence on the performance of the two solvers. Second, the difﬁculty of the classiﬁcation problem will directly affect the trade-off between a large margin and a low error solution and thus the two solvers under study. Musicant’s data generator [53] can accommodate problem difﬁculty as the degree of overlap between the two classes, which is equivalent to the Bayes error. We reuse settings from previous research (s = 0.01 and 15; [56,57]) to stretch the covariance matrix of the data generator. Together with the three settings for Ierror (low, medium, high), we obtain a 3 2 2 factorial setting for the synthetic data experiment. Although it represents an important characteristic of a classiﬁcation task, we refrain from considering the number of examples (i.e., problem size) as a further experimental factor. This is because the main objective of the synthetic data experiment is to scrutinize the ability of tDSVMmem to identify near-optimal solutions. Therefore, we must ensure that the benchmark, tDSVMCPLEX, actually ﬁnds the optimal solution of a problem instance. A series of pretests revealed that u = l = 25 fulﬁlls this requirement for all twelve experimental settings, whereas larger data sets lead to CPLEX not converging within the 30 min time limit. We therefore consider randomly generated datasets of 50 examples for the synthetic data experiment. Table 1 reports the empirical results of the comparison of tDSVMmem vs. tDSVMCPLEX across the twelve experimental settings on synthetic data. Columns one to three depict problem characteristics in terms of Ierror, n and s, whereas the remaining columns report the objective values of tDSVMCPLEX and tDSVMmem, and their percentage difference (i.e., the gap between the optimal solution and the tDSVMmem solution), respectively. For tDSVMmem, we also report the standard deviation of objective values, which we compute over ten random initializations per setting. The last two columns of Table 1 give the predictive performance in terms of AUC and ACC, which is the same for the two solvers in all experiments. Here and in the following, a indicates that CPLEX ﬁnds the optimal solution of a problem instance.

Table 1 Comparative results of tDSVMmem and tDSVMCPLEX on synthetic data. Ierror

Low

Medium

High

n

s

tDSVMCPLEX tDSVMmem

Gap (%)

AUC

ACC

0 0 0 0

0 0 0 0

1 0.8701 1 0.6508

1 0.7600 1 0.8000

0.2584 1.1546 0.2533 1.8170

0 0 0 0

0 0 0 0

1 0.8691 1 0.9683

1 0.8000 1 0.8400

0.0517 0.5214 0.0507 0.3634

0 0 0 0

0 0 0 0

1 1 1 0.9603

1 1 1 0.8400

Obj. value ()

Obj. value

Std. dev.

4 0.01 15 32 0.01 15

0.4652 0.8744 0.4560 2.0237

0.4652 0.8744 0.4560 2.0237

4 0.01 15 32 0.01 15

0.2584 1.1546 0.2533 1.8170

4 0.01 15 32 0.01 15

0.0517 0.5214 0.0507 0.3634

Table 1 provides strong support for the efﬁcacy of tDSVMmem. It ﬁnds the optimal solution under all experimental conditions. Moreover, the low standard deviations evidence its robustness (i.e., that it ﬁnds optimal solutions independent of its random initialization). In a similar fashion, none of the three experimental factors severity of classiﬁcation error, dimensionality, and difﬁculty of the classiﬁcation problem seem to affect tDSVMmem. We observe optimal solutions for all factor combinations. In summary, the results on synthetic data clearly show the strong performance of tDSVMmem. However, the number of examples was only 50 in these experiments. It is important to also examine the performance of tDSVMmem in larger settings that are more representative for practical applications of classiﬁcation analysis and TL in particular. This is the subject of the next section. 5.2. Analysis of solver efﬁcacy To complement the previous analysis and to examine the competitive performance of tDSVMmem and tDSVMCPLEX in realistic scenarios, we select seven binary classiﬁcation problems from the UCI Machine Learning Repository [35]. The data sets represent realworld decision tasks in different domains (e.g., business, medical diagnosis, physics and text-classiﬁcation) to facilitate an industry-independent assessment. In particular, Australian Credit (AC) and German Credit (GC) represent loan approval problems. The classiﬁcation task is to discriminate between good and bad credit risks on the basis of demographic and transactional attributes that characterize loan applications. The attributes in Wisconsin Breast Cancer (BC) refer to characteristics of cell nuclei, and the classiﬁcation task is to distinguish tumor cells into benign or malignant. The Heart data set (HE) data set is also from the ﬁeld of medical decision making. It includes the records of 267 patients who are classiﬁed into two groups (normal vs. abnormal) on the basis of single proton emission computed tomography images. The Ionosphere dataset requires a categorization of radar returns into two classes (good and bad), whereas the Spambase (SP) task is to discriminate between ordinary and spam emails on the basis of word and/or character counts extracted from the mail body. Finally, the Adult dataset (AD), contains Census data concerning demographic and socio-demographic characteristics of US households. The task is to predict whether household income exceeds $50,000. These data sets have been employed in several previous studies to explore the performance of competing learning algorithms (e.g. [25,27,28,58– 61]). In this sense, they represent an established set of classiﬁcation problems well suited for evaluating novel classiﬁers and tDSVMmem in particular. Table 2 summarizes the characteristics of the data sets. It is known that many classiﬁers and SVM-type methods in particular beneﬁt from data preprocessing (e.g. [62]). In this study, we preprocess all data sets as follows: we ﬁrst convert categorical attributes into numerical attributes using the weight of evidence approach [63]. We then employ the z-transformation to avoid problems with continuous attributes on different measurement scales.

Table 2 Characteristics of the seven real-world UCI datasets. D

jDj

n

P(y = + 1)

AC GC BC HE IO SP AD

690 1000 569 267 351 4601 48,842

14 20 30 44 34 57 14

0.4449 0.3000 0.3726 0.2060 0.6410 0.3940 0.2393

588

H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595

Table 3 Empirical results on UCI data at Ierror = low.

Table 4 Empirical results on UCI data at Ierror = medium.

D

#

tDSVMCPLEX

tDSVMmem

Gap (%)

D

#

tDSVMCPLEX

Obj. val.

Sec.

Obj. val.

Std. dev.

AC

1 2 3 4 5 £

0.5697 0.5555 0.5514 0.5514 0.5372 0.5531

>1800 >1800 >1800 >1800 >1800 >1800

0.5352 0.5372 0.5393 0.5413 0.5413 0.5389

0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

Obj. val.

Sec.

Obj. val.

20.2 20.6 21.0 22.8 23.0 21.5

6.06 3.30 2.20 1.84 0.75 2.53

AC

1 2 3 4 5 £

0.8486 0.7775 0.7572 0.7572 0.6862 0.7654

28.6 14.6 23.2 13.8 11.4 18.3

0.8486 0.7775 0.7572 0.7572 0.6862 0.7654

0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

14.0 19.5 22.0 22.1 22.3 20.0

0.00 0.00 0.00 0.00 0.00 0.00

BC

1 2 3 4 5 £

2.4662 2.4925 2.4921 2.4016 2.6745 2.5054

>1800 >1800 >1800 >1800 >1800 >1800

2.3533 2.3668 2.4131 2.4273 2.4307 2.3982

0.0028 0.0030 0.0065 0.0796 0.0041 0.0192

103.0 103.3 103.4 104.4 110.1 104.8

4.57 5.04 3.17 1.07 9.11 4.17

BC

1 2 3 4 5 £

4.0338 3.7492 3.7737 4.0748 3.8456 3.8954

>1800 >1800 >1800 >1800 >1800 >1800

3.6990 3.7300 3.7364 3.7727 3.8067 3.7490

0.0250 0.0284 0.0172 0.0273 0.0310 0.0258

58.3 59.6 61.5 61.8 63.6 61.0

8.30 0.51 0.99 7.41 1.01 3.65

GC

1 2 3 4 5 £

0.8820 0.8772 0.8820 0.8892 0.9012 0.8863

>1800 >1800 >1800 >1800 >1800 >1800

0.8714 0.8714 0.8726 0.8741 0.8815 0.8742

0.0243 0.0095 0.0342 0.0297 0.0591 0.0314

71.6 73.1 85.9 85.9 97.2 82.8

1.20 0.66 1.06 1.70 2.18 1.36

GC

1 2 3 4 5 £

2.3380 2.4100 2.4820 2.3500 2.3260 2.3812

>1800 >1800 >1800 >1800 >1800 >1800

2.2912 2.3104 2.3284 2.3306 2.3380 2.3197

0.1100 0.1404 0.1292 0.0481 0.0000 0.0856

77.7 82.9 83.0 84.6 88.0 83.2

2.00 4.13 6.19 0.83 0.52 2.53

HE

1 2 3 4 5 £

2.0469 1.8806 1.9880 1.7110 1.9431 1.9139

>1800 >1800 >1800 >1800 >1800 >1800

1.5441 1.6943 1.7007 1.7030 1.7037 1.6691

0.0064 0.0205 0.0397 0.0502 0.0009 0.0236

42.6 44.7 48.6 50.6 51.5 47.6

24.56 9.91 14.45 0.46 12.32 12.34

HE

1 2 3 4 5 £

3.8507 3.2087 3.2234 2.7729 3.1157 3.2343

>1800 >1800 >1800 >1800 >1800 >1800

3.1539 3.1784 3.0206 3.0532 3.1082 3.1029

0.0567 0.0622 0.1224 0.1333 0.1520 0.1053

20.6 21.7 19.9 19.9 20.6 20.5

18.10 0.95 6.29 10.11 0.24 3.09

IO

1 2 3 4 5 £

2.3307 2.1903 2.1901 2.4343 2.2386 2.2768

>1800 >1800 >1800 >1800 >1800 >1800

2.1079 2.1217 2.1235 2.1360 2.1445 2.1267

0.0717 0.0692 0.0379 0.1085 0.0148 0.0604

79.9 82.2 88.3 90.6 91.4 86.5

9.56 3.13 3.04 12.25 4.20 6.44

IO

1 2 3 4 5 £

4.0505 3.1552 3.6279 3.5531 2.7699 3.4313

>1800 >1800 >1800 >1800 1410.6 >1800

2.9877 3.0558 3.0641 3.1513 3.1573 3.0832

0.0109 0.0808 0.0063 0.2648 0.1322 0.0990

39.1 41.0 41.8 44.3 47.1 42.7

26.24 3.15 15.54 11.31 13.99 8.45

SP

1 2 3 4 5 £

6.4979 / / / / 6.4979

>1800 >1800 >1800 >1800 >1800 >1800

6.4390 7.1841 7.3723 7.4562 7.5422 7.1988

0.9653 1.0335 2.0785 1.6170 2.1602 1.5709

157.3 171.2 195.0 201.0 210.8 187.1

0.91 0.91

SP

1 2 3 4 5 £

17.0316 15.7249 15.6210 / / 16.1258

>1800 >1800 >1800 >1800 >1800 >1800

11.0977 10.8216 10.8737 10.9386 11.0377 10.9538

0.6896 0.6719 0.5021 0.7731 0.8186 0.6911

226.9 209.1 226.3 227.1 229.4 223.7

34.84 31.18 30.39 32.14

AD

1 2 3 4 5 £

/ / / / / /

>1800 >1800 >1800 >1800 >1800 >1800

0.7934 0.7999 0.8004 0.8035 0.8088 0.8012

0.0447 0.0000 0.0246 0.0163 0.0218 0.0215

163.3 178.8 189.0 189.6 189.9 182.1

AD

1 2 3 4 5 £

/ / / / / /

>1800 >1800 >1800 >1800 >1800 >1800

1.3820 1.4203 1.4465 1.4578 1.4979 1.4409

0.0162 0.1267 0.1935 0.1945 0.2600 0.1582

162.3 177.2 177.2 181.5 184.1 176.5

Sec.

To simulate a TL scenario, we partition all data sets randomly into a training and a working set using a split ratio of 1:1 (see above). We repeat this partitioning ﬁve times to account for the fact that the performance of tDSVMmem and tDSVMCPLEX might depend upon the random assignment of decision objects to the training and working set, respectively. Tables 3–5 report the empirical results in terms of objective values for the three experimental settings Ierror = low, medium, high. In each table, the ﬁve panels per dataset provide the individual results of the random assignments _ of objects to L[U. We use a to highlight settings where tDSVMCPLEX ﬁnds the optimal solution within the time limit; otherwise we report the best CPLEX result found within 30 min. The last row in each panel gives the average performance of the two solvers, computed over the ﬁve individual data partitionings. A / indicates that tDSVMCPLEX failed to ﬁnd a feasible solution within the time limit. The running time for tDSVMmem varies between one to four minutes (depending on data set size) and never exceeds the time limit. In the last column of Tables 3–5, we use bold face to highlight cases in which tDSVMmem outperforms tDSVMCPLEX (i.e., negative gap). Tables 3–5 give a similar picture as the synthetic data experiment and suggest that tDSVMmem performs typically better than

tDSVMmem

Gap (%) Std. dev.

Sec.

tDSVMCPLEX. More speciﬁcally, the average objective values of tDSVMmem are consistently better than those of the CPLEX benchmark across all experiments. Considering the individual results across different random partitionings of the data sets, the overall win/tie/loss counts of tDSVMmem vs. tDSVMCPLEX are 69/10/5. On the basis of this results, a Wilcoxon signed rank test for matched samples (e.g. [64]) enables us to reject the null-hypothesis that the median objective values of the two solvers are the same (p-value <0.000). It is also noteworthy that all ties are due to both approaches solving a problem instance to optimality. It is encouraging to observe that tDSVMmem ﬁnds the optimal solution of a problem instance whenever it is revealed (i.e., through tDSVMCPLEX). Moreover, the standard deviation of tDSVMmem performance (computed over ten different random initializations) is consistently low. This suggests that the quality of tDSVMmem does not depend on the random initialization of the algorithm. In view of these results, we may conclude that tDSVMmem performs significantly better than tDSVMCPLEX. This also conﬁrms that the heuristic operators and their combination within tDSVMmem are effective and well geared to the problem at hand. Such conﬁrmation is crucial for any new metaheuristic. Finally, it is important to note that

H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595 Table 5 Empirical results on UCI data at Ierror = high. D

#

tDSVMCPLEX Obj. val.

Sec.

AC

1 2 3 4 5 £

1.1274 0.9996 0.9630 0.9630 0.8352 0.9777

482.7 143.1 85.3 129.6 12.2 170.6

1.1274 0.9996 0.9630 0.9630 0.8352 0.9777

0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

12.5 13.0 13.1 13.4 13.5 13.1

0.00 0.00 0.00 0.00 0.00 0.00

BC

1 2 3 4 5 £

2.6192 2.3681 2.4459 2.4141 2.6541 2.5003

>1800 >1800 >1800 >1800 >1800 >1800

1.7320 1.8146 1.8630 1.8853 1.9056 1.8401

0.6203 0.6258 0.6595 0.6890 0.6382 0.6466

61.0 62.1 63.3 83.4 85.1 71.0

33.87 23.37 23.83 21.90 28.20 26.24

GC

1 2 3 4 5 £

3.1796 3.5560 3.3604 3.7992 3.3684 3.4527

>1800 >1800 >1800 >1800 >1800 >1800

2.3842 2.4475 2.4856 2.4907 2.5087 2.4633

0.3022 0.3619 0.3484 0.3163 0.1537 0.2965

33.2 35.1 48.7 49.4 49.6 43.2

25.02 31.17 26.03 34.44 25.52 28.44

HE

1 2 3 4 5 £

5.3507 5.3763 4.8141 4.3857 5.3164 5.0486

>1800 >1800 >1800 >1800 >1800 >1800

3.6760 3.6795 3.8482 4.0385 4.0796 3.8644

0.0568 0.0760 0.0754 0.0857 0.0694 0.0727

15.8 15.9 16.1 16.2 16.3 16.0

31.30 31.56 20.06 7.92 23.26 22.82

IO

1 2 3 4 5 £

4.1134 3.4063 4.0042 3.8410 3.4778 3.7685

>1800 >1800 >1800 >1800 >1800 >1800

2.7874 2.8186 2.8338 2.8622 2.8767 2.8357

0.2014 0.0256 0.0009 0.1832 0.09822 0.1019

41.4 44.1 45.4 46.3 47.2 44.9

32.24 17.25 29.23 25.48 17.28 24.30

SP

1 2 3 4 5 £

21.9349 18.3260 22.8557 19.5278 24.8513 21.4991

>1800 >1800 >1800 >1800 >1800 >1800

11.8460 11.9520 11.9706 11.9775 12.0379 11.9568

0.30136 0.63059 0.33681 0.12441 0.30096 0.33881

111.6 116.5 118.1 124.9 127.1 119.7

45.99 34.78 47.63 38.66 51.56 43.73

1 2 3 4 5 £

/ / / / / /

>1800 >1800 >1800 >1800 >1800 >1800

1.8699 1.8781 1.8789 1.8830 1.8836 1.8787

0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

137.8 138.7 138.8 141.2 141.9 139.7

AD

tDSVMmem Obj. val.

Gap (%) Std. dev.

Sec.

tDSVMCPLEX experiences severe difﬁculties to ﬁnd any solution within the time limit when working with larger data sets. This is true for SP and especially AD. On the other hand, tDSVMmem delivers solutions for these problems within a few minutes. This emphasizes the need for heuristic search in transductive SVM learning. Lacking an objective benchmark, we are unable to assess many SP and all AD solutions. Considering, the SP experiments where tDSVMCPLEX has found a solution (i.e., with Ierror = medium or high), the large magnitude with which tDSVMmem outperforms the CPLEX benchmark gives rise to the suspicion that the advantage of tDSVMmem is particularly prominent for larger data sets. However, future research is needed to examine the relationship between problem size and tDSVMmem performance in more detail. With respect to the general appropriateness of tDSVMmem, the above results provide strong evidence for the algorithm being an effective and efﬁcient approach to solve the tDSVM learning problem. 5.3. Evaluation of prediction performance Having established the appropriateness of tDSVMmem from an optimization perspective, this section examines the ability of tDSVMmem to predict the class membership of working set objects

589

with high accuracy. We begin once again with a comparison to tDSVMCPLEX because it solves the same mathematical program as tDSVMmem. Subsequently, we consider other transductive and inductive classiﬁers to extend the scope of the comparison in Section 5.4. In general, the predictive performance of a classiﬁer depends on the choice of meta-parameter settings (e.g. [9]). In the case of tDSVM, these parameters are a, C, and Cw (see (2)). It is common practice to determine suitable parameter values empirically by testing a set of candidate values (e.g., [8,52,65]). In our case, the three experimental settings Ierror = low, medium, high can be seen as a simple means of parameter tuning – a three value grid-search – which helps to identify suitable parameter values. Fig. 5 illustrates the dependency of predictive performance on parameter values by means of a box plot. For each of the three settings a = {0.1, 0.5, 0.9}, the corresponding box includes the results of tDSVMCPLEX (from all data sets and all randomized runs per data set). Fig. 5 reveals that the highest median AUC and ACC values follow from the setting with Ierror = high. Moreover, the performance of the classiﬁers corresponding to this setting varies less than in alternative settings. This suggests that tDSVMCPLEX performs more robust when classiﬁcation errors receive a larger weight in the objective. Finally, Fig. 5 indicates that the worst case performance is highest in the Ierror = high setting. Together, these results suggest that the setting Ierror = high produces the strongest tDSVMCPLEX classiﬁers and thus the most challenging benchmark models. Consequently, we select this setting for subsequent comparisons of tDSVMmem and tDSVMCPLEX and set a = 0.1. We refrain from tuning C and Cw in this paper, but use the following default rule:

C ¼ n=l;

C H ¼ n=u:

ð13Þ

The intuition is that higher (lower) dimensionality will, in tendency, increase (decrease) the inﬂuence of w on the objective, relative to the inﬂuence of empirical and transductive errors. Multiplying the weights of classiﬁcation errors with n compensates this effect. In a similar way, dividing the weights by the number of examples (l or u) ensures that the relative inﬂuence of classiﬁcation errors does not increase (decrease), simply because of working with larger (smaller) data sets. Moreover, dividing C and Cw by l and u implies that the relative importance of empirical errors compared to transductive errors depends on the prevalence of labeled and unlabeled objects in the data. This is a plausible choice for benchmark experiments that use data sets from different applications. We acknowledge that a more elaborate tuning of all three meta-parameters, a, C, and Cw, is likely to improve predictive performance. However, this is true for both tDSVMmem and tDSVMCPLEX, as well as for any other classiﬁcation method. Considering that tDSVMmem creates a classiﬁcation model much faster than tDSVMCPLEX, allotting the same amount of tuning resources to both methods would give the former an advantage. In this sense, our approach to avoid extensive parameter tuning leads to a tougher benchmark for tDSVMmem. The prediction performance of tDSVMmem and tDSVMCPLEX is compared in Table 6. In particular, Table 6 reports average performance estimates (over the ﬁve randomized runs per dataset; see Section 4), which we compute from working set objects. The results evidence that tDSVMmem predicts more accurately than tDSVMCPLEX. Clearly, the two solvers perform identically on the AC data set, where they both found the optimal solution (again marked by a in Table 6). On all other data sets, tDSVMmem outperforms the CPLEX benchmark in terms of ACC and AUC. Fig. 6 provides a more detailed view on the comparative performance of the two solvers across the ﬁve randomized runs per UCI data set. In particular, Fig. 6 depicts the distribution of the percentage differences in AUC and ACC between tDSVMmem and tDSVMCPLEX. To

590

H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595

Fig. 5. Predictive performance of tDSVMCPLEX in terms of AUC and ACC across all data sets and randomized runs per data set for different settings of the meta-parameter a (high = 0.1, medium = 0.5, low = 0.9).

Table 6 Average prediction performance of tDSVMmem and tDSVMCPLEX on UCI data. D

AC BC GC HE IO SP AD

AUC

ACC

tDSVMCPLEX

tDSVMmem

Gap (%)

tDSVMCPLEX

tDSVMmem

Gap (%)

0.872 0.984 0.601 0.666 0.791 0.865 /

0.872 0.996 0.792 0.820 0.901 0.918 0.898

0.00 1.20 31.73 23.10 13.93 6.04

0.857 0.930 0.677 0.741 0.807 0.807 /

0.857 0.968 0.933 0.778 0.875 0.842 0.835

0.00 4.09 37.75 4.93 8.50 4.35

ensure consistency with previous results, we calculate the percentage differences such that negative values indicate superior performance of tDSVMmem. Note that we exclude AC and AD from the analysis. Results on AC do not differ between the two solvers

because they both ﬁnd the optimal solution. We also exclude AD because the CPLEX benchmark does not ﬁnd a feasible solution within the time limit. Fig. 6 reveals that all performance differences are below zero. This shows that tDSVMmem not only provides more accurate class predictions on average, but consistently outperforms tDSVMCPLEX in all randomized runs. A formal comparison of the two solvers by means of a Wilcoxon signed rank test (including the AC results) conﬁrms that tDSVMmem predicts signiﬁcantly more accurately than tDSVMCPLEX (p-value <0.0000 for both performance indicators). In addition, we can estimate the expected performance difference between the two classiﬁers (i.e., on new data not used in the study) by calculating the median of their observed performance differences [66]. The median performance difference between tDSVMCPLEX and tDSVMmem is 0.064 points for AUC and 0.043 points for ACC. Compared to the median prediction performance of tDSVMCPLEX, this equates to a relative improvement of 7.61% for AUC and about 5.31% for ACC. This is a sizeable improve-

Fig. 6. Distribution of the percentage differences in ACC and AUC between tDSVMCPLEX and tDSVMmem across the randomized runs per data set for selected UCI data sets.

591

H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595 Table 7 Predictive performance of tDSVM compared to other transductive and inductive classiﬁers in terms of AUC.

Table 8 Predictive performance of tDSVM compared to other transductive and inductive classiﬁers in terms of ACC.

D

tDSVM

TSVM

L1-SVM

DSVM

D

tDSVM

TSVM

L1-SVM

DSVM

AC BC GC HE IO SP AD £ AUC £ rank

0.872 0.996 0.792 0.820 0.901 0.918 0.898 0.885 1.927

0.923 0.992 0.723 0.750 0.871 0.902 0.890 0.864 2.736

0.926 0.986 0.801 0.729 0.852 0.965 0.892 0.879 2.482

0.894 0.976 0.771 0.774 0.846 0.951 0.900 0.873 2.855

AC BC GC HE IO SP AD £ ACC £ rank

0.857 0.968 0.933 0.778 0.875 0.842 0.835 0.870 2.191

0.865 0.965 0.744 0.776 0.853 0.831 0.718 0.822 2.736

0.861 0.956 0.764 0.744 0.865 0.915 0.839 0.849 2.346

0.826 0.952 0.738 0.743 0.855 0.890 0.848 0.836 2.727

ment that will be meaningful in many applications. For example, increasing the accuracy of a targeting model in marketing by ﬁve percent will typically improve the proﬁts resulting from the corresponding marketing campaign to a large degree (e.g. [67]). In this sense, our results indicate that the predictive advantage of tDSVMmem is not only statistically signiﬁcant but also meaningful from a managerial perspective. Finally, it is worth commenting on previous results that a better objective value may translate into an inferior classiﬁer [29]. We do not observe such a case in our comparison of tDSVMmem versus tDSVMCPLEX. The better solver consistently produces the more accurate classiﬁer. In this sense, our results provide some evidence that the link between the objective of (2) and the accuracy of the resulting classiﬁer is relatively stable. This might result from a strict compliance with ORM theory and thus be a feature of the discrete measurement of empirical and transductive errors. It might also result from the direct estimation of class labels for working set objects, and thus be a general feature of TL. Clarifying the speciﬁc origin of the strong connection between objective and accuracy might be a promising avenue for future research. 5.4. Comparison to other inductive and transductive classiﬁers Having conﬁrmed the suitability of tDSVMmem as a solver for program (2), we now turn our attention to the tDSVM classiﬁer itself and examine its predictive performance in comparison to other inductive and transductive classiﬁers. In these comparisons, the tDSVM classiﬁer is always constructed by means of tDSVMmem. First, we compare tDSVM to the original TSVM [18]. The two classiﬁers differ mainly in how they account for classiﬁcation errors during training. TSVM uses continuous slack variables, whereas tDSVM adopts Orsenigo and Vercellis’s approach to count classiﬁcation errors with a discrete step-function [27,28]. Comparing the two classiﬁers allows us to examine the effectiveness of a discrete error measurement in a TL setting. In addition, we consider two inductive classiﬁers, DSVM and a linear support vector machine with an L1-penalty (L1-SVM; e.g. [68,69]). As the TSVM benchmark, DSVM shares several similarities with tDSVM. The main difference between the two concerns the underlying learning paradigm, induction versus transduction. Therefore, contrasting tDSVM to DSVM clariﬁes upon the value of using unlabeled data during classiﬁer construction.2 The L1-SVM benchmark completes the set of benchmarks in the sense that it represents an inductive classiﬁer without a discrete measurement of classiﬁcation errors. Tables 7 and 8 depict the results of the comparison in terms of AUC and ACC, respectively. All performance ﬁgures represent aver-

2 Note that we use a simpliﬁed version of DSVM to increase the similarity with tDSVM. In particular, we solve problem (1) to devise a DSVM classiﬁer, using a linear programming heuristic [28]. Orsenigo and Vercellis [27] employed an additional regularizer in their DSVM, which penalizes the number of non-zero coefﬁcients in the normal vector.

ages, computed over ﬁve randomized training/test set splits per data set. The last two rows report the average performance and the average rank per methods, which we compute across the individual results of the ﬁve randomized runs. For each run, the best performing method receives a rank of one, the runner-up a rank of two, etc. (e.g. [64]). A ﬁrst ﬁnding is that tDSVM provides, on average, the most accurate predictions. It achieves the highest average AUC and ACC, and also the lowest average rank across all methods. A closer look at the two transductive approaches reveals that tDSVM performs better than TSVM on all data sets but AC. Furthermore, a Wilcoxon signed rank test allows concluding that tDSVM classiﬁes signiﬁcantly more accurately than TSVM (p-value: 0.0074 for AUC; 0.0001 for ACC).3 Given the similarity between the mathematical programs underlying these two classiﬁers, the main factor that can explain a signiﬁcant difference in classiﬁcation performance is the discrete measurement of classiﬁcation errors. Orsenigo and Vercellis found this approach to be highly effective in inductive settings. Our results extend this ﬁnding to transductive classiﬁcation. Measuring empirical and transductive errors by means of a discrete step function helps to increase the accuracy of the resulting classiﬁer. This ﬁnding is consistent with ORM theory [21]. In this sense, our results provide further evidence in favor of the suitability to design novel prediction methods that are not only data-driven but also well anchored in learning theory. In comparison to other inductive classiﬁers, tDSVM gives better performance on average. However, its predictive advantage is smaller than in the TSVM case. Especially L1-SVM performs competitive. For example, L1-SVM beats tDSVM on three data sets when using AUC as performance measure. In the case of ACC, tDSVM ’wins’ four data sets, whereas no competitor secures more than one win. A statistical analysis conﬁrms the competitive performance of L1-SVM and DSVM. In particular, our results do not provide sufﬁcient evidence to conclude whether AUC differences between tDSVM and L1-SVM are systematic or random (p-value Wilcoxon signed rank test: 0.466). The same holds true for the comparison of tDSVM and DSVM (p-value: 0.301). Using ACC as indicator of predictive performance, we ﬁnd marginal evidence for the superiority of tDSVM over L1-SVM (p-value: 0.103), whereas the results allow concluding that tDSVM delivers signiﬁcantly higher accuracy than DSVM (p-value: 0.038). Higher average accuracy, especially when using ACC as performance measure, is an appealing result. In this sense, the analysis conﬁrms the efﬁcacy of TL and tDSVM in particular. However, the inductive approaches perform surprisingly competitive. Given that tDSVM training includes the working set objects, one would expect that the resulting model classiﬁes these objects much more accurately than an

3 Note that we use the individual results of the randomized runs per data set for statistical testing.

592

H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595

Fig. 7. Development of predictive performance of tDSVM and L1-SVM across different ratios l/u.

inductive model. Arguably, our results do not show a substantial difference between tDSVM and the two inductive approaches. A possible explanation lies in our experimental setup. In all experiments, we use the same amount of labeled and unlabeled data (see Section 4). TL is typically employed when the amount of labeled data is small and unlabeled data is available in abundance (e.g. [18,70,71]). The marginal utility of using additional examples during model building decreases, once a classiﬁer has seen ’enough’ labeled examples [72]. Using even amounts of labeled and unlabeled data in our experiments, the inductive classiﬁers may have received sufﬁcient information (i.e., labeled examples) to reconstruct the relationship between class membership and attribute values with high accuracy. To test this, we run an additional set of experiments with varying amounts of labeled and unlabeled data using nine ratios l/u = [0.01, 0.02, 0.03, 0.04, 0.05, 0.10, 0.15,

0.20, 0.25]. For each ratio, we construct ﬁve samples per data set with random training/test set assignments. We use these samples to construct tDSVM and L1-SVM classiﬁers, which we then compare in terms of AUC and ACC. We select L1-SVM for this analysis because our previous results suggest that it performs slightly better than the other inductive classiﬁer on our data. The results are shown in Fig. 7. Fig. 7 reveals that tDSVM performs typically better than L1-SVM. For most data sets, tDSVM achieves higher predictive accuracy across all ratios l to u. In the most extreme setting where only one percent of the original labeled training set is available for model building, tDSVM outperforms L1-SVM on all but one data set. This evidences the importance of the relative amounts of labeled and unlabeled data when comparing inductive and transductive classiﬁers. More speciﬁcally, the results of Fig. 7 conﬁrm that

H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595

the competitive performance of the inductive classiﬁers in the previous analysis (Tables 7 and 8) is largely explained by the ratio of labeled to unlabeled data. Another ﬁnding from Fig. 7 is that the magnitude with which predictive performance increases due to applying TL can be small, even if labeled data is scarce. Consider, for example, the ACC results on the IO data set. L1-SVM and tDSVM perform virtually the same across all ratios l/u, including the most extreme case l/u = 0.01. Some other data sets show a similar tendency. So one may ask whether the additional computational effort of TL compared to L1-SVM is justiﬁed. It is difﬁcult to give a general answer to this question. The value of improvements in predictive performance depends on the application context. A common view is that marginal improvements in accuracy can be very meaningful in business applications. This is typically explained with scaling effects. When using a decision support model with high frequency, increasing its effectiveness a little bit can improve business performance substantially. This has, for example, been observed in churn management (e.g. [67,73,74]). Targeting banner-ads so as to increase click-through rates in online marketing is another example (e.g., [75]). Predictive performance is even more important when a model is used to aid medical decision making. Consider for instance the case of cancer diagnosis (e.g. [76]). Clearly, the consequences of misclassiﬁcation errors are dramatic. In this sense, there is always room for better – more accurate – prediction methods, even if the degree to which they improve upon current standards is moderate. It is however important to choose the right applications for such methods. This is also apparent from the results of the SP data set where L1-SVM outperforms tDSVM on all but the most extreme l/u ratio (see Fig. 7). Out of all data sets used in this study, SP is the one with the largest number of attributes (see Table 2). One may speculate that the poor performance of tDSVM is partly due to high dimensionality. In particular, tDSVM implements the cluster assumption and strives to construct the decision surface in a low density area of the feature space. The distance between decision objects in the feature space increases with the number of attributes. As a consequence, low density areas are more likely to occur not only between objects of different classes but also between objects of the same class. It might then be misleading to rely on density-based information when creating the classiﬁcation model. This could explain why tDSVM classiﬁes SP objects less accurately. However, future research is needed to better understand the relationship between dimensionality and TL efﬁcacy and, more generally, clarify what other factors besides the ratio of labeled to unlabeled data determines the appropriateness of TL. For now, we conclude that tDSVM is a viable alternative to inductive models that should be considered whenever class decisions are needed for a priori known working set objects. There is a good chance that tDSVM improves predictive performance, especially when the amount of unlabeled data is small.

6. Summary and conclusions We set out to develop a novel transductive classiﬁer which extends Orsenigo and Vercellis’s DSVM methodology [27,28] to the TL setting. Drawing inspiration from Joachims’s [18] formulation of a TSVM, we developed a combinatorial program that incorporates a discrete step function to count empirical and transductive errors on training and working set objects, respectively. We then devised a memetic algorithm to solve the MIP formulation and construct a tDSVM classiﬁcation model. Our algorithm performs a direct optimization of the model parameters (w, b) and an indirect optimization of the decision variables yH i . It employs population-based and local search operators to pursue the conﬂicting goals of low classiﬁcation error and large margin of separation,

593

respectively. In several empirical experiments on synthetic and real-world data, we examined the appropriates of our tDSVM formulation, the effectiveness of tDSVMmem, and, more generally, the relative advantages of TL compared to inductive classiﬁcation. Overall, our results warrant three main conclusions. First, we ﬁnd further support in favor of Orsenigo and Vercellis’s view that a discrete measurement of classiﬁcation errors is preferable to a continuous error approximation during SVM training. In particular, we observed that tDSVM classiﬁes working set objects signiﬁcantly more accurately than the original TSVM. We attribute this to the discrete measurement of transductive errors, which represents the main difference between the two classiﬁers. In this sense, we show that the effectiveness of the DSVM framework [27,28] generalizes to TL. Second, the comparisons of tDSVM to two inductive classiﬁers revealed that L1-SVM and DSVM predict slightly less accurately when working with even amounts of labeled and unlabeled data. If labeled data is scarce, tDSVM will typically give better prediction performance. This suggests that TL and tDSVM in particularly are not guaranteed to be a better choice whenever the particular application requirements of TL are met. The transductive approach entails a higher computational cost (e.g., compared to L1-SVM), but will not always improve upon an inductive classiﬁer. An important determinant whether TL is preferable is the similarity/dissimilarity of the distributions of working and training set objects. Given that this factor may be difﬁcult to quantify in real-world applications, a practical recommendation is to consider TL alongside – but not instead of – inductive methods whenever class predictions are sought for a set of known working set instances. In cases where the modeler faces a severe imbalance between labeled and unlabeled objects, TL is particularly appropriate because an inductive approach will have difﬁculties to learn the relationship between attribute values and class memberships. Third, we found strong evidence that tDSVMmem is a suitable approach to construct tDSVM classiﬁers. Despite its heuristic nature, it successfully identiﬁed optimal solutions in several cases. In all other cases, where optimality was not assured due to the difﬁculty of the problem instance, we were able to show that tDSVMmem performs at least as good and typically better than CPLEX. This conﬁrmed the effectiveness of tDSVMmem to solve the mathematical program underlying the tDSVM classiﬁer. We also found that the classiﬁcation models resulting from tDSVMmem predict signiﬁcantly more accurate than the tDSVM models produced by CPLEX. This suggest that better solutions (lower objective values) translate into better classiﬁers (higher accuracy). Although somewhat intuitive, previous results have shown that there is not always a one-to-one relationship between objective values and predictive accuracy in inductive learning [29]. In this sense, our results suggest that the link between the two is more stable in a transductive setting where decision objects are considered when constructing the classiﬁer. One could argue that this makes the TL setting particularly suitable for developing novel (better) classiﬁcation methods. The stronger the link between the objective values of a classiﬁer learning program and the classiﬁer’s predictive performance, the more rewarding it is to develop better solvers. Our study suggests several directions for further research, related to either the formulation of the tDSVM learning problem or the novel solver tDSVMmem. With respect to the mathematical program, future research could test alternative formulation of the tDSVM program and compare these to the one proposed here. For example, instead of using a discrete measurement for both empirical and transductive errors, it would be possible to use the approximate error measurement of standard SVMs for empirical errors and reserve the discrete approach for working set objects. This would reduce the number of binary variables and thus accelerate the classiﬁer construction process, especially in applications

594

H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595

where a reasonable amount of labeled data is available. Such analysis would also deepen our understanding how and under which conditions a stricter compliance with the ORM principle leads to better classiﬁers. Eventually, initiatives along this line could lead to a family of alternative programming formulation and decision rules how to chose among these formulations on the basis of data set characteristics. In a similar fashion, tDSVMmem is a ﬁrst approach to construct tDSVM classiﬁers. Our results evidence the collective efﬁcacy of the search operators incorporated in tDSVMmem. Future research could elicit their effectiveness in isolation. In particular, running tDSVMmem with just one of the three operators proposed in this work (or different pairs of operators, respectively) facilitates appraising the extent to which an operator contributes individually to the performance of tDSVMmem. This insight could lead to a revision of our operators and the development of novel operators, respectively, to further improve tDSVMmem. On a wider scope, future research could examine the potential of alternative approaches to construct tDSVM classiﬁers. We have argued the appropriateness of the memetic framework for the focal programming problem. However, experiments with other metaheuristics could provide valuable insight to conﬁrm this view, or, possibly, identify an even more suitable search philosophy. Finally, the TL setting is rarely considered outside Machine Learning. As we show in this work, building a transductive classiﬁcation model often involves solving a MIP. With its excellency in mixed integer programming, it is beyond doubt that the OR community could contribute greatly toward the further advancement of TL. The tasks of analyzing TL programming problems and the corresponding solution spaces from a theoretic angle, and designing tailor-made exact or heuristic solvers exemplify the potential for interdisciplinary work in the ﬁeld. Our work makes a ﬁrst step into this direction, which we believe is a promising avenue for future research at the interface of OR and data mining. References [1] X. Bai, R. Padman, J. Ramsey, P. Spirtes, Tabu search-enhanced graphical models for classiﬁcation in high dimensions, INFORMS Journal on Computing 20 (3) (2008) 423–437. [2] O.L. Mangasarian, W.N. Street, W.H. Wolberg, Breast cancer diagnosis and prognosis via linear programming, Operations Research 43 (4) (1995) 570– 577. [3] D. West, P. Mangiameli, R. Rampal, V. West, Ensemble strategies for a medical diagnostic decision support system: a breast cancer diagnosis application, European Journal of Operational Research 162 (2) (2005) 532–551. [4] W. Gehrlein, B. Wagner, A two-stage least cost credit scoring model, Annals of Operations Research 74 (0) (1997) 159–171. [5] D. West, S. Dellana, J. Qian, Neural network ensemble strategies for ﬁnancial decision applications, Computers & Operations Research 32 (10) (2005) 2543– 2559. [6] J. Hadden, A. Tiwari, R. Roy, D. Ruta, Computer assisted customer churn management: state-of-the-art and future trends, Computers & Operations Research 34 (10) (2007) 2902–2917. [7] J. Qi, L. Zhang, Y. Liu, L. Li, Y. Zhou, Y. Shen, L. Liang, H. Li, ADTreesLogit model for customer churn prediction, Annals of Operations Research 168 (1) (2009) 247–265. [8] W. Verbeke, K. Dejaeger, D. Martens, J. Hur, B. Baesens, New insights into churn prediction in the telecommunication sector: a proﬁt driven data mining approach, European Journal of Operational Research 218 (1) (2012) 211–229. [9] T. Hastie, R. Tibshirani, J.H. Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer, New York, 2009. [10] D.J. Hand, W.E. Henley, Statistical classiﬁcation models in consumer credit scoring: a review, Journal of the Royal Statistical Society: Series A (General) 160 (3) (1997) 523–541. [11] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, 1993. [12] N. Freed, F. Glover, Linear programming and statistical discrimination the LP side, Decision Sciences 13 (1) (1982) 172–175. [13] O.L. Mangasarian, Linear and nonlinear separation of patterns by linear programming, Operations Research 13 (3) (1965) 444–452. [14] P.A. Rubin, Solving mixed integer classiﬁcation problems by decomposition, Annals of Operations Research 74 (0) (1997) 51–64.

[15] F. Talla Nobibon, R. Leus, F.C.R. Spieksma, Optimization models for targeted offers in direct marketing: exact and heuristic algorithms, European Journal of Operational Research 210 (3) (2011) 670–683. [16] J. Zhang, Y. Shi, P. Zhang, Several multi-criteria programming methods for classiﬁcation, Computers & Operations Research 36 (3) (2009) 823–836. [17] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995. [18] T. Joachims, Transductive inference for text classiﬁcation using support vector machines, in: I. Bratko, S. Dzeroski (Eds.), Proceedings of the 16th International Conference on Machine Learning, Morgan Kaufmann, Bled, Slovenia, 1999, pp. 200–209. [19] M.M. Silva, T.T. Maia, A.P. Braga, An evolutionary approach to transduction in support vector machines, in: Proceedings of the 5th International Conference on Hybrid Intelligent Systems, 2005, pp. 329–334. [20] H. Brandner, S. Lessmann, S. Voß, Support of managerial decision making by transductive learning, in: A. Bernstein, G. Schwabe (Eds.), Proceedings of the 10th International Conference on Wirtschaftsinformatik, Zurich, Switzerland, 2011, pp. 973–982. [21] V. Cherkassky, F.M. Mulier, Learning from Data: Concepts, Theory, and Methods, second ed., Wiley & Sons, New Jersey, 2007. [22] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, Cambridge, 2000. [23] K.P. Bennett, A. Demiriz, Semi-supervised support vector machines, in: M.J. Kearns, S.A. Solla, D.A. Cohn (Eds.), Advances in Neural Information Processing Systems, vol. 11, MIT Press, Cambridge, 1999, pp. 368–374. [24] O. Chapelle, B. Schölkopf, A. Zien, Semi-Supervised Learning, MIT Press, Cambridge, 2006. [25] G. Fung, O.L. Mangasarian, Semi-supervised support vector machines for unlabeled data classiﬁcation, Optimization Methods and Software 15 (2001) 29–44. [26] V. Vapnik, S. Kotz, Estimation of Dependences Based on Empirical Data, second ed., Springer, New York, 2006. [27] C. Orsenigo, C. Vercellis, Multivariate classiﬁcation trees based on minimum features discrete support vector machines, IMA Journal of Management Mathematics 14 (3) (2003) 221–234. [28] C. Orsenigo, C. Vercellis, Discrete support vector decision trees via tabu-search, Computational Statistics & Data Analysis 47 (2) (2004) 311–322. [29] M. Caserta, S. Lessmann, S. Voß, A novel approach to construct discrete support vector machine classiﬁers, in: A. Fink, B. Lausen, W. Seidel, A. Ultsch (Eds.), Advances in Data Analysis, Data Handling and Business Intelligence, Springer, Berlin, 2010, pp. 115–125. [30] C. Orsenigo, C. Vercellis, Evaluating membership functions for fuzzy discrete SVM, in: Proceedings of the 7th International Workshop on Fuzzy Logic and Applications, Springer, Berlin, 2007, pp. 187–194. [31] C. Orsenigo, C. Vercellis, Softening the margin in discrete SVM, in: P. Perner (Ed.), Proceedings of the 7th Industrial Conference on Data Mining, Springer, Berlin, 2007, pp. 49–62. [32] C. Orsenigo, C. Vercellis, Multicategory classiﬁcation via discrete support vector machines, Computational Management Science 6 (1) (2009) 101–114. [33] C. Orsenigo, C. Vercellis, Combining discrete SVM and ﬁxed cardinality warping distances for multivariate time series classiﬁcation, Pattern Recognition 43 (11) (2010) 3787–3794. [34] R.I. Bot, N. Lorenz, Optimization problems in statistical learning: duality and optimality conditions, European Journal of Operational Research 213 (2) (2011) 395–404. [35] A. Frank, A. Asuncion, UCI Machine Learning Repository, Tech. Rep., School of Information and Computer Science, University of California, Irvine, CA, USA, 2010. [36] E.L. Allwein, R.E. Schapire, Y. Singer, Reducing multi-class to binary: a unifying approach for margin classiﬁers, Journal of Machine Learning Research 1 (2000) 113–141. [37] V. Cherkassky, Y. Ma, Another look at statistical learning theory and regularization, Neural Networks 22 (7) (2009) 958–969. [38] P.S. Bradley, U.M. Fayyad, O.L. Mangasarian, Mathematical programming for data mining: formulations and challenges, INFORMS Journal on Computing 11 (3) (1999) 217–238. [39] O. Chapelle, V. Vapnik, J. Weston, Transductive inference for estimating values of functions, in: S.A. Solla, T.K. Leen, K.-R. Mnller (Eds.), Advances in Neural Information Processing Systems, vol. 12, MIP Press, Cambridge, 2000, pp. 421– 427. [40] P. Moscato, On Evolution, Search, Optimization, Genetic Algorithms and Martial Arts – Towards Memetic Algorithms, Tech. Rep., Caltech Concurrent Computation Program Report 826, CalTech, Pasadena, CA, USA, 1989. [41] W.E. Hart, N. Krasnogor, J.E. Smith, Recent Advances in Memetic Algorithms, 1st ed., Springer, Berlin, 2005. [42] R. Dawkins, The Selﬁsh Gene, Oxford University Press, 1976. [43] J. Holland, Adaptation in Natural and Artiﬁcial Systems, University of Michigan Press, Ann Arbor, 1975. [44] H. Beyer, H. Schwefel, Evolution strategies – a comprehensive introduction, Natural Computing 1 (1) (2002) 3–52. [45] D.E. Goldberg, Genetic Algorithms in Search, ﬁrst ed., Optimization and Machine Learning, Addison-Wesley, Boston, 1989. [46] G.J. Köhler, Improper linear discriminant classiﬁers, European Journal of Operational Research 50 (2) (1991) 188–198.

H. Brandner et al. / European Journal of Operational Research 230 (2013) 581–595 [47] G.J. Köhler, Characterization of unacceptable solutions in LP discriminant analysis, Decision Sciences 20 (2) (1989) 239–257. [48] G.J. Köhler, Considerations for mathematical programming models in discriminant analysis, Managerial and Decision Economics 11 (4) (1990) 227–234. [49] V. Sindhwani, S.S. Keerthi, Large scale semi-supervised linear SVMs, in: E.N. Efthimiadis, S.T. Dumais, D. Hawking, K. Järvelin (Eds.), Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, 2006, pp. 477–484. [50] Y. Chen, G. Wang, S. Dong, Learning with progressive transductive support vector machine, Pattern Recognition Letters 24 (12) (2003) 1845–1855. [51] T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters 27 (8) (2006) 861–874. [52] S. Lessmann, S. Voß, A reference model for customer-centric data mining with support vector machines, European Journal of Operational Research 199 (2) (2009) 520–530. [53] D.R. Musicant, NDC: Normally Distributed Clustered Datasets, 1998. . [54] C. Audet, P. Hansen, A. Karam, C. Ng, S. Perron, Exact l2-norm plane separation, Optimization Letters 2 (4) (2008) 483–495. [55] F. Plastria, S. De Bruyne, E. Carrizosa, Alternating local search based VNS for linear classiﬁcation, Annals of Operations Research 174 (1) (2010) 121–134. [56] A. Karam, G. Caporossi, P. Hansem, Arbitrary-norm hyperplane separation by variable neighbourhood search, IMA Journal of Management Mathematics 18 (2) (2007) 173–189. [57] O.L. Mangasarian, D.R. Musicant, Active support vector machine classiﬁcation, in: T.K. Lee, T.G. Dietterich, V. Tresp (Eds.), Advances in Neural Information Processing Systems, vol. 13, MIT Press, Cambridge, 2000, pp. 577–583. [58] P.S. Bradley, O.L. Mangasarian, J.B. Rosen, Parsimonious least norm approximation, Computational Optimization and Applications 11 (1) (1998) 5–21. [59] R.S. Sexton, S. McMurtrey, D. Cleavenger, Knowledge discovery using a neural network simultaneous optimization algorithm on a real world classiﬁcation problem, European Journal of Operational Research 168 (3) (2006) 1009–1018. [60] D. Bertsimas, R. Shioda, Classiﬁcation and regression via integer optimization, Operations Research 55 (2) (2007) 252–271. [61] Y. Marinakis, M. Marinaki, M. Doumpos, N. Matsatsinis, C. Zopounidis, A hybrid ACO-GRASP algorithm for clustering analysis, Annals of Operations Research 188 (1) (2009) 343–358. [62] S.F. Crone, S. Lessmann, R. Stahlbock, The impact of preprocessing on data mining: an evaluation of classiﬁer sensitivity in direct marketing, European Journal of Operational Research 173 (3) (2006) 781–800.

595

[63] D. Martens, B. Baesens, T. Van Gestel, Decompositional rule extraction from support vector machines by active learning, IEEE Transactions on Knowledge and Data Engineering 21 (2) (2009) 178–191. [64] J. Demsar, Statistical comparisons of classiﬁers over multiple data sets, Journal of Machine Learning Research 7 (2006) 1–30. [65] T. Van Gestel, B. Baesens, J.A.K. Suykens, D. Van den Poel, D.-E. Baestaens, M. Willekens, Bayesian kernel based classiﬁcation for ﬁnancial distress detection, European Journal of Operational Research 172 (3) (2006) 979–1003. [66] S. Garca, A. Fernndez, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power, Information Sciences 180 (10) (2010) 2044–2064. [67] S.A. Neslin, S. Gupta, W. Kamakura, J. Lu, C.H. Mason, Defection detection: measuring and understanding the predictive accuracy of customer churn models, Journal of Marketing Research 43 (2) (2006) 204–211. [68] P. Bradley, O. Mangasarian, W. Street, Feature selection via mathematical programming, INFORMS Journal on Computing 10 (2) (1998) 209–217. [69] K.P. Bennett, Combining support vector and mathematical programming methods for classiﬁcation, in: B. Schölkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, 1999, pp. 307–326. [70] R. Collobert, F. Sinz, J. Weston, L. Bottou, Large scale transductive svms, Journal of Machine Learning Research 7 (2006) 1687–1712. [71] Y. Wang, S.-T. Huang, Training TSVM with the proper number of positive samples, Pattern Recognition Letters 26 (14) (2005) 2187–2194. [72] C. Perlich, F. Provost, J.S. Simonoff, W.W. Cohen, Tree induction vs. logistic regression: a learning-curve analysis, Journal of Machine Learning Research 4 (2) (2003) 211–255. [73] B. Baesens, S. Viaene, D. Van den Poel, J. Vanthienen, G. Dedene, Bayesian neural network learning for repeat purchase modelling in direct marketing, European Journal of Operational Research 138 (1) (2002) 191–211. [74] W. Buckinx, D. Van den Poel, Customer base analysis: partial defection of behaviourally loyal clients in a non-contractual FMCG retail setting, European Journal of Operational Research 164 (1) (2005) 252–268. [75] C. Perlich, B. Dalessandro, R. Hook, O. Stitelman, T. Raeder, F.J. Provost, Bid optimizing and inventory scoring in targeted online advertising, in: Q. Yang, D. Agarwal, J. Pei (Eds.), Proc. of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, 2012, pp. 804–812. [76] D. West, P. Mangiameli, R. Rampal, V. West, Ensemble strategies for a medical diagnostic decision support system: a breast cancer diagnosis application, European Journal of Operational Research 162 (2) (2005) 532–551.

A memetic approach to construct transductive discrete support vector machines

A memetic approach to construct transductive discrete support vector machines

Recommend Documents