European Journal of Operational Research 173 (2006) 893–909 www.elsevier.com/locate/ejor
Robust classification and regression using support vector machines Theodore B. Trafalis *, Robin C. Gilbert Laboratory of Optimization and Intelligent Systems, University of Oklahoma, School of Industrial Engineering, 202 West Boyd, Room 124, Norman, OK 73019, USA Received 18 November 2004; accepted 18 July 2005 Available online 16 November 2005
Abstract In this paper, we investigate the theoretical aspects of robust classification and robust regression using support vector machines. Given training data (x1, y1), . . . , (xl, yl), where l represents the number of samples, xi 2 Rn and yi 2 {1, 1} (for classification) or y i 2 R (for regression), we investigate the training of a support vector machine in the case where bounded perturbation is added to the value of the input xi 2 Rn . We consider both cases where our training data are either linearly separable and nonlinearly separable respectively. We show that we can perform robust classification or regression by using linear or second order cone programming. 2005 Elsevier B.V. All rights reserved. Keywords: Robustness; Classification; Regression; Support vector machines
1. Introduction In real life problems random and non-random processes give rise to uncertain data. Therefore, it is a critical issue to develop machine learning and data mining techniques that are ‘‘immune’’ to data uncertainties and perturbations. The aim of a realistic classification (or regression) predictive data mining model such as Support Vector Machines (SVM) [1–4] is to be unaffected by potential noise on the input data. Developing these models requires exploiting recent developments in the area of robust optimization [5–7].
*
Corresponding author. Tel.: +1 405 325 4347; fax: +1 405 325 7555. E-mail addresses:
[email protected] (T.B. Trafalis),
[email protected] (R.C. Gilbert). URL: http://www.lois.ou.edu/ (T.B. Trafalis).
0377-2217/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.ejor.2005.07.024
894
T.B. Trafalis, R.C. Gilbert / European Journal of Operational Research 173 (2006) 893–909
To fit in a realistic manner the real-world systems, we chose to construct formulations which ban a priori informations about the uncertainties: we assume that our data uncertainties and perturbations are unknown. However, a bound of the perturbations according to a chosen norm in the input space is known. We are now dealing with hyperspheres in Rn which delimit the portion of space where our perturbed data are supposed to be living [8,9]. Our aim is to obtain classifiers or regressors which have good generalization properties and are insensitive to bounded perturbations of the data. We are producing classifiers and regressors which outputs are very close to the ones of the precise case, even if the inputs are corrupted. Our analysis is general, based on optimization arguments and does not assume any probability distribution function on the noise. Next we provide a quick overview of robust optimization methods. For example, assume that we are trying to obtain the robust formulation of the following linear programming problem: minimize
ct x
subject to
Ax 6 b;
x
n
with c; x 2 R ; b 2 Rm and A 2 Rmn is a matrix of perturbed coefficients which can be decomposed into e where A 2 Rmn is the matrix of the original (known) coefficients and A e 2 Rmn is the matrix A¼AþA of bounded perturbations. Therefore, the robust counterpart of the previous linear programming problem is: minimize
ct x
subject to
e 6 b; ðA þ AÞx
x
t
e e i k 6 gi ; kA p
8i 2 si; mt;
where ei 2 Rm is the ith column of the identity matrix of rank m, and gi 2 Rþ is an upper bound for the pnorm of the perturbation of the ith column. Then, an optimal solution of the above problem x* is robust e with respect to the perturbations induced by the matrix A. This mathematical programming problem cannot be solved directly using standard techniques. This problem is equivalent either to a linear programming problem or a second order cone programming problem (SOCP) [10] depending on the specific selection of the norm. These robust counterparts allow for obtaining the same separating hyperplanes as in the case of precise data even if the inputs are corrupted by unknown perturbations. The connection for large margin SVM classifiers and robust optimization can been explained by the following simple geometric argument. Given a training point ð x; yÞ, we can generate a test point of the form ð x þ r; yÞ, where the perturbation vector r is , where g P 0). If we manage to bounded in norm by g, (radius of uncertainty sphere around the point x separate the training data with margin q P 2g, then we can correctly classify this point. This provides a connection between training with pattern noise and margin maximization. The latter can be thought as a regularizer. Similar links to supervised training with noise, for other types of regularizers, have been investigated before for neural networks by Bishop [11]. The paper is structured as follows. In Section 2, we discuss robust classification. After short recalls on classical support vector classification 2.1 and 2.2, we define the formulation of the linear robust classification problem in Section 2.3. Section 2.4 extends this formulation to the nonlinear case and Section 2.5 deals with linear robust classification using L1 or L1 as a measure of distance in Rn . In Section 3, we discuss the robust regression using SVM. Section 3.1 sets the formulation of the linear regression problem in the non-separable case; Section 3.2 introduces the linear robust regression problem, and Section 3.3 provides an extension to the nonlinear case. Section 3.4 deals with linear robust regression using L1 or L1 as a measure of distance in Rn . Computational results are provided in Section 4. Finally, Section 5 concludes the paper.
T.B. Trafalis, R.C. Gilbert / European Journal of Operational Research 173 (2006) 893–909
895
2. Robust classification We will extend the current techniques of pattern classification using SVM [12] to the general case of pattern classification with uncertain data. In this section, we will consider the Lp norm as a measure of distance in Rn . 2.1. Maximum margin classifiers We assume that our set of classifiers in Rn are defined through oriented hyperplanes and our data are linearly separable.1 Given a family {(xi, yi)}i2s1,lb of training data where l is the number of given samples, and for all i in s1, lb the vectors xi belong to Rn and yi 2 {1, 1}, our purpose is to look for the separating hyperplane with largest margin, with respect to the set of inequalities (1) y i ðhw xi i þ bÞ P 1;
8i 2 s1; lt;
ð1Þ
where yi 2 {1, 1} for all i in s1, lb and l is the number of samples. Thus, the margin qp(b, w) which is twice the distance between the separating hyperplane and the closest points with respect to the Lp norm will be qp ðb; wÞ ¼ qp ðwÞ ¼
2 ; kwkq
where Lq is the dual norm of Lp. To maximize qp(w) we just have to solve a minimization problem. Specifically to minimize a function of kwkq with respect to inequalities (1). The value of this function of kwkq, 2 referred as fp(w), can be chosen for example to be fp(w) = kwkq or fp ðwÞ ¼ kwkq . In fact this choice depends on the choice of p. 2.2. Linear classification in the non-separable case In the case where our data are not linearly separable, but we still wish to perform linear separation, we introduce positive slack variables ni, for all i in s1, lb in the constraints, which then become y i ðhw xi i þ bÞ P 1 ni ;
8 i 2 s1; lt.
Also, we introduce a real-valued penalty function g(C, n) in the objective function. The role of this function g(C, n) is to assign a penalty cost into the objective function when the data are not completely separated by the resulting hyperplane. The vector C is a vector of positive parameters which will assign different penalties to errors for each data point. A natural choice for g(C, n) would be g(C, n) = hC Æ ni (linear formulation). Also, we may have g(C, n) = nt diag(C)n (quadratic formulation), where diag(C) is a matrix where the diagonal is the vector C. This last formulation avoids adding the supplementary constraints ni P 0, for all i in s1, lb. Therefore the resulting optimization problem for training a maximum margin classifier will be minimize w;b;n
fp ðwÞ þ gðC; nÞ
subject to y i ðhw xi i þ bÞ P 1 ni ; ni P 0;
8i 2 s1; lt;
8i 2 s1; lt.
Given an optimal solution w*, b* and n*, the decision function h will be defined for x in Rn by hðw ; b ; xÞ ¼ signðhw xi þ b Þ.
1
A set of data will be called linearly separable if it exists a pair (w, b) such as (1) holds.
896
T.B. Trafalis, R.C. Gilbert / European Journal of Operational Research 173 (2006) 893–909
2.3. Linear robust classification Next we consider that our training vectors xi, i in s1, lb, are subject to perturbations on each of their coordinates. These perturbations could be represented as an additional vector ri which alters the coordinate i . Therefore, we will have xi ¼ x i þ ri and the first set of constraints of the of the mean or nominal vector x previous optimization problem becomes i i þ bÞ þ y i hw ri i P 1 ni ; y i ðhw x
8i 2 s1; lt.
We consider here that the value of the vector ri is unknown. Despite this fact, we assume that we are able to bound the norm of each ri for all i in s1, lb with respect to the Lp norm. This means that we will have a real constant gi for each ri such as krikp 6 gi, for all i in s1, lb. Now, we have to find a feasible solution of the previous optimization problem with respect to the above new set of constraints, for every realization of ri. We consider w as a robust feasible solution if and only if for every i 2 s1, lb we have i i þ bÞ þ y i hw ri i þ ni P 1. min y i ðhw x
kri kp 6gi
Thus, we need to solve the following problem (in which w is fixed) minimize
y i hw ri i
subject to
kri kp 6 gi .
ri
Using Ho¨lders inequality jy i hw ri ij 6 kwkq kri kp 6 gi kwkq ; where Lq is the dual norm of Lp. Then, the minimum of yihri Æ wi is equal to gikwkq. By substituting this minimum in the previous set of constraints, we obtain the following formulation of the linear robust classification problem: minimize
fp ðwÞ þ gðC; nÞ
subject to
i i þ bÞ gi kwkq P 1 ni ; y i ðhw x
w;b;n
ni P 0;
8i 2 s1; l; t;
ð2Þ
8i 2 s1; lt;
i þ ri and krikp 6 gi, for all i in s1, lb. Given an optimal solution w*, b* and n*, the decision where xi ¼ x function h will still be defined for x in Rn by hðw ; b ; xÞ ¼ signðhw xi þ b Þ. 2.4. Nonlinear robust classification In order to use a nonlinear classifier, we first map the data to some other Euclidean space, using a mapping which we call U. In this space, denoted by H, which will be called the ‘‘feature space’’ (the space where originally our data live will be called ‘‘input space’’), we will perform a linear robust classification. Then our formulation of the optimization problem becomes minimize
f ð-Þ þ gðC; nÞ
subject to
y i ðh- Uð xi ÞiH þ bÞ di k-kH P 1 ni ;
-;b;n
ni P 0;
8i 2 s1; lt;
8i 2 s1; lt;
T.B. Trafalis, R.C. Gilbert / European Journal of Operational Research 173 (2006) 893–909
897
where Uðxi Þ ¼ Uð xi Þ þ 1i and k1i kH 6 di , for all i in s1, lb. The bound di in the feature space can be computed directly from the known bound gi in the input space. Also, we use the notation - to refer to the vector defining the classifier is in the feature space. Also, for practical reasons, it is not necessary to know explicitly U. We can rewrite this problem using a ‘‘kernel’’ function k [13] such that kðx; yÞ ¼ hUðxÞ UðyÞiH , for every pair of vectors (x, y) which belong to the input space. Then, let -¼
l X
bi Uð xi Þ;
i¼1
where bi are real, for all i in s1, lb. So, the dot product h- Uð xi ÞiH becomes h- Uð xi ÞiH ¼
l X
i Þ; bj kð xj ; x
j¼1
which can be computed through the kernel k without even using U. The next problem is to express the value k-kH through a kernel function. In this case, we have 2
k-kH ¼ h- -iH ¼
l X l X i¼1
j Þ ¼ bt Kb; bi bj kð xi ; x
j¼1
j Þ. Now we can compute k-kH where K is a matrix such that for every i and j in s1, lb, we have K ij ¼ kð xi ; x without knowing U. But at that point our set of constraints is still nonlinear. Also, for computational reasons, we would like this set of constraints to be, at least, a set of second order conic constraints. To do so, we need to define an affine function F for which k-kH ¼ kF ðbÞk2 . As the matrix K is real, symmetric and positive semi-definite, we can perform a factorization of K. Then K is assumed to be equal to the matrix product RtR. Then, btKb = (Rb)tRb and k-kH ¼ kRbk2 . Our function F is the linear function b # Rb. Thus, the optimization problem becomes minimize b;b;n
2
kRbk2 þ gðC; nÞ
subject to y i ðK i b þ bÞ di kRbk2 P 1 ni ; ni P 0; 8i 2 s1; lt;
8i 2 s1; lt;
where Ki is the 1 · l-vector corresponding to the ith line of the kernel matrix K. Note that, this optimization problem will be a Second Order Cone programming (SOCP) if its objective function is linear. Then, by introducing an auxiliary variable a and using a linear formulation for the penalty function g, we obtain the following SOCP problem: minimize a;b;b;n
a þ Ct n
subject to kRbk2 6 a; kdi Rbk2 6 y i ðK i b þ bÞ 1 þ ni ; 0 6 ni ;
8i 2 s1; lt;
8i 2 s1; lt;
898
T.B. Trafalis, R.C. Gilbert / European Journal of Operational Research 173 (2006) 893–909
which is the SOCP formulation for the nonlinear robust classification problem. Also, if we wanted a quadratic penalty function g instead, we would have used the same trick we used before by introducingpffiffiffi the ffi auxiliary variable a. In this case, by introducing an extra auxiliary variable c and setting D ¼ diagð C Þ, we obtain minimize
aþc
subject to
kRbk2 6 a;
a;b;b;c;n
kDnk2 6 c; kdi Rbk2 6 y i ðK i b þ bÞ 1 þ ni ;
8i 2 s1; lt.
Now, given an optimal solution b*, b* and n*, the decision function h will be defined for x in Rn by
hðb ; b ; xÞ ¼ sign
l X
! bj kð xj ; xÞ
þb
.
j¼1
2.5. Linear programming robust classification formulations We may transform the optimization problem (2) into a Linear Programming (LP) problem by using L1 or L1 norm as a measure of distance in Rn . If we chose p = 1, the optimization problem (2) will be transformed into minimize
kwk1 þ hC ni
subject to
i i þ bÞ gi kwk1 P 1 ni ; y i ðhw x
w;b;n
ni P 0;
8i 2 s1; lt;
8i 2 s1; lt.
The above problem is equivalent to an LP problem, by introducing an auxiliary variable a equal to kwk1 and add the constraints 0 6 a, wk 6 a and wk 6 a, for all k in s1, nb. The resulting optimization problem will be minimize
a þ hC ni
subject to
i i þ bÞ gi a P 1 ni ; y i ðhw x
a;w;b;n
8i 2 s1; lt;
ni P 0; 8i 2 s1; lt; a P wk ; 8k 2 s1; nt; a P wk ; a P 0.
8k 2 s1; nt;
If we chose the L1 norm instead, the optimization problem (2) will be transformed into minimize
kwk1 þ hC ni
subject to
i i þ bÞ gi kwk1 P 1 ni ; y i ðhw x ni P 0; 8i 2 s1; lt.
w;b;n
8i 2 s1; lt;
Similarly we can have a LP problem, by introducing an auxiliary vector a 2 Rn where for all k in s1, nb, ak equals to jwkj and add the constraints 0 6 ak, wk 6 ak and wk 6 ak, for all k in s1, nb. The resulting optimization problem will be
T.B. Trafalis, R.C. Gilbert / European Journal of Operational Research 173 (2006) 893–909
minimize a;w;b;n
n X
899
ak þ hC ni
k¼1
i i þ bÞ gi subject to y i ðhw x
n X
ak P 1 ni ;
8i 2 s1; lt;
k¼1
ni P 0;
8i 2 s1; lt;
ak P wk ; ak P wk ; ak P 0;
8k 2 s1; nt; 8k 2 s1; nt;
8k 2 s1; nt.
3. Robust regression The Support Vector regression (SVR) that we will establish uses the e-insensitive loss function to find a function h that has at most e deviation from the targets yi for all the training data [14,15]. Again, each data point in the input space is mapped into a higher dimensional feature space using a feature map U. The function h will be linear in the feature space. A kernel function k will be used to substitute the dot product in the nonlinear case. Here, we will consider the Lp as a measure of distance in Rn . 3.1. Linear regression in the non-separable case The similarities with the classification are numerous. Our objective is still to find a normal vector w, small in terms of length. We are still trying to minimize the norm of w, and it has to be done with respect to constraints that now translate the fact that we wish to be at most at e from targets yi. In other words jhw xi i þ b y i j 6 e;
8i 2 s1; lt.
Therefore, the basic linear SVR problem will be minimize w;b
fp ðwÞ
subject to ðhw xi i þ bÞ y i 6 e;
8i 2 s1; lt;
y i ðhw xi i þ bÞ 6 e;
8i 2 s1; lt.
For the case of infeasible constraints, we introduce positive slack variables ni and ^ni , for all i in s1, lb, in the previous constraints, which then become ðhw xi i þ bÞ y i 6 e þ ni ; y ðhw xi i þ bÞ 6 e þ ^ ni ; i
8i 2 s1; lt; 8i 2 s1; lt.
The corresponding penalty function g in the objective function may be gðC; n; ^ nÞ ¼ hC ðn þ ^ nÞi.
900
T.B. Trafalis, R.C. Gilbert / European Journal of Operational Research 173 (2006) 893–909
Then, in this case the SVR problem will be minimize
fp ðwÞ þ gðC; n; ^ nÞ
subject to
ðhw xi i þ bÞ y i 6 e þ ni ; ^i ; y ðhw xi i þ bÞ 6 e þ n
w;b;n;^ n
i
8i 2 s1; lt; 8i 2 s1; lt;
8i 2 s1; lt;
ni P 0; n^i P 0;
8i 2 s1; lt;
^ , the regression function h will be defined for x in Rn by and given an optimal solution w*, b*, n* and n hðw ; b ; xÞ ¼ hw xi þ b . 3.2. Linear robust regression We still consider that the perturbations are represented as additional vectors ri which alter the coordii , for all i in s1, lb. We have xi ¼ x i þ ri , and the set of constraints of the prenates of the mean vectors x vious optimization problem becomes i i þ bÞ y i þ hw ri i 6 e þ ni ; ðhw x i i þ bÞ hw ri i 6 e þ ^ y ðhw x ni ; i
8i 2 s1; lt; 8i 2 s1; lt.
Also according to Ho¨lders inequality: as we have krikp 6 gi for all i in s1, lb, the absolute value of the dot product hw Æ rii is bounded by gikwkq where Lq norm is the dual norm of Lp. Taking the worst case scenario, the previous constraints will be transformed into i i þ bÞ y i þ gi kwkq 6 e þ ni ; ðhw x i i þ bÞ þ g kwk 6 e þ ^ ni ; y ðhw x i
i
q
8i 2 s1; lt; 8i 2 s1; lt.
Then, the resulting SVR problem will be minimize
fp ðwÞ þ gðC; n; ^ nÞ
subject to
i i þ bÞ y i þ gi kwkq 6 e þ ni ; ðhw x
8i 2 s1; lt;
i i þ bÞ þ gi kwkq 6 e þ ^ni ; y i ðhw x
8i 2 s1; lt;
w;b;n;^ n
ni P 0;
8i 2 s1; lt;
^ ni P 0;
8i 2 s1; lt.
ð3Þ
Given an optimal solution w*, b*, n* and ^ n , the regression function h will still be defined for x in Rn by
hðw ; b ; xÞ ¼ hw xi þ b . 3.3. Nonlinear robust regression We introduce a mapping U and map the data in some other Euclidean space where we intend to perform a linear regression. The resulting optimization problem will be
T.B. Trafalis, R.C. Gilbert / European Journal of Operational Research 173 (2006) 893–909
minimize -;b;n;^ n
901
f ð-Þ þ gðC; n; ^ nÞ
subject to ðh- Uð xi Þi þ bÞ y i þ di k-kH 6 e þ ni ;
8i 2 s1; lt;
xi Þi þ bÞ þ di k-kH 6 e þ ^ni ; y i ðh- Uð
8i 2 s1; lt;
ni P 0;
8i 2 s1; lt;
^ ni P 0;
8i 2 s1; lt.
Again, writing -¼
l X
bi Uð xi Þ;
i¼1
and using a kernel function k such that kðx; yÞ ¼ hUðxÞ UðyÞiH , for every pair of vectors (x, y) which belong to the input space, we will have h- Uð xi ÞiH ¼
l X
i Þ. bj kð xj ; x
j¼1
Then the constraints become ! l X i Þ þ b y i þ di k-kH 6 e þ ni ; bj kð xj ; x
8i 2 s1; lt;
j¼1
yi
l X
! i Þ þ b bj kð xj ; x
þ di k-kH 6 e þ ^ ni ;
8i 2 s1; lt.
j¼1
Also, by setting p =p 2,ffiffiffiffiffiffiffiffiffiffi we can ffi bypass the use of U in the computation of the expression k-kH . This quantity becomes equal to bt Kb where K is a positive matrix such that for every i and j in s1, lb we have j Þ. Then, as the matrix K is real, symmetric and positive semi-definite, we are able to perform K ij ¼ kð xi ; x a factorization of K such that there exists a matrix R with K = RtR. Now we may write k-kH ¼ kF ðbÞk2 where F is the linear function b # Rb. Then, the formulation of our optimization problem is minimize b;b;n;^ n
2 kRbk2 þ gðC; n; ^ nÞ
subject to ðK i b þ bÞ y i þ di kRbk2 6 e þ ni ;
8i 2 s1; lt;
y i ðK i b þ bÞ þ di kRbk2 6 e þ ^ni ;
8i 2 s1; lt;
ni P 0;
8i 2 s1; lt;
^ ni P 0;
8i 2 s1; lt;
where Ki is the 1 · l-vector corresponding to the line i of the kernel matrix K. By introducing an auxiliary variable a and using a linear formulation for the penalty function g, we transform the previous optimization problem into a SOCP problem:
902
T.B. Trafalis, R.C. Gilbert / European Journal of Operational Research 173 (2006) 893–909
minimize
n a þ Ct n þ Ct ^
subject to
kRbk2 6 a;
a;b;b;n;^ n
kdi Rbk2 6 e þ ni þ y i ðK i b þ bÞ;
8i 2 s1; lt;
ni y i þ ðK i b þ bÞ; kdi Rbk2 6 e þ ^
8i 2 s1; lt;
0 6 ni ;
8i 2 s1; lt;
06^ ni ;
8i 2 s1; lt;
which is the SOCP formulation for the nonlinear robust SVR problem. In the case of a quadratic penalty function g, we would have used the same trick we used in Section 2.4 by introducing the pffiffiffiffiauxiliary variable a. In this case, introducing two auxiliary variables c and ^c, and by setting D ¼ diagð C Þ, we obtain minimize
a þ c þ ^c
subject to
kRbk2 6 a;
a;b;b;c;^c;n;^ n
kDnk2 6 c; kD^ nk2 6 ^c; kdi Rbk2 6 e þ ni þ y i ðK i b þ bÞ;
8i 2 s1; lt;
ni y i þ ðK i b þ bÞ; kdi Rbk2 6 e þ ^
8i 2 s1; lt.
^ , the regression function h will be defined for x in Rn by Now, given an optimal solution b*, b*, n* and n i hðb ; b ; xÞ ¼
l X
bj kð xj ; xÞ þ b .
j¼1
3.4. Linear programming robust regression formulations In this case we wish to perform robust SVR using an LP formulation. The formulation (3) can be transformed into a LP problem by using L1 or L1 norm as a measure of distance in Rn . If we use the L1 norm, this problem becomes minimize
kwk1 þ hC ðn þ ^ nÞi
subject to
i i þ bÞ y i þ gi kwk1 6 e þ ni ; ðhw x
8i 2 s1; lt;
i i þ bÞ þ gi kwk1 6 e þ ^ni ; y i ðhw x
8i 2 s1; lt;
w;b;n;^ n
ni P 0;
8i 2 s1; lt;
^ni P 0;
8i 2 s1; lt.
Again, we introduce an auxiliary variable a equal to kwk1 and add the constraints 0 6 a, wk 6 a and wk 6 a for all k in s1, nb. The resulting optimization problem will be
T.B. Trafalis, R.C. Gilbert / European Journal of Operational Research 173 (2006) 893–909
minimize a;w;b;n;^ n
903
a þ hC ðn þ ^ nÞi
subject to ðhwi xi þ bÞ y i þ gi a 6 e þ ni ; i i þ bÞ þ gi a 6 e þ ^ ni ; y i ðhw x ni P 0;
8i 2 s1; lt;
^ ni P 0;
8i 2 s1; lt;
a P wk ;
8i 2 s1; lt; 8i 2 s1; lt;
8k 2 s1; nt; 8k 2 s1; nt;
a P wk ; a P 0.
If we use the L1 norm, then problem (3) becomes minimize w;b;n;^ n
kwk1 þ hC ðn þ ^ nÞi
i i þ bÞ y i þ gi kwk1 6 e þ ni ; subject to ðhw x
8i 2 s1; lt;
i i þ bÞ þ gi kwk1 6 e þ ^ni ; y i ðhw x
8i 2 s1; lt;
ni P 0;
8i 2 s1; lt;
^ ni P 0;
8i 2 s1; lt.
Next, we introduce an auxiliary vector a 2 Rn where for all k in s1, nb ak is equal to jwkj and add the constraints 0 6 ak, wk 6 ak and wk 6 ak, for all k in s1, nb. The resulting optimization problem will be n X ak þ hC ðn þ ^ nÞi minimize a;w;b;n;^ n
k¼1
i i þ bÞ y i þ gi subject to ðhw x
n X
ak 6 e þ ni ;
8i 2 s1; lt;
ak 6 e þ ^ni ;
8i 2 s1; lt;
k¼1
i i þ bÞ þ gi y i ðhw x
n X k¼1
ni P 0;
8i 2 s1; lt;
^ ni P 0;
8i 2 s1; lt;
ak P wk ; ak P wk ; ak P 0;
8k 2 s1; nt; 8k 2 s1; nt;
8k 2 s1; nt.
4. Computational results We have developed a MATLAB code. The SOCP problems related to our models have been solved by both SeDuMi Matlab toolbox, version 1.05 [16] and SDPT3 Matlab toolbox, version 3.02 [17].
904
T.B. Trafalis, R.C. Gilbert / European Journal of Operational Research 173 (2006) 893–909
4.1. Robust classification results We start by testing our robust techniques on a simple synthetic example. We will train our robust SVMs on the dataset shown in Fig. 1. In order to simplify, we consider that the upper bounds of the perturbations on each training vector have all the same value which will be denoted by g. We assume that C = 10, and we choose a kernel based on Radial Basis Function (RBF) with r = 1. Using a 1-norm penalty function, we obtain the results shown in Fig. 2. While increasing the upper bound of the uncertainty, g, we observe that the evolution of the quantity a1 that can be considered as a generalized margin in the case where we have data uncertainties (see previous SOCP problems) is separated in four phases: • The initial phase (g = 0): this is the precise case. The value of a1 (noted q0) is a reference for further comparisons with other robust classifiers. • The linear phase (0 < g 6 0.1147): a1 is decreasing linearly with eta, with a discontinuity at zero. Therefore, for 0 < g 6 0.0027, the value of a1 is greater than q0. We deduce that a small uncertainty give better maximum margin classifiers than the one we obtain in the precise case. Between 0.0027 and 0.1147,
10
10
10
8
8
8
6
6
6
4
4
4
2
2
2
0
0
0
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
Fig. 1. From left to the right, plots of the classification lines for respectively g = 0.00025, g = 0.1147 and g = 0.1477. The points of class 1 are the circles; the points of class +1 are the squares.
inverse alpha (circles)
30% 25%
r(0)
20% Linear domain 15% 10% 5% r(min) 0
η
0.1147
0.1477
Misclassified training points of class 1 (crosses)
35% r(max)
0%
Fig. 2. Plot of a1 in function of g (on the left, the circles) with the plot of the percentage of misclassified training points of class 1 (on the right, the crosses).
T.B. Trafalis, R.C. Gilbert / European Journal of Operational Research 173 (2006) 893–909
905
due to increasing uncertainties on the position of the training points, a1 is still decreasing linearly toward an absolute minimum value (noted qmin). From Fig. 1, we observe that, in this phase, the classification line is not changing significantly. Also, no training points of class 1 (in Fig. 2) or +1 (not represented) are misclassified. • The nonlinear phase (0.1147 < g 6 0.1477): a1 is increasing continuously toward a maximum value (noted qmax). In this phase it appears that we have an increasing proportion of misclassified training points. In Fig. 1, we observe that the surface containing the points of class 1 and delimited by a closed classification line is shrinking. The frontier of the ‘‘minus 1’’ surface is repelled, and after a certain g, this surface is suddenly divided into small bubbles which are also shrinking when g is increasing. • The final phase (0.1477 < g): the SOCP programming problem is no longer feasible. Therefore, we cannot obtain further results. No points are classified and 0.1477 represents an upper limit for g. When the norm of the perturbations on the input is greater than this value, we cannot train a support vector machine. We suspect that we can deduce the value of g between the third and fourth phase directly from the dataset itself without involving massive experiments, but this is still under investigation. Finally, when we divide our dataset into training and testing subsets, we remark that the variation of misclassified training points of class 1 (or +1) are correlated. Actually, this is exactly the kind of behavior we expected from such a classifier. Because a separating hyperplane shatters the feature space into two portions, then testing and training points of the same class are gathered. Therefore, when the separating hyperplane is moving in the feature space due to increasing perturbations, we should expect that the number of both misclassified testing and training points of one particular class to increase together in a correlated way. The following results are for the breastcancer dataset from the UCI database [18] and for different realizations of g. We took C = 10, a RBF kernel with r = 1 and we used a 1-norm penalty function. Also, we have still considered that the upper bound for the norm of the perturbations is still the same for every input (and referred to as g). For each experiment with a different g, we trained our SVMs 30 times, by considering 30 random partitions of the data into training and testing points. The proportion for training and testing points is 1–9 for different values of g. For each experiment, we have computed the mean value of the 30 consecutive results. Fig. 3 displays the results we obtained.
35%
30% Percentage of misclassified points
inverse alpha
rho(0)
Linear domain
25%
20%
15%
10%
5%
rho(min) 0
class 1 training points class +1 training points class 1 testing points class +1 testing points
η
0.1945
0.286
0% 0
η
0.1945
0.286
Fig. 3. Breastcancer dataset. On the left: plot of a1 in function of g. On the right: plots of the percentages of misclassified training and testing points.
906
T.B. Trafalis, R.C. Gilbert / European Journal of Operational Research 173 (2006) 893–909
We still observe the four phases we previously described. • Initial phase (g = 0): the value of a1 appears to be slightly inferior to the corresponding value of a1 for g slightly greater than zero. • Linear phase (0 < g 6 0.1945): the value of a1 is decreasing linearly toward an absolute minimum value. The percentages of misclassified training and testing points are stable (around 18% for testing points of class 1 and around 1% for testing points of class +1). • Nonlinear phase (0.1945 < g 6 0.286): the value of a1 is increasing in an irregular manner. The percentages of misclassified training and testing points of class 1 are increasing rapidly while the percentages of misclassified training and testing points of class +1 are decreasing toward zero. We are witnessing the shifting of the separating hyperplane in the feature space. • Final phase (0.286 < g): the SOCP programming problem is no longer feasible. Therefore, we cannot obtain further results. No points are classified and 0.286 represents an upper limit for g. When the norm of the perturbations on the input is greater than this value, we cannot train a support vector machine.
4.2. Robust regression results Next, we consider a classic example; we train our robust SVM to fit the sinc function with different levels of uncertainty. The interval [3, 3] has been discretized into 30 equal segments. Each node of the sub-intervals will be a one-dimensional training vector (see Fig. 4). We train support vector machines using these training points with the following parameters: C ¼ 10; e ¼ 0:01. The small value of e guaranties a good approximation of the outputs of the SVM in the precise case. Therefore, the mean square error (MSE) between the real and the predicted outputs will be relatively ‘‘small’’ for the precise case. Also, the MSE of the precise case will play the role of a ‘‘threshold’’. We consider that the upper bounds of the perturbations on each training vector have all the same value which will be noted by g.
0
10
–1
10
0.8
–2
10 0.6
–3
MSE
10 0.4
–4
10
10–5
0.2
10–6
0
10–7 –0.2 –3
–2
–1
0
1
2
3
10–8 –3 10
–2
10
–1
η
10
0
10
Fig. 4. On the left: the sinc function and its discretization. The inputs are the circles, the outputs the crosses. On the right: MSE in function of g.
T.B. Trafalis, R.C. Gilbert / European Journal of Operational Research 173 (2006) 893–909
907
Then, we will study the behavior of the MSE when g is increasing and, if for a certain g > 0, the MSE is bigger than the MSE of the precise case, we will deduce that the outputs of the SVM are not good enough to fit correctly the outputs of the sinc function. In our example, the chosen kernel is based on Radial Basis Function (RBF) with r = 1. Using a 1-norm penalty function, we obtain the results shown in Fig. 4. The results show that the evolution of the MSE is in two phases. In the first phase, 0 6 g 6 0.008, the MSE drops when the bound of the perturbation is increasing. Then the trained robust support vector machines give better results than the classical SVMs. When g = 0.008, we remark on the plot that the resulting regressor is almost perfect in the interval [3, 3]. Then, for 0.008 6 g 6 0.016, the resulting MSE is still under the MSE of the precise case. So, for g 6 0.016, we obtain better regressors than the ones we obtain with the classical non-robust support vector machines. Between 0.016 and 0.08, the outputs are not good enough, but the MSE is increasing regularly and continuously. For 0.08 6 g 6 0.216, the MSE is increasing discontinuously but regularly toward a limit value. At this stage, the bound of the perturbations is bigger than half the distance between two nodes. When g P 0.216, the plots show that the resulting regressors are close to a liner function. Results show that when g is over 0.216, the value of a1 increases by steps toward the infinity. Therefore the coefficients of the vector U(w) are very close to zero, and the resulting regressor is close to be a line. Fig. 5 shows the evolution of the value of a1. The profile of the plot of a1 is very much like a continuous succession of affine functions, and the plot of the approximation of the second derivative of a1 is very 2.5
2
2.4
1.5
2.3
1 diff(inverse alpha,2)
inverse alpha
2.2 2.1 2 1.9
0.5 0 –0.5 –1
1.8
–1.5
1.7 1.6
–2 0
0.01
0.02
0.03
0.04
0.05 η
0.06
0.07
0.08
0.09
0
0.1
0.02
0.04
0.06
0.08
0.1
η
0.12
0.14
0.16
0.18
0.2
Fig. 5. On the left: plot of a1. On the right: plot of the second derivative of a1.
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0
–0.2
–0.2
–0.2 –3
–2
–1
0
1
2
3
–3
–2
–1
0
1
2
3
–3
–2
–1
0
Fig. 6. From left to the right, curves for respectively g = 0.008, g = 0.016, g = 0.08 and g = 0.23.
1
2
3
908
T.B. Trafalis, R.C. Gilbert / European Journal of Operational Research 173 (2006) 893–909
much like the plot of the null function. Also we notice a discontinuity at zero, like in the classification case. For the moment, we assert that a1 as a function of g is a piecewise affine function. Note that this function admits a minimum value, and at that minimum the MSE between the predicted and the real output is minimum. Also, we trained support vector machines for other values of C and e and we obtained similar results (see the curves in Fig. 6).
5. Conclusion For both classification and regression cases we have obtained new robust programming formulations which give, for relatively small perturbations, robust support vector machines. For nonlinear classification or regression, we have derived second order cone programming problems which can now be solved using dual–primal interior points methods. Also, these formulations are very flexible: one can freely choose what will be its penalty functions and the formulations are robust to mixed precise and uncertain data cases. Within a certain range of perturbations, a1 decreases linearly toward an absolute minimum value when the upper bound of the perturbations, g, is increasing. During this phase, the separating hyperplane in the feature space is roughly the same as that in the precise case. Then, after a threshold value of g, this linear behavior ceases to exist and the value of a1 is fluctuating irregularly. In this phase, the separating hyperplane in the feature space is moving without predictable behavior with an increasing g. Then we witness an increasing number of misclassified points. After a maximum value for g, the programming problem is no longer feasible. This linear decrease of the value of a1 is a very helpful tool to predict the value of the coefficients of the support vectors. Unfortunately, we are not able to estimate directly from the dataset parameters like the minimum value of a1 and the corresponding value of g, the rate of the linear decrease and the maximum admissible value for g. Nevertheless, we think it is possible to estimate these parameters (and therefore construct robust support vector classifiers without solving an SOCP problem) directly from the dataset itself. Also, we do not know much about the behavior of the separating hyperplane after we reach the minimum value of a1. In the regression case, our results showed that there exists a particular value of g, strictly positive, in which the MSE between the predicted and the real outputs is minimum. Also, within a certain range of uncertainty, this MSE is less than the corresponding MSE in the precise case. A problem for future research is to develop techniques that provide the level of noise that gives the best generalization for a particular data set.
Acknowledgement The present work has been partially supported by the NSF grant ECS-0099378.
References [1] [2] [3] [4] [5]
K.P. Bennet, C. Campbell, Support vector machines: Hype or hallelujah? SIGKDD Explorations 2 (2) (2000) 1–6. B. Scho¨lkopf, Support Vector Learning, R. Oldenbourg Verlag, Munich, Germany, 1997. V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, 1995. V. Vapnik, Statistical Learning Theory, Wiley, 1998. A. Ben-Tal, A. Nemirovski, Robust solutions to uncertain linear programs via convex programming, Operations Research Letters 25 (1) (1996) 1–17.
T.B. Trafalis, R.C. Gilbert / European Journal of Operational Research 173 (2006) 893–909
909
[6] A. Ben-Tal, A. Nemirovski, Robust convex optimization, Mathematics of Operations Research 23 (4) (1998) 769–805. [7] L.E. Ghaoui, H. Lebret, Robust solutions to least-square problems with uncertain data, SIAM Journal of Matrix Analysis Applications 18 (1997) 1035–1064. [8] T.B. Trafalis, S.A. Alwazzi, Robust optimization in support vector machine training with bounded errors, in: Proceedings of the International Joint Conference on Neural Networks, 2003, pp. 2039–2042. [9] T.B. Trafalis, S.A. Alwazzi, Robust support vector regression and applications, in: C.H. Dagli, A.L. Buczak, J. Ghosh, M.J. Embrechts, O. Ersoy, S.W. Kercel (Eds.), Intelligent Engineering Systems Through Artificial Neural Networks, vol. 13, ASME Press, 2003, pp. 181–186. [10] S. Boyd, M.S. Lobo, L. Vanderberghe, Applications of second-order cone programming, Linear Algebra and its Applications 284 (1998) 193–226. [11] C. Bishop, Training with noise is equivalent to Tikhonov regularization, Neural Computation 7 (1995) 108–116. [12] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Knowledge Discovery and Data Mining 2 (2) (1998) 121–167. [13] A.J. Smola, Learning with kernels, Ph.D. thesis, Technical University Berlin, 1998. [14] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, Cambridge, United Kingdom, 2000. [15] A. Smola, B. Scho¨lkopf, A tutorial on support vector regression, Statistics and Computing Invited paper 14 (3) (2004) 199–222. [16] J.F. Sturm, Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones, Optimization Methods and Software 11–12 (1999) 625–653. [17] K.C. Toh, M.J. Todd, R.H. Tutuncu, SDPT3—a Matlab software package for semidefinite programming, version 2.1, Optimization Methods and Software 11 (1999) 545–581. [18] C. Blake, C. Merz, UCI repository of machine learning databases [http://www.ics.uci.edu/~mlearn/mlrepository.html], University of California, Irvine, Department of Information and Computer Sciences, 1998.