Online algorithm based on support vectors for orthogonal regression

Online algorithm based on support vectors for orthogonal regression

Pattern Recognition Letters 34 (2013) 1394–1404 Contents lists available at SciVerse ScienceDirect Pattern Recognition Letters journal homepage: www...

942KB Sizes 0 Downloads 81 Views

Pattern Recognition Letters 34 (2013) 1394–1404

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Online algorithm based on support vectors for orthogonal regression Roberto C.S.N.P. Souza, Saul C. Leite, Carlos C.H. Borges, Raul Fonseca Neto ⇑ Computer Science Department, Federal University of Juiz de Fora, Juiz de Fora, MG, Brazil

a r t i c l e

i n f o

Article history: Received 26 November 2012 Available online 10 May 2013 Communicated by G. Moser Keywords: Support vector machines Online algorithms Kernel methods Regression problem Orthogonal regression

a b s t r a c t In this paper, we introduce a new online algorithm for orthogonal regression. The method is constructed via an stochastic gradient descent approach combined with the idea of a tube loss function, which is similar to the one used in support vector (SV) regression. The algorithm can be used in primal or in dual variables. The latter formulation allows the introduction of kernels and soft margins. In addition, an incremental strategy algorithm is introduced, which can be used to find sparse solutions and also an approximation to the ‘‘minimal tube’’ containing the data. The algorithm is very simple to implement and avoids quadratic optimization. Ó 2013 Published by Elsevier B.V.

1. Introduction A regression problem consists of finding an unknown relationship between given points xi 2 Rd and their corresponding target values yi 2 R. This problem is usually formulated as one of finding a function f : Rd ! R, which maps points to target values, that minimizes a certain loss function. Usually, it is assumed that noise is only present in the target values and the loss function measures deviations of f ðxi Þ from its corresponding value yi . This is the case of the classical regression formulation proposed by Gauss, which minimizes the sum of the quadratic deviation between targets and the estimated function (Huber, 1972). Aiming to solve the classical regression problem, Vapnik developed a formulation based on a loss function called e-insensitive and introduced the concept of tube (Vapnik, 1995). These new elements, based on structural risk minimization principle, allowed the development of an specific support vector machine formulation for regression problems, called support vector (SV-)regression. This method has become quite popular due to its flexibility, specially with respect to the use of kernels (Smola and Schölkopf, 2002), which permits the construction of nonlinear solutions by only choosing an appropriate kernel function. In addition, the concept of tube allows the representation of the final solution only in terms of the support vectors, which is crucial for kernel based methods. Support vector machines, in its standard form, are formulated as a quadratic optimization problem. This formulation can be computationally expensive in large problems or problems where the regression problem is solved iteratively with a growing sample. ⇑ Corresponding author. Tel.: +55 32 21023316; fax: +55 32 21023302. E-mail address: [email protected] (R. Fonseca Neto). 0167-8655/$ - see front matter Ó 2013 Published by Elsevier B.V. http://dx.doi.org/10.1016/j.patrec.2013.04.023

In classification and classical regression problems, these applications have motivated the development of a large literature of online algorithms that are able to obtain approximated solutions efficiently (e.g., Suykens and Vandewalle, 1999; Gentile, 2001; Li and Long, 2002; Kivinen et al., 2004; Shwartz et al., 2007; Crammer et al., 2006; Leite and Fonseca Neto, 2008, to name a few). Online algorithms are also an attractive alternative to the standard SV approach, since it is typically simple to understand and implement. Consider now a regression problem where noise can be present not only in the target values but also in the data points xi . This problem has a long history, dating back to 1877 with (Adcock, 1877) (see Markovsky and Huffel, 2007 for a historical overview). This problem appears in the literature under different names, for instance, it is called orthogonal regression, after its geometric interpretation (Ammann and Ness, 1988), it is also known as error-in-variables in the statistics community (Griliches and Ringstad, 1970; Gleser, 1981) and total least-square in the numerical analysis field (Golub, 1973; Golub and Loan, 1980). For linear regression, (Golub, 1973; Golub and Loan, 1980) proposed a solution to this problem constructed using a singular value decomposition for the overdetermined regression system, which has now become a standard solution. Nowadays, orthogonal regression is motivated by several applications, as for instance, in audio (Hermus et al., 2005) and image (Luong et al., 2012; Luong et al., 2011; Hirakawa and Parks, 2006) processing (see Markovsky, 2010 for a more comprehensive list). Most of these problems consider nonlinear data. Therefore, orthogonal regression can benefit from kernel methods. In this work, we present a new formulation for orthogonal regression which adapts the idea of e-insensitive loss function introduced by Vapnik (1995). The problem is defined as the

1395

R.C.S.N.P. Souza et al. / Pattern Recognition Letters 34 (2013) 1394–1404

minimization of the empirical risk with respect to a novel tube loss function for orthogonal regression. An algorithm to solve this problem is proposed based on the stochastic gradient descent approach, similar to Rosenblatt’s Perceptron (Rosenblatt, 1958). The proposed method can be used in primal or in dual variables, making it more flexible for different types of problems. In dual variables, the algorithm allows the introduction of kernels and margin flexibility. To the best of our knowledge, this is the first method that allows the introduction of kernels via the so-called ‘‘kernel-trick’’ for orthogonal regression. In addition, an incremental strategy is introduced, similar to the one proposed in Leite and Fonseca Neto (2008), which can be used to find more sparse solutions and also an approximation of the ‘‘minimal tube’’ containing the data. It is important to mention that this minimal tube cannot be found using the standard SV-regression approach. The strategy is based on successive calls to the proposed online algorithm with decreasing tube radii. As the radius decreases, the most outlining points are found, contributing to the sparsity of the solution. Also, this strategy can be useful for datasets corrupted with outliers, specially when combined with the margin flexibility of the online method. This treatment of outlining data is not so easily addressed in the classical SVD approaches. The paper is structured as follows. First, Section 2 introduces the regression problem and different loss functions used for regression. In Section 3, the framework for online algorithms is presented and the online algorithm for orthogonal regression is derived in primal variables. This provides the basis of the proposed method. Section 4 presents the algorithm in dual variables. In section 5, we introduce the incremental strategy to find sparse solutions and also to approximate the minimal tube containing the data. Furthermore, some numerical tests and results are shown in Section 6 to support the theory. Finally, Section 7 presents some conclusions and discussion. 2. The regression problem and loss functions d Let X m :¼ fxi gm i¼1 , where xi 2 R , be the set of training points and m let Y m :¼ fyi gi¼1 , with yi 2 R the corresponding set of target values. Let Z m :¼ fðy; xÞ : y 2 Y m and x 2 X m g be the training set. The general problem of regression can be stated as follows: suppose that the pairs zi :¼ ðyi ; xi Þ 2 Z m are independent samples of a random vector Z :¼ ðY; XÞ, where Y and X are correlated and have an unknown joint distribution PZ . Given Z m , the problem is to find an unknown relationship between points and their respective targets, given by a function f : Rd ! R over a certain class C of functions, which minimizes the expected risk:

EZ ½‘ðY; X; f Þ; where the expectation is taken over the distribution PZ and ‘ : R  Rd  C ! R is a loss function, which penalizes deviations between functional and target values. One approach in this case is to use Z m to estimate PZ , however, this generally turns out to be a more challenging task than the original problem. Therefore, it is common to consider the reduced problem of finding a function f 2 C which minimizes the empirical risk given the training set Z m , that is: m 1X Remp ½f ; Z m  :¼ ‘ðy ; xi ; f Þ: m i¼1 i

Since we are interested in applying the so-called ‘‘kernel trick’’ later on, we restrict our class of functions C to linear functions of the form: fðw;bÞ ðxÞ :¼ hw; xi þ b, where w 2 Rd is the normal vector and b 2 R is the bias term.

The most common choice for ‘ is the squared loss given by ‘2 ðy; x; f Þ :¼ ðy  f ðxÞÞ2 , which gives origin to the least-square regression. The rationale behind this approach is to minimize the sum of squared residuals dyi :¼ yi  f ðxi Þ in such a way that yi ¼ f ðxi Þ þ dyi . The common assumption is that only the target values are random. However, some practical problems may have noisy training points xi as well. A natural generalization of the above procedure is to also minimize variation in the training points xi , that is, minimize dy2i þ dx2i in such a way that yi ¼ f ðxi þ dxi Þ þ dyi . This is commonly called total least-square or orthogonal regression (Markovsky and Huffel, 2007). Geometrically, this problem minimizes the sum of squared orthogonal distances between the points zi :¼ ðyi ; xi Þ and the hyperplane

fðy; xÞ 2 Rdþ1 : y  hx; wi ¼ bg  fz 2 Rdþ1 : hz; ð1; wÞi ¼ bg; as opposed to the least-square formulation, which minimizes the sum of the squared direct differences between functional values and target values, as shown in Fig. 1. In order to set the total least-square problem in the framework introduced in the beginning of the section, let us define the p-distance of a point z 2 Rd to a hyperplane H :¼ fz 2 Rd : hz; wi ¼ bg as distp ðz; HÞ :¼ minx2H kz  xkp , where kxkp is the p-norm of a vector x. It is possible to write this distance as:

distp ðz; HÞ ¼

hz; wi þ b ; kwkq

where k  kq is the dual norm of k  kp , with 1=p þ 1=q ¼ 1, see for instance (Dax, 2006). Therefore, the corresponding loss function for the total least-square formulation can be written as:

‘t ðy; x; fðw;bÞ Þ :¼

ðy  hw; xi  bÞ2 kð1; wÞk2

;

where kð1; wÞk2 ¼ 1 þ kwk2 , and the norm k  k with no lower index corresponds to the l2 norm k  k2 . Another common choice for the loss function, which is used in the SV-regression, is called the e-insensitive (or e-tube) loss. It is given by:

‘e ðy; x; f Þ :¼ maxf0; jy  f ðxÞj  eg; where e is interpreted as the radius of this tube. Therefore, this loss function penalizes solutions which leaves training points outside of this tube. A favorable feature of this loss function is that it gives sparse solutions when formulated in dual variables. In respect to this loss function, let us introduce some notation and terminology which will be used later on. For each fixed e > 0, define the following set:

VðZ m ; eÞ :¼ fðw; bÞ 2 Rdþ1 : jyi  hw; xi i  bj 6 e;

8ðxi ; yi Þ 2 Z m g;

which we call version space. When this set is not empty, we say that the problem accepts a tube of size e, or an e-tube. In order to consider a tube loss function for the orthogonal regression problems, which is useful to maintain the sparsity of the dual solution, we propose the following loss function: for a q > 0, let

  jy  hw; xi  bj ‘q ðy; x; fðw;bÞ Þ :¼ max 0; q ; kð1; wÞk which we call q-insensitive (or q-tube) loss. In a similar fashion to what was done for the ‘e loss function, define the following version space:

XðZ m ; qÞ :¼ fðw; bÞ 2 Rdþ1 : jyi  hw; xi i  bj 6 qkð1; wÞk; 8ðxi ; yi Þ 2 Z m g;

R.C.S.N.P. Souza et al. / Pattern Recognition Letters 34 (2013) 1394–1404

Y −2

−1

0

−2.0 −1.5 −1.0 −0.5 0.0

Y

−2.0 −1.5 −1.0 −0.5 0.0

0.5

0.5

1.0

1.0

1.5

1.5

1396

−2

1

−1

0

1

X

X Fig. 1. left: Ordinary Least Square fit. right: Total Least Square fit.

for each q > 0. Again, we say that the problem accepts a q-tube if the version space is not empty. For each fixed ðw; bÞ, notice that there is an interesting relationship between the orthogonal distances and the direct functional differences for each point ðyi ; xi Þ. For that let ei :¼ yi  hw; xi i  b and qi :¼ ðyi  hw; xi i  bÞ=kð1; wÞk, then clearly ei ¼ qi kð1; wÞk. 3. Online algorithms for regression In the online learning setting, one constructs the candidate functions f 2 C (usually called hypothesis) minimizing the empirical risk by examining one training example ðyi ; xi Þ at a time. In this way, we start with an initial hypothesis f0 and, at each iteration t, the algorithm examines one example and updates its current hypothesis ft according to some specific update rule. In order to derive this update rule, we follow the ideas of the Perceptron algorithm (Rosenblatt, 1958) and use an stochastic gradient descent approach. Considering the empirical risk defined in the previous section, let us define the following cost:

Jðf Þ :¼

X

‘ðyi ; xi ; f Þ;

ðyi ;xi Þ2Z m

which should be minimized in respect to f. Therefore, for each pair of points ðyi ; xi Þ, the following update rule is applied to the current hypothesis ft :

ftþ1

ft  g@ f ‘ðyi ; xi ; f Þ

ð1Þ

where g > 0 is usually called learning rate and @ f denotes the gradient of the loss function with respect to f. One important aspect of this approach is that if ‘ðÞ P 0, which is true for most loss functions, the above updates need only take effect in case ‘ðyi ; xi ; f Þ > 0. Otherwise, the current hypothesis ft already achieves the minimum for the example ðyi ; xi Þ and no change needs to be made, i.e. ftþ1 ¼ ft . In this sense, loss functions that are based on the idea of tube are well suited for this scheme, since the example will only affect the current hypothesis if it is outside of the tube.

algorithm for classical regression using the e-tube loss function, which is the same used in the SV-regression formulation. The derivation of this algorithm helps to introduce the main ideas before presenting the algorithm for orthogonal regression. It is also used for comparison in the experimental section. Let us apply the ideas of the previous section to the loss function ‘e , restricting our class of functions C to linear functions fðw;bÞ . Then the condition ‘e ðÞ > 0 to update the hypothesis fðwt ;bt Þ after example ðyi ; xi Þ becomes:

jyi  hwt ; xi i  bt j > e

ð2Þ

For the update rule, the gradient in Eq. (1) is taken in respect to the parameters ðw; bÞ that compose the function fðw;bÞ . Therefore we have:

wtþ1

wt þ g signðyi  hwt ; xi i  bt Þxi

btþ1

bt þ g signðyi  hwt ; xi i  bt Þ;

ð3Þ

where signðxÞ :¼ x=jxj, for x 2 R n f0g. We call this algorithm Fixed eRadius Perceptron (eFRP). A similar algorithm has been presented in Kivinen et al. (2004), using a similar loss function. The algorithms are equivalent when the parameter m, used in Kivinen et al. (2004), is set to zero. Since we are interested in orthogonal regression, we consider the q-tube loss function presented in Section 2. Following an analogous derivation, the condition to update the hypothesis after examining ðyi ; xi Þ becomes:

jyi  hwt ; xi i  bt j > q: kð1; wt Þk The corresponding update rule has the following form:

wtþ1 btþ1

  signðyi  hwt ; xi i  bt Þxi wt kt þ g kð1; wt Þk   signðyi  hwt ; xi i  bt Þ ; bt þ g kð1; wt Þk

ð4Þ

where kt is given by 3.1. Fixed Radius Perceptron (FRP) In this section, the proposed online algorithm for orthogonal regression is presented. It is based on the online learning setting described in the previous section applied to the q-insensitive loss function. In order to do so, we begin by presenting an online

kt :¼

1þg

jyi  hwt ; xi i  bt j kð1; wt Þk3

! :

ð5Þ

We call the corresponding algorithm Fixed q-Radius Perceptron (qFRP). It is presented in details in Algorithm 1.

1397

R.C.S.N.P. Souza et al. / Pattern Recognition Letters 34 (2013) 1394–1404

this corresponds to scaling the vector at by the same factor kt before the associated component of at is updated. Hence, if the ith example ðyi ; xi Þ is outside of the q-tube, the update is done as follows: first at is scaled by kt , then the following update is performed:

  signðyi  ft ðxi ÞÞ ; kð1; wt Þk

atþ1;i at;i þ g

4. Algorithm in dual variables Suppose now that the training examples are in some abstract space X . In addition, suppose that the functions f 2 C accept the following representation: f ¼ fH þ b, for some fH 2 H and b 2 R, where H is a reproducing kernel Hilbert space (RKHS) (e.g., Smola and Schölkopf, 2002). Let h; iH and k : X  X ! R be the associated inner product and reproducing kernel, respectively. Then, the reproducing property of k implies that kðx; Þ 2 H and, for any f 2 H, we have that hf ; kðx; ÞiH ¼ f ðxÞ for all x 2 X . Another interesting property of RKHS is that any f 2 H can be written as linear combination of the kðx; Þ. This fact is very useful for learning algorithms since we can write the hypothesis at iteration t as:

ft ðxÞ ¼

m X

at;i kðxi ; xÞ þ bt

ð6Þ

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi where we define kð1; wt Þk :¼ 1 þ kwt k2H . In order to improve the computational efficiency of the qFRP, it is possible to update the functional values ft ðxj Þ, for j ¼ 1; . . . ; m, and the norm kwt kH from their previous values after each update. First, suppose that an update at iteration t was made for a mistake in the ith pattern ðyi ; xi Þ. Then, by examining the update rule for atþ1 and the expression of ft ðÞ given by Eq. (7), we can calculate f ðxj Þ from its previous values as follows:

  signðyi  ft ðxi ÞÞ ftþ1 ðxj Þ ¼ kt wt ; kðxj ; Þ H þ g kðxi ; xj Þ þ btþ1 kð1; wt Þk ¼ kt ðft ðxj Þ  bt Þ þ g

ft ðxÞ ¼ hwt ; kðx; ÞiH þ bt ;

ð7Þ

by the reproducing property of k. Let k  kH be the norm induced by the inner product h; iH , i.e. kf k2H :¼ hf ; f iH for all f 2 H. Then, the norm of wt can be written as:

kwt k2H :¼

signðyi  ft ðxi ÞÞ kðxi ; xj Þ þ btþ1 : kð1; wt Þk

ð10Þ

An analogous derivation can be carried out for the eFRP. The norm kwk2H can also be computed after each update. If an update occurs for a mistake in the ith example, the norm kwtþ1 kH can be computed as:

kwtþ1 k2H ¼ k2t hwt ; wt iH þ

i¼1

for some at :¼ ðat;1 ; . . . ; at;m Þ0 2 Rm , bt 2 R; x 2 X and xi 2 X m . In this P sense, we can define wt :¼ m i¼1 at;i kðxi ; Þ and interpret the function ft , given in Eq. (6), as:

ð9Þ

2

¼ k2t kwt kH þ

2gkt signðyi  ft ðxi ÞÞ g2 kðxi ; xi Þ hwt ; kðxi ; ÞiH þ 2 kð1; wt Þk kð1; wt Þk

2gkt signðyi  ft ðxi ÞÞ g2 kðxi ; xi Þ ðft ðxi Þ  bt Þ þ : kð1; wt Þk kð1; wt Þk2 ð11Þ

In this way, vector–matrix multiplications are avoided, improving the computational efficiency of the method. The qFRP algorithm in dual variables is depicted in Algorithm 2.

m X m X

at;i at;j kðxi ; xj Þ;

i¼1 j¼1

by the reproducing property. Usually, in practice, the above construction of the function class C is established by choosing a function k : X  X ! R, which intuitively measures similarities between points in X . If this function k attends Mercer’s condition (e.g. Smola and Schölkopf, 2002), it can be shown that there is a corresponding reproducing kernel Hilbert space H that has k as its associated kernel. When X ¼ Rd , one possible choice for k is the inner product h; i of Rd . This leads us to the linear representation of f used in the previous sections. In this sense, given the above representation of wt as the linear P combination m i¼1 at;i kðxi ; Þ, we can derive the update rule for the eFRP in dual variables at by examining the update rule given by Eq. (3). For a mistake in the example ðyi ; xi Þ the update rule for wt becomes: m X

atþ1;j kðxj ; Þ

j¼1

Xm j¼1

4.1. Margin flexibility

at;j kðxj ; Þ þ gsignðyi  ft ðxi ÞÞkðxi ; Þ;

which implies the following update rule for the dual variable at :

atþ1;i

at;i þ g signðyi  ft ðxi ÞÞ

ð8Þ

The update rule for qFRP in dual variables is constructed in a similar fashion. Following the update rule given in Eq. (4), notice that wt is scaled by a factor of kt , given by Eq. (5). In dual variables

Noisy experimental data frequently leads to problems where outliers are present. In such cases, an accurate representation of the training set may result in a hypothesis with poor generalization performance. Thus, a mechanism to weigh the trade-off between an accurate representation of the outlining data and the generalization capability of the hypothesis is crucial. In the setting of tube loss functions, a common approach is to consider soft margins,

1398

R.C.S.N.P. Souza et al. / Pattern Recognition Letters 34 (2013) 1394–1404

where the most outlining training points are allowed to cross the tube boundaries. An early formulation of soft margin was proposed for a linear programming scheme in Bennett and Mangasarian (1992). Few years later Vapnik (1995) adapted the concept of soft margin to support vector machines. In this approach, slack variables are added to the problem’s constraints to allow margin violation. These slack variables are then penalized in the cost function and a parameter C is introduced as a measure of the trade-off between the magnitude of margin violations and the accurate representation of the training data. Another common approach to soft margins is to simply add a constant kdiag > 0 to the diagonal of the kernel matrix (Smola and Schölkopf, 2002):

the norm kwt k has the tendency to grow large as the algorithm iterates. This is natural, since by increasing the norm, the algorithm forces training points to enter the tube. When the kernel function has a limited representation capability, this deregularizing effect does not produce undesirable results, since this norm growth is limited by the capability of the kernel to represent the data. However, this effect can become a problem in cases where the kernel function has unlimited or large representation capability, as it is the case of the Gaussian kernel. In these cases, the deregularization can become a problem and the final solution tends to overfit the training data. One solution to control the norm growth in such cases is to minimize a regularized empirical risk (e.g., Kivinen et al., 2004; Herbrich, 2002), which is given by:

~ :¼ K þ kdiag I; K

Rreg ½f ; Z m  :¼ Remp ½f ; Z m  þ bOðf Þ;

where the kernel matrix is defined as K 2 Rmm with components K ij :¼ kðxi ; xj Þ, for xi ; xj 2 X m . It can be shown that this approach is equivalent to the introduction of the slack variables in the SVM formulation, when they are penalized quadratically (Smola and Schölkopf, 2002). In fact, it is possible to establish a direct relationship between this constant kdiag and the parameter C, which is given by kdiag ¼ 1=2C (see for instance Campbell, 2002). It is important to mention here that this modification of the kernel matrix is used only in the training algorithm and it should not be used for testing. In order to consider margin flexibility for the proposed method, we follow a similar idea and add a constant kdiag to the diagonal of the kernel matrix. It is interesting to see the effect of this constant kdiag on the eFRP algorithm. For the modified kernel matrix, notice that the condition for a point ðyi ; xi Þ to be inside of the tube at iteration t can be written as:

where O is called the regularizer, which penalizes the complexity of the solution f, and b > 0 is the regularization parameter, which controls the amount of regularization performed. The most common choice of regularizer, which we adopt here, is Oðf Þ :¼ 12 kwk2H . This way, regularization penalizes norm growth and, therefore, this scheme should be used with caution, so that it does not interfere in this natural aspect of the orthogonal regression problem. Using the regularized empirical risk, the update rule for a point ðyi ; xi Þ that violates the tube is the same as the previous update rule, given by Eq. (9), except for the premultiplication of the parameter kt , which is now given by:

e 6 yi 

m X

at;j Ke ij  bt 6 e;

j¼1

~ or equivalently, using the definition of K:

e þ at;i kdiag 6 yi 

m X

at;j K ij  bt 6 e þ at;i kdiag :

j¼1

Notice that at;i will generally have the same sign of yi  ft ðxi Þ, by the update rule given by Eq. (8). The only case in which this might not be true is when ðyi ; xi Þ changed sides in respect to the hyperplane after some updates. This implies that the desired margin flexibility is achieved by adding a slack, given by ni :¼ at;i kdiag , to the problem’s constraints. With a similar analysis, one obtains an analogous relaxation of the constraints for the qFRP algorithm:

qkð1; wt Þk þ at;i kdiag 6 yi 

m X

at;j K ij  bt 6 qkð1; wt Þk þ at;i kdiag :

j¼1

Again, the values ni :¼ at;i kdiag can be interpreted as slacks added to the problem’s constraints. In this sense, the addition of the kdiag to the diagonal of the kernel matrix is only a device used to introduce these slacks into the problem. Therefore, the modified kernel matrix should only be used to test whether a point violates the tube and it should not be used to calculate the norm kð1; wt Þk. 4.2. Regularization As observed in Markovsky and Huffel (2007), orthogonal regression is intrinsically a problem that promotes deregularization of the candidate solutions. In the algorithm proposed here, this effect can be observed by the fact that the scaling parameter kt , used in the update rule, will always be strictly greater than one. Thus,

kt :¼ 1 þ g

jyi  hwt ; xi i  bt j kð1; wt Þk3

! b :

This way, the parameter b compensates the first term inside the parenthesis, contributing to the control of the scale parameter kt . In addition, the regularized update is also applied to points ðyi ; xi Þ which are examined but do not violate the tube (the same approach is taken in Kivinen et al. (2004)). The update rule for these points consists in scaling the values at by the factor: ~ kt :¼ 1  gb. Notice that after this update, one can update ftþ1 ðxj Þ and kwtþ1 k following an analogous derivation that led to Eqs. (10) and (11). 5. Incremental strategy In this section, we consider an incremental strategy, based on the one introduced in Leite and Fonseca Neto (2008), which can be useful to give sparse solutions and to find an approximation to the minimal tube containing the data. For this, we restrict the discussion in this section to the qFRP algorithm, although the same arguments can be extended directly to eFRP. Given a training set Z m and a fixed constant q, the qFRP algorithm can be seen as one that finds a point ðw; bÞ inside the version space XðZ m ; qÞ. Suppose that one is able to construct a tube radius ~ such that q ~ < q from a solution ðw; bÞ 2 XðZ m ; qÞ in such a way q ~ Þ is not empty. Then the qFRP that the new version space XðZ m ; q algorithm can be used to find a sequence of strictly decreasing tube radii q0 ; q1 ; . . . ; qn such that the corresponding version spaces are not empty. One application of such strategy of constructing this sequence of decreasing tube radii is to identify support vectors or points that are the most outlining among the training set. Suppose for instance that a radius qf is desired for a given problem. Then one can proceed in the following fashion: first one chooses a large q0 and progressively decreases this radius up to a final radius qn such that qn 6 qf . This way, as the radius shrinks, only the most outlining training points will have an effect on the hypothesis, contributing to the sparsity of the solution. In addition, notice that this strategy

R.C.S.N.P. Souza et al. / Pattern Recognition Letters 34 (2013) 1394–1404

is useful when combined with the soft margins of Section 4.1 in order to deal with datasets corrupted with outliers. In the case where soft margins are not desired, one can use the above strategy to approximate the minimal tube containing the data, that is:

q :¼ inffq : XðZ m ; qÞ – ;g:

1399

strategy has the advantage that qFRP needs to make fewer corrections to satisfy the new radius. In addition, for the first qFRP, we set P the initial bias term to b0 ¼ m1 m i¼1 yi in order to have better sparsity results, since this results in an initial hypothesis centered on data, avoiding unnecessary updates. Algorithm 3 presents the incremental strategy algorithm (ISA) used to find the minimal tube containing the data.

This can be done by iteratively producing new radii qn up to a final iteration N where qN  q . In order to construct a new radius value qnþ1 from the previous one qn , suppose that the qFRP finds a solution in XðZ m ; qn Þ, namely ðwn ; bn Þ. Define the corresponding positive and negative radius as:

  yi  hwn ; xi i  bn i kð1; wn Þk   w ; xi i þ b n  y i h n : qn ¼ max i kð1; wn Þk

qþn ¼ max

ð12Þ

A desirable feature for the final solution q is to have the positive and negative radius balanced. This way, we can update the radius by setting:

We also illustrate the incremental strategy algorithm process in Fig. 2. 5.1. Sorting the Data

qnþ1

ðqþ þ qn Þ ¼ n : 2

ð13Þ

Notice that this new radius always yields a feasible solution, since

qn 6

yi  hwn ; xi i  bn 6 qþn kð1; wn Þk

8i

ð14Þ

þ and by adding ðq n  qn Þ=2 to the inequality, a new solution is obtained by only changing the bias term.  However, in some cases, it is possible to have qþ n  qn and the new radius would not be much different from the previous value. In order to circumvent this effect, we use the following update rule for the radius:

qnþ1 ¼ min



 ðqþn þ qn Þ ; ð1  d=2Þqn ; 2

where a new parameter d was introduced. It turns out that by adopting this rule, we might end up with an empty version space and the qFRP will not converge. Hence we stipulate a maximum number of iterations T for the qFRP to converge. If the convergence was not achieved in T iterations the algorithm returns the solution for the last solved problem. Therefore, the value d should be chosen carefully in order not to interfere in the incremental process. In the case that a final radius qf is given and we are only looking for a more sparse solution, the choice of d must be such that d 6 2ð1  qf =qn Þ. If an approximation to the minimal tube containing the data is desired, then this parameter can be chosen according to the quality of the expected approximation. That is, if an a-approximation of the minimum radius (i.e. the final radius is less than ð1 þ aÞq ; a 2 ð0; 1Þ) is desired, then d should be set to the value of a. In order to see this, suppose that we have a solution ðwn ; bn Þ 2 XðZ m ; qn Þ, for some n P 1, and a new radius is constructed qnþ1 ¼ ð1  a=2Þqn . Suppose that this new radius is such that qnþ1 < q . Then, qFRP will not converge and the last feasible solution found in XðZ m ; qn Þ is returned. This q



q final solution has radius qn , which satisfies: qn ¼ ð1nþ1 a=2Þ < ð1a=2Þ

< ð1 þ aÞq . Finally, it is important to mention that each final qFRP solution w is used as a starting point for the next qFRP problem. This

In view of the incremental strategy presented in the previous section, and considering that the most outlining data are gradually discovered as the radii qn shrinks, we can take advantage of this process and consider first only the most distant points during the qFRP main loop. We do this by a very simple sorting algorithm. The algorithm sorts the data as qFRP iterates according to how frequently the point promotes an update. This way, the qFRP can consider a reduced set of only the first s points before considering the rest of the data. This variable s is calculated iteratively as the algorithm progresses. Initially, its value is set to zero and this value progressively increases as it finds points that are updating the solution. This parameter s is passed among the several calls to qFRP performed by the incremental strategy algorithm. In addition, an index array idx is defined to control the data ordering. This array is initialized with idx f1; . . . ; mg and it is passed among the consecutive calls to qFRP by ISA. The algorithm is presented below by Algorithm 4.

1400

R.C.S.N.P. Souza et al. / Pattern Recognition Letters 34 (2013) 1394–1404

1.2

1.2

regression tube training points

1

0.8

0.8

0.6

0.6

0.4

0.4

Y

1

regression tube training points

0.2

0.2 -1

-0.5

0

1.2

0.5

1

-1

-0.5

0

0.5

1.2

regression tube training points

1

0.8

0.8

0.6

0.6

0.4

0.4

regression tube training points

Y

1

1

0.2

-1

-0.5

0

0.5

1

0.2

-1

-0.5

0

0.5

1

X

X

Fig. 2. Incremental strategy algorithm process after a growing number of iterations.

1.0 0.5 0.0

Y −0.5 −1.0 −1.5 −2.0

This section was constructed aiming to cover the various technical features of the proposed method. In this sense, the section is divided in five parts as follows. In Section 6.1, an interesting aspect of orthogonal regression models is verified for the proposed method. Section 6.2 presents the results related to the application of eFRP and qFRP combined with the ISA in order to find sparse solutions. Also, we present comparisons with Joachim’s SVM-light (Joachims, 1999). In addition, we discuss the quality of the solution under varying degrees of noise in the variables. In Section 6.3, we present results regarding the use of the ISA to find an approximation to the minimal tube containing the data. Section 6.4 has the results concerning the use of regularization in the qFRP method. Finally, Section 6.5 present the results using real world datasets. For the FRP algorithms, we always use the sorting algorithm presented in Section 5.1. The regularization was only used in Section 6.4 and 6.5.

1.5

6. Experimental results

ρFRP εFRP − Y on X εFRP − X on Y

−2

6.1. Treating variables symmetrically An interesting technical feature of orthogonal regression models is that it treats the variables of the problem symmetrically (Ammann and Ness, 1988). Such procedure could be very useful when the problem actually does not have independent and dependent variables, and should instead be treated equally. This first toy example shows that the proposed method presents the behavior discussed above. Fig. 3 shows three regression lines. The solid line corresponds to the orthogonal regression using the qFRP. The dashed line is the standard regression using eFRP to predict Y (as dependent variable) from X (as independent variable). The dash-dot line is the standard regression using eFRP, where the dependent and independent variables where swapped. That is, the

−1

0

1

X Fig. 3. Orthogonal regression (qFRP) and standard regression (eFRP).

regression is predicting X from Y. Also, Fig. 3 depicts the lines of the respective distances that are measured from a typical point. It is important to mention that if we swap the variables and run the orthogonal regression with qFRP, the same solid line is obtained. 6.2. Sparsity In this group of experiments, we applied the qFRP and eFRP algorithms combined with the incremental strategy algorithm

1401

R.C.S.N.P. Souza et al. / Pattern Recognition Letters 34 (2013) 1394–1404

(ISA) to different datasets. As discussed in Section 5, the ISA is very useful to obtain sparse solutions, since the process finds the points that contribute to the solution gradually. In order to observe this, we also tested both methods without ISA. The results of SVM-light (Joachims, 1999) are also presented for comparison. The training sets were generated from a chosen function and polluted with different noise intensities in both variables. This is done in order to compare the orthogonal regression (qFRP) with the classical regression (eFRP and SVM). For the testing set, we generated a new dataset using the same function, with different points distributed over the same input range, with no noise added. The testing set has twice the number of points of the training set. The sets are described below for each chosen function. The Linear1 training set has 51 examples generated from the function y ¼ 2x þ 0:1, where x 2 ½5; 5. Both variables are polluted with a Gaussian noise with standard deviation rx ¼ ry ¼ 0:2. The Linear2 training set has the same 51 points of the previous function but we modified the noise intensity to rx ¼ 0:04 and ry ¼ 0:4. The 2 Exp1 training set has 51 examples of the function y ¼ ex , where x 2 ½1; 1 and Gaussian noise is introduced with rx ¼ ry ¼ 0:1. The Poly3 training set has 61 examples of the function y ¼ x3 , where x 2 ½3; 3 and both variables are polluted with Gaussian noise with rx ¼ ry ¼ 0:2. In order to compare the solutions, we present the following data obtained from the experiments: the number of support vectors (sv), shown to observe sparsity; the orthogonal (q) and vertical (e) tube radii; and the norm of the solution (norm ¼ kð1; wÞk). For the FRP methods, we also present the total number of iterations (t) and total number of updates (up) performed by the algorithms. In order to measure the quality of the fit, we use two error measures. The first is the root mean squared error (RMSE). This criterion takes the squared root of the mean squared error, where this error accounts for the direct difference between the estimated function and target values. In this sense, we proposed the second error

measure, based on the orthogonal fit, which we called geometric mean squared error given by

gMSE :¼

m 1X ðyi  hw; xi i  bÞ2 ; m i¼1 kð1; wÞk2

and we also take the squared root deriving the RgMSE criterion. The test presented in this section were performed in the following way. First, we calculated the range of the target values r :¼ maxi¼1;...;m yi  mini¼1;...;m yi in the training set. Then, we set the value of e to 0:1r for the eFRP and SVM algorithms. Also, this value was used as an stop criterion for eFRPISA. In order to compare the results, we calculated the respective q obtained for the eFRP solution, by the equation e ¼ qkð1; wÞk, and used it to run the qFRP and qFRPISA algorithms. The capacity control parameter C was set to C=m ¼ 10, where m is the number of training examples, as suggested in Smola and Schölkopf (2002). Also, we set the learning rate to g ¼ 0:01. For the Linear1 and Linear2 datasets, the linear   kernel kðxi ; xj Þ :¼ xi ; xj was used. For Exp1, we used a polynomial kernel kðxi ; xj Þ :¼ ðs xi ; xj þ cÞd with d ¼ 2, s ¼ 1 and c ¼ 0. A polynomial kernel with d ¼ 3; s ¼ 1 and c ¼ 0 was also used for the Poly3 dataset. In each case, we saved the model and used it to perform the tests. The RMSE and RgMSE criteria are measured in the training and testing sets. The results are presented in Table 1. First notice that ISA performs well in respect to maintaining the sparsity of the solution. In most of the cases, the FRP methods have less support vectors when combined with the ISA. The only exception is for the qFRP on the Poly3 dataset. This could be explained by the fact that qFRPISA achieved a smaller radius for this training set. Also, notice that the number of support vectors is slightly higher but follows closely those presented by SVM. Secondly, when using ISA, the FRP algorithms show a higher number of iterations, as expected. However, they make fewer number of updates. Considering the quality of the fit, the orthogonal regression methods present the best results. For dataset Exp1, the results

Table 1 Results from orthogonal (qFRP) and classical regression (eFRP and SVM). sv

Linear1 qFRP qFRPISA eFRP eFRPISA SVM

q/e/norm

t/u

Training

Testing

RMSE

RgMSE

RMSE

RgMSE

44 5 37 6 2

0.626/1.344/2.149 0.625/1.327/2.122 0.626/1.952/3.120 0.639/1.928/3.015 0.998/1.928/1.932

7/65 92/46 4/50 74/37 -/-

0.48068 0.56054 0.88479 1.00106 1.09079

0.22367 0.26415 0.28359 0.33201 0.56468

0.28514 0.41071 0.81896 0.94416 1.04105

0.13268 0.19354 0.26249 0.31314 0.53893

45 4 36 4 2

0.624/1.302/2.087 0.579/1.221/2.105 0.624/1.936/3.10 0.584/1.828/3.131 0.935/1.828/1.955

4/61 86/43 4/49 76/38 -/-

0.60639 0.56098 0.90116 0.88459 0.99727

0.29061 0.26645 0.29044 0.28254 0.51003

0.49572 0.43669 0.83506 0.81595 0.93765

0.23757 0.20742 0.26913 0.26061 0.47954

40 5 41 22 22

0.094/0.240/2.566 0.094/0.230/2.460 0.094/0.098/1.043 0.093/0.098/1.055 0.093/0.080/1.158

1105/5789 2580/3894 26304/164166 25727/130023 -/-

0.11911 0.11343 0.11382 0.11272 0.11289

0.04642 0.04611 0.10912 0.10685 0.09748

0.03528 0.03103 0.04679 0.03840 0.04033

0.01375 0.01261 0.04486 0.03640 0.03483

5 7 9 5 2

4.179/5.777/1.382 3.950/5.770/1.461 4.179/5.409/1.294 4.214/5.370/1.274 4.068/5.370/1.320

65/113 79/196 12062/44740 237/574 -/-

1.94745 1.89469 1.96528 2.09847 2.27268

1.40884 1.29713 1.51846 1.64681 1.72150

0.53883 0.77797 1.26403 1.61558 2.00520

0.38980 0.53260 0.97665 1.26785 1.51889

Linear2

qFRP qFRPISA eFRP eFRPISA SVM Exp1

qFRP qFRPISA eFRP eFRPISA SVM Poly3

qFRP qFRPISA eFRP eFRPISA SVM

1402

R.C.S.N.P. Souza et al. / Pattern Recognition Letters 34 (2013) 1394–1404

Table 2 Results from orthogonal regression (qFRP) and classical regression (eFRP and SVM) without allowing margin flexibility. The FRP methods were combined with ISA to find the minimal tube containing the data. sv

Linear1 qFRPISA eFRPISA SVM

q

e

norm

Training

Testing

RMSE

RgMSE

RMSE

RgMSE

6 10 2

0.34818 0.18559 0.34834

0.78690 0.79373 0.79373

2.26007 4.27679 2.27858

0.43324 0.45657 0.46253

0.19169 0.10676 0.20299

0.06748 0.13262 0.14460

0.02986 0.03101 0.06346

4 4 2

0.22521 0.24895 0.23021

0.26368 0.26670 0.26670

1.17082 1.07129 1.15852

0.11425 0.11544 0.11459

0.09758 0.10776 0.09891

0.03105 0.03078 0.03614

0.02652 0.02873 0.03119

8 6 2

1.00253 3.99057 3.92966

5.35425 5.21344 5.21344

5.34074 1.30644 1.32669

2.14634 1.84730 2.29019

0.40188 1.41399 1.72625

1.88537 0.90730 1.96983

0.35302 0.69449 1.48477

Exp1

qFRPISA eFRPISA SVM Poly3

qFRPISA eFRPISA SVM

are very similar for the RMSE measure, however the eFRP and eFRPISA performed a large number of iterations and updates.

had the best performance for RMSE measure and qFRPISA obtained the best results for RgMSE.

6.3. Minimal tube

6.4. Introducing the regularization

In this section, we test the qFRP and eFRP combined with ISA to obtain the minimal tube containing the data. In this case, we do not allow points to violate the tube. It is important to mention that this minimal tube cannot be found using the standard SV-regression approach. In order to perform the tests, we used three datasets presented in the previous section, namely, Linear1, Exp1 and Poly3. To compare the results we present the number of support vectors (sv), the orthogonal (q) and vertical (e) tube radii and the norm of the solution (norm ¼ kð1; wÞkÞ). Also we present the RMSE and RgMSE measured in training and testing sets. The tests were performed in the following way: for qFRPISA and eFRPISA we set the learning rate to g ¼ 0:01 and fixed the number of iterations to T ¼ 1000. The value of e obtained with eFRPISA was used to run SVM-light. The capacity control parameter C was set to a large value in order to avoid the margin flexibility in SVMlight. The kernel functions used were the same presented in the previous section. The results are presented in Table 2. Since we are shrinking the tube radius, it is expected that the algorithms yield similar results. This can be noted in datasets Linear1 and Exp1, where the error measures are very close. For dataset Poly3, notice that eFRPISA

As discussed in Section 4.2, orthogonal regression is intrinsically a process that promotes a de-regularizing effect on the candidate solutions, and this might be a problem when kernel functions with a large representation capability are used. In this section, we test qFRP combined with ISA, using the Gaussian kernel

kðxi ; xj Þ :¼ expðckxi  xj k2 Þ

ð15Þ

and the regularization presented in Section 4.2. The dataset used, named Sinc, was generated with 63 examples of the function y ¼ sincðxÞ, where sincðxÞ :¼ sinðxÞ=x and x 2 ½p; p. Gaussian noise was added to both the training points x and target values y, with standard deviation rx ¼ ry ¼ 0:2. The testing set was generated with twice the number of examples in the same range and noise free. In order to compare the results, we present the same criteria of the previous section. Also, four different values for the regularization parameter b were used: (i) b ¼ 0:001, (ii) b ¼ 0:005, (iii) b ¼ 0:01 and (iv) b ¼ 0:02. We call the regularized version of the algorithm qFRPISA-reg. The tests were performed in the following way: first, we ran the qFRPISA and qFRPISA-reg with T ¼ 1000 and T ¼ 5000 iterations. The learning rate was set to g ¼ 0:01. In order to run SVM-light,

Table 3 Results obtained with qFRPISA, q FRPISA-reg and SVM-light on the dataset Sinc. The FRP algorithms were set to run for 1000 and 5000 iterations. Training

Sinc - ðT ¼ 1000Þ qFRPISA qFRPISA-reg (i) qFRPISA-reg (ii) qFRPISA-reg (iii) qFRPISA-reg (iv) SVM-light

Testing

sv/m

q/e/norm

RMSE

RgMSE

RMSE

RgMSE

14/63 14/63 12/63 10/63 6/63 6/63

0.18699/0.35855/1.91750 0.19485/0.35597/1.82689 0.21893/0.35785/1.63453 0.26349/0.37910/1.43873 0.40338/0.49470/1.22638 0.26647/0.37910/1.42265

0.22058 0.21589 0.20658 0.20840 0.25835 0.21025

0.11503 0.11817 0.12638 0.14485 0.21066 0.14778

0.17869 0.16089 0.11808 0.11157 0.16254 0.10802

0.09319 0.08806 0.07224 0.07755 0.13254 0.07593

24/63 21/63 17/63 9/63 6/63 7/63

0.08280/0.32418/3.91510 0.11361/0.32843/2.89093 0.16136/0.33366/2.06782 0.23142/0.35414/1.53029 0.40816/0.49540/1.21376 0.23039/0.35414/1.53709

0.21568 0.22022 0.21277 0.21209 0.26224 0.21512

0.05509 0.07618 0.10289 0.13860 0.21606 0.13995

0.18522 0.18170 0.13467 0.11383 0.16730 0.11123

0.04731 0.06285 0.06513 0.07439 0.13784 0.07236

Sinc - ðT ¼ 5000Þ

qFRPISA qFRPISA-reg (i) qFRPISA-reg (ii) qFRPISA-reg (iii) qFRPISA-reg (iv) SVM-light

1403

R.C.S.N.P. Souza et al. / Pattern Recognition Letters 34 (2013) 1394–1404

1.5

1.5

regression tube training points

1

0.5

0.5

0

0

-0.5

-0.5

Y

1

regression tube training points

-1

-4

-3

-2

-1

0

1

1.5

2

3

4

-1

-4

-3

-2

-1

0

1

1.5

regression tube training points

1

0.5

0.5

0

0

-0.5

-0.5

3

4

3

4

regression tube training points

Y

1

2

-1

-4

-3

-2

-1

0

1

2

3

4

-1

-4

-3

-2

-1

0

1

2

X

X

Fig. 4. Top left: qFRPISA. Top right: qFRPISA-reg (iii). Bottom left: qFRPISA-reg (iv). Bottom right: SVM-light.

we choose the respective e obtained with the best result obtained with the regularized algorithm. The capacity control parameter was set to C=m ¼ 10, as suggested in Smola and Schölkopf (2002). The Gaussian kernel parameter was set to c ¼ 1. The RMSE and RgMSE were computed for the training and testing sets. The results are presented in Table 3. In addition, Fig. 4 depicts the solution for T ¼ 5000 iterations. First notice that, with more iterations, the norm of qFRPISA grows and the RgMSE error decreases. However, Fig. 4 shows that qFRPISA generates a somewhat rough curve. After the introduction of the regularization, the presented solution is smoother. Also, it is interesting to notice that by choosing a high penalty, the obtained solution tends to be very naive and does not meet the purpose, as illustrated for the qFRPISA-reg (iv) in Fig. 4. In addition, notice that it is important to choose the regularization parameter b carefully. From the Table 3, we can observe that the regularized algorithm (i) which uses a small value of b has a weakly penalized solution which leads to results very similar to the algorithm without regularization. Otherwise, the regularized algorithm (iv) have a high value of b. This way, the solution is highly penalized and the obtained results are not desirable. In this sense, we must choose a suitable value for b, which entails a good solution. Particularly, in the problem of this section, the best results were obtained with choices (ii) and (iii) for regularization and this value may vary with the problem.

and 6 attributes. Dataset quake has 2178 examples and 3 variables. The input data were linearly normalized, but no normalization was performed on the target values. In all experiments we used the Gaussian kernel given by Eq. (15). In addition, we present the results obtained with SVM-light to compare solutions. The tests were performed in the following way: first, we random sampled the data to construct the training and testing sets. Then, we defined the value q as stop criterion to run the qFRPISA. The correspondent e obtained was used as stop criterion for eFRPISA. The final e achieved with eFRPISA was used to execute SVM-light. The learning rate was set to g ¼ 0:01 and the control capacity parameter was set to C=m ¼ 10. The Gaussian kernel parameter was set to c ¼ 1. In all experiments we regularized the qFRPISA and for dataset Quake we also regularized the eFRPISA. In order to compare the results, we present the percentage of support vectors on the training set (%sv ); the orthogonal (q) and

Table 4 Results obtained with qFRPISA, e FRPISA and SVM-light on real world datasets. %sv

q

e

norm

RMSE

RgMSE

2.564 5.128 2.564

1.8000 1.7999 1.8000

1.8015 1.8000 1.8000

1.0008 1.0000 1.0000

0.39744 0.39686 0.41231

0.39710 0.39686 0.41231

0.912 0.912 0.912

1.9969 2.0675 2.0488

2.0730 2.0716 2.0716

1.0381 1.0020 1.0112

0.59430 0.61005 0.74385

0.57251 0.60882 0.73564

0.561 0.255 0.102

0.5500 0.5500 0.5500

0.5501 0.5500 0.5500

1.0002 1.0000 1.0000

0.34201 0.38669 0.41746

0.34193 0.38669 0.41746

Diabetes

qFRPISA eFRPISA SVM-light DEE

6.5. Real world datasets In this group of experiments, we applied the qFRP and eFRP algorithms combined with ISA in three different real world datasets: diabetes, DEE and quake. The used datasets were obtained from KEEL repository (Alcalá-Fdez et al., 2011). Dataset diabetes has 43 samples and 2 variables. Dataset DEE has 365 samples

qFRPISA eFRPISA SVM-light Quake

qFRPISA eFRPISA SVM-light

1404

R.C.S.N.P. Souza et al. / Pattern Recognition Letters 34 (2013) 1394–1404

classical (e) radii; the norm of the solution (norm); the error criteria RMSE and RgMSE measured on the testing set. (see Table 4). First, notice that the obtained solution is very sparse due the value established as stop criterion for qFRPISA. This also leads to a small value for the norm, since the stop radius is not too narrow. Considering the quality of the solution, we can see that the algorithms obtained very similar results and, in general, the qFRPISA obtained the best results. 7. Conclusions In this paper we introduced a new formulation for orthogonal regression based on an online training approach using the stochastic gradient descent. The proposed method uses a new loss function based on the e-insensitive loss which enables the algorithm to use the support vector approach. When formulated in dual variables the algorithm allows the introduction of kernels, via the kernel trick, and margin flexibility. In the cases of kernels with a large representation capability, a regularization scheme is presented in order to control the norm growth. The proposed algorithm is entirely based on perceptron which makes it simple to understand and implement. To the best of our knowledge, this is the first online algorithm for orthogonal regression with kernels. Also we presented an incremental strategy algorithm that can be combined with the proposed method in order to find sparse solutions and also the minimal tube containing the data. The experimental results show that the proposed method works well in comparisons with classical regression approach when different scenarios of noise are presented. References Adcock, R., 1877. Note on the method of least squares. Analyst 4, 183–184. Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., Garcı´a, S., Sánchez, L., Herrera, F., 2011. Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing 17, 255–287. Ammann, L., Ness, J.V., 1988. A routine for converting regression algorithms into corresponding orthogonal regression algorithms. ACM Transactions on Mathematical Software 14, 76–87. Bennett, K.P., Mangasarian, O.L., 1992. Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software 1, 23–34.

Campbell, C., 2002. Kernel methods: a survey of current techniques. Neurocomputing 48, 63–84. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y., 2006. Online passive-aggressive algorithms. Journal of Machine Learning Research 7, 551– 585. Dax, A., 2006. The distance between two convex sets. Linear Algebra and its Applications 416, 184–213. Gentile, C., 2001. A new approximate maximal margin classification algorithm. Journal of Machine Learning Research 2, 213–242. Gleser, L., 1981. Estimation in a multivariate error in variables regression model: large sample results. Annals of Statistics 9 (1), 24–44. Golub, G.H., 1973. Some modified matrix eigenvalue problems. SIAM Review 15, 318–334. Golub, G.H., Loan, C.V., 1980. An analysis of the total least squares problem. SIAM Journal on Numerical Analysis 17, 883–893. Griliches, Z., Ringstad, V., 1970. Error-in-the-variables bias in nonlinear contexts. Econometrica 38 (2), 368–370. Herbrich, R., 2002. Learning Kernel Classifiers: Theory and Algorithms. The MIT Press. Hermus, K., Verhelst, W., Lemmerling, P., Wambacq, P., Huffel, S.V., 2005. Perceptual audio modeling with exponentially damped sinusoids. Signal Processing 85 (1), 163–176. Hirakawa, K., Parks, T.W., 2006. Image denoising using total least squares. IEEE Transactions on Image Processing 15, 2730–2742. Huber, P.J., 1972. Robust statistics: A review. The Annals of Mathematical Statistics 43, 1041–1067. Joachims, T., 1999. Making large-scale support vector machine learning practical. In: Schölkopf, B., Burges, C., Smola, A. (Eds.), Advances in Kernel Methods. MIT Press, pp. 169–184. Kivinen, J., Smola, A., Williamson, R., 2004. Online learning with kernels. IEEE Transactions on Signal Processing 52, 2165–2176. Leite, S.C., Fonseca Neto, R., 2008. Incremental margin algorithm for large margin classifiers. Neurocomputing 71, 1550–1560. Li, Y., Long, P.M., 2002. The relaxed online maximum margin algorithm. Machine Learning 46, 361–387. Luong, H.Q., Goossens, B., Pizurica, A., Philips, W., 2011. Joint photometric and geometric image registration in the total least square sense. Pattern Recognition Letters 32, 2061–2067. Luong, H.Q., Goossens, B., Pizurica, A., Philips, W., 2012. Total least square kernel regression. Journal of Visual Communication and Image Representation 23, 94– 99. Markovsky, I., 2010. Bibliography on total least squares and related methods. Statistics and Its Interface 3 (2010), 329–334. Markovsky, I., Huffel, S.V., 2007. Overview of total least-square methods. Signal Processing 87, 2283–2303. Rosenblatt, F., 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review 65, 386–408. Shwartz, S.S., Singer, Y., Srebro, N., 2007. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. Proceedings of the 24th International Conference on Machine learning. ACM, pp. 807–814. Smola, A., Schölkopf, B., 2002. Learning with Kernels. MIT Press. Suykens, J.A.K., Vandewalle, J., 1999. Least squares support vector machine classifiers. Neural Processing Letters 9, 293–300. Vapnik, V.N., 1995. The Nature of Statistical Learning Theory. Springer.