Minimum deviation distribution machine for large scale regression

Minimum deviation distribution machine for large scale regression

Accepted Manuscript Minimum deviation distribution machine for large scale regression Ming-Zeng Liu, Yuan-Hai Shao, Zhen Wang, Chun-Na Li, Wei-Jie Ch...

2MB Sizes 0 Downloads 14 Views

Accepted Manuscript

Minimum deviation distribution machine for large scale regression Ming-Zeng Liu, Yuan-Hai Shao, Zhen Wang, Chun-Na Li, Wei-Jie Chen PII: DOI: Reference:

S0950-7051(18)30053-4 10.1016/j.knosys.2018.02.002 KNOSYS 4210

To appear in:

Knowledge-Based Systems

Received date: Revised date: Accepted date:

23 March 2017 26 January 2018 1 February 2018

Please cite this article as: Ming-Zeng Liu, Yuan-Hai Shao, Zhen Wang, Chun-Na Li, Wei-Jie Chen, Minimum deviation distribution machine for large scale regression, Knowledge-Based Systems (2018), doi: 10.1016/j.knosys.2018.02.002

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Minimum deviation distribution machine for large scale regression

CR IP T

Ming-Zeng Liua , Yuan-Hai Shaob,∗, Zhen Wangc , Chun-Na Lid , Wei-Jie Chend a

AN US

School of Mathematics and Physics Science, Dalian University of Technology, Panjin, 124221, P.R.China b School of Economics and Management, Hainan University, Haikou, 570228, P.R.China c School of Mathematical Sciences, Inner Mongolia University, Hohhot, 010021, P.R.China d Zhijiang College, Zhejiang University of Technology, Hangzhou, 310024, P.R.China

Abstract

PT

ED

M

In this paper, by introducing the statistics of training data into support vector regression (SVR), we propose a minimum deviation distribution regression (MDR). Rather than just minimizing the structural risk, MDR also minimizes both the regression deviation mean and the regression deviation variance, which is able to deal with the different distribution of boundary data and noises. The formulation of minimizing the first and second order statistics in MDR leads to a strongly convex quadratic programming problem (QPP). An efficient dual coordinate descend algorithm is adopted for small sample problem, and an average stochastic gradient algorithm for large scale one. Both theoretical analysis and experimental results illustrate the efficiency and effectiveness of the proposed method.

AC

CE

Keywords: regression, support vector machine, minimum deviation distribution machine, dual coordinate descend algorithm, stochastic gradient algorithm



Corresponding author. Email address: [email protected] (Yuan-Hai Shao )

Preprint submitted to Elsevier

February 9, 2018

ACCEPTED MANUSCRIPT

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

AC

31

CR IP T

6

AN US

5

M

4

ED

3

Regression analysis [1, 2, 3, 4], a powerful statistical tool for estimating the relationships among variables, has been studied extensively. In general, there are mainly two kinds of statistical regressors. One discovers the sample distribution by the statistics [5, 6, 7, 8], e.g., the first-order and second-order statistics, necessary condition analysis, and correlations. These methods, such as linear regression [5, 6, 7, 9, 10, 11] and least square regression [12, 13, 14], attempt to find the best fitting function as a regressor, and they are mainly concern the empirical risk minimization. Another kinds of regressors focuses on the expected structure risk minimization for prediction, where the regressor tries to learn better regression value for unseen samples that has not been learned. Two popular predictive regressors are ridge regression [15] and support vector regression (SVR) [16, 17]. SVR constructs an ε-sensitive tube as the bounds of the regressor with a flat regularization term. There are two advantages of SVR: sparsity of the solutions and flexibility of generalizing to non-linear regression easily. Recent researches on SVR mainly concern on two aspects: one is to design efficient algorithms for solving a quadratic programming problem (QPP) with double size of training samples, such as Chunking [18], sequential minimal optimization (SMO) [19], least squares support vector machine (LSSVM) [12, 13], LIBSVM [20], Pegasos [21] and LIBLINEAR [22]; the other is to provide comprehensive models for data with different statistics structure, such as modifications of ν-support vector machine (Par-vSVM) [23], twin support vector regression (TSVR) [24], ε-twin support vector regression (ε-TSVR) [10], and parametric-insensitive nonparallel support vector regression (PINSVR) [25], which were proposed to capture data structure and boundary information more accurately. Inspired by the theoretical result [26] that the margin distribution is of importance to the generalization performance in the formulation of SVM, Zhou et al. [8, 27] presented a large margin distribution machine (LDM), which introduces the statistical information into SVM by seeking the support hyperplanes with the largest statistical margin, and meanwhile keeps the class samples close to each other from the same class and far away to each other from the different classes in the margin sense. More precisely, LDM maximizes the margin between the support hyperplanes together with the margin mean and minimizes the margin variance. The idea of margin distribution has received a great deal of attention [28, 29, 30, 31, 32, 33, 34, 35].

PT

2

1. Introduction

CE

1

32 33 34 35 36 37

2

ACCEPTED MANUSCRIPT

68

2. Preliminaries

44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

AC

66

AN US

43

M

42

ED

41

PT

40

CE

39

CR IP T

67

For regression, it is more reasonable to take into account the sample statistics in the deviation sense to precisely characterize the sample distribution. Therefore, in this paper, we propose a minimum deviation distribution regression (MDR) by introducing the statistics of deviation into SVR. However, the statistics margin strategy used in LDM cannot be directly applied to our MDR, since the decision hyperplane in SVM or LDM is a separated function, while it is a fitting one in MDR. In contrast to the formulation of LDM, MDR minimizes the regression deviation mean together with the regression deviation variance. The regression of MDR can be obtained by solving a QPP as in SVR. To promote the learning speed of our MDR, an efficient dual coordinate descend algorithm and average stochastic gradient descent (ASGD) algorithm are constructed for small size problem and large scale problem, respectively. The main contributions of this paper are as follows: i) The regression deviation mean and the regression deviation variance are defined to present the first-order and second-order statistics in regression. ii) MDR minimizes these two deviations in SVR, which also achieves the structure risk minimization principle. iii) MDR is robust to different distribution of boundary data and noises due to the use of the regression deviation mean and the regression deviation variance. iv) Dual coordinate descend algorithm and average stochastic gradient algorithm are designed for solving the small and large scale regression problem, respectively. v) Experimental results on both artificial data sets and benchmark data sets demonstrate the effectiveness and efficiency of the proposed method. This paper is organized as follows. Section 2 introduces the basic notations and gives a brief review of SVR. Section 3 presents the details of MDR, including the formulation, algorithms and its properties. Experiments results are reported in Section 4. Section 5 concludes this paper.

38

69

70

71

72 73

Suppose we are given a training set S = {(x1 , y1 ), (x2 , y2 ), · · · , (xm , ym )} ∈ X × R, where xi ∈ X is the input sample and yi ∈ R is the response value. The goal of the regression is to find a regression function f (x) to predict the output of an input x. In SVR [6], the main goal is to find a regression function f (x) = (w ˜ · φ(x)) + b, where w ˜ is a weight, and b is 3

ACCEPTED MANUSCRIPT

75 76 77 78 79 80 81

a bias and φ(x) is a feature mapping of x induced by a kernel k(·, ·), i.e., k(xi , xj ) = (φ(xi ) · φ(xj )). In fact, by appending each sample with an additional dimension: x = [x 1]> and w = [w ˜ b]> , the function f (x) = (w ˜ · φ(x)) + b can be expressed as its unbiased version f (x) = (w · φ(x)). In the following, we just consider the unbiased function. In practice, we may want to allow for some errors with f to approximate some pairs (xi , yi ) with positive ε precision. Therefore, SVR with soft-margin could be expressed as m X 1 T (ξi + ξi∗ ) min w w + C w,ξ,ξ∗ 2 i=1

CR IP T

74

82 83 84

∗ T ] where C > 0 is a trading-off parameter, ξ = [ξ1 , · · · , ξm ]T and ξ ∗ = [ξ1∗ , · · · , ξm measure the losses of samples. In general, one solves the dual problem of (1) to obtain the regressor. For instance, the dual QPP of (1) is formulated as

M

85

(1)

AN US

s.t. − ε − ξi ≤ yi − (w · φ(xi )) ≤ ε + ξi∗ , ξi , ξi∗ ≥ 0, i = 1, · · · , m,



m X i=1

ED

m m X 1 X (∗) (∗) (∗) (αi + αi ) min (αi − αi )(αj − αj )(φ(xi ) · φ(xj )) + ε (∗) 2 i,j=1 α i=1 (∗)

yi (αi − αi )

(2)

(∗)

PT

s.t. 0 ≤ αi ≤ C, i = 1, · · · , m, ∗ T ] is the Lagrange multiplier vector. where α(∗) = [α1 , α1∗ , · · · , αm , αm

87

3. Minimum deviation distribution regression (MDR) In this section, we first define two deviation distributions, and then give the primal optimization problem of MDR. At last, we will give two solving algorithms and the corresponding theoretical guarantee.

AC

88

CE

86

89 90

91 92 93

3.1. Definitions of deviation distributions We consider the most straightforward first- and second-order statistics for characterizing the deviation distributions, that is, the mean and the variance 4

ACCEPTED MANUSCRIPT

95 96 97 98 99

of deviation. For regression, let the regression deviation of sample (x, y) be γ = y − f (x). From the definition, it is easy to determine which side the sample lies in. In particular, a sample (xi , yi ) lays above the regressor with the regression deviation γi = yi − f (xi ), ∀i = 1, · · · , m positive and under it when γi is negative. Then we give the definitions of statistics deviation in regression. Definition 1. Regression deviation mean: m

γ¯ =

103 104 105 106

107 108 109 110

AN US

102

It can be seen that regression deviation mean is the average distance between the predict value and the real one, which indicates the relative position of data set and regressor.

Proposition 1. Minimizing the least square loss function `(e) = eT e = P m 2 i=1 (yi − f (xi )) in regression leads to the expectation of the regression deviation mean γ¯ to be zero, i.e. E(γ) = 0. Here e = [y1 − f (x1 ), · · · , ym − f (xm )]T .

M

101

1 X 1 X γi = (yi − f (xi )). m i=1 m i=1

From Proposition 1, it is evident that the least square regression implies the minimization of regression deviation mean. Based on the definition of regression deviation mean, it is easy to extend the second-order definition of regression deviation as follows:

ED

100

m

CR IP T

94

As can be seen from the definition of regression deviation variance, it is the variance of regression deviation mean that quantifies the scatter of regression.

AC

111

CE

PT

Definition 2. Regression deviation variance: 2  v uX m X m u 1 [yi − f (xi ) − yj + f (xj )]2  . γˆ 2 =  t m i=1 j=1 112

113

3.2. Formulation of MDR Formally, denote X as the matrix whose i-th column is φ(xi ), i.e., X = [φ(x1 ), · · · , φ(xm )], y = [y1 , · · · , ym ]T is a column vector, and e stands for 5

ACCEPTED MANUSCRIPT

γ¯ =

1 T e (y − X T w) m

and the regression deviation variance is γˆ =

115 116 117 118 119

1 1 {2[wT X(mI −eeT )X T w−2y T (mI −eeT )X T w+y T (mI −eeT )y]} 2 . m

To seek a good mean and variance of the regression deviation, we need to not only minimize the regression deviation mean, but also minimize the regression deviation variance, simultaneously. We first consider the feasible cases where f approximates all pairs (xi , yi ) with positive ε precision. In these cases, the minimization of the regression deviation mean and the regression deviation variance leads to the following primal problem of hard ε-tube MDR,

AN US

114

CR IP T

the all-one vector of appropriate dimensions. When f (x) = (w · φ(x)), the regression deviation mean is

1 T w w + λ1 γ¯ 2 + λ2 γˆ 2 w 2 s.t. y − X T w ≤ εe,

(3)

M

min

X T w − y ≤ εe,

122 123 124

AC

CE

125

ED

121

where λ1 and λ2 are the parameters for trading-off the regression deviation mean, the regression deviation variance and the model complexity. It is evident that the hard ε-tube MDR subsumes the hard ε-tube SVR when λ1 and λ2 equal 0. For the infeasible cases, similar to the soft-margin SVR, the soft ε-tube MDR leads to

PT

120

126

127 128 129

1 T w w + λ1 γˆ 2 + λ2 γ¯ 2 + C(eT ξ + eT ξ ∗ ) 2 s.t.y − X T w ≤ εe + ξ,

min

w,ξ,ξ∗

(4)



T

X w − y ≤ εe + ξ , ξ, ξ ∗ ≥ 0,

where C is a positive trading-off parameter, ξ = [ξ1 , · · · , ξm ]T and ξ ∗ = ∗ T [ξ1∗ , · · · , ξm ] measure the ε-sensitive losses of samples. The geometric meaning of (4) is clear: (i) minimizing the first term in objective function will make the regression function more flatness; (ii) minimizing the second and 6

ACCEPTED MANUSCRIPT

133 134 135 136

137 138 139 140 141

3.3. Algorithms for MDR 3.3.1. Dual algorithm for MDR In this subsection, we first derive the dual of MDR, then design a dual coordinate descent algorithm for solving its dual formulation. Considering the primal problem (4), by substituting the definition (1) and definition (2), we have 2λ1 T λ2 − 2λ1 T 1 w XX T w + w XeeT X T w min∗ wT w + w,ξ,ξ 2 m m2 2λ2 − 4λ1 T T T 4λ1 T T y X w− y ee X w + C(eT ξ + eT ξ ∗ ) − m m2 (5) s.t. y − X T w ≤ εe + ξ,

M

142

CR IP T

131 132

third terms will keep the training samples closer to regression function; (iii) minimizing the last term in objective function with the constraints of (4) will make all training samples lie inside the ε-tube. Similarly, the soft ε-tube MDR subsumes the soft-margin SVR if λ1 and λ2 both equal 0. Because the soft-margin SVR often suits for most real problems, in the following we will focus on the soft ε-tube and if without clarification. In this paper MDR refers to the soft ε-tube MDR.

AN US

130

144 145 146

Theorem 1. The optimal solution w∗ for problem (5) could be expressed as

CE

147

In (5), the constants are omitted without influence on optimization. Problem (5) is hard to solve because of the high or infinite dimensionality of φ(·). Fortunately, inspired by the representer theorem in [17], the following theorem states that the optimal solution for (5) can be spanned by {φ(xi ), 1 ≤ i ≤ m}.

PT

143

ED

X T w − y ≤ εe + ξ ∗ , ξ, ξ ∗ ≥ 0.

m X w = αi φ(xi ) = Xα, ∗

(6)

i=1

where α = [α1 , · · · , αm ]T are the coefficients.

AC 148

149

150

Proof 1. As w is the weight vector in φ(·) space, it can be decomposed into a part that lives in the span of φ(xi ) and an orthogonal part, i.e., w=

m X

αi φ(xi ) + v = Xα + v

i=1

7

(7)

ACCEPTED MANUSCRIPT

for some α = [α1 , · · · , αm ]T and v satisfying (φ(xj ) · v) = 0 for all j, i.e., X T v = 0. Note that

CR IP T

X T w = X T (Xα + v) = X T Xα, thus the second and the third terms of (5) are independent of v. Further note that the constraint is also independent of v, so the last term of (5) is also independent of v. With regard to the first term of (5), since X T v = 0, hence we get w · w = (Xα + v)T (Xα + v) = αT X T Xα + v T v ≥ αT X T Xα

153 154

155 156 157

with equality occurring if and only if v = 0. As the value of v has no affection on the other term, we could set v = 0 to reduce the first term to obtain the optimal solution of (5). Therefore, we obtain the form (6) as the optimal solution.

AN US

152

According to Theorem 1, we have X T w = X T Xα = Gα, wT w = α X T Xα = αT Gα, where G = X T X is the kernel matrix. Then, (5) could be reformulated into T

M

151

1 T α Qα + pT α + C(eT ξ + eT ξ ∗ ) α,ξ,ξ 2 s.t.y − Gα ≤ εe + ξ, Gα − y ≤ εe + ξ ∗ , ξ, ξ ∗ ≥ 0,

159

1 1 1 1 where Q = G+ 4λ GT G+ 2λ2m−4λ GT eeT G and p = − 4λ GT y− 2λ2m−4λ GT eeT y. 2 2 m m Next, we give the Lagrangian of (8) as

CE

158

(8)

PT

ED

min∗

AC

1 ˆ β, ˜ η, L(α, ξ, ξ ∗ , β, ˆ η) ˜ = αT Qα + pT α + C(eT ξ + eT ξ ∗ ) 2 ˆ − β T (εe + ξ − y + Gα) − β˜T (εe + ξ ∗ + y − Gα)

160 161

− ηˆT ξ − η˜T ξ ∗ ,

(9)

where βˆ = [βˆ1 , · · · , βˆm ]T , β˜ = [β˜1 , · · · , β˜m ]T , ηˆ = [ηˆ1 , · · · , ηˆm ]T and η˜ = [η˜1 , · · · , η˜m ]T are lagrange multipliers. By setting the partial derivations of 8

ACCEPTED MANUSCRIPT

162

{α, ξ, ξ ∗ } to zero, we have

163 164

169 170 171 172



,d =



−GQ−1 p + εe − y GQ−1 p + εe + y



.

M

−GQ−1 GT GQ−1 GT

Due to the simple box-constraint and the convex quadratic objective function, there exists many methods to solve the optimization problem [36, 37, 38, 40]. As suggested by [39], (13) can be efficiently solved by the dual coordinate descent algorithm. In dual coordinate descent algorithm [40], one of the variables is selected to minimize while the other variables are kept as constants at each iteration, so each iteration could obtain a bounded constraint solution for one variable. For instance, we can keep the βj6=i as constraints to minimize βi , then we obtain the following problem

CE

173

GQ−1 GT −GQ−1 GT

(13)

ED

168

(12)

PT

167



AN US

where ˆT , β ˜T ]T , H = β = [β

166

(11)

By substituting (10), (11) and (12) into (9), the dual of (8) can be obtained 1 min f (β) = β T Hβ + dT β β 2 s.t. 0 ≤ βi ≤ C, i = 1, · · · , 2m,

165

(10)

CR IP T

∂L = Qα + p − GT βˆ + GT β˜ = 0 ∂α ∂L = Ce − βˆ − ηˆ = 0 ∂ξ ∂L = Ce − β˜ − η˜ = 0 ∂ξ ∗

min f (β + tei ) t

s.t. 0 ≤ βi + t ≤ C, i = 1, · · · , 2m,

(14)

AC

where ei denotes the vector with 1 in the i-th coordinate and 0s elsewhere. Furthermore f could be simplified as 1 f (β + tei ) = hii t2 + [∇f (β)]i t + f (β), 2

where H = [hij ]i,j=1,··· ,2m and [∇f (β)]i is the i-th component of the gradient ∇f (β). Note that f (β + tei ) is a simple quadratic function of t and f (β) 9

ACCEPTED MANUSCRIPT

can be dropped. Combined the bounded constraint 0 ≤ βi + t ≤ C, the minimizer of (14) leads to a bounded constraint solution

174 175

[∇f (β)]i , 0), C). hii

CR IP T

βinew = min(max(βi −

We summarizes the pseudo-code of dual coordinate descent solver for kernel MDR in Algorithm 1.

177

For prediction, according to (10), one can obtain the coefficients α from the optimal β ∗ as

PT

176

ED

M

AN US

Algorithm 1: Dual coordinate descent solver for kernel MDR Input: Data set X, λ1 , λ2 , C Output: α 1 1 Initialize β = 0, α = 4λ Q−1 GT y + 2λ2m−4λ Q−1 GT eeT y, 2 m A = Q−1 GT N , hii = eTi Hei ; while β not converge do for i = 1, · · · , 2m  do    G εe − y [∇f (β)]i ← α+ ; −G εe + y i βiold ← βi ; [∇f (β)]i , 0), C); βi ← min(max(βi − hii α ← α+(βi − βiold )Aei ; end for end while

178

where N = [I, −I]. Hence for a test sample x ˜, its value can be obtained by P f (˜ x) = (w · φ(˜ x)) = m α k(x , x ˜ ). i i i=1

AC

179

CE

α = Q−1 (GT N β ∗ − p) 4λ1 2λ2 − 4λ1 T = Q−1 GT (N β ∗ + y+ ee y), m m2

180 181 182 183 184

3.3.2. Properties of MDR In this subsection, we investigate the statistical property of MDR that leads to a bound on the expectation of MDR according to the leave-one-out cross-validation estimate, which is an unbiased estimate of the probability of test error. 10

ACCEPTED MANUSCRIPT

185

Recall the linear kernel MDR 1 T 4λ1 2λ2 − 4λ1 w [I + XX T − XeeT X T ]w w,ξ,ξ 2 m m2 2λ2 − 4λ1 4λ1 Xy − XeeT y]w + C(eT ξ + eT ξ ∗ ) + [− m m2 s.t.y − X T w ≤ εe + ξ,

CR IP T

min∗

X T w − y ≤ εe + ξ ∗ , ξ, ξ ∗ ≥ 0,

188

189

where ˆT , β ˜T ]T , H = β = [β

193 194

196

197

2λ2 −4λ1 XeeT X T m2

,d =



−X T Q−1 p + εe − y X T Q−1 p + εe + y

1 and p = − 4λ Xy − m

2λ2 −4λ1 XeeT y. m2

Definition 3. Regression error: π(x, y) = |y − f (x)|. We have the following Theorem: Theorem 2. Let β be the optimal solution of (16), and E[R(β)] be the expectation of the probability of test error, then we have P P E[ε|I1 | + h i∈I2 (βi − ε) + i∈I3 (ε + ξi )] (17) E[R(θ)] ≤ , m where I1 ≡ {i|βi = 0}, I2 ≡ {i|0 < βi < C}, I3 ≡ {i|βi = C} and h = max{hii , i = 1, · · · , m}.

AC

195



ED

192

4λ1 XX T m



PT

191

Q=I+

X T Q−1 X −X T Q−1 X −X T Q−1 X X T Q−1 X

CE

190



AN US

187

where X = [x1 , · · · , xm ] and y = [y1 , · · · , ym ]T is a column vector. Following the same steps in subsection 3.2.1, one can obtain the dual problem of (15), i.e. 1 min f (β) = β T Hβ + dT β β 2 (16) s.t. 0 ≤ βi ≤ C, i = 1, · · · , 2m,

M

186

(15)

Proof 2. Suppose β ∗ = argminf (β), 0≤β≤C i

β = argmin f (β), i = 1, · · · , 2m, 0≤β≤C,βi =0

11

(18)



,

ACCEPTED MANUSCRIPT

198 199

and the corresponding solution for MDR are w∗ and wi , respectively. As shown in [44], E[L((x1 , y1 ), · · · , (xm , ym ))] (19) , m where L((x1 , y1 ), · · · , (xm , ym )) is the number of errors in the leave-one-out procedure. To compute the test error, we divide it into the following three cases: i) if any of βˆi∗ = 0 or β˜i∗ = 0 we have that (xi , yi ) is in the ε-tube in the leave-one-out procedure according to the KKT conditions (ξi = 0, ξi∗ = 0), and it is evident that π(xi , yi ) ≤ ε. ii) if both 0 < βˆ∗ < C and 0 < β˜∗ < C, we have

201 202

203

AN US

200

CR IP T

E[R(θ)] =

i

i i

f (β ) − min f (β + tei ) ≤ f (β ) − f (β ∗ ) ≤ f (β ∗ − βi∗ ei ) − f (β ∗ ). i

i

t

205

206

we can find that the right-hand side of (20) is equal to side of (20) is equal to

[∇f (β i )]2i . 2hii

βi∗ 2 hii , 2

and the left-hand

Note that     xi ε − yi i [∇f (β )]i = w+ −xi ε + yi

(21)

M

204

and

(20)

So we have

PT

[∇f (β i )]2i βi∗ 2 hii (ε + π(xi , yi ))2 ≤ ≤ . 2hii 2hii 2 Further, we can obtain π(xi , yi ) ≤ βi∗ hii − ε. iii) if β˜i∗ = C and βˆi∗ = C, we have that (xi , yi ) is out of the ε-tube in the leave-one-out procedure according to the KKT conditions, and

CE

207

ED

(ε + f − y) ≥ 0, (ε − f + y) ≥ 0, max{(ε + f − y), (ε − f + y)} = ε + π.(22)

π(xi , yi ) = ε + ξi∗ .

So we have

AC 208

L((x1 , y1 ), · · · , (xm , ym )) ≤ ε|I1 | + h

209 210 211

X

i∈I2

(βi∗ − ε) +

X

(ε + ξi∗ )

(23)

i∈I3

where I1 ≡ {i|βi∗ = 0}, I2 ≡ {i|0 < βi∗ < C}, I3 ≡ {i|βi∗ = C} and h = max{hii , i = 1, · · · , m}. Take expectation on both side and with (19), we get that (17) holds. 12

ACCEPTED MANUSCRIPT

214 215 216 217 218

min g(w) = w

CR IP T

213

3.3.3. Primal algorithm for MDR In the above subsection, dual coordinate descent algorithm could solve kernel MDR efficiently. As for large scale regression problem, the computational complexity of kernel matrix is O(m2 ) in nonlinear kernel MDR. In the following, we adopt an average stochastic gradient (ASGD) algorithm [41] to linear kernel MDR for large scale problem. We reformulate the linear kernel MDR as follows 4λ1 2λ2 − 4λ1 1 T w [I + XX T + XeeT X T ]w 2 m m2 2λ2 − 4λ1 4λ1 Xy − XeeT y]w + [− m m2 m m X X T + C( max(0, yi − w xi − ε) + max(0, wT xi − yi − ε)),

AN US

212

i=1

223 224 225 226 227 228

229 230

M

222

ED

221

where X = [x1 · · · xm ] and y = [y1 · · · ym ]T . For large scale problem, it is expensive to compute the gradient of (24) because its computation involves all the training samples. Stochastic gradient descent (SGD) works by computing a noisy unbiased estimation of the gradient via sampling a subset of the training samples. Recently, SGD has been successfully used in classical SVM with powerful computation efficiency [42, 43], where it is convergent to the global optimal solution because of the convexity of the formulation of SVM. Following the step of SGD in SVM, we present an approach to obtain an unbiased estimation of the gradient ∇g(w).

PT

220

(24)

Theorem 3. If two samples (xi , yi ) and (xj , yj ) are sampled from training data set randomly, then

CE

219

i=1

AC

∇g(w, xi , xj ) = 4λ1 xi xTi w + (2λ2 − 4λ1 )ei xi ej xTj w + w  i ∈ I1 ,  xi −xi i ∈ I2 , − 4λ1 yi xi − (2λ2 − 4λ1 )ei xi ej yj − mC  0 otherwise, (25)

231 232

is an unbiased estimation of ∇g(w). Here I1 ≡ {i|wT xi < yi − ε}, I2 ≡ {i|wT xi > yi − ε}. 13

ACCEPTED MANUSCRIPT

234

CR IP T

233

Proof 3. Note that the gradient of g(w) is  Pm i ∈ I1 ,  Pi=1 xi m −x ∇g(w) = Qw + p − C i i ∈ I2 ,  i=1 0 otherwise,

1 1 1 1 XX T + 2λ2m−4λ XeeT X T and p = − 4λ Xy− 2λ2m−4λ XeeT y. where Q = I+ 4λ 2 2 m m Further note that m 1 1 X T xi xTi = XX T , Exi [xi xi ] = m i=1 m

m

AN US

1 X 1 Exi [yi xi ] = yi xi = Xy, m i=1 m m

1 X 1 Exi [ei xi ] = xi = Xe, m i=1 m

(26)

m

1 X 1 Exi [ei yi ] = ei xi = y T e. m i=1 m

236

According to the linearity of expectation, the independence between xi and xj ,and with (26), we have

M

235

ED

Exi ,xj [∇g(w, xi , xj )]

AC

CE

PT

= 4λ1 Exi [xi xTi ]w + (2λ2 − 4λ1 )Exi [ei xi ]Exj [ej xj ]w + w   Exi [xi |i ∈ I1 ], Exi [−xi |i ∈ I2 ], −4λ1 Exi [yi xi ] − (2λ2 − 4λ1 )Exi [ei xi ]Exj [ej yj ] − mC  0 otherwise, 1 1 1 = 4λ1 XX T w + (2λ2 − 4λ1 ) Xe( Xe)T w + w m m m  Pm xi i ∈ I1 , 1 1 1  Pi=1 m T −xi i ∈ I2 , −4λ1 Xy − (2λ2 − 4λ1 ) 2 Xee y − mC m m m  i=1 0, otherwise  Pm i ∈ I1 ,  Pi=1 xi m −x = Qw + p − C i i ∈ I2 ,  i=1 0 otherwise, = ∇g(w).

237

It means that ∇g(w, xi , xj ) is a noisy unbiased gradient of g(w). 14

ACCEPTED MANUSCRIPT

238

With Theorem 3, the stochastic gradient update can be expressed as wt+1 = wt − ηt ∇g(w, xi , xj ),

where ηt is an appropriate selected step-size parameter in the t-th iteration. In practice, averaged stochastic gradient descent (ASGD) is adopted because it is more robust than SGD. During each iteration, besides performing the normal stochastic gradient update (27), we also compute t X 1 ¯t = wi , w t − t0 i=t +1

AN US

0

CR IP T

239

(27)

where t0 determines when we carry out the averaging operation. This average can be performed efficiently via a recursive formula: ¯ t+1 = w ¯ t + µt (wt+1 − w ¯ t ), w 241 242

where µt = 1/max{1, t − t0 }. Algorithm 2 summarizes the pseudo-code of primal algorithm for linear MDR.

M

240

AC

CE

PT

ED

Algorithm 2: primal algorithm for linear MDR Input: Data set X, λ1 , λ2 , C, ε ¯ Output: w Initialize u = 0, T = 5; for t = 1 · · · T m do Sample two training samples (xi , yi ) and (xj , yj ) randomly; Compute ∇g(w, xi , xj ) as in (25); w ← ηt ∇g(w, xi , xj ); ¯ ←w ¯ + µt (w − w); ¯ w end for

243

244

245 246

4. Experimental Results In this section, the experiments were made to illustrate the effectiveness of our MDR compared with ε-TSVR [10], ε-SVR [7, 5] and LSSVR [46] on some data sets. The methods were implemented by MATLAB 7.0 runing 15

ACCEPTED MANUSCRIPT

248 249 250 251 252 253 254 255

AN US

256

on a PC with an Intel(R) Core Duo i7(2.70GHZ) with 32 GB RAM. ε-SVR was solved by LIBSVM and LIBLINEAR and LSSVR was solved by > > > 2 2 LSSVMlab. Gaussian kernel K(x> i , xj ) = exp(−||xi − xj || /σ ) and Poly d > kernel K(x> i , xj ) = (xi ·xj +1) were employed for nonlinear regression. The values of parameters were selected by grid-search method [4]. For brevity, we set c1 = c2 , c3 = c4 , and ε1 = ε2 for ε-TSVR and λ1 = λ2 for our nonlinear MDR. The parameters in Gaussian kernel and the penalty parameters in four methods were selected from the set {2−9 , 2−8 , · · · , 29 } by 10-fold cross-validation. Specifically, the parameter d in poly kernel was selected from {2, 3, 4, 5, 6}. Without loss of generality, let m be the number of training samples and l P be the number of test samples, yˆi is the prediction value of yi , and y¯ = l 1 i=1 yi . Then, the definitions of some performance criteria are stated in l Table 1. To demonstrate the overall performance of a method, a performance criterion referred to average rank is defined as

CR IP T

247

average rank =

M

258

where D denotes the data sets. and rank(F) ∈ {1, 2, · · · , n} means the rank of all of methods on data set F, and n is the number of used methods.

ED

257

1 X rank(F), |D| F ∈D

Table 1: Performance criteria.

Criteria

Definition

SSE

SSE =

AC

SSR

259 260 261 262 263 264

(yi − yˆi )2

PT SST =

CE

SST

l P

SSR =

i=1 l P

Criteria

Definition

NMSE

NMSE = SSE / SST (

P (ˆ yi −E[ˆ yi ])(yi −E[yi ]))2

(yi − y¯)2

R2

R2 =

(ˆ yi − y¯)2

MAPE

MAPE =

i=1 l P

i=1

i

σy2 σy2ˆ

1 l

l P

i=1

yi | yiy−ˆ | i

In our experiments, we test the performance of the above regressors on two artificial data sets, eight regular scale data sets, eleven large scale data sets and three large scale data sets with specified test set, including UCI data sets and Statlib data sets. Tables 2 and 3 summarize the details of these data sets. In particular, the feature in Table 3 is preprocessed as in [47]. For regular data sets, RBF kernel and Poly kernel are employed separately, and 16

ACCEPTED MANUSCRIPT

Table 2: The information of benchmark data sets. #Instances #Features

regular Diabetes Autoprice Wisconsin MachineCPU large

ConcreteCS abalone(scale) cpusmall Bike cadate spatialnetwork

Dataset

43 159 194 209

2 15 32 7

Motorcycle Servo WisconsinBC AutoMpg

1,030 4,177 8,192 10,886 20,640 434,874

8 8 12 9 8 3

Abalone bank8fh cpusmall(scale) driftdataset CASP

#Instances

#Features

133 167 194 398

1 4 31 7

4,177 8,192 8,192 13,910 45,730

8 8 12 128 9

CR IP T

Dataset

AN US

Scale

Table 3: The information of large scale data sets with specified test set.

Data

#Training #Testing

MSD TFIDF-2006 LOG1P-2006

268

269 270 271 272 273

274

41,734,350 19,971,013 96,731,838

[0,1] [-7.90,-0.52] [-7.90,-0.52]

M

for large scale data sets the linear kernel MDR is only used. Experiments were repeated for 30 times with 10-fold cross data partitions, and the mean of the evaluation of R2 , NMSE and MAPE as well as their standard deviations were recorded. 4.1. Artificial Data Sets To compare our MDR with ε-TSVR, ε-SVR and LSSVR, two artificial data sets with different distributions were implemented on these methods. 2 The first artificial example is a synthetic data set x 3 function, which is defined as 2 yi = xi3 + ξi , xi ∼ U [−2, 2], ξi ∼ N (0, 0.52 ), (28) where U [a, b] represents the uniform random variable in [a, b]. To fully reflect the performance of our method, training samples were polluted by Gaussian noise ξi with zero means and 0.5 standard deviation. To avoid biased comparisons, ten independent groups of noisy samples, including 200 training samples and 400 none noise test samples, were generated. The estimated functions generated by these four methods are illustrated in Fig. 1. It is clear that these methods could fit the data fairly well, but our MDR is closer to the test function than others. In fact, by the observation that on the right

AC

275

90 150,358 4,272,226

ED

267

Range of y

PT

266

51,630 3,308 3,308

#Non-zeros(training)

CE

265

463,715 16,087 16,087

#Features

276 277 278 279 280 281

17

AN US

CR IP T

ACCEPTED MANUSCRIPT

(b) ε-SVR

CE

PT

ED

M

(a) ε-TSVR

(c) LSSVR

(d) MDR

AC

Figure 1: The performance of ε-TSVR, ε-SVR, LSSVR and our MDR on y = x2/3 function.

18

ACCEPTED MANUSCRIPT

288 289 290 291

292 293 294 295 296 297 298 299 300 301

302 303 304 305 306 307 308 309

4.2. Small Benchmark Data Sets Tables 5 and 6 exhibit the experimental results on 8 regular scale data sets with RBF and Poly kernels, respectively. All these data sets are normalized to zero mean and unit deviation. From the average rank at the bottom of Table 5 and Table 6, the overall performance of MDR is superior to other three methods. More specifically, for RBF kernel, MDR performs dramatically better than other compared methods on 6, 7 and 7 over 8 data sets for performance criteria R2 , NMSE and MAPE, respectively. For poly kernel, MDR performs better than other methods on 7, 5 and 4 over 8 data sets for performance criteria R2 , NMSE and MAPE, respectively. The optimal parameters were listed in Tables 7 and 8, respectively. Fig. 5(a) shows the CPU time of RBF kernel MDR on regular scale data sets. To further demonstrate the performance of our MDR, we investigate the regression deviation mean and variance of our MDR with RBF kernel, εSVR, ε-TSVR and LSSVR on the regular scale data sets as shown in Fig.3.

AC

310

CR IP T

287

AN US

286

M

285

ED

284

PT

283

half part of the data set, our MDR performs intrinsically the same as the other three methods. However, when we looking at the left half part, the advantage of MDR is obvious, i.e., its fitting function is almost identical to the test one. The corresponding performance criteria and training time are listed in Table 4, which shows that compared with the the other methods, MDR has the highest R2 ,lowest NMSE and comparable MAPE. The CPU time of our MDR is also superior over ε-TSVR and LSSVR and comparable with ε-SVR. The second artificial example is the regression estimation on the sinc function as follows |xi | sin(xi ) + (0.5 − )ξi , xi ∼ U [−4π, 4π], ξi ∼ N (0, 0.52 ). (29) yi = xi 8π The data set consists of 200 training samples and 400 test samples. Fig.2 shows the estimated functions generated by these four methods, and Table 4 exhibits the corresponding results. As can be seen from Fig. 2 and Table 4, our MDR obtains the highest R2 , comparable NMSE and MAPE compared with other regressors. The CPU time is also superior than ε-TSVR and LSSVR. At the bottom of Table 4, we listed the average ranks of all four methods on the artificial data sets with regard to different performance criteria. It can be seen that our MDR is superior than other methods on R2 and comparable to other three methods on NMSE and MAPE.

CE

282

311 312 313 314

315 316

19

AN US

CR IP T

ACCEPTED MANUSCRIPT

(b) ε-SVR

CE

PT

ED

M

(a) ε-TSVR

(c) LSSVR

(d) MDR

AC

Figure 2: The performance of ε-TSVR, ε-SVR, LSSVR and our MDR on sinc function.

20

ACCEPTED MANUSCRIPT

Table 4: Comparisons of ε-TSVR, ε-SVR,LSSVR and our MDR on artificial data sets.

sinc(x)

R2 (rank)

NMSE(rank)

ε-TSVR ε-SVR LSSVR MDR

0.9686(3) 0.9599(4) 0.9704(2) 0.9733(1)

0.0340(2) 0.0463(4) 0.0376(3) 0.0332(1)

ε-TSVR ε-SVR LSSVR MDR

0.9852(3) 0.9869(2) 0.9849(4) 0.9878(1)

0.0335(2) 0.0310(1) 0.0338(3) 0.0361(4)

3.0000 3.0000 3.0000 1.0000

2.0000 2.5000 3.0000 2.5000

average rank ε-TSVR ε-SVR LSSVR MDR

321 322 323 324 325 326 327 328 329 330

AC

331

M

320

ED

319

PT

318

332

333 334

335

336 337

0.1883(2) 0.1799(1) 0.2063(4) 0.2027(3)

0.0085 0.0037 0.0063 0.0043

0.6148(3) 0.5407(1) 0.6153(4) 0.5484(2)

0.0073 0.0023 0.0055 0.0045

2.5000 1.0000 4.0000 2.5000

-

From Fig.3, the means of our MDR have the closest distance to zero, and the medians are always zero or around zero on most data sets, , where the other methods have some biases. On the other hand, MDR also has the most compact variance distribution and the lowest variance median, which indicates its robustness. In particular, although LSSVR model implies the minimization of regression deviation mean as claimed in Proposition 1 and exhibits also a pretty good distribution of regression deviation and variance, it is evident that our MDR illustrates more smaller distribution regression deviation mean and compact distribution deviation variance compared with the other three methods and demonstrates the superiority on the minimum regression deviation distribution. We now study the influence of parameters on the performance of our MDR on these 8 regular data sets. The formulation of our RBF kernel MDR involves three trade-off parameters, i.e., λ1 , λ2 , C and a kernel parameter σ. Fig.4(a) and Fig.4(b) show the influence of λ1 on the NMSE and CPU time by varying it from 2−9 to 29 while fixing λ2 , C and σ as the optimal ones by cross validation. Fig.4(c)∼Fig.4(h) show the influence of λ2 , C and σ on NMSE and CPU time, respectively. It can be observed from Fig.4(a) that λ1 has more obvious influence on NMSE. As λ1 gets larger, NMSE is getting smaller. It implies that a larger λ1 needs to be specified when one wants a more smaller NMSE. Fig.4(c) and Fig.4(e) show the influence

CE

317

MAPE(rank) CPU(sec)

CR IP T

x

2/3

regressor

AN US

Dataset

21

ACCEPTED MANUSCRIPT

CR IP T

Table 5: Comparisons of ε-TSVR, ε-SVR, LSSVR and our MDR on regular scale data sets with RBF kernel. Dataset

regressor

R2 (rank)

Diabetes

ε-TSVR ε-SVR LSSVR MDR

0.6178±0.1395 (4) 0.7472±0.1231(2) 0.7249±0.1879 (3) 0.7624±0.0871(1)

0.9612±0.2113(3) 1.8546±0.0810(2) 1.1755±0.2870(4) 1.9454±0.1463(4) 0.9420±0.1901(2) 1.9043±0.0654(3) 0.9316±0.1430(1) 1.8244±0.0786(1)

0.0010 0.0003 0.0024 0.0010

Motorcycle

ε-TSVR ε-SVR LSSVR MDR

0.7912±0.0797(4) 0.8681±0.0618(2) 0.8091±0.0727(3) 0.8733±0.0418(1)

0.2686±0.0384(4) 1.3227±0.0268(2) 0.2676±0.0590(3) 1.3280±0.0383(3) 0.2563±0.0521(2) 1.3550±0.0273(4) 0.2436±0.0305(1) 1.2991±0.0253(1)

0.0028 0.0009 0.0031 0.0068

Autoprice

ε-TSVR ε-SVR LSSVR MDR

0.8198±0.0651(2) 0.8121±0.0790(4) 0.8155±0.0808(3) 0.8601±0.0232(1)

0.1647±0.0375(2) 0.7119±0.0277(4) 0.1954±0.0267(4) 0.6905±0.0290(2) 0.1900±0.0326(3) 0.7107±0.0179(3) 0.1618±0.0297(1) 0.6804±0.0234(1)

0.0040 0.0027 0.0035 0.0106

Servo

ε-TSVR ε-SVR LSSVR MDR

0.9009±0.0638(2) 0.1976±0.0515(3) 0.6585±0.0354(2) 0.6080±0.0887(4) 0.2263±0.0648(4) 0.7002±0.0414(3) 0.8419±0.0919 (3) 0.1793±0.0583(2) 0.7034±0.0295(4) 0.9220±0.0322(1) 0.1626±0.0356(1) 0.6213±0.0233(1)

0.0044 0.0020 0.0060 0.0119

Wisconsin

ε-TSVR ε-SVR LSSVR MDR

0.3059±0.1432(2) 0.4430±0.1118(1) 0.2791±0.1468(4) 0.3146±0.1961(3)

MAPE(rank)

CPU(sec)

0.9209±0.1218(2) 1.1124±0.0218(4) 1.0389±0.0955(4) 1.0703±0.0449(2) 0.9432±0.1622(3) 1.0759±0.0168(3) 0.9170±0.1350(1) 1.0572±0.0182(1)

0.0077 0.0174 0.0047 0.0177

WisconsinBC ε-TSVR ε-SVR LSSVR MDR

0.3484±0.1569(3) 0.3719±0.0581(2) 0.2679±0.0650(4) 0.3747±0.1700(1)

0.9360±0.1836(1) 1.1323±0.0230(4) 1.0307±0.0771(4) 1.0455±0.0452(2) 0.9455±0.0690(2) 1.0671±0.0248(3) 0.9643±0.1246(3) 1.0385±0.0228(1)

0.0235 0.0111 0.0129 0.0177

PT

ED

M

AN US

NMSE(rank)

MachineCPU

0.8145±0.0973(4) 0.8469±0.0888(2) 0.8186±0.0798(3) 0.9152±0.0384(1)

0.1129±0.0361(2) 1.5032±0.0465(1) 0.1130±0.0342(3) 1.6066±0.0484(3) 0.1270±0.0272(4) 1.7890±0.0442(4) 0.1046±0.0225(1) 1.5575±0.0453(2)

0.0235 0.0026 0.0043 0.0235

ε-TSVR ε-SVR LSSVR MDR

0.9130±0.0176(4) 0.9635±0.0140(1) 0.9561±0.0162(2) 0.9379±0.0055(3)

0.0497±0.0060(2) 0.2527±0.0053(2) 0.0508±0.0083(3) 0.2558±0.0030(3) 0.0534±0.0052(4) 0.2835±0.0024(4) 0.0474±0.0085(1) 0.2516±0.0063(1)

0.0256 0.0206 0.0204 0.1380

ε-TSVR ε-SVR LSSVR MDR

3.2500 2.2500 3.1250 1.3750

CE

ε-TSVR ε-SVR LSSVR MDR

AC

AutoMpg

average rank

2.3750 3.6250 2.7500 1.2500

22

2.6250 2.8750 3.3750 1.1250

-

ACCEPTED MANUSCRIPT

R2 (rank)

Dataset

regressor

Diabetes

ε-TSVR ε-SVR LSSVR MDR

0.8097±0.1132(2) 0.9603±0.1979(2) 2.0172±0.1248(4) 0.7495±0.0960(3) 1.1327±0.2353(4) 1.9838±0.0938(3) 0.4877±0.1922(4) 1.0001±0.2123(3) 1.2047±0.0471(1) 0.8177±0.0682(1) 0.9465±0.1183(1) 1.4689±0.0660(2)

MAPE(rank)

0.0009 0.0002 0.0023 0.0008

Motorcycle

ε-TSVR ε-SVR LSSVR MDR

0.7222±0.1421(2) 0.8677±0.1297(4) 3.1522±0.0776(3) 0.6459±0.1300(4) 0.6813±0.1422(2) 2.0734±0.2143(2) 0.6850±0.1601(3) 0.7181±0.1288(3) 2.2004±0.1706(4) 0.7775±0.0895(1) 0.6506±0.0931(1) 2.0697±0.1215(2)

0.0041 0.1625 0.0033 0.0084

Autoprice

ε-TSVR ε-SVR LSSVR MDR

0.7662±0.0946(4) 0.4058±0.0886(4) 1.0432±0.0543(4) 0.8346±0.0784(2) 0.2208±0.0584(3) 0.6581±0.0355(3) 0.8052±0.0754(3) 0.1478±0.0253(2) 0.6445±0.0219(2) 0.8634±0.0241(1) 0.1236±0.0147(1) 0.6432±0.0187(1)

0.0087 0.0064 0.0031 0.0103

Servo

ε-TSVR ε-SVR LSSVR MDR

0.8699±0.0718(3) 0.1941±0.0377(3) 0.7498±0.0403(4) 0.8329±0.0659(4) 0.1598±0.0370(2) 0.5150±0.0356(1) 0.9002±0.0597(2) 0.2005±0.0581(4) 0.7163±0.0459(3) 0.9654±0.0210(1) 0.1536±0.0309(1) 0.6807±0.0399(2)

0.0046 0.4495 0.0043 0.0129

Wisconsin

ε-TSVR ε-SVR LSSVR MDR

0.4299±0.1163(3) 1.0743±0.1186(2) 1.3511±0.0463(3) 0.3478±0.0653(4) 1.0020±0.0722(1) 1.0233±0.0306(1) 0.5351±0.1483(2) 1.3165±0.1651(4) 1.3665±0.0477(4) 0.5357±0.2000(1) 1.2255±0.1044(3) 1.0988±0.0407(2)

ED

M

AN US

NMSE(rank)

CR IP T

Table 6: Comparisons of ε-TSVR, ε-SVR, LSSVR and our MDR on regular scale data sets with Poly kernel.

0.0272 0.0032 0.0034 0.0168

0.8714±0.0915(2) 0.1490±0.0613(2) 1.8438±0.0821(3) 0.8311±0.1304(4) 0.4008±0.2163(4) 1.7974±0.1271(2) 0.8345±0.1130(3) 0.3981±0.2569(3) 1.8776±0.0727(4) 0.8820±0.0558(1) 0.1105±0.0243(1) 1.7466±0.0857(1)

0.0055 0.0198 0.0034 0.0203

ε-TSVR ε-SVR LSSVR MDR

0.9284±0.0186(4) 0.0736±0.0157(4) 0.2893±0.0060(4) 0.9390±0.0162(3) 0.0542±0.0088(1) 0.2787±0.0045(2) 0.9559±0.0192(2) 0.0576±0.0047(3) 0.2867±0.0050(3) 0.9643±0.0193(1) 0.0554±0.0074(2) 0.2737±0.0063(1)

0.0309 0.0338 0.0115 0.1335

ε-TSVR ε-SVR LSSVR MDR

CE

MachineCPU

AutoMpg

AC

0.9124±0.0715(1) 1.6502±0.1405(4) 1.9877±0.1055(4) 0.2329±0.0374(4) 0.9609±0.0710(1) 0.8664±0.0284(1) 0.5631±0.1815(2) 1.3305±0.1598(3) 1.3524±0.0539(3) 0.3388±0.0492(3) 1.1263±0.1055(2) 1.1004±0.0508(2)

0.0080 0.0047 0.0036 0.0169

PT

WisconsinBC ε-TSVR ε-SVR LSSVR MDR

CPU(sec)

average rank

ε-TSVR ε-SVR LSSVR MDR

2.5714 3.4286 2.7143 1.2857

3.1250 2.2500 3.000 1.6250

23

3.7500 1.8750 2.8750 1.5000

-

ACCEPTED MANUSCRIPT

339 340 341 342

on NMSE of corresponding parameters λ2 and C, indicating no significant influence. Fig.4(g) shows the influence of σ on NMSE and it means that the trend varies with different data sets, such as “Motorcycle” and “Autoprice”. Fig.4(b), Fig.4(d), Fig.4(f) and Fig.4(h) show the influence on CPU time of parameters λ1 , λ2 , C and σ.

CR IP T

338

Table 7: The optimal parameters on regular scale data sets with RBF kernel. Dataset ε-TSVR ε-SVR LSSVR MDR c1 = c2 c3 = c4 σ c σ c σ λ1 = λ2 c σ

2−3 21 2−9 2−9 2−5 2−4 2−7 2−5

23 2−3 25 20 28 27 27 21

27 20 23 27 27 25 23 23

2−5 22 2−6 21 2−8 2−7 2−7 2−4

27 25 23 24 23 21 24 22

25 2−2 27 22 28 25 29 21

2−5 2−9 22 20 2−4 2−7 2−2 24

25 27 29 29 27 27 29 29

AN US

2−5 2−6 2−7 2−6 2−4 2−6 2−3 2−4

M

Diabetes Motorcycle Autoprice Servo Wisconsin WisconsinBC MachineCPU AutoMpg

2−2 23 2−6 2−2 2−8 2−8 2−7 2−2

Table 8: The optimal parameters on regular scale data sets with Poly kernel.

ε-TSVR c1 = c2 c3 = c4 −8

2 2−3 20 2−5 2−2 24 2−7 2−3

−7

2 2−8 2−3 26 26 29 2−5 22

d 2 6 2 4 2 2 2 3

ε-SVR c d −3

2 23 22 25 2−1 2−4 22 2−5

AC

CE

PT

Diabetes Motorcycle Autoprice Servo Wisconsin WisconsinBC MachineCPU AutoMpg

ED

Dataset

343 344 345 346 347 348

5 6 2 4 3 4 3 6

LSSVR c d −8

2 2−3 2−5 2−6 2−8 2−8 25 2−4

3 6 2 5 2 2 2 3

MDR λ1 = λ2 c −6

2 25 21 2−1 2−4 2−5 26 20

−1

2 2−1 26 25 26 2−5 2−6 27

d 4 6 2 5 2 2 2 4

4.3. Large Scale Data Sets Tables 9 and 10 summarize the experimental results on 11 large scale data sets with linear kernel. All these data sets were normalized to zero mean and unit deviation. From the average rank at the bottom of Tables 9 and 10, the overall performance of MDR is superior or highly competitive to other compared methods. More specifically, MDR performs dramatically better 24

CR IP T

ACCEPTED MANUSCRIPT

(b) Motorcycle

(c) Autoprice

(e) Wisconsin

(f) WisconsinBC

AN US

(a) Diabetes

(g) MachineCPU

(d) Servo

(h) AutoMpg

Figure 3: The regression deviation mean and variance of MDR with RBF kernel on regular scale data sets.

353 354 355 356 357 358 359 360 361

M

AC

362

ED

352

PT

350 351

than other compared methods on 10 and 6 over 11 data sets for performance criteria R2 and MAPE, respectively. Although MDR achieves only the best NMSE on 4 over 11 data sets compared with other four methods, it is not the worst on the other 7 data sets. Fig. 5(b) shows the CPU time of linear MDR on large scale data sets, and it is highly competitive to ε-TSVR, ε-SVR and LSSVR, and comparable with LIBLINEAR. The optimal parameters on 11 large scale data sets are listed in Table 11. Further, three large scale data sets with specified test set were provided to test, which have high dimension of features. The details of data sets were listed in Table 3, and the last two data sets are sparse sets. Since ε-TSVR, εSVR and LSSVR have high computation costs, we only compared our linear MDR with LIBLINEAR. The results are given in Table 12. It can be seen that our MDR performs better than LIBLINEAR on 2 over 3 data sets for R2 . For other performance criterions, MDR is comparable with LIBLINEAR, with a fast learning speed. Table 13 lists the optimal parameters on these three data sets.

CE

349

363 364

25

CR IP T

ACCEPTED MANUSCRIPT

(b) Influence of parameter λ1 on time

M

AN US

(a) Influence of parameter λ1 on NMSE

(d) Influence of parameter λ2 on time

CE

PT

ED

(c) Influence of parameter λ2 on NMSE

(f) Influence of parameter C on time

AC

(e) Influence of parameter C on NMSE

26 (g) Influence of parameter σ on NMSE

(h) Influence of parameter σ on time

Figure 4: All:The influence of λ1 and λ2 on NMSE and time on regular scale data sets with RBF kernel.

ACCEPTED MANUSCRIPT

R2 (rank)

ε-TSVR ε-SVR LSSVR LIBLINEAR MDR

0.6665 0.8203 0.8424 0.6257 0.8599

Abalone

ε-TSVR ε-SVR LSSVR LIBLINEAR MDR

0.6247 0.5534 0.5922 0.5126 0.6265

abalone (scale)

ε-TSVR ε-SVR LSSVR LIBLINEAR MDR

0.5138 0.5481 0.5555 0.5387 0.6178

bank8fh

ε-TSVR ε-SVR LSSVR LIBLINEAR MDR

0.7178 0.7859 0.7258 0.7420 0.8899

cpusmall

ε-TSVR ε-SVR LSSVR LIBLINEAR MDR

0.7464 0.7311 0.6651 0.7374 0.7888

cpusmall (scale)

ε-TSVR ε-SVR LSSVR LIBLINEAR MDR

0.8373 0.8532 0.6754 0.7365 0.8699

ε-TSVR ε-SVR LSSVR LIBLINEAR MDR

0.2858 0.3245 0.3196 0.3305 0.3315

PT

CE

Bike

average rank ε-TSVR ε-SVR LSSVR LIBLINEAR MDR

AC

NMSE(rank)

± ± ± ± ±

0.1179(4) 0.5111 ± 0.1109(3) 0.5701 ± 0.0947(2) 0.5572 ± 0.0297(5) 0.4003 ± 0.0635(1) 0.4999 ±

± ± ± ± ±

0.0494(5) 0.5067 0.0499(3) 0.5152 0.0317(2) 0.5030 0.0188(4) 0.4801 0.0305(1) 0.4779

± ± ± ± ±

± ± ± ± ±

0.0619(2) 0.5267 ± 0.0442(4) 0.5251 ± 0.0291(3) 0.5124 ± 0.0205(5) 0.4868 ± 0.0226(1) 0.5132 ±

MAPE(rank)

0.0495(3) 2.2493 ± 0.0641(5) 2.3755 ± 0.0462(4) 2.3916 ± 0.0187(1) 1.9662 ± 0.0750(2) 2.0265 ± 0.0387(5) 3.7255 0.0180(4) 3.4396 0.0167(2) 3.6543 0.0151(1) 3.4040 0.0659(3) 3.3262

0.0310(5) 0.2666 0.0284(2) 0.2738 0.0225(4) 0.2667 0.0161(3) 0.2607 0.0215(1) 0.2603

± ± ± ± ± ± ± ± ± ±

0.0165(4) 3.4426 0.0188(5) 3.4187 0.0117(3) 3.5739 0.0148(2) 3.4559 0.0672(1) 3.2898 0.0056(3) 2.3249 0.0065(5) 2.5529 0.0015(4) 2.3664 0.0049(2) 2.3411 0.0189(1) 2.2163

0.1658 0.0610 0.0436 0.0006 0.0011

± ± ± ± ±

0.1027(5) 0.1226(3) 0.0968(4) 0.0081(2) 0.4585(1)

3.1778 3.5160 1.1190 0.0021 0.0044

0.1188(3) 0.1210(2) 0.1050(5) 0.0074(4) 0.3532(1)

1.6736 64.1820 0.9930 0.0025 0.0044

± ± ± ± ±

0.0712(2) 0.0698(5) 0.0591(4) 0.0037(3) 0.8939(1)

7.5381 1.5519 5.3512 0.0031 0.0082

± ± ± ± ±

± ± ± ± ±

0.1282(2) 0.3462 ± 0.0339(4) 5.2208 ± 0.4072(4) 0.1444(4) 0.4590 ± 0.0384(5) 3.6823 ± 0.0596(1) 0.0539(5) 0.3230 ± 0.0108(3) 5.2031 ± 0.2817(3) 0.0437(3) 0.2937 ± 0.0113(1) 6.1691 ± 0.0096(5) 0.0380(1) 0.3173 ± 0.0249 4.6351(2) ± 0.8357

± ± ± ± ±

0.0174(5) 0.6765 0.0134(3) 0.7202 0.0100(4) 0.6757 0.0046(2) 0.6664 0.0144(1) 0.6473

± ± ± ± ±

0.1014(3) 0.3666 ± 0.1090(2) 0.4758 ± 0.0529(5) 0.3203 ± 0.0447(4) 0.2900 ± 0.0652(1) 0.4084 ±

3.7143 3.0000 3.2857 4.0000 1.0000

± ± ± ± ±

0.0362(3) 5.5668 ± 0.2614(4) 0.0436(5) 3.7185 ± 0.0508(1) 0.0117(2) 5.2580 ± 0.3372(3) 0.0101(1) 6.1661 ± 0.0082(5) 0.1071(4) 5.0433 ± 1.4498(2) 0.0066(4) 2.1755 ± 0.0503(2) 0.0077(5) 1.8959 ± 0.0431(1) 0.0027(3) 2.2549 ± 0.0383(3) 0.0066(2) 2.2875 ± 0.0014(5) 0.0094(1) 2.2578 ± 0.4987(4)

3.7143 4.8571 3.0000 1.4286 2.0000

27

CPU(sec)

0.1799(3) 0.2123(4) 0.2462(5) 0.0161(1) 0.7575(2)

AN US

ConcreteCS

M

regressor

ED

Dataset

CR IP T

Table 9: Comparisons of ε-SVR, ε-TSVR,LSSVR and our MDR on large scale data sets

3.2857 2.4286 3.7143 3.7143 1.8571

7.7388 2.0502 5.3547 0.0063 0.0099(2) 7.7821 1.9375 5.3137 0.0064 0.0098 13.4827 4.3324 11.3286 0.0060 0.0118 -

(b) CPU time on large data sets

AN US

(a) CPU time on regular data sets

CR IP T

ACCEPTED MANUSCRIPT

Figure 5: CPU time on regular and large data sets with RBF kernel.

Table 10: Comparisons of ε-SVR, ε-TSVR,LS-SVR and our MDR on large scale data sets R2 (rank)

Dataset

regressor

driftdataset

ε-TSVR ε-SVR LSSVR LIBLINEAR MDR

0.9218 0.8350 0.8433 0.9280 0.9474

cadata

ε-TSVR ε-SVR LSSVR LIBLINEAR MDR

0.6301 0.6623 0.6376 0.6363 0.7093

ε-TSVR ε-SVR LSSVR LIBLINEAR MDR

AC

spatialnetwork

0.0373(3) 0.1674 ± 0.0362(5) 0.1723 ± 0.0163(4) 0.1619 ± 0.0179(2) 0.1195 ± 0.0236(1) 0.2150 ±

average rank ε-TSVR ε-SVR LSSVR LIBLINEAR MDR

± ± ± ± ±

73.7589 255.3798 64.1550 0.0150 0.0265

N/A(5) N/A(5) N/A(5) 0.4372 ± 0.0122(1) 0.7685 ± 0.0077(3) 1.6074 ± 0.0264(4) 0.2992 ± 0.0082(3) 0.7296 ± 0.0054(2) 1.4027 ± 0.0185(2) 0.2819 ± 0.0027(4) 0.7179 ± 0.0031(1) 1.3963 ± 0.0009(1) 0.4276 ± 0.1885(2) 0.8410 ± 0.2226(4) 1.4927 ± 0.2336(3)

N/A(5) 439.2588 1488.3350 0.0405 0.0831

M

0.0403(4) 0.0366(2) 0.0399(5) 0.0050(3) 0.5873(1)

± ± ± ± ±

0.0222(5) 0.3703 0.0188(2) 0.3796 0.0128(3) 0.3705 0.0085(4) 0.3660 0.0206(1) 0.3642

± ± ± ± ±

0.0371(3) 0.8852 0.0271(4) 0.7840 0.0114(2) 0.9211 0.0159(1) 0.8289 0.0491(5) 0.7837

CPU(sec) 21.7462 41.2956 22.3623 0.3259 0.1790

ED

ε-TSVR ε-SVR LSSVR LIBLINEAR MDR

MAPE(rank) 0.0110(4) 0.0114(2) 0.0109(5) 0.0009(3) 0.0553(1)

PT

CE

CASP

± ± ± ± ±

NMSE(rank)

0.0057(3) 2.0284 0.0061(5) 1.8931 0.0023(4) 2.0332 0.0036(2) 2.0230 0.0177(1) 1.8732

± ± ± ± ±

N/A(5) N/A(5) N/A(5) N/A(5) N/A(5) N/A(5) N/A(5) N/A(5) N/A(5) 0.0136 ± 0.0001(2) 0.9867 ± 0.0003(1) 2.4266 ± 0.0003(2) 0.0144 ± 0.0003(1) 0.9903 ± 0.0009(2) 2.2681 ± 0.1225(1) 4.5000 3.2500 3.7500 3.0000 1.2500

4.0000 4.2500 3.2500 1.2500 3.0000

28

4.5000 3.2500 4.2500 2.2500 1.5000

N/A N/A N/A 0.1208 0.8178 -

ACCEPTED MANUSCRIPT

Table 11: The optimal parameters on large scale data sets. ε-SVR c

LSSVR c

LIBLINEAR c

2−3 2−7 2−2 2−3 2−2 2−3 2−4 2−9 2−5 -

27 29 27 2−5 2−4 23 2−2 2−6 24 23 -

27 23 20 2−3 2−6 2−6 2−5 2−7 2−2 21 -

20 2−6 2−3 22 26 22 2−8 2−8 20 22 24

21 2−9 2−4 2−4 21 2−1 26 25 22 -

λ1

MDR λ2 c

CR IP T

ConcreteCS Abalone abalone(scale) bank8fh cpusmall cpusmall(scale) Bike driftdataset cadata CASP spatialnetwork

ε-TSVR c1 = c2 c3 = c4

20 2−6 20 24 20 20 20 2−6 24 22 27

AN US

Dataset

2−6 22 21 25 21 21 22 2−3 26 23 27

24 29 28 25 2−1 2−2 21 24 28 25 27

Table 12: Comparisons data sets in Table 3. R2

regressor

MSD

LIBLINEAR MDR

0.2107±0.0000(1) 0.7707±0.0000(1) 0.1301±0.0000(2) 0.1946±0.0003(2) 0.8462±0.0002(2) 0.1397±0.0000(2)

NMSE

4.9590 4.9483

TFIDF-2006

LIBLINEAR MDR

0.6627±0.0000(2) 0.8835±0.0000(1) 0.1822±0.0000(1) 0.7137±0.0014(1) 0.8882±0.0012(2) 0.1830±0.0002(2)

1.6344 1.8546

LOG1P-2006

LIBLINEAR 0.7128±0.0000(2) 0.8113±0.0000(1) 0.1718±0.0000(2) MDR 0.7383±0.1291(1) 0.8957±0.0967(2) 0.1760±0.0069(1)

76.5066 16.2212

average rank

LIBLINEAR MDR

ED

1.6667 1.3333

CE

PT

MAPE

M

Dataset

1.0000 2.0000

1.3333 1.6667

AC

Table 13: The optimal parameters on data sets in Table 3.

Dataset MSD TFIDF-2006 LOG1P-2006

LIBLINEAR c

λ1

MDR λ2

c

28 28 2−5

2−7 2−4 2−9

2−7 2−4 2−9

21 23 26

29

CPU(sec)

-

ACCEPTED MANUSCRIPT

365

5. Conclusions

379

Acknowledgment

368 369 370 371 372 373 374 375 376 377

AN US

367

CR IP T

378

In this paper, by introducing the regression deviation mean and the regression deviation variance in regression, we have proposed a robust minimum deviation distribution machine for large scale regression. Our MDR is not only robust to the different distribution regression but also achieves the structural risk minimization. In addition, two algorithms are proposed for linear and nonlinear MDR. Experiments on benchmark data sets and large scale data sets show that MDR is more robust than traditional SVRs with comparable learning speed. The corresponding MDR Matlab codes can be downloaded from: http://www.optimal-group.org/Resources/Code/MDR.html. However, there are several parameters need to be regularized in MDR, the design of proper parameter selection is our future work. Furthermore, it is also interesting to extend the regression deviation mean and the regression deviation variance to other problems.

366

388

References

384 385 386

389 390

[1] NR Draper and H Smith, Applied regression analysis, 3rd Edition, Wiley-Interscience publication, 1998. [2] CJC Burges, A tutorial on support vector machines for pattern recognition, Data Mining Knowl Discovery 2(2) (1998) 121-167.

AC

391

ED

383

PT

382

CE

381

M

387

The authors thank the editors and the anonymous reviewers, whose invaluable comments helped improve the presentation of this paper substantially. This work is supported by the National Natural Science Foundation of China (No. 11501310, 61603338, 11371365, 11426202), the Zhejiang Provincial Natural Science Foundation of China (No. LY15F030013, LQ17F030003, LY16A010020), Inner Mongolia Natural Science Foundation of China (No. 2015BS0606) and the Fundamental Research Funds for the Central Universities (No. DUT14RC(3)024, DUT16RC(4)67).

380

392

393 394

395 396

[3] V Christianini and J Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, Cambridge, 2002. [4] CW Hsu and CJ Lin, A comprasion of methods for multiclass support vector machines, IEEE Trans. Neural Netw. 13(2002)415-425. 30

ACCEPTED MANUSCRIPT

397 398 399

[5] NY Deng, YJ Tian and CH Zhang, Support vector Machine: Optimization Based Theory, Algorithms, and Extensions, Chapman and Hall/CRC, 2012.

402

[7] VN Vapnik, Statistical Learning Theory, Wiley, New York, 1998.

408 409 410

411 412 413

414 415 416 417

418 419

420

[9] DL Liu, Y Shi, YJ Tian, Ramp loss nonparallel support vector machine for pattern classification, Knowledge-Based Systems, 85:224-233, 2015. [10] YH Shao, CH Zhang, ZM Yang, L Jing and NY Deng, An ε-twin support vector machine for regression, Neural Computing and Applications, 23:175-185, 2013. [11] M Tanveer, K Shubham, M Aldhaifallah, et al. An efficient regularized k-nearest neighbor based weighted twin support vector regression, Knowledge-Based Systems, 94:70-87, 2016. [12] JAK Suykens, L Lukas, P van Dooren, B De Moor and J Vandewalle, Least squares support vector machine classifiers: a large scale algorithm. In: Proceedings of European conference of circuit theory design, pp 839842, 1999. [13] JAK Suykens and J Vandewalle, Least squares support vector machine classifiers, Neural Process Letter, 9(3):293-300, 1999. [14] K Wang, P Zhong, Robust non-convex least squares loss function for regression with outliers, Knowledge-Based Systems, 71:290-302, 2014.

AC

421

AN US

407

M

406

ED

405

[8] T Zhang and ZH Zhou, Large margin distribution learning, In: Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discoverty and Data Mining (KDD’14), New York, NY, pp.313-322, 2014.

PT

404

CE

403

CR IP T

401

[6] VN Vapnik, The Natural of Statistical Learning Theory, Springer, New York,1995.

400

422 423 424

425 426

[15] R TIBSHIRANI, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological), 267-288, 1996. [16] B Schl¨olkopf and AJ Smola, A tutorial on support vector regression, Statistics and Computing, 14(3):199-222, 2004. 31

ACCEPTED MANUSCRIPT

433

434 435 436

437 438 439

440 441 442

443 444

445 446

447 448 449

450

[20] CC Chang and CJ Lin, LIBSVM: A library for support vector machines, ACM Transctions on Intelligent Systems and Technology, 2:27:1-27:27, 2011.

[21] S Shalev-Shwartz, Y Singer and N Srebro, Pegasos: Primal estimated sub-gradient solver for SVM, In Proceedings of the 24th International Conference on Machine Learning, pp.807-814, Helsinki, Finland, 2007. [22] RE Fan, KW Chang, CJ Hsieh, XR Wang and CJ Lin, LIBLINEAR: A library for large scale linear classification, Journal of Machine Learning Research, 9:1871-1874, 2008. [23] PY Hao, New support vector algorithms with parametric insensitive margin model, Neural Networks,23(1):60-73, 2010. [24] XJ Peng, TSVR: an efficient twin support vector machine for regression, Neural Networks, 23(3):365-372, 2010. [25] ZM Yang, XY Hua, YH Shao, et al, A novel parametric-insensitive nonparallel support vector machine for regression, Neurocomputing,171:649663, 2016. [26] W Gao and ZH Zhou, On the doubt about margin explanation of boosting, Artificial Intellegence,199-200:22-44, 2013.

AC

451

CR IP T

432

[19] JC Platt, Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, Microsoft Research, Technical Report MSR-TR-98-14, 1998.

AN US

431

[18] C Cortes and VN Vapnik, Support vector networks, Machine Learning, 20(3):273-297, 1995.

M

430

ED

429

PT

428

[17] B Schl¨olkopf and AJ Smola, Learning with kernels, MIT Press, Cambridge, 2002.

CE

427

452 453

454 455 456

[27] T Zhang and ZH Zhou, Optimal margin distribution machine, arXiv preprint arXiv:1604.03348, 2016. [28] FY Cheng, J Zhang and CH Wen, Cost-Sensitive Large margin Distribution Machine for classification of imbalanced data, Pattern Recognition Letters, 80: 107-112, 2016. 32

ACCEPTED MANUSCRIPT

463 464

465 466 467

468 469 470

471 472 473

474 475 476

477 478

479 480 481

[31] B Sun, HY Chen and JD Wang, An empirical margin explanation for the effectiveness of DECORATE ensemble learning algorithm, KnowledgeBased Systems,78: 1-12, 2015. [32] YH Shao, WJ Chen and NY Deng, Nonparallel hyperplane support vector machine for binary classification problems, Information Sciences, 263: 22-35, 2014.

[33] YF Ye, YH Shao, NY Deng, CN Li and XY Hua, Robust Lp-norm least squares support vector regression with feature selection, Applied Mathematics and Computation, 305: 32-52, 2017. [34] Z Wang, YH Shao, L Bai and NY Deng, Twin support vector machine for clustering, IEEE transactions on neural networks and learning systems, 26(10): 2583-2588, 2015. [35] WJ Chen, YH Shao, CN Li and NY Deng, MLTSVM: a novel twin support vector machine to multi-label learning, Pattern Recognition, 52: 61-74, 2016. [36] D Anguita, A Ghio, S Pischiutta and S Ridella, A support vector machine with integer parameters, Neurocomputing, 72(1): 480-489,2008. [37] L Oneto, A Ghio, S Ridella and D Anguita, Learning Resource-Aware Classifiers for Mobile Devices: From Regularization to Energy Efficiency, Neurocomputing, 169: 225-235, 2015. [38] L Oneto, JL Ortiz and D Anguita, Constraint-Aware Data Analysis on Mobile Devices: An Application to Human Activity Recognition on Smartphones, Adaptive Mobile Computing, 127-149,2017.

AC

482

CR IP T

462

AN US

461

[30] Y Wang, G Ou, W Pang, et al, e-Distance Weighted Support Vector Regression, arXiv preprint arXiv:1607.06657, 2016.

M

460

ED

459

PT

458

[29] YH Zhou and Zh Zhou, Large margin distribution learning with cost interval and unlabeled data, IEEE Transactions on Knowledge and Data Engineering, 28(7): 1749-1763, 2016.

CE

457

483 484

485 486

[39] GX Yuan, CH Ho and CJ Lin, Recent advances of larger-scale linear classification, Proceedings of The IEEE, 100(9):2584-2603, 2012.

33

ACCEPTED MANUSCRIPT

492 493

494 495 496

497 498

499 500 501

502 503

504 505 506

507

[41] BT Polyak, AB Juditsky, Acceleration of stochastic approximation by averaging, SIAM Journal on Control and Optimization, 30(4):838-855, 1992. [42] L Bottou, Large-scale machine learning with stochastic gradient descent, In Proceedings of the 19th International Conference on Computational Statistics, pages 177-186, Paris, France, 2010.

[43] HJ Kushner and GG Yin, Stochastic Approximation and Recursive Algorithms and Applications, Springer, New York, 2nd edition, 2003. [44] A Luntz and V Brailovsky, On estimation of characters obtained in statistical procedure of recognition, Tehnicheskaya Kibernetica, 3, 1969. (in Russian). [45] XJ Peng, D. Xu and J.D. Shen, A twin projection support vector machine for data regression, Neurocomputing, 138(2014) 131-141. [46] JAK Suykens, T Van Gestel, J De Brabanter, B De Moor, J Vandewalle, Least Squares Support Vector Machines, World Scientific, Singapore, 2002 (ISBN 981-238-151-1) [47] CH Ho and CJ Lin, Large-scale linear support vector regression, Journal of Machine Learning Research, 13: 3323-3348, 2012.

AC

CE

508

CR IP T

491

AN US

490

M

489

ED

488

[40] CJ Hsieh, KW Chang, CJ Lin, Keerthi SS and Sundararajan S, A dual coordinate descent method for large-scale linear SVM, In Proceedings of the 25th International Conference on Machine Learing, pages 408-415, Helsinki, Finland, 2008.

PT

487

34