Some sets of orthogonal polynomial kernel functions

Accepted Manuscript Title: Some Sets of Orthogonal Polynomial Kernel Functions Author: Meng Tian Wenjian Wang PII: DOI: Reference: S1568-4946(17)3049...

Download PDF

407KB Sizes 0 Downloads 178 Views

Report

PDF Reader
Full Text

Accepted Manuscript Title: Some Sets of Orthogonal Polynomial Kernel Functions Author: Meng Tian Wenjian Wang PII: DOI: Reference:

S1568-4946(17)30492-1 http://dx.doi.org/doi:10.1016/j.asoc.2017.08.010 ASOC 4400

To appear in:

Applied Soft Computing

Received date: Revised date: Accepted date:

30-11-2016 23-6-2017 2-8-2017

Please cite this article as: Meng Tian, Wenjian Wang, Some Sets of Orthogonal Polynomial Kernel Functions, (2017), http://dx.doi.org/10.1016/j.asoc.2017.08.010 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Ac

ce pt

ed

M

an

us

cr

ip t

Highlight 1.Two new kinds of orthogonal polynomial kernels are proposed. 2.The construction methods of some sets of orthogonal polynomial kernels are compared. 3.The similarities and differences among these kernels are highlighted. 4.The experimental results reveal the effectiveness of orthogonal polynomial kernels. 5.The guide to the usage of orthogonal polynomial kernels is given for actual applications.

Page 1 of 33

Meng Tiana,b , Wenjian Wanga,∗

School of Computer and Information Technology, Shanxi University, Taiyuan 030006, PR China b School of Science, Shandong University of Technology, Zibo 255049, PR China

us

cr

a

ip t

Some Sets of Orthogonal Polynomial Kernel Functions

Abstract

Ac ce p

te

d

M

an

Kernel methods provide high performance in a variety of machine learning tasks. However, the success of kernel methods is heavily dependent on the selection of the right kernel function and proper setting of its parameters. Several sets of kernel functions based on orthogonal polynomials have been proposed recently. Besides their good performance in the error rate, these kernel functions have only one parameter chosen from a small set of integers, and it facilitates kernel selection greatly. Two sets of orthogonal polynomial kernel functions, namely the triangularly modiﬁed Chebyshev kernels and the triangularly modiﬁed Legendre kernels, are proposed in this study. Furthermore, we compare the construction methods of some orthogonal polynomial kernels and highlight the similarities and diﬀerences among them. Experiments on 32 data sets are performed for better illustration and comparison of these kernel functions in classiﬁcation and regression scenarios. In general, there is diﬀerence among these orthogonal polynomial kernels in terms of accuracy, and most orthogonal polynomial kernels can match the commonly used kernels, such as the polynomial kernel, the Gaussian kernel and the wavelet kernel. Compared with these universal kernels, the orthogonal polynomial kernels each have a unique easily optimized parameter, and they store statistically signiﬁcantly less support vectors in support vector classiﬁcation. New presented kernels can obtain better generalization performance both for classiﬁcation tasks and regression tasks. Keywords: Kernel selection, Orthogonal polynomial kernel, Classiﬁcation, ∗

Corresponding author. Tel. :+86 351 7017566; Fax: +86 351 7018176. Email addresses: [email protected] (Meng Tian), [email protected] (Wenjian Wang)

Preprint submitted to Applied Soft Computing

June 23, 2017

Page 2 of 33

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

cr

us

6

an

5

M

4

Kernel methods [1] have recently received considerable attention in machine learning community, because of their sound mathematical justiﬁcation, lower error rate and relatively fast training time compared to other learning methods such as artiﬁcial neural network and decision tree in solving nonlinear problems. Supervised algorithms (such as classiﬁcation and regression) and unsupervised algorithms (such as clustering and feature extraction) both beneﬁt largely from the tremendous growth of kernel methods. These methods include support vector machine (SVM) [2], support vector clustering [3], generalized discriminant analysis (GDA) [4], kernel principal components analysis [5], kernel canonical correlation analysis [6], and many others. Conceptually, kernel methods work by nonlinear mapping of data points in the input space to a higher, or possibly inﬁnite, dimensional feature space such that they build linear algorithms in the feature space to implement nonlinear counterparts in the input space. These approaches are called kernel methods due to their dependency on the importation of kernel functions. This operation simpliﬁes the computation of the inner product value of the implicitly mapped points in the feature space. Mercer’s theorem provides a necessary and suﬃcient characterization of a function as a kernel function. There are several commonly used kernel functions [1, 2, 7], such as the linear kernel (KLin ), the polynomial kernel (KP ol ), the Gaussian kernel (KGau ), and the wavelet kernel (KW av ): • KLin (x, z) =< x, z >; • KP ol (x, z) = (< x, z > +1)n ; 2 • KGau (x, z) = exp(− ||x−z|| ); 2σ 2 ∏n 2 i • KW av (x, z) = i=1 cos(1.75 xi −z )exp(− ||x−z|| ). α 2 As choosing a kernel function implies deﬁning the mapping from the input space to the feature space, the performance of kernel method depends heavily on the determination of the right type and suitable parameter (known as hyperparameter) settings of the kernel function. In general, the improved performance of a kernel function implies the eﬀectiveness of the kernel in deﬁning similarities and capturing distinctions between data points in the feature space.

d

3

te

2

1. Introduction

Ac ce p

1

ip t

Regression, Generalization

2

Page 3 of 33

39 40 41 42 43 44 45 46 47 48

te

d

49

ip t

38

cr

37

us

36

an

35

However, choosing a proper kernel function or optimizing the hyperparameter is a crucial problem. Part of the reason lies in the mapping function cannot be directly deﬁned or accessed. To remedy this problem, kernel selection becomes a new trend in machine learning over the past few years. It aims to select an optimal kernel, or its hyperparameters, to best deﬁne the nature of the underlying data. The operations can be divided into three categories. The ﬁrst approach is determining parameters of a single commonly used kernel by using grid search technique or optimizing a quality function such as generalization error bound [8], kernel target alignment [9], and Fisher discriminant criterion [10]. In this category of methods, the used kernel is often predeﬁned, ﬁxed and independent of the input data. And the goal is to ﬁnd the appropriate hyperparameters for the kernel function. Basically, the Gaussian kernel is the most commonly used kernel. As the parameter is embedded in the exponential part, the tuning of the Gaussian kernel parameter is not easy [11, 12, 13]. The second approach is employing multiple kernel learning method (MKL), which adopts multiple kernels instead of selecting one speciﬁc kernel function and its hyperparameters to improve performance [14, 15]. A common implementation of multiple kernels is

M

34

kη (xi , xj ) = fη ({kh (xhi , xhj )}Ph=1 ),

51 52 53 54 55 56 57 58 59 60 61 62 63 64

where P is the number of base kernels, and kernel functions {kh : Rmh × Rmh → R}Ph=1 take mh -dimensional feature representation of data instants. The combination function, fη : RP → R, can be a linear, a nonlinear function, or a data-dependent combination. MKL pursues the optimal combination of a group of kernels. It solves the combination coeﬃcients and the kernel algorithms associated with combined kernel simultaneously, or by using a two-step alternate optimization procedure. A comprehensive tutorial on MKL classiﬁers has been published recently [16]. Though using multiple kernels instead of a single one may be useful for improved accuracy in practice, but MKL needs more time for training and testing to get the kernel combination coeﬃcients. The third approach is designing new kernel functions. Genton [17] listed the spectral representation of various classes of kernels and gave a new formula to construct kernels. Basak [18] proposed a long-tailed kernel function based on the Cauchy distribution and observed that the Cauchy kernel

Ac ce p

50

3

Page 4 of 33

72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102

ip t

cr

71

us

70

an

69

M

68

d

67

te

66

is useful with constrained least square kernel machines for some data sets. Zhang and Wang [19] presented a kernel function based on Lorentzian function, and the kernel can obtain better generalization performance. However, tuning parameters of these proposed kernels is not a trivial task. Daoud and Turabieh [20] proposed some empirical nonparametric kernels for support vector classiﬁcation. Having no parameter simpliﬁes kernel selection eﬀectively. However, the performances of these kernels vary from one dataset to another. As a compromise, constructing an easily optimized kernel may be a better choice. Classical orthogonal polynomials, which arise as solutions to diﬀerential equations related to the hypergeometric equations, are powerful mathematical tools. They have been proven to be very helpful in many applied problems [21, 22, 23]. These orthogonal polynomials have good uniform proximity and orthogonality, and these properties have attracted the attention of many researchers in kernel function design [2, 24, 25, 26, 27]. Vapnik [2] depicted ndimensional Hermite polynomial kernel for the approximation of real-valued functions. Based on the Chebyshev polynomials of the ﬁrst kind, Ye et al. [25] constructed orthogonal Chebyshev kernel. Ozer et al. [26] further proposed a set of generalized Chebyshev kernels, which extended orthogonal Chebyshev kernel from single variable input to vector input. Furthermore, the modiﬁed Chebyshev kernels, which replaced the weighting function with an exponential function, have also been proposed. By combining Chebyshev polynomials of the ﬁrst kind and the second kind, Zhao et al. [27] proposed a new sequence of orthogonal polynomials, namely the uniﬁed Chebyshev polynomials. Built on these new polynomials, the uniﬁed Chebyshev kernels have been constructed. Among these kernel functions, the orthogonal Chebyshev kernel and the generalized Chebyshev kernel can be optimized easily because they both have only one kernel parameter chosen from a small set of integers. Some orthogonal polynomial kernels provide competitive performance and their parameters can be optimized easily. It is a thrilling feature of kernel function. To facilitate kernel optimization, the triangularly modiﬁed Chebyshev kernel and the triangularly modiﬁed Legendre kernel are proposed. Here we do not attempt a full treatment of all existing orthogonal polynomial kernels, rather, we present a somewhat biased point of view illustrating the main ideas by drawing mainly from the work of the authors referred above for the sake of association. Three key points of discussion are: which factor may contribute to a better-performance orthogonal kernel;

Ac ce p

65

4

Page 5 of 33

121

2. Some sets of orthogonal polynomial kernel functions

109 110 111 112 113 114 115 116 117 118

te

119

cr

108

us

107

an

106

M

105

d

104

ip t

120

which orthogonal polynomial kernels could compete out or outperform others; compared with the well-known kernels, whether the orthogonal kernel functions can provide competitive performance or not. In this paper, the construction methods of these orthogonal kernel functions are shown, and the similarities and diﬀerences among them are highlighted. Experiments are performed on synthetic and real data sets for better illustration and comparison of these orthogonal polynomial kernels in classiﬁcation and regression scenarios. We give an overall comparison of these orthogonal polynomial kernels by using statistical analysis methods, and discuss the possibility of these kernels to be used as general kernels. The results may be of practical signiﬁcance for the use of orthogonal polynomial kernels. The remaining part of this paper is organized as follows. Section 2 provides a brief description of some orthogonal polynomial kernels, and proposes two sets of new orthogonal polynomial kernels. Section 3 categorizes the orthogonal polynomial kernels and highlights their similarities and diﬀerences. The construction methods of some sets of orthogonal polynomial kernels are expressed for better comparison. Section 4 describes the experimental study and analyses the results obtained. Finally, Section 5 concludes this article.

103

122 123 124 125 126 127 128

129 130 131 132

Ac ce p

Two functions are called orthogonal if their inner product < f, g > is zero for f ̸= g. A typical deﬁnition of an inner product for functions is ∫ < f, g >= w(x)f (x)g(x)dx,

with appropriate integration boundaries, and where w(x) is a weighting function. There are many sets of orthogonal functions with varying weighting functions and integration boundaries. To understand better the situation we now proceed to give a brief description of four sets of orthogonal polynomials. For clarity, the characteristics of these orthogonal polynomial functions are summarized in Table 1. And subsequently some sets of orthogonal polynomial kernels will be illustrated. 2.1. Some sets of existing orthogonal polynomial kernels Based on the orthogonal polynomials given in Table 1, some sets of kernel functions will be described in the following discussion. (a) The Hermite kernels 5

Page 6 of 33

Weighting Integration Recurrence relation function boundary P0 (x) = 1 P1 (x) = x 1 [−1, 1] Ps (x) = 2s−1 xPs−1 (x) − s

Uniﬁed Chebyshev polynomials

1 1−x2

[−1, 1]

T0 (x) = 1 T1 (x) = x Ts (x) = 2xTs−1 (x) − Ts−2 (x)

1 − x2

[−1, 1]

F0 (x) = 1 F1 (x) = ax + x F2 (x) = (a2 + a + 2)x2 − 1 Fs (x) = (a + 2)xFs−1 (x) − (2ax2 + 1)Fs−2 (x) + axFs−3 (x)

√

√

Hermite polynomials

2

e−x

cr

Chebyshev polynomials (the 1st kind)

s−1 Ps−2 (x) s

us

Legendre polynomials

H0 (x) = 1 (−∞, +∞) H1 (x) = 2x Hs (x) = 2xHs−1 (x) − 2(s − 1)Hs−2 (x)

an

Name

ip t

Table 1: List of orthogonal polynomial functions.

i=0

q i Hi (x)Hi (z) = √

te

K(x, z) =

∞ ∑

d

M

Vapnik [2] gave a kernel for the regularized expansion on m-dimensional Hermite polynomials. Firstly, a kernel for one-dimensional Hermite polynomials was obtained: 1

π(1 − q 2 )

exp(

2xzq (x − z)2 q 2 − ), 1+q 1 − q2

Ac ce p

where q is a convergence factor and 0 ≤ q ≤ 1. By increasing the order i, the kernel K(x, z) approaches the δ-function. Based on the theorem about the basis functions in a multidimensional set of functions, the m-dimensional Hermite polynomial kernels can be formulated as [2] KHer (x, z) =

=

133 134 135 136 137

∏m

−zk )2 q 2 1 k zk q − (xk1−q ) exp( 2x1+q 2 2 π(1−q ) 2 2 q 1 exp( 2q − ||x−z|| ). 1+q 1−q 2 (1−q 2 )m/2 k=1

√

(1)

Vapnik proposed that the Hermite kernel (KHer ) can be seen as a semi-local kernel, since it is a multiplier with the composite function of the linear kernel deﬁning “global” approximation and the Gaussian kernel deﬁning “local” approximation. (b) The Chebyshev kernels Ye et al. [25] constructed a m-dimensional Chebyshev kernels by replacing the inﬁnite series with the partial sum sequence. This operation cleverly 6

Page 7 of 33

142 143 144 145 146 147 148 149 150 151

M

141

d

140

te

139

1 , The Chebyshev kernel (KChe ) chooses a convergence factor, namely √1−x k zk similar to the corresponding weighting function (see Table 1), and the convergence factor only depends on the input data. Thus this set of Chebyshev kernels has only one parameter n selected from a small set of integers. Once the polynomial order n changes, the mathematical formulation of kernel function will change correspondingly. √ Note that, when the value of 1 − xk zk is very close to zero, the kernel value may yield an inﬁnitely big number. It can aﬀect the Hessian matrix badly, and make KChe suﬀer from illposed problems. Furthermore, when the number of dimensions m is a large number, as a result of multiplication, KChe starts to have very small kernel values at the oﬀ-diagonal entries of the kernel matrix. This may result in the classiﬁer to perform badly in test accuracy. To overcome these problems, some researchers set the denominator √ as 1 − xk zk + 0.002 in experiments [25, 26, 27]. According to Table 1, the Legendre polynomials are orthogonal with respect to the weight 1 on the interval [−1, 1]. Thus the Legendre kernels can be formally expressed [28]:

Ac ce p

138

an

us

cr

ip t

avoids the pure algebraic manipulation of inﬁnite series. Firstly they constructed a kernel for one dimensional Chebyshev polynomials of the ﬁrst kind: ∑n Ti (x)Ti (z) √ K(x, z) = i=1 , n = 1, 2, . . . , 1 − xz where n is the highest order of the Chebyshev polynomials utilized in the kernel. Following Vapnik [2], the m-dimensional Chebyshev kernels are constructed as m ∑n ∏ i=0 Ti (xk )Ti (zk ) √ KChe (x, z) = , n = 1, 2, . . . . (2) 1 − xk zk k=1

KLeg (x, z) =

152 153 154 155

m ∑ n ∏

Pi (xk )Pi (zk ), n = 1, 2, . . . .

(3)

k=1 i=0

Compared with the Chebyshev kernels, the omission of the denominator can facilitate the calculation of the kernel matrix, more signiﬁcantly, avoid the illposed problems caused by the denominator closing to zero. Pan et al. [28] pointed that the Legendre polynomial kernels (KLeg ) can construct the

7

Page 8 of 33

157 158

us

cr

159

separating hyperplane with less support vectors and less running time in experiments. Reducing the number of support vectors can reduce the execution time in the testing phase [29]. (c) The generalized Chebyshev kernels As the kernel functions are deﬁned as the inner product of two given vectors in the high-dimensional feature space and many applications require multidimensional vector inputs in machine learning, Ozer et al. [26] extended the previous work in [25]. They applied kernel functions onto vector inputs directly instead of applying them to each input element. Firstly, the generalized Chebyshev polynomials are deﬁned as

ip t

156

an

T0 (x) = 1, T1 (x) = x, T Tn (x) = 2xTn−1 (x) − Tn−2 (x), n = 2, 3, 4, . . . ,

162 163 164 165 166

te

161

Ac ce p

160

d

M

T where x is a row vector and Tn−1 (x) is the transpose of the Tn−1 (x). Based on the generalized Chebyshev polynomials, the generalized Chebyshev kernels are deﬁned as ∑n Ti (x)TiT (z) √ KG−Che (x, z) = i=0 , (4) m − xzT where n is the order of generalized Chebyshev polynomial, and x and z are m-dimensional row vectors. Generalizing Chebyshev polynomials is a reasonable and valuable exploration, and diﬀerent kernels can be derived based on the generalized Chebyshev polynomials. Experimental results showed that the generalized Chebyshev kernel is more robust with respect to the kernel parameter n in contrast to the Chebyshev kernel, and the kernel approaches to the minimum support vector number for support vector classiﬁcation. Following [26], the generalized Legendre polynomial kernel (KG−Leg ) based on the generalized Legendre polynomials is proposed [30]:

KG−Leg (x, z) =

n ∑

Pi (x)PiT (z),

(5)

i=0

where

P0 (x) = 1 P1 (x) = x T (x) − n−1 xPn−1 Pn−2 (x), n = 2, 3, . . . . Pn (x) = 2n−1 n n Note that two generalized orthogonal polynomial kernels follow the format presented as K(x, z) = (ax,z + bx,z xzT )w, 8

Page 9 of 33

169 170 171

and 1

for the generalized Chebyshev kernels and the generalized Legendre kernels, respectively. The generalized polynomial yields a row vector if the polynomial order n is an odd number, otherwise, it yields a scalar value. Because of this, compared with the (n − 1)th order generalized kernels, only ax,z changes when n is an odd number, otherwise, only bx,z changes. Because of w = 1, the generalized Legendre kernels are a ﬁrst order polynomial kernels with variable coeﬃcients. For the generalized Chebyshev kernels,

us

172

1

m−xzT

ip t

168

where ax,z and bx,z are constants depending on x, z, and w is √

cr

167

174 175

d

176

Obviously, the generalized Chebyshev kernels have more complex expressions compared with the generalized Legendre kernels, so the generalized Chebyshev kernels can depict more abundant nonlinear information [30]. (d) The exponentially modiﬁed Chebyshev kernels Since an exponential function (the Gaussian kernel) can capture local information along the decision surface better than the square root function can, Ozer et al. [26] replaced the weighting function √ 1 T with Gaussian

M

173

an

1 1 xzT (2n − 1)!!(xzT )n w=√ = √ [1+ +. . .+ +. . .], n = 1, 2, . . . . 2d 2n n!dn m m − xzT

m−xz

177 178 179 180

181 182

Ac ce p

te

kernel exp(−γ||x − z||2 ), and deﬁned the exponentially modiﬁed nth order Chebyshev kernels as ∑n Ti (x)TiT (z) KExp−Che (x, z) = i=0 , (6) exp(γ||x − z||2 ) where n is the Chebyshev polynomial order and γ is the decaying parameter. The test results showed that the exponentially modiﬁed Chebyshev kernels (KExp−Che ) provide better classiﬁcation performance in tests compared with the generalized Chebyshev kernels. Similarly, the exponentially modiﬁed Legendre kernels (KExp−Leg ) are deﬁned as follows [31]: ∑n Pi (x)PiT (z) KExp−Leg (x, z) = i=0 . (7) exp(γ||x − z||2 ) And the only diﬀerence between these two sets of exponentially modiﬁed orthogonal polynomial kernels, by comparing (6) with (7), is the basis 9

Page 10 of 33

188 189 190 191 192 193

ip t

187

cr

186

us

185

= = = =

1, (a + 1)x, (a2 + a + 2)xxT − 1, T (a + 2)xFsT (x) − (2axxT + 1)Fs−1 (x) + axFs−2 (x), s = 2, 3, . . . .

d

F0 (x) F1 (x) F2 (x) Fs+1 (x)

an

184

function. Note that, the exponentially modiﬁed orthogonal polynomial kernels are actually the product of the well-known Gaussian kernel and the corresponding generalized orthogonal polynomial kernels (without weighting function). Similar to the Hermite kernel, the modiﬁed generalized orthogonal polynomial kernels can be seen as semi-local kernels. Having two parameters, n and γ, makes the optimization of these two kernels more diﬃcult to exploit than that of the generalized orthogonal polynomial kernels. Though the selection of γ of Gaussian kernel is not trivial, Sedat et al. [26] pointed that the modiﬁed Chebyshev kernel is less sensitive to the change in γ value. In the following experiments, the value of γ is ﬁxed as same as [26], i.e., γ = 1. (e) The uniﬁed Chebyshev kernel The uniﬁed generalized Chebyshev polynomials are generated in the same way of the general Chebyshev polynomials, and they can be deﬁned as

M

183

194 195 196 197 198

199 200 201 202 203 204

Ac ce p

te

Based on the obtained polynomial functions above, Zhao et al. [27] derived the uniﬁed Chebyshev kernels by the construction method of the generalized Chebyshev kennels. The expression of the uniﬁed Chebyshev kernel (KU −Che ) is ∑n Fi (x)FiT (z) √ KU −Che (x, z) = i=1 . (8) m − xzT Note that KU −Che has the same weighting function √ 1 as the genm−xzT

eralized Chebyshev kernel, rather than the corresponding weighting function √ √ m − xzT . In fact, m − xzT is not a valid kernel. So Zhao et al. [27] deﬁned the uniﬁed Chebyshev kernel as (8) may be out of consideration for a more powerful kernel capturing the nonlinearity along the decision surface. 2.2. Two sets of new orthogonal polynomial kernels By importing a triangular kernel, two sets of triangularly modiﬁed orthogonal polynomial kernels are proposed. Triangular kernel [32, 33], which is basically an aﬃne function of the Euclidean distance between the points in the original space, is expressed as K(x, z) = (1 − ||x − z||/λ)+ . The ()+ forces this mapping to be positive, and ensures this expression to be a kernel. 10

Page 11 of 33

ip t

The triangularly modiﬁed Chebyshev kernels (KT ri−Che ) are deﬁned as follows: n ||x − z|| ∑ KT ri−Che (x, z) = (1 − )+ Ti (x)TiT (z). (9) λ i=0

cr

And the triangularly modiﬁed Legendre kernels (KT ri−Leg ) are listed below: ||x − z|| ∑ KT ri−Leg (x, z) = (1 − )+ Pi (x)PiT (z). λ i=0

(10)

us

n

211

3. A summarization of orthogonal polynomial kernels

212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230

M

209

Ten orthogonal kernels are categorized into a summary table (see Table 2). Three key properties of some orthogonal kernel functions, namely the construction method, form of weighting function, and the kernel parameter, are identiﬁed in order to obtain a meaningful categorization. These orthogonal polynomial kernels are situated to make the similarities and diﬀerences among them clear. Especially, the advantages and weaknesses of kernel functions are highlighted. To intuitively show the diﬀerence between these two construction methods, namely tensor product of the inner product of iterative expressions and partial sum of the inner product of generalized polynomials, the schematic diagrams of the second order Legendre kernel and the second order generalized Legendre kernel are given as examples in Fig. 1. The Legendre kernel is based on tensor product of the inner product of iteration expressions. Given the polynomial order n, the corresponding orthogonal polynomials up to nth order for each ith element of x (xi ) can be received. There exists a (n+1)-dimensional vector with its kth element being the (k − 1)th order Legendre polynomial of xi (k ≤ (n + 1)). The obtained vector is named as the n-order Legendre iterative vector of xi . Thus, the inner product of the n-order Legendre iterative vectors of the corresponding

d

208

te

207

Ac ce p

206

an

210

∑ ¯ )|¯ Here let λ = max{d(xi , x x = N1 N i=1 xi , xi ∈ X}, X is a ﬁnite sample set, N is the number of samples, thus all the data live in a ball of radius λ. Since the parameter λ only depends on the input data, the triangularly modiﬁed orthogonal polynomial kernels have a unique parameter chosen from a small set of integers. It is easy to examine these kernels satisfy the Mercer conditions.

205

11

Page 12 of 33

232

elements of samples can be computed. After being calculated the tensor product of these inner product values, the Legendre kernel for m-dimensional samples can be obtained.

ip t

231

Table 2: List of orthogonal polynomial kernels and their signiﬁcance. Kernel

Construction Weighting method function

KG−Che [26]

Method I

√ 1 1−xz

n∈N

Semi-parametric Semi-local kernel

KG−Leg [30]

Method I

1

n∈N

Semi-parametric kernel; Weighting function decays Easy to Calculate very slow

KHer [2]

Method II

—

KExp−Che Method I [26]

cr

us

an

exp(−γ||x − z||2 )

KChe [25]

√ 1 1−xz

n∈N γ>0

Ac ce p

te

KExp−Leg Method I [31]

Method III

kernel; Weighting function decays slow

M

KT ri−Leg Method I

Weakness

q ∈ (0, 1) Preserve the orthogonality; Semi-local kernel ||x−z|| (1 − ) n ∈ N Semi-local kernel; Weight+ λ ing function decays fast; Parameters can be optimized easily ||x−z|| (1 − ) n ∈ N Semi-local kernel; Weight+ λ ing function decays fast; Parameters can be optimized easily n∈N 2 exp(−γ||x − z|| ) Semi-local kernel; Weightγ>0 ing function decays fast

d

KT ri−Che Method I

Kernel Advantage parameter

Parameter can not be optimized easily Having two parameters

Having two parameters

Having two parameters; One parameter can not be optimized easily

Semi-local kernel; Weight- Having two parameters; ing function decays fast One parameter can not be optimized easily

n∈N

Semi-parametric kernel

Suﬀer from illposed problems caused by high dimension and the denominator closing to 0

Semi-parametric kernel

Suﬀer from illposed problems caused by high dimension

KLeg [28]

Method III

1

n∈N

KU −Che [27]

Method I

√ 1 1−xz

n∈N a>0

Semi-local kernel; Weight- Having two parameters; ing function decays fast One parameter can not be optimized easily

Method I: Partial sum of the inner product of generalized polynomials Method II: Tensor product of the limit of inner product Method III: Tensor product of the inner product of iteration expressions

233 234 235 236 237

On the other hand, the construction method of the generalized Legendre kernel, namely partial sum of the inner product of generalized polynomials, diﬀers from that of the Legendre kernel. It takes the vector as a research object, and exploits the generalized Legendre polynomials. Based on x, the 12

Page 13 of 33

241

xm

x2 x1

1

1

xm

z1 z2

1

1

ip t

240

zm

cr

239

n + 1 corresponding generalized Legendre polynomial vectors could be obtained easily. And then the n-order generalized Legendre kernel can be received after doing the summation of the inner products of the corresponding generalized Legendre polynomial vectors.

1

1

us

238

x2 x1

( 32 x2m − 12 ) ( 23 x22 − 21 ) ( 23 x21 − 21 )

z1 z2

zm

( 23 z21 − 21 ) ( 32 z22 − 21 )

( 32 z2m − 12 )

an

1 + x1 z1 + ( 23 x21 − 12 )( 32 z21 − 21 ) 1 + x2 z2 + ( 23 x22 − 12 )( 32 z22 − 21 )

1 + xm zm + ( 32 x2m − 12 )( 32 z2m − 21 )

M

Q

x2 x1

te

xm

d

(a) The 2nd order Legendre kernel zm

1

11T

1

x

xzT

z

Ac ce p 3 xxT 2

z1 z2

−

1 2

( 32 xxT − 12 )( 32 zzT − 21 )

P

3 T zz 2

−

1 2

(b) The 2nd order generalized Legendre kernel

Figure 1: Schematic diagrams of the Legendre kernel and the generalized Legendre kernel, n = 2.

242 243 244 245 246 247

The schematic diagram of the construction method of the Hermite kernel, namely tensor product of the limit of the inner product, is not given here. In fact, this method is similar to that of the Legendre kernel, and the diﬀerence is that the iterative vector of xi is with inﬁnite order. For one-dimension Hermite kernel, the inner product of two pairs of elements can be expressed in a compact expression. However, it is not always held for other orthogonal 13

Page 14 of 33

258

4. Experiments

253 254 255 256

259 260 261

cr

252

us

251

an

250

In this section, experiments are conducted to evaluate the performance of these orthogonal polynomial kernel functions for both classiﬁcation and regression tasks with four artiﬁcial data sets and twenty-eight standard data sets [34, 35]. The speciﬁcations of these data sets are listed in Table 3.

M

249

ip t

257

kernels. Therefore, two construction methods shown in Fig. 1 have much wider application range than this method. Fig. 2 shows the output of these ten kinds of orthogonal polynomial kernel functions for various kernel parameters, where z changes within the range of [−0.999, 0.999], and where x is ﬁxed at a constant value, i.e., x = 0.64. For the modiﬁed generalized orthogonal polynomial kernels, γ = 1; for the uniﬁed Chebyshev kernel, a = 0.2. One can observe that these kernels can be used in kernel methods for the similarity purpose. Unlike the Gaussian kernel, these orthogonal polynomial kernels alter their shapes based on the input values and the shapes of these kernels are not symmetric around the x.

248

No. of samples 1605 690 200 1000 270 351 345 432 432 432 1000 208 186 1000 1000 569

te

Data set A1a Australian Checkerboard German Heart Ionosphere Liverdisorder Monks1 Monks2 Monks3 Ringnorm Sonar Spiraldata Splice Twonorm Wdbc

No. of features 123 14 2 24 13 34 6 6 6 6 20 60 2 60 20 30

Ac ce p

No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

d

Table 3: Data sets for experiments. Classiﬁcation

Regression No. 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Data set Airfoil Bodyfat Concrete Eunite2001 Forestﬁre Gabor Housing Mexican2 Mexican3 Mg Mpg Pyrim Triazine Wankara Winered Yacht

No. of features 5 13 8 15 12 2 12 1 2 5 6 27 59 10 11 6

No. of samples 1503 252 1030 367 517 1296 506 401 324 1385 392 74 186 321 1599 308

262 263 264 265 266

Ten sets of orthogonal polynomial kernels mentioned above are implemented. Besides these kernels, four kinds of common kernel functions (KLin , KP ol ,KGau , and KW av ) are selected as referents. The values of parameters of these kernels are chosen according to the list in Table 4.

14

Page 15 of 33

2

5

1.6 1.4 K(x,0.64) value

K(x,0.64) value

4 3 2

1.2 1

1

0.8

0

0.6

−1

0.4

−2 −1

−0.5

0 x value

0.5

0.2 −1

1

4

1

1.6 1.4

3

2

−0.5

0 x value

0.5

−1 −1

1

(c) KHer (x, 0.64)

−0.5

0 x value

0.5

0 −1

1

(d) KT ri−Che (x, 0.64)

6

2

n=1 n=3 n=5 n=7 n=9

6

K(x,0.64) value

1.4

5 4 K(x,0.64) value

1.6

3

2

1.2 1

0.8

Ac ce p −1 −1

−0.5

0 x value

0.5

1

(f) KExp−Che (x, 0.64)

−0.5

0 x value

0.5

3 2

−2 −1

1

(g) KExp−Leg (x, 0.64)

−0.5

0 x value

0.5

1

(h) KChe (x, 0.64)

2

8

n=1 n=3 n=5 n=7 n=9

7 6 5 K(x,0.64) value

K(x,0.64) value

n=1 n=3 n=5 n=7 n=9

−1

0.2 0 −1

1

0

0.4

0

0.5

1

0.6

1

0 x value

7

n=1 n=3 n=5 n=7 n=9

d

1.8

−0.5

(e) KT ri−Leg (x, 0.64)

te

K(x,0.64) value

M

0.2

−1 −1

1.2 1

0.8

n=1 n=3 n=5 n=7 n=9

4 3 2 1

0.6

0

0.4 0.2 −1

1

0.8

0.4

0

−0.5

1.4

1.2

0.6

1 0

1.6

n=1 n=3 n=5 n=7 n=9

1.8

0.5

4

1

2

n=1 n=3 n=5 n=7 n=9

K(x,0.64) value

5

1.5

5

0.5

an

K(x,0.64) value

2

6

q=0.1 q=0.3 q=0.5 q=0.7 q=0.9 K(x,0.64) value

2.5

0 x value

(b) KG−Leg (x, 0.64)

3.5 3

−0.5

us

(a) KG−Che (x, 0.64)

1.8

n=1 n=3 n=5 n=7 n=9

1.8

ip t

n=1 n=3 n=5 n=7 n=9

6

cr

7

−1 −0.5

0 x value

0.5

−2 −1

1

(i) KLeg (x, 0.64)

−0.5

0 x value

0.5

1

(j) KU −Che (x, 0.64)

Figure 2: The x-value vs. the kernel output for a separate ﬁxed value, i.e., z=0.64 for 10 diﬀerent orthogonal polynomial kernels.

15

Page 16 of 33

KChe KLeg KU −Che

270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285

286 287 288

cr

M

d

269

te

268

Seen from Table 4, KLin has no parameter. KExp−Che , KExp−Leg , and KU −Che require two parameters to be chosen. Evaluating each combination of the corresponding parameters will be quite time consuming. So in the experiments, γ=1 is used for KExp−Che and KExp−Leg , and a value set is assigned for the parameter a of KU −Che . Each of KChe , KLeg , KG−Che , KG−Leg , KT ri−che , and KT ri−Leg has only one parameter chosen from the set of integers. As the normalization of data is inevitable for KChe , KG−Che , and KU −Che , all the data sets are normalized within the interval of [−1, 1] in advance. The data normalization method used here is as the same as the way in [26]. The 2(xi −M ini ) normalized value for each element is deﬁned as xnew =M − 1, where i axi −M ini M ini is the minimum value for ith elements among all the possible input vectors and M axi is the maximum value among all the ith elements. All the compared calculations are carried out by using Matlab (V2010, the Mathworks, Inc.), the SVM toolbox developed by Gunn(1997) from http://www.isis.ecs.soton.ac.uk/isystems/kernel/, and the statistical pattern recognition toolbox by V. Franc(2008) from http://cmp.felk.cvut.cz/cmp/ software/stprtool/index.html. All experiments are conducted on a PC with 2.93 GHz CPU and 2GB RAM.

Ac ce p

267

Values — 1,2,3,4,5,6,7,8,9 2−4 , 2−3 , 2−2 , 2−1 , 20 , 21 , 22 , 23 , 24 2−4 , 2−3 , 2−2 , 2−1 , 20 , 21 , 22 , 23 , 24 1,2,3,4,5,6,7,8,9 1,2,3,4,5,6,7,8,9 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9 1,2,3,4,5,6,7,8,9 1,2,3,4,5,6,7,8,9 1,2,3,4,5,6,7,8,9 1 1,2,3,4,5,6,7,8,9 1 1,2,3,4,5,6,7,8,9 1,2,3,4,5,6,7,8,9 1,2,3,4 0.2,0.6,1,1.4

us

KExp−Leg

parameters — n σ α n n q n n n γ n γ n n n a

an

Kernel KLin KP ol KGau KW av KG−Che KG−Leg KHer KT ri−Che KT ri−Leg KExp−Che

ip t

Table 4: The values corresponding to kernel parameters in experiments.

4.1. Experimental methodology In order to evaluate the performance, a strategy pointed out and adopted in [16] is used here. The experimental methodology is as follows: For a given 16

Page 17 of 33

297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326

ip t

cr

295 296

us

294

an

293

M

292

d

291

te

290

a data set, if learning and test sets are not supplied separately, a random two-thirds is reserved as the learning set and the remaining one-third is used as the test set. If the learning set has more than 1000 data instances, it is resampled using 5×2 cross-validation to generate 10 training and validation sets with stratiﬁcation. If the number of instances of the learning set is less than 1000, a 10-fold cross validation is applied on the benchmark data sets to evaluate the generalization ability of these kernels. Two kernel based methods, GDA and SVM, are applied to each dichotomous classiﬁcation data set. And ε-support vector regression (ε-SVR) is performed for regression tasks. In experiments, GDA is followed by a k-Nearest Neighbor (k-NN) classiﬁer to perform the recognition. The parameter k in k-NN is set as 1. In experiments, the common regularization C of SVM is chosen within the given set {0.01, 0.1, 1, 10, 100}. For regression targets, ε is ﬁxed at 0.1 for all the simulations. Various groups of parameters ((n, C), (α, C), (σ, C), (q, C) and (n, t, C)) are tested. For classiﬁcation targets, the one with the lowest misclassiﬁcation error is chosen ﬁrstly. When two pairs of parameters produce the same misclassiﬁcation error, the one with the lowest average support vector percentage is picked by using SVM. And then if two pairs of parameter conﬁgurations have both the same misclassiﬁcation error and the same support vector percentage, the one having the shortest training time is chosen ﬁnally. When performing the pattern recognition procedure by GDA+KNN, the one having the shortest running time are chosen when two pairs of parameters derive the highest accuracy simultaneously. Similarly, for the regression target, the best parameter conﬁguration is the one having the lowest average mean square error (MSE), the highest Willmott’s index of agreement (WIA), and the shortest training time. And these three indexes are in order of decreasing priority on the validation folds. The averages and standard deviations of test error of SVM and GDA+KNN, the support vector percentage of SVM, the mean square error and Willmott’s index of agreement of SVR are measured. On each data set, these indexes are compared by using the paired t test according to the resampling scheme used. The signiﬁcance level is taken as 0.05 for all statistical tests. In the following tables, a superscript a denotes that the performance values of KLin . Using this kernel corresponds to running the original algorithm in the input space. ¯ a and a denote the compared kernel has statistically signiﬁcantly higher and ¯ lower average value than KLin , respectively. Similarly, the superscripts b, c and d denote the performance values of KP ol , KGau , and KW av , respectively.

Ac ce p

289

17

Page 18 of 33

335 336

337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363

ip t

cr

334

us

333

4.2. Experiments results (a) Classiﬁcation experiments Table 5 and Table 6 summarize the classiﬁcation performance of all compared kernel functions for SVM and GDA+KNN, respectively. And Table 7 gives the support vector percentages of these kernels on 16 data sets. Firstly, the kernel functions based on the same construction method and weighting function but diﬀerent orthogonal polynomials are compared. Subsequently, the winning orthogonal polynomial kernels will be compared with the four classical kernels. Table 5 shows that there is no signiﬁcant diﬀerence between two sets of triangularly modiﬁed generalized kernels, KT ri−Che and KT ri−Leg . The same is true with the diﬀerence between KExp−Che and KExp−Leg . However, compared with KG−Leg , KG−Che provides a better test accuracy on 10 out of 16 data sets. Especially, KG−Che outperforms KG−Leg by more than nineteen percent on Checkerboard, Monks2, and Monks3. It implies KG−Che can capture more local information along the decision surface than KG−Leg . The above results show that the impact of the weighting function is more prominent than that of the basis function for these orthogonal kernels based on generalized orthogonal polynomials. Note that KChe has no results on 4 data sets. Perhaps for two reasons: ﬁrst, the square-root at the denominator of the kernel function KChe causes illposed problems; and second, the tensor product of coordinatewise basis functions leads to bad performance for high-dimensional vector in test accuracy. And the latter reason may be responsible for failure of KLeg on A1a and the bad performance of KLeg on Splice. After ruling out the four data sets on which KChe has no results, KLeg presents better performance on 9 out of 12 data sets compared with KChe in terms of test accuracy. This result indicates that KLeg is a better and robust kernel function compared with KChe .

an

332

M

330 331

d

329

te

328

¯ ¯ ¯ denote the compared kernel has statistically signiﬁcantly higher b, c and d average value than KP ol , KGau , and KW av , respectively. And b, c and d ¯ ¯ ¯ denote the compared kernel has statistically signiﬁcantly lower average value than KP ol , KGau , and KW av , respectively. After comparing the performances of these kernel functions for each data set, an overall comparison on 16 data sets is given by using the nonparametric Friedman’s test on rankings with the Tukey’s honestly signiﬁcant diﬀerence criterion as the post-hoc test [36]. In the Friedman’s test, the missing data is substituted as an assumed output in Tables 5-9, i.e. all of the missing test error, support vector percentage, MSE, and 1−WIA are ﬁxed at 1.

Ac ce p

327

18

Page 19 of 33

Table 5: Test error of diﬀerent kernels on data sets for SVM (mean(%)± std). KP ol

KGau

KG−Leg

KW av

18.43 ± 0.0148 18.13 ± 0.0291 19.26 ± 0.0354ad 26.62 ± 0.0103 21.56 ± 0.0393 11.63 ± 0.0092ab 31.48 ± 0.0503a 36.53 ± 0.0658ab 24.65 ± 0.1235a 0.28 ± 0.0088a 3.71 ± 0.0049abd 7.14 ± 0.0363a 43.07 ± 0.0833d 17.01 ± 0.0229abd 3.68 ± 0.0047a 3.00 ± 0.0145b KHer KT ri−Che

19.65 ± 0.0199b 17.96 ± 0.0242 13.43 ± 0.0307abc 27.31 ± 0.0202 22.22 ± 0.0502 11.54 ± 0.0147ab 31.48 ± 0.0510a 43.61 ± 0.1275 25.84 ± 0.2023a 1.53 ± 0.0395a 7.13 ± 0.0140abc 10.14 ± 0.0626a 28.71 ± 0.1050abc 17.49 ± 0.0236ac 3.98 ± 0.0083a 2.48 ± 0.0079b KT ri−Leg

ip t

17.98 ± 0.0139d 17.26 ± 0.0334 24.93 ± 0.1017ad 27.28 ± 0.0184 24.00 ± 0.0457 13.85 ± 0.0248acd 33.30 ± 0.0685a 50.42 ± 0.4128c 28.96 ± 0.0717a 0.28 ± 0.0088a 4.91 ± 0.0102acd 8.57 ± 0.0316a 48.22 ± 0.0656d 19.64 ± 0.0401ac 3.59 ± 0.0101a 5.26 ± 0.0096acd

cr

KLin 18.24 ± 0.0114 18.13 ± 0.0575 48.06 ± 0.0118bcd 27.46 ± 0.0082 20.33 ± 0.0229 20.34 ± 0.0332bcd 41.57 ± 0.1016bcd 47.77 ± 0.4152c 59.79 ± 0.0543bcd 19.65 ± 0.0066bcd 23.08 ± 0.0132bcd 23.14 ± 0.0534bcd 44.19 ± 0.0296d 30.00 ± 0.0114bcd 5.15 ± 0.0056bcd 3.32 ± 0.0093b KG−Che

us

A1a Australian Checkerboard German Heart Ionosphere Liverdisorder Monks1 Monks2 Monks3 Ringnorm Sonar Spiraldata Splice Twonorm Wdbc

an

Data set

18.92 ± 19.22 ± 0.0263 27.91 ± 0.0724acd 26.50 ± 0.0142 23.44 ± 0.0353 13.76 ± 0.0163acd 31.22 ± 0.0787a 38.26 ± 0.0826a 21.04 ± 0.0840a 1.67 ± 0.0243a 3.83 ± 0.0109ad 13.86 ± 0.1098a 0.16 ± 0.0051abcd 18.56 ± 0.0176a 5.00 ± 0.0172bc 3.16 ± 0.0061b KExp−Che

18.28 ± 0.0167 18.70 ± 0.0306 52.39 ± 0.0333abcd 28.05 ± 0.0188c 19.44 ± 0.0150b 14.96 ± 0.0167acd 32.87 ± 0.0850a 46.25 ± 0.0477c 40.21 ± 0.0862abc 32.08 ± 0.0881abcd 3.23 ± 0.0083abd 16.43 ± 0.0576abc 0 ± 0abcd 23.26 ± 0.0128abcd 3.86 ± 0.0065a 3.63 ± 0.0254 KExp−Leg

19.42 ± 0.0152 17.91 ± 0.0280 20.00 ± 0.0473ad 27.18 ± 0.0176 23.44 ± 0.0349a 13.33 ± 0.0177acd 35.22 ± 0.0885 34.86 ± 0.0894ab 22.29 ± 0.1292a 0.97 ± 0.0132a 4.82 ± 0.0067acd 8.14 ± 0.0316a 41.45 ± 0.1231d 23.14 ± 0.0201acd 3.68 ± 0.0080a 3.42 ± 0.0114bd KChe

18.69 ± 0.0236 16.22 ± 0.0158d 21.05 ± 0.0500ad 26.59 ± 0.0192 23.44 ± 0.0322 11.97 ± 0.0107a 29.83 ± 0.0393a 39.45 ± 0.1002a 30.69 ± 0.0627a 6.04 ± 0.0528abc 3.14 ± 0.0060abd 7.28 ± 0.0207a 0.32 ± 0.0102abcd 18.35 ± 0.0151a 4.37 ± 0.0103 4.05 ± 0.0124bd KLeg

18.71 ± 0.0256 17.22 ± 0.0194d 22.39 ± 0.0977ad 25.66 ± 0.0121abd 24.00 ± 0.2966 11.79 ± 0.0097b 28.35 ± 0.0387a 35.28 ± 0.1181a 30.90 ± 0.0589a 5.76 ± 0.0331abcd 2.87 ± 0.0035abcd 7.14 ± 0.0393a 1.77 ± 0.0246abcd 17.87 ± 0.0134a 5.18 ± 0.0087bcd 3.79 ± 0.0155bd KU −Che

A1a Australian Checkerboard German Heart Ionosphere Liverdisorder Monks1 Monks2 Monks3 Ringnorm Sonar Spiraldata Splice Twonorm Wdbc

21.23 ± 0.0180abcd 19.00 ± 0.0169 28.21 ± 0.0395acd 27.78 ± 0.0083c 23.78 ± 0.0119a 13.17 ± 0.0072acd 32.78 ± 0.0594a 48.61 ± 0.1181c 7.22 ± 0.0573abcd 2.15 ± 0.0442a 3.08 ± 0.0081abd 7.71 ± 0.0193a 0 ± 0abcd 33.38 ± 0.0097abcd 4.55 ± 0.0056abc 4.00 ± 0.0097abcd

21.35 ± 0.0179abcd 18.52 ± 0.0219 26.27 ± 0.0728acd 27.75 ± 0.0063c 24.33 ± 0.0110ac 13.42 ± 0.0081acd 32.00 ± 0.0454a 48.61 ± 0.0969c 6.67 ± 0.0544abcd 2.08 ± 0.0440a 3.98 ± 0.0099ad 7.71 ± 0.0193a 0.16 ± 0.0051abcd 33.23 ± 0.0098abcd 4.37 ± 0.0081a 3.89 ± 0.0139bd

— — 17.61 ± 0.0549abd — 30.00 ± 0.0203abcd 14.79 ± 0.0264acd 34.87 ± 0.0297 49.86 ± 0.0044c 26.18 ± 0.2747a 0 ± 0a 3.20 ± 0.0035abcd 23.86 ± 0.0302bcd 30.81 ± 0.0923abc — 4.52 ± 0.0118 5.00 ± 0.0112acd

— 17.74 ± 0.0093 17.46 ± 0.0290ad 26.59 ± 0.0116 21.56 ± 0.0472 11.88 ± 0.0110a 31.48 ± 0.0390a 47.22 ± 0.1070c 9.44 ± 0.1452abcd 1.81 ± 0.0444a 3.00 ± 0.0088abd 14.00 ± 0.0147abc 33.87 ± 0.0986abc 61.20 ± 0.0092abcd 3.89 ± 0.0097a 5.05 ± 0.0079acd

18.43 ± 0.0109d 18.39 ± 0.0293 27.31 ± 0.0503acd 27.37 ± 0.0306 24.11 ± 0.0355a 13.93 ± 0.0198acd 28.35 ± 0.0778a 31.81 ± 0.0561abcd 23.19 ± 0.0720a 1.39 ± 0.0245a 2.87 ± 0.0089abcd 8.71 ± 0.0326a 38.71 ± 0.0684abd 18.80 ± 0.0169acd 3.89 ± 0.0068a 3.42 ± 0.0028bd

Ac ce p

te

d

M

A1a Australian Checkerboard German Heart Ionosphere Liverdisorder Monks1 Monks2 Monks3 Ringnorm Sonar Spiraldata Splice Twonorm Wdbc

0.0165b

19

Page 20 of 33

Table 6: Test error of diﬀerent kernels on data sets for GDA+KNN (mean(%)± std). KP ol

KGau

KG−Leg 22.24 ± 0.0401cd 21.99 ± 0.0174 46.42 ± 0.0362bcd 31.52 ± 0.0208d 23.34 ± 0.0374d 12.57 ± 0.0389cd 38.43 ± 0.0268 45.64 ± 0.0447d 22.01 ± 0.1253ab 44.97 ± 0.0807abcd 2.93 ± 0.0106ab 19.00 ± 0.0579bcd 19.68 ± 0.1708abc 24.79 ± 0.0204bcd 4.69 ± 0.0078d 7.78 ± 0.0223acd KExp−Leg

A1a Australian Checkerboard German Heart Ionosphere Liverdisorder Monks1 Monks2 Monks3 Ringnorm Sonar Spiraldata Splice Twonorm Wdbc

24.34 ± 0.1098abd 23.61 ± 0.0148abd 37.02 ± 0.0720acd 27.07 ± 0.0038abcd 31.11 ± 0.0301abc 8.72 ± 0.0138a 39.39 ± 0.0470c 44.80 ± 0.0572d 3.89 ± 0.0468abcd 5.70 ± 0.0745ad 4.52 ± 0.0098abcd 28.77 ± 0.0656abcd 2.10 ± 0.0275abcd 44.01 ± 0.0328abcd 4.62 ± 0.0163 6.00 ± 0.0236d

24.43 ± 0.0576abcd 24.73 ± 0.0042ac 23.69 ± 0.0174abd 33.26 ± 0.0538abcd 37.46 ± 0.0442acd 21.64 ± 0.0452ab 27.14 ± 0.0024abcd 32.57 ± 0.0302cd 31.22 ± 0.0399abc 43.22 ± 0.0693abcd 9.49 ± 0.0239a 30.77 ± 0.0217abcd 40.09 ± 0.0484 40.17 ± 0.0364b 48.05 ± 0.0940d 51.81 ± 0.0495cd 5.07 ± 0.0506abcd 43.90 ± 0.0977abcd 3.19 ± 0.0571acd 22.50 ± 0.1099b 4.01 ± 0.0064abcd 2.67 ± 0.0066ab 25.58 ± 0.0417abcd 35.57 ± 0.0304abcd 0.48 ± 0.0153abcd 29.52 ± 0.1266abc 45.99 ± 0.0433abcd 47.01 ± 0abcd 5.24 ± 0.0148 20 5.81 ± 0.0127a 5.69 ± 0.0095d 17.11 ± 0.0405abcd

te

20.89 ± 0.0237a 20.61 ± 0.0205 25.37 ± 0.0870abc 29.42 ± 0.0064ab 28.11 ± 0.0529a 8.63 ± 0.0148a 39.39 ± 0.0939 34.86 ± 0.1205abc 19.38 ± 0.1847ab 27.53 ± 0.1577bc 3.05 ± 0.0074ab 8.43 ± 0.0439a 24.59 ± 0.0718abc 15.12 ± 0.0021ab 5.78 ± 0.0060ab 4.10 ± 0.0098ab KT ri−Leg

ip t

21.78 ± 0.0366d 21.04 ± 0.0196 43.73 ± 0.0660bcd 30.15 ± 0.0082abd 22.89 ± 0.0283d 12.82 ± 0.0315acd 37.56 ± 0.0312c 26.94 ± 0.0586abc 19.73 ± 0.0498ab 4.45 ± 0.0158abcd 2.97 ± 0.0154ab 9.14 ± 0.0263a 0.48 ± 0.0109abcd 15.75 ± 0.0098abc 5.08 ± 0.0094 5.21 ± 0.0139 KExp−Che

d

A1a Australian Checkerboard German Heart Ionosphere Liverdisorder Monks1 Monks2 Monks3 Ringnorm Sonar Spiraldata Splice Twonorm Wdbc

Ac ce p

KW av

21.75 ± 0.0158ab 21.78 ± 0.0343 18.96 ± 0.0541abd 29.49 ± 0.0229ab 25.89 ± 0.0463 9.15 ± 0.0140a 42.00 ± 0.0434b 45.83 ± 0.0472d 25.56 ± 0.1759 15.42 ± 0.1092bd 2.57 ± 0.0073ab 9.14 ± 0.0344a 44.84 ± 0.0825d 15.18 ± 0.0086ab 4.55 ± 0.0155 4.42 ± 0.0155b KHer KT ri−Che

cr

22.41 ± 0.0096cd 20.13 ± 0.0185a 37.61 ± 0.0415acd 32.46 ± 0.0211cd 24.89 ± 0.0183 10.34 ± 0.0281a 36.35 ± 0.0287ac 55.14 ± 0.1433ad 39.44 ± 0.0616d 0.83 ± 0.0134acd 8.80 ± 0.0210acd 9.71 ± 0.0176a 41.94 ± 0.0564d 17.30 ± 0.0109acd 4.88 ± 0.0075d 6.01 ± 0.0086cd

us

KLin 22.60 ± 0.0187cd 21.39 ± 0.0234b 48.66 ± 0.1032bcd 32.45 ± 0.0124cd 22.89 ± 0.0371d 15.64 ± 0.0255bcd 39.83 ± 0.0523b 46.94 ± 0.0964bd 39.44 ± 0.0616d 24.31 ± 0.0521b 30.60 ± 0.0229bcd 18.29 ± 0.0385bcd 46.61 ± 0.0566d 25.54 ± 0.0182bcd 4.70 ± 0.0069d 5.31 ± 0.0120d KG−Che

an

A1a Australian Checkerboard German Heart Ionosphere Liverdisorder Monks1 Monks2 Monks3 Ringnorm Sonar Spiraldata Splice Twonorm Wdbc

M

Data set

23.15 ± 0.0267c 22.39 ± 0.0325 25.82 ± 0.1079ab 28.17 ± 0.0231ab 27.89 ± 0.0515a 15.84 ± 0.0651bcd 34.09 ± 0.0352ac 49.16 ± 0.0839d 22.64 ± 0.1669ab 1.67 ± 0.0438acd 3.00 ± 0.0203ab 11.43 ± 0.0928a 42.90 ± 0.0903d 27.40 ± 0.0367bcd 5.90 ± 0.0201 3.63 ± 0.0088ab KChe

23.69 ± 0.0658bcd 18.39 ± 0.0134abcd 20.01 ± 0.0708ab 25.36 ± 0.0092abcd 24.22 ± 0.0261cd 8.46 ± 0.0316a 39.57 ± 0.0288 41.28 ± 0.0801b 24.23 ± 0.0549ab 5.83 ± 0.0543abcd 2.77 ± 0.0095ab 8.29 ± 0.0470a 1.61 ± 0.0186abcd 16.53 ± 0.0133acd 5.43 ± 0.0111 3.64 ± 0.0143ab KLeg

23.82 ± 0.0626bcd 18.86 ± 0.0127ac 24.03 ± 0.1227ab 25.33 ± 0.0078abcd 22.22 ± 0.0257bcd 8.21 ± 0.0252ab 34.78 ± 0.0542c 40.34 ± 0.0868b 21.06 ± 0.0501ab 2.78 ± 0.0041abcd 3.50 ± 0.0161ab 8.57 ± 0.0442a 4.68 ± 0.0596abcd 16.56 ± 0.0131acd 5.56 ± 0.0096a 3.58 ± 0.0069ab KU −Che

23.39 ± 0.0113bcd 23.70 ± 0.0298abd 22.39 ± 0.0558ab 29.91 ± 0.0349a 23.11 ± 0.0462 24.79 ± 0.0351abcd 35.56 ± 0.0410c 54.58 ± 0.0658acd 8.26 ± 0.0737abc 5.63 ± 0.1378ad 2.27 ± 0.0038abd 23.86 ± 0.0527abcd 40.16 ± 0.1329d 48.50 ± 0.0541abcd 4.70 ± 0.0134d 18.59 ± 0.0366abcd

20.64 ± 0.0134 21.29 ± 0.0165 37.46 ± 0.0447acd 30.63 ± 0.0143abd 24.11 ± 0.0363d 25.08 ± 0.0239abcd 33.21 ± 0.0471acd 34.55 ± 0.1942ab 17.92 ± 0.0472ab 3.48 ± 0.0118abcd 2.01 ± 0.0063abcd 10.14 ± 0.0282a 39.03 ± 0.0752d 14.97 ± 0.0046abd 4.98 ± 0.0060d 5.47 ± 0.0090d

Page 21 of 33

Table 7: Ratio of support vectors of diﬀerent kernels on data sets (mean(%)± std).

ip t

cr

us

an

M

d

Ac ce p

A1a Australian Checkerboard German Heart Ionosphere Liverdisorder Monks1 Monks2 Monks3 Ringnorm Sonar Spiraldata Splice Twonorm Wdbc

KLin KP ol KGau KW av 49.35 ± 0.1818 46.89 ± 0.3209 48.06 ± 0.0579d 65.88 ± 0.1802c 82.78 ± 0.2072bcd 57.03 ± 0.3028a 71.13 ± 0.2498a 72.54 ± 0.2696ab 92.83 ± 0.0483bcd 64.36 ± 0.2778acd 43.59 ± 0.1184ab 33.59 ± 0.1444ab 66.50 ± 0.1783 59.90 ± 0.1374d 71.73 ± 0.1531 68.47 ± 0.1253b 50.18 ± 0.2639 32.45 ± 0.1479 46.36 ± 0.1690 46.98 ± 0.1648 47.11 ± 0.2423bcd 35.99 ± 0.0961b 23.08 ± 0.1064acd 35.80 ± 0.0931b 90.82 ± 0.1058acd 84.78 ± 0.1182 77.25 ± 0.0803ab 88.70 ± 0.0802d 79.74 ± 0.0939c 75.17 ± 0.2493c 63.93 ± 0.2880 45.56 ± 0.2146ad 93.87 ± 0.0986bcd 27.82 ± 0.0976acd 63.87 ± 0.2207ab 68.41 ± 0.2582abc 75.73 ± 0.1305bcd 24.72 ± 0.0199ab 5.04 ± 0.0759acd 28.45 ± 0.0481ab 29.52 ± 0.0903acd 51.62 ± 0.1543ab 66.84 ± 0.2352ab 87.73 ± 0.1350bcd bcd acd ab 54.24 ± 0.1785 17.44 ± 0.1514 85.47 ± 0.1828 81.62 ± 0.1928ab 91.11 ± 0.0403 92.68 ± 0.0447 96.20 ± 0.0473 92.87 ± 0.1068 75.08 ± 0.1364 50.68 ± 0.3033 63.67 ± 0.1432 65.66 ± 0.1505 17.38 ± 0.1838 20.00 ± 0.1758 15.49 ± 0.0584 22.86 ± 0.2626 14.77 ± 0.0524bcd 4.94 ± 0.0760acd 11.71 ± 0.0267ab 11.68 ± 0.0209ab KG−Che KG−Leg KHer KT ri−Che KT ri−Leg 34.75 ± 0.2645d 60.73 ± 0.2717 5.82 ± 0.0051abcd 27.65 ± 0.2390cd 29.09 ± 0.2567d 38.74 ± 0.2978acd 38.75 ± 0.2963acd 59.13 ± 0.3316 86.96 ± 0.2079 26.49 ± 0.2469abcd 53.08 ± 0.1922ad 53.08 ± 0.1921ad 51.97 ± 0.2114ac 93.98 ± 0.0445bcd 61.28 ± 0.0807acd 53.06 ± 0.2591 71.85 ± 0.1980 56.65 ± 0.2227 23.56 ± 0.1604abcd 43.62 ± 0.3176cd 40.78 ± 0.2587 37.67 ± 0.0615d 52.91 ± 0.1743 29.10 ± 0.0322 27.73 ± 0.2966 23.96 ± 0.0968acd 55.02 ± 0.3115b 18.45 ± 0.1815acd 19.37 ± 0.1431ac 21.06 ± 0.1408acd 90.39 ± 0.1012d 91.26 ± 0.0901d 77.91 ± 0.1055b 70.39 ± 0.1801ab 68.84 ± 0.1469abc a a a 48.10 ± 0.4006 48.09 ± 0.4005 38.25 ± 0.4547 86.54 ± 0.1611bc 50.87 ± 0.4142a 53.06 ± 0.3028ab 52.58 ± 0.2942ab 40.71 ± 0.5103a 81.27 ± 0.1329abc 58.81 ± 0.3043ac 8.65 ± 0.1031acd 8.65 ± 0.0131acd 15.08 ± 0.2866a 20.79 ± 0.1727bcd 20.76 ± 0.1724ab abcd abcd a 6.08 ± 0.0092 6.95 ± 0.0182 46.19 ± 0.2859 12.54 ± 0.0158abcd 11.92 ± 0.0142abcd 46.24 ± 0.3257bcd 51.20 ± 0.1752bcd 48.89 ± 0.3313cd 15.47 ± 0.2937acd 1.68 ± 0.0171acd 16.73 ± 0.0242abcd 14.02 ± 0.0074abcd 92.50 ± 0.0789 69.16 ± 0.0124abcd 79.91 ± 0.0096abcd 28.48 ± 0.2576acd 57.18 ± 0.1967a 99.04 ± 0.0057abcd 38.71 ± 0.3361ad 44.35 ± 0.3063acd 36.72 ± 0.4182 36.71 ± 0.4180 36.80 ± 0.1996ac 18.63 ± 0.2862 16.16 ± 0.0326 a 10.27 ± 0.0139 23.86 ± 0.2742 10.48 ± 0.0008b 5.95 ± 0.0779a 12.08 ± 0.0888b KExp−Che KExp−Leg KChe KLeg KU −Che 20.63 ± 0.4100d 20.24 ± 0.4100d — — 42.73 ± 0.2318d 43.18 ± 0.2055a 39.11 ± 0.4191acd — 5.51 ± 0.0035abcd 52.10 ± 0.3127ad 63.59 ± 0.1045acd 49.32 ± 0.1901a 44.69 ± 0.2098a 42.05 ± 0.2335ab 82.75 ± 0.2361cd — 4.29 ± 0.0027abcd 37.82 ± 0.2715acd 64.80 ± 0.4769 72.25 ± 0.0063b 38.16 ± 0.4462 36.67 ± 0.4602 6.11 ± 0.0026abcd 27.16 ± 0.2464a 35.93 ± 0.1769 25.12 ± 0.3197 20.72 ± 0.2872a 18.26 ± 0.0582acd 14.77 ± 0.1179acd 22.27 ± 0.0208acd 73.62 ± 0.2479 80.97 ± 0.1646 56.13 ± 0.2249abcd 82.37 ± 0.1148 87.10 ± 0.1123 56.67 ± 0.3697a 34.44 ± 0.4585a 7.85 ± 0.0067abcd 60.95 ± 0.2582d 56.11 ± 0.3008a 67.06 ± 0.2386ab 40.48 ± 0.5081a 7.85 ± 0.0067abcd 52.18 ± 0.2849ab 55.13 ± 0.2407ab ab a 21.19 ± 0.1719 14.84 ± 0.2819 2.85 ± 0.0066acd 20.87 ± 0.1835ab 13.63 ± 0.0717abcd 36.53 ± 0.1462acd 35.30 ± 0.2092ad 3.29 ± 0.0813abcd 46.05 ± 0.1243ab 14.29 ± 0.2116acd 20.08 ± 0.4046cd 20.68 ± 0.4181cd 7.52 ± 0.0201acd 4.32 ± 0.0268acd 23.25 ± 0.3365abcd 60.36 ± 0.0554abcd 36.02 ± 0.0563abcd 63.24 ± 0.1478abcd 90.94 ± 0.0366c 90.00 ± 0.0545c 20.77 ± 0.4176acd 25.86 ± 0.4013acd — 0.47 ± 0.0043abcd 18.36 ± 0.2463abcd 14.98 ± 0.2440 36.21 ± 0.1732c 24.34 ± 0.1971 23.01 ± 0.2088 18.14 ± 0.2094 4.78 ± 0.1842 14.78 ± 0.1832 3.10 ± 0.0154acd 4.68 ± 0.0558acd 10.70 ± 0.0533ab

te

Data set A1a Australian Checkerboard German Heart Ionosphere Liverdisorder Monks1 Monks2 Monks3 Ringnorm Sonar Spiraldata Splice Twonorm Wdbc

A1a Australian Checkerboard German Heart Ionosphere Liverdisorder Monks1 Monks2 Monks3 Ringnorm Sonar Spiraldata Splice Twonorm Wdbc

21

Page 22 of 33

371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401

ip t

cr

370

us

369

an

368

M

367

d

366

te

365

Based on the analysis above, six sets of orthogonal polynomial kernels, i.e., KG−Che , KHer , KT ri−Che , KExp−Che , KLeg , and KU −Che , are compared with four standard kernels. On most data sets, KLin is outperformed by these six orthogonal kernel functions. Compared with KP ol , KGau , and KW av , these orthogonal polynomial kernels have a better (statistically signiﬁcantly) or equal (not statistically signiﬁcantly) accuracy on more than 13 data sets, 11 data sets and 11 data sets, respectively. It means that these six kinds of orthogonal polynomial kernels can give competitive results compared with the four common kernels. Table 6 lists the results of GDA+KNN for diﬀerent kernels. Seen from Table 6, KG−Che outperforms KG−Leg , and KLeg can be found to exceed KChe . The diﬀerence between KT ri−Che and KT ri−Leg and that between KExp−Che and KExp−Leg are not signiﬁcant. So these orthogonal polynomial kernels, KG−Che , KHer , KT ri−Che , KExp−Che , KLeg , and KU −Che , are chosen to compare with four commonly used kernels. The result is similar to that in SVM. Compared with KLin , KP ol , KGau , and KW av , these orthogonal polynomial kernels have a better (statistically signiﬁcantly) or equal (not statistically signiﬁcantly) accuracy on more than 10 data sets, 10 data sets, 10 data sets and 9 data sets out of 16 data sets, respectively. It means that these six kinds of orthogonal polynomial kernels have competitive classiﬁcation abilities compared with the four commonly used kernels. Table 7 lists the percentages of support vectors stored by all compared kernel functions on 16 data sets. Taking #SV/N as an index is out of the consideration that it can be seen as a coarse estimate of Leave-One-Out (LOO) generalization error rate. Seen from Table 7, ten sets of orthogonal polynomial kernels all store less (statistically signiﬁcantly) or equal (not statistically signiﬁcantly) support vectors than four common kernels on more than 11 data sets. This property may be a result of the orthogonality feature of the basis orthogonal polynomial functions. KChe stores statistically significantly less support vectors than KLin , KP ol , KGau and KW av on 6 data sets, but it has no results on 4 data sets. KLeg is trapped in the similar situation with having no result on A1a. In a nutshell, KChe and KLeg are not good choices for some data sets although their competitive performance in terms of support vector percentage. By comparison of the kernels constructed by the same method, KG−Che stores obviously less support vectors than KG−Leg (by less than ﬁve percent) on 6 data sets, while KT ri−Leg stores obviously less support vectors than KT ri−Che (by less than ﬁve percent) on 5 data sets. The performance of KExp−Che and KExp−Leg in terms of support vector per-

Ac ce p

364

22

Page 23 of 33

407 408 409 410 411 412

KLin KPol K Gau K Wav KG−Che KG−Leg K Her K Tri−Che K Tri−Leg KExp−Che KExp−Leg KChe K Leg K

KLin K Pol KGau K Wav KG−Che KG−Leg KHer K Tri−Che KTri−Leg K Exp−Che KExp−Leg K Che K Leg K

Ac ce p

te

d

M

413

ip t

406

cr

405

us

403 404

centage is comparable. It means the impact of basis function on the stored support vector percentage is uncertain. Fig. 3(a) shows the overall comparison between the kernels in terms of test error in SVM. KGau is more accurate (statistically signiﬁcantly) than KLin and KChe , and more accurate (but not statistically signiﬁcantly) than the rest kernel functions. Of ten orthogonal polynomial kernels, KT ri−Che and KT ri−Leg have the highest test accuracy results, and KG−Che , KHer , KLeg and KU −Che fall behind. The eﬀectiveness of KT ri−Che and KT ri−Leg may suggest that merging a proper kernel can improve the average test accuracy. KG−Che outperforms KG−Leg , while KChe is inferior to KLeg . There is no large diﬀerence among the performances of KG−Che , KHer , KLeg and KU −Che . And KChe has the worst performance of these ten orthogonal polynomial kernels in terms of test accuracy.

an

402

0

2

4

6

8

10

U−Che

U−Che

12

0

14

5

10

15

rank

rank

(a) SVM

(b) GDA+KNN

Figure 3: Overall comparison of diﬀerent kernels in terms of test error for classiﬁcation.

414 415 416 417 418 419 420 421 422

The same set of experiments is replicated by using GDA+KNN. Fig. 3(b) displays the overall comparison between these kernel functions in terms of misclassiﬁcation rate. Note that KT ri−che and KT ri−leg are more accurate than the rest kernel functions. Similar to the result in Fig. 3(a), the performances of KG−Leg and KChe are outperformed by KG−Che and KLeg , respectively. There is no substantial diﬀerence in terms of accuracy among KG−Che , KExp−Che and KU −Che . The diﬀerence between KExp−Che and KExp−Leg is trivial, and so is the diﬀerence between the output of KT ri−Che and KT ri−Leg . 23

Page 24 of 33

426 427 428 429 430

an

431

ip t

425

cr

424

And KChe has the worst performance of these ten orthogonal polynomial kernels in terms of test accuracy. Fig. 4 illustrates the overall comparison between these kernel functions in terms of the support vector percentages. First of all, orthogonal polynomial kernels clearly store less (not statistically signiﬁcantly) support vectors than KLin , KGau and KW av . And KT ri−Leg , KChe and KLeg win out from all fourteen orthogonal polynomial kernels. The rest orthogonal polynomial kernels store comparable support vectors. Combining the results in Table 5, a conclusion can be drawn that KChe and KLeg are not good choices although their good performance in terms of the support vector percentages.

us

423

te

d

M

K Lin K Pol KGau K Wav KG−Che K G−Leg K Her KTri−Che K Tri−Leg K Exp−Che KExp−Leg KChe K Leg K

4

6

U−Che

8

10

12

14

16

rank

Ac ce p

2

Figure 4: Overall comparison of diﬀerent kernels in terms of the ratio of support vector for SVC.

432 433 434 435 436 437 438 439 440 441 442 443

(b) Regression experiments In regression scenario, Table 8 and Table 9 list the average MSE values and 1−WIA values of all compared kernel functions for SVR on sixteen data sets, respectively. Seen from Table 8, similar to the results of the previous classiﬁcation experiments, there is no signiﬁcant diﬀerence between KT ri−Che and KT ri−Leg , so is the diﬀerence between KExp−Che and KExp−Leg . KChe has no results on three data sets Pyrim, Eunite2001, and Triazine, and KLeg has no results on Triazine. The reasons have been discussed in the classiﬁcation experiments. Compared with KChe , KLeg has a lower MSE value on 10 data sets out of 13 data sets (three data sets on which KChe has no results have been 24

Page 25 of 33

Table 8: Mean square error of diﬀerent kernels on data sets (mean(%)± std). Data set

KLin

KP ol

KGau

KW av

ip t

51.00 ± 0.0056bcd 27.00 ± 0.0086abc 32.86 ± 0.0086abcd 25.22 ± 0.0089abd a bcd a 40.81 ± 0.0976 28.37 ± 0.0166 28.43 ± 0.0152a 29.63 ± 0.0280 bcd ac abd 48.37 ± 0.0078 27.13 ± 0.0023 24.87 ± 0.0066 26.74 ± 0.0054ac bcd a a 58.50 ± 0.0111 39.10 ± 0.0109 41.90 ± 0.0503 41.98 ± 0.0512a 19.41 ± 0.0625bcd 8.69 ± 0.0085acd 6.75 ± 0.0082ab 6.61 ± 0.0029ab 90.57 ± 0.0044bcd 28.69 ± 0.0031acd 11.98 ± 0.0038ab 12.69 ± 0.0145ab 41.42 ± 0.0033bcd 16.54 ± 0.0053a 16.46 ± 0.0077a 16.52 ± 0.0079a bcd ac ab 25.57 ± 0.0201 23.61 ± 0.0785 59.73 ± 0.0236 24.36 ± 0.0174ab bcd acd ab 82.98 ± 0.0089 35.68 ± 0.0148 15.11 ± 0.0123 16.64 ± 0.0367ab 51.71 ± 0.0016bcd 24.70 ± 0.0107a 22.79 ± 0.0448a 22.77 ± 0.0599a 46.42 ± 0.0079bcd 21.19 ± 0.0094a 20.16 ± 0.0177a 20.19 ± 0.0222a 36.71 ± 0.0705cd 32.24 ± 0.0594cd 27.80 ± 0.0153ab 27.71 ± 0.0135ab bcd a a 29.53 ± 0.0159 14.76 ± 0.0419 16.51 ± 0.0303 15.44 ± 0.0329a bcd a a 19.89 ± 0.0083 5.62 ± 0.0340 4.17 ± 0.0182 4.06 ± 0.0185a bcd ad ad 44.08 ± 0.0064 33.17 ± 0.0067 32.96 ± 0.0072 33.54 ± 0.0075abc 57.30 ± 0.0282bcd 18.26 ± 0.0302abd 24.43 ± 0.0167abc 12.60 ± 0.0381acd KG−Che KG−Leg KHer KT ri−Che KT ri−Leg Airfoil 32.33 ± 0.0095acd 34.77 ± 0.0060abcd 24.86 ± 0.0099abcd 25.99 ± 0.0123abcd 26.08 ± 0.0112abcd Bodyfat 27.08 ± 0.0894ab 28.80 ± 0.0751a 28.42 ± 0.0130a 26.75 ± 0.0101abcd 26.79 ± 0.0949abcd Concrete 30.65 ± 0.0159abcd 34.90 ± 0.0633abcd 22.36 ± 0.0116abcd 23.18 ± 0.0137abcd 23.33 ± 0.0143abcd Eunite2001 40.03 ± 0.0043ab 40.51 ± 0.0058ab 41.76 ± 0.0436ab 41.88 ± 0.0130ab 41.44 ± 0.0109ab 8.70 ± 0.0073acd 7.52 ± 0.0030abcd 10.34 ± 0.0069abcd 10.22 ± 0.0049abcd Forestﬁre 8.82 ± 0.0036acd Gabor 27.04 ± 0.0061ab 16.25 ± 0.0023a 16.28 ± 0.0042a 11.42 ± 0.0098ab 11.08 ± 0.0052abcd a a a Housing 16.52 ± 0.0028 29.59 ± 0.0163 16.28 ± 0.0042 17.26 ± 0.0025abcd 17.16 ± 0.0023abcd Mexican2 24.32 ± 0.0166ab 24.51 ± 0.0212a 24.31 ± 0.0156a 24.63 ± 0.0104a 24.70 ± 0.0205a Mexican3 32.03 ± 0.0970acd 26.99 ± 0.0470abcd 12.51 ± 0.0108abcd 18.69 ± 0.0322ab 18.09 ± 0.0432ab Mg 28.15 ± 0.0080abcd 37.59 ± 0.0025abcd 19.79 ± 0.0187ab 22.22 ± 0.0144ab 22.24 ± 0.0087ab Mpg 24.64 ± 0.0237abcd 29.72 ± 0.0089abcd 21.22 ± 0.0188a 20.01 ± 0.0220a 20.15 ± 0.0220a acd cd cd acd Pyrim 29.77 ± 0.0193 31.95 ± 0.0458 33.83 ± 0.0253 30.58 ± 0.0177 30.28 ± 0.0102acd Triazine 16.23 ± 0.0293a 19.04 ± 0.0427abcd 16.86 ± 0.0146a 20.32 ± 0.0109abcd 20.10 ± 0.0114abcd Wankara 4.26 ± 0.0138a 7.68 ± 0.0195acd 3.55 ± 0.0044a 4.37 ± 0.0216a 7.82 ± 0.0206acd Winered 33.20 ± 0.0079a 33.29 ± 0.0076a 33.01 ± 0.0051abd 32.47 ± 0.0070abd 32.66 ± 0.0073abd Yacht 19.08 ± 0.0175abd 35.00 ± 0.0501abcd 11.13 ± 0.0167acd 15.64 ± 0.0219ad 15.97 ± 0.0237abd KExp−Che KExp−Leg KChe KLeg KU −Che

Ac ce p

te

d

M

an

us

cr

Airfoil Bodyfat Concrete Eunite2001 Forestﬁre Gabor Housing Mexican2 Mexican3 Mg Mpg Pyrim Triazine Wankara Winered Yacht

Airfoil Bodyfat Concrete Eunite2001 Forestﬁre Gabor Housing Mexican2 Mexican3 Mg Mpg Pyrim Triazine Wankara Winered Yacht

26.21 ± 0.0109abcd 29.12 ± 0.0113a 22.98 ± 0.0129abcd 49.05 ± 0.0065abcd 24.34 ± 0.0081abcd 22.12 ± 0.0243abcd 26.58 ± 0.0101abd 24.37 ± 0.0183a 13.36 ± 0.0169abc 21.87 ± 0.0076ab 21.11 ± 0.0211a 62.66 ± 0.0089abcd 37.98 ± 0.0054abcd 11.91 ± 0.0049abcd 32.81 ± 0.0061abd 15.79 ± 0.0327acd

26.49 ± 0.0081abcd 24.60 ± 0.0241abd 29.32 ± 0.0113a 35.60 ± 0.0255abcd abcd 24.07 ± 0.0123 32.68 ± 0.0356abcd abcd 49.33 ± 0.0086 — 24.72 ± 0.0072abcd 41.25 ± 0.0486abcd 23.96 ± 0.0201abcd 11.79 ± 0.0096ab 27.58 ± 0.0201abcd 47.93 ± 0.0997bcd 24.92 ± 0.0223a 24.17 ± 0.0163a 12.75 ± 0.0362abcd 13.77 ± 0.0282ab 21.70 ± 0.0084ab 20.44 ± 0.0058ab 21.46 ± 0.0122a 25.46 ± 0.0181abcd abcd 62.66 ± 0.0089 — 38.00 ± 0.0054abcd — 12.25 ± 0.0098abcd 12.67 ± 0.0199abcd 32.73 ± 0.0060abd 25 36.09 ± 0.0087abcd ad 14.98 ± 0.0648 18.82 ± 0.0338abd

21.55 ± 0.0107abcd 29.47 ± 0.0146a 26.58 ± 0.0691a 42.88 ± 0.0502ab 13.02 ± 0.0031abcd 12.78 ± 0.0092abc 18.79 ± 0.0056abcd 24.51 ± 0.0212a 12.67 ± 0.0196abcd 20.54 ± 0.0073ab 22.43 ± 0.0165acd 51.76 ± 0.0138abcd — 6.18 ± 0.0101acd 33.57 ± 0.0107a 17.17 ± 0.0135abd

30.75 ± 0.0097abcd 27.67 ± 0.0106a 29.76 ± 0.0165abcd 39.66 ± 0.0089a 9.17 ± 0.0028acd 27.16 ± 0.0058abcd 16.98 ± 0.0081a 24.84 ± 0.0121a 35.54 ± 0.0580acd 27.42 ± 0.0033abcd 22.81 ± 0.0198abcd 29.13 ± 0.0158acd 15.77 ± 0.0384a 4.59 ± 0.0204a 33.12 ± 0.0070ad 17.10 ± 0.0272abd

Page 26 of 33

Table 9: 1−WIA of diﬀerent kernels on data sets (mean(%)± std). Data set

KLin

KP ol

KGau

KW av

ip t

22.20 ± 0.0050bcd 12.88 ± 0.0052abc 16.84 ± 0.0117acd 12.20 ± 0.0062abd 14.15 ± 0.0155 14.54 ± 0.0307 12.93 ± 0.0115 13.49 ± 0.0187 16.03 ± 0.0040bcd 10.21 ± 0.0021acd 8.85 ± 0.0041abd 9.82 ± 0.0036abc 24.18 ± 0.0073bcd 15.93 ± 0.0088ad 17.90 ± 0.0321a 17.61 ± 0.0293ab 0.58 ± 0.0007acd 0.71 ± 0.0009b 0.70 ± 0.0003b 0.74 ± 0.0013b bcd acd ab 5.49 ± 0.0011 1.04 ± 0.0003 21.48 ± 0.0025 1.09 ± 0.0023ab bcd a a 6.79 ± 0.0109 3.75 ± 0.0011 3.72 ± 0.0018 3.67 ± 0.0021a 25.73 ± 0.0029bcd 11.24 ± 0.0210a 9.60 ± 0.0051a 10.13 ± 0.0105a 23.58 ± 0.0044bcd 11.88 ± 0.0128acd 2.04 ± 0.0032ab 2.56 ± 0.0123ab 16.31 ± 0.0015bcd 8.53 ± 0.0091a 8.27 ± 0.0224a 8.29 ± 0.0339a bcd a a 8.74 ± 0.0032 3.75 ± 0.0248 3.69 ± 0.0053 3.68 ± 0.0067a 8.01 ± 0.0290cd 10.72 ± 0.0279 9.39 ± 0.0154a 9.33 ± 0.0154a 9.88 ± 0.0197bcd 5.02 ± 0.0286a 5.77 ± 0.0219a 5.33 ± 0.0232a 1.24 ± 0.0017 1.80 ± 0.0101 1.40 ± 0.0046 1.39 ± 0.0054 20.97 ± 0.0046d 20.94 ± 0.0052d 20.87 ± 0.0084d 21.67 ± 0.0120abc 11.36 ± 0.0121bcd 1.49 ± 0.0100ad 2.14 ± 0.0052ad 3.52 ± 0.0439abc KG−Che KG−Leg KHer KT ri−Che KT ri−Leg Airfoil 15.64 ± 0.0154acd 17.73 ± 0.0024abcd 11.58 ± 0.0066abcd 12.27 ± 0.0117ab 12.28 ± 0.0111ab Bodyfat 12.64 ± 0.0132 13.74 ± 0.0541 13.08 ± 0.0143a 11.77 ± 0.0080abcd 11.77 ± 0.0075abcd abcd abcd abcd Concrete 11.69 ± 0.0061 13.99 ± 0.0046 7.65 ± 0.0044 8.00 ± 0.0069abcd 8.22 ± 0.0064abcd ab ab ab Eunite2001 17.21 ± 0.0037 17.80 ± 0.0051 17.65 ± 0.0272 17.73 ± 0.0098ab 17.42 ± 0.0081ab Forestﬁre 0.68 ± 0.0004b 0.59 ± 0.0010ad 0.72 ± 0.0002b 0.77 ± 0.0004ab 0.76 ± 0.0003ab 0.93 ± 0.0007abc Gabor 4.57 ± 0.0044abcd 6.29 ± 0.0051abcd 1.22 ± 0.0010abcd 1.00 ± 0.0018ab Housing 3.62 ± 0.0007ab 3.73 ± 0.0019a 3.47 ± 0.0022abcd 3.63 ± 0.0016a 3.64 ± 0.0018a Mexican2 10.13 ± 0.0131ab 10.41 ± 0.0019ab 10.16 ± 0.0122a 10.37 ± 0.0072a 10.48 ± 0.0198ab Mexican3 26.52 ± 0.1784bcd 11.22 ± 0.0739acd 1.57 ± 0.0022abcd 4.85 ± 0.0259abc 4.98 ± 0.0361abc ab abcd ab ab Mg 9.93 ± 0.0026 13.85 ± 0.0044 6.71 ± 0.0097 7.66 ± 0.0078 7.63 ± 0.0073ab abcd a abcd a 6.07 ± 0.0214 3.89 ± 0.0060 Mpg 4.63 ± 0.0070 3.56 ± 0.0076 3.56 ± 0.0061a Pyrim 10.47 ± 0.0221cd 11.15 ± 0.0310 10.52 ± 0.0126acd 9.09 ± 0.0054 9.35 ± 0.0085 Triazine 5.74 ± 0.0184a 7.89 ± 0.0357bd 5.96 ± 0.0142a 8.82 ± 0.0134bcd 8.46 ± 0.0122bcd Wankara 1.48 ± 0.0032a 1.25 ± 0.0014 1.55 ± 0.0044a 2.30 ± 0.0058acd 2.21 ± 0.0048acd Winered 20.58 ± 0.0041abd 20.85 ± 0.0042ad 20.93 ± 0.0059d 20.76 ± 0.0064d 20.91 ± 0.0078d Yacht 2.66 ± 0.0052abcd 7.21 ± 0.0073abcd 1.10 ± 0.0022ad 1.82 ± 0.0053ad 1.92 ± 0.0064ad KExp−Che KExp−Leg KChe KLeg KU −Che

Ac ce p

te

d

M

an

us

cr

Airfoil Bodyfat Concrete Eunite2001 Forestﬁre Gabor Housing Mexican2 Mexican3 Mg Mpg Pyrim Triazine Wankara Winered Yacht

Airfoil Bodyfat Concrete Eunite2001 Forestﬁre Gabor Housing Mexican2 Mexican3 Mg Mpg Pyrim Triazine Wankara Winered Yacht

12.23 ± 0.0055abd 13.48 ± 0.0109 7.79 ± 0.0066abcd 21.23 ± 0.0045abcd 2.58 ± 0.0008abcd 2.93 ± 0.0091abcd 4.48 ± 0.0019abcd 10.06 ± 0.0108a 2.00 ± 0.0056ab 7.35 ± 0.0028ab 3.79 ± 0.0069a 17.38 ± 0.0033abcd 15.32 ± 0.0029abcd 3.43 ± 0.0014abcd 20.77 ± 0.0049ad 1.94 ± 0.0066ad

12.71 ± 0.0093ab 15.38 ± 0.0587a 13.59 ± 0.0109 16.81 ± 0.0188acd abcd 8.30 ± 0.0062 16.68 ± 0.0619bcd 21.44 ± 0.0064abcd — 2.68 ± 0.0015abcd 10.10 ± 0.0476abcd 3.92 ± 0.0071abcd 1.12 ± 0.0010abc 4.73 ± 0.0054abcd 29.57 ± 0.1407abcd 10.55 ± 0.0174a 10.27 ± 0.0146a 2.30 ± 0.0190ab 2.21 ± 0.0171ab ab 7.09 ± 0.0046 6.93 ± 0.0039ab a 3.91 ± 0.0040 5.46 ± 0.0117abcd 17.39 ± 0.0032abcd — 15.33 ± 0.0029abcd — 3.58 ± 0.0040abcd 3.74 ± 0.0078abcd 20.66 ± 0.0046abd 23.07 ± 0.0095abcd 26 1.99 ± 0.0159ad 3.35 ± 0.0158ab

9.91 ± 0.0113abcd 13.29 ± 0.0103 11.80 ± 0.0736 17.92 ± 0.0298ab 1.08 ± 0.0005abcd 1.26 ± 0.0015abc 3.88 ± 0.0025ad 10.41 ± 0.0191ab 1.72 ± 0.0034ab 7.02 ± 0.0075ab 4.25 ± 0.0058ab 15.89 ± 0.0102abcd — 1.81 ± 0.0025ac 21.86 ± 0.0095abcd 2.40 ± 0.0018abd

15.19 ± 0.0078abcd 12.17 ± 0.0090ab 11.67 ± 0.0152abcd 17.01 ± 0.0070ab 0.66 ± 0.0005a 4.71 ± 0.0040abcd 3.88 ± 0.0026a 10.52 ± 0.0111a 25.96 ± 0.1453bcd 9.62 ± 0.0025ab 4.24 ± 0.0068a 9.76 ± 0.0158 5.14 ± 0.0171a 1.58 ± 0.0045a 20.78 ± 0.0055abd 2.04 ± 0.0062ad

Page 27 of 33

451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481

ip t

cr

450

us

449

an

448

M

447

d

446

te

445

ruled out). And KG−Che has a better or equal accuracy on 11 out of 16 data sets in comparison with KG−Leg . Accordingly, Six sets of orthogonal polynomial kernels (KG−Che , KHer , KT ri−Che , KExp−Che , KLeg and KU −Che ) are selected to compare with four classical kernels in the following discussion. Notice that six sets of orthogonal kernel functions outperform KLin on most data sets in Table 9. Compared with the rest three common kernels (KP ol , KGau and KW av ), KG−Che , KHer , KT ri−Che , KExp−Che , KLeg and KU −Che have better (statistically signiﬁcantly) or equal (not statistically signiﬁcantly) test accuracy on more than 9 data sets, 14 data sets, 10 data sets, 8 data sets, 9 data sets and 8 data sets out of 16 data sets, respectively. Obviously, KHer and KT ri−Che contribute competitive results, and the rest four kernels give no clear visual results of the generalization ability evaluation. The performance of KT ri−Che may imply changing weighting function helps. Similar to the discussion above, these six kinds of orthogonal polynomial kernels (KG−Che , KHer , KT ri−Che , KExp−Che , KLeg and KU −Che ) are compared with four kinds of common kernels (KLin , KP ol , KGau and KW av ) in terms of 1−WIA values in Table 9. The 1−WIA values of KLin is outperformed by all compared orthogonal polynomial kernels on most data sets, suggesting a more excellent generalization performance of these orthogonal kernel functions than that of the original SVR algorithm in the input space. KG−Che , KHer , KT ri−Che , KExp−Che , KLeg and KU −Che compete with or outperform KP ol , KGau and KW av in terms of 1−WIA alignments on more than 9 data sets, 14 data sets, 13 data sets, 9 data sets, 9 data sets and 12 data sets out of 16 data sets, respectively. Thus, KHer , KT ri−Che and KU −Che are good performing kernel functions of six kinds of orthogonal polynomial kernels. Fig. 5 (a)-(b) illustrates the overall comparison between the kernel functions in terms of MSE values and 1−WIA values, respectively. It can be seen that these two ﬁgures show the similar results. Of all kernel functions, the performance of KLin is the worst, while the values of the corresponding indexes of KHer are the best in both Fig. 5 (a) and Fig. 5 (b). However, for one reason or another, it has received only very little attention in machine learning. Of all orthogonal polynomial kernels, the MSE and 1−WIA values of two triangularly modiﬁed orthogonal polynomial kernels are better than those orthogonal polynomial kernels except KHer . As observed in the previous experiment, KG−Che outperforms KG−Leg , while KChe is inferior to KLeg . There is no large diﬀerence between the performances of KG−Che and KU −Che . For regression tasks, the performance of KExp−Che is better (not statistically signiﬁcantly) than that of KExp−Leg , but it is poorer than that

Ac ce p

444

27

Page 28 of 33

of KHer , KT ri−Che and KT ri−Leg .

ip t

482

K Lin K Pol KGau K Wav K G−Che KG−Leg

KLin KPol KGau K

cr

Wav

us

K G−Che K G−Leg KHer K Tri−Che K Tri−Leg

K Exp−Che KExp−Leg KChe

0

2

4

6

8

10

12

14

an

K Leg KU−Che 16

0

rank

10

15

rank

(b) 1−WIA

M

(a) MSE

5

K Her K Tri−Che K Tri−Leg K Exp−Che KExp−Leg K Che KLeg KU−Che

485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500

te

484

5. Conclusion

In this paper, two sets of orthogonal polynomial kernels are proposed by employing the triangular kernel. Ten sets of orthogonal polynomial kernels based on varying construction methods, weighting functions, and basis orthogonal polynomials are discussed. Having only one parameter chosen from positive integer set facilitates the classical kernel selection for some orthogonal polynomial kernels. This property will be attractive to the users who do not want to spend too much time on parameter tuning. Several issues should be noted in experiments. First, using partial sum of the inner product of generalized polynomials to construct kernels is effective in view of test performance and parameter optimization. Second, the importance of weighting function is signiﬁcantly more than that of basis function. By exploiting a more eﬀective kernel function, a more eﬀective orthogonal kernel could be constructed. Third, many orthogonal polynomial kernels can show excellent generalization performance for both classiﬁcation and regression tasks, compared with the commonly used kernels. Especially, experiments demonstrate that new proposed triangularly modiﬁed orthogonal polynomial kernels are competitive kernels in both classiﬁcation and

Ac ce p

483

d

Figure 5: Overall comparison of diﬀerent kernels in terms of test error for regression.

28

Page 29 of 33

515

Acknowledgements

506 507 508 509 510 511 512 513

cr

505

us

504

an

503

M

502

ip t

514

regression scenarios. And there is a clear advantage in favor of the Hermite kernel for regression problems. Finally, orthogonal polynomials kernels store less support vectors than all other common kernels in support vector classiﬁcation. This property can be useful for some special research requiring less support vectors. The performances of these orthogonal polynomial kernel functions are demonstrated experimentally on some examples. However, one main concern for some orthogonal polynomial kernels is that the kernel functions have a more complex computation compared with some general kernel functions. The reason lies in these orthogonal kernel functions use only a subset of the monomials and their expressions are not so compact, which may limit its applicability, though recursive procedures can be used as a shortcut. Another interesting direction is to investigate the scalability of the orthogonal polynomial kernels to large data compared to other kernels.

501

519

References

521

522

523 524

525 526

527 528 529

te

520

[1] J. Shawe-Taylor, N. Cristianini, in: Kernel Methods for Pattern Analysis, Cambridge University Press, U.K., 2004.

Ac ce p

517

d

518

The work described in this paper was partially supported by the National Natural Science Foundation of China (No. 61273291, 61673249), Research Project Supported by Shanxi Scholarship Council of China (No. 2016-004)

516

[2] V.N. Vapnik, in: Statistical Learning Theory, Wiley, USA, 1998. [3] A. Ben-Hur, D. Horn, H.T. Siegelmann, V.N. Vapnik, Support vector clustering, Journal of Machine Learning Research 2(2)(2002) 125−137. [4] G. Baudat, F. Anouar, Generalized discriminant analysis using a kernel approach, Neural Computation 12(10)(2000) 2385−2404. [5] B. Sch¨ olkopf, A.J. Smola, K.-R. M¨ uller, Kernel principal component analysis, in: Proceedings of the 7th International Conference on Artiﬁcial Neural Networks, 1997, pp. 583−588.

29

Page 30 of 33

537 538

539 540 541

542 543

544 545 546

547 548 549

550 551 552

553 554 555

556 557

558 559

ip t

cr

536

[8] K.-M. Chung, W.-C. Kao, C.-L. Sun, L.-L. Wang, C.-J. Lin, Radius margin bounds for support vector machines with the RBF kernel, Neural Computation 15(11)(2003) 2463−2681.

us

535

[9] N. Cristianini, J. Shawe-Taylor, A. Elisseeﬀ, J. Kandola, On kerneltarget alignment, in: Proceedings of Advances in Neural Information Processing Systems 14(2002), pp. 367−373.

an

534

[7] L. Zhang, W. Zhou, L. Jiao, Wavelet support vector machine, IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics 34(1)(2004) 34−39.

[10] J. Wang, H. Lu, K.N. Plataniotis, J. Lu, Gaussian kernel optimization for pattern classiﬁcation, Pattern Recognition 42(7)(2009) 1237−1247.

M

533

[11] W. Wang, Z. Xu, W. Lu, X. Zhang, Determination of the spread parameter in the gaussian kernel for classiﬁcation and regression, Neurocomputing 55(3-4)(2003) 643−663.

d

532

te

531

[6] D.R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: an overview with application to learning methods, Neural Computation 16(12)(2004) 2639−2664.

[12] K.-P. Wu, S.-D. Wang, Choosing the kernel parameters for support vector machine by the inter-cluster distance in the feature space, Pattern Recognition 42(5)(2009) 710−717.

Ac ce p

530

[13] S. Kitayama, K. Yamazaki, Simple estimate of the width in Gaussian kernel with adaptive scaling technique, Applied Soft Computing 11(8)(2011) 4726−4737. [14] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L.E. Ghaoui, M.I. Jordan, Learning the kernel matrix with semideﬁnite programming, Journal of Machine Learning Research 5(2004) 27−72. [15] A. Rakotomamonjy, F.R. Bach, S. Canu, Y. Grandvalet, SimpleMKL, Journal of Machine Learning Research 9(2008) 2491−2521. [16] M. G¨ onen, E. Alpaydın, Multiple kernel learning algorithms, Journal of Machine Learning Research 12(2011) 2211−2268.

30

Page 31 of 33

567

568 569 570

571 572 573

574 575 576 577

578 579 580 581

582 583 584

585 586 587

588 589 590

ip t

cr

566

[19] R. Zhang, W. Wang, Facilitating the applications of support vector machine by using a new kernel, Expert Systems with Applications 38(11)(2011) 14225−14230.

us

565

[20] E.A. Daoud, H. Turabieh, New empirical nonparametric kernels for support vector machine classiﬁcation, Applied Soft Computing 13(4)(2013) 1759−1765.

an

564

[18] J. Basak, A least square kernel machine with box constraints, in: Proceedings of the International Conference on Pattern Recognition, 2008, pp. 1−4.

[21] C. Dai, Y. Wang, W. Yue, A new orthogonal evolutionary algorithm based on decomposition for multi-objective optimization, Journal of the Operational Research Society 66(10)(2015) 1686−1698.

M

563

[22] S. M. Berman, Legendre polynomial kernel estimation of a density function with censored observations and an application to clinical trials, Communications on Pure and Applied Mathematics, 60(8)(2007) 1238−1259.

d

562

te

561

[17] M.G. Genton, Classes of kernels for machine learning: a statistics perspective, Journal of Machine Learning Research 2(2)(2001) 299−312.

[23] Y. Mukuta, T. Harada. Kernel approximation via empirical orthogonal decomposition for unsupervised feature learning, in: Proceedings of the International Conference on Computer Vision and Pattern Recognition, 2016: 5222−5230.

Ac ce p

560

[24] V.H. Moghaddam, J. Hamidzadeh, New Hermite orthogonal polynomial kernel and combined kernels in Support Vector Machine classifer, Pattern Recognition 60(2016) 921−935. [25] N. Ye, R. Sun, Y. Liu, L. Cao, Support vector machine with orthogonal Chebyshev kernel, in: Proceedings of the 18th International Conference on Pattern Recognition, 2006. [26] S. Ozer, C.H. Chen, H.A. Cirpan, A set of new Chebyshev kernel functions for support vector machine pattern classiﬁcation, Pattern Recognition 44(7)(2011) 1435−1447.

31

Page 32 of 33

598 599

600 601 602

603 604 605

606 607 608

609 610 611 612

613 614 615

616 617

618 619

ip t

cr

597

[29] H.G. Jung, G. Kim, Support vector number reduction: Survey and experimental evaluations, IEEE Transportation Systems 15(2)(2014) 463−476.

us

596

[30] M. Tian, W. Wang, Research on the properties of orthogonal polynomial kernel functions, Pattern Recognition and Artiﬁcial Intelligence (in Chinese), 27(5)(2014)386−393.

an

595

[28] Z. Pan, H. Chen, X. You, Support vector machine with orthogonal Legendre kernel, in: Proceedings of the 2012 International Conference on Wavelet Analysis and Pattern Recognition, 2012, pp. 125−130.

[31] R. Zhang, W. Wang, Y. Zhang, F. Sun, Legendre kernel function for support vector classiﬁcation, Computer Science (Chinese) 39(7)(2012) 222−224.

M

594

[32] M. Belanche, L. Antonio, Developments in kernel design, in: Proceedings of the 21st European Symposium on Artiﬁcial Neural networks, Computational Intelligence and Machine Learning, 2013, pp. 369−378.

d

593

te

592

[27] J. Zhao, G. Yan, B. Feng, W. Mao, J. Bai, An adaptive support vector regression based on a new sequence of uniﬁed orthogonal polynomials, Pattern Recognition 46(3)(2013) 899−913.

[33] F. Fleuret, H. Sahbi, Scale-invariance of support vector machines based on the triangular kernel, in: Proceedings of the 3rd International Workshop on Statistical and Computational Theories of Vision, Paris, INRIA, 2003.

Ac ce p

591

[34] A. Frank, A. Asuncion, 2010, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml/]. Irvine, CA: University of California, School of Information and Computer Science. [35] University of Toronto, 2013, Delve Datasets, [http://www.cs.toronto. edu/edelve/data/datasets.html]. [36] J. Demˇsar, Statistical comparisons of classiﬁers over multiple datasets, Journal of Machine Learning Reasearch 7(1)(2006), pp. 1−30.

32

Page 33 of 33

Some sets of orthogonal polynomial kernel functions

Some sets of orthogonal polynomial kernel functions

Recommend Documents