Matrix-pattern-oriented classifier with boundary projection discrimination

Matrix-pattern-oriented classifier with boundary projection discrimination

Accepted Manuscript Matrix-Pattern-Oriented Classifier with Boundary Projection Discrimination Zhe Wang, Zonghai Zhu PII: DOI: Reference: S0950-7051...

767KB Sizes 0 Downloads 9 Views

Accepted Manuscript

Matrix-Pattern-Oriented Classifier with Boundary Projection Discrimination Zhe Wang, Zonghai Zhu PII: DOI: Reference:

S0950-7051(17)30602-0 10.1016/j.knosys.2017.12.024 KNOSYS 4160

To appear in:

Knowledge-Based Systems

Received date: Revised date: Accepted date:

24 January 2017 19 December 2017 22 December 2017

Please cite this article as: Zhe Wang, Zonghai Zhu, Matrix-Pattern-Oriented Classifier with Boundary Projection Discrimination, Knowledge-Based Systems (2017), doi: 10.1016/j.knosys.2017.12.024

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT 1

Matrix-Pattern-Oriented Classifier with Boundary Projection Discrimination

Abstract

CR IP T

Zhe Wang∗,1 , Zonghai Zhu1

The matrix-pattern-oriented Ho-Kashyap classifier (MatMHKS), utilizing two-sided weight vectors to constrain the matrix-based pattern, extends the representation of sample from vector to matrix. To

AN US

further improve the classification ability of MatMHKS, we introduce a new regularization term into MatMHKS to form a new algorithm named BPDMatMHKS. In detail, we first divide the samples into three types including noise sample, fuzzy sample and boundary sample. Then, we combine the projection discrimination with these boundary samples, thus proposing the regularization term which concerns the priori structural information of the boundary samples. By doing so, the classification ability proposed BPDMatMHKS.

M

of MatMHKS has been further improved. Experiments validate the effectiveness and efficiency of the

ED

Matrix-based Classifier; Boundary Sample; Projection Discrimination; Regularization

PT

Learning; Pattern Recognition.

I. Introduction

The conventional Linear classifiers [9], such as Perceptron Learning Algorithm (PLA) [22]

CE

and Support Vector Machine (SVM) [8], obtain the final classification result based on one weight vector. Different from the traditional linear classifier, the matrix-based classifiers [6],

AC

[7], [34] uses two-sided weight vectors to obtain the final result of the classifier. Therefore, it can be directly applied to solve the matrix-based pattern representation problems [20] and demonstrated to be superior to the vector-based classifier. Our previous work proposed a new matrix-based learning machine named MatMHKS [6]. It follows the minimum risk framework ∗

Corresponding author. Email: [email protected] (Z. Wang)

1

Department of Computer Science & Engineering, East China University of Science & Technology, Shanghai, 200237, P.R.

China

ACCEPTED MANUSCRIPT 2

1

1 Class+ Class-

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

S

0.9

0

1

b+

S b-

w

0

(a)

0.1

CR IP T

0.9

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(b)

Fig. 1. (a) shows the situation that the iteration has been stopped when the classification hyperplane can separate the two binary samples rightly. It, however, is not a good classification hyperplane. (b) shows a better classification hyperplane since it

AN US

obviously has better generalization ability than the hyperplane in (a).

[6], [32], and naturally introduces two weight vectors into classifier design. Therefore, it can simply and effectively process the matrix-based pattern directly. The framework is as follows:

M

min J = Remp + cRreg ,

(1)

In Eq. 1, the first term Remp is defined as the empirical risk, which adopts the mathematical

ED

form uT Av. A ∈ Rd1 ×d2 is a matrix pattern while both u ∈ Rd1 ×1 and v ∈ Rd2 ×1 are two weight vectors. In this way, the matrix pattern A constrained by u and v is a natural extension of the

PT

vector method. Moreover, the second regularization term Rreg extends the geometric interval and reducing the generalization error. The parameter c is the penalty coefficient.

CE

Although the matrix method has learned more feasible and structural information from matrix pattern, and the term Rreg boosts the generalization ability by extending the geometric interval

AC

between two classes. Unfortunately, it still might neglect useful information. The MatMHKS originated from the Modified Ho-Kasyap (MHKS) [18] algorithm may not obtain the perfect classification hyperplane, since the iteration steps in MHKS algorithm would stop once the classification hyperplane can separate the samples in two classes. As is shown by Fig. 1 (a), the classification hyperplane would not change if all train samples are separated rightly. It is obviously that the generalization ability of the two dotted lines is not well, although it distinguishes all training samples. In Fig. 1 (b), the classification hyperplane described by the

ACCEPTED MANUSCRIPT 3

solid line performs better, since it is based on margin maximization and these samples in different class are treated equally. In order to achieve better prediction, most of the classification algorithms attempt to learn the information of samples [21], especially the boundary information of each class. Yao [35], [36], [37] proposed a three-way decision model, which introduced the boundary decision into original positive and negative decision. Li [19] proposed a method about selecting critical

CR IP T

samples based on local geometrical and statistical information. K.Nikolaidis [24] proposed a boundary preserving algorithm for data condensation. It tends to preserve the boundary samples in application. Except for considering the location of the data, some researchers also pay attention to sample pretreatment. C.G.Osorio [25] proposed a linear complexity instance selection algorithm based on classifier ensemble concepts. The sample pretreatment technology also plays

AN US

an important role in imbalance problem. H.Han [12] proposed a Borderline-SMOTE, which pay more attention to the boundary samples in minority class, improves the performance for imbalance problems. After that, Sukarna Barua [2] proposed a MWMOTE method that further improved the performance compared with Borderline-SMOTE. The boundary samples are easier to be misclassified, thus being more important for classification. This paper introduces a new

M

regularization term Rbpb into the original framework. 1, thus boosting the performance compared

ED

with the original framework. The framework can be described as: min J = Remp + cRreg + λRbpd ,

(2)

PT

The term Remp and Rreg are in accord with the same term in Eq. 1. And the new term Rbpd is expected to learn the boundary information which is not performed well in original MatMHKS.

CE

Moreover, the parameter λ is the penalty coefficient. In the term Rbpd , the first letter b means the boundary samples. The second letter p means the projection and the last letter d is discrimination. The term Rbpd expects to obtain the better

AC

final classification hyperplane according to the projection discrimination of boundary samples. There are two major steps in constituting the regularization term Rbpd . Firstly, the K Nearest Neighbour (KNN) [13] provides useful reference in selecting boundary samples, and secondly, the thinking of Fisher linear discrimination analysis (FLDA) [10] is adopted to project these boundary samples on the argument vector of the classification hyperplane. The KNN method is widely applied in many aspects such as sample selecting, noise filtering and so on. Pan [26] used the KNN method to decide the weight of samples in TWSVM [15]. M.Tanveer [29] introduced

ACCEPTED MANUSCRIPT 4

the KNN method into TWSVM, thus reducing the time and improving the generalization ability. Through using the KNN method, we can get the neighbours of every sample and further divide them into noise samples, fuzzy samples and boundary samples. Then these boundary samples can be used to assist the algorithm in finding better classification hyperplane. According to FLDA, it would be better if the projection value of these points from both classes have large between-class distance and small within-class distance.

CR IP T

To illustrate it clearly. Here, we name one is class+, the other is class−, respectively. For every sample in class+, its k nearest neighbors in class− are selected. Relatively, the selected boundary samples in class− are used to search the boundary samples in class+. After the boundary samples in two classes are selected, these samples in different class are expected to have large between-class distance and small within-class distance. As is shown in Fig. 1 (b), the

AN US

classification hyperplane described by the solid line can perform better than that discribed by the dotted line in Fig. 1 (a), since it is based on margin maximization and the samples in different class are treated equally. In order to gain this type of classification hyperplane, it is important to obtain the boundary samples in two classes. In Fig. 1 (b), the boundary samples in two classes are filled color and the weight vector w is displayed. It can be found that these boundary

M

samples are satisfied with three characteristics if they are projected on the vector w. Firstly, they have large between-class distance. Secondly, they have small within-class distance. Finally, the

ED

classification hyperplane is close to the midpoint of the means in two different boundary sample sets.

PT

It is worth noting that the SVM pays attention to the support vectors near the classification hyperplane and the FLDA aims to gain large between-class distance and small within-class

CE

distance. The boundary samples of the proposed BPDMatMHKS and the support vectors of the SVM are quite different. The support vectors of the SVM are the samples which satisfy with

AC

the strict-constrained condition in Eq. 3. They are computed in the optimization process of the objective function. Therefore, these support vectors are nonintuitive. s.t.

yi (wT xi + b) ≥ 1 − ξi ,

i = 1, ..., n

(3)

Different from the support vectors, the boundary samples of the BPDMatMHKS are computed under the geometric intuition. In addition, these samples are obtained before the optimization process and they offer priori information. Through finding the relationship of neighbours, we obtain priori structure information of boundary samples, thus forming a regularization term

ACCEPTED MANUSCRIPT 5

according to these boundary samples. Although the FLDA shares the first two characteristics, it does not satisfied the third characteristic. On the other hand, the boundary samples in different class come in pairs, thus it would be better if the classification hyperplane is close to the midpoint of the means in two different boundary sample sets. With the help of the boundary sample selecting method, the designed regularization term Rbpd learns the boundary information of the data, thus improving the generalization ability when the term Rbpd is introduced into the

CR IP T

MatMHKS. The proposed method is named BPDMatMHKS for short. The major contribution of this paper lies in the following aspects:

• The proposed BPDMatMHKS extends the framework of existing MatMHKS. Therefore,

the proposed BPDMatMHKS is a special example of the proposed framework. In addition, a boundary sample selecting method is proposed in this paper. Through utilizing the priori

AN US

information of the boundary samples, the new regularization term is introduced.

• The proposed BPDMatMHKS inherits the advantages of original MatMHKS. Moreover, it

lays emphasis on the boundary samples information which plays an important role in finding the better classification hyperplane, thus improving the generalization ability. To our best knowledge, it is the first time that boundary samples learning is introduced into MatMHKS.

M

• The proposed BPDMatMHKS is feasibility and effectiveness according to the experiments.

In addition, the proposed BPDMatMHKS has a tighter generalization risk bound than original

ED

MatMHKS through analyzing their Rademacher complexity and testing their stability. The rest of this paper is organized as follows. Section II presents a brief introduction on the

PT

preliminary knowledge of MatMHKS and the method of boundary sample selection. Section III gives the architecture of the proposed BPDMatMHKS and the modified method of boundary

CE

sample selection. Section IV reports all the experimental results. Section V gives a theoretical and experimental generalization risk bound analysis for BPDMatMHKS. Finally, conclusions are

AC

given in Section VI.

II. Related Work

In this section, a concise introduction of MatMHKS and regularization learning are demon-

strated.

ACCEPTED MANUSCRIPT 6

A. MatMHKS Our previous work MatMHKS is the original method of the proposed algorithm. Moreover, the MatMHKS is original from the the vector-based Modified Ho-Kashyap algorithm (MHKS), which comes form Ho-Kashyap algorithm (HK). Therefore, the HK, MHKS and MatMHKS are given a brief introduction in this section.

CR IP T

In original binary-class (class+, class−) problem, there are N samples (xi , ϕi ), i = 1, ..., N, xi ∈ Rd and the class label ϕi ∈ {+1, −1}. When xi ∈ class+, ϕi = +1. Otherwise, ϕi = −1.

Through turning the vector pattern xi to yi = [xiT , 1], a corresponding augmented weight vector w = [e wT , 1]T ∈ Rd+1 is created. The weight vector w can be derived from the objective function

of HK:

min J(w) = ||Yw − b||2 ,

(4)

AN US

w

Where b > 0N×1 ∈ RN is a bias vector and Y = [ϕ1 y1 , ϕ2 y2 , ..., ϕN yN ] , which expects all

elements are non-negative.

However, the original HK is sensitive to outliers. To solve this problem, Leski proposed a Modified Ho-Kashyap algorithm algorithm named MHKS for short MHKS, which is based on

M

the regularized least squares. The MHKS tries to maximize the discriminating margin by adding a hyperplane 1n×1 ∈ Rn . Thus, the objective function of MHKS modified from Eq. 4 is as follow:

ED

min J(w, b) = ||Yw − 1N×1 − b||2 + cwT w. w,b

(5)

Eq. 5 is the detail form of Eq. 1. The term ||Yw − 1N×1 − b||2 is corresponding to the Remp

PT

in Eq. 1. And the term wT w is corresponding to the Rreg . The parameter c is a regularized coefficient which control the balance between the Remp and Rreg . Since the MHKS and HK are

CE

vector-based classifier, the pattern with very high dimension may lead to the memory waste and dimensionality curse.

AC

To solve this problem, our previous work has proposed a novel matrix-based classifier named MatMHKS, which can process the matrix-based date directly. For instance, the matrix-based sample is described as (Ai , ϕi ), i = 1, ..., N, where Ai ∈ Rd1×d2 , and the class label is ϕi ∈ {+1, −1}.

Then, the discrimination function of MatMHKS can be written as:      > 0, yi ∈ class+ T g(Ai ) = u Ai v˜ + v0  ,    < 0, yi ∈ class−

(6)

ACCEPTED MANUSCRIPT 7

Where u ∈ Rd1 and v˜ ∈ Rd2 are two weight vectors. Accordingly, the optimization function of

MatMHKS is as follow:

N X (ϕi (uT Ai v˜ + v0 ) − 1 − bi )2 + c(uT S1 u + vT S2 v). min J(u, v, b, v0 ) =

u,v,b,v0

(7)

i

Where S1 = Id1×d1 , and S2 = I(d2+1)×(d2+1) are two regularized matrix corresponding to u and v, respectively. The regularization parameter c controls the importance of the regularization term.

CR IP T

The weight vector u and v can be derived from the Eq. 7. Compared with MHKS, the advantages taken by MatMHKS can be summarized as follows: (1) processing the matrix based pattern such as A ∈ Rd1×d2 directly; (2) saving memory resources by reducing the weight vector from d1×d2 to d1 + d2; (3) avoiding over fitting and dimensionality curse. However, the MatMHKS, depending on the training error, pay less attention to the distribution of the boundary samples, thus being

AN US

weak in the performance of the generalization. Accordingly, it is necessary to find a new way to solve the problem.

B. The difference and relationship between matrix model and vector model The major difference between MatMHKS and other classifier is that MatMHKS can process

M

not only matrix-based but vector-based sample. However, the vector-based classifier can only process the vector-based sample. To illustrate the difference between the matrix-based classifier

ED

and the vector-based classifier, We have appended the figure to illustrate the problem clearly. As is shown in the Fig. 2, (a) shows the matrix model sample and its corresponding vector

PT

model. (b) shows that the matrix-based classifier, using two-side weigh vector, processes the matrix-based sample and the vector-based classifier, using one weight vector, only processes the

CE

vector-based sample. In real world applications, images are matrix-based problems. The matrixbased classifier can process matrix-based samples. However, the vector-based classifier must convert the images into vector-based model. For instance, if the size of the image is 28 × 23,

AC

the image sample must be converted into vector-based sample with 28 × 23 = 644 dimensions.

The advantage of matrix-based classifier is shown in the Table I. According to the table,

the dimensions of the weight vector is reduced greatly. This is why the matrix-based classifier relieves the curse of dimensionality. In addition, the vector-based sample with high dimensions can be converted into matrix-based sample, thus extending the representation of the sample. In fact, the matrix model and the vector model can be convert to each other according to the Kronecker product [33].

ACCEPTED MANUSCRIPT 8

Matrix-based Classifier

...

.

2

.

.

.

i

.

.

.

.

.

.

N

N = d1 x d2 Matrix Model: A ϵ Rd1xd2 Vector Model: A ϵ RNx1

d1

1

...

.

2

.

.

.

i

.

d2

.

.

.

.

.

N

1

uTAv, u ϵ Rd1x1 ,v ϵ Rd2x1

1

CR IP T

1

...

...

1

Vector-based Classifier

2

...

i

...

2

...

N

i

...

N

... ...

1 1

N

Aw, w ϵ RNx1

(a)

(b)

Fig. 2. (a) the matrix-based samples and its corresponding vector-based form. (b) the matrix-based classifier and vector-based

AN US

classifier. As is shown in (b), the matrix-based classifier utilize two-side vector to constrain the matrix-based sample. Therefore, it can process matrix-based samples. The vector-based classifier utilizes one long weight vector.

TABLE I

Memory required and relative reduced rate for the weight vectors of matrxi-based and vector-based classifier. d1 + d2(matrix)

d1 × d2(vector)

(d1 + d2)/(d1 × d2)

M

Image Size

64

24×18

42

432

1/10.29

28×23

51

644

1/12.63

ED

32×32

1024

1/16

CE

PT

Lemma 1 Let A ∈ Rm×n , B ∈ Rn×p and C ∈ R p×q , then, vec(ABC) = (C T ⊗ A)vec(B),

(8)

Where vec(X) denoted an operation that change the matrix X into vector model. Let the

AC

discriminant functions of MHKS and MatMHKS be the following function: 1) MHKS: g(x) = ˜ T x + w0 and 2) MatMHKS: g(A) = uT A˜v + v0 . It can be conclude that both MHKS and w MatMHKS have the same form. In addition, the solution space for the weights in MatMHKS is contained in that of MHKS, and MatMHKS is an MHKS imposed with Kronecker product decomposability constraint. To make it clear, we give a concise example to show the changes from matrix to vector.

ACCEPTED MANUSCRIPT 9

   T   1 2 , ,then the vector form of A is x = 1 3 2 4 . Let u = 1 2 Suppose matrix A =  3 4     3 and v =  , then w = vT ⊗ u = 3 6 4 8 . We can get the equation: uAv = wx. 4 C. Regularization learning

CR IP T

Regularization [5], [11], [28], [38], which can be traced back to the ill-posed problem [4], is an useful technology to solve the over-fitting and boost the generalization ability. Through introducing certain amount of priori knowledge into optimization formulations, regularization is powerful to overcome the over-fitting problem. This introduction is proved to be successful, especially for classification. Combining the regularization learning and data dimensionality

AN US

reduction technology, the regularization term based on the geometric information can be created. Data dimensionality reduction algorithm can be divided into linear and nonlinear types. The main method of nonlinear dimensionality reduction is the manifold learning. It includes isometric feature mapping (IOSMAP) [30], local linear embedding (LLE) [27] and so on. The linear dimensionality reduction algorithms main include principle component analysis (PCA) [1]

M

and linear discrimination analysis (LDA). Although PCA can effectively discover the potential primary information of the data, PCA does not take difference of classes into account. Different

ED

form PCA, LDA trying to model the difference between the classes of data, can be used in not only dimensionality reduction but classification. These algorithms, considering the maintenance

PT

of the local geometry structure information, offers useful information for classification problems. The FLDA, as a kind of LDA, can find a linear combination of features that characterizes or

CE

separates two or more classes. In binary-class problem, the FLDA aims to find a argument vector of the classification hyperplane. If the labeled samples are projected on the vector, the larger between-class distance and smaller within-class distance mean the higher accurate recognition

AC

rate. The criterion function of FLDA is as follow: max J(w) = w

wT S B w . wT Sw w

(9)

In this criterion function, the parameter w is the argument vector. SB means the between-class scatter matrix, and the Sw means the within-class scatter matrix. In binary-class problem, the SB and Sw can be written as: SB = (µ+ − µ− )(µ+ − µ− )T ,

(10)

ACCEPTED MANUSCRIPT 10

Sw =

X

xi ∈class+

(xi − µ+ )(xi − µ+ )T +

X

x j ∈class−

(x j − µ− )(x j − µ− )T .

(11)

Where the µ+ and µ− are their respective mean points in two classes. In this paper, we combine the boundary samples and the geometric information of FLDA to design the regularization term. Then, the term is used to assist MatMHKS in obtaining the better classification hyperplane.

CR IP T

III. Matrix-pattern-oriented Classifier with boundary samples This section first presents the strategy used in selecting boundary samples, and then describes the architecture of the proposed BPDMatMHKS. A. Selecting the boundary samples

AN US

This paper utilizes the thinking of KNN to select the suitable boundary samples. As is shown by Fig. 3 (a), it shows such a scenario, where the blue and red points represent the samples of class+ and class−, respectively. Before going to describe the detail construction of boundary samples, we suggest that Sb+ and Sb− displayed in Fig. 3 (b) are two boundary sample sets. For every sample xi ∈ class+, we search its k nearest neighbours (assuming k = 5) ∈ class−,

M

thus adding these selected neighbours to S b− . After that, for every sample xi ∈ Sb− rather than ∈ class−, Sb+ can be constructed by adding k nearest neighbours of these samples ⊂ Sb− . To

ED

view Fig. 3 (b), Sb+ and Sb− are found through this method.

However, just using nearest neighbours to search boundary samples may cause bad situations,

PT

since it may obtain the noise samples instead of the boundary samples. Fig. 4 (a) shows that samples A, B, C, D and E may affect the final classification hyperplane. For instance, if we search the boundary set as we do in Fig. 3 (a) and Fig. 3 (b), the final classification hyperplane

CE

derived form Sb+ and Sb− would be influenced by these points. In order to find these points, we record the k nearest neighbours belong to all training samples. Further, this paper proposes two

AC

definitions directing against the neighbours belong to all training samples: (1) If the k nearest neighbours of ones sample xi (xi ∈ class+) are all ∈ class−, xi is a noise

sample.

(2) If one sample xi (xi ∈ class+) has more than

  k 2

nearest neighbours of another class

(class−). Then, generate the vectors by subtracting the feature value µ+ form xi and its neighbours ∈ class−. Finally, these vectors are projected on the vector (µ− − µ+ ) and record their values. xi is a fuzzy sample if the projection value of itself is not minimum.

ACCEPTED MANUSCRIPT 11

1 Class+ ClassS b-

0.8

0.9

Class+ S b+

0.8

ClassS b-

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

1

(a)

0

0.1

CR IP T

1 0.9

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(b)

Fig. 3. In Fig. 2 (a), the method starts from the samples ∈ class+. These sample search the k nearest neighbours ∈ class− and record these neighbours in the boundary sample set ∈ class−. For convenience, the boundary sample set is marked as Sb− . After

AN US

every sample ∈ class+ has finished searching their neighbours ∈ class−. Then start from the samples ∈ Sb− and do the same

operation. Fig. 2 (b) shows the final result that the two boundary sample sets Sb+ and Sb− have been searched.

Where µ+ and µ− are the mean points of two classes (class+, class−). For noise sample, it can not be regarded as a boundary sample and its k neighbours belonging to the other class would

M

not be recorded as boundary samples. For fuzzy sample, although its k neighbours belonging to the other class should be add to the boundary sample set, it can not be selected as a boundary

ED

sample. To illustrate this situation clearly. Fig. 4 (c) shows such a scenario, we start finding boundary samples from class+. It can be found that the 5 nearest neighbours of A are all belong

PT

to another class, so A is a noise sample. To view C (C ∈ class+) in Fig. 4 (c), it may be a   fuzzy sample, since it has more than 2k nearest neighbours of the other class. Next step is to

CE

judge whether it is a real fuzzy sample. In this fig, µ+ and µ− are two mean points in class+ and class−. And the vector w is equal to (µ− − µ+ ). After that, sample C and its neighbours

AC

belonging to class− minus µ+ to form vectors and then these vectors are projected on w. The

projection value of sample D (D ∈ class−) is smaller than C. According to definition (2), C is a fuzzy sample, and it can not be selected as a boundary sample. When boundary samples

⊂ class− has been selected, we select the boundary samples ∈ class+ by using these samples ⊂ Sb− . The final boundary data set Sb+ and Sb− , showed by Fig. 4 (d), are the boundary data

sets belong to class+ and class−. They are more effective and robust to assist the classifier in finding discrimination hyperplane.

ACCEPTED MANUSCRIPT 12

1

1 Class+ Class-

0.9

S

0.9

0.8

0.8 A

A

0.7

0.7

C D

0.6

C D

0.6

0.5

0.5 0.4

B E

0.3 0.2

0.2

0.1

0.1 0

0.1

0.2

0.3

B E

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

1

0

0.1

(a) 1

1 µ+

0.5

0.6

0.7

0.8

0.9

1

S b-

0.8

A

0.7

0.7

C D

0.6

0.6 0.5

C' D'

0.4

B E

0.3

0.2

0.2 0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(d)

ED

(c)

0.9

M

0.1 0

0.4

AN US

0.8

0.3

0.3

S b+

0.9

µ-

0.4

0.2

(b)

0.9

0.5

CR IP T

0.4

0

b+

S b-

Fig. 4. (a)shows samples, including A, B, C, D and E, may affect finding the boundary sample set. (b) shows the situation that the two boundary sample sets are irregular, which has bad influence on boundary projection discrimination. (c) demonstrates

PT

that A is a noise sample, since all the neighbours of A are ∈ class−. Moreover, as is shown in (c), ||D0 µ+ || is smaller than

||C 0 µ+ ||. Since the distance of it projected point C 0 on w = (µ− − µ+ ) is not minimum compared with its neighbours ∈ class−, C is a fuzzy sample. (d) displays the final boundary sample sets have been selected after the influence of noise points and fuzzy

CE

points are eliminated.

AC

The procedure of selecting boundary samples is shown by Table II. It should be noted that

this procedure is designed for binary-class problem. N+ and N− are the total number of training samples in class+ and class−, respectively. B. Architecture of BPDMatMHKS In this section, the proposed BPDMatMHKS is introduced in four steps. First, the novel regularization term Rbpd is constructed by the boundary samples. Then, the term Rbpd is introduced

ACCEPTED MANUSCRIPT 13

TABLE II Algorithm I: Selecting Boundary Samples Input: The training set S = {(A1 , ϕ1 ), ..., (AN , ϕN )}, the parameter k is the number of nearest neighbours;

Output: The selected boundary sample data set Sb+ , Sb−

1. Divide the multi-class problem into multiple binary-class ones such as S+ and S− ; S+ = {(A1 , 1), ..., (AN+ , 1)} and S− = {(A1 , −1), ..., (AN− , −1)}; P P µ+ = N1+ Ai ∈S+ Ai and µ− = N1− Ai ∈S− Ai ;

CR IP T

Obtain the training sets of S + and S − as follows:

2. For each sample Ai (Ai ∈ S+ ), calculates the distance form it to all samples ∈ S except for itself, records the k nearest neighbours ∈ S− as Scandidate , and the k nearest neighbours ∈ S as S judge : In S judge , calculate the number m of sample ∈ Sb− ; if m = k

AN US

Ai is a noise point and it should be divided from S, S = S/Ai   else if 2k ≤ m < k

Compared with the projecting distances of neighbours ∈ S− on the vector (µ− − µ+ ); if the distances of Ai is not the minimum Add Ai to fuzzy points set SF ; Add the Scandidate to Sb− ; else end if

else if m <

  k 2

ED

Add the Scandidate to Sb− ;

M

Add the Scandidate to Sb− ;

end if end for

PT

3. For each sample Ai (Ai ∈ S− ), we do the similar operation as we do in step 2

and obtain Sb+ ). Here we just scan sample Ai (Ai ∈ Sb− ) instead of Ai (Ai ∈ S− )

4. Obtain the boundary sample dataset with final operation

CE

to divide those fuzzy samples ∈ SF , Sb+ = Sb+ \SF and

AC

Sb− = Sb− \SF ;

into original MatMHKS to boost the generalization ability. After that, the optimal solutions for the weight vector u and v can be derived from the proposed BPDMatMHKS. Finally, the realizing process of the proposed BPDMatMHKS is designed as pseudo-code. In the proposed BPDMatMHKS, there is a matrix data set S = {(A1 , ϕ1 )...(Ai , ϕi )...(AN , ϕN )},

where N is the size of the data set, Ai ∈ Rd1×d2 is the matrix sample and its class label ϕi ∈

{+1, −1} corresponding to the binary-class data set (S+ and S− ). In addition, the other two matrix

ACCEPTED MANUSCRIPT 14

data sets Sb+ and Sb− are two boundary sample data sets. Their samples can be described as (A+1 , +1)...(A+i , +1)...(A+n+ , +1)}, and (A−1 , −1)...(A−i , −1)...(A−n− , −1)}, where n+ and n− are the size of data in Sb+ and Sb− . Therefore, the regularization term Rbpd can be written as follows: n+ X i=1

((uT A+i v˜ + v0 ) − mean+ )2 +

n− X i=1

((uT A−i v˜ + v0 ) − mean− )2 +

(mean+ + mean− )2 − (mean+ − mean− )2 , mean+ = uT A+mean v˜ + v0 , mean− = uT A−mean v˜ + v0 ,

CR IP T

Rbpd =

(12)

(13) (14)

The mean+ and mean− are the mean points of Sb+ and Sb− . Eq. 12 experts that the boundary samples belong to different sample set have smaller within-class distance and bigger between-

AN US

class distance when the samples ∈ Sb+ and Sb− are projected on the argument vector of the classification hyperplane. Moreover, it hopes that the classification hyperplane can fit the perpendicular

bisectors of A+mean and A−mean by introducing the section (mean+ + mean− )2 − (mean+ − mean− )2 in Eq. 12, since these boundary samples are appearing in pairs. In fact, this section can be simplify

M

calculated, and the Eq. 12 can be rewritten as follows: n+ n− X X Rbpd = ((uT A+i v˜ + v0 ) − mean+ )2 + ((uT A−i v˜ + v0 ) − mean− )2 + 4mean+ mean− , i=1

(15)

i=1

ED

The key of BPDMatMHKS is to introduce the designed boundary sample regularization term

Rbpd into Eq. 9. According to regularized learning, the framework can be written as:

PT

min J(u, v, b, v0 ) =

u,v,b,v0

N X i

(ϕi (uT Ai v˜ + v0 ) − 1 − bi )2 + c(uT S1 u + vT S2 v) + λRbpd .

(16)

CE

Then, the gradient descent method is used to seek the optimal values of u, v and b. First, we get the partial derivative of u: N

N

N

n

AC

+ X X X ∂J X Ai v˜ v˜ T ATi u + cS1 u − ϕi Ai v˜ (1 + bi ) + Ai v˜ v0 + λ (A+i − A+mean )˜vv˜ T (A+i − A+mean )T u+ = ∂u i i i i

λ

n− X i

(A−i − A−mean )˜vv˜ T (A−i − A−mean )T u + 2λ(A+mean v˜ v˜ T A−mean T u + A−mean v˜ v˜ T A+mean T u + A+mean v˜ v0 + A−mean v˜ v0 ), (17)

ACCEPTED MANUSCRIPT 15

Set it be equal to zero, then the expression of u can be written as: n− n+ N X X X (A+i − A+mean )˜vv˜ T (A+i − A+mean )T + λ (A−i − A−mean )˜vv˜ T (A−i − A−mean )T + u = ( Ai v˜ v˜ T ATi + cS1 + λ i

i

i

N N X X Ai v˜ v0 + 2λA+mean v˜ v0 + 2λA−mean v˜ v0 ), 2λA+mean v˜ v˜ T A−mean T + 2λA−mean v˜ v˜ T A+mean T )−1 ( ϕi Ai v˜ (1 + bi ) + i

i

(18)

CR IP T

To simplify Eq. 16, we define Y = [y1 , y2 , ..., yN ]T , yi = ϕi [uT Ai , 1]T , (i = 1, 2, ..., N), v = [˜vT , v0 ]T . And y+j = [uT A+j , 1]T , ( j = 1, 2, ..., n+ ) Y+ = [y+1 , y+2 , ..., y+n+ ]T , y−j = [uT A−j , 1]T , ( j = 1, 2, ..., n− ), Y− = [y−1 , y−2 , ..., y−n− ]T , Then the objective function Eq. 16 can be reformulated as: min J(u, v, b) = (Yv − 1N×1 − b)T (Yv − 1N×1 − b) + c(uT S1 u + vT S˜ 2 v)+ λ((Y v −

Y+mean v)T (Y+ v



Y−mean v)

4Y+mean vY−mean v).

AN US

Y+mean v)

Y−mean v)T (Y− v

(19)

− + (Y v − − +    S2 0   is an (d2 + 1) × (d2 + 1) matrix. Y+ is the mean value of Y+ , the In Eq. 19, S˜ 2 =  mean  0 0  Y−mean is the mean value of Y− , respectively. +

Therefore, the partial derivative of v can be derived as:

λ(Y − and set it to zero:

Y−mean )T (Y−



Y−mean )v

+

2λ(Y+mean T Y−mean

+

ED



M

∂J = YT Yv − YT (1N×1 + b) + cS˜ 2 v + λ(Y+ − Y+mean )T (Y+ − Y+mean )v+ ∂v

v =(YT Y + cS˜ 2 + λ(Y+ − Y+mean )T (Y+ − Y+mean ) + λ(Y− − Y−mean )T (Y− − Y−mean )+

(21)

PT

2λ(Y+mean T Y−mean + Y−mean T Y+mean ))−1 YT (1N×1 + b).

(20)

Y−mean T Y+mean )v,

CE

in addition, the vector b, whose iteration function is designed as: ∂J = −2(Yv − 1N×1 − bN×1 ) = −2e, ∂b

(22)

AC

Obviously, the weight vectors u and v are depend on the margin vector b, which controls the distance from the corresponding sample to the classification hyperplane. And the elements in b are non-negative. It expects that the classification hyperplane can separate all samples belong to different class rightly. And e is the error. Moreover, according to Eq. 18 and 21), it can be found that u and v are determined by each other. At the first step in iteration, v can be derived by initialized u and b. Then v and b can determine the new value of u. The b can be recalculated

ACCEPTED MANUSCRIPT 16

by measuring the difference between the value of error in previous step and the current step. The iterative function of b can be written as follows:      b1 ≥ 0 ,     b( k + 1) = b(k) + ρ(e(k) + |e(k)|)

(23)

Finally, the discriminant function of BPDMatMHKS for an input sample Ai ∈ Rd1×d2 is defined

as:

CR IP T

     > 0, A ∈ class+ T g(A) = u Ai v˜ + v0     < 0, A ∈ class−

,

(24)

The procedure of BPDMathMHKS in binary-class situation is shown by Table III, which is written by Pseudo code. According to Table III, the proposed BPDMatMHKS is the same as

AN US

MatMHKS if the parameter λ is set to zero. TABLE III

Algorithm II: BPDMatMHKS Input: S = {(A1 , ϕ1 ), ..., (AN , ϕN )}, Sb+ = A+1 , ..., A+n+ ;Sb− = A−1 , ..., A−n− ;

Output: the weight vectors u, v, and the bias v0 ;

M

1. Initialization: c ≥ 0, λ ≥ 0, 0 < ρ < 1; b(1) ≥ 0, u(1); k = 1;

2. Training:

ED

for k = 1, k ≤ kmax , k = k + 1

Set Y = [y1 , y2 , ..., yN ]T , where yi = ϕi [u(k)T Ai , 1], i = 1, 2, ..., N;

Set Y+ = [y+1 , y+2 , ..., y+n+ ]T , where y+j = [u(k)T A+j , 1], j = 1, 2, ..., n+ ;

PT

Set Y− = [y−1 , y−2 , ..., y−n− ]T , where y−j = [u(k)T A−j , 1], j = 1, 2, ..., n− ;

Obtain v(k) = (YT Y + cS˜ 2 + λ(Y+ − Y+mean )T (Y+ − Y+mean ) + λ(Y− − Y−mean )T (Y− − Y−mean )+ 2λ(Y+mean T Y−mean + Y−mean T Y+mean ))−1 YT (1N×1 + b(k));

CE

Obtain e = Yv − 1N×1 − b(k);

Obtain b(k + 1) = b(k) + ρ(e(k) + |e(k)|);

if k b(k + 1) − b(k) k> ξ, and k b(k + 1) − b(k) k≤ temp(k ≥ 2) ,

AC

temp =k b(k + 1) − b(k) k; P P u(k + 1) = ( iN Ai v˜ (k)˜v(k)T ATi + cS1 + λ ni + (A+i − A+mean )˜v(k)˜v(k)T (A+i − A+mean )T + P λ ni − (A−i − A−mean )˜v(k)˜v(k)T (A−i − A−mean )T + 2λA+mean v˜ (k)˜v(k)T A−mean T + P P 2λA−mean v˜ (k)˜v(k)T A+mean T )−1 ( iN ϕi Ai v˜ (k)(1 + bi (k)) + iN Ai v˜ v0 + 2λA+mean v˜ v0 + 2λA−mean v˜ v0 );

else,

Stop; end if end for

ACCEPTED MANUSCRIPT 17

TABLE IV The description of the parameters. Related parameters

Values

Remark

c

10−2 , 10−1 , 100 , 101 , 102

Selected

λ

10−2 , 10−1 , 100 , 101 , 102

Selected

MatMHKS MHKS SVM(Linear) SVM(RBF)

c

−1

0

1

2

Selected

−2

−1

0

1

2

Selected

−2

−1

0

−2

−1

0

−2

−1

0

−2

−1

0

10 , 10 , 10 , 10 , 10

c

10 , 10 , 10 , 10 , 10 1

2

Selected

1

2

Selected

1

2

Selected

c1

1

10 , 10 , 10 , 10 , 10

2

Selected

c2

10−2 , 10−1 , 100 , 101 , 102

Selected

c

10 , 10 , 10 , 10 , 10

c

10 , 10 , 10 , 10 , 10 10 , 10 , 10 , 10 , 10

σ

AN US

TWSVM

−2

CR IP T

Algorithm BPDMatMHKS

IV. Experiments

In this section, the experiments are designed to investigate the feasibility and effectiveness of BPDMatMHKS. The proposed BPDMatMHKS is compared with MatMHKS, MHKS , SVM and TWSVM on 20 UCI benchmark and 6 image data sets. More concretely, this section is divided

M

into five parts. First, the basic setting of all involved algorithms are given. In the second part, the corresponding experimental results and analyses on matrix-based data sets such as images

ED

are given. In the third part, the corresponding experimental results and analyses on the UCI benchmark data sets are given. In the fourth part, the effect of the varying parameters (including further discussed.

PT

c, λ) are discussed. Finally, classification performance of the proposed regularization term is

CE

A. Experimental setting

In the experiments, the parameters used in all algorithm are shown in Table IV. According to

AC

this table, the algorithms such as MHKS, MatMHKS and BPDMatMHKS are related to original Ho-Kashyap. Therefore, they share some common parameters. For BPDMatMHKS, the number of nearest neighbors is fixed to 5. For SVM, the linear kernel, K(xi , x j ) = xTi x j , and the Radial Basis Function kernel (RBF), K(xi , x j ) = exp(−||xi − x j ||2 /σ), are taken into consideration. For TWSVM, the linear kernel is used.

In the detail method of classification, the one-against-one classification strategy [17] is utilized here to deal with the multi-classes problems. After all results of one-against-one classification

ACCEPTED MANUSCRIPT 18

have been obtained, the voting role is used to combine all results and, as a result, outputs the final label according to majority rule. To get the average validation performance and avoid the random deviation, we combine the cross-validation with hold-out. In detail, we use 5-fold crossvalidation in the outer layer. Each data set used in the experiments is randomly split into 5 parts, where the one for testing and the others for training. Inside every cross-validation, 4 parts are further used to do the 4-fold cross-validation to get the optimal parameters such as c and

CR IP T

λ in BPDMatMHKS. Then, we use the obtained optimal parameters to train the classification hyperplane on the 4 parts samples. Finally, the classification hyperplane is tested on the 5th unseen part. The procedure of the experiment is shown in Fig. 5. All the computations are performed on Intel i5 6600K with 3.50GHz, 16G RAM DDR4, Microsoft Windows 10, and

round

1

2

3

5

3

4

5

2

3

4

5

1

3

4

5

1

2

4

5

1

2

3

5

1

2

3

4

1

2

3

4

2

2

3

4

1

3

3

4

1

2

4

4

1

2

3

3 4

M

5

1

1

ED

for every training set 2

3

test set

2

2 4

training set

1

1

5-fold

AN US

MATLAB R2015b environment.

4

…...

PT

use the obtained best parameters to train the training set

CE

obtain the parameters corresponding to the best result in the 4-fold cross-validation

Fig. 5. In the process of the experiment, the 5-fold corss-validation is combined with the hold-out strategy. Then, the optimal

AC

parameters are selected through training on the 4 parts seen training set, thus avoiding excessive optimistic results.

ACCEPTED MANUSCRIPT 19

TABLE V Description of the used image data sets. Size

Dimension

Class

Original Matrix Size

Coil-20

1440

1024

20

32×32

Letter

500

432

10

24×18

ORL

400

644

40

28×23

Yale

165

1024

15

32×32

YaleB

2414

1024

38

32×32

JAFFE

213

1024

10

32×32

B. Classification comparison on image data sets

CR IP T

Dataset

AN US

In this subsection, the performance of BPDMatMHKS is compared with other classifier on six image data sets including Coil-201 , Letter2 , Orl-Faces3 , Yale4 , YaleB5 , and JAFFE6 are selected as comparison algorithms. The results lay emphasis on the classification accuracy. In addition, the parameters c and λ are discussed in this section.

(1) Data introduction and performance comparison

M

Coil-20 with 20 objects is a data set of gray-scale images. The rotation of each object is recorded by a fixed camera and each object has 72 images with the size 32 × 32. Letter with

ED

500 samples contains 10 handwritten digits from ”0” to ”9”. Therefore, the Letter is a date

set with 10 classes and each class has 50 images. Orl-Faces with 10 different images for each person is the set of the facial expression images. These images with the size 28 × 23 come

PT

form 40 persons. Yale comes from 15 persons and each person has 11 facial expression images. Similar with Yale, YaleB comes from 38 persons and each person has approximately 64 facial

CE

expression images. The Japanese Female Facial Expression (JAFFE) is with 10 persons and each person has approximately 20 facial expression images with the size 32 × 32. Table V

AC

gives the detailed information of all used image data sets. In the experiment, these matrix-based samples are firstly convert in to vector pattern. After that, they can be transformed into different 1

http://www.cs.columbia.edu/CAVE/coil-20.html

2

http://sun16.cecs.missouri.edu/pgader/CECS477/NNdigits.zip

3

http://www.cam-orl.co.uk

4

http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html.

5

http://vision.ucsd.edu/ leekc/ExtYaleDatabase/ExtYaleB.html.

6

http://www.kasrl.org/jaffe info.html.

ACCEPTED MANUSCRIPT 20

TABLE VI Test classification accuracy (%) for BPDMatMHKS, MatMHKS, MHKS, and SVM on image data sets (The best result on each data set is written in bold). MHKS

model

accuracy±std

accuracy±std

(1×1024)

95.9 ± 4.24

95.28 ± 3.91

95.78 ± 3.8

94.6 ± 4.87

(3×256) (4×128) letter

(1×432) (2×216) (4×108) (8×56)

ORL

(1×644) (2×322) (4×161) (7×92)

Yale

(1×1024) (2×512) (4×256) (8×128)

YaleB

(1×1024) (2×512) (4×256) (8×128)

JAFFE

(1×1024) (2×512) (4×256)

95.68 ± 4.22 91.97 ± 4.5

92.6 ± 3.44

92.8 ± 3.56

92.8 ± 2.59

92.8 ± 2.59

98.25 ± 2.44

98.25 ± 2.44

92.4 ± 2.7

92.6 ± 3.44

98.25 ± 2.44 98 ± 2.27

97.75 ± 2.24

92.6 ± 3.05

92.4 ± 3.71

98.25 ± 2.44 98 ± 2.27

97.75 ± 2.24

76.89 ± 11.95

76.89 ± 11.95

76.89 ± 9.34

76.89 ± 9.34

78.67 ± 10.95 71.56 ± 13.58 82.28 ± 19.51 79.5 ± 20.33

73.49 ± 22.08 70.1 ± 24.85 99.58 ± 0.93

78.67 ± 10.95 74.22 ± 14.67 84.45 ± 18.77 85.4 ± 17.55

100 ± 0

99.58 ± 0.93 100 ± 0 -/-/-

TWSVM (Linear)

accuracy±std

accuracy±std

accuracy±std

accuracy±std

93.47 ± 3.42

96.85 ± 3.35

94.47 ± 3.87

92.92 ± 3.9

92.4 ± 1.67

93.2 ± 3.11

93.4 ± 3.36

92.8 ± 2.77

98 ± 2.27

98.25 ± 2.09

98.75 ± 1.25

97.25 ± 2.05

76.44 ± 12.85

78.44 ± 15.19

77.11 ± 15.31

80.67 ± 7.6

62.41 ± 28.67

86.93 ± 16.51

86.2 ± 17.39

79.89 ± 16.9

99.58 ± 0.93

99.17 ± 1.14

99.17 ± 1.14

100 ± 0

6/0/0

2/2/2

3/3/0

3/1/2

87.95 ± 14.6

86.38 ± 15.98 99.58 ± 0.93 100 ± 0

99.58 ± 0.93 100 ± 0 1/1/4

CE

Win/Loss/Tie

94.23 ± 3.55

PT

(8×128)

95.53 ± 3.86

SVM (RBF)

AN US

(2×512)

SVM (Linear)

CR IP T

MatMHKS

M

coil-20

BPDMatMHKS

ED

Dataset

matrix-based size when they are used in BPDMatMHKS and MatMHKS. For instance, the data

AC

set Coil-20 with 1024 dimensions is transformed into the matrix patterns such as 1 × 1024,

2 × 512, 4 × 256, 8 × 128, 16 × 64 and 32 × 32. The 4 best classification result is marked in

bold. The classification accuracy results of BPDMatMHKS, MatMHKS, MHKS, SVM (Linear), SVM (RBF) and TWSVM(Linear) are all listed in Table VI. In Table VI, rather than the vector-based classifier use the original high dimensional vectorbased sample, the matrix-based classifier such as BPDMatMHKS and MatMHKS, converting

the dimensions form d1 × d2 to d1 + d2, still perform well in the lower dimensions condition. In

ACCEPTED MANUSCRIPT 21

100

100

95

90

80

80 75

60

coil-20 letter ORL Yale YaleB JAFFE

70 65 60 0.01

0.1

1

10

70

50

40 0.01

100

c

coil-20 letter ORL Yale YaleB JAFFE

CR IP T

85

Accuracy(%)

Accuracy(%)

90

0.1

1

10

100

λ

(a) parameter c

(b) parameter λ

AN US

Fig. 6. The classification accuracy(%) of BPDMatMHKS with varying parameter c and λ.

addition, the performance of BPDMatMHKS is competitive with MatMHKS and SVM in data sets with high dimensions.

(2) The influence of the parameter c and λ in the experiment

The parameter c controls the term Rreg , while λ control the new proposed regularization term

M

Rbpd in Eq. 2. Both the two parameters are selected form {0.01, 0.1, 1, 10, 100}. First, we fix λ

to 0.01 and choose c from the range. For convenience, only the value of the best matrix form

ED

on the data set is recorded. For parameter λ, we fix c to 0.01 and discuss the λ in the range of options. The sub-figure (a) in Fig .6 shows the classification accuracy with varying c and

PT

sub-figure (b) is corresponding to the varying λ. In sub-figure (a), the image data sets such as JAFFE and Coli-20 share the same trend that

CE

the accuracy is declining while c is increasing. Conversely, the trend in YaleB is that the higher c is corresponding to higher accuracy. The trend in Yale and ORL is stable. For Letter data set, the accuracy keeps increasing while c is less than 10 and it declines when c is equal to 100. In

AC

sub-figure (b), every image data set share the same trend that the accuracy is stable when λ is less than 1 and the accuracy keeps declining while λ is lager than 1. C. Classification comparison on UCI data sets In this subsection, the proposed BPDMatMHKS is validated whether it is effective to deal with the vector-based data sets. Table VII shows the detail description of the used UCI benchmark data sets. Since the BPDMatMHKS and MatMHKS can process the matrix based sample directly,

ACCEPTED MANUSCRIPT 22

TABLE VII Information of UCI Data Sets Size

Dimension

Class

Data set

Size

Dimension

Class

Water

116

38

2

Wine

178

13

3

Iris

150

4

3

Sonar

208

60

2

Ionosphere

351

34

2

House Colic

366

27

2

House Vote

435

16

2

Wdbc

569

30

2

Breast Cancer Wisconsin

699

9

2

Transfusion

768

8

2

Hill Valley

1372

4

2

CMC

Secom

1567

590

2

Semeion

Waveform

5000

21

3

Statlog

Marketing

8993

9

13

Pendigits

748

4

2

1212

100

2

1473

9

3

1593

256

10

6435

36

6

10992

16

10

AN US

Pima indians Banknote Authentication

CR IP T

Data set

these UCI benchmark data sets represented with vector forms can be transformed into their corresponding matrix forms. In general, one original vector-based pattern can be reshaped in multiple matrix forms. In this experiment, the dimensions of these data sets are not as high

M

as image data sets. For instance, the Sonar data set with 60 dimensions can be transformed into the matrix patterns such as 1 × 60, 2 × 30, 3 × 20, 4 × 15, 5 × 12 and 6 × 10. These

ED

kinds of matrix size are calculated in BPDMatMHKS and MatMHKS. Table VIII gives the results from all the implemented algorithms on the used UCI data sets. For each data set, the

PT

optimal matrix size of BPDMatMHKS and MatMHKS is recorded in the table. And the best result of different algorithm is marked in bold. Moreover, Friedman test [14] and Win/Loss/Tie

CE

counts are adapted to validate the difference between the proposed BPDMatMHKS and compared algorithms. Further, the parameter c and λ are discussed. (1) Classification comparison with other algorithms

AC

In this subsection, the classification performance of the proposed BPDMatMHKS and the other

comparison algorithms are listed in Table VIII. In this experiment, BPDMatMHKS outperforms MatMHKS on 14 data sets, which can be attributed to the assistance of the boundary information. When BPDMatMHKS is compared with MHKS, the proposed BPDMatMHKS is well over MHKS, since the BPDMatMHKS takes advantage of not only the different matrix models but

the boundary information. The SVM (RBF) uses the kernel function that maps the original space to higher dimensions space, thus solving the issue of non-linear separable. Compared

ACCEPTED MANUSCRIPT 23

with SVM (RBF), BPDMatMHKS, as a linear classifier, is still competitive with SVM (RBF) according to the Win/Loss/Tie counts and Friedman average rank. In addition, BPDMatMHKS outperforms TWSVM (Linear) in 17 data sets, which means the performance of BPDMatMHKS is obviously better than TWSVM. In generally, the BPDMatMHKS has obvious advantage when it is compared with the other linear classifier and it is competitive with the non-linear classifier SVM (RBF).

CR IP T

The Friedman test is a non-parametric statistical test, which is used to detect whether the used algorithms are similar. If these algorithm are at the same value, the average rank of different algorithm should be the same. According to original Friedman test, suppose there are k algorithms are compared on N data sets. The ri is the average rank of the ith algorithm. Then, ri obeys normal distribution, whose mean value and variance are (k + 1)/2 and (k2 + 1)/12. the variable

AN US

τχ2 can be written as follows:

 k  12N X 2 k(k + 1)2  τχ2 =  ri −  , k(k + 1)  i=1 4

(25)

the aforementioned Friedman test, however, is too conservative. Now the typically used variable (N − 1)τχ2 , N(k − 1) − τχ2

M

is: τF =

(26)

τF is given as follows:

PT

k X

ED

and the τF obeys the F distribution whose degrees of freedom is k − 1 and (k − 1)(N − 1). P According to Table VIII, the value of ki=1 ri 2 can be calculated and the process of calculating

! 12 × 20 6 × (6 + 1)2 77.2362 − = 21.3497, τχ 2 = 6(6 + 1) 4

AC

CE

i=1

ri2 = 2.2752 + 3.3252 + 4.352 + 3.82 + 2.82 + 4.452 = 77.2362,

τF =

(20 − 1) × 32.9386 = 5.1576, 20 × (6 − 1) − 21.3497

(27) (28) (29)

Compared with the critical value of F(k − 1, (k − 1)(N − 1)) = F(5, 95) = 2.31, the τF = 5.1576

is larger than the value 2.31, which means the hypothesis that all used algorithms are similar

is rejected. Accordingly, the performance of these algorithm is significant different. Moreover, the Nemenyi post-hoc test is used to distinguish the difference of these algorithm. The related

ACCEPTED MANUSCRIPT 24

TABLE VIII Test classification accuracy (%) and t-test comparison for BPDMatMHKS, MatMHKS, MHKS, SVM and TWSVM on UCI datasets. BPDMatMHKS

MatMHKS

MHKS

SVM (linear)

SVM (RBF)

TWSVM (Linear)

Accuracy±std

Accuracy±std

Accuracy±std

Accuracy±std

Accuracy±std

Accuracy±std

Model/Rank

Model/Rank

Rank

Rank

Rank

Rank

Water

96.67 ± 3.49

95.83 ± 2.95

95.00 ± 5.43

96.50 ± 3.56

98.33 ± 2.28

91.67 ± 6.59

Wine

94.59 ± 5.73

92.43 ± 5.20

95.55 ± 3.01

96.76 ± 2.26

97.84 ± 2.26

94.22 ± 4.23

Iris

98.00 ± 2.98

98.00 ± 2.98

97.33 ± 3.65

96.67 ± 4.71

97.33 ± 3.65

97.33 ± 3.65

Sonar

64.88 ± 17.33

61.54 ± 13.83

56.33 ± 13.64

58.84 ± 14.93

55.05 ± 21.63

57.08 ± 18.29

Ionosphere

86.69 ± 2.00

84.90 ± 3.05

85.27 ± 6.61

85.25 ± 5.98

93.75 ± 2.88

82.72 ± 5.57

80.80 ± 5.37

80.80 ± 5.37

60.11 ± 3.51

81.39 ± 4.97

81.39 ± 4.97

80.85 ± 4.18

94.70 ± 2.09

91.75 ± 2.76

93.33 ± 1.51

94.00 ± 1.93

93.78 ± 1.10

92.18 ± 2.06

96.32 ± 1.68

95.09 ± 2.63

95.45 ± 1.51

98.08 ± 0.94

98.25 ± 0.86

96.86 ± 1.30

Breast Cancer Wisconsin

96.45 ± 2.84

96.73 ± 2.58

95.60 ± 4.23

96.73 ± 1.69

95.73 ± 3.20

94.89 ± 5.84

Transfusion

76.21 ± 0.46

78.61 ± 3.83

76.34 ± 0.48

69.71 ± 13.87

66.61 ± 20.85

76.21 ± 0.46

Pima Indians Diabetes

76.17 ± 2.47

76.70 ± 2.61

76.43 ± 2.10

74.61 ± 3.78

75.01 ± 4.42

76.82 ± 3.03

Hill Valley

50.58 ± 1.81

50.17 ± 1.48

49.92 ± 1.25

48.50 ± 3.21

49.74 ± 3.57

50.17 ± 2.66

99.05 ± 0.66

98.83 ± 0.75

97.89 ± 0.60

98.98 ± 0.70

100.00 ± 0.00

97.74 ± 0.70

51.79 ± 2.08

51.52 ± 2.21

49.48 ± 3.03

43.45 ± 6.13

48.39 ± 7.32

47.90 ± 3.35

93.36 ± 0.11

92.72 ± 0.50

82.92 ± 22.50

72.11 ± 13.25

93.36 ± 0.11

92.41 ± 0.70

94.66 ± 1.34

94.09 ± 0.91

92.13 ± 1.51

93.71 ± 0.67

95.48 ± 1.80

91.70 ± 1.44

Waveform

87.08 ± 1.07

86.94 ± 0.84

86.26 ± 1.08

86.78 ± 0.99

86.96 ± 0.88

86.70 ± 0.80

Statlog

85.64 ± 1.45

85.05 ± 1.60

83.90 ± 1.30

84.26 ± 2.09

87.96 ± 0.94

83.93 ± 2.06

Marketing

32.22 ± 0.92

32.34 ± 0.93

31.52 ± 0.79

27.98 ± 0.78

27.80 ± 2.83

30.45 ± 0.70

Penbased

98.01 ± 0.56

96.21 ± 0.63

96.85 ± 0.66

98.06 ± 0.58

99.51 ± 0.19

96.26 ± 0.66

Win/Loss/Tie

-/-/-

14/4/2

17/3/0

15/5/0

10/9/1

17/2/1

average rank

2.275

3.325

4.35

3.8

2.8

4.45

1x4 / 1.5 6x10 / 1

1x34 / 1.5

Horse Colic

1x27 / 4.5

House Vote

1x16 / 1

Wdbc

1x30 / 4 3x3 / 3

PT

1x8 / 4

10x10 / 1

Banknote Authentication

AC

Secom

CE

1x4 / 2

Cmc

Semeion

1x13 / 6

1x9 / 1

5x118 / 1.5 2x128 / 2 1x21 / 1 1x36 / 2 1x13 / 2 1x16 / 3

3

1x4 / 2

3x20 / 2 1x34 / 5

1x27 / 4.5 2x8 / 6

2x15 / 5

3x3 / 1.5 1x4 / 1

ED

2x2 / 3.5

5

4

5

1x8 / 2

4x25 / 2.5 1x4 / 4 1x9 / 2

10x59 / 3

2x128 / 3 1x21 / 3 1x36 / 3 1x13 / 1 1x16 / 5

3 2

6

3

1

3

6

6 5

2

3

4

5 3

5

5

6 6 3 4

6

CR IP T

1x13 / 4

1x38 / 4

1

4

6

AN US

1x38 / 2

M

Data set

4

1.5 2

2

1.5 5 6

6

3 6

6 4

4 4 5 2

1

1.5 3

1

4

6

5

5

1 4

1.5 1

2 1 6 1

5

4

4 6

3

5

3 6

3.5 1

2.5 6 5

4 6

5 5 4 6

ACCEPTED MANUSCRIPT 25

average ranks differ named Critical Difference (CD) in Nemenyi post-hoc test can be written as follow: CD = qα

r

k(k + 1) 6N

(30)

According to the literature 13, the value qα is 2.85 when there are 5 algorithms with α = 0.05.

CD = 2.85 ×

r

6(6 + 1) = 1.6861, 6 × 20

CR IP T

Therefore, the value CD is equal to:

(31)

If the difference between two algorithms is over the value of CD, it means the two algorithms are not similar. As is shown by Table VIII, it can be found that BPDMatMHKS is much better than MHKS, SVM (Linear) and TWSVM (Linear), since the difference of average rank between these algorithms is over 1.6816. According to average rank in Table VIII, the performance

AN US

of BPDMatMHKS is still superior to SVM (RBF) and MatMHKS, since the average rank of BPDMatMHKS is the lowest.

(2) The influence of the parameter c and λ in the experiment

In this subsection, the influence of the parameter c and λ is discussed in Fig. 7 and .8. The experiment setting is the same as we do in the image data sets.

M

Fig. 7 shows the classification performance of varying c. It can been found that the classification performance is stable in most data sets. For some data sets including Wine, Ionosphere,

ED

Sonar, House vote and Horse colic, the classification accuracy keeps stable when c is less than 10 and declines when c is equal to 100. For secom data set, the classification accuracy keep {0.1, 1, 10}.

PT

increasing while c is increasing. In generally, we can find that the range of c can be shrank to

CE

Fig. 8 shows the classification performance of varying λ. The classification accuracy keeps

stable or ascending when λ is less than 10. It is getting worse when λ is equal to 100, except

AC

for the secom and transfusion data sets. In these two data sets, the classification performance is increasing with λ increasing. In generally, the best choice of the parameter of c and λ is from {0.1, 1, 10}.

(3) Convergence analysis Here, we discus the convergence of the proposed BPDMatMHKS. As is shown in Fig. 9, it can

be found that the convergence speed of the BPDMatMHKS is fast at the beginning. The value of ||b(k + 1) − b(k)|| is stable and close to 0 when the iteration number is over 10. In addition,

about half of the data sets is satisfied with the break condition when the iteration number is less

ACCEPTED MANUSCRIPT 26

100

100 95

95

90 90

80 75 70

Accuracy(%)

Accuracy(%)

85 water wine iris sonar ionosphere

65

85

horse colic house vote wdbc breast cancer wisconsin transfusion

80 75 70

55 50 0.01

0.1

1

10

65 0.01

100

c

0.1

CR IP T

60

1

10

100

10

100

c

(a)

(b)

100

100

90

AN US

90

80

70

60

Accuracy(%)

Accuracy(%)

80

pima indians diabetes hill valley banknote authentication cmc secom

60

semeion waveform statlog marketing penbased

50

50

40

0.1

1

c

100

30 0.01

0.1

1

c

(d)

ED

(c)

10

M

40 0.01

70

PT

Fig. 7. The classification accuracy(%) of BPDMatMHKS with varying parameter c.

than 10. Therefore, it can be concluded that the proposed BPDMatMHKS has a fast convergence

CE

speed.

D. Difference between matrix model and vector model

AC

As is described in section II, the matrix model and the vector model can be convert to each

other according to the Kronecker product. Inspired by the this information, We can measure the different between the solution space in matrix-based algorithms and that in vector-based algorithms. To reflect the difference in the required weights to be optimized in vector and matrix

cases, we first convert the solution u and v˜ of BPDMatMHKS and MatMHKS to the vector form ˜ of MHKS. Then, we unitize these obtained solutions v˜ T ⊗ u. After that, we get the solution w

ACCEPTED MANUSCRIPT 27

100

100 95

90 90 85

Accuracy(%)

70

60

50

40 0.01

80 75 70

water wine iris sonar ionosphere

65 60

0.1

1

10

55 0.01

100

λ

horse colic house vote wdbc brease cancer wisconsin trasfusion

CR IP T

Accuracy(%)

80

0.1

1

10

100

10

100

λ

(a)

(b)

100

100

90

AN US

90

80

70

60

Accuracy(%)

Accuracy(%)

80

pima indians diabetes hill valley banknote authentication cmc secom

70 60 50

semeion waveform statlog marketing penbased

40

50

0.1

1

λ

100

20 0.01

0.1

1

λ

(d)

ED

(c)

10

M

40 0.01

30

PT

Fig. 8. The classification accuracy(%) of BPDMatMHKS with varying parameter λ.

of BPDMatMHKS, MatMHKS and MHKS to measure the difference between these solution

CE

vectors by using the 2-norm. The formula is as follow:

T

v˜ ⊗ u ˜

w −

T

, ˜ 2 2 ||˜v ⊗ u||2 ||w||

(32)

AC

The range of Eq. 32 is from 0 to 2. Therefore, the value 1 can be seen as the threshold to

reflect whether the difference is obvious. In addition, we also measure the difference of solution

vector between BPDMatMHKS and MatMHKS under the same matrix model. The result is shown in Table IX. Since the multi-class data sets have too many solution vectors to reflect the difference clearly, the binary-class data sets are used in this experiment. In addition, Table IX also shows the required sum of weights in matrix (d1 + d2) and vector (d1 × d2) cases.

ACCEPTED MANUSCRIPT 28

TABLE IX Test 2-norm distance between BPDMatMHKS, MatMHKS and MHKS on binary data sets (The best result on each data set is written in bold. Here, the BPDMatMHKS is abbreviated as BPD and MatMHKS is abbreviated as Mat). Model

d1 + d2

Water

1 × 38

39

1 × 60

61

Horse Colic House Vote

0.7859

4 × 15

19

0.6725

5 × 12

17

0.8984

6 × 10

16

0.7948

1 × 34

35

2 × 17

19

1 × 27

28

3×9

12

2×8

10

1 × 30

31

60

34

0.3861 0.3581

0.7893

0.7645

0.9483

0.9074

0.9255

0.9860

1.1230

1.0995

1.0540

0.9737

0.7684

0.6534

0.9253

0.8604

0.0893

0.5314

0.5483

0.1871

1.3270

1.3303

0.3681

0.5751

0.6801

0.9649

1.1976

0.7152

0.3548

1.4053

1.2767

1.4137

1.0110

1.4141

1.3083

1.1186

1.1903

13

0.9196

1.0936

1.3227

11

1.0118

1.1000

1.1856

0.1315

0.8539

0.8304

0.4618

0.9161

0.9404

0.0276

0.9311

0.9090

0.5320

1.2237

1.1203

0.1433

0.5539

0.6848

1.1871

1.1903

1.3548

1.2601

1.2263

1.3862

17 8 17

ED

5×6 1×9

10

1×4

5

3×3

9

2×2

4

2×4

6

PT

CE

0.7311

0.6398

1×8

9

1 × 100

101

27 16

30

9 4 8 100

2 × 50

52

1.4073

1.1471

1.3868

4 × 25

29

1.0492

1.3904

1.4746

5 × 20

25

0.8856

1.4310

1.3936

20

0.5037

1.4793

1.4706

0.0211

0.2301

0.2285

2×2

4

0.1233

0.4288

0.5203

0.8449

1.0675

1.1521

1×4

5

Secom

1 × 590

591

5 × 118 average

0.4004

23

10 × 10

Banknote Authentication

1.3822

3 × 20

AC

Hill Valley

1.2614

1.3205

0.7173

3 × 10

Pima Indians

1.2767

0.9638

32

2 × 15

Transfusion

0.9130

2 × 30

4×4

Breast Cancer Wisconsin

Mat–MHKS

21

1 × 16

Wdbc

BPD–MHKS

AN US

Ionosphere

38

BPD–Mat

2 × 19

M

Sonar

d1 × d2

CR IP T

Data set

4 590

2 × 295

297

0.8171

1.3179

1.2877

123

1.1708

1.3559

1.3827

10 × 59

69

1.2591

1.4199

1.3836

0.6996

1.0385

1.0623

ACCEPTED MANUSCRIPT 29

14

60 water wine iris sonar ionosphere

12 10

horse colic house vote wdbc breast cancer wisconsion transfusion

50

||b(k+1) - b(k)||

||b(k+1) - b(k)||

40 8 6

30

20

10

2 0

0

5

10

15

20

25

0

30

0

5

Iteration number

15

20

25

30

(b)

100

250 pima indians diabetes hill valley banknote authentication cmc secom

80

200

||b(k+1) - b(k)||

70

semeion waveform statlog marketing penbased

AN US

90

||b(k+1) - b(k)||

10

Iteration number

(a)

60 50 40 30 20

150

100

50

0

5

10

15

20

Iteration number

25

30

0

0

5

10

15

20

25

30

Iteration number

(d)

ED

(c)

M

10 0

CR IP T

4

PT

Fig. 9. The classification accuracy(%) of BPDMatMHKS with varying parameter λ.

According to Table IX, it is found that the average difference between BPDMatMHKS and

CE

MatMHKS is 0.6996, which is far less than 1. Therefore, it can be conclude that the different between BPDMatMHKS and MatMHKS is not very significant. However, the average difference between matrix-based algorithms (BPDMatMHKS and MatMHKS) and MHKS is over than 1.

AC

To make further observation on the data sets including Sonar, Hill Valley and Secom, it is found that the difference between the matrix model and vector model is getting bigger with the required number of weights declining. Therefore, it can be conclude that there are many differences between matrix-based algorithm and vector-based algorithm.

ACCEPTED MANUSCRIPT 30

E. Classification performance of the proposed regularization term In this paper, we propose a new regularization term Rbpd , which concerns the priori structural information of the boundary samples. Through introducing the regularization term into the original MatMHKS, we further extend the framework and improve the classification performance of MatMHKS. In this subsection, we further discuss how the regularization term works on the BPDMatMHKS and MatMHKS in Table X. TABLE X

CR IP T

data sets with different dimension and size. Here, we list the classification results between

Performance Difference between BPDMatMHKS and MatMHKS on Data Sets with Different Dimensions BPDMatMHKS vs MatMHKS

dimensions < 100

Win/Loss/Tie

11/4/2

100 ≤ dimensions < 1000

1/1/2

AN US

3/0/2

1000 ≤ dimensions

According to the statistical results, the proposed BPDMatMHKS outperforms MatMHKS when the dimensions of the data set is less than 100. Compared with MatMHKS, BPDMatMHKS has great advantages when 100 ≤ dimensions < 1000. For the data sets with more than 1000

M

dimensions, BPDMatMHKS is still competitive with MatMHKS.

In order to further reflect the stability of the regularization term and explore which factor affects

ED

the results corresponding to varying parameter λ, the standard deviation of the classification accuracies corresponding to every data set in Fig. VI and VIII is calculated at first. Here, the

PT

standard deviation is named as S tdλ to reflect the the stability of the regularization term. The lower the value of S tdλ is, the higher the stability of the regularization term is. Then, the

CE

corresponding best matrix type and the required sum of weights (d1 + d2) are list in the table. Moreover, the data size is list in the table. Since the BPDMatMHKS is a binary-classifier, the average size of samples in every class is more reasonable to reflect the data size. In addition,

AC

the value which is equal to the average number divided by the required sum of weights is listed in the table.

According to Table XI, we can get the correlation coefficients between S tdλ and the other

factors. The correlation coefficients ρXY reflecting the relevance between dimension X and Y can be calculated as follows: COV(X, Y) ρXY = √ , √ Var(X) Var(Y)

(33)

ACCEPTED MANUSCRIPT 31

TABLE XI Test used model, required sum of weights, average size, the proportion between average size and required sum of weights, and S tdλ (%) of BPDMatMHKS on all used data sets (The best result on each data set is written in bold). Data set

Model

d1 + d2

Average size

Average size/(d1 + d2)

S tdλ

water

1 × 38 1×4

58.00

1.49

7.49

14

59.33

4.24

6.76

5

50.00

10.00

0.84

6 × 10

16

104.00

1 × 34

35

175.50

1 × 27

28

183.00

1 × 16

17

217.50

1 × 30

31

284.50

3×3

6

349.50

1×4

5

374.00

1×8

9

10 × 10

20

606.00

30.30

0.24

5

686.00

137.20

0.36

1×9

10

491.00

49.10

4.40

5 × 118

123

783.50

6.37

3.09

130

159.30

1.23

8.08

1 × 21

22

1666.67

75.76

1.00

37

1072.50

28.99

4.22

14

691.77

49.41

4.11

17

1099.20

64.66

2.71

1 × 1024

1025

72.00

0.07

15.29

147

50.00

0.34

14.48

1 × 644

645

10.00

0.02

16.63

514

11.00

0.02

11.34

1 × 1024

1025

63.53

0.06

11.02

514

21.30

0.04

25.27

Sonar Ionosphere Horse Colic House Vote Wdbc Breast Cancer Wisconsin Transfusion Pima Indians Hill Valley Banknote Authentication

1×4

Cmc Semeion Waveform

2 × 128

Statlog

1 × 36

Marketing

ED

Penbased

1 × 16

Coil 20 Letter Yale

CE

YaleB

3 × 144

PT

ORL

AC

JAFFE

1 × 13

M

Secom

2 × 512 2 × 512

6.50

7.90

5.01

12.04

6.54

9.24

12.79

14.01

9.18

13.64

58.25

8.27

74.80

1.33

AN US

Iris

CR IP T

39

1 × 13

Water

384.00

42.67

COV(X, Y) = E[(X − E[X])(Y − E[Y])],

5.10

(34)

Where E[X] is the mathematical expectation of dimension X, Var(X) and Var(Y) are the

standard deviations of dimension X and Y, respectively. In Table XI, the correlation coefficient between S tdλ and required sum of weights is 0.5659, which means that the higher the number of the required weights is, the lower the stability of the regularization term is. Moreover, the

ACCEPTED MANUSCRIPT 32

correlation coefficient between S tdλ and average size is -0.6059, which means that the stability of the regularization term will be better with the required sum of weights increasing. In addition, the correlation coefficient between S tdλ and averagesize/(d1 + d2) is -0.6129, which means that the stability of the regularization term is related to the proportion between the data size and the number of required weights. The result accords with the learning theory that more samples and

CR IP T

less dimensions can bring better generalization and stability ability. V. Rademacher Complexity Analysis and Stability test

In this section, the Rademacher complexity [16] is used to estimate the generalization risk boundary [23] of the proposed BPDMatMHKS. The Radermacher complexity is different from the well-known Vapnik-Chervonenkis dimensionality theory [31]. The VC dimension is inde-

AN US

pendent of the distribution of the data. However, the Rademacher complexity considers the distribution of the data. Accordingly, it is demonstrated to be an effective technique to measure the generalization risk bound of one classifier [34].

N Suppose the training data set D = {xi , yi }i=1 , and yi is the label of the sample, For binary-

classification problem, suppose the hypothesis space H, let Z = X×{−1, 1}. Then the hypothesis

M

h can be written as:

(35)

ED

fh (z) = fh (x, y) = I(h(x) , y),

Then the range of hypothesis space H transforms form {−1, 1} to [0, 1], and the corresponding

function space is FH = { fh , h ∈ H}. Then, the empirical Rademacher complexity can be written

CE

PT

as:

N 1X σi fh (xi , yi )] Rˆ Z (FH ) = Eσ [ sup fh ∈FH N i=1

N 1 1X 1 σi h(xi )] = Rˆ D (H), = Eσ [sup 2 2 h∈H N i=1

(36)

AC

Get the mathematical expectation to Eq. 36 and then we can get: 1 RN (FH ) = RN (H), 2

(37)

Theorem 1 In the hypothesis space H : X → [−1, 1], The data set X = {x1 , x2 , ..., xN }, xi ∈ X is

a random sample set independently chosen according to distribution D. Then with probability at least 1 - δ, δ ∈ (0, 1), every h ∈ H satisfies:

ˆ E[h] ≤ E(h) + RN (H) +

r

ln(1/δ) , 2N

(38)

ACCEPTED MANUSCRIPT 33

ˆ E[h] ≤ E(h) + Rˆ D (H) + 3

r

ln(1/δ) , 2N

(39)

Then, we use RN (hBPDMatMHKS ), RN (h MatMHKS ), and RN (h MHKS ) to denote the Rademacher complexities of BPDMatMHKS, MatMHKS, and MHKS, respectively. Since the objective function of BPDMatMHKS is inherited from MatMHKS. Beyond that, it has one more new regularization term Rbpd than MatMHKS. Therefore, the hypothesis space H of BPDMatMHKS is smaller than

CR IP T

that of MatMHKS. And we can get {hBPDMatMHKS } ⊆ {h MatMHKS }. According to Eq. 36, we have: RN (hBPDMatMHKS ) ≤ RN (h MatMHKS ).

(40)

For MatMHKS and MHKS, it is known that MatMHKS is original from MHKS and the solution space of MatMHKS is decomposed from MHKS according to the Kronecker product

AN US

decomposability constraint. Accordingly, the the hypothesis space H of MHKS contains the the hypothesis space H of MatMHKS. Therefore, we can get {h MatMHKS } ⊆ {h MHKS }. Similarly , it

can be gotten that:

RN (g MatMHKS ) ≤ RN (g MHKS ).

(41)

RN (g MHKS ) can be concluded that:

M

Based on Eq. 40 and Eq. 41, the relationship between RN (hBPDMatMHKS ), RN (g MatMHKS ), and

(42)

ED

RN (hBPDMatMHKS ) ≤ RN (h MatMHKS ) ≤ RN (h MHKS ).

Then, it can be concluded that BPDMatMHKS has a tighter generalization risk bound than both

PT

MatMHKS and MHKS.

Although the Generalization error bound can be deduced according to Rademacher complexity, we should find another way to obtain the detail result of the error bound. In this paper, we

CE

analysis the stability of the BPDMatMHKS, MatMHKS and MHKS. The stability test observes the change of output when the input is changed.

AC

Suppose that the training data set D = {z1 = (x1 , y1 ), z2 = (x2 , y2 )...zn = (xn , yn )} is inde-

pendently chosen according to distribution D, xi ∈ X and yi is the label of the sample, For binary-classification problem, suppose the hypothesis space H, let function F : X × {−1, 1}.

Function F learns form H according to the training samples. Consider the change of D:

D\i = {z1 , z2 , ...zi−1 , zi+1 , ...zn } means divide the ith sample form D. And the loss function is

denoted as L(FD , z).

ACCEPTED MANUSCRIPT 34

Definition 1 For every x ∈ X, z = (x, y), if the function F satisfies: |L(FD , z) − L(FD\i , z)| ≤ β, i = 1, 2, ...n,

(43)

Then the function F is consider to be satisfied with β uniformity and stability.

For every D and z = (x, y), the loss function L satisfies 0 ≤ L(FD , z) ≤ M, then [3]:

Theorem 2 Suppose D with n samples is a random sample set independently chosen according

CR IP T

to distribution D. If the function F is satisfied with β uniformity and stability and the upper bound

of the loss function L is M. Then with probability at least 1 - δ, δ ∈ (0, 1), the generalization error satisfies:

ˆ D , z) + 2β + (4nβ + M) L(FD , z) ≤ L(F ˆ D , z) is the empirical error. where the L(F

r

ln(1/δ) , 2n

(44)

AN US

According to Eq. 43, every sample is divided from the data set and then the corresponding

function FD\i can be learned. However, it is time cost if we train for every sample when it is divided. In our experiment, the number of the samples divided from the data set is fixed to 5.

Therefore, the value of β is relatively larger compared with the β calculated in original definition. The result of the experiment is shown by Fig. 10. The BPDMatMHKS generally obtains the

M

smallest value of β except for Pima Indians. According to Theorem 2, the BPDMatMHKS has a tighter error bound than the other two algorithms. And the MHKS gets the highest

ED

value of β in 8 binary-class data sets. The MatMHKS is ranked first in Pima Indians. And the stability of MatMHKS is not well in the data sets including House Vote, Wdbc and Breast

PT

Cancer Wisconsin. In the other data sets, the MatMHKS is ranked second. Broadly speaking, the experiment result is consist with the theoretical analysis according to Rademacher complexity.

CE

Therefore, the BPDMatMHKS has the lower generalization error bound, which means that it

AC

has better generalization ability in terms of the theory and experiment. VI. Conclusions

Since the boundary samples, which are easily misclassified, play an important role in classifier

designing. In this paper, we proposed a boundary sample selecting method based on nearest neighbor principle. In detail, this method is more robust since it can eliminate the influence of noise points and fuzzy points. Further, we use these boundary samples to design a regularization term Rbpd based on boundary projection discrimination. Through introducing the term Rbpd into original matrix-based classifier MatMHKS, we can overcome the drawback that the

ACCEPTED MANUSCRIPT 35

0.14 BPDMatMHKS MatMHKS MHKS

0.12 0.1

0.06 0.04 0.02 0

AN US ill

a

y

lle

s an

on

di

In

si

te

ic ol

Vo

C

fu

Va

m

s an

W

c db

se

ou

BA

Pi

Tr

BC

W

H

re

e ph

se

ou

H

os

In

r na

So

er at

W

H

CR IP T

β

0.08

Fig. 10. The value of β calculated in BPDMatMHKS, MatMHKS and MHKS on the binary-class data sets including Water, Sonar, Inosphere, House Colic, House Vote, Wdbc, Breast Cancer Wisconsin (BCW), Transfusion, Pima Indians, Hill Valley,

M

and Banknote Authentication (BA).

MatMHKS ignores the boundary information of the whole data distribution. In the experiment,

ED

we validate the feasibility and effectiveness of BPDMatMHKS. The performance of the proposed BPDMatMHKS is better than the original MatMHKS and some other comparison algorithms

PT

on the UCI data sets and image data sets. Further, the influence of the parameters such as c and λ is discussed in the analysis of the experiment. Finally, we use Rademacher complexity

AC

bound.

CE

and stability test to validate that the proposed BPDMatMHKS has a tighter generalization risk

Acknowledgment

This work is supported by Natural Science Foundations of China under Grant No. 61672227,

Shuguang Program supported by Shanghai Education Development Foundation and Shanghai Municipal Education Commission, and the 863 Plan of China Ministry of Science and Technology under Grant No. 2015AA020107.

ACCEPTED MANUSCRIPT 36

References [1] H. Abdi and L.J. Williams. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4):433–459, 2010. [2] Sukarna Barua, Md. Monirul Islam, X. Yao, and K. Murase. MWMOTE-Majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge and Data Engineer, 26(2):405–425, 2014. [3] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Maching Learning Research, 2(3), 2002. [4] F. Cakoni and D. Colton. Ill-posed problem. Applied Mathematical Sciences, 188:27–43, 2014.

CR IP T

[5] H. Chen, L.Q. Li, and J.T. Peng. Error bounds of multi-graph regularized semi-supervised classification. Information Sciences, 179(12):1960–1969, 2009.

[6] S. Chen, Z. Wang, and Y. Tian. Matrix-pattern-oriented ho–kashyap classifier with regularization learning. Pattern Recognition, 40(5):1533–1543, 2007.

[7] S. Chen, Y. Zhu, D. Zhang, and J.Y. Yang. Feature extraction approaches based on matrix pattern: Matpca and matflda. Pattern Recognition Letters, 26(8):1157–1167, 2005.

[8] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995.

AN US

[9] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, 2012.

[10] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179–188, 1936. [11] K. Hagiwara and K. Kuno. Regularization learning and early stopping in linear networks. In IEEE International Joint Conference on Neural Networks, volume 4, pages 511–516, 2000.

[12] H. Han, W.Y. Wang, and B.H. Mao. Borderline-SMOTE: A new oversampling method in imbalanced data sets learning. Lecture Notes in Computer Science, 3644(5):878–887, 2005.

M

[13] P. Hart. The condensed nearest neighbor rule. IEEE Transaction on Information Theory, 14(3):515–516, 1968. [14] M. Hollander, D. Wolfe, and E. Chicken. Nonparametric statistical methods. John Wiley & Sons, 2013. [15] Jayadeva, R. Khemchandani, and Suresh Chandra.

Twin support vector machines for pattern classification.

IEEE

ED

Transactions on Pattern Analysis and Machine Intelligence, 29(5):905–910, 2007. [16] V. Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902–1914, 2001.

PT

[17] U. Kreßel. Pairwise classification and support vector machines. In Advances in Kernel Methods, pages 255–268, 1999. [18] J. Leski. Ho–kashyap classifier with generalization control. Pattern Recognition Letters, 24(14):2281–2290, 2003. [19] Y.H. Li. Selecting critical patterns based on local geometrical and statistical information. IEEE Transacions on Pattern

CE

Analysis and Machine Intelligenc, 33(6), 2011. [20] R.Z. Liang, L. Shi, J. Meng, J.J.Y. Wang, Q. Sun, and Y. Gu. Top precision performance measure of content-based image retrieval by learning similarity function. In Pattern Recognition (ICPR), 2016 23st International Conference, 2016.

AC

[21] R.Z. Liang, W. Xie, W. Li, H. Wang, J.J.Y. Wang, and L. Taylor. A novel transfer learning method based on common space mapping and weighted domain matching. In Tools with Artificial Intelligence (ICTAI), 2016 IEEE 28th International

Conference, pages 299–303, 2016.

[22] Marvin Minsky and Seymour Papert. Perceptrons: An Introduction to Computational Geometry. The MIT Press, 1988. [23] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundation of Machine Learning. The MIT Press, 2012. [24] K. Nikolaidis, J.Y. Goulermas, and Q.H. Wu. A class boundary preserving algorithm for data condensationn. Pattern Recognitionc, 44(3):704–715, 2011. [25] C.G. Osorio, A.D.H. Garcła, and N.G. Pedrajas. Democratic instance selection:a linear complexity instance selection algorithm based on classifier ensemble concepts. Artificial Intelligence, 174(5–6):410–441, 2010.

ACCEPTED MANUSCRIPT 37

[26] Xianli Pan, Yao Luo, and Yitian Xu. K-nearest neighbor based structural twin support vector machine. KnowledgeBasedSystems, 88:34–44, 2015. [27] Sam T. Roweis and Lawrence K. Saul.

Nonlinear dimensionality reduction by locally linear embedding.

Science,

290(5500):2323–2326, 2000. [28] X. Shi, Y. Yang, Z. Guoa, and Z. Lai. Face recognition by sparse discriminant analysis via joint l2,1-norm minimization. Pattern Recognition, 47(7):2447–2453, 2014. [29] M. Tanveer, K. Shubham, M. Aldhaifallah, and S.S. Ho. An efficient regularized k-nearest neighbor based weighted twin

CR IP T

support vector regression. Knowledge-BasedSystems, 94:70–87, 2016. [30] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.

[31] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264–280, 1971. [32] N. Vladimir. Statistical Learning Theory. Wiley Interscience, 1998.

[33] Z. Wang, S. Chen, J. Liu, and D. Zhang. Pattern representation in feature extraction and classifier design: Matrix versus

AN US

vector. IEEE Transations on Neural Network, 19(5):758–769, 2008.

[34] Z. Wang, S.C. Chen, and D.Q. Gao. A novel multi-view learning developed from single-view patterns. Pattern Recognition, 44(10):2395–2413, 2011.

[35] Yiyu Yao. Three-way ddecisions with probabilistic rough sets. Information Sciences, 180(3):341–353, 2010. [36] Yiyu Yao. The superiority of three-way decisions in probabilistic rough set models. Information Sciences, 181(6):1080– 1096, 2011.

[37] Yiyu Yao. Two semantic issues in a probabilistic rough set model. Fundamenta Informaticae, 108(3):249–265, 2011.

AC

CE

PT

ED

M

[38] H.T. Zhao and W.K. Wong. Regularized discriminant entropy analysis. Pattern Recognition, 47(2):806–819, 2014.