Accepted Manuscript
Matrix-Pattern-Oriented Classifier with Boundary Projection Discrimination Zhe Wang, Zonghai Zhu PII: DOI: Reference:
S0950-7051(17)30602-0 10.1016/j.knosys.2017.12.024 KNOSYS 4160
To appear in:
Knowledge-Based Systems
Received date: Revised date: Accepted date:
24 January 2017 19 December 2017 22 December 2017
Please cite this article as: Zhe Wang, Zonghai Zhu, Matrix-Pattern-Oriented Classifier with Boundary Projection Discrimination, Knowledge-Based Systems (2017), doi: 10.1016/j.knosys.2017.12.024
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT 1
Matrix-Pattern-Oriented Classifier with Boundary Projection Discrimination
Abstract
CR IP T
Zhe Wang∗,1 , Zonghai Zhu1
The matrix-pattern-oriented Ho-Kashyap classifier (MatMHKS), utilizing two-sided weight vectors to constrain the matrix-based pattern, extends the representation of sample from vector to matrix. To
AN US
further improve the classification ability of MatMHKS, we introduce a new regularization term into MatMHKS to form a new algorithm named BPDMatMHKS. In detail, we first divide the samples into three types including noise sample, fuzzy sample and boundary sample. Then, we combine the projection discrimination with these boundary samples, thus proposing the regularization term which concerns the priori structural information of the boundary samples. By doing so, the classification ability proposed BPDMatMHKS.
M
of MatMHKS has been further improved. Experiments validate the effectiveness and efficiency of the
ED
Matrix-based Classifier; Boundary Sample; Projection Discrimination; Regularization
PT
Learning; Pattern Recognition.
I. Introduction
The conventional Linear classifiers [9], such as Perceptron Learning Algorithm (PLA) [22]
CE
and Support Vector Machine (SVM) [8], obtain the final classification result based on one weight vector. Different from the traditional linear classifier, the matrix-based classifiers [6],
AC
[7], [34] uses two-sided weight vectors to obtain the final result of the classifier. Therefore, it can be directly applied to solve the matrix-based pattern representation problems [20] and demonstrated to be superior to the vector-based classifier. Our previous work proposed a new matrix-based learning machine named MatMHKS [6]. It follows the minimum risk framework ∗
Corresponding author. Email:
[email protected] (Z. Wang)
1
Department of Computer Science & Engineering, East China University of Science & Technology, Shanghai, 200237, P.R.
China
ACCEPTED MANUSCRIPT 2
1
1 Class+ Class-
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
S
0.9
0
1
b+
S b-
w
0
(a)
0.1
CR IP T
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(b)
Fig. 1. (a) shows the situation that the iteration has been stopped when the classification hyperplane can separate the two binary samples rightly. It, however, is not a good classification hyperplane. (b) shows a better classification hyperplane since it
AN US
obviously has better generalization ability than the hyperplane in (a).
[6], [32], and naturally introduces two weight vectors into classifier design. Therefore, it can simply and effectively process the matrix-based pattern directly. The framework is as follows:
M
min J = Remp + cRreg ,
(1)
In Eq. 1, the first term Remp is defined as the empirical risk, which adopts the mathematical
ED
form uT Av. A ∈ Rd1 ×d2 is a matrix pattern while both u ∈ Rd1 ×1 and v ∈ Rd2 ×1 are two weight vectors. In this way, the matrix pattern A constrained by u and v is a natural extension of the
PT
vector method. Moreover, the second regularization term Rreg extends the geometric interval and reducing the generalization error. The parameter c is the penalty coefficient.
CE
Although the matrix method has learned more feasible and structural information from matrix pattern, and the term Rreg boosts the generalization ability by extending the geometric interval
AC
between two classes. Unfortunately, it still might neglect useful information. The MatMHKS originated from the Modified Ho-Kasyap (MHKS) [18] algorithm may not obtain the perfect classification hyperplane, since the iteration steps in MHKS algorithm would stop once the classification hyperplane can separate the samples in two classes. As is shown by Fig. 1 (a), the classification hyperplane would not change if all train samples are separated rightly. It is obviously that the generalization ability of the two dotted lines is not well, although it distinguishes all training samples. In Fig. 1 (b), the classification hyperplane described by the
ACCEPTED MANUSCRIPT 3
solid line performs better, since it is based on margin maximization and these samples in different class are treated equally. In order to achieve better prediction, most of the classification algorithms attempt to learn the information of samples [21], especially the boundary information of each class. Yao [35], [36], [37] proposed a three-way decision model, which introduced the boundary decision into original positive and negative decision. Li [19] proposed a method about selecting critical
CR IP T
samples based on local geometrical and statistical information. K.Nikolaidis [24] proposed a boundary preserving algorithm for data condensation. It tends to preserve the boundary samples in application. Except for considering the location of the data, some researchers also pay attention to sample pretreatment. C.G.Osorio [25] proposed a linear complexity instance selection algorithm based on classifier ensemble concepts. The sample pretreatment technology also plays
AN US
an important role in imbalance problem. H.Han [12] proposed a Borderline-SMOTE, which pay more attention to the boundary samples in minority class, improves the performance for imbalance problems. After that, Sukarna Barua [2] proposed a MWMOTE method that further improved the performance compared with Borderline-SMOTE. The boundary samples are easier to be misclassified, thus being more important for classification. This paper introduces a new
M
regularization term Rbpb into the original framework. 1, thus boosting the performance compared
ED
with the original framework. The framework can be described as: min J = Remp + cRreg + λRbpd ,
(2)
PT
The term Remp and Rreg are in accord with the same term in Eq. 1. And the new term Rbpd is expected to learn the boundary information which is not performed well in original MatMHKS.
CE
Moreover, the parameter λ is the penalty coefficient. In the term Rbpd , the first letter b means the boundary samples. The second letter p means the projection and the last letter d is discrimination. The term Rbpd expects to obtain the better
AC
final classification hyperplane according to the projection discrimination of boundary samples. There are two major steps in constituting the regularization term Rbpd . Firstly, the K Nearest Neighbour (KNN) [13] provides useful reference in selecting boundary samples, and secondly, the thinking of Fisher linear discrimination analysis (FLDA) [10] is adopted to project these boundary samples on the argument vector of the classification hyperplane. The KNN method is widely applied in many aspects such as sample selecting, noise filtering and so on. Pan [26] used the KNN method to decide the weight of samples in TWSVM [15]. M.Tanveer [29] introduced
ACCEPTED MANUSCRIPT 4
the KNN method into TWSVM, thus reducing the time and improving the generalization ability. Through using the KNN method, we can get the neighbours of every sample and further divide them into noise samples, fuzzy samples and boundary samples. Then these boundary samples can be used to assist the algorithm in finding better classification hyperplane. According to FLDA, it would be better if the projection value of these points from both classes have large between-class distance and small within-class distance.
CR IP T
To illustrate it clearly. Here, we name one is class+, the other is class−, respectively. For every sample in class+, its k nearest neighbors in class− are selected. Relatively, the selected boundary samples in class− are used to search the boundary samples in class+. After the boundary samples in two classes are selected, these samples in different class are expected to have large between-class distance and small within-class distance. As is shown in Fig. 1 (b), the
AN US
classification hyperplane described by the solid line can perform better than that discribed by the dotted line in Fig. 1 (a), since it is based on margin maximization and the samples in different class are treated equally. In order to gain this type of classification hyperplane, it is important to obtain the boundary samples in two classes. In Fig. 1 (b), the boundary samples in two classes are filled color and the weight vector w is displayed. It can be found that these boundary
M
samples are satisfied with three characteristics if they are projected on the vector w. Firstly, they have large between-class distance. Secondly, they have small within-class distance. Finally, the
ED
classification hyperplane is close to the midpoint of the means in two different boundary sample sets.
PT
It is worth noting that the SVM pays attention to the support vectors near the classification hyperplane and the FLDA aims to gain large between-class distance and small within-class
CE
distance. The boundary samples of the proposed BPDMatMHKS and the support vectors of the SVM are quite different. The support vectors of the SVM are the samples which satisfy with
AC
the strict-constrained condition in Eq. 3. They are computed in the optimization process of the objective function. Therefore, these support vectors are nonintuitive. s.t.
yi (wT xi + b) ≥ 1 − ξi ,
i = 1, ..., n
(3)
Different from the support vectors, the boundary samples of the BPDMatMHKS are computed under the geometric intuition. In addition, these samples are obtained before the optimization process and they offer priori information. Through finding the relationship of neighbours, we obtain priori structure information of boundary samples, thus forming a regularization term
ACCEPTED MANUSCRIPT 5
according to these boundary samples. Although the FLDA shares the first two characteristics, it does not satisfied the third characteristic. On the other hand, the boundary samples in different class come in pairs, thus it would be better if the classification hyperplane is close to the midpoint of the means in two different boundary sample sets. With the help of the boundary sample selecting method, the designed regularization term Rbpd learns the boundary information of the data, thus improving the generalization ability when the term Rbpd is introduced into the
CR IP T
MatMHKS. The proposed method is named BPDMatMHKS for short. The major contribution of this paper lies in the following aspects:
• The proposed BPDMatMHKS extends the framework of existing MatMHKS. Therefore,
the proposed BPDMatMHKS is a special example of the proposed framework. In addition, a boundary sample selecting method is proposed in this paper. Through utilizing the priori
AN US
information of the boundary samples, the new regularization term is introduced.
• The proposed BPDMatMHKS inherits the advantages of original MatMHKS. Moreover, it
lays emphasis on the boundary samples information which plays an important role in finding the better classification hyperplane, thus improving the generalization ability. To our best knowledge, it is the first time that boundary samples learning is introduced into MatMHKS.
M
• The proposed BPDMatMHKS is feasibility and effectiveness according to the experiments.
In addition, the proposed BPDMatMHKS has a tighter generalization risk bound than original
ED
MatMHKS through analyzing their Rademacher complexity and testing their stability. The rest of this paper is organized as follows. Section II presents a brief introduction on the
PT
preliminary knowledge of MatMHKS and the method of boundary sample selection. Section III gives the architecture of the proposed BPDMatMHKS and the modified method of boundary
CE
sample selection. Section IV reports all the experimental results. Section V gives a theoretical and experimental generalization risk bound analysis for BPDMatMHKS. Finally, conclusions are
AC
given in Section VI.
II. Related Work
In this section, a concise introduction of MatMHKS and regularization learning are demon-
strated.
ACCEPTED MANUSCRIPT 6
A. MatMHKS Our previous work MatMHKS is the original method of the proposed algorithm. Moreover, the MatMHKS is original from the the vector-based Modified Ho-Kashyap algorithm (MHKS), which comes form Ho-Kashyap algorithm (HK). Therefore, the HK, MHKS and MatMHKS are given a brief introduction in this section.
CR IP T
In original binary-class (class+, class−) problem, there are N samples (xi , ϕi ), i = 1, ..., N, xi ∈ Rd and the class label ϕi ∈ {+1, −1}. When xi ∈ class+, ϕi = +1. Otherwise, ϕi = −1.
Through turning the vector pattern xi to yi = [xiT , 1], a corresponding augmented weight vector w = [e wT , 1]T ∈ Rd+1 is created. The weight vector w can be derived from the objective function
of HK:
min J(w) = ||Yw − b||2 ,
(4)
AN US
w
Where b > 0N×1 ∈ RN is a bias vector and Y = [ϕ1 y1 , ϕ2 y2 , ..., ϕN yN ] , which expects all
elements are non-negative.
However, the original HK is sensitive to outliers. To solve this problem, Leski proposed a Modified Ho-Kashyap algorithm algorithm named MHKS for short MHKS, which is based on
M
the regularized least squares. The MHKS tries to maximize the discriminating margin by adding a hyperplane 1n×1 ∈ Rn . Thus, the objective function of MHKS modified from Eq. 4 is as follow:
ED
min J(w, b) = ||Yw − 1N×1 − b||2 + cwT w. w,b
(5)
Eq. 5 is the detail form of Eq. 1. The term ||Yw − 1N×1 − b||2 is corresponding to the Remp
PT
in Eq. 1. And the term wT w is corresponding to the Rreg . The parameter c is a regularized coefficient which control the balance between the Remp and Rreg . Since the MHKS and HK are
CE
vector-based classifier, the pattern with very high dimension may lead to the memory waste and dimensionality curse.
AC
To solve this problem, our previous work has proposed a novel matrix-based classifier named MatMHKS, which can process the matrix-based date directly. For instance, the matrix-based sample is described as (Ai , ϕi ), i = 1, ..., N, where Ai ∈ Rd1×d2 , and the class label is ϕi ∈ {+1, −1}.
Then, the discrimination function of MatMHKS can be written as: > 0, yi ∈ class+ T g(Ai ) = u Ai v˜ + v0 , < 0, yi ∈ class−
(6)
ACCEPTED MANUSCRIPT 7
Where u ∈ Rd1 and v˜ ∈ Rd2 are two weight vectors. Accordingly, the optimization function of
MatMHKS is as follow:
N X (ϕi (uT Ai v˜ + v0 ) − 1 − bi )2 + c(uT S1 u + vT S2 v). min J(u, v, b, v0 ) =
u,v,b,v0
(7)
i
Where S1 = Id1×d1 , and S2 = I(d2+1)×(d2+1) are two regularized matrix corresponding to u and v, respectively. The regularization parameter c controls the importance of the regularization term.
CR IP T
The weight vector u and v can be derived from the Eq. 7. Compared with MHKS, the advantages taken by MatMHKS can be summarized as follows: (1) processing the matrix based pattern such as A ∈ Rd1×d2 directly; (2) saving memory resources by reducing the weight vector from d1×d2 to d1 + d2; (3) avoiding over fitting and dimensionality curse. However, the MatMHKS, depending on the training error, pay less attention to the distribution of the boundary samples, thus being
AN US
weak in the performance of the generalization. Accordingly, it is necessary to find a new way to solve the problem.
B. The difference and relationship between matrix model and vector model The major difference between MatMHKS and other classifier is that MatMHKS can process
M
not only matrix-based but vector-based sample. However, the vector-based classifier can only process the vector-based sample. To illustrate the difference between the matrix-based classifier
ED
and the vector-based classifier, We have appended the figure to illustrate the problem clearly. As is shown in the Fig. 2, (a) shows the matrix model sample and its corresponding vector
PT
model. (b) shows that the matrix-based classifier, using two-side weigh vector, processes the matrix-based sample and the vector-based classifier, using one weight vector, only processes the
CE
vector-based sample. In real world applications, images are matrix-based problems. The matrixbased classifier can process matrix-based samples. However, the vector-based classifier must convert the images into vector-based model. For instance, if the size of the image is 28 × 23,
AC
the image sample must be converted into vector-based sample with 28 × 23 = 644 dimensions.
The advantage of matrix-based classifier is shown in the Table I. According to the table,
the dimensions of the weight vector is reduced greatly. This is why the matrix-based classifier relieves the curse of dimensionality. In addition, the vector-based sample with high dimensions can be converted into matrix-based sample, thus extending the representation of the sample. In fact, the matrix model and the vector model can be convert to each other according to the Kronecker product [33].
ACCEPTED MANUSCRIPT 8
Matrix-based Classifier
...
.
2
.
.
.
i
.
.
.
.
.
.
N
N = d1 x d2 Matrix Model: A ϵ Rd1xd2 Vector Model: A ϵ RNx1
d1
1
...
.
2
.
.
.
i
.
d2
.
.
.
.
.
N
1
uTAv, u ϵ Rd1x1 ,v ϵ Rd2x1
1
CR IP T
1
...
...
1
Vector-based Classifier
2
...
i
...
2
...
N
i
...
N
... ...
1 1
N
Aw, w ϵ RNx1
(a)
(b)
Fig. 2. (a) the matrix-based samples and its corresponding vector-based form. (b) the matrix-based classifier and vector-based
AN US
classifier. As is shown in (b), the matrix-based classifier utilize two-side vector to constrain the matrix-based sample. Therefore, it can process matrix-based samples. The vector-based classifier utilizes one long weight vector.
TABLE I
Memory required and relative reduced rate for the weight vectors of matrxi-based and vector-based classifier. d1 + d2(matrix)
d1 × d2(vector)
(d1 + d2)/(d1 × d2)
M
Image Size
64
24×18
42
432
1/10.29
28×23
51
644
1/12.63
ED
32×32
1024
1/16
CE
PT
Lemma 1 Let A ∈ Rm×n , B ∈ Rn×p and C ∈ R p×q , then, vec(ABC) = (C T ⊗ A)vec(B),
(8)
Where vec(X) denoted an operation that change the matrix X into vector model. Let the
AC
discriminant functions of MHKS and MatMHKS be the following function: 1) MHKS: g(x) = ˜ T x + w0 and 2) MatMHKS: g(A) = uT A˜v + v0 . It can be conclude that both MHKS and w MatMHKS have the same form. In addition, the solution space for the weights in MatMHKS is contained in that of MHKS, and MatMHKS is an MHKS imposed with Kronecker product decomposability constraint. To make it clear, we give a concise example to show the changes from matrix to vector.
ACCEPTED MANUSCRIPT 9
T 1 2 , ,then the vector form of A is x = 1 3 2 4 . Let u = 1 2 Suppose matrix A = 3 4 3 and v = , then w = vT ⊗ u = 3 6 4 8 . We can get the equation: uAv = wx. 4 C. Regularization learning
CR IP T
Regularization [5], [11], [28], [38], which can be traced back to the ill-posed problem [4], is an useful technology to solve the over-fitting and boost the generalization ability. Through introducing certain amount of priori knowledge into optimization formulations, regularization is powerful to overcome the over-fitting problem. This introduction is proved to be successful, especially for classification. Combining the regularization learning and data dimensionality
AN US
reduction technology, the regularization term based on the geometric information can be created. Data dimensionality reduction algorithm can be divided into linear and nonlinear types. The main method of nonlinear dimensionality reduction is the manifold learning. It includes isometric feature mapping (IOSMAP) [30], local linear embedding (LLE) [27] and so on. The linear dimensionality reduction algorithms main include principle component analysis (PCA) [1]
M
and linear discrimination analysis (LDA). Although PCA can effectively discover the potential primary information of the data, PCA does not take difference of classes into account. Different
ED
form PCA, LDA trying to model the difference between the classes of data, can be used in not only dimensionality reduction but classification. These algorithms, considering the maintenance
PT
of the local geometry structure information, offers useful information for classification problems. The FLDA, as a kind of LDA, can find a linear combination of features that characterizes or
CE
separates two or more classes. In binary-class problem, the FLDA aims to find a argument vector of the classification hyperplane. If the labeled samples are projected on the vector, the larger between-class distance and smaller within-class distance mean the higher accurate recognition
AC
rate. The criterion function of FLDA is as follow: max J(w) = w
wT S B w . wT Sw w
(9)
In this criterion function, the parameter w is the argument vector. SB means the between-class scatter matrix, and the Sw means the within-class scatter matrix. In binary-class problem, the SB and Sw can be written as: SB = (µ+ − µ− )(µ+ − µ− )T ,
(10)
ACCEPTED MANUSCRIPT 10
Sw =
X
xi ∈class+
(xi − µ+ )(xi − µ+ )T +
X
x j ∈class−
(x j − µ− )(x j − µ− )T .
(11)
Where the µ+ and µ− are their respective mean points in two classes. In this paper, we combine the boundary samples and the geometric information of FLDA to design the regularization term. Then, the term is used to assist MatMHKS in obtaining the better classification hyperplane.
CR IP T
III. Matrix-pattern-oriented Classifier with boundary samples This section first presents the strategy used in selecting boundary samples, and then describes the architecture of the proposed BPDMatMHKS. A. Selecting the boundary samples
AN US
This paper utilizes the thinking of KNN to select the suitable boundary samples. As is shown by Fig. 3 (a), it shows such a scenario, where the blue and red points represent the samples of class+ and class−, respectively. Before going to describe the detail construction of boundary samples, we suggest that Sb+ and Sb− displayed in Fig. 3 (b) are two boundary sample sets. For every sample xi ∈ class+, we search its k nearest neighbours (assuming k = 5) ∈ class−,
M
thus adding these selected neighbours to S b− . After that, for every sample xi ∈ Sb− rather than ∈ class−, Sb+ can be constructed by adding k nearest neighbours of these samples ⊂ Sb− . To
ED
view Fig. 3 (b), Sb+ and Sb− are found through this method.
However, just using nearest neighbours to search boundary samples may cause bad situations,
PT
since it may obtain the noise samples instead of the boundary samples. Fig. 4 (a) shows that samples A, B, C, D and E may affect the final classification hyperplane. For instance, if we search the boundary set as we do in Fig. 3 (a) and Fig. 3 (b), the final classification hyperplane
CE
derived form Sb+ and Sb− would be influenced by these points. In order to find these points, we record the k nearest neighbours belong to all training samples. Further, this paper proposes two
AC
definitions directing against the neighbours belong to all training samples: (1) If the k nearest neighbours of ones sample xi (xi ∈ class+) are all ∈ class−, xi is a noise
sample.
(2) If one sample xi (xi ∈ class+) has more than
k 2
nearest neighbours of another class
(class−). Then, generate the vectors by subtracting the feature value µ+ form xi and its neighbours ∈ class−. Finally, these vectors are projected on the vector (µ− − µ+ ) and record their values. xi is a fuzzy sample if the projection value of itself is not minimum.
ACCEPTED MANUSCRIPT 11
1 Class+ ClassS b-
0.8
0.9
Class+ S b+
0.8
ClassS b-
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
1
(a)
0
0.1
CR IP T
1 0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(b)
Fig. 3. In Fig. 2 (a), the method starts from the samples ∈ class+. These sample search the k nearest neighbours ∈ class− and record these neighbours in the boundary sample set ∈ class−. For convenience, the boundary sample set is marked as Sb− . After
AN US
every sample ∈ class+ has finished searching their neighbours ∈ class−. Then start from the samples ∈ Sb− and do the same
operation. Fig. 2 (b) shows the final result that the two boundary sample sets Sb+ and Sb− have been searched.
Where µ+ and µ− are the mean points of two classes (class+, class−). For noise sample, it can not be regarded as a boundary sample and its k neighbours belonging to the other class would
M
not be recorded as boundary samples. For fuzzy sample, although its k neighbours belonging to the other class should be add to the boundary sample set, it can not be selected as a boundary
ED
sample. To illustrate this situation clearly. Fig. 4 (c) shows such a scenario, we start finding boundary samples from class+. It can be found that the 5 nearest neighbours of A are all belong
PT
to another class, so A is a noise sample. To view C (C ∈ class+) in Fig. 4 (c), it may be a fuzzy sample, since it has more than 2k nearest neighbours of the other class. Next step is to
CE
judge whether it is a real fuzzy sample. In this fig, µ+ and µ− are two mean points in class+ and class−. And the vector w is equal to (µ− − µ+ ). After that, sample C and its neighbours
AC
belonging to class− minus µ+ to form vectors and then these vectors are projected on w. The
projection value of sample D (D ∈ class−) is smaller than C. According to definition (2), C is a fuzzy sample, and it can not be selected as a boundary sample. When boundary samples
⊂ class− has been selected, we select the boundary samples ∈ class+ by using these samples ⊂ Sb− . The final boundary data set Sb+ and Sb− , showed by Fig. 4 (d), are the boundary data
sets belong to class+ and class−. They are more effective and robust to assist the classifier in finding discrimination hyperplane.
ACCEPTED MANUSCRIPT 12
1
1 Class+ Class-
0.9
S
0.9
0.8
0.8 A
A
0.7
0.7
C D
0.6
C D
0.6
0.5
0.5 0.4
B E
0.3 0.2
0.2
0.1
0.1 0
0.1
0.2
0.3
B E
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
1
0
0.1
(a) 1
1 µ+
0.5
0.6
0.7
0.8
0.9
1
S b-
0.8
A
0.7
0.7
C D
0.6
0.6 0.5
C' D'
0.4
B E
0.3
0.2
0.2 0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(d)
ED
(c)
0.9
M
0.1 0
0.4
AN US
0.8
0.3
0.3
S b+
0.9
µ-
0.4
0.2
(b)
0.9
0.5
CR IP T
0.4
0
b+
S b-
Fig. 4. (a)shows samples, including A, B, C, D and E, may affect finding the boundary sample set. (b) shows the situation that the two boundary sample sets are irregular, which has bad influence on boundary projection discrimination. (c) demonstrates
PT
that A is a noise sample, since all the neighbours of A are ∈ class−. Moreover, as is shown in (c), ||D0 µ+ || is smaller than
||C 0 µ+ ||. Since the distance of it projected point C 0 on w = (µ− − µ+ ) is not minimum compared with its neighbours ∈ class−, C is a fuzzy sample. (d) displays the final boundary sample sets have been selected after the influence of noise points and fuzzy
CE
points are eliminated.
AC
The procedure of selecting boundary samples is shown by Table II. It should be noted that
this procedure is designed for binary-class problem. N+ and N− are the total number of training samples in class+ and class−, respectively. B. Architecture of BPDMatMHKS In this section, the proposed BPDMatMHKS is introduced in four steps. First, the novel regularization term Rbpd is constructed by the boundary samples. Then, the term Rbpd is introduced
ACCEPTED MANUSCRIPT 13
TABLE II Algorithm I: Selecting Boundary Samples Input: The training set S = {(A1 , ϕ1 ), ..., (AN , ϕN )}, the parameter k is the number of nearest neighbours;
Output: The selected boundary sample data set Sb+ , Sb−
1. Divide the multi-class problem into multiple binary-class ones such as S+ and S− ; S+ = {(A1 , 1), ..., (AN+ , 1)} and S− = {(A1 , −1), ..., (AN− , −1)}; P P µ+ = N1+ Ai ∈S+ Ai and µ− = N1− Ai ∈S− Ai ;
CR IP T
Obtain the training sets of S + and S − as follows:
2. For each sample Ai (Ai ∈ S+ ), calculates the distance form it to all samples ∈ S except for itself, records the k nearest neighbours ∈ S− as Scandidate , and the k nearest neighbours ∈ S as S judge : In S judge , calculate the number m of sample ∈ Sb− ; if m = k
AN US
Ai is a noise point and it should be divided from S, S = S/Ai else if 2k ≤ m < k
Compared with the projecting distances of neighbours ∈ S− on the vector (µ− − µ+ ); if the distances of Ai is not the minimum Add Ai to fuzzy points set SF ; Add the Scandidate to Sb− ; else end if
else if m <
k 2
ED
Add the Scandidate to Sb− ;
M
Add the Scandidate to Sb− ;
end if end for
PT
3. For each sample Ai (Ai ∈ S− ), we do the similar operation as we do in step 2
and obtain Sb+ ). Here we just scan sample Ai (Ai ∈ Sb− ) instead of Ai (Ai ∈ S− )
4. Obtain the boundary sample dataset with final operation
CE
to divide those fuzzy samples ∈ SF , Sb+ = Sb+ \SF and
AC
Sb− = Sb− \SF ;
into original MatMHKS to boost the generalization ability. After that, the optimal solutions for the weight vector u and v can be derived from the proposed BPDMatMHKS. Finally, the realizing process of the proposed BPDMatMHKS is designed as pseudo-code. In the proposed BPDMatMHKS, there is a matrix data set S = {(A1 , ϕ1 )...(Ai , ϕi )...(AN , ϕN )},
where N is the size of the data set, Ai ∈ Rd1×d2 is the matrix sample and its class label ϕi ∈
{+1, −1} corresponding to the binary-class data set (S+ and S− ). In addition, the other two matrix
ACCEPTED MANUSCRIPT 14
data sets Sb+ and Sb− are two boundary sample data sets. Their samples can be described as (A+1 , +1)...(A+i , +1)...(A+n+ , +1)}, and (A−1 , −1)...(A−i , −1)...(A−n− , −1)}, where n+ and n− are the size of data in Sb+ and Sb− . Therefore, the regularization term Rbpd can be written as follows: n+ X i=1
((uT A+i v˜ + v0 ) − mean+ )2 +
n− X i=1
((uT A−i v˜ + v0 ) − mean− )2 +
(mean+ + mean− )2 − (mean+ − mean− )2 , mean+ = uT A+mean v˜ + v0 , mean− = uT A−mean v˜ + v0 ,
CR IP T
Rbpd =
(12)
(13) (14)
The mean+ and mean− are the mean points of Sb+ and Sb− . Eq. 12 experts that the boundary samples belong to different sample set have smaller within-class distance and bigger between-
AN US
class distance when the samples ∈ Sb+ and Sb− are projected on the argument vector of the classification hyperplane. Moreover, it hopes that the classification hyperplane can fit the perpendicular
bisectors of A+mean and A−mean by introducing the section (mean+ + mean− )2 − (mean+ − mean− )2 in Eq. 12, since these boundary samples are appearing in pairs. In fact, this section can be simplify
M
calculated, and the Eq. 12 can be rewritten as follows: n+ n− X X Rbpd = ((uT A+i v˜ + v0 ) − mean+ )2 + ((uT A−i v˜ + v0 ) − mean− )2 + 4mean+ mean− , i=1
(15)
i=1
ED
The key of BPDMatMHKS is to introduce the designed boundary sample regularization term
Rbpd into Eq. 9. According to regularized learning, the framework can be written as:
PT
min J(u, v, b, v0 ) =
u,v,b,v0
N X i
(ϕi (uT Ai v˜ + v0 ) − 1 − bi )2 + c(uT S1 u + vT S2 v) + λRbpd .
(16)
CE
Then, the gradient descent method is used to seek the optimal values of u, v and b. First, we get the partial derivative of u: N
N
N
n
AC
+ X X X ∂J X Ai v˜ v˜ T ATi u + cS1 u − ϕi Ai v˜ (1 + bi ) + Ai v˜ v0 + λ (A+i − A+mean )˜vv˜ T (A+i − A+mean )T u+ = ∂u i i i i
λ
n− X i
(A−i − A−mean )˜vv˜ T (A−i − A−mean )T u + 2λ(A+mean v˜ v˜ T A−mean T u + A−mean v˜ v˜ T A+mean T u + A+mean v˜ v0 + A−mean v˜ v0 ), (17)
ACCEPTED MANUSCRIPT 15
Set it be equal to zero, then the expression of u can be written as: n− n+ N X X X (A+i − A+mean )˜vv˜ T (A+i − A+mean )T + λ (A−i − A−mean )˜vv˜ T (A−i − A−mean )T + u = ( Ai v˜ v˜ T ATi + cS1 + λ i
i
i
N N X X Ai v˜ v0 + 2λA+mean v˜ v0 + 2λA−mean v˜ v0 ), 2λA+mean v˜ v˜ T A−mean T + 2λA−mean v˜ v˜ T A+mean T )−1 ( ϕi Ai v˜ (1 + bi ) + i
i
(18)
CR IP T
To simplify Eq. 16, we define Y = [y1 , y2 , ..., yN ]T , yi = ϕi [uT Ai , 1]T , (i = 1, 2, ..., N), v = [˜vT , v0 ]T . And y+j = [uT A+j , 1]T , ( j = 1, 2, ..., n+ ) Y+ = [y+1 , y+2 , ..., y+n+ ]T , y−j = [uT A−j , 1]T , ( j = 1, 2, ..., n− ), Y− = [y−1 , y−2 , ..., y−n− ]T , Then the objective function Eq. 16 can be reformulated as: min J(u, v, b) = (Yv − 1N×1 − b)T (Yv − 1N×1 − b) + c(uT S1 u + vT S˜ 2 v)+ λ((Y v −
Y+mean v)T (Y+ v
−
Y−mean v)
4Y+mean vY−mean v).
AN US
Y+mean v)
Y−mean v)T (Y− v
(19)
− + (Y v − − + S2 0 is an (d2 + 1) × (d2 + 1) matrix. Y+ is the mean value of Y+ , the In Eq. 19, S˜ 2 = mean 0 0 Y−mean is the mean value of Y− , respectively. +
Therefore, the partial derivative of v can be derived as:
λ(Y − and set it to zero:
Y−mean )T (Y−
−
Y−mean )v
+
2λ(Y+mean T Y−mean
+
ED
−
M
∂J = YT Yv − YT (1N×1 + b) + cS˜ 2 v + λ(Y+ − Y+mean )T (Y+ − Y+mean )v+ ∂v
v =(YT Y + cS˜ 2 + λ(Y+ − Y+mean )T (Y+ − Y+mean ) + λ(Y− − Y−mean )T (Y− − Y−mean )+
(21)
PT
2λ(Y+mean T Y−mean + Y−mean T Y+mean ))−1 YT (1N×1 + b).
(20)
Y−mean T Y+mean )v,
CE
in addition, the vector b, whose iteration function is designed as: ∂J = −2(Yv − 1N×1 − bN×1 ) = −2e, ∂b
(22)
AC
Obviously, the weight vectors u and v are depend on the margin vector b, which controls the distance from the corresponding sample to the classification hyperplane. And the elements in b are non-negative. It expects that the classification hyperplane can separate all samples belong to different class rightly. And e is the error. Moreover, according to Eq. 18 and 21), it can be found that u and v are determined by each other. At the first step in iteration, v can be derived by initialized u and b. Then v and b can determine the new value of u. The b can be recalculated
ACCEPTED MANUSCRIPT 16
by measuring the difference between the value of error in previous step and the current step. The iterative function of b can be written as follows: b1 ≥ 0 , b( k + 1) = b(k) + ρ(e(k) + |e(k)|)
(23)
Finally, the discriminant function of BPDMatMHKS for an input sample Ai ∈ Rd1×d2 is defined
as:
CR IP T
> 0, A ∈ class+ T g(A) = u Ai v˜ + v0 < 0, A ∈ class−
,
(24)
The procedure of BPDMathMHKS in binary-class situation is shown by Table III, which is written by Pseudo code. According to Table III, the proposed BPDMatMHKS is the same as
AN US
MatMHKS if the parameter λ is set to zero. TABLE III
Algorithm II: BPDMatMHKS Input: S = {(A1 , ϕ1 ), ..., (AN , ϕN )}, Sb+ = A+1 , ..., A+n+ ;Sb− = A−1 , ..., A−n− ;
Output: the weight vectors u, v, and the bias v0 ;
M
1. Initialization: c ≥ 0, λ ≥ 0, 0 < ρ < 1; b(1) ≥ 0, u(1); k = 1;
2. Training:
ED
for k = 1, k ≤ kmax , k = k + 1
Set Y = [y1 , y2 , ..., yN ]T , where yi = ϕi [u(k)T Ai , 1], i = 1, 2, ..., N;
Set Y+ = [y+1 , y+2 , ..., y+n+ ]T , where y+j = [u(k)T A+j , 1], j = 1, 2, ..., n+ ;
PT
Set Y− = [y−1 , y−2 , ..., y−n− ]T , where y−j = [u(k)T A−j , 1], j = 1, 2, ..., n− ;
Obtain v(k) = (YT Y + cS˜ 2 + λ(Y+ − Y+mean )T (Y+ − Y+mean ) + λ(Y− − Y−mean )T (Y− − Y−mean )+ 2λ(Y+mean T Y−mean + Y−mean T Y+mean ))−1 YT (1N×1 + b(k));
CE
Obtain e = Yv − 1N×1 − b(k);
Obtain b(k + 1) = b(k) + ρ(e(k) + |e(k)|);
if k b(k + 1) − b(k) k> ξ, and k b(k + 1) − b(k) k≤ temp(k ≥ 2) ,
AC
temp =k b(k + 1) − b(k) k; P P u(k + 1) = ( iN Ai v˜ (k)˜v(k)T ATi + cS1 + λ ni + (A+i − A+mean )˜v(k)˜v(k)T (A+i − A+mean )T + P λ ni − (A−i − A−mean )˜v(k)˜v(k)T (A−i − A−mean )T + 2λA+mean v˜ (k)˜v(k)T A−mean T + P P 2λA−mean v˜ (k)˜v(k)T A+mean T )−1 ( iN ϕi Ai v˜ (k)(1 + bi (k)) + iN Ai v˜ v0 + 2λA+mean v˜ v0 + 2λA−mean v˜ v0 );
else,
Stop; end if end for
ACCEPTED MANUSCRIPT 17
TABLE IV The description of the parameters. Related parameters
Values
Remark
c
10−2 , 10−1 , 100 , 101 , 102
Selected
λ
10−2 , 10−1 , 100 , 101 , 102
Selected
MatMHKS MHKS SVM(Linear) SVM(RBF)
c
−1
0
1
2
Selected
−2
−1
0
1
2
Selected
−2
−1
0
−2
−1
0
−2
−1
0
−2
−1
0
10 , 10 , 10 , 10 , 10
c
10 , 10 , 10 , 10 , 10 1
2
Selected
1
2
Selected
1
2
Selected
c1
1
10 , 10 , 10 , 10 , 10
2
Selected
c2
10−2 , 10−1 , 100 , 101 , 102
Selected
c
10 , 10 , 10 , 10 , 10
c
10 , 10 , 10 , 10 , 10 10 , 10 , 10 , 10 , 10
σ
AN US
TWSVM
−2
CR IP T
Algorithm BPDMatMHKS
IV. Experiments
In this section, the experiments are designed to investigate the feasibility and effectiveness of BPDMatMHKS. The proposed BPDMatMHKS is compared with MatMHKS, MHKS , SVM and TWSVM on 20 UCI benchmark and 6 image data sets. More concretely, this section is divided
M
into five parts. First, the basic setting of all involved algorithms are given. In the second part, the corresponding experimental results and analyses on matrix-based data sets such as images
ED
are given. In the third part, the corresponding experimental results and analyses on the UCI benchmark data sets are given. In the fourth part, the effect of the varying parameters (including further discussed.
PT
c, λ) are discussed. Finally, classification performance of the proposed regularization term is
CE
A. Experimental setting
In the experiments, the parameters used in all algorithm are shown in Table IV. According to
AC
this table, the algorithms such as MHKS, MatMHKS and BPDMatMHKS are related to original Ho-Kashyap. Therefore, they share some common parameters. For BPDMatMHKS, the number of nearest neighbors is fixed to 5. For SVM, the linear kernel, K(xi , x j ) = xTi x j , and the Radial Basis Function kernel (RBF), K(xi , x j ) = exp(−||xi − x j ||2 /σ), are taken into consideration. For TWSVM, the linear kernel is used.
In the detail method of classification, the one-against-one classification strategy [17] is utilized here to deal with the multi-classes problems. After all results of one-against-one classification
ACCEPTED MANUSCRIPT 18
have been obtained, the voting role is used to combine all results and, as a result, outputs the final label according to majority rule. To get the average validation performance and avoid the random deviation, we combine the cross-validation with hold-out. In detail, we use 5-fold crossvalidation in the outer layer. Each data set used in the experiments is randomly split into 5 parts, where the one for testing and the others for training. Inside every cross-validation, 4 parts are further used to do the 4-fold cross-validation to get the optimal parameters such as c and
CR IP T
λ in BPDMatMHKS. Then, we use the obtained optimal parameters to train the classification hyperplane on the 4 parts samples. Finally, the classification hyperplane is tested on the 5th unseen part. The procedure of the experiment is shown in Fig. 5. All the computations are performed on Intel i5 6600K with 3.50GHz, 16G RAM DDR4, Microsoft Windows 10, and
round
1
2
3
5
3
4
5
2
3
4
5
1
3
4
5
1
2
4
5
1
2
3
5
1
2
3
4
1
2
3
4
2
2
3
4
1
3
3
4
1
2
4
4
1
2
3
3 4
M
5
1
1
ED
for every training set 2
3
test set
2
2 4
training set
1
1
5-fold
AN US
MATLAB R2015b environment.
4
…...
PT
use the obtained best parameters to train the training set
CE
obtain the parameters corresponding to the best result in the 4-fold cross-validation
Fig. 5. In the process of the experiment, the 5-fold corss-validation is combined with the hold-out strategy. Then, the optimal
AC
parameters are selected through training on the 4 parts seen training set, thus avoiding excessive optimistic results.
ACCEPTED MANUSCRIPT 19
TABLE V Description of the used image data sets. Size
Dimension
Class
Original Matrix Size
Coil-20
1440
1024
20
32×32
Letter
500
432
10
24×18
ORL
400
644
40
28×23
Yale
165
1024
15
32×32
YaleB
2414
1024
38
32×32
JAFFE
213
1024
10
32×32
B. Classification comparison on image data sets
CR IP T
Dataset
AN US
In this subsection, the performance of BPDMatMHKS is compared with other classifier on six image data sets including Coil-201 , Letter2 , Orl-Faces3 , Yale4 , YaleB5 , and JAFFE6 are selected as comparison algorithms. The results lay emphasis on the classification accuracy. In addition, the parameters c and λ are discussed in this section.
(1) Data introduction and performance comparison
M
Coil-20 with 20 objects is a data set of gray-scale images. The rotation of each object is recorded by a fixed camera and each object has 72 images with the size 32 × 32. Letter with
ED
500 samples contains 10 handwritten digits from ”0” to ”9”. Therefore, the Letter is a date
set with 10 classes and each class has 50 images. Orl-Faces with 10 different images for each person is the set of the facial expression images. These images with the size 28 × 23 come
PT
form 40 persons. Yale comes from 15 persons and each person has 11 facial expression images. Similar with Yale, YaleB comes from 38 persons and each person has approximately 64 facial
CE
expression images. The Japanese Female Facial Expression (JAFFE) is with 10 persons and each person has approximately 20 facial expression images with the size 32 × 32. Table V
AC
gives the detailed information of all used image data sets. In the experiment, these matrix-based samples are firstly convert in to vector pattern. After that, they can be transformed into different 1
http://www.cs.columbia.edu/CAVE/coil-20.html
2
http://sun16.cecs.missouri.edu/pgader/CECS477/NNdigits.zip
3
http://www.cam-orl.co.uk
4
http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html.
5
http://vision.ucsd.edu/ leekc/ExtYaleDatabase/ExtYaleB.html.
6
http://www.kasrl.org/jaffe info.html.
ACCEPTED MANUSCRIPT 20
TABLE VI Test classification accuracy (%) for BPDMatMHKS, MatMHKS, MHKS, and SVM on image data sets (The best result on each data set is written in bold). MHKS
model
accuracy±std
accuracy±std
(1×1024)
95.9 ± 4.24
95.28 ± 3.91
95.78 ± 3.8
94.6 ± 4.87
(3×256) (4×128) letter
(1×432) (2×216) (4×108) (8×56)
ORL
(1×644) (2×322) (4×161) (7×92)
Yale
(1×1024) (2×512) (4×256) (8×128)
YaleB
(1×1024) (2×512) (4×256) (8×128)
JAFFE
(1×1024) (2×512) (4×256)
95.68 ± 4.22 91.97 ± 4.5
92.6 ± 3.44
92.8 ± 3.56
92.8 ± 2.59
92.8 ± 2.59
98.25 ± 2.44
98.25 ± 2.44
92.4 ± 2.7
92.6 ± 3.44
98.25 ± 2.44 98 ± 2.27
97.75 ± 2.24
92.6 ± 3.05
92.4 ± 3.71
98.25 ± 2.44 98 ± 2.27
97.75 ± 2.24
76.89 ± 11.95
76.89 ± 11.95
76.89 ± 9.34
76.89 ± 9.34
78.67 ± 10.95 71.56 ± 13.58 82.28 ± 19.51 79.5 ± 20.33
73.49 ± 22.08 70.1 ± 24.85 99.58 ± 0.93
78.67 ± 10.95 74.22 ± 14.67 84.45 ± 18.77 85.4 ± 17.55
100 ± 0
99.58 ± 0.93 100 ± 0 -/-/-
TWSVM (Linear)
accuracy±std
accuracy±std
accuracy±std
accuracy±std
93.47 ± 3.42
96.85 ± 3.35
94.47 ± 3.87
92.92 ± 3.9
92.4 ± 1.67
93.2 ± 3.11
93.4 ± 3.36
92.8 ± 2.77
98 ± 2.27
98.25 ± 2.09
98.75 ± 1.25
97.25 ± 2.05
76.44 ± 12.85
78.44 ± 15.19
77.11 ± 15.31
80.67 ± 7.6
62.41 ± 28.67
86.93 ± 16.51
86.2 ± 17.39
79.89 ± 16.9
99.58 ± 0.93
99.17 ± 1.14
99.17 ± 1.14
100 ± 0
6/0/0
2/2/2
3/3/0
3/1/2
87.95 ± 14.6
86.38 ± 15.98 99.58 ± 0.93 100 ± 0
99.58 ± 0.93 100 ± 0 1/1/4
CE
Win/Loss/Tie
94.23 ± 3.55
PT
(8×128)
95.53 ± 3.86
SVM (RBF)
AN US
(2×512)
SVM (Linear)
CR IP T
MatMHKS
M
coil-20
BPDMatMHKS
ED
Dataset
matrix-based size when they are used in BPDMatMHKS and MatMHKS. For instance, the data
AC
set Coil-20 with 1024 dimensions is transformed into the matrix patterns such as 1 × 1024,
2 × 512, 4 × 256, 8 × 128, 16 × 64 and 32 × 32. The 4 best classification result is marked in
bold. The classification accuracy results of BPDMatMHKS, MatMHKS, MHKS, SVM (Linear), SVM (RBF) and TWSVM(Linear) are all listed in Table VI. In Table VI, rather than the vector-based classifier use the original high dimensional vectorbased sample, the matrix-based classifier such as BPDMatMHKS and MatMHKS, converting
the dimensions form d1 × d2 to d1 + d2, still perform well in the lower dimensions condition. In
ACCEPTED MANUSCRIPT 21
100
100
95
90
80
80 75
60
coil-20 letter ORL Yale YaleB JAFFE
70 65 60 0.01
0.1
1
10
70
50
40 0.01
100
c
coil-20 letter ORL Yale YaleB JAFFE
CR IP T
85
Accuracy(%)
Accuracy(%)
90
0.1
1
10
100
λ
(a) parameter c
(b) parameter λ
AN US
Fig. 6. The classification accuracy(%) of BPDMatMHKS with varying parameter c and λ.
addition, the performance of BPDMatMHKS is competitive with MatMHKS and SVM in data sets with high dimensions.
(2) The influence of the parameter c and λ in the experiment
The parameter c controls the term Rreg , while λ control the new proposed regularization term
M
Rbpd in Eq. 2. Both the two parameters are selected form {0.01, 0.1, 1, 10, 100}. First, we fix λ
to 0.01 and choose c from the range. For convenience, only the value of the best matrix form
ED
on the data set is recorded. For parameter λ, we fix c to 0.01 and discuss the λ in the range of options. The sub-figure (a) in Fig .6 shows the classification accuracy with varying c and
PT
sub-figure (b) is corresponding to the varying λ. In sub-figure (a), the image data sets such as JAFFE and Coli-20 share the same trend that
CE
the accuracy is declining while c is increasing. Conversely, the trend in YaleB is that the higher c is corresponding to higher accuracy. The trend in Yale and ORL is stable. For Letter data set, the accuracy keeps increasing while c is less than 10 and it declines when c is equal to 100. In
AC
sub-figure (b), every image data set share the same trend that the accuracy is stable when λ is less than 1 and the accuracy keeps declining while λ is lager than 1. C. Classification comparison on UCI data sets In this subsection, the proposed BPDMatMHKS is validated whether it is effective to deal with the vector-based data sets. Table VII shows the detail description of the used UCI benchmark data sets. Since the BPDMatMHKS and MatMHKS can process the matrix based sample directly,
ACCEPTED MANUSCRIPT 22
TABLE VII Information of UCI Data Sets Size
Dimension
Class
Data set
Size
Dimension
Class
Water
116
38
2
Wine
178
13
3
Iris
150
4
3
Sonar
208
60
2
Ionosphere
351
34
2
House Colic
366
27
2
House Vote
435
16
2
Wdbc
569
30
2
Breast Cancer Wisconsin
699
9
2
Transfusion
768
8
2
Hill Valley
1372
4
2
CMC
Secom
1567
590
2
Semeion
Waveform
5000
21
3
Statlog
Marketing
8993
9
13
Pendigits
748
4
2
1212
100
2
1473
9
3
1593
256
10
6435
36
6
10992
16
10
AN US
Pima indians Banknote Authentication
CR IP T
Data set
these UCI benchmark data sets represented with vector forms can be transformed into their corresponding matrix forms. In general, one original vector-based pattern can be reshaped in multiple matrix forms. In this experiment, the dimensions of these data sets are not as high
M
as image data sets. For instance, the Sonar data set with 60 dimensions can be transformed into the matrix patterns such as 1 × 60, 2 × 30, 3 × 20, 4 × 15, 5 × 12 and 6 × 10. These
ED
kinds of matrix size are calculated in BPDMatMHKS and MatMHKS. Table VIII gives the results from all the implemented algorithms on the used UCI data sets. For each data set, the
PT
optimal matrix size of BPDMatMHKS and MatMHKS is recorded in the table. And the best result of different algorithm is marked in bold. Moreover, Friedman test [14] and Win/Loss/Tie
CE
counts are adapted to validate the difference between the proposed BPDMatMHKS and compared algorithms. Further, the parameter c and λ are discussed. (1) Classification comparison with other algorithms
AC
In this subsection, the classification performance of the proposed BPDMatMHKS and the other
comparison algorithms are listed in Table VIII. In this experiment, BPDMatMHKS outperforms MatMHKS on 14 data sets, which can be attributed to the assistance of the boundary information. When BPDMatMHKS is compared with MHKS, the proposed BPDMatMHKS is well over MHKS, since the BPDMatMHKS takes advantage of not only the different matrix models but
the boundary information. The SVM (RBF) uses the kernel function that maps the original space to higher dimensions space, thus solving the issue of non-linear separable. Compared
ACCEPTED MANUSCRIPT 23
with SVM (RBF), BPDMatMHKS, as a linear classifier, is still competitive with SVM (RBF) according to the Win/Loss/Tie counts and Friedman average rank. In addition, BPDMatMHKS outperforms TWSVM (Linear) in 17 data sets, which means the performance of BPDMatMHKS is obviously better than TWSVM. In generally, the BPDMatMHKS has obvious advantage when it is compared with the other linear classifier and it is competitive with the non-linear classifier SVM (RBF).
CR IP T
The Friedman test is a non-parametric statistical test, which is used to detect whether the used algorithms are similar. If these algorithm are at the same value, the average rank of different algorithm should be the same. According to original Friedman test, suppose there are k algorithms are compared on N data sets. The ri is the average rank of the ith algorithm. Then, ri obeys normal distribution, whose mean value and variance are (k + 1)/2 and (k2 + 1)/12. the variable
AN US
τχ2 can be written as follows:
k 12N X 2 k(k + 1)2 τχ2 = ri − , k(k + 1) i=1 4
(25)
the aforementioned Friedman test, however, is too conservative. Now the typically used variable (N − 1)τχ2 , N(k − 1) − τχ2
M
is: τF =
(26)
τF is given as follows:
PT
k X
ED
and the τF obeys the F distribution whose degrees of freedom is k − 1 and (k − 1)(N − 1). P According to Table VIII, the value of ki=1 ri 2 can be calculated and the process of calculating
! 12 × 20 6 × (6 + 1)2 77.2362 − = 21.3497, τχ 2 = 6(6 + 1) 4
AC
CE
i=1
ri2 = 2.2752 + 3.3252 + 4.352 + 3.82 + 2.82 + 4.452 = 77.2362,
τF =
(20 − 1) × 32.9386 = 5.1576, 20 × (6 − 1) − 21.3497
(27) (28) (29)
Compared with the critical value of F(k − 1, (k − 1)(N − 1)) = F(5, 95) = 2.31, the τF = 5.1576
is larger than the value 2.31, which means the hypothesis that all used algorithms are similar
is rejected. Accordingly, the performance of these algorithm is significant different. Moreover, the Nemenyi post-hoc test is used to distinguish the difference of these algorithm. The related
ACCEPTED MANUSCRIPT 24
TABLE VIII Test classification accuracy (%) and t-test comparison for BPDMatMHKS, MatMHKS, MHKS, SVM and TWSVM on UCI datasets. BPDMatMHKS
MatMHKS
MHKS
SVM (linear)
SVM (RBF)
TWSVM (Linear)
Accuracy±std
Accuracy±std
Accuracy±std
Accuracy±std
Accuracy±std
Accuracy±std
Model/Rank
Model/Rank
Rank
Rank
Rank
Rank
Water
96.67 ± 3.49
95.83 ± 2.95
95.00 ± 5.43
96.50 ± 3.56
98.33 ± 2.28
91.67 ± 6.59
Wine
94.59 ± 5.73
92.43 ± 5.20
95.55 ± 3.01
96.76 ± 2.26
97.84 ± 2.26
94.22 ± 4.23
Iris
98.00 ± 2.98
98.00 ± 2.98
97.33 ± 3.65
96.67 ± 4.71
97.33 ± 3.65
97.33 ± 3.65
Sonar
64.88 ± 17.33
61.54 ± 13.83
56.33 ± 13.64
58.84 ± 14.93
55.05 ± 21.63
57.08 ± 18.29
Ionosphere
86.69 ± 2.00
84.90 ± 3.05
85.27 ± 6.61
85.25 ± 5.98
93.75 ± 2.88
82.72 ± 5.57
80.80 ± 5.37
80.80 ± 5.37
60.11 ± 3.51
81.39 ± 4.97
81.39 ± 4.97
80.85 ± 4.18
94.70 ± 2.09
91.75 ± 2.76
93.33 ± 1.51
94.00 ± 1.93
93.78 ± 1.10
92.18 ± 2.06
96.32 ± 1.68
95.09 ± 2.63
95.45 ± 1.51
98.08 ± 0.94
98.25 ± 0.86
96.86 ± 1.30
Breast Cancer Wisconsin
96.45 ± 2.84
96.73 ± 2.58
95.60 ± 4.23
96.73 ± 1.69
95.73 ± 3.20
94.89 ± 5.84
Transfusion
76.21 ± 0.46
78.61 ± 3.83
76.34 ± 0.48
69.71 ± 13.87
66.61 ± 20.85
76.21 ± 0.46
Pima Indians Diabetes
76.17 ± 2.47
76.70 ± 2.61
76.43 ± 2.10
74.61 ± 3.78
75.01 ± 4.42
76.82 ± 3.03
Hill Valley
50.58 ± 1.81
50.17 ± 1.48
49.92 ± 1.25
48.50 ± 3.21
49.74 ± 3.57
50.17 ± 2.66
99.05 ± 0.66
98.83 ± 0.75
97.89 ± 0.60
98.98 ± 0.70
100.00 ± 0.00
97.74 ± 0.70
51.79 ± 2.08
51.52 ± 2.21
49.48 ± 3.03
43.45 ± 6.13
48.39 ± 7.32
47.90 ± 3.35
93.36 ± 0.11
92.72 ± 0.50
82.92 ± 22.50
72.11 ± 13.25
93.36 ± 0.11
92.41 ± 0.70
94.66 ± 1.34
94.09 ± 0.91
92.13 ± 1.51
93.71 ± 0.67
95.48 ± 1.80
91.70 ± 1.44
Waveform
87.08 ± 1.07
86.94 ± 0.84
86.26 ± 1.08
86.78 ± 0.99
86.96 ± 0.88
86.70 ± 0.80
Statlog
85.64 ± 1.45
85.05 ± 1.60
83.90 ± 1.30
84.26 ± 2.09
87.96 ± 0.94
83.93 ± 2.06
Marketing
32.22 ± 0.92
32.34 ± 0.93
31.52 ± 0.79
27.98 ± 0.78
27.80 ± 2.83
30.45 ± 0.70
Penbased
98.01 ± 0.56
96.21 ± 0.63
96.85 ± 0.66
98.06 ± 0.58
99.51 ± 0.19
96.26 ± 0.66
Win/Loss/Tie
-/-/-
14/4/2
17/3/0
15/5/0
10/9/1
17/2/1
average rank
2.275
3.325
4.35
3.8
2.8
4.45
1x4 / 1.5 6x10 / 1
1x34 / 1.5
Horse Colic
1x27 / 4.5
House Vote
1x16 / 1
Wdbc
1x30 / 4 3x3 / 3
PT
1x8 / 4
10x10 / 1
Banknote Authentication
AC
Secom
CE
1x4 / 2
Cmc
Semeion
1x13 / 6
1x9 / 1
5x118 / 1.5 2x128 / 2 1x21 / 1 1x36 / 2 1x13 / 2 1x16 / 3
3
1x4 / 2
3x20 / 2 1x34 / 5
1x27 / 4.5 2x8 / 6
2x15 / 5
3x3 / 1.5 1x4 / 1
ED
2x2 / 3.5
5
4
5
1x8 / 2
4x25 / 2.5 1x4 / 4 1x9 / 2
10x59 / 3
2x128 / 3 1x21 / 3 1x36 / 3 1x13 / 1 1x16 / 5
3 2
6
3
1
3
6
6 5
2
3
4
5 3
5
5
6 6 3 4
6
CR IP T
1x13 / 4
1x38 / 4
1
4
6
AN US
1x38 / 2
M
Data set
4
1.5 2
2
1.5 5 6
6
3 6
6 4
4 4 5 2
1
1.5 3
1
4
6
5
5
1 4
1.5 1
2 1 6 1
5
4
4 6
3
5
3 6
3.5 1
2.5 6 5
4 6
5 5 4 6
ACCEPTED MANUSCRIPT 25
average ranks differ named Critical Difference (CD) in Nemenyi post-hoc test can be written as follow: CD = qα
r
k(k + 1) 6N
(30)
According to the literature 13, the value qα is 2.85 when there are 5 algorithms with α = 0.05.
CD = 2.85 ×
r
6(6 + 1) = 1.6861, 6 × 20
CR IP T
Therefore, the value CD is equal to:
(31)
If the difference between two algorithms is over the value of CD, it means the two algorithms are not similar. As is shown by Table VIII, it can be found that BPDMatMHKS is much better than MHKS, SVM (Linear) and TWSVM (Linear), since the difference of average rank between these algorithms is over 1.6816. According to average rank in Table VIII, the performance
AN US
of BPDMatMHKS is still superior to SVM (RBF) and MatMHKS, since the average rank of BPDMatMHKS is the lowest.
(2) The influence of the parameter c and λ in the experiment
In this subsection, the influence of the parameter c and λ is discussed in Fig. 7 and .8. The experiment setting is the same as we do in the image data sets.
M
Fig. 7 shows the classification performance of varying c. It can been found that the classification performance is stable in most data sets. For some data sets including Wine, Ionosphere,
ED
Sonar, House vote and Horse colic, the classification accuracy keeps stable when c is less than 10 and declines when c is equal to 100. For secom data set, the classification accuracy keep {0.1, 1, 10}.
PT
increasing while c is increasing. In generally, we can find that the range of c can be shrank to
CE
Fig. 8 shows the classification performance of varying λ. The classification accuracy keeps
stable or ascending when λ is less than 10. It is getting worse when λ is equal to 100, except
AC
for the secom and transfusion data sets. In these two data sets, the classification performance is increasing with λ increasing. In generally, the best choice of the parameter of c and λ is from {0.1, 1, 10}.
(3) Convergence analysis Here, we discus the convergence of the proposed BPDMatMHKS. As is shown in Fig. 9, it can
be found that the convergence speed of the BPDMatMHKS is fast at the beginning. The value of ||b(k + 1) − b(k)|| is stable and close to 0 when the iteration number is over 10. In addition,
about half of the data sets is satisfied with the break condition when the iteration number is less
ACCEPTED MANUSCRIPT 26
100
100 95
95
90 90
80 75 70
Accuracy(%)
Accuracy(%)
85 water wine iris sonar ionosphere
65
85
horse colic house vote wdbc breast cancer wisconsin transfusion
80 75 70
55 50 0.01
0.1
1
10
65 0.01
100
c
0.1
CR IP T
60
1
10
100
10
100
c
(a)
(b)
100
100
90
AN US
90
80
70
60
Accuracy(%)
Accuracy(%)
80
pima indians diabetes hill valley banknote authentication cmc secom
60
semeion waveform statlog marketing penbased
50
50
40
0.1
1
c
100
30 0.01
0.1
1
c
(d)
ED
(c)
10
M
40 0.01
70
PT
Fig. 7. The classification accuracy(%) of BPDMatMHKS with varying parameter c.
than 10. Therefore, it can be concluded that the proposed BPDMatMHKS has a fast convergence
CE
speed.
D. Difference between matrix model and vector model
AC
As is described in section II, the matrix model and the vector model can be convert to each
other according to the Kronecker product. Inspired by the this information, We can measure the different between the solution space in matrix-based algorithms and that in vector-based algorithms. To reflect the difference in the required weights to be optimized in vector and matrix
cases, we first convert the solution u and v˜ of BPDMatMHKS and MatMHKS to the vector form ˜ of MHKS. Then, we unitize these obtained solutions v˜ T ⊗ u. After that, we get the solution w
ACCEPTED MANUSCRIPT 27
100
100 95
90 90 85
Accuracy(%)
70
60
50
40 0.01
80 75 70
water wine iris sonar ionosphere
65 60
0.1
1
10
55 0.01
100
λ
horse colic house vote wdbc brease cancer wisconsin trasfusion
CR IP T
Accuracy(%)
80
0.1
1
10
100
10
100
λ
(a)
(b)
100
100
90
AN US
90
80
70
60
Accuracy(%)
Accuracy(%)
80
pima indians diabetes hill valley banknote authentication cmc secom
70 60 50
semeion waveform statlog marketing penbased
40
50
0.1
1
λ
100
20 0.01
0.1
1
λ
(d)
ED
(c)
10
M
40 0.01
30
PT
Fig. 8. The classification accuracy(%) of BPDMatMHKS with varying parameter λ.
of BPDMatMHKS, MatMHKS and MHKS to measure the difference between these solution
CE
vectors by using the 2-norm. The formula is as follow:
T
v˜ ⊗ u ˜
w −
T
, ˜ 2 2 ||˜v ⊗ u||2 ||w||
(32)
AC
The range of Eq. 32 is from 0 to 2. Therefore, the value 1 can be seen as the threshold to
reflect whether the difference is obvious. In addition, we also measure the difference of solution
vector between BPDMatMHKS and MatMHKS under the same matrix model. The result is shown in Table IX. Since the multi-class data sets have too many solution vectors to reflect the difference clearly, the binary-class data sets are used in this experiment. In addition, Table IX also shows the required sum of weights in matrix (d1 + d2) and vector (d1 × d2) cases.
ACCEPTED MANUSCRIPT 28
TABLE IX Test 2-norm distance between BPDMatMHKS, MatMHKS and MHKS on binary data sets (The best result on each data set is written in bold. Here, the BPDMatMHKS is abbreviated as BPD and MatMHKS is abbreviated as Mat). Model
d1 + d2
Water
1 × 38
39
1 × 60
61
Horse Colic House Vote
0.7859
4 × 15
19
0.6725
5 × 12
17
0.8984
6 × 10
16
0.7948
1 × 34
35
2 × 17
19
1 × 27
28
3×9
12
2×8
10
1 × 30
31
60
34
0.3861 0.3581
0.7893
0.7645
0.9483
0.9074
0.9255
0.9860
1.1230
1.0995
1.0540
0.9737
0.7684
0.6534
0.9253
0.8604
0.0893
0.5314
0.5483
0.1871
1.3270
1.3303
0.3681
0.5751
0.6801
0.9649
1.1976
0.7152
0.3548
1.4053
1.2767
1.4137
1.0110
1.4141
1.3083
1.1186
1.1903
13
0.9196
1.0936
1.3227
11
1.0118
1.1000
1.1856
0.1315
0.8539
0.8304
0.4618
0.9161
0.9404
0.0276
0.9311
0.9090
0.5320
1.2237
1.1203
0.1433
0.5539
0.6848
1.1871
1.1903
1.3548
1.2601
1.2263
1.3862
17 8 17
ED
5×6 1×9
10
1×4
5
3×3
9
2×2
4
2×4
6
PT
CE
0.7311
0.6398
1×8
9
1 × 100
101
27 16
30
9 4 8 100
2 × 50
52
1.4073
1.1471
1.3868
4 × 25
29
1.0492
1.3904
1.4746
5 × 20
25
0.8856
1.4310
1.3936
20
0.5037
1.4793
1.4706
0.0211
0.2301
0.2285
2×2
4
0.1233
0.4288
0.5203
0.8449
1.0675
1.1521
1×4
5
Secom
1 × 590
591
5 × 118 average
0.4004
23
10 × 10
Banknote Authentication
1.3822
3 × 20
AC
Hill Valley
1.2614
1.3205
0.7173
3 × 10
Pima Indians
1.2767
0.9638
32
2 × 15
Transfusion
0.9130
2 × 30
4×4
Breast Cancer Wisconsin
Mat–MHKS
21
1 × 16
Wdbc
BPD–MHKS
AN US
Ionosphere
38
BPD–Mat
2 × 19
M
Sonar
d1 × d2
CR IP T
Data set
4 590
2 × 295
297
0.8171
1.3179
1.2877
123
1.1708
1.3559
1.3827
10 × 59
69
1.2591
1.4199
1.3836
0.6996
1.0385
1.0623
ACCEPTED MANUSCRIPT 29
14
60 water wine iris sonar ionosphere
12 10
horse colic house vote wdbc breast cancer wisconsion transfusion
50
||b(k+1) - b(k)||
||b(k+1) - b(k)||
40 8 6
30
20
10
2 0
0
5
10
15
20
25
0
30
0
5
Iteration number
15
20
25
30
(b)
100
250 pima indians diabetes hill valley banknote authentication cmc secom
80
200
||b(k+1) - b(k)||
70
semeion waveform statlog marketing penbased
AN US
90
||b(k+1) - b(k)||
10
Iteration number
(a)
60 50 40 30 20
150
100
50
0
5
10
15
20
Iteration number
25
30
0
0
5
10
15
20
25
30
Iteration number
(d)
ED
(c)
M
10 0
CR IP T
4
PT
Fig. 9. The classification accuracy(%) of BPDMatMHKS with varying parameter λ.
According to Table IX, it is found that the average difference between BPDMatMHKS and
CE
MatMHKS is 0.6996, which is far less than 1. Therefore, it can be conclude that the different between BPDMatMHKS and MatMHKS is not very significant. However, the average difference between matrix-based algorithms (BPDMatMHKS and MatMHKS) and MHKS is over than 1.
AC
To make further observation on the data sets including Sonar, Hill Valley and Secom, it is found that the difference between the matrix model and vector model is getting bigger with the required number of weights declining. Therefore, it can be conclude that there are many differences between matrix-based algorithm and vector-based algorithm.
ACCEPTED MANUSCRIPT 30
E. Classification performance of the proposed regularization term In this paper, we propose a new regularization term Rbpd , which concerns the priori structural information of the boundary samples. Through introducing the regularization term into the original MatMHKS, we further extend the framework and improve the classification performance of MatMHKS. In this subsection, we further discuss how the regularization term works on the BPDMatMHKS and MatMHKS in Table X. TABLE X
CR IP T
data sets with different dimension and size. Here, we list the classification results between
Performance Difference between BPDMatMHKS and MatMHKS on Data Sets with Different Dimensions BPDMatMHKS vs MatMHKS
dimensions < 100
Win/Loss/Tie
11/4/2
100 ≤ dimensions < 1000
1/1/2
AN US
3/0/2
1000 ≤ dimensions
According to the statistical results, the proposed BPDMatMHKS outperforms MatMHKS when the dimensions of the data set is less than 100. Compared with MatMHKS, BPDMatMHKS has great advantages when 100 ≤ dimensions < 1000. For the data sets with more than 1000
M
dimensions, BPDMatMHKS is still competitive with MatMHKS.
In order to further reflect the stability of the regularization term and explore which factor affects
ED
the results corresponding to varying parameter λ, the standard deviation of the classification accuracies corresponding to every data set in Fig. VI and VIII is calculated at first. Here, the
PT
standard deviation is named as S tdλ to reflect the the stability of the regularization term. The lower the value of S tdλ is, the higher the stability of the regularization term is. Then, the
CE
corresponding best matrix type and the required sum of weights (d1 + d2) are list in the table. Moreover, the data size is list in the table. Since the BPDMatMHKS is a binary-classifier, the average size of samples in every class is more reasonable to reflect the data size. In addition,
AC
the value which is equal to the average number divided by the required sum of weights is listed in the table.
According to Table XI, we can get the correlation coefficients between S tdλ and the other
factors. The correlation coefficients ρXY reflecting the relevance between dimension X and Y can be calculated as follows: COV(X, Y) ρXY = √ , √ Var(X) Var(Y)
(33)
ACCEPTED MANUSCRIPT 31
TABLE XI Test used model, required sum of weights, average size, the proportion between average size and required sum of weights, and S tdλ (%) of BPDMatMHKS on all used data sets (The best result on each data set is written in bold). Data set
Model
d1 + d2
Average size
Average size/(d1 + d2)
S tdλ
water
1 × 38 1×4
58.00
1.49
7.49
14
59.33
4.24
6.76
5
50.00
10.00
0.84
6 × 10
16
104.00
1 × 34
35
175.50
1 × 27
28
183.00
1 × 16
17
217.50
1 × 30
31
284.50
3×3
6
349.50
1×4
5
374.00
1×8
9
10 × 10
20
606.00
30.30
0.24
5
686.00
137.20
0.36
1×9
10
491.00
49.10
4.40
5 × 118
123
783.50
6.37
3.09
130
159.30
1.23
8.08
1 × 21
22
1666.67
75.76
1.00
37
1072.50
28.99
4.22
14
691.77
49.41
4.11
17
1099.20
64.66
2.71
1 × 1024
1025
72.00
0.07
15.29
147
50.00
0.34
14.48
1 × 644
645
10.00
0.02
16.63
514
11.00
0.02
11.34
1 × 1024
1025
63.53
0.06
11.02
514
21.30
0.04
25.27
Sonar Ionosphere Horse Colic House Vote Wdbc Breast Cancer Wisconsin Transfusion Pima Indians Hill Valley Banknote Authentication
1×4
Cmc Semeion Waveform
2 × 128
Statlog
1 × 36
Marketing
ED
Penbased
1 × 16
Coil 20 Letter Yale
CE
YaleB
3 × 144
PT
ORL
AC
JAFFE
1 × 13
M
Secom
2 × 512 2 × 512
6.50
7.90
5.01
12.04
6.54
9.24
12.79
14.01
9.18
13.64
58.25
8.27
74.80
1.33
AN US
Iris
CR IP T
39
1 × 13
Water
384.00
42.67
COV(X, Y) = E[(X − E[X])(Y − E[Y])],
5.10
(34)
Where E[X] is the mathematical expectation of dimension X, Var(X) and Var(Y) are the
standard deviations of dimension X and Y, respectively. In Table XI, the correlation coefficient between S tdλ and required sum of weights is 0.5659, which means that the higher the number of the required weights is, the lower the stability of the regularization term is. Moreover, the
ACCEPTED MANUSCRIPT 32
correlation coefficient between S tdλ and average size is -0.6059, which means that the stability of the regularization term will be better with the required sum of weights increasing. In addition, the correlation coefficient between S tdλ and averagesize/(d1 + d2) is -0.6129, which means that the stability of the regularization term is related to the proportion between the data size and the number of required weights. The result accords with the learning theory that more samples and
CR IP T
less dimensions can bring better generalization and stability ability. V. Rademacher Complexity Analysis and Stability test
In this section, the Rademacher complexity [16] is used to estimate the generalization risk boundary [23] of the proposed BPDMatMHKS. The Radermacher complexity is different from the well-known Vapnik-Chervonenkis dimensionality theory [31]. The VC dimension is inde-
AN US
pendent of the distribution of the data. However, the Rademacher complexity considers the distribution of the data. Accordingly, it is demonstrated to be an effective technique to measure the generalization risk bound of one classifier [34].
N Suppose the training data set D = {xi , yi }i=1 , and yi is the label of the sample, For binary-
classification problem, suppose the hypothesis space H, let Z = X×{−1, 1}. Then the hypothesis
M
h can be written as:
(35)
ED
fh (z) = fh (x, y) = I(h(x) , y),
Then the range of hypothesis space H transforms form {−1, 1} to [0, 1], and the corresponding
function space is FH = { fh , h ∈ H}. Then, the empirical Rademacher complexity can be written
CE
PT
as:
N 1X σi fh (xi , yi )] Rˆ Z (FH ) = Eσ [ sup fh ∈FH N i=1
N 1 1X 1 σi h(xi )] = Rˆ D (H), = Eσ [sup 2 2 h∈H N i=1
(36)
AC
Get the mathematical expectation to Eq. 36 and then we can get: 1 RN (FH ) = RN (H), 2
(37)
Theorem 1 In the hypothesis space H : X → [−1, 1], The data set X = {x1 , x2 , ..., xN }, xi ∈ X is
a random sample set independently chosen according to distribution D. Then with probability at least 1 - δ, δ ∈ (0, 1), every h ∈ H satisfies:
ˆ E[h] ≤ E(h) + RN (H) +
r
ln(1/δ) , 2N
(38)
ACCEPTED MANUSCRIPT 33
ˆ E[h] ≤ E(h) + Rˆ D (H) + 3
r
ln(1/δ) , 2N
(39)
Then, we use RN (hBPDMatMHKS ), RN (h MatMHKS ), and RN (h MHKS ) to denote the Rademacher complexities of BPDMatMHKS, MatMHKS, and MHKS, respectively. Since the objective function of BPDMatMHKS is inherited from MatMHKS. Beyond that, it has one more new regularization term Rbpd than MatMHKS. Therefore, the hypothesis space H of BPDMatMHKS is smaller than
CR IP T
that of MatMHKS. And we can get {hBPDMatMHKS } ⊆ {h MatMHKS }. According to Eq. 36, we have: RN (hBPDMatMHKS ) ≤ RN (h MatMHKS ).
(40)
For MatMHKS and MHKS, it is known that MatMHKS is original from MHKS and the solution space of MatMHKS is decomposed from MHKS according to the Kronecker product
AN US
decomposability constraint. Accordingly, the the hypothesis space H of MHKS contains the the hypothesis space H of MatMHKS. Therefore, we can get {h MatMHKS } ⊆ {h MHKS }. Similarly , it
can be gotten that:
RN (g MatMHKS ) ≤ RN (g MHKS ).
(41)
RN (g MHKS ) can be concluded that:
M
Based on Eq. 40 and Eq. 41, the relationship between RN (hBPDMatMHKS ), RN (g MatMHKS ), and
(42)
ED
RN (hBPDMatMHKS ) ≤ RN (h MatMHKS ) ≤ RN (h MHKS ).
Then, it can be concluded that BPDMatMHKS has a tighter generalization risk bound than both
PT
MatMHKS and MHKS.
Although the Generalization error bound can be deduced according to Rademacher complexity, we should find another way to obtain the detail result of the error bound. In this paper, we
CE
analysis the stability of the BPDMatMHKS, MatMHKS and MHKS. The stability test observes the change of output when the input is changed.
AC
Suppose that the training data set D = {z1 = (x1 , y1 ), z2 = (x2 , y2 )...zn = (xn , yn )} is inde-
pendently chosen according to distribution D, xi ∈ X and yi is the label of the sample, For binary-classification problem, suppose the hypothesis space H, let function F : X × {−1, 1}.
Function F learns form H according to the training samples. Consider the change of D:
D\i = {z1 , z2 , ...zi−1 , zi+1 , ...zn } means divide the ith sample form D. And the loss function is
denoted as L(FD , z).
ACCEPTED MANUSCRIPT 34
Definition 1 For every x ∈ X, z = (x, y), if the function F satisfies: |L(FD , z) − L(FD\i , z)| ≤ β, i = 1, 2, ...n,
(43)
Then the function F is consider to be satisfied with β uniformity and stability.
For every D and z = (x, y), the loss function L satisfies 0 ≤ L(FD , z) ≤ M, then [3]:
Theorem 2 Suppose D with n samples is a random sample set independently chosen according
CR IP T
to distribution D. If the function F is satisfied with β uniformity and stability and the upper bound
of the loss function L is M. Then with probability at least 1 - δ, δ ∈ (0, 1), the generalization error satisfies:
ˆ D , z) + 2β + (4nβ + M) L(FD , z) ≤ L(F ˆ D , z) is the empirical error. where the L(F
r
ln(1/δ) , 2n
(44)
AN US
According to Eq. 43, every sample is divided from the data set and then the corresponding
function FD\i can be learned. However, it is time cost if we train for every sample when it is divided. In our experiment, the number of the samples divided from the data set is fixed to 5.
Therefore, the value of β is relatively larger compared with the β calculated in original definition. The result of the experiment is shown by Fig. 10. The BPDMatMHKS generally obtains the
M
smallest value of β except for Pima Indians. According to Theorem 2, the BPDMatMHKS has a tighter error bound than the other two algorithms. And the MHKS gets the highest
ED
value of β in 8 binary-class data sets. The MatMHKS is ranked first in Pima Indians. And the stability of MatMHKS is not well in the data sets including House Vote, Wdbc and Breast
PT
Cancer Wisconsin. In the other data sets, the MatMHKS is ranked second. Broadly speaking, the experiment result is consist with the theoretical analysis according to Rademacher complexity.
CE
Therefore, the BPDMatMHKS has the lower generalization error bound, which means that it
AC
has better generalization ability in terms of the theory and experiment. VI. Conclusions
Since the boundary samples, which are easily misclassified, play an important role in classifier
designing. In this paper, we proposed a boundary sample selecting method based on nearest neighbor principle. In detail, this method is more robust since it can eliminate the influence of noise points and fuzzy points. Further, we use these boundary samples to design a regularization term Rbpd based on boundary projection discrimination. Through introducing the term Rbpd into original matrix-based classifier MatMHKS, we can overcome the drawback that the
ACCEPTED MANUSCRIPT 35
0.14 BPDMatMHKS MatMHKS MHKS
0.12 0.1
0.06 0.04 0.02 0
AN US ill
a
y
lle
s an
on
di
In
si
te
ic ol
Vo
C
fu
Va
m
s an
W
c db
se
ou
BA
Pi
Tr
BC
W
H
re
e ph
se
ou
H
os
In
r na
So
er at
W
H
CR IP T
β
0.08
Fig. 10. The value of β calculated in BPDMatMHKS, MatMHKS and MHKS on the binary-class data sets including Water, Sonar, Inosphere, House Colic, House Vote, Wdbc, Breast Cancer Wisconsin (BCW), Transfusion, Pima Indians, Hill Valley,
M
and Banknote Authentication (BA).
MatMHKS ignores the boundary information of the whole data distribution. In the experiment,
ED
we validate the feasibility and effectiveness of BPDMatMHKS. The performance of the proposed BPDMatMHKS is better than the original MatMHKS and some other comparison algorithms
PT
on the UCI data sets and image data sets. Further, the influence of the parameters such as c and λ is discussed in the analysis of the experiment. Finally, we use Rademacher complexity
AC
bound.
CE
and stability test to validate that the proposed BPDMatMHKS has a tighter generalization risk
Acknowledgment
This work is supported by Natural Science Foundations of China under Grant No. 61672227,
Shuguang Program supported by Shanghai Education Development Foundation and Shanghai Municipal Education Commission, and the 863 Plan of China Ministry of Science and Technology under Grant No. 2015AA020107.
ACCEPTED MANUSCRIPT 36
References [1] H. Abdi and L.J. Williams. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4):433–459, 2010. [2] Sukarna Barua, Md. Monirul Islam, X. Yao, and K. Murase. MWMOTE-Majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge and Data Engineer, 26(2):405–425, 2014. [3] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Maching Learning Research, 2(3), 2002. [4] F. Cakoni and D. Colton. Ill-posed problem. Applied Mathematical Sciences, 188:27–43, 2014.
CR IP T
[5] H. Chen, L.Q. Li, and J.T. Peng. Error bounds of multi-graph regularized semi-supervised classification. Information Sciences, 179(12):1960–1969, 2009.
[6] S. Chen, Z. Wang, and Y. Tian. Matrix-pattern-oriented ho–kashyap classifier with regularization learning. Pattern Recognition, 40(5):1533–1543, 2007.
[7] S. Chen, Y. Zhu, D. Zhang, and J.Y. Yang. Feature extraction approaches based on matrix pattern: Matpca and matflda. Pattern Recognition Letters, 26(8):1157–1167, 2005.
[8] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995.
AN US
[9] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, 2012.
[10] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179–188, 1936. [11] K. Hagiwara and K. Kuno. Regularization learning and early stopping in linear networks. In IEEE International Joint Conference on Neural Networks, volume 4, pages 511–516, 2000.
[12] H. Han, W.Y. Wang, and B.H. Mao. Borderline-SMOTE: A new oversampling method in imbalanced data sets learning. Lecture Notes in Computer Science, 3644(5):878–887, 2005.
M
[13] P. Hart. The condensed nearest neighbor rule. IEEE Transaction on Information Theory, 14(3):515–516, 1968. [14] M. Hollander, D. Wolfe, and E. Chicken. Nonparametric statistical methods. John Wiley & Sons, 2013. [15] Jayadeva, R. Khemchandani, and Suresh Chandra.
Twin support vector machines for pattern classification.
IEEE
ED
Transactions on Pattern Analysis and Machine Intelligence, 29(5):905–910, 2007. [16] V. Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902–1914, 2001.
PT
[17] U. Kreßel. Pairwise classification and support vector machines. In Advances in Kernel Methods, pages 255–268, 1999. [18] J. Leski. Ho–kashyap classifier with generalization control. Pattern Recognition Letters, 24(14):2281–2290, 2003. [19] Y.H. Li. Selecting critical patterns based on local geometrical and statistical information. IEEE Transacions on Pattern
CE
Analysis and Machine Intelligenc, 33(6), 2011. [20] R.Z. Liang, L. Shi, J. Meng, J.J.Y. Wang, Q. Sun, and Y. Gu. Top precision performance measure of content-based image retrieval by learning similarity function. In Pattern Recognition (ICPR), 2016 23st International Conference, 2016.
AC
[21] R.Z. Liang, W. Xie, W. Li, H. Wang, J.J.Y. Wang, and L. Taylor. A novel transfer learning method based on common space mapping and weighted domain matching. In Tools with Artificial Intelligence (ICTAI), 2016 IEEE 28th International
Conference, pages 299–303, 2016.
[22] Marvin Minsky and Seymour Papert. Perceptrons: An Introduction to Computational Geometry. The MIT Press, 1988. [23] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundation of Machine Learning. The MIT Press, 2012. [24] K. Nikolaidis, J.Y. Goulermas, and Q.H. Wu. A class boundary preserving algorithm for data condensationn. Pattern Recognitionc, 44(3):704–715, 2011. [25] C.G. Osorio, A.D.H. Garcła, and N.G. Pedrajas. Democratic instance selection:a linear complexity instance selection algorithm based on classifier ensemble concepts. Artificial Intelligence, 174(5–6):410–441, 2010.
ACCEPTED MANUSCRIPT 37
[26] Xianli Pan, Yao Luo, and Yitian Xu. K-nearest neighbor based structural twin support vector machine. KnowledgeBasedSystems, 88:34–44, 2015. [27] Sam T. Roweis and Lawrence K. Saul.
Nonlinear dimensionality reduction by locally linear embedding.
Science,
290(5500):2323–2326, 2000. [28] X. Shi, Y. Yang, Z. Guoa, and Z. Lai. Face recognition by sparse discriminant analysis via joint l2,1-norm minimization. Pattern Recognition, 47(7):2447–2453, 2014. [29] M. Tanveer, K. Shubham, M. Aldhaifallah, and S.S. Ho. An efficient regularized k-nearest neighbor based weighted twin
CR IP T
support vector regression. Knowledge-BasedSystems, 94:70–87, 2016. [30] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
[31] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264–280, 1971. [32] N. Vladimir. Statistical Learning Theory. Wiley Interscience, 1998.
[33] Z. Wang, S. Chen, J. Liu, and D. Zhang. Pattern representation in feature extraction and classifier design: Matrix versus
AN US
vector. IEEE Transations on Neural Network, 19(5):758–769, 2008.
[34] Z. Wang, S.C. Chen, and D.Q. Gao. A novel multi-view learning developed from single-view patterns. Pattern Recognition, 44(10):2395–2413, 2011.
[35] Yiyu Yao. Three-way ddecisions with probabilistic rough sets. Information Sciences, 180(3):341–353, 2010. [36] Yiyu Yao. The superiority of three-way decisions in probabilistic rough set models. Information Sciences, 181(6):1080– 1096, 2011.
[37] Yiyu Yao. Two semantic issues in a probabilistic rough set model. Fundamenta Informaticae, 108(3):249–265, 2011.
AC
CE
PT
ED
M
[38] H.T. Zhao and W.K. Wong. Regularized discriminant entropy analysis. Pattern Recognition, 47(2):806–819, 2014.