Maximum margin classification based on flexible convex hulls

Neurocomputing 149 (2015) 957–965 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Maximum...

Download PDF

680KB Sizes 1 Downloads 41 Views

Report

PDF Reader
Full Text

Neurocomputing 149 (2015) 957–965

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Maximum margin classiﬁcation based on ﬂexible convex hulls Ming Zeng, Yu Yang n, Jinde Zheng, Junsheng Cheng State Key Laboratory of Advanced Design and Manufacturing for Vehicle Body, Hunan University, Changsha 410082, PR China

art ic l e i nf o

a b s t r a c t

Article history: Received 16 January 2014 Received in revised form 14 May 2014 Accepted 14 July 2014 Communicated by: Shiliang Sun Available online 2 August 2014

Based on deﬁning a ﬂexible convex hull, a maximum margin classiﬁcation based on ﬂexible convex hulls (MMC-FCH) is presented in this work. The ﬂexible convex hull deﬁned in our work is a class region approximation looser than a convex hull but tighter than an afﬁne hull. MMC-FCH approximates each class region with a ﬂexible convex hull of its training samples, and then ﬁnds a linear separating hyperplane that maximizes the margin between ﬂexible convex hulls by solving a closest pair of points problem. The method can be extended to nonlinear case by using the kernel trick, and multi-class classiﬁcation problems are dealt with by constructing binary pairwise classiﬁers as in support vector machine (SVM). The experiments on several databases show that the proposed method compares favorably to the maximum margin classiﬁcation based on convex hulls (MMC-CH) or afﬁne hulls (MMC-AH). & 2014 Elsevier B.V. All rights reserved.

Keywords: Flexible convex hull Maximum margin classiﬁcation Kernel method Convex hull Afﬁne hull Support vector machine

1. Introduction Over recent years, as a robust methodology for classiﬁcation, support vector machine (SVM) [1] has been successfully used in a wide variety of applications including computer vision [2,3], text categorization [4,5], bioinformatics [6,7] and fault diagnosis [8,9]. Moreover, some fruitful methods combining SVM with other learning strategies have also been proposed and bring performance improvements to SVM. Ji et al. [10] proposed a new learning paradigm named multitask multiclass privileged information support vector machine that can take full advantages of the multitask learning and privileged information. Sun and ShaweTaylor [11] presented a general framework for sparse semisupervised learning and subsequently proposed a sparse multiview support vector machine. Wang et al. [12] introduced a novel active learning support vector machine algorithm with adaptive model selection which traces the full solution path of the base classiﬁer before each new query, and then performs efﬁcient model selection using the unlabeled samples. The basic idea of SVM is to construct a separating hyperplane that maximizes the geometric margin which is deﬁned as the distance between the separating hyperplane and the closest samples from two sample sets. From geometrical point of view, in linearly separable case the SVM optimization problem of ﬁnding the maximum margin between two sample sets is equivalent to ﬁnding the

n

Corresponding author. E-mail address: [email protected] (Y. Yang).

http://dx.doi.org/10.1016/j.neucom.2014.07.038 0925-2312/& 2014 Elsevier B.V. All rights reserved.

closest pair of points on the respective convex hulls of the two sample sets, and the optimal separating hyperplane is chosen to be the one that perpendicularly bisects the line segment connecting the closest pair of points [13]. That is to say, SVM can be regarded as a maximum margin classiﬁcation based on convex hulls (MMC-CH) which actually approximates each class region with a convex hull, and the two closest points on the convex hulls determine the hyperplane for separating the convex hulls. In practice, the samples we can obtain are always ﬁnite, but the real number of samples belonging to the class region of these samples should be actually inﬁnite. SVM approximates the class region of these samples with a convex hull rather than the isolated samples themselves, thus the samples extends to be inﬁnite. Actually, relevant researchers have also proposed some other analogous geometric models to approximate class regions, such as afﬁne hulls, hyperspheres and hyperellipsoids. In the spirit of the geometric interpretation for SVM, Zhou et al. [14] and Cevikalp et al. [15] have independently presented a maximum margin classiﬁcation based on afﬁne hulls (MMC-AH). In contrast to SVM, this method approximates class regions with afﬁne hulls of their sample sets rather than convex hulls, and then the separating hyperplane can be determined by solving the closest pair of points problem. Tax and Duin [16,17] introduced a method called support vector data description (SVDD) for novelty or outlier detection. SVDD tries to ﬁnd a smallest hypersphere containing almost all sample points for approximating the class region and classiﬁes an unknown sample according to the Euclidean distance from this sample to the center of the hypersphere. Moreover, the hypersphere approximation is also extended to multi-class classiﬁcation

958

M. Zeng et al. / Neurocomputing 149 (2015) 957–965

problems. Lee and Lee [18] employed a hypersphere for approximating the class region of each sample set and then utilize these approximations to classify an unknown sample via Bayesian decision rule. Wei et al. [19] also proposed another one-class classiﬁcation method, which replaces the hypersphere with a hyperellipsoid for the class region approximation in SVDD and thus utilizes the Mahalanobis distance rather than Euclidean distance as the decision rule. No matter for one-class or multiclass classiﬁcation problems, these classiﬁcation methods based on geometric models have one thing in common, namely, a kind of geometric model (convex hulls, afﬁne hulls, hyperspheres or hyperellipsoids etc.) is used to approximate the class region of each sample set and then the classiﬁcation models are built based on certain decision rule. Therefore, these methods offer us a good idea for classiﬁcation that one can start with geometric approximation models to class regions. The convex hull is the smallest convex set containing given ﬁnite samples, so it could approximate the class region very tightly. Generally speaking, the natural class region almost always extends beyond the convex hull of its ﬁnite samples, especially in highdimensional spaces. In this sense, the unrealistically tight convex hull is typically a substantial under-approximation to the class region. As opposed to the convex hull, the afﬁne hull gives a rather loose approximation, because it does not constrain the positions of the sample points within the afﬁne subspace. Besides, linear separability of samples does not necessarily guarantee the separability of corresponding afﬁne hulls [15]. It is more restrictive from this point of view. Afﬁne hulls go to inﬁnity in every direction thus they will not overlap only if they are parallel to each other. The hypersphere model is particularly suitable for samples with a spherical distribution and high clustering degree; otherwise, large empty space will appear inside the hypersphere. In this case, the hypersphere also gives a relatively loose approximation to the class region. The large empty space is likely to misclassify samples from other class regions or outliers generated by noise and interference to such approximated class region. Compared with the hypersphere, the hyperellipsoid model takes the distribution information of samples into consideration and thus approximates the class region more tightly. However, in the case of high-dimensional small samples the covariance matrix tends to be singular, which brings some algorithm obstacles to the classiﬁcation methods based on hyperellipsoids in high-dimensional feature spaces. Although some skills have been developed to improve the calculation of the covariance matrix, but they still have many more parameters owing to the need to represent the covariance matrix. In a word, it is still worthwhile for us to explore another appropriate geometric model for approximating the class region of samples in the pattern recognition research. Inspired by convex hulls and afﬁne hulls, in this study we deﬁne a new geometric model called a ﬂexible convex hull for the class region approximation and propose a novel classiﬁcation method, i.e., maximum margin classiﬁcation based on ﬂexible convex hulls (MMC-FCH). The basic goal of MMC-FCH is to ﬁnd an optimal linear separating hyperplane that yields the maximum margin between ﬂexible convex hulls of sample sets. This can be solved by computing a closest pair of points problem and then the optimal separating hyperplane is chosen to be the one that perpendicularly bisects the line segment connecting the closest pair of points. MMC-FCH can also be extended to nonlinear case by using the kernel trick. To use the proposed method in multi-class classiﬁcation problems, we can consider most of the common strategies developed for extending binary SVM classiﬁers to the multi-class cases. The remaining part of the paper is organized as follows. Section 2 gives the deﬁnition of a ﬂexible convex hull. Section 3 introduces the proposed method, MMC-FCH. Section 4 presents our experimental results and Section 5 concludes the paper.

2. Deﬁnition of a ﬂexible convex hull 2.1. Motivation As mentioned above, SVM can be regarded as a maximum margin classiﬁcation based on convex hulls (MMC-CH), which ﬁrst approximates each class with a convex hull of its training samples and then ﬁnds a hyperplane that maximizes the margin between the two convex hulls. The convex hull of a sample set can be expressed as a linear combination of the sample points from the sample set where all coefﬁcients are non-negative and sum to one. Consider a ﬁnite sample set X ¼ fxi g ði ¼ 1; 2; :::; nÞ, xi A ℝp , then the convex hull of the sample set X can be written as ( ) n n convðXÞ ¼ ∑ αi xi ∑ αi ¼ 1; 0 r αi r1 ð1Þ i¼1 i¼1 where αi is the combination coefﬁcient of the ith sample point xi . The convex hull model is the tightest possible convex approximation to the class region, and for classes with more general convex forms, it is typically a substantial under-approximation. Other than convex hull models, maximum margin classiﬁcation based on afﬁne hulls (MMC-AH) approximates each class with an afﬁne hull [15]. The afﬁne hull of a sample set can be expressed as a linear combination of the sample points where coefﬁcients add up to one without non-negativity constraints. Then, the afﬁne hull of the sample set X can be written as ( ) n n affðXÞ ¼ ∑ αi xi ∑ αi ¼ 1 ð2Þ i¼1 i¼1 where αi is the combination coefﬁcient of the ith sample point xi . Even though the afﬁne hull is an unbounded and hence typically rather loose model to the class region in contrast to the convex hull approximation, MMC-AH works surprisingly better than SVM (or MMC-CH) especially in high-dimensional spaces with limited number of samples [15]. This is one indication that convex hulls based methods may be too tight to be realistic. Nevertheless, due to the unboundedness of afﬁne hulls, the separability of afﬁne hulls requires they are certainly parallel to each other. If different sample sets have similar or intersecting afﬁne hulls but very different distributions of samples within their afﬁne hulls, MMCAH would fail to separate the sample sets. Therefore, it seems more reasonable to tighten the afﬁne hull model. 2.2. Flexible convex hulls Motivated by convex hulls and afﬁne hulls, we deﬁne a new geometric model called a ﬂexible convex hull for the class region approximation. Similar to the convex hull and the afﬁne hull, a ﬂexible convex hull can also be expressed as a linear combination of sample points where coefﬁcients add up to one, but it imposes different lower and upper bounds on the coefﬁcients. More formally, the ﬂexible convex hull of the sample set X is deﬁne as ( ) n n 1λ 1λ flexðXÞ ¼ ∑ αi xi ∑ αi ¼ 1; r αi r þλ ð3Þ i ¼ 1 n n i¼1 where αi is the combination coefﬁcient of the ith sample point xi and λ A ð1; þ 1Þ is the ﬂexible factor. The ﬂexible convex hull and the ﬂexible factor both have their own explicit geometric interpretations. For a given λ, an arbitrary sample point xi A X extends along the ! radial direction xxi with respect to λ where x ¼ ð1=nÞΣ ni¼ 1 xi is the center of the sample set X, and the corresponding extended point can be written as x0i ¼ ð1 λÞx þ λxi , then the convex hull of the new sample set X 0 ¼ fx0i gði ¼ 1; 2; :::; nÞ is exactly the ﬂexible convex hull of the original sample set X; moreover, the ﬂexible factor λ is exactly the ratio of the distance between the extended point x0i and the set

M. Zeng et al. / Neurocomputing 149 (2015) 957–965

center x to the distance between the original point xi and the set center x. In other words, we have the following proposition. Proposition. The ﬂexible convex hull (3) of sample set X ¼ fxi gði ¼ 1; 2; …; nÞ is equivalent to the convex hull of the new sample set ( ) 1 n 0 0 0 X ¼ xi xi ¼ ð1 λÞ ∑ xj þ λxi ði ¼ 1; 2; …; nÞ ð4Þ nj¼1 i.e.,

( 0

convðX Þ ¼

n

∑ β

i¼1

0 0 i xi xi

n 1 n ¼ ð1 λÞ ∑ xj þ λxi ; ∑ βi ¼ 1; 0 r βi r 1 nj¼1 i¼1

)

) n 1λ 1λ ∑ α i x i ∑ α i ¼ 1; rαi r þλ i¼1 n n i¼1

( flexðX Þ ¼

959

n

ð7Þ where λ þ and λ are the ﬂexible factors of the positive and negative ﬂexible convex hulls, respectively. In the linearly separable case, ﬁnding the closest pair of points on the ﬂexible convex hulls of sample sets can be written as the following optimization problem: 2 n 1 n þ min ∑ α þ i x þ i ∑ α i x i α þ ;α 2 i ¼ 1 i¼1 1λþ 1λþ rαþi r þ λ þ ; i ¼ 1; 2; :::; n þ nþ nþ

nþ

ð5Þ where βi is the combination coefﬁcient of the ith extended point x0i ; and λ A ð1; þ 1Þ is the ﬂexible factor. The complete proof is given in the Appendix. Note that the lower and upper bounds of the coefﬁcients vary with the ﬂexible factor. If λ ¼ 1, ð1 λ=nÞ ¼ 0 and ð1 λ=nÞ þ λ ¼ 1, then the ﬂexible convex hull reduces to the convex hull; if 1 o λ o þ 1, ð1 λ=nÞ o 0 and ð1 λ=nÞ þ λ 4 1, then it is the ﬂexible convex hull with respect to λ; if λ- þ 1, ð1 λ=nÞ- 1 and ð1 λ=nÞ þ λ- þ 1, then the ﬂexible convex hull expands to the afﬁne hull. In this sense, the convex hull and the afﬁne hull can be actually considered as the limiting cases of the ﬂexible convex hull. Therefore, the ﬂexible convex hull is an approximation model looser than the convex hull but tighter than the afﬁne hull. The introduction of the ﬂexible factor λ, just as its name implies, makes the new deﬁned geometric model more ﬂexible. For an appropriate choice of λ, it is entirely possible for the ﬂexible convex hull to give a desirable approximation to the class region.

3. MMC-FCH 3.1. Linearly separable case Having deﬁned the ﬂexible convex hull, we propose a maximum margin classiﬁcation based on ﬂexible convex hulls (MMCFCH). The basic goal of MMC-FCH is to ﬁnd an optimal linear separating hyperplane that yields the maximum margin between ﬂexible convex hulls of sample sets. All points x that lie on the separating hyperplane satisfy 〈w; x〉 þb ¼ 0 where w and b are the normal and bias of the separating hyperplane, respectively. For the optimal separating hyperplane, all points x in the positive sample set satisfy 〈w; x〉 þ b 40 and all points x in the negative sample set satisfy 〈w; x〉 þ b o 0. Finding the best separating hyperplane maximizing the margin between ﬂexible convex hulls can be solved by computing the closest pair of points on them. The optimal separating hyperplane will be the one that perpendicularly bisects the line segment connecting the closest pair of points as in SVM. Once the optimal separating hyperplane is determined, an unknown sample x is classiﬁed based on the decision function f ðxÞ ¼ signð〈w; x〉 þbÞ. Consider a binary classiﬁcation problem with the positive and negative training sets given in the form X þ ¼ fxi þ g ði ¼ 1; 2; …; n þ Þ and X ¼ fxi g ði ¼ 1; 2; …; n Þ. According to formula (3), the ﬂexible convex hulls of the positive and negative sample sets can be written as ( ) nþ nþ 1λþ 1λþ flexðX þ Þ ¼ ∑ α þ i x þ i ∑ α þ i ¼ 1; rαþi r þλþ i¼1 nþ nþ i¼1 ð6Þ

∑ α þ i ¼ 1;

s:t:

i¼1

1λ 1λ rαi r þ λ ; i ¼ 1; 2; :::; n n n

n

∑ α i ¼ 1;

i¼1

ð8Þ Expand the objective function, and then the optimization problem can be written as 1 α þ ;α 2

nþ nþ

n n

i ¼ 1j ¼ 1

i ¼ 1j ¼ 1

∑ ∑ α þ i α þ j 〈x þ i ; x þ j 〉 þ ∑ ∑ α i α j 〈x i ; x j 〉

min

!

nþ n

2 ∑ ∑ α þ i α j 〈x þ i ; x j 〉 i ¼ 1j ¼ 1

nþ

s:t:

∑ α þ i ¼ 1;

i¼1 n

∑ α i ¼ 1;

i¼1

1λþ 1λþ rαþi r þ λ þ ; i ¼ 1; 2; :::; n þ nþ nþ

1λ 1λ rαi r þ λ ; i ¼ 1; 2; :::; n n n

ð9Þ

This is a quadratic programming problem that can be solved using standard optimization algorithms. From the geometric property of ﬂexible convex hulls, we know that the closest points always lie on the vertices or boundaries of the ﬂexible convex hulls. In other words, the optimal separating hyperplane only depends on a few vertices or boundary points of the ﬂexible convex hulls. Fig. 1 shows a two-dimensional case. The areas enclosed by dashes and dots represent ﬂexible convex hulls and convex hulls, respectively. In this ﬁgure, we cannot show the case of afﬁne hulls since they completely overlap each other in the two-dimensional case. Given the optimal solution ðαnþ 1 ; αnþ 2 ; …; αnþ n þ ; αn 1 ; αn 2 ; :::; n α n ÞT and let xnþ and xn denote the corresponding closest points on the positive and negative ﬂexible convex hulls, respectively. Then, the normal w and bias b of the optimal separating hyperplane can be computed by using the following equations: nþ

n

i¼1

i¼1

w ¼ xnþ xn ¼ ∑ αnþ i x þ i ∑ αn i x i

ð10Þ

Negative samples

Closest pair of points #2

Positive samples

Separating hyperplane #2 Separating hyperplane #1 Closest pair of points #1 Fig. 1. Closest pair of points determines the optimal separating hyperplane.

960

M. Zeng et al. / Neurocomputing 149 (2015) 957–965

1 1 b ¼ wT ðxnþ þ xn Þ ¼ 2 2

nþ

nþ

∑ ∑ αnþ i αnþ j 〈x þ i ; x þ j 〉

i¼1j¼1 n

!

n

∑ ∑ αn i αn j 〈x i ; x j 〉

ð11Þ

i¼1j¼1

Finally, the decision function can equivalently be expressed as ( f ðxÞ ¼ sign

nþ

n

∑ αnþ i 〈x i ; x〉 ∑ αn i 〈x i ; x〉

i¼1

1 2

nþ

i¼1

nþ

n

n

!)

∑ ∑ α þ i α þ j 〈x þ i ; x þ j 〉 ∑ ∑ α i α j 〈x i ; x j 〉 n

n

i¼1j¼1

n

n

i¼1j¼1

ð12Þ

3.2. Linearly inseparable case In the case of linearly inseparable ﬂexible convex hulls, we can map the samples to a higher dimensional space where ﬂexible convex hulls constructed in the mapped space become linearly separable by using the kernel trick. Note that either the objective function of (9) or the decision function (12) can be written in terms of inner products among samples, which allows the use of the kernel trick, i.e., replacing the inner product 〈xi ; xj 〉 with the kernel function kðxi ; xj Þ ¼ 〈Φðxi Þ; Φðxj Þ〉. 3.3. Extension to multi-class classiﬁcation problems

4. Experiments

MMC-FCH can be extended to multi-class classiﬁcation problems by using most of the common strategies developed for extending binary SVM classiﬁers to the multi-class cases. Here we only use the most popular strategy in our experiments: oneagainst-one [20]. For a c-class classiﬁcation problem, one-againstone constructs all possible c ¼ ðc 1Þ=2 binary pairwise classiﬁers out of the c classes and classiﬁes unknown samples to the class that wins the most pairwise decisions. Other strategies such as one-against-rest [21], binary decision trees [22] and directed acyclic graphs [23] can also be used. 3.4. Choice of the ﬂexible convex factor

In this section, we test the linear and kernelized versions of the proposed classiﬁcation method MMC-FCH on several databases and compare them with those of MMC-CH and MMC-AH. The linear versions of the above methods are tested on Yale face database [24], while the kernelized versions using Gaussian kernels are tested on Jochen Triesch hand posture database [25] and MNIST handwritten digits database [26]. Kernel parameters are speciﬁed by using 10-fold cross-validation based on MMC-AH, and then the three classiﬁcation methods share the same kernel parameter in the same experiment. 4.1. Performance evaluation

Generally speaking, for a binary pairwise classiﬁer each of the two ﬂexible convex hulls is allowed to have its distinct ﬂexible factor. In our preliminary study, we just simply keep the two ﬂexible factors equal, i.e., λ þ ¼ λ ¼ λ; moreover, we also keep all the ﬂexible factors of the pairwise classiﬁers equal. Different ﬂexible factors will correspond to different ﬂexible convex hulls, thus generate different separating hyperplanes. Fig. 2 gives a twodimensional example of ﬁnding a line separating the simulated 4

Class 1 Class 2

3 2 2nd dimension

two-class data. As shown in Fig. 2, the separating line varies with the ﬂexible factor. As a result, the ﬂexible factor will directly affect the classiﬁcation performance of MMC-FCH. Evidently, the ﬂexible factor is so important that we have to treat it cautiously. However, absence of the priori knowledge of sample distributions in general, we can hardly decide which ﬂexible factor is the best one in advance for classiﬁcation. In the ﬁelds of classiﬁcation, k-fold cross-validation is usually used to evaluate or compare the performances of different classiﬁcation methods. Generally, an appropriate ﬂexible factor among alternatives is chosen to be the one that makes the classiﬁcation method obtain the best classiﬁcation performance. Therefore, we can choose the ﬂexible factor by using k-fold crossvalidation. In k-fold cross-validation, the original sample set is randomly partitioned into k subsamples. Each subsample has roughly equal size and roughly the same class proportions as in class labels. Of the k subsamples, a single subsample is retained as the validation data for testing the classiﬁcation method, and the remaining k 1 subsamples are used as training data. The crossvalidation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce single performance estimation for the classiﬁcation method. In the preliminary work, we use 10-fold cross-validation to choose the ﬂexible factor.

In the following experiments, we evaluate the classiﬁcation performance not only for overall classes but also for each class. To this end, common overall accuracy (Note that classiﬁcation accuracy mentioned later just means overall accuracy unless otherwise speciﬁed) and F-measure are used, respectively. Overall accuracy can be computed as the total number of samples classiﬁed correctly divided by the total number of samples. F-measure for a single class is deﬁned as the harmonic mean of Precision and Recall. Precision and Recall for class i are deﬁned as follows Precisioni ¼

1 4

Recalli ¼

2 0 -2 -4 -5 -5

-4

-3

-2

-1

0

1

2

3

4

1st dimension Fig. 2. Separating lines generating from MMC-FCH for simulated two-class data with various λ: 1.2, 1.4, 1.6, 1.8, 2.0.

TP i TP i þ FP i

TP i TP i þ FN i

ð13Þ ð14Þ

where TP i (True Positive) is the number of samples classiﬁed correctly to class i; FP i (False Positive) is the number of samples that do not belong to class i but are classiﬁed to class i incorrectly; and FNi (False Negative) is the number of samples that actually belong to class i but are not classiﬁed to class i. Then, F-measure for class i is deﬁned as follows: Fi ¼

2Precisioni Recalli Precisioni þ Recalli

ð15Þ

M. Zeng et al. / Neurocomputing 149 (2015) 957–965

F i value varies from 0 to 1. Larger F i value implies that the classiﬁer gives higher classiﬁcation quality in class i.

4.2. Yale face database Yale face database contains 165 grayscale images from 15 subjects. Each of the 15 subjects has 11 different images with variations in illuminations (center light, left light and right light), facial expressions (normal, happy, sad, sleepy, surprised and wink) and facial details (glasses and no glasses). The size of each image is 243 320 pixels, with 256 gray levels per pixel. In the experiments, each image from the Yale face database was cropped and resized to 64 64 pixels. Fig. 3 shows the processed images of one subject. Raw pixel values were used as features and each feature vector was normalized to unit before use. Here the ﬂexible factor determined by using 10-fold cross-validation is set to 1.7378. To assess the effect of the number of training samples per subject on classiﬁcation performance, we randomly selected n¼2, 4, 6, 8, 10 images of each subject for training and the remaining 11 n ones for testing. This process was repeated 20 times independently, with the ﬁnal classiﬁcation accuracies being averaged over the 20 results. The experimental results are shown in Table 1. For different numbers of training samples per subject, MMC-FCH and MMC-AH both outperform MMC-CH except the case of 10 training samples. When 6 or less training samples per subject are used, MMC-FCH and MMC-AH give the similar results, and both of them outperform MMC-CH. With the increasing number of training samples per subject, MMC-AH has lower performance than MMC-FCH and even is outperformed by MMC-CH in the case of 10 training samples per subject. These results show that ﬂexible convex hulls also are better models for representing classes in high-dimensional spaces with limited number of training samples, besides afﬁne hulls as reported in Ref. [15]. In the case of small sample size, convex hulls present unrealistic tight approximations to the classes, leading to lower classiﬁcation accuracies as given by MMC-CH. In contrast, the appropriate ﬂexible factor of ﬂexible convex hulls and the looseness of afﬁne hulls contribute to the better classiﬁcation results of MMC-FCH and MMC-AH in such a case, respectively. With the increasing number of training samples, the samples might tend to be enough for training. Consequently, the excessive looseness of afﬁne hulls in turn lowers the performances of MMCAH. From these results we know that the ﬂexible convex hull is a

Fig. 3. Processed images of one subject from the Yale face database.

Table 1 Classiﬁcation accuracies (%) with different numbers of training samples per subject on the Yale face database. Training samples

MMC-CH

MMC-AH

MMC-FCH

2 4 6 8 10

82.9 7 2.9 92.4 7 3.0 94.7 7 2.8 96.6 7 2.8 99.37 2.1

83.0 7 2.9 92.9 7 2.7 95.3 7 3.1 97.0 7 1.9 98.7 7 2.7

83.07 2.9 92.97 2.8 95.37 2.6 97.37 2.0 99.37 2.1

961

geometric model between the convex hull and afﬁne hull, and it captures the best aspects of these. For each iteration of the repeated experiments, we can compute the F-measure value for each subject. Then, average F-measure value over 20 iterations for each subject is calculated as the F-measure value in the total repeated experiments. However, we ﬁnd that the F-measure value for a certain subject varies dramatically over 20 iterations in some cases, resulting in a large standard deviation. For example, in the case of 10 training samples per subject, the F-measure values along standard deviations for subject #3 given by MMC-CH, MMC-AH and MMC-FCH are 98.3 77.5, 98.1 77.9 and 98.3 77.5, respectively. The standard deviations are so large that we can hardly accept them. Additionally, in some cases the F-measure value for a certain subject even cannot be calculated through Eq. (15) because TP (True Positive) for that subject is zero. The reason for the phenomena is that the classiﬁcation result for a certain subject might be easily inﬂuenced by randomness in the repeated experiments. Since F-measure concentrates on the single subject, it is also easily inﬂuenced. Finally, we have to give up the F-measure values for this database.

4.3. Jochen Triesch hand posture database Jochen Triesch hand posture database consists of 10 hand postures (“a”, “b”, “c” “d”, “g”, “h”, “i”, “l”, “v” and “y”) performed by 24 persons against three backgrounds. For each person the 10 hand postures were recorded in front of uniform light, uniform dark and complex backgrounds. There are variations in scale and shape of each hand posture performed by different persons against the same background and even performed by the same person against different backgrounds. Some images of hand postures against three backgrounds are depicted in Fig. 4. In the experiments, we not only used the respective images against three backgrounds separately but also mixed all the images together for 10-fold cross-validation tests. Compared with those against uniform backgrounds, the images against complex background have signiﬁcant background clutter, which lead to a relatively difﬁcult classiﬁcation task in the independent experiments. Additionally, the classiﬁcation of the mixed images also is a quite challenging task because of the diverse inner-class and intra-class variations. Here we use a bag-of-words model [27] to represent each image. Each image can be represented by a collection of 128-dimensional vectors using scale-invariant feature transform (SIFT) [28,29]. Then, l codewords of a codebook are obtained by performing k-means clustering over all the training SIFT descriptors, and the resulting codewords are histogrammed over each image to generate an l-dimensional descriptor vector as the input feature. According to the maximum number of SIFT descriptors of single image, we set l to 70, 80 and 270 for the respective images against uniform light, uniform dark and complex backgrounds in the independent experiments, and 270 for all the images in the mixed experiment. Here the ﬂexible factors are set to 1.0471, 1.1482, 1.5849 and 1.0715 in the experiments against uniform light, uniform dark, complex and mixed backgrounds, respectively. For 10-fold cross-validation tests, the results are reported as the average classiﬁcation accuracies over 20 iterations along with

Fig. 4. Some examples of images from the Jochen Triesch hand posture database.

M. Zeng et al. / Neurocomputing 149 (2015) 957–965

962

standard deviations. The experimental results are shown in Table 2. In all cases, the proposed MMC-FCH achieves the best results, signiﬁcantly outperforming MMC-AH and MMC-CH in the relatively challenging experiment against complex background. Among the remaining methods, MMC-AH outperforms MMC-CH in the two of the four experiments, yet MMC-CH outperforms MMC-AH in another two experiments. A single 10-fold cross-validation test deﬁnes 10 random partitions on the data. For each data partition, we can compute the F-measure value for each hand posture. Then, average F-measure value over 10 partitions for each hand posture is calculated as the F-measure value in the single 10-fold cross-validation test. The ﬁnal F-measure value in the repeated 10-fold cross-validation tests is reported as the average F-measure value over 20 iterations for each hand posture. Tables 3–6 summarize the experimental results

against different backgrounds. As expected, the proposed method MMC-FCH brings encouraging results. For example, in the case of uniform complex background, MMC-FCH gives the F-measure value larger than 99% in hand postures “c”, “d”, “g”, “h”, “l”, “v” and “y”; and MMC-AH provides that larger than 99% in hand postures “d”, “v” and “y”; while MMC-CH gives that only in hand postures “v” and “y”. Both MMC-FCH and MMC-AH signiﬁcantly outperforms MMC-CH in hand postures “d”, “g”, “h”, “i” and “l”, furthermore, in these hand postures except “d” MMC-FCH even signiﬁcantly outperforms MMC-AH. In addition, MMC-FCH and MMC-CH signiﬁcantly outperforms MMC-AH in hand posture “b”. As for other hand postures, the three methods yield comparable results.

Table 2 Classiﬁcation accuracies (%) for 10-fold cross-validation tests on the Jochen Triesch hand posture database.

MNIST handwritten digits database contains 70,000 28 28 grayscale images of handwritten digits ranging from 0 to 9, with 60,000 reserved for training and the remaining 10,000 for testing. Some examples of images are shown in Fig. 5. Raw pixel values were used as features without any preprocessing or feature extraction. In this experiment, we only selected the ﬁrst 10000 samples of the original training set as the new training samples due to the limit of computer memory size. The ﬂexible factor is set to 1.3804.

Background

MMC-CH

MMC-AH

MMC-FCH

Light Dark Complex Mixed

96.9 70.4 96.6 70.7 97.0 70.5 96.7 70.4

96.8 7 0.7 96.9 7 0.6 97.9 7 0.4 96.4 7 0.5

97.47 0.7 97.07 0.7 98.77 0.4 96.97 0.4

4.4. MNIST handwritten digits database

Table 3 F-measure values (%) for each hand posture against uniform light background. Method

MMC-CH MMC-AH MMC-FCH

Hand posture “a”

“b”

“c”

“d”

“g”

“h”

“i”

“l”

“v”

“y”

98.17 0.2 97.5 7 1.9 99.7 7 0.7

94.7 71.2 94.9 72.0 96.9 71.0

97.97 1.2 97.7 7 1.5 97.97 1.2

97.8 7 2.1 96.3 7 2.1 98.17 2.7

94.3 7 2.0 95.6 7 1.9 95.8 7 2.5

96.2 7 1.7 95.3 7 2.1 96.3 7 1.7

98.47 1.0 98.17 0.9 98.47 1.0

96.1 71.8 97.4 71.3 96.0 71.9

94.2 71.6 94.2 71.7 94.1 71.7

98.4 7 0.8 98.4 7 0.5 98.4 7 0.8

Table 4 F-measure values (%) for each hand posture against uniform dark background. Method

MMC-CH MMC-AH MMC-FCH

Hand posture “a”

“b”

“c”

“d”

“g”

“h”

“i”

“l”

“v”

“y”

95.8 7 1.2 97.2 7 0.7 97.2 7 1.2

94.17 1.8 95.3 7 1.2 95.47 1.4

97.0 7 1.7 97.67 1.1 96.6 7 1.6

96.9 7 1.9 97.97 1.7 96.8 7 2.5

97.6 7 1.6 97.2 7 0.9 97.87 1.2

96.6 72.1 96.2 71.7 96.7 71.9

97.3 7 1.7 97.2 7 1.7 97.87 1.7

96.0 71.8 95.9 71.7 96.2 71.8

94.6 7 2.1 94.6 7 2.3 94.87 2.1

98.3 71.7 98.3 71.7 98.4 71.5

Table 5 F-measure values (%) for each hand posture against complex background. Method

MMC-CH MMC-AH MMC-FCH

Hand posture “a”

“b”

“c”

“d”

“g”

“h”

“i”

“l”

“v”

“y”

96.07 1.3 95.9 7 1.3 95.9 7 1.3

95.77 1.7 94.4 7 1.7 95.4 7 1.3

99.0 71.6 98.7 71.3 99.3 71.4

97.2 7 1.7 99.5 7 1.0 99.5 7 1.0

94.0 7 1.4 97.3 7 0.9 99.87 0.5

93.2 7 1.5 97.17 1.4 99.07 1.4

95.5 71.3 97.4 72.1 98.7 71.4

98.0 7 0.9 98.6 7 1.3 99.47 1.1

99.37 1.3 99.37 1.3 99.37 1.3

99.57 1.0 99.57 1.0 99.57 1.0

Table 6 F-measure values (%) for each hand posture against mixed background. Method

MMC-CH MMC-AH MMC-FCH

Hand posture “a”

“b”

“c”

“d”

“g”

“h”

“i”

“l”

“v”

“y”

97.5 7 0.7 96.4 7 1.1 97.9 7 0.9

95.9 7 0.8 95.17 1.0 96.27 1.0

96.9 70.9 96.6 70.6 96.4 70.6

96.0 7 1.1 95.9 7 1.0 95.5 7 0.8

95.5 7 1.2 96.77 1.0 95.5 7 1.1

96.7 7 0.9 96.2 7 0.9 96.7 7 0.8

97.2 70.8 95.4 71.0 97.1 70.8

96.7 70.8 96.9 70.8 97.1 70.7

96.3 7 1.0 96.6 7 0.7 97.37 1.0

98.2 7 0.8 98.2 7 0.9 98.8 7 0.8

M. Zeng et al. / Neurocomputing 149 (2015) 957–965

Fig. 5. Some examples of images from the MNIST handwritten digits database.

Table 7 Classiﬁcation accuracies (%) on the MNIST handwritten digits database. MMC-CH

MMC-AH

MMC-FCH

97.0

97.2

97.2

Digit

MMC-CH MMC-AH MMC-FCH

“0”

“1”

“2”

“3”

“4”

“5”

“6”

“7”

“8”

“9”

98.1 98.1 98.2

98.8 98.8 98.9

96.6 97.0 96.9

96.7 96.8 96.7

96.9 97.4 97.1

96.6 97.2 96.8

97.5 97.5 97.7

96.2 96.8 96.7

96.6 96.4 96.6

95.8 96.1 96.1

Number of support vsrotce

2500

2

4

6

8

10

20 16 12 8 4 0

0

15

30

45 60 75 Binary classifier index

90

105

Fig. 7. Number of support vectors of every binary classiﬁer constructed by MMC-CH for the Yale face database.

MMC-CH. Despite of the fact the optimal separating hyperplane generating from MMC-FCH is only determined by a few vertices or boundary points of the ﬂexible convex hulls, these key points still need to be expressed as a linear combination of all the original training points where coefﬁcients are non-zero in the calculation of decision function values. Speciﬁcally, a vertex x0i of the ﬂexible convex hull determining the separating hyperplane can be necesn sarily written as x0i ¼ ð1 λÞð1=nÞΣ i ¼ 1 xi þ λxi where xi also is necessarily a vertex of the convex hull of the training set, according to the geometric interpretation of the ﬂexible convex hull. As for MMC-AH, the separability of afﬁne hulls requires that all the training sample points contribute to the afﬁne hull models [15]. However, we ﬁnd that the difference between testing time of MMC-CH and MMC-FCH is not so signiﬁcant in high-dimensional spaces. This is due to the fact that most of the training samples become support vectors in such cases. For example, Fig. 7 illustrates the number of support vectors of every binary classiﬁer constructed by MMC-CH for the used Yale face database, with different numbers of training samples per class. As shown in this ﬁgure, MMC-CH returns all the training samples as the support vectors in all cases.

5. Conclusions

Table 8 F-measure values (%) for each digit. Method

24 Number of support vectors

Table 7 summarizes the classiﬁcation accuracies. MMC-FCH and MMC-AH give comparable results and both of them are slightly more efﬁcient than MMC-CH in terms of classiﬁcation accuracy. Meanwhile, the F-measure values for each digit are given in Table 8. From the results, we can ﬁnd that all the three methods gives the best classiﬁcation performance in digit “1”. Even though the best performance for each digit is alternately given by MMCFCH and MMC-AH, MMC-FCH always received better or comparable performance compared to MMC-CH, while the latter is outperformed by MMC-CH in digit “8”. Considering the real-time efﬁciency (testing time), however, we ﬁnd that MMC-CH gives better performance regarding this database. For the three classiﬁcation methods, the real-time performance mainly depends on the number of support vectors (i.e., training samples whose corresponding coefﬁcients are nonzero) of every binary classiﬁer. In terms of this database, there are 45 binary classiﬁers in total, corresponding to the ten-class classiﬁcation problem. Fig. 6 shows the number of support vectors of every binary classiﬁer. We can ﬁnd that the numbers of support vectors for each binary classiﬁer produced by MMC-FCH and MMC-AH are the same, and much more than that produced by

963

MMC-CH

MMC-AH

MMC-FCH

2000 1500

We deﬁned a ﬂexible convex hull as an alternative class region approximation to the convex hull and afﬁne hull. By introducing a ﬂexible factor, the ﬂexible convex hull is deﬁnitely an approximation model looser than the convex hull but tighter than the afﬁne hull, and it captures the best aspects of these. Our experimental results verify this fact. We proposed a maximum margin classiﬁcation based on ﬂexible convex hulls. Given two ﬂexible convex hulls, we showed how to construct such classiﬁcation model. The proposed method can also be kernelized by using the kernel trick and extended to multi-class classiﬁcation by constructing binary pairwise classiﬁers. Experiments on several databases show that MMC-FCH receives encouraging results compared to MMC-CH and MMC-AH. These results do not indicate the absolute superiority of MMC-FCH in every aspect, but indeed provide useful insights on the potential of the proposed method.

1000

Acknowledgments

500 0

0

5

10

15

20

25

30

35

40

45

Binary classifier index Fig. 6. Number of support vectors of every binary classiﬁer constructed for the MNIST handwritten digits database.

This research is supported by the National Natural Science Foundation of China (Grant nos. 51175158 and 51375152) and the Hunan Provincial Innovation Foundation for Postgraduate (Grant no. CX2014B146). We also greatly appreciate the database owners' authorization to use the free databases for this research.

964

M. Zeng et al. / Neurocomputing 149 (2015) 957–965

References

Appendix Proposition. The ﬂexible convex hull (3) is equivalent to the convex hull (5). Proof. On the one hand, 8 x A flexðXÞ can be written as n

x ¼ ∑ α i xi

ð16Þ

i¼1

Let αi ¼ ð1 λ=nÞ þ λβi , αi A ½ð1 λ=nÞ; ð1 λ=nÞ þ λ, then we know βi A ½0; 1. Thus, Eq. (16) can be further written as n n 1λ 1 n þ λβi xi ¼ ð1 λÞ ∑ xi þ λ ∑ β i xi x¼ ∑ ð17Þ ni¼1 n i¼1 i¼1 At the same time, we have n n 1λ þ λβi ¼ 1 λ þ λ ∑ β i ∑ αi ¼ ∑ n i¼1 i¼1 i¼1 n

ð18Þ

Combine Eq. (18) and the constraint Σ i ¼ 1 αi ¼ 1, we obtain n

n

∑ βi ¼ 1

ð19Þ

i¼1

Submit Eq. (19) to Eq. (17) ! ! n n 1 n x ¼ ð1 λÞ ∑ xi ∑ β i þ λ ∑ β i xi ni¼1 i¼1 i¼1 ! n n 1 n ¼ ∑ βi ð1 λÞ ∑ xj þ ∑ βi ðλxi Þ nj¼1 i¼1 i¼1 ! n 1 n ¼ ∑ βi ð1 λÞ ∑ xj þ λxi nj¼1 i¼1

ð20Þ

Let x0i ¼ ð1 λÞð1=nÞΣ nj¼ 1 xj þ λxi , then n

x ¼ ∑ βi x0i

ð21Þ

i¼1

Thus, we proved 8 x A flexðXÞ ) x A convðX 0 Þ. On the other hand, 8 x A convðX 0 Þ can be written as ! n n 1 n x ¼ ∑ βi x0i ¼ ∑ β i ð1 λÞ ∑ xj þ λxi nj¼1 i¼1 i¼1 ! ! n n 1 n ¼ ð1 λÞ ∑ xi ∑ βi þ λ ∑ βi xi ni¼1 i¼1 i¼1

ð22Þ

Submit Σ i ¼ 1 β i ¼ 1 to Eq. (22) n

x ¼ ð1 λÞ

n n 1 n 1λ þ λβi xi ∑ xi þ λ ∑ β i xi ¼ ∑ ni¼1 n i¼1 i¼1

ð23Þ

Let αi ¼ ð1 λ=nÞ þ λβi , β i A ½0; 1, then we know αi A ½ð1 λ=nÞ; ð1 λ=nÞ þ λ. Thus, Eq. (23) can be further written as n

x ¼ ∑ α i xi

ð24Þ

[1] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995) 273–297. [2] A. Sanchez, V. David, Advanced support vector machines and kernel methods, Neurocomputing 55 (1) (2003) 5–20. [3] J. Dong, A. Krzyżak, C.Y. Suen, An improved handwritten Chinese character recognition system using support vector machine, Pattern Recognit. Lett. 26 (12) (2005) 1849–1856. [4] M.A. Kumar, M. Gopal, A comparison study on multiple binary-class SVM methods for unilabel text categorization, Pattern Recognit. Lett. 31 (11) (2010) 1437–1444. [5] V. Mitra, C.J. Wang, S. Banerjee, Text classiﬁcation: a least square support vector machine approach, Appl. Soft Comput. 7 (3) (2007) 908–914. [6] S. Hua, Z. Sun, A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach, J. Mol. Biol. 308 (2) (2001) 397–407. [7] M.F. Akay, Support vector machines combined with feature selection for breast cancer diagnosis, Expert Syst. Appl. 36 (2) (2009) 3240–3247. [8] Y. Yang, D. Yu, J. Cheng, A fault diagnosis approach for roller bearing based on IMF envelope spectrum and SVM, Measurement 40 (9) (2007) 943–950. [9] S. Abbasion, A. Rafsanjani, A. Farshidianfar, et al., Rolling element bearings multi-fault classiﬁcation based on the wavelet denoising and support vector machine, Mech. Syst. Signal Process. 21 (7) (2007) 2933–2945. [10] Y. Ji, S. Sun, Y. Lu, Multitask multiclass privileged information support vector machines, Proceedings of IEEE 21st International Conference on Pattern Recognition (ICPR), 2012. [11] S. Sun, J. Shawe-Taylor, Sparse semi-supervised learning using conjugate functions, J. Mach. Learn. Res. 11 (2010) 2423–2455. [12] Z. Wang, S. Yan, C. Zhang, Active learning with adaptive regularization, Pattern Recognit. 44 (10) (2011) 2375–2383. [13] K.P. Bennett, E.J. Bredensteiner, Duality and geometry in SVM classiﬁers, in: P. Langley (Ed.) Proceedings of the International Conference on Machine Learning, 2000. [14] X. Zhou, W. Jiang, Y. Tian, et al., A new kernel-based classiﬁcation algorithm, in: W. Wang, H. Kargupta, S. Ranka, P. S. Yu, X. Wu (Eds.) Proceedings of the Ninth IEEE International Conference on Data Mining, 2009. [15] H. Cevikalp, B. Triggs, H.S. Yavuz, et al., Large margin classiﬁers based on afﬁne hulls, Neurocomputing 73 (16) (2010) 3160–3168. [16] D.M.J. Tax, R.P.W. Duin, Support vector domain description, Pattern Recognit. Lett. 20 (11) (1999) 1191–1199. [17] D.M.J. Tax, R.P.W. Duin, Support vector data description, Mach. Learn. 54 (1) (2004) 45–66. [18] D. Lee, J. Lee, Domain described support vector classiﬁer for multiclassiﬁcation problems, Pattern Recognit. 40 (1) (2007) 41–51. [19] X.K. Wei, G.B. Huang, Y.H. Li, Mahalanobis ellipsoidal learning machine for one class classiﬁcation, Proceedings of the IEEE International Conference on Machine Learning and Cybernetics, 2007. [20] S. Knerr, L. Personnaz, G. Dreyfus, Single-layer learning revisited: a stepwise procedure for building and training a neural network, in: F. Fogelman Soulié, J. Hérault (Eds.), Neurocomputing: Algorithms, Architectures and Applications, Springer-Verlag, 1990, pp. 41–50. [21] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [22] B. Fei, J. Liu, Binary tree of SVM: a new fast multiclass training and classiﬁcation algorithm, IEEE Trans. Neural Netw. 17 (3) (2006) 696–704. [23] J.C. Platt, N. Cristianini, J. Shawe-taylor, Large margin DAGs for multiclass classiﬁcation, in: S.A. Solla, T.K. Leen, K. Müller (Eds.), Advances in Neural Information Processing Systems, 2000. [24] Yale Face Database, 〈http://cvc.yale.edu/projects/yalefaces/yalefaces.html〉. [25] J. Triesch, C. Von Der Malsburg, Robust classiﬁcation of hand postures against complex backgrounds, Proc. Second IEEE Int. Conf. Autom. Face Gesture Recognit. (1996). [26] Y. LeCun, L. Bottou, Y. Bengio, et al., Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324. [27] F. Li, P. Perona, A bayesian hierarchical model for learning natural scene categories, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. [28] D.G. Lowe, Object recognition from local scale-invariant features, Proceedings of the Seventh IEEE International Conference on Computer Vision (1999). [29] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis. 60 (2) (2004) 91–110.

i¼1

Moreover, we have n n 1λ þ λβi ¼ 1 λ þ λ ∑ β i ∑ αi ¼ ∑ n i¼1 i¼1 i¼1 n

ð25Þ

Submit Σ i ¼ 1 β i ¼ 1 to Eq. (25), we obtain n

n

∑ αi ¼ 1

i¼1

Thus, we proved 8 x A convðX 0 Þ ) x A flexðXÞ. In summary, x A flexðXÞ 3 x A convðX 0 Þ.

ð26Þ

Ming Zeng received the B.S. degree from School of Electro-mechanical Engineering, Guangdong University of Technology, Guangzhou, China, in 2010. Now he is currently working toward the Ph.D. Degree in Hunan University, Changsha, China. His main research interests include pattern recognition and machinery fault diagnosis.

M. Zeng et al. / Neurocomputing 149 (2015) 957–965

Yu Yang received the B.S. degree, the M.S. and Ph.D. degrees in mechanical engineering from the College of Mechanical and Vehicle Engineering, Hunan University, Changsha, PR China, in 1994, 1997 and 2005, respectively. Her research interests include pattern recognition, digital signal processing and machine fault diagnosis.

Jinde Zheng received the B.S. degree in Mathematics from Anhui Normal University, Wuhu, China, in 2009. Now he is currently working toward the Ph.D. Degree in Hunan University, Changsha, China. His main research interests include dynamic signal processing, time frequency analysis and machinery fault diagnosis.

965

Junsheng Cheng received the Ph.D. degree in manufacturing engineering and automation form Hunan University in 2005. He is currently a professor in College of Mechanical and Vehicle Engineering, Hunan University. His main research interests include mechanical fault diagnosis, dynamics signal processing, vibration and noise control.

Maximum margin classification based on flexible convex hulls

Maximum margin classification based on flexible convex hulls

Recommend Documents