Maximum margin classification based on flexible convex hulls

Maximum margin classification based on flexible convex hulls

Neurocomputing 149 (2015) 957–965 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Maximum...

680KB Sizes 1 Downloads 41 Views

Neurocomputing 149 (2015) 957–965

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Maximum margin classification based on flexible convex hulls Ming Zeng, Yu Yang n, Jinde Zheng, Junsheng Cheng State Key Laboratory of Advanced Design and Manufacturing for Vehicle Body, Hunan University, Changsha 410082, PR China

art ic l e i nf o

a b s t r a c t

Article history: Received 16 January 2014 Received in revised form 14 May 2014 Accepted 14 July 2014 Communicated by: Shiliang Sun Available online 2 August 2014

Based on defining a flexible convex hull, a maximum margin classification based on flexible convex hulls (MMC-FCH) is presented in this work. The flexible convex hull defined in our work is a class region approximation looser than a convex hull but tighter than an affine hull. MMC-FCH approximates each class region with a flexible convex hull of its training samples, and then finds a linear separating hyperplane that maximizes the margin between flexible convex hulls by solving a closest pair of points problem. The method can be extended to nonlinear case by using the kernel trick, and multi-class classification problems are dealt with by constructing binary pairwise classifiers as in support vector machine (SVM). The experiments on several databases show that the proposed method compares favorably to the maximum margin classification based on convex hulls (MMC-CH) or affine hulls (MMC-AH). & 2014 Elsevier B.V. All rights reserved.

Keywords: Flexible convex hull Maximum margin classification Kernel method Convex hull Affine hull Support vector machine

1. Introduction Over recent years, as a robust methodology for classification, support vector machine (SVM) [1] has been successfully used in a wide variety of applications including computer vision [2,3], text categorization [4,5], bioinformatics [6,7] and fault diagnosis [8,9]. Moreover, some fruitful methods combining SVM with other learning strategies have also been proposed and bring performance improvements to SVM. Ji et al. [10] proposed a new learning paradigm named multitask multiclass privileged information support vector machine that can take full advantages of the multitask learning and privileged information. Sun and ShaweTaylor [11] presented a general framework for sparse semisupervised learning and subsequently proposed a sparse multiview support vector machine. Wang et al. [12] introduced a novel active learning support vector machine algorithm with adaptive model selection which traces the full solution path of the base classifier before each new query, and then performs efficient model selection using the unlabeled samples. The basic idea of SVM is to construct a separating hyperplane that maximizes the geometric margin which is defined as the distance between the separating hyperplane and the closest samples from two sample sets. From geometrical point of view, in linearly separable case the SVM optimization problem of finding the maximum margin between two sample sets is equivalent to finding the

n

Corresponding author. E-mail address: [email protected] (Y. Yang).

http://dx.doi.org/10.1016/j.neucom.2014.07.038 0925-2312/& 2014 Elsevier B.V. All rights reserved.

closest pair of points on the respective convex hulls of the two sample sets, and the optimal separating hyperplane is chosen to be the one that perpendicularly bisects the line segment connecting the closest pair of points [13]. That is to say, SVM can be regarded as a maximum margin classification based on convex hulls (MMC-CH) which actually approximates each class region with a convex hull, and the two closest points on the convex hulls determine the hyperplane for separating the convex hulls. In practice, the samples we can obtain are always finite, but the real number of samples belonging to the class region of these samples should be actually infinite. SVM approximates the class region of these samples with a convex hull rather than the isolated samples themselves, thus the samples extends to be infinite. Actually, relevant researchers have also proposed some other analogous geometric models to approximate class regions, such as affine hulls, hyperspheres and hyperellipsoids. In the spirit of the geometric interpretation for SVM, Zhou et al. [14] and Cevikalp et al. [15] have independently presented a maximum margin classification based on affine hulls (MMC-AH). In contrast to SVM, this method approximates class regions with affine hulls of their sample sets rather than convex hulls, and then the separating hyperplane can be determined by solving the closest pair of points problem. Tax and Duin [16,17] introduced a method called support vector data description (SVDD) for novelty or outlier detection. SVDD tries to find a smallest hypersphere containing almost all sample points for approximating the class region and classifies an unknown sample according to the Euclidean distance from this sample to the center of the hypersphere. Moreover, the hypersphere approximation is also extended to multi-class classification

958

M. Zeng et al. / Neurocomputing 149 (2015) 957–965

problems. Lee and Lee [18] employed a hypersphere for approximating the class region of each sample set and then utilize these approximations to classify an unknown sample via Bayesian decision rule. Wei et al. [19] also proposed another one-class classification method, which replaces the hypersphere with a hyperellipsoid for the class region approximation in SVDD and thus utilizes the Mahalanobis distance rather than Euclidean distance as the decision rule. No matter for one-class or multiclass classification problems, these classification methods based on geometric models have one thing in common, namely, a kind of geometric model (convex hulls, affine hulls, hyperspheres or hyperellipsoids etc.) is used to approximate the class region of each sample set and then the classification models are built based on certain decision rule. Therefore, these methods offer us a good idea for classification that one can start with geometric approximation models to class regions. The convex hull is the smallest convex set containing given finite samples, so it could approximate the class region very tightly. Generally speaking, the natural class region almost always extends beyond the convex hull of its finite samples, especially in highdimensional spaces. In this sense, the unrealistically tight convex hull is typically a substantial under-approximation to the class region. As opposed to the convex hull, the affine hull gives a rather loose approximation, because it does not constrain the positions of the sample points within the affine subspace. Besides, linear separability of samples does not necessarily guarantee the separability of corresponding affine hulls [15]. It is more restrictive from this point of view. Affine hulls go to infinity in every direction thus they will not overlap only if they are parallel to each other. The hypersphere model is particularly suitable for samples with a spherical distribution and high clustering degree; otherwise, large empty space will appear inside the hypersphere. In this case, the hypersphere also gives a relatively loose approximation to the class region. The large empty space is likely to misclassify samples from other class regions or outliers generated by noise and interference to such approximated class region. Compared with the hypersphere, the hyperellipsoid model takes the distribution information of samples into consideration and thus approximates the class region more tightly. However, in the case of high-dimensional small samples the covariance matrix tends to be singular, which brings some algorithm obstacles to the classification methods based on hyperellipsoids in high-dimensional feature spaces. Although some skills have been developed to improve the calculation of the covariance matrix, but they still have many more parameters owing to the need to represent the covariance matrix. In a word, it is still worthwhile for us to explore another appropriate geometric model for approximating the class region of samples in the pattern recognition research. Inspired by convex hulls and affine hulls, in this study we define a new geometric model called a flexible convex hull for the class region approximation and propose a novel classification method, i.e., maximum margin classification based on flexible convex hulls (MMC-FCH). The basic goal of MMC-FCH is to find an optimal linear separating hyperplane that yields the maximum margin between flexible convex hulls of sample sets. This can be solved by computing a closest pair of points problem and then the optimal separating hyperplane is chosen to be the one that perpendicularly bisects the line segment connecting the closest pair of points. MMC-FCH can also be extended to nonlinear case by using the kernel trick. To use the proposed method in multi-class classification problems, we can consider most of the common strategies developed for extending binary SVM classifiers to the multi-class cases. The remaining part of the paper is organized as follows. Section 2 gives the definition of a flexible convex hull. Section 3 introduces the proposed method, MMC-FCH. Section 4 presents our experimental results and Section 5 concludes the paper.

2. Definition of a flexible convex hull 2.1. Motivation As mentioned above, SVM can be regarded as a maximum margin classification based on convex hulls (MMC-CH), which first approximates each class with a convex hull of its training samples and then finds a hyperplane that maximizes the margin between the two convex hulls. The convex hull of a sample set can be expressed as a linear combination of the sample points from the sample set where all coefficients are non-negative and sum to one. Consider a finite sample set X ¼ fxi g ði ¼ 1; 2; :::; nÞ, xi A ℝp , then the convex hull of the sample set X can be written as  ( )  n n  convðXÞ ¼ ∑ αi xi  ∑ αi ¼ 1; 0 r αi r1 ð1Þ i¼1 i¼1 where αi is the combination coefficient of the ith sample point xi . The convex hull model is the tightest possible convex approximation to the class region, and for classes with more general convex forms, it is typically a substantial under-approximation. Other than convex hull models, maximum margin classification based on affine hulls (MMC-AH) approximates each class with an affine hull [15]. The affine hull of a sample set can be expressed as a linear combination of the sample points where coefficients add up to one without non-negativity constraints. Then, the affine hull of the sample set X can be written as  ( )  n n  affðXÞ ¼ ∑ αi xi  ∑ αi ¼ 1 ð2Þ i¼1 i¼1 where αi is the combination coefficient of the ith sample point xi . Even though the affine hull is an unbounded and hence typically rather loose model to the class region in contrast to the convex hull approximation, MMC-AH works surprisingly better than SVM (or MMC-CH) especially in high-dimensional spaces with limited number of samples [15]. This is one indication that convex hulls based methods may be too tight to be realistic. Nevertheless, due to the unboundedness of affine hulls, the separability of affine hulls requires they are certainly parallel to each other. If different sample sets have similar or intersecting affine hulls but very different distributions of samples within their affine hulls, MMCAH would fail to separate the sample sets. Therefore, it seems more reasonable to tighten the affine hull model. 2.2. Flexible convex hulls Motivated by convex hulls and affine hulls, we define a new geometric model called a flexible convex hull for the class region approximation. Similar to the convex hull and the affine hull, a flexible convex hull can also be expressed as a linear combination of sample points where coefficients add up to one, but it imposes different lower and upper bounds on the coefficients. More formally, the flexible convex hull of the sample set X is define as  ( )  n n 1λ 1λ  flexðXÞ ¼ ∑ αi xi  ∑ αi ¼ 1; r αi r þλ ð3Þ i ¼ 1 n n i¼1 where αi is the combination coefficient of the ith sample point xi and λ A ð1; þ 1Þ is the flexible factor. The flexible convex hull and the flexible factor both have their own explicit geometric interpretations. For a given λ, an arbitrary sample point xi A X extends along the ! radial direction xxi with respect to λ where x ¼ ð1=nÞΣ ni¼ 1 xi is the center of the sample set X, and the corresponding extended point can be written as x0i ¼ ð1  λÞx þ λxi , then the convex hull of the new sample set X 0 ¼ fx0i gði ¼ 1; 2; :::; nÞ is exactly the flexible convex hull of the original sample set X; moreover, the flexible factor λ is exactly the ratio of the distance between the extended point x0i and the set

M. Zeng et al. / Neurocomputing 149 (2015) 957–965

center x to the distance between the original point xi and the set center x. In other words, we have the following proposition. Proposition. The flexible convex hull (3) of sample set X ¼ fxi gði ¼ 1; 2; …; nÞ is equivalent to the convex hull of the new sample set (  )  1 n 0 0 0 X ¼ xi xi ¼ ð1  λÞ ∑ xj þ λxi ði ¼ 1; 2; …; nÞ ð4Þ  nj¼1 i.e.,

( 0

convðX Þ ¼

n

∑ β

i¼1

 

0 0 i xi xi



n 1 n ¼ ð1  λÞ ∑ xj þ λxi ; ∑ βi ¼ 1; 0 r βi r 1 nj¼1 i¼1

)

 )  n 1λ 1λ  ∑ α  i x  i  ∑ α  i ¼ 1; rαi r þλ i¼1 n n i¼1

( flexðX  Þ ¼

959

n

ð7Þ where λ þ and λ  are the flexible factors of the positive and negative flexible convex hulls, respectively. In the linearly separable case, finding the closest pair of points on the flexible convex hulls of sample sets can be written as the following optimization problem:  2  n 1  n þ  min  ∑ α þ i x þ i  ∑ α  i x  i  α þ ;α  2 i ¼ 1  i¼1 1λþ 1λþ rαþi r þ λ þ ; i ¼ 1; 2; :::; n þ nþ nþ



ð5Þ where βi is the combination coefficient of the ith extended point x0i ; and λ A ð1; þ 1Þ is the flexible factor. The complete proof is given in the Appendix. Note that the lower and upper bounds of the coefficients vary with the flexible factor. If λ ¼ 1, ð1  λ=nÞ ¼ 0 and ð1  λ=nÞ þ λ ¼ 1, then the flexible convex hull reduces to the convex hull; if 1 o λ o þ 1, ð1  λ=nÞ o 0 and ð1  λ=nÞ þ λ 4 1, then it is the flexible convex hull with respect to λ; if λ- þ 1, ð1  λ=nÞ-  1 and ð1  λ=nÞ þ λ- þ 1, then the flexible convex hull expands to the affine hull. In this sense, the convex hull and the affine hull can be actually considered as the limiting cases of the flexible convex hull. Therefore, the flexible convex hull is an approximation model looser than the convex hull but tighter than the affine hull. The introduction of the flexible factor λ, just as its name implies, makes the new defined geometric model more flexible. For an appropriate choice of λ, it is entirely possible for the flexible convex hull to give a desirable approximation to the class region.

3. MMC-FCH 3.1. Linearly separable case Having defined the flexible convex hull, we propose a maximum margin classification based on flexible convex hulls (MMCFCH). The basic goal of MMC-FCH is to find an optimal linear separating hyperplane that yields the maximum margin between flexible convex hulls of sample sets. All points x that lie on the separating hyperplane satisfy 〈w; x〉 þb ¼ 0 where w and b are the normal and bias of the separating hyperplane, respectively. For the optimal separating hyperplane, all points x in the positive sample set satisfy 〈w; x〉 þ b 40 and all points x in the negative sample set satisfy 〈w; x〉 þ b o 0. Finding the best separating hyperplane maximizing the margin between flexible convex hulls can be solved by computing the closest pair of points on them. The optimal separating hyperplane will be the one that perpendicularly bisects the line segment connecting the closest pair of points as in SVM. Once the optimal separating hyperplane is determined, an unknown sample x is classified based on the decision function f ðxÞ ¼ signð〈w; x〉 þbÞ. Consider a binary classification problem with the positive and negative training sets given in the form X þ ¼ fxi þ g ði ¼ 1; 2; …; n þ Þ and X  ¼ fxi  g ði ¼ 1; 2; …; n  Þ. According to formula (3), the flexible convex hulls of the positive and negative sample sets can be written as  ( )  nþ nþ 1λþ 1λþ  flexðX þ Þ ¼ ∑ α þ i x þ i  ∑ α þ i ¼ 1; rαþi r þλþ i¼1 nþ nþ i¼1 ð6Þ

∑ α þ i ¼ 1;

s:t:

i¼1

1λ 1λ rαi r þ λ  ; i ¼ 1; 2; :::; n  n n

n

∑ α  i ¼ 1;

i¼1

ð8Þ Expand the objective function, and then the optimization problem can be written as 1 α þ ;α  2

nþ nþ

n n

i ¼ 1j ¼ 1

i ¼ 1j ¼ 1

∑ ∑ α þ i α þ j 〈x þ i ; x þ j 〉 þ ∑ ∑ α  i α  j 〈x  i ; x  j 〉

min

!

nþ n

 2 ∑ ∑ α þ i α  j 〈x þ i ; x  j 〉 i ¼ 1j ¼ 1



s:t:

∑ α þ i ¼ 1;

i¼1 n

∑ α  i ¼ 1;

i¼1

1λþ 1λþ rαþi r þ λ þ ; i ¼ 1; 2; :::; n þ nþ nþ

1λ 1λ rαi r þ λ  ; i ¼ 1; 2; :::; n  n n

ð9Þ

This is a quadratic programming problem that can be solved using standard optimization algorithms. From the geometric property of flexible convex hulls, we know that the closest points always lie on the vertices or boundaries of the flexible convex hulls. In other words, the optimal separating hyperplane only depends on a few vertices or boundary points of the flexible convex hulls. Fig. 1 shows a two-dimensional case. The areas enclosed by dashes and dots represent flexible convex hulls and convex hulls, respectively. In this figure, we cannot show the case of affine hulls since they completely overlap each other in the two-dimensional case. Given the optimal solution ðαnþ 1 ; αnþ 2 ; …; αnþ n þ ; αn 1 ; αn 2 ; :::; n α  n  ÞT and let xnþ and xn denote the corresponding closest points on the positive and negative flexible convex hulls, respectively. Then, the normal w and bias b of the optimal separating hyperplane can be computed by using the following equations: nþ

n

i¼1

i¼1

w ¼ xnþ  xn ¼ ∑ αnþ i x þ i  ∑ αn i x  i

ð10Þ

Negative samples

Closest pair of points #2

Positive samples

Separating hyperplane #2 Separating hyperplane #1 Closest pair of points #1 Fig. 1. Closest pair of points determines the optimal separating hyperplane.

960

M. Zeng et al. / Neurocomputing 149 (2015) 957–965

1 1 b ¼  wT ðxnþ þ xn Þ ¼  2 2





∑ ∑ αnþ i αnþ j 〈x þ i ; x þ j 〉

i¼1j¼1 n

!

n

 ∑ ∑ αn i αn j 〈x  i ; x  j 〉

ð11Þ

i¼1j¼1

Finally, the decision function can equivalently be expressed as ( f ðxÞ ¼ sign



n

∑ αnþ i 〈x  i ; x〉  ∑ αn i 〈x  i ; x〉

i¼1

1  2



i¼1



n

n

!)

∑ ∑ α þ i α þ j 〈x þ i ; x þ j 〉  ∑ ∑ α  i α  j 〈x  i ; x  j 〉 n

n

i¼1j¼1

n

n

i¼1j¼1

ð12Þ

3.2. Linearly inseparable case In the case of linearly inseparable flexible convex hulls, we can map the samples to a higher dimensional space where flexible convex hulls constructed in the mapped space become linearly separable by using the kernel trick. Note that either the objective function of (9) or the decision function (12) can be written in terms of inner products among samples, which allows the use of the kernel trick, i.e., replacing the inner product 〈xi ; xj 〉 with the kernel function kðxi ; xj Þ ¼ 〈Φðxi Þ; Φðxj Þ〉. 3.3. Extension to multi-class classification problems

4. Experiments

MMC-FCH can be extended to multi-class classification problems by using most of the common strategies developed for extending binary SVM classifiers to the multi-class cases. Here we only use the most popular strategy in our experiments: oneagainst-one [20]. For a c-class classification problem, one-againstone constructs all possible c ¼ ðc  1Þ=2 binary pairwise classifiers out of the c classes and classifies unknown samples to the class that wins the most pairwise decisions. Other strategies such as one-against-rest [21], binary decision trees [22] and directed acyclic graphs [23] can also be used. 3.4. Choice of the flexible convex factor

In this section, we test the linear and kernelized versions of the proposed classification method MMC-FCH on several databases and compare them with those of MMC-CH and MMC-AH. The linear versions of the above methods are tested on Yale face database [24], while the kernelized versions using Gaussian kernels are tested on Jochen Triesch hand posture database [25] and MNIST handwritten digits database [26]. Kernel parameters are specified by using 10-fold cross-validation based on MMC-AH, and then the three classification methods share the same kernel parameter in the same experiment. 4.1. Performance evaluation

Generally speaking, for a binary pairwise classifier each of the two flexible convex hulls is allowed to have its distinct flexible factor. In our preliminary study, we just simply keep the two flexible factors equal, i.e., λ þ ¼ λ  ¼ λ; moreover, we also keep all the flexible factors of the pairwise classifiers equal. Different flexible factors will correspond to different flexible convex hulls, thus generate different separating hyperplanes. Fig. 2 gives a twodimensional example of finding a line separating the simulated 4

Class 1 Class 2

3 2 2nd dimension

two-class data. As shown in Fig. 2, the separating line varies with the flexible factor. As a result, the flexible factor will directly affect the classification performance of MMC-FCH. Evidently, the flexible factor is so important that we have to treat it cautiously. However, absence of the priori knowledge of sample distributions in general, we can hardly decide which flexible factor is the best one in advance for classification. In the fields of classification, k-fold cross-validation is usually used to evaluate or compare the performances of different classification methods. Generally, an appropriate flexible factor among alternatives is chosen to be the one that makes the classification method obtain the best classification performance. Therefore, we can choose the flexible factor by using k-fold crossvalidation. In k-fold cross-validation, the original sample set is randomly partitioned into k subsamples. Each subsample has roughly equal size and roughly the same class proportions as in class labels. Of the k subsamples, a single subsample is retained as the validation data for testing the classification method, and the remaining k  1 subsamples are used as training data. The crossvalidation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce single performance estimation for the classification method. In the preliminary work, we use 10-fold cross-validation to choose the flexible factor.

In the following experiments, we evaluate the classification performance not only for overall classes but also for each class. To this end, common overall accuracy (Note that classification accuracy mentioned later just means overall accuracy unless otherwise specified) and F-measure are used, respectively. Overall accuracy can be computed as the total number of samples classified correctly divided by the total number of samples. F-measure for a single class is defined as the harmonic mean of Precision and Recall. Precision and Recall for class i are defined as follows Precisioni ¼

1 4

Recalli ¼

2 0 -2 -4 -5 -5

-4

-3

-2

-1

0

1

2

3

4

1st dimension Fig. 2. Separating lines generating from MMC-FCH for simulated two-class data with various λ: 1.2, 1.4, 1.6, 1.8, 2.0.

TP i TP i þ FP i

TP i TP i þ FN i

ð13Þ ð14Þ

where TP i (True Positive) is the number of samples classified correctly to class i; FP i (False Positive) is the number of samples that do not belong to class i but are classified to class i incorrectly; and FNi (False Negative) is the number of samples that actually belong to class i but are not classified to class i. Then, F-measure for class i is defined as follows: Fi ¼

2Precisioni  Recalli Precisioni þ Recalli

ð15Þ

M. Zeng et al. / Neurocomputing 149 (2015) 957–965

F i value varies from 0 to 1. Larger F i value implies that the classifier gives higher classification quality in class i.

4.2. Yale face database Yale face database contains 165 grayscale images from 15 subjects. Each of the 15 subjects has 11 different images with variations in illuminations (center light, left light and right light), facial expressions (normal, happy, sad, sleepy, surprised and wink) and facial details (glasses and no glasses). The size of each image is 243  320 pixels, with 256 gray levels per pixel. In the experiments, each image from the Yale face database was cropped and resized to 64  64 pixels. Fig. 3 shows the processed images of one subject. Raw pixel values were used as features and each feature vector was normalized to unit before use. Here the flexible factor determined by using 10-fold cross-validation is set to 1.7378. To assess the effect of the number of training samples per subject on classification performance, we randomly selected n¼2, 4, 6, 8, 10 images of each subject for training and the remaining 11 n ones for testing. This process was repeated 20 times independently, with the final classification accuracies being averaged over the 20 results. The experimental results are shown in Table 1. For different numbers of training samples per subject, MMC-FCH and MMC-AH both outperform MMC-CH except the case of 10 training samples. When 6 or less training samples per subject are used, MMC-FCH and MMC-AH give the similar results, and both of them outperform MMC-CH. With the increasing number of training samples per subject, MMC-AH has lower performance than MMC-FCH and even is outperformed by MMC-CH in the case of 10 training samples per subject. These results show that flexible convex hulls also are better models for representing classes in high-dimensional spaces with limited number of training samples, besides affine hulls as reported in Ref. [15]. In the case of small sample size, convex hulls present unrealistic tight approximations to the classes, leading to lower classification accuracies as given by MMC-CH. In contrast, the appropriate flexible factor of flexible convex hulls and the looseness of affine hulls contribute to the better classification results of MMC-FCH and MMC-AH in such a case, respectively. With the increasing number of training samples, the samples might tend to be enough for training. Consequently, the excessive looseness of affine hulls in turn lowers the performances of MMCAH. From these results we know that the flexible convex hull is a

Fig. 3. Processed images of one subject from the Yale face database.

Table 1 Classification accuracies (%) with different numbers of training samples per subject on the Yale face database. Training samples

MMC-CH

MMC-AH

MMC-FCH

2 4 6 8 10

82.9 7 2.9 92.4 7 3.0 94.7 7 2.8 96.6 7 2.8 99.37 2.1

83.0 7 2.9 92.9 7 2.7 95.3 7 3.1 97.0 7 1.9 98.7 7 2.7

83.07 2.9 92.97 2.8 95.37 2.6 97.37 2.0 99.37 2.1

961

geometric model between the convex hull and affine hull, and it captures the best aspects of these. For each iteration of the repeated experiments, we can compute the F-measure value for each subject. Then, average F-measure value over 20 iterations for each subject is calculated as the F-measure value in the total repeated experiments. However, we find that the F-measure value for a certain subject varies dramatically over 20 iterations in some cases, resulting in a large standard deviation. For example, in the case of 10 training samples per subject, the F-measure values along standard deviations for subject #3 given by MMC-CH, MMC-AH and MMC-FCH are 98.3 77.5, 98.1 77.9 and 98.3 77.5, respectively. The standard deviations are so large that we can hardly accept them. Additionally, in some cases the F-measure value for a certain subject even cannot be calculated through Eq. (15) because TP (True Positive) for that subject is zero. The reason for the phenomena is that the classification result for a certain subject might be easily influenced by randomness in the repeated experiments. Since F-measure concentrates on the single subject, it is also easily influenced. Finally, we have to give up the F-measure values for this database.

4.3. Jochen Triesch hand posture database Jochen Triesch hand posture database consists of 10 hand postures (“a”, “b”, “c” “d”, “g”, “h”, “i”, “l”, “v” and “y”) performed by 24 persons against three backgrounds. For each person the 10 hand postures were recorded in front of uniform light, uniform dark and complex backgrounds. There are variations in scale and shape of each hand posture performed by different persons against the same background and even performed by the same person against different backgrounds. Some images of hand postures against three backgrounds are depicted in Fig. 4. In the experiments, we not only used the respective images against three backgrounds separately but also mixed all the images together for 10-fold cross-validation tests. Compared with those against uniform backgrounds, the images against complex background have significant background clutter, which lead to a relatively difficult classification task in the independent experiments. Additionally, the classification of the mixed images also is a quite challenging task because of the diverse inner-class and intra-class variations. Here we use a bag-of-words model [27] to represent each image. Each image can be represented by a collection of 128-dimensional vectors using scale-invariant feature transform (SIFT) [28,29]. Then, l codewords of a codebook are obtained by performing k-means clustering over all the training SIFT descriptors, and the resulting codewords are histogrammed over each image to generate an l-dimensional descriptor vector as the input feature. According to the maximum number of SIFT descriptors of single image, we set l to 70, 80 and 270 for the respective images against uniform light, uniform dark and complex backgrounds in the independent experiments, and 270 for all the images in the mixed experiment. Here the flexible factors are set to 1.0471, 1.1482, 1.5849 and 1.0715 in the experiments against uniform light, uniform dark, complex and mixed backgrounds, respectively. For 10-fold cross-validation tests, the results are reported as the average classification accuracies over 20 iterations along with

Fig. 4. Some examples of images from the Jochen Triesch hand posture database.

M. Zeng et al. / Neurocomputing 149 (2015) 957–965

962

standard deviations. The experimental results are shown in Table 2. In all cases, the proposed MMC-FCH achieves the best results, significantly outperforming MMC-AH and MMC-CH in the relatively challenging experiment against complex background. Among the remaining methods, MMC-AH outperforms MMC-CH in the two of the four experiments, yet MMC-CH outperforms MMC-AH in another two experiments. A single 10-fold cross-validation test defines 10 random partitions on the data. For each data partition, we can compute the F-measure value for each hand posture. Then, average F-measure value over 10 partitions for each hand posture is calculated as the F-measure value in the single 10-fold cross-validation test. The final F-measure value in the repeated 10-fold cross-validation tests is reported as the average F-measure value over 20 iterations for each hand posture. Tables 3–6 summarize the experimental results

against different backgrounds. As expected, the proposed method MMC-FCH brings encouraging results. For example, in the case of uniform complex background, MMC-FCH gives the F-measure value larger than 99% in hand postures “c”, “d”, “g”, “h”, “l”, “v” and “y”; and MMC-AH provides that larger than 99% in hand postures “d”, “v” and “y”; while MMC-CH gives that only in hand postures “v” and “y”. Both MMC-FCH and MMC-AH significantly outperforms MMC-CH in hand postures “d”, “g”, “h”, “i” and “l”, furthermore, in these hand postures except “d” MMC-FCH even significantly outperforms MMC-AH. In addition, MMC-FCH and MMC-CH significantly outperforms MMC-AH in hand posture “b”. As for other hand postures, the three methods yield comparable results.

Table 2 Classification accuracies (%) for 10-fold cross-validation tests on the Jochen Triesch hand posture database.

MNIST handwritten digits database contains 70,000 28  28 grayscale images of handwritten digits ranging from 0 to 9, with 60,000 reserved for training and the remaining 10,000 for testing. Some examples of images are shown in Fig. 5. Raw pixel values were used as features without any preprocessing or feature extraction. In this experiment, we only selected the first 10000 samples of the original training set as the new training samples due to the limit of computer memory size. The flexible factor is set to 1.3804.

Background

MMC-CH

MMC-AH

MMC-FCH

Light Dark Complex Mixed

96.9 70.4 96.6 70.7 97.0 70.5 96.7 70.4

96.8 7 0.7 96.9 7 0.6 97.9 7 0.4 96.4 7 0.5

97.47 0.7 97.07 0.7 98.77 0.4 96.97 0.4

4.4. MNIST handwritten digits database

Table 3 F-measure values (%) for each hand posture against uniform light background. Method

MMC-CH MMC-AH MMC-FCH

Hand posture “a”

“b”

“c”

“d”

“g”

“h”

“i”

“l”

“v”

“y”

98.17 0.2 97.5 7 1.9 99.7 7 0.7

94.7 71.2 94.9 72.0 96.9 71.0

97.97 1.2 97.7 7 1.5 97.97 1.2

97.8 7 2.1 96.3 7 2.1 98.17 2.7

94.3 7 2.0 95.6 7 1.9 95.8 7 2.5

96.2 7 1.7 95.3 7 2.1 96.3 7 1.7

98.47 1.0 98.17 0.9 98.47 1.0

96.1 71.8 97.4 71.3 96.0 71.9

94.2 71.6 94.2 71.7 94.1 71.7

98.4 7 0.8 98.4 7 0.5 98.4 7 0.8

Table 4 F-measure values (%) for each hand posture against uniform dark background. Method

MMC-CH MMC-AH MMC-FCH

Hand posture “a”

“b”

“c”

“d”

“g”

“h”

“i”

“l”

“v”

“y”

95.8 7 1.2 97.2 7 0.7 97.2 7 1.2

94.17 1.8 95.3 7 1.2 95.47 1.4

97.0 7 1.7 97.67 1.1 96.6 7 1.6

96.9 7 1.9 97.97 1.7 96.8 7 2.5

97.6 7 1.6 97.2 7 0.9 97.87 1.2

96.6 72.1 96.2 71.7 96.7 71.9

97.3 7 1.7 97.2 7 1.7 97.87 1.7

96.0 71.8 95.9 71.7 96.2 71.8

94.6 7 2.1 94.6 7 2.3 94.87 2.1

98.3 71.7 98.3 71.7 98.4 71.5

Table 5 F-measure values (%) for each hand posture against complex background. Method

MMC-CH MMC-AH MMC-FCH

Hand posture “a”

“b”

“c”

“d”

“g”

“h”

“i”

“l”

“v”

“y”

96.07 1.3 95.9 7 1.3 95.9 7 1.3

95.77 1.7 94.4 7 1.7 95.4 7 1.3

99.0 71.6 98.7 71.3 99.3 71.4

97.2 7 1.7 99.5 7 1.0 99.5 7 1.0

94.0 7 1.4 97.3 7 0.9 99.87 0.5

93.2 7 1.5 97.17 1.4 99.07 1.4

95.5 71.3 97.4 72.1 98.7 71.4

98.0 7 0.9 98.6 7 1.3 99.47 1.1

99.37 1.3 99.37 1.3 99.37 1.3

99.57 1.0 99.57 1.0 99.57 1.0

Table 6 F-measure values (%) for each hand posture against mixed background. Method

MMC-CH MMC-AH MMC-FCH

Hand posture “a”

“b”

“c”

“d”

“g”

“h”

“i”

“l”

“v”

“y”

97.5 7 0.7 96.4 7 1.1 97.9 7 0.9

95.9 7 0.8 95.17 1.0 96.27 1.0

96.9 70.9 96.6 70.6 96.4 70.6

96.0 7 1.1 95.9 7 1.0 95.5 7 0.8

95.5 7 1.2 96.77 1.0 95.5 7 1.1

96.7 7 0.9 96.2 7 0.9 96.7 7 0.8

97.2 70.8 95.4 71.0 97.1 70.8

96.7 70.8 96.9 70.8 97.1 70.7

96.3 7 1.0 96.6 7 0.7 97.37 1.0

98.2 7 0.8 98.2 7 0.9 98.8 7 0.8

M. Zeng et al. / Neurocomputing 149 (2015) 957–965

Fig. 5. Some examples of images from the MNIST handwritten digits database.

Table 7 Classification accuracies (%) on the MNIST handwritten digits database. MMC-CH

MMC-AH

MMC-FCH

97.0

97.2

97.2

Digit

MMC-CH MMC-AH MMC-FCH

“0”

“1”

“2”

“3”

“4”

“5”

“6”

“7”

“8”

“9”

98.1 98.1 98.2

98.8 98.8 98.9

96.6 97.0 96.9

96.7 96.8 96.7

96.9 97.4 97.1

96.6 97.2 96.8

97.5 97.5 97.7

96.2 96.8 96.7

96.6 96.4 96.6

95.8 96.1 96.1

Number of support vsrotce

2500

2

4

6

8

10

20 16 12 8 4 0

0

15

30

45 60 75 Binary classifier index

90

105

Fig. 7. Number of support vectors of every binary classifier constructed by MMC-CH for the Yale face database.

MMC-CH. Despite of the fact the optimal separating hyperplane generating from MMC-FCH is only determined by a few vertices or boundary points of the flexible convex hulls, these key points still need to be expressed as a linear combination of all the original training points where coefficients are non-zero in the calculation of decision function values. Specifically, a vertex x0i of the flexible convex hull determining the separating hyperplane can be necesn sarily written as x0i ¼ ð1  λÞð1=nÞΣ i ¼ 1 xi þ λxi where xi also is necessarily a vertex of the convex hull of the training set, according to the geometric interpretation of the flexible convex hull. As for MMC-AH, the separability of affine hulls requires that all the training sample points contribute to the affine hull models [15]. However, we find that the difference between testing time of MMC-CH and MMC-FCH is not so significant in high-dimensional spaces. This is due to the fact that most of the training samples become support vectors in such cases. For example, Fig. 7 illustrates the number of support vectors of every binary classifier constructed by MMC-CH for the used Yale face database, with different numbers of training samples per class. As shown in this figure, MMC-CH returns all the training samples as the support vectors in all cases.

5. Conclusions

Table 8 F-measure values (%) for each digit. Method

24 Number of support vectors

Table 7 summarizes the classification accuracies. MMC-FCH and MMC-AH give comparable results and both of them are slightly more efficient than MMC-CH in terms of classification accuracy. Meanwhile, the F-measure values for each digit are given in Table 8. From the results, we can find that all the three methods gives the best classification performance in digit “1”. Even though the best performance for each digit is alternately given by MMCFCH and MMC-AH, MMC-FCH always received better or comparable performance compared to MMC-CH, while the latter is outperformed by MMC-CH in digit “8”. Considering the real-time efficiency (testing time), however, we find that MMC-CH gives better performance regarding this database. For the three classification methods, the real-time performance mainly depends on the number of support vectors (i.e., training samples whose corresponding coefficients are nonzero) of every binary classifier. In terms of this database, there are 45 binary classifiers in total, corresponding to the ten-class classification problem. Fig. 6 shows the number of support vectors of every binary classifier. We can find that the numbers of support vectors for each binary classifier produced by MMC-FCH and MMC-AH are the same, and much more than that produced by

963

MMC-CH

MMC-AH

MMC-FCH

2000 1500

We defined a flexible convex hull as an alternative class region approximation to the convex hull and affine hull. By introducing a flexible factor, the flexible convex hull is definitely an approximation model looser than the convex hull but tighter than the affine hull, and it captures the best aspects of these. Our experimental results verify this fact. We proposed a maximum margin classification based on flexible convex hulls. Given two flexible convex hulls, we showed how to construct such classification model. The proposed method can also be kernelized by using the kernel trick and extended to multi-class classification by constructing binary pairwise classifiers. Experiments on several databases show that MMC-FCH receives encouraging results compared to MMC-CH and MMC-AH. These results do not indicate the absolute superiority of MMC-FCH in every aspect, but indeed provide useful insights on the potential of the proposed method.

1000

Acknowledgments

500 0

0

5

10

15

20

25

30

35

40

45

Binary classifier index Fig. 6. Number of support vectors of every binary classifier constructed for the MNIST handwritten digits database.

This research is supported by the National Natural Science Foundation of China (Grant nos. 51175158 and 51375152) and the Hunan Provincial Innovation Foundation for Postgraduate (Grant no. CX2014B146). We also greatly appreciate the database owners' authorization to use the free databases for this research.

964

M. Zeng et al. / Neurocomputing 149 (2015) 957–965

References

Appendix Proposition. The flexible convex hull (3) is equivalent to the convex hull (5). Proof. On the one hand, 8 x A flexðXÞ can be written as n

x ¼ ∑ α i xi

ð16Þ

i¼1

Let αi ¼ ð1  λ=nÞ þ λβi , αi A ½ð1  λ=nÞ; ð1  λ=nÞ þ λ, then we know βi A ½0; 1. Thus, Eq. (16) can be further written as   n n 1λ 1 n þ λβi xi ¼ ð1  λÞ ∑ xi þ λ ∑ β i xi x¼ ∑ ð17Þ ni¼1 n i¼1 i¼1 At the same time, we have   n n 1λ þ λβi ¼ 1  λ þ λ ∑ β i ∑ αi ¼ ∑ n i¼1 i¼1 i¼1 n

ð18Þ

Combine Eq. (18) and the constraint Σ i ¼ 1 αi ¼ 1, we obtain n

n

∑ βi ¼ 1

ð19Þ

i¼1

Submit Eq. (19) to Eq. (17) ! ! n n 1 n x ¼ ð1  λÞ ∑ xi ∑ β i þ λ ∑ β i xi ni¼1 i¼1 i¼1 ! n n 1 n ¼ ∑ βi ð1  λÞ ∑ xj þ ∑ βi ðλxi Þ nj¼1 i¼1 i¼1 ! n 1 n ¼ ∑ βi ð1  λÞ ∑ xj þ λxi nj¼1 i¼1

ð20Þ

Let x0i ¼ ð1  λÞð1=nÞΣ nj¼ 1 xj þ λxi , then n

x ¼ ∑ βi x0i

ð21Þ

i¼1

Thus, we proved 8 x A flexðXÞ ) x A convðX 0 Þ. On the other hand, 8 x A convðX 0 Þ can be written as ! n n 1 n x ¼ ∑ βi x0i ¼ ∑ β i ð1  λÞ ∑ xj þ λxi nj¼1 i¼1 i¼1 ! ! n n 1 n ¼ ð1  λÞ ∑ xi ∑ βi þ λ ∑ βi xi ni¼1 i¼1 i¼1

ð22Þ

Submit Σ i ¼ 1 β i ¼ 1 to Eq. (22) n

x ¼ ð1  λÞ

  n n 1 n 1λ þ λβi xi ∑ xi þ λ ∑ β i xi ¼ ∑ ni¼1 n i¼1 i¼1

ð23Þ

Let αi ¼ ð1  λ=nÞ þ λβi , β i A ½0; 1, then we know αi A ½ð1  λ=nÞ; ð1  λ=nÞ þ λ. Thus, Eq. (23) can be further written as n

x ¼ ∑ α i xi

ð24Þ

[1] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995) 273–297. [2] A. Sanchez, V. David, Advanced support vector machines and kernel methods, Neurocomputing 55 (1) (2003) 5–20. [3] J. Dong, A. Krzyżak, C.Y. Suen, An improved handwritten Chinese character recognition system using support vector machine, Pattern Recognit. Lett. 26 (12) (2005) 1849–1856. [4] M.A. Kumar, M. Gopal, A comparison study on multiple binary-class SVM methods for unilabel text categorization, Pattern Recognit. Lett. 31 (11) (2010) 1437–1444. [5] V. Mitra, C.J. Wang, S. Banerjee, Text classification: a least square support vector machine approach, Appl. Soft Comput. 7 (3) (2007) 908–914. [6] S. Hua, Z. Sun, A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach, J. Mol. Biol. 308 (2) (2001) 397–407. [7] M.F. Akay, Support vector machines combined with feature selection for breast cancer diagnosis, Expert Syst. Appl. 36 (2) (2009) 3240–3247. [8] Y. Yang, D. Yu, J. Cheng, A fault diagnosis approach for roller bearing based on IMF envelope spectrum and SVM, Measurement 40 (9) (2007) 943–950. [9] S. Abbasion, A. Rafsanjani, A. Farshidianfar, et al., Rolling element bearings multi-fault classification based on the wavelet denoising and support vector machine, Mech. Syst. Signal Process. 21 (7) (2007) 2933–2945. [10] Y. Ji, S. Sun, Y. Lu, Multitask multiclass privileged information support vector machines, Proceedings of IEEE 21st International Conference on Pattern Recognition (ICPR), 2012. [11] S. Sun, J. Shawe-Taylor, Sparse semi-supervised learning using conjugate functions, J. Mach. Learn. Res. 11 (2010) 2423–2455. [12] Z. Wang, S. Yan, C. Zhang, Active learning with adaptive regularization, Pattern Recognit. 44 (10) (2011) 2375–2383. [13] K.P. Bennett, E.J. Bredensteiner, Duality and geometry in SVM classifiers, in: P. Langley (Ed.) Proceedings of the International Conference on Machine Learning, 2000. [14] X. Zhou, W. Jiang, Y. Tian, et al., A new kernel-based classification algorithm, in: W. Wang, H. Kargupta, S. Ranka, P. S. Yu, X. Wu (Eds.) Proceedings of the Ninth IEEE International Conference on Data Mining, 2009. [15] H. Cevikalp, B. Triggs, H.S. Yavuz, et al., Large margin classifiers based on affine hulls, Neurocomputing 73 (16) (2010) 3160–3168. [16] D.M.J. Tax, R.P.W. Duin, Support vector domain description, Pattern Recognit. Lett. 20 (11) (1999) 1191–1199. [17] D.M.J. Tax, R.P.W. Duin, Support vector data description, Mach. Learn. 54 (1) (2004) 45–66. [18] D. Lee, J. Lee, Domain described support vector classifier for multiclassification problems, Pattern Recognit. 40 (1) (2007) 41–51. [19] X.K. Wei, G.B. Huang, Y.H. Li, Mahalanobis ellipsoidal learning machine for one class classification, Proceedings of the IEEE International Conference on Machine Learning and Cybernetics, 2007. [20] S. Knerr, L. Personnaz, G. Dreyfus, Single-layer learning revisited: a stepwise procedure for building and training a neural network, in: F. Fogelman Soulié, J. Hérault (Eds.), Neurocomputing: Algorithms, Architectures and Applications, Springer-Verlag, 1990, pp. 41–50. [21] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [22] B. Fei, J. Liu, Binary tree of SVM: a new fast multiclass training and classification algorithm, IEEE Trans. Neural Netw. 17 (3) (2006) 696–704. [23] J.C. Platt, N. Cristianini, J. Shawe-taylor, Large margin DAGs for multiclass classification, in: S.A. Solla, T.K. Leen, K. Müller (Eds.), Advances in Neural Information Processing Systems, 2000. [24] Yale Face Database, 〈http://cvc.yale.edu/projects/yalefaces/yalefaces.html〉. [25] J. Triesch, C. Von Der Malsburg, Robust classification of hand postures against complex backgrounds, Proc. Second IEEE Int. Conf. Autom. Face Gesture Recognit. (1996). [26] Y. LeCun, L. Bottou, Y. Bengio, et al., Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324. [27] F. Li, P. Perona, A bayesian hierarchical model for learning natural scene categories, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. [28] D.G. Lowe, Object recognition from local scale-invariant features, Proceedings of the Seventh IEEE International Conference on Computer Vision (1999). [29] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis. 60 (2) (2004) 91–110.

i¼1

Moreover, we have   n n 1λ þ λβi ¼ 1  λ þ λ ∑ β i ∑ αi ¼ ∑ n i¼1 i¼1 i¼1 n

ð25Þ

Submit Σ i ¼ 1 β i ¼ 1 to Eq. (25), we obtain n

n

∑ αi ¼ 1

i¼1

Thus, we proved 8 x A convðX 0 Þ ) x A flexðXÞ. In summary, x A flexðXÞ 3 x A convðX 0 Þ.

ð26Þ

Ming Zeng received the B.S. degree from School of Electro-mechanical Engineering, Guangdong University of Technology, Guangzhou, China, in 2010. Now he is currently working toward the Ph.D. Degree in Hunan University, Changsha, China. His main research interests include pattern recognition and machinery fault diagnosis.

M. Zeng et al. / Neurocomputing 149 (2015) 957–965

Yu Yang received the B.S. degree, the M.S. and Ph.D. degrees in mechanical engineering from the College of Mechanical and Vehicle Engineering, Hunan University, Changsha, PR China, in 1994, 1997 and 2005, respectively. Her research interests include pattern recognition, digital signal processing and machine fault diagnosis.

Jinde Zheng received the B.S. degree in Mathematics from Anhui Normal University, Wuhu, China, in 2009. Now he is currently working toward the Ph.D. Degree in Hunan University, Changsha, China. His main research interests include dynamic signal processing, time frequency analysis and machinery fault diagnosis.

965

Junsheng Cheng received the Ph.D. degree in manufacturing engineering and automation form Hunan University in 2005. He is currently a professor in College of Mechanical and Vehicle Engineering, Hunan University. His main research interests include mechanical fault diagnosis, dynamics signal processing, vibration and noise control.