Fast data selection for SVM training using ensemble margin

Fast data selection for SVM training using ensemble margin

Pattern Recognition Letters 51 (2014) 112–119 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.c...

2MB Sizes 4 Downloads 106 Views

Pattern Recognition Letters 51 (2014) 112–119

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Short Communication

Fast data selection for SVM training using ensemble margin ✩ Li Guo a,b,∗, Samia Boukir a a b

G&E Laboratory (EA 4592), IPB/University of Bordeaux, 1 allée F.Daguin 33670 Pessac, France CNRS-IMS Laboratory (UMR 5218), Bt A4, 351 Cours de la libération, 33402 Talence Cedex, France

a r t i c l e

i n f o

Article history: Received 10 October 2013 Available online xxx Keywords: Boundary points Instance selection Ensemble learning Margin theory Large data Support vector machine

a b s t r a c t Support Vector Machine (SVM) is a powerful classification method. However, it suffers a major drawback: the high memory and time complexity of the training which constrains the application of SVM to large size classification tasks. To accelerate the SVM training, a new ensemble margin-based data selection approach is proposed. It relies on a simple and efficient heuristic to provide support vector candidates: selecting lowest margin instances. This technique significantly reduces the SVM training task complexity while maintaining the accuracy of the SVM classification. A fast alternative of our approach we called SVIS (Small Votes Instance Selection) with great potential for large data problem is also introduced. © 2014 Elsevier B.V. All rights reserved.

1. Introduction Support Vector Machine (SVM) relies on statistical learning as proposed by Vapnik [1]. It is a powerful classification scheme that outperforms most existing classifiers on various applications. However, it suffers a major drawback: the high computational costs of the training which constrain the application of SVM to classification tasks involving a huge amount of data. To reduce the SVM training complexity, mainly two different approaches have been proposed in literature. The first approach consists in selecting support vector candidates, then using them to train the SVM [2]. The second approach speeds up the training by a decomposition procedure leading to problems of smaller size solved repeatedly [3]. Although the latter approach reduces the difficulty of the underlying optimization problem, the considerable training price induced by memory space complexity must be endured [4]. Our approach belongs to the first category of methods and is based on ensemble learning. Ensemble learning consists in producing multiple learners, which are generally homogeneous, and then combining them [5]. The homogeneous base classifiers are produced by different runs of the same training algorithm. A popular method for creating homogeneous models, through the manipulation of training data, is bagging [6]. Bagging is particularly well-suited for huge data. Firstly, all base classifiers in bagging are independent from each other, and simultaneously created. Moreover, each base classifier uses just a portion of

✩ ∗

This paper has been recommended for acceptance by Dr. G. Moser. Corresponding author: Tel.: +33 557 121 036; fax: +33 557 121 001. E-mail address: [email protected] (L. Guo).

http://dx.doi.org/10.1016/j.patrec.2014.08.003 0167-8655/© 2014 Elsevier B.V. All rights reserved.

the total training set, about 63.2% for classic with replacement sample bagging, 50% for without replacement one, also known as subagging [7]. In addition, and more interestingly, Breiman [8] showed that the base classifier can be built with a very small sampling ratio, as small as 0.5% of the total training data, in some cases without degrading the classification performance. Indeed, the optimal value of the sampling ratio is problem-dependent, which changes from dataset to dataset. However, in most cases, the best sampling ratio is smaller than standard choices [9]. Besides sampling at data level, at variable (or feature) level, the variables can also be sampled to construct a bagging ensemble, known as random forest [10], one of the most successful classifiers nowadays. We have firstly introduced ensemble method into instance selection in previous work [11]. It relies on the margin paradigm of ensemble classifiers to select the most relevant training instances of SVM and thus accelerate its training speed. In this preliminary work, a classic bagging tree [6] is at the core of the proposed ensemble based instance selection method. However, classic bagging tree is not very effective for data selection especially when the size of training data is very large and the dimensionality of the data is very high. In this work, we propose to investigate in our instance selection scheme more powerful ensemble methods: random forest and a very small ensemble called SVIS (Small Votes Instance Selection). It is particularly well-suited for large data problems and is the major contribution of this work. This paper is organized as follows. A brief overview of instance selection methods for speeding up SVM training in literature is given in Section 2. We introduce then our support vector selection method which relies on ensemble margin in Section 3. The validation of our

l. Guo, S. Boukir / Pattern Recognition Letters 51 (2014) 112–119

approach is presented in Section 4. Finally, discussions and concluding remarks are given in Section 5.

2. Building SVM with reduced training complexity 2.1. Problem statement The SVM method consists in finding the hyperplane maximizing the distance (called margin) to the closest training data points known as Support (SV) in both classes. These SV play an important role in defining the discriminant hyperplane or predicting function. They are chosen among all the training data by solving a Quadratic Programming (QP) problem. The solution of this QP problem is the crux of the SVM design, which depends on all training instances and the selection of a few kernel parameters. However, the memory space for storing the kernel matrix in SVM QP formulation is O(N2 ), where N is the number of training data. In addition, the time complexity of a standard QP solver is O(N3 ) [12]. This indicates that SVM are unsuitable to problems of large size, such as the classification of remote sensing data. In SVM, the decision function for classification or regression problems is fully determined by support vectors. Therefore, training with a reduced training set, consisting of all support vectors, gives an identical model to training with the whole training set. Furthermore, these SV are almost identical regardless of which kernel function is applied for training [13]. In many real-world applications, the number of support vectors is expected to be much smaller than the total number of training examples. Therefore, if one knows in advance which patterns correspond to the support vectors, the same solution can be obtained by solving a much smaller QP problem that involves only the support vectors [14]. As long as the size of the resulting reduced training set is much smaller than the size of the total training set, the SVM training speed will be significantly improved since, in practice, the training algorithm scales quadratically in the training set size on many problems [14]. The problem is then how to select training examples that are likely to be support vectors. Recently, there has been considerable research on data selection for SVM training.

2.2. Support vector selection methods Many SV selection methods have been proposed in literature and can be classified into the following categories: •

Clustering-based method. A common approach consists firstly in clustering samples of each class, and then identifying the critical clusters, finally, the SV candidates are selected from identified critical clusters according to their proximity to the border [15]. A pattern selection technique based on K-means clustering called SVM − KM was conducted by Almeida et al. [16]. All members of the heterogeneous clusters, which consist of instances from different classes, are selected as SV candidates while instances from the same homogeneous cluster are replaced by the centroid of the cluster. The main drawback of this approach is the sensitivity of the SV selection performance to the heterogeneous threshold and to the number of clusters. Moreover, both parameters are difficult to set. A similar work is introduced in [17], where it is assumed that the cluster centers are known in advance, and the boundary patterns of homogeneous clusters are selected instead of the centroid. Lyhyaoui et al. [15] applied 1-nearest neighbor searching from the opposite class after class-wise clustering to identify SV candidates located thus near the boundary of the two classes. But the performance of this method is poor when the classes are overlapped orthere are many mislabeled instances in the





113

training set. Wang and Xu [4] have proposed a heuristic method for accelerating SVM training in which all the training data are first divided into several groups using some clustering methods, and then for each group, some training vectors are discarded based on the measurement of similarity among examples, under a preestablished similarity threshold. In this method, those data points which are close to some reference data point such as mean data point are removed. Although this approach can reduce the training size efficiently, often the extracted data point is not a support vector. K-nearest neighbors-based method. This approach relies on the fact that an instance located near the decision boundary tends to have more heterogeneous neighbors. In Guo and Zhang [18], a training set reduction method for SVM training is developed based on the observation that the support vector’s target value is usually a local extremum or near extremum among the values of its Knearest neighbors. Thus, an example with very few same category neighbors has more probability to be a support vector. Li et al. [19] combined the samples selected by an edge detector and the centroids of clusters to reconstruct the training dataset. A K-nearest neighbor search was applied as edge detector. Shin and Cho [2] proposed a method called Neighborhood Property Based Pattern Selection (NPPS). It consists of two steps to determine the sample selection. Firstly, it computes the label entropy value among the K-nearest neighbors of a considered instance. An instance located near the decision boundary tends to have more heterogeneous neighbors in their class membership, thus its corresponding entropy value should be positive. The second step consists in removing the noisy samples. If an instance’s own label is different from the majority label of its neighbors, it is likely to be incorrectly labeled. Statistical method. This approach employs some statistical information on training data to determine which instances should be selected as SV candidates. In Wang et al. [14], the selection method is based on a statistical confidence measure that determines whether a training example is near an opposite class instance. This approach considers the number of training examples that are constrained in the largest sphere centered on the considered training example without covering examples of other classes. Intuitively, the larger this number N(xi ), the more likely training example xi truly belongs to class yi as labeled in the training data, i.e., the more confidence on its class membership. Although the underlying concept of this method sounds reasonable, its selection performance is not sufficient to build a robust SVM classifier. Indeed, the authors reported experimental results on five UCI datasets, among which the proposed method is significantly better than random selection on only one dataset. Another method proposed by the same authors uses the minimal distance (Hausdorff distance) from a training example to the training examples of a different class as a criterion to select patterns near the decision boundary. However, it turned out less effective than the former one. Recently, Li and Maguire [20] proposed the BorderEdge Pattern Selection (BEPS) method. This method combines both local geometrical (neighborhood) and statistical information to instance selection. It selects edge patterns based on the approximated tangent hyperplane of a class surface, and also determines border patterns between classes with their local statistical information. It relies on K-nearest neighbors to determine local information.

In fact, K-nearest neighbors-based method and clustering-based method could be categorized into a same group: neighborhood based method which selects instances according to their local geometrical information. Many other strategies were also introduced to this domain, such as rough set model [21], graph-based model [22] and genetic algorithm [23].

114

l. Guo, S. Boukir / Pattern Recognition Letters 51 (2014) 112–119

3. A new support vector selection method based on ensemble margin 3.1. Margin of ensemble methods The margin theory of ensemble methods was firstly proposed by Schapire et al. [24] to explain the success of boosting [25]. It also casted boosting into a very inviting context: large margin classifiers [26] which includes also another major classifier: SVM. We use an unsupervised ensemble margin that combines the first and second most voted class labels under the model [11,27,28]. This margin can be computed by Eq. (1), where c1 is the most voted class for sample x and vc1 is the number of related votes, c2 is the second most popular class and vc2 is the number of corresponding votes. This margin’s range is from 0 to 1. It is an alternative to the classical definition of the margin [24] that has an appealing property: it does not require the true class labels of instances. Thus, it is potentially more robust to noise as it is not affected by errors occurring on the class label itself.

maxc=1,...,L (vc ) − maxc=1,...,L∩c=c1 (vc ) vc − vc2 margin(x) = 1L = T ( v ) c c=1 (1) where T represents the number of base classifiers in the ensemble. 3.2. Method We propose here a new method to extract a relevant set of support vector candidates which relies on the ensemble margin of each training instance. We use the unsupervised margin concept previously introduced. A plain SVM is then run on this smaller set to reduce computational costs. Our method selects patterns near the decision boundary based on their margin values. The lower the margin margin(xi ) of a training example xi , the more likely xi a priori belongs to the decision boundary, and therefore the more confidence in its support vector set membership. Our method is illustrated by Algorithm 1. It is an instance ranking method. It consists in ranking all the instances according to their margin values, and selecting the most informative ones. It trains first the ensemble on the whole training set and then selects the data on which most of the committee of base classifiers does not agree with each other.

Actually, in most applications, the common choice of ensemble size T = 100 and sampling ratio γ = 0.632 for a normal bootstrap sampling have been shown sufficient to achieve stable classifications [29,6,10,5]. However, our investigation of an ensemble method is not for classification, but for instance selection. Thus, a very small

sampling ratio γ and a reduced ensemble size T can be applied. We call this very small ensemble based instance selection method, we introduce in this work, Small Votes Instance Selection (SVIS) method. This method is both effective and efficient for instance selection. It is the first time being investigated for instance selection. Our data selection algorithm has a single parameter to tune: the percentage M of lowest margin training data to set as SV. M can be automatically determined by maximizing the accuracy achieved on a validation set by SVM with instances selected by our ensemble margin-based SV selection method. Though bagging type ensemble such as random forest, especially small votes is much faster than SVM, running the training several times to optimize the setting of M would be time-consuming and could compromise the efficiency saving of the proposed approach over SVM. The tuning of parameter M needs to be carefully addressed to ensure both effectiveness and efficiency for SVM. However, in case of huge data, this parameter is hardware dependent, the biggest hardware permitted (or affordable computing time) M should achieve the best performance. Indeed, the main objective of SV selection methods is to handle huge datasets with SVM in affordable time. 3.3. Complexity Our method is less costly than the SVM approach. Indeed, bagging tree ensemble time complexity is O(TN log(N)), where N is the number of training data and T is the number of trees in the ensemble, compared to O(N3 ) for a standard implementation of SVM. In general, T << N, especially for large datasets, reducing the complexity to O(N log(N)). In our experiments, we used ensembles of moderate size with T = 100 a typical ensemble size sufficient for most applications [6] and even T = 10, which is 10 to 100 times lower than our training dataset sizes. In addition, the bagging algorithm is suitable for a parallel implementation. Besides, just a small portion of training set can be used to construct each single base classifier. Bagging tree ensemble does not need an intensive computing time, however a considerable amount of memory is needed to store a N × T matrix in memory, hence its space complexity is O(N) (when T << N). The SVM algorithm memory requirements depend on the number of SV Nsv , as the memory space for storing the Nsv × Nsv kernel matrix 2 ), the SVM space complexity is O(N 2 ) in the worse case (large is O(Nsv datasets). Let us emphasize that our SV selection method can be implemented using any bagging ensemble algorithm. Thus, when there are strict memory requirements or if some of the accuracy in classification can be sacrificed for speed in training, small ratio sample for each base classifier training can be applied, and simple classifiers such as decision stumps [30] can be used as they have very low memory requirements and can be generated very quickly. An attractive alternative that reduces the memory requirements and largely increases the speed of both training and classification of ensembles while maintaining their generalization performance is the use of pruned CART trees. An additional intermediate stage: ensemble pruning that deals with the reduction of the ensemble size prior to combination can also be applied [28]. Finally, though our method requires an ensemble training step prior to the SVM training on the resulting reduced training set, the additional cost in training is small compared to the SVM training time on the whole dataset. Previous experiments run on huge datasets, in the context of remote sensing, showed than the training time of SVM was 30 to 40 times higher than in case of tree-based ensembles [27]. 3.4. Discussion Like support vectors, lowest margin instances are usually close to decision boundaries and therefore are more likely to be misclassified. That is why all lowest margin instances are considered in our data

l. Guo, S. Boukir / Pattern Recognition Letters 51 (2014) 112–119

115

Fig. 1. The 20% of smallest margin instances from ensemble margin-based data selection (displayed in filled points) on dataset Sin-Square.

selection scheme regardless of the outcome of their classification. Let us emphasize that an instance of low margin is not always a boundary instance. It can also belong to central instances of a rare class whose under-representation in the training set leads to low margins and high misclassification rate. Nevertheless, our method is not expected to select exactly the same set of SV as SVM. Indeed, it is based on a selection criterion that is not limited to boundary instances but to rare and/or difficult class instances as well. Improving the classification accuracy of minority or hard classes will have an impact on the overall performance of the SVM classifier and will somehow compensate for the non-detection of a portion of SV determining the SVM decision function. Unlike clustering or K-nearest neighbors based SV selection methods where one has to compute the distance between each pair of instances, a procedure of time complexity O(N2 ), which can be very costly, especially in high dimension problems, our SV selection technique is very fast and well adapted to both high dimension and huge datasets. In addition, distance calculation is difficult and not practical when the variable is categorical. The ensemble margin plays a key role in our method, especially, we rely on small margin instances, a portion of which belongs to difficult or minority classes. Thus, our method is also suitable for imbalanced datasets. Moreover, compared to single classifier on which rely existing state-of-the-art SV selection methods, bagging ensemble methods are more robust to noisy problems. 4. Experimental results 4.1. Datasets In this section, we report experimental results on a synthetic imbalanced dataset Sin-Square (Fig. 1) and nine datasets from the UCI Machine Learning repository [31], shown in Table 1.

Table 1 Datasets. Dataset

Training

Test

Attributes

Classes

Glass Heart Iris Letter Optdigits Pendigits Segment Sin-Square Waveform Wine quality red

107 135 75 10000 2810 5496 1155 2500 2500 800

107 135 75 10000 2810 5496 1155 2500 2500 799

10 13 4 16 64 16 19 2 21 11

6 2 3 26 10 10 7 3 3 6

To validate our method, we use random forest to create an ensemble, the base classifier being Classification and Regression Trees (CART) [32], whose number has been set to 100. In classic random forest, each base classifier uses a bootstrapped sample of about 63.2% of the total training set. However, that portion might still be too big to create a CART when the dataset is huge. Hence, we also report here random forest results with just 10 CARTs, each of them using just 10% of the total training set, which is very fast to construct. We call this very small ensemble based instance selection method SVIS (Small Votes Instance Selection). For each training set, according to the data selection method used, a portion of the training set is selected as a reduced training set to train the SVM classifier. For each SVM, a radial basis kernel is used. Besides, a 10-fold cross-validation is applied on training set to tune the SVM parameters. A comparative analysis is carried out with respect to two reference methods called respectively NPPS (Neighborhood Property based Pattern Selection) [2] and Border-Edge Pattern Selection (BEPS) method [20]. NPPS is a successful K-nearest neighbors based method

116

l. Guo, S. Boukir / Pattern Recognition Letters 51 (2014) 112–119

Fig. 2. Rate of genuine SV among the selected SV candidates on dataset Sin-Square.

described in Section 2.2. Shin and Cho [2] compared their proposed NPPS method with a clustering-based method SVM − KM introduced in [16]. They showed that, compared to SVM − KM, NPPS achieved better computational efficiency. NPPS has become an important reference method for SV selection, and hence it is one of the methods we chose to compare our data selection to. We also compare our method to another competitive method called Border-Edge Pattern Selection (BEPS) method [20]. Li and Maguire [20] compared their BEPS method to 4 different pattern selection methods on 19 classification problems, and demonstrated that BEPS achieves a better reduction performance in comparison with other 4 methods. This comparison is difficult to conduct because, unlike our method, NPPS and BEPS cannot set beforehand a percentage of whole training set to be selected, their performances depending on their parameters. Therefore, in order to conduct a fair comparison, we compared these methods in optimal conditions that will lead to the best performance achievable by each method on validation set. The smallest parameter K and smallest percentage M of training data of lowest margins were respectively used for NPPS and for our method to achieve the accuracy A∗ of SVM, trained on all the training set, on a validation set. In this work, the whole training set is used as validation set. Because for some of the datasets we used, NPPS failed to converge, we degraded the best achievable SVM accuracy A∗ by 0.5% for both methods. As for BEPS, two parameters kb and ke are determined according to Li and Maguire [20] proposed formulae, the other two parameters λ and γ are optimized by a grid search on the range of (0.1,0.3) and (0,0.1) as proposed by Li and Maguire [20]. All of the results reported in this paper are the mean values of 10-time calculation. 4.2. Low margin instances as support vectors First of all, we demonstrate the relationship between the ensemble margin value of an instance and the relevance of this instance as an SV candidate. In Fig. 1, the filled points represent the first 20% smallest margin instances which are found by the random forest classifier. This result clearly shows that the majority of the depicted small margin instances are located at class boundaries, thus behaving like SV. Hence, the margin criterion is a good heuristic for determining whether a training example is near class boundaries. The SVM method led to 308 SV (using 5000 training instances), of which the ensemble classifier found 90% using just 21% of smallest margin instances as training data. The data selected by our method that are not genuine SV cannot totally be considered as false positives. Indeed, as stated before, relying on lowest margins will certainly provide boundary instances

but rare and difficult class instances as well. This has also a significant impact on the outcome of a classifier which explains the low loss in SVM accuracy despite an incomplete selection of genuine SV, as shown later by our experiments. 4.3. Support vector selection We compare here the effectiveness of the data selection strategies in terms of their ability to reduce the training set size while maintaining the generalization performance of the resulting SVM classifier. 4.3.1. Overall performance Fig. 2 shows the curves of the rate of genuine SV (provided by SVM) among the selected SV candidates, as function of training data size, achieved by our classic random forest ensemble instance ranking method and uniform random sampling respectively. Our method selects significantly more SV than the random alternative. It provides 90% of all SV using just 21% of training instances. However, for the same reduced training data size, we selected only 25% of all SV related to dataset Glass, which is the worse performance of our method in these experimental results. Fig. 3 displays the SVM classification accuracy behavior with respect to the training data size achieved on dataset Sin-Square after a training step using classic random forest ensemble classifier selected instances and randomly selected instances respectively. It clearly shows that our method performs better than random selection. Besides, our data selection method can highly increase the training speed of SVM, with very limited or even without loss in classification ability. For dataset Sin-Square, just 20% of lowest margin instances are sufficient to achieve the SVM classification accuracy based on the use of the whole training set. Table 2 presents the difference in accuracy obtained on test set by SVM trained on all training set and on the reduced training set selected by the four competing data selection methods, as well as their associated selected training set size. We also report the accuracy of SVM with all of the training set as a reference. Compared to SVM, our data selection method, involving a classic random forest (RF) ensemble method based on a bootstrap sample using about 63.2% of the whole training set for each base classifier and a size 100, lost on average 0.66% in accuracy with an average training size percentage of 44.89%. It achieves the first rank both in terms of accuracy and reduction rate among the four methods. Our classic ensemble method reduces SVM training size on average about 20% more than NPPS and 6% more than BEPS, while inducing a lower loss in accuracy. It can be noticed that our classic ensemble method significantly outperforms the NPPS

l. Guo, S. Boukir / Pattern Recognition Letters 51 (2014) 112–119

117

Fig. 3. Classification accuracy on dataset Sin-Square. Table 2 Loss in SVM accuracy (%) and reduced training size (%) by NPPS, BEPS, RF and SVIS pattern selection methods. SVM

NPPS

BEPS

RF

SVIS

Dataset

Acc.

Loss

Size

Loss

Size

Loss

Size

Loss

Size

Glass Heart Iris Letter Optdigits Pendigits Segment SinSquare Waveform Wine. Average loss & size Average rank

95.32 78.51 94.66 96.58 98.93 99.59 96.10 97.44 86.88 61.07

5.61 (4) 0.74 (2) 0.00 (1) 0.29 (3) 0.24 (1) 0.69 (4) 0.00 (3) 0.24 (2) 0.64 (1) 1.50 (1) 0.99 2.2

68.2 (2) 58.5 (2) 92.0 (4) 91.0 (4) 52.7 (3) 60.4 (4) 72.5 (4) 18.8 (2) 38.3 (1) 79.6 (4) 63.20 3.0

2.79 (2) 1.47 (4) 1.32 (3) 8.87 (4) 1.27 (4) 0.55 (3) 0.94 (4) 2.96 (4) 1.24 (4) 3.99 (4) 2.5 3.6

57.0 (1) 60.0 (3) 42.6 (1) 52.1 (1) 92.0 (4) 40.2 (3) 30.3 (1) 28.4 (3) 83.8 (4) 28.6 (1) 51.5 2.2

2.14 (1) 0.37 (1) 1.46 (4) 0.15 (1) 0.34 (2) 0.38 (1) −0.69 (1) 0.13 (1) 0.82 (3) 1.50 (1) 0.66 1.6

75.9 (4) 49.4 (1) 54.8 (2) 56.7 (2) 20.1 (1) 13.9 (1) 43.7 (2) 17.1 (1) 40.8 (2) 76.5 (3) 44.89 1.9

2.89 (3) 0.88 (3) 0.53 (2) 0.25 (2) 0.34 (2) 0.50 (2) −0.48 (2) 0.32 (3) 0.67 (2) 2.21 (2) 0.81 2.3

73.9 (3) 62.1 (4) 68.4 (3) 65.4 (3) 26.8 (2) 21.4 (2) 47.1 (3) 37.4 (4) 42.0 (3) 67.3 (2) 51.18 2.9

method in SV selection (training size reduction) in 80% of the cases (considering both classic and reduced complexity ensemble). We also show SV selection results based on lower complexity ensemble built with only 10 base classifiers, each of them being trained on a random sample without replacement of size 10% of the whole training set (SVIS). Let us emphasize that this ensemble is very fast and achieves comparable results to NPPS and BEPS in terms of both SVM training size reduction and SVM accuracy. In addition, the removal of higher margin instances (via lowest margin selection) from training set can sometimes even improve the classifier’s performance, as shown on dataset Segment, for which the loss in accuracy is negative: improvement in SVM accuracy of 0.7% while removing about 56% of training data. Table 3 shows the accuracy of SVM, RF and SVIS trained on all the training data and the accuracy of SVM trained on the selected data by our methods. RF is generally less effective than SVM in terms of accuracy, it achieves three wins and seven losses compared to SVM trained on all the training data. Our instance selection approach can be considered as a combination of these two powerful methods. SVM trained on the selected data by SVIS appears particularly as a powerful alternative, it achieves a comparable accuracy to SVM trained on all the training data, and meanwhile, largely reduces the time complexity of classic SVM. 4.3.2. Per-class analysis Fig. 4 shows the variation in accuracy per class of SVM, trained using respectively classic random forest and random instance selection method on dataset Sin-Square. These results show the effectiveness

Table 3 Accuracy of SVM, RF and SVIS trained on all the training data, and accuracy by SVM trained on selected data by the ensemble margin-based pattern selection methods (RF+SVM, SVIS+SVM). Data

SVM

RF

Gla. Hea. Iri. Let. Opt. Pen. Seg. Sin. Wav. Win.

95.32 78.51 94.66 96.58 98.93 99.59 96.10 97.44 86.88 61.07

98.13 77.77 93.33 95.01 97.86 98.58 97.40 96.56 85.24 64.58

RF+SVM

SVIS

SVIS+SVM

93.18 78.14 93.20 96.43 98.59 99.21 96.79 97.31 86.06 59.57

78.50 73.33 92.00 81.28 89.25 94.57 92.81 91.68 79.88 55.81

92.43 77.63 94.13 96.33 98.59 99.09 96.58 97.12 86.21 58.86

of our low-margin based strategy in handling minority classes (class C2 here). Our method selects all the instances of this smallest class using 49% of all training instances. According to the unsupervised margin definition (Eq. (1)), the instances of the smallest and most difficult class have the smallest margin values. Therefore, these instances have more chance to be selected by our method. Consequently, our technique will increase the accuracy of the most difficult and smallest class. Table 4 presents the loss in minimum classification accuracy per class obtained on test set compared to SVM trained on all the training data, by NPPS, BEPS and our ensemble margin based methods respectively. These results confirmonce again the superiority of our method.

118

l. Guo, S. Boukir / Pattern Recognition Letters 51 (2014) 112–119

Fig. 4. Improvement of SVM accuracy per class using ensemble margin-based selection with respect to random selection of training data on dataset Sin-Square.

Table 4 Loss in minimum classification accuracy (%) per class by NPPS, BEPS, RF and SVIS pattern selection methods. Data

SVM

Gla. 75.00 Hea. 69.35 Iri. 90.90 Let. 92.61 Opt. 97.09 Pen. 99.25 Seg. 89.94 Sin. 89.39 Wav. 81.96 Win. 00.00 Average loss Average rank

NPPS

BEPS

RF

SVIS

8.33 (3) 1.61 (4) 0.00 (1) 1.31 (1) 0.06 (1) 2.54 (3) −0.05 (3) −1.51 (2) 1.77 (3) 0.00 (1) 1.41 2.2

25.00 (4) −3.23 (1) 4.53 (4) 30.60 (4) 2.04 (4) 2.45 (2) 3.90 (4) 0.00 (3) 4.85 (4) 0.00 (1) 7.01 3.1

4.16 (1) 0.23 (3) 4.09 (3) 3.00 (2) 0.37 (2) 2.19 (1) −1.45 (1) −7.06 (1) −0.08 (1) 0.00 (1) 0.54 1.6

7.91 (2) −0.48 (2) 3.18 (2) 4.52 (3) 0.50 (3) 2.74 (4) −0.94 (2) 0.00 (3) 0.91 (2) 0.00 (1) 1.83 2.4

Table 5 Execution times of selection methods (in seconds). Methods NPPS BEPS RF SVIS

5. Conclusion We have proposed a new efficient heuristic to select support vectors candidates using an unsupervised version of ensemble margin. It consists in considering lowest margin training instances as potential support vectors. This simple algorithm significantly reduces the SVM training set size while maintaining the performance of the SVM classification. SVIS, the fast alternative of our ensemble margin-based instance selection approach, has great potential for large data and high dimensionality problems. It achieved a comparable performance as NPPS and BEPS. Future work will focus on automatically setting the percentage M of lowest margin data to select as support vectors for SVM training. This procedure could involve the optimization of some function of the margins.

Acknowledgment

Time (s) 0.05 0.06 0.02 0.01

≤ ≤ ≤ ≤

t t t t

≤ ≤ ≤ ≤

184.47 83.32 3.66 0.44

The authors thank Dr. Yuhua Li for providing the matlab code of the BEPS method.

References The worst performance for our approach is on dataset Glass, a loss in accuracy of about 4% for RF, and 7.91% for SVIS, while the loss induced by NPPS is twice higher (8% for same dataset) and four times higher for BEPS. The best performance achieved by our method is an increase in SVM per-class accuracy of 7%, compared to a significantly lower increase (3%) for BEPS and (1.5%) for NPPS. Compared to SVM trained on the whole training set, our reduced training strategy (both classic and reduced ensembles) increased SVM per-class accuracy for four datasets. Let us notice that SVM failed to classify one of the classes of dataset Wine quality red, which explains the results of the minimum classification accuracy for that dataset. 4.3.3. Execution times Table 5 shows the minimum and maximum computing times of four pattern selection methods achieved for the datasets of Table 1. The fastest method is SVIS, our fast ensemble margin-based method, it is more than 400 times faster than the slowest method NPPS. These results were obtained on a two processors of 2.67 GHz Inter Xeon CPU with 12 G of RAM machine. Let us notice that BEPS is implemented on a Matlab platform while the 3 others are implemented on R.

[1] V. Vapnik, The Nature of Statistical Learning Theory, Springer, 1995. [2] H. Shin, S. Cho, Neighborhood property-based pattern selection for support vector machines, Neural Comput. 19 (2007) 816–855. [3] S. Abe, Support Vector Machines for Pattern Classification, Springer, 2005. [4] W. Wang, Z. Xu, A heuristic training for support vector regression, Neurocomputing 61 (2004) 259–275. [5] T. Dietterich, Ensemble methods in machine learning, in: Proceedings of the First International Workshop on Multiple Classifier Systems, 2000, pp. 1–15. [6] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123–140. [7] P. Bühlmann, B. Yu, Analyzing bagging, Ann. Stat. 30 (2002) 927–961. [8] L. Breiman, Pasting small votes for classification in large databases and on-line, Mach. Learn. 36 (1999) 85–103. [9] G. Martínez-Muñoz, A. Suárez, Out-of-bag estimation of the optimal sample size in bagging, Pattern Recogn. 43 (2010) 143–152. [10] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32. [11] L. Guo, S. Boukir, N. Chehata, Support vectors selection for supervised learning using an ensemble approach, in: 20th IAPR International Conference on Pattern Recognition, 2010, pp. 37–40. [12] J.C. Platt, First training of support vector machines using sequential minimal optimization, in: Advances in Kernel Methods: Support Vector Machines, MIT Press, 1999, pp. 185–208. [13] B. Scholkopf, C. Burges, V. Vapnik, Extracting support data for a given task, in: First International Conference on Knowledge Discovery & Data Mining, 1995, pp. 252–257. [14] J. Wang, P. Neskovic, L. Cooper, Selecting data for fast support vector machines training, in: K. Chen, L. Wang (Eds.), Trends in Neural Computation, vol. 35, Springer, 2007, pp. 61–84.

l. Guo, S. Boukir / Pattern Recognition Letters 51 (2014) 112–119 [15] A. Lyhyaoui, M. Martinez, I. Mora, M. Vaquez, J.L. Sancho, A. Figueiras-Vidal, Sample selection via clustering to construct support vector-like classifiers, IEEE Trans. Neural Networ. 10 (1999) 1474–1481. [16] M. Barros de Almeida, A. de Padua Braga, J. Braga, Svm-km: speeding svms learning with a priori cluster selection and k-means, in: Sixth Brazilian Symposium on Neural Networks, 2000, pp. 162–167. [17] R. Koggalage, S. Halgamuge, Reducing the number of training samples for fast support vector machine classification, Neural Inform. Process. 2 (2004) 57–65. [18] G. Guo, J.S. Zhang, Reducing examples to accelerate support vector regression, Pattern Recogn. Lett. 28 (2007) 2173–2183. [19] B. Li, Q. Wang, J. Hu, A fast svm training method for very large datasets, in: International Joint Conference on Neural Networks, IJCNN 2009, 2009, pp. 1784– 1789. [20] Y. Li, L. Maguire, Selecting critical patterns based on local geometrical and statistical information, IEEE Trans. Pattern Anal. 33 (2011) 1189–1201. [21] Q. He, Z. Xie, Q. Hu, C. Wu, Neighborhood based sample and feature selection for svm classification learning, Neurocomputing 74 (2011) 1585–1594. [22] E. Marchiori, Graph-based discrete differential geometry for critical instance filtering, in: Lecture Notes in Computer Science, vol. 5782, 2009, pp. 63–78.

119

[23] M. Kawulok, J. Nalepa, Support vector machines training data selection using a genetic algorithm, in: Lecture Notes in Computer Science, vol. 7626, 2012, pp. 557–565. [24] R.E. Schapire, Y. Freund, P. Bartlett, W.S. Lee, Boosting the margin: a new explanation for the effectiveness of voting methods, Ann. Stat. 26 (1998) 1651–1686. [25] R.E. Schapire, The strength of weak learnability, Mach. Learn. 5 (1990) 197–227. [26] P.J. Bartlett, B. Schölkopf, D. Schuurmans, A.J. Smola (Eds.), Advances in Large Margin Classifiers, first ed., Neural Information Processing, The MIT Press, 2000. [27] L. Guo, Margin framework for ensemble classifiers. Application to remote sensing data (Ph.D. thesis), University of Bordeaux, France, 2011. [28] L. Guo, S. Boukir, Margin-based ordered aggregation for ensemble pruning, Pattern Recogn. Lett. 34 (2013) 603–609. [29] D. Hernández-Lobato, G. Martínez-Muñoz, A. Su’arez, How large should ensembles of classifiers be?, Pattern Recogn. 46 (2013) 1323–1336. [30] W. Iba, P. Langley, Induction of one-level decision trees, in: International Workshop on Machine Learning, 1992, pp. 233–240. [31] K. Bache, M. Lichman, UCI Machine Learning Repository, Irvine, CA: University of California, School of Information and Computer Science, 2013. [32] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regression Trees, Wadsworth, Belmont, 1984.