Novel approaches using evolutionary computation for sparse least square support vector machines

Novel approaches using evolutionary computation for sparse least square support vector machines

Author's Accepted Manuscript Novel Approaches using Evolutionary Computation for Sparse Least Square Support Vector Machines Danilo Avilar Silva, Jul...

1000KB Sizes 0 Downloads 31 Views

Author's Accepted Manuscript

Novel Approaches using Evolutionary Computation for Sparse Least Square Support Vector Machines Danilo Avilar Silva, Juliana Peixoto Silva, Ajalmar R. Rocha Neto

www.elsevier.com/locate/neucom

PII: DOI: Reference:

S0925-2312(15)00694-3 http://dx.doi.org/10.1016/j.neucom.2015.05.034 NEUCOM15559

To appear in:

Neurocomputing

Received date: 6 October 2014 Revised date: 10 January 2015 Accepted date: 9 May 2015 Cite this article as: Danilo Avilar Silva, Juliana Peixoto Silva, Ajalmar R. Rocha Neto, Novel Approaches using Evolutionary Computation for Sparse Least Square Support Vector Machines, Neurocomputing, http://dx.doi.org/10.1016/j. neucom.2015.05.034 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Novel Approaches using Evolutionary Computation for Sparse Least Square Support Vector Machines Danilo Avilar Silvaa , Juliana Peixoto Silvaa , Ajalmar R. Rocha Netoa,1 a

Federal Institute of Cear´ a, Department of Teleinformatics, Av. Treze de Maio, 2081, Benfica CEP: 60040-215, Fortaleza, Cear´ a, Brasil

Abstract This paper introduces two new approaches to building sparse least square support vector machines (LSSVM) based on genetic algorithms (GAs) for classification tasks. LSSVM classifiers are an alternative to SVM ones because the training process of LSSVM classifiers only requires to solve a linear equation system instead of a quadratic programming optimization problem. However, the absence of sparseness in the Lagrange multiplier vector (i.e. the solution) is a significant problem for the effective use of these classifiers. In order to overcome this lack of sparseness, we propose both single and multiobjective GA approaches to leave a few support vectors out of the solution without affecting the classifier’s accuracy and even improving it. The main idea is to leave out outliers, non-relevant patterns or those ones which can be corrupted with noise and thus prevent classifiers to achieve higher accuracies along with a reduced set of support vectors. Differently from previous works, genetic algorithms are used in this work to obtain sparseness not to find out the optimal values of the LSSVM hyper-parameters. Keywords: Least Square Support Vector Machines, Pruning Methods, Evolutionary Computation

Email addresses: [email protected] (Danilo Avilar Silva), [email protected] (Juliana Peixoto Silva), [email protected] (Ajalmar R. Rocha Neto) 1 Corresponding author

Preprint submitted to Neurocomputing

May 14, 2015

1. Introduction Evolutionary Computation tools such as Genetic Algorithms (GAs) have been used to solve optimization problems in many different real-world areas, such as bioinformatics, medicine, economics, chemistry and so on [1, 2, 3, 4, 5]. GAs are optimization methods which can be used to generate useful solutions to search problems. Due to the underlying features of GAs, some optimization problems can be solved without the assumption of linearity, differentiability, continuity or convexity of the objective function. Unfortunately, these desired characteristics are not found in several mathematical methods when applied to the same kind of problems. GAs have been used with several types of models in classification tasks in order to tune their parameters [6, 7, 8]. Similarly, other meta-heuristic methods have also been used to tune classifier parameters [9, 10, 11]. As example of classifiers, we point out large margin classifiers such as Support Vector Machines (SVMs) [6, 7, 8] and Least Square Support Vector Machines (LSSVMs) [12, 13, 14, 15]. In these works, GAs are mostly applied with GAs to search for the kernel and classifier parameters. A theoretical advantage of large margin classifiers such as Support Vector Machines [16] concerns the empirical and structural risk minimization which balances the complexity of the model against its success at fitting the training data, along with the production of sparse solutions [17]. By sparseness we mean that the decision hyperplane built by the induced classifier can be written in terms of a relatively small number of input examples, the socalled support vectors (SVs), which usually lie close to the decision borders of two classes. In practice, however, it is observed that the application of different training approaches to the same kernel-based machine over identical training sets yield distinct sparseness [18], i.e. produce solutions with a greater number of SVs than are strictly necessary. The LSSVM is an alternative to the standard SVM formulation [16]. A solution for the LSSVM is achieved by solving linear KKT systems2 in a least square sense. In fact, the solution follows directly from solving a linear equation system, instead of a QP optimization problem. On the one hand, it is in general easier and less computationally intensive to solve a linear system than a QP problem. On the other hand, the resulting solution is far from sparse, in the sense that it is common to have all training samples 2

Karush-Kuhn-Tucker systems.

2

being used as SVs. Besides that, due to the size of the matrix built from data in the linear systems resulting from the LSSVM primal problem, the LSSVM formulation is not suitable to handle too large datasets since such matrix must be inverted. Such datasets can be dealt with Fixed-Size Support Vector Machines instead of LSSVM [19]. To handle the lack of sparseness in SVM and LSSVM solutions, several reduced set (RS) and pruning methods have been proposed, respectively. These methods comprise a bunch of techniques aiming at simplifying the internal structure of SVM and LSSVM models, while keeping the decision boundaries as similar as possible to the original ones. It seems that the basic idea behind these kind of methods was firstly introduced by Burges [20]. RS and pruning methods are very useful in reducing the computational complexity of the original models, since they speed up the decision process by reducing the number of SVs. They are particularly important for handling large datasets, when a great number of data samples may be selected as support vectors, either by pruning less important SVs [21, 22, 23, 24] or by constructing a smaller set of training examples [25, 26, 27, 28, 29, 30, 31, 32], often with minimal impact on performance. A smaller number of SVs is useful also for better understanding of the internal structure of SVM and LSSVM models by means of more succinct prototype-based rules [33]. The general idea of RS and pruning methods is to reduce the out-of-sample prediction time. Thus, in general, these methods have some extra computation at training, however, they save time at testing by reducing the number of support vectors. Few of these methods provide explicit SVM and LSSVM solutions (see e.g. the work in [34]). Most of the works relies on selective sampling or active learning techniques to reduce the number of training data samples. Such techniques aims at selecting a representative subset of instances from the original dataset [35, 36, 37]. Active learning optimizes learning from a minimum number of data samples, selected from the entire dataset by means of some clever (deterministic or probabilistic) heuristic. Usually, it involves an iterative process that avoids selecting redundant or non-informative samples. When applied to SVM (LSSVM) training, the model is constructed on an initial small subset of samples. The trained model is then used to query new samples to add to the existing training set repeating this step until convergence. This approach essentially selects data samples close to the decision borders, since they have higher chances of becoming support vectors. In order to combine the aforementioned advantages of LSSVM classifiers 3

and genetic algorithms, this work aims at putting both of them to work together for reducing the number of support vectors and also achieve equivalent (in some cases, superior) performances than standard full-set LSSVM classifiers. Our two proposals deal with both single and multi-objective optimization problems. The first one finds out sparse classifiers based on a single optimization genetic algorithm and the accuracy (of classifier over the training dataset) as the fitness function. The second one finds out sparse classifiers based on a priori multi-objective genetic algorithm, the accuracy of classifier, the pruning rate and a cost of pruning patterns. In order to do this, we also propose a new a priori multi-objective fitness function which incorpores a cost of pruning in its formulation. Differently from previous works, genetic algorithms are used to obtain sparseness not to find out the optimal values of the kernel and classifier parameters. The remaining part of this paper is organized as follows. In Section 2 we review the fundamentals of the LSSVM classifiers. In Section 3 we briefly present some methods for obtaining sparse LSSVM classifiers, such as the Pruning LSSVM and IP-LSSVM. In Section 4, we introduce genetic algorithms which are necessary to understanding of our proposal and then, in Section 7, we present our simulations. After all, the paper is concluded in Section 8. 2. LSSVM Classifiers The formulation of the primal problem for the LSSVM [38] is given by   L 1 T 1 2 min (1) w w+γ ξ , w,b,ξi 2 2 i=1 i subject to yi [(wT xi ) + b] = 1 − ξi , i = 1, . . . , L where γ is a positive cost parameter, as well as the slack variables {ξi }Li=1 can assume negative values. The Lagrangian function for the LSSVM is then written as 1 2  1 L(w, b, ξ, α) = wT w + γ ξ − αi (yi (xTi w + b) − 1 + ξi ), 2 2 i=1 i i=1 L

L

(2)

where {αi }Li=1 are the Lagrange multipliers. The conditions for optimality, 4

similarly to the SVM problem, can be given by the partial derivatives  ∂L(w, b, ξ, α) =0 ⇒ w= α i yi x i , ∂w i=1 L

L  ∂L(w, b, ξ, α) =0 ⇒ αi yi = 0, ∂b i=1

(3)

∂L(w, b, ξ, α) = 0 ⇒ yi (xTi w + b) − 1 + ξi = 0, ∂αi ∂L(w, b, ξ, α) = 0 ⇒ αi = γξi . ∂ξi

Thus, based on Eq. (3), one can formulate a linear system Ax = B in order to represent this problem as      0 b 0 yT = (4) −1 y Ω+γ I α 1 where Ω ∈ RL×L is a matrix whose entries are given by Ωi,j = yi yj xi T xj , i, j = 1, . . . , L. In addition, y = [y1 · · · yL ]T and the symbol 1 denotes a vector of ones with dimension L. The solution of this linear system can be computed by direct inversion of matrix A as follows −1    0 0 yT −1 x=A B= . (5) y Ω + γ −1 I 1 For nonlinear kernels, one should use Ωi,j = yi yj K(xi , xj ). 3. Sparse Classifiers In this section, we present two of the most important methods used to obtain sparse classifiers: Pruning LSSVM and IP-LSSVM. Both of them reduce the number of support vectors based on the values of Lagrange multipliers. These methods are compared with our proposal in the sections results and discussions . 3.1. Pruning LSSVM Pruning LSSVM was proposed by Suykens in 2000 [24]. In this method, support vectors and then their respective input vectors are eliminated according to the absolute value of their Lagrange multipliers. The process is carried 5

out recursively, with gradual vector elimination at each iteration, until a stop criterion is reached, which is usually associated with decrease in performance on a validation set. Vectors are eliminated by setting the corresponding Lagrange multipliers to zero, without any change in matrix dimensions. The resolution of the current linear system, for each new reduced set, is needed at each iteration, and the reduced set is selected from the best iteration. This is a multi-step method, since the linear system needs to be solved many times until the convergence criterion is reached. 3.2. IP-LSSVM IP-LSSVM [26] uses a criterion in which patterns close to the separating surface and far from the support hyperplanes are very likely to become support vectors. In fact, since patterns with αi  0 are likely to become support vectors, the margin limits are located closer to the separating surface than the support hyperplanes. As we can see, that idea applied in IP-LSSVM training is based on SVM classifiers working. According to these arguments, the new relevance criteria proposed in IP-LSSVM work can be described as: • xi with αi  0 is a support vector. In this case, xi is placed on the border between the two classes or on the opposite class area, corresponding to vectors associated with non-zero αi  0 when compared to SVM classifiers solved by Quadratic Programming (QP) algorithms. • xi with αi ≥ 0 is removed. In this situation, xi is correctly classified, close to the support hyperplane, corresponding to vectors associated with αi = 0 in SVM classifier solved by QP Algorithms. • xi with αi < 0 or αi  0 is eliminated. In this case, xi is correctly classified, far from the decision surface, corresponding to vectors associated with αi = 0 in QP SVM classifiers. The proposed criteria is applied to IP-LSSVM in order to eliminate nonrelevant columns of the original matrix A and to build a non-squared reduced matrix A2 to be used a posteriori. The eliminated columns correspond to the least relevant vectors for the classification problem, selected according to their Lagrange multiplier values. The rows of A are not removed, because its elimination would lead to a loss of labeling information and in performance [39].

6

The first step is accomplished by using the inverse function to solve the system of linear equations represented by Eq. (4). The solution of this system is x = A−1 B. The first element of x is discarded, due to it corresponds to the bias value and only α values are aimed in this vector elimination phase. In second step, a system A2 x2 = b2 is solved using the pseudo-inverse function, whose solution x2 . The training process of the proposed sparse classifier can be described as: 1. The system of linear equations presented in Eq. (4) can be solved, with all training vectors, using x = A−1 B , since A is a square matrix. 2. The parameter τ ∈ [0, 1] defines the fraction of training vectors that will be considered support vectors. 3. The training vectors are ordered by their α values. 4. The fraction 1 − τ of training data that corresponds to the smaller a values is selected. 5. The non-squared matrix A2 is generated by removing from A the columns associated to the selected elements of a. 6. The new system of linear equations, represented by A2 x2 = b2 , is solved as x2 = A∗2 b2 where A∗ = (AT2 A2 )A2 . 7. The training points of A2 are the support vectors. 8. The a and b values are obtained from the solution x2 . It is also important to mention a related form of pruning by retaining only positive alpha values in the Lagrange multiplier vector which was also proposed in [40]. 4. Genetic Algorithms Genetic algorithm is a search meta-heuristic method inspired by natural evolution, such as inheritance, mutation, natural selection, and crossover. This meta-heuristic can be used to generate useful solutions to optimization and search problems. Due to the characteristics of GA methods, it is easier to solve few kind of problems by GA than other mathematical methods which do have to rely on the assumption of linearity, differentiability, continuity, or convexity of the objective function. In a genetic algorithm, a population of candidate individuals (or solutions) to an optimization problem is evolved toward better solutions by natural selection, i.e., a fitness function. In this population, each individual has 7

a set of genes (gene vector), named chromosome, which can be changed by mutation or combined generation-by-generation with other one to build new individuals by reproduction processes which use crossover. The most common way of representing solutions is in a binary format, i.e., strings of 0s and 1s. Despite that other kind of encodings are also possible. The first population - set of solutions - is generated in a pseudo-randomized mode and it evolves in cycles, usually named generations. A value of the fitness function (its accuracy) for each individual ranks how ’good’ the solution is. One can calculate the fitness value after decoding the chromosome and the best solution of the problem to be solved has the best fitness value. This value is used in order to guide the reproduction process where individuals with high fitness value have more chance to spread out their genes over the population. Standard implementation of genetic algorithm for natural selection is the roulette wheel. After the selection process, the selected individuals are used as input to other genetic operators: crossover and mutation. Crossover operator combines two strings of chromosomes, but mutation modifies a few bits in a single chromosome. A single objective GA can be shown as follows. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

Initiate t = 0, where t stands for the generation; Generate initial population P (t), randomly; For each individual i in P (t) Evaluate the fitness function; While stopping criteria is not achieved Select some individuals from P (t); Apply crossover operator to selected individuals; Apply mutation operator to selected individuals; Compute t = t + 1; Evaluate fitness function for the new population P (t); Select the best individual or solution.

5. Proposal #1: (Single-Objective) Genetic Algorithm for Sparse LSSVM (GAS-LSSVM) Our first proposal called GAS-LSSVM is based on a single objective genetic algorithm which aims at improving (maximizing) the classifiers accuracy over the training (or validation) dataset. The individual or chromosome 8

in our simulations is represented by a binary vector of genes where each one is set to either “zero” (false) whether a certain pattern will be used as a support vector in the training process or “one” (true) otherwise. Our fitness function is the value of the resulting accuracy of a classifier when we take into account the genes that were set to “one”. In this approach, each individual has as many support vectors as the number of genes set up to “one”, except if some of them receive a zero value of its Lagrange multiplier. GAS-LSSVM is guided by accuracy, however, in parallel this approach achieves sparse classifiers due to the flexibility added, i.e., the ability of having or not a certain pattern as a support vector. Thus, each individual is a solution in terms of which patterns do belong to the set of support vectors. In this way, some patterns will not belong to the set of support vectors and then ones which are outliers, corrupted with noise or even non-relevant can fortunately be eliminated. For this reason, our population can evolve to the best solution or at least to a better one. In a similar way to IP-LSSVM, our proposal only remove non-relevant columns related to the patterns we are intend to throw out of the support vectors set. Therefore, we maintain all of the rows which means to maintain the restrictions in order to avoid a loss of labeling information and in performance. 5.1. Individuals or chromosomes In order to avoid misunderstanding, we present simple examples of how to construct individuals using GAS-LSSVM (or even MOGAS-LSSVM). Let us consider the matrix A of LSSSVM linear system for a very small training set only with four patterns as presented in Fig. (1). In our proposal, for all the patterns in the support vectors set, we have to set up a gene vector to [1 1 1 1].

Figure 1: Matrix A without leaving any patterns out of the support vectors set. This means a individual stood for a gene vector equals to [1 1 1 1].

For a individual without the fourth pattern in the SV set, we need to set 9

up a gene vector to [1 1 1 0]. This means the fifth column have to be eliminated as presented in Fig. (2).

Figure 2: Matrix A1 leaving only the third pattern out of the support vectors set. This means a individual stood for a gene vector equals to [1 1 1 0]. In this situation, we have to remove the fifth column.

Generally speaking, in order to eliminate a certain pattern from the support vector set which is described by the i-th gene (gi ) in the vector g = [g1 g2 g3 g4 ], it is necessary to remove the i + 1 column. It is worth emphasizing again that we must not remove the first column in order to avoid lost of performance. 5.2. Fitness Function for GAS-LSSVM Our single objective fitness function is the classifiers accuracy over the training (or validation) set obtained by using the pseudo-inverse method to solve the LSSVM linear system. As stated before, we use pseudo-inverse method due to the resulting non-square matrix achieved by eliminating those patterns which have gene value equals zero. 5.3. GAS-LSSVM Algorithm GAS-LSSVM algorithm for training a classifier can be described as follows. 1. Initiate t = 0, where t stands for the generation; 2. Generate initial population P (t), i.e., sets of genes and build their related matrices {Ai }si=1 , where s is the number of individuals at the generation t, randomly; 3. For each individual i in P (t) 4. Solve the LSSVM linear (xi = (ATi Ai )−1 ATi bi ), where (ATi Ai )−1 ATi is the pseudo-inverse; 5. Evaluate the fitness function, i.e., the training set accuracy; 10

6. While t ≤ tmax , where tmax is the maximum number of generations 7. Select individuals i and their matrices Ai ; 8. Apply crossover operation to selected individuals; 9. Apply mutation operation to selected individuals; 10. Compute t = t + 1; 11. Build a matrix Ai for each individual in P (t + 1). 12. Evaluate training set accuracies for each Ai ; 13. Select the best individual or solution, i.e., Ao . 6. Proposal #2: Multi-Objective Genetic Algorithm for Sparse LSSVM (MOGAS-LSSVM) Our second proposal, named MOGAS-LSSVM, is based on a priori multiobjective genetic algorithm which aims at both improving (maximizing) the classifiers accuracy over the training (or validation) dataset and reducing (minimizing) the number of support vectors. Similarly, the individual or chromosome in our simulations is represented by a binary vector of genes where each one is set to either “zero” (false) whether a certain pattern will be used as a support vector in the training process or “one” (true) otherwise. Our scalarization-based fitness function [41] is a balance between the value of the resulting accuracy of a classifier when we take into account the genes that were set to “one” and the proportion of support vector pruned to the number of training patterns. In this approach, each individual also has as many support vectors as the number of genes set up to “one”, except if some of them receive a zero value of its Lagrange multiplier. 6.1. Fitness Function for MOGAS-LSSVM With respect to individual modeling, MOGAS-LSSSVM is very similar to GAS-LSSVM. However, MOGAS-LSSVM is very different in terms of fitness function, since this improvement in the fitness function requires a more complex modeling than that done by the fitness function of the GAS-LSSVM classifier. GAS-LSSVM classifier tries to both maximize the accuracy and minimize the number of support vectors, but this optimization process takes in account a certain cost θ of pruning patterns which belong to the support vector set. This can be achieved as follows   (1 − accuracy) + ((1 − θ) ∗ reduction) f itness = 1 − (6) 2−θ 11

where θ ∈ [0, 1] is a parameter of MOGAS-LSSVM and the reduction =

#training patterns pruned . #training patterns

(7)

It is easy to see that the values of the fitness function lies in [0, 1], since whether the cost θ is high (one), so values of fitness function are only depending on accuracy; whether the cost θ is low (zero), so values of fitness function are still depending on accuracy but in direct proportion to the reduction. 6.2. MOGAS-LSSVM Algorithm MOGAS-LSSVM algorithm for training a classifier can be described as follows. 1. Initiate t = 0, where t stands for the generation; 2. Generate initial population P (t), i.e., sets of genes and build their related matrices {Ai }si=1 , where s is the number of individuals at the generation t, randomly; 3. For each individual i in P (t) 4. Solve the LSSVM linear (xi = (ATi Ai )−1 ATi bi ) where (ATi Ai )−1 ATi is the pseudo-inverse; 5. Evaluate the fitness function, i.e., the function presented in Eq. (6); 6. While t ≤ tmax , where tmax is the maximum number of generations 7. Select individuals i and their matrices Ai ; 8. Apply crossover operation to selected individuals; 9. Apply mutation operation to selected individuals; 10. Compute t = t + 1; 11. Build a matrix Ai for each individual in P (t + 1). 12. Evaluate training set accuracies for each Ai ; 13. Select the best individual or solution, i.e., Ao . 7. Simulations and Discussion For the simulations carried out and presented below, 80% of the data examples were randomly selected for training purposes. The remaining 20% of the examples were used for testing the classifiers generalization performances. Tests with real-world benchmarking datasets were also evaluated 12

in this work. We used three UCI datasets (Diabetes, Haberman and Breast Cancer) and the vertebral column pathologies dataset described in [42]. For this study, we transformed the original three-class VCP problem into a binary one by aggregating the two classes of pathologies, the disc hernia and spondylolisthesis, into a single one. In addition to such datasets, we evaluated our proposals over other two well-known datasets, Ripley (RIP) and Banana (BAN). The normal class remained unchanged. Some information about evaluated data set are presented in Table 1, such as data set name, abbreviation and number of Patterns (# Patterns) . Initially, the GA randomly creates a population of feasible solutions. Each solution is a string of binary values in which each bit represents the presence (1) or absence (0) of a pattern in the support vector set (i.e., whether the Lagrange multiplier related to a certain pattern is zero or not). For the next generation, we take into account the fact that 10% of the best individuals are selected due to a elitist selection scheme, 80% of new individuals were generated as result of applying the crossover operator and then the remaining individuals were obtained by mutation. All the parameters of LSSVM were got from [43, 44]. In general, the idea is to obtain the parameter by any method (such as cross-validation, trial and error and so on) for the standard LSSVM and then, in order to be fair, to fix the value of gamma for the other classifiers based on that parameter obtained for the standard LSSVM. This is necessary because, otherwise, we would have different feature spaces resulting from the kernel method and a certain value of gamma. In this situation, the problem (i.e., how the patterns lie on the multivariate space) should likely not be the same and thus a different set of Lagrange multipliers (with different Table 1: List of datasets classified in this work.

Data Set Vertebral Column Pathologies Pima Indians Diabets Breast Cancer Winconsin Haberman Ripley Banana

13

Abbreviation # Patterns VCP 310 PID 768 BCW 683 HAB 306 RIP 1250 BAN 1001

Table 2: Results for the LSSVM, IP-LSSVM, GAS-LSSVM and MOGAS-LSSVM classifiers with 80% (20%) of the full dataset for training (testing). Dataset VCP VCP VCP VCP HAB HAB HAB HAB BCW BCW BCW BCW PID PID PID PID RIP RIP RIP RIP BAN BAN BAN BAN

Model LSSVM IP-LSSVM GAS-LSSVM MOGAS-LSSVM LSSVM IP-LSSVM GAS-LSSVM MOGAS-LSSVM LSSVM IP-LSSVM GAS-LSSVM MOGAS-LSSVM LSSVM IP-LSSVM GAS-LSSVM MOGAS-LSSVM LSSVM IP-LSSVM GAS-LSSVM MOGAS-LSSVM LSSVM IP-LSSVM GAS-LSSVM MOGAS-LSSVM

γ 0.05 0.05 0.05 0.05 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04

Accuracy 81.2 ± 4.9 75.4 ± 8.6 84.6 ± 3.5 84.3 ± 4.5 73.8 ± 5.0 70.6 ± 4.9 76.1 ± 4.5 75.7 ± 4.6 96.7 ± 1.1 96.8 ± 1.1 96.6 ± 1.3 96.5 ± 1.1 75.8 ± 2.7 74.0 ± 5.3 77.8 ± 3.1 77.7 ± 2.7 87.7 ± 1.62 86.4 ± 3.35 87.7 ± 2.2 88.1 ± 2.7 96.9 ± 1.3 96.5 ± 1.4 96.7 ± 1.2 96.4 ± 1.2

# TP 248 248 248 248 245 245 245 245 546 546 546 546 614 614 614 614 1000 1000 1000 1000 800 800 800 800

# SVs 248.0 184.0 116.5 53.0 245.0 223.0 105.8 57.2 546.0 284.0 273.1 146.8 614.0 503.0 280.4 199.6 1000 496.0 478.3 346.3 800.0 394.0 394.2 245.5

Red. − 25.8% 53.0% 78.6% − 9.0% 56.8% 76.6% − 48.0% 49.9% 73.1% − 18.1% 54.3% 67.5% − 50.4% 52.2% 65.4% − 50.8% 50.7% 69.4%

amount) would be required. Nevertheless, during the simulations, we were mainly worried about results with certain values of gamma from literature that were able to achieve accuracies close to those state-of-art [45]. In Table 2, we report performance metrics (mean value and standard deviation of the recognition rate) on testing set averaged over 30 independent runs with θ equals zero. We also show the average number of SVs (#SV s), the number of training patterns (#T P ) as well as the values of the parameter γ (LS-SVM) and the reduction obtained (Red.). In order to compare our novel approach with other proposed methods (such as Pruning LSSVM) and to be in accordance with their assessment methodology, we carried out simulations splitting our data set into three 14

other ones. In this configuration 60% of the patterns were in the first one (training set), 20% of examples for validation purposes and, as usual, the remaining 20% of the patterns were used for testing the classifiers generalization performances. In Table 3 we show some results related to LSSVM, Pruning LSSSVM (P-LSSVM), GAS-LSSVM and MOGAS-LSSVM. By analyzing these tables, one can conclude that the performances of the reduced-set classifiers (GAS-LSSVM and MOGAS-LSSVM) were equivalent to those achieved by the full-set classifiers. In some cases, as shown in this table for the VCP and Pima Diabetes, the performances of the reduced-set classifiers were even better. It is worth mentioning that, as expected, the multi-objective proposal achieves a much better reduction in terms of support vectors when compared to the single-objective, since MOGAS-LSSVM incorpores the ability of reducing the number of support vectors in the training process. In addition to the previous results, we show in Figure 3 a) the values of LSSVM and GAS-LSSVM accuracies and in Figure 3 b) the mean of support vectors for 10%, 20%, . . . , 80% over 30 independent runs for the VCP data set. We notice in this figure that the GAS-LSSVM accuracies are higher than LSSVM ones when the training data set size is equal or higher than 20%. We also notice that the MOGAS-LSSVM accuracies are higher than LSSVM ones when the training data set size is higher than 50%. Moreover, the means of support vectors for MOGAS-LSSVM and GAS-LSSVM are lower than ones for LSSVM, which emphasizes our proposal importance, i.e., even with a low number of patterns in a training data set our proposal achieves a classifier with lower number of support vectors and higher accuracy than LSSVM one. In the Figure 4, it is depicted accuracy and mean of support vector for different costs of reduction (θ) and different values of training data set sizes. We can see in Figure 4 b) that the mean of support vectors depends directly on θ. In this figure, as an example, for the training data set size at 80%, we have less support vectors with θ = 0 than with θ = 1. The same can be told to the other sizes of training data set. In Figure 5 is shown the results for reduction according to the cost of reduction (θ) obtained for each problem evaluated with 80% and 20% of data for training and testing, respectively. In Figure 6 is similarly presented the reductions according to the cost of reduction (θ) obtained for each problem evaluated with 60%, 20% and 20% of data for training, validation and testing, respectively. In the figures 5 and 6, one can notice that the accuracy is almost the same even when the value θ is changed in range [0, 1] for every and each 15

Table 3: Results for the LSSVM, P-LSSVM, GAS-LSSVM and MOGAS-LSSVM classifiers with 60% , 20% and 20% of the full dataset for training, validating and testing, respectively. Dataset VCP VCP VCP VCP HAB HAB HAB HAB BCW BCW BCW BCW PID PID PID PID RIP RIP RIP RIP BAN BAN BAN BAN

Model LSSVM P-LSSVM GAS-LSSVM MOGAS-LSSVM LSSVM P-LSSVM GAS-LSSVM MOGAS-LSSVM LSSVM P-LSSVM GAS-LSSVM MOGAS-LSSVM LSSVM P-LSSVM GAS-LSSVM MOGAS-LSSVM LSSVM P-LSSVM GAS-LSSVM MOGAS-LSSVM LSSVM P-LSSVM GAS-LSSVM MOGAS-LSSVM

γ 0.05 0.05 0.05 0.05 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04

Accuracy 80.1 ± 3.2 80.0 ± 4.7 82.7 ± 5.8 81.3 ± 5.3 73.7 ± 3.6 74.2 ± 6.1 73.0 ± 6.8 75.1 ± 7.1 96.9 ± 0.8 96.3 ± 1.5 95.9 ± 1.8 96.0 ± 1.6 75.8 ± 1.5 75.6 ± 3.3 77.0 ± 3.5 76.9 ± 3.9 87.9 ± 0.9 87.8 ± 1.7 87.6 ± 2.1 87.9 ± 2.1 96.8 ± 0.6 97.1 ± 0.9 97.0 ± 1.2 97.0 ± 1.3

# TS 186 186 186 186 184 184 184 184 410 410 410 410 461 461 461 461 750 750 750 750 601 601 601 601

# SVs 186.0 120.5 92.6 29.4 184.0 94.0 95.7 23.1 410.0 210.3 206.2 91.0 461.0 230.5 230.8 113.6 750 377.0 371.2 220.6 601 293 300.3 159.4

Red. − 35.2% 50.1% 84.2% − 48.9% 47.9% 87.5% − 48.7% 49.7% 77.8% − 50.0% 49.9% 75.4% − 49.7% 50.5% 70.6% − 51.2% 50.0% 73.5%

data set evaluated. These results show that our proposals are very robust in terms of accuracy. It is also possible to notice in those figures that the reduction increases when the value of θ rises. These can be particularly observed for data sets HAB, PID, RIP and VCP with both methodologies 80%-20% and 60%-20%-20%.

16

Figure 3: (a) Accuracies versus different values of training data set size over the VCP problem. (b) Mean of support vectors versus different values of training data set size over the VCP problem.

Figure 4: (a) Accuracies versus different values of training data set size over θ in range [0, 1] for the VCP problem. (b) Mean of support vectors versus different values of training data set size over θ in range [0, 1] for the VCP problem.

17

Figure 5: Accuracy and reduction according to the cost of reduction (θ) obtained for each problem evaluated with 80% and 20% of data for training and testing, respectively.

Figure 6: Accuracy and reduction according to the cost of reduction (θ) obtained for each problem evaluated with 60, 20 and 20 percent of data for training, validation and testing, respectively.

18

8. Conclusions Our proposals called Genetic Algorithm for Sparse LSSVM (GAS-LSSVM) and Multi-Objective GAS-LSSVM are based on a single and multi-objective genetic algorithms, respectively. Our first proposal is guided by the training (or validation) accuracy, however, in parallel the approach achieves sparse classifiers due to the flexibility added, i.e., the ability of having or not a certain pattern as a support vector. Our second one, based on a multi-objective genetic algorithms, is guided by both the training (or validation) accuracy along with the reduction of support vectors. For this multi-objective proposal, it is also proposed a way of balancing the accuracy and reduction amounts. Based on the obtained results, GAS-LSSVM and MOGAS-LSSVM were able to reduce the number of support vectors and also improve the classifiers performance. Moreover, the achieved results indicate that the GASLSSVM and MOGAS-LSSVM work very well, providing a reduced number of support vectors while maintaining or increasing classifiers accuracy. [1] S. Jauhari, S. Rizvi, Mining gene expression data focusing cancer therapeutics: A digest, Computational Biology and Bioinformatics, IEEE/ACM Transactions on 11 (3) (2014) 533–547. [2] P. Farag´o, C. Farag´o, S. Hintea, M. Cˆırlugea, An evolutionary multiobjective optimization approach to design the sound processor of a hearing aid, in: International Conference on Advancements of Medicine and Health Care through Technology, Vol. 44, 2014, pp. 181–186. [3] D. Alvarez, R. Hornero, J. V. Marcos, F. del Campo, Feature selection from nocturnal oximetry using genetic algorithms to assist in obstructive sleep apnoea diagnosis, Medical Engineering & Physics 34 (8) (2012) 1049 – 1057. [4] A. E. R. Palencia, G. E. M. Delgadillo, A computer application for a bus body assembly line using genetic algorithms, International Journal of Production Economics 140 (1) (2012) 431 – 438. [5] Z. E. Brain, M. A. Addicoat, Using meta-genetic algorithms to tune parameters of genetic algorithms to find lowest energy molecular conformers., in: ALIFE, 2010, pp. 378–385.

19

[6] F. Samadzadegan, A. Soleymani, R. Abbaspour, Evaluation of genetic algorithms for tuning svm parameters in multi-class problems, in: Computational Intelligence and Informatics (CINTI), 2010 11th International Symposium on, 2010, pp. 323–328. [7] J. Huang, X. Hu, F. Yang, Support vector machine with genetic algorithm for machinery fault diagnosis of high voltage circuit breaker, Measurement 44 (6) (2011) 1018 – 1027. [8] Y. Ren, G. Bai, Determination of optimal svm parameters by using ga/pso, Journal of Computers 5 (8) (2010) 1160–1168. [9] X. Zhang, X. Chen, Z. He, An aco-based algorithm for parameter optimization of support vector machines, Expert Systems with Applications 37 (9) (2010) 6618 – 6628. [10] I. Aydin, M. Karakose, E. Akin, A multi-objective artificial immune algorithm for parameter optimization in support vector machine, Applied Soft Computing 11 (1) (2011) 120–129. [11] G. S. dos Santos, L. G. J. Luvizotto, V. C. Mariani, L. dos Santos Coelho, Least squares support vector machines with tuning based on chaotic differential evolution approach applied to the identification of a thermal process, Expert Systems with Applications 39 (5) (2012) 4805 – 4812. [12] S. Mousavi, S. Iranmanesh, Least squares support vector machines with genetic algorithm for estimating costs in npd projects, in: Communication Software and Networks (ICCSN), 2011 IEEE 3rd International Conference on, 2011, pp. 127–131. doi:10.1109/ICCSN.2011.6014864. [13] S. Dong, T. Luo, Bearing degradation process prediction based on the PCA and optimized LS-SVM model, Measurement 46 (9) (2013) 3143 – 3152. [14] L. Yu, H. Chen, S. Wang, K. K. Lai, Evolving least squares SVM for stock market trend mining, IEEE Trans. on Evolutionary Computation 13 (1) (2009) 87–102. [15] M. Mustafa, M. Sulaiman, H. Shareef, S. Khalid, Reactive power tracing in pool-based power system utilising the hybrid genetic algorithm and 20

least squares SVM, IET Generation, Transmission & Distribution 6 (2) (2012) 133–141. [16] V. N. Vapnik, Statistical Learning Theory, Wiley-Interscience, 1998. [17] I. Steinwart, Sparseness of support vector machines, Journal of Machine Learning Research 4 (2003) 1071–1105. [18] L. D’Amato, J. A. Moreno, R. Mujica, Reducing the complexity of kernel machines with neural growing gas in feature space, in: Proceedings of the IX Ibero-American Artificial Intelligence Conference (IBERAMIA’04), Vol. LNAI-3315, Springer, 2004, pp. 799–808. [19] J. D. B. B. D. M. J. V. J. A. K. Suykens, T. Van Gestel, Least Squares Support Vector Machines (ISBN 981-238-151-1), Publisher, 2002. [20] C. J. C. Burges, Simplified support vector decision rules, in: Proceedings of the 13th International Conference on Machine Learning (ICML’96), Morgan Kaufmann, 1996, pp. 71–77. [21] D. Geebelen, J. A. K. Suykens, J. Vandewalle, Reducing the number of support vectors of SVM classifiers using the smoothed separable case approximation, IEEE Transactions on Neural Networks and Learning Systems 23 (4) (2012) 682–688. [22] Y. Li, C. Lin, W. Zhang, Improved sparse least-squares support vector machine classifiers, Neurocomputing 69 (2006) 1655–1658. [23] L. Hoegaerts, J. A. K. Suykens, J. Vandewalle, B. De Moor, A comparison of pruning algorithms for sparse least squares support vector machines, in: Proceedings of the 11th International Conference on Neural Information Processing (ICONIP’04), 2004, pp. 22–25. [24] J. A. K. Suykens, L. Lukas, J. Vandewalle, Sparse least squares support vector machine classifiers, in: Proceedings of the 8th European Symposium on Artificial Neural Networks (ESANN’00), 2000, pp. 37–42. [25] R. Peres, C. E. Pedreira, Generalized risk zone: Selecting observations for classification, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (7) (2009) 1331–1337.

21

[26] B. P. R. Carvalho, A. P. Braga, IP-LSSVM: A two-step sparse classifier, Pattern Recognition Letters 30 (2009) 1507–1515. [27] A. Hussain, S. Shahbudin, H. Husain, S. A. Samad, N. M. Tahir, Reduced set support vector machines: Application for 2-dimensional datasets, in: Proceedings of the 2nd International Conference on Signal Processing and Communication Systems (ICSPCS’08), 2008, pp. 1–4. [28] B. Tang, D. Mazzoni, Multiclass reduced-set support vector machines, in: Proceedings of the 23rd International Conference on Machine Learning (ICML’2006), 2006, pp. 921–928. [29] T. Downs, K. E. Gates, A. Masters, Exact simplification of support vector solutions, Journal of Machine Learning Research 2 (2002) 293– 297. [30] G. Fung, O. L. Mangasarian, Proximal support vector machine classifiers, in: F. Provost, R. Srikant (Eds.), Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’01), 2001, pp. 77–86. [31] Y.-J. Lee, O. L. Mangasarian, SSVM: A smooth support vector machine for classification, Computational Optimization and Applications 20 (1) (2001) 5–22. [32] O. L. Mangasarian, Generalized support vector machines, in: B. S. A. J. Smola, P. Bartlett, D. Schuurmans (Eds.), Advances in Large Margin Classifiers, MIT Press, 2000, pp. 135–146. [33] M. Blachnik, M. Kordos, Simplifying SVM with weighted LVQ algorithm, in: Proceedings of the 12th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL’2011, Springer, 2011, pp. 212–219. [34] T. Thies, F. Weber, Optimal reduced-set vectors for support vector machines with a quadratic kernel, Neural Computation 16 (9) (2004) 1769– 1777. [35] J. Wang, P. Neskovic, L. N. Cooper, Selecting data for fast support vector machines training, in: K. Shen, L. Wang (Eds.), Trends in Neural

22

Computation, Vol. 35 of Studies in Computational Intelligence (SCI), Springer, 2007, pp. 61–84. [36] J. L. Balczar, Y. Dai, O. Watanabe, A random sampling technique for training support vector machines, in: Proceedings of the 12th International Conference on Algorithmic Learning Theory (ALT’01), 2001, pp. 119–134. [37] S. Tong, D. Koller, Support vector machine active learning with applications to text classification, Journal of Machine Learning Research 2 (2001) 45–66. [38] J. A. K. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural Processing Letters 9 (3) (1999) 293–300. [39] J. Valyon, G. Horvath, A sparse least squares support vector machine classifier, in: Proceedings of IEEE International Joint Conference on Neural Networks, 2004., Vol. 1, 2004, pp. –548. [40] L. C., V. G. T., S. J.A.K., V. H. S., V. I., T. D., Preoperative prediction of malignancy of ovarium tumor using least squares support vector machines, Artificial Intelligence in Medicine 28 (3) (2003) 281–306. [41] J. Dubois-Lacoste, M. Lopez-Ibanez, T. Stutzle, Combining two search paradigms for multi-objective optimization: Two-phase and pareto local search, in: Hybrid Metaheuristics, Vol. 434 of Studies in Computational Intelligence, 2013, pp. 97–117. [42] A. R. Rocha Neto, G. A. Barreto, On the application of ensembles of classifiers to the diagnosis of pathologies of the vertebral column: A comparative analysis, IEEE Latin America Transactions 7 (4) (2009) 487–496. [43] A. Rocha Neto, G. A. Barreto, Opposite maps: Vector quantization algorithms for building reduced-set svm and lssvm classifiers, Neural Processing Letters 37 (1) (2013) 3–19. [44] J. Peixoto Silva, A. Da Rocha Neto, Sparse least squares support vector machines via genetic algorithms, in: Computational Intelligence and 11th Brazilian Congress on Computational Intelligence (BRICS-CCI CBIC), 2013 BRICS Congress on, 2013, pp. 248–253. 23

[45] T. van Gestel, J. Suykens, B. Baesens, S. Viaene, J. Vanthienen, G. Dedene, B. de Moor, J. Vandewalle, Benchmarking least squares support vector machine classifiers, Machine Learning 54 (1) (2004) 5–32.

24