Applied Soft Computing 12 (2012) 819–825
Contents lists available at SciVerse ScienceDirect
Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc
A prototype classifier based on gravitational search algorithm Abbas Bahrololoum, Hossein Nezamabadi-pour ∗ , Hamid Bahrololoum, Masoud Saeed Department of Electrical Engineering, Shahid Bahonar University of Kerman, P.O. Box 76169-133, Kerman, Iran
a r t i c l e
i n f o
Article history: Received 4 October 2010 Received in revised form 4 April 2011 Accepted 23 October 2011 Available online 31 October 2011 Keywords: Classification Prototype classifier Swarm intelligence Gravitational search algorithm UCI machine learning repository
a b s t r a c t In recent years, heuristic algorithms have been successfully applied to solve clustering and classification problems. In this paper, gravitational search algorithm (GSA) which is one of the newest swarm based heuristic algorithms is used to provide a prototype classifier to face the classification of instances in multiclass data sets. The proposed method employs GSA as a global searcher to find the best positions of the representatives (prototypes). The proposed GSA-based classifier is used for data classification of some of the well-known benchmark sets. Its performance is compared with the artificial bee colony (ABC), the particle swarm optimization (PSO), and nine other classifiers from the literature. The experimental results of twelve data sets from UCI machine learning repository confirm that the GSA can successfully be applied as a classifier to classification problems. © 2011 Elsevier B.V. All rights reserved.
1. Introduction A classification problem is the task of assigning objects to one of several predefined categories. From the mathematical point of view, it is defined as a mapping from the input feature space into a set of labels. There are many classification techniques that have been developed by researchers in the field of machine learning (ML) which is the study of developing computer programs that improve their performance through the experience they gain over sets of prepared data known as data sets. Among these techniques, we can name binary classifiers [1–4], decision tree classifiers [5–7], artificial neural network (ANN) classifiers [8–10], Bayesian classifiers [1,11], support vector machine (SVM) classifiers [12–16], and instance (prototype) based classifiers [17–22] that we shed more light on it later in this paper. The choice of proper classification technique depends on factors such as having either noisy or noise free data, having either discrete-valued or real-valued inputs. It also depends on the type of hypothesis space representations, the choice of suitable inductive bias, prior knowledge about input data class probabilities (for instance, parameters or shapes of the probability density functions), the size of the dataset, the dimension of the inputs, the complexity of the classifier, etc. More details can be found in [1]. There are a number of optimization algorithms that have been used for classification problems. The main reason for that, perhaps,
∗ Corresponding author. Tel.: +98 341 3235900; fax: +98 341 3235900. E-mail address:
[email protected] (H. Nezamabadi-pour). 1568-4946/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2011.10.008
comes from the nature of ML techniques that have uncertainty within themselves. Since the accuracy of such techniques largely depends on the qualification of training, they need to be trained well. The training process has a gradual form in which as the data set members are fed into a relative ML technique, it becomes more experienced, and simultaneously its performance improves. This gradual improvement is usually achieved by a search through a high dimensional space called “hypothesis space” [1]. It should be noted that different ML techniques may have different hypothesis space representations. For example, in concept learning that is a binary classifier, it is in the form of lattice structure, and in genetic algorithms (GAs) usually, it is the space formed of bit strings. The common characteristic of all the techniques is that they search through their own hypothesis spaces using different semantics or strategies behind their algorithms. Such a space may have some local minima within itself. A side effect of a bad training can be convergence to a local optimum, leading to the poor performance of the relative technique. For instance, in the case of classification, it may lead to a significant increase in the number of misclassifications. In other words, this means that every ML technique has an optimization part of its type within itself. Having trained that ML technique properly, the correct convergence of that ML technique in that hypothesis space is attained. Here by the term “training properly”, we mean an expansion to the search mechanism of the relative ML technique. Such an expansion is expected to result in better manipulation of the problems, such as trapping into local minima, in searching hypothesis spaces. At this point it sounds reasonable that some researchers have incorporated other optimization techniques with some ML techniques to enhance their performance in terms of accuracy or speed of convergence.
820
A. Bahrololoum et al. / Applied Soft Computing 12 (2012) 819–825
As briefly pointed out in the above part, one type of classifiers is called prototype (instance based) classifiers. These classifiers are referred to as model-free classification techniques. The training data in prototype classifiers are represented in the form of a set of points in feature space, but prototypes are usually chosen to be different from training examples. Just for the case of one-nearestneighbor (1-NN), there is an exception [17]. Each prototype has a class label, and classification of each new instance is carried out by finding the closest prototype using some defined distance measures (typically Euclidean distance). There are some variations of prototype methods such as K-means clustering [23,24], learning vector quantization (LVQ) [25,26], Gaussian mixtures (GM) [27–30] and K-nearest neighbor (K-NN) [31–33]. They vary based on the following two criteria: first the number of prototypes used and second the way by which prototypes are selected. In K-means clustering (KMC), the goal is to find clusters and their centers in such a way that minimizes each inner cluster variance. After we choose randomly the initial cluster centers, the algorithm then is repeated over the following two steps to meet the convergence. First, training points are spread on associated clusters. Second, previous cluster centers are replaced with means of their relative cluster instances (i.e. for each dimension of cluster points, representing a feature, this mean is calculated). Clustering can be considered as a classification task. It is cast into a classification task in three steps. First, running KMC on each class. Next, each prototype is assigned a label. Last, the classification of each new instance is carried out according to the label of nearest prototype. In LVQ, sampled training points can either repel or attract their closest prototypes classifying them. The degree of attraction and repellent is controlled through a parameter called “learning rate”. In GM, each cluster is represented by a Gaussian density with a centroid and a covariance matrix. An algorithm that is called Expectation Maximization (EM) runs to adjust cluster parameters in an iterative manner in two steps. In the first step of the algorithm, each instance (observation) is given a weight based on the likelihood of Gaussians. The likelihood shows which cluster the instance belongs to and what degree it is. In the second step of EM, observations contribute to the centroids and covariance matrices. After finishing this kind of optimization using EM, the best parameters of Gaussians are identified and the classification of the instances can be carried out calculating posterior probabilities. In K-NN, which is a memory based method; the classification of an instance is carried out using the majority vote of K nearest neighbors of it. The neighbors are identified by defining a distance metric. In the case of K-NN, no optimization is required. For more details, along with a comparative study on the above prototype methods, see [17]. There are a number of researchers who have applied different optimization techniques to some of the prototype based techniques. For example, De Falco et al. proposed a particle swarm optimization (PSO) based classifier to face the problem of classification of instances in multiclass databases [34]. Moreover, three different fitness functions have been proposed by them for classification purposes. Karaboga and Ozturk in [35] adapted the artificial bee colony (ABC) to face classification using one of the functions proposed by De Falco et al. The experiments provided by [34,35] confirm that the heuristic based classifiers (PSO and ABC) provide good results in contrast to well-known classification techniques. Gravitational search algorithm (GSA) is one of the latest heuristic optimization algorithms, which was first introduced by Rashedi et al. as a new stochastic population-based optimization tool [36] based on the metaphor of gravitational interaction between masses. This approach provides an iterative method that simulates mass interactions, and moves through a multi-dimensional search space under the influence of gravitation. This heuristic algorithm has been inspired by the Newtonian laws of gravity and motion
[36]. The effectiveness of GSA and its binary version (BGSA) [37] in solving a set of nonlinear benchmark functions has been proven [36,37]. Moreover, the results obtained in [38–40] confirm that GSA is a suitable tool for linear and nonlinear filter modeling, parameter identification of hydraulic turbine governing system and synthesis of thinned scanned concentric ring array antenna respectively. Theoretically, GSA belongs to the class of swarm based heuristic algorithms. Rashedi et al. [36] practically carried out a comparative study between GSA and a small number of well-known swarm algorithms like PSO. The results suggest that GSA which is inspired by the law of gravity has merit in the field of optimization. In the current paper a prototype based classification approach that uses GSA is recommended. For simplicity, this approach uses one prototype per class. The issue of finding the appropriate position of each prototype (class representative) is the main objective of this approach. After finding the class representatives, the instance is classified by the class representative that is at the closest distance (i.e. using Euclidean distance). The process of finding the proper positions of class representatives reveals another optimization task. GSA is used to tackle this optimization task. Since GSA is a heuristic based optimization algorithm, it requires some fitness functions to guide its search. In this paper the same fitness functions as those applied by [34] are used. To verify the effectiveness of the proposed GSA based classifier, it is applied to 12 benchmark data sets, and the results are compared with those of the 11 well-known classification techniques reported in the literature. The remainder of this paper is organized as follows: in Section 2 the structure of GSA is described. In Section 3 GSA is modified so that it conducts a classification task by three fitness functions mentioned above. In Section 4 first, twelve selected data sets form UCI machine learning repository [41] are explained. Then, the proposed classifier is applied to them and a comparison is made with those of the 11 well-known classification techniques including PSO and ABC over the selected data sets. And finally, in Section 5 the conclusions are stated.
2. A prologue to gravitational search algorithm In GSA, agents are considered as objects, and their performances are measured by their masses. All these objects attract each other by a gravity force, and this force causes the movement of all objects globally towards objects with heavier masses. The heavy masses correspond to good solutions of the problem. The position of the agent corresponds to a solution of the problem, and its mass is determined using a fitness function. By lapse of time, masses are attracted by the heaviest mass. We hope that this mass would present an optimum solution in the search space. The GSA could be considered as an isolated system of masses. It is like a small artificial world of masses obeying the Newtonian laws of gravitation and motion [36]. To describe the GSA, consider a system with N masses (agents) in which the position of the ith mass is defined as follows: Xi = (xi1 , . . . , xid , . . . , xin ),
i = 1, 2, . . . , N
(1)
where xid is the position of ith mass in the dth dimension, and n is dimension of the search space. It is noted that the positions of masses correspond to the solutions of the problem. Based on [36], the mass of each agent is calculated after computing current population’s fitness which is as follows: Mi (t) =
fiti (t) − worst(t)
N
j=1
(fitj (t) − worst(t))
(2)
A. Bahrololoum et al. / Applied Soft Computing 12 (2012) 819–825
821
where fiti (t) represents the fitness value of the agent i at t, and worst(t) is defined as follows: worst(t) =
⎧ min fitj (t) for maximization problems ⎨ j ∈ {1,...,N} ⎩ max fitj (t) for minimization problems
(3)
j ∈ {1,...,N}
At a specific time “t”, the force acting on mass “i” from mass “j” is defined as following: Fijd (t) = G(t)
Mi (t) × Mj (t) Rij (t) + ε
(xjd (t) − xid (t))
(4)
To compute the acceleration of an agent, total forces from all other masses that are applied to it should be considered based on the law of gravity (Eq. (5)) which is followed by calculation of agent acceleration using law of motion (Eq. (6)). Fid (t) =
N
rj Fijd (t) = G(t)Mi (t)
j=1,j = / i
N
rj
j=1,j = / i
Mj (t) Rij (t) + ε
(xjd (t) − xid (t)) (5)
adi (t) =
Fid (t) Mi (t)
N
= G(t)
rj
j=1,j = / i
Mj (t) Rij (t) + ε
(xjd (t) − xid (t))
(6)
where adi (0) = 0. Afterward, the next velocity of an agent is calculated as a fraction of its current velocity added to its acceleration (Eq. (7)). Then, its next position could be calculated by using Eq. (8).
vdi (t + 1) = ri × vdi (t) + adi (t)
(7)
xid (t
(8)
+ 1) =
xid (t) + vdi (t
+ 1)
where vdi (0) = 0, ri and rj are two uniformly distributed random numbers in the interval [0, 1], and ε is a small value. G(t) = G(G0 , t) is gravitational constant at time t which is a decreasing function of time where is set to G0 at the beginning (t = 0) and will be decreased exponentially [36] or linearly [37] towards zero at the last iteration. Rij (t) is the Euclidean distance between two agents i and j, and it is defined as:
Rij (t) = Xi (t), Xj (t)
Fig. 1. The pseudo code of GSA.
fitness function. The block diagram of the proposed classifier in the training phase is given by Fig. 2. GSA is initialized with N masses in n-dimension space in which n = C × D. It means that we ask GSA to find one representative for each class, totally C representatives for the problem at hand, in which each representative is a D-dimensional vector that is used by a classifier as the class prototype. The initial positions of the masses in GSA are selected randomly from the data of the training data set. The positions of masses form the candidate solutions to the problem. The fitness value of each mass is computed, and the results are fed into the GSA for mass calculation (Eq. (2)), which is followed by computing Eqs. (4)–(8). The search process for finding the best representatives continues until the stopping condition is met. At the end of training phase, the best obtained representatives are delivered into the second phase to set the initial parameters of prototype classifier. Based on the information obtained by the training phase, the final prototype classifier can start the work. Here, we have C prototypes, where each prototype has a class label and classification of each new instance (from test set) is carried out by finding the closest prototype using a defined distance measure (typically Euclidean distance). The block diagram of the proposed classifier in the performance phase is given by Fig. 3. 3.1. Fitness functions
(9)
2
To make a comparison with GSA, PSO, and ABC in terms of classification accuracy, the same fitness functions as what was reported in [34] are picked and applied to GSA, namely fit1 , fit2 , and fit3 .
The pseudo code of the GSA is given by Fig. 1. 3. The proposed GSA based classifier Assume we have a C-class classification problem in a Ddimensional feature space. It means that each instance should be classified into one of the existing C classes in D-dimensional feature space. To design prototype classifier based on GSA, two phases are succeeding: training phase and Performance phase. For this purpose the data set is divided into two sections: training set and test set. In the training phase the GSA uses the training data to provide the best representative for each class according to a specific
3.1.1. Fitness function fit1 According to this function the coordination of prototypes are calculated so that the percentage of the misclassification (in the training set not in the test set) be minimized. The percentage of the misclassification is calculated as follows [34]. fit1 =
Training data
GSA
Suggested representaves (one for each class)
100 DTrain
DTrain
m(Ij )
(10)
j=1
Fitness funcon
Prototype classifier
Fitness value
Distance measure Fig. 2. The schematic of the proposed GSA-based prototype classifier in the training phase.
822
A. Bahrololoum et al. / Applied Soft Computing 12 (2012) 819–825
4. Experimental results
Final representaves obtained by GSA (one for each class)
Prototype classifier
Test data
4.1. UCI selected data sets
Class of data
Distance measure Fig. 3. The schematic of the proposed GSA-based prototype classifier in the performance phase.
where DTrain is the number of training examples, each represented by an ordered pair consists of a vector Ij and a chosen output label from the label set of the relative data set. The term m(Ij ) indicates whether a misclassification has occurred. In the case of a misclassification, its value equals 1 and in the case of a correct classification its value equals 0. The summation over all instances reveals the number of incurred misclassifications. It should be noted that the denominator DTrain is used to normalize the quantity ‘number of misclassifications’. And the number 100 is used for percentile that leads the fitness function to vary in the interval [0, 100]. 3.1.2. Fitness function fit2 The second fitness function is defined as the Euclidean distance between each representative (prototype) and each of its associated instances are represented by Ij . The formula is as follows [34]: fit2 =
Dtrain
1 DTrain
d(Ij , Pj )
(11)
j=1
where d() is the Euclidean distance between Ij ’s, generic instances, and Pj is the prototype to which the instance belongs to. The denominator DTrain is used to normalize the objective value. It should be noted that this second fitness function fit2 has a key advantage over the previous fitness function fit1 . The merit of fit2 over fit1 is due to the greater continuity of that, which is resulted from Euclidean distance continuous nature. While fit1 can only take steps in size of the fraction 1/DTrain . 3.1.3. Fitness function fit3 In the third fitness function, first, each training instance is assigned to the prototype of closer distance. Then in the next step, the fitness function fit3 is calculated as a linear combination of Eqs. (10) and (11) [34]: fit3 =
1 2
fit
1
100
+ fit2
(12)
to normalize fit3 , first, fit1 is divided by 100, then its summation to fit2 is divided by 2. Having defined these three fitness functions, the classification task is shaped into a minimization problem. By these three different kinds of fitness functions, three versions of GSA based prototype classifiers are used to do the classification. The performance of each of the three versions of GSA then is calculated according to the misclassification percentages of instances by the best found agent. This misclassification percentage is obtained through the comparison between the assigned label and the correct label from training examples.
Again to make the comparison possible, the same classification data sets, as used in [34,35] from UCI machine learning repository [41], are selected to be classified by three versions of GSA based classifiers. Note that in 9 data sets, the first 75% of data set elements are used as a training set, and the next 25% are used for the case of testing after the training process. Just in the case of data sets glass, thyroid, and wine, the data set classes are in sequential form. Since in training and testing phases we must have elements with different class distributions, similar to [34] and [35], we first shuffle the elements, and then act like before to pick training and test sets. It also should be noted that the data set attributes are normalized and then the three versions of GSA are run on them. The characteristics of these 12 data sets summarized in Table 1 are as follows: Balance data set: This data set is used to model psychological experimental results. This data set encompasses 625 elements. The instances of this data set are classified into three classes: balance scale tip to the right, balance scale tip to the left, or balanced. The 4 attributes are the left weight, the left distance, the right weight, and the right distance. From these elements, 469 are used for training and 156 for testing. Cancer data set: This data set is generated from the “breast cancer Wisconsin-Diagnostic” data set. It encompasses the symptoms by which a judgment about classification of a tumor into two classes benign or malignant is made. A benign tumor is one that is not likely to cause a death. But a malignant tumor has advanced a lot and gets uncontrollable and is likely to lead to death. This data set includes 569 elements. Each instance of data set uses 30 attributes out of 32. Cancer-Int data set: This data set is generated from the “breast cancer Wisconsin-Original” data set. Its description is the same as the previous one. But it contains 699 elements and each instance of it uses 9 attributes out of 10. Credit data set: This data set is used for the case of credit approval (the Australian version of credit). The name and values of attributes has been chosen in such a way to be meaningless for information security purposes. This data set has an advantage of using a blend of continuous, small-valued nominal, and large-valued nominal attributes. The attributes are formed into 51 input values [42]. Dermatology data set: This data set is used for the case of recognition of skin diseases. Dermatology is the scientific study of skin diseases. Two major difficulties with the recognition of this disease are that firstly, it requires some microscopic tissue test to manifest the disease. Secondly it may show the characteristics of another disease that is in its first stages. The data set contains 366 elements and 34 inputs and 6 following classes: psoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic dermatitis, and pityriasis rubra pilaris. Diabetes data set: This data set is used for the recognition of diabetes disease. Some constrains such as being females of 21-yearold belonging to Pima Indian heritage are imposed to gather data for this data set. This data set has 768 elements. The first 576 elements are used as training set and the remainder 192 as test set. Each sample has 8 attributes. E. coli data set: This data set includes protein localization sites for E. coli bacteria. It should be noted that originally this data set has 336 members that are classified into 8 classes. But since 3 of classes were represented by just 2, 2, 5 instances, we set them aside. So the new data set has 5 classes and has 327 members, from which we picked 245 members for training and 82 members for testing as can be seen in Table 1. Glass data set: This data set is used for classification of glass types as being float processed building windows, non-float processed building windows, vehicle windows, containers, tableware
A. Bahrololoum et al. / Applied Soft Computing 12 (2012) 819–825
823
Table 1 Characteristics of the 12 UCI data sets. Data Set
# of data set elements
# of training data
Balance Cancer Cancer-Int Credit Dermatology Diabetes E. coli Glass Horse Iris Thyroid Wine
625 569 699 690 366 768 327 214 364 150 215 178
469 427 524 518 274 576 245 161 273 112 162 133
and head lamps. This data set can be used for investigations in which a crime has been committed and a piece of glass left. Having classified the glass type correctly, it can later be utilized as evidence against the criminal. Input attributes are based on 9 chemical measurements assigned to each of the 6 glass types that have 70, 76, 17, 13, 9, and 29 instances of total 214 instances respectively. From the entire elements, 161 are selected for training and the left 53 are selected for testing. Horse data set: This data set is used for classification (prediction) of a horse fate, when it has a colic problem and a severe pain it suffers from. The horse fate can be classified into the following classes: the horse will be euthanized (the act of killing animals painlessly), the horse will survive, or it will die. The data set has 364 elements. Each element has 58 inputs formed from 27 attributes and 3 classes. Iris data set: This data set is the most popular UCI data set and is used for classification of Iris flowers into one of the three species of them namely Setosa, Versicolor, Virginica. This classification is carried out based on 4 attributes namely, Sepal length, Sepal width, Petal length, and Petal width. The data set has 150 elements of which 50 belong to each species. Thyroid data set: This data set is used to classify the situation of thyroid gland into three classes over function, normal function, or under function. The data set is based on new-thyroid data that contains 215 elements. Each instance has 5 attributes. Wine data set: This data set is used for classification of wines into each of three cultivators based on a chemical analysis. This data set has 178 elements. And each instance has 13 attributes. 4.2. Results To classify using GSA, the following settings are applied: the population size, N and the number of iterations, T, are set to 20 and 50, respectively. G is set using Eq. (13), where G0 is set to 1 and T is the total number of iterations [37]:
G = G0 1 −
t T
# of testing data 156 142 175 172 92 192 82 53 91 38 53 45
(13)
Therefore, for GSA-based classifier the number of fitness evaluations is equal to 1000. In PSO and ABC as reported in [35] the number of fitness evaluations are 50,000 and 20,000 respectively. The reason for taking small values of fitness evaluations by GSA is to make the comparison as reliable as possible and to avoid having results dominated by such possible errors which are originated from giving more time to GSA than other algorithms. To make the comparison, the percentage of misclassification is used as a criterion. This criterion is computed as follows: first, the entire test data is classified and the number of misclassifications is counted through the comparison with the correct labels in the training data. Second, this number is divided by relative cardinality of the data set (test data). And finally to achieve percentage it is multiplied by 100.
# of input attributes
# of classes
4 30 9 51 34 8 7 9 58 4 5 13
3 2 2 2 6 2 5 6 3 3 3 3
Table 2 shows the average misclassification percentages and rankings of each of the classifiers PSO, ABC, and GSA over 12 chosen data sets. In other words, each table slot, which belongs to a specific classifier and the data set, contains the average of misclassification percentages on 20 runs of that classifier over that data set. The number in brackets in each table slot shows the ranking of each classifier based on the mentioned criterion after running on a specific data set. The best results have been shown in bold. It should be noted that PSO and ABC results are those experimented in [34,35], respectively. As can be seen in Table 2, GSA versions have obtained acceptable results in comparison with PSO and ABC; just in the case of the data sets Iris and Wine, ABC gets the first rank and GSA has the second and third ranks and in other cases GSA has the first rank. In the case of Cancer-Int data set, both ABC and GSA get the same results. The reason that the ABC results are just present for fitness function fit2 is because there were no experiments on fitness functions fit1 and fit3 in the relative paper [35]. So due to that, we also limit our comparison with the case of using fitness function fit2 in later tables. In Table 3, the results of PSO, ABC, and GSA based on fitness function fit2 are displayed for a better comparison. As it can be seen, PSO just in the Balance data set and ABC in five data sets Cancer, Cancer-Int, Diabetes, Iris, and Wine get the first rank and GSA in seven data sets including Cancer-Int gets the first rank. In Table 4, Average misclassification percentages along with relative rankings for each of the 12 different classifiers running on the selected data sets are shown. The table shows a comprehensive list of classifiers and provides a better comparative framework with the current work. Except the GSA, the presented results of other classifiers are those reported in [34,35]. More details about presented classifiers in Table 4 can be found in the following papers [34,35]. In Table 5, the average misclassification percentages on the entire data sets for each of the 12 different classifiers along with the new relative obtained ranks have been shown. In other words first, the average of each column of the previous table has been calculated, then a ranking has been conducted which can be seen in brackets. As it can be seen from among all the classifiers, GSA, MLP-ANN, and Bayes-Net get first, second, and third rankings respectively. In Table 6 this comparison has been made according to the sum of the ranks available in Table 4 per each column. Although this quantity is of lower precision degree for reporting results in some cases, it is common in Nonparametric Statistics. As can be seen in this table, MLP-ANN, GSA, and ABC get the first, second, and third rankings, respectively. At this point, it should be mentioned that although there is no universal classifiers that can get the best results on the entire available benchmarks, the results obtained by GSA confirm that the proposed GSA-based prototype classifier also can be thought of as a worthwhile classifier beside other practical classifiers that have proved their efficiencies thus far.
824
A. Bahrololoum et al. / Applied Soft Computing 12 (2012) 819–825
Table 2 Average misclassification percentages and rankings of each of the three classifiers PSO, ABC, GSA executed on 12 UCI data sets. The number in brackets in each table slot shows the ranking of each classifier. Data set
Classifier
Balance Cancer Cancer-Int Credit Dermatology Diabetes E. coli Glass Horse Iris Thyroid Wine
PSO fit1
PSO fit2
PSO fit3
ABC fit2
GSA fit1
25.47 (7) 5.80 (7) 2.87 (7) 22.96 (7) 5.76 (5) 22.50 (5) 14.63 (6) 40.18 (5) 40.98 (7) 2.63 (3) 5.55 (6) 2.22 (4)
13.24 (4) 3.49 (4) 2.75 (6) 22.19 (6) 19.67 (7) 23.22 (6) 23.65 (7) 40.18 (5) 37.69 (5) 3.68 (6) 6.66 (7) 6.22 (7)
13.12 (3) 3.49 (4) 2.64 (5) 18.77 (5) 6.08 (6) 21.77 (3) 13.90 (5) 38.67 (4) 35.16 (4) 5.26 (7) 3.88 (5) 2.88 (6)
15.38 (5) 2.81 (3) 0 (1) 13.37 (2) 5.43 (4) 22.39 (4) 13.41 (4) 41.50 (7) 38.26 (6) 0 (1) 3.77 (4) 0 (1)
12.67 (1) 1.53 (2) 0.97 (4) 14.36 (3) 3.80 (1) 21.40 (2) 7.37 (2) 33.01 (3) 33.91 (3) 1.84 (2) 3.14 (2) 0.90 (2)
GSA fit2
GSA fit3
18.88 (6) 4.16 (6) 0 (1) 10.94 (1) 5.27 (3) 23.59 (7) 11.22 (3) 32.07 (2) 33.31 (2) 2.63 (3) 1.85 (1) 2.27 (5)
12.92 (2) 1.39 (1) 0.91 (3) 14.42 (4) 4.12 (2) 20.88 (1) 7.31 (1) 31.89 (1) 33.15 (1) 2.89 (5) 3.14 (2) 1.58 (3)
Table 3 Average misclassification percentages and rankings of PSO, ABC, and GSA using fit2 on each of the 12 chosen UCI data sets. The number in brackets in each table slot shows the ranking of each classifier. Data set
Classifier PSO fit2
Balance Cancer Cancer-Int Credit Dermatology Diabetes E. coli Glass Horse Iris Thyroid Wine
ABC fit2
13.24 (1) 3.49 (2) 2.75 (3) 22.19 (3) 19.67 (3) 23.22 (2) 23.65 (3) 40.18 (2) 37.69 (2) 3.68 (3) 6.66 (3) 6.22 (3)
GSA fit2
15.38 (2) 2.81 (1) 0 (1) 13.37 (2) 5.43 (2) 22.39 (1) 13.41 (2) 41.50 (3) 38.26 (3) 0 (1) 3.77 (2) 0 (1)
18.88 (3) 4.16 (3) 0 (1) 10.94 (1) 5.27 (1) 23.59 (3) 11.22 (1) 32.07 (1) 33.31 (1) 2.63 (2) 1.85 (1) 2.27 (2)
Table 4 Average misclassification percentages along with relative rankings for each of the 12 different classifiers running on UCI data sets. The number in brackets in each table slot shows the ranking of each classifier. Dataset
Balance Cancer Cancer-Int Credit Dermatology Diabetes E. coli Glass Horse Iris Thyroid Wine
Classifier Bayes Net MLP ANN RBF
KStar
Bagging
MultiBoost NBTree
Ridor
VFI
PSO fit3
ABC fit2
GSA fit3
19.74 (7) 4.19 (6) 3.42 (4) 12.13 (2) 1.08 (1) 25.52 (4) 17.07 (6) 29.62 (5) 30.76 (2) 2.63 (7) 6.66 (6) 0.00 (1)
10.25 (2) 2.44 (2) 4.57 (6) 19.18 (11) 4.66 (6) 34.05 (10) 18.29 (9) 17.58 (1) 35.71 (8) 0.52 (5) 13.32 (11) 3.99 (9)
14.77 (5) 4.47 (7) 3.93 (5) 10.68 (1) 3.47 (4) 26.87 (6) 15.36 (5) 25.36 (3) 30.32 (1) 0.26 (4) 14.62 (12) 2.66 (6)
24.20 (10) 5.59 (8) 5.14 (7) 12.71 (4) 53.26 (12) 27.08 (7) 31.70 (12) 53.70 (12) 38.46 (10) 2.63 (7) 7.40 (7) 17.77 (12)
20.63 (9) 6.36 (9) 5.48 (9) 12.65 (3) 7.92 (10) 29.31 (9) 17.07 (6) 31.66 (6) 31.86 (3) 0.52 (5) 8.51 (8) 5.10 (10)
38.85 (12) 7.34 (10) 5.71 (10) 16.47 (9) 7.60 (9) 34.37 (11) 17.07 (6) 41.11 (9) 41.75 (12) 0.00 (1) 11.11 (9) 5.77 (11)
13.12 (4) 3.49 (5) 2.64 (3) 18.77 (10) 6.08 (8) 21.77 (2) 13.90 (4) 38.67 (8) 35.16 (7) 5.26 (11) 3.88 (4) 2.88 (7)
15.38 (6) 2.81 (3) 0.00 (1) 13.37 (5) 5.43 (7) 22.39 (3) 13.41 (2) 41.50 (10) 38.26 (9) 0.00 (1) 3.77 (3) 0.00 (1)
12.92 (3) 1.39 (1) 0.91 (2) 14.42 (7) 4.12 (5) 20.88 (1) 7.31 (1) 31.89 (7) 33.15 (6) 2.89 (10) 3.14 (2) 1.58 (4)
9.29 (1) 2.93 (4) 5.25 (8) 13.81 (6) 3.26 (3) 29.16 (8) 13.53 (3) 28.51 (4) 32.19 (5) 0.00 (1) 1.85 (1) 1.33 (3)
33.61 (11) 20.27 (12) 8.17 (12) 43.29 (12) 34.66 (11) 39.16 (12) 24.38 (11) 44.44 (11) 38.46 (10) 9.99 (12) 5.55 (5) 2.88 (7)
19.74 (7) 7.69 (11) 5.71 (10) 16.18 (8) 1.08 (1) 25.52 (4) 20.73 (10) 24.07 (2) 31.86 (3) 2.63 (7) 11.11 (9) 2.22 (5)
Table 5 Average misclassification percentages on the entire data sets for each of the 12 different classifiers along with the new relative obtained ranks. Average
Average
Classifier Bayes Net
MLP ANN
RBF
KStar
Bagging
MultiBoost
NBTree
Ridor
VFI
PSO
12.73 (3)
11.75 (2)
25.40 (12)
13.71 (6)
12.73 (3)
23.30 (11)
14.04 (8)
14.75 (9)
18.92 (10)
13.80 (7) 13.02 (5) 11.21 (1)
ABC
GSA
Table 6 Sum of rankings of Table 4 on each column. Sum of ranks
Sum of ranks
Classifier Bayes Net
MLP ANN
RBF
KStar
Bagging
MultiBoost
NBTree
Ridor
VFI
PSO
ABC
GSA
51 (3)
47 (1)
126 (12)
80 (8)
59 (5)
108 (10)
77 (7)
87 (9)
109 (11)
73 (6)
51 (3)
49 (2)
A. Bahrololoum et al. / Applied Soft Computing 12 (2012) 819–825
5. Conclusion The overall goal of this paper is to propose a novel prototype classifier which employs GSA as the global searcher to find the best positions of the representatives (prototypes). The proposed method has been used in classification of benchmark problems. The performance of GSA based classifier has been compared with ABC, PSO and nine other well-known classifiers which are widely used in the literature. The experimental results confirm the effectiveness and efficiency of the proposed method and show that it can successfully be applied as a classifier to classification problems. Acknowledgements The authors would like to extend their appreciation to Saber Zahedi and Abdol-hamid Haeri for proof reading the manuscript and providing valuable comments. In addition, the authors would like to thank the ASOC Editorial Board and the anonymous reviewers for their very helpful suggestions. References [1] T. Mitchell, Machine Learning, McGraw-Hill, 1997, ISBN 0070428077. [2] A. Unler, A. Murat, A discrete particle swarm optimization method for feature selection in binary classification problems, European Journal of Operational Research 206 (3) (2010) 528–539. [3] S.D. Bhavani, T.S. Rani, R.S. Bapi, Feature selection using correlation fractal dimension: issues and applications in binary classification problems, Applied Soft Computing 8 (1) (2008) 555–563. [4] N. García-Pedrajas, D. Ortiz-Boyer, An empirical study of binary classifier fusion methods for multiclass classification, Information Fusion 12 (2) (2011) 111–130. [5] K. Polat, S. Günes¸, A novel hybrid intelligent method based on C4.5 decision tree classifier and one-against-all approach for multi-class classification problems, Expert Systems with Applications 36 (2) (2009) 1587–1592. [6] M.W. Kurzynski, The optimal strategy of a tree classifier, Pattern Recognition 16 (1983) 81–87. [7] F. Seifi, M.R. Kangavari, H. Ahmadi, E. Lotfi, S. Imaniyan, S. Lagzian, Optimizing twins decision tree classification using genetic algorithms, in: 7th IEEE International Conference on Cybernetic Intelligent Systems, 2008, pp. 1–6. [8] P. Knagenhjelm, P. Brauer, Classification of vowels in continuous speech using MLP and a hybrid net, Speech Communication 9 (1) (1990) 31–34. [9] J-Y. Wu, MIMO CMAC neural network classifier for solving classification problems, Applied Soft Computing 11 (2) (2011) 2326–2333. [10] C.R. De Silva, S. Ranganath, L.C. De Silva, Cloud basis function neural network: a modified RBF network architecture for holistic facial expression recognition, Pattern Recognition 41 (4) (2008) 1241–1253. [11] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Second ed., Wiley, New York, 2001. [12] Y. Liu, Z. You, L. Cao, A novel and quick SVM-based multi-class classifier, Pattern Recognition 39 (11) (2006) 2258–2264. [13] H. Qian, Y. Mao, W. Xiang, Z. Wang, Recognition of human activities using SVM multi-class classifier, Pattern Recognition Letters 31 (2) (2010) 100–111. [14] R. Kumar, A. Kulkarni, V.K. Jayaraman, B.D. Kulkarni, Symbolization assisted SVM classifier for noisy data, Pattern Recognition Letters 25 (4) (2004) 495–504. [15] R. Kumar, V.K. Jayaraman, B.D. Kulkarni, An SVM classifier incorporating simultaneous noise reduction and feature selection: illustrative case examples, Pattern Recognition 38 (1) (2005) 41–49. [16] E.J.R. Justino, F. Bortolozzi, R. Sabourin, A comparison of SVM and HMM classifiers in the off-line signature verification, Pattern Recognition Letters 26 (9) (2005) 1377–1385. [17] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second ed., Springer Verlag, New York, 2009. [18] C.-L. Liu, M. Nakagawa, Evaluation of prototype learning algorithms for nearestneighbor classifier in application to handwritten character recognition, Pattern Recognition 34 (2001) 601–615.
825
[19] F. Shen, O. Hasegawa, A fast nearest neighbor classifier based on self-organizing incremental neural network, Neural Networks 21 (10) (2008) 1537–1547. [20] F. Chang, C.-H. Chou, C.-C. Lin, C.-J. Chen, A prototype classification method and its application to handwritten character recognition, in: IEEE International Conference on Systems, Man, and Cybernetics, vol. 5, March, 2004, pp. 4738–4743. [21] S.R.H. Hatem, A. Fayed, A.F. Atiya, Self-generating prototypes for pattern classification, Pattern Recognition 40 (5) (2007) 1498–1509. [22] C.-H. Chou, C.-C. Lin, Y.-H. Liu, F. Chang, A prototype classification method and its use in a hybrid solution for multiclass pattern recognition, Pattern Recognition 39 (4) (2006) 624–634. [23] C.-C. Hung, L. Wan, Hybridization of particle swarm optimization with the K-Means algorithm for image classification, in: IEEE Symposium on Computational Intelligence for Image Processing, 2009, pp. 60–64. [24] S. Su, Image classification based on particle swarm optimization combined with K-means, in: International Conference on Test and Measurement (ICTM), Hong Kong, 5–6 December, 2009, pp. 367–370. [25] H.-H. Song, S.-W. Lee, LVQ combined with simulated annealing for optimal design of large-set reference models, Neural Networks 9 (2) (1996) 329–336. [26] D.T. Pham, S. Otri, A. Ghanbarzadeh, E. Kog, Application of the bees algorithm to the training of learning vector quantization networks for control chart pattern recognition, in: Information and Communication Technologies (ICTTA), 2006, pp. 1624–1629. [27] C.M. Bishop, Pattern Recognition and Machine Learning, first ed., Springer, 2006. [28] Z. Botev, D.P. Kroese, Global likelihood optimization via the cross-entropy method, with an application to mixture models, in: R.G. Ingalls, M.D. Rossetti, J.S. Smith, B.A. Peters (Eds.), IEEE Proceedings of the 2004 Winter Simulation Conference, December, Washington DC, 2004, pp. 529–535. [29] A. Bessadok, P. Hansen, A. Rebai, EM algorithm and variable neighborhood search for fitting finite mixture model parameters, in: Proceedings of the International Multiconference on Computer Science and Information Technology, 2009, pp. 725–733. [30] X. Zhou, X. Wang, Optimisation of Gaussian mixture model for satellite image classification, IEE Proceedings – Vision, Image and Signal Process 153 (3) (2006) 349–356. [31] Y. Liao, V.R. Vemuri, Use of K-nearest neighbor classifier for intrusion detection, Computers & Security 21 (5) (2002) 439–448. [32] C.-L. Liu, M. Nakagawa, Prototype learning algorithms for nearest neighbor classifier with application to handwritten character recognition, in: ICDAR’99: IEEE Proceedings of the Fifth International Conference on Document Analysis and Recognition, Washington DC, USA, 1999, pp. 378–381. [33] L. Jiang, Z. Cai, D. Wang, S. Jiang, Survey of improving k-nearest-neighbor for classification, in: Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), vol. 1, Haikou, China, August 24–27, 2007, pp. 679–683. [34] I. De Falco, A. Della Cioppa, E. Tarantino, Facing classification problems with particle swarm optimization, Applied Soft Computing 7 (3) (2007) 652–658. [35] D. Karaboga, C. Ozturk, A novel clustering approach: artificial bee colony (ABC) algorithm, Applied Soft Computing 11 (1) (2011) 652–657. [36] E. Rashedi, H. Nezamabadi-pour, S. Saryazdi, GSA: a gravitational search algorithm, Information Science 179 (2009) 2232–2248. [37] E. Rashedi, H. Nezamabadi-pour, S. Saryazdi, BGSA: binary gravitational search algorithm, Natural Computing 9 (3) (2010) 727–745. [38] E. Rashedi, H. Nezamabadi-pour, S. Saryazdi, Filter modeling using gravitational search algorithm, Engineering Applications of Artificial Intelligence 24 (1) (2011) 117–122. [39] C. Li, J. Zhou, Parameters identification of hydraulic turbine governing system using improved gravitational search algorithm, Energy Conversion and Management 52 (1) (2011) 374–381. [40] A. Chatterjee, G.K. Mahanti, Comparative performance of gravitational search algorithm and modified particle swarm optimization algorithm for synthesis of thinned scanned concentric ring array antenna, Progress in Electromagnetics Research B 25 (2010) 331–348. [41] A.C.L. Blake, C.J. Merz, University of California at Irvine Repository of Machine Learning Databases, 1998, http://www.ics.uci.edu/∼mlearn/ MLRepository.html. [42] L. Prechelt, L. Proben, A set of neural network benchmark problems and benchmarking rules, Technical Report 21/94, Fakultät für Informatik, Universität Karlsruhe. Available via: ftp.ira.uka.de, 1994.