Information Sciences 177 (2007) 476–489 www.elsevier.com/locate/ins
Support vector machines with genetic fuzzy feature transformation for biomedical data classification Bo Jin, Y.C. Tang, Yan-Qing Zhang
q
*
Department of Computer Science, Georgia State University, Atlanta, GA 30302-3994, USA
Abstract In this paper, we present a genetic fuzzy feature transformation method for support vector machines (SVMs) to do more accurate data classification. Given data are first transformed into a high feature space by a fuzzy system, and then SVMs are used to map data into a higher feature space and then construct the hyperplane to make a final decision. Genetic algorithms are used to optimize the fuzzy feature transformation so as to use the newly generated features to help SVMs do more accurate biomedical data classification under uncertainty. The experimental results show that the new genetic fuzzy SVMs have better generalization abilities than the traditional SVMs in terms of prediction accuracy. Ó 2006 Elsevier Inc. All rights reserved. Keywords: Support vector machines; Feature transformation; Fuzzy logic; Genetic algorithms; Data classification; Bioinformatics
1. Introduction In the last decade support vector machines (SVMs) [4,17,18] as a kind machine learning algorithm have been widely used in many field for data classification and pattern recognition such as bioinformatics [15]. SVMs are a kind of structural risk minimization (SRM) based learning algorithms and have better generalization abilities comparing to other traditional empirical risk minimization (ERM) based learning algorithms. Under the help of nonlinear kernel functions, SVMs can map original input data into a high dimensional feature space to find a separate hyperplane. However, the current SVMs have two limitations. One is that kernel functions should be symmetric and positive semidefinite (Mercer conditions), otherwise SVMs might fail to find a global solution. SVMs nonlinear process ability and performance are mainly affected by kernel functions. Zhang et al. [21] extended the range of kernels that are not required to satisfy the condition of positive definite properties and presented an algorithm called hidden space support vector machines (HSSVMs), which inherit both Neural Networks nonlinear processing ability and SVMs good generalization ability.
q *
This work is supported in part by NIH under P20 GM065762. Corresponding author. Tel.: +1 404 651 0682; fax: +1 404 463 9912. E-mail addresses:
[email protected] (B. Jin),
[email protected] (Y.C. Tang),
[email protected],
[email protected] (Y.-Q. Zhang).
0020-0255/$ - see front matter Ó 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2006.03.015
B. Jin et al. / Information Sciences 177 (2007) 476–489
477
The other SVMs limitation is no feature preprocessing. Data to be classified are represented by feature vectors that are directly fed to SVMs without any modification. It looks like that the support vectors (SVs) and related Lagrange multipliers used to construct the optimal hyperplane implicitly implement some kind of feature scaling, however this kind of effect is rather restricted by the inner operation of kernel functions. For example, in RBF kernels 2
expðck~ x ~ yk Þ:
ð1Þ
Two features vectors as units are operated and each feature is treated with the same importance. In many real world applications such as biomedical data classification, the features of vectors might be noisy, imprecise or unrelated, thus the objects cannot be described precisely by these features. Under such kind of uncertain situation, it is difficult for SVMs to learn knowledge effectively from raw given data. Considering SVMs limitations mentioned above, we in this paper present an evolutionary fuzzy feature transformation method to help SVMs obtain the abilities of feature preprocessing. In the system, the data are first transformed into a high feature space through a fuzzy system, and then SVMs are used to map data into a higher feature space and construct the hyperplane there. To make the fuzzy system cooperate with SVMs, genetic algorithms are used to optimize the fuzzy feature transformation. With the help of genetic algorithm, the fuzzy feature transformation provides an extra effective control for feature preprocessing, which may alleviate kernels burden in SVMs classification and to help SVMs do more accurate data classification under uncertainty. The fuzzy feature transformation is a kind of feature preprocessing techniques, so it is not restricted by Mercer conditions. Furthermore, in the transformation, operations are directly implemented on each feature, which is a kind of explicit feature mappings and may be thought as a kind of complement to kernel operations on the vector unit. Such kind of feature transformations is benefit for SVMs classification. There are also some other works that combined SVMs with some computational intelligence technologies. In [6,11], fuzzy logic is used to build fuzzy SVMs to make different training data have different contributions to their own classes. In [7], a polyhedral pyramid fuzzy membership function is defined to resolve the unclassifiable regions. Fro¨hlich et al. [5] presents a feature selection method, in which the theoretical bounds on the generalization errors are taken as fitnesses to evaluate SVMs performance. In [2,20], clustering-based SVMs system are presented, which try to take advantage of the merit of data clustering to achieve high performance for SVMs. The rest of the paper is organized as follows. The basic SVMs theory for classification is introduced in Section 2. The fuzzy feature transformation is presented in Section 3. Section 4 describes the genetic fuzzy SVMs. Section 5 shows simulation results for different biomedical data classification applications. Finally, Section 6 gives conclusion and directs the future work. 2. SVMs For a binary classification problem, let S ¼ fð~ x1 ;~ y 1 Þ; . . . ;ð~ xl ;~ y l Þg represent a training data set, where ~ xi , i = 1, . . . , l are vectors and yi 2 {1, +1}, i = 1, . . . , l are the class labels of ~ xi . SVMs try to find an optimal hyperplane w;~ xi i þ b ¼ 0; h~
ð2Þ
where ~ w 2 Rn and b 2 R. For the linearly separable case, the optimal hyperplane is found by solving the following constrained optimization problem: Minimize Subject to
1 2 k~ wk ; 2 w;~ xi i þ bÞ P 1; y i ðh~
ð3Þ ð4Þ
where i = 1, . . . , l. For the linearly non-separable case, a set of nonnegative slack variables n1, . . . , nl is introduced to penalize training errors. The constrained optimization problem is rewritten as
478
B. Jin et al. / Information Sciences 177 (2007) 476–489
Minimize
X 1 k~ w k2 þ C ni ; 2 i
ð5Þ
Subject to
y i ðh~ w;~ xi i þ bÞ P 1 ni ;
ð6Þ
where C is the regularization parameter to control trade-off between the training error and the margin. According to Kuhn–Tucker theorem, the problem described by Eqs. (5) and (6) can be transformed to Minimize
X i
Subject to
X
ai
1XX ai aj y i y j h~ xi ;~ xj i; 2 i j
ai y i ¼ 0;
ð7Þ ð8Þ
i
where 0 6 ai 6 C, i, j = 1, . . . , l, and those with ai 5 0 are called support vectors. The decision function is ! X f ð~ xÞ ¼ signðh~ w;~ xi þ bÞ ¼ sign ai y i h~ xi ;~ xi þ b : ð9Þ i
For a nonlinear problem, SVMs map the data from original space Rn into a higher dimensional feature space H. U : Rn 7! H :
ð10Þ
The optimal separating hyperplane is found in the feature space H. Instead of calculating U, the kernel function K is used to implement mapping Kð~ x;~ yÞ ¼ hUð~ xÞ;Uð~ yÞi:
ð11Þ
The hyperplane is found by solving X 1XX Maximize ai ai aj y i y j Kð~ xi ;~ xj Þ; 2 i i j X Subject to ai y i ¼ 0:
ð12Þ ð13Þ
i
The related decision function is f ð~ xÞ ¼ sign
X
!
ai y i Kð~ x;~ xi Þ þ b :
ð14Þ
i
3. Fuzzy feature transformation The motivation of using fuzzy logic techniques for feature transformation is that fuzzy logic provides a kind of human-like reasoning capabilities to interpret imprecise and incomplete information, which are difficult to be described by mathematical models. With the fuzzy feature transformation, some useful knowledge may be extracted among features, the new generated vectors may tolerate the possible uncertain information and help SVMs achieve better classification accuracies. 3.1. General model for fuzzy feature transformation Suppose each vector in data set S ¼ fð~ x1 ;y 1 Þ; . . . ;ð~ xl ;y l Þg has n features zui 2 R, u = 1, . . . , n, i = 1, . . . , l. These features are grouped into m groups Gj, j = 1, . . . , m. Each group Gj has pj, j = 1, . . . , m features zjki 2 Gj , k = 1, . . . , pj, i = 1, . . . , l. For clarity, we show an example. Suppose we have data set with 3 vectors and each vector has 4 features, v1 = (a1, a2, a3, a4), v2 = (b1, b2, b3, b4), v3 = (c1, c2, c3, c4). Features are grouped into 2 groups (see Fig. 1). Since features are grouped into different groups may have different effect on the system performance, in the experiment of this paper, we randomly shuffle the features several times and group them in a random way.
B. Jin et al. / Information Sciences 177 (2007) 476–489
479
Fig. 1. Example of feature groups.
Which features should be grouped together is a challenging topic in the literature, in this paper, we will not discuss it and we assume all features have been grouped into groups. Let Z jk be a collection of features zjki , i = 1, . . . , l in group Gj. The fuzzy set Ajk in Z jk is defined like this n o ð15Þ Ajk ¼ zjki ;lAj ðzjki Þ j zjki 2 Z jk ; k
lAj ðzjki Þ k
where is the grade of the membership function (MF) of zjki and lAj : R ! ½0;1. Linguistic terms Ljrk , k r = 1, . . . , q are used to describe each feature zjki . bh ;h ¼ 1; . . . ;qpj is the consequence value in one rule. The fuzzy rule base of group Gj is defined in Table 1. The fuzzy feature transformation Fe is defined to transform the pj features of group Gj from space Rpj to a higher feature space Rq : p
j Fe : Rpj 7! Rq :
ð16Þ
Each Rule h, h ¼ 1; . . . ;qpj generates a new feature ^zjh by using the product conjunction operator: ^zjh ¼
pj Y k¼1
lAj ðzjki Þbjh : k
ð17Þ
Here fuzzy rule aggregation and defuzzification are not defined, and we may think SVMs as a special kind of aggregation and defuzzification function, which implements the fuzzy input–output mapping W: Pm p j q b 7!f1; þ 1g; W : R j¼1 7! H ð18Þ where SVMs kernel implements the following mapping: Pm p j q b: R j¼1 7! H
Table 1 Fuzzy rule base for group Gj j j1 j j j1 Rule 1: If zj1i is Lj1 1 And z2i is L2 And . . . And zpj i is Lpj Then b1 j j1 j j j1 Rule 2: If zj1i is Lj2 1 And z2i is L2 And . . . And zpj i is Lpj Then b2 .. . j j1 j j j1 Rule q: If zj1i is Ljq 1 And z2i is L2 And . . . And zpj i is Lpj Then bq j j2 j j j1 Rule q + 1: If zj1i is Lj1 1 And z2i is L2 And . . . And zpj i is Lpj Then bqþ1 j j2 j j j1 Rule q + 2: If zj1i is Lj2 1 And z2i is L2 And . . . And zpj i is Lpj Then bqþ2 .. . j j2 j j j1 Rule 2q: If zj1i is Ljq 1 And z2i is L2 And . . . And zpj i is Lpj Then b2q .. . j jq j j j1 Rule q2: If zj1i is Ljq 1 And z2i is L2 And . . . And zpj i is Lpj Then bq2 .. . j j1 j j jq Rule qpj : If zj1i is Ljq 1 And z2i is L2 And . . . And zpj i is Lpj Then bqpj
ð19Þ
480
B. Jin et al. / Information Sciences 177 (2007) 476–489
3.2. Experimental model for fuzzy feature transformation In this paper, for simplicity, the triangular MFs are used to map each zjki to a grade between 0 and 1, and only two linguistic terms Small and Large are used to describe each feature (see Fig. 2). In Fig. 2, zjkiðmaxÞ is the maximal value of zjki , i = 1, . . . , l. The data are grouped into dn/2e groups of size 2 according to the feature number. For example, the first two features are grouped into the first group, the third and fourth feature are grouped into the second group, and so on. In the case that n is odd number, the last feature and the first feature are grouped into the (n/2 + 1)th group. Four rules are generated in each group (see Table 2). A fuzzy system has the ability of using fuzzy if–then rules to represent the knowledge, however, it lacks of the adaptability to cooperate with SVMs automatically [9]. Hence, in the next section we use genetic algorithms to optimize the fuzzy system. In the fuzzy system optimization, there are many tunable system parameters, while more flexibility may cause the system to be more complex. In our work, only the consequence values bjh , h = 1, . . . , 4 are optimized. 4. Genetic fuzzy SVMs Many approaches have been proposed to make the fuzzy logic based or related systems adaptable. One kind of such approaches are Neuro–Fuzzy systems [8,9,12,19], which combine the neural networks learning methods into the fuzzy inference systems. The other are genetic algorithms based methods for fuzzy logic system optimizing [3,16]. Genetic algorithms are evolution principle based optimization algorithms. Due to the simplicity and robustness, they are also employed to optimize other systems, such as the parameters optimization of SVMs [5]. In this section, we present the genetic fuzzy SVMs, in which genetic algorithms are used to make the fuzzy feature transformation more adaptive for SVMs-based classification. 4.1. Chromosome definition In the experiment the given data set S is split into the training set S1 and the unseen testing set S2 in some proportion. Let Xi denote the population in generation i, where i = 1, . . . , n1. The population Xi has mi chromosomes oij, j = 1, . . . , m1, and each chromosome represents a combination of the consequence values of fuzzy rules. The format of chromosome oij is shown in Table 3. Each gene is a fuzzy rules consequence value. Let Xij denote the decoded consequence values set from chromosome oij, and let bctðijÞ denote the consequence value bct decoded from gene 4c 4 + t in chromosome oij, where t = 1, . . . , 4, c = 1, . . . ,dn/2e. In the experiment the value of each gene in chromosome oij is a real value between 1 and 1:
Fig. 2. Membership functions of the linguistic terms of each feature zjki in Gj. Table 2 Four generated rules in each group of size 2 Rule 1: If zj1i is Small And zj2i is Small Then bj1 Rule 2: If zj1i is Small And zj2i is Large Then bj2 Rule 3: If zj1i is Large And zj2i is Small Then bj3 Rule 4: If zj1i is Large And zj2i is Large Then bj4
B. Jin et al. / Information Sciences 177 (2007) 476–489
481
Table 3 Format of chromosome oij Gene number
oij
1
2
3
4
b11
b12
b13
b14
X ij ¼
nn
dn/2e 3
dn/2e 2
dn/2e 1
dn/2e
dn=2e b1
dn=2e b2
dn=2e b3
b4
o n oo dn=2e dn=2e dn=2e dn=2e b11ðijÞ ;b12ðijÞ ;b13ðijÞ ;b14ðijÞ ; . . . ; b1ðijÞ ;b2ðijÞ ;b3ðijÞ ;b4ðijÞ :
dn=2e
ð20Þ
4.2. Fitness and performance evaluation of SVMs There are two main approaches to evaluate SVMs performance. One is to use the SVMs theoretical bounds on the generalization errors; the other is to use k-fold cross-validation. In this paper, we use k-fold cross-validation to evaluate SVMs performance. Let the data set e S 1 ðX ij Þ represent the transformed training set S1 using the fuzzy rule base system with S 1 ðX ij Þ is separated into k disjoint subsets e S 1t ðX ij Þ, t = 1, . . . , k. To make k-fold the consequence value set Xij. e cross validation on the consequence values set Xij, each time data set Kt(Xij) is used for SVMs training and data set e S 1t ðX ij Þ is used to test the trained SVMs model and generate prediction accuracy Acct: Kt ðX ij Þ ¼ e S 1 ðX ij Þ e S 1t ðX ij Þ: ð21Þ The same training–testing procedure is executed k times on data set e S 1 ðX ij Þ, and then we get k prediction accuracies on each subset e S 1t ðX ij Þ. The fitness degree fij of chromosome oij is calculated: k 1X Acct : ð22Þ fij ¼ k t 4.3. Operations of genetic algorithms In the system, genetic algorithms operations follow the algorithms described in [14]. To generate a new population Xi+1, elitist, select operations are executed. If the best chromosome in generation i is worse then the best chromosome in generation i 1, the best chromosome in generation i 1 will replace the worst chromosome in generation i in elitist operation. In select operation, the chromosomes are selected according to cumulative fitness ~qij . where Fi is the total fitness of population Xi: j X fik ~ qij ¼ ; ð23Þ Fi k¼1 Fi ¼
m1 X
fij :
ð24Þ
j¼1
Each time a random number r is first generated within the range of [0, 1]. If r is smaller than ~qi1 , then chromosome oi1 is selected; otherwise chromosome oik, 2 6 k 6 m1 is selected under the condition of ~ qi;j : qi;j1 < r 6 ~
ð25Þ
After running the above select procedure m1 times, the new population is generated. Then the crossover and mutation operations are applied to the chromosomes in the new population. In the crossover operation, two chromosomes are first selected randomly as the parents, and then a single point is randomly chosen. Two parents then take part in the crossover at the crossover point to generate two children. The crossover rate in the experiment is 0.7. In the mutation operation, some chromosomes are randomly chosen. Some genes of the chosen chromosomes are randomly selected and replaced by a random value within the range of [1, 1]. The mutation rate is 0.1 in the experiment.
482
B. Jin et al. / Information Sciences 177 (2007) 476–489
4.4. Architecture and learning procedure The learning procedure of the genetic fuzzy SVMs is shown in Table 4. The architecture of the genetic fuzzy SVMs is shown in Fig. 3 including the learning phase and classifying phase with even feature number. It’s also
Table 4 Learning procedure of the genetic fuzzy SVMs For each generation i from 1 to n1 Do select operation Do crossover operation Do mutation operation Do k-fold cross validation { For each fuzzy rule consequence value set Xij in generation i, where j from 1 to m1 Transform S1 to e S 1 ðX ij Þ Split e S 1 ðX ij Þ into k disjoint subsets e S 1t ðX ij Þ, t = 1, . . . , k Repeat t from 1 to k Train SVMs on Kt ðX ij Þ ¼ e S 1 ðX ij Þ e S 1t ðX ij Þ Test SVMs on e S 1t ðX ij Þ Calculate ACCt P Calculate fitness fij ¼ 1k kt Acct Repeat end For End } Do elitist operation For End b Þ, which is transformed using the best consequence value set X b among n1 generations Train SVMs on data set e S 1ðX
Fig. 3. Architecture of genetic fuzzy SVMs.
B. Jin et al. / Information Sciences 177 (2007) 476–489
483
reasonable that SVMs parameters are optimized together with the fuzzy system in the genetic fuzzy SVMs, while in this paper we test the genetic fuzzy SVMs using several groups of SVMs parameter in order to evaluate the fuzzy feature transformation as a kind of complement to a given kernel. 5. Experiment and analysis In the experiment, we make a comparison between the genetic fuzzy SVMs and the traditional SVMs on four biomedical data sets in terms of prediction accuracy. All these biomedical data are public and available from the UCI Machine Learning Repository [1]. 5.1. Experimental data The first data set is Pima Indians Diabetes two-class data set (Pima data set) created by the National Institute of Diabetes and Digestive and Kidney Diseases for the diabetes prediction. The Pima data set has 768 examples with eight features and one binary class label. The second data set is Bupa liver disorders two-class data set (Bupa data set) created by the Bupa Medical Research Ltd. for the liver disease prediction. The Bupa data set has 345 examples with six features. The third data set is a processed heart disease diagnosis data set (Cleveland data set) collected by Cleveland Clinic Foundation, which contains 303 examples with 13 features and five classes. In the experiment, we omit six examples with missing feature values and concentrate on distinguishing between presence (values 1, 2, 3, 4) and absence (value 0). The fourth data set is the Wisconsin breast cancer data set [13] (Wisconsin data set) obtained from the university of Wisconsin hospitals. The data set contains 699 examples with nine useful features and binary class labels. We omit 16 examples with missing feature values. 5.2. Experimental setup For each data set, each data feature is first assigned a feature number and then features are randomly shuffled to generate a new feature order. The feature information and new feature orders are shown in Tables 5–8. For more feature information, please see [1]. The genetic fuzzy SVMs are evaluated on each data set with five different feature orders. Here the performance of traditional SVMs wont be affected by the feature orders. In the experiment, all feature values of data sets are normalized to [0, 1]. Each data set is randomly shuffled and split into the training data set and the testing data set in the proportion of 3:1. The testing data set is unseen during training and only used to evaluate the performance of classifiers. During training, 4-fold cross validation on the training data is used in determining the fitness degree of each individual. The population size is 1000, and the number of generations is 30. The initial value of each gene is randomly generated from the range of [1, 1]. In the experiment, RBF kernel functions are used in both the genetic fuzzy SVMs and traditional SVMs. Six groups of SVMs parameters (C, c) are randomly generated in (0, 100), which are listed in Table 9. The software of the genetic fuzzy SVMs is developed based on Joachims SVMlight 4.0 [10] and Denis Cormiers GA code [14]. Table 5 Pima data set Feature description
Feature number
Feature order
Number of times pregnant Plasma glucose concentration test Diastolic blood pressure Triceps skin-fold thickness Serum insulin Body mass index Diabetes pedigree function Age
0 1 2 3 4 5 6 7
Shuffle0: Shuffle1: Shuffle2: Shuffle3: Shuffle4:
0 6 6 5 3
1 3 1 3 1
2 2 5 6 5
3 7 0 0 0
4 4 3 1 7
5 5 7 7 6
6 0 4 4 4
7 1 2 2 2
484
B. Jin et al. / Information Sciences 177 (2007) 476–489
Table 6 Bupa data set Feature description
Feature number
Feature order
mcv alkphos sgpt sgot gammagt drinks
0 1 2 3 4 5
Shuffle0: Shuffle1: Shuffle2: Shuffle3: Shuffle4:
0 0 5 5 1
1 1 1 4 5
2 2 2 1 0
3 5 0 3 3
4 3 3 0 2
5 4 4 2 4
3 3 7 2 5
4 6 5 4 7
5 7 0 3 3
Table 7 Wisconsin data set Feature description
Feature number
Feature order
Clump thickness Uniformity of cell size Uniformity of cell shape Marginal adhesion Single epithelial cell size Bare nuclei Bland chromatin Normal nucleoli Mitoses
0 1 2 3 4 5 6 7 8
Shuffle0: Shuffle1: Shuffle2: Shuffle3: Shuffle4:
0 0 8 8 1
1 4 2 6 6
2 5 3 0 2
6 1 6 1 0
7 2 1 7 4
8 8 4 5 8
Table 8 Cleveland data set Feature description
Feature number
Feature order
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
0 1 2 3 4 5 6 7 8 9 10 11 12
Shuffle0: Shuffle1: Shuffle2: Shuffle3: Shuffle4:
0 9 9 1 1
1 2 3 4 5 6 7 8 9 10 11 12 12 1 10 3 5 6 7 8 4 0 11 2 12 4 10 8 3 1 11 5 6 0 7 2 12 6 5 11 3 4 2 10 9 8 7 0 12 10 8 5 6 7 2 4 11 3 0 9
Table 9 Six groups of SVMs parameters (C, c)
Group 1
Group 2
Group 3
Group 4
Group 5
Group 6
C c
73.3 41.5
68.2 22.7
2 66.3
5.7 63.8
92.7 46.6
48.4 34.3
5.3. Experimental results The experimental results are shown in Tables 10–21. Tables 10, 13, 16 and 19 give the testing accuracies of the genetic fuzzy SVMs and the traditional SVMs on four data testing sets. Tables 11, 14, 17 and 20 give the training accuracies on training sets. Tables 12, 15, 18 and 21 list the best fitness values after 30 generations.
B. Jin et al. / Information Sciences 177 (2007) 476–489
485
Table 10 Testing accuracies on Pima testing set (C, c) groups
SVMs
Genetic fuzzy SVMs Shuffle0
Shuffle1
Shuffle2
Shuffle3
Shuffle4
Average
1 2 3 4 5 6
0.71 0.7 0.7 0.69 0.71 0.69
0.76 0.71 0.72 0.73 0.77 0.73
0.77 0.78 0.78 0.71 0.77 0.76
0.73 0.74 0.77 0.713 0.7 0.75
0.69 0.7 0.72 0.73 0.71 0.68
0.72 0.73 0.75 0.76 0.71 0.77
0.73 0.73 0.75 0.73 0.73 0.74
Average
0.7
0.74
0.76
0.73
0.71
0.74
0.735
Table 11 Training accuracies on Pima training set (C, c) groups
1 2 3 4 5 6
SVMs
Genetic fuzzy SVMs
1.0 1.0 1.0 1.0 1.0 1.0
Shuffle0
Shuffle1
Shuffle2
Shuffle3
Shuffle4
0.89 0.87 0.86 0.92 0.88 0.92
0.87 0.89 0.86 0.86 0.88 0.85
0.86 0.86 0.89 0.95 1.0 0.95
0.95 0.95 0.92 0.94 1.0 0.98
0.96 0.86 0.86 0.88 1.0 0.82
Table 12 Best fitness values on Pima training set after 30 generations (C, c) groups
Genetic fuzzy SVMs
1 2 3 4 5 6
Shuffle0
Shuffle1
Shuffle2
Shuffle3
Shuffle4
0.75 0.75 0.77 0.76 0.74 0.74
0.75 0.75 0.77 0.76 0.73 0.76
0.74 0.76 0.77 0.75 0.74 0.75
0.74 0.75 0.76 0.76 0.73 0.74
0.74 0.76 0.76 0.77 0.73 0.76
Table 13 Testing accuracies on Bupa testing set (C, c) groups
SVMs
Genetic fuzzy SVMs Shuffle0
Shuffle1
Shuffle2
Shuffle3
Shuffle4
Average
1 2 3 4 5 6
0.62 0.66 0.69 0.67 0.61 0.66
0.67 0.74 0.71 0.71 0.69 0.72
0.67 0.7 0.72 0.71 0.66 0.73
0.67 0.73 0.69 0.74 0.76 0.76
0.6 0.72 0.72 0.7 0.57 0.74
0.78 0.76 0.73 0.73 0.71 0.72
0.68 0.73 0.71 0.72 0.68 0.73
Average
0.65
0.71
0.7
0.73
0.68
0.74
0.708
From Tables 10, 13, 16 and 19, we find that the genetic fuzzy SVMs achieve higher classification accuracies than the traditional SVMs on all four data sets in average. Especially for the Cleveland data set, testing accuracy is increased by about 21% by the genetic fuzzy SVMs. By referring to Table 17, we can find that the
486
B. Jin et al. / Information Sciences 177 (2007) 476–489
Table 14 Training accuracies on Bupa training set (C, c) groups
1 2 3 4 5 6
SVMs
Genetic fuzzy SVMs
0.74 0.74 0.75 0.74 0.74 0.74
Shuffle0
Shuffle1
Shuffle2
Shuffle3
Shuffle4
0.73 0.74 0.73 0.76 0.72 0.74
0.72 0.74 0.73 0.73 0.74 0.74
0.74 0.74 0.74 0.73 0.73 0.76
0.73 0.74 0.73 0.74 0.73 0.73
0.74 0.74 0.75 0.74 0.74 0.74
Table 15 Best fitness values on Bupa training set after 30 generations (C, c) groups
Genetic fuzzy SVMs
1 2 3 4 5 6
Shuffle0
Shuffle1
Shuffle2
Shuffle3
Shuffle4
0.74 0.74 0.75 0.74 0.74 0.74
0.73 0.74 0.73 0.76 0.72 0.74
0.72 0.74 0.73 0.73 0.74 0.74
0.74 0.74 0.74 0.73 0.73 0.76
0.73 0.74 0.73 0.74 0.73 0.73
Table 16 Testing accuracies on Cleveland testing set (C, c) groups
SVMs
Genetic fuzzy SVMs Shuffle0
Shuffle1
Shuffle2
Shuffle3
Shuffle4
Average
1 2 3 4 5 6
0.6 0.62 0.54 0.55 0.55 0.62
0.78 0.78 0.82 0.81 0.7 0.8
0.82 0.84 0.76 0.77 0.76 0.81
0.68 0.84 0.8 0.85 0.73 0.81
0.8 0.82 0.84 0.82 0.78 0.78
0.77 0.78 0.8 0.73 0.76 0.77
0.77 0.81 0.80 0.8 0.75 0.79
Average
0.58
0.78
0.79
0.79
0.81
0.77
0.787
Table 17 Training accuracies on Cleveland training set (C, c) groups
SVMs
1 2 3 4 5 6
1.0 1.0 1.0 1.0 1.0 1.0
Genetic fuzzy SVMs Shuffle0
Shuffle1
Shuffle2
Shuffle3
Shuffle4
1.0 1.0 0.99 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 0.98 1.0 1.0 1.0
1.0 1.0 0.99 0.99 1.0 1.0
1.0 1.0 0.97 1.0 1.0 1.0
traditional SVMs always get 100% training accuracies in all trainings with different groups of parameters, but get very low testing accuracies. It is an overfitting problem. It is more possible that data with uncertainty cannot be recognized by the traditional SVMs. The genetic fuzzy SVMs work well because the evolutionary fuzzy feature transformation may tolerate some potential uncertainty and help SVMs make better classification.
B. Jin et al. / Information Sciences 177 (2007) 476–489
487
Table 18 Best fitness values on Cleveland training set after 30 generations (C, c) groups
Genetic fuzzy SVMs
1 2 3 4 5 6
Shuffle0
Shuffle1
Shuffle2
Shuffle3
Shuffle4
0.8 0.81 0.79 0.78 0.79 0.8
0.8 0.81 0.79 0.77 0.79 0.8
0.8 0.82 0.78 0.79 0.79 0.8
0.81 0.82 0.8 0.79 0.8 0.81
0.81 0.82 0.81 0.79 0.8 0.83
Table 19 Testing accuracies on Wisconsin testing set (C, c) groups
SVMs
Genetic fuzzy SVMs Shuffle0
Shuffle1
Shuffle2
Shuffle3
Shuffle4
Average
1 2 3 4 5 6
0.94 0.95 0.93 0.93 0.94 0.94
0.98 0.98 0.96 0.98 0.98 0.96
0.96 0.95 0.97 0.95 0.98 0.96
0.96 0.96 0.98 0.96 0.97 0.96
0.97 0.96 0.96 0.96 0.97 0.97
0.97 0.98 0.96 0.96 0.97 0.96
0.97 0.97 0.97 0.96 0.97 0.96
Average
0.94
0.97
0.96
0.97
0.97
0.97
0.967
Table 20 Training accuracies on Wisconsin training set (C, c) groups
1 2 3 4 5 6
SVMs
1.0 1.0 1.0 1.0 1.0 1.0
Genetic fuzzy SVMs Shuffle0
Shuffle1
Shuffle2
Shuffle3
Shuffle4
1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 0.99 1.0 1.0 1.0
Table 21 Best fitness values on Wisconsin training set after 30 generations (C, c) groups
1 2 3 4 5 6
Genetic fuzzy SVMs Shuffle0
Shuffle1
Shuffle2
Shuffle3
Shuffle4
0.97 0.97 0.97 0.97 0.97 0.97
0.97 0.97 0.96 0.96 0.97 0.97
0.97 0.97 0.96 0.97 0.97 0.97
0.96 0.97 0.96 0.97 0.97 0.96
0.96 0.97 0.97 0.96 0.96 0.96
For Wisconsin data set, it’s only about 2–3% accuracy improvements. While we can see the traditional SVMs perform very well and all testing accuracies achieved with six groups of parameters are above or equal to 93%. Also, there is no error in all trainings. Under such a special situation, the genetic fuzzy SVMs still show better performance. Some testing accuracies can reach 98%.
488
B. Jin et al. / Information Sciences 177 (2007) 476–489
Table 22 Example of fuzzy rules Rule 1: If z11 is Small And z12 is Small Then b11 ¼ 0:1196 Rule 2: If z11 is Small And z12 is Large Then b12 ¼ 0:2364 Rule 3: If z11 is Large And z12 is Small Then b13 ¼ 0:117 Rule 4: If z11 is Large And z12 is Large Then b14 ¼ 0:011 Rule 5: If z23 is Small And z24 is Small Then b21 ¼ 0:1242 Rule 6: If z23 is Small And z24 is Large Then b22 ¼ 0:138 Rule 7: If z23 is Large And z24 is Small Then b23 ¼ 0:058 Rule 8: If z23 is Large And z24 is Large Then b24 ¼ 0:0648 Rule 9: If z35 is Small And z36 is Small Then b31 ¼ 0:629 Rule 10: If z35 is Small And z36 is Large Then b32 ¼ 0:1226 Rule 11: If z35 is Large And z36 is Small Then b33 ¼ 0:0704 Rule 12: If z35 is Large And z36 is Large Then b34 ¼ 0:4016 Rule 13: If z47 is Small And z48 is Small Then b41 ¼ 0:1662 Rule 14: If z47 is Small And z48 is Large Then b42 ¼ 0:2658 Rule 15: If z47 is Large And z48 is Small Then b43 ¼ 0:2496 Rule 16: If z47 is Large And z48 is Large Then b44 ¼ 0:0992
For Bupa data set and Pima data set, about 3.5% and 5.8% accuracy improvements are achieved in total average, respectively. However, a few testing accuracies of the genetic fuzzy SVMs are lower than those of the traditional SVMs with the same configuration of SVMs parameters. We find most of them happened in the case that the data features were shuffled three times and the related average accuracies were lower than the overall average accuracies. The reason probably is that the features are grouped in a random way and it’s possible to generate weak group features. While in average, our system still performs better than the traditional SVMs. We finally show a group of fuzzy rules generated by using the SVMs parameters C = 92.7, c = 46.6 and training the system on Pima training set with shuffle 0 feature order in Table 22. 6. Conclusion and future work In this paper, we first discuss the limitations of SVMs, and then propose an evolutionary fuzzy feature transformation method to help SVMs learn knowledge on data with uncertainty. The experimental results show that the new genetic fuzzy SVMs have better generalization abilities than the traditional SVMs in the classification of biomedical data. The experimental results also show features are grouped in different groups may have different effects on the system performance. In the future, we will consider how to optimize fuzzy feature group members. We will also take efforts to consider other kinds of fuzzy feature transformation by using more effective fuzzy inference methods such as TSK method. References [1] C.L. Blake, C.J. Merz, UCI Repository of Machine Learning Databases, University of California, Department of Information and Computer Science, Irvine, CA, 1998. [2] J.-H. Chiang, P.-Y. Hao, A new kernel-based fuzzy clustering approach: support vector clustering with cell growing, IEEE Transactions on Fuzzy Systems 11 (4) (2003) 518–527. [3] O. Cordon, F. Herrera, F. Hoffmann, L. Magdalena, Genetic fuzzy systems: evolutionary tuning and learning of fuzzy knowledge bases, in: K. Hirota, G.J. Klir, E. Sanchez, P.-Z. Wang, R.R. Yager (Eds.), Advances in Fuzzy Systems – Applications and Theory, 19, World Scientific, Singapore, 2001. [4] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20 (1995) 273–297. [5] H. Fro¨hlich, O. Chapelle, B. Scho¨lkopf, Feature selection for support vector machines using genetic algorithms, in: Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence, 2003, pp. 142–148. [6] H.-P. Huang, Y.-H. Liu, Fuzzy support vector machines for pattern recognition and data mining, International Journal of Fuzzy Systems 4 (3) (2002) 826–835.
B. Jin et al. / Information Sciences 177 (2007) 476–489
489
[7] T. Inoue, S. Abe, Fuzzy support vector machines for pattern classification, in: Proceedings of the International Joint Conference on Neural Networks (IJCNN’01), 2000, pp. 1449–1454. [8] J.-S.R. Jang, ANFIS: Adaptive-network-based fuzzy inference systems, IEEE Transactions on Systems, Man, and Cybernetics 23 (3) (1993) 665–685. [9] J.-S.R. Jang, C.-T. Sun, E. Mizutani, Neuro Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence, Prentice Hall, New Jersey, 1996. [10] T. Joachims, Making large-scale SVM learning practical, Advances in Kernel Methods – Support Vector Learning, MIT Press, Cambridge, MA, 1999. [11] C.-F. Lin, S.-D. Wang, Fuzzy support vector machines, IEEE Transactions on Neural Networks 13 (2) (2002) 464–471. [12] C.-T. Lin, C.S.G. Lee, Neural-network-based fuzzy logic control and decision system, IEEE Transactions on Computers 40 (12) (1991) 1320–1336. [13] O.L. Mangasarian, W.H. Wolberg, Cancer diagnosis via linear programming, SIAM News 23 (5) (1990) 1–18. [14] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, Springer, Verlag, Berlin, 1996. [15] W.S. Noble, Support vector machine applications in computational biology, in: B. Scho¨lkopf, K. Tsuda, J.-P. Vert (Eds.), Kernel Methods in Computational Biology, MIT Press, Cambridge, MA, 2004. [16] J.D. Schaffer, D. Whitley, L.J. Eshelman, Combinations of genetic algorithms and neural networks: a survey of the state of art, in: Proceedings of International Workshop on Combinations of Genetic Algorithms and Neural Networks, 1992, pp. 1–37. [17] V. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995. [18] V. Vapnik, Statistical Learning Theory, John Wiley and Sons, New York, 1998. [19] L.-X. Wang, J.M. Mendel, Back-propagation fuzzy systems as nonlinear dynamic system identifiers, in: Proceedings of the IEEE International Conference on Fuzzy Systems, San Diego, March 1992, pp. 1409–1418. [20] H. Yu, J. Yang, J. Han, Classifying large data sets using SVMs with hierarchical clusters, in: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, Washington, DC, 2003, pp. 306–315. [21] L. Zhang, W.-D. Zhou, L.-C. Jiao, Hidden space support vector machines, IEEE Transactions on Neural Networks 15 (6) (2004) 1424–1434.