Predictor output sensitivity and feature similarity-based feature selection

Predictor output sensitivity and feature similarity-based feature selection

Fuzzy Sets and Systems 159 (2008) 422 – 434 www.elsevier.com/locate/fss Predictor output sensitivity and feature similarity-based feature selection夡 ...

218KB Sizes 0 Downloads 134 Views

Fuzzy Sets and Systems 159 (2008) 422 – 434 www.elsevier.com/locate/fss

Predictor output sensitivity and feature similarity-based feature selection夡 A. Verikasa, b,∗ , M. Bacauskienea , D. Valinciusa , A. Gelzinisa a Department of Applied Electronics, Kaunas University of Technology, LT-51368 Kaunas, Lithuania b Intelligent Systems Laboratory, Halmstad University, Box 823, S 301 18 Halmstad, Sweden

Received 8 August 2006; received in revised form 1 May 2007; accepted 31 May 2007 Available online 6 June 2007

Abstract This paper is concerned with a feature selection technique capable of generating an efficient feature set in a few selection steps. The feature saliency measure proposed is based on two factors, namely, the fuzzy derivative of the predictor output with respect to the feature and the similarity between the feature being considered and the feature set. The use of the fuzzy derivative enables modelling the vagueness that occurs in estimating the predictor output sensitivity. The feature similarity measure employed allows avoiding utilization of very redundant features. The experimental investigations performed on five real world problems have shown the effectiveness of the feature selection technique proposed. The technique developed removed a large number of features from the original data sets without reducing the classification accuracy of a classifier. In contrast, the accuracy of the classifiers utilizing the reduced feature sets was higher than those exploiting all the original features. © 2007 Elsevier B.V. All rights reserved. Keywords: Feature selection; Classification; Fuzzy sets; Support vector machine

1. Introduction Numerous features can be usually measured in many pattern recognition or machine learning applications. Not all of the features, however, are important for a specific task. Some of the variables may be redundant or even irrelevant. Usually better performance may be achieved by discarding such variables. Moreover, as the number of features used grows, the number of training samples required to estimate the model parameters grows exponentially [12]. Therefore, in many practical applications we need to reduce the data dimensionality. The principal component analysis (PCA) [12] is the traditional technique to reduce the dimensionality. The reduction is achieved by projecting the original data on the first few principal directions. The new features are linear combinations of the original ones. Thus, none of the original variables can be discarded. This can be a serious disadvantage, since some though irrelevant or redundant variables can be very costly to obtain in some applications. Although feature selection is the main concern of this paper, we use the PCA technique to make some comparisons. 夡

We gratefully acknowledge the support we have received from the Lithuanian State Science and Studies Foundation (EUREKA Project E!3681).

∗ Corresponding author. Intelligent Systems Laboratory, Halmstad University, Box 823, S 301 18 Halmstad, Sweden. Tel.: +47 35 167 140.

E-mail addresses: [email protected] (A. Verikas), [email protected] (M. Bacauskiene), [email protected] (D. Valincius), [email protected] (A. Gelzinis). 0165-0114/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.fss.2007.05.020

A. Verikas et al. / Fuzzy Sets and Systems 159 (2008) 422 – 434

423

In a general case, only an exhaustive search can guarantee an optimal solution to the feature selection problem. Two choices are to be made when solving the problem. The first one is concerned with the search algorithm applied, while the second is related to the feature saliency measure. A large variety of search techniques that result in a sub-optimal feature set have been proposed [16], ranging from the sequential forward selection or backward elimination [18,15] to sequential forward floating selection [24], genetic [35,20], tabu [36], or branch and bound algorithm based search [8]. To assess feature saliency, measures based on predictor output sensitivity [4,23,28,38,9,1,13], mutual information between the feature space and the classes [7], fuzzy sets [21], Fisher index [17], the reaction of the validation data set classification error due to the removal of the individual features [26,30,2], and the magnitude of the neural network input layer weights (in the case of a neural network based feature selector) [26,3] have been applied. The weights-based feature saliency measures bank on the idea that weights connected to important features attain large absolute values while weights connected to unimportant features would probably attain values somewhere near zero. However, a saliency measure alone does not indicate how many of the candidate features should be used. Therefore, some of feature selection procedures are based on making comparisons between the saliency of a candidate feature and the saliency of a noise feature [23,28,3,5]. A formal hypothesis test for testing the statistical significance of the difference between the saliency of a candidate and a noise feature is sometimes applied to assess the comparison result [28]. The number of features to be chosen is often identified by the significant decrease of the classification accuracy of the validation data set when eliminating a feature [26,30,2] or by applying a formal hypothesis test [29]. A feature selection technique proposed by Weston et al. [33] tries to find a feature subset minimizing a bound on the classification error probability. The technique is limited to separable classification problems solved by a support vector machine (SVM). A significant number of feature saliency measures used for feature selection are based on predictor’s output sensitivity [4,23,28,38,9,1,13,25]. Eq. (1) [28] and Eq. (2) [25] exemplify two such measures:  Q   P    jyj    i =  jx  ,

(1)

 Q P  1    jyjp  i =  jx  , QP ip

(2)

p=1 j =1 Xi ∈Di

i

j =1 p=1

where y is the predictor output, Q the number of predictor outputs, P the number of training samples, xip stands for the ith component of the pth input vector xp , and Di is the set of sample points for the ith feature. It was suggested [25] to make the set Di of the most important points. In Eq. (1), only the available training data points are considered as the important points. Pal [21] and De et al. [9] discuss the difficulties arising when searching for such points. The robustness of the predictor output sensitivity estimate is very dependent upon the choice of the estimation points. To mitigate the problem, we use the fuzzy derivative instead of the ordinary one for assessing the predictor output sensitivity in this work. A definition of the fuzzy derivative will be given shortly. The feature saliency measure proposed in this work is based on two factors, namely, the fuzzy derivative of the predictor output with respect to the feature and the similarity between the feature and the feature set. By using the concept of fuzzy derivative instead of the ordinary derivative we model the vagueness that occurs in estimating the predictor output sensitivity. Very often features used for solving the problem at hand are very correlated. Thus, some feature similarity measure is required to avoid using very redundant features. The correlation coefficient is one of such measures. However, the correlation coefficient is invariant to scaling of features. Thus, two pairs of features exhibiting different variance may have the same correlation coefficient value. A measure exhibiting such properties is not very suitable for feature selection. Therefore, to assess the similarity of features, we resorted to the so-called maximal information compression index proposed in [19]. The new ways of assessing the predictor output sensitivity and the feature saliency measure based on both predictor output sensitivity and similarity between a feature and a feature set are the main novelty aspects of the technique proposed. Many of feature selection techniques eliminate features one by one. In the case of large feature sets, such a procedure is rather time consuming. We demonstrate that when using the proposed feature selection technique, the size

424

A. Verikas et al. / Fuzzy Sets and Systems 159 (2008) 422 – 434

and effectiveness of the feature sets selected in only two feature elimination steps are very close to the size and effectiveness of the feature sets obtained in the one-by-one elimination procedure. To make a decision about inclusion of a candidate feature into a feature set, the paired t-test comparing the saliency of the candidate and the noise feature is used. The rest of the paper is organized as follows. In the next section, the measure used to assess feature saliency is presented. Section 3 describes the feature selection procedure proposed. The results of the experimental investigations are given in Section 4. Finally, Section 5 presents conclusions of the work. 2. Feature saliency Having a feature set F, we define the following feature saliency score i assigned to the ith feature: i =

Υi maxj =1,...,N Υj

+ (1 − )i,F ,

(3)

where Υi is the predictor output sensitivity-based feature saliency measure, i,F stands for the average similarity between the ith feature and the set F, N is the number of features and  a parameter. We use the fuzzy derivative to estimate the predictor output sensitivity. 2.1. Fuzzy derivative Dubois and Prade [10] considered differentiation of fuzzy functions. In this work, we are concerned with differen0 . According to [10], a “fuzzy point” is a convex fuzzy tiation of a function f at a fuzzy location—a “fuzzy point’’ X subset of the real line. Thus, a “fuzzy point’’ can be considered as the possibility distribution of a point x whose location is only approximately known [37]. The uncertainty about the location of the point x amounts to an uncertainty of the 0 can be considered as a fuzzy derivative f  (x) at the point. The derivative of a real-valued function f at a fuzzy point X    set f (X0 ). The membership function of the fuzzy set f (X 0 ) expresses the degree to which a particular f (x) is the 0 . The membership function is defined as [37] first derivative of f at X f  (X 0 ) (z) =

sup x∈f −1 (z)

X 0 (x).

(4)

0 = {(−2, 0.5), (−1, 0.8), (0, 0.9), (1, 0.7), (2, 0.8)} and f  (x) = x 2 , the fuzzy derivative f  (X 0 ) If, for example, X is given by {(4, 0.8), (1, 0.8), (0, 0.9)}. It was discussed in the Introduction section that the predictor output sensitivity should be estimated at the most important points of the feature space. However, an exact location of such points is not known. Therefore, differentiation of a function at a fuzzy location seems to be a sound argument for using the fuzzy derivative to estimate the predictor output sensitivity. 2.2. Fuzzy locations p by applying Having the data point xp , we obtain the membership function D p (x) defining the fuzzy location D the t-norm operator to the membership functions N (xp ) and DB (xp ), namely D p (x) = t{N (xp ), DB (xp )}

(5)

with t being the t-norm operator. The usefulness of the algebraic product and the min t-norm operators has been studied in this work. The membership function N (xp ) defines the neighbourhood of xp , while the value of DB (xp ) is inversely proportional to the distance of xp to the decision boundary. We define the function N (xp ) as a -function centred on xp . A -function, being in the range [0,1], for x ∈ R N , where N is the number of features, is defined

A. Verikas et al. / Fuzzy Sets and Systems 159 (2008) 422 – 434

425

to be [22] ⎧

 x − ci  2 ⎪ ⎪ for x − ci  , 2 1 − ⎪ ⎪ ⎨  2

N (x) = x − ci  2  ⎪ 1−2 for 0 x − ci   , ⎪ ⎪ 2  ⎪ ⎩ 0 otherwise,

(6)

with  > 0 being the radius of the -function and ci the central point at which (ci , ci , ) = 1. In our case, the central point is given by the actual data point xp . The membership function DB (xp ) is defined as (7) DB (xp ) = exp − max |ypj (xp )| , j

where is a parameter. 2.3. Predictor output sensitivity-based feature saliency measure In this work, we use the following predictor output sensitivity based saliency measure for the ith feature: Υi =

Q P 1    |y j (Dip )|, PQ

(8)

p=1 j =1

ip the pth fuzzy location for the ith feature, Q the number of classes (outputs), where P is the number of data points, D   ip ) of the jth output with respect to the input feature xi at the and y j (Dip ) the defuzzified value of the derivative yj (D  ip . The derivative y (D ip ) is a fuzzy set, the membership function of which is defined according to fuzzy location D j Eq. (4). ip ) of the derivative y  (D ip ) is obtained by using the centre of area (COA) defuzzification The defuzzified value y j (D j method:

ip ) y  (D ip ) (zk )zk zk ∈yj (D j   , (9) y j (Dip ) = ip ) y  (D ip ) (zk ) zk ∈y  (D j

where zk =

j

 jyj  . jxi Xk ∈D ip

(10)

As pointed out by the Editor, the midpoint of the defuzzified interval introduced by Dubois and Prade [11] and studied in detail by Fortemps and Roubens [14] can be used for the defuzzification instead of COA. Linearity with respect to fuzzy addition is the main advantage of the midpoint defuzzification method. 2.4. Feature similarity Let be the covariance matrix of two features i and j. The maximal information compression index (i, j ) is then defined as [19] (i, j ) = smallest eigenvalue of

   1 2 2 = 2 var(i) + var(j ) − [var(i) + var(j )] − 4 var(i) var(j )[1 − (i, j ) ] ,

(11)

where var(i) stands for the feature i variance and (i, j ) is the coefficient of correlation between the features i and j.

426

A. Verikas et al. / Fuzzy Sets and Systems 159 (2008) 422 – 434

The measure (i, j ) is symmetric, invariant to translation and rotation of the variables about the origin. The measure 0 (i, j )0.5[var(i) + var(j )] and (i, j ) = 0 if and only if i and j are linearly related. In the multi-class case, we calculate the average value of the measure: (i, j ) =

Q 1  k (i, j ), Q

(12)

k=1

where Q is the number of classes and k (i, j ) stands for the measure value calculated using data coming from the kth class. The measure is normalized: n (i, j ) =

(i, j ) maxi,j ∈F (i, j )

(13)

with F being a feature set. The similarity between the feature i and the set F is then given by i,F = min n (i, j ). j ∈F

(14)

3. Feature selection procedure The feature selection procedure is summarized in the following steps: (1) Randomly assign the available data points into learning Sl , validation Sv , and test St data sets, for example 50% for learning, and 25% each for validation and testing. If the number of data points available is small, the size of the Sv and St sets can be reduced. For the public databases used in this study, the size of the Sl , Sv , and St sets was chosen to coincide with the size used by other researchers. (2) Increase the dimension of the input vectors by adding one additional noise feature with mean m = 0 and a given standard deviation s. The value of s depends on the number of initial features. The larger the number, the smaller is the s value. The standard deviation s = 1 worked well in all our tests with the relatively small number of features (up to  50). For the feature sets containing 150–200 features, s = 0.3 was a good choice. (3) Train the model. The learning set is used to train the model, the validation set is used to select hyper-parameters of the model, and the test set is used to assess the generalization error of the model chosen. Since we used SVM with the Gaussian kernel as the model in our tests, the hyper-parameters considered were the kernel width and regularization constant. Any model (classifier) lending itself to calculate the output sensitivity can be used. (4) Set  = 1. Calculate the saliency score i , i = 1, . . . , N. (5) Repeat Steps 3 and 4 K times using different random data partitioning into training, validation, and test sets. (6) Eliminate features the saliency of which, do not exceed the saliency of the noise feature. Use the paired t-test to compare the saliencies. (7) Train the model using the reduced set of features and the noise feature. (8) Set 0 <  < 1. Calculate the saliency score i , i = 1, . . . , N. (9) Repeat Steps 7 and 8 K times using different random data partitioning into training, validation, and test sets. (10) Eliminate features the saliency of which, do not exceed the saliency of the noise feature. Use the paired t-test to compare the saliencies. (11) Train the model using the reduced set of features 3.1. The paired t-test To assess the equality of the mean saliency of ith feature i and the noise n the paired t-test is defined as suggested in [28]: Null Hypothesis Di = 0, Alternative Hypothesis Di > 0, where Di = i − n . To test the null hypothesis, a t ∗ statistic t∗ =

Di sD i

(15)

A. Verikas et al. / Fuzzy Sets and Systems 159 (2008) 422 – 434

427

is evaluated, where Di = K −1 K j =1 Dij , Dij = ij − nj , with ij and nj being the saliency scores computed using (3) for the ith and the noise feature, respectively, in the jth loop (see Step 5 of the feature selection procedure), K is the number of loops, and  K 2 j =1 (Dij − D i ) sD i = . (16) K(K − 1) Under the null hypothesis, the t ∗ statistic is t distributed. If t ∗ > tcrit , the hypothesis that the difference in the means is zero is rejected, where tcrit is the critical value of the t distribution with = K − 1 degrees of freedom for a significance level of : tcrit = t1− , . 3.2. Selecting features for SVM, an example Output of an SVM y(x) is given by y(x) =

Ns  j =1

∗j dj (xj , x) + b,

(17)

where Ns is the number of support vectors, the threshold b and the parameter values ∗j are found as a solution to the optimization problem, (xj , x) is a kernel, and dj is a target value (dj = ±1). Having the jth input vector xj presented to the input, the derivative of the output with respect to the ith feature is given by

Ns ∗

k dk (xj , xk ) + b) jy(xj ) j( k=1 = jxj i jxj i =

Ns 

∗k dk

k=1

j(xj , xk ) . jxj i

(18)

For a Gaussian kernel (xj , xk ) = exp{−xj − xk 2 /}, where  is the standard deviation of the Gaussian, the derivatives are given by  N   (xj n − xkn )2 j(xj , xk ) 2 = − (xj i − xki ) exp − (19) jxj i   n=1

and

 N  Ns  (xj n − xkn )2 jy(xj ) 2  =−

∗k dk (xj i − xki ) exp − . jxj i   k=1

(20)

n=1

4. Experimental investigations In all the tests, we run an experiment 30 times with different random partitioning of a data set into learning Sl , validation Sv , and test St sets. The mean values and standard deviations of the correct classification rate presented in this paper were calculated from these 30 trials. In this work, we used the 1-norm soft margin SVM with a Gaussian kernel [27]. The width of the kernel and the regularization constant have been found by cross-validation. Apart from the parameters of the classifier, there are four parameters governing behaviour of the feature selection procedure, the type of the t-norm operator used in Eq. (5), and the parameters , , and . The usefulness of the algebraic product and the min t-norm operators has been studied. The value of the -function radius  was chosen such that, on average, for a given x there were five data points yielding N (x) > 0. To assure the use of relatively small values of , five data points have been generated in the neighbourhood of each training data point and used to evaluate the fuzzy derivative. The points were generated by adding a normally distributed random noise vector to x. The value of determines the sensitivity of the membership function DB (x) (Eq. (7)). The parameter 0 1 defines the trade-off

428

A. Verikas et al. / Fuzzy Sets and Systems 159 (2008) 422 – 434

between the two terms of the feature saliency measure. The influence of the parameters and  on the effectiveness of the feature set selected has been studied experimentally. The results of the studies are summarized in Section 4.2. The value of = 2 worked well in all the tests. 4.1. Data used To test the approach we used one artificial and five real-world problems. Data characterizing the artificial and four of the real-world problems: Waveforms, US congressional voting records problem, The diabetes diagnosis problem, Wisconsin breast cancer problem, and Wisconsin diagnostic breast cancer problem are available at: www.ics.uci.edu/ ∼mlearn/. The fifth problem concerns classification of laryngeal images [31]. Waveform problem: The ability of the technique to detect pure noise features and the usefulness of the fuzzy derivative in assessing the predictor output sensitivity have been studied using the 40-dimensional Waveform data [6]. There are three classes of waves with equal number of instances in each class, where each class is generated from a combination of 2 of 3 “base” waves. All of the 40 attributes include noise with mean 0 and variance 1. The later 19 attributes are all noise attributes with mean 0 and variance 1. Out of 5000 data points, 2500 were assigned to the Sl set and 1250 to each the Sv and St set. US congressional voting records problem: The United States Congressional Voting Records Data Set consists of the voting records of 435 congressman on 16 major issues in the 98th Congress. The votes are categorized into one of the three types of votes: (1) Yea, (2) Nay, and (3) Unknown. The task is to predict the correct political party affiliation of each congressman. Out of 435 data points available, 197 samples were assigned to the set Sl , 21 were assigned to the set Sv , and 217 to the set St . The diabetes diagnosis problem: The Pima Indians Diabetes Data Set contains 768 samples taken from patients who may show signs of diabetes. Each sample is described by eight features. There are 500 samples from patients who do not have diabetes and 268 samples from patients who are known to have diabetes. From the data set, 345 samples were assigned to the Sl , 39 to the set Sv , and 384 to the set St . Wisconsin breast cancer database—WBCD: The University of Wisconsin Breast Cancer Data Set consists of 699 patterns. Amongst them there are 458 benign samples and 241 malignant ones. Each of these patterns consists of nine measurements taken from fine needle aspirates from a patient’s breast. The number of data points assigned to the sets Sl , Sv , and St was equal to 315, 35, and 349, respectively. Wisconsin diagnostic breast cancer—WDBC: There are 30 real-valued features. The features are computed from a digitized image of a fine needle aspirate of a breast mass and describe characteristics of the cell nuclei present in the image. There are 569 instances, 357 benign, and 212 malignant. The number of data points assigned to the sets Sl , Sv , and St was equal to 269, 30, and 270, respectively. Laryngeal images: The task is an automated categorization of colour laryngeal images into the healthy, nodular, and diffuse decision classes. Aiming to obtain a comprehensive description of laryngeal images, multiple feature sets exploiting information on image colour, texture, geometry, image intensity gradient direction, and frequency content are extracted [32]. In this work, the frequency content based features are utilized. Let us assume that Z(u, v) is the Fourier transform of the image z(x, y) = L∗ (x, y) exp{jHab (x, y)},

(21)

where z(x, y) a complex colour representation of the colour image L∗ (x, y), a ∗ (x, y), b∗ (x, y), and Hab (x, y) = arctan[b∗ (x, y)/a ∗ (x, y)] is the hue-angle [34]. We define the Fourier spectrum of the image z(x, y) as P (u, v) = Z(u, v)2 = R 2 (u, v) + I 2 (u, v),

(22)

where R(u, v) and I (u, v) are the real and imaginary parts of Z(u, v), respectively. To compute the feature vector , the upper part of the frequency plane is divided into N equidistant wedges Wi and the average power Pi =

1 NW i

 u,v∈Wi

P (u, v),

i = 1, . . . , N

(23)

A. Verikas et al. / Fuzzy Sets and Systems 159 (2008) 422 – 434

429

Table 1 Features selected by the technique proposed to discriminate between the pairs of classes of the “Waveform” data set for two t-norm operator types Pair of classes

Number of features

Features

Algebraic product t-norm, = 2 1/2 1/3 2/3

14 15 15

13 12 17 8 5 18 16 9 4 19 10 20 15 11 18 40 8 3 14 17 9 7 4 10 12 5 6 11 13 20 12 10 18 4 5 9 16 8 6 14 17 13 15 7

Min t-norm, = 2 1/2 1/3 2/3

14 15 13

13 4 18 19 8 20 5 17 12 16 9 15 11 10 14 13 8 9 17 6 10 4 7 11 5 3 18 40 12 10 18 4 5 9 14 8 13 17 6 16 15 7

is computed in each of the wedges, where NW i is the number of distinct frequencies in the wedge Wi . The P i values constitute components of the feature vector . The number of wedges (features) N used was equal to N = 180. There are 49 images from the healthy class, 406 from the nodular class, and 330 from the diffuse class. Out of the 785 images available, 550 images were assigned to the set Sl , 100 images were assigned to the set Sv , and 135 to the test set St . 4.2. Results In the first set of experiments, the usefulness of the fuzzy derivative in assessing the predictor output sensitivity has been studied using the set of artificial “Waveform” data. The influence of the type of the t-norm operator and the parameter value affecting the fuzzy derivative estimation result have also been investigated using this data set. There are three decision classes in the data set. Since the SVM performs binary classification, in these tests, we selected separate feature sets for discriminating between different pairs of the decision classes. First, the influence of the t-norm operator type used has been studied. Table 1 presents the feature sets selected by the technique proposed when using the algebraic product and the min t-norm operators in the fuzzy derivative estimation process. In Table 1, the features are ranked in an ascending order from the left to the right according to their saliency estimated using (3). The optimal value of  (the one leading to a feature set that provides the highest correct classification rate) found for the different pairs of classes varied between 0.2 and 0.8. The experiments have shown that the value of  = 0.5 can be adopted for all the class pairs without any significant deterioration of the classification accuracy. We remind that features with numbers larger than 21 are all noise features. Observe that features 1 and 21 are also equivalent to noise features. Noise features are emphasized in the table. As can be seen from Table 1, while the ranking of features obtained using the two t-norm operators is different, the features selected are exactly the same except features 20 and 12, which have not been selected to discriminate the pair of classes 2/3 in the min t-norm operator case. Only one pure noise feature, feature 40 has been included in the selected feature sets. Observe that all features characterizing the problem are contaminated by noise and the saliency of the meaningful features does not differ much. In further studies we have chosen to use the algebraic product t-norm operator. When using this operator, the pure noise feature 40 has been deemed to be less salient than in the case of the min operator. In the next experiment, the influence of the parameter (Eq. (7)) on the feature selection results has been studied using the “Waveform” data set. Apart from the value of = 2 used to obtain the results presented in Table 1, the utility of other values has been explored. Table 2 summarizes the feature selection results obtained for = 1 and 3. As can be seen, the feature sets selected for = 1, 2, and 3 are very similar. The only difference is that feature 20 has not been selected to discriminate the pair of classes 2/3 in the case of = 1. Thus, the feature selection technique is rather robust against the imprecise choice of the parameter value. The last experiment with the set of artificial data concerns comparison of the feature selection results obtained using the fuzzy and ordinary derivatives in the feature saliency measure. Table 3 presents the features selected according to the fuzzy and ordinary derivative based feature saliency measure to discriminate between the pairs of classes of the “Waveform” data set. As can be seen from Table 3, the fuzzy derivative based approach clearly outperforms the

430

A. Verikas et al. / Fuzzy Sets and Systems 159 (2008) 422 – 434

Table 2 Features selected by the technique proposed to discriminate between the pairs of classes of the “Waveform” data set for two different parameter values Pair of classes

Number of features

Features

Algebraic product t-norm, = 1 1/2 1/3 2/3

14 15 14

4 19 20 13 8 18 12 17 5 9 16 10 15 11 14 17 8 18 13 9 4 7 10 3 5 12 40 11 6 12 10 4 18 5 9 8 6 17 15 14 13 16 7

Algebraic product t-norm, = 3 1/2 1/3 2/3

14 15 15

13 12 4 18 19 8 20 16 15 5 17 11 9 10 14 13 9 8 17 6 10 4 7 11 5 3 18 40 12 10 20 12 18 4 5 9 8 6 17 14 13 16 15 7

Table 3 Features selected according to the fuzzy and ordinary derivative based feature saliency measure for discriminating between the pairs of classes of the “Waveform” data set Pair of classes

Number of features

Features

Fuzzy derivative 1/2 1/3 2/3

14 15 15

13 12 17 8 5 18 16 9 4 19 10 20 15 11 18 40 8 3 14 17 9 7 4 10 12 5 6 11 13 20 12 10 18 4 5 9 16 8 6 14 17 13 15 7

Ordinary derivative 1/2 1/3 2/3

16 16 16

3 35 19 21 13 20 30 34 4 8 5 18 12 15 17 10 2 27 28 35 18 40 3 14 17 8 9 4 7 5 12 10 3 4 10 33 27 20 37 12 18 5 9 6 17 14 8 16

Table 4 The average correct classification rate obtained from the SVM for the different data sets when using all the original features Data set

Number of features

Training set

Test set

Diabetes WBCD Voting WDBC Laryngeal

8 9 16 30 180

79.98 (1.63) 97.80 (0.52) 98.19 (0.73) 98.96 (0.42) 82.23 (2.82)

76.87 (1.60) 96.86 (0.79) 95.49 (1.03) 97.23 (1.01) 79.18 (3.74)

ordinary derivative based technique. There are 11 pure noise features included into the selected feature sets when using the ordinary derivative, while there is only one such feature in the fuzzy derivative case. In the next set of experiments, the sets of real world data have been used. Table 4 presents the average correct classification rate obtained from the SVM for the different data sets when using all the original features in the classification process. The number of features available is also given in the table. In the parentheses, the standard deviation of the correct classification rate is provided. In the next experiment, we studied the influence of the parameter  on the discrimination power of the feature set selected. The step size used to change a  value has been set to 0.1. In these tests, an ordinary derivative has been used in the feature saliency measure instead of the fuzzy one. Observe that in this experiment, for each data set partitioning into the Sl , Sv and St subsets, the feature elimination process is accomplished in only two steps, as described in the feature elimination procedure. Table 5 summarizes the results of the test. Apart from the correct classification rate obtained using the selected feature sets, the number of selected features and the optimal value of the parameter  are provided in the table. The value of  providing the highest correct classification rate is considered to be optimal. The

A. Verikas et al. / Fuzzy Sets and Systems 159 (2008) 422 – 434

431

Table 5 The average correct classification rate obtained for the different data sets when using features selected according to the ordinary derivative based feature saliency measure Data set

Number of selected features

Training set

Test set

Optimal value of 

Diabetes WBCD Voting WDBC Laryngeal

6 7 7 8 60

78.21 (1.72) 97.62 (0.55) 97.68 (0.70) 98.10 (0.61) 99.99 (0.02)

77.06 (1.68) 97.00 (0.78) 95.81 (1.28) 97.17 (0.96) 87.41 (2.71)

0.2–1.0 0.2–1.0 0.2–1.0 0.6–0.9 0.6–0.8

Table 6 The highest average correct classification rate obtained for the different data sets when using features selected according to the ordinary derivative based feature saliency measure Data set

Number of selected features

Training set

Test set

Optimal value of 

Diabetes WBCD Voting WDBC Laryngeal

5 6 5 6 49

78.32 (1.70) 97.48 (0.60) 96.51 (0.92) 97.75 (0.61) 99.99 (0.02)

77.51 (1.73) 97.10 (0.75) 95.85 (1.19) 97.25 (0.75) 87.67 (2.74)

0.1 0.2–0.6 0.4–0.6 0.6–0.9 0.6–0.8

range of  values provided in Table 5 means that the same correct classification rate was obtained for all the  values tested from the range. As can be seen from Table 5, for the first three data sets characterized by a small number of features, the interval of the optimal  values includes the value of  = 1. This means that for those data sets, the use of the feature similarity term in the feature saliency index did not improve the discrimination power of the feature sets selected. In fact, for all the  values belonging to the interval, including the value of  = 1, the same features were included into the final feature set. Comparing the test data set correct classification rate presented in Tables 4 and 5, we see that the classification rate obtained using the selected features is higher (Laryngeal data set) or approximately the same as that obtained exploiting all the original features. Next, the effectiveness of the feature selection procedure has been studied from the viewpoint of the size of the feature set found. The feature set found has been further subjected to the feature elimination process. Features have been eliminated one-by-one based on the value of the feature saliency measure. After each feature elimination a classifier was retrained using the reduced feature set. Table 6 summarizes the results of the experiment. The number of selected features shown in Table 6 is the number providing the highest correct classification rate for the test set data. Comparing the results presented in Tables 5 and 6 we can see that the number of features found in the two step elimination process is very close to that providing the highest correct classification rate. It is interesting to note that now—when using the feature sets with a fewer number of features—the value of  = 1 does not belong to the set of optimal values even for the first three data sets. For example, for the Diabetes data set, for all values of  ∈ [0.2 − 1.0] the highest correct classification rate of 77.19% (1.62) is achieved when using five features. The features selected are 2,6,1,7,5. However, in the case of  = 0.1 the features selected are 2,6,1,8,3. This set of features provided the highest average classification accuracy of the test set data (see Table 6). In the last experiment the usefulness of the fuzzy derivative for assessing the feature saliency, if compared to the ordinary derivative, was studied. Now, the fuzzy derivative was used in the feature saliency measure. The results of the experiment are presented in Table 7. The results should be compared with those shown in Table 5. Though the test data set correct classification rate presented in Tables 5 and 7 are similar, the feature sets found when using the fuzzy derivative contain less features. Thus, the classifier complexity is reduced. This is especially obvious for the Laryngeal data set characterized by a large number of features. The average power of the Fourier spectrum in the equidistant wedges of the frequency plane constitutes components of the feature vector characterizing the Laryngeal

432

A. Verikas et al. / Fuzzy Sets and Systems 159 (2008) 422 – 434

Table 7 The average correct classification rate obtained for the different data sets when using features selected according to the technique proposed Data set

Number of selected features

Training set

Test set

Optimal value of 

Diabetes WBCD Voting WDBC Laryngeal

5 6 5 7 34

77.77 (1.67) 97.49 (0.61) 96.46 (0.97) 98.41 (0.56) 99.93 (0.08)

77.20 (1.62) 97.10 (0.75) 95.83 (1.01) 97.34 (0.81) 87.48 (2.77)

0.4–0.6 0.4–1.0 0.2–0.8 0.4–0.8 0.2–0.6

The fuzzy derivative has been used in the feature saliency measure.

Diabetes 78

86

77 Correct classification rate

Correct classification rate

Laryngeal 88

84 Max=86.96% 82 80 78

76 Max=77.26%

75 74 73

76

72

74 10

20

30

40

Number of components

50

60

2

3

4

5

6

7

8

Number of components

Fig. 1. The test data set correct classification rate as a function of the number of principal components used to classify the laryngeal (left) and diabetes (right) data.

data. The number of wedges is relatively large and there are many wedges with approximately the same average power for all the three decision classes. Thus, amongst features characterizing the Laryngeal data there are many features the discrimination power of which does not differ from that of the pure noise feature. The great feature reduction obtained for the Laryngeal data substantiates this fact. When using the fuzzy approach, the estimate of the classifier output sensitivity was more immune to the noise in data. Therefore, a higher repetitiveness of feature sets selected in different training sessions was observed. As can be seen from Table 7, the value of  = 0.5 worked well for all the data sets. The technique can be applied to all types of classifiers and features lending themselves to calculate the feature saliency measure utilized. One may wonder what correct classification rate can be achieved by using PCA to reduce the data dimensionality. When using PCA, none of the original variables can be discarded. Although feature selection is the main concern of this paper, we used the PCA technique for the comparison. Figs. 1–3 plot the test data set correct classification rate as a function of the number of principal components used to classify the data coming from the five real-world problems. The highest correct classification rate achieved is shown in the figures. As can be seen from Figs. 1–3 and Table 7, for all the problems, the number of features selected is less than the number of principal components providing the highest correct classification rate. Moreover, the correct classification rate obtained using the selected features is approximately the same or higher than that achieved in the principal components case.

A. Verikas et al. / Fuzzy Sets and Systems 159 (2008) 422 – 434

433

WBDC 97

97

96.9

Correct classification rate

Correct classification rate

WDBC 98

Max=97.42% 96

95

94

93

5

10

15

20

25

96.8

96.7

96.6

96.5

30

Max=96.93%

2

3

4

Number of components

5

6

7

8

9

Number of components

Fig. 2. The test data set correct classification rate as a function of the number of principal components used to classify the WDBC (left) and WBDC (right) data.

Voting

Correct classification rate

96

94

92

Max=95.31%

90

88

86 2

4

6

8

10

12

14

16

Number of components Fig. 3. The test data set correct classification rate as a function of the number of principal components used to classify the Voting data.

5. Conclusions A feature selection technique capable of generating an efficient feature set in a few selection steps has been presented in this paper. Two factors, the fuzzy derivative of the predictor output with respect to the feature and the similarity between the feature and the feature set are combined into the feature saliency measure. To make a decision about inclusion of a candidate feature into a feature set, the paired t-test comparing the saliency of the candidate and the noise feature is used. The use of the fuzzy derivative enables modelling the vagueness that occurs in estimating the predictor output sensitivity. The fuzzy approach-based estimate of the classifier output sensitivity was less noisy. Therefore, a higher repetitiveness of feature sets selected in different training sessions was observed. The feature similarity measure employed allows avoiding utilization of redundant features. This property of the measure was clearly observed for the data sets characterized by a large number of features. The effectiveness of the feature selection technique proposed

434

A. Verikas et al. / Fuzzy Sets and Systems 159 (2008) 422 – 434

was demonstrated on five real world problems. The size and effectiveness of the feature sets selected in only two feature elimination steps were very close to the size and effectiveness of the feature sets obtained in the one-by-one elimination procedure. Classifiers utilizing the reduced feature sets demonstrated a higher classification accuracy than those exploiting all the original features. References [1] N. Acir, C. Guzelis, Automatic recognition of sleep spindles in EEG via radial basis support vector machine based on a modified feature selection algorithm, Neural Comput. Appl. 14 (2005) 56–65. [2] M. Bacauskiene, A. Verikas, Selecting salient features for classification based on neural network committees, Pattern Recognition Lett. 25 (16) (2004) 1879–1891. [3] K.W. Bauer, S.G. Alsing, K.A. Greene, Feature screening using signal-to-noise ratios, Neurocomputing 31 (2000) 29–44. [4] L.M. Belue, K.W. Bauer, Determining input features for multilayer perceptrons, Neurocomputing 7 (1995) 111–121. [5] J. Bi, K.P. Bennett, M. Embrechts, C.M. Breneman, M. Song, Dimensionality reduction via sparse support vector machines, J. Mach. Learning Res. 3 (2003) 1229–1243. [6] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Chapman & Hall, Amsterdam, 1993. [7] S. Cang, D. Partridge, Feature ranking and best feature subset using mutual information, Neural Comput. Appl. 13 (2004) 175–184. [8] X.W. Chen, An improved branch and bound algorithm for feature selection, Pattern Recognition Lett. 24 (12) (2003) 1925–1933. [9] R.K. De, N.R. Pal, S.K. Pal, Feature analysis: neural network and fuzzy set theoretic approaches, Pattern Recognition 30 (10) (1997) 1579–1590. [10] D. Dubois, H. Prade, Towards fuzzy differential calculus, Part 2: differentiation, Fuzzy Sets and Systems 8 (3) (1982) 225–233. [11] D. Dubois, H. Prade, The mean-value of a fuzzy number, Fuzzy Sets and Systems 24 (3) (1987) 279–300. [12] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed., Wiley, New York, 2001. [13] T. Evgeniou, M. Pontil, C. Papageorgiou, T. Poggio, Image representations and feature selection for multimedia database search, IEEE Trans. Knowledge Data Eng. 15 (4) (2003) 911–920. [14] P. Fortemps, M. Roubens, Ranking and defuzzification methods based on area compensation, Fuzzy Sets and Systems 82 (3) (1996) 319–330. [15] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification using support vector machines, Mach. Learning 46 (2002) 389–422. [16] M. Kudo, J. Sklansky, Comparison of algorithms that select features for pattern classifiers, Pattern Recognition 33 (1) (2000) 25–41. [17] Y. Liu, Y.F. Zheng, FS-SFS: a novel feature selection method for support vector machines, Pattern Recognition 39 (7) (2006) 1333–1345. [18] K.Z. Mao, Orthogonal forward selection and backward elimination algorithms for feature subset selection, IEEE Trans. Systems Man Cybernet. 34 (1) (2004) 629–634. [19] P. Mitra, C.A. Murthy, S.K. Pal, Unsupervised feature selection using feature similarity, IEEE Trans. Pattern Anal. Mach. Intell. 24 (3) (2002) 301–312. [20] I.S. Oh, J.S. Lee, B.R. Moon, Hybrid genetic algorithms for feature selection, IEEE Trans. Pattern Anal. Mach. Intelligence 26 (11) (2004) 1424–1437. [21] N.R. Pal, Soft computing for feature analysis, Fuzzy Sets and Systems 103 (1999) 201–221. [22] S.K. Pal, P.K. Pramanik, Fuzzy measures in determining seed points in clustering, Pattern Recognition Lett. 4 (1986) 159–164. [23] K.L. Priddy, S.K. Rogers, D.W. Ruck, G.L. Tarr, M. Kabrisky, Bayesian selection of important features for feedforward neural networks, Neurocomputing 5 (1993) 91–103. [24] P. Pudil, J. Novovicova, P. Somol, Feature selection toolbox software package, Pattern Recognition Lett. 23 (2002) 487–492. [25] D.W. Ruck, S.K. Rogers, M. Kabrisky, Feature selection using a multilayer perceptron, J. Neural Network Comput. 2 (2) (1990) 40–48. [26] R. Setiono, H. Liu, Neural-network feature selector, IEEE Trans. Neural Networks 8 (3) (1997) 654–662. [27] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, Cambridge, UK, 2004. [28] J.M. Steppe, K.W. Bauer, Improved feature screening in feedforward neural networks, Neurocomputing 13 (1996) 47–58. [29] J.M. Steppe, K.W. Bauer, S.K. Rogers, Integrated feature and architecture selection, IEEE Trans. Neural Networks 7 (4) (1996) 1007–1014. [30] A. Verikas, M. Bacauskiene, Feature selection with neural networks, Pattern Recognition Lett. 23 (11) (2002) 1323–1335. [31] A. Verikas, A. Gelzinis, M. Bacauskiene, V. Uloza, Towards a computer-aided diagnosis system for vocal cord diseases, Artificial Intelligence Medicine 36 (1) (2006) 71–84. [32] A. Verikas, A. Gelzinis, D. Valincius, M. Bacauskiene, V. Uloza, Multiple feature sets based categorization of laryngeal images, Comput. Methods Programs Biomedicine 85 (3) (2007) 257–266. [33] J. Weston, S. Mujherjee, O. Chapelle, M. Pontil, T. Poggio, V. Vapnik, Feature selection for SVMs, in: T.K. Leen, T.G. Dietterich, V. Tresp (Eds.), Advances in Neural Information Processing Systems, Vol. 13, MIT Press, Cambridge, MA, 2000, pp. 668–674. [34] G. Wyszecki, W.S. Stiles, Color Science. Concepts and Methods, Quantitative Data and Formulae, second ed., Wiley, New York, 1982. [35] S. Yu, S.G. Backer, P. Scheunders, Genetic feature selection combined with composite fuzzy nearest neighbor classifiers for hyperspectral satellite imagery, Pattern Recognition Lett. 23 (1–3) (2002) 183–190. [36] H. Zhang, G. Sun, Feature selection using tabu search method, Pattern Recognition 35 (2002) 701–711. [37] H.J. Zimmermann, Fuzzy Set Theory—and Its Applications, third ed., Kluwer Academic Publishers, Boston, 1996. [38] J.M. Zurada, A. Malinowski, S. Usui, Perturbation method for deleting redundant inputs of perceptron networks, Neurocomputing 14 (1997) 177–193.