A filter model for feature subset selection based on genetic algorithm

A filter model for feature subset selection based on genetic algorithm

Knowledge-Based Systems 22 (2009) 356–362 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locat...

283KB Sizes 66 Downloads 191 Views

Knowledge-Based Systems 22 (2009) 356–362

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

A filter model for feature subset selection based on genetic algorithm M.E. ElAlami Department of Computer Science, Mansoura University, Mansoura 35111, Egypt

a r t i c l e

i n f o

Article history: Received 15 January 2008 Received in revised form 27 September 2008 Accepted 17 February 2009 Available online 26 February 2009 Keywords: Feature subset selection Relevant feature Genetic algorithm Artificial neural networks Non-linear optimization Fitness function

a b s t r a c t This paper describes a novel feature subset selection algorithm, which utilizes a genetic algorithm (GA) to optimize the output nodes of trained artificial neural network (ANN). The new algorithm does not depend on the ANN training algorithms or modify the training results. The two groups of weights between inputhidden and hidden-output layers are extracted after training the ANN on a given database. The general formula for each output node (class) of ANN is then generated. This formula depends only on input features because the two groups of weights are constant. This dependency is represented by a non-linear exponential function. The GA is involved to find the optimal relevant features, which maximize the output function for each class. The dominant features in all classes are the features subset to be selected from the input feature group. Ó 2009 Elsevier B.V. All rights reserved.

1. Introduction Reducing dimensionality of a problem, in many real world problems, is an essential step before any analysis of the data. The general criterion for reducing the dimensionality is the desire to preserve most of the relevant information of the original data according to some optimality criteria [1]. Dimensionality reduction or feature selection has been an active research area in pattern recognition, statistics and data mining communities. The main idea of feature selection is to choose a subset of input features by eliminating features with little or no predictive information. In particular, feature selection removes irrelevant features, increases efficiency of learning tasks, improves learning performance and enhances comprehensibility of learned results [2,3]. Feature selection problem can be viewed as a special case of the feature-weighting problem. The weight associated with a feature measures its relevance or significance in the classification task [4]. If we restrict the weights to be binary valued, the feature-weighting problem reduces to the feature selection problem. Feature selection algorithms fall into two broad categories, the filter model or the wrapper model [5]. Filter models use an evaluation function that relies solely on properties of the data, thus it is independent on any particular algorithm. Wrapper models use the inductive algorithm to estimate the value of a given subset. Most algorithms for feature selection perform either heuristic or exhaustive search [6]. Heuristic feature selection algorithms estimate the feature’s quality with a heuristic measure such as information gain [7], Gini E-mail address: [email protected] 0950-7051/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2009.02.006

index [8], discrepancies measure [9] and chi-square test [10]. Other examples of heuristic algorithms include the Relief algorithm and its extension [11]. Exhaustive feature selection algorithms search all possible combinations of features and aim at finding a minimal combination of features that are sufficient to construct a model consistent with a given set of instances such as the FOCUS algorithm [12]. Various approaches have been proposed for finding irrelevant features and remove them from the feature set. C4.5 decision tree presented in [13] finds relevant features by keeping only those features that appear in the decision tree. The cross-validation method is applied in [14] to filter irrelevant features before constructing ID3 and C4.5 decision trees. Neural network is used to estimate the relative importance of each feature (with respect to the classification task), and assign it a corresponding weight [15]. When properly weighted, an important feature would receive a larger weight than less important or irrelevant features. In general, feature selection refers to the study of algorithms that select an optimal subset from the input feature set. Optimality is normally dependent on the evaluation criteria or the application’s needs. Therefore, the genetic algorithms (GAs) have recently received much attention because of their ability to solve difficult problems in the optimization. The GAs are search methods that have been widely used in feature selection where the size of the search space is large [16]. The most important differences between the GAs and the traditional optimization algorithms are: genetic algorithms work with a coded version of the parameters; they do not search from one single point, but from a population of points. A crucial issue in the design of a genetic algorithm is the choice of the fitness function. This function is used to evaluate the quality of each

357

M.E. ElAlami / Knowledge-Based Systems 22 (2009) 356–362

hypothesis, and it is the function to be optimized in the target problem. This paper presents a novel algorithm for feature subset selection from trained neural network using genetic algorithm. It does not depend on the ANN training algorithms and it does not modify the training results. The GA is used to find the optimal input features (relevant), which maximize the output functions of trained neural network. The organization of this paper is as follows. The problem formulation is described in Section 2. The data preprocessing is performed in Section 3. The proposed feature selection algorithm is outlined in Section 4. An initial experiment is described in Section 5 to demonstrate the feasibility of the proposed algorithm. The application and results are reported in Section 6. The conclusion and future work are presented in Section 7. 2. Problem description The proposed feature subset selection algorithm starts with training the artificial neural network on the input features and the corresponding class. The ANN is trained so that a satisfactory error level is reached. Each input unit corresponds typically to a single feature and each output unit corresponds to a class value. The main objective of our approach is to encode the network in such a way that a genetic algorithm can run over it. After training the ANN, the weights between input-hidden and hidden-output layers are extracted. Therefore, each output node of ANN can be represented as a general function of input features and extracted weights. The activation function used in the hidden and output nodes of the ANN is a sigmoid function. Therefore, each output function is non-linear exponential function, which has a maximum output value of one. For each output node, the GA is used to find the optimal values of the input features, which maximize the output function. The obtained features represent the relevant features for each class value. The dominant features in all classes are the overall relevant features for a given database. The proposed algorithm for feature selection is shown in Fig. 1.

3. Data preprocessing Some of the learning algorithms such as neural network are often trained more successful and fast when the discrete input features are used. Therefore, the features which have numerical values in a given database must be treated by using the discretization technique. The discretization technique splits the continuous feature values into small sets of intervals, each interval has upper and lower values [Xlower–Xupper]. Hence, these intervals are transformed into linguistic terms. The discretization of features is formulated as an optimization problem. The GA is used for obtaining the optimal boundaries of these intervals which maximize the density of the predominate class while minimizing the other classes densities in the given interval. As a result, all features in the given database can be transformed into discrete values via substituting each interval by a linguistic term such as short (S), medium (M) or large (L). The mathematical model of this technique and its applications is presented in my paper [17]. 4. The proposed algorithm A supervised ANN uses a set of M examples or records. These records include N features. Each feature, Fn (n = 1,2,. . .,N), can be encoded into a fixed length binary sub-string fx1 xi xV n g, where Vn is the number of possible values of nth feature. The element xi = 1 if its corresponding feature value exists, while all the other elements = 0. Therefore, the proposed number of input nodes, I, in the input layer of ANN is given as:



N X

Vn

Consequently, the input features vector, Xm, to the input layer can be rewritten as:

X m ¼ fx1

xi

Preprocessing the input features using the discretization technique

Ok

ð2Þ

OK g

ð3Þ

where m = (1,2,. . .,M), M is the total number of input training patterns; k = (1,2,. . .,K), K is the number of different possible classes. If the output vector belongs to classk then the element Ok is equal to one while all the other elements in the vector are zeros. Therefore, the proposed number of output nodes in the output layer of ANN is K. Accordingly, the input and output nodes of the ANN are determined and the structure of the ANN is shown in

Encoding the input features and the crossponding class by bit strings

Input Layer

x1

Training the ANN on the encoded input and output patterns

F1

1

i

W ij H idden Layer j

x2

1

Extracting the weights between inputhidden and hidden-output layers Genertaing a general function for each output node of the ANN

xI g

The output class vector, Tm is encoded as a bit vector of a fixed length K as follows:

T m ¼ fO1

Input Database

ð1Þ

n¼1

x3

Wjk

O utput Layer k

1

O1

K

OK

F2

Using the GA to find the relevant features, which maximize each output function

J Fn

Obtaining the overall relevant features Fig. 1. The framework of the proposed feature subset selection algorithm.

xI

I Fig. 2. The structure of the artificial neural network.

358

M.E. ElAlami / Knowledge-Based Systems 22 (2009) 356–362

Fig. 2. The ANN is trained on the encoded input features’ vectors and the corresponding output classes’ vectors. The training of ANN is processed until the convergence rate between the actual and the desired output will be achieved. The convergence rate can be improved by changing number of iterations, number of hidden nodes (J), learning rate and momentum rate. After training the ANN, two groups of weights are obtained. The first group, Wij, includes the weights between the input node i and the hidden node j. The second group, Wjk, includes the weights between the hidden node j and the output node k. The activation function used in the hidden and output nodes of the ANN is a sigmoid function. Therefore, the inputs to the jth hidden node, IHj, is given by:

IHj ¼

I X

xi  W ij

ð4Þ

i¼1

Consequently, the output of the jth hidden node is given by:

   OHj ¼ 1= 1 þ ExpðIHj Þ

ð5Þ

The total input to the kth output node can be written as:

IOk ¼

J X

W jk  OHj

Therefore, the final value of the kth output node, OK, is given by:

OK ¼ ð1=ð1 þ ExpðIOk ÞÞÞ

ð7Þ

The function, OK = f(xi, Wij, Wjk) is an exponential function in xi since Wij and Wjk are constants and its maximum output value is an equal one. Therefore, the input features vector, Xm, belongs to a classk iff OK 2 Tm = 1 and all other elements in Tm = 0. For extracting the relevant input features to a specific classk one must find the input features vector, which maximizes OK. This is an optimization problem and can be stated as:

OK ¼ 1 1 þ Exp 

- The best chromosome is divided into N segments. - Each segment represents a one feature, Fn (i = 1,2,. . .,N), and has a corresponding bits length Vn, which represents their values. - The feature exists if any of the corresponding bits in the best chromosome is equal to one and vice versa. - The feature involving all values is equivalent to drop (i.e., the feature is irrelevant).

ð6Þ

j¼1

Maximize : ( "

as the evolution proceeds. This procedure is called selection. Basically selection prepares the population for the later reproduction. Reproductions are implemented by special operations. The two best known operations are crossover and mutation. Crossover is a process between two chromosomes, named parents, in a population. Crossover occurs when the parents exchange parts of their chromosomes to form two new individuals, called offspring. The mutation operation flips bits of individual chromosomes at random with some small probability. Generally, GA is a stochastic search process and is not guaranteed to converge. Thus, some termination condition should be specified to stop the iteration. For example, stop after some fixed number of generations. For extracting the relevant features which belong to classk the best chromosome must be decoded as follows:

J X

W jk  1= 1 þ Exp 

j¼1

I X

!!!!#) xi  W ij

i¼1

ð8Þ Subject to : xi are binary values ð0 or 1Þ

ð9Þ

Since the objective function OK(xi) is non-linear and the constraints are binary so, it is a non-linear integer optimization problem. The GA can be used efficiently to solve it. The GA is an iterative procedure that searches for an optimal solution in a solution space. Since the solution space is usually huge, GA adopts a heuristic approach. In each iteration, a fixed-size set of candidate solutions, called population, are examined. Each member of this population is encoded as a finite string and called chromosome. Each chromosome of the population contains one bit for each value of each feature. The chromosome of length I corresponds to a I-dimensional binary feature vector X, where each bit represents the elimination or inclusion of the associated feature. Then, xi = 0 represents elimination and xi = 1 indicates inclusion of the ith feature. Therefore, all possible chromosomes form the set of possible solutions in a given problem space. The standard GA procedure imitates the biological evolution. First, a random or heuristic process is conducted to produce the initial population. Second, each member in this population is evaluated according to the fitness function, Ok, as presented in Eq. (8). Finally, those which score higher in the fitness function values are assigned higher probabilities to be selected for the creation of the next generation. Thus, individuals with high fitness are more likely to be reserved for reproduction while those with low fitness values are more likely to disappear

Fig. 3 explains how the ANN and GA can be used to obtain the best-input features vector (chromosome), which maximizes the output function OK(xi) of the trained ANN. The previous steps of Fig. 3 are repeated for each output node (class) to find a corresponding best chromosome. The dominant features in all classes are the features subset to be selected from the input feature group. In general, the GA searches for selecting a subset features which maximizes the objective function, Ok, in each output node of the trained ANN. Usually, these features are the features which are associated with the highest values of the weights between the lay-

Begin { - Collect a set of data. - Divide the data into training and test data sets. - Train the ANN on the training data set. - Set the training parameters (such as learning rate, momentum, etc). - Train the different neural network structures. - Choose a trained network with the highes t accuracy rate and obtained the weights between the layers. - Construct the general formula for each output node (class) of ANN, O K (x i ) . - Assume the fitness function of GA as O K (x i ) . - Create a chromosome structural as follows; { • Generate number of slots equal I, which represent the input features vector X. • Put a random value 0 or 1 in each slot. } - G = 0, Where G is the number of generation. - Create the initial population, P, of T chromosomes, P(t) G, where t =(1 to T) - Evaluate the fitness function according to P(t) G - While termination conditions are not satisfied Do { • G = G +1 • Select number of chromosomes from the population P(t)G according to the roulette wheel procedure • Recombine between the selected chromosomes using crossover and mutation. • Modify the population from P(t)G-1 to P(t)G • Evaluate the fitness function according to P(t)G } - Display the best chromosome which satisfies the conditions - Decode the best chromosome for extracting the relevant features set. } End Fig. 3. The pseudo-code of the ANN and GA for selecting the relevant features.

359

M.E. ElAlami / Knowledge-Based Systems 22 (2009) 356–362

ers of the trained ANN (relevant features). Vice versa, the GA eliminates the features which are associated with the lowest values of the weights (irrelevant features) that dampen the objective function.



N X

V n ¼ V 1 þ V 2 þ V 3 þ V 4 ¼ 10

ð10Þ

n¼1

The number of output nodes is:

K ¼ 2:

The convergence rate between the actual and the desired output is achieved by four hidden nodes, 0.55 learning coefficient, 0.65 momentum coefficient and 300 iterations. The allowable error equals 0.000001. Table 3 shows the first group of weights Wij between each input node and the hidden nodes. The second group of weights Wjk between each hidden node and the output nodes is shown in Table 4. Applying GA to solve the function Ok(xi) for each class to get the relevant features, which maximizes that function. GA consists of population of 10 individuals evolving during 1300 generations. The crossover and the mutation were 0.25 and 0.01, respectively. The best chromosome (most relevant features) and the corresponding fitness function for each class are shown in Table 5. The temperature feature showed all values in class1 and does not appear in class2. Therefore, it is irrelevant feature. The dominant features in the two classes are {Outlook – Humidity – Wind}. These features are the overall relevant features for a given database. The ID3 algorithm [18] and the RITIO algorithm [19] use the same database for extracting the rules. The confidence and

5. Initial experiment The experiment described in this section is chosen to demonstrate the simplicity of the proposed feature subset selection algorithm. A given database (has four features and two different output classes) is shown in Table 1 [18]. The encoding values of the given database are shown in Table 2. The ANN is trained on the encoding input features vectors, Xm, and the corresponding output classes vectors Tm. The number of input nodes is given by:

Table 1 Example for target concept Play Tennis [18]. Day

Outlook

Temperature

Humidity

Wind

Play Tennis

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14

Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain

Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild

High High High High Normal Normal Normal High Normal Normal Normal High Normal High

Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong

No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No

ð11Þ

Table 4 The weights between hidden nodes (j) and output nodes (k). Output nodes (k)

1 2

Hidden nodes (j) 1

2

3

4

9.20896 9.22879

9.01273 9.00487

1.21132 0.77388

0.90564 1.21893

Table 2 Encoding a given database. I/P pattern Xm

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14

Outlook V1 = 3

Humidity V3 = 2

Wind V4 = 2

Sunny

Over

Rain

Temperature V2 = 3 Hot

Mild

Cool

High

Normal

Weak

Strong

O/P pattern Tm

No

Yes

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

O1

O2

1 1 0 0 0 0 0 1 1 0 1 0 0 0

0 0 1 0 0 0 1 0 0 0 0 1 1 0

0 0 0 1 1 1 0 0 0 1 0 0 0 1

1 1 1 0 0 0 0 0 0 0 0 0 1 0

0 0 0 1 0 0 0 1 0 1 1 1 0 1

0 0 0 0 1 1 1 0 1 0 0 0 0 0

1 1 1 1 0 0 0 1 0 0 0 1 0 1

0 0 0 0 1 1 1 0 1 1 1 0 1 0

1 0 1 1 1 0 0 1 1 1 0 0 1 0

0 1 0 0 0 1 1 0 0 0 1 1 0 1

1 1 0 0 0 1 0 1 0 0 0 0 0 1

0 0 1 1 1 0 1 0 1 1 1 1 1 0

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14

Play Tennis

Table 3 The weights between input nodes (i) and hidden nodes (j). Hidden nodes (j)

Input nodes (i) 1

2

3

4

5

6

7

8

9

10

1 2 3 4

4.0969 3.7412 1.2106 1.4285

6.1545 4.566 0.3498 1.1095

0.8252 1.1149 0.1533 0.4791

0.4222 0.2961 0.1970 0.5540

4.1281 3.0774 0.1549 0.6519

2.732 2.5952 0.5676 0.3253

4.9341 4.0053 1.1703 0.8969

5.2821 4.3672 0.2353 0.6167

3.0601 3.1160 1.1067 0.5679

3.6301 2.2842 1.3633 1.0215

360

M.E. ElAlami / Knowledge-Based Systems 22 (2009) 356–362

Table 5 The relevant features for Play Tennis database. Outlook

Temperature

Humidity

Wind

Sunny

Overca.

Rain

Hot

Mild

Cool

High

Norm.

Weak

Strong

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

1 0

0 1

1 0

1 0

1 0

1 0

1 0

0 1

0 1

1 0

the supporting levels of these rules are shown in Table 6, and can be calculated according to the following equations:

A AþB A Supporting level ¼ T

Confidence level ¼

ð12Þ ð13Þ

where A, number of correct instances which satisfy its rule; B, number of incorrect instances which does not satisfy its rule; T, total instances in a given database. Therefore, these rules have high confidence level (100%) and cover all instances (total supporting level = 100%) in a given database. However, these rules indicate that the feature of temperature is neglected (irrelevant feature). This result agrees with the proposed algorithm which neglects the same feature. Another method for determining the most relevant features set is presented in [20]. The evaluation function (ES), of this method depends on the correlation among the features set (CAA), and the feature to the corresponding target (CAT). The evaluation function is given by:

n  C AT ES ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n þ nðn  1Þ  C AA

ð14Þ

where n, number of features; CAA, average of feature – feature correlation’s; CAT, average of feature – target correlation’s. Table 7 illustrates the evaluation function for all possible sets for the same database. From group (1), it is clear that the preferable initial set is {Humidity}, which has the largest evaluation function.

Table 6 The confidence and supporting level of extracted rules. Rules

Confidence level (%)

Supporting level (%)

If outlook = sunny and humidity = high Then don’t play If outlook = sunny and humidity = normal Then play If outlook = rain and wind = strong Then don’t play If outlook = overcast Then play If outlook = rain and wind = weak Then play

100

21.428

100

14.285

100

14.285

100 100

28.571 21.428

Table 7 The evaluation function of all possible sets. Groups

Possible sets

Evaluation function

Group (1)

{Outlook} {Temperature} {Humidity} {Wind} {Humidity Outlook} {Humidity Temperature} {Humidity Wind} {Humidity Outlook Temperature} {Humidity Outlook Wind} {Humidity Outlook Wind Temperature}

0.130 0.025 0.185 0.081 0.220 0.133 0.188 0.175 0.226 0.191

Group (2)

Group (3) Group (4)

Class

Fitness

Class1 Class2

0.999874 0.999983

Group (2) shows that the highest evaluation function set is {Humidity, Outlook}. Group (3) shows that the largest evaluation function is obtained from the set {Humidity, Outlook, Wind}. Group (4) includes features set {Humidity, Outlook, Wind, Temperature}. The evaluation function of group (4) is less than that of the group (3). As the result, the most relevant features set is {Humidity, Outlook, Wind}, and the feature of temperature is rejected. This result agrees with the proposed algorithm which neglects the same feature. 6. Application and results The proposed algorithm is evaluated on two different databases, Monk1’s database and Car Evaluation database for extracting the relevant features. These databases are drawn from UCI machine learning database repository [21]. 6.1. Monk1’s database The Monk1’s problem is the benchmark binary classification task in which robots are described in terms of six features and two classes. The six features and their values are shown in Table 8. The target concepts associated to the Monk1’s problem is Head Shape = Body Shape and Jacket Color = red. This means that the relevant features are A1, A2, and A5. The ANN is trained on 124 input vectors, Xm, and the corresponding output class vectors, Tm. The number of input nodes, I, and the number of output nodes, K, of ANN equal 17 and 2, respectively. The convergence rate between the actual and the desired output is achieved by six hidden nodes, 0.25 learning coefficient, 0.85 momentum coefficient and 320 iterations. The allowable error equals 0.0000001. The two groups of weights Wij and Wjk are extracted. The GA has a population of 10 individuals evolving during 1225 generations. The crossover and the mutation are 0.28 and 0.002, respectively. The best output chromosome for each class is shown in Table 9. The ‘‘Smiling” and ‘‘Holding” features have all values in class1 and does not appear in class2. Therefore, these features are irrelevant. The ‘‘Tie” feature does not appear in class1 and class2. Consequently it is irrelevant also. The dominant features in the two classes are {Head Shape – Body Shape – Jacket Color}. These features are the overall relevant features for Monk1’s database. The HC algorithm [22] uses the same database and indicates that the features {Head Shap – Body Shape – Jacket Color} are the only

Table 8 The features and their values of Monk1’s database [21]. Robot features

Nominal values

A1 = Head Shape A2 = Body Shape A3 = Is Smiling A4 = Holding A5 = Jacket Color A6 = Has Tie

Round, square, octagon Round, square, octagon Yes, no Sword, flag, balloon Red, yellow, green, blue Yes, no

361

M.E. ElAlami / Knowledge-Based Systems 22 (2009) 356–362 Table 9 The relevant features for Monk1’s database. Head Shape

Smiling

Holding

Round

Square

Octagon

Body Shape Round

Square

Octagon

Yes

No

Sword

Flag

Balloon

Red

Yellow

Green

Blue

Yes

No

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

x16

x17

1 0

0 0

1 0

1 0

0 0

1 0

1 0

1 0

1 0

1 0

1 0

0 1

1 0

1 0

1 0

0 0

0 0

relevant features in Monk1’s database. Table 10 shows the comparison between the deferent feature selection algorithms, FCBF, CFS, ReliefF, and FOCUS for Monk1’s database [23]. The extracted features for each algorithm are compared with the standard relevant features for Monk1’s database presented in [21]. The results indicate that, both FCBF and CFS algorithms perform similarly and find more features (irrelevant features), while ReliefF and FOCUS algorithms agree with the proposed algorithm and find the true relevant features. To explain how the feature selection affects the neural network performance, one hundred neural networks were constructed for Monk1’s problem. The Monk1’s problem was solved 100 times with the full set of six features and another 100 times with only three selected features. The neural network construction algorithm was terminated only when the 100% classification on the training data was achieved. The results of the classification accuracy are summarized in Tables 11 and 12. When all features were used as input for the constructed neural networks, thirty of the 100 runs are stopped as shown in Table 11. The average number of iteration and its standard deviation are given in columns 2 and 3. The number of iterations indicates the total number of times, which the weights are updated. In the last four columns of the table, the average number and standard deviation of the accuracy of the networks constructed on the training and testing set are shown. We can see from Table 11 that the algorithm

Table 10 The comparison between the feature selection algorithms for Monk1’s DB. Algorithms

FCBF

CFS

ReliefF

FOCUS

Relevant features

A1, A2, A3, A4, A5

A1, A2, A3, A4, A5

A1, A2, A5

A1, A2, A5

Table 11 The classification accuracy of ANN with the entire data set (all features). Frequency

30

Iteration

Accuracy on training set (%)

Accuracy on testing set (%)

Average

Standard deviation

Average

Standard deviation

Average

Standard deviation

556

109

100

0.00

96.03

1.67

Table 12 The classification accuracy of ANN with the entire data subset (selected features). Frequency

13

Iteration

Accuracy on training set (%)

Accuracy on testing set (%)

Average

Standard deviation

Average

Standard deviation

Average

Standard deviation

376

50

100.00

0.00

100.00

0.00

Jacket Color

Tie

Class

Fitness

Class1 Class2

0.999443 0.999999

was successful in constructing a network with 100% accuracy rate on the training set, while the predictive accuracy on testing set is 96.03%. When only three selected features {Head Shape – Body Shape – Jacket Color} were used, the predictive accuracy of the constructed network was improved to 100% on testing set as shown in Table 12. In addition, the comparison between the average iteration of neural networks in Tables 11 and 12 indicates that the average iteration is reduced from 556 to 376. This means that, the training time of the neural network is reduced by 0.3237%. In general, the feature subset selection algorithm can be extremely useful in reducing the dimensionality of the data to be processed, reducing execution time and improving the overall performance of the constructed neural network. 6.2. Car Evaluation database Car Evaluation database is described in terms of six features. There are 1728 instances, which completely cover the features space. The Car Evaluation database is classified into four classes; Unacceptable, Acceptable, Good, and Very Good. The features and their values are shown in Table 13. The ANN is training on 1728 input vectors and the corresponding output class vectors. The number of input nodes, I, and the number of output nodes, K, of ANN equal 21 and 4, respectively. The convergence rate between the actual and the desired output is achieved by 9 hidden nodes, 0.19 learning coefficient, 0.92 momentum coefficient and 450 iterations. The allowable error equals 0.000001. The two groups of weights Wij and Wjk are extracted. The GA has a population of 20 individuals evolving during 1750 generations. The crossover and the mutation are 0.27 and 0.0018 respectively. The best output chromosome for each class is shown in Table 14. The ‘‘Doors” feature has all values in class1 (Unacceptable) and class4 (Very-Good) and does not appear in class2 (Acceptable) and class3 (Good). The ‘‘Lug_Boot” feature has all values in class3 and class4 and does not appear in class1 and class2. Therefore, the ‘‘Doors” and the ‘‘Lug_Boot” features are irrelevant. The dominant features in the four classes are {Buying – Maint. – Persons – Safety}. These features are the overall relevant features for Car Evaluation database. This subset features selection is identical with the features extracted by the algorithm presented in [24].

Table 13 The features and their values of Car Evaluation database [21]. Car Evaluation features

Nominal values

Buying Maint. Doors Persons Lug_Boot Safety

VHigh, high, med, low VHigh, high, med, low 2, 3, 4, 5more 2, 4, more Small, med, big Low, med, high

362

M.E. ElAlami / Knowledge-Based Systems 22 (2009) 356–362

Table 14 The relevant features for Car Evaluation database. Buying

Maint. VHigh

Doors High

Medium

Low

2

3

Persons

VHigh

High

Medium

Low

4

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

0 0 0 1

0 0 0 1

0 0 1 0

0 1 0 0

1 0 0 1

1 0 1 0

1 0 0 1

1 0 1 0

1 0 0 1

1 0 0 1

1 0 0 1

5 More

Lug_Boot

Safety

2

4

More

Small

Medium

Big

x12

x13

x14

x15

x16

x17

1 0 0 1

1 0 0 0

0 1 0 1

0 0 1 0

0 0 1 1

0 0 1 1

7. Conclusion and future work A novel feature subset selection algorithm from trained artificial neural network and genetic algorithm is presented in this paper. The proposed algorithm is applied on two different applications, Monk1’s database and Car Evaluation database. The results demonstrate that the proposed algorithm reduces the dimensionality of two databases by 50% and 33.33% respectively. Therefore, it is very effective in reducing dimensionality, removing irrelevant features and improving result comprehensibility. The quality of the proposed algorithm has been confirmed by comparing the results of the proposed technique with other data mining algorithms for the same databases. This yields a good indication of the algorithm stability. The future work should consist of more experiments with other data sets (mixed of nominal and continuous features), as well as more elaborated experiments to optimize the GA parameters of the proposed algorithm. References [1] Bovas Abraham, Giovanni Merola, Dimensionality reduction approach to multivariate prediction, Computational Statistics and Data Analysis 48 (1) (2005) 5–16. [2] Zhiping Chen, Kevin Lü, A preprocess algorithm of filtering irrelevant information based on the minimum class difference, Knowledge-Based Systems 19 (6) (2006) 422–429. [3] Fangming Zhu, Steven Guan, Feature selection for modular GA-based classification, Applied Soft Computing 4 (4) (2004) 381–393. [4] Muhammad Atif Tahir, Ahmed Bouridane, Fatih Kurugollu, Simultaneous feature selection and feature weighting using hybrid tabu search/K-nearest neighbor classifier, Pattern Recognition Letters 28 (4) (2007) 438–446. [5] Marc Sebban, Richard Nock, A hybrid filter/wrapper approach of feature selection using information theory, Pattern Recognition 35 (4) (2002) 835– 846. [6] Manoranjan Dash, Huan Liu, Consistency-based search in feature selection, Artificial Intelligence 151 (1–2) (2003) 155–176. [7] Changki Lee, Gary Geunbae Lee, Information gain and divergence-based feature selection for machine learning-based text categorization, Information Processing and Management 42 (1) (2006) 155–165.

Low

Medium

High

x18

x19

x20

x21

0 0 1 1

1 1 0 0

0 0 1 0

0 0 0 1

Class

Fitness

Class1 Class2 Class3 Class4

0.99993 0.99989 0.99996 0.99997

[8] Mohammed A. Muharram, George D. Smith, Evolutionary feature construction using information gain and Gini index, in: Genetic Programming 7th European Conference (EuroGP) Proceedings, 2004, pp. 379–388. [9] N.L. Fernández-García, R. Medina-Carnicer, A. Carmona-Poyato, F.J. MadridCuevas, M. Prieto-Villegas, Characterization of empirical discrepancy evaluation measures, Pattern Recognition Letters 25 (1) (2004) 35–47. [10] Xin Jin, Anbang Xu, Rongfang Bie, Ping Guo, Machine Learning Techniques and Chi-Square Feature Selection for Cancer Classification Using {SAGE} Gene Expression Profiles, Data Mining for Biomedical Applications, {PAKDD} Workshop, Bio{DM} Proceedings, vol. 3916, Springer, 2006, pp. 106–115. [11] Yuhang Wang, Fillia Makedon, Application of Relief-F Feature Filtering Algorithm to Selecting Informative Genes for Cancer Classification Using Microarray Data, CSB, IEEE Computer Society, 2004, pp. 497–498. [12] Jaekyung Yang, Sigurdur Olafsson, Optimization-based feature selection with adaptive instance sampling, Computers and Operations Research 33 (11) (2006) 3088–3106. [13] S. Ruggieri, Efficient C4.5, IEEE Transactions Knowledge and Data Engineering 14 (2), (2002) 438–444. [14] C.R. Rao, Y. Wu, Linear model selection by cross-validation, Journal of Statistical Planning and Inference 128 (1) (2005) 231–240. [15] Harinder Sawhney, B. Jeyasurya, A feed-forward artificial neural network with enhanced feature selection for power system transient stability assessment, Electric Power Systems Research 76 (12) (2006) 1047–1054. [16] Riyaz Sikora, Selwyn Piramuthu, Framework for efficient feature selection in genetic algorithm based data mining, European Journal of Operational Research 180 (2) (2007) 723–737. [17] M.E. ElAlami, Improving Similarity Measure via an Optimal Discretization and Weighing of Database Features, Scientific Bulletin, Ain Shams University, Faculty of Engineering, vol. 42 (4), (2007), pp. 369–385, ISSN: 1110-1385. [18] Tom M. Mitchell, Machine Learning, McGraw-Hill, Copyright 1997, pp. 52–81. [19] Xindond Wu, David Urpani, Induction by attribute elimination, IEEE Transactions on Knowledge and Data Engineering 11 (5) (1999) 805–812. [20] E. Elalfi, R. Elkamar, M. Sharawy, M.E. ELAlmi, Dimensionality Reduction for Machine Learning Algorithms, The Egyptian Computer Journal ISSR, Cairo Univ., 2001. [21] . [22] Marco Muselli, Diego Liberati, Hamming clustering: a new approach to rule extraction, in: Proceedings of the Third {ICSC} Symposia on Intelligent Industrial Automation ({IIA}’99) and Soft Computing ({SOCO}’99), 1–4th June, 1999. [23] Zheng Zhao, Huan Liu, Searching for interacting features, in: The International Joint Conference on Artificial Intelligence (IJCAI), 2007. [24] Jiye Li, Nick Cercone, Introducing a rule importance measure, Transactions on Rough Sets V, vol. 4100, Springer, Berlin/Heidelberg, 2006, pp. 167–189.