Feature selection using genetic algorithm and cluster validation

Feature selection using genetic algorithm and cluster validation

Expert Systems with Applications 38 (2011) 2727–2732 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ww...

1MB Sizes 3 Downloads 164 Views

Expert Systems with Applications 38 (2011) 2727–2732

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Feature selection using genetic algorithm and cluster validation Yi-Leh Wu a, Cheng-Yuan Tang b,⇑, Maw-Kae Hor c, Pei-Fen Wu b a

Dept. of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan Dept. of Information Management, Huafan University, Taipei, Taiwan c Dept. of Computer Science, National Chengchi University, Taipei, Taiwan b

a r t i c l e

i n f o

a b s t r a c t

Keywords: Feature selection Image retrieval Genetic algorithms Taguchi method Hubert’s C statistics

Feature selection plays an important role in image retrieval systems. The better selection of features usually results in higher retrieval accuracy. This work tries to select the best feature set from a total of 78 low level image features, including regional, color, and textual features, using the genetic algorithms (GA). However, the GA is known to be slow to converge. In this work we propose two directions to improve the convergence time of the GA. First we employ the Taguchi method to reduce the number of necessary offspring to be tested in every generation in the GA. Second we propose to use an alternative measure, the Hubert’s C statistics, to evaluate the fitness of each offspring instead of evaluating the retrieval accuracy directly. The experiment results show that the proposed techniques improve the feature selection results by using the GA in both time and accuracy. Ó 2010 Elsevier Ltd. All rights reserved.

1. Introduction

The first category represents the regional information which includes the region position, the circumference, the area, etc. The second category represents the color information which includes the lab, the invariant moments, the color moments, the color coherence vectors, etc. The third category represents the textual information which includes the edge orientation histogram, the edge density, the anisotropy, the contrast, etc. A total of 78 image features are included in the initial image features set. The similarity of two given image regions is computed by the Euclidean distance of their corresponding feature vectors. To improve the accuracy of the image retrieval systems, it is important to have a proper image feature set that describes the contents of an image. The more suitable image features set results in higher retrieval accuracy. The main contributions of this work are summarized as follows:

Today’s image retrieval systems usually employ low level color, shape, and textual features to represent the contents in a given image. However, these low level image features are usually not consistent with human perception in semantic such results in less satisfactory image retrieval performance. In recent years many researches have proposed more human visual system aware image retrieval systems such as the Region-Based Image Retrieval (RBIR) systems (Carson, Belongie, Greenspan, & Malik, 2002; Li, Dai, Xu, & Er, 2008; Ma & Manjunath, 1997; Wang, 2001; Wang, Li, & Wiederhold, 2001) that employ the objects or similar image regions as the basis for image retrieval. When considering the whole image as the retrieval target, if there are many objects in the images or the image backgrounds are not related with the foreground objects, the retrieval results will be dissatisfactory. Image retrieval systems based on RBIR include: Berkeley Blobworld (Carson et al., 2002), UCSB Netra (Ma & Manjunath, 1997), SIMPLIcity (Wang, 2001), etc. The RBIR systems first segment images into many regions then extract image features from each segmented region. Each region is represented by a feature vector. Feature vectors may have different dimensionality depending on the number of image features to represent the given region. In this work, we employ the Blobworld (Carson et al., 2002) method to segment images. For each segmented region we then extract the low level image features. Our initial image features set includes three categories of features. ⇑ Corresponding author. E-mail addresses: [email protected] (Y.-L. Wu), [email protected] (C.-Y. Tang), [email protected] (M.-K. Hor). 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.08.062

 We propose to employ the Hybrid Taguchi-Genetic Algorithm (HTGA) to perform feature section for the RBIR systems.  Instead of using the direct retrieval accuracy, which is expensive to compute, to select better offsprings in every generations of the HTGA, we propose to use the Hubert’s C statistic, which estimates the cluster validity, as the fitness measure to select better offspring.  We propose to use the Halton quasi-random sampling method that greatly reduces the computation time of the Hubert’s C statistic. Our experiment results support that the proposed improvements over the original GA can perform feature set selection efficiently from a large image features set.

2728

Y.-L. Wu et al. / Expert Systems with Applications 38 (2011) 2727–2732 c

La ðb Þ;

2. The Hybrid Taguchi-Genetic Algorithm (HTGA) Taguchi method (Tsai, Liu, & Chou, 2004) is commonly applied in quality management to improve the quality and the stability in production. Taguchi method can reduce the environment influence in production. From the manufacturing cost point of view, Taguchi method allows the use of low grade materials and not expensive equipments while maintaining certain quality level. From the developing cost point of view, Taguchi method wants to shorten construction period and to reduce required resources. Taguchi method is a robust design which has the characteristic of high quality, low developing, and low cost. The two major tools employed by the Taguchi method are the orthogonal array (OA) and the signal–noise ratio (SNR). We briefly discuss the orthogonal array and the signal–noise ratio as follows: 2.1. The orthogonal array (OA) In factor design, when the number of factors increases, the number of experiments required increases. Taguchi method utilizes the OA to collect the experimental data directly and the result is a more robust factor estimator with fewer number of experiments required. The OA is an important tool to conduct a robust experiment design. A general orthogonal array is denoted as follows:

Table 1 L8 (24) orthogonal array.

1 2 3 4 5 6 7 8

Factor A

B

C

D

1 1 1 1 2 2 2 2

1 1 2 2 1 1 2 2

1 2 1 2 1 2 1 2

1 2 2 1 2 1 1 2

0110 = 1 solution

1. All factors are assumed to be independent: During the process of numerical calculation, we do not take the interrelation among factors into account. If there are dependent factors, we create a polynomial term to represent the factors that are dependent; e.g., if factor A and factor B are dependent, we create a new factor A * B separately as an independent factor. 2. The number of appearances at each level must be equal: To maintain objectivity of the experiments, the occurrences of levels must be equal; e.g., in a level-2 orthogonal array, if factor 1 has four 0’s, then it must also have four 1’s to preserve objectivity. 3. The stronger the orthogonal array, the more reliable the experiment results. However, stronger orthogonal arrays are harder to construct and require more experiments: The strength of an orthogonal array is defined as follows. An OA of level-2 (only 1 and 2 appears) and strength 3 has the characteristic that by selecting any three columns there must exist at least one of the following eight combinations (111, 112, 121, 122, 211, 212, 221, 222). A sample OA of level 2 and strength 3 is shown in Table 1. 2.2. The signal–noise ratio (SNR) Taguch method employs the SNR to estimate the contribution degree to the objective function of each factor in each level. Formulation of the SNR is derived from the unbiasedness in statistics. It is an estimate of how samples deviate from the center of population. The general formulation of the SNR is as follows:

0001 Fitness = 5

0110 Fitness = 30

Fitness evaluation

Replacement

Run

where a is the number of experimental runs; b is the number of levels for each factor; c is the number of columns in the orthogonal array. The quality of the OA will greatly influence the accuracy and the objectivity of the experiments. To construct a qualified OA we use the following general principles:

0000

0000

Mating selection

0010

0001

0101

0110

0001 Crossover

Taguchi Method Mutation

Fig. 1. Architecture of the Hybrid Taguchi-Genetic Algorithm (HTGA).

2729

Y.-L. Wu et al. / Expert Systems with Applications 38 (2011) 2727–2732

  mÞ2  S2 ; ðy

1988), etc. The steps for testing the validity of a clustering structure are as follows:

 is the mean of sample, m is the mean of object, S is the stanwhere y dard deviation of sample.

  mÞ2  S2  ¼ 10 log SNR ¼ 10 log½ðy

"P n

i¼1 ðyi

n

 mÞ

2

# ;

where n is the number of samples in the population. In 2004, Tsai et al. proposed the Hybrid Taguchi-Genetic Algorithm (HTGA) (Tsai et al., 2004) that combined the Taguchi method and the GA which results in faster convergence speed. The main difference of the HTGA and the original GA is that the offspring after the crossover operation need to pass an additional Taguchi method test which results the optimal offspring in each generations. A diagram of the HTGA is shown in Fig. 1. Through this optimization process, the GA can converge early and can improve the precision. The HTGA is detailed as follows: 1. Initialization (parameter setting): The population size is M chromosomes, the crossover rate is PC, the mutation rate is PM, and the number of generations is N. 2. Fitness: Calculate the objective value of each individual and the fitness value of each population. 3. Selection: Use the roulette wheel approach or other similar methods to select the individuals with higher fitness to perform crossover. 4. Crossover: Determine by the probability of PC, select the set of individuals that should crossover. From the set we select two individuals at random then apply the one-cut-point method to generate two offspring. 5. Taguchi test: With a 2-level orthogonal array appropriate for our experiment, we take the offspring of step 4 and calculate their fitness and SNR. We then calculate the effective degree of each factor in the objective function to generate the best offspring. 6. Repeat steps 3 and 4 until the number of better offspring reaches 1/2 * M * PC. 7. Mutation: The probability of mutation is determined by the mutation rate PM. 8. Replacement: Sort the parents and offspring by their fitness measures. Then select the best M chromosomes as the parents of the next generation. 9. Repeat steps 2–8, until one of the following two stopping conditions is met.  HTGA converges to the optimal solution, or  The number of execution generations exceeds the pre-defined threshold.

Step 1: Select the clustering structure, the validation criteria and the index. Step 2: Obtain the distribution of the index under the no structure hypothesis. Step 3: Compute the index for the clustering structure. Step 4: Statistically test the no structure hypothesis of by determining whether the index from Step 3 is unusually large or unusually small. 3.1. The Hubert’s C statistic To validate a computed clustering structure one can compare it to an a priori structure. The Hubert’s C statistic was designed to measure the fitness between data and a priori structures. Let X = [X(i, j)] and Y = [Y(i, j)] be two n  n proximity matrices on n objects. X(i, j) is the observed proximity between objects i and j. Y(i, j) is defined as:

 Yði; jÞ ¼

0

if objects i and j belong to the same category;

1 if not:

The Hubert’s C statistic is the measurement of point by point correlation between the two matrices X and Y. When X and Y are symmetric, we have



n1 X n X

Xði; jÞYði; jÞ:

i¼1 j¼iþ1

However, the C computed from the above equation is in its raw form. To normalize C statistic, we have

(

C ¼ ð1=MÞ

n1 X n X

) ½Xði; jÞ  mx ½Yði; jÞ  my  =Sx Sy

i¼1 j¼iþ1

where M = n(n  1)/2, mx and my denote the sample means of the entities in X and Y and Sx and Sy denote the sample standard deviations of the entities in X and Y. The normalized C statistic will have the range between 1 and 1. If the two matrices agree with each other in structure then the absolute value of C statistic will be unusually large. One of the most common applications of C statistic is to test the random label hypothesis; i.e., could the values in one of the two matrices X and Y have been inserted at random? To test the random label hypothesis, the distribution of C under the random label hypothesis must be known in advance. This distribution is the accumulated histogram of C with all n! permutations of the row and column numbers of Y.

3. Cluster validity

3.2. The Halton Quasi-random Numbers

Cluster validity measures the adequacy of a structure recovered by cluster analysis that can be interpreted objectively. The adequacy of a clustering structure refers to which the clustering structure reflects the intrinsic character of the data (Bel Mufti, Bertrand, & El Moubarki, 2005; Dubes, 1993; Halkidi, Batistakis, & Vazirgiannis, 2002; Jain & Dubes, 1988; Liu, Jiang, & Kot, 2009; Santos, Marques de Sa, & Alexandre, 2008). In general, there are three criteria to investigate cluster validity, namely external, internal, and relative. The hypothesis tests are used to determine if a recovered structure is appropriate for the data. When the external and internal criteria are used, the hypothesis tests are to test whether the value of the index is either very large or very small. Many statistical tools can be employed for cluster validity; e.g., Monte Carlo, Hubert’s C and Goodman–Kruskal c (Jain & Dubes,

When testing whether the value of C is unusually large, the distribution must be found by evaluating C for all n! permutations in advance. However, with six objects, 6! = 720 values of C must be computed, and with nine objects, 9! = 362, 880 values must be found. It leads to a computationally complex procedure. We propose to employ the Halton Quasi-random Numbers technique (Press, Teukolsky, Vetterling, & Flannery, 2002) as a solution to this high computational problem. The random samples generated by the Halton Quasi-random Numbers technique can distribute uniformly in n-dimensional space. We employ the Halton Quasi-random Numbers technique to reduce computation by generating sample distributions for Hubert’s C statistic. The distributions of the Halton Quasi-random Numbers in a two dimensional space are shown in Fig. 2.

2730

Y.-L. Wu et al. / Expert Systems with Applications 38 (2011) 2727–2732

of relevant images retrieved to total number of relevant images in collection. Precision refers to the ratio of number of relevant images retrieved to total number of images retrieved. The higher recall and precision result the better searching efficiency. The definitions of recall and precision are as depicted in Fig. 3.

Number of relevant images retrieved ; Total number of relevant images in collection Number of relevant images retrieved : Precision ¼ Total number of images retrieved Recall ¼

4.2. The F-measure The F-measure (Hanza, 2003) combines precision and recall into one single measure. Let D indicate all the data in the database, C = {C1, . . ., Ck} indicate k clusters discovered by clustering algo  rithms, C  ¼ C 1 ; . . . ; C l indicate target clusters from the same data, l is the number of clusters, the F-measure is defined as:

F¼ Fig. 2. Random sample use Halton Quasi-random Numbers. (a) Points 1–128, (b) points 129–512, (c) points 513–1024, and (d) points 1–1024.

  l   X Ci  max fF i;j g; D j¼1;...;k i¼1

F i;j ¼ 4. Measure of retrieval efficiency

1 precði;jÞ

2 ; 1 þ recði;jÞ

jC \C  j jC \C  j where prec(i, j) indicates jCj i and rec(i,j) indicates jC  i . i Larger F-measure indicates that cluster structure result by the clustering algorithm is more similar to the target cluster structure.

4.1. The precision and recall Recall and precision are the measurements of the search efficiency in information retrieval. Recall refers to the ratio of number

4.3. The CS measure Suppose a clustering algorithm generates clustering structure X = {xj; j = 1, 2, . . ., N}, where N is the number of the resulting clusters, the CS measure (Chou, Su, & Lai, 2003) is defined as: 1 c



Pc

CSðcÞ ¼

1 c

Pc

P

xj 2Ai

Pc

 maxfdðxj ; xk Þg xk 2Ai

i¼1 fmin fdðv i ; v j Þgg j2c;j–i



i¼1

¼

1 jAi j

i¼1

1 jAi j

P



xj 2Ai maxfdðxj ; xk Þg xk 2Ai

Pc

i¼1 fmin fdðv i ; v j Þgg j2c;j–i

Fig. 3. Recall and precision. a + b + c + d is all the images in the database, a + c is the relevant images, and a + b is the retrieved images.

P

where v i ¼ xj 2Ai xj , c is the number of clusters, Ai are data grouped into the ith cluster, jAij is the cardinalities of the ith cluster, and d is the distance function. Smaller CS measure indicates better cluster structure result by the clustering algorithm. 1 jAi j

Fig. 4. Images from the leopard category.

Fig. 5. Images from the bird category.

Y.-L. Wu et al. / Expert Systems with Applications 38 (2011) 2727–2732 Table 2 Initial image features set.

2731

Table 5 Details of the 15 feature selection experiments.

5. Experiments 5.1. Feature selection In this section we conduct experiments to combine HTGA and the Hubert’s C statistics to select better image features set. We use the Coral image dataset and select images from two categories, leopards and birds, for this experiments. We randomly take five images from each category, as shown in Figs. 4 and 5, for this feature selection experiments. We conduct a total of 15 experiments. We employ the Blobworld method to segment images. The total number of initial image features is 78. The detailed numbers of image features in each category are as shown in Table 2.

Table 3 Feature selection results.

Table 4 Feature selection results (sorted).

We then employ the HTGA to automatically select features together with Hubert’s C statistics to evaluate the fitness. The fitness function is defined as: C (sample)-95% critical value, with larger values the better. In Table 3, the indices are highlighted as in the color, the texture, and the region features categories with respect to Table 2. Table 3 shows the number of experiments for each feature that had been selected in the total of 15 experiments. Table 4 is from Table 3 that has been sorted by the number of experiments for each feature that had been selected in the total of 15 experiments. Table 5

2732

Y.-L. Wu et al. / Expert Systems with Applications 38 (2011) 2727–2732

Table 6 Comparison of indexing accuracy.

ment results also suggest that the proposed method can select smaller image features set and produce higher retrieval accuracy. Acknowledgements

shows the final fitness values and the number of features selected for each experiment. From Table 5, we observe that the first experiment produces the largest fitness value and the average number of features selected is 32. 5.2. Indexing accuracy After selecting the features, the next experiment evaluates the indexing accuracy of the selected features. In this experiment, we continue to use the same two categories of images (leopards and birds) with 100 images from each category. So the total number of testing images is 200. We employ the k-means algorithms for clustering. We compare the results of using: (1) all 78 features, (2) 27 features selected by the experiment 1 in Table 5, and (3) 32 (average number of feature selected) highest ranking features in Fig. 5. The experiment results are shown in Table 6. From Table 6, we conclude that the features selected by the proposed HTGA method produces higher indexing accuracy then using all features without any selection process. The experiment results suggest that the proposed method can produce better image features set and thus higher retrieval accuracy. 6. Conclusion This work presents a feature section scheme based on the GA for the Region-Based Image Retrieval systems. We show that the feature set selected by the proposed feature selection scheme produces higher retrieval accuracy than the one without feature selection. We also show that the Hubert’s C statistics can be used as the fitness measure in the evolution of the HTGA. The experi-

This work was partially supported by the National Science Council, Taiwan, under the Grant No. NSC99-2221-E-011-124, NSC98-2631-H-211-001, and NSC99-2631-H-211-001. References Bel Mufti, G., Bertrand, P., & El Moubarki, L. (2005). Determining the number of groups from cluster stability. Proceeding of ASMDA, 404–414. Carson, C., Belongie, S., Greenspan, H., & Malik, J. (2002). Blobworld: Image segmentation using expectation–maximization and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8), 1026–1038. Chou, C. H., Su, M. C., & Lai, E. (2003). A new cluster validity measure for clusters with different densities. In 2003 IASTED international conference on intelligent systems and control, Salzburg (pp. 276–281). Dubes, R. C. (1993). Clustering analysis and related issues. Handbook of pattern recognition and computer vision (2nd ed.). World Scientific. pp. 3–32. Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2002). Cluster validity methods: Part I. SIGMOD Record, 31(2). Hanza, M. H. (2003). On cluster validity and the information need of users. In The 3rd IASTED international conference on artificial intelligence and applications (AIA03), Spain (pp. 216–221). Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall. Li, F., Dai, Q., Xu, W., & Er, G. (2008). Multilabel neighborhood propagation for region-based image retrieval. IEEE Transactions on Multimedia, 10(8), 1592– 1604. Liu, M., Jiang, X., & Kot, A. C. (2009). A multi-prototype clustering algorithm. Pattern Recognition, 42(5), 689–698. Ma, W. Y., & Manjunath, B. (1997). Netra: A toolbox for navigating large image database. In Proceedings of IEEE international conference image processing (pp. 568–571). Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (2002). Numerical recipes in C++. Cambridge. Santos, J. M., Marques de Sa, J., & Alexandre, L. A. (2008). LEGClust—A clustering algorithm based on layered entropic subgraphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(1), 62–75. Tsai, J. T., Liu, T. K., & Chou, J. H. (2004). Hybrid Taguchi-Genetic Algorithm for global numerical optimization. IEEE Transactions on Evolutionary Computation, 8(4), 365–377. Wang, J. Z. (2001). Integrated region-based image retrieval. Kluwer Academic Publishers. Wang, J. Z., Li, J., & Wiederhold, G. (2001). SIMPLIcity: Semantics-sensitive integrated matching for picture libraries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23, 947–963.