Research into a Feature Selection Method for Hyperspectral Imagery Using PSO and SVM

Research into a Feature Selection Method for Hyperspectral Imagery Using PSO and SVM

Dec. 2007 Journal of China University of Mining & Technology J China Univ Mining & Technol Vol.17 No.4 2007, 17(4): 0473 – 0478 Research into a ...

225KB Sizes 0 Downloads 43 Views

Dec. 2007

Journal of China University of Mining & Technology

J China Univ Mining & Technol

Vol.17

No.4

2007, 17(4): 0473 – 0478

Research into a Feature Selection Method for Hyperspectral Imagery Using PSO and SVM YANG Hua-chao, ZHANG Shu-bi, DENG Ka-zhong, DU Pei-jun School of Environment & Spatial Informatics, China University of Mining & Technology, Xuzhou, Jiangsu 221008, China Abstract: Classification and recognition of hyperspectral remote sensing images is not the same as that of conventional multi-spectral remote sensing images. We propose, a novel feature selection and classification method for hyperspectral images by combining the global optimization ability of particle swarm optimization (PSO) algorithm and the superior classification performance of a support vector machine (SVM). Global optimal search performance of PSO is improved by using a chaotic optimization search technique. Granularity based grid search strategy is used to optimize the SVM model parameters. Parameter optimization and classification of the SVM are addressed using the training date corresponding to the feature subset. A false classification rate is adopted as a fitness function. Tests of feature selection and classification are carried out on a hyperspectral data set. Classification performances are also compared among different feature extraction methods commonly used today. Results indicate that this hybrid method has a higher classification accuracy and can effectively extract optimal bands. A feasible approach is provided for feature selection and classification of hyperspectral image data. Key words: hyperspectral remote sensing; particle swarm optimization; support vector machine; feature extraction CLC number: TG 156

1

Introduction

Hyperspectral remote sensing makes it possible to obtain high resolution images, which provides quantitative remote sensing with an important base. However, the hyperspectral remote sensing image process is a difficult task because increasing the spectral resolution is only possible when provided with an additional large amount of band width to an existing band range. There are other limitations in classification and recognition in spectral space because it is hard to construct a standard spectral library and spectral matching techniques are not perfect. Usually, classification of hyperspectral images was addressed in feature space by using a statistical pattern recognition method. In a pattern recognition system, dimensionality reduction research is an essential part, especially for hyperspectral data with ten or hundreds of spectral bands. There are usually two methods to carry out dimensionality reduction: feature selection and feature extraction[1]. Feature selection requires a choice between a certain number of bands according to classification requirements and a predetermined

selection strategy. Feature extraction always involves a coordinate transformation of the original high dimensional feature space and then a proper subspace is chosen as a feature space. In contrast to feature selection, feature extraction based dimensionality reduction methods can lead to a loss of radiated or reflected information of objects contained in the original bands. It is disadvantageous for guiding the hyperspectral remote sensing image classification. On the other hand, feature selection methods, although they can retain all information, are difficult to realize because there are too many bands and strong correlations exist between bands. Therefore, many feature selection methods virtually cannot be realized effectively, either because of their heavy computational burden or because of the strong correlations between features[1–4]. In this paper we propose a novel feature selection and classification method for hyperspectral images by combining the global optimization ability of particle swarm optimization (PSO) algorithm and the superior classification performance of support vector machine (SVM). Global optimal search performance of PSO is

Received 04 January 2007; accepted 06 May 2007 Project 40401038 supported by the National Natural Science Foundation of China Corresponding author. Tel: +86-516-82175045; E-mail address: [email protected]

474

Journal of China University of Mining & Technology

Vol.17 No.4

improved by using a chaotic optimization search technique. A granularity based grid search strategy is used to optimize the SVM model parameters. Feature selection is carried out using binary PSO (BPSO). Tests of feature selection and classification are carried out on a hyperspectral data set. Classification performances are also compared among different feature extraction methods commonly used today. Results indicate that this hybrid method has higher classification accuracy and can effectively extract optimal bands. A feasible approach is provided for feature selection and classification of hyperspectral image data.

2.2

2

The formulation of velocity updating is not changed as stated in equation (1). But the position is updated as follows:

PSO Algorithm

2.1

Standard PSO (SPSO) [5]

Initially, a population of n particles is randomly generated in the particle swarm optimization algorithm. Each particle represents a potential solution and has a position represented by a position vector. Particle swarms have two primary operators: velocity update and position update. During each generation each particle is accelerated toward its previously best position and towards its best global position. At each iteration, a new velocity value for each particle is calculated based on its current velocity, the distance from its previous best position and the distance from its best global position. The new velocity value is then used to calculate the next position of the particle. In the d dimensional search space, we assume that xi = (xi1, xi2, ... , xid )T denotes the current position of the ith particle, vi = (vi1, vi2, ... , vid )T denotes the fly velocity of the ith particle, yi = (yi1, yi2, ... , yid )T represents the best position reached previously and g = (g1, g2, ... , gd )T denotes the best global values, i.e., the current optimal solution in a population. Each particle updates its own velocity and position according to the following equations: k k ⎧⎪vidk +1 = wk vidk + b1r1 ( pidk − xidk ) + b2 r2 ( p gd − xgd ) ⎨ k +1 k k ⎪⎩ xid = xid + avid

(1)

where k denotes the kth iteration; i = 1, 2, …, n; b1 and b2 are learning factors where b1=b2=2; c1 and c2 are positive random numbers drawn from a uniform distribution between 0.0 and 1.0; a is the constraint factor used to control the velocity weight; xid is the current position of individual i; vid is the current velocity of individual i; pid is the individual’s i best position found so far and pgd is the best neighborhood state found so far. w is the inertia weight; typically w is reduced linearly at each iteration. In the search space, particles track the individual’s best values and the best global values. This process is then iterated a set number of times, or until a minimum error is achieved.

Binary PSO (BPSO)

The previous PSO algorithm is a standard PSO algorithm, i.e., each dimension of a particle can only be set as real values. So it is hard to be used in discrete optimization problems such as feature selection for hyperspectral imagery. For the BPSO algorithm, a binary encoding is adopted, where xi and pi for each dimension are limited to 1 or 0, but this limitation is not used for particle velocity. The sigmoid function of velocity is a logical choice to do this as follows: s (vid ) =

if

1 1 + exp(−vid )

(rand( ) < s (vidk ))

then xidk = 1

(2)

(3)

xid = 0 else where rand( ) sets the value of vid to a range of [0.0, 1.0]. The maximal velocity vmax can be used to limit the probability to 0 or 1. k

2.3

Improved PSO algorithm

The value of inertial weight w is an important parameter affecting the convergence performance of the PSO algorithm. It is used to control the ability of the previous velocity affecting the current velocity. Previous studies indicate that we can set a relatively large value of w at the earlier stage of evolution in order to improve the global search ability. An accurate solution is then obtained with a gradually reduced w in the process of evolution. For the PSO algorithm, a particle reflects the social learning ability of a population. It drags all particles of a population into its position and eventually converges to this position. The convergence ability is increased at the expense of robustness. Therefore, performance of the PSO algorithm can be greatly improved by limited mutational operation of a particle. We introduced a chaotic optimization search strategy in this PSO algorithm[6]. The main idea is that a chaotic optimization search is carried out with predetermined number of steps for the best particles in a population in order to improve on the disadvantages of a slowly convergent speed which can easily fall into a local minimum. So, optimization variables for the best particles were found using a chaotic search method with small probability in the earlier stages of the evolutionary process and an approximate probability of 1 at a later stage. The probability γt used by this chaotic search method was able to adjust it self adaptively given the formulation γt = 1ˉ1/(1+lnt), where t is the number of iterations.

YANG Hua-chao et al

Research into a Feature Selection Method for Hyperspectral Imagery Using PSO and SVM

3 Multi-class SVM and Parameters Optimization SVM is an effective machine learning method based on structural risk minimization (SRM). The main idea is that, by ensuring the minimization of empirical risk, the top margin of the confidence risk is minimized. The SVM implements the following idea: it maps the input vector into a high-dimensional feature space and constructs the optimal separating hyperplane in that space. The SVM learns a separating hyperplane in order to maximize the boundary margin and to produce good generalization ability, which can avoid over-fitting[7–11]. The radial basis function (RBF) K(xi, xj)= exp(–γ|| xi –xj||2) is selected as a SVM kernel function. When training the SVM classifier, two important parameters, the penalty value C and kernel parameter γ should be predetermined using the training data. The two parameters are main factors affecting the classification accuracy of the SVM. Different values of SVM model parameters can lead to different classification performances. Conventional methods of selecting the two parameters largely include a leave-one-out (LOO) method and a cross-validation (CV) method. Unfortunately, the LOO method usually finds the optimal model parameters in an approximate search space of the parameter set, predetermined by personal experience, which leads to a heavy computational burden. Moreover, the CV method always requires a large amount of training samples and uses an exhaustive search method with high computational complexity[11–14]. We used a new granularity based grid search strategy to find the optimal SVM model parameters. Its basic idea is to create a parameter grid with dimensions M×N using different values of C and γ to train different SVM models and to obtain the parameter combination with the strongest generalization ability. This training process is repeated by creating parameter grids again using the parameter combination found in the previous step until the generalization property changes are small or the maximum (number of) iterations are achieved. Remote sensing image classification is a multiclass problem. The SVM was originally developed to perform binary classification. However, classification of data into more than two classes, called multi-class classification, is more practical in remote sensing applications such as land use land cover classification. A number of methods have been proposed to create multi-class SVM from binary counterparts such as 1VSA, ECOC, 1VS1, MOC, DDAG etc[1]. We have adopted the DDAG method. A Directed Acyclic Graph structure was adopted in the DDAG method. We can create k(k−1)/2 binary classifiers for a k class classification, which forms k(k−1)/2 nodes in the graphic structure. Nodes are organized in the form of

475

a triangle with the single root node at the top and increasing subsequently in an increment of one node at each level until the last level with M nodes. The DDAG evaluates input data starting at the root node and moves to the next level based on the output values. The binary classifier at the next level then evaluates the input data[4]. The advantages of the DDAG method lies in the fact that the training results are analyzed and have a rapid test speed[8].

4 Feature Selection and Classification Method Based on BPSO and SVM The basic steps of the proposed BPSO and SVM based feature extraction and classification model for hyperspectral remote sensing imagery are described as follows: Step 1: Selection of training samples. According to the number of classes to be classified, training and test data sets are constructed by selecting a certain amount of training samples and testing samples for each class; Step 2: Encoding. Considering the nature of feature selection for hyperspectral imagery, each feature is defined as a one dimensional discrete binary variable. The length of the variable is equal to the number of all features, i.e., the number of the original bands in the hyperspectral image data set. Each dimension corresponds to a feature to be selected. The ith feature is selected if the value of the ith bit equals 1, otherwise, this feature will not be chosen (Fig. 1). Each particle corresponds to a selected feature subset; Hide

Select

1

0

1

1

0

0

0

1



1

Length of variable (the number of original bands)

Fig. 1

Encoding of a 1D discrete binary variable of a particle

Step 3: Parameter initialization. Initial population, initial particle xi0 and initial velocity vi0 are generated randomly. The learning parameters c1 and c2, the inertia weight w(0), the maximum number of iterations and the maximum velocity vmax for the PSO algorithm should be assigned in advance; Step 4: Fitness evaluation. The most important task is to evaluate the particle by a so-called fitness function; the selection of this fitness function is a crucial point in using the PSO algorithm, which determines what a PSO should optimize. In general, the purpose of a feature subset selection is to achieve the same or an improved classification performance. The fitness function selected for the PSO can directly reflect the classification performance as follows: (6) fitness = 1 − accuracy

476

Journal of China University of Mining & Technology

where the accuracy denotes overall classification accuracy for each individual of a population obtained using ten multiple cross-validation with SVM classifiers, which will be used to guide the optimization of a particle. Here, the task of the BPSO algorithm is to find the global minimum value according to the definition of the fitness function; Step 5: Updating the best values. If the individual’s current fitness (xi(t)) is better than the previous fitness (pi(tˉ1)), then the update of the individual’s best value is as follows: pi (t) = xi (t), or pi(t) = pi(tˉ1), where t is the number of iterations. The best global value is updated according to the formulation: g(t) = pk(t), fitness(pk(t)) = max{fitness (p1(t)), …, fitness (pn(t))}. Of all the individual’s best values, the optimal individual best value is the best global value and the optimal feature subset corresponding to the best global values is the optimal solution of a population. Step 6: Updating a particle’s velocity and position according to equation (1); Step 7: Calculating mutation probability γt and carrying out a chaotic optimization search with γt. Step 8: Judgement of stop-criterion. If the maximum number of generations is reached or no better parameter vector is found for a long time (about 100 steps), then stop. The optimal feature subset is then determined by the global optimal particle found by the BPSO algorithm. Otherwise, go to step 4 and repeat steps 4 to 8; Step 9: Classification. Use the training data, which correspond to the optimal feature subset obtained by step 6, in order to train the SVM classifier. Then the SVM classifier can be used to classify the hyperspectral remote sensing data.

5

Tests and Results Analysis

Test data acquired from AVIRIS available at (ftp://ftp.ecn.purdue.edu/biehl/MultiSpec/) has been used for our demonstration. Spectral resolution and spatial resolution of the test image is 10 nm and 20 m respectively. One of the bands of this image is shown in Fig. 2. The image was acquired on June 12, 1992 over the northern part of Indiana (USA). Two-thirds

Fig. 2

A single band AVIRIS image

(band 100 [1.3075–1.3172 µm])

Vol.17 No.4

of the scene is covered with agricultural land while one-third is forest or otherwise occupied. A field survey map consisting of sixteen classes and one unclassified landuse was also available (Fig. 3) and has been used as reference data (ground data) for training and testing our data collection and accuracy assessment. Four of the bands do not contain any data. In our experiments, similar to other studies, 20 water absorption bands numbered [104–108], [150–163] and 220 were removed from the original image. In addition, 15 noisy bands [1–3], 103, [109–112], [148–149], [164–165] and [217–219], as observed from visual inspection, were also discarded. A number of training and testing pixels for each class were randomly selected and are presented in Table 1. Table 1

Training and testing samples

Class No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Class name Alfalfa Corn-no till Corn-min Corn Grass/pasture Grass/trees Grass/pasture-mowed Hay-windrowed Oats Soybeans-no till Soybeans-min Soybeans-clean Wheat Woods Bldg-grass-tree-drives Stone-steeltowers Total

(# of () pixels)

Training samples

Testing samples

42 712 423 116 241 362 27 268 20 485 1228 296 104 632 191 47 5194

46 1428 830 237 483 730 28 478 20 972 2455 593 205 1265 386 93 10249

For the PSO algorithm, 30 particles were generated randomly. The learning factors c1 and c2 were set to 2, the inertia weight w was reduced linearly from 0.9 to 0.5, the maximum number of iterations was set to 200, the dimensions of the solution space was set to 185 (original number of bands) and the maximum fly velocity for each particle was set at 200. According to a previous study [3], when optimizing the SVM model parameters, i.e., C and γ, the range of a grid search is limited to [1, 1000] and [0, 20] respectively.

Fig. 3

Reference data

YANG Hua-chao et al

Research into a Feature Selection Method for Hyperspectral Imagery Using PSO and SVM

Fig. 4 illustrates the relationship between the global optimal fitness of a particle and the number of iterations obtained using the improved BPSO and general BPSO algorithm. From Fig. 4 we can see that the improved BPSO algorithm can overcome disadvantages of the more general BPSO which decreases easily to a local minimum. It also has a rapid convergence speed. The global optimal fitness tends to be no change at the 4.9% level after 110 generations. Its corresponding nine feature bands are: 10, 17, 19, 32, 69, 123, 130, 158 and 179. In order to indicate further the efficiency of our feature selection method we propose that classification tests, using the SVM algorithm stated in our paper, be addressed by adopting both the nine feature bands and the original 185 bands for Table 2 Class No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Total

1 37 0 0 0 1 0 0 0 3 0 4 0 0 1 0 1 47

2 0 1358 0 1 0 0 1 0 0 33 45 1 0 1 1 0 1441

3 0 1 809 2 3 0 0 0 0 2 0 21 0 1 0 0 839

4 0 0 3 214 0 0 0 0 0 0 3 0 5 0 2 0 227

Class No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Total

1 31 0 0 0 1 2 0 2 3 0 7 0 0 2 0 1 49

2 0 1271 0 1 0 0 3 0 0 98 125 1 0 2 1 2 1504

3 2 1 778 2 4 0 0 0 0 2 0 59 0 3 1 0 852

4 0 1 3 201 0 0 1 0 1 1 5 0 4 0 2 2 221

5 0 3 7 2 447 0 13 0 4 0 2 1 1 3 1 0 484

Table 3 5 0 7 5 2 406 0 29 0 0 0 4 2 3 5 1 1 465

477

which confusion matrices are listed in Tables 2 and 3 respectively (Note: the larger the diagonal elements in the confusion matrix, the higher the classification accuracy).

Fig. 4

Relationship between global optimal fitness values and iteration times of particles

Confusion matrix for nine feature bands 6 0 0 3 1 1 720 0 0 3 0 3 0 0 2 2 1 736

7 0 0 0 0 1 0 24 0 3 1 0 1 0 2 0 3 35

8 0 0 3 0 3 0 5 459 0 2 0 3 0 0 0 0 475

9 0 0 0 1 0 0 0 0 19 0 0 0 3 3 1 1 28

10 0 57 9 0 0 0 0 1 0 932 7 3 0 0 0 0 1009

11 2 77 1 0 0 1 0 0 0 0 2316 0 0 2 0 0 2399

12 0 9 23 0 0 0 0 0 0 22 0 571 0 3 1 0 629

13 1 0 5 0 0 3 0 0 2 0 2 0 198 1 1 0 213

14 1 3 0 1 0 1 1 0 1 0 3 0 5 1197 1 1 1215

15 2 0 0 0 2 1 0 3 0 0 0 0 0 2 369 0 379

16 0 0 0 0 0 2 0 1 0 0 0 0 0 0 3 87 93

Total 43 1508 863 222 458 728 44 464 35 992 2385 601 212 1218 382 94 10249

Confusion matrix for original 185 bands 6 3 0 7 1 3 687 0 0 5 0 7 0 0 4 2 0 719

7 1 0 1 1 1 0 21 0 2 1 0 1 2 2 1 5 39

8 0 2 11 0 3 0 23 418 0 2 7 5 0 2 2 1 476

Overall classification accuracy is 95.2% for Table 2 and 90.3% for Table 3. Therefore, the classification accuracy is improved by 4.9% by using our proposed feature selection method. For purposes of comparison, the performance of our proposed algorithm is evaluated and compared

9 0 0 0 1 0 0 0 0 13 0 0 0 1 0 0 1 16

10 0 142 4 0 0 0 0 0 0 861 5 7 0 1 0 0 1020

11 2 190 1 0 0 3 0 0 0 0 2201 0 0 3 0 2 2402

12 0 1 48 0 0 0 0 0 0 21 0 560 0 1 1 0 632

13 3 0 1 0 0 1 0 0 1 0 1 0 198 1 1 2 209

14 0 1 0 1 0 1 1 0 1 0 1 0 7 1164 1 0 1178

15 4 0 0 0 1 1 0 7 0 0 0 2 0 2 361 3 381

16 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 84 87

Total 46 1616 859 210 419 696 78 427 26 986 2363 637 215 1192 375 104 10249

with that of a principal component analysis (PCA), a discriminant analysis feature extraction (DAFE) and a decision boundary feature extraction (DBFE) technique. The overall classification accuracy of various feature extraction techniques applied on the AVRIS data is presented in Table 4.

478

Journal of China University of Mining & Technology

Table 4

Accuracy comparison

Table 5

Feature extraction methods

Overall classification accuracy (%)

PCA

89.4

0.78

DAFE DBFE

92.3 90.7

0.85 0.82

PSO-SVM

95.2

0.94

Kappa coefficient

Table 4 shows that the use of PSO and SVM based feature extraction methods improves the overall accuracy by 2.9% to 5.8%. In order to test the optimization performance of the granular grid based SVM model parameters a search strategy was initiated. The LOO method was also used and compared with the proposed method. The LOO method is employed to calculate the minimum loss function using the training samples corresponding to the nine feature bands in the parameter space determined by the value range of C and γ, which ensures the minimization of the structural risk. The optimized SVM model parameters are then determined by the value of a minimum loss function. The optimization performances using these two algorithms are listed in Table 5 (Computer equipment: CPU 2.4G and memory 256 MB). Table 5 shows that the classification accuracy of the proposed optimization method for SVM model parameters is 3.9% higher than that of the LOO method. Computational time is also less.

Vol.17 No.4

Comparison of optimization performance

Optimization methods

C

Γ

LOO

358.612

9.237

Overall classification accuracy (%) 91.3

Granular grid

12.647

1.320

95.2

6

Time (s) 428.7 108.4

Conclusions

1) The velocity-move search model of the PSO algorithm is operationally simple and has lower complexity. It can ensure an optimal solution with a larger probability by adjusting the global search and local search, using inertial weight. Moreover, chaotic optimization techniques can overcome the disadvantage of an easy seduction into a local minimum. 2) A granular grid based parameter search strategy can improve the SVM classification performance and greatly increase the optimization speed. 3) Our proposed feature selection and classification method for hyperspectral images by combining the global optimization ability of a particle swarm optimization (PSO) algorithm with a superior classification performance of the support vector machine (SVM) is feasible. The results indicate that this hybrid method has higher classification accuracy and can effectively extract optimal bands.

References [1] [2] [3] [4] [5]

[6] [7] [8] [9] [10] [11] [12] [13] [14]

Yang Z H, Zhang Y Z, Gong D P, et al. Feature selection in hyperspectral classification based on tabu search algorithm. Hydrographic Surveying and Charting, 2006, 26(4): 11–14. (In Chinese) Custavo C V, Luis G C, Javier C M, et al. Robust support vector method for hyperspectral data classification and knowledge discovery. IEEE Transactions on Geoscience and Remote Sensing, 2004, 42(7): 1530–1541. Zhu G B, Dan G, Blumberg. Classification using ASTER data and SVM algorithms: the case study of Beer Sheva, Israel. Remote Sensing of Environment, 2002, 80(2): 233–240. Kenned Y J, Eberhart R. Particle swarm optimization. In: IEEE International Conference on Neural Networks. Perth: IEEE Neural Networks Society, 1995: 1942–1948. Liu S S, Hou Z J. Weighted gradient direction based chaos optimization algorithm for nonlinear programming problem. In: Proceedings of the 4th World Congress on Intelligent Control and Automation. Shanghai: East China University of Science and Technology Press, 2002: 1779–1783. Chen G M, Jia J Y, Huang Q. Study on the strategy of decreasing inertia weight in particle swarm optimization algorithm. Journal of Xi’an Jiao Tong University, 2006, 40(1): 53–56. (In Chinese) Lu B, Wei X K, Bi D Y. Applications of support vector machines in classification. Journal of Image & Graphics, 2005, 10(8): 1029–1035. (In Chinese) Bengio Y. Gradient-based optimization of hyperparameters. Neural Computation, 2000, 12(8): 1889–1900. Keerthi S S. Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms. IEEE Trans on Neural Networks, 2002, 13(5): 1225–1229. Zheng C H, Jiao L C, Zheng G W. Genetic algorithm-based SVM for automatic target classification of remote sensing images. Control and Decision, 2005, 20(11): 1212–1216. (In Chinese) Xu P, Chan A K. An efficient algorithm on multi-class support vector machine model selection. In: Proceedings of the International Joint Conference on Neural Networks. Porland: IEEE, 2003: 3229–3232. Chapelle O, Vapnik V, Bousquet O. Choosing multiple parameters for support vector machines. Machine Learning, 2002, 46(1): 131–159. Landgrebe D. Hyperspectral image data analysis. IEEE Signal Process Mag, 2002, 19: 17–28. Jin W D, Zhang G X, Hu L Z. Radar emitter signal recognition using wavelet packet transform and support vector machines. Journal of Southwest Jiaotong University (English Edition), 2006, 14(1): 15–20.