Knowledge-Based Systems 24 (2011) 40–48
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
An effective feature selection method for hyperspectral image classification based on genetic algorithm and support vector machine Shijin Li *, Hao Wu, Dingsheng Wan, Jiali Zhu School of Computer and Information Engineering, HoHai University, No. 1, Xikang Road, Nanjing 210098, China
a r t i c l e
i n f o
Article history: Received 15 February 2010 Received in revised form 11 July 2010 Accepted 13 July 2010 Available online 16 July 2010 Keywords: Hyperspectral remote sensing Band selection Conditional mutual information Support vector machine Genetic algorithm Branch and bound algorithm
a b s t r a c t With the development and popularization of the remote-sensing imaging technology, there are more and more applications of hyperspectral image classification tasks, such as target detection and land cover investigation. It is a very challenging issue of urgent importance to select a minimal and effective subset from those mass of bands. This paper proposed a hybrid feature selection strategy based on genetic algorithm and support vector machine (GA–SVM), which formed a wrapper to search for the best combination of bands with higher classification accuracy. In addition, band grouping based on conditional mutual information between adjacent bands was utilized to counter for the high correlation between the bands and further reduced the computational cost of the genetic algorithm. During the post-processing phase, the branch and bound algorithm was employed to filter out those irrelevant band groups. Experimental results on two benchmark data sets have shown that the proposed approach is very competitive and effective. Ó 2010 Elsevier B.V. All rights reserved.
1. Introduction With the development of the remote-sensing imaging technology and hyperspectral sensors, the use of hyperspectral image is becoming more and more widespread, such as target detection and land cover investigation. Due to the dense sampling of spectral signatures of land covers, hyperspectral images have a better discrimination among similar ground cover classes than traditional multispectral scanners [19]. At the same time, these images are usually composed of tens of or hundreds of close spectral bands, which will result in high redundancy and great amount of computation in hyperspectral image classification. Therefore the most important and urgent issue is how to reduce the number of those bands largely, but with little loss of information or classification accuracy. The commonly-used dimension reduction methods fall into two categories: feature selection and feature extraction. Because every band of the hyperspectral data has its own corresponding image, the way of feature extraction that the high-dimensional feature space is mapped to a low-dimensional space by linear or nonlinear transformation, could not keep the original physical interpretation of the image. Thus feature extraction approaches are not suitable for the dimensionality reduction of hyperspectral images. As the spectral distance between the two adjacent bands in the hyperspectral data is only 10 nm and the correlation between them is * Corresponding author. Tel.: +86 25 83786520; fax: +86 25 83735625. E-mail address:
[email protected] (S. Li). 0950-7051/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2010.07.003
so high [21], there is a considerable redundancy, which should be largely reduced by the feature selection or band selection methods so that the classification efficiency and accuracy can be greatly improved. According to their combination style with the adopted machine learning algorithms, feature selection approaches are divided into wrappers, filters and embedded methods [4]. Embedded methods perform feature selection in the process of training and are usually specific to given learning algorithms, such as decision trees classifier. Filters select subsets of features as a pre-processing step, independent of the chosen classifier [8,22]. Since they only involve the training data, their computational complexity is much lower. However, in the case of highly correlated features, such as hyperspectral image bands, this kind of feature selector will fail, as they only consider the individual feature ranking without considering the combination of features. Wrappers utilize the learning machine of interest as a black box to score subsets of features according to their predictive power. Although the search space and complexity of this kind of methods are larger than the former two methods, this method could avoid the deflection resulting from the independence between the evaluation criteria and learning algorithm. Furthermore, the algorithm takes the dependency between the features into consideration, so that the result is more accurate. In fact, the band selection of hyperspectral remote sensing images is a very complex problem in combinatorial optimization, as some bands with less amount of information may play an important role in the classification. For such a problem, the more effective method to obtain optimal subset of bands is based on the search strategy
41
S. Li et al. / Knowledge-Based Systems 24 (2011) 40–48
and evaluation criteria. In this paper, the search strategy combining the support vector machine and genetic algorithm (GA–SVM) is adopted. In the literature of feature selection, there are many search algorithms, such as the best individual features, sequential forward search, and sequential forward floating search. However, for the hyperspectral image, there are hundreds of bands and the efficiency of search algorithms will deteriorate greatly. Serpico and Bruzzone [27] proposed a new sub-optimal search strategy suitable for hyper-dimensional feature selection, which was based on the search for constrained local extremes in a discrete binary space. In their later work, Serpico and Moser [28] put forward a procedure to extract spectral channels of variable bandwidths and spectral positions from the hyperspectral image in such a way as to optimize the accuracy for a specific classification problem, in which each spectral channel was obtained by averaging a group of contiguous channels of the original hyperspectral image. The employed search methods included sequential forward, steepest ascent and fast constrained one, the last two were proposed in Serpico and Bruzzone [27]. Guo et al. [12] found that the selection based on the simple criterion of only retaining features with high associated mutual information (MI) values may be problematic when the features were highly correlated. And the direct way of selecting features by jointly maximizing MI would suffer from combinatorial explosion, thus they proposed a fast feature selection scheme based on a ‘greedy’ optimization strategy. To resolve the high correlation between the adjacent bands, band grouping and selection had been proposed by some researchers. Sun and Gao [30] proposed that the original data set was separated into groups based on fuzzy set theory and then reduced the dimension by selecting a representative band or using their linear fusion, and finally the band subset could be found out based on genetic algorithm and rough set theory. Huang and He [15] presented a feature weighting method for band selection, which was based on the pairwise separability criterion and matrix coefficients analysis through principal component analysis (PCA). Another method was proposed by Wang et al. [31], in which all the bands were separated into subspace according to the correlation between two adjacent bands, and selected the bands which had the best sensitivity of support vector machine from every subspace. According to Peng et al. [26], in feature selection, it has been recognized that the combination of individually good features do not necessarily lead to good classification performance. Jensen and Solberg [16] proposed an optimal constant function representation of hyperspectral signature curves in the mean square sense. Their method [16] divided the spectral curves into contiguous regions by piecewise constant function approximations. The extracted constants were then used as new features. Serpico and Moser [28] put forward a procedure to extract spectral channels of variable bandwidths. Each spectral channel was obtained by averaging a group of contiguous channels of the original hyperspectral image. Sotoca and Pla [29] presented a band selection method using correlation among bands based on MI. The relationship among bands was represented by the means of the transformation matrix. A process based on Deterministic Annealing (DA) optimization was applied to the transformation matrix in order to obtain a reduction of this matrix looking for the image bands as less uncorrelated as possible. Huang et al. [14] proposed a hybrid genetic algorithm for feature selection, which consisted of the local and global searches. Their method utilized mutual information as an independent measure for feature ranking in the local search process, while the mutual information between the predictive labels of a trained classifier and the true classes were used as the fitness function in the global search. Bazi and Melgani [2] had put forward a GA– SVM based system to optimize the SVM classifier accuracy for hyperspectral imagery. The authors proposed to determine the
best SVM parameters and detect the best discriminative features (bands) in a completely automatic way. They exploited fitness criteria intrinsically related to the generalization capabilities of SVM classifiers, which included the simple support vector count and the radius margin bound. Comparisons with other feature selection methods proved that their method was one of the state-of-theart approaches in the GA–SVM family of algorithms [10,11,33– 36]. Tan et al. [33] proposed a framework based on a genetic algorithm for feature subset selection that combined various existing feature selection methods, in which some existing algorithms were used to provide candidate features for GA. Estévez et al. [9] first proposed a filter method of feature selection based on normalized mutual information. Their method was an enhancement over Battiti’s Mutual information based feature selection methods (MIFS) [1], but its performance degraded in problems where group of features were relevant. Hence, a second method, called genetic algorithm guided by mutual information for feature selection (GAMIFS) was proposed, which was a hybrid filter/wrapper method that combined a GA with their first method. Since there are hundreds of bands in the hyperspectral imagery, the search space for GA directly on the original band space will be too huge. Our method differs from the previous works [12,31,29,14,2,33,9] in three aspects. First, conditional mutual information is employed to partition the bands into disjoint subspace, thus getting the irredundant set of bands and reducing the search space at the same time. Second, GA–SVM is adopted to search for the optimal combination of bands. And lastly, the branch and bound search algorithm (BB) is used to prune the irrelevant bands from the result of GA–SVM and the minimal set of relevant bands for the classification task are thus obtained. The flowchart of our system is illustrated in Fig. 1. Our system can not only reduce the search space of the genetic algorithm and the computational complexity, but also avoid the error resulting from elimination based only on the information of a single band. In this way, we can further reduce the irrelevant and ineffective band groups by the branch and bound algorithm so that the accuracy of classification will be improved. The rest of this paper is organized as follows: Section 2 presents the band grouping method base on the conditional mutual information. Then Section 3 provides a review of the support vector machine and genetic algorithm (GA–SVM) adopted in this paper and Section 4 introduces the branch and bound algorithm. The experimental results and comprehensive comparison with other state-ofthe-art methods are presented in Section 5. Finally, conclusions are drawn in Section 6. 2. Initial band grouping based on conditional mutual information 2.1. Mutual information and conditional mutual information The entropy is a measure of uncertainty of random variables, and is a frequently-used evaluation criterion of feature selection [18]. If a discrete random variable X has U alphabets and the probability density function is p(x), x e U, the entropy of X is defined as:
HðXÞ ¼
X
pðxÞ log pðxÞ
ð1Þ
x2U
For two discrete random variables X and Y, which have U and W alphabets and their joint probability density function is p(x, y), x e U, y e W, the joint entropy of X and Y is defined as:
HðX; YÞ ¼
XX
pðx; yÞ log pðx; yÞ
ð2Þ
x2U y2W
When certain variables Y are known and others X are not, the remaining uncertainty is measured by the conditional entropy:
42
S. Li et al. / Knowledge-Based Systems 24 (2011) 40–48
[hyperspectral image data] Preprocessing
Band Grouping
Band Selection
GA-SVM
AB&B Pruning
GA-SVM
[Classification Result]
Fig. 1. The overall structure of the proposed method in this paper.
HðXjYÞ ¼
XX
pðx; yÞ log pðxjyÞ
ð3Þ
x2U y2W
The mutual information is usually used to measure the correlation between two random variables and it is defined as [24]:
IðX; YÞ ¼ HðXÞ þ HðYÞ HðX; YÞ ¼ HðXÞ HðXjYÞ
ð4Þ
If we assume that two features are represented with two discrete variables X and Y and the class is represented with a discrete variables C, when Y is given the conditional mutual information of X and C is defined as:
IðC; XjYÞ ¼ HðXjYÞ HðXjC; YÞ ¼ IðC; YÞ IðC; X; YÞ ¼ IðC; XÞ ½IðX; YÞ IðX; YjCÞ
ð5Þ
2.2. Band grouping by analyzing conditional mutual information For the task of hyperspectral image classification, the basic principles of the band grouping is that the adjacent bands which have high correlation should be merged into one group and the ones with little redundancy should be separated into different groups. Then we can select the representative band from each group. In this way, the loss of useful information can be minimized and the reduction of redundant information can be maximized. We can use the MI and conditional MI to measure the correlation between the adjacent bands. The redundancy between two bands is greater when the value of MI is larger. And the information obtained from the bands for classification is greater when the value of conditional MI is smaller. While the MI is only able to express the relationship between the bands, the conditional MI could not only express the correlation of the bands, but also the relationship between the class and the features (bands). Based on the above analysis of conditional MI, we choose the conditional MI as the basis of band grouping. Here we take ‘‘Washington DC Mall” data set from HYDICE spectrometer as an example to illustrate this band grouping method. There are 191 bands in this data set after discarding the water absorption bands. Fig. 2 shows the conditional MI between the adjacent bands. In Fig. 2, the red diamonds are the local maxima points. During the process of band grouping based on conditional MI, the basic principle is that the bands are divided into groups according to these local maxima points. In this paper, these local maxima points are extracted manually, though they can also be obtained automatically by comparing the neighborhoods of every point. Table 1 shows the grouping result. From Table 1, it can be observed that there are usually tens of bands in each group. But there are 37 and 50 bands in the first and the last group. In this paper, we divide the first band group into two groups and the eighth group into three in order to avoid omitting the important information for discrimination, as there are too many bands in these two groups. Table 2 gives the grouping result
with additional groups. Because of the pruning by BB algorithm in the following section, there is no redundant or irrelevant band group in the final result.
3. Search algorithm based on the combination of SVM and GA (GA–SVM) The search space of traditional band selection algorithm based on genetic algorithm (GA) is too large to search because it is the total band space. In this paper, the search space is much smaller than the traditional method because we only choose one band from each band group which is organized as stated in Section 2. As stated in [25], simple GA sometimes provides solutions inferior or just comparable to classical heuristic algorithms. A practical and effective way of overcoming this limitation is to hybridize the GA, by incorporating domain-specific knowledge. In this paper, the special characteristics of hyperspectral bands can be regarded as the domain knowledge, and we have divided the whole mass of bands into several groups by the analysis of the corresponding conditional MI. Assume there are N bands in the hyperspectral image, which should be divided into K groups. Here we do not choose the common binary coding method as the genetic coding mode. If we use this method, we are not able to make sure that there is one and only one band from each band group in the band combination after the mutation and crossing. Accordingly, we use the integer coding mode in our algorithm and define the corresponding mutation operator and crossover operator to satisfy the requirement of band selection in the context of band grouping beforehand. More specifically, for the band selection problem, a string with K integer is used. Let C denote a chromosome, C = (c1, c2, . . . , cK). For each ci, it takes the value from ½Bi1 ; . . . ; BiK , where Bij is the i band number in the ith band group, j = 1, 2, . . . , Ki; Ki is the number of bands in the ith band group. As the crossover operator is concerned, we use the standard single-point binary crossover operation. For the mutation operation, it is a bit different. Since our coding scheme is the integer code, we cannot use the 0–1 flip operator for binary code. Instead, we randomly choose one ci from the offspring of crossover operation, and generate a random value between [0, 1]; if this value is smaller than the user-defined probability of mutation, ci changes to other integer in the ith band group. For example, if ci = 65, and the ith band group is [56, 78], then ci is randomly set to another integer in [56, 78]. The whole procedure of GA–SVM is similar with those in [2,36]. The difference lies in that we adopt a different chromosome encoding scheme and have a different search space. Moreover, the mutation operator is also modified according to the integer encoding mechanism. In addition, another key of genetic algorithm is the design of fitness function. The traditional band selection algorithm based on genetic algorithm requires the designers to pay attention to not
43
S. Li et al. / Knowledge-Based Systems 24 (2011) 40–48
0. 4 0. 35 0. 3 0. 25 0. 2 0. 1 5 0. 1 0. 05 0
1
11
21
31
41
51
61
71
81
91
101
111 121 131 141 151
161 171
181 191
Band Number Fig. 2. The conditional MI of Washington DC Mall data set.
Table 1 Initial band grouping for Washington DC Mall data set. Groups
1
2
3
4
5
6
7
8
Bands
1–37
38–56
57–72
73–86
87–102
103–133
134–140
141–191
Table 2 Additional band grouping for Washington DC Mall data set. Groups
1
2
3
4
5
6
Bands
1–18
19–37
38–56
57–72
73–86
87–102
Groups Bands
7 103–133
8 134–140
9 141–158
10 159–175
11 176–191
only the accuracy of classification, but also the size of band subset. In the method of this paper, the group number K is the final band number. Hence, we use only the classification accuracy as the fitness function. In this way, we could not only control the complexity of the algorithm gracefully, but also get the optimal band combination with the highest accuracy. In this paper, the adopted classifier is SVM, as it is one of the most competitive classifiers for small samples problem. Since it is very common nowadays, we omit the principles in the text and use the LibSVM [5] implementation. The kernel function used is a radial basis function (RBF function), and the two SVM parameters (i.e. C and c) are selected based on 5-fold cross-validation during the training phase. For the GA algorithm, the population number is 30, the probability of mutation is 0.2, the probability of crossover is 0.8, and the maximum generation number is 100.
4. Band pruning with the branch and bound algorithm In regard to feature redundancy and relevance, band grouping can handle the redundancy among adjacent bands, and GA–SVM is used to search for the relatively best combination of those irredundant bands from each group. We do not know whether the obtained band combination is with the minimum number of bands or not. Hence we propose to use the branch and bound algorithm (BB) to further search for the band combination with minimal number of bands. The only optimal feature selection algorithms are the exhaustive search and the branch and bound (BB) algorithm and its variants [13]. An exhaustive search finds the best subset of features by evaluating a criterion function for all possible feature combinations
and selects the best subset with the maximum criterion value. The number of possible feature sets that need to be searched is excessively large as the dimensionality of the original feature space increases; an exhaustive search is thus only applicable for lowdimensional problems (in the case of hyperspectral image classification, the resulting band combination by the state-of-the-art methods has usually less than 30 bands, such as those in Serpico and Moser [28], Benediktsson et al. [3]). The BB algorithm (and its variations) is more efficient than an exhaustive search, because it avoids exhaustively exploring the entire search space; this is achieved by organizing the search space properly so that many subsets that are guaranteed to be sub-optimal are rejected. Many modified versions of the BB algorithm have been proposed to improve its speed by reducing redundant criterion evaluations. Since the conditional MI only considers the correlation between bands, GA–SVM searches for the best combination of bands with little redundant information. From the point view of feature selection with minimum redundancy, and maximum relevance [26], we need to search for the minimal set of bands with maximum relevance. Hence we use the adaptive branch and bound (ABB) algorithm proposed in Nakariyakul and Casasent [23] as the pruning method to find out the minimal band combination with maximal relevance. By the way, all BB algorithms are optimal and yield the same optimal feature subset. More specifically, we use BB algorithm to prune one band each time, and a SVM classifier is trained on the remaining bands to classify the test set. After all the bands are pruned out or the minimum number of bands is reached, the band combination with the highest classification accuracy is chosen as the final result. In this paper we stop the pruning process when there are three bands left, since fewer bands cannot fulfill the classification task with a
44
S. Li et al. / Knowledge-Based Systems 24 (2011) 40–48
98 97.5 97 96.5 96 95.5 95 94.5 94 3
4
5
6
7
8
9
10
11
The Number of Bands after Pruning Fig. 3. The fitness of each combination with different number of bands after pruning.
Table 3 Final band grouping for Washington DC Mall data set after pruning. Groups
1
2
3
4
5
6
7
Bands
1–37
38–56
57–72
73–86
87–102
103–133
134–191
satisfactory accuracy. Again, we take ‘‘Washington DC Mall” data set from HYDICE spectrometer as an illustrating example. Fig. 3 shows the overall accuracy of each band combination based on the SVM classifier, after 1–8 bands have been pruned, respectively. From Fig. 3, we find that the band combination with seven bands has the largest fitness. Thus we select this combination as the final result. In this combination, not only the 2nd, the 9th and the 10th group which are added for avoiding the loss of important information, but also the 8th group are cut out (here the group No. is due to Table 2). As is shown in Fig. 2, there are fewer bands in the 8th group than others and the conditional MI between these bands is not so consistent. Accordingly, this group should be merged into the 9th group. Table 3 shows the final band grouping result after pruning by the BB algorithm. 5. Experimental results and analysis In this paper, our experiment analysis is conducted on two wellknown benchmark data sets, i.e., the Washington DC Mall data set and the Indian Pine data set. In the former data set, 210 bands are collected in the 0.4–2.4 lm region of the visible and infrared spectrum. The water absorption bands are then deleted, resulting in
Table 4 The training/test samples of Washington DC Mall data set. Class
Training Test
1
2
3
4
5
6
Grass
Roofs
Road
Water
Trees
Trail
7 Shadow
1849 1985
184 232
72 103
1114 814
136 269
999 225
74 23
191 channels. The dataset contains 1280 scan lines with 307 pixels in each scan line, which has been studied extensively by many research groups [19,3,7,17]. The Washington DC Mall data set is available in the student CD-ROM of Landgrebe [19]. The second data set is made up of a 145 145 pixel portion of the AVIRIS image acquired over NW Indian Pine in June 1992 [19,27,28]. Not all of the 220 original bands are employed in the experiments since 18 bands are affected by atmosphere absorption phenomena and are consequently discarded. Hence, the dimensionality for the Indian Pine data set is 202 [28]. As stated in Serpico and Moser [28], the subdivision of the ground truth into training and test data is not performed randomly but defining spatially disjoint training and test fields for each class to reduce as much as possible the correlation between the samples used to train the system and those employed to test its performances. The Washington DC Mall data set has been manually annotated in Landgrebe [19] and there are 137 patches of seven classes with ground truth. We use those pixels from the patches with even numbers as training set, while the rest of the patches with odd numbers as test set. The MultiSpec project file provided on the student CD-ROM [19] is used to select the training/test samples in our experiments. The detailed partition of the training and testing samples are listed in Table 4. For the Indian Pine data set, we adopt the same partition of training/test sets as Serpico and Moser [28], which are also given in Table 5. In the following subsections, a group of experimental results will be presented on these two benchmark data sets. When presenting the results in the rest of the paper, our classification accuracy means overall accuracy. Though it favors large classes and is biased to the class(es) with the largest number of samples, there are not so severe imbalances among the land cover classes in the two benchmark data sets. As far as the model selection of the SVM classifier is concerned, we do not try to find the optimal parameters during the GA–SVM searching procedure, because it is of much time cost. After some trial, we just choose C = 1 and c = 1 empirically. And for the final band combination in each experiment, we use 5-fold cross-validation to obtain the optimal parameters, the search range for C is in [23, 210], and [28, 22] for c. Thus we can reduce the experimental time.
Table 5 The training/test samples of Indian Pine data set (the same as Serpico and Moser [28]). Class
Training Test
1
2
3
4
5
6
7
8
Corn-no till
Corn-min
Grass/pasture
Grass/trees
Hay-windrowed
Soybean-no till
Soybean-min
Soybean-clean till
9 Woods
762 575
435 326
232 225
394 283
235 227
470 443
1428 936
328 226
728 487
45
S. Li et al. / Knowledge-Based Systems 24 (2011) 40–48 Table 6 Classification accuracy of three genetic algorithms on Washington DC Mall data set. Algorithms
Band number
Minimal accuracy
Maximum accuracy
Average accuracy
Non-Sel GS CGGS
191 24–28 8
93.40 88.28 95.72
93.40 95.04 98.57
93.40 92.14 97.26
5.1. Experimental results on Washington DC data set 5.1.1. Comparison I Here we implement another two classification methods for hyperspectral image to compare with the algorithm proposed in this paper. The first one uses all 191 bands without band selection as the input of SVM (denoted as Non-Sel Algorithm). Another method uses all 191 bands without band grouping as the input of GA–SVM [36] (denoted as GS Algorithm). Table 6 shows the overall classification accuracies of these three algorithms with 10-run averaging. We can find that the classification accuracy of our algorithm (denoted as CGGS) is better than the other two algorithms (to compare fairly, the CGGS result is based on the initial band grouping shown in Fig. 1) and the selected band number is much fewer. The search space of our algorithm is about 2.7 1010, while that of GS is 2191 1. Because there is no band grouping, the search space of GS is 1047 times more than that of our algorithm. Because of the fewer bands in the band combination, the computation time of our algorithm for SVM is much less than that of the GS algorithm. With a typical PC configured with CPU 2.0 GHz, 512 M RAM, and running Windows XP, the CGGS algorithm needs about 300 min to iterate 100 generations. And we believe that our algorithm is an efficient feature selection method for hyperspectral image classification in the genetic family with low computational complexity. 5.1.2. Comparison II In order to prove the necessity and efficiency of the pruning by BB for the initial band grouping, we search the best band combination with GA–SVM algorithm based on 8-band group (As shown in
Table 1) and 11-band group (as shown in Table 2) without pruning and 7-band group after pruning (as shown in Table 3). Table 7 gives the results with 10-run averaging. As shown in Table 7, the band group after pruning can make the final band combination achieve a higher classification accuracy by cutting out the irrelevant groups. The reduction of groups can also reduce the heavy computation burdens of GA–SVM so that the algorithm becomes more efficient. 5.1.3. Comparison III In order to show the merits of our algorithm (CGGS + BB), especially the band grouping scheme based on the analysis of conditional MI, we implement another two uniform grouping methods for hyperspectral image classification to compare with our algorithm. We partition the bands equally into 19 and 38 groups (Group Averages) respectively, and then select the first band of each group as the input of SVM. To compare with correlation structure, we conduct two other experiments. Based on the conditional mutual information between the bands, we extract the first band from each group (GS8_First) and the medium band from each group (GS8_Medium) respectively; then classified with SVM. Table 8 shows the results of these five methods. From Table 8, it can be observed that our band grouping scheme based on conditional mutual information is effective. For example, our band grouping with the first band from each group can achieve nearly the same classification accuracy with that of 19 bands uniformly extracted. As shown in these experiment results, our algorithm has a higher accuracy and fewer bands than the others. Benediktsson et al. [3] used the same experiment data set. In that paper, the authors made a principal component analysis for hyperspectral image first,
Table 7 Classification accuracy of before/after pruning on Washington DC Mall data set. Algorithms
Band number
Minimal accuracy
Maximum accuracy
Average accuracy
Before pruning 1 Before pruning 2 After pruning
8 11 7
95.72 94.99 97.37
98.57 96.55 98.79
97.26 96.02 98.25
Table 8 The classification accuracies of different grouping schemes on Washington DC Mall data set. Algorithms
Band number
Minimal accuracy
Maximum accuracy
Average accuracy
Group averages 1 Group averages 2 GS8_First GS8_Medium CGGS + BB
19 38 8 8 7
96.80 93.48 96.14 96.48 97.37
96.80 93.48 96.14 96.48 98.79
96.80 93.48 96.14 96.48 98.25
Table 9 Initial band grouping for Indian Pine data set by conditional mutual information. Groups
1
2
3
4
5
6
7
8
9
Bands
1–4
5–18
19–33
34–44
45–57
58–63
64–77
78–85
86–94
10 95–102
Groups Bands
11 103–105
12 106–125
13 126–131
14 132–143
15 144–147
16 148–152
17 153–157
18 158–170
19 171–198
20 199–202
46
S. Li et al. / Knowledge-Based Systems 24 (2011) 40–48
Table 10 Band grouping for Indian Pine data set after pruning. Groups
1
2
3
4
5
6
7
8
9
10
11
12
Bands
1–18
19–33
34–44
45–57
58–77
78–105
106–125
126–131
132–147
148–157
158–170
171–202
Table 11 Classification accuracy of before/after pruning on Indian Pine data set. Algorithms
Band number
Minimal accuracy
Maximum accuracy
Average accuracy
Before pruning After pruning
20 12
81.20 83.32
83.37 84.66
82.30 83.95
and then conducted mathematical morphological analysis for the first two main image components, after which nonparametric weighted feature extraction (NWFE) and a neural network classifier were utilized to train and classify. The authors of Benediktsson et al. [3] achieved an overall accuracy of 98.9% for 26 bands and 98.5% for 14 bands, the latter with feature extraction We can achieve an average accuracy of 98.25% and the maximum accuracy of 98.79%, for only seven bands without feature extraction. 5.2. Experimental results on Indian Pine data set 5.2.1. Comparison IV As stated in the beginning of this section, there are 202 bands in the Indian Pine data set for our experiments, which are initially divided into 20 groups according to conditional MI. The detailed grouping is given in Table 9. And the final band grouping result after BB pruning is given in Table 10. In order to prove the necessity and efficiency of the pruning by BB for the original band grouping, we search the best band combination with GA–SVM algorithm based on 20-band group (as shown in Table 9) without pruning and 12-band group after pruning (as shown in Table 10). Table 11 gives the results with 10-run averaging. As shown in Table 11, with the Indian Pine data set, the band group after pruning can also make the final band combination have a higher classification accuracy by cutting out the irrelevant groups. 5.2.2. Comparison V Serpico and Moser [28] proposed three search algorithms for band selection and experiments on the same Indian Pine data set were performed. There are also other researchers who conducted their experiments on the Indian Pine data set, such as Bazi and Melgani [2], Guo et al. [12], however, with different partition of the train/test data sets. In Serpico and Moser [28], they had also compared their algorithms with other state-of-the-art methods in the literature. In this section, we only compare our method with theirs on the same training/test data set. Serpico and Moser [28]
have employed MAP classifier in their paper, but we have utilized the SVM classifier. To compare fairly, we conduct the train/test experiments on the final bands searched by their methods, but using the same classifier of SVM. In addition, we also give the overall accuracy with all the original bands by a SVM classifier. Table 12 gives the detailed results. In band selection for hyperspectral data classification, there are two objectives to optimize: maximization of the classification accuracy and minimization of the number of bands selected. Sometimes, it is a dilemma and should be trade-offed. From Table 12, it can be observed that the method presented in Serpico and Moser [28] have achieved higher accuracies using SVM, but with more bands; and our method is very competitive with satisfactory accuracy but fewer bands. 5.3. Discussion In Dash and Liu [6], the authors compared five feature selection strategies, i.e., exhaustive, complete, heuristic, probabilistic, and hybrid of complete and probabilistic search methods. The branch and bound algorithm belongs to the complete category; while genetic algorithm belongs to the probabilistic one. The proposed method in this paper belongs to the hybrid category. However, ours is different from the quick branch and bound (QBB) search method in [6]. QBB is a two phase algorithm that runs LVF (Las Vegas Filter) in the first phase and ABB (automatic branch and bound) in the second phase. According to Dash and Liu [6], LVF adopted the inconsistency rate as the evaluation measure. It generated feature subsets randomly with equal probability, and once a consistent feature subset was obtained, the size of generated subsets was pegged to the size of that subset. LVF was fast in reducing the number of features in the early stages and could produce optimal solutions if computing resources permitted. The possible maximum number of subsets (the search space) of LVF is 2200 (if there are 200 bands in the original hyperspectral data). There is another key issue about QBB performance, i.e., what is the crossing point at which ABB takes over from LVF. If the cross point is quite early, then ABB will suffer from a large size input; otherwise, LVF will
Table 12 Classification accuracy by different algorithms on Indian Pine data set. Algorithms
Band number
Minimal accuracy
Maximum accuracy
Average accuracy
All bands(SVM) SFBE [28] FCBE [28] SABE [28] SFBE (SVM) FCBE (SVM) SABE (SVM) CGGS + BB
202 26 26 28 26 26 28 12
83.29 81.30 81.46 81.57 86.59 86.48 86.64 83.32
83.29 81.30 81.46 81.57 86.59 86.48 86.64 84.66
83.29 81.30 81.46 81.57 86.59 86.48 86.64 83.95
S. Li et al. / Knowledge-Based Systems 24 (2011) 40–48
take much time to give a small size feature subset. For the problem of band selection for hyperspectral data classification, there are about 200 bands and it is a more challenging issue. In this paper, we firstly divide the bands into several groups based on conditional mutual information, and then GA–SVM is employed to search for the optimal combination of bands from different groups. After the band grouping, the number of selected bands by GA–SVM is set. Lastly, the branch and bound method is used to eliminate the irrelevant bands from the preceding results by GA–SVM. The underlying rationale is that GA–SVM searches for the best combination of bands with little redundant information, while the BB algorithm search for the minimal set of bands with maximum relevance. The Washington DC data is a somewhat ‘‘easy” data set, while the Indian Pine data set is more difficult to classify, since there are several classes that have similar spectral characteristics. The proposed approach has achieved higher accuracy than both the previous works. In summary, our method is superior with comprehensive experiments on the above two benchmark data sets. 6. Concluding remarks An innovative band selection algorithm with band grouping based on conditional MI, pruning by BB, and GA–SVM evaluation for hyperspectral image has been proposed in this paper. With band grouping based on conditional MI and pruning by BB, we could find the irredundant band combination with minimal number of bands and rule out the irrelevant groups and reduce the number of band groups so that the classification accuracy can become higher. In addition, we use GA–SVM to get a better band combination than the traditional algorithms. Comparison results with other state-of-the-art algorithms on two benchmark data sets show that the proposed method is superior in classification accuracy and with fewer selected bands. The conditional MI can represent the correlation between the bands and the classes. It is an important topic to study in future on how to mine more information for band grouping of hyperspectral image from conditional MI and make the classification accuracy higher. In this paper, the final number of chosen bands is set after the conditional mutual information of the training samples is computed. However, compared with the existing literature of the study [28,3,2,12], the suitable number of the bands may be more than 20, which can produce satisfactory results. Hence, how to reformulate the band grouping scheme and get higher classification accuracy is another interesting topic for future study. Another concerned problem is how to further reduce the computational complexity of GA–SVM and make the searching procedure faster. [35] have introduced high performance computing (HPC) techniques, such as data parallelization, multi-threading, to speed up the GA–SVM procedure. In our previous work [20], we have found that clustering the training samples can speed up the search process of GA by 10 times faster, with little degradation of the classification accuracy. Further work in this direction is undertaken currently. Acknowledgments The authors would like to thank the anonymous reviewers for their constructive comments. The Washington DC Mall data set was obtained from the student CD-ROM which accompanies Prof. Landgrebe’s book. The authors are also grateful to Prof. Serpico and Dr. Moser for providing us the training/test samples of Indian Pine data set used in their paper. A preliminary version of this paper has appeared in the Proceedings of the Seventh International Symposium on Neural Networks (ISNN 2010) [32].
47
References [1] R. Battiti, Using mutual information for selecting features in supervised neural net learning, IEEE Trans. Neural Network 5 (4) (1994) 537–550. [2] Y. Bazi, F. Melgani, Toward an optimal SVM classification system for hyperspectral remote sensing images, IEEE Trans. Geosci. Remote Sens. 44 (11) (2006) 3374–3385. [3] J.A. Benediktsson, J.A. Palmason, J.R. Sveinsson, Classification of hyperspectral data from urban areas based on extended morphological profiles, IEEE Trans. Geosci. Remote Sens. 43 (3) (2005) 480–491. [4] A.L. Blum, P. Langley, Selection of relevant features and examples in machine learning, Artif. Intell. 97 (1–2) (1997) 245–271. [5] Chih-Chung Chang, Chih-Jen Lin, LIBSVM: a library for support vector machines, 2009. Software available at:
. [6] M. Dash, H. Liu, Consistency-based search in feature selection, Artif. Intell. 151 (1–2) (2003) 155–176. [7] M.M. Dundar, D.A. Landgrebe, Toward an optimal supervised classifier for the analysis of hyperspectral data, IEEE Trans. Geosci. Remote Sens. 42 (1) (2004) 271–277. [8] M.E. ElAlami, A filter model for feature subset selection based on genetic algorithm, Knowledge-Based Syst. 22 (5) (2009) 356–362. [9] Pablo A. Estévez, Michel Tesmer, A. Perez Claudio, M. Zurada Jacek, Normalized mutual information feature selection, IEEE Trans. Neural Networks 20 (2) (2009) 189–201. [10] H. Frohlich, O. Chapelle, Scholkopf, Feature selection for support vector machines by means of genetic algorithm, in: Proceedings of 15th IEEE International Conference on Tools with Artificial Intelligence, 2003, pp. 142– 148. [11] N. Ghoggali, F. Melgani, Y. Bazi, A multiobjective genetic SVM approach for classification problems with limited training samples, IEEE Trans. Geosci. Remote Sens. 47 (6) (2009) 1707–1718. [12] B. Guo, R.I. Damper, S.R. Gunn, et al., A fast separability-based feature-selection method for high-dimensional remotely sensed image classification, Pattern Recognit. 41 (5) (2008) 1653–1662. [13] I. Guyon, A. Elissee, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003) 1157–1182. [14] J. Huang, Y. Cai, X. Xu, A wrapper for feature selection based on mutual information, in: Proceedings of the 18th International Conference on Pattern Recognition (August 20-24, 2006), ICPR, vol. 02, IEEE Computer Society, Washington, DC, 2006, pp. 618–621. [15] R. Huang, M. He, Band selection based on feature weighting for classification of hyperspectral data, IEEE Trans. Geosci. Remote Sens. Lett. 2 (2) (2005) 156– 159. [16] A.C. Jensen, A.S. Solberg, Fast hyperspectral feature reduction using piecewise constant function approximations, IEEE Trans. Geosci. Remote Sens. Lett. 4 (4) (2007) 547–551. [17] B.C. Kuo, K.Y. Chang, Feature extractions for small sample size classification problem, IEEE Trans. Geosci. Remote Sens. 45 (3) (2007) 756–764. [18] N. Kwak, C.-H. Choi, Input feature selection for classification problems, IEEE Trans. Neural Networks 13 (1) (2002) 143–159. [19] D.A. Landgrebe, Signal Theory Methods in Multispectral Remote Sensing, Wiley, Hoboken, NJ, 2003. [20] L. Lin, S.-J. Li, Y.-L. Zhu, et al., A novel approach to band selection for hyperspectral image classification, in: Proceedings of Chinese Conference on Pattern Recognition, CCPR 2009, 2009, pp. 298–303. [21] C. Liu, C. Zhao, Y. Zhang, A new method of hyperspectral remote sensing image dimensional reduction, J. Image Graph. 10 (2) (2005) 218–224. [22] J. Lu, T. Zhao, Y. Zhang, Feature selection based-on genetic algorithm for image annotation, Knowledge-Based Syst. 21 (8) (2008) 887–891. [23] S. Nakariyakul, D.P. Casasent, Adaptive branch and bound algorithm for selecting optimal features, Pattern Recognit. Lett. 28 (2007) 1415– 1427. [24] J. Novovicova, P. Somol, M. Haindl, et al., Conditional mutual information based feature selection for classification task, in: CIARP 2007, LNCS, vol. 4756, 2007, pp. 417–426. [25] I.-S. Oh, J.-S. Lee, B.-R. Moon, Hybrid genetic algorithms for feature selection, IEEE Trans. Pattern Anal. Mach. Intell. 26 (11) (2004) 1424–1437. [26] H. Peng, F. Long, C. Ding, Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell. 27 (8) (2005) 1226–1238. [27] S.B. Serpico, L. Bruzzone, A new search algorithm for feature selection in hyperspectral remote sensing images, IEEE Trans. Geosci. Remote Sens. 39 (7) (2001) 1360–1367. [28] S.B. Serpico, G. Moser, Extraction of spectral channels from hyperspectral images for classification purposes, IEEE Trans. Geosci. Remote Sens. 45 (2) (2007) 484–495. [29] J.M. Sotoca, F. Pla, Hyperspectral data selection from mutual information between image bands, in: D.-Y. Yeung et al., (Eds.), SSPR&SPR 2006, LNCS, vol. 4109, 2006, pp. 853–861. [30] L. Sun, W. Gao, Selection the optimal classification bands based on rough sets, Pattern Recognit. Artif. Intell. 13 (2) (2000) 181–186. [31] L. Wang, Y. Gu, Y. Zhang, Band selection method based on combination of support vector machines and subspatial partition, Syst. Eng. Electron. 279 (6) (2005) 974–977.
48
S. Li et al. / Knowledge-Based Systems 24 (2011) 40–48
[32] H. Wu, J. Zhu, S.-J. Li, et al., A hybrid evolutionary approach to band selection for hyperspectral image classification, in: Proceedings of the Seventh International Symposium on Neural Networks (ISNN 2010), LNEE 67, 2010, pp. 329–336. [33] F. Tan, X. Fu, Y. Zhang, A.G. Bourgeois, A genetic algorithm-based method for feature subset selection, Soft Comput. 12 (2) (2007) 111–120. [34] C.K. Zhang, H. Hu, An effective feature selection scheme via genetic algorithm using mutual information, in: L. Wang, Y. Jin, (Eds.), FSKD 2005, Lecture Notes of Artificial Intelligence, vol. 3614, 2005, pp. 73–80.
[35] T. Zhang, X. Fu, R.S. Goh, et al., A GA–SVM feature selection model based on high performance computing techniques, in: Proceedings of the 2009 IEEE International Conference on Systems, Man and Cybernetics (San Antonio, TX, USA, October 11–14, 2009), IEEE Press, Piscataway, NJ, 2009, pp. 2653–2658. [36] L. Zuo, J. Zheng, F. Wang, et al., A genetic algorithm based wrapper feature selection method for classification of hyper spectral data using support vector machine, Geograph. Res. 27 (3) (2008) 493–501.