An overview on spectral and spatial information fusion for hyperspectral image classification: Current trends and challenges

Information Fusion 59 (2020) 59–83 Contents lists available at ScienceDirect Information Fusion journal homepage: www.elsevier.com/locate/inffus An...

Download PDF

9MB Sizes 1 Downloads 123 Views

Report

Full Text

Information Fusion 59 (2020) 59–83

Contents lists available at ScienceDirect

Information Fusion journal homepage: www.elsevier.com/locate/inffus

An overview on spectral and spatial information fusion for hyperspectral image classiﬁcation: Current trends and challenges Maryam Imani, Hassan Ghassemian∗ Image Processing and Information Analysis Lab, Faculty of Electrical and Computer Engineering, Tarbiat Modares University, Tehran, Iran

a r t i c l e

i n f o

a b s t r a c t

Keywords: Hyperspectral image Feature fusion Decision fusion Classiﬁcation

Hyperspectral images (HSIs) have a cube form containing spatial information in two dimensions and rich spectral information in the third one. The high volume of spectral bands allows discrimination between various materials with high details. Moreover, by utilizing the spatial features of image such as shape, texture and geometrical structures, the land cover discrimination will be improved. So, fusion of spectral and spatial information can signiﬁcantly improve the HSI classiﬁcation. In this work, the spectral-spatial information fusion methods are categorized into three main groups. The ﬁrst group contains segmentation based methods where objects or superpixels are used instead of pixels for classiﬁcation or the obtained segmentation map is used for relaxation of the pixel-wise classiﬁcation map. The second group consists of feature fusion methods which are divided into six subgroups: features stacking, joint spectral-spatial feature extraction, kernel based classiﬁers, representation based classiﬁers, 3D spectral-spatial feature extraction and deep learning based classiﬁers. The third fusion methods are decision fusion based approaches where complementary information of several classiﬁers are contributed for achieving the ﬁnal classiﬁcation map. A review of diﬀerent methods in each category, is presented. Moreover, the advantages and diﬃculties/disadvantages of each group are discussed. The performance of various fusion methods are assessed in terms of classiﬁcation accuracy and running time using experiments on three popular hyperspectral images. The results show that the feature fusion methods although are time consuming but can provide superior classiﬁcation accuracy compared to other methods. Study of this work can be very useful for all researchers interested in HSI feature extraction, fusion and classiﬁcation.

1. Introduction

the same spectral signatures, they can be discriminated through their shapes and texture [2]. Thus, one can proposed to fuse the spectral and spatial information to improve HSI classiﬁcation. The main and based idea of using spatial information is that in local regions, neighboring pixels have similar spectral features and belong to the same class with a high probability [3]. To better understand the value of using spatial information, please attend to Fig. 2, where, the position of pixels are randomly changed, and, the spectral features of each pixel remained unchanged. The result of spectral classiﬁcation is equivalent for both of these ﬁgures. But, Fig. 2(a) contains valuable spatial information about shape and texture of objects which can be used in a spectral-spatial classiﬁer. A signiﬁcant HIS classiﬁcation improvement can be achieved by applying an appropriate spectral-spatial fusion method. To extract the spatial information, usually a local window is considered around each pixel of image. By applying a spatial transform or by computing the statistics of the local dependency, some spatial features are extracted and assign to the central pixel [4]. The spectral-spatial fusion methods are generally categorized in three main groups (Segmentation based, Feature fusion based, and

Development of hyperspectral sensors provides hyperspectral images (HSIs) containing hundreds spectral bands. The spectral signature of each image pixel constituted by hundreds spectral bands acts as a ﬁnger print for identiﬁcation of its material type. A HSI is a cube constituted of images acquired from the same scene but at diﬀerent electromagnetic wavelengths where each slice of this cube is associated with a special wavelength (see Fig. 1). In other words, each pixel of HSI (spatial sample) located in row i and column j denoted as p(i, j) has a spectral signature composed of the associated reﬂections of that position of image scene in various wavelengths (a feature vector containing the associated values of diﬀerent spectral bands). The huge spectral information simpliﬁes distinguishing between diﬀerent materials. Thus, it allows material recognition and land cover classiﬁcation with a high accuracy. HSIs with rich spectral information are useful in various applications and ﬁelds such as mineralogy, agriculture, load cover classiﬁcation and target detection [1]. Although the single use of spectral features may be useful but it may not be enough in many cases. When two diﬀerent objects have ∗

Corresponding author. E-mail addresses: [email protected] (M. Imani), [email protected] (H. Ghassemian).

https://doi.org/10.1016/j.inﬀus.2020.01.007 Received 11 October 2019; Received in revised form 18 January 2020; Accepted 20 January 2020 Available online 21 January 2020 1566-2535/© 2020 Elsevier B.V. All rights reserved.

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 1. A hyperspectral cube in the left and a typical spectral feature in the right [2]. Fig. 2. (Left) A hyperspectral cube, (right) randomly changing of the position of pixels or removing the spatial information.

Fig. 3. Categorization of the spectral-spatial fusion methods.

Decision fusion based) where each of them also contains some subgroups. This categorization is shown in Fig. 3 and represented as follows:

members of the same class; hence, the scene’s objects can each be represented by a single suitably chosen feature set. Typically the size and shape of objects in the scene vary randomly, and the sampling rate and therefore the pixel size are ﬁxed. It is reasonable to assume that the sample data (pixels) from a simple object have a common characteristic. A complex scene consists of simple objects. Any scene can thus be described by classifying the objects in terms of their features and by recording the relative position and orientation of the objects in the scene.

A) Segmentation based methods This category of the spectral-spatial fusion methods produce some segments (objects or super-pixels or Pixons) through the HSI. This technique is based on the fundamental assumption that the scene is segmented into objects such that all samples (pixels) from an object are 60

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

In the segmentation methods, the spatial information is used to generate segments. Each segment contains the adjacent pixels with similar spectral features. Two approaches can be used to beneﬁt the obtained segmentation maps:

B-6. Deep learning based classiﬁers Deep learning methods such as conventional neural networks (CNNs) extracts joint spectral-spatial features layer by layer where sub-feature map of each layer is extracted from feature map of the previous layer. The high potential of deep learning methods in extraction of non-linear and hidden features is the main advantage of them. However, the deep learning networks have hyper-parameters that need to set where learning of network requires a large training set. Otherwise, the over-ﬁtting problem causes less classiﬁcation accuracy in the testing phase compared to the training stage.

A-1. The obtained objects are classiﬁed instead of pixels. In other words, the same label is assigned to all pixels belong to an object. A-2. The HSI is classiﬁed pixel-wise. Then, the obtained segmentation map is used as a mask to improve the pixel-wise classiﬁcation map. Usually, the majority voting rule is used to assign the same label to all pixels located in a segment.

A) Decision fusion based methods

The segmentation based methods remove the noisy pixels of the classiﬁcation maps but selection of an appropriate segmentation algorithm, generation of suitable objects, with ﬁtting sizes and shapes is a challenging task.

In the decision fusion methods, the classiﬁcation map is obtained multiple times through applying diﬀerent classiﬁers with the same feature set; or by individually applying the same classiﬁer to various feature sets; or by applying various classiﬁers to various feature sets. The ﬁnal classiﬁcation map is obtained by implementation of a decision fusion rule such as majority voting and joint measures method. Selection of feature sets or choice of classiﬁers containing complement information; and high computation time due to implementation of multiple classiﬁcation processes are diﬃculties of the decision fusion methods. A review of diﬀerent information fusion methods is given in this paper. Several state-of-the-art methods from each represented group are introduced. The advantages and disadvantage of each group are also discussed.

A) Feature fusion based methods In the spectral-spatial feature fusion category, the spectral and spatial features are extracted individually or simultaneously. Then, the obtained spectral-spatial feature cube is fed to a potential classiﬁer to achieve the classiﬁcation map. Various feature fusion methods are represented as follows: B-1. Features stacking In these methods, the spectral features and the spatial ones are extracted individually, and then simply stacked together to generate the spectral-spatial cube. These methods are relatively simple, but due to independent extraction of spectral and spatial features procedure, the hidden information in joint spectral and spatial features will be lost. Moreover, the stacked spectral-spatial feature vector assigned to each pixel has a high dimension, which results in curse of dimensionality with a limited number of available training samples (Hughes phenomenon) [5].

2. Segmentation based (object) methods There are two types of segmentation methods. In the ﬁrst type, a segmentation algorithm is applied to the HSI for objects extraction. Then, the objects are classiﬁed. In the second type, the obtained segmentation map is used as a mask for relaxation of a pixel-wise classiﬁcation map. The main challenge of the object based methods is selection of an appropriate segmentation algorithm that extract a suﬃcient number of valid objects and avoids over-segmentation or under-segmentation [4]. A hierarchical statistical region merging (HSRM) segmentation algorithm is proposed in [3]. A fuzzy no border/border map is generated to provide weighting coeﬃcients for modifying the spatial prior of Markov random ﬁeld (MRF) based multi-level logistic model [4]. The proposed MRF+HSRM method deals with the over-segmentation of classiﬁcation output that is a common problem of MRF based classiﬁers. The statistical region merging (SRM) segmentation not only has a simple merger formation and fast implementation but also has a strong mathematical support. Moreover, SRM is robust to texture and image noise that results in meaningful edges identiﬁcation. The MRF+HSRM method shows more robustness with respect to object-regularized approaches such as majority voting.

B-2. Joint spectral-spatial feature extraction Some fusion methods instead of individual extraction of spectral features and spatial ones, jointly extract them. Some of advantages of these methods are: avoiding the long fused vectors, due to features stacking, and considering joint contribution of spectral and spatial information. Of course with the cost of, more computation and missing some information of the original spectral bands. B-3. Kernel based classiﬁers The spectral and spatial features can be combined through applying multiple kernels or composite kernels. The high potential of kernels in extraction of non-linear features allows to handle the non-linear class boundaries. But, designing of an appropriate kernel and selection of its parameters is a hard task.

2.1. Object classiﬁcation

B-4. Representation based classiﬁers Generally, an image can be analyzed pixel vise or object vise. The object based methods usually apply a segmentation algorithm to an image for object detection. The spectral-spatial features of each object are then extracted. Finally, the objects are classiﬁed by utilizing the object features [2]. An object based classiﬁcation method is proposed in [5] that signiﬁcantly reduces the complexity of a multispectral image through compressing it by a compaction coeﬃcient larger than 20. The running classiﬁcation time is reduced as well by a factor larger than 20. The proposed method called automatic multispectral image compaction algorithm (AMICA) utilizes the gradient vector of image pixels within objects and also the contextual information to generate the object features. A speciﬁc adjacency relation and a similarity measure have been introduced by their mathematical tools to form an object. The use of spectralspatial object features instead of the original spectral features of individual pixels is used for data redundancy reduction. The AMICA algorithm

The representation based methods are the non-parametric ones with no requirement to any assumption about data distribution or statistics estimation. The most well-known methods of this category are sparse representation (SR) and collaborative representation (CR). These methods are based on this idea that each image pixel can be represented through a linear combination of atoms of an appropriate dictionary. The dictionary composition (or dictionary learning), and solving the optimization problem is a diﬃcult task. B-5. 3D spectral-spatial feature extraction Due to 3D inherent of HSI, simultaneously extraction of spectral and spatial features preserves the joint dependencies of spectral and spatial information. 3D ﬁlters are usually selected for extraction of 3D spectralspatial cube. The high volume of computations, selection of appropriate 3D ﬁlters and their parameter settings are diﬃculties of these methods. 61

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 4. Two examples of classiﬁcation map relaxation.

has three steps. At ﬁrst, the data cube is partitioned to an exhaustive set of objects. In the second step, all pixels belong to the same object are characterized with an object feature vector. Finally, in the third step, the object features are used rather than the pixels features for analysis, classiﬁcation or transmission of data. An object based classiﬁer is proposed in [6] that automatically tunes the free parameters. The process is done band by band. At ﬁrst, the partial diﬀerential equations (PDEs) [7], both of its real and complex versions, are used to smooth the HSI data. To tune the parameters of PDE, the genetic algorithm with a new ﬁtness function is utilized. In the second step, the objects are extracted from the obtained smoothed bands. The introduced distance matrix is substituted by a conventional distance metric. The proposed method uses summation of the gradient values around each considered pixel in addition to the diﬀerence between the pixel and surrounded object. The average of spectral features is computed and fed to a SVM classiﬁer for object classiﬁcation in the third step. The classiﬁcation outputs of diﬀerent HSI bands are then fused through the majority voting rule as a popular decision fusion method. Due to independent processing of individual bands, the running time of the proposed method is equal to the elapsed time of a single band in a parallel processing way. But, in a sequential processing way, the elapsed time is equal to sum of the processing time of all single bands.

pixels with more homogeneity, a region merging process is implemented to the over-segmented map. Finally, the classiﬁcation map obtained by the SVM classiﬁer is guided by the super-pixel map to do a soft decision fusion. Probabilistic label relaxation as a post-processing approach incorporates the contextual information of a ﬁxed local window. The SVM method is used for both initial classiﬁcation and obtaining the class probability estimates for label relaxation in the post-processing stage in [12]. A super-pixel based 3D deep neural network is proposed in [13] to improve the HSI classiﬁcation in diﬀerent structures and boundaries. The super-pixel construction results in an over-segmentation where the HSI is partitioned to non-overlapped regions. Each homogeneous region reveals the local structures of the HSI with adaptive shapes and sizes. The use of 3D convolutional neural network (3D-CNN) may cause noisy classiﬁcation maps. To cope with this problem, a weighted feature image (WFI) is constructed via super-pixels to allow spectral-spatial consistency in the classiﬁcation output. In order to construct the WFI, the spectral pixels are linearly combined in each super-pixel. The WFI provides more spectral similarity between pixels within the super-pixels, and also, it maintains the pixels diversity. Thus, WFI not only preserves the regional consistency but also avoids to eliminate the eﬀects of the mixed pixels. In addition, in order to cope with the misclassiﬁcation of the mixed pixels, the 3D recurrent CNN (3D-RCNN) is proposed for extraction of 3D features from the WFI. Moreover, the spectral-spatial information contained in each super-pixel is used to ﬁll the super-pixels boundary in the 3D local neighborhood cube. The ﬁlled samples preserve the spectral-spatial similarity to the central pixel and therefore, deal with misclassiﬁcation of super-pixels boundaries. HSI contains rich structure features while it provides noisy classiﬁcation maps. In contrast, WFI lacks structural features while provides a good spatial continuity. So, to achieve a balance between structure and homogeneous regions, both of HSI and WFI are utilized to construct the 3D samples. The proposed super-pixel based 3D deep learning network method has four steps. First, it creates super-pixels and constructs WFI. Second, it constructs the 3D super-pixel based samples. Third, it explores 3D spectral-spatial features from HSI via 3D CNN and from WFI via 3DRCNN. Forth, it classiﬁes the HSI using multi-feature learning. Kernel based classiﬁers such as SVM have a high ability in handling of high dimensional data. Moreover, the use of an adaptive similarity measure instead of an un-weighted similarity measure such as Euclidean distance involves the appropriate features related to a speciﬁc task such as classiﬁcation. The best well-known metric for improvement of classiﬁcation accuracy is the Mahalanobis distance. The SVM classiﬁer with the Mahalanobis distance based kernel is used for initial classiﬁcation in [14]. The introduced classiﬁcation method has two steps. Firstly, SVM as a kernel based classiﬁer is used

2.2. Classiﬁcation map relaxation The spatial information can be utilized after spectral based pixelwise classiﬁcation for improvement of the classiﬁcation results. To this end, a label relaxation process is implemented by utilizing the spatial contextual information on the pixel-wise classiﬁcation map to remove the noisy labels and redundant parts of the classiﬁcation map. In this case, the spectral information is ﬁrstly used in the classiﬁcation procedure; and secondly, the spatial information is used in the post-processing procedure. Two examples of classiﬁcation map relaxation can be seen in Fig. 4. As seen, the relaxed (regularized) classiﬁcation maps contain less amount of salt and pepper noise. In the relaxation methods, the segmentation map is usually used for correction and regularization of the pixel-wise classiﬁcation map [8,9]. Relaxation of classiﬁcation is done by using the segmentation map obtained by generation of super-pixels in [10]. In this method, at ﬁrst the uniform local binary pattern (ULBP) is applied to the HSI for extraction of local features and then, the support vector machine (SVM) classiﬁer is applied to the ULBP feature cube to ﬁnd the initial probabilistic classiﬁcation map. The principal component analysis (PCA) transform is applied to the original HSI cube. The ﬁrst three principal components are used to a composite image, then, the entropy rate segmentation (ERS) method [11] is applied to the obtained composite image to over-segment it into homogeneous regions. After that, to achieve super62

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

to provide a spectral based classiﬁcation map. A Mahalanobis metric based kernel is used in the proposed SVM classiﬁer to achieve more classiﬁcation accuracy and reduce the computations. The second step is segmentation where the posterior probabilities acquired from the SVM in previous step is used for reconstruction of spatial relationships in the HSI. For evaluation of deviations among pixels of each region, a kernel transformation is introduced. Finally, a graph cut method is used to segment the pseudo-image. The main contribution of [14] is combination of a kernel based classiﬁcation with segmentation through multi-level logistic obtained by the SVM classiﬁer. The object based classiﬁers are preferred with respect to the pixelwise classiﬁers from the view of providing a smooth and noiseless classiﬁcation map which is more applicable in real scenarios. From the other hand, the main disadvantage of the object-oriented classiﬁers is that they result in inaccurate classiﬁcation map if objects are inaccurately extracted. Both of image segmentation error and classiﬁcation process error are accumulated. If an object is misclassiﬁed, all pixels of that object will be misclassiﬁed that causes a big error. The segmentation methods such as watershed [15] and partitional clustering [16] are easy and have low computational complexity and can well reveal the spatial structures. But, they have two main disadvantages: ﬁrst, the number of segments has to be set by the user, second, the segmentation result is not robust because it depends on the initialization values. In contrast, other segmentation methods such as statistical region merging (SRM) not only have easy implementation but also have no need to set the number of segments and provide robust segmentation results [17].

linear discriminant analysis (LDA), binary coding based feature weighting (BCFE) [20] and feature space discriminant analysis (FSDA) [21] are examples of spectral feature extraction methods. As an example of feature fusion through stacking, we can refer to the attribute proﬁle based FSDA (APFSDA) method [22]. The APFSDA method is an extension of FSDA. The FSDA method is originally is a supervised spectral feature extraction method that simultaneously considers three measures: maximizing between-class scatters, minimizing the within-class scatters; and maximizing the between-band scatters. While the two ﬁrst measures result in increasing class discrimination, the third measure increases diﬀerences between the extracted features which results in decreasing overlapped features and redundancy. The APFSDA method adds the spatial information of attribute ﬁlters (AFs) to the FSDA method. The AFs have great ﬂexibility in deﬁnition of attributes that allows a high capability in extraction and modelling of various contextual features. However, AF can be applied to the single band grey level images. The conventional way to apply AFs to the HSI is to reduce the HSI dimensionality using the PCA transform, and then, apply AFs on each principal component individually. But, PCA is an unsupervised feature extraction method that works based on mean square error (MSE) measure that is appropriate for representation based applications not classiﬁcation ones. To deal with this diﬃculty, APFSDA uses the FSDA method to ﬁnd components with high class discrimination and low overlap. Fig. 6 illustrates the ﬂowchart of the APFSDA method that is explained in the following (for more details, the interested reader is referred to [22]). FSDA is applied to the HSI to ﬁnd m components of HSI. Then, the attribute proﬁle (AP) of each component is obtained that contains the attribute spatial features. On each FSDA component, 𝑦𝑗 ; 𝑗 = 1, … , 𝑚, the AP with attribute 𝑎𝑘 (𝑘 = 1, 2, … , 𝑠) denoted by 𝐴𝑃𝑎𝑘 (𝑦𝑗 ) is achieved by applying a series of attribute thinning (𝛾 i ) and attribute thickening (𝜑i ) ﬁlters with thresholds {𝜆1 , 𝜆2 , … , 𝜆𝑛 }:

3. Feature fusion The HSI classiﬁcation methods in the feature fusion level can be done in two general approaches. In the ﬁrst approach, the spatial features are extracted from the HSI and then, the extracted features are combined with the spectral features through a combination method such as feature stacking or kernel based methods. In the second approach, the spectralspatial features are extracted jointly to preserve the correlated nature of HSI cube where the spectral and spatial information is dependently and jointly contained in 3D structure. Joint spectral-spatial feature extraction, representation based classiﬁers, 3D spectral-spatial feature extraction and deep learning based classiﬁers belong to this group. Diﬀerent feature extraction methods can be used for extraction of geometric structures, shape and texture from the HSI. To extract spatial features, usually a window with a ﬁxed or adaptive size is locally considered around each pixel, then, the spatial features are extracted from the neighborhood region. Two main challenges of the feature fusion methods are selection of an appropriate window size for the neighborhood region and also needing to a high number of training samples. Six different types of feature fusion methods are represented and discussed as follows.

( ) { ( ) ( ) ( ) ( )} 𝐴 𝑃 𝑎 𝑘 𝑦 𝑗 = 𝜑 𝑛 𝑦 𝑗 , … , 𝜑 1 𝑦 𝑗 , 𝑦 𝑗 , 𝛾𝑛 𝑦 𝑗 , … , 𝛾 1 𝑦 𝑗 ; 𝑗 = 1, 2, … , 𝑚; 𝑘 = 1, 2, … , 𝑠

(1)

For each pixel of HSI associated with attribute k, the extended AP (EAP) is achieved by: { ( ) ( ) ( )} 𝐸𝐴𝑃𝑎𝑘 (𝑥) = 𝐴𝑃𝑎𝑘 𝑦1 , 𝐴𝑃𝑎𝑘 𝑦2 , … , 𝐴𝑃𝑎𝑘 𝑦𝑚 ; 𝑘 = 1, 2, … , 𝑠 (2) The extended multi-AP (EMAP) is acquired by applying s attributes: { } 𝐸 𝑀𝐴𝑃 (𝑥) = 𝐸 𝐴𝑃𝑎1 (𝑥), 𝐸 𝐴𝑃𝑎2 (𝑥), … , 𝐸𝐴𝑃𝑎𝑠 (𝑥)

(3)

The EMAP features stacked on the original spectral features are given to the multinomial logistic regression (MLR) classiﬁer for classiﬁcation. The APFSDA method by jointly fusion of spectral and spatial information in addition to maximizing the class discrimination and minimizing redundancy in the FSDA process provides superior classiﬁcation results compared to MFL and GCK. The proposed feature fusion method in [23] extracts spectral features using the traditional feature extractors and extracts spatial features using a CNN deep model. It stacks the spectral and spatial features and fed them to a classiﬁer such as SVM. A popular feature extraction method widely used in classiﬁcation problems is linear discriminant analysis (LDA), which maximizes the inter-class scatters while minimizes the intra-class scatters. Various versions of LDA such as nonparametric weighted feature extraction (NWFE) [24], Imani method 1 [25], local ﬁsher discriminant analysis [26], Imani method 2 [27], and local discriminant embedding (LDE) [28] have been introduced. LDE maximizes the inter-class scatters while keeps away the neighborhood samples of diﬀerent classes through utilizing a graph embedding structure. A new version of LDE called balanced LDE (BLDE) is proposed in [23] for spectral feature extraction. BLDE considers the intra-class criterion in addition to the inter-class one in the objective function through the used graph embedding structure. To distinguish various materials

3.1. Feature stacking In the stacking approach, the spatial features of each PC are extracted from the HSI and then, they are stacked in the features vector (the original spectral bands or the spectral features extracted by diﬀerent feature extraction methods). Then, the stacked features vectors are fed to an appropriate classiﬁer to obtain the classiﬁcation map. Fig. 5 shows a general form of feature fusion using the stacking approach. The PCA transform is applied to the original HSI to ﬁnd the principal components (PCs) of it. m principal components are chosen. Then, n diﬀerent spatial feature extraction methods are applied to each PC. Finally all of the extracted spatial features are stacked on the original spectral bands to produce a long feature vector. Note that, instead of the original spectral bands, some spectral feature extraction methods may be applied to mine the spectral features. GLCM, Gabor, morphological and attribute ﬁlters are examples of the spatial feature extraction methods [18,19]; and PCA, 63

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 5. General form of stacking approach for feature fusion.

Fig. 6. The APFSDA method as an example of stacking approach [22].

with the same spectral features, it is necessary that incorporates the spatial information beside the spectral one. Although there are various handling machine learning methods for spatial feature extraction such as grey-level co-occurrence matrix (GLCM), geometrical ﬁlters, Gabor ﬁlters, Morphological proﬁle, attribute proﬁle and various types of wavelet transforms, but, they are limited in parameter conﬁguration. It means that by setting of speciﬁc parameters, just objects with speciﬁc size, shape and texture are detected. So, by limited parameters settings in traditional spatial feature extraction methods, the great variety present at low levels cannot be shown completely. In contrast to hand

engineering feature extraction methods, deep learning methods can automatically extract high level robust and eﬃcient spatial features. To this end, the spectral features obtained by BLDE are combined with spatial features extracted by the CNN model in [23]. The spectral features and the spatial ones can be extracted individually, and then, simply stacked together to form a long feature vector. The obtained feature vector can be given to a classiﬁer such as SVM. PCA, LDA and NWFE are used for spectral feature extraction; and morphological proﬁle (MP), Gabor ﬁlters and GLCM are utilized for spatial feature extraction in [29]. From one hand, various spectral-spatial fea-

64

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

tures provides diﬀerent views of HSI. Gabor, MP and GLCM provide diverse features of HSI (directionality, shape and size and randomness, respectively) that are complement to each other. But, from the other hand, the obtained long feature vector contains redundant information and also causes over-ﬁtting problem due to curse of dimensionality and limited training samples.

robust to illumination changes and noise but also eﬀectively extracts the orientations edges and texture features. Quaternion WLD (QWLD) is introduced based on QR. QWLD is obtained for each pixel of HSI in a local region surrounded it. QWLD features are computed based on 1intensity of the center pixel, and 2- intensity variations of local pixels in the neighborhood. The relation between the center pixel and its neighbors is obtained by 1- diﬀerential excitation that is diﬀerence between central pixel intensity and its neighbor’s intensity, 2-orientation feature that is obtained by gradient containing horizontal, vertical and diagonal orientation features. To enhance discriminant ability of the QWLD features, the obtained features are fused through construction of feature histogram in a local neighborhood. The histogram vector can appear the gradient information of the central pixel in a local neighborhood region. Due to presence of homogeneous regions with diﬀerent shapes and sizes in a HSI, multi-scale analysis is proposed in [36] to extract more intrinsic and accurate spatial information. So, the radius of the square window around each central pixel is changed to provide more spatial details from the neighborhood region. SVM and sparse representation classiﬁer (SRC) are ﬁnally used for classiﬁcation of the fused features.

3.2. Joint spectral-spatial feature extraction The multi-scale feature fusion is done by a Gaussian pyramid decomposition in [30]. At ﬁrst, the segmented PCA is applied to the HSI for feature reduction. To this end, the spectral bands of HSI are partitioned to some subsets containing adjacent spectral bands. Then, PCA is applied to each subset. The Gaussian pyramid of each subset is then obtained by applying the subsequent Gaussian kernel and down-sampling operator to the segmented PCA outputs. Again, the segmented PCA is applied to the obtained Gaussian pyramid to increase the diﬀerences among HSI pixels and discriminability of diﬀerent pixels. The ﬁnal extracted features are fed to the SVM classiﬁer to achieve the classiﬁcation map. An unsupervised spectral-spatial feature extraction method, which is an extend version of locality preserving projection (LPP), is proposed in [31]. The proposed method is implemented in two steps. In the ﬁrst step, the HSI is ﬁltered and a homogeneous neighborhood region is considered around each pixel of HSI for selection of the unlabeled samples. In the second step, the spectral-spatial features of the unlabeled samples are taken to calculate the projection matrix by using the LPP approach. Authors in [32] utilize the adaptive total variation ﬁltering (ATVF) for de-noising and spectral-spatial feature extraction of HSI. At ﬁrst, the PCA transform is applied to the HSI for feature reduction. Then, ATVF is applied to each component to extract noiseless spectral-spatial features. Finally, the extracted features are given to an extreme learning machine (ELM) for classiﬁcation. The main model of a general total variation minimization problem, which is originally proposed for image de-noising, consists of two terms. The ﬁrst term is total variation and the second term calculates the square of absolute diﬀerence between the original image and the noisy one. Two terms are related together through a regularization parameter. The total variation can be calculated by using the gradient operators. The hierarchical guidance ﬁltering (HGF) is proposed in [33] for joint extraction of spectral-spatial features. HGF is an extension of rolling guidance ﬁltering and guided ﬁltering. HGF produces a series of spectral-spatial features sets. Diﬀerent hierarchical features provide contextual information with diﬀerent scales. Then, a measure matrix called as the matrix of spectral angle distance is deﬁed to evaluate the quality of the extracted features in each hierarchy. Finally, the weighting voting rule is used as a popular ensemble strategy to obtain the classiﬁcation result. Due to presence of various geometrical features with diﬀerent shapes in diﬀerent locations of an image, the use of a ﬁxed structural element (SE) for providing the MP is not so eﬃcient. To deal with this problem, the patch image-based morphological proﬁle (EPIMP) is proposed in [34] that adaptively considers speciﬁc SE for each area (patch) of the image. The chosen SE for each patch is corresponding to the shape or edge image of that patch. The spatial features extracted by EPIMP provide more morphological information with respect to the conventional MP. Combination of the original spectral features with the MP and also utilizing the spatial information contained in the neighborhood region is also proposed in [35]. Weber local descriptor (WLD) and quaternion representation (QR) are used for joint extraction of spectral and spatial features in [36]. To apply WLD and QR, at ﬁrst, the PCA method is applied to HSI to reduce the data dimensionality to three components. Then, the spectralspatial features are extracted from the principal components. QR uses the quaternion algebra where a quaternion has four parts, a real part together with three imaginary parts. WLD also provides two categories of features: diﬀerential excitation and the orientation ones. WLD is not only

3.3. Kernel based classiﬁers An elegant and eﬃcient way for solving the non-linear classiﬁcation problem is using kernel methods. The base idea in a kernel method is data mining from the original feature space to a convenient feature space (often with higher dimensionality) through applying a non-linear mapping function. A kernel function is used to compute inner products in the high dimensional feature space without explicitly knowing the mapping function and without explicitly transformation to a high dimensional feature space. The beneﬁt of the kernel trick is solving the nonlinear problems using linear algorithms in the obtained feature space. The best well-known kernel based machine that is widely used in HSI classiﬁcation is SVM. Diﬀerent versions of SVM such as subspace based SVM [37] and adaptive boosting [38] have been introduced. Other classiﬁers have also used the kernel method to improve their eﬃciency. For, example, the kernelized versions of the representation based classiﬁers such as SRC [39] and collaborative representation based classiﬁer (CRC) [40,41] have been used for HSI classiﬁcation improvement. Transform to a higher dimensional space can increase the discrimination ability. An illustration of mapping to a higher dimensional feature space is shown in Fig. 7 [42]. According to this ﬁgure, two classes that were non-linearly separable in the original space, become linearly separable in the mapped space. So, the use of kernel approaches can be useful for solving the nonlinear problems. Assume a HSI containing N pixels with d spectral bands {𝑥1 , 𝑥2 , … , 𝑥𝑁 }; 𝑥𝑖 ∈ 𝑑 . Given the nonlinear mapping function 𝜑(x), the data samples are mapped from the input space to a high dimensionality feature space as follows [43]: 𝜑 ∶ 𝑑 → ℍ 𝒙 → 𝜑 (𝒙 ) where ℍ is the Hilbert or the mapped feature space. The main idea in the kernel based methods, known as kernel trick, is that deﬁnes a kernel function in the input space. The kernel function indirectly does the dot product in the mapped feature space: ( ) ( ) ( ) 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝜑 𝑥𝑖 , 𝜑 𝑥𝑗 (4) where K( · ) denotes the kernel function and ⟨ · ⟩ represents the dot product. The kernel function satisﬁes the properties of the Mercerʼs theorem such as symmetric, positive semi-deﬁnite and continous [44]. Some of the popular kernel functions are [45]: ( ) Linear kernel ∶ 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑥𝑖 , 𝑥𝑗 (5) ( ) ( )𝑑 Polynomial kernel ∶ 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑥𝑖 , 𝑥𝑗 + 1 ; 𝑑 ∈ +

65

(6)

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 7. An illustration of mapping to a higher dimensional feature space in the kernel based methods.

Gaussian or radial basis f unction(RBF) kernel ∶ ( ) 𝑥𝑖 − 𝑥𝑗 2 ( ) 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑒𝑥𝑝 − ; 𝜎 ∈ + 2𝜎 2

function where each kernel exploits a subset or the full set of features: 𝑀 ( ) ∑ ( ) 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝛽𝑚 𝐾 𝑚 𝑥 𝑖 , 𝑥 𝑗

(7)

𝑚=1

The most widely used kernel in the HSI classiﬁcation problems is RBF because its Fourier transform is also Gaussian and it has translation invariability. Assume each sample xi has a label yi where the training set is {(𝑥𝑖 , 𝑦𝑖 ); 𝑖 = 1, … , 𝑁 } and 𝑦𝑖 ∈ {−1, +1}. The aim of the kernel based classiﬁers such as SVM is ﬁnding a classiﬁcation hyper-plane in the Hilbert space with a maximum margin. For example, the optimization objective function of the standard binary SVM can be formulated by [46]:

s.t.

𝑚=1

𝑁 ∑ 𝑖=1

𝛼𝑖 𝑦𝑖 = 0

𝛽𝑚 = 1,

𝛽𝑚 ≥ 0

(9)

where 𝛽 m indicates the weight of kernel m and M is the number of basis kernels. The objective function of a MKL is given by: 𝑁 𝑁 𝑀 𝑁 ∑ ( ) ( ) ∑ 1 ∑∑ max 𝐿 𝛼𝑖 , 𝛼𝑗 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝛽𝑚 𝐾 𝑚 𝑥 𝑖 , 𝑥 𝑗 + 𝛼𝑖 2 𝑖=1 𝑗=1 𝑚=1 𝑖=1

𝑁 𝑁 𝑁 ( ) ( ) ∑ 1 ∑∑ max 𝐿 𝛼𝑖 , 𝛼𝑗 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐾 𝑥𝑖 , 𝑥𝑗 + 𝛼𝑖 2 𝑖=1 𝑗=1 𝑖=1

𝑠.𝑡. 𝛼𝑖 , 𝛼𝑗 ∈ [0, 𝑐 ], ∀𝑖, 𝑗 = 1, … , 𝑁;

𝑀 ∑

𝑠.𝑡.𝛼𝑖 , 𝛼𝑗 ∈ [0, 𝑐 ], ∀𝑖, 𝑗 = 1, … , 𝑁;

𝑁 ∑ 𝑖=1

𝛼𝑖 𝑦𝑖 = 0;

𝑀 ∑ 𝑚=1

𝛽𝑚 = 1,

𝛽𝑚 ≥ 0 (10)

To combine the basis kernels and obtain a composite kernel, the following three steps are introduced in [58]:

(8)

(1) Pixel deﬁnition: the pixel is redeﬁned by its spectral features 𝑥𝑤 𝑖 ∈ 𝑁𝑤 and its spatial features 𝑥𝑠𝑖 ∈ 𝑁𝑠 where Nw and Ns are the number of spectral features and the number of spatial ones, respectively. (2) Kernel construction: any type of kernels can be constructed on 𝑥𝑤 𝑖 and 𝑥𝑠𝑖 . (3) Kernel combination: the composite kernel can be computed by a simple summation of basis kernels in diﬀerent ways. The kernel containing the spectral features is denoted by Kw and the kernel containing the spatial features is indicated by Ks . The kernels containing the cross-information between spectral and spatial features are also denoted by Ksw and Kws .

where 𝛼 i and 𝛼 j denote the Lagrange Multipliers. Those xi that their associated 𝛼 i are non-zero called support vectors. The support vectors are determining the hyper-plane for decision making. The binary SVMs can be implemented in a parallel way for multi-class problems. The standard SVM uses a single kernel that has not the generalization capability of coping with multi-class and multi-dimensional data. Due to limitation in choice of a single kernel and to able better ﬁt the selected kernel to the complex structure of data, the multiple kernel learning (MKL) methods have been introduced [47]. MKL has been proposed to explore the information of HSI with more ﬂexibility compared to the single kernel based methods [48]. MKL is one of the fusion approaches for combination of diﬀerent sub-features extracted by diﬀerent operators or acquired by diﬀerent sensors. The MKL algorithms combine various features to be used in a kernel based task such as regression or classiﬁcation. The aim of MKL is to generate a composite kernel through linear or non-linear combination of some base kernels. Each base kernel can exploit a subset of features or the full set of them. The weights of basis kernels are all non-negative and sum to one because of keeping the composite kernel positive semi-deﬁnite and normalized. In a MKL problem, there are two categories of unknown parameters that have to be solved: 1- unknown parameters of the original learning problem, 2- the combining weights. The chosen learning approach determines the former unknown parameters; and there are several strategies for determination of the combining weights of the basis kernels. These strategies are divided into: 1- criterion based approaches where a criterion function is used for obtaining the kernel weights, 2- optimization approaches where the kernel weights are computed by solving an optimization problem, and 3- ensemble approaches where a new base kernel is iteratively added to the composite kernel until the cost function becomes minimum. In the MKL, some basis kernels are linearly combined to form a convex

Four diﬀerent combinations of kernels are reported in [49]: (1) Stacked features: ( ) 𝐾{𝑠,𝑤} ≡ 𝐾 𝑥𝑖 , 𝑥𝑗 where 𝑥𝑖 ≡

𝑠 {𝑥𝑤 𝑖 , 𝑥𝑖 }

(11)

is the stacked spectral and spatial feature vector.

(2) Direct summation: ( ) ( ) ( ) 𝑤 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝐾𝑠 𝑥𝑠𝑖 , 𝑥𝑠𝑗 + 𝐾𝑤 𝑥𝑤 𝑖 , 𝑥𝑗

(12)

(3) Weighted summation: ( ) ( ) ( ) 𝑤 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝜇𝐾𝑠 𝑥𝑠𝑖 , 𝑥𝑠𝑗 + (1 − 𝜇)𝐾𝑤 𝑥𝑤 𝑖 , 𝑥𝑗

(13)

where 0 < 𝜇 < 1 is used to provide a tradeoﬀ between using the spectral features and the spatial ones. (4) Cross-information: ( ) ( ) ( ) ( ) ( ) 𝑤 𝑠 + 𝐾𝑠𝑤 𝑥𝑠𝑖 , 𝑥𝑤 + 𝐾𝑤𝑠 𝑥𝑤 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝐾𝑠 𝑥𝑠𝑖 , 𝑥𝑠𝑗 + 𝐾𝑤 𝑥𝑤 𝑖 , 𝑥𝑗 𝑗 𝑖 , 𝑥𝑗 (14) 66

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Generally, in a MKL algorithm, an optimal composite kernel is obtained through combining a few basis kernels constructed from various feature subsets. Finding the optimal kernel is a complex optimization problem. To simply the problem, a MKL framework is proposed in [50] that is data-dependent. The proposed framework deﬁnes three measures to estimate the composite kernel goodness. The measures are based on the similarity between an ideal kernel and the composite one. In addition, to solve the optimization problem, the meta-heuristic algorithms are used that are accurate and implemented fast. A review of diﬀerent types of MKL methods and their solutions are given in [51]. It is shown that MKL results in good performance for heterogeneous features under ill-posed conditions. Various features extracted from diﬀerent sources convey diﬀerent meaning and have different statistical signiﬁcance. Therefore, their roles in the classiﬁcation problem is diﬀerent. So, obviously, the feature stacking approach is not an appropriate choice for HSI classiﬁcation while MKL is a suitable approach for heterogeneous features handling. The generalized composite kernel (GCK) method uses the composite kernels with a great ﬂexibility in contribution of spectral and spatial information without any requirement to weight parameters [52]. GCK uses the MLR classiﬁer that is very ﬂexible in the non-linear kernel construction and has high control on generalization capacity through logistic regressors. The GCK method uses the following function as the input of a MLR classiﬁer [53]: ( ) [ ( ) ( )]𝑇 ℎ 𝑥𝑖 = 1, 𝐾 𝑇 𝑥𝑖 , 𝑥1 , … , 𝐾 𝑇 𝑥𝑖 , 𝑥𝑁 (15)

lead to over-smoothing. In addition, appropriate selection of the regularization parameter, which provides a tradeoﬀ between minimizing the training error and maximizing the margin is highly important. 3.4. Representation based classiﬁers The nearest regularized subspace (NRS) classiﬁer incorporates the distance weighted regularization with the nearest subspace classiﬁcation [59]. In NRS, each testing sample is approximated by a linear combination of training samples of each of classes. The class that results in minimum residual value is assigned to the testing pixel. The L2 norm used in the collaborative representation of samples to be classiﬁed provides a closed form solution. In addition, the used L2 norm regularization term copes with the ill-posed condition in the inverse problem. The main disadvantage of NRS is that ignores the spatial information of HSI where two adjacent pixels likely belong to the same class. The joint collaborative representation (JCR) method [60] and the weighted JCR (WJCR) [61] have been proposed to deal with this problem. In JCR and WJCR, the neighboring pixels of the testing samples are simultaneously approximated through a collaborative representation of training samples. While the same weight is assigned to all neighboring pixels in JCR, larger weights are assigned to the neighboring pixels with more similarity to the central pixel in WJCR. JCR/WJCR calculates an average/weighted average of pixels in a neighborhood window. Although from one hand, they include the spatial information in the classiﬁcation process in homogenous and smooth areas, but, on the other hand, they may degrade the classiﬁcation performance in the neighborhood regions containing edges and class boundaries. To deal with this disadvantage of JCR and WJCR, the edge-preserving-based collaborative representation (EPCR) has been proposed in [62] that utilizes an edge image for correction of weights and residual values in the collaborative representation. The edge image is calculated by estimation of discontinuity through all spectral bands. Two other versions of WJCR, WJCR based on angular separation (WJCR-AS) and WJCR based on median-mean line (WJCR-MML) have been proposed in [63]. The WJCR-AS uses the angular separation metric, i.e., the cosine distance, for calculating of weights in WJCR. Although the neighboring pixels more similar to the central pixel should be assigned more weights; but, the neighboring samples highly correlated to the center may cause redundancy leading to classiﬁcation degradation. So, the weights of neighbors should have reverse relationship to the AS metric calculated between the central pixel and its neighbors. The presence of outlier samples in the neighborhood region may deviate the weighted mean of the local area from its real value. To deal with this problem, the WJCR-MML method uses the median-mean line metric instead of the simple mean metric for calculating the weighted mean of each neighboring region. The MML metric can rectify the position of outlying neighbors, and so, improves the classiﬁcation performance. Sparse representation (SR) as a power tool has been widely used recently in diﬀerent applications of image processing such as HSI classiﬁcation. SR works based on this idea that pixels with the same class labels have similar spectral similarities and thus, a testing pixel can be linearly approximated by a few number of training samples of a class. The conventional SR based HSI classiﬁer just considers the spectral information that is not enough to have an accurate classiﬁcation map. To deal with this problem the joint SR (JSR) based classiﬁer has been proposed in [64]. JSR is based on this assumption that in a local region, there are pixels likely constructed by the same materials with similar spectral signatures. So, the JSR classiﬁer, by considering pixels in a local information and contributing the spatial information, improves the classiﬁcation accuracy. But, the conventional JSR classiﬁer ignores this point that pixels of a local region may be not belong to the same class, and in this case, the classiﬁer performance is seriously degraded. To deal with this diﬃculty, it is proposed that use the correlation coeﬃcient measure beside the SR one in the classiﬁcation process [65]. To this end, the spectral similarity among the testing samples and the train-

where N is the number of training samples. The kernel K(xi , xj ) can be a simple stacking of spectral and spatial kernels: ) ( )]𝑇 ( ) [ ( 𝑤 𝑠 𝑠 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝐾𝑤 𝑥𝑤 (16) 𝑖 , 𝑥𝑗 , 𝐾𝑠 𝑥𝑖 , 𝑥𝑗 that in this case, h(xi ) will be: ( ) [ ( ) ( 𝑤 𝑤) ( 𝑠 𝑠) ( 𝑠 𝑠 )]𝑇 𝑤 ℎ 𝑥𝑖 = 1, 𝐾𝑤 𝑥𝑤 𝑖 , 𝑥1 , … , 𝐾𝑤 𝑥𝑖 , 𝑥𝑁 , 𝐾𝑠 𝑥𝑖 , 𝑥1 , … , 𝐾𝑠 𝑥𝑖 , 𝑥𝑁 (17) To include the cross-information between the spectral and spatial features, the following cross kernel can be used instead of (16): ) ( ) ( ) ( )]𝑇 ( ) [ ( 𝑤 𝑠 𝑠 𝑤 𝑠 𝑠 𝑤 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝐾𝑤 𝑥𝑤 (18) 𝑖 , 𝑥𝑗 , 𝐾𝑠 𝑥𝑖 , 𝑥𝑗 , 𝐾𝑤𝑠 𝑥𝑖 , 𝑥𝑗 , 𝐾𝑠𝑤 𝑥𝑖 , 𝑥𝑗 The conventional composite kernels or MKL methods need convex combination of kernels while GCK has no restriction of convexity. The linear combination of the basis kernels used in the composite kernel, which is included in the MLR objective function, has more ﬂexibility with respect to the convex combination of kernels in MKL that assigns ﬁxed weights to the kernels. GCK allows more freedom for balancing the spatial and spectral information. The multiple feature learning (MFL) method integrates various features extracted by diﬀerent linear and non-linear transformations to treat with linear and non-linear boundaries of present classes [54]. Similar to GCK, MFL also uses the MLR classiﬁer. The MLR classiﬁer and its diﬀerent versions have some advantages: 1-fast computations, 2-good capability of algorithm generalization, 3-open and ﬂexible structure where MKL and composite kernel can be simply modeled under them [55–57]. MLR classiﬁers learns directly the posterior class probabilities and can eﬀectively cope with the high dimensionality of the HSI. The kernel based classiﬁers such as SVM have some advantages [58]: 1- having less sensitivity to the number of training samples due to considering just the samples close to the class boundaries, i.e., support vectors, 2- being non-parametric and do not need to acquire data distribution, 3- easy implementation, 4-self adaptive, 5- having a convex cost function that results in an optimal solution, 6- having fast training stage. The main drawback of the kernel based methods is sensitivity to the kernel parameters where an inappropriate choice may cause over-ﬁtting or over-smoothing. Selection of a small value for the width parameter of kernel may lead to over-ﬁtting. In contrast, assigning a large value may 67

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

ing ones is calculated and used beside the residual obtained by the JSR measure to ﬁnd the class label of the unseen testing samples. The representation based classiﬁers do not consider any assumption about the statistical distribution of data and also do not need to compute any statistics of HSI. Thus, they are appropriate classiﬁers when there is not any prior knowledge about image distribution or there is not sufﬁcient training samples for statistics estimate. The representation based classiﬁers are the non-parametric methods that directly determine the label of a testing pixel using a structured dictionary. The basic idea is that a testing sample can be linearly approximated by the training dictionary. The computed coeﬃcients of this approximation represent how important a dictionary atom is. The representation based classiﬁers are generally divided into two main categories: sparse representation based classiﬁer (SRC) and collaborative representation based classiﬁer (CRC). In the SRC method, the testing sample is sparsely approximated by a few atoms of dictionary through a L1 minimization problem. In the CRC method, the testing sample is collaboratively represented by all atoms of dictionary using a L2 minimization problem. SRC provides a compact representation of HSI with high computational burden due to L1 -norm optimization problem. In contrast, CRC beneﬁts the information of all atoms and achieves a closed form solution through the L2 -norm optimization problem. However, each of SRC or CRC beneﬁts some good characteristics of HSI. The fusion methods can be used to integrate the advantages of both of them. In contrast to conventional classiﬁers, the kernel based methods can signiﬁcantly improve the classiﬁcation accuracy by utilizing the complex structure of the given data in the kernel space [66]. In a kernel based method, the samples are projected into a high dimensional space through applying a non-linear mapping. In the new high dimensional feature space, the complex non-linear structure of data samples that may not be accurately represented by linear models are exploited. To beneﬁt the advantages of the kernel trick, the kernel based SRC (KSRC) and the kernel based CRC (KCRC) have been proposed and fused together in [67] to achieve both beneﬁts of sparse representation and collaborative representation in the kernel space. To this end, at ﬁrst, data is mapped to the kernel feature space. Then, the coeﬃcients of sparse and collaborative representation are separately computed to ﬁnd each of residuals individually. Finally, the obtained residuals are combined together through an adjusting parameter. The achieved fused residual is used for class label determination of each testing sample. The fused method shows higher discriminative ability with respect to simple SRC and CRC and their kernellized versions. In the following two main representation based classiﬁers, collaborative representation based classiﬁer (CRC) and the sparse representation based classiﬁer (SRC) are represented.

The derivative of above function is taken and set to zero to compute the weight vector: ( )−1 𝑇 𝛼̂ 𝑖 = 𝑋𝑖𝑇 𝑋𝑖 + 𝜆𝑖 𝐼 𝑋𝑖 𝑦; 𝑖 = 1, … , 𝑐 (21) Some of samples of dictionary has more similarity to the testing pixel y. So, it is proposed that larger weights are assigned to samples with more similarity to the testing one. In other words, the more similar atoms of dictionary have more role in representation of y. To this end, the distance weighted Tikhonov matrix is deﬁned to adjust and regularize the weight vector by: ⎡‖ 𝑦 − 𝑥𝑖,1 ‖ ‖2 ⎢‖ Γ𝑦 𝑖 = ⎢ ⎢ 0 ⎣

2

𝛼𝑖

2

; 𝑖 = 1, … , 𝑐

𝑖 = 1, … , 𝑐

(22)

where 𝑥𝑖,𝑘 ; 𝑘 = 1, … , 𝑛𝑖 are the samples of dictionary Xi and ni denotes the number of atoms in Xi . Matrix Γ𝑦𝑖 calculates the Euclidean distance between the testing pixel y to each of atoms in Xi . Then, the optimization problem in (19) is updated as follows: ‖ ‖2 2 Γ 𝛼 ‖ ; 𝑖 = 1, … , 𝑐 arg min ‖ 𝑦 − 𝑋𝑖 𝛼𝑖 ‖ ‖2 + 𝜆𝑖 ‖ ‖ 𝑦𝑖 𝑖 ‖2 𝛼𝑖 ‖ The above problem has the following closed-form solution: ( )−1 𝛼̂ 𝑖 = 𝑋𝑖𝑇 𝑋𝑖 + 𝜆𝑖 Γ𝑇𝑦 Γ𝑦𝑖 𝑋𝑖𝑇 𝑦; 𝑖 = 1, … , 𝑐 𝑖

(23)

(24)

The representation error in subspace i is computed by calculating the residual image [69]: ‖ ‖ ‖ 𝑟 𝑖 (𝑦 ) = ‖ ‖𝑦 − 𝑦̂𝑖 ‖2 = ‖𝑦 − 𝑋𝑖 𝛼̂ 𝑖 ‖2 ; 𝑖 = 1, … , 𝑐

(25)

The class label of the testing sample, ly , is determined by: 𝑙𝑦 = arg min 𝑟𝑖 (𝑦)

(26)

𝑖=1,…,𝑐

3.4.2. Sparse representation based classiﬁer (SRC) Each testing sample y can be sparsely represented by training samples of the class that belong to it. The sparse representation of y can be achieved by a linear combination of c dictionaries of available classes, 𝑋𝑖 ; 𝑖 = 1, … , 𝑐 as follows: 𝑦=

𝑐 ∑ 𝑖=1

𝑋𝑖 𝛽𝑖 = 𝑋𝛽

(27)

where X denotes the union dictionary composed from dictionaries of all classes and 𝛽 is a concatenation of all sparse vectors containing a few nonzero entries. The sparse vector 𝛽 can be calculated by [70]: 𝛽̂ = arg min ‖𝑋𝛽 − 𝑦‖22 𝛽

𝑠.𝑡.

‖𝛽‖0 ≤ 𝐿

(28)

where ‖ · ‖0 indicates the l0 norm, i.e., the number of nonzero entries (called as sparsity level) and L is the given upper bound on the sparsity level. Any greedy pursuit algorithm such as orthogonal matching pursuit [71] can be used for solving this optimization problem. The dictionary matrix X can be decomposed to 𝑋𝑖 ; 𝑖 = 1, … , 𝑐 and also the sparse vector 𝛽 is decomposed into 𝛽𝑖 ; 𝑖 = 1, … , 𝑐 to obtain the partially estimate of y by individually using each of dictionaries of classes. The error representation of each subspace can be computed by: ‖ ‖ ‖ 𝑟 𝑖 (𝑦 ) = ‖ (29) 𝑦 − 𝑋𝑖 𝛽̂𝑖 ‖ ; 𝑖 = 1, … , 𝑐 ‖𝑦 − 𝑦̂𝑖 ‖2 = ‖ ‖ ‖2 Similar to CRC, the class label of the testing sample y is given by 𝑙𝑦 = arg min 𝑟𝑖 (𝑦).

3.4.1. Collaborative representation based classiﬁer (CRC) Each testing pixel y can be approximated by using each of dictionaries of c given classes. The dictionary of each class is composed by atoms constituted by the spectral features or spatial features or spectralspatial features of the training samples of that class. Let Xi be the subspace or dictionary of class 𝑖(𝑖 = 1, … , 𝑐 ). The testing sample y is estimated individually by each of subspaces. That class that its subspace (dictionary) can better approximate the testing sample is assigned to the testing pixel. The representation of testing sample y through subspace of class i, is obtained by solving the following objective function [68]: ‖ ‖ ‖ arg min ‖ ‖𝑦 − 𝑋𝑖 𝛼𝑖 ‖2 + 𝜆𝑖 ‖𝛼𝑖 ‖2

⎤ ⎥ ⎥; ‖ ‖ ⎥ ‖𝑦 − 𝑥𝑖,𝑛𝑖 ‖ ⎦ ‖ ‖2 0

⋱

𝑖=1,…,𝑐

(19)

The representation based classiﬁers are non-parametric methods that they do not require any knowledge about data distribution. They avoid heavy computations of the training process and they are directly performed on the given dictionary. Two main representation based classiﬁers are CRC and SRC. The CRC method due to using the l2 -norm in its optimization process is simpler than SRC and results in a closed form solution. In contrast, SRC by including the sparsity constraint through a l1 -norm in its objective function avoids involving non-related and redundant training atoms in the pixels reconstructions.

where 𝛼 i and 𝜆i are the weight vector and the regularization parameter, respectively. The regularization parameter provides a tradeoﬀ between the residual term and the regularization one. By doing some computations on (19), it is simpliﬁed as follows: [ ( ) ] arg min 𝛼𝑖𝑇 𝑋𝑖𝑇 𝑋𝑖 + 𝜆𝑖 𝐼 𝛼𝑖 − 2𝛼𝑖𝑇 𝑋𝑖𝑇 𝑦 ; 𝑖 = 1, … , 𝑐 (20) 𝛼𝑖

68

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

where 𝑁 (𝑥, 𝑦, 𝜆) =

1 3 2

(2𝜋) 𝜎 3

𝑒

( ) − 12 𝑥2 +𝑦2 +𝜆2 2𝜎

is the Gaussian component and ( )) ( 𝐸 (𝑥, 𝑦, 𝜆) = 𝑒𝑥𝑝 𝑗2𝜋 𝑓𝑥 𝑥 + 𝑓𝑦 𝑦 + 𝑓𝜆 𝜆

(32)

represents the sinusoidal component. (x, y) indicates the spatial indices and 𝜆 denotes the wavelength variable. parameter 𝜎 is used to control the width of the Gaussian function that determines the ﬁlter scale. (fx , fy , f𝜆 ) is the frequency of E(x, y, 𝜆) that shows the central frequency of the 3D Gabor ﬁlter [78]. Although the spectral-spatial features extracted by 3D Gabor ﬁlters leads to an accurate classiﬁcation map, the high number of features and heavy computations limit applicability of 3D Gabor ﬁlter bank. To deal with this diﬃculty, a 3D Gabor phase coding (3D GPC) is introduced in [79] that is used together with a hamming distance based matching for HSI classiﬁcation. To overcome the large volume of Gabor features, the 3D GPC method just exploits the phase features of Gabor instead of the magnitude features. In addition, it just uses the Gabor ﬁlter with certain orientations, i.e., only directions parallel to the spectral axis. These directions involve more discriminative features compared to other ones. The extracted Gabor features are then encoded by a quadrant bit coding algorithm. For classiﬁcation, the nearest neighbor classiﬁer is utilized where the similarity between pixels is measured by the normalized hamming distance matching method. The experiments show good performance of 3D GPC in terms of both generalization ability and computational complexity. In [74], 3D MP, 3D Gabor and 3D LBP [80] are introduced to extract joint spectral-spatial features. The extracted features are then fused through a multi-task sparse representation framework to achieve the classiﬁcation map. According to the sparse theory, each testing pixel can be sparsely approximated by the subspace containing the training samples of the class that belong to it. The class label of the testing sample is determined by checking which class yields the smallest reconstruction error. In the multi-task sparse classiﬁer proposed in [74], the label of the testing sample is determined according to the least reconstruction error over the three sets of the obtained 3D spectral-spatial features. The one/two dimensional empirical mode decomposition (EMD) method is extended to three dimensional EMD (3D-EMD) to treat a HSI as a cube [81]. 3D EMD decomposes the HSI into 3D intrinsic mode functions (3D-IMFs) where each of them is a varying oscillation and an extracted 3D feature. Due to the increased burden caused by added dimensions, two approaches are taken. The use of 3D Delaunay triangulation which determiners the extrema distances and the use of separable ﬁlters for envelops generation. In other words, rather than implementation of sophisticated 3D ﬁlter, a 1D ﬁlter is performed three times to acquire the same results as the 3D ﬁlter. Thus, the computational burden is signiﬁcantly reduced. The extracted 3D features are given to a robust multi-task learning classiﬁer where each IMF is taken as a task. 3D implementation of wavelet transforms is proposed for extraction of contextual features from the HSI cube [21] of 3D. In [82], 3D discrete wavelet transform (3D-DWT) is used for spatial feature extraction. The 3D extracted features are fed to the SVM classiﬁer to provide the probabilistic classiﬁcation map. Then, the MRF is used for exploration of local spatial dependencies and correlation among neighboring pixels. After that, the maximum a posterior (MAP) classiﬁcation is formulated. The Bayesian optimization problem is solved by 𝛼-Expansion min-cut algorithm. The 3D-DWT transform can encode the spatial details and approximation of cube in diﬀerent frequencies, scales and orientations. The 3D-DWT transform can be implemented by applying three 1D-DWT in three dimensions of HSI cube: weight and height of spatial dimensions and the spectral dimension (see Fig. 9). The 3D scattering wavelet transform is proposed for spatial ﬁltering of HSI through applying 1- a cascade of wavelet decompositions, 2complex modulus, and 3-local weighted averaging [83]. Compared to

Fig. 8. A Gabor ﬁlter bank containing 13 ﬁlters in the frequency domain.

3.5. 3D spectral-spatial feature extraction To exploit the spatial information, various feature extraction methods have been introduced [72]. Among diﬀerent spatial feature extraction methods, it can be referred to the MP, local binary pattern (LBP) [73] and Gabor ﬁlters. MP by exploiting two mathematical operations called erosion and dilation on the principal components of HSI, extracts the geometrical structures with diﬀerent shapes from the HSI. For applying LBP to each single band image, a local region is considered around each pixel of image. Then, a binary code is assigned by comparing the grey level of the central pixel with that of the surrounding pixels. Then, by accounting the occurrence repetition of the obtained pattern over the neighborhood region, the statistical histogram is achieved. The Gabor wavelet transform is a powerful ﬁlter for extraction of texture from a single band image by providing the optimal joint space-frequency resolutions. As said, each of MP, LBP and Gabor methods can be applied to a 2D image. In another words, for implementation of each of them on the HSI, each spectral channel should be treated individually. Due to independent extraction of spatial features from each spectral band, the joint spectral-spatial information contained in the HSI cannot be exploited. To deal with this diﬃculty, 3D spectral-spatial feature extraction is required to reveal the 3D inherent structure of HSI. Morphological ﬁlters analyze an image by applying a 2D structuring element (SE) with speciﬁc shape and size. The conventional morphological ﬁlters are two dimensional and ignore the spectral-spatial dependencies in 3D structure of HSI. To explore the joint spectral-spatial morphological information of HSI, 3D morphological proﬁle (3D-MP) method is introduced in [74]. 3D-MP which is an extension of 2D-MP is directly implemented on 3D HSI cube through using the 3D SEs. Two basics operators of 3D-MP are erosion and dilation. The 3D erosion operator of a HSI with a 3D SE is deﬁned as the minimum pixel value of the pixel values inside the 3D SE. The dual operator, 3D dilation is also deﬁned as the maximum pixel value of pixels values contained in the 3D SE. The 3D opening and closing operators are then deﬁned based on the given 3D erosion and dilation ﬁlters [74]. Due to the 3D nature structure of HSI and the tightly correlation among spectral and spatial information, the use of 3D Gabor ﬁlters preferred than 2D Gabor ﬁlters for HSI analysis [75]. A Gabor ﬁlter is computed by modulating a Gaussian function by a sinusoidal one. A ﬁlter bank containing 13 ﬁlters in the frequency domain is shown in Fig. 8 [76]. A 3D Gabor ﬁlter can be deﬁned in the spectral-spatial feature space as follows [77]: 𝐺𝑓 ,𝜑,𝜃 (𝑥, 𝑦, 𝜆) = 𝑁 (𝑥, 𝑦, 𝜆)𝐸 (𝑥, 𝑦, 𝜆)

(31)

(30) 69

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

The CNN models become over-trained if there is not enough training samples. This means that network has good performance just for seen training samples while has weak performance in dealing with unseen testing samples. One of the strategies to overcome this problem is using the generative adversarial network (GAN) [90]. GAN as a regularization method involves two models: a generative model (G) and a discriminative one (D). While G generates fake samples as much as possible similar to real samples, D classiﬁes real and fake samples. G and D are trained in an adversarial way where both of them try to get the optimal results (the best classiﬁcation map for D and generation of fake data as real as possible for G). The samples generated by GAN can be utilized as virtual samples to improve the classiﬁcation accuracy [91]. The use of GAN beside deep networks is investigated in [92] and [93]. A 3D GAN in combination with CNN model is introduced in [94] for spectral-spatial HSI classiﬁcation. CNN together with GAN can provide better classiﬁcation performance than the conventional CNN. 3D GLCM [95] and 3D shearlet transforms [96] are other examples of 3D feature extraction from the HSI cube. Tensors are also appropriate mathematical tools for processing of 3D images such as HSIs [97].

3.6. Deep learning based classiﬁers Deep networks can jointly extract spectral and spatial features from the HSI data. The basis of work in deep neural networks is extraction of features from the raw input data through layer by layer processing of input data. But, there are some diﬃculties in conventional deep neural networks. They require a large number of training sample and also they need much eﬀort for tuning of hyper-parameters. These diﬃculties are removed in the proposed deep network in [98]. A deep network is introduced that uses the multi-grained cascade forests where the output of each cascading level is transformed to the next level for more processing. Two types of forests are used in each level for increasing diversity. Training of the multi-grained cascade forest is much easier than that of the conventional deep neural network. Several various features are ﬁrstly extracted for each pixel, and then, the obtained features are given to the deep random forest classiﬁer. By utilizing the information of neighboring pixels, the spectral-spatial information is fused eﬀectively to improve the class discrimination. The last layer of network determines the classiﬁcation probabilities. The original spectral features, the features extracted by discrete cosine transform, the features of a wavelet transform and the extended morphological proﬁle are used as the input of the proposed deep random forest classiﬁer. There are diﬀerent types of deep learning networks. CNN is among the best well-known networks used for HSI feature extraction and classiﬁcation [99]. The convolutional and pooling layers are alternatively stacked where output of each layer is given as input to the subsequent layer. In the end, the ﬁnal produced feature map is given to a fully connected layer (FCL) to form the ﬁnal feature vector for classiﬁcation through softmax layer [100]. While the shallower convolutional layers explore the detailed structures of objects such as edges, more abstract features are mined from the deeper convolutional layers. An example of patch based processing of HSI for pixel based classiﬁcation by using the CNN model is shown in Fig. 10. A CNN based spectral-spatial feature fusion method is proposed in [101]. The proposed framework consists of three steps: the use of local spatial constraint and non-local spectral constraint for sample augmentation; feature fusion; and classiﬁcation. In the ﬁrst step, the local spatial constraint utilizes the contextual information of adjacent pixels in a local neighborhood region. In addition, the non-local spectral constraint uses the spectral similarity of pixels in a non-local way. A multi-layered CNN is used for spectral-spatial feature fusion in the second step. The softmax layer is ﬁnally used for multi-decision classiﬁcation. A uniﬁed loss function is used for jointly optimization of the classiﬁcation step and the previous step for the spectral-spatial fused features learning.

Fig. 9. Structure of a 3D-DWT transform.

3D DWT and 3D Gabor ﬁlters, the 3D scattering wavelet transform has two beneﬁts. First, due to cascade of wavelet decompositions in multiple orientations and scales, rich descriptions of sophisticated structures are provided for HSI classiﬁcation. Second, the used local weighted averaging reduces the feature variability and leads to local consistency of pixel labels in the neighborhood regions. Although CNNs have high capability for extraction of features from low to high levels, but they lack multi-resolution ﬁltering. From the other hand, the 3D wavelets provide 3D characterizations of HSI in multi-resolutions in the frequency-space domains. So, authors in [84] combine the advantages of both 3D wavelets with CNN to adaptively extract 3D features in diﬀerent scales and depths. The 3D deep networks have been utilized for jointly extraction of spectral and spatial features [85]. 3D CNN has been introduced for directly extraction of deep spectral-spatial features of HSI [86,87]. These networks process the raw HSI. Although they provide promising results, their performance is degraded when utilizing the deeper networks. To overcome this problem, a spectral-spatial residual network (SSRN) is proposed in [88] that is composed of consecutive learning blocks. The residual blocks, as an extension of the convolutional layers used in CNN models, are designed for extraction of discriminative spectral-spatial features. The SSRN model allows a deeper structure compared to previous 3D-CNNs. The SSRN model, by providing shortcut connections among other convolutional layers, leads to robust learning of spectralspatial representations of HSI. The SSRN method proposed in [88] forms a contextual CNN by incorporating of residual learning for fully convolutional layers, and investigates appropriate residual architectures to provide robustness in various scenarios. Authors in [13] have proposed that the spatial feature map constituted by the super-pixels can be more processed for spectral-spatial feature extraction through applying a 3D recurrent CNN. The framework exploits the continuity of image pixels and suppresses noise. One of the serious problems of CNN methods is over-ﬁtting [89]. This problem is due to huge number of learnable parameters in the network that require a lot of training samples. Due to expensively and time demanding gathering of training samples in the remote sensing community, availability of limited training samples is a common situation. 70

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 10. An example of patch based processing of HSI classiﬁcation by using the CNN model.

A method for using a pre-trained CNN such as VGG-VD16 or AlexNet is introduced in [102]. Diﬀerent layers of a CNN contain complementary information extracted from the input image. While the shallower layers appear the low-level visual features such as edges, the deeper layers reﬂect more abstract or semantic features. A CNN model containing multiple layers conveys complementary information that can be used for HSI classiﬁcation improvement. The multi-layer stacked covariance pooling (MSCP) method is proposed in [102] that is implemented in three steps. First, a pre-trained CNN model is applied to the HSI to extract the feature maps in multiple layers. Second, the covariance matrix of the stacked feature maps is calculated. Covariance matrix implicitly fuses the feature maps containing the complementary information where each entry of it represents the covariance value between two diﬀerent and likely complementary feature maps. The calculated covariance matrix is then used as the extracted features fed to a SVM classiﬁer. An unsupervised deep learning based feature extraction method is proposed in [103] that fuses multi-scale spectral-spatial features. At ﬁrst, the pre-trained VGG16 network is used for extraction of multiscale spatial information. Then, a sparse auto-encoder network is used for fusing the raw spectral bands with the extracted spatial features. The extracted features are fed to the SVM classiﬁer to ﬁnd the classiﬁcation map. DeepLab is ﬁrstly introduced in [104] for semantic segmentation. Due to much similarity of HSI classiﬁcation to the semantic segmentation, DeepLab is chosen for HSI classiﬁcation in [105]. To implement DeepLab for HSI, at ﬁrst, the maximum noise fraction (MNF) method [106] is applied to the HSI to ﬁnd several ﬁrst principal components (PCs) of it. The ﬁrst PCs are used as the label image for DeepLab training. DeepLab extracts the spatial features of HSI. After that, z-score is used to normalize both of the original spectral bands and the extracted spatial features. Then, a weighted fusion rule is taken to combine the spectral and spatial information. The fused features are ﬁnally given to a SVM classiﬁer. The proposed deep base feature fusion method has some advantages with respect to other deep learning methods. It extracts spatial features at multiple scales; it does not use the patch based feature learning; and it avoids spatial resolution reduction. The experiments show the superior performance of the proposed DeepLab based feature extraction method especially when there are small scale classes containing limited pixels. The guided ﬁlters are used for extraction of spatial features with multiple scales from the HSIs in [107]. The extracted spectral-spatial features are then given to a CNN model for classiﬁcation. The guided ﬁlter involving a local linear model acts as an edge preserving smoothing operator. To compute the ﬁlter output, a guided image is required. The content and structures of the guided image are transformed to the ﬁltered output. Three principal components of HSI are obtained and considered as the guided image. The PCA transformation is also applied to the HSI for feature reduction. The reduced dimensionality HSI is transformed through conduction of several guided ﬁlters with diﬀerent scales. The extracted spatial feature maps with diﬀerent scales are stacked together to generate an image cube containing the spatial features. The spatial feature vector of each pixel is reshaped to form a 2D image. This image

is given as the input of a CNN for classiﬁcation. The used CNN adopts regularization and dropout to deal with the over-ﬁtting problem in limited training situations. The guided ﬁlter based spectral-spatial feature fusion method has a simple implementation and shows good classiﬁcation accuracy. In addition, the multi-scales spatial features extracted by ﬁlters with diﬀerent scales, fed to the CNN model, provide a full use of the spatial features. The advantages of the neural network based classiﬁers are represented in the following: they do not require any prior knowledge about the statistical distribution of data, they are data-driven and have high capability in non-linear extraction of features and classiﬁcation. The neural network based classiﬁers also have the following drawbacks: 1they have heavy training process where a high volume of training samples must be given to the neural network in many epochs to allow good learning (however, a neural network is fast in the testing stage), 2- the neural networks are not stable for the same training set. In other words, in each repeat of the classiﬁcation step, a diﬀerent result with diﬀerent classiﬁcation accuracy is achieved. So, usually, it is necessary to run the classiﬁers multi-times and report the average results. Deep learning methods are from the family of neural networks. Among various deep networks, CNNs are very popular in HSI classiﬁcation. Although 1D-CNN provides less accurate classiﬁcation results compared to the conventional classiﬁers such as SVM, 2D- and 3D-CNNs can eﬀectively explore the spatial dependencies of a HSI by utilizing the local connections [58]. From deep networks drawbacks, it can be refer to 1- how to design a suitable deep network that is an open issue where the layers structure, number of ﬁlters, the used cost function and also settings of hyper-parameters have a high eﬀect in the output result. 4. Decision fusion method Due to intrinsic limitation of each single feature set, the HSI classiﬁcation methods by using just a single feature set ignore some valuable information and loss elegant details. To improve the classiﬁcation accuracy, it is proposed that use several feature sets containing complement information to avoid information losing. In the decision fusion level, which is a high level fusion, separate decisions based on individual feature sets are drawn, and then, the results are combined to conclude a global decision. The general block diagram of the decision fusion approach is shown in Fig. 11. In the general form, the spectral bands of a HSI is given to M diﬀerent feature extraction methods to extract features with various views from it [108]. In other words, M feature extractors can be used to ﬁnd M diﬀerent feature subsets. Then, each obtained subset is given to a classiﬁer to ﬁnd a local decision. The ﬁnal classiﬁcation map is achieved by accepting a decision fusion rule such as majority voting (MV) rule. In other form, one feature extractor can be used instead of M feature extraction methods to obtain one subset of features; and then, the decision is obtained by M diﬀerent classiﬁers. The block diagram of the MBFSDA method [109] is shown in Fig. 12. At ﬁrst, n PCs of HSI with d spectral bands are extracted by the PCA transform. A MP is constituted from each PC. Each MP contains spatial 71

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 11. General block diagram of the decision fusion approach.

Fig. 12. Block diagram of the MBFSDA method as a decision fusion approach [109].

features such as size and shape information of contextual structures. Then, the FSDA transform containing three measures are implemented in three steps: 1- maximizing the between-bands scatters where the bands are the morphological spatial features here, 2- maximizing class discrimination through maximizing between-class scatters and minimizing the within-class scatters and 3- extraction of spatial features by using the obtained projection matrix computed based on three aforementioned measures. The extracted features of each subset is given to a classiﬁer to ﬁnd the local decision. The SVM and nearest neighbor (NN) classiﬁers are used in this step. The ﬁnal classiﬁcation map is acquired by MV rule (for more details, the interested reader is referred to [109]). Another decision fusion approach, has been proposed in [110] where the band partitioning is used instead of feature extraction for producing the sub features. At ﬁrst, the HSI cube is partitioned into some smaller sub-cubes with the same spatial sizes but with a lower number of adjacent spectral bands. The beneﬁt of bands partitioning in band selection method with respect to the feature extraction methods is that the physical meaning of the spectral channels is preserved. In addition, here, no feature reduction is not applied. Then, the redundant and noisy information of each sub-cube is removed by applying the deﬁned smoothing ﬁlters. Then, the useful spatial features of each cleaned sub-cube are achieved by applying morphological ﬁlters. The SVM classiﬁer is individually used for classiﬁcation of each subset of features; and ﬁnally, the classiﬁcation map is obtained by the MV rule. In this example of decision fusion, the same feature extraction methods (morphological ﬁlter) and the same classiﬁers (SVM) are used for classiﬁ-

cation of each subset of HSI cube. Fig. 13 illustrates what explained in above. A decision level fusion method has been proposed for HSI classiﬁcation in [111]. Two categories of features are used for HSI classiﬁcation. The ﬁrst category contains the spectral reﬂectance curves that provides a global view of HSI. The second category includes absorptions that are considered as a local view of HSI. Absorptions are the available valleys in the spectral reﬂectance curve that resulted from the absorption by constituent molecules or atoms of materials. The absorption features are the binary values assigned to each spectral band of HSI. In other words, a binary vector is considered associated with each pixel where value of 1 is assigned to a band if an absorption valley is appeared in that band and a value of 0 is assigned to it if no absorption is detected [112]. To extract absorption features, at ﬁrst, the spectral curves of HSI is normalized to [0, 1]; and then, a peak detection algorithm is applied to detect the absorption dips. For avoiding noise, two criteria are considered for absorption determination: 1- an absorption point must have a depth more than 0.005 and 2- an absorption point must appear on more than half of samples in each class. Since the reﬂectance features show more accurate classiﬁcation results than the absorption ones, the classiﬁcation is ﬁrst done using the reﬂectance features and the SVM classiﬁer. If the result is satisﬁed, it is reported as the ﬁnal result, otherwise, the classiﬁcation result obtained by absorption features using a multi-label classiﬁcation method is reported as the ﬁnal result. The satisfaction measure for classiﬁcation accuracy is based an entropy measure where higher uncertainty (entropy) means lower accuracy.

72

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 13. A decision fusion approach using band partitioning [110].

A decision fusion based spectral-spatial HSI classiﬁcation method is proposed in [113]. There is a high probability that adjacent pixels in a neighborhood region belong to the same class. So, by exploiting the spatial information contained in the neighborhood regions, several cubes with diﬀerent scales are considered for each central pixel. Then, the matrix of each sub-cube that contains information related to a special spectral band is reshaped to a vector. In other words, the spectral features of each pixel is reshaped to a row vector and then, each sub-cube is converted to a matrix containing spectral-spatial features. A matrix is obtained from each sub-cube associated with each pixel containing the spectral-spatial information of it. The robust matrix discriminative analysis (RMDA) [113,114] is then applied for jointly spectral-spatial feature extraction from each matrix. Due to various degradations such as missing data, noise contamination, and calibration errors, each matrix associated with pixel (i, j) can be decomposed as 𝑋𝑖𝑗 = 𝑌𝑖𝑗 + 𝐸𝑖𝑗 where Yij denotes the clean data and Eij represents noise. A de-noising model inspired from unmixing method is applied to each Xij to obtain the clean matrix Yij . Then, MDA model is used for feature extraction from Yij where it like LDA maximizes the between-class scatters and minimizes the within-class scatters. The features extracted by RMDA are then given to a SVM classiﬁer. Eventually, the classiﬁcation maps obtained from diﬀerent sub-cubes with diﬀerent scales are fused together through a MV rule to generate the ﬁnal classiﬁcation map. MLR is an appropriate classiﬁer for ill-posed conditions where a low number of training samples is available. MLR models the posterior class distribution all over an image in a Bayesian framework [115]. The subspace version of MLR called as MLRsub [116] works based on this idea that samples of each class can be approximated in subspace with lower dimension. It uses a projection to ﬁnd that subspace. Since HSI is normally located in a much lower dimensional space, MLRsub provides good results for this type of data. Authors in [117] use the MLRsub for locally and globally learning the posterior probabilities for each HSI pixel. A probabilistic SVM is used as an indicator to detect the number of mixed components in each pixel. The obtained number of mixed components is used for local probability learning in MLRsub. Then, a decision fusion rule is used to fuse the global probabilities and the local ones obtained by the MLRsub method. Finally, the MRF regularization is applied to achieve the classiﬁcation map. Diﬀerent spectral feature extraction methods are used for extraction of diverse features of HSI with various views [118]. The morphological ﬁlters are applied to each set of the extracted features to implicitly fuse the spatial features with the spectral ones. Each obtained MP is given to a classiﬁer and the classiﬁcation outputs are fused through the MV rule to ﬁnd the ﬁnal output map. Support vector data description (SVDD) has been ﬁrstly proposed for one-class classiﬁcation or target detection problems. SVDD is inspired by two-classes SVM. SVDD obtains a minimum boundary as a hyper-

sphere around the target sample. The achieved hyper-sphere is used to determine whether new sample belongs to targets or not. The sphere volume is minimized while it tries that all training samples are included in the sphere. SVDD is sparse and capable of using kernels and also has good generalization. The SVDD classiﬁcation is used for multi-class HSI classiﬁcation through applying an ensemble of multiple one-class classiﬁcation and doing a decision fusion in [119]. An ensemble method uses multiple learning techniques to improve the classiﬁcation performance. Several classiﬁcation are done and their results are combined using a speciﬁc fusion rule. The main point in an ensemble method is to integrate the results of individual accurate and diverse classiﬁers. The ensemble methods are generally divided into three main categories: 1data (sample) level combination where diﬀerent training sets are given to the same classiﬁer, 2- feature level combination where various feature sets are extracted or selected, and then, combined and given to a classiﬁer, 3-classiﬁcation level combination where diﬀerent classiﬁers with the same training set are used. In all ensemble methods, using an appropriate fusion rule is very important. There are two general fusion techniques: linear combination and non-linear one. The non-linear methods involve a non-linear function such as power, multiplication or exponential. But, the simple linear methods are more public where a weighted or non-weighted combination is used. In the voting rule, the alternatives with a majority are selected. In the un-weighted MV, the same weights are assigned to the base classiﬁers while in the weighted MV rule, a diﬀerent weight is assigned to each basic voter. The weights of the basic classiﬁers should be proportional to their classiﬁcation accuracies. In [119], the weights are deﬁned based on average of all correlation coeﬃcients obtained by each part of the predicted class labels obtained by individual SVDDs. The proposed weighted voting rule provides better classiﬁcation results compared to the conventional MV rule. 5. Experiments A brief representation of advantages and disadvantages of diﬀerent spectral-spatial fusion methods for hyperspectral image classiﬁcation is seen in Table 1. For each subgroup of three types, a method is given as an instance. The performance of these methods are assessed in terms of classiﬁcation accuracy and computation time using tree real and popular hyperspectral images: the well-known Indian Pines, University of Pavia and Salinas. The Indian scene was collected by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) over Northwestern Indiana in June of 1992. This image with a size of 145 × 145 pixels, contains 16 agricultural and forest classes which 10 largest classes of it are chosen for our experiments. This dataset comprises 224 spectral channels where 200 spectral bands are remained after removing 20 water absorption bands. The Pavia im73

M. Imani and H. Ghassemian

Table 1 Advantages and disadvantages of diﬀerent fusion methods for hyperspectral image classiﬁcation. Group

Sub-group

Segmentation based methods

Object-based classiﬁcation Relaxation of classiﬁcation map Features stacking

Feature fusion

An example Method

Reference and year

Pixon-based classiﬁer MRF+HSRM

[6](2016) [3](2016)

APFSDA

[22](2017)

Advantages

Diﬃculties/Disadvantages

The noisy pixels are deleted in the classiﬁcation maps and an applicable smooth map is achieved for land cover.

• Anomaly pixels may be removed. Anomalies are

• Simple implementation • Eﬃcient if appropriate spatial features are selected.

Joint spectral-spatial feature extraction Kernel based classiﬁers

MSPP

[35](2019)

GCK

[52](2013)

The use of high correlation among spectral and spatial information

important but rare pixels with different spectral signature with respect to background. • Determination of super-pixels or objects with appropriate shape and size and edge preserving is a challenging problem. • High dimensionality of the fused feature vector may need feature reduction. • High computations High computations Sensitive to the kernel parameters

74

• Less sensitive to the number of training samples • Having a convex cost function • Fast training stage

Representation based classiﬁers

WJCR-AS

[63](2017) • Without any assumption about the statistical

High computations of solving the optimization problem especially if the l0 -norm or l1 -norm is used

distribution of data • The use of relations among pixels from both local

and global points of view 3D spectral-spatial feature extraction Deep learning based classiﬁers

3D-Gabor

[76](2010)

RPNet

[142](2018)

Preserve the intrinsic 3D structure of hyperspectral image

• Simultaneously feature extraction and classiﬁcation

in a uniﬁed framework • High ability in feature extraction (detailed features in shallow layers and semantic ones in deep layers) Decision fusion

Decision fusion

MBFSDA

[109](2018)

• Overﬁtting problem with insuﬃcient training samples • High computations in training phase if the used

network is deep

Selection of appropriate feature extractors or classiﬁers (decision makers) with minimum overlapping and redundancy is a challenging problem.

Information Fusion 59 (2020) 59–83

The use of complementary information and votes of several powerful classiﬁers

High computations

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Table 2 Classiﬁcation results for Indian dataset (10 classes) achieved by 10 training samples. No

Name of class

1 Corn-no till 2 Corn-min till 3 Grass/pasture 4 Grass/trees Hay-windrowed 5 6 Soybeans-no till 7 Soybeans-min till 8 Soybeans-clean till 9 Woods 10 Bldg-Grass-Tree-Drives Average Accuracy Overall accuracy Kappa Computation time (seconds)

# samples

Pixon-based

MRF+HSRM

APFSDA

MSPP

GCK

WJCR-AS

3D-Gabor

RPNet

MBFSDA

1434 834 497 747 489 968 2468 614 1294 380

41.42 33.93 71.43 23.16 87.73 65.81 76.09 22.96 95.44 91.58 60.96 62.45 56.70 1.28

62.34 61.15 64.19 64.12 99.39 90.70 74.31 44.46 97.53 93.42 75.16 74.96 71.15 0.20

64.71 95.56 82.09 99.06 99.59 73.35 88.01 74.59 98.84 96.84 87.27 85.83 83.63 9.49

77.34 88.73 81.29 90.09 100.00 90.29 76.01 92.67 98.45 91.32 88.62 85.91 83.87 129.82

63.46 89.93 79.68 96.92 99.59 70.35 72.49 66.45 88.79 93.95 82.16 78.67 75.56 0.97

76.50 93.41 88.33 95.72 99.80 92.98 81.77 83.71 99.46 99.74 91.14 88.60 86.87 16.30

48.12 55.16 63.78 93.04 94.89 50.21 52.23 44.79 65.84 97.89 66.59 60.67 55.06 64.56

60.32 73.38 93.56 84.20 96.73 86.47 81.48 60.10 58.73 95.79 79.08 75.94 72.32 2.32

70.57 81.06 81.09 82.06 99.39 77.48 66.37 73.94 90.80 59.21 78.20 76.42 72.95 3.42

Table 3 Classiﬁcation results for Indian dataset (10 classes) achieved by 50 training samples. No

Name of class

1 Corn-no till 2 Corn-min till 3 Grass/pasture 4 Grass/trees Hay-windrowed 5 6 Soybeans-no till 7 Soybeans-min till 8 Soybeans-clean till 9 Woods 10 Bldg-Grass-Tree-Drives Average Accuracy Overall accuracy Kappa Computation time (seconds)

# samples

Pixon-based

MRF+HSRM

APFSDA

MSPP

GCK

WJCR-AS

3D-Gabor

RPNet

MBFSDA

1434 834 497 747 489 968 2468 614 1294 380

57.11 61.87 72.43 98.39 87.73 80.68 74.76 50.49 96.14 53.16 73.28 74.46 70.65 2.15

79.64 94.00 96.98 98.93 100.00 81.30 82.74 94.95 89.57 95.79 91.39 88.13 86.34 0.20

90.73 98.80 96.58 99.87 100.00 86.47 95.54 93.49 99.61 100.00 96.11 95.40 94.67 13.99

91.21 99.76 98.99 99.87 100.00 91.74 89.14 95.77 98.84 97.89 96.32 94.54 93.69 161.95

85.63 99.28 94.77 97.86 100.00 88.64 86.18 92.02 99.07 98.16 94.16 92.05 90.83 2.48

92.75 99.88 97.18 100.00 100.00 94.21 97.33 99.19 99.69 100.00 98.02 97.43 97.02 46.31

77.68 87.65 92.76 100.00 98.98 85.54 72.53 82.90 92.89 99.47 89.04 84.77 82.57 66.11

94.07 92.93 97.79 99.06 99.39 91.43 89.47 92.18 96.60 98.95 95.19 93.79 92.82 6.64

92.19 92.57 96.18 97.72 100.00 88.22 78.00 92.35 95.52 86.32 91.91 89.47 87.88 5.41

age was collected by the Reﬂective Optics System Imaging Spectrometer (ROSIS). It contains 610 × 340 pixels with nine classes from an urban area. 115 spectral bands of it are reduced to 103 bands after removal of the noisy bands. The Salinas image was acquired by AVIRIS over the valley of Salinas located in Southern California in 1998. It is a 512 × 217 image containing 16 classes and 204 bands after removing 20 absorption bands. Average accuracy, overall accuracy and kappa coeﬃcient measures [22] are used to assess the classiﬁcation accuracy. The methods represented as instances in Table 1 are compared together. The performance of classiﬁers are assessed in two diﬀerence cases: 1) using small training set (10 samples per class) and 2) using relative large raining set (50 samples per class). The classiﬁcation results for Indian dataset obtained by 10 and 50 training samples are reported in Tables 2 and 3, respectively. Ground truth map (GTM) and corresponding classiﬁcation maps are also shown in Figs. 14 and 15. According to the obtained results, the following conclusions can be found:

3)

4)

5)

6) 7)

1) The best classiﬁcation results in both cases of 10 and 50 training samples are achieved by WJCR-AS. The WJCR-AS classiﬁer is a weighted version of joint collaborative representation based method which beneﬁts the angular separation metric. It is a non-parametric classiﬁer that has less sensitivity to the number of training samples, uses the spatial information of local regions with appropriate weights and decreases the redundant information of the highly correlated spatial neighbors. 2) After WJCR-AS, MSPP ranks second with 10 training samples. MSPP utilizes a structure-preserving projection for extracting of spectralspatial features where morphological ﬁlters are applied for shape and contextual feature extraction. Due to training set extension by utilizing the neighborhood information of local region, MSPP not

only is less sensitive to the number of training samples but also includes richer spectral-spatial features in the classiﬁcation process. With 50 training samples, APFSDA ranks second after WJCR-AS. Although, APFSDA is a stacking feature fusion method, but due to the use of spectral-spatial features with maximum class discrimination and minimum overlapping and redundancy, it is not very sensitive to the training set size. APFSDA by applying FSDA projection to the attribute proﬁles maximizes the between-spectral scatters and maximizes the class discrimination. Although, RPNet is not so eﬃcient using 10 training samples, but it has high performance with 50 training samples. It is expected because deep learning networks need suﬃcient labeled samples in the training phase to result in generalization ability in the testing phase. Indian dataset is a cluttered and multi-modal image. So, the use of object-based approaches such as pixel-based or segmentation based relaxation methods such as MRF+HSRM lead to high misclassiﬁcation in Indian image. Pixel-based classiﬁers such as GCK and MBFSDA provide classiﬁcation maps that are noisier than other competitors. 3D Gabor leads to over-smoothing classiﬁcation maps.

The classiﬁcation results for Pavia dataset achieved by 10 training samples are shown in Table 4 and Fig. 16. Among diﬀerent classiﬁers, APFSDA, MSPP, WJCR-AS and GCK provide the best classiﬁcation results, respectively. The results obtained by 50 training samples (Table 5 and Fig. 17) show that GCK is the best candid. After that, RPNet and MSPP rank second and APFSDA and WJCR-AS rank third. The classiﬁcation results for Salinas dataset corresponding to 10 and 50 training samples are reported in Tables 6,7 and Figs. 18,19. The ranking of the best classiﬁers for 10 training samples is: APFSDA, MSPP, WJCR-AS and GCK. The ranking of the best methods for 50 training samples is also obtained as follows: APFSDA, MRF+HSRM, WJCR-AS, MSPP and GCK. 75

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 14. Classiﬁcation maps for Indian dataset achieved by 10 training samples.

Fig. 15. Classiﬁcation maps for Indian dataset achieved by 50 training samples. Table 4 Classiﬁcation results for Pavia dataset (9 classes) achieved by 10 training samples. No

Name of class

1 Asphalt 2 Meadows 3 Gravel 4 Trees 5 Painted metal sheets 6 Bare Soil 7 Bitumen 8 Self-Blocking Bricks Shadows 9 Average Accuracy Overall accuracy Kappa Computation time (seconds)

# samples

Pixon-based

MRF+HSRM

APFSDA

MSPP

GCK

WJCR-AS

3D-Gabor

RPNet

MBFSDA

6631 18649 2099 3064 1345 5029 1330 3682 947

45.20 82.00 90.23 74.05 85.13 18.55 92.48 84.17 97.57 74.38 69.63 57.72 3.34

93.12 1.43 98.09 79.67 99.03 98.59 56.47 87.24 98.84 79.16 51.74 45.96 1.98

93.77 90.88 94.00 67.10 81.26 87.73 99.62 87.75 97.47 88.84 89.25 86.01 1542.81

74.71 86.09 86.37 96.31 97.25 90.85 98.72 79.74 99.26 89.92 86.12 82.18 24.84

94.83 81.91 85.47 87.99 97.55 71.55 99.77 79.90 96.30 88.36 84.50 79.86 8.79

81.35 78.65 91.33 97.81 99.93 83.69 99.92 95.06 99.68 91.94 84.86 80.75 155.54

50.26 66.53 59.98 59.66 89.44 72.74 72.78 58.66 85.74 68.42 64.59 55.56 268.74

83.52 54.19 70.70 91.64 100.00 65.30 92.41 74.93 93.24 80.66 68.81 61.78 11.82

86.68 89.22 87.28 79.34 99.33 89.52 88.12 68.88 82.89 85.70 86.45 82.31 21.64

Note that MRF+HSRM, which is a segmentation based relaxation method, in contrast to Indian and Pavia datasets, has high performance in Salinas dataset especially when suﬃcient training samples is available. The reason is that Salinas scene contains homogeneous regions with less spatial details. So, applying relaxation to it has not destructive relaxation eﬀect with removing details. So, MRF+HSRM leads to little mis-classiﬁcation pixels in Salinas image. The running times of diﬀerent methods are also reported in the given tables. Among 9 spectral-spatial classiﬁcation methods related to 3 main groups of segmentation based methods, feature fusion meth-

ods and decision fusion ones, the highest computation time is related to MSPP, 3D Gabor, WJCR-AS and APFSDA that all of them belong to the feature fusion group. This result is obtained in all datasets but ranking of these methods is a bit away in diﬀerent datasets. The high running time of feature fusion methods is due to high computations of various feature extraction processes. However, the eﬀective features are the basis of a good classiﬁcation. Although feature fusion methods impose high computations but they usually lead to superior classiﬁcation results in various agriculture or urban land cover scenes. As seen from the obtained results, the feature fusion methods such as WJCR76

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 16. Classiﬁcation maps for Pavia dataset achieved by 10 training samples. Table 5 Classiﬁcation results for Pavia dataset (9 classes) achieved by 50 training samples. No

Name of class

1 Asphalt 2 Meadows 3 Gravel 4 Trees 5 Painted metal sheets 6 Bare Soil 7 Bitumen 8 Self-Blocking Bricks Shadows 9 Average Accuracy Overall accuracy Kappa Computation time (seconds)

# samples

Pixon-based

MRF+HSRM

APFSDA

MSPP

GCK

WJCR-AS

3D-Gabor

RPNet

MBFSDA

6631 18649 2099 3064 1345 5029 1330 3682 947

48.48 75.70 88.23 70.76 85.13 98.27 92.78 84.11 97.57 82.34 76.43 70.11 6.25

93.65 78.89 64.75 92.75 99.70 98.63 99.62 92.29 100.00 91.14 86.72 83.04 2.11

95.90 92.66 93.71 90.50 99.85 97.59 99.17 95.60 100.00 96.11 94.49 92.80 1735.20

93.32 93.35 97.52 94.42 99.85 98.79 96.39 97.66 99.89 96.80 95.08 93.56 44.50

97.35 95.26 92.76 97.06 98.74 97.53 99.70 96.52 100.00 97.21 96.32 95.16 11.91

96.94 89.53 98.90 98.86 99.85 98.39 99.77 98.18 100.00 97.83 94.47 92.82 438.38

78.60 76.14 72.56 75.75 95.69 79.52 82.63 77.62 86.38 80.54 77.88 71.84 273.39

85.88 97.73 90.47 97.81 100.00 94.13 98.57 95.90 98.94 95.49 95.09 93.51 27.87

93.44 89.55 96.05 88.09 99.78 94.77 94.29 87.45 94.93 93.15 91.39 88.78 27.48

AS, APFSDA and MSPP provide high performance in all experimented datasets. Some useful links for source codes of hyperspectral image classiﬁcation are given in the following: RPNet classiﬁer: https://github.com/YonghaoXu/RPNet MRF+HSRM classiﬁer: https://www.researchgate.net/publication/287491511_MRFHSRM_Matlab_Code GCK and MFL classiﬁers, extended attribute proﬁles and some other hyperspectral image analysis methods: http://www.lx.it.pt/~jun/demos.html Some other useful links: https://personal.utdallas.edu/~cxc123730/research.html http://ssp.dml.ir/research/sadl/1/ https://paperswithcode.com/task/hyperspectral-imageclassiﬁcation/latest https://github.com/gokriznastic/HybridSN

https://github.com/custom- computing- ic/ CNN- Based- Hyperspectral- Image- Classiﬁcation https://github.com/mhaut/pResNet-HSI https://github.com/leeguandong/FSKNet- for- HSI https://github.com/leeguandong/3D- DenseNet- for- HSI https://github.com/Hsuxu/Two- branch- CNN- Multisource- RSclassiﬁcation https://github.com/eecn/Hyperspectral-Classiﬁcation https://github.com/syamkakarla98/Dimensionality- reduction- andclassiﬁcation- on- Hyperspectral- Images- Using- Python https://github.com/shuguang-52/FDSSC https://github.com/zilongzhong/SSRN https://github.com/henanjun/EPs-F https://github.com/henanjun/demo_MCMs Some popular hyperspectral image datasets are also available in the following link: http://www.ehu.es/ccwintco/index.php/Hyperspectral_Remote_ Sensing_Scenes

77

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 17. Classiﬁcation maps for Pavia dataset achieved by 50 training samples. Table 6 Classiﬁcation results for Salinas dataset (16 classes) achieved by 10 training samples. No

Name of class

1 Brocoli_green_weeds_1 2 Brocoli_green_weeds_2 3 Fallow 4 Fallow_rough_plow 5 Fallow_smooth 6 Stubble 7 Celery 8 Grapes_untrained 9 Soil_vineyard_develop 10 Corn_senesced_green_weeds 11 Lettuce_romaine_4weeks 12 Lettuce_romaine_5 weeks 13 Lettuce_romaine_6 weeks 14 Lettuce_romaine_7 weeks 15 Vineyard_untrained 16 Vineyard_vertical_trellis Average Accuracy Overall accuracy Kappa Computation time (seconds)

# samples

Pixon-based

MRF+HSRM

APFSDA

MSPP

GCK

WJCR-AS

3D-Gabor

RPNet

MBFSDA

2009 3726 1976 1394 2678 3959 3579 11271 6203 3278 1068 1927 916 1070 7268 1807

99.90 99.38 96.20 91.61 86.89 88.79 97.32 87.20 99.84 46.34 52.90 30.10 89.74 84.21 94.79 92.09 83.58 87.15 85.63 3.63

100.00 99.46 100.00 99.86 98.43 99.87 99.58 97.91 100.00 98.17 95.22 100.00 98.14 96.45 0.43 97.51 92.56 85.65 83.86 1.82

100.00 99.68 100.00 99.21 98.28 99.44 99.66 84.61 100.00 94.63 90.07 99.64 98.69 96.64 79.80 97.73 96.13 93.19 92.43 59.14

100.00 98.74 99.90 99.71 96.94 99.39 93.63 75.49 99.89 91.67 98.88 98.86 97.93 96.36 91.28 98.84 96.09 92.28 91.45 23.65

97.41 96.43 99.80 96.56 98.81 97.15 98.88 73.87 99.97 91.95 92.13 100.00 99.02 95.70 83.21 93.25 94.63 90.55 89.52 5.13

99.90 98.93 100.00 99.21 98.32 99.17 99.61 69.75 99.98 93.41 93.82 100.00 98.80 93.64 86.68 99.83 95.69 90.97 90.00 124.28

86.86 93.48 98.13 93.33 90.66 97.37 90.16 75.96 94.24 83.92 77.43 92.53 90.39 90.19 86.63 69.84 88.20 87.01 85.57 280.58

98.01 99.19 96.91 99.21 98.36 99.60 96.42 76.01 96.58 84.66 96.25 97.87 93.56 97.76 66.26 97.57 93.39 88.16 86.83 13.06

99.95 99.65 99.60 100.00 96.30 99.14 93.55 78.33 97.81 94.23 97.38 99.27 98.69 75.70 80.38 99.89 94.37 90.96 89.95 12.56

6. Trends and advanced fusion methods

In the following, some advanced methods in each of above groups are brieﬂy introduced.

Three main trends have been seen in the recent literature: 1) Design of new feature extraction methods for generation of rich spectral-spatial features with a high ability in class discrimination and preserving the 3D local and global structure of hyperspectral images. 2) Hybrid fusion methods where two or more types of fusion methods are used for feature fusion, decision fusion and classiﬁcation map relaxation (regularization). 3) Deep learning methods for joint spectral-spatial feature generation with extraction of detailed features in shallow layers and semantic ones in deep layers.

6.1. Design of feature extraction methods Many of recent feature extraction methods do non-linear feature learning, use the sparse representation, graph based approaches and super-pixel based ones. Some instances are cited in the following. A nonlinear manifold learning is represented in [127] to extract the intrinsic topology of a hyperspectral image. A Graph based feature extraction method is proposed in [128]. Conventional approaches are single vector-based graphs which fail to capture spatial features. The multigraph embedding proposed in [128] is based on patch tensors where 78

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 18. Classiﬁcation maps for Salinas dataset achieved by 10 training samples.

Fig. 19. Classiﬁcation maps for Salinas dataset achieved by 50 training samples.

79

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Table 7 Classiﬁcation results for Salinas dataset (16 classes) achieved by 50 training samples. No

Name of class

1 Brocoli_green_weeds_1 2 Brocoli_green_weeds_2 3 Fallow 4 Fallow_rough_plow 5 Fallow_smooth 6 Stubble 7 Celery 8 Grapes_untrained 9 Soil_vineyard_develop 10 Corn_senesced_green_weeds 11 Lettuce_romaine_4weeks 12 Lettuce_romaine_5 weeks 13 Lettuce_romaine_6 weeks 14 Lettuce_romaine_7 weeks 15 Vineyard_untrained 16 Vineyard_vertical_trellis Average Accuracy Overall accuracy Kappa Computation time (seconds)

# samples

Pixon-based

MRF+HSRM

APFSDA

MSPP

GCK

WJCR-AS

3D-Gabor

RPNet

MBFSDA

2009 3726 1976 1394 2678 3959 3579 11271 6203 3278 1068 1927 916 1070 7268 1807

99.90 99.68 97.27 91.61 86.97 88.79 97.32 99.36 92.37 46.16 52.90 30.10 89.74 84.77 94.76 96.57 84.27 89.04 87.58 6.16

100.00 99.65 100.00 99.86 99.37 99.95 99.47 98.54 99.84 98.84 99.44 100.00 99.34 96.36 98.57 98.23 99.22 99.16 99.07 1.25

100.00 99.92 100.00 99.28 99.70 99.44 99.92 99.05 99.77 96.74 100.00 100.00 99.13 97.01 98.39 98.89 99.20 99.17 99.07 54.94

100.00 99.36 97.32 99.14 98.02 99.90 96.31 91.60 98.63 97.22 99.72 99.90 99.13 95.42 96.44 98.67 97.92 96.77 96.41 85.95

99.25 99.65 99.80 98.49 99.59 99.57 99.44 89.74 99.87 96.06 99.06 100.00 99.67 96.45 85.37 98.95 97.56 95.33 94.80 14.53

99.80 99.84 99.80 99.35 99.37 99.44 99.75 96.34 99.19 99.57 99.63 100.00 99.78 96.45 97.58 100.00 99.12 98.58 98.42 373.50

93.53 91.25 96.76 96.99 97.46 97.52 95.81 87.72 97.10 93.26 96.72 99.01 99.78 98.50 84.53 90.43 94.77 92.55 91.72 73.76

99.20 99.09 98.58 99.71 98.02 99.80 99.66 75.97 99.92 98.54 99.53 100.00 98.80 98.79 78.58 99.61 96.49 91.67 90.75 26.63

100.00 99.89 98.33 99.00 98.58 99.47 99.02 96.47 99.10 98.93 99.34 99.79 97.38 86.17 84.91 100.00 97.27 96.46 96.06 18.91

three diﬀerent sub-graphs are designed for description of intrinsic geometrical structures in hyperspectral image. A super-pixel and graph based feature reduction is introduced in [129]. At ﬁrst, the hyperspectral image is segmented into nonoverlapping super-pixels. Then, the super-pixel based linear discriminant analysis is applied to learn a super-pixel-guided graph. A labelguided graph is also constructed for exploration of spectral similarity. Two made graphs are integrated for learning the discriminant projection. A semi-supervised dimensionality reduction is introduced in [130]. It jointly considers labeled and unlabeled samples to learn the low dimensional subspace. The labels are dynamically propagated on a learnable graph to progressively reﬁne the pseudo-labels providing a properly feedback system. Sparse coding provides an eﬃcient representation of hyperspectral images. But, due to high inter-band correlation of spectral channels, sparse analysis on individual spectral bands is not appropriate. In [131], convolutional frameworks are investigated to simultaneously learn dictionaries and sparse codes and achieve an appropriate spectral-spatial representation of hyperspectral image. Because of high dimensionality of hyperspectral cube, a convolutional encoder-decoder network is selected to this end. Also, 3D convolutions are adopted for modelling the spectral-spatial features.

implementation of CNNs for hyperspectral images analysis. First is insuﬃcient labeled samples for training that lead to overﬁtting problem. Second is ignoring complementary spectral-spatial information among low and high level features extracted by shallow and deep layers, respectively. Authors in [101] use two solutions to deal with the mentioned diﬃculties. They use multi-layer spectral-spatial features for learning complementary information of shallow and deep layers. In addition, they apply sample augmentation with adding local and non-local constraints. A soft-max based multi-decision approach is used to ﬁnd the ﬁnal classiﬁcation map from diﬀerent extracted sub-features. A multi-object CNN decision fusion method is proposed in [134]. In [135], 3D Gabor features are convolved with EMAP features to provide EMAP-Gabor features containing rich spatial and texture features. Then, the collaborative representation based classiﬁer (CRC) is used for classiﬁcation of EMAP-Gabor features. Individually, EMAP features are employed for generating multi-scale super-pixel maps where the number of super-pixels is automatically found with a heuristic strategy. The superpixel maps are used for regularization of classiﬁcation maps. Then, the regularized maps are fused to provide the ﬁnal classiﬁcation map. A hybrid fusion method for hyperspectral image classiﬁcation is proposed in [136]. At ﬁrst, the Gabor ﬁlters are applied to the hyperspectral cube to generate Gabor features including magnitude and phase. The magnitude features are fed to the SVM classiﬁers while the phase features are given to the quadrant bit coding and hamming distance for sample similarity measuring. Two kinds of generated features are then combined to obtain a weight for each sample belonging to a given class. Also, some super-pixel maps are generated from the raw hyperspectral image to regularize the weighted cube obtained from the previous step. Maximum value classiﬁcation is then applied to the regularized maps to ﬁnd the ﬁnal classiﬁcation map.

6.2. Hybrid fusion methods Hyperspectral image classiﬁcation is done using both feature fusion and decision fusion in [132]. Various edge preserving features are extracted using multiple operations, and then, improved with assistant of super-pixel segmentation. The achieved edge preserving features are fused with the spectral ones to generate a composite kernel. Various classiﬁcation maps are ﬁnally fused using majority voting rule. Feature fusion, decision fusion and classiﬁcation map regularization (relaxation) using segmentation map are used for achieving the hyperspectral image classiﬁcation map in [133]. The ﬁrst order deviation of Gabor magnitude images is used to extend 2D Gabor features to 3D structure of hyperspectral cube. The PCA transform is then used for dimensionality reduction of each extracted 3D Gabor cube. Each reduced cube is fed to a SVM classiﬁer. Then, the majority voting strategy is applied for decision fusion. Finally, a super-pixel map obtained by a simple linear iterative clustering is used for regularization of the classiﬁcation map. Another integration of feature fusion with decision fusion is introduced in [101]. The proposed framework uses the CNNs for joint spectral-spatial feature extraction. There are two main diﬃculties in

6.3. Deep learning methods With fast development and great progression of deep learning methods especially with wide success of convolutional neural networks (CNNs) in image processing problems, deep learning has attracted far interest from the remote sensing researchers. Recently, a huge volume of studies about hyperspectral image classiﬁcation is done around deep learning. A review of deep learning based classiﬁcation methods for hyperspectral images is given in [137]. Due to overﬁtting problem, the structure of neural networks designed for hyperspectral images should not be too deep. Related to this requirement, a cascade dual-scale crossover neural network is introduced in [138]. It is able 80

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

to extract more discriminant contextual information through applying diﬀerent spatial-size and spectral-size convolution kernels. CNN together with data augmentation using pixel-block pair [139], regularized CNN [107], Deeplab based CNN [105], 3D CNN and 3D dense network [140], and adaptive multi-scale deep fusion residual network [141] are some instances of deep learning based hyperspectral image classiﬁcation methods.

[8] Y. Tarabalka, J.A. Benediktsson, J. Chanussot, Spectral-spatial classiﬁcation of hyperspectral imagery based on partitional clustering techniques, IEEE Trans. Geosci. Remote Sens. 47 (8) (2009) 2973–2987. [9] Y. Tarabalka, J. Chanussot, J.A. Benediktsson, Segmentation andclassiﬁcation of hyperspectral images using watershed transformation, Pattern Recognit. 43 (7) (2010) 2367–2379. [10] S. Jia, B. Deng, J. Zhu, X. Jia, Q. Li, Local binary pattern-based hyperspectral image classiﬁcation with superpixel guidance, IEEE Trans. Geosci. Remote Sens. 56 (2) (2018) 749–759. [11] M.-Y. Liu, O. Tuzel, S. Ramalingam, R. Chellappa, Entropy rate superpixel segmentation, IEEE Conf. Comp. Vision Pattern Recognit. (CVPR) (Jun. 2011) 2097–2104. [12] M. Khodadadzadeh, H. Ghassemian, Contextual classiﬁcation of hyperspectral remote sensing images using SVM-PLR, Aust. J. Basic Appl. Sci. 5 (8) (2011) 374–382. [13] C. Shi, C.-M. Pun, Superpixel-based 3D deep neural networks for hyperspectral image classiﬁcation, Pattern Recognit. 74 (2018) 600–616. [14] L. Li, C. Sun, L. Lin, J. Li, S. Jiang, J. Yin, A dual-kernel spectral-spatial classiﬁcation approach for hyperspectral images based on Mahalanobis distance metric learning, Inf. Sci. (Ny) 429 (2018) 260–283. [15] Y. Tarabalka, J. Chanussot, J.A. Benediktsson, Segmentation and classiﬁcation of hyperspectral images using watershed transformation, Pattern Recognit. 43 (7) (2010) 2367–2379. [16] Y. Tarabalka, J.A. Benediktsson, J. Chanussot, Spectral-Spatial classiﬁcation of hyperspectral imagery based on partitional clustering techniques, IEEE Trans. Geosci. Remote Sens. 47 (8) (2009) 2973–2987. [17] Z. Miao, W. Shi, A new methodology for spectral-spatial classiﬁcation of hyperspectral images, J. Sensors 2016 (2016) 12 Article ID 1538973pages. [18] M. Imani, H. Ghassemian, Morphology-based structure-preserving projection for spectral–spatial feature extraction and classiﬁcation of hyperspectral data, IET Image Proc. 13 (2) (2019) 270–279. [19] F. Mirzapour, H. Ghassemian, Fast GLCM and gabor ﬁlters for texture classiﬁcation of very high resolution remote sensing images, Int. J. Inform. Commun. Tech. Res. 7 (3) (2015) 21–30. [20] M. Imani, H. Ghassemian, Binary coding based feature extraction in remote sensing high dimensional data, Inf. Sci. (Ny) 342 (2016) 191–208. [21] M. Imani, H. Ghassemian, Feature space discriminant analysis for hyperspectral data feature reduction, ISPRS J. Photogramm. Remote Sens. 102 (2015) 1–13. [22] M. Imani, H. Ghassemian, Attribute proﬁle based feature space discriminant analysis for spectral-spatial classiﬁcation of hyperspectral images, Comput. Electr. Eng. 62 (2017) 555–569. [23] W. Zhao, S. Du, Spectral–Spatial feature extraction for hyperspectral image classiﬁcation: a dimension reduction and deep learning approach, IEEE Trans. Geosci. Remote Sens. 54 (8) (2016) 4544–4554. [24] C. Kuo, D.A. Landgrebe, Nonparametric weighted feature extraction for classiﬁcation, IEEE Trans. Geosci. Remote Sens. 42 (5) (2004) 1096–1105. [25] M. Imani, H. Ghassemian, Two dimensional linear discriminant analysis for hyperspectral data, Photogramm. Eng. Remote Sens. 81 (10) (2015) 777–786. [26] Z. Wang, Q. Ruan, G. An, Facial expression recognition using sparse local Fisher discriminant analysis, Neurocomputing 174 (Part B) (2016) 756–766. [27] M. Imani, H. Ghassemian, Feature extraction using median-mean and feature line embedding, Int. J. Remote Sens. 36 (17) (2015) 4297–4314. [28] P. Huang, Z. Yang, C. Chen, Fuzzy local discriminant embedding for image feature extraction, Comput. Electr. Eng. 46 (2015) 231–240. [29] F. Mirzapour, H. Ghassemian, Improving hyperspectral image classiﬁcation by combining spectral, texture, and shape features, Int. J. Remote Sens. 36 (4) (2015) 1070–1096. [30] S. Li, Q. Hao, X. Kang, J.A. Benediktsson, Gaussian pyramid based multiscale feature fusion for hyperspectral image classiﬁcation, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 11 (9) (2018) 3312–3324. [31] A. Kianisarkaleh, H. Ghassemian, Spatial-spectral locality preserving projection for hyperspectral image classiﬁcation with limited training samples, Int. J. Remote Sens. 37 (21) (2016) 5045–5059. [32] G. Zhang, J. Wang, X. Zhang, H. Fei, B. Tu, Adaptive total variation-based spectral-spatial feature extraction of hyperspectral image, J. Vis. Commun. Image Represent. 56 (2018) 150–159. [33] B. Pan, Z. Shi, X. Xu, Hierarchical guidance ﬁltering-based ensemble classiﬁcation for hyperspectral images, IEEE Trans. Geosci. Remote Sens. 55 (7) (2017) 4177–4189. [34] M. Imani, H. Ghassemian, Edge patch image-based morphological proﬁles for classiﬁcation of multispectral and hyperspectral data, IET Image Proc. 11 (3) (2017) 164–172. [35] M. Imani, H. Ghassemian, Morphology-based structure-preserving projection for spectral–spatial feature extraction and classiﬁcation of hyperspectral data, IET Image Proc. 13 (2) (2019) 270–279. [36] H. Li, H. Li, L. Zhang, Quaternion-Based multiscale analysis for feature extraction of hyperspectral images, IEEE Trans. Signal Process. 67 (6) (2019) 1418–1430. [37] L. Gao, et al., Subspace-based support vector machines for hyperspectral image classiﬁcation, IEEE Geosci. Remote Sens. Lett. 12 (2) (2015) 349–353. [38] P. Ramzi, F. Samadzadegan, P. Reinartz, Classiﬁcation of hyperspectral data using an AdaBoostSVM technique applied on band clusters, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 7 (6) (2014) 2066–2079. [39] Y. Chen, N.M. Nasrabadi, T.D. Tran, Hyperspectral image classiﬁcation via kernel sparse representation, IEEE Trans. Geosci. Remote Sens. 51 (1) (2013) 217–231. [40] J. Li, H. Zhang, L. Zhang, Column-generation kernel nonlocal joint collaborative representation for hyperspectral image classiﬁcation, ISPRS J. Photogramm. Remote Sens. 94 (2014) 25–36.

7. Conclusion The fusion topic in the image processing ﬁeld has been discussed from two main views in literature. In the ﬁrst view, the useful features from two or more individual source images are fused to provide an image with all beneﬁcial characteristics of the source images [120–124]. In the second view, an image containing various worthful characteristics is explored from diﬀerent aspects. The useful features are extracted and fused together to allow more powerful decision making from the given image [125,126]. The spectral-spatial fusion for hyperspectral image classiﬁcation belongs to the second group that is reviewed in this work. Three general categories of spectral-spatial fusion methods are reviewed and discussed in this paper. The ﬁrst group is segmentation based classiﬁers where they are divided into object based classiﬁcation methods and the relaxed pixel-wise classiﬁcation ones. The second group contains feature fusion methods. Six diﬀerent types of methods in this group are act in two general main. In the ﬁrst way, the spectral and spatial features are individually extracted and then used in the classiﬁcation phase such as feature stacking methods and the kernel based classiﬁers. In the second way, the spectral and spatial features are simultaneously extracted and classiﬁed such as representation based classiﬁers, 3D feature extraction methods and deep learning based classiﬁers. The third group of spectral-spatial fusion methods is decision fusion framework where a decision fusion rule is used to integrate various decisions acquired by complement classiﬁcation maps. Appropriate selection of segmentation algorithms, feature transforms, kernels and their parameters settings, 3D ﬁlters and their parameters settings, optimization problems solving in representation based classiﬁers and tuning of hyper-parameters in deep learning methods are among the main diﬃculties of various fusion methods. The appropriate choice of fusion method is done by consideration of available training samples, HSI spectral and spatial resolution, and also a trade-oﬀ between classiﬁcation accuracy and computation time. However, generally, the spectral-spatial HSI classiﬁcation results in signiﬁcant improvement with respect to HSI classiﬁcation by using just the spectral features. Declaration of Competing Interest The authors declare that they have no known competing ﬁnancial interests or personal relationships that could have appeared to inﬂuence the work reported in this paper. References [1] C.H. Chen, Frontiers of remote sensing information processing, World Scientiﬁc, 2003. [2] H. Ghassemian, D.A. Landgrebe, On-line object feature extraction for multispectral scene representation, NASA technical reports, NASA-CR-187006, NAS 1.26:187006, TR-EE-88-34, Aug. 1988. [3] M. Golipour, H. Ghassemian, F. Mirzapour, Integrating hierarchical segmentation maps with MRF prior for classiﬁcation of hyperspectral images in a bayesian framework, IEEE Trans. Geosci. Remote Sens. 54 (2) (2016) 805–816. [4] J. Li, J.M. Bioucas-Dias, A. Plaza, Spectral-spatial hyperspectral image segmentation using subspace multinomial logistic regression and Markov random ﬁelds, IEEE Trans. Geosci. Remote Sens. 50 (3) (2012) 809–823. [5] H. Ghassemian, D.A. Landgrebe, Object-oriented feature extraction method for image data compaction, IEEE Control Syst. Mag. 8 (3) (1988) 42–48. [6] A. Zehtabian, H. Ghassemian, Automatic object-based hyperspectral image classiﬁcation using complex diﬀusions and a new distance metric, IEEE Trans. Geosci. Remote Sens. 54 (7) (2016) 4106–4114. [7] A. Zehtabian, H. Ghassemian, An adaptive pixon extraction technique for multispectral/hyperspectral image classiﬁcation, IEEE Geosci. Remote Sens. Lett. 12 (4) (2015) 831–835. 81

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

[41] A. Cornuéjols, C. Wemmert, P. Gançarski, Y. Bennani, Collaborative clustering: why, when, what and how, Inform. Fusion 39 (2018) 81–95. [42] R. Zhao, B. Du, L. Zhang, A robust nonlinear hyperspectral anomaly detection approach, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 7 (4) (2014) 1227–1234. [43] Y. Gu, C. Wang, D. You, Y. Zhang, S. Wang, Y. Zhang, Representative multiple kernel learning for classiﬁcation in hyperspectral imagery, IEEE Trans. Geosci. Remote Sens. 50 (7) (2012) 2852–2865. [44] J. Mercer, Functions of positive and negative type and their connection with the theory of integral equations, Philos. Trans. R. Soc. Lond. A Math. Phys. Sci. 209 (441–458) (1909) 415–446. [45] D. Tuia, G. Camps-Valls, Semisupervised remote sensing image classiﬁcation with cluster kernels, IEEE Geosci. Remote Sens. Lett. 6 (2) (2009) 224–228. [46] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining Knowl. Disc. 2 (2) (1998) 121–167. [47] M. Gönen, E. Alpaydın, Multiple kernel learning algorithms, J. Mach. Learn. Res. 12 (2011) 2211–2268. [48] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L.E. Ghaoui, M.I. Jordan, Learning the kernel matrix with semideﬁnite programming, J. Mach. Learn. Res. 5 (2004) 27–72. [49] G. Camps-Valls, L. Gomez-Chova, J. Muéz-Mari, J. Vila-Frances, J. Calpe-Maravilla, Composite kernels for hyperspectral image classiﬁcation, IEEE Geosci. Remote Sens. Lett. 3 (1) (2006) 93–97. [50] S. Niazmardi, A. Safari, S. Homayouni, A novel multiple kernel learning framework for multiple feature classiﬁcation, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 10 (8) (2017) 3734–3743. [51] Y. Gu, J. Chanussot, X. Jia, J.A. Benediktsson, Multiple kernel learning for hyperspectral image classiﬁcation: a review, IEEE Trans. Geosci. Remote Sens. 55 (11) (2017) 6547–6565. [52] J. Li, P. Marpu, A. Plaza, J. Bioucas-Dias, J.A. Benediktsson, Generalized composite kernel framework for hyperspectral image classiﬁcation, IEEE Trans. Geosci. Remote Sens. 51 (9) (2013) 4816–4829. [53] B. Suchetana, B. Rajagopalan, J. Silverstein, Investigating regime shifts and the factors controlling total inorganic nitrogen concentrations in treated wastewater using non-homogeneous Hidden Markov and multinomial logistic regression models, Sci. Total Environ. 646 (2019) 625–633. [54] J. Li, X. Huang, P. Gamba, J.M. Bioucas-Dias, L. Zhang, J.A. Benediktsson, A. Plaza, Multiple feature learning for hyperspectral image classiﬁcation, IEEE Trans. Geosci. Remote Sens. 53 (3) (2015) 1592–1606. [55] J. Li, J. Bioucas-Dias, A. Plaza, Spectral-spatial hyperspectral image segmentation using subspace multinomial logistic regression and Markov random ﬁelds, IEEE Trans. Geosci. Remote Sens. 50 (3) (2012) 809–823. [56] Y. Zhang, S. Prasad, Locality preserving composite kernel feature extraction for multi-source geospatial image analysis, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 8 (3) (2015) 1385–1392. [57] J. Li, P. Marpu, A. Plaza, J. Bioucas-Dias, J.A. Benediktsson, Generalized composite kernel framework for hyperspectral image classiﬁcation, IEEE Trans. Geosci. Remote Sens. 51 (9) (2013) 4816–4829. [58] P. Ghamisi, J. Plaza, Y. Chen, J. Li, A.J. Plaza, Advanced spectral classiﬁers for hyperspectral images: a review, IEEE Geosci. Remote Sens. Mag. 5 (1) (2017) 8–32. [59] W. Li, E.W. Tramel, S. Prasad, J.E. Fowler, Nearest regularized subspace for hyperspectral classiﬁcation, IEEE Trans. Geosci. Remote Sens. 52 (1) (2014) 477–489. [60] W. Li, Q. Du, Joint within-class collaborative representation for hyperspectral image classiﬁcation, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 7 (6) (2014) 2200–2208. [61] M. Xiong, Q. Ran, W. Li, J. Zou, Q. Du, Hyperspectral image classiﬁcation using weighted joint collaborative representation, IEEE Geosci. Remote Sens. Lett. 12 (6) (2015) 1209–1213. [62] M. Imani, H. Ghassemian, Edge-preserving-based collaborative representation for spectral-spatial classiﬁcation, Int. J. Remote Sens. 38 (20) (2017) 5524–5545. [63] M. Imani, H. Ghassemian, Weighted joint collaborative representation based on median-mean line and angular separation, IEEE Trans. Geosci. Remote Sens. 55 (10) (2017) 5612–5624. [64] Y. Chen, N.M. Nasrabadi, T.D. Tran, Hyperspectral image classiﬁcation using dictionary-based sparse representation, IEEE Trans. Geosci. Remote Sens., 49 (10) 3973–3985. [65] B. Tu, X. Zhang, X. Kang, G. Zhang, J. Wang, J. Wu, Hyperspectral image classiﬁcation via fusing correlation coeﬃcient and joint sparse representation, IEEE Geosci. Remote Sens. Lett. 15 (3) (2018) 340–344. [66] M. Borhani, H. Ghassemian, Spectral-spatial graph kernel machines in the context of hyperspectral remote sensing image classiﬁcation, CSI J. Comp. Sci. Eng. 11 (2013) 31–42 2 & 4 (b). [67] L. Gan, P. Du, J. Xia, Y. Meng, Kernel fused representation-based classiﬁer for hyperspectral imagery, IEEE Geosci. Remote Sens. Lett. 14 (5) (2017) 684–688. [68] M. Imani, Attribute proﬁle based target detection using collaborative and sparse representation, Neurocomputing 313 (2018) 364–376. [69] M. Imani, Anomaly detection using morphology-based collaborative representation in hyperspectral imagery, Eur. J. Remote Sens. 51 (1) (2018) 457–471. [70] G. Goswami, P. Mittal, A. Majumdar, M. Vatsa, R. Singh, Group sparse representation based classiﬁcation for multi-feature multimodal biometrics, Inform. Fusion 32 (Part B) (2016) 3–12. [71] B. Yang, S. Li, Pixel-level image fusion with simultaneous orthogonal matching pursuit, Inform. Fusion 13 (1) (2012) 10–19. [72] M. Imani, Anomaly detection from hyperspectral images using clustering based feature reduction, J. Indian Soc. Remote Sens. 46 (9) (2018) 1389–1397.

[73] F. Yuan, X. Xia, J. Shi, Mixed co-occurrence of local binary patterns and Hamming-distance-based local binary patterns, Inf. Sci. (Ny) 460–461 (2018) 202–222. [74] J. Zhu, J. Hu, S. Jia, X. Jia, Q. Li, Multiple 3-D feature fusion framework for hyperspectral image classiﬁcation, IEEE Trans. Geosci. Remote Sens. 56 (4) (2018) 1873–1886. [75] L. He, J. Li, A. Plaza, Y. Li, Discriminative low-rank gabor ﬁltering for spectral–spatial hyperspectral image classiﬁcation, IEEE Trans. Geosci. Remote Sens. 55 (3) (2017) 1381–1395. [76] T.C. Bau, S. Sarkar, G. Healey, Hyperspectral region classiﬁcation using a three-dimensional gabor ﬁlterbank, IEEE Trans. Geosci. Remote Sens. 48 (9) (2010) 3457–3464. [77] M. Imani, 3D Gabor based hyperspectral anomaly detection, AUT J. Model. Simul. 50 (2) (2018) 101–110. [78] T.C. Bau, S. Sarkar, G. Healey, Hyperspectral region classiﬁcation using a three-dimensional gabor ﬁlterbank, IEEE Trans. Geosci. Remote Sens. 48 (9) (2010) 3457–3464. [79] S. Jia, L. Shen, J. Zhu, Q. Li, A 3-D gabor phase-based coding and matching framework for hyperspectral imagery classiﬁcation, IEEE Trans. Cybern. 48 (4) (2018) 1176–1188. [80] S. Jia, J. Hu, J. Zhu, X. Jia, Q. Li, Three-Dimensional local binary patterns for hyperspectral imagery classiﬁcation, IEEE Trans. Geosci. Remote Sens. 55 (4) (2017) 2399–2413. [81] Z. He, L. Liu, Robust multitask learning with three-dimensional empirical mode decomposition-based features for hyperspectral classiﬁcation, ISPRS J. Photogramm. Remote Sens. 121 (2016) 11–27. [82] X. Cao, L. Xu, D. Meng, Q. Zhao, Z. Xu, Integration of 3-dimensional discrete wavelet transform and Markov random ﬁeld for hyperspectral image classiﬁcation, Neurocomputing 226 (2017) 90–100. [83] Y.Y. Tang, Y. Lu, H. Yuan, Hyperspectral image classiﬁcation based on three-dimensional scattering wavelet transform, IEEE Trans. Geosci. Remote Sens. 53 (5) (2015) 2467–2480. [84] C. Shi, C.-M. Pun, 3D multi-resolution wavelet convolutional neural networks for hyperspectral image classiﬁcation, Inf. Sci. (Ny) 420 (2017) 49–65. [85] B. Hamida, A. Benoit, P. Lambert, C. Ben Amar, 3-D Deep learning approach for remote sensing image classiﬁcation, IEEE Trans. Geosci. Remote Sens. 56 (8) (2018) 4420–4434. [86] Y. Chen, H. Jiang, C. Li, X. Jia, P. Ghamisi, Deep feature extraction and classiﬁcation of hyperspectral images based on convolutional neural networks, IEEE Trans. Geosci. Remote Sens. 54 (10) (2016) 6232–6251. [87] Y. Li, H. Zhang, Q. Shen, Spectral–spatial classiﬁcation of hyperspectral imagery with 3D convolutional neural network, Remote Sens. 9 (1) (2017) 67. [88] Z. Zhong, J. Li, Z. Luo, M. Chapman, Spectral–spatial residual network for hyperspectral image classiﬁcation: a 3-D deep learning framework, IEEE Trans. Geosci. Remote Sens. 56 (2) (2018) 847–858. [89] J. Yang, W. Xiong, S. Li, C. Xu, Learning structured and non-redundant representations with deep neural networks, Pattern Recognit. 86 (2019) 224–235. [90] I. Rodriguez, J. María Martínez-Otzeta, I. Irigoien, E. Lazkano, Spontaneous talking gestures using generative adversarial networks, Rob. Auton. Syst. 114 (2019) 57–65. [91] I. Good fellow, et al., Generative adversarial nets, in: Proc. NIPS, Montreal, QC, Canada, 2014, pp. 2672–2680. [92] M. Zhang, M. Gong, Y. Mao, J. Li, Y. Wu, Unsupervised feature extraction in hyperspectral images based on wasserstein generative adversarial network, IEEE Trans. Geosci. Remote Sens. (2018) In Press. [93] J. Feng, H. Yu, L. Wang, X. Cao, X. Zhang, L. Jiao, Classiﬁcation of hyperspectral images based on multiclass spatial-spectral generative adversarial networks, IEEE Trans. Geosci. Remote Sens. (2019) In Press. [94] L. Zhu, Y. Chen, P. Ghamisi, J.A. Benediktsson, Generative adversarial networks for hyperspectral image classiﬁcation, IEEE Trans. Geosci. Remote Sens. 56 (9) (2018) 5046–5063. [95] J.L. Tsai, Feature extraction of hyperspectral image cubes using three-dimensional gray-level cooccurrence, IEEE Trans. Geosci. Remote Sens. 51 (6) (2013) 3504–3513. [96] M. Zaouali, S. Bouzidi, E. Zagrouba, 3-D Shearlet transform based feature extraction for improved joint sparse representation hsi classiﬁcation, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 11 (4) (2018) 1306–1314. [97] B.T. Zhao, H. Fei, N. Li, X. Yang, Spatial-spectral classiﬁcation of hyperspectral image via group tensor decomposition, Neurocomputing 316 (2018) 68–77. [98] X. Cao, R. Li, L. Wen, J. Feng, L. Jiao, Deep multiple feature fusion for hyperspectral image classiﬁcation, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 11 (10) (2018) 3880–3891. [99] J. Yang, Y. Zhao, J.C. Chan, Learning and transferring deep joint spectral–spatial features for hyperspectral classiﬁcation, IEEE Trans. Geosci. Remote Sens. 55 (8) (2017) 4729–4742. [100] G.L. Zhao, L. Fang, B. Tu, P. Ghamisi, Multiple convolutional layers fusion framework for hyperspectral image classiﬁcation, Neurocomputing 339 (2019) 149–160. [101] J. Feng, et al., CNN-based multilayer spatial–spectral feature fusion and sample augmentation with local and nonlocal constraints for hyperspectral image classiﬁcation, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 12 (4) (2019) 1299–1313. [102] N. He, L. Fang, S. Li, A. Plaza, J. Plaza, Remote sensing scene classiﬁcation using multilayer stacked covariance pooling, IEEE Trans. Geosci. Remote Sens. 56 (12) (2018) 6899–6910. [103] X. Xu, W. Li, Q. Ran, Q. Du, L. Gao, B. Zhang, Multisource remote sensing data classiﬁcation based on convolutional neural network, IEEE Trans. Geosci. Remote Sens. 56 (2) (2018) 937–949.

82

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

[104] L.-.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, Semantic image segmentation with deep convolutional nets and fully connected CRFs.” [Online]. Available: https://arxiv.org/abs/1412.7062 (Dec. 2014). [105] Z. Niu, W. Liu, J. Zhao, G. Jiang, DeepLab-based spatial feature extraction for hyperspectral image classiﬁcation, IEEE Geosci. Remote Sens. Lett. 16 (2) (2019) 251–255. [106] L. Sun, J. Rieger, H. Hinrichs, Maximum noise fraction (MNF) transformation to remove ballistocardiographic artifacts in EEG signals recorded during fMRI scanning, Neuroimage 46 (1) (2009) 144–153. [107] Y. Guo, H. Cao, J. Bai, Y. Bai, High eﬃcient deep feature extraction and classiﬁcation of spectral-spatial hyperspectral image using cross domain convolutional neural networks, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 12 (1) (2019) 345–356. [108] M. Imani, Manifold structure preservative for hyperspectral target detection, Adv. Space Res. 61 (2018) 2510–2520. [109] M. Imani, H. Ghassemian, Discriminant analysis in morphological feature space for high-dimensional image spatial–spectral classiﬁcation, J. Appl. Remote Sens. 12 (1) (2018) 016024-1_016024-28. [110] M. Imani, H. Ghassemian, Hyperspectral images classiﬁcation by spectral-spatial processing, 8th international symposium on telecommunications (IST’2016), Tehran, Iran, 27-29 Sept. 2016. [111] B. Guo, H. Shen, M. Yang, Improving hyperspectral image classiﬁcation by fusing spectra and absorption features, IEEE Geosci. Remote Sens. Lett. 14 (8) (2017) 1363–1367. [112] D.G. Stavrakoudis, E. Dragozi, I.Z. Gitas, C.G. Karydas, Decision fusion based on hyperspectral and multispectral satellite imagery for accurate forest species mapping, Remote Sens. 6 (8) (2014) 6897–6928. [113] R. Hang, et al., Robust matrix discriminative analysis for feature extraction from hyperspectral images, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 10 (5) (2017) 2002–2011. [114] R. Hang, Q. Liu, H. Song, Y. Sun, Matrix-based discriminant subspace ensemble for hyperspectral image spatialcspectral feature fusion, IEEE Trans. Geosci. Remote Sens. 54 (2) (2016) 783–794. [115] J. Li, J. Bioucas-Dias, A. Plaza, Hyperspectral image segmentation using a new bayesian approach with active learning, IEEE Trans. Geosci. Remote Sens. 49 (10) (2011) 3947–3960. [116] J. Li, J. Bioucas-Dias, A. Plaza, Spectral–spatial hyperspectral image segmentation using subspace multinomial logistic regression and Markov random ﬁelds, IEEE Trans. Geosci. Remote Sens. 50 (3) (2012) 809–823. [117] J.L. Khodadadzadeh, A. Plaza, H. Ghassemian, J.M. Bioucas-Dias, X. Li, Spectral–spatial classiﬁcation of hyperspectral data using local and global probabilities for mixed pixel characterization, IEEE Trans. Geosci. Remote Sens. 52 (10) (2014) 6298–6314. [118] M. Imani, H. Ghassemian, Spectral-spatial feature transformations with controlling contextual information through smoothing ﬁltering and morphological analysis, Int. J. Inform. Commun. Tech. Res. 10 (1) (2018) 1–12. [119] F.S. Uslu, H. Binol, M. Ilarslan, A. Bal, Improving SVDD classiﬁcation performance on hyperspectral images via correlation based ensemble technique, Opt. Lasers Eng. 89 (2017) 169–177. [120] S. Li, X. Kang, L. Fang, J. Hu, H. Yin, Pixel-level image fusion: a survey of the state of the art, Inform. Fusion 33 (2017) 100–112. [121] Y. Liu, X. Chen, Z. Wang, Z.J. Wang, R.K. Ward, X. Wang, Deep learning for pixel-level image fusion: recent advances and future prospects, Inform. Fusion 42 (2018) 158–173. [122] X. Ma, S. Hu, S. Liu, J. Fang, S. Xu, Multi-focus image fusion based on joint sparse representation and optimum theory, Signal Process. Image Commun. (2019) In Press.

[123] Q. Zhang, Y. Liu, R.S. Blum, J. Han, D. Tao, Sparse representation based multi-sensor image fusion for multi-focus and multi-modality images: a review, Inform. Fusion 40 (2018) 57–75. [124] B. Meher, S. Agrawal, R. Panda, A. Abraham, A survey on region based image fusion methods, Inform. Fusion 48 (2019) 119–132. [125] C. Liu, J. Li, L. He, Superpixel-Based semisupervised active learning for hyperspectral image classiﬁcation, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 12 (1) (2019) 357–370. [126] K. Fotiadou, G. Tsagkatakis, P. Tsakalides, Spectral super resolution of hyperspectral images via coupled dictionary learning, IEEE Trans. Geosci. Remote Sens. 57 (5) (2019) 2777–2797. [127] P. Zhang, H. He, L. Gao, A nonlinear and explicit framework of supervised manifold-feature extraction for hyperspectral image classiﬁcation, Neurocomputing 337 (2019) 315–324. [128] Y. Deng, H. Li, X. Song, Y. Sun, X. Zhang, Q. Du, Patch tensor-based multigraph embedding framework for dimensionality reduction of hyperspectral images, IEEE Trans. Geosci. Remote Sens. (2020) In Press. [129] H. Xu, H. Zhang, W. He, L. Zhang, Superpixel-based spatial-spectral dimension reduction for hyperspectral imagery classiﬁcation, Neurocomputing 360 (2019) 138–150. [130] D. Hong, N. Yokoya, J. Chanussot, J. Xu, X.X. Zhu, Learning to propagate labels on graphs: an iterative multitask regression framework for semi-supervised hyperspectral dimensionality reduction, ISPRS J. Photogramm. Remote Sens. 158 (2019) 35–49. [131] P.V. Arun, B. Krishna Mohan, A. Porwal, Spatial-spectral feature based approach towards convolutional sparse coding of hyperspectral images, Comput. Vision Image Understanding 188 (2019) 102797. [132] P. Duan, X. Kang, S. Li, P. Ghamisi, J.A. Benediktsson, Fusion of multiple edge-preserving operations for hyperspectral image classiﬁcation, IEEE Trans. Geosci. Remote Sens. 57 (12) (2019) 10336–10349. [133] S. Jia, K. Wu, J. Zhu, X. Jia, Spectral–spatial gabor surface feature fusion approach for hyperspectral imagery classiﬁcation, IEEE Trans. Geosci. Remote Sens. 57 (2) (2019) 1142–1154. [134] Y. Hu, J. Zhang, Y. Ma, J. An, G. Ren, X. Li, Hyperspectral coastal wetland classiﬁcation based on a multiobject convolutional neural network model and decision fusion, IEEE Geosci. Remote Sens. Lett. 16 (7) (2019) 1110–1114. [135] S. Jia, X. Deng, J. Zhu, M. Xu, J. Zhou, X. Jia, Collaborative representation-based multiscale superpixel fusion for hyperspectral image classiﬁcation, IEEE Trans. Geosci. Remote Sens. 57 (10) (2019) 7770–7784. [136] S. Jia, Z. Lin, B. Deng, J. Zhu, Q. Li, Cascade superpixel regularized gabor feature fusion for hyperspectral image classiﬁcation, IEEE Trans. Neural. Netw. Learn. Syst. (2020) In Press. [137] M.E. Paoletti, J.M. Haut, J. Plaza, A. Plaza, Deep learning classiﬁers for hyperspectral imaging: a review, ISPRS J. Photogramm. Remote Sens. 158 (2019) 279–317. [138] F. Cao, W. Guo, Cascaded dual-scale crossover network for hyperspectral image classiﬁcation, Knowl. Based Syst. (2019) 105122. [139] W. Li, C. Chen, M. Zhang, H. Li, Q. Du, Data augmentation for hyperspectral image classiﬁcation with deep CNN, IEEE Geosci. Remote Sens. Lett. 16 (4) (2019) 593–597. [140] C. Zhang, G. Li, S. Du, Multi-scale dense networks for hyperspectral remote sensing image classiﬁcation, IEEE Trans. Geosci. Remote Sens. 57 (11) (2019) 9201–9222. [141] G. Li, L. Li, H. Zhu, X. Liu, L. Jiao, Adaptive multiscale deep fusion residual network for remote sensing image classiﬁcation, IEEE Trans. Geosci. Remote Sens. 57 (11) (2019) 8506–8521. [142] Y. Xu, B. Du, F. Zhang, L. Zhang, Hyperspectral image classiﬁcation via a random patches network, ISPRS J. Photogramm. Remote Sens. 142 (2018) 344–357.

83

An overview on spectral and spatial information fusion for hyperspectral image classification: Current trends and challenges

An overview on spectral and spatial information fusion for hyperspectral image classification: Current trends and challenges

Recommend Documents