An overview on spectral and spatial information fusion for hyperspectral image classification: Current trends and challenges

An overview on spectral and spatial information fusion for hyperspectral image classification: Current trends and challenges

Information Fusion 59 (2020) 59–83 Contents lists available at ScienceDirect Information Fusion journal homepage: www.elsevier.com/locate/inffus An...

9MB Sizes 1 Downloads 123 Views

Information Fusion 59 (2020) 59–83

Contents lists available at ScienceDirect

Information Fusion journal homepage: www.elsevier.com/locate/inffus

An overview on spectral and spatial information fusion for hyperspectral image classification: Current trends and challenges Maryam Imani, Hassan Ghassemian∗ Image Processing and Information Analysis Lab, Faculty of Electrical and Computer Engineering, Tarbiat Modares University, Tehran, Iran

a r t i c l e

i n f o

a b s t r a c t

Keywords: Hyperspectral image Feature fusion Decision fusion Classification

Hyperspectral images (HSIs) have a cube form containing spatial information in two dimensions and rich spectral information in the third one. The high volume of spectral bands allows discrimination between various materials with high details. Moreover, by utilizing the spatial features of image such as shape, texture and geometrical structures, the land cover discrimination will be improved. So, fusion of spectral and spatial information can significantly improve the HSI classification. In this work, the spectral-spatial information fusion methods are categorized into three main groups. The first group contains segmentation based methods where objects or superpixels are used instead of pixels for classification or the obtained segmentation map is used for relaxation of the pixel-wise classification map. The second group consists of feature fusion methods which are divided into six subgroups: features stacking, joint spectral-spatial feature extraction, kernel based classifiers, representation based classifiers, 3D spectral-spatial feature extraction and deep learning based classifiers. The third fusion methods are decision fusion based approaches where complementary information of several classifiers are contributed for achieving the final classification map. A review of different methods in each category, is presented. Moreover, the advantages and difficulties/disadvantages of each group are discussed. The performance of various fusion methods are assessed in terms of classification accuracy and running time using experiments on three popular hyperspectral images. The results show that the feature fusion methods although are time consuming but can provide superior classification accuracy compared to other methods. Study of this work can be very useful for all researchers interested in HSI feature extraction, fusion and classification.

1. Introduction

the same spectral signatures, they can be discriminated through their shapes and texture [2]. Thus, one can proposed to fuse the spectral and spatial information to improve HSI classification. The main and based idea of using spatial information is that in local regions, neighboring pixels have similar spectral features and belong to the same class with a high probability [3]. To better understand the value of using spatial information, please attend to Fig. 2, where, the position of pixels are randomly changed, and, the spectral features of each pixel remained unchanged. The result of spectral classification is equivalent for both of these figures. But, Fig. 2(a) contains valuable spatial information about shape and texture of objects which can be used in a spectral-spatial classifier. A significant HIS classification improvement can be achieved by applying an appropriate spectral-spatial fusion method. To extract the spatial information, usually a local window is considered around each pixel of image. By applying a spatial transform or by computing the statistics of the local dependency, some spatial features are extracted and assign to the central pixel [4]. The spectral-spatial fusion methods are generally categorized in three main groups (Segmentation based, Feature fusion based, and

Development of hyperspectral sensors provides hyperspectral images (HSIs) containing hundreds spectral bands. The spectral signature of each image pixel constituted by hundreds spectral bands acts as a finger print for identification of its material type. A HSI is a cube constituted of images acquired from the same scene but at different electromagnetic wavelengths where each slice of this cube is associated with a special wavelength (see Fig. 1). In other words, each pixel of HSI (spatial sample) located in row i and column j denoted as p(i, j) has a spectral signature composed of the associated reflections of that position of image scene in various wavelengths (a feature vector containing the associated values of different spectral bands). The huge spectral information simplifies distinguishing between different materials. Thus, it allows material recognition and land cover classification with a high accuracy. HSIs with rich spectral information are useful in various applications and fields such as mineralogy, agriculture, load cover classification and target detection [1]. Although the single use of spectral features may be useful but it may not be enough in many cases. When two different objects have ∗

Corresponding author. E-mail addresses: [email protected] (M. Imani), [email protected] (H. Ghassemian).

https://doi.org/10.1016/j.inffus.2020.01.007 Received 11 October 2019; Received in revised form 18 January 2020; Accepted 20 January 2020 Available online 21 January 2020 1566-2535/© 2020 Elsevier B.V. All rights reserved.

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 1. A hyperspectral cube in the left and a typical spectral feature in the right [2]. Fig. 2. (Left) A hyperspectral cube, (right) randomly changing of the position of pixels or removing the spatial information.

Fig. 3. Categorization of the spectral-spatial fusion methods.

Decision fusion based) where each of them also contains some subgroups. This categorization is shown in Fig. 3 and represented as follows:

members of the same class; hence, the scene’s objects can each be represented by a single suitably chosen feature set. Typically the size and shape of objects in the scene vary randomly, and the sampling rate and therefore the pixel size are fixed. It is reasonable to assume that the sample data (pixels) from a simple object have a common characteristic. A complex scene consists of simple objects. Any scene can thus be described by classifying the objects in terms of their features and by recording the relative position and orientation of the objects in the scene.

A) Segmentation based methods This category of the spectral-spatial fusion methods produce some segments (objects or super-pixels or Pixons) through the HSI. This technique is based on the fundamental assumption that the scene is segmented into objects such that all samples (pixels) from an object are 60

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

In the segmentation methods, the spatial information is used to generate segments. Each segment contains the adjacent pixels with similar spectral features. Two approaches can be used to benefit the obtained segmentation maps:

B-6. Deep learning based classifiers Deep learning methods such as conventional neural networks (CNNs) extracts joint spectral-spatial features layer by layer where sub-feature map of each layer is extracted from feature map of the previous layer. The high potential of deep learning methods in extraction of non-linear and hidden features is the main advantage of them. However, the deep learning networks have hyper-parameters that need to set where learning of network requires a large training set. Otherwise, the over-fitting problem causes less classification accuracy in the testing phase compared to the training stage.

A-1. The obtained objects are classified instead of pixels. In other words, the same label is assigned to all pixels belong to an object. A-2. The HSI is classified pixel-wise. Then, the obtained segmentation map is used as a mask to improve the pixel-wise classification map. Usually, the majority voting rule is used to assign the same label to all pixels located in a segment.

A) Decision fusion based methods

The segmentation based methods remove the noisy pixels of the classification maps but selection of an appropriate segmentation algorithm, generation of suitable objects, with fitting sizes and shapes is a challenging task.

In the decision fusion methods, the classification map is obtained multiple times through applying different classifiers with the same feature set; or by individually applying the same classifier to various feature sets; or by applying various classifiers to various feature sets. The final classification map is obtained by implementation of a decision fusion rule such as majority voting and joint measures method. Selection of feature sets or choice of classifiers containing complement information; and high computation time due to implementation of multiple classification processes are difficulties of the decision fusion methods. A review of different information fusion methods is given in this paper. Several state-of-the-art methods from each represented group are introduced. The advantages and disadvantage of each group are also discussed.

A) Feature fusion based methods In the spectral-spatial feature fusion category, the spectral and spatial features are extracted individually or simultaneously. Then, the obtained spectral-spatial feature cube is fed to a potential classifier to achieve the classification map. Various feature fusion methods are represented as follows: B-1. Features stacking In these methods, the spectral features and the spatial ones are extracted individually, and then simply stacked together to generate the spectral-spatial cube. These methods are relatively simple, but due to independent extraction of spectral and spatial features procedure, the hidden information in joint spectral and spatial features will be lost. Moreover, the stacked spectral-spatial feature vector assigned to each pixel has a high dimension, which results in curse of dimensionality with a limited number of available training samples (Hughes phenomenon) [5].

2. Segmentation based (object) methods There are two types of segmentation methods. In the first type, a segmentation algorithm is applied to the HSI for objects extraction. Then, the objects are classified. In the second type, the obtained segmentation map is used as a mask for relaxation of a pixel-wise classification map. The main challenge of the object based methods is selection of an appropriate segmentation algorithm that extract a sufficient number of valid objects and avoids over-segmentation or under-segmentation [4]. A hierarchical statistical region merging (HSRM) segmentation algorithm is proposed in [3]. A fuzzy no border/border map is generated to provide weighting coefficients for modifying the spatial prior of Markov random field (MRF) based multi-level logistic model [4]. The proposed MRF+HSRM method deals with the over-segmentation of classification output that is a common problem of MRF based classifiers. The statistical region merging (SRM) segmentation not only has a simple merger formation and fast implementation but also has a strong mathematical support. Moreover, SRM is robust to texture and image noise that results in meaningful edges identification. The MRF+HSRM method shows more robustness with respect to object-regularized approaches such as majority voting.

B-2. Joint spectral-spatial feature extraction Some fusion methods instead of individual extraction of spectral features and spatial ones, jointly extract them. Some of advantages of these methods are: avoiding the long fused vectors, due to features stacking, and considering joint contribution of spectral and spatial information. Of course with the cost of, more computation and missing some information of the original spectral bands. B-3. Kernel based classifiers The spectral and spatial features can be combined through applying multiple kernels or composite kernels. The high potential of kernels in extraction of non-linear features allows to handle the non-linear class boundaries. But, designing of an appropriate kernel and selection of its parameters is a hard task.

2.1. Object classification

B-4. Representation based classifiers Generally, an image can be analyzed pixel vise or object vise. The object based methods usually apply a segmentation algorithm to an image for object detection. The spectral-spatial features of each object are then extracted. Finally, the objects are classified by utilizing the object features [2]. An object based classification method is proposed in [5] that significantly reduces the complexity of a multispectral image through compressing it by a compaction coefficient larger than 20. The running classification time is reduced as well by a factor larger than 20. The proposed method called automatic multispectral image compaction algorithm (AMICA) utilizes the gradient vector of image pixels within objects and also the contextual information to generate the object features. A specific adjacency relation and a similarity measure have been introduced by their mathematical tools to form an object. The use of spectralspatial object features instead of the original spectral features of individual pixels is used for data redundancy reduction. The AMICA algorithm

The representation based methods are the non-parametric ones with no requirement to any assumption about data distribution or statistics estimation. The most well-known methods of this category are sparse representation (SR) and collaborative representation (CR). These methods are based on this idea that each image pixel can be represented through a linear combination of atoms of an appropriate dictionary. The dictionary composition (or dictionary learning), and solving the optimization problem is a difficult task. B-5. 3D spectral-spatial feature extraction Due to 3D inherent of HSI, simultaneously extraction of spectral and spatial features preserves the joint dependencies of spectral and spatial information. 3D filters are usually selected for extraction of 3D spectralspatial cube. The high volume of computations, selection of appropriate 3D filters and their parameter settings are difficulties of these methods. 61

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 4. Two examples of classification map relaxation.

has three steps. At first, the data cube is partitioned to an exhaustive set of objects. In the second step, all pixels belong to the same object are characterized with an object feature vector. Finally, in the third step, the object features are used rather than the pixels features for analysis, classification or transmission of data. An object based classifier is proposed in [6] that automatically tunes the free parameters. The process is done band by band. At first, the partial differential equations (PDEs) [7], both of its real and complex versions, are used to smooth the HSI data. To tune the parameters of PDE, the genetic algorithm with a new fitness function is utilized. In the second step, the objects are extracted from the obtained smoothed bands. The introduced distance matrix is substituted by a conventional distance metric. The proposed method uses summation of the gradient values around each considered pixel in addition to the difference between the pixel and surrounded object. The average of spectral features is computed and fed to a SVM classifier for object classification in the third step. The classification outputs of different HSI bands are then fused through the majority voting rule as a popular decision fusion method. Due to independent processing of individual bands, the running time of the proposed method is equal to the elapsed time of a single band in a parallel processing way. But, in a sequential processing way, the elapsed time is equal to sum of the processing time of all single bands.

pixels with more homogeneity, a region merging process is implemented to the over-segmented map. Finally, the classification map obtained by the SVM classifier is guided by the super-pixel map to do a soft decision fusion. Probabilistic label relaxation as a post-processing approach incorporates the contextual information of a fixed local window. The SVM method is used for both initial classification and obtaining the class probability estimates for label relaxation in the post-processing stage in [12]. A super-pixel based 3D deep neural network is proposed in [13] to improve the HSI classification in different structures and boundaries. The super-pixel construction results in an over-segmentation where the HSI is partitioned to non-overlapped regions. Each homogeneous region reveals the local structures of the HSI with adaptive shapes and sizes. The use of 3D convolutional neural network (3D-CNN) may cause noisy classification maps. To cope with this problem, a weighted feature image (WFI) is constructed via super-pixels to allow spectral-spatial consistency in the classification output. In order to construct the WFI, the spectral pixels are linearly combined in each super-pixel. The WFI provides more spectral similarity between pixels within the super-pixels, and also, it maintains the pixels diversity. Thus, WFI not only preserves the regional consistency but also avoids to eliminate the effects of the mixed pixels. In addition, in order to cope with the misclassification of the mixed pixels, the 3D recurrent CNN (3D-RCNN) is proposed for extraction of 3D features from the WFI. Moreover, the spectral-spatial information contained in each super-pixel is used to fill the super-pixels boundary in the 3D local neighborhood cube. The filled samples preserve the spectral-spatial similarity to the central pixel and therefore, deal with misclassification of super-pixels boundaries. HSI contains rich structure features while it provides noisy classification maps. In contrast, WFI lacks structural features while provides a good spatial continuity. So, to achieve a balance between structure and homogeneous regions, both of HSI and WFI are utilized to construct the 3D samples. The proposed super-pixel based 3D deep learning network method has four steps. First, it creates super-pixels and constructs WFI. Second, it constructs the 3D super-pixel based samples. Third, it explores 3D spectral-spatial features from HSI via 3D CNN and from WFI via 3DRCNN. Forth, it classifies the HSI using multi-feature learning. Kernel based classifiers such as SVM have a high ability in handling of high dimensional data. Moreover, the use of an adaptive similarity measure instead of an un-weighted similarity measure such as Euclidean distance involves the appropriate features related to a specific task such as classification. The best well-known metric for improvement of classification accuracy is the Mahalanobis distance. The SVM classifier with the Mahalanobis distance based kernel is used for initial classification in [14]. The introduced classification method has two steps. Firstly, SVM as a kernel based classifier is used

2.2. Classification map relaxation The spatial information can be utilized after spectral based pixelwise classification for improvement of the classification results. To this end, a label relaxation process is implemented by utilizing the spatial contextual information on the pixel-wise classification map to remove the noisy labels and redundant parts of the classification map. In this case, the spectral information is firstly used in the classification procedure; and secondly, the spatial information is used in the post-processing procedure. Two examples of classification map relaxation can be seen in Fig. 4. As seen, the relaxed (regularized) classification maps contain less amount of salt and pepper noise. In the relaxation methods, the segmentation map is usually used for correction and regularization of the pixel-wise classification map [8,9]. Relaxation of classification is done by using the segmentation map obtained by generation of super-pixels in [10]. In this method, at first the uniform local binary pattern (ULBP) is applied to the HSI for extraction of local features and then, the support vector machine (SVM) classifier is applied to the ULBP feature cube to find the initial probabilistic classification map. The principal component analysis (PCA) transform is applied to the original HSI cube. The first three principal components are used to a composite image, then, the entropy rate segmentation (ERS) method [11] is applied to the obtained composite image to over-segment it into homogeneous regions. After that, to achieve super62

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

to provide a spectral based classification map. A Mahalanobis metric based kernel is used in the proposed SVM classifier to achieve more classification accuracy and reduce the computations. The second step is segmentation where the posterior probabilities acquired from the SVM in previous step is used for reconstruction of spatial relationships in the HSI. For evaluation of deviations among pixels of each region, a kernel transformation is introduced. Finally, a graph cut method is used to segment the pseudo-image. The main contribution of [14] is combination of a kernel based classification with segmentation through multi-level logistic obtained by the SVM classifier. The object based classifiers are preferred with respect to the pixelwise classifiers from the view of providing a smooth and noiseless classification map which is more applicable in real scenarios. From the other hand, the main disadvantage of the object-oriented classifiers is that they result in inaccurate classification map if objects are inaccurately extracted. Both of image segmentation error and classification process error are accumulated. If an object is misclassified, all pixels of that object will be misclassified that causes a big error. The segmentation methods such as watershed [15] and partitional clustering [16] are easy and have low computational complexity and can well reveal the spatial structures. But, they have two main disadvantages: first, the number of segments has to be set by the user, second, the segmentation result is not robust because it depends on the initialization values. In contrast, other segmentation methods such as statistical region merging (SRM) not only have easy implementation but also have no need to set the number of segments and provide robust segmentation results [17].

linear discriminant analysis (LDA), binary coding based feature weighting (BCFE) [20] and feature space discriminant analysis (FSDA) [21] are examples of spectral feature extraction methods. As an example of feature fusion through stacking, we can refer to the attribute profile based FSDA (APFSDA) method [22]. The APFSDA method is an extension of FSDA. The FSDA method is originally is a supervised spectral feature extraction method that simultaneously considers three measures: maximizing between-class scatters, minimizing the within-class scatters; and maximizing the between-band scatters. While the two first measures result in increasing class discrimination, the third measure increases differences between the extracted features which results in decreasing overlapped features and redundancy. The APFSDA method adds the spatial information of attribute filters (AFs) to the FSDA method. The AFs have great flexibility in definition of attributes that allows a high capability in extraction and modelling of various contextual features. However, AF can be applied to the single band grey level images. The conventional way to apply AFs to the HSI is to reduce the HSI dimensionality using the PCA transform, and then, apply AFs on each principal component individually. But, PCA is an unsupervised feature extraction method that works based on mean square error (MSE) measure that is appropriate for representation based applications not classification ones. To deal with this difficulty, APFSDA uses the FSDA method to find components with high class discrimination and low overlap. Fig. 6 illustrates the flowchart of the APFSDA method that is explained in the following (for more details, the interested reader is referred to [22]). FSDA is applied to the HSI to find m components of HSI. Then, the attribute profile (AP) of each component is obtained that contains the attribute spatial features. On each FSDA component, 𝑦𝑗 ; 𝑗 = 1, … , 𝑚, the AP with attribute 𝑎𝑘 (𝑘 = 1, 2, … , 𝑠) denoted by 𝐴𝑃𝑎𝑘 (𝑦𝑗 ) is achieved by applying a series of attribute thinning (𝛾 i ) and attribute thickening (𝜑i ) filters with thresholds {𝜆1 , 𝜆2 , … , 𝜆𝑛 }:

3. Feature fusion The HSI classification methods in the feature fusion level can be done in two general approaches. In the first approach, the spatial features are extracted from the HSI and then, the extracted features are combined with the spectral features through a combination method such as feature stacking or kernel based methods. In the second approach, the spectralspatial features are extracted jointly to preserve the correlated nature of HSI cube where the spectral and spatial information is dependently and jointly contained in 3D structure. Joint spectral-spatial feature extraction, representation based classifiers, 3D spectral-spatial feature extraction and deep learning based classifiers belong to this group. Different feature extraction methods can be used for extraction of geometric structures, shape and texture from the HSI. To extract spatial features, usually a window with a fixed or adaptive size is locally considered around each pixel, then, the spatial features are extracted from the neighborhood region. Two main challenges of the feature fusion methods are selection of an appropriate window size for the neighborhood region and also needing to a high number of training samples. Six different types of feature fusion methods are represented and discussed as follows.

( ) { ( ) ( ) ( ) ( )} 𝐴 𝑃 𝑎 𝑘 𝑦 𝑗 = 𝜑 𝑛 𝑦 𝑗 , … , 𝜑 1 𝑦 𝑗 , 𝑦 𝑗 , 𝛾𝑛 𝑦 𝑗 , … , 𝛾 1 𝑦 𝑗 ; 𝑗 = 1, 2, … , 𝑚; 𝑘 = 1, 2, … , 𝑠

(1)

For each pixel of HSI associated with attribute k, the extended AP (EAP) is achieved by: { ( ) ( ) ( )} 𝐸𝐴𝑃𝑎𝑘 (𝑥) = 𝐴𝑃𝑎𝑘 𝑦1 , 𝐴𝑃𝑎𝑘 𝑦2 , … , 𝐴𝑃𝑎𝑘 𝑦𝑚 ; 𝑘 = 1, 2, … , 𝑠 (2) The extended multi-AP (EMAP) is acquired by applying s attributes: { } 𝐸 𝑀𝐴𝑃 (𝑥) = 𝐸 𝐴𝑃𝑎1 (𝑥), 𝐸 𝐴𝑃𝑎2 (𝑥), … , 𝐸𝐴𝑃𝑎𝑠 (𝑥)

(3)

The EMAP features stacked on the original spectral features are given to the multinomial logistic regression (MLR) classifier for classification. The APFSDA method by jointly fusion of spectral and spatial information in addition to maximizing the class discrimination and minimizing redundancy in the FSDA process provides superior classification results compared to MFL and GCK. The proposed feature fusion method in [23] extracts spectral features using the traditional feature extractors and extracts spatial features using a CNN deep model. It stacks the spectral and spatial features and fed them to a classifier such as SVM. A popular feature extraction method widely used in classification problems is linear discriminant analysis (LDA), which maximizes the inter-class scatters while minimizes the intra-class scatters. Various versions of LDA such as nonparametric weighted feature extraction (NWFE) [24], Imani method 1 [25], local fisher discriminant analysis [26], Imani method 2 [27], and local discriminant embedding (LDE) [28] have been introduced. LDE maximizes the inter-class scatters while keeps away the neighborhood samples of different classes through utilizing a graph embedding structure. A new version of LDE called balanced LDE (BLDE) is proposed in [23] for spectral feature extraction. BLDE considers the intra-class criterion in addition to the inter-class one in the objective function through the used graph embedding structure. To distinguish various materials

3.1. Feature stacking In the stacking approach, the spatial features of each PC are extracted from the HSI and then, they are stacked in the features vector (the original spectral bands or the spectral features extracted by different feature extraction methods). Then, the stacked features vectors are fed to an appropriate classifier to obtain the classification map. Fig. 5 shows a general form of feature fusion using the stacking approach. The PCA transform is applied to the original HSI to find the principal components (PCs) of it. m principal components are chosen. Then, n different spatial feature extraction methods are applied to each PC. Finally all of the extracted spatial features are stacked on the original spectral bands to produce a long feature vector. Note that, instead of the original spectral bands, some spectral feature extraction methods may be applied to mine the spectral features. GLCM, Gabor, morphological and attribute filters are examples of the spatial feature extraction methods [18,19]; and PCA, 63

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 5. General form of stacking approach for feature fusion.

Fig. 6. The APFSDA method as an example of stacking approach [22].

with the same spectral features, it is necessary that incorporates the spatial information beside the spectral one. Although there are various handling machine learning methods for spatial feature extraction such as grey-level co-occurrence matrix (GLCM), geometrical filters, Gabor filters, Morphological profile, attribute profile and various types of wavelet transforms, but, they are limited in parameter configuration. It means that by setting of specific parameters, just objects with specific size, shape and texture are detected. So, by limited parameters settings in traditional spatial feature extraction methods, the great variety present at low levels cannot be shown completely. In contrast to hand

engineering feature extraction methods, deep learning methods can automatically extract high level robust and efficient spatial features. To this end, the spectral features obtained by BLDE are combined with spatial features extracted by the CNN model in [23]. The spectral features and the spatial ones can be extracted individually, and then, simply stacked together to form a long feature vector. The obtained feature vector can be given to a classifier such as SVM. PCA, LDA and NWFE are used for spectral feature extraction; and morphological profile (MP), Gabor filters and GLCM are utilized for spatial feature extraction in [29]. From one hand, various spectral-spatial fea-

64

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

tures provides different views of HSI. Gabor, MP and GLCM provide diverse features of HSI (directionality, shape and size and randomness, respectively) that are complement to each other. But, from the other hand, the obtained long feature vector contains redundant information and also causes over-fitting problem due to curse of dimensionality and limited training samples.

robust to illumination changes and noise but also effectively extracts the orientations edges and texture features. Quaternion WLD (QWLD) is introduced based on QR. QWLD is obtained for each pixel of HSI in a local region surrounded it. QWLD features are computed based on 1intensity of the center pixel, and 2- intensity variations of local pixels in the neighborhood. The relation between the center pixel and its neighbors is obtained by 1- differential excitation that is difference between central pixel intensity and its neighbor’s intensity, 2-orientation feature that is obtained by gradient containing horizontal, vertical and diagonal orientation features. To enhance discriminant ability of the QWLD features, the obtained features are fused through construction of feature histogram in a local neighborhood. The histogram vector can appear the gradient information of the central pixel in a local neighborhood region. Due to presence of homogeneous regions with different shapes and sizes in a HSI, multi-scale analysis is proposed in [36] to extract more intrinsic and accurate spatial information. So, the radius of the square window around each central pixel is changed to provide more spatial details from the neighborhood region. SVM and sparse representation classifier (SRC) are finally used for classification of the fused features.

3.2. Joint spectral-spatial feature extraction The multi-scale feature fusion is done by a Gaussian pyramid decomposition in [30]. At first, the segmented PCA is applied to the HSI for feature reduction. To this end, the spectral bands of HSI are partitioned to some subsets containing adjacent spectral bands. Then, PCA is applied to each subset. The Gaussian pyramid of each subset is then obtained by applying the subsequent Gaussian kernel and down-sampling operator to the segmented PCA outputs. Again, the segmented PCA is applied to the obtained Gaussian pyramid to increase the differences among HSI pixels and discriminability of different pixels. The final extracted features are fed to the SVM classifier to achieve the classification map. An unsupervised spectral-spatial feature extraction method, which is an extend version of locality preserving projection (LPP), is proposed in [31]. The proposed method is implemented in two steps. In the first step, the HSI is filtered and a homogeneous neighborhood region is considered around each pixel of HSI for selection of the unlabeled samples. In the second step, the spectral-spatial features of the unlabeled samples are taken to calculate the projection matrix by using the LPP approach. Authors in [32] utilize the adaptive total variation filtering (ATVF) for de-noising and spectral-spatial feature extraction of HSI. At first, the PCA transform is applied to the HSI for feature reduction. Then, ATVF is applied to each component to extract noiseless spectral-spatial features. Finally, the extracted features are given to an extreme learning machine (ELM) for classification. The main model of a general total variation minimization problem, which is originally proposed for image de-noising, consists of two terms. The first term is total variation and the second term calculates the square of absolute difference between the original image and the noisy one. Two terms are related together through a regularization parameter. The total variation can be calculated by using the gradient operators. The hierarchical guidance filtering (HGF) is proposed in [33] for joint extraction of spectral-spatial features. HGF is an extension of rolling guidance filtering and guided filtering. HGF produces a series of spectral-spatial features sets. Different hierarchical features provide contextual information with different scales. Then, a measure matrix called as the matrix of spectral angle distance is defied to evaluate the quality of the extracted features in each hierarchy. Finally, the weighting voting rule is used as a popular ensemble strategy to obtain the classification result. Due to presence of various geometrical features with different shapes in different locations of an image, the use of a fixed structural element (SE) for providing the MP is not so efficient. To deal with this problem, the patch image-based morphological profile (EPIMP) is proposed in [34] that adaptively considers specific SE for each area (patch) of the image. The chosen SE for each patch is corresponding to the shape or edge image of that patch. The spatial features extracted by EPIMP provide more morphological information with respect to the conventional MP. Combination of the original spectral features with the MP and also utilizing the spatial information contained in the neighborhood region is also proposed in [35]. Weber local descriptor (WLD) and quaternion representation (QR) are used for joint extraction of spectral and spatial features in [36]. To apply WLD and QR, at first, the PCA method is applied to HSI to reduce the data dimensionality to three components. Then, the spectralspatial features are extracted from the principal components. QR uses the quaternion algebra where a quaternion has four parts, a real part together with three imaginary parts. WLD also provides two categories of features: differential excitation and the orientation ones. WLD is not only

3.3. Kernel based classifiers An elegant and efficient way for solving the non-linear classification problem is using kernel methods. The base idea in a kernel method is data mining from the original feature space to a convenient feature space (often with higher dimensionality) through applying a non-linear mapping function. A kernel function is used to compute inner products in the high dimensional feature space without explicitly knowing the mapping function and without explicitly transformation to a high dimensional feature space. The benefit of the kernel trick is solving the nonlinear problems using linear algorithms in the obtained feature space. The best well-known kernel based machine that is widely used in HSI classification is SVM. Different versions of SVM such as subspace based SVM [37] and adaptive boosting [38] have been introduced. Other classifiers have also used the kernel method to improve their efficiency. For, example, the kernelized versions of the representation based classifiers such as SRC [39] and collaborative representation based classifier (CRC) [40,41] have been used for HSI classification improvement. Transform to a higher dimensional space can increase the discrimination ability. An illustration of mapping to a higher dimensional feature space is shown in Fig. 7 [42]. According to this figure, two classes that were non-linearly separable in the original space, become linearly separable in the mapped space. So, the use of kernel approaches can be useful for solving the nonlinear problems. Assume a HSI containing N pixels with d spectral bands {𝑥1 , 𝑥2 , … , 𝑥𝑁 }; 𝑥𝑖 ∈ 𝑑 . Given the nonlinear mapping function 𝜑(x), the data samples are mapped from the input space to a high dimensionality feature space as follows [43]: 𝜑 ∶ 𝑑 → ℍ 𝒙 → 𝜑 (𝒙 ) where ℍ is the Hilbert or the mapped feature space. The main idea in the kernel based methods, known as kernel trick, is that defines a kernel function in the input space. The kernel function indirectly does the dot product in the mapped feature space: ( ) ( ) ( ) 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝜑 𝑥𝑖 , 𝜑 𝑥𝑗 (4) where K( · ) denotes the kernel function and ⟨ · ⟩ represents the dot product. The kernel function satisfies the properties of the Mercerʼs theorem such as symmetric, positive semi-definite and continous [44]. Some of the popular kernel functions are [45]: ( ) Linear kernel ∶ 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑥𝑖 , 𝑥𝑗 (5) ( ) ( )𝑑 Polynomial kernel ∶ 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑥𝑖 , 𝑥𝑗 + 1 ; 𝑑 ∈ +

65

(6)

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 7. An illustration of mapping to a higher dimensional feature space in the kernel based methods.

Gaussian or radial basis f unction(RBF) kernel ∶ ( ) 𝑥𝑖 − 𝑥𝑗 2 ( ) 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑒𝑥𝑝 − ; 𝜎 ∈ + 2𝜎 2

function where each kernel exploits a subset or the full set of features: 𝑀 ( ) ∑ ( ) 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝛽𝑚 𝐾 𝑚 𝑥 𝑖 , 𝑥 𝑗

(7)

𝑚=1

The most widely used kernel in the HSI classification problems is RBF because its Fourier transform is also Gaussian and it has translation invariability. Assume each sample xi has a label yi where the training set is {(𝑥𝑖 , 𝑦𝑖 ); 𝑖 = 1, … , 𝑁 } and 𝑦𝑖 ∈ {−1, +1}. The aim of the kernel based classifiers such as SVM is finding a classification hyper-plane in the Hilbert space with a maximum margin. For example, the optimization objective function of the standard binary SVM can be formulated by [46]:

s.t.

𝑚=1

𝑁 ∑ 𝑖=1

𝛼𝑖 𝑦𝑖 = 0

𝛽𝑚 = 1,

𝛽𝑚 ≥ 0

(9)

where 𝛽 m indicates the weight of kernel m and M is the number of basis kernels. The objective function of a MKL is given by: 𝑁 𝑁 𝑀 𝑁 ∑ ( ) ( ) ∑ 1 ∑∑ max 𝐿 𝛼𝑖 , 𝛼𝑗 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝛽𝑚 𝐾 𝑚 𝑥 𝑖 , 𝑥 𝑗 + 𝛼𝑖 2 𝑖=1 𝑗=1 𝑚=1 𝑖=1

𝑁 𝑁 𝑁 ( ) ( ) ∑ 1 ∑∑ max 𝐿 𝛼𝑖 , 𝛼𝑗 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐾 𝑥𝑖 , 𝑥𝑗 + 𝛼𝑖 2 𝑖=1 𝑗=1 𝑖=1

𝑠.𝑡. 𝛼𝑖 , 𝛼𝑗 ∈ [0, 𝑐 ], ∀𝑖, 𝑗 = 1, … , 𝑁;

𝑀 ∑

𝑠.𝑡.𝛼𝑖 , 𝛼𝑗 ∈ [0, 𝑐 ], ∀𝑖, 𝑗 = 1, … , 𝑁;

𝑁 ∑ 𝑖=1

𝛼𝑖 𝑦𝑖 = 0;

𝑀 ∑ 𝑚=1

𝛽𝑚 = 1,

𝛽𝑚 ≥ 0 (10)

To combine the basis kernels and obtain a composite kernel, the following three steps are introduced in [58]:

(8)

(1) Pixel definition: the pixel is redefined by its spectral features 𝑥𝑤 𝑖 ∈ 𝑁𝑤 and its spatial features 𝑥𝑠𝑖 ∈ 𝑁𝑠 where Nw and Ns are the number of spectral features and the number of spatial ones, respectively. (2) Kernel construction: any type of kernels can be constructed on 𝑥𝑤 𝑖 and 𝑥𝑠𝑖 . (3) Kernel combination: the composite kernel can be computed by a simple summation of basis kernels in different ways. The kernel containing the spectral features is denoted by Kw and the kernel containing the spatial features is indicated by Ks . The kernels containing the cross-information between spectral and spatial features are also denoted by Ksw and Kws .

where 𝛼 i and 𝛼 j denote the Lagrange Multipliers. Those xi that their associated 𝛼 i are non-zero called support vectors. The support vectors are determining the hyper-plane for decision making. The binary SVMs can be implemented in a parallel way for multi-class problems. The standard SVM uses a single kernel that has not the generalization capability of coping with multi-class and multi-dimensional data. Due to limitation in choice of a single kernel and to able better fit the selected kernel to the complex structure of data, the multiple kernel learning (MKL) methods have been introduced [47]. MKL has been proposed to explore the information of HSI with more flexibility compared to the single kernel based methods [48]. MKL is one of the fusion approaches for combination of different sub-features extracted by different operators or acquired by different sensors. The MKL algorithms combine various features to be used in a kernel based task such as regression or classification. The aim of MKL is to generate a composite kernel through linear or non-linear combination of some base kernels. Each base kernel can exploit a subset of features or the full set of them. The weights of basis kernels are all non-negative and sum to one because of keeping the composite kernel positive semi-definite and normalized. In a MKL problem, there are two categories of unknown parameters that have to be solved: 1- unknown parameters of the original learning problem, 2- the combining weights. The chosen learning approach determines the former unknown parameters; and there are several strategies for determination of the combining weights of the basis kernels. These strategies are divided into: 1- criterion based approaches where a criterion function is used for obtaining the kernel weights, 2- optimization approaches where the kernel weights are computed by solving an optimization problem, and 3- ensemble approaches where a new base kernel is iteratively added to the composite kernel until the cost function becomes minimum. In the MKL, some basis kernels are linearly combined to form a convex

Four different combinations of kernels are reported in [49]: (1) Stacked features: ( ) 𝐾{𝑠,𝑤} ≡ 𝐾 𝑥𝑖 , 𝑥𝑗 where 𝑥𝑖 ≡

𝑠 {𝑥𝑤 𝑖 , 𝑥𝑖 }

(11)

is the stacked spectral and spatial feature vector.

(2) Direct summation: ( ) ( ) ( ) 𝑤 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝐾𝑠 𝑥𝑠𝑖 , 𝑥𝑠𝑗 + 𝐾𝑤 𝑥𝑤 𝑖 , 𝑥𝑗

(12)

(3) Weighted summation: ( ) ( ) ( ) 𝑤 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝜇𝐾𝑠 𝑥𝑠𝑖 , 𝑥𝑠𝑗 + (1 − 𝜇)𝐾𝑤 𝑥𝑤 𝑖 , 𝑥𝑗

(13)

where 0 < 𝜇 < 1 is used to provide a tradeoff between using the spectral features and the spatial ones. (4) Cross-information: ( ) ( ) ( ) ( ) ( ) 𝑤 𝑠 + 𝐾𝑠𝑤 𝑥𝑠𝑖 , 𝑥𝑤 + 𝐾𝑤𝑠 𝑥𝑤 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝐾𝑠 𝑥𝑠𝑖 , 𝑥𝑠𝑗 + 𝐾𝑤 𝑥𝑤 𝑖 , 𝑥𝑗 𝑗 𝑖 , 𝑥𝑗 (14) 66

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Generally, in a MKL algorithm, an optimal composite kernel is obtained through combining a few basis kernels constructed from various feature subsets. Finding the optimal kernel is a complex optimization problem. To simply the problem, a MKL framework is proposed in [50] that is data-dependent. The proposed framework defines three measures to estimate the composite kernel goodness. The measures are based on the similarity between an ideal kernel and the composite one. In addition, to solve the optimization problem, the meta-heuristic algorithms are used that are accurate and implemented fast. A review of different types of MKL methods and their solutions are given in [51]. It is shown that MKL results in good performance for heterogeneous features under ill-posed conditions. Various features extracted from different sources convey different meaning and have different statistical significance. Therefore, their roles in the classification problem is different. So, obviously, the feature stacking approach is not an appropriate choice for HSI classification while MKL is a suitable approach for heterogeneous features handling. The generalized composite kernel (GCK) method uses the composite kernels with a great flexibility in contribution of spectral and spatial information without any requirement to weight parameters [52]. GCK uses the MLR classifier that is very flexible in the non-linear kernel construction and has high control on generalization capacity through logistic regressors. The GCK method uses the following function as the input of a MLR classifier [53]: ( ) [ ( ) ( )]𝑇 ℎ 𝑥𝑖 = 1, 𝐾 𝑇 𝑥𝑖 , 𝑥1 , … , 𝐾 𝑇 𝑥𝑖 , 𝑥𝑁 (15)

lead to over-smoothing. In addition, appropriate selection of the regularization parameter, which provides a tradeoff between minimizing the training error and maximizing the margin is highly important. 3.4. Representation based classifiers The nearest regularized subspace (NRS) classifier incorporates the distance weighted regularization with the nearest subspace classification [59]. In NRS, each testing sample is approximated by a linear combination of training samples of each of classes. The class that results in minimum residual value is assigned to the testing pixel. The L2 norm used in the collaborative representation of samples to be classified provides a closed form solution. In addition, the used L2 norm regularization term copes with the ill-posed condition in the inverse problem. The main disadvantage of NRS is that ignores the spatial information of HSI where two adjacent pixels likely belong to the same class. The joint collaborative representation (JCR) method [60] and the weighted JCR (WJCR) [61] have been proposed to deal with this problem. In JCR and WJCR, the neighboring pixels of the testing samples are simultaneously approximated through a collaborative representation of training samples. While the same weight is assigned to all neighboring pixels in JCR, larger weights are assigned to the neighboring pixels with more similarity to the central pixel in WJCR. JCR/WJCR calculates an average/weighted average of pixels in a neighborhood window. Although from one hand, they include the spatial information in the classification process in homogenous and smooth areas, but, on the other hand, they may degrade the classification performance in the neighborhood regions containing edges and class boundaries. To deal with this disadvantage of JCR and WJCR, the edge-preserving-based collaborative representation (EPCR) has been proposed in [62] that utilizes an edge image for correction of weights and residual values in the collaborative representation. The edge image is calculated by estimation of discontinuity through all spectral bands. Two other versions of WJCR, WJCR based on angular separation (WJCR-AS) and WJCR based on median-mean line (WJCR-MML) have been proposed in [63]. The WJCR-AS uses the angular separation metric, i.e., the cosine distance, for calculating of weights in WJCR. Although the neighboring pixels more similar to the central pixel should be assigned more weights; but, the neighboring samples highly correlated to the center may cause redundancy leading to classification degradation. So, the weights of neighbors should have reverse relationship to the AS metric calculated between the central pixel and its neighbors. The presence of outlier samples in the neighborhood region may deviate the weighted mean of the local area from its real value. To deal with this problem, the WJCR-MML method uses the median-mean line metric instead of the simple mean metric for calculating the weighted mean of each neighboring region. The MML metric can rectify the position of outlying neighbors, and so, improves the classification performance. Sparse representation (SR) as a power tool has been widely used recently in different applications of image processing such as HSI classification. SR works based on this idea that pixels with the same class labels have similar spectral similarities and thus, a testing pixel can be linearly approximated by a few number of training samples of a class. The conventional SR based HSI classifier just considers the spectral information that is not enough to have an accurate classification map. To deal with this problem the joint SR (JSR) based classifier has been proposed in [64]. JSR is based on this assumption that in a local region, there are pixels likely constructed by the same materials with similar spectral signatures. So, the JSR classifier, by considering pixels in a local information and contributing the spatial information, improves the classification accuracy. But, the conventional JSR classifier ignores this point that pixels of a local region may be not belong to the same class, and in this case, the classifier performance is seriously degraded. To deal with this difficulty, it is proposed that use the correlation coefficient measure beside the SR one in the classification process [65]. To this end, the spectral similarity among the testing samples and the train-

where N is the number of training samples. The kernel K(xi , xj ) can be a simple stacking of spectral and spatial kernels: ) ( )]𝑇 ( ) [ ( 𝑤 𝑠 𝑠 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝐾𝑤 𝑥𝑤 (16) 𝑖 , 𝑥𝑗 , 𝐾𝑠 𝑥𝑖 , 𝑥𝑗 that in this case, h(xi ) will be: ( ) [ ( ) ( 𝑤 𝑤) ( 𝑠 𝑠) ( 𝑠 𝑠 )]𝑇 𝑤 ℎ 𝑥𝑖 = 1, 𝐾𝑤 𝑥𝑤 𝑖 , 𝑥1 , … , 𝐾𝑤 𝑥𝑖 , 𝑥𝑁 , 𝐾𝑠 𝑥𝑖 , 𝑥1 , … , 𝐾𝑠 𝑥𝑖 , 𝑥𝑁 (17) To include the cross-information between the spectral and spatial features, the following cross kernel can be used instead of (16): ) ( ) ( ) ( )]𝑇 ( ) [ ( 𝑤 𝑠 𝑠 𝑤 𝑠 𝑠 𝑤 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝐾𝑤 𝑥𝑤 (18) 𝑖 , 𝑥𝑗 , 𝐾𝑠 𝑥𝑖 , 𝑥𝑗 , 𝐾𝑤𝑠 𝑥𝑖 , 𝑥𝑗 , 𝐾𝑠𝑤 𝑥𝑖 , 𝑥𝑗 The conventional composite kernels or MKL methods need convex combination of kernels while GCK has no restriction of convexity. The linear combination of the basis kernels used in the composite kernel, which is included in the MLR objective function, has more flexibility with respect to the convex combination of kernels in MKL that assigns fixed weights to the kernels. GCK allows more freedom for balancing the spatial and spectral information. The multiple feature learning (MFL) method integrates various features extracted by different linear and non-linear transformations to treat with linear and non-linear boundaries of present classes [54]. Similar to GCK, MFL also uses the MLR classifier. The MLR classifier and its different versions have some advantages: 1-fast computations, 2-good capability of algorithm generalization, 3-open and flexible structure where MKL and composite kernel can be simply modeled under them [55–57]. MLR classifiers learns directly the posterior class probabilities and can effectively cope with the high dimensionality of the HSI. The kernel based classifiers such as SVM have some advantages [58]: 1- having less sensitivity to the number of training samples due to considering just the samples close to the class boundaries, i.e., support vectors, 2- being non-parametric and do not need to acquire data distribution, 3- easy implementation, 4-self adaptive, 5- having a convex cost function that results in an optimal solution, 6- having fast training stage. The main drawback of the kernel based methods is sensitivity to the kernel parameters where an inappropriate choice may cause over-fitting or over-smoothing. Selection of a small value for the width parameter of kernel may lead to over-fitting. In contrast, assigning a large value may 67

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

ing ones is calculated and used beside the residual obtained by the JSR measure to find the class label of the unseen testing samples. The representation based classifiers do not consider any assumption about the statistical distribution of data and also do not need to compute any statistics of HSI. Thus, they are appropriate classifiers when there is not any prior knowledge about image distribution or there is not sufficient training samples for statistics estimate. The representation based classifiers are the non-parametric methods that directly determine the label of a testing pixel using a structured dictionary. The basic idea is that a testing sample can be linearly approximated by the training dictionary. The computed coefficients of this approximation represent how important a dictionary atom is. The representation based classifiers are generally divided into two main categories: sparse representation based classifier (SRC) and collaborative representation based classifier (CRC). In the SRC method, the testing sample is sparsely approximated by a few atoms of dictionary through a L1 minimization problem. In the CRC method, the testing sample is collaboratively represented by all atoms of dictionary using a L2 minimization problem. SRC provides a compact representation of HSI with high computational burden due to L1 -norm optimization problem. In contrast, CRC benefits the information of all atoms and achieves a closed form solution through the L2 -norm optimization problem. However, each of SRC or CRC benefits some good characteristics of HSI. The fusion methods can be used to integrate the advantages of both of them. In contrast to conventional classifiers, the kernel based methods can significantly improve the classification accuracy by utilizing the complex structure of the given data in the kernel space [66]. In a kernel based method, the samples are projected into a high dimensional space through applying a non-linear mapping. In the new high dimensional feature space, the complex non-linear structure of data samples that may not be accurately represented by linear models are exploited. To benefit the advantages of the kernel trick, the kernel based SRC (KSRC) and the kernel based CRC (KCRC) have been proposed and fused together in [67] to achieve both benefits of sparse representation and collaborative representation in the kernel space. To this end, at first, data is mapped to the kernel feature space. Then, the coefficients of sparse and collaborative representation are separately computed to find each of residuals individually. Finally, the obtained residuals are combined together through an adjusting parameter. The achieved fused residual is used for class label determination of each testing sample. The fused method shows higher discriminative ability with respect to simple SRC and CRC and their kernellized versions. In the following two main representation based classifiers, collaborative representation based classifier (CRC) and the sparse representation based classifier (SRC) are represented.

The derivative of above function is taken and set to zero to compute the weight vector: ( )−1 𝑇 𝛼̂ 𝑖 = 𝑋𝑖𝑇 𝑋𝑖 + 𝜆𝑖 𝐼 𝑋𝑖 𝑦; 𝑖 = 1, … , 𝑐 (21) Some of samples of dictionary has more similarity to the testing pixel y. So, it is proposed that larger weights are assigned to samples with more similarity to the testing one. In other words, the more similar atoms of dictionary have more role in representation of y. To this end, the distance weighted Tikhonov matrix is defined to adjust and regularize the weight vector by: ⎡‖ 𝑦 − 𝑥𝑖,1 ‖ ‖2 ⎢‖ Γ𝑦 𝑖 = ⎢ ⎢ 0 ⎣

2

𝛼𝑖

2

; 𝑖 = 1, … , 𝑐

𝑖 = 1, … , 𝑐

(22)

where 𝑥𝑖,𝑘 ; 𝑘 = 1, … , 𝑛𝑖 are the samples of dictionary Xi and ni denotes the number of atoms in Xi . Matrix Γ𝑦𝑖 calculates the Euclidean distance between the testing pixel y to each of atoms in Xi . Then, the optimization problem in (19) is updated as follows: ‖ ‖2 2 Γ 𝛼 ‖ ; 𝑖 = 1, … , 𝑐 arg min ‖ 𝑦 − 𝑋𝑖 𝛼𝑖 ‖ ‖2 + 𝜆𝑖 ‖ ‖ 𝑦𝑖 𝑖 ‖2 𝛼𝑖 ‖ The above problem has the following closed-form solution: ( )−1 𝛼̂ 𝑖 = 𝑋𝑖𝑇 𝑋𝑖 + 𝜆𝑖 Γ𝑇𝑦 Γ𝑦𝑖 𝑋𝑖𝑇 𝑦; 𝑖 = 1, … , 𝑐 𝑖

(23)

(24)

The representation error in subspace i is computed by calculating the residual image [69]: ‖ ‖ ‖ 𝑟 𝑖 (𝑦 ) = ‖ ‖𝑦 − 𝑦̂𝑖 ‖2 = ‖𝑦 − 𝑋𝑖 𝛼̂ 𝑖 ‖2 ; 𝑖 = 1, … , 𝑐

(25)

The class label of the testing sample, ly , is determined by: 𝑙𝑦 = arg min 𝑟𝑖 (𝑦)

(26)

𝑖=1,…,𝑐

3.4.2. Sparse representation based classifier (SRC) Each testing sample y can be sparsely represented by training samples of the class that belong to it. The sparse representation of y can be achieved by a linear combination of c dictionaries of available classes, 𝑋𝑖 ; 𝑖 = 1, … , 𝑐 as follows: 𝑦=

𝑐 ∑ 𝑖=1

𝑋𝑖 𝛽𝑖 = 𝑋𝛽

(27)

where X denotes the union dictionary composed from dictionaries of all classes and 𝛽 is a concatenation of all sparse vectors containing a few nonzero entries. The sparse vector 𝛽 can be calculated by [70]: 𝛽̂ = arg min ‖𝑋𝛽 − 𝑦‖22 𝛽

𝑠.𝑡.

‖𝛽‖0 ≤ 𝐿

(28)

where ‖ · ‖0 indicates the l0 norm, i.e., the number of nonzero entries (called as sparsity level) and L is the given upper bound on the sparsity level. Any greedy pursuit algorithm such as orthogonal matching pursuit [71] can be used for solving this optimization problem. The dictionary matrix X can be decomposed to 𝑋𝑖 ; 𝑖 = 1, … , 𝑐 and also the sparse vector 𝛽 is decomposed into 𝛽𝑖 ; 𝑖 = 1, … , 𝑐 to obtain the partially estimate of y by individually using each of dictionaries of classes. The error representation of each subspace can be computed by: ‖ ‖ ‖ 𝑟 𝑖 (𝑦 ) = ‖ (29) 𝑦 − 𝑋𝑖 𝛽̂𝑖 ‖ ; 𝑖 = 1, … , 𝑐 ‖𝑦 − 𝑦̂𝑖 ‖2 = ‖ ‖ ‖2 Similar to CRC, the class label of the testing sample y is given by 𝑙𝑦 = arg min 𝑟𝑖 (𝑦).

3.4.1. Collaborative representation based classifier (CRC) Each testing pixel y can be approximated by using each of dictionaries of c given classes. The dictionary of each class is composed by atoms constituted by the spectral features or spatial features or spectralspatial features of the training samples of that class. Let Xi be the subspace or dictionary of class 𝑖(𝑖 = 1, … , 𝑐 ). The testing sample y is estimated individually by each of subspaces. That class that its subspace (dictionary) can better approximate the testing sample is assigned to the testing pixel. The representation of testing sample y through subspace of class i, is obtained by solving the following objective function [68]: ‖ ‖ ‖ arg min ‖ ‖𝑦 − 𝑋𝑖 𝛼𝑖 ‖2 + 𝜆𝑖 ‖𝛼𝑖 ‖2

⎤ ⎥ ⎥; ‖ ‖ ⎥ ‖𝑦 − 𝑥𝑖,𝑛𝑖 ‖ ⎦ ‖ ‖2 0



𝑖=1,…,𝑐

(19)

The representation based classifiers are non-parametric methods that they do not require any knowledge about data distribution. They avoid heavy computations of the training process and they are directly performed on the given dictionary. Two main representation based classifiers are CRC and SRC. The CRC method due to using the l2 -norm in its optimization process is simpler than SRC and results in a closed form solution. In contrast, SRC by including the sparsity constraint through a l1 -norm in its objective function avoids involving non-related and redundant training atoms in the pixels reconstructions.

where 𝛼 i and 𝜆i are the weight vector and the regularization parameter, respectively. The regularization parameter provides a tradeoff between the residual term and the regularization one. By doing some computations on (19), it is simplified as follows: [ ( ) ] arg min 𝛼𝑖𝑇 𝑋𝑖𝑇 𝑋𝑖 + 𝜆𝑖 𝐼 𝛼𝑖 − 2𝛼𝑖𝑇 𝑋𝑖𝑇 𝑦 ; 𝑖 = 1, … , 𝑐 (20) 𝛼𝑖

68

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

where 𝑁 (𝑥, 𝑦, 𝜆) =

1 3 2

(2𝜋) 𝜎 3

𝑒

( ) − 12 𝑥2 +𝑦2 +𝜆2 2𝜎

is the Gaussian component and ( )) ( 𝐸 (𝑥, 𝑦, 𝜆) = 𝑒𝑥𝑝 𝑗2𝜋 𝑓𝑥 𝑥 + 𝑓𝑦 𝑦 + 𝑓𝜆 𝜆

(32)

represents the sinusoidal component. (x, y) indicates the spatial indices and 𝜆 denotes the wavelength variable. parameter 𝜎 is used to control the width of the Gaussian function that determines the filter scale. (fx , fy , f𝜆 ) is the frequency of E(x, y, 𝜆) that shows the central frequency of the 3D Gabor filter [78]. Although the spectral-spatial features extracted by 3D Gabor filters leads to an accurate classification map, the high number of features and heavy computations limit applicability of 3D Gabor filter bank. To deal with this difficulty, a 3D Gabor phase coding (3D GPC) is introduced in [79] that is used together with a hamming distance based matching for HSI classification. To overcome the large volume of Gabor features, the 3D GPC method just exploits the phase features of Gabor instead of the magnitude features. In addition, it just uses the Gabor filter with certain orientations, i.e., only directions parallel to the spectral axis. These directions involve more discriminative features compared to other ones. The extracted Gabor features are then encoded by a quadrant bit coding algorithm. For classification, the nearest neighbor classifier is utilized where the similarity between pixels is measured by the normalized hamming distance matching method. The experiments show good performance of 3D GPC in terms of both generalization ability and computational complexity. In [74], 3D MP, 3D Gabor and 3D LBP [80] are introduced to extract joint spectral-spatial features. The extracted features are then fused through a multi-task sparse representation framework to achieve the classification map. According to the sparse theory, each testing pixel can be sparsely approximated by the subspace containing the training samples of the class that belong to it. The class label of the testing sample is determined by checking which class yields the smallest reconstruction error. In the multi-task sparse classifier proposed in [74], the label of the testing sample is determined according to the least reconstruction error over the three sets of the obtained 3D spectral-spatial features. The one/two dimensional empirical mode decomposition (EMD) method is extended to three dimensional EMD (3D-EMD) to treat a HSI as a cube [81]. 3D EMD decomposes the HSI into 3D intrinsic mode functions (3D-IMFs) where each of them is a varying oscillation and an extracted 3D feature. Due to the increased burden caused by added dimensions, two approaches are taken. The use of 3D Delaunay triangulation which determiners the extrema distances and the use of separable filters for envelops generation. In other words, rather than implementation of sophisticated 3D filter, a 1D filter is performed three times to acquire the same results as the 3D filter. Thus, the computational burden is significantly reduced. The extracted 3D features are given to a robust multi-task learning classifier where each IMF is taken as a task. 3D implementation of wavelet transforms is proposed for extraction of contextual features from the HSI cube [21] of 3D. In [82], 3D discrete wavelet transform (3D-DWT) is used for spatial feature extraction. The 3D extracted features are fed to the SVM classifier to provide the probabilistic classification map. Then, the MRF is used for exploration of local spatial dependencies and correlation among neighboring pixels. After that, the maximum a posterior (MAP) classification is formulated. The Bayesian optimization problem is solved by 𝛼-Expansion min-cut algorithm. The 3D-DWT transform can encode the spatial details and approximation of cube in different frequencies, scales and orientations. The 3D-DWT transform can be implemented by applying three 1D-DWT in three dimensions of HSI cube: weight and height of spatial dimensions and the spectral dimension (see Fig. 9). The 3D scattering wavelet transform is proposed for spatial filtering of HSI through applying 1- a cascade of wavelet decompositions, 2complex modulus, and 3-local weighted averaging [83]. Compared to

Fig. 8. A Gabor filter bank containing 13 filters in the frequency domain.

3.5. 3D spectral-spatial feature extraction To exploit the spatial information, various feature extraction methods have been introduced [72]. Among different spatial feature extraction methods, it can be referred to the MP, local binary pattern (LBP) [73] and Gabor filters. MP by exploiting two mathematical operations called erosion and dilation on the principal components of HSI, extracts the geometrical structures with different shapes from the HSI. For applying LBP to each single band image, a local region is considered around each pixel of image. Then, a binary code is assigned by comparing the grey level of the central pixel with that of the surrounding pixels. Then, by accounting the occurrence repetition of the obtained pattern over the neighborhood region, the statistical histogram is achieved. The Gabor wavelet transform is a powerful filter for extraction of texture from a single band image by providing the optimal joint space-frequency resolutions. As said, each of MP, LBP and Gabor methods can be applied to a 2D image. In another words, for implementation of each of them on the HSI, each spectral channel should be treated individually. Due to independent extraction of spatial features from each spectral band, the joint spectral-spatial information contained in the HSI cannot be exploited. To deal with this difficulty, 3D spectral-spatial feature extraction is required to reveal the 3D inherent structure of HSI. Morphological filters analyze an image by applying a 2D structuring element (SE) with specific shape and size. The conventional morphological filters are two dimensional and ignore the spectral-spatial dependencies in 3D structure of HSI. To explore the joint spectral-spatial morphological information of HSI, 3D morphological profile (3D-MP) method is introduced in [74]. 3D-MP which is an extension of 2D-MP is directly implemented on 3D HSI cube through using the 3D SEs. Two basics operators of 3D-MP are erosion and dilation. The 3D erosion operator of a HSI with a 3D SE is defined as the minimum pixel value of the pixel values inside the 3D SE. The dual operator, 3D dilation is also defined as the maximum pixel value of pixels values contained in the 3D SE. The 3D opening and closing operators are then defined based on the given 3D erosion and dilation filters [74]. Due to the 3D nature structure of HSI and the tightly correlation among spectral and spatial information, the use of 3D Gabor filters preferred than 2D Gabor filters for HSI analysis [75]. A Gabor filter is computed by modulating a Gaussian function by a sinusoidal one. A filter bank containing 13 filters in the frequency domain is shown in Fig. 8 [76]. A 3D Gabor filter can be defined in the spectral-spatial feature space as follows [77]: 𝐺𝑓 ,𝜑,𝜃 (𝑥, 𝑦, 𝜆) = 𝑁 (𝑥, 𝑦, 𝜆)𝐸 (𝑥, 𝑦, 𝜆)

(31)

(30) 69

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

The CNN models become over-trained if there is not enough training samples. This means that network has good performance just for seen training samples while has weak performance in dealing with unseen testing samples. One of the strategies to overcome this problem is using the generative adversarial network (GAN) [90]. GAN as a regularization method involves two models: a generative model (G) and a discriminative one (D). While G generates fake samples as much as possible similar to real samples, D classifies real and fake samples. G and D are trained in an adversarial way where both of them try to get the optimal results (the best classification map for D and generation of fake data as real as possible for G). The samples generated by GAN can be utilized as virtual samples to improve the classification accuracy [91]. The use of GAN beside deep networks is investigated in [92] and [93]. A 3D GAN in combination with CNN model is introduced in [94] for spectral-spatial HSI classification. CNN together with GAN can provide better classification performance than the conventional CNN. 3D GLCM [95] and 3D shearlet transforms [96] are other examples of 3D feature extraction from the HSI cube. Tensors are also appropriate mathematical tools for processing of 3D images such as HSIs [97].

3.6. Deep learning based classifiers Deep networks can jointly extract spectral and spatial features from the HSI data. The basis of work in deep neural networks is extraction of features from the raw input data through layer by layer processing of input data. But, there are some difficulties in conventional deep neural networks. They require a large number of training sample and also they need much effort for tuning of hyper-parameters. These difficulties are removed in the proposed deep network in [98]. A deep network is introduced that uses the multi-grained cascade forests where the output of each cascading level is transformed to the next level for more processing. Two types of forests are used in each level for increasing diversity. Training of the multi-grained cascade forest is much easier than that of the conventional deep neural network. Several various features are firstly extracted for each pixel, and then, the obtained features are given to the deep random forest classifier. By utilizing the information of neighboring pixels, the spectral-spatial information is fused effectively to improve the class discrimination. The last layer of network determines the classification probabilities. The original spectral features, the features extracted by discrete cosine transform, the features of a wavelet transform and the extended morphological profile are used as the input of the proposed deep random forest classifier. There are different types of deep learning networks. CNN is among the best well-known networks used for HSI feature extraction and classification [99]. The convolutional and pooling layers are alternatively stacked where output of each layer is given as input to the subsequent layer. In the end, the final produced feature map is given to a fully connected layer (FCL) to form the final feature vector for classification through softmax layer [100]. While the shallower convolutional layers explore the detailed structures of objects such as edges, more abstract features are mined from the deeper convolutional layers. An example of patch based processing of HSI for pixel based classification by using the CNN model is shown in Fig. 10. A CNN based spectral-spatial feature fusion method is proposed in [101]. The proposed framework consists of three steps: the use of local spatial constraint and non-local spectral constraint for sample augmentation; feature fusion; and classification. In the first step, the local spatial constraint utilizes the contextual information of adjacent pixels in a local neighborhood region. In addition, the non-local spectral constraint uses the spectral similarity of pixels in a non-local way. A multi-layered CNN is used for spectral-spatial feature fusion in the second step. The softmax layer is finally used for multi-decision classification. A unified loss function is used for jointly optimization of the classification step and the previous step for the spectral-spatial fused features learning.

Fig. 9. Structure of a 3D-DWT transform.

3D DWT and 3D Gabor filters, the 3D scattering wavelet transform has two benefits. First, due to cascade of wavelet decompositions in multiple orientations and scales, rich descriptions of sophisticated structures are provided for HSI classification. Second, the used local weighted averaging reduces the feature variability and leads to local consistency of pixel labels in the neighborhood regions. Although CNNs have high capability for extraction of features from low to high levels, but they lack multi-resolution filtering. From the other hand, the 3D wavelets provide 3D characterizations of HSI in multi-resolutions in the frequency-space domains. So, authors in [84] combine the advantages of both 3D wavelets with CNN to adaptively extract 3D features in different scales and depths. The 3D deep networks have been utilized for jointly extraction of spectral and spatial features [85]. 3D CNN has been introduced for directly extraction of deep spectral-spatial features of HSI [86,87]. These networks process the raw HSI. Although they provide promising results, their performance is degraded when utilizing the deeper networks. To overcome this problem, a spectral-spatial residual network (SSRN) is proposed in [88] that is composed of consecutive learning blocks. The residual blocks, as an extension of the convolutional layers used in CNN models, are designed for extraction of discriminative spectral-spatial features. The SSRN model allows a deeper structure compared to previous 3D-CNNs. The SSRN model, by providing shortcut connections among other convolutional layers, leads to robust learning of spectralspatial representations of HSI. The SSRN method proposed in [88] forms a contextual CNN by incorporating of residual learning for fully convolutional layers, and investigates appropriate residual architectures to provide robustness in various scenarios. Authors in [13] have proposed that the spatial feature map constituted by the super-pixels can be more processed for spectral-spatial feature extraction through applying a 3D recurrent CNN. The framework exploits the continuity of image pixels and suppresses noise. One of the serious problems of CNN methods is over-fitting [89]. This problem is due to huge number of learnable parameters in the network that require a lot of training samples. Due to expensively and time demanding gathering of training samples in the remote sensing community, availability of limited training samples is a common situation. 70

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 10. An example of patch based processing of HSI classification by using the CNN model.

A method for using a pre-trained CNN such as VGG-VD16 or AlexNet is introduced in [102]. Different layers of a CNN contain complementary information extracted from the input image. While the shallower layers appear the low-level visual features such as edges, the deeper layers reflect more abstract or semantic features. A CNN model containing multiple layers conveys complementary information that can be used for HSI classification improvement. The multi-layer stacked covariance pooling (MSCP) method is proposed in [102] that is implemented in three steps. First, a pre-trained CNN model is applied to the HSI to extract the feature maps in multiple layers. Second, the covariance matrix of the stacked feature maps is calculated. Covariance matrix implicitly fuses the feature maps containing the complementary information where each entry of it represents the covariance value between two different and likely complementary feature maps. The calculated covariance matrix is then used as the extracted features fed to a SVM classifier. An unsupervised deep learning based feature extraction method is proposed in [103] that fuses multi-scale spectral-spatial features. At first, the pre-trained VGG16 network is used for extraction of multiscale spatial information. Then, a sparse auto-encoder network is used for fusing the raw spectral bands with the extracted spatial features. The extracted features are fed to the SVM classifier to find the classification map. DeepLab is firstly introduced in [104] for semantic segmentation. Due to much similarity of HSI classification to the semantic segmentation, DeepLab is chosen for HSI classification in [105]. To implement DeepLab for HSI, at first, the maximum noise fraction (MNF) method [106] is applied to the HSI to find several first principal components (PCs) of it. The first PCs are used as the label image for DeepLab training. DeepLab extracts the spatial features of HSI. After that, z-score is used to normalize both of the original spectral bands and the extracted spatial features. Then, a weighted fusion rule is taken to combine the spectral and spatial information. The fused features are finally given to a SVM classifier. The proposed deep base feature fusion method has some advantages with respect to other deep learning methods. It extracts spatial features at multiple scales; it does not use the patch based feature learning; and it avoids spatial resolution reduction. The experiments show the superior performance of the proposed DeepLab based feature extraction method especially when there are small scale classes containing limited pixels. The guided filters are used for extraction of spatial features with multiple scales from the HSIs in [107]. The extracted spectral-spatial features are then given to a CNN model for classification. The guided filter involving a local linear model acts as an edge preserving smoothing operator. To compute the filter output, a guided image is required. The content and structures of the guided image are transformed to the filtered output. Three principal components of HSI are obtained and considered as the guided image. The PCA transformation is also applied to the HSI for feature reduction. The reduced dimensionality HSI is transformed through conduction of several guided filters with different scales. The extracted spatial feature maps with different scales are stacked together to generate an image cube containing the spatial features. The spatial feature vector of each pixel is reshaped to form a 2D image. This image

is given as the input of a CNN for classification. The used CNN adopts regularization and dropout to deal with the over-fitting problem in limited training situations. The guided filter based spectral-spatial feature fusion method has a simple implementation and shows good classification accuracy. In addition, the multi-scales spatial features extracted by filters with different scales, fed to the CNN model, provide a full use of the spatial features. The advantages of the neural network based classifiers are represented in the following: they do not require any prior knowledge about the statistical distribution of data, they are data-driven and have high capability in non-linear extraction of features and classification. The neural network based classifiers also have the following drawbacks: 1they have heavy training process where a high volume of training samples must be given to the neural network in many epochs to allow good learning (however, a neural network is fast in the testing stage), 2- the neural networks are not stable for the same training set. In other words, in each repeat of the classification step, a different result with different classification accuracy is achieved. So, usually, it is necessary to run the classifiers multi-times and report the average results. Deep learning methods are from the family of neural networks. Among various deep networks, CNNs are very popular in HSI classification. Although 1D-CNN provides less accurate classification results compared to the conventional classifiers such as SVM, 2D- and 3D-CNNs can effectively explore the spatial dependencies of a HSI by utilizing the local connections [58]. From deep networks drawbacks, it can be refer to 1- how to design a suitable deep network that is an open issue where the layers structure, number of filters, the used cost function and also settings of hyper-parameters have a high effect in the output result. 4. Decision fusion method Due to intrinsic limitation of each single feature set, the HSI classification methods by using just a single feature set ignore some valuable information and loss elegant details. To improve the classification accuracy, it is proposed that use several feature sets containing complement information to avoid information losing. In the decision fusion level, which is a high level fusion, separate decisions based on individual feature sets are drawn, and then, the results are combined to conclude a global decision. The general block diagram of the decision fusion approach is shown in Fig. 11. In the general form, the spectral bands of a HSI is given to M different feature extraction methods to extract features with various views from it [108]. In other words, M feature extractors can be used to find M different feature subsets. Then, each obtained subset is given to a classifier to find a local decision. The final classification map is achieved by accepting a decision fusion rule such as majority voting (MV) rule. In other form, one feature extractor can be used instead of M feature extraction methods to obtain one subset of features; and then, the decision is obtained by M different classifiers. The block diagram of the MBFSDA method [109] is shown in Fig. 12. At first, n PCs of HSI with d spectral bands are extracted by the PCA transform. A MP is constituted from each PC. Each MP contains spatial 71

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 11. General block diagram of the decision fusion approach.

Fig. 12. Block diagram of the MBFSDA method as a decision fusion approach [109].

features such as size and shape information of contextual structures. Then, the FSDA transform containing three measures are implemented in three steps: 1- maximizing the between-bands scatters where the bands are the morphological spatial features here, 2- maximizing class discrimination through maximizing between-class scatters and minimizing the within-class scatters and 3- extraction of spatial features by using the obtained projection matrix computed based on three aforementioned measures. The extracted features of each subset is given to a classifier to find the local decision. The SVM and nearest neighbor (NN) classifiers are used in this step. The final classification map is acquired by MV rule (for more details, the interested reader is referred to [109]). Another decision fusion approach, has been proposed in [110] where the band partitioning is used instead of feature extraction for producing the sub features. At first, the HSI cube is partitioned into some smaller sub-cubes with the same spatial sizes but with a lower number of adjacent spectral bands. The benefit of bands partitioning in band selection method with respect to the feature extraction methods is that the physical meaning of the spectral channels is preserved. In addition, here, no feature reduction is not applied. Then, the redundant and noisy information of each sub-cube is removed by applying the defined smoothing filters. Then, the useful spatial features of each cleaned sub-cube are achieved by applying morphological filters. The SVM classifier is individually used for classification of each subset of features; and finally, the classification map is obtained by the MV rule. In this example of decision fusion, the same feature extraction methods (morphological filter) and the same classifiers (SVM) are used for classifi-

cation of each subset of HSI cube. Fig. 13 illustrates what explained in above. A decision level fusion method has been proposed for HSI classification in [111]. Two categories of features are used for HSI classification. The first category contains the spectral reflectance curves that provides a global view of HSI. The second category includes absorptions that are considered as a local view of HSI. Absorptions are the available valleys in the spectral reflectance curve that resulted from the absorption by constituent molecules or atoms of materials. The absorption features are the binary values assigned to each spectral band of HSI. In other words, a binary vector is considered associated with each pixel where value of 1 is assigned to a band if an absorption valley is appeared in that band and a value of 0 is assigned to it if no absorption is detected [112]. To extract absorption features, at first, the spectral curves of HSI is normalized to [0, 1]; and then, a peak detection algorithm is applied to detect the absorption dips. For avoiding noise, two criteria are considered for absorption determination: 1- an absorption point must have a depth more than 0.005 and 2- an absorption point must appear on more than half of samples in each class. Since the reflectance features show more accurate classification results than the absorption ones, the classification is first done using the reflectance features and the SVM classifier. If the result is satisfied, it is reported as the final result, otherwise, the classification result obtained by absorption features using a multi-label classification method is reported as the final result. The satisfaction measure for classification accuracy is based an entropy measure where higher uncertainty (entropy) means lower accuracy.

72

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 13. A decision fusion approach using band partitioning [110].

A decision fusion based spectral-spatial HSI classification method is proposed in [113]. There is a high probability that adjacent pixels in a neighborhood region belong to the same class. So, by exploiting the spatial information contained in the neighborhood regions, several cubes with different scales are considered for each central pixel. Then, the matrix of each sub-cube that contains information related to a special spectral band is reshaped to a vector. In other words, the spectral features of each pixel is reshaped to a row vector and then, each sub-cube is converted to a matrix containing spectral-spatial features. A matrix is obtained from each sub-cube associated with each pixel containing the spectral-spatial information of it. The robust matrix discriminative analysis (RMDA) [113,114] is then applied for jointly spectral-spatial feature extraction from each matrix. Due to various degradations such as missing data, noise contamination, and calibration errors, each matrix associated with pixel (i, j) can be decomposed as 𝑋𝑖𝑗 = 𝑌𝑖𝑗 + 𝐸𝑖𝑗 where Yij denotes the clean data and Eij represents noise. A de-noising model inspired from unmixing method is applied to each Xij to obtain the clean matrix Yij . Then, MDA model is used for feature extraction from Yij where it like LDA maximizes the between-class scatters and minimizes the within-class scatters. The features extracted by RMDA are then given to a SVM classifier. Eventually, the classification maps obtained from different sub-cubes with different scales are fused together through a MV rule to generate the final classification map. MLR is an appropriate classifier for ill-posed conditions where a low number of training samples is available. MLR models the posterior class distribution all over an image in a Bayesian framework [115]. The subspace version of MLR called as MLRsub [116] works based on this idea that samples of each class can be approximated in subspace with lower dimension. It uses a projection to find that subspace. Since HSI is normally located in a much lower dimensional space, MLRsub provides good results for this type of data. Authors in [117] use the MLRsub for locally and globally learning the posterior probabilities for each HSI pixel. A probabilistic SVM is used as an indicator to detect the number of mixed components in each pixel. The obtained number of mixed components is used for local probability learning in MLRsub. Then, a decision fusion rule is used to fuse the global probabilities and the local ones obtained by the MLRsub method. Finally, the MRF regularization is applied to achieve the classification map. Different spectral feature extraction methods are used for extraction of diverse features of HSI with various views [118]. The morphological filters are applied to each set of the extracted features to implicitly fuse the spatial features with the spectral ones. Each obtained MP is given to a classifier and the classification outputs are fused through the MV rule to find the final output map. Support vector data description (SVDD) has been firstly proposed for one-class classification or target detection problems. SVDD is inspired by two-classes SVM. SVDD obtains a minimum boundary as a hyper-

sphere around the target sample. The achieved hyper-sphere is used to determine whether new sample belongs to targets or not. The sphere volume is minimized while it tries that all training samples are included in the sphere. SVDD is sparse and capable of using kernels and also has good generalization. The SVDD classification is used for multi-class HSI classification through applying an ensemble of multiple one-class classification and doing a decision fusion in [119]. An ensemble method uses multiple learning techniques to improve the classification performance. Several classification are done and their results are combined using a specific fusion rule. The main point in an ensemble method is to integrate the results of individual accurate and diverse classifiers. The ensemble methods are generally divided into three main categories: 1data (sample) level combination where different training sets are given to the same classifier, 2- feature level combination where various feature sets are extracted or selected, and then, combined and given to a classifier, 3-classification level combination where different classifiers with the same training set are used. In all ensemble methods, using an appropriate fusion rule is very important. There are two general fusion techniques: linear combination and non-linear one. The non-linear methods involve a non-linear function such as power, multiplication or exponential. But, the simple linear methods are more public where a weighted or non-weighted combination is used. In the voting rule, the alternatives with a majority are selected. In the un-weighted MV, the same weights are assigned to the base classifiers while in the weighted MV rule, a different weight is assigned to each basic voter. The weights of the basic classifiers should be proportional to their classification accuracies. In [119], the weights are defined based on average of all correlation coefficients obtained by each part of the predicted class labels obtained by individual SVDDs. The proposed weighted voting rule provides better classification results compared to the conventional MV rule. 5. Experiments A brief representation of advantages and disadvantages of different spectral-spatial fusion methods for hyperspectral image classification is seen in Table 1. For each subgroup of three types, a method is given as an instance. The performance of these methods are assessed in terms of classification accuracy and computation time using tree real and popular hyperspectral images: the well-known Indian Pines, University of Pavia and Salinas. The Indian scene was collected by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) over Northwestern Indiana in June of 1992. This image with a size of 145 × 145 pixels, contains 16 agricultural and forest classes which 10 largest classes of it are chosen for our experiments. This dataset comprises 224 spectral channels where 200 spectral bands are remained after removing 20 water absorption bands. The Pavia im73

M. Imani and H. Ghassemian

Table 1 Advantages and disadvantages of different fusion methods for hyperspectral image classification. Group

Sub-group

Segmentation based methods

Object-based classification Relaxation of classification map Features stacking

Feature fusion

An example Method

Reference and year

Pixon-based classifier MRF+HSRM

[6](2016) [3](2016)

APFSDA

[22](2017)

Advantages

Difficulties/Disadvantages

The noisy pixels are deleted in the classification maps and an applicable smooth map is achieved for land cover.

• Anomaly pixels may be removed. Anomalies are

• Simple implementation • Efficient if appropriate spatial features are selected.

Joint spectral-spatial feature extraction Kernel based classifiers

MSPP

[35](2019)

GCK

[52](2013)

The use of high correlation among spectral and spatial information

important but rare pixels with different spectral signature with respect to background. • Determination of super-pixels or objects with appropriate shape and size and edge preserving is a challenging problem. • High dimensionality of the fused feature vector may need feature reduction. • High computations High computations Sensitive to the kernel parameters

74

• Less sensitive to the number of training samples • Having a convex cost function • Fast training stage

Representation based classifiers

WJCR-AS

[63](2017) • Without any assumption about the statistical

High computations of solving the optimization problem especially if the l0 -norm or l1 -norm is used

distribution of data • The use of relations among pixels from both local

and global points of view 3D spectral-spatial feature extraction Deep learning based classifiers

3D-Gabor

[76](2010)

RPNet

[142](2018)

Preserve the intrinsic 3D structure of hyperspectral image

• Simultaneously feature extraction and classification

in a unified framework • High ability in feature extraction (detailed features in shallow layers and semantic ones in deep layers) Decision fusion

Decision fusion

MBFSDA

[109](2018)

• Overfitting problem with insufficient training samples • High computations in training phase if the used

network is deep

Selection of appropriate feature extractors or classifiers (decision makers) with minimum overlapping and redundancy is a challenging problem.

Information Fusion 59 (2020) 59–83

The use of complementary information and votes of several powerful classifiers

High computations

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Table 2 Classification results for Indian dataset (10 classes) achieved by 10 training samples. No

Name of class

1 Corn-no till 2 Corn-min till 3 Grass/pasture 4 Grass/trees Hay-windrowed 5 6 Soybeans-no till 7 Soybeans-min till 8 Soybeans-clean till 9 Woods 10 Bldg-Grass-Tree-Drives Average Accuracy Overall accuracy Kappa Computation time (seconds)

# samples

Pixon-based

MRF+HSRM

APFSDA

MSPP

GCK

WJCR-AS

3D-Gabor

RPNet

MBFSDA

1434 834 497 747 489 968 2468 614 1294 380

41.42 33.93 71.43 23.16 87.73 65.81 76.09 22.96 95.44 91.58 60.96 62.45 56.70 1.28

62.34 61.15 64.19 64.12 99.39 90.70 74.31 44.46 97.53 93.42 75.16 74.96 71.15 0.20

64.71 95.56 82.09 99.06 99.59 73.35 88.01 74.59 98.84 96.84 87.27 85.83 83.63 9.49

77.34 88.73 81.29 90.09 100.00 90.29 76.01 92.67 98.45 91.32 88.62 85.91 83.87 129.82

63.46 89.93 79.68 96.92 99.59 70.35 72.49 66.45 88.79 93.95 82.16 78.67 75.56 0.97

76.50 93.41 88.33 95.72 99.80 92.98 81.77 83.71 99.46 99.74 91.14 88.60 86.87 16.30

48.12 55.16 63.78 93.04 94.89 50.21 52.23 44.79 65.84 97.89 66.59 60.67 55.06 64.56

60.32 73.38 93.56 84.20 96.73 86.47 81.48 60.10 58.73 95.79 79.08 75.94 72.32 2.32

70.57 81.06 81.09 82.06 99.39 77.48 66.37 73.94 90.80 59.21 78.20 76.42 72.95 3.42

Table 3 Classification results for Indian dataset (10 classes) achieved by 50 training samples. No

Name of class

1 Corn-no till 2 Corn-min till 3 Grass/pasture 4 Grass/trees Hay-windrowed 5 6 Soybeans-no till 7 Soybeans-min till 8 Soybeans-clean till 9 Woods 10 Bldg-Grass-Tree-Drives Average Accuracy Overall accuracy Kappa Computation time (seconds)

# samples

Pixon-based

MRF+HSRM

APFSDA

MSPP

GCK

WJCR-AS

3D-Gabor

RPNet

MBFSDA

1434 834 497 747 489 968 2468 614 1294 380

57.11 61.87 72.43 98.39 87.73 80.68 74.76 50.49 96.14 53.16 73.28 74.46 70.65 2.15

79.64 94.00 96.98 98.93 100.00 81.30 82.74 94.95 89.57 95.79 91.39 88.13 86.34 0.20

90.73 98.80 96.58 99.87 100.00 86.47 95.54 93.49 99.61 100.00 96.11 95.40 94.67 13.99

91.21 99.76 98.99 99.87 100.00 91.74 89.14 95.77 98.84 97.89 96.32 94.54 93.69 161.95

85.63 99.28 94.77 97.86 100.00 88.64 86.18 92.02 99.07 98.16 94.16 92.05 90.83 2.48

92.75 99.88 97.18 100.00 100.00 94.21 97.33 99.19 99.69 100.00 98.02 97.43 97.02 46.31

77.68 87.65 92.76 100.00 98.98 85.54 72.53 82.90 92.89 99.47 89.04 84.77 82.57 66.11

94.07 92.93 97.79 99.06 99.39 91.43 89.47 92.18 96.60 98.95 95.19 93.79 92.82 6.64

92.19 92.57 96.18 97.72 100.00 88.22 78.00 92.35 95.52 86.32 91.91 89.47 87.88 5.41

age was collected by the Reflective Optics System Imaging Spectrometer (ROSIS). It contains 610 × 340 pixels with nine classes from an urban area. 115 spectral bands of it are reduced to 103 bands after removal of the noisy bands. The Salinas image was acquired by AVIRIS over the valley of Salinas located in Southern California in 1998. It is a 512 × 217 image containing 16 classes and 204 bands after removing 20 absorption bands. Average accuracy, overall accuracy and kappa coefficient measures [22] are used to assess the classification accuracy. The methods represented as instances in Table 1 are compared together. The performance of classifiers are assessed in two difference cases: 1) using small training set (10 samples per class) and 2) using relative large raining set (50 samples per class). The classification results for Indian dataset obtained by 10 and 50 training samples are reported in Tables 2 and 3, respectively. Ground truth map (GTM) and corresponding classification maps are also shown in Figs. 14 and 15. According to the obtained results, the following conclusions can be found:

3)

4)

5)

6) 7)

1) The best classification results in both cases of 10 and 50 training samples are achieved by WJCR-AS. The WJCR-AS classifier is a weighted version of joint collaborative representation based method which benefits the angular separation metric. It is a non-parametric classifier that has less sensitivity to the number of training samples, uses the spatial information of local regions with appropriate weights and decreases the redundant information of the highly correlated spatial neighbors. 2) After WJCR-AS, MSPP ranks second with 10 training samples. MSPP utilizes a structure-preserving projection for extracting of spectralspatial features where morphological filters are applied for shape and contextual feature extraction. Due to training set extension by utilizing the neighborhood information of local region, MSPP not

only is less sensitive to the number of training samples but also includes richer spectral-spatial features in the classification process. With 50 training samples, APFSDA ranks second after WJCR-AS. Although, APFSDA is a stacking feature fusion method, but due to the use of spectral-spatial features with maximum class discrimination and minimum overlapping and redundancy, it is not very sensitive to the training set size. APFSDA by applying FSDA projection to the attribute profiles maximizes the between-spectral scatters and maximizes the class discrimination. Although, RPNet is not so efficient using 10 training samples, but it has high performance with 50 training samples. It is expected because deep learning networks need sufficient labeled samples in the training phase to result in generalization ability in the testing phase. Indian dataset is a cluttered and multi-modal image. So, the use of object-based approaches such as pixel-based or segmentation based relaxation methods such as MRF+HSRM lead to high misclassification in Indian image. Pixel-based classifiers such as GCK and MBFSDA provide classification maps that are noisier than other competitors. 3D Gabor leads to over-smoothing classification maps.

The classification results for Pavia dataset achieved by 10 training samples are shown in Table 4 and Fig. 16. Among different classifiers, APFSDA, MSPP, WJCR-AS and GCK provide the best classification results, respectively. The results obtained by 50 training samples (Table 5 and Fig. 17) show that GCK is the best candid. After that, RPNet and MSPP rank second and APFSDA and WJCR-AS rank third. The classification results for Salinas dataset corresponding to 10 and 50 training samples are reported in Tables 6,7 and Figs. 18,19. The ranking of the best classifiers for 10 training samples is: APFSDA, MSPP, WJCR-AS and GCK. The ranking of the best methods for 50 training samples is also obtained as follows: APFSDA, MRF+HSRM, WJCR-AS, MSPP and GCK. 75

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 14. Classification maps for Indian dataset achieved by 10 training samples.

Fig. 15. Classification maps for Indian dataset achieved by 50 training samples. Table 4 Classification results for Pavia dataset (9 classes) achieved by 10 training samples. No

Name of class

1 Asphalt 2 Meadows 3 Gravel 4 Trees 5 Painted metal sheets 6 Bare Soil 7 Bitumen 8 Self-Blocking Bricks Shadows 9 Average Accuracy Overall accuracy Kappa Computation time (seconds)

# samples

Pixon-based

MRF+HSRM

APFSDA

MSPP

GCK

WJCR-AS

3D-Gabor

RPNet

MBFSDA

6631 18649 2099 3064 1345 5029 1330 3682 947

45.20 82.00 90.23 74.05 85.13 18.55 92.48 84.17 97.57 74.38 69.63 57.72 3.34

93.12 1.43 98.09 79.67 99.03 98.59 56.47 87.24 98.84 79.16 51.74 45.96 1.98

93.77 90.88 94.00 67.10 81.26 87.73 99.62 87.75 97.47 88.84 89.25 86.01 1542.81

74.71 86.09 86.37 96.31 97.25 90.85 98.72 79.74 99.26 89.92 86.12 82.18 24.84

94.83 81.91 85.47 87.99 97.55 71.55 99.77 79.90 96.30 88.36 84.50 79.86 8.79

81.35 78.65 91.33 97.81 99.93 83.69 99.92 95.06 99.68 91.94 84.86 80.75 155.54

50.26 66.53 59.98 59.66 89.44 72.74 72.78 58.66 85.74 68.42 64.59 55.56 268.74

83.52 54.19 70.70 91.64 100.00 65.30 92.41 74.93 93.24 80.66 68.81 61.78 11.82

86.68 89.22 87.28 79.34 99.33 89.52 88.12 68.88 82.89 85.70 86.45 82.31 21.64

Note that MRF+HSRM, which is a segmentation based relaxation method, in contrast to Indian and Pavia datasets, has high performance in Salinas dataset especially when sufficient training samples is available. The reason is that Salinas scene contains homogeneous regions with less spatial details. So, applying relaxation to it has not destructive relaxation effect with removing details. So, MRF+HSRM leads to little mis-classification pixels in Salinas image. The running times of different methods are also reported in the given tables. Among 9 spectral-spatial classification methods related to 3 main groups of segmentation based methods, feature fusion meth-

ods and decision fusion ones, the highest computation time is related to MSPP, 3D Gabor, WJCR-AS and APFSDA that all of them belong to the feature fusion group. This result is obtained in all datasets but ranking of these methods is a bit away in different datasets. The high running time of feature fusion methods is due to high computations of various feature extraction processes. However, the effective features are the basis of a good classification. Although feature fusion methods impose high computations but they usually lead to superior classification results in various agriculture or urban land cover scenes. As seen from the obtained results, the feature fusion methods such as WJCR76

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 16. Classification maps for Pavia dataset achieved by 10 training samples. Table 5 Classification results for Pavia dataset (9 classes) achieved by 50 training samples. No

Name of class

1 Asphalt 2 Meadows 3 Gravel 4 Trees 5 Painted metal sheets 6 Bare Soil 7 Bitumen 8 Self-Blocking Bricks Shadows 9 Average Accuracy Overall accuracy Kappa Computation time (seconds)

# samples

Pixon-based

MRF+HSRM

APFSDA

MSPP

GCK

WJCR-AS

3D-Gabor

RPNet

MBFSDA

6631 18649 2099 3064 1345 5029 1330 3682 947

48.48 75.70 88.23 70.76 85.13 98.27 92.78 84.11 97.57 82.34 76.43 70.11 6.25

93.65 78.89 64.75 92.75 99.70 98.63 99.62 92.29 100.00 91.14 86.72 83.04 2.11

95.90 92.66 93.71 90.50 99.85 97.59 99.17 95.60 100.00 96.11 94.49 92.80 1735.20

93.32 93.35 97.52 94.42 99.85 98.79 96.39 97.66 99.89 96.80 95.08 93.56 44.50

97.35 95.26 92.76 97.06 98.74 97.53 99.70 96.52 100.00 97.21 96.32 95.16 11.91

96.94 89.53 98.90 98.86 99.85 98.39 99.77 98.18 100.00 97.83 94.47 92.82 438.38

78.60 76.14 72.56 75.75 95.69 79.52 82.63 77.62 86.38 80.54 77.88 71.84 273.39

85.88 97.73 90.47 97.81 100.00 94.13 98.57 95.90 98.94 95.49 95.09 93.51 27.87

93.44 89.55 96.05 88.09 99.78 94.77 94.29 87.45 94.93 93.15 91.39 88.78 27.48

AS, APFSDA and MSPP provide high performance in all experimented datasets. Some useful links for source codes of hyperspectral image classification are given in the following: RPNet classifier: https://github.com/YonghaoXu/RPNet MRF+HSRM classifier: https://www.researchgate.net/publication/287491511_MRFHSRM_Matlab_Code GCK and MFL classifiers, extended attribute profiles and some other hyperspectral image analysis methods: http://www.lx.it.pt/~jun/demos.html Some other useful links: https://personal.utdallas.edu/~cxc123730/research.html http://ssp.dml.ir/research/sadl/1/ https://paperswithcode.com/task/hyperspectral-imageclassification/latest https://github.com/gokriznastic/HybridSN

https://github.com/custom- computing- ic/ CNN- Based- Hyperspectral- Image- Classification https://github.com/mhaut/pResNet-HSI https://github.com/leeguandong/FSKNet- for- HSI https://github.com/leeguandong/3D- DenseNet- for- HSI https://github.com/Hsuxu/Two- branch- CNN- Multisource- RSclassification https://github.com/eecn/Hyperspectral-Classification https://github.com/syamkakarla98/Dimensionality- reduction- andclassification- on- Hyperspectral- Images- Using- Python https://github.com/shuguang-52/FDSSC https://github.com/zilongzhong/SSRN https://github.com/henanjun/EPs-F https://github.com/henanjun/demo_MCMs Some popular hyperspectral image datasets are also available in the following link: http://www.ehu.es/ccwintco/index.php/Hyperspectral_Remote_ Sensing_Scenes

77

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 17. Classification maps for Pavia dataset achieved by 50 training samples. Table 6 Classification results for Salinas dataset (16 classes) achieved by 10 training samples. No

Name of class

1 Brocoli_green_weeds_1 2 Brocoli_green_weeds_2 3 Fallow 4 Fallow_rough_plow 5 Fallow_smooth 6 Stubble 7 Celery 8 Grapes_untrained 9 Soil_vineyard_develop 10 Corn_senesced_green_weeds 11 Lettuce_romaine_4weeks 12 Lettuce_romaine_5 weeks 13 Lettuce_romaine_6 weeks 14 Lettuce_romaine_7 weeks 15 Vineyard_untrained 16 Vineyard_vertical_trellis Average Accuracy Overall accuracy Kappa Computation time (seconds)

# samples

Pixon-based

MRF+HSRM

APFSDA

MSPP

GCK

WJCR-AS

3D-Gabor

RPNet

MBFSDA

2009 3726 1976 1394 2678 3959 3579 11271 6203 3278 1068 1927 916 1070 7268 1807

99.90 99.38 96.20 91.61 86.89 88.79 97.32 87.20 99.84 46.34 52.90 30.10 89.74 84.21 94.79 92.09 83.58 87.15 85.63 3.63

100.00 99.46 100.00 99.86 98.43 99.87 99.58 97.91 100.00 98.17 95.22 100.00 98.14 96.45 0.43 97.51 92.56 85.65 83.86 1.82

100.00 99.68 100.00 99.21 98.28 99.44 99.66 84.61 100.00 94.63 90.07 99.64 98.69 96.64 79.80 97.73 96.13 93.19 92.43 59.14

100.00 98.74 99.90 99.71 96.94 99.39 93.63 75.49 99.89 91.67 98.88 98.86 97.93 96.36 91.28 98.84 96.09 92.28 91.45 23.65

97.41 96.43 99.80 96.56 98.81 97.15 98.88 73.87 99.97 91.95 92.13 100.00 99.02 95.70 83.21 93.25 94.63 90.55 89.52 5.13

99.90 98.93 100.00 99.21 98.32 99.17 99.61 69.75 99.98 93.41 93.82 100.00 98.80 93.64 86.68 99.83 95.69 90.97 90.00 124.28

86.86 93.48 98.13 93.33 90.66 97.37 90.16 75.96 94.24 83.92 77.43 92.53 90.39 90.19 86.63 69.84 88.20 87.01 85.57 280.58

98.01 99.19 96.91 99.21 98.36 99.60 96.42 76.01 96.58 84.66 96.25 97.87 93.56 97.76 66.26 97.57 93.39 88.16 86.83 13.06

99.95 99.65 99.60 100.00 96.30 99.14 93.55 78.33 97.81 94.23 97.38 99.27 98.69 75.70 80.38 99.89 94.37 90.96 89.95 12.56

6. Trends and advanced fusion methods

In the following, some advanced methods in each of above groups are briefly introduced.

Three main trends have been seen in the recent literature: 1) Design of new feature extraction methods for generation of rich spectral-spatial features with a high ability in class discrimination and preserving the 3D local and global structure of hyperspectral images. 2) Hybrid fusion methods where two or more types of fusion methods are used for feature fusion, decision fusion and classification map relaxation (regularization). 3) Deep learning methods for joint spectral-spatial feature generation with extraction of detailed features in shallow layers and semantic ones in deep layers.

6.1. Design of feature extraction methods Many of recent feature extraction methods do non-linear feature learning, use the sparse representation, graph based approaches and super-pixel based ones. Some instances are cited in the following. A nonlinear manifold learning is represented in [127] to extract the intrinsic topology of a hyperspectral image. A Graph based feature extraction method is proposed in [128]. Conventional approaches are single vector-based graphs which fail to capture spatial features. The multigraph embedding proposed in [128] is based on patch tensors where 78

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Fig. 18. Classification maps for Salinas dataset achieved by 10 training samples.

Fig. 19. Classification maps for Salinas dataset achieved by 50 training samples.

79

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

Table 7 Classification results for Salinas dataset (16 classes) achieved by 50 training samples. No

Name of class

1 Brocoli_green_weeds_1 2 Brocoli_green_weeds_2 3 Fallow 4 Fallow_rough_plow 5 Fallow_smooth 6 Stubble 7 Celery 8 Grapes_untrained 9 Soil_vineyard_develop 10 Corn_senesced_green_weeds 11 Lettuce_romaine_4weeks 12 Lettuce_romaine_5 weeks 13 Lettuce_romaine_6 weeks 14 Lettuce_romaine_7 weeks 15 Vineyard_untrained 16 Vineyard_vertical_trellis Average Accuracy Overall accuracy Kappa Computation time (seconds)

# samples

Pixon-based

MRF+HSRM

APFSDA

MSPP

GCK

WJCR-AS

3D-Gabor

RPNet

MBFSDA

2009 3726 1976 1394 2678 3959 3579 11271 6203 3278 1068 1927 916 1070 7268 1807

99.90 99.68 97.27 91.61 86.97 88.79 97.32 99.36 92.37 46.16 52.90 30.10 89.74 84.77 94.76 96.57 84.27 89.04 87.58 6.16

100.00 99.65 100.00 99.86 99.37 99.95 99.47 98.54 99.84 98.84 99.44 100.00 99.34 96.36 98.57 98.23 99.22 99.16 99.07 1.25

100.00 99.92 100.00 99.28 99.70 99.44 99.92 99.05 99.77 96.74 100.00 100.00 99.13 97.01 98.39 98.89 99.20 99.17 99.07 54.94

100.00 99.36 97.32 99.14 98.02 99.90 96.31 91.60 98.63 97.22 99.72 99.90 99.13 95.42 96.44 98.67 97.92 96.77 96.41 85.95

99.25 99.65 99.80 98.49 99.59 99.57 99.44 89.74 99.87 96.06 99.06 100.00 99.67 96.45 85.37 98.95 97.56 95.33 94.80 14.53

99.80 99.84 99.80 99.35 99.37 99.44 99.75 96.34 99.19 99.57 99.63 100.00 99.78 96.45 97.58 100.00 99.12 98.58 98.42 373.50

93.53 91.25 96.76 96.99 97.46 97.52 95.81 87.72 97.10 93.26 96.72 99.01 99.78 98.50 84.53 90.43 94.77 92.55 91.72 73.76

99.20 99.09 98.58 99.71 98.02 99.80 99.66 75.97 99.92 98.54 99.53 100.00 98.80 98.79 78.58 99.61 96.49 91.67 90.75 26.63

100.00 99.89 98.33 99.00 98.58 99.47 99.02 96.47 99.10 98.93 99.34 99.79 97.38 86.17 84.91 100.00 97.27 96.46 96.06 18.91

three different sub-graphs are designed for description of intrinsic geometrical structures in hyperspectral image. A super-pixel and graph based feature reduction is introduced in [129]. At first, the hyperspectral image is segmented into nonoverlapping super-pixels. Then, the super-pixel based linear discriminant analysis is applied to learn a super-pixel-guided graph. A labelguided graph is also constructed for exploration of spectral similarity. Two made graphs are integrated for learning the discriminant projection. A semi-supervised dimensionality reduction is introduced in [130]. It jointly considers labeled and unlabeled samples to learn the low dimensional subspace. The labels are dynamically propagated on a learnable graph to progressively refine the pseudo-labels providing a properly feedback system. Sparse coding provides an efficient representation of hyperspectral images. But, due to high inter-band correlation of spectral channels, sparse analysis on individual spectral bands is not appropriate. In [131], convolutional frameworks are investigated to simultaneously learn dictionaries and sparse codes and achieve an appropriate spectral-spatial representation of hyperspectral image. Because of high dimensionality of hyperspectral cube, a convolutional encoder-decoder network is selected to this end. Also, 3D convolutions are adopted for modelling the spectral-spatial features.

implementation of CNNs for hyperspectral images analysis. First is insufficient labeled samples for training that lead to overfitting problem. Second is ignoring complementary spectral-spatial information among low and high level features extracted by shallow and deep layers, respectively. Authors in [101] use two solutions to deal with the mentioned difficulties. They use multi-layer spectral-spatial features for learning complementary information of shallow and deep layers. In addition, they apply sample augmentation with adding local and non-local constraints. A soft-max based multi-decision approach is used to find the final classification map from different extracted sub-features. A multi-object CNN decision fusion method is proposed in [134]. In [135], 3D Gabor features are convolved with EMAP features to provide EMAP-Gabor features containing rich spatial and texture features. Then, the collaborative representation based classifier (CRC) is used for classification of EMAP-Gabor features. Individually, EMAP features are employed for generating multi-scale super-pixel maps where the number of super-pixels is automatically found with a heuristic strategy. The superpixel maps are used for regularization of classification maps. Then, the regularized maps are fused to provide the final classification map. A hybrid fusion method for hyperspectral image classification is proposed in [136]. At first, the Gabor filters are applied to the hyperspectral cube to generate Gabor features including magnitude and phase. The magnitude features are fed to the SVM classifiers while the phase features are given to the quadrant bit coding and hamming distance for sample similarity measuring. Two kinds of generated features are then combined to obtain a weight for each sample belonging to a given class. Also, some super-pixel maps are generated from the raw hyperspectral image to regularize the weighted cube obtained from the previous step. Maximum value classification is then applied to the regularized maps to find the final classification map.

6.2. Hybrid fusion methods Hyperspectral image classification is done using both feature fusion and decision fusion in [132]. Various edge preserving features are extracted using multiple operations, and then, improved with assistant of super-pixel segmentation. The achieved edge preserving features are fused with the spectral ones to generate a composite kernel. Various classification maps are finally fused using majority voting rule. Feature fusion, decision fusion and classification map regularization (relaxation) using segmentation map are used for achieving the hyperspectral image classification map in [133]. The first order deviation of Gabor magnitude images is used to extend 2D Gabor features to 3D structure of hyperspectral cube. The PCA transform is then used for dimensionality reduction of each extracted 3D Gabor cube. Each reduced cube is fed to a SVM classifier. Then, the majority voting strategy is applied for decision fusion. Finally, a super-pixel map obtained by a simple linear iterative clustering is used for regularization of the classification map. Another integration of feature fusion with decision fusion is introduced in [101]. The proposed framework uses the CNNs for joint spectral-spatial feature extraction. There are two main difficulties in

6.3. Deep learning methods With fast development and great progression of deep learning methods especially with wide success of convolutional neural networks (CNNs) in image processing problems, deep learning has attracted far interest from the remote sensing researchers. Recently, a huge volume of studies about hyperspectral image classification is done around deep learning. A review of deep learning based classification methods for hyperspectral images is given in [137]. Due to overfitting problem, the structure of neural networks designed for hyperspectral images should not be too deep. Related to this requirement, a cascade dual-scale crossover neural network is introduced in [138]. It is able 80

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

to extract more discriminant contextual information through applying different spatial-size and spectral-size convolution kernels. CNN together with data augmentation using pixel-block pair [139], regularized CNN [107], Deeplab based CNN [105], 3D CNN and 3D dense network [140], and adaptive multi-scale deep fusion residual network [141] are some instances of deep learning based hyperspectral image classification methods.

[8] Y. Tarabalka, J.A. Benediktsson, J. Chanussot, Spectral-spatial classification of hyperspectral imagery based on partitional clustering techniques, IEEE Trans. Geosci. Remote Sens. 47 (8) (2009) 2973–2987. [9] Y. Tarabalka, J. Chanussot, J.A. Benediktsson, Segmentation andclassification of hyperspectral images using watershed transformation, Pattern Recognit. 43 (7) (2010) 2367–2379. [10] S. Jia, B. Deng, J. Zhu, X. Jia, Q. Li, Local binary pattern-based hyperspectral image classification with superpixel guidance, IEEE Trans. Geosci. Remote Sens. 56 (2) (2018) 749–759. [11] M.-Y. Liu, O. Tuzel, S. Ramalingam, R. Chellappa, Entropy rate superpixel segmentation, IEEE Conf. Comp. Vision Pattern Recognit. (CVPR) (Jun. 2011) 2097–2104. [12] M. Khodadadzadeh, H. Ghassemian, Contextual classification of hyperspectral remote sensing images using SVM-PLR, Aust. J. Basic Appl. Sci. 5 (8) (2011) 374–382. [13] C. Shi, C.-M. Pun, Superpixel-based 3D deep neural networks for hyperspectral image classification, Pattern Recognit. 74 (2018) 600–616. [14] L. Li, C. Sun, L. Lin, J. Li, S. Jiang, J. Yin, A dual-kernel spectral-spatial classification approach for hyperspectral images based on Mahalanobis distance metric learning, Inf. Sci. (Ny) 429 (2018) 260–283. [15] Y. Tarabalka, J. Chanussot, J.A. Benediktsson, Segmentation and classification of hyperspectral images using watershed transformation, Pattern Recognit. 43 (7) (2010) 2367–2379. [16] Y. Tarabalka, J.A. Benediktsson, J. Chanussot, Spectral-Spatial classification of hyperspectral imagery based on partitional clustering techniques, IEEE Trans. Geosci. Remote Sens. 47 (8) (2009) 2973–2987. [17] Z. Miao, W. Shi, A new methodology for spectral-spatial classification of hyperspectral images, J. Sensors 2016 (2016) 12 Article ID 1538973pages. [18] M. Imani, H. Ghassemian, Morphology-based structure-preserving projection for spectral–spatial feature extraction and classification of hyperspectral data, IET Image Proc. 13 (2) (2019) 270–279. [19] F. Mirzapour, H. Ghassemian, Fast GLCM and gabor filters for texture classification of very high resolution remote sensing images, Int. J. Inform. Commun. Tech. Res. 7 (3) (2015) 21–30. [20] M. Imani, H. Ghassemian, Binary coding based feature extraction in remote sensing high dimensional data, Inf. Sci. (Ny) 342 (2016) 191–208. [21] M. Imani, H. Ghassemian, Feature space discriminant analysis for hyperspectral data feature reduction, ISPRS J. Photogramm. Remote Sens. 102 (2015) 1–13. [22] M. Imani, H. Ghassemian, Attribute profile based feature space discriminant analysis for spectral-spatial classification of hyperspectral images, Comput. Electr. Eng. 62 (2017) 555–569. [23] W. Zhao, S. Du, Spectral–Spatial feature extraction for hyperspectral image classification: a dimension reduction and deep learning approach, IEEE Trans. Geosci. Remote Sens. 54 (8) (2016) 4544–4554. [24] C. Kuo, D.A. Landgrebe, Nonparametric weighted feature extraction for classification, IEEE Trans. Geosci. Remote Sens. 42 (5) (2004) 1096–1105. [25] M. Imani, H. Ghassemian, Two dimensional linear discriminant analysis for hyperspectral data, Photogramm. Eng. Remote Sens. 81 (10) (2015) 777–786. [26] Z. Wang, Q. Ruan, G. An, Facial expression recognition using sparse local Fisher discriminant analysis, Neurocomputing 174 (Part B) (2016) 756–766. [27] M. Imani, H. Ghassemian, Feature extraction using median-mean and feature line embedding, Int. J. Remote Sens. 36 (17) (2015) 4297–4314. [28] P. Huang, Z. Yang, C. Chen, Fuzzy local discriminant embedding for image feature extraction, Comput. Electr. Eng. 46 (2015) 231–240. [29] F. Mirzapour, H. Ghassemian, Improving hyperspectral image classification by combining spectral, texture, and shape features, Int. J. Remote Sens. 36 (4) (2015) 1070–1096. [30] S. Li, Q. Hao, X. Kang, J.A. Benediktsson, Gaussian pyramid based multiscale feature fusion for hyperspectral image classification, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 11 (9) (2018) 3312–3324. [31] A. Kianisarkaleh, H. Ghassemian, Spatial-spectral locality preserving projection for hyperspectral image classification with limited training samples, Int. J. Remote Sens. 37 (21) (2016) 5045–5059. [32] G. Zhang, J. Wang, X. Zhang, H. Fei, B. Tu, Adaptive total variation-based spectral-spatial feature extraction of hyperspectral image, J. Vis. Commun. Image Represent. 56 (2018) 150–159. [33] B. Pan, Z. Shi, X. Xu, Hierarchical guidance filtering-based ensemble classification for hyperspectral images, IEEE Trans. Geosci. Remote Sens. 55 (7) (2017) 4177–4189. [34] M. Imani, H. Ghassemian, Edge patch image-based morphological profiles for classification of multispectral and hyperspectral data, IET Image Proc. 11 (3) (2017) 164–172. [35] M. Imani, H. Ghassemian, Morphology-based structure-preserving projection for spectral–spatial feature extraction and classification of hyperspectral data, IET Image Proc. 13 (2) (2019) 270–279. [36] H. Li, H. Li, L. Zhang, Quaternion-Based multiscale analysis for feature extraction of hyperspectral images, IEEE Trans. Signal Process. 67 (6) (2019) 1418–1430. [37] L. Gao, et al., Subspace-based support vector machines for hyperspectral image classification, IEEE Geosci. Remote Sens. Lett. 12 (2) (2015) 349–353. [38] P. Ramzi, F. Samadzadegan, P. Reinartz, Classification of hyperspectral data using an AdaBoostSVM technique applied on band clusters, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 7 (6) (2014) 2066–2079. [39] Y. Chen, N.M. Nasrabadi, T.D. Tran, Hyperspectral image classification via kernel sparse representation, IEEE Trans. Geosci. Remote Sens. 51 (1) (2013) 217–231. [40] J. Li, H. Zhang, L. Zhang, Column-generation kernel nonlocal joint collaborative representation for hyperspectral image classification, ISPRS J. Photogramm. Remote Sens. 94 (2014) 25–36.

7. Conclusion The fusion topic in the image processing field has been discussed from two main views in literature. In the first view, the useful features from two or more individual source images are fused to provide an image with all beneficial characteristics of the source images [120–124]. In the second view, an image containing various worthful characteristics is explored from different aspects. The useful features are extracted and fused together to allow more powerful decision making from the given image [125,126]. The spectral-spatial fusion for hyperspectral image classification belongs to the second group that is reviewed in this work. Three general categories of spectral-spatial fusion methods are reviewed and discussed in this paper. The first group is segmentation based classifiers where they are divided into object based classification methods and the relaxed pixel-wise classification ones. The second group contains feature fusion methods. Six different types of methods in this group are act in two general main. In the first way, the spectral and spatial features are individually extracted and then used in the classification phase such as feature stacking methods and the kernel based classifiers. In the second way, the spectral and spatial features are simultaneously extracted and classified such as representation based classifiers, 3D feature extraction methods and deep learning based classifiers. The third group of spectral-spatial fusion methods is decision fusion framework where a decision fusion rule is used to integrate various decisions acquired by complement classification maps. Appropriate selection of segmentation algorithms, feature transforms, kernels and their parameters settings, 3D filters and their parameters settings, optimization problems solving in representation based classifiers and tuning of hyper-parameters in deep learning methods are among the main difficulties of various fusion methods. The appropriate choice of fusion method is done by consideration of available training samples, HSI spectral and spatial resolution, and also a trade-off between classification accuracy and computation time. However, generally, the spectral-spatial HSI classification results in significant improvement with respect to HSI classification by using just the spectral features. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. References [1] C.H. Chen, Frontiers of remote sensing information processing, World Scientific, 2003. [2] H. Ghassemian, D.A. Landgrebe, On-line object feature extraction for multispectral scene representation, NASA technical reports, NASA-CR-187006, NAS 1.26:187006, TR-EE-88-34, Aug. 1988. [3] M. Golipour, H. Ghassemian, F. Mirzapour, Integrating hierarchical segmentation maps with MRF prior for classification of hyperspectral images in a bayesian framework, IEEE Trans. Geosci. Remote Sens. 54 (2) (2016) 805–816. [4] J. Li, J.M. Bioucas-Dias, A. Plaza, Spectral-spatial hyperspectral image segmentation using subspace multinomial logistic regression and Markov random fields, IEEE Trans. Geosci. Remote Sens. 50 (3) (2012) 809–823. [5] H. Ghassemian, D.A. Landgrebe, Object-oriented feature extraction method for image data compaction, IEEE Control Syst. Mag. 8 (3) (1988) 42–48. [6] A. Zehtabian, H. Ghassemian, Automatic object-based hyperspectral image classification using complex diffusions and a new distance metric, IEEE Trans. Geosci. Remote Sens. 54 (7) (2016) 4106–4114. [7] A. Zehtabian, H. Ghassemian, An adaptive pixon extraction technique for multispectral/hyperspectral image classification, IEEE Geosci. Remote Sens. Lett. 12 (4) (2015) 831–835. 81

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

[41] A. Cornuéjols, C. Wemmert, P. Gançarski, Y. Bennani, Collaborative clustering: why, when, what and how, Inform. Fusion 39 (2018) 81–95. [42] R. Zhao, B. Du, L. Zhang, A robust nonlinear hyperspectral anomaly detection approach, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 7 (4) (2014) 1227–1234. [43] Y. Gu, C. Wang, D. You, Y. Zhang, S. Wang, Y. Zhang, Representative multiple kernel learning for classification in hyperspectral imagery, IEEE Trans. Geosci. Remote Sens. 50 (7) (2012) 2852–2865. [44] J. Mercer, Functions of positive and negative type and their connection with the theory of integral equations, Philos. Trans. R. Soc. Lond. A Math. Phys. Sci. 209 (441–458) (1909) 415–446. [45] D. Tuia, G. Camps-Valls, Semisupervised remote sensing image classification with cluster kernels, IEEE Geosci. Remote Sens. Lett. 6 (2) (2009) 224–228. [46] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining Knowl. Disc. 2 (2) (1998) 121–167. [47] M. Gönen, E. Alpaydın, Multiple kernel learning algorithms, J. Mach. Learn. Res. 12 (2011) 2211–2268. [48] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L.E. Ghaoui, M.I. Jordan, Learning the kernel matrix with semidefinite programming, J. Mach. Learn. Res. 5 (2004) 27–72. [49] G. Camps-Valls, L. Gomez-Chova, J. Muéz-Mari, J. Vila-Frances, J. Calpe-Maravilla, Composite kernels for hyperspectral image classification, IEEE Geosci. Remote Sens. Lett. 3 (1) (2006) 93–97. [50] S. Niazmardi, A. Safari, S. Homayouni, A novel multiple kernel learning framework for multiple feature classification, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 10 (8) (2017) 3734–3743. [51] Y. Gu, J. Chanussot, X. Jia, J.A. Benediktsson, Multiple kernel learning for hyperspectral image classification: a review, IEEE Trans. Geosci. Remote Sens. 55 (11) (2017) 6547–6565. [52] J. Li, P. Marpu, A. Plaza, J. Bioucas-Dias, J.A. Benediktsson, Generalized composite kernel framework for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 51 (9) (2013) 4816–4829. [53] B. Suchetana, B. Rajagopalan, J. Silverstein, Investigating regime shifts and the factors controlling total inorganic nitrogen concentrations in treated wastewater using non-homogeneous Hidden Markov and multinomial logistic regression models, Sci. Total Environ. 646 (2019) 625–633. [54] J. Li, X. Huang, P. Gamba, J.M. Bioucas-Dias, L. Zhang, J.A. Benediktsson, A. Plaza, Multiple feature learning for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 53 (3) (2015) 1592–1606. [55] J. Li, J. Bioucas-Dias, A. Plaza, Spectral-spatial hyperspectral image segmentation using subspace multinomial logistic regression and Markov random fields, IEEE Trans. Geosci. Remote Sens. 50 (3) (2012) 809–823. [56] Y. Zhang, S. Prasad, Locality preserving composite kernel feature extraction for multi-source geospatial image analysis, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 8 (3) (2015) 1385–1392. [57] J. Li, P. Marpu, A. Plaza, J. Bioucas-Dias, J.A. Benediktsson, Generalized composite kernel framework for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 51 (9) (2013) 4816–4829. [58] P. Ghamisi, J. Plaza, Y. Chen, J. Li, A.J. Plaza, Advanced spectral classifiers for hyperspectral images: a review, IEEE Geosci. Remote Sens. Mag. 5 (1) (2017) 8–32. [59] W. Li, E.W. Tramel, S. Prasad, J.E. Fowler, Nearest regularized subspace for hyperspectral classification, IEEE Trans. Geosci. Remote Sens. 52 (1) (2014) 477–489. [60] W. Li, Q. Du, Joint within-class collaborative representation for hyperspectral image classification, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 7 (6) (2014) 2200–2208. [61] M. Xiong, Q. Ran, W. Li, J. Zou, Q. Du, Hyperspectral image classification using weighted joint collaborative representation, IEEE Geosci. Remote Sens. Lett. 12 (6) (2015) 1209–1213. [62] M. Imani, H. Ghassemian, Edge-preserving-based collaborative representation for spectral-spatial classification, Int. J. Remote Sens. 38 (20) (2017) 5524–5545. [63] M. Imani, H. Ghassemian, Weighted joint collaborative representation based on median-mean line and angular separation, IEEE Trans. Geosci. Remote Sens. 55 (10) (2017) 5612–5624. [64] Y. Chen, N.M. Nasrabadi, T.D. Tran, Hyperspectral image classification using dictionary-based sparse representation, IEEE Trans. Geosci. Remote Sens., 49 (10) 3973–3985. [65] B. Tu, X. Zhang, X. Kang, G. Zhang, J. Wang, J. Wu, Hyperspectral image classification via fusing correlation coefficient and joint sparse representation, IEEE Geosci. Remote Sens. Lett. 15 (3) (2018) 340–344. [66] M. Borhani, H. Ghassemian, Spectral-spatial graph kernel machines in the context of hyperspectral remote sensing image classification, CSI J. Comp. Sci. Eng. 11 (2013) 31–42 2 & 4 (b). [67] L. Gan, P. Du, J. Xia, Y. Meng, Kernel fused representation-based classifier for hyperspectral imagery, IEEE Geosci. Remote Sens. Lett. 14 (5) (2017) 684–688. [68] M. Imani, Attribute profile based target detection using collaborative and sparse representation, Neurocomputing 313 (2018) 364–376. [69] M. Imani, Anomaly detection using morphology-based collaborative representation in hyperspectral imagery, Eur. J. Remote Sens. 51 (1) (2018) 457–471. [70] G. Goswami, P. Mittal, A. Majumdar, M. Vatsa, R. Singh, Group sparse representation based classification for multi-feature multimodal biometrics, Inform. Fusion 32 (Part B) (2016) 3–12. [71] B. Yang, S. Li, Pixel-level image fusion with simultaneous orthogonal matching pursuit, Inform. Fusion 13 (1) (2012) 10–19. [72] M. Imani, Anomaly detection from hyperspectral images using clustering based feature reduction, J. Indian Soc. Remote Sens. 46 (9) (2018) 1389–1397.

[73] F. Yuan, X. Xia, J. Shi, Mixed co-occurrence of local binary patterns and Hamming-distance-based local binary patterns, Inf. Sci. (Ny) 460–461 (2018) 202–222. [74] J. Zhu, J. Hu, S. Jia, X. Jia, Q. Li, Multiple 3-D feature fusion framework for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 56 (4) (2018) 1873–1886. [75] L. He, J. Li, A. Plaza, Y. Li, Discriminative low-rank gabor filtering for spectral–spatial hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 55 (3) (2017) 1381–1395. [76] T.C. Bau, S. Sarkar, G. Healey, Hyperspectral region classification using a three-dimensional gabor filterbank, IEEE Trans. Geosci. Remote Sens. 48 (9) (2010) 3457–3464. [77] M. Imani, 3D Gabor based hyperspectral anomaly detection, AUT J. Model. Simul. 50 (2) (2018) 101–110. [78] T.C. Bau, S. Sarkar, G. Healey, Hyperspectral region classification using a three-dimensional gabor filterbank, IEEE Trans. Geosci. Remote Sens. 48 (9) (2010) 3457–3464. [79] S. Jia, L. Shen, J. Zhu, Q. Li, A 3-D gabor phase-based coding and matching framework for hyperspectral imagery classification, IEEE Trans. Cybern. 48 (4) (2018) 1176–1188. [80] S. Jia, J. Hu, J. Zhu, X. Jia, Q. Li, Three-Dimensional local binary patterns for hyperspectral imagery classification, IEEE Trans. Geosci. Remote Sens. 55 (4) (2017) 2399–2413. [81] Z. He, L. Liu, Robust multitask learning with three-dimensional empirical mode decomposition-based features for hyperspectral classification, ISPRS J. Photogramm. Remote Sens. 121 (2016) 11–27. [82] X. Cao, L. Xu, D. Meng, Q. Zhao, Z. Xu, Integration of 3-dimensional discrete wavelet transform and Markov random field for hyperspectral image classification, Neurocomputing 226 (2017) 90–100. [83] Y.Y. Tang, Y. Lu, H. Yuan, Hyperspectral image classification based on three-dimensional scattering wavelet transform, IEEE Trans. Geosci. Remote Sens. 53 (5) (2015) 2467–2480. [84] C. Shi, C.-M. Pun, 3D multi-resolution wavelet convolutional neural networks for hyperspectral image classification, Inf. Sci. (Ny) 420 (2017) 49–65. [85] B. Hamida, A. Benoit, P. Lambert, C. Ben Amar, 3-D Deep learning approach for remote sensing image classification, IEEE Trans. Geosci. Remote Sens. 56 (8) (2018) 4420–4434. [86] Y. Chen, H. Jiang, C. Li, X. Jia, P. Ghamisi, Deep feature extraction and classification of hyperspectral images based on convolutional neural networks, IEEE Trans. Geosci. Remote Sens. 54 (10) (2016) 6232–6251. [87] Y. Li, H. Zhang, Q. Shen, Spectral–spatial classification of hyperspectral imagery with 3D convolutional neural network, Remote Sens. 9 (1) (2017) 67. [88] Z. Zhong, J. Li, Z. Luo, M. Chapman, Spectral–spatial residual network for hyperspectral image classification: a 3-D deep learning framework, IEEE Trans. Geosci. Remote Sens. 56 (2) (2018) 847–858. [89] J. Yang, W. Xiong, S. Li, C. Xu, Learning structured and non-redundant representations with deep neural networks, Pattern Recognit. 86 (2019) 224–235. [90] I. Rodriguez, J. María Martínez-Otzeta, I. Irigoien, E. Lazkano, Spontaneous talking gestures using generative adversarial networks, Rob. Auton. Syst. 114 (2019) 57–65. [91] I. Good fellow, et al., Generative adversarial nets, in: Proc. NIPS, Montreal, QC, Canada, 2014, pp. 2672–2680. [92] M. Zhang, M. Gong, Y. Mao, J. Li, Y. Wu, Unsupervised feature extraction in hyperspectral images based on wasserstein generative adversarial network, IEEE Trans. Geosci. Remote Sens. (2018) In Press. [93] J. Feng, H. Yu, L. Wang, X. Cao, X. Zhang, L. Jiao, Classification of hyperspectral images based on multiclass spatial-spectral generative adversarial networks, IEEE Trans. Geosci. Remote Sens. (2019) In Press. [94] L. Zhu, Y. Chen, P. Ghamisi, J.A. Benediktsson, Generative adversarial networks for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 56 (9) (2018) 5046–5063. [95] J.L. Tsai, Feature extraction of hyperspectral image cubes using three-dimensional gray-level cooccurrence, IEEE Trans. Geosci. Remote Sens. 51 (6) (2013) 3504–3513. [96] M. Zaouali, S. Bouzidi, E. Zagrouba, 3-D Shearlet transform based feature extraction for improved joint sparse representation hsi classification, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 11 (4) (2018) 1306–1314. [97] B.T. Zhao, H. Fei, N. Li, X. Yang, Spatial-spectral classification of hyperspectral image via group tensor decomposition, Neurocomputing 316 (2018) 68–77. [98] X. Cao, R. Li, L. Wen, J. Feng, L. Jiao, Deep multiple feature fusion for hyperspectral image classification, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 11 (10) (2018) 3880–3891. [99] J. Yang, Y. Zhao, J.C. Chan, Learning and transferring deep joint spectral–spatial features for hyperspectral classification, IEEE Trans. Geosci. Remote Sens. 55 (8) (2017) 4729–4742. [100] G.L. Zhao, L. Fang, B. Tu, P. Ghamisi, Multiple convolutional layers fusion framework for hyperspectral image classification, Neurocomputing 339 (2019) 149–160. [101] J. Feng, et al., CNN-based multilayer spatial–spectral feature fusion and sample augmentation with local and nonlocal constraints for hyperspectral image classification, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 12 (4) (2019) 1299–1313. [102] N. He, L. Fang, S. Li, A. Plaza, J. Plaza, Remote sensing scene classification using multilayer stacked covariance pooling, IEEE Trans. Geosci. Remote Sens. 56 (12) (2018) 6899–6910. [103] X. Xu, W. Li, Q. Ran, Q. Du, L. Gao, B. Zhang, Multisource remote sensing data classification based on convolutional neural network, IEEE Trans. Geosci. Remote Sens. 56 (2) (2018) 937–949.

82

M. Imani and H. Ghassemian

Information Fusion 59 (2020) 59–83

[104] L.-.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, Semantic image segmentation with deep convolutional nets and fully connected CRFs.” [Online]. Available: https://arxiv.org/abs/1412.7062 (Dec. 2014). [105] Z. Niu, W. Liu, J. Zhao, G. Jiang, DeepLab-based spatial feature extraction for hyperspectral image classification, IEEE Geosci. Remote Sens. Lett. 16 (2) (2019) 251–255. [106] L. Sun, J. Rieger, H. Hinrichs, Maximum noise fraction (MNF) transformation to remove ballistocardiographic artifacts in EEG signals recorded during fMRI scanning, Neuroimage 46 (1) (2009) 144–153. [107] Y. Guo, H. Cao, J. Bai, Y. Bai, High efficient deep feature extraction and classification of spectral-spatial hyperspectral image using cross domain convolutional neural networks, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 12 (1) (2019) 345–356. [108] M. Imani, Manifold structure preservative for hyperspectral target detection, Adv. Space Res. 61 (2018) 2510–2520. [109] M. Imani, H. Ghassemian, Discriminant analysis in morphological feature space for high-dimensional image spatial–spectral classification, J. Appl. Remote Sens. 12 (1) (2018) 016024-1_016024-28. [110] M. Imani, H. Ghassemian, Hyperspectral images classification by spectral-spatial processing, 8th international symposium on telecommunications (IST’2016), Tehran, Iran, 27-29 Sept. 2016. [111] B. Guo, H. Shen, M. Yang, Improving hyperspectral image classification by fusing spectra and absorption features, IEEE Geosci. Remote Sens. Lett. 14 (8) (2017) 1363–1367. [112] D.G. Stavrakoudis, E. Dragozi, I.Z. Gitas, C.G. Karydas, Decision fusion based on hyperspectral and multispectral satellite imagery for accurate forest species mapping, Remote Sens. 6 (8) (2014) 6897–6928. [113] R. Hang, et al., Robust matrix discriminative analysis for feature extraction from hyperspectral images, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 10 (5) (2017) 2002–2011. [114] R. Hang, Q. Liu, H. Song, Y. Sun, Matrix-based discriminant subspace ensemble for hyperspectral image spatialcspectral feature fusion, IEEE Trans. Geosci. Remote Sens. 54 (2) (2016) 783–794. [115] J. Li, J. Bioucas-Dias, A. Plaza, Hyperspectral image segmentation using a new bayesian approach with active learning, IEEE Trans. Geosci. Remote Sens. 49 (10) (2011) 3947–3960. [116] J. Li, J. Bioucas-Dias, A. Plaza, Spectral–spatial hyperspectral image segmentation using subspace multinomial logistic regression and Markov random fields, IEEE Trans. Geosci. Remote Sens. 50 (3) (2012) 809–823. [117] J.L. Khodadadzadeh, A. Plaza, H. Ghassemian, J.M. Bioucas-Dias, X. Li, Spectral–spatial classification of hyperspectral data using local and global probabilities for mixed pixel characterization, IEEE Trans. Geosci. Remote Sens. 52 (10) (2014) 6298–6314. [118] M. Imani, H. Ghassemian, Spectral-spatial feature transformations with controlling contextual information through smoothing filtering and morphological analysis, Int. J. Inform. Commun. Tech. Res. 10 (1) (2018) 1–12. [119] F.S. Uslu, H. Binol, M. Ilarslan, A. Bal, Improving SVDD classification performance on hyperspectral images via correlation based ensemble technique, Opt. Lasers Eng. 89 (2017) 169–177. [120] S. Li, X. Kang, L. Fang, J. Hu, H. Yin, Pixel-level image fusion: a survey of the state of the art, Inform. Fusion 33 (2017) 100–112. [121] Y. Liu, X. Chen, Z. Wang, Z.J. Wang, R.K. Ward, X. Wang, Deep learning for pixel-level image fusion: recent advances and future prospects, Inform. Fusion 42 (2018) 158–173. [122] X. Ma, S. Hu, S. Liu, J. Fang, S. Xu, Multi-focus image fusion based on joint sparse representation and optimum theory, Signal Process. Image Commun. (2019) In Press.

[123] Q. Zhang, Y. Liu, R.S. Blum, J. Han, D. Tao, Sparse representation based multi-sensor image fusion for multi-focus and multi-modality images: a review, Inform. Fusion 40 (2018) 57–75. [124] B. Meher, S. Agrawal, R. Panda, A. Abraham, A survey on region based image fusion methods, Inform. Fusion 48 (2019) 119–132. [125] C. Liu, J. Li, L. He, Superpixel-Based semisupervised active learning for hyperspectral image classification, IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 12 (1) (2019) 357–370. [126] K. Fotiadou, G. Tsagkatakis, P. Tsakalides, Spectral super resolution of hyperspectral images via coupled dictionary learning, IEEE Trans. Geosci. Remote Sens. 57 (5) (2019) 2777–2797. [127] P. Zhang, H. He, L. Gao, A nonlinear and explicit framework of supervised manifold-feature extraction for hyperspectral image classification, Neurocomputing 337 (2019) 315–324. [128] Y. Deng, H. Li, X. Song, Y. Sun, X. Zhang, Q. Du, Patch tensor-based multigraph embedding framework for dimensionality reduction of hyperspectral images, IEEE Trans. Geosci. Remote Sens. (2020) In Press. [129] H. Xu, H. Zhang, W. He, L. Zhang, Superpixel-based spatial-spectral dimension reduction for hyperspectral imagery classification, Neurocomputing 360 (2019) 138–150. [130] D. Hong, N. Yokoya, J. Chanussot, J. Xu, X.X. Zhu, Learning to propagate labels on graphs: an iterative multitask regression framework for semi-supervised hyperspectral dimensionality reduction, ISPRS J. Photogramm. Remote Sens. 158 (2019) 35–49. [131] P.V. Arun, B. Krishna Mohan, A. Porwal, Spatial-spectral feature based approach towards convolutional sparse coding of hyperspectral images, Comput. Vision Image Understanding 188 (2019) 102797. [132] P. Duan, X. Kang, S. Li, P. Ghamisi, J.A. Benediktsson, Fusion of multiple edge-preserving operations for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 57 (12) (2019) 10336–10349. [133] S. Jia, K. Wu, J. Zhu, X. Jia, Spectral–spatial gabor surface feature fusion approach for hyperspectral imagery classification, IEEE Trans. Geosci. Remote Sens. 57 (2) (2019) 1142–1154. [134] Y. Hu, J. Zhang, Y. Ma, J. An, G. Ren, X. Li, Hyperspectral coastal wetland classification based on a multiobject convolutional neural network model and decision fusion, IEEE Geosci. Remote Sens. Lett. 16 (7) (2019) 1110–1114. [135] S. Jia, X. Deng, J. Zhu, M. Xu, J. Zhou, X. Jia, Collaborative representation-based multiscale superpixel fusion for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 57 (10) (2019) 7770–7784. [136] S. Jia, Z. Lin, B. Deng, J. Zhu, Q. Li, Cascade superpixel regularized gabor feature fusion for hyperspectral image classification, IEEE Trans. Neural. Netw. Learn. Syst. (2020) In Press. [137] M.E. Paoletti, J.M. Haut, J. Plaza, A. Plaza, Deep learning classifiers for hyperspectral imaging: a review, ISPRS J. Photogramm. Remote Sens. 158 (2019) 279–317. [138] F. Cao, W. Guo, Cascaded dual-scale crossover network for hyperspectral image classification, Knowl. Based Syst. (2019) 105122. [139] W. Li, C. Chen, M. Zhang, H. Li, Q. Du, Data augmentation for hyperspectral image classification with deep CNN, IEEE Geosci. Remote Sens. Lett. 16 (4) (2019) 593–597. [140] C. Zhang, G. Li, S. Du, Multi-scale dense networks for hyperspectral remote sensing image classification, IEEE Trans. Geosci. Remote Sens. 57 (11) (2019) 9201–9222. [141] G. Li, L. Li, H. Zhu, X. Liu, L. Jiao, Adaptive multiscale deep fusion residual network for remote sensing image classification, IEEE Trans. Geosci. Remote Sens. 57 (11) (2019) 8506–8521. [142] Y. Xu, B. Du, F. Zhang, L. Zhang, Hyperspectral image classification via a random patches network, ISPRS J. Photogramm. Remote Sens. 142 (2018) 344–357.

83