Journal Pre-proof
CNN based spectral super-resolution of remote sensing images P.V. Arun , K.M. Buddhiraju , A. Porwal , J. Channusot PII: DOI: Reference:
S0165-1684(19)30446-3 https://doi.org/10.1016/j.sigpro.2019.107394 SIGPRO 107394
To appear in:
Signal Processing
Received date: Revised date: Accepted date:
1 August 2019 18 November 2019 19 November 2019
Please cite this article as: P.V. Arun , K.M. Buddhiraju , A. Porwal , J. Channusot , CNN based spectral super-resolution of remote sensing images, Signal Processing (2019), doi: https://doi.org/10.1016/j.sigpro.2019.107394
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.
Highlights Novel sparse-coding-based pixel-spectra enhancement Collaborative unmixing for refining the results with respect to the coarse image Spatial-spectral prior based transformation Endmember similarities and spectral image prior used as loss functions
CNN based spectral super-resolution of remote sensing images Arun, P.V., Buddhiraju, K.M., Porwal, A., and Channusot, J.
Abstract The spectral super-resolution techniques attempt to re-project spectrally coarse images to a set of finer wavelength bands. However, complexity of the mapping between coarser and finer scale spectra, large variability of spectral signatures, and the difficulty in simultaneously modeling spatial and spectral contexts make the problem highly ill-posed. Our main hypothesis is that the consideration of spatial as well as spectral aspects is essential for spectral enhancement of remote sensing images. In this regard, this paper proposes a framework consisting of sparse-coding-based pixel-spectra enhancement, collaborative unmixing, and spatial-spectral prior based transformation. Two sparse-coding-based architectures are proposed to project the coarser scale pixel-spectra to the target scale. These models facilitate simultaneous optimization of sparse codes and dictionaries with regard to the spectral super-resolution objective. A CNN based encoding-decoding architecture is explored to model the spatial-spectral prior for improving fidelity of the reconstructions. The endmember similarities and spectral image prior are considered while designing the proposed loss functions. The experiments, over standard as well as AVIRIS-NG and drone-derived datasets, confirm better accuracy of the proposed frameworks as compared to the prominent approaches. In addition, the proposed CNN models for spectral upscaling and spatial-spectral transformation are found to be less sensitive to the variation in network parameter values.
Index Terms—Spectral-super-resolution, hyperspectral, remote sensing
P.V. Arun (
[email protected]), K.M. Buddhiraju, and A. Porwal are with the Indian Institute of Technology, Bombay, Mumbai, India. J. Chanussot is with the Grenoble Institute of Technology, France.
1. INTRODUCTION Nowadays, hyperspectral images are being widely employed for various applications as they can better distinguish the objects or materials on the ground. Finer the spectral resolution better will be the ability to classify the materials. The recovery of finer scale spectrum from coarser spectrum, known as spectral super-resolution, is important for improving separability of object/material classes. Although different spatial super-resolution approaches have given satisfactory results, very few published works are available on spectral super-resolution of remote sensing images. The lack of regular spatial structure of aerial image features, large spectral signature variability, large number of spectral bands, and the difficulty in modelling the spatial-spectral context make the problem highly ill-posed. Recently, different learning techniques [1-3] have been adopted to model the mapping between coarser and finer scale spectrum from a large number of training data. Arad et al. [1] and Aeschbacher et al. [2] employed sparse-coding based approach to facilitate the coarser-to-finer scale mapping. Instead of using dictionary learning, Akhtar and Mian [3] employed Gaussian processes and utilized clustering in data processing. Heikkinen [4] proposed a new estimation model based on learning spectral subspace components via scalar valued Gaussian process regressions. The convolutional neural network (CNN) based deep learning frameworks have also been experimented for the purpose and have reported the stateof-the-art results [5-7]. Although Xiong et al. [5] proposed a CNN based spectral-super-resolution framework, the approach required prior knowledge of the spectral response function for an explicit upsampling operation. In their seminal paper, Galliani et al. [6] adopted a semantic-segmentation based CNN framework for spectral super-resolution achieving satisfactory results. The Generative Adversarial Network (GAN) was explored by Alvarez-Gila et al. [7] for modeling the joint spectral-spatial distribution of hyperspectral datasets. In their paper, Shi et al. [8] adopted convolutional residual blocks for improving the fidelity of spectral reconstruction. They illustrated that the use of dense blocks with a novel fusion scheme, instead of the residual blocks, yields better results. It may be noted that, in both the models, a dynamic upscaling was proposed instead of the static one. Recently, Yan et al. [9] explored a U-Net based framework for jointly encoding the local and non-local image information through symmetrically downsampling and upsampling the intermediate feature maps. Unlike most of the above discussed approaches, we explore spatial and spectral information derived from both the input image and training set to formulate an effective transformation. Although not specifically for super-resolution, prominent CNN based architectures available for similar image transformations are discussed in [10-17]. Recently, the concept of 3D-CNN has been proposed where the filters have a downward movement in addition to the movement in 2D space [10, 18, 19]. The neural network inversion techniques, which attempt to reconstruct inputs from corresponding network
outputs, are employed for various image enhancement tasks [20]. In this paper, various advancements in convolutional architectures are explored for addressing the issues prevalent in spectral-super-resolution of remote sensing images. Recently, neural network pruning is being widely employed to remove the least useful filters and channels from the network without affecting the accuracy. Generally, these approaches sparsify the filters and channels, and prune out the unimportant ones based on certain criteria. In this regard, low-rank matrix approximation, dynamic learning strategies, and structured sparse regularization have been employed to sparsify a given network. He et al. [21] achieved the state-of-the-art results using a LASSO-regressionbased channel and filter pruning. Kiaee et al. [22] proposed an alternating direction method of multipliers (ADMM) based approach to simultaneously induce sparsity and minimize the classification based network loss. This research adopts a few of these approaches for pruning the proposed frameworks. In this paper, we explore an optimal spectral super-resolution framework for remote sensing datasets. Our approach ensures spectral as well as spatial fidelity of reconstructions with minimum number of samples. The specific aspects of aerial images, that affect the conventional strategies, are considered in the proposed approach. It may be noted that loss functions and pruning strategies for the proposed frameworks are designed considering the characteristics of the datasets.
2. PROPOSED APPROACH Consider a remote sensing image Xc with spatial resolution s and number of spectral bands b. Let Xf be its high spectral-resolution counterpart with number of bands r (r>>b), and let NE be the number of endmembers in Xc and Xf. In this research, we attempt to recover a high spectral resolution image Y (Y should be as close as possible to original ground truth Xf) from its low-resolution version Xc. The proposed framework is summarized in Fig. 1. Each pixel spectrum of the input image (Xc) is spectrally enhanced (Fig. 2 & sec. 2.1) to yield the spectrally super-resolved image (X’). The input Xc and its spectrally enhanced version X’ are jointly unmixed (Fig. 3 & sec. 2.2) to refine the endmember and fractional abundance estimates of X’. The spectrally refined estimate of X’, thus obtained, is transformed using a spatial-spectral transformation network (Fig. 4 & sec. 2.3) to obtain the final result (Y).
CNN framework for spectral super-resolution of each pixel vector (xic)
Input image (Xc)
X' Collaborative unmixing of Xc and X’ to refine the endmember matrix (E’) and abundance estimate (A) of X’ with respect to those of Xc E'’, A’ Spectral super-resolution estimate (Y’=E”A’) Y’ Refining the estimate using the spatial-spectral prior derived using a 3D-CNN framework
Spectral super-resolution result (Y)
Fig. 1. Proposed spectral super-resolution framework-1
2.1. Proposed spectral-super-resolution of pixel-spectra This section discusses the proposed convolutional architectures for spectrally upscaling the coarser pixelspectra. In this regard, a 1D-CNN based sparse-coding framework is employed. Extending our studies in spatial super-resolution framework [26], the spectral super-resolution of pixel-spectra can be formulated as:
(1) ( ( )
where
(‖
‖
‖ ‖ ))
is the spectrally super-resolved version of the coarse resolution input spectra (xci), dlr and dhr are
respectively the low- and high-resolution dictionaries, * denotes the convolution operation, and
is the
deconvolved features derived from xci. Inspired from the ISTA approach [27-29], majorization-
minimization algorithm is employed to estimate the next minimal point of F(z) by minimizing a function ( ) at each zk such that G(z) > F (z) ∀ z ≠ zk and F(zk) = G (zk). The function Gk(z), at each iteration k, is solved using the soft-thresholding operator. Hence, the sparse code optimization in (1) can be reformulated in an iterative form as: (
)
(2)
where s and w are convolutional units, λ is the scaling constant,
( ) is the shrinkage thresholding
function with a threshold θ, and β is the step size. A proximal regularization strategy on linearized function F(z) was also explored to reach an iterative formulation. However, (2) is preferred due to computational efficiency. From (2), it is clear that the sparse code optimization can be accomplished using a recursive convolutional network. Assuming the threshold to be positive, the shrinkage thresholding function for the kth iteration |
|
( )| can be expressed as:
( )|
According to (3),
( )
(
|
|
)
( )
(3)
( ) can be implemented using two linear scaling layers (with weights θ and 1/θ
respectively) separated by a unit-thresholded ReLU layer. The proposed implementation of (2) in a network setting is summarized in Fig. 2. The coarse resolution input spectrum is projected on to a set of convolution (dlr) and deconvolution (dlr and dhr) filters to implement the dynamic upscaling. The sub-network consisting of convolutional filters w and s optimize the sparse codes. The optimized sparse code (z*) is projected on to the high resolution dictionary (dhr) and the latent representation thus obtained is inverted to get the superresolved spectrum.
Stream 1 dlr Coarse res. input spectrum (xic)
dlr’1 dlr’2
s
I1
𝜃
𝜃
I2
+
w
dhr
Super-resolved Spectrum (xi’)
Stream 2
Fig. 2. Spectral super-resolution architecture-1
It may be noted that, instead of using pre-defined thresholds (based on λ and β), here the network is made to learn the same. The proposed architecture-1 has improved the learning capability and hence the reconstruction accuracy significantly. Inspired from the FISTA approach [30], in order to improve convergence of (2), a very specific linear combination of the previous two points (zk and zk-1 ) is employed for computing a given optimal point (zk+1). Hence, the sparse code optimization is reformulated as:
((
(
√
)
)
(4)
)
(5)
(
)
(6)
where hθ (.) is the soft-thresholding block and t 1 is set to 1. The modified formulation (4-6) is embedded in the super-resolution pipeline as shown in Fig. 3. The approach has significantly reduced the average number of recurrences required. Also, the network is much lighter when compared to the previous
strategies. Moreover, sensitivity of the approach towards the length of recurrent network is reduced. Coarse res. input spectrum (xic) 1/β wT d
lr
lr’
d
dhr I1 I2 hθ (.)
+
lr’
d
1
2
zk
w
Super-resolved spectrum (xi’) +
× 𝑡𝑘 𝑡𝑘
-
zk-1
Fig. 3. Spectral super-resolution architecture-2
A reversible learning based hyper-parameter optimization [23] is adopted to fine-tune the network parameters. The numbers of filters in the first, second, and third layers of both the architectures are set to be 128, 256, and 512 respectively. Similarly, the fourth, fifth, sixth, seventh and eighth layers have 512, 256, 64, 32 and 1 filters respectively. It may be noted that equal number of multi-sized 1D kernels having sizes of 1×3, 1×5, 1×7, 1×9, 1×11, and 1×13 are employed in all the layers. In both the architectures, mean squared error (MSE) based and spectral dissimilarity based loss functions are minimized through back propagation to train the network weights. The MSE loss (LMSE), between the reconstructed spectrum (xi’) and its corresponding ground truth (xif), is computed as:
∑‖
‖
(7)
where M denotes the total number of spectral bands. Similarly, the spectral-dissimilarity loss (Lspectral) is estimated as the inverse of the average spectral similarity between xi’ and xif, i.e., |
||
|
(8)
2.2. Proposed collaborative unmixing to improve the fidelity of the reconstruction In order to finetune the spectrally enhanced result with respect to the endmember and abundance estimates (derived from the coarser input), a joint spectral unmixing based optimization strategy is formulated as:
((
*
+)
(‖
‖
‖
‖ )) (9)
where E and E’ are the endmember matrices of X c and X’ respectively, and A is the corresponding fractional abundance matrix. The additive (sum to one) constraint is enforced using the approach discussed in Heinz and Chang [24]. Using the multliplicative update rule, ( ) is optimized by iteratively updating the
as: ( )
(10)
( )
( )and
where
( ) denote the negative and positive coefficients respectively of the derivative of
the loss function with respect to (
)
(
((
)(
In this regard, partially differentiating ( ) with respect to A, we have,
) )
((
)(
) ))
(11)
From (11), we have, (
)
((
)(
)
(
)(
) )
(12)
Using the standard results related to derivation of trace, we have, (
)
(
(
)
(
)
)
(13)
(14)
(
)
(
)
(15)
From (12), (13), (14) and (15) the multiplicative update rule can be formulated as:
(16)
(17)
(
) (
)
(18)
The updates ((16), (17) and (18)) are iterated until the change in ratio of the cost function between two successive iterations is above a threshold value (empirically set to 10−2). Finally, the spectrally refined super-resolved image (Y’) is obtained as:
(19)
2.3. Proposed spatial-spectral transformation This section discusses the proposed transformation for improving the spectrally super-resolved image using spatial-spectral prior derived from the training data. In this regard, a 3D-CNN based encoder-decoder model, as presented in Fig. 4., is employed.
Conv. unit with 256 filters + ReLu
Conv. unit with 256 filters + ReLu Spectrally superresolved image (Y’)
Conv. unit with 174 filters
Un-pool
Max-pool
Conv. unit with 512 filters + ReLu
Conv. unit with 512 filters + ReLu
Spatial-spectral prior based transformed image (Y)
Un-pool
Max-pool
Conv. unit with 1024 filters + ReLu
Fig. 4. CNN framework (architecture-3) for spectral-spatial transformation
The architecture has 256, 512, 1024, 512, 256 and 174 filters respectively in the first, second, third, fourth, fifth and sixth layers. The size of filters in the first, second, third, fourth, fifth, and sixth layers are set to be 3×3, 5×5, 5×5, 5×5, 3×3, and 1×1 respectively. It may be noted that 2×2 max-pooling units are employed for downscaling the features. Also, un-pooling is implemented by replacing each pixel with a 2×2 grid having the actual value replicated along the main diagonal. The 1×1 convolutional unit maps the output features to the reconstructed 174-channel hyperspectral image. It may be noted that, in addition to the feedforward path, the model employs skip connections to concatenate the corresponding feature maps of the encoder and decoder. Along with the MSE based losses, the spectral dissimilarity and endmember dissimilarity based losses are employed to enhance the spectral fidelity of the reconstruction. In this regard, the spectral-dissimilarity loss (Lspectral) is estimated as:
(∑
∑
|
||
|
)
(20)
where λs is the scaling constant, M and N are the dimensions of the reconstructed image (Y) and its corresponding ground truth (Xf). The endmember dissimilarity loss (Lend) facilitates the network to favor solutions having valid endmembers and is defined as:
(∑
|
||
|
)
(21)
where λe is the scaling constant,
and
are the
endmember derived from Y and Xf respectively,
and NE is the number of endmembers. In addition to the MSE and spectral fidelity based losses, a total variation (TV) based regularization is also employed. In order to reduce the computational complexity, instead of using the entire hypercube for TV computation, the top three PCs are adopted, i.e.,
( )
∑
((
)
(
) )
(22)
where Yp denotes the top three PCs of Y, λTV denotes the normalizing constant and φ is a scalar whose value determines the trade-off between denoising and sharpness preservation. As discussed in sec. 2.1, a reversible-learning-based hyper-parameter optimization is employed to tune the network parameters.
2.4. Design of network pruning strategy This section discusses a few of the approaches adopted for pruning the proposed sub-frameworks (architectures 3 and 4). The trade-off between computation time and generalization capability of the network is considered while formulating the pruning strategy. Initially, a subset of the training data is employed to train the network so as to sparsify a set of filters and channels that are redundant or not useful with respect to the given data. In this regard, the structured sparsity regularization [25] is adopted. Further to that, criteria such as L0 norm of the filters and channels, L0 norm and standard deviation of activation maps, and mutual information between activation maps and network outputs are employed for ascertaining the importance of filters. The unimportant filters, based on a threshold, are removed. Finally, the pruned network is retrained to finetune the filter weights. The three stages are repeated for a pre-decided number of iterations. A normalizing constant (λp) is employed for deciding the trade-off between pruning
regularizations and reconstruction loss minimization.
2.5. Alternate framework In addition to the framework-1 (Fig. 1) discussed above, an alternate approach, as summarised in Fig. 5, has also been experimented with. In this approach, the input image is first spectrally unmixed and the endmembers are spectrally enhanced using the spectral super-resolution framework (Fig. 3 & sec. 2.1). It may be noted that here only the endmembers are spectrally super-resolved.
Input image (Xc)
Spectral unmixing E
E, A
Joint unmixing based optimization to refine E’ and A E’ E’’, A’ Spectral super-resolution estimate (Y’=E”A’)
Y'
CNN framework for spectral superresolution of each endmember spectra (Ei) to yield the superresolved version (Ei’)
Refining the estimate using the spatial-spectral prior derived using a 3D-CNN framework
Spectral super-resolution result (Y)
Fig. 5. Proposed spectral super-resolution framework-2
In order to finetune the reconstructed endmember spectra (E’) with respect to the initial endmember and abundance estimates, a joint spectral unmixing based optimization strategy is formulated as: (
*
+)
(‖
‖
‖
‖ ) (23)
where E and E’ are the endmember matrices of Xc and X’ respectively, A is the corresponding fractional abundance matrix, and A’ and E” are the refined versions of A and E’ respectively. The additive (sum to one) constraint is enforced using the approach discussed in Heinz and Chang [24]. Using the multliplicative update rule, ( ) is optimized by iteratively updating the
(
as:
)
(24)
(
) (
(25)
)
The updates ((24) and (25)) are iterated until the change in ratio of the cost function between two successive iterations is below a threshold value (empirically set to 10−2). From (23), the spectrally refined super-resolved hypercube (Y’) is obtained as:
(26)
Finally, the spatial-spectral prior derived from the training data, using the framework proposed in sec. 2.3, is employed to improve Y’ to yield the final result Y.
2.6. Accuracy measures The peak signal to noise ratio (PSNR) [26], spectral similarity measure (SSM) [26, 31-32], endmember similarity measure (ESM) and structural similarity index measure (SSIM) [33-34], computed between the reconstructed image (Y) and its corresponding high-resolution counterpart (Xf), are used as quantitative measures for estimating the accuracy of the proposed spectral super-resolution approaches. The PSNR does not accurately represent the human perception; hence SSIM as well as classification accuracies (Kappa statistics) are also reported for a better comparison. The Kappa statistics is computed over the classification results generated from the reconstructed image. The SSIM measures the average perceptual similarity between the spectral bands of Y and Xf and is computed for each band as: ( (
)( )(
) )
(27)
where x and y denote spatial patches of size l×l from Y and Xf respectively, µ i and σi denote the mean and standard deviation respectively of the patch i, σij denotes the covariance between the patches i and j, and c1 and c2 are constants. It may be noted that both the SSM and ESM estimate the spectral fidelity of the reconstruction. The SSM is computed as the average of the spectral similarity between the pixels of Y and Xf, i.e.,
∑
∑
|
||
(28)
|
where M and N are the dimensions of the reconstructed image and its corresponding ground truth. The ESM measures the similarity of endmembers derived from Y and X f, and is computed as: ∑
where
and
|
||
(29)
|
are the
endmember derived from Y and Xf respectively.
3. EXPERIMENTS The proposed spectral super-resolution framework is analyzed over standard benchmark datasets such as Indian Pines, Pavia, Salinas and KSC. The high- resolution image patches are spectrally down-scaled using different interpolation strategies such as bilinear, bi-cubic, and nearest neighbor interpolation to generate high- and low-spectral-resolution training/testing pairs. Throughout the experiments, mini-batch size for training is set to 200; momentum and weight decay for the backpropagation are set to 0.8 and 10 -3 respectively (obtained through cross validation); learning rate is initially set to 0.6 and is depreciated by a factor of 10 after every 200 epochs. All the models are analyzed for 600 epochs.
3.1. Study of CNN architectures A summary of the analysis of possible alternatives for the proposed framework is presented in Table 1. As is evident, the encoder-decoder architecture, similar to the semantic segmentation framework, did not yield satisfactory results. The deep learning based sparse-coding strategies, including the 1D, 2D and 3D convolutional approaches, also failed to project the spectrally coarser images to finer scale. As can be observed from Table 1, the 3D convolutional architecture does not perform well as compared to the other
architectures. This can be attributed to the inability of the 3D convolutions in simultaneously modelling the spectral-spatial feature space, especially with less training samples. The performance of the 3D framework is found to deteriorate with increase in spectral dimension of the input. Among two of the proposed frameworks, Framework-1 has yielded much better results as compared to Framework-2. The worse results of Framework-2 in comparison with Framework-1 indicate that the initial spectral unmixing cannot model the spectral base of the input image. The pixel-wise enhancement and the collaborative unmixing adopted in Framework-2 resolves this issue. For a zoom factor of two as well as for images with low intra class variances, both the frameworks yield comparable results. Also, in comparison with Framework-2, Framework-1 is computationally more complex and may be preferred when the number of samples are less. As is evident from Table 1, the lack of spectral-spatial transformation (discussed in sec. 2.3) limits generalization capability of both the frameworks. Although deeper and complex alternatives to architecture-3 have been experimented, the proposed one is found to be optimal. In the context of spectral super-resolution of endmembers (discussed in sec. 2.1), architecture-2 yields faster convergence when compared to architecture-1. Table 1. Comparative analysis of possible architectural variations for the proposed framework on Indian Pines dataset
Architecture
Scale
PSNR
SSM
SSIM
Kappa
3D convolutional spectral-spatial sparse-coding framework 1D convolutional spectral sparsecoding framework Semantic segmentation framework like architecture Proposed framework-1 Proposed framework1 without spatialspectral transformation Proposed framework-
2
19.59
0.9924
0.8739
0.72
End member similarity measure 0.9891
3
16.67
0.9927
0.8552
0.68
0.9885
2
21.38
0.9918
0.8776
0.75
0.9883
3
15.54
0.9891
0.8521
0.66
0.9848
2
21.03
0.9914
0.8854
0.74
0.9865
3
17.26
0.9906
0.8637
0.69
0.9849
2 3 2
25.90 23.76 25.08
0.9996 0.9990 0.9992
0.8976 0.8805 0.8879
0.84 0.82 0.83
0.9994 0.9991 0.9984
3
22.95
0.9984
0.8730
0.79
0.9991
2
25.19
0.9980
0.8954
0.84
0.9978
2
3
22.30
0.9971
0.8512
0.79
0.9914
3.2. Analysis of loss functions An analysis of the effect of losses on the proposed approach for Indian Pines dataset is summarized in Table 2. A similar trend has been observed for other datasets also. As is evident from Table 2, for all the proposed convolutional architectures (architectures- 2, and 3), the use of spectral dissimilarity loss along with the MSE based loss functions improve the results. In case of architecture-3, the use of endmember dissimilarity loss along with the MSE and spectral dissimilarity based loss functions result in better SSM values. Also, the use of TV regularizer along with these loss functions improve the spatial fidelity resulting in higher values of SSIM. The computation of MSE loss in latent space is found to improve the SSIM when compared to the same computed in image space. Based on a grid search based optimization and crossvalidation over various datasets, the optimal values of normalizing constants λ s and λe are found to be 0.8 and 0.1 respectively. For TV regularization, optimal values for φ and normalizing constant (λTV) are observed to be 2 and 0.28 respectively. The considerable impact of loss functions on the performance of the proposed archictecture-3 can mainly be attributed to the consideration of the entire input image as an entity for the computation of these losses. It may be noted that Lspectral and Lend consider the input image as a whole rather than considering it as a stack of independent spectral bands. When the input spectral dimension and zoom factor are low, the effects of the loss functions are not significant. In case of architecture-2, the spectral dissimilarity loss accurately models the vector similarity of the spectra as compared to the mean squared error. Also, the former considers the spectral features better than the latter. The influence of TV regularization is also significant for higher zoom factors as the effect of noise is significant in hyperspectral image reconstruction.
Table 2. Analysis of the effect of loss functions on the proposed framework on Indian Pines dataset
Architecture
Loss function LMSE
Architecture-2
LMSE+Lspectr al
LMSE LMSE+Lspectr al
Architecture-3
LMSE+Lspectr al+Lend LMSE+Lspect ral+Lend+ LTV
Scale
PSNR
SSM
SSIM
Kappa
2 3 2 3 2 3 2 3 2 3 2
24.86 22.57 25.90 23.76 22.92 21.05 23.27 22.18 25.47 23.64 25.90
0.9988 0.9981 0.9996 0.9990 0.9976 0.9960 0.9983 0.9974 0.9992 0.9987 0.9996
0.8670 0.8703 0.8976 0.8805 0.8432 0.8218 0.8759 0.8432 0.8895 0.8798 0.8976
0.82 0.78 0.84 0.82 0.78 0.75 0.80 0.76 0.83 0.80 0.84
End member similarity measure 0.9983 0.9961 0.9994 0.9991 0.9930 0.9924 0.9971 0.9952 0.9990 0.9989 0.9994
3
23.76
0.9990
0.8805
0.82
0.9991
In order to further investigate the effect of the loss functions on other network architectures, different CNN models have been analyzed. It was observed that the hyperspectral reconstruction networks are sensitive to the loss functions especially when the zoom factor or input spectral dimension is high. A similar trend was also observed for simple hypercube reconstruction as well as for spatial resolution enhancement. However, the effects of losses are not significant for images with low spectral dimension or when individual bands are used.
3.3. Analysis of network parameter settings Experiments over standard datasets indicate that the proposed convolutional sub-frameworks
(architectures-1 and 2, and 3) are less sensitive to the network depth. As is evident from Figs. 6-7, for all these architectures, an increase in network depth only slightly improves the peak signal to noise ratio (PSNR). For architectures-1 and 2, depth of the deconvolution layers is determined based on the zoom factor. Based on a grid search based optimization, the optimal number of convolutional layers is found to be 3-5. The number of recurrent cycles for architectures 1 and 2 are empirically set to 10 and 5 respectively. An increase in the number of inversion blocks (above the empirically found optimal value 3) does not affect the reconstruction accuracy. The lesser sensitivity of architectures -1 and 2 to the variation in network depth can be attributed to the stability achieved through sparse code optimization. For architecture-3, increase in the number of layers improve the accuracy to a limit beyond which it deteriorates. In all the architectures, increase in number of filters in each layer improves the PSNR (Figs. 8-9) but increases the running time. Hence an optimal choice depends on the trade-off between both. Experiments indicate that the increase in size of filters improves the PSNR but deteriorates after a threshold. This trend is more evident for the transformation sub-framework (architecture-3) as compared to the super-resolution framework (architecture-1 and 2). This can be attributed to the underfitting of the encoder-decoder framework resulting in poor reconstruction of small features. Hence, the optimal filter sizes should be determined with reference to the scene homogeneity/complexity. It is further observed that the increase in diversity of kernel sizes improves the accuracy and yields better reconstructions even at shorter depths.
Fig. 6. Variation in reconstruction PSNR with respect to network-depths for architecture-2
Fig. 7. Variation in reconstruction PSNR with respect to network-depths for architecture-3
Fig. 8. Variation in reconstruction PSNR with respect to no. of filters in layer-1 for architecture-2
Fig. 9. Variation in reconstruction PSNR with respect to no. of filters in layer-1 for architecture-3
3.4. Evaluation of the proposed method The PSNR, SSM, and SSIM values, estimated between the original and reconstructed images for different approaches, on standard datasets, are presented in Table 3. As is evident, the proposed approach predicts finer spectral features and better discriminates even spectrally similar classes as compared to the bilinearly interpolated coarser spectra. A comparison of the support vector machine (SVM) based classification of the super-resolved outputs is presented in Table 4 and Figs. 10-12. Many of the state-ofthe-art spectral-super-resolution approaches that give satisfactory results for RGB images, do not give expected results for their remote sensing counterparts. In this regard, some recent spectral-super-resolution approaches such as Arad and Ben-Shahar [1], Xiong et al. [5], Shi et al. [8] and Yan et al. [9] are used as benchmarks. Although, most of these approaches use CNN, the spectral-super-resolution of aerial datasets is highly ill-posed causing over- or under-fitting. The irregular spatial-structure of land cover features also affects the reconstructions. From the results (Tables 3-4), it can be observed that the proposed framework outperforms other methods. The spectral-prior based optimization reduces the ill-posedness, resulting in better results even with minimum training samples. It may be noted that the proposed approach simultaneously improves SSM as well as SSIM values. This can be attributed to the unmixing prior based
spectral upscaling as well as the spatial-spectral prior based transformation. The proposed loss functions also improve the spatial as well as the spectral fidelity of the reconstructions. Although comparative analyses discussed above illustrate better accuracy and performance of the proposed approaches, the use of limited number of standard datasets affects the effectiveness and reliability of such comparisons. For instance, high accuracy obtained in the above experiments may be due to the overlap between the testing and training image patches. For a fair evaluation, the networks have been trained and tested with mutually exclusive unmanned aerial vehicle (UAV) and AVIRIS-NG image patches. Both these experiments over 18030 training patches and 80 test images indicate that the proposed approaches perform better than the prominent spectral-super-resolution techniques.
4. CONCLUSION The spectral super-resolution, involving recovery of finer scale hypercubes from their spectrally coarser version, is a highly ill-posed problem. Most of the conventional approaches give blurred and inaccurate results for spectral-super-resolution of remote sensing datasets. In this regard, this study explores convolutional neural networks and collaborative spectral unmixing for addressing the issues. The proposed soft-thresholding based recurrent sub-network, dynamic upscaling, and spectral dissimilarity based loss functions are found to be optimal for spectral-super-resolution of the endmember spectra. The proposed collaborative unmixing of spectrally coarser input and its finer scale estimate improves spectral-spatial fidelity of the reconstructions. In addition, the spatial-spectral prior, derived from training data using a CNN framework, is employed to further fine-tune the results. Experiments indicate that the proposed framework improves the results when compared to the existing spectral super-resolution strategies. The better results can be attributed to the three-stage approach which considers both spatial as well as spectral characteristics making the problem less ill-posed. Augmentation of training data is adopted for improving generalization capability of the framework. The proposed models for endmember enhancement and spatialspectral prior based transformation are found to be less sensitive to the network parameters. Parallelization can be explored to improve the computational efficiency of the proposed framework.
Dataset
Approach
PSNR
SSIM
SSM
End member similarity measure
Arad and Ben-Shahar [1] 17.50 0.7583 0.9879 0.9901 Xiong et al. [5] 16.79 0.7965 0.9856 0.9895 Indian Yan et al. [9] 19.37 0.8214 0.9899 0.9912 Pines Shi et al. [8] 21.28 0.8680 0.9954 0.9940 Proposed framework 23.76 0.8805 0.9980 0.9984 Arad and Ben-Shahar [1] 19.40 0.7042 0.9934 0.9918 Xiong et al. [5] 21.76 0.7662 0.9941 0.9943 Salinas Yan et al. [9] 23.49 0.7797 0.9962 0.9957 Shi et al. [8] 25.09 0.7965 0.9974 0.9970 Proposed framework 26.89 0.8497 0.9987 0.9996 Arad and Ben-Shahar [1] 19.47 0.6129 0.9848 0.9890 Xiong et al. [5] 20.04 0.6950 0.9866 0.9873 Yan et al. [9] 20.78 0.6438 0.9909 0.9927 Pavia Shi et al. [8] 21.93 0.7052 0.9922 0.9934 Proposed framework 24.48 0.7874 0.9991 0.9992 Arad and Ben-Shahar [1] 17.72 0.6482 0.9896 0.9903 Xiong et al. [5] 16.45 0.6316 0.9890 0.9897 Yan et al. [9] 19.56 0.6678 0.9902 0.9946 KSC Shi et al. [8] 20.14 0.7046 0.9923 0.9973 Proposed framework 23.79 0.7692 0.9982 0.9989 Table 3. Comparative analysis of the proposed method with prominent spectral-super-resolution approaches for zoom factor = 3
Table 4. Comparative analysis of the proposed method with prominent spectral super-resolution approaches for zoom factor = 3
Dataset
Indian Pines
Salinas
Pavia
KSC
Approach Arad and Ben-Shahar [1] Xiong et al. [5] Yan et al. [9] Shi et al. [8] Proposed framework Arad and Ben-Shahar [1] Xiong et al. [5] Yan et al. [9] Shi et al. [8] Proposed framework Arad and Ben-Shahar [1] Xiong et al. [5] Yan et al. [9] Shi et al. [8] Proposed framework Arad and Ben-Shahar [1] Xiong et al. [5] Yan et al. [9] Shi et al. [8] Proposed framework
Overall Accuracy 72.91 74.59 77.82 80.34 86.15 72.51 76.84 77.09 83.26 92.58 70.49 76.05 80.39 81.62 87.49 74.39 75.91 80.27 82.68 85.42
Kappa 0.68 0.71 0.74 0.76 0.82 0.70 0.72 0.74 0.78 0.86 0.65 0.68 0.72 0.77 0.83 0.69 0.68 0.74 0.79 0.81
(i) Ground truth
(ii) Yan et al. [9]
(iii)
Shi et al. [8]
(iv)
Proposed
method Fig. 10. Results of support vector machine (SVM) classifier on the spectral-super resolution output of prominent approaches on Indian Pines dataset for a zoom factor = 3
(i) Ground truth
(ii) Yan et al. [9]
(iii) Shi et al. [8]
(iv) Proposed approach
Fig. 11. Results of SVM classifier on the spectral-super resolution output of prominent approaches on Salinas dataset for a zoom factor = 3
(i) Ground truth
(ii) Yan et al. [9]
(iii) Shi et al. [8]
(iv) Proposed approach
Fig. 12. Results of SVM classifier on the spectral-super resolution output of prominent approaches on KSC dataset for a zoom factor = 3
5. REFERENCES [1] B. Arad and O. Ben-Shahar, “Sparse recovery of hyperspectral signal from natural RGB image,” In proceedings of the ECCV, pp. 1-7, 2016.
[2] J. Aeschbacher, J. Wu, R. Timofte, D. CVL, and E. ITET, “In defense of shallow learned spectral reconstruction from RGB images,” In proceedings of the ICCVW, pp. 1-2, 2017. [3] N. Akhtar and A. Mian. “Hyperspectral recovery from RGB images using gaussian processes,” arXiv preprint, arXiv:1801.04654, 2018. [4] V. Heikkinen, "Spectral Reflectance Estimation Using Gaussian Processes and Combination Kernels," IEEE Transactions on Image Processing, vol. 27, no. 7, pp. 3358-3373, July 2018. [5] Z. Xiong, Z. Shi, H. Li, L. Wang, D. Liu, and F. Wu. “HSCNN: CNN-based hyperspectral image recovery from spectrally undersampled projections,” In proceedings of the ICCVW, pp. 1-7, 2017. [6] S. Galliani, C. Lanaras, D. Marmanis, E. Baltsavias, and K. Schindler, “Learned spectral superresolution,” arXiv preprint, arXiv:1703.09470, 2017. [7] A. Alvarez-Gila, J. van de Weijer, and E. Garrote, “Adversarial networks for spatial context-aware spectral image reconstruction from RGB,” In proceedings of the ICCVW, pp. 1-2, 2017. [8] Z. Shi, C. Chen, Z. Xiong, D. Liu, and F. Wu, “HSCNN+: Advanced CNN-based hyperspectral recovery from RGB images,” In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018. [9] Y. Yan, L. Zhang, J. Li, W. Wei, Y. Zhang, “Accurate Spectral Super-Resolution from Single RGB Image Using Multi-scale CNN,” Pattern Recognition and Computer Vision, Lecture Notes in Computer Science, vol. 11257, 2018. [10] S. Ji, W. Xu, M. Yang and K. Yu, “3D Convolutional Neural Networks for Human Action Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221231, Jan. 2013. [11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint, arXiv: 1512.03385, Dec. 2015. [12] S. Ioffe, and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” In proceedings of the 32nd International Conference on Machine Learning (ICML), vol. 37, pp. 448–456, Lille, France, July 2015. [13] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning,” arXiv preprint, arXiv: 1602.07261 [cs.CV], Feb. 2016. [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio “Generative Adversarial Networks,” arXiv preprint, arXiv:1406.2661 [cs.CV], June 2014. [15] M.S.M. Sajjadi, B. Scholkopf, and M. Hirsch, “EnhanceNet: Single Image Super-Resolution through Automated Texture Synthesis,” arXiv preprint, arXiv:1612.07919 [cs.LG], Dec. 2016.
[16] A. Dosovitskiy, and T. Brox, “Inverting Visual Representations with Convolutional Networks,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 4829-4837, 2016. [17] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Te- jani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” arXiv preprint, arXiv: 1609.04802 [cs.CV], Sept. 2016. [18] S. Mei, J. Ji, J. Hou, X. Li and Q. Du, "Learning Sensor-Specific Spatial-Spectral Features of Hyperspectral Images via Convolutional Neural Networks," IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 8, pp. 4520-4533, Aug. 2017. [19] P. V. Arun, I. Herrmann, K. M. Budhiraju, A. Karnieli, “Convolutional network architectures for super-resolution/sub-pixel mapping of drone-derived images,” Pattern Recognition, Dec. 2018. [20] M.T. McCann, K.H. Jin, and M. Unser, “Convolutional Neural Networks for Inverse Problems in Imaging: A Review,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 85-95, June 2017. [21] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” In proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct. 2017. [22] F. Kiaee, C. Gagné, and M. Abbasi, “Alternating Direction Method of Multipliers for Sparse Convolutional Neural Networks,” arXiv preprint, arXiv:1611.01590v3 [cs.NE], Jan. 2017. [23] D. Maclaurin, D. Duvenaud, and R.P. Adams, “Gradient-based hyperparameter optimization through reversible learning,” In proceedings of the 32nd International Conference on Machine Learning, vol. 37, pp. 2113-2122, July 2015. [24] D.C. Heinz, and C.I. Chang, “Fully constrained least squares linear spectral mixture analysis method for material quantification in hyperspectral imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 39, no. 3, pp. 529–545, March 2001. [25] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li “Learning structured sparsity in deep neural networks,” In proceedings of the NIPS, pp. 2074-2082, Dec. 2016. [26] P.V. Arun, K.M. Buddhiraju, and A. Porwal, “Spatial-spectral feature based approach towards convolutional sparse coding of hyperspectral images,” Computer Vision and Image Understanding, 2019. [27] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Communications on Pure and Applied Mathematics, vol. 57, no. 11, pp. 1413–1457, Nov. 2003. [28] C.J. Rozell, D. H. Johnson, R. G. Baraniuk, and B. A. Olshausen, “Sparse coding via thresholding and local competition in neural circuits,” Neural Computation, vol. 20, no. 10, pp. 2526– 2563, Oct. 2008.
[29] K. Gregor, and Y. LeCun, “Learning Fast Approximations of Sparse Coding,” In proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, pp. 399-406, Jun. 2010. [30] A. Beck, and M. Teboulle “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems,” SIAM Journal of Imaging Science, vol. 2, no. 1, pp. 183–202. [31] P.V. Arun, K. M. Buddhiraju, and A. Porwal, “Spatial-spectral feature based approach towards convolutional sparse coding of hyperspectral images,” Computer Vision and Image Understanding, vol. 188, no. 11, p. 102797, Nov. 2019. [32] P.V. Arun, and K. M. Buddhiraju, "A deep learning based spatial dependency modelling approach towards super-resolution," In proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, 2016, pp. 6533-6536. [33] Z. Wang, Z. Wang, and Q. Li, “Information Content Weighting for Perceptual Image Quality Assessment,” IEEE Transactions on Image Processing, vol. 20, no. 5, pp. 1185–1198, May 2011. [34] A. Hore, and D. Ziou, “Image Quality Metrics: PSNR vs. SSIM,” In proceedings of the 20th International Conference on Pattern Recognition, pp. 2366–2369, Aug. 2010.
Author Contributions P.V. Arun: Conceptualization, Methodology, Software, Data curation, Formal analysis, Investigation, WritingOriginal draft preparation K.M. Buddhiraju: Super-vision, Methodology, Project administration, Funding acquisition, WritingReviewing and Editing, Resources A. Porwal: Super-vision, Methodology, Validation, Writing- Reviewing and Editing, Resources J. Channusot: Investigation, Super-vision, Writing- Reviewing and Editing, Formal analysis, Resources
Conflict of Interest All authors have participated in (a) conception and design, or analysis and interpretation of the data; (b) drafting the article or revising it critically for important intellectual content; and (c) approval of the final version. This manuscript has not been submitted to, nor is under review at, another journal or other publishing venue. The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript