Medical image fusion based on sparse representation of classified image patches

Medical image fusion based on sparse representation of classified image patches

Biomedical Signal Processing and Control 34 (2017) 195–205 Contents lists available at ScienceDirect Biomedical Signal Processing and Control journa...

4MB Sizes 0 Downloads 32 Views

Biomedical Signal Processing and Control 34 (2017) 195–205

Contents lists available at ScienceDirect

Biomedical Signal Processing and Control journal homepage: www.elsevier.com/locate/bspc

Medical image fusion based on sparse representation of classified image patches Jing-jing Zong a,b , Tian-shuang Qiu a,∗ a b

Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian 116024, China School of Electrical & Information Engineering, Dalian Jiaotong University, Dalian 116028, China

a r t i c l e

i n f o

Article history: Received 23 April 2016 Received in revised form 20 September 2016 Accepted 5 February 2017 Keywords: Medical image fusion Sparse representation Patch classification Online dictionary learning (ODL) Least angle regression (LARS)

a b s t r a c t Medical image fusion is one of the hot research in the field of medical imaging and radiation medicine, and is widely recognized by medical and engineering fields. In this paper, a new fusion scheme for medical images based on sparse representation of classified image patches is proposed. In this method, first, the registered source images are divided into classified patches according to the patch geometrical direction, from which the corresponding sub-dictionary is trained via the online dictionary learning (ODL) algorithm, and the least angle regression (LARS) algorithm is used to sparsely code each patch; second, the sparse coefficients are combined with the “choose-max” fusion rule; Finally, the fused image is reconstructed from the combined sparse coefficients and the corresponding sub-dictionary. The experimental results showed that the proposed method outperforms other methods in terms of both visual perception and objective evaluation metrics. © 2017 Elsevier Ltd. All rights reserved.

1. Introduction Multi-modal medical image fusion, which aims to combine complementary multi-source information into a fused image for visual perception or computer analysis [1], has been emerging as a promising research area. For example, combined MRI/CT imaging is shown to be effective for diagnosis by displaying both soft tissue information and bony structure information [2]; combined PET/MRI imaging can extract both functional information and structural information for clinical diagnosis and treatment [1,3]. It not only helps in diagnosing diseases but also reduces the storage cost, and is widely recognized by medical and engineering fields. As one of the medical image post-processing techniques emerged in 1990s [4], medical image fusion can generally be divided into spatial domain and transform domain techniques [5]. The spatial domain techniques use localised spatial features to guide the fusion, it may just take the pixel average of the source images, which is simple and fast but leads to undesirable effects such as reduced contrast, artifacts, spectral distortion in the fused image [3,6]; or it may fuse source images with divided blocks or segmented regions [7,8] and use spatial frequency or gradient as the activity level measurement (ALM) to measure the regional

∗ Corresponding author at: Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, China. E-mail addresses: [email protected] (J.-j. Zong), [email protected] (T.-s. Qiu). http://dx.doi.org/10.1016/j.bspc.2017.02.005 1746-8094/© 2017 Elsevier Ltd. All rights reserved.

saliency, which is less sensitive to noise, but still suffers from block effects [7]. The transform domain techniques use the transform coefficients of the source image in some bases as features to fuse images. As the image’s salient features are more clearly depicted in the transform domain than in the spatial domain [9], generally, the performance of the transform domain is better than that of spatial domain, but it is time consuming. Typical transforms including Discrete Wavelet Transform (DWT) [10] and Non-Subsampled Contourlet Transform (NSCT) [11,12] have been used for decomposition of source images. Recently, with the rise of compressed sensing [13], image fusion based on sparse representation (SR) has attracted great attention [14–19]. The SR-based fusion method uses the sparse coefficients in sparse domain as the feature to participate in the fusion, and it can also be classified as a transform domain method. The fusion framework proposed by Nikolaos [9] can be regarded as the predecessor of SR-based fusion. In [9], the source images are divided into patches by employing “sliding window” technique which is proved to have better performance in capturing local salient features and keeping shift invariance. Yang and Li [14] proposed a multi-focus image fusion with sparse representation, which took the first step for SR-based fusion algorithm. Yu and Qiu [15] proposed an image fusion method based on joint sparse representation (JSR). As the choice of dictionary plays a crucial role in sparse representation, some studies focused on developing effective dictionary learning algorithm for image fusion. Zhang et al. [19] presented a dictionary learning method based on JSR. Many

196

J.-j. Zong, T.-s. Qiu / Biomedical Signal Processing and Control 34 (2017) 195–205

dictionary learning methods in these work aim at learning a universal and over-complete dictionary to represent various image structures, but algorithms exploiting single dictionary to represent the image patches may not reflect the differences of various image patches types and sparse decomposition over a highly redundant dictionary tends to generate visual artifacts [20,21]. Also, these methods were only for gray level image fusion and the computation efficiency is low. To address the issues above, motivated by the powerful ability of clustering-based sparse coding in image deblurring, super-resolution, etc [20,22,23], a novel fusion method for medical image based on sparse representation of classified image patches is proposed. In the method, image patches are firstly classified according to their geometrical directions; then rather than a single dictionary, multi-class dictionaries are learned within each class; and the most relevant sub-dictionary is adaptively selected for an image patch to be coded during the fusion process. The rest of this paper is organized as follows. In Section 2, the basic theory of sparse representation is briefly reviewed. The detailed procedures of the proposed fusion scheme are described in Section 3. Experimental results and comparisons are presented in Section 4. Finally, we conclude this work and discuss future research in Section 5.

where D is the dictionary as mentioned above, S = [s1 , ..., sn ] ∈ RK×n are the corresponding sparse decomposition coefficients. The joint optimization problem (2) is not jointly convex, but it is convex with respect to each of the two variables D and si when the other one is fixed. Main procedures of this algorithm include initialization, sparse coding and dictionary update, shown as follows: First, initialization. Initialize dictionary D0 ∈ Rw ×K ; and reset the “past” information A0 ∈ RK×K ←− 0, B0 ∈ Rw×K ←− 0. Second, sparse coding. Draw one element xt at iteration t, as shown in (3), the sparse coefficient st of xt over the dictionary Dt−1 obtained at the previous iteration is computed by LARS [32] algorithm. Where t is the iteration serial number, t ≤ T , T is the total iterations number. 1 st  argmin xt − Dt−1 s22 + s1 s 2

Third, dictionary update. Update the “past” information At and Bt as Eq. (4), and the new dictionary Dt is updated by solving Eq. (5). At ←− At−1 + st st T , Bt ←− Bt−1 + xt st T 1 1 ( xi − Dsi 22 + si 1 ) t 2

D

Sparse representation has been proven as an extraordinary powerful statistical signal modeling method and has a good reputation in both theoretical research and practical applications [24]. It has achieved a big success in image de-noising, super-resolution, and image classification, etc [20,25–27]. Sparse representation model is used to express the signal in a given over-complete dictionary in order to obtain a more concise representation of the signal. Suppose that x ∈ Rw is the signal, and D = [d1 , · · ·,dK ] ∈ Rw ×K (w < K) is a given dictionary of atoms dl . The sparse representation of x over D is to find a sparse vector s = [s1 ; · · ·; sK ] ∈ RK such that exact x = Ds or approximate, x ≈ Ds, satisfying x − Ds2 ≤ ε [28]. The sparse vector s can be found by





(1)

where  > 0 is a regularization parameter, ·p denotes lp -norm. The l0 -minimization is an NP-hard problem, and approximate solutions including greedy algorithms (e.g., MP, OMP) [29,30] and convex relaxation techniques (e.g., BP, LARS) [31,32] had been proposed in the past decade. Among which, the convex relaxation technique replaces the non-convex l0 -norm with l1 -norm to make the optimization problem convex [33]. In the sparsity-based model, generally, the dictionary has two categories. The first one is analytic dictionary, including Wavelet, Contourlet and Curvelet dictionaries, etc [34–36]. This type of dictionary usually has simple and fast algorithms, but sometimes the atomic form is not abundant enough so that it cannot offer optimal representation for complex image structure. The second is trained dictionary, which learns a dictionary on a training set. This type of dictionary (e.g., MOD, KSVD or ODL) [28,37,38] can better express the image structure in most cases, leading to state-of-the-art results in many practical image processing applications. In this work, ODL [38] algorithm is used to train the dictionary. The ODL technique for sparse representation considers a finite training set of signals X = [x1 , · · ·, xn ] ∈ Rw ×n and optimize the empirical cost function 1 n n

min D,s

i=1

1 2

xi − Dsi 22 + si 1

(5)

i=1

2. Sparse representation and dictionary learning

s

(4)

t

Dt  argmin

sˆ = argmin x − Ds2 + s0

(3)

 (2)

uj ←−

1 1 (b − Daj ) + dj , dj ←− u Ajj j max(uj 2 , 1) j

(6)

Concretely, the j-th(1 ≤ j ≤ K) column(dj ) of the input dictionary D is updated by Eq. (6), where A = [a1 , · · ·, aK ] ∈ RK×K , B = [b1 , · · ·, bK ] ∈ Rw×K . Sequentially update each column until convergence, and then return the updated dictionary Dt . One advantage of this dictionary update process is that it is parameter-free. Finally, the learned dictionary DT is obtained when the iteration is over. The ODL algorithm only needs access an element or a small subset at each iteration, so it takes up less memory and the computational efficiency is high. Experiments in [38] demonstrate this algorithm is faster and outperforms its batch alternatives for desired precision, it does not require a careful learning rate tuning like regular stochastic gradient descent methods, and it can guarantee the convergence to the global optimal value [38]. 3. Proposed method At present, a single dictionary is usually exploited to represent the image patches in SR-based fusion, which cannot reflect the differences of various image patches types [20,21]. As the local features of different classes of patches are different, only one dictionary may not represent well the underlying image patch. Motivated by the promising result of clustering-based sparse coding in maintaining the texture, edge, and details in image de-blurring and super-resolution [20,22,23], with the fast and efficient ODL algorithm, a novel fusion method for medical image based on sparse representation of classified image patches is proposed, which aims to provide a fast and efficient algorithm while it preserves the image structure better. 3.1. Overview In this section, sparse representation of classified image patches-based medical image fusion algorithm is presented in detail. The diagram of the proposed method is shown in Fig. 1. In this work, we assume Single Photon Emission Computed Tomography(SPECT) images are shown in pseudo-color, and we take them as color images. Suppose that there are two pre-registered source images I1c , I2 ∈ RM×N , among which, I1c represents the color image,

J.-j. Zong, T.-s. Qiu / Biomedical Signal Processing and Control 34 (2017) 195–205

197

Fig. 1. Overall framework of the proposed method.

Fig. 2. Source images used in our experiments. 12 pairs of multimodal medical images: (a) 4 pairs of MRI and SPECT images; left: MRI; right: SPECT. (b) 4 pairs of CT and MRI images; left: CT; right: MRI. (c) 4 pairs of CT and PET images; left: CT; right: PET.

I2 represents the gray image. As shown in Fig. 1, first, the intensityhue-saturation (IHS) transform [1,39] is employed to convert the color source image I1c from RGB to IHS color space; then the intensity component I1 (which also represents the input gray image) is chosen to join the gray images fusion process; finally, the intensity component is substituted by the fused image IF , and the final color fusion image IFc is generated by the inverse IHS transform. It should be noted that the fusion process does not require the IHS transform and inverse transform when both source images are gray images (e.g., CT and MRI). The gray-level image fusion process is shown in the red dashed rectangular box in Fig. 1, in this process, first, the sliding window technique is used to divide each source image into patches, then the patches are classified according to their geometrical directions, all the classified patches are converted into corresponding lexicographic ordering vectors, which constitute corresponding matrix Vj (j = 1, 2, · · ·, L). Second, multiclass sub-dictionaries are learned by ODL algorithm [38]. For the corresponding column vectors in the patch set of source images, their sparse representations are calculated using the LARS method [32]. Third, Fuse the corresponding columns of sparse coefficients matrix S1 , S2 of the source images to generate SF according to their ALM, then we can get the vector representation of the fused image VF . Last, the fused image IF is reconstructed using VF .

3.2. Patch classification By sliding window technique with a step length of 1, the source √ √ images are divided into patches of size w × w(w = 64), such method can reduce the computational complexity of sparse representation and reconstruction. Then the patches are classified into many categories based on the dominant gradient direction of each patch. Concretely, assuming that the ith patch in source images I1 and I2 are p1i , p2i , respectively. Where 1 ≤ i ≤ Q , Q = √ √ (M − w + 1)(N − w + 1). As we know that the human eyes are more sensitive to the high frequency details of the image, and the local variance of the image can be used to describe the details of the image. So we pick out the patch which has larger variance from p1i and p2i to join the classification for getting a correct classification result. As we know, various medical images can provide complementary information. For example, CT images are very clear for bone imaging, but they have relatively low contrast for soft tissue; while MRI images can better show the soft tissue and the relevant vessel. The patch which has larger variance was chosen to join the classification, it means that we have chosen a patch that is more representative of the salient features of the image to be classified, and the classification results will be more accurate. And then a Sobel

198

J.-j. Zong, T.-s. Qiu / Biomedical Signal Processing and Control 34 (2017) 195–205

Fig. 3. Example of MRI and SPECT image fusion. Close-up views from (a) to (h) are sequentially shown in the bottom of the fusion results for better visualization and comparison. (a)MRI image; (b) SPECT image; (c) DWT; (d) NSCT; (e) SR128; (f) SR256; (g) CSR128; (h) CSR256.

Fig. 4. Example of CT and MRI image fusion. Magnified regions extracted from (a) to (h) are sequentially shown in the bottom of the fusion results for better visualization and comparison. (a) CT image; (b) MRI image; (c) DWT; (d) NSCT; (e) SR128; (f) SR256; (g) CSR128; (h) CSR256.

edge operator is applied to get the gradient magnitude gp (x, y) and orientation ˛p (x, y) for each pixel p(x,y) in the selected patch.

 gp (x, y) =

y

spx (x, y)2 + sp (x, y)2

(7)

 ˛p (x, y) = tan

−1

y

sp (x, y) spx (x, y)

(8)

J.-j. Zong, T.-s. Qiu / Biomedical Signal Processing and Control 34 (2017) 195–205

199

Fig. 5. Example of CT and PET image fusion. (a) CT image; (b) PET image; (c) DWT; (d) NSCT; (e) SR128; (f) SR256; (g) CSR128; (h) CSR256. y

where spx and sp are the convolved results with the 3 × 3 horizontal and vertical Sobel templates. Similar to the orientation assignment approach in the Histogram of Oriented Gradient (HOG) descriptor [40], the gradient orientation histogram for each patch is obtained next. Suppose that there are L orientation bins {1 , 2 , · · ·, L } which are evenly spaced over 0◦ –180◦ (unsigned gradient). Let Gi (j )(j = 1, 2, · · ·, L) denotes the gradient magnitude accumulation in the jth orientation bin for the ith selected patch pi , let us define Ji as the class of patch pi . The classification rules are as follows:

Ji =

⎧ Gi max ⎪ 0 if ⎪ ⎪ L ⎨  ⎪ ⎪ ⎪ ⎩

v 1i = DJi s1i = [d 1 , · · ·, d L0 ][s1i (1), ..., s1i (L0 )]T =

L0 

s1i (t)d t

(10)

s2i (t)d t

(11)

t=1

v 2i = DJi s2i = [d 1 , · · ·, d L0 ][s2i (1), ..., s2i (L0 )]T =

L0  t=1

3.4. Fusion < Th

Gi (j )

(9)

j=1

J

dictionary DJi ∈ Rw×L0 (w < L0 ), these parameters satisfy (10) and (11).

others

where Gi max = max{Gi (j )} is the peak in the histogram for patch pi , and the index of Gi max is J = argmax{Gi (j )}, the pre-classification j

threshold Th(0 < Th < 1) is used to distinguish irregular patch and orientation patch, and the optimal threshold should be determined by experiment. Ji = 0 means the patch is an irregular patch that does not belong to any orientation class. Otherwise, the patch belongs to the category J. An image patch √ √ with size w × w corresponds to a lexicographic ordering vector with size w × 1 [9]. Then the classified patches are ordered lexicographically as vectors, and the corresponding column vector set V = {Vj |j = 0, 1, 2, · · ·, L} is obtained.

3.4.1. Activity level measurement Image fusion aims to combine salient features in source images into a fused image for visual perception or computer analysis. In this paper, the l1 -norm of the sparse coefficients for each patch are used as the ALM to measure the patch’s salient feature. The intuition behind this choice comes from reference [41]. In [41], the l1 -norm of an image patch in PCA coordinates is defined as the pattern distinctness. Similarly, we use the l1 -norm of sparse coefficients as the patch’s pattern distinctness, which is considered as the salient feature and represents the structure features of the image patch. 3.4.2. Fusion rule Let A1i , A2i be the ALM of p1i , p2i , then A1i = s1i 1 , A2i = s2i1. The corresponding fused sparse coefficient vector sFi is obtained by “choose max” fusion rule, which means that the patch with the maximum pattern distinctness is involved in the fusion, and the fusion image is expected to preserve the structure information well. sFi = sk∗ i , k∗ = argmax(Aki |k = 1, 2)

(12)

k

3.3. Dictionary learning and sparse coding L + 1 corresponding sub-dictionariesD = {Dj |j = 0, 1, 2, · · ·, L} are learned with ODL algorithm [38] from V. Assume that the column vectors corresponding to the ith patch p1i , p2i in source images I1 and I2 are v1i , v2i , respectively; and the patches belong to category Ji . Then based on LARS algorithm [32], calculating the sparse coefficient vectors s1i , s2i of v1i , v2i with the adaptively selected sub-

The ith column vector vi of the fused image can be obtained by v i =DJi sFi

(13)

For all Q pairs of image patches, repeating the steps of (10)–(13) as mentioned above, and the vector representation VF of the fused image IF can be obtained by



V F = v i |i = 1, 2, · · ·, Q



(14)

200

J.-j. Zong, T.-s. Qiu / Biomedical Signal Processing and Control 34 (2017) 195–205

Fig. 6. The source images and fusion results obtained by different methods. (a) Source Image 1; (b) Source Image 2; rest of the columns show the fused image produced by the method:(c) DWT; (d) NSCT; (e) SR128; (f) SR256; (g) CSR128; (h) CSR256.

3.5. Reconstruction √ √ Reshape each vector vi as a w × w patch and take the patch back to its original corresponding position. As the patches are over-

lapped during this process, a simple averaging operation is applied to all overlapping patches to form the fused imageIF .

J.-j. Zong, T.-s. Qiu / Biomedical Signal Processing and Control 34 (2017) 195–205

201

Table 1 Objective assessment of different methods. Best results are labeled in bold. Method

DWT

NSCT

SR128

SR256

CSR128

CSR256

SF SD MI QAB/F MSSIM Q

19.1385 55.9754 3.2147 0.4767 0.61 0.4695

19.6258 62.2694 4.1694 0.5388 0.5711 0.4844

19.3479 72.8648 4.3917 0.483 0.5845 0.4916

19.1545 71.474 4.53 0.4995 0.5891 0.4955

20.3265 74.158 5.0948 0.5567 0.5874 0.5056

20.2888 74.2958 5.186 0.556 0.5874 0.5062

Table 2 Objective assessment of different methods. Best results are labeled in bold. Method

DWT

NSCT

SR128

SR256

CSR128

CSR256

SF SD MI QAB/F MSSIM Q

27.9905 64.4489 2.5557 0.4142 0.7043 0.5855

28.9909 74.0058 2.8367 0.4905 0.5732 0.3406

23.9541 73.6083 3.2388 0.4633 0.7024 0.5791

26.9998 78.1735 3.2452 0.4771 0.7099 0.5907

29.7864 84.8276 3.4508 0.5442 0.7317 0.6666

30.0229 88.3809 3.6075 0.5522 0.7371 0.6734

Table 3 Objective assessment of different methods. Best results are labeled in bold. Method

DWT

NSCT

SR128

SR256

CSR128

CSR256

SF SD MI QAB/F MSSIM Q

21.063 36.6837 3.0137 0.6086 0.7953 0.6891

22.6398 45.7151 3.0865 0.7382 0.7783 0.5692

17.7967 40.408 3.3816 0.6819 0.8095 0.7067

19.697 43.0768 3.4247 0.7013 0.8107 0.7069

22.6259 49.5475 4.1763 0.7648 0.8098 0.7493

22.7375 49.9269 4.2844 0.7648 0.8101 0.7507

4. Experiments In this section, experiments on 12 registered medical image pairs from different modalities are presented to evaluate the effectiveness of the algorithm, as shown in Fig. 2, including 4 pairs of SPECT and MRI brain images, 4 pairs of MRI and CT brain images, 4 pairs of lung PET and CT images. The brain imaging data of size 256 × 256 pixels used in this paper are from Harvard Medical School (http://www.med.harvard.edu/aanlib/home. html#userconsent#), the whole brain atlas of Harvard Medical School was created by Keith A.Johnson and J. Alex Becker, mainly based on brain MRI imaging, combined with CT and SPECT imaging. It provides brain images of normal brain and some common diseases, such as cerebrovascular disease, brain tumors, degenerative diseases and central infectious diseases. Among which, almost all images are indexed images, and they can be referred to as color images or converted to other types of images as needed. This database has such an advantage that any slice image can be captured so as to see the characteristics of each layer. It also has the introduction of clinical examples, so it is a good choice for clinical neuroscience workers; the pulmonary imaging data of size 168 × 168 pixels are provided by the cancer diagnosis and treatment center in the Normand1y region of France, namely, the comprehensive cancer Center Henri Becquerel (CHB) in Rouen. 4.1. Experiment setups The proposed CSR-based method is compared with two representative multi-scale transform methods, including DWT [10], NSCT-based [11] fusion algorithms, and the recently fusion method based on sparse representation (SR) [14]. The parameters of each algorithm are set as follows. For DWT-based fusion method, the wavelet basis “sym4” is applied, the decomposition level is fixed to 3, with averaging and maximum coefficients fusion rules for low and high frequency sub-band, respectively. For NSCT-based fusion method, the ‘pyrexc’ filter is used as the pyramid filter and the ‘vk’

filter as the directional filter with decomposition level √   √ = [0,1,3,4,4], 1/ 2 1 1/ 2 while the PCNN parameters are set as W = , 0 1√ 1√ 1/ 2 1 1/ 2 ˛L = 0.06931, ˛ = 0.2, VL = 1, V = 20, ˇ = 3, and maximum iteration number n = 200. For the SR and CSR-based fusion methods, two settings for the dictionary sizes of 128 and 256 are tested, corresponding to SR128, SR256, CSR128 and CSR256 methods. The SR-based method uses the fixed over-complete redundant DCT dictionary and estimates sparse coefficients by OMP. For the CSRbased method, the patch geometrical directions is fixed to 6(i.e., L = 6), that is, there are 7 sub-dictionaries; the threshold Th for preclassification is set to 0.25; the method trains the over-complete dictionary by ODL, while calculates the sparse coefficients by LARS, the iteration number here is set to 50, and we used the regular√ ization parameter  = 1.2/ w = 0.15 in all of our experiments. For a fair comparison, in the sparse representation-based fusion schemes, the same value is set for the similar meaning parameter, where the source images are divided into image patches of size 8 × 8 which have been shown to be appropriate settings for the sparse representation-based image processing applications [14], and the step size in sliding window technology is 1 pixel. The other main parameters of the mentioned fusion methods are the recommended values that reported in respective publications. All the experiments are accomplished by Matlab R2015b and on a Intel(R) @3.5 GHz 3.5 GHz PC with 4GB RAM. 4.2. Objective evaluation metrics As an important research branch in the field of image fusion, the fusion quality assessment is of great significance to the selection of fusion algorithm and the optimization of parameters. Since a known reference image (or ground truth) is not always available in practical application, the blind or non-referenced assessment is generally preferred. Although a number of objective evaluation

202

J.-j. Zong, T.-s. Qiu / Biomedical Signal Processing and Control 34 (2017) 195–205

Table 4 The average objective assessment of the different methods. Best results are labeled in bold. Method

DWT

NSCT

SR128

SR256

CSR128

CSR256

SF SD MI QAB/F MSSIM Q

23.2530(1) 51.1395(0) 2.8150(0) 0.5115(0) 0.7222(1) 0.5976(0)

24.3951(0) 61.2032(0) 3.1965(0) 0.6041(0) 0.6787(0) 0.4748(0)

20.4915(0) 58.2094(0) 3.5460(0) 0.5563(0) 0.7228(1) 0.5967(0)

22.0979(0) 62.1828(0) 3.5838(0) 0.5773(0) 0.7270(2) 0.6028(0)

24.8291(3) 67.5031(0) 4.0892(0) 0.6354(6) 0.7316(3) 0.6484(1)

24.8686(8) 68.7610(12) 4.1962(12) 0.6362(6) 0.7335(5) 0.6522(11)

Table 5 Objective assessment when L = 1. L

SF

SD

MI

QAB/F

MSSIM

Q

1

30.2786

88.3582

3.6782

0.5519

0.7342

0.6732

Table 6 Quantitative comparison between different orientation numbers and thresholds. Best result for each parameter is indicated in red bold, and better group results are labeled in bold.(For interpretation of the references to colour in this table legend, the reader is referred to the web version of this article.)

metrics have been proposed [42] so far, the objective evaluation standard is still not unified. Generally, most researchers choose the evaluation index of the fusion quality according to the experience and usually the multiple evaluation indices are selected for a comprehensive evaluation. In this work, six metrics are selected to evaluate the fusion quality. Let A and B be two M × N source images, while F represents the fused image, the six metrics are introduced as follows:

(1) Spatial frequency (SF). The SF of the fused image is defined as SF =



RF2 + CF2

  N M   1  2 RF =  [F(i, j) − F(i, j − 1)] MN

i=1 j=2

(15)

(16)

J.-j. Zong, T.-s. Qiu / Biomedical Signal Processing and Control 34 (2017) 195–205

  M N   1  2 CF =  [F(i, j) − F(i − 1, j)]

(17)

MN

j=1 i=2

(6) Piella’s fusion metric Q [46]. This index improved on the basis of the Wang-Bovik image quality index Q0 [47] is defined as Q = Q (A, B, F) =

Spatial frequency reflects change rate of the image intensity, which is used to measure the clarity of the image. (2) Standard deviation (SD). The SD of the fused image is defined as

SD =

  M N   2  (F(i, j) − F)   i=1 j=1

Q0 (x, F) = (w) = (18)

M×N

where F is the mean value of the fused image, SD can be used to measure the richness of image information and it usually reflects the contrast of the fused image. (3) Mutual information (MI) [43]. The definition of MI is MI = MIFA (f, a) + MIFB (f, b) MIFA (f, a) =



pFA (f, a)log2

pFA (f, a) pA (a)pF (f )

(20)

pFB (f, b)log2

pFB (f, b) pB (b)pF (f )

(21)

f,a

MIFB (f, b) =



(19)

f,b

where pA (a), pB (b) and pFA (f, a) pFB (f, b) represent marginal probability distributions and joint probability distributions, respectively. MIFA (f, a) and MIFB (f, b) represent the amount of information that F contains about A and B, respectively. MI reflects a total amount of information that fused image F contains about A and B. (4) Xydeas’s fusion metric QAB/F [44]. It is an edge information based objective evaluation index and used to quantitatively evaluate how well the fused image retains the edge of the source images. It is calculated by N M  

Q AB/F =

Q AF (i, j)wA (i, j) + Q BF (i, j)wB (i, j)

i=1 j=1 M  N 

(22) (wA (i, j) + wB (i, j))

i=1 j=1

Q xF (i, j) = QgxF (i, j) Q˛xF (i, j) L

wx (i, j) = [gx (i, j)]

x = A, B

(23)

where Q AF (i, j) and Q BF (i, j) represent the edge information preservation values, while wA (i, j) and wB (i, j) are their weighted coefficients, respectively. QgxF (i, j) and Q˛xF (i, j) denote the edge strength and orientation preservation values of a pixel p(i, j) in x

that is transferred into the fused image F; L is a constant, gx (i, j) is the edge strength for each pixel p(i, j) in x. (5) Mean structural similarity (MSSIM). The index MSSIM is the average value of SSIM [45], and SSIM is a full-reference image fusion metric considering three comparisons of luminance, contrast, and structure. MSSIM is calculated by MSSIM = 0.5(SSIM(A, F) + SSIM(B, F)) SSIM(x, F) =

(2x F + C1 )(2xF + C2 ) (2x + 2F + C1 )(x2 + F2 + C2 )

(24) x = A, B

(25)

where SSIM(x, F) is the structural similarity (SSIM) index between signals x and F, where x represents source image A or B.

203

1 W



((w)Q0 (A, F|w) + (1 − (w))Q0 (B, F|w))

w∈W

4xF x F (2x + 2F )(x2 + F2 )

(26)

x = A, B

s(A|w) s(A|w) + s(B|w)

(27)

(28)

where s(A|w) and s(B|w) are the local saliencies of the two input images A and B, (w) is a local weight indicating the relative importance of image A compared to B. Here the variance of images A and B, are taken as s(A|w) and s(B|w), respectively. According to Liu et al. [42], the fusion metrics can be classified into 4 categories. In the above indices, SD and SF are statistics-based metrics, MI is an information-based metric, QAB/F , MSSIM and Q are human-visual-system based metrics. For each metric, the larger the value is, the better the fusion result is. All the parameters setting of these quality metrics used in this paper are the default values reported in the related publications. 4.3. Experimental results and discussion 4.3.1. Overall comparison In this subsection, the fusion results of 12 groups of medical multimodal images are shown in Fig. 6, and the average objective assessment of different methods on all the image pairs are listed in Table 4. Among them, three representative examples from fusion results are minutely shown in Figs. 3–5 respectively. In all the following figures, subfigures (a) and (b) are the two source images, and the fused results obtained by DWT, NSCT, SR and CSR fusion methods are shown in subfigures (c)–(h), respectively. The objective assessments of the fusion methods are listed in the relative tables, in which a value in bold indicates the best result over all methods on the corresponding fusion metric. A fusion example of MRI/SPECT image pairs is shown in Fig. 3. As shown in the bottom of the figure, two magnified regions are extracted to make a close-up views for better comparison. It can be seen that the DWT-based method suffer from serious luminance and color distortions so that the focal point is not obvious in the fused image. The other methods can achieve better performance than the DWT-based method in color retention, but the structure distortion phenomenon still appears in some regions (see the magnified regions of Fig. 3(d)–(f)), where some texture details are smoothed, while the proposed method can well preserve them (see the magnified regions of Fig. 3(g)–(h)). The fusion example of CT/MRI image pairs is shown in Fig. 4. As seen from the results, the CT contour information is not well integrated into the fused image by the DWT and NSCT-based methods, while both SR and CSR-based methods perform well in keeping the integrality of structure. However, the enlarged details from the results show that the proposed CSR-based method has higher luminance contrast and more clear structure information (see the magnified regions of Fig. 4(e)–(h)) than those of the SR-based method. Fig. 5 shows a fusion example of CT/PET image pairs. It can be seen that the salient information in PET are not well fused into the fused image by DWT and NSCT-based methods, the results suffer from artifacts; while the SR and CSR-based methods can obtained better visual quality. And compared with SR-based method, the proposed CSR-based method has smaller luminance distortion and contrast distortion (see the magnified regions of Fig. 5(e)–(h)).

204

J.-j. Zong, T.-s. Qiu / Biomedical Signal Processing and Control 34 (2017) 195–205

Fig. 7. Fusion results under different parameters L and Th. (a) L = 1; (b) (L,Th) = (2,0.6); (c) (L,Th) = (5,0.5); (d) (L,Th) = (5,0.6); (e) (L,Th) = (6,0.5); (f) (L,Th) = (6,0.7).

Fig. 8. The trained sub-dictionaries with (L, Th) = (6,0.5).

In addition to subjective assessment, six objective evaluation indexes are employed to compare the results of different methods. From Tables 1–3 , it can be clearly seen that the proposed CSR-based method can outperform the other methods on five (see Table 1) or all the six metrics (see Tables 2 and 3), i.e., the proposed method beats the other methods in the six metrics in most cases. All the experimental results are shown in Fig. 6, and some common regularities can be found. As shown in the figure, the DWT and NSCT-based methods generally have low contrast and structural distortion, and the DWT-based method also has serious brightness and color distortion. While the SR and CSR-based methods can obtain better results in color, brightness and contrast, but the loss of fine details existed in SR-based examples (see the third, the sixth, the eighth groups of experiment in Fig. 6), it can be clearly seen that the proposed method can well preserve the edge, texture and detail. The average objective evaluation indexes are listed in Table 4, in which the best result is shown in bold and the numbers in parentheses represent the number of the related method won the first place. From Table 4, it can be clearly seen that the proposed method beats the other compared methods in terms of all the six metrics, which can ensure the validity of the proposed method in many aspects. Furthermore, for each metric in Table 4, the number in the parenthesis of the proposed method is greater than those of other methods, which reflects the reliability and robustness of the proposed method to some extent. Considering the subjective visual effect and objective evaluation indexes, the proposed method greatly outperforms other algorithms compared in this paper, generally, the performance order is CSR > SR > NSCT > DWT, here, the symbol ‘>’ denotes ‘better than’. The reason for this is that the DWT has limited direction, and cannot be “optimal” to show the high dimensional function of the line or plane, so its overall performance is the worst. The NSCT transform has the characteristics of multi-scale, multi-direction and shift-invariant, so it performs better than the DWT-based method, but NSCT belongs to the non-adaptive multiscale geometric analysis, its basis function is independent of the image content, so that the transform cannot represent the image feature “more sparsely”, which leads to the poor ability of nonlinear approximation. Sparse representation based on dictionary learning can produce an overcomplete redundant dictionary which can effectively extract the geometric structure of the source image, so the SR-based and CSRbased methods perform better than that of DWT and NSCT-based

methods. But since the SR-based method used fixed DCT dictionary as the over-complete dictionary, while the CSR-based method used adaptive dictionary learning which can be more effective in image feature extraction, so the proposed CSR-based method performs better than the SR-based method in both subjective and objective aspects.

4.3.2. Investigation of the number of orientation bins Considering the difference between different types of patches, the CSR-based method trained multi-class dictionaries for each class. In this subsection, to further evaluate the effect of the number of orientation bins (denoted as orientation number) for the CSRbased fusion method, experiments are carried out with different orientation numbers. Here, the dictionary size is set to 256, the orientation number L is set to 1–6, the pre-classification threshold Th is set to 0.3–0.7, and the other parameters are set as described in Section 4.1. Next, an example from the 5th source images in Fig. 6 will be given to reveal some regularities of the experiment. The objective evaluation indexes are listed in Table 5 and Table 6, while some visual experimental results are shown in Fig. 7. As shown in Table 6, compared with the case of L = 1, there are 5 groups of experiments that go beyond the results of this case, they are the case of (L,Th) = (2,0.6), (L,Th) = (5,0.6), (L,Th) = (5,0.7), (L,Th) = (6,0.6), (L,Th) = (6,0.7), respectively. Here, the notation (L, Th) represents the combination of values L and Th. The best results for each parameter(in red bold) are concentrated in the case of (L,Th) = (5,0.5), (L,Th) = (5,0.6), and (L,Th) = (6,0.7). From the theoretical point of view, the more the orientation number is, the more detailed the patch classification results will be. As a result, the trained sub-dictionary will be more targeted, so the fusion results will be better. However, due to the fact that the number of image orientation is limited, the orientation number L needs to be compromised. In addition to the objective evaluation, some visual fusion results with higher objective indexes are used to compare the results subjectively. As shown in Fig. 7, preliminary experiments demonstrate that the CSR-based method (L > 1) performs slightly better than the CSR-based method (L = 1) does, as the overall contour and texture details in these fusion images are maintained better than the CSR-based method (L = 1). Furthermore, to make a balance between the subjective and objective results, the choice of parameters L and Th is usually a trade-off. The combination of values L = 6, Th = 0.5 gives the best comprehensive results, so (L,

J.-j. Zong, T.-s. Qiu / Biomedical Signal Processing and Control 34 (2017) 195–205

Th) = (6,0.5) may be selected as the optimal parameters. Meanwhile, it can be seen from Fig. 8 that the direction of the corresponding trained sub-dictionary is obvious, which not only proved that the criteria for choosing the gradient direction as the patch classification is valid, but also ensured a more accurate sparse representation for the directional patch, and this is good for fusion. From the above experiments and discussion, it can be seen that the proposed CSR-based fusion method can represent well the underlying image, so it achieves much better results than its counterparts in terms of both visual perception and objective metrics. 5. Conclusions A new fusion scheme for medical images based on sparse representation of classified image patches is proposed. Image patches are classified according to the geometrical gradient direction, and multi-class dictionaries via ODL algorithm are trained within each class. Some experiments taken on MRI/SPECT, CT/MRI and CT/PET images demonstrate that the proposed method can effectively integrate the salient information of the source images into the fusion image, and many image features such as texture details, edges and color information can be well preserved. The superiorities of the proposed CSR-based fusion method to its counterparts can be observed in terms of both visual perception and objective metrics. However, this work is only an elementary attempt and there is still much room for improvement. For example, the classification of patches with more effective feature or multi-features to provide more effective sparse representation is still an open problem. Also, in this case, a fast and efficient dictionary learning algorithm for fusion may be a good choice. Besides, it is worth investigating the applicability of the algorithm to other types of medical images. Acknowledgements The authors are very grateful to the technical editor and anonymous reviewers for their valuable and constructive comments, which resulting in much improvement of the quality of this paper. This work was supported in part by the Natural Science Foundation of China (NSFC) (Grant Numbers: 61671105, 61471080, by the Scientific Research Project of Liaoning Provincial Department of Education under Grant JDL2016024, and by the Fundamental Research Project of Liaoning Key Laboratory under Grant LZ2015008. References [1] S. Daneshvar, H. Ghassemian, MRI and PET image fusion by combining IHS and retina-inspired models, Inf. Fusion 11 (2) (2010) 114–123. [2] S. Singh, D. Gupta, R.S. Anand, V. Kumar, Nonsubsampled shearlet based CT and MR medical image fusion using biologically inspired spiking neural network, Biomed. Signal Process. Control 18 (2015) 91–101. [3] N. Amini, E. Fatemizadeh, H. Behnam, MRI-PET image fusion based on NSCT transform using local energy and local variance fusion rules, J. Med. Eng. Technol. 38 (4) (2014) 211–219. [4] A.P. James, B.V. Dasarathy, Medical image fusion: a survey of the state of the art, Inf. Fusion 19 (2014) 4–19. [5] P.R. Hill, C.N. Canagarajah, D.R. Bull, Image fusion using complex wavelets, BMVC (2002) 1–10. [6] Z. Zhang, R.S. Blum, A categorization of multiscale-decomposition-based image fusion schemes with a performance study for a digital camera application, Proc. IEEE 87 (1999) 1315–1326. [7] S. Li, J.T. Kwok, Y. Wang, Combination of images with diverse focuses using the spatial frequency, Inf. Fusion 2 (3) (2001) 169–176. [8] S. Li, B. Yang, Multifocus image fusion using region segmentation and spatial frequency, Image Vis. Comput. 26 (7) (2008) 971–979. [9] N. Mitianoudis, T. Stathaki, Pixel-based and region-based image fusion schemes using ICA bases, Inf. Fusion 8 (2) (2007) 131–142. [10] H. Li, B. Manjunath, S.K. Mitra, Multisensor image fusion using the wavelet transform, Graph. Models Image Process. 57 (3) (1995) 235–245. [11] S. Das, M.K. Kundu, NSCT-based multimodal medical image fusion using pulse-coupled neural network and modified spatial frequency, Med. Biol. Eng. Comput. 50 (10) (2012) 1105–1114.

205

[12] P. Ganasala, V. Kumar, CT and MR image fusion scheme in nonsubsampled contourlet transform domain, J. Digit. Imaging 27 (3) (2014) 407–418. [13] D.L. Donoho, Compressed sensing, IEEE Trans. Inf. Theory 52 (4) (2006) 1289–1306. [14] B. Yang, S. Li, Multifocus image fusion and restoration with sparse representation, IEEE Trans. Instrum. Meas. 59 (4) (2010) 884–892. [15] N. Yu, T. Qiu, F. Bi, A. Wang, Image features extraction and fusion based on joint sparse representation, IEEE J. Sel. Top. Signal Process. 5 (5) (2011) 1074–1082. [16] H. Yin, S. Li, Multimodal image fusion with joint sparsity model, Opt. Eng. 50 (6) (2011) 409–421. [17] N. Yu, T. Qiu, W. Liu, Medical image fusion based on sparse representation with ksvd, World Congress Med. Phys. Biomed. Eng. (2013) 550–553. [18] Y. Liu, Z. Wang, Simultaneous image fusion and denoising with adaptive sparse representation, IET Image Process. 9 (5) (2015) 347–357. [19] Q. Zhang, Y. Fu, H. Li, J. Zou, Dictionary learning method for joint sparse representation-based image fusion, Opt. Eng. 52 (5) (2013) 532–543. [20] W. Dong, L. Zhang, G. Shi, X. Wu, Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization, IEEE Trans. Image Process. 20 (7) (2011) 1838–1857. [21] M. Elad, I. Yavneh, A plurality of sparse representations is better than the sparsest one alone, IEEE Trans. Inf. Theory 55 (10) (2009) 4701–4714. [22] I. Ramirez, P. Sprechmann, G. Sapiro, Classification and clustering via dictionary learning with structured incoherence and shared features, IEEE Conf. Comput. Vis. Pattern Recognit. (2010) 3501–3508. [23] B. Wen, S. Ravishankar, Y. Bresler, Structured overcomplete sparsifying transform learning with convergence guarantees and applications, Int. J. Comput. Vis. 114 (2–3) (2014) 137–167. [24] Z. Zhang, Y. Xu, J. Yang, X. Li, D. Zhang, A survey of sparse representation: algorithms and applications, Access IEEE 3 (2015) 490–530. [25] J. Yang, J. Wright, T.S. Huang, Y. Ma, Image super-resolution via sparse representation, IEEE Trans. Image Process. 19 (11) (2010) 2861–2873. [26] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. [27] M. Elad, M. Aharon, Image denoising via sparse and redundant representations over learned dictionaries, IEEE Trans. Image Process. 15 (12) (2006) 3736–3745. [28] M. Aharon, M. Elad, A. Bruckstein, K-SVD. An algorithm for designing overcomplete dictionaries for sparse representation, IEEE Trans. Signal Process. 54 (11) (2006) 4311–4322. [29] S.G. Mallat, Z. Zhang, Matching pursuits with time-frequency dictionaries, IEEE Trans. Signal Process. 41 (12) (1993) 3397–3415. [30] S. Chen, S.A. Billings, W. Luo, Orthogonal least squares methods and their application to non-linear system identification, Int. J. Control 50 (5) (1989) 1873–1896. [31] S.S. Chen, D.L. Donoho, M.A. Saunders, Atomic decomposition by basis pursuit, SIAM Rev. 43 (1) (2001) 129–159. [32] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, Ann. Stat. 32 (2) (2004) 407–499. [33] D.L. Donoho, For most large underdetermined systems of linear equations, the minimal l1 -norm solution is also the sparsest solution, Commun. Pure Appl. Math. 59 (6) (2006) 797–829. [34] I.W. Selesnick, R.G. Baraniuk, N.G. Kingsbury, The dual-tree complex wavelet transform, IEEE Signal Process. Mag. 22 (6) (2005) 123–151. [35] M.N. Do, M. Vetterli, The contourlet transform: an efficient directional multiresolution image representation, IEEE Trans. Image Process. 14 (12) (2005) 2091–2106. [36] E. Candès, L. Demanet, D. Donoho, L. Ying, Fast discrete curvelet transforms, Multiscale Model. Simul. 5 (3) (2006) 861–899. [37] K. Engan, S.O. Aase, J. Hakon Husoy, Method of optimal directions for frame design, IEEE Int. Conf. Acoustics Speech Signal Process. 5 (1999) 2443–2446. [38] J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online learning for matrix factorization and sparse coding, J. Mach. Learn. Res. 11 (2010) 19–60. [39] T.M. Tu, S.C. Su, H.C. Shyu, P.S. Huang, A new look at IHS-like image fusion methods, Inf. Fusion 2 (3) (2001) 177–186. [40] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 1 (2005) 886–893. [41] R. Margolin, A. Tal, L. Zelnik-Manor, What makes a patch distinct? IEEE Conf. Comput. Vis. Pattern Recognit. (2013) 1139–1146. [42] Z. Liu, E. Blasch, Z. Xue, J. Zhao, R. Laganiere, W. Wu, Objective assessment of multiresolution image fusion algorithms for context enhancement in night vision: a comparative study, IEEE Trans. Pattern Anal. Mach. Intell. 34 (1) (2012) 94–109. [43] G. Qu, D. Zhang, P. Yan, Information measure for performance of image fusion, Electron. Lett. 38 (7) (2002) 313–315. [44] C. Xydeas, V. Petrovic, Objective image fusion performance measure, Electron. Lett. 36 (4) (2000) 308–309. [45] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. Image Process. 13 (4) (2004) 600–612. [46] G. Piella, H. Heijmans, A new quality metric for image fusion, Int. Conf. Image Process. 3 (2003) 173–176 (III). [47] Z. Wang, A.C. Bovik, A universal image quality index, IEEE Signal Process. Lett. 9 (3) (2002) 81–84.