Computers and Electrical Engineering 68 (2018) 366–380
Contents lists available at ScienceDirect
Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng
Combining extreme learning machine with modified sine cosine algorithm for detection of pathological brain☆
T
⁎
Deepak Ranjan Nayak ,a, Ratnakar Dasha, Banshidhar Majhia, Shuihua Wangb a b
Pattern Recognition Lab, Department of Computer Science and Engineering,NIT Rourkela, 769 008, India School of Electronic Science and Engineering, Nanjing University, Nanjing, Jiangsu 210 046, China
A R T IC LE I N F O
ABS TRA CT
Keywords: Pathological brain detection Magnetic resonance imaging Fast discrete curvelet transform Extreme learning machine Modified sine cosine algorithm
Development of automated diagnosis systems has taken a major place in current research practice to assist medical experts in decision-making. This paper presents a new automatic system for detection of pathological brain through magnetic resonance imaging (MRI). The proposed system involves contrast enhancement of input MR images using contrast limited adaptive histogram equalization (CLAHE). Then, the curve like features are computed from the preprocessed MR brain images using fast discrete curvelet transform via unequally-spaced FFT (FDCT-USFFT). Subsequently, a combined technique known as PCA+LDA is employed to derive more discriminative and reduced feature sets. Finally, a novel learning approach dubbed as extreme learning machine with modified sine cosine algorithm (MSCA-ELM) is proposed by combining ELM and MSCA for classification of MR images into two categories: pathological and healthy. A mutation operator is introduced to basic SCA (MSCA). In MSCA-ELM, MSCA is used to optimize the input weights and hidden biases of single-hidden layer feed-forward neural network (SLFN) and an analytical procedure is used to compute the output weights. The proposed scheme is rigorously evaluated on three standard datasets and the results are compared against other competent schemes. The experimental results demonstrate that the proposed scheme outperforms its counterparts in terms of classification accuracy and number of features required. It has also been noticed that MSCA-ELM yields superior performance than conventional learning methods. Hence, the proposed system can effectively recognize pathological brain in real-time and can possibly be installed on medical robots.
1. Introduction Across the globe, the death rate of individuals with various age groups is increasing immeasurably due to several brain diseases [1]. Pathological brain detection (PBD) has played vital role for early diagnosis of various diseases such as Alzheimer’s disease, mild cognitive impairment, autism spectrum disorder, multiple sclerosis, hearing loss, and microbleeding. The significant objective of PBD is to help radiologists in taking correct and quick clinical decisions. Magnetic resonance imaging (MRI), an advanced neuroimaging technique, is frequently used in PBD because of its ability to provide better resolution of brain tissues and its radiation-free properties [2]. However, manual interpretation is tedious and may subject to error due to the high image contents [3,4]. Thus there is a strong demand for identification, evaluation and classification support tools in the diagnostic procedure. Pathological brain detection system (PBDS) development is a growing research area that aims at meeting these demands. Through PBDS we can speed up
☆ ⁎
Reviews processed and recommended for publication to the Editor-in-Chief by Guest Editor Dr. Y. Zhang. Corresponding author. E-mail addresses:
[email protected] (D.R. Nayak),
[email protected] (R. Dash),
[email protected] (B. Majhi),
[email protected] (S. Wang).
https://doi.org/10.1016/j.compeleceng.2018.04.009 Received 1 November 2017; Received in revised form 13 April 2018; Accepted 16 April 2018 0045-7906/ © 2018 Elsevier Ltd. All rights reserved.
Computers and Electrical Engineering 68 (2018) 366–380
D.R. Nayak et al.
the clinical decisions and reduce the diagnostic errors. The work on PBD started in early 2000 [1,5]. The efforts then were initiated by Chaplot et al. [6] in which 2D discrete wavelet transform (2D DWT) and support vector machine (SVM) are used for feature extraction and classification. El-Dahshan et al. [7] have employed 2D DWT and two classifiers such as k-nearest neighbor (KNN) and feed forward back-propagation artificial neural network (FP-ANN). To reduce the feature dimensionality, they have applied principal component analysis (PCA). The authors in [2,8,9] have used scaled conjugate gradient (SCG), particle swarm optimization (PSO), adaptive chaotic PSO (ACPSO), and scaled chaotic artificial bee colony (SCABC) to train the feed forward neural network (FNN) classifier. Zhang et al. [10] have combined DWT, PCA and kernel SVM (KSVM). In [3], a ripplet transform (RT) and least squares SVM (LS-SVM) based system is suggested. In [11], the authors harnessed wavelet entropy (WE) to extract features and probabilistic neural network (PNN) is used for classification. Later, in [1], the authors have combined feedback pulse coupled neural network (FPCNN), DWT, PCA and FNN to detect pathological brain. Dong et al. [12] have utilized wavelet packet Shannon entropy (WPSE) and wavelet packet Tsallis entropy (WPTE) separately as features. In this, GEPSVM is employed as classifier. Nayak et al. [4] have utilized 2D DWT, probabilistic PCA (PPCA) and AdaBoost with random forests (ADBRF) for identifying pathological brains. Zhang et al. [13] have offered a PBDS which combines stationary wavelet transform (SWT), PCA, and GEPSVM. In [14], a PCA+LDA technique is applied on the 2D DWT features. In [15], Naive Bayes classifier (NBC) based PBDS is proposed which makes use of WE features. Sun et al. [16] have utilized GEPSVM+RBF classifier on WE and Hu moment invariants (HMI) features. Wang et al. [17] have proposed a novel feature called fractional Fourier entropy (FRFE) and performed Welch’s t-test (WTT) to select the relevant features. Twin SVM (TSVM) classifier is employed for classification. Later, in [18], a PBDS based on FRFE features and multilayer perceptron (MLP) is proposed. They have employed an adaptive real coded BBO (ARCBBO) approach for training the MLP. In this case, the number of hidden neurons of MLP is found using three separate pruning methods, namely, Bayesian detection boundaries (BDB), dynamic pruning (DP) and Kappa coefficient (KC). Chen et al. [19] have utilized Minkowski-Bouligand dimension (MBD) features and proposed an improved PSO (IPSO) to train the single-hidden layer feedforward neural network. Later on, Wang et al. [20] combined the variance and entropy (VE) values of dual-tree complex wavelet transform (DTCWT) and TSVM to detect pathological brain. Li et al. [21] have employed wavelet packet Tsallis entropy (WPTE) and FNN with real-coded biogeography-based optimization (RCBBO) for pathological brain detection. The literature studies reveals that 2D DWT and its variants (SWT, DTCWT, DWPT, etc.) are commonly used as the feature extractor. However, these transforms have limited capability of representing 2D singularities (edges and textures of an image). In other words, they can not capture curve like features from the images efficiently which is inherent in MRI scanning. Further, it has been noticed that most PBDSs employ classifiers such as FNN and SVM. However, traditional training algorithms for FNN such as Levenberg-Marquardt (LM) and back-propagation (BP), are slower and trapped at local minima. The computational complexity involved with standard SVM is very high. Furthermore, several PBDSs demand large number of features. To resolve such issues, a novel framework for detection of pathological brain is proposed. The main contributions of this study are summarized as follows: (a) Fast discrete curvelet transform via unequally-spaced FFT (FDCT-USFFT) is harnessed as feature extractor since it is efficient in capturing 2D singularities along with a group of curves. (b) To combat the issues faced by conventional learning algorithms, a simple and non-iterative learning technique known as extreme learning machine (ELM) is employed. (c) The concept of mutation is introduced to basic sine cosine algorithm (SCA) to enhance the global search capability and is referred to as modified sine cosine algorithm (MSCA). (d) A novel learning algorithm known as MSCA-ELM is proposed based on MSCA and ELM to further enhance the performance of basic ELM. (e) To evaluate the performance of the proposed scheme, extensive experiments are carried out on three well-known datasets. In this context, the performance of the suggested scheme is compared against its counterparts. The remainder of this article is organized as follows. Section 2 offers the datasets used in this study. Section 3 discusses the details of the proposed methodology. In Section 4, the evaluation results on standard datasets and comparisons with existing schemes are presented. Finally, Section 5 concludes the work and suggests some possible future research directions. 2. Datasets used The proposed PBDS has been evaluated on three benchmark datasets, namely, DS-I, DS-II, and DS-III which carries 66, 160 and 255 brain MR images respectively [3,4,12]. The datasets accommodate T2-weighted brain MR images of size 256 × 256 in axial view plane which were downloaded from Medical School of Harvard University website [22]. Both DS-I and DS-II hold samples of seven categories of diseases such as sarcoma, glioma, meningioma, AD plus visual agnosia (VA), Pick’s disease (PD), AD and Huntington’s disease (HD) plus healthy brain samples. However, DS-III includes four more diseases such as cerebral toxoplasmosis (CTP), multiple sclerosis (MS), herpes encephalitis (HE), and chronic subdural hematoma (CSH). 3. Proposed methodology The proposed framework includes four vital components such as contrast limited adaptive histogram equalization (CLAHE) based preprocessing, FDCT-USFFT based feature extraction, PCA+LDA based feature dimensionality reduction, and MSCA-ELM based 367
Computers and Electrical Engineering 68 (2018) 366–380
D.R. Nayak et al.
Fig. 1. Overview of the proposed framework.
classification. The input of the system is an MR image and the output is the class label (healthy or pathological). The overview of the proposed framework is depicted in Fig. 1. 3.1. Preprocessing based on CLAHE It is observed that most of the images in the datasets considered in this study are of low-contrast. Therefore, for contrast enhancement of the images, a standard technique named CLAHE is employed. CLAHE initially evaluates a histogram of gray values at a contextual region surrounded by every pixel and thereafter, allocates a value to each pixel intensity within the display range [5]. Additionally, it uses a fixed value dubbed as clip limit which helps in clipping the histogram prior to the computation of cumulative distribution function (CDF). However, CLAHE redistributes those parts of the histogram equally among all histogram bins that surpass the clip limit. 3.2. Feature extraction based on FDCT via USFFT Wavelet transform has received much attention of researchers due to its properties like time-frequency localization and multiresolution. Wavelet shows better performance in representing 1D singularities; however, it is unable to capture 2D singularities (line, curves, etc.) from the images. Thereafter another transform called as ridgelet transform is proposed which can handle line singularities, but can not effectively deal with curve singularities. In contrary, first generation curvelet efficiently handles 2D singularities. Additionally, curvelet transform offers properties like multiresolution, more directional selectivity, anisotropy and localization [5]. More recently, second generation curvelet transform is introduced which resolves the problems faced by first order curvelet such as the unclear geometry of ridgelets and more time-consuming [23]. Let g be a signal, we now can define the curvelet transform with the help of inner product as
(α, β, γ ) = g , ϕα, β, γ
(1)
Here, ϕα, β, γ indicates the curvelet basis function. α, γ, and β denote the scale, position, and direction (orientation) parameter respectively. The curvelet transform decomposes the image into numerous windows at various scales and orientations. The discrete form of curvelet transform for an input Cartesian array g[x1, y1] with 0 ≤ x1, y1 < n is defined as [23]
D (α, β, γ ) =
∑
g [x1, y1] ϕαD, β, γ [x1, y1]
(2)
0 ≤ x 1, y1 < n
where, ϕαD, β, γ indicates a digital curvelet waveform. The proposed PBDS makes use of a second generation curvelet transform also called fast discrete curvelet transform (FDCT) for feature extraction. There exist two procedures to implement FDCT such as FDCT via wrapping (FDCT-WR) and FDCT via unequally spaced fast Fourier transform (FDCT-USFFT). In contrast to first generation curvelets, these two procedures are fast, simple, and less redundant. However, we choose FDCT-USFFT as the feature extractor in this study as it provides proper discretization of the continuous definition. The steps to collect the curvelet coefficients using FDCT-USFFT are listed in Algorithm 1.
Feature generation. To generate a feature vector, the coefficients of FDCT-USFFT at each scale α and orientation β are collected. Here, the number of scales (s) for an image with size nr × nc is decided as 368
Computers and Electrical Engineering 68 (2018) 366–380
D.R. Nayak et al.
Require: Input image: g[x1 , y1 ]; 0 ≤ x1 , y1 < n Ensure: Curvelet coefficients: CD (α, β, γ) 1: Given an input g[x1 , y1 ], apply 2D FFT and generate Fourier coefficients g ˆ [n1 , n2 ] as gˆ [n1 , n2 ] =
n−1
g[x1 , y1 ]e−i2π(n1 x1 +n2 y1 )/n ; −n/2 ≤ n1 , n2 < n/2
x1 ,y1 =0
2: 3:
For each scale and angle pair (α, β), interpolate gˆ [n1 , n2 ] to generate sample values gˆ [n1 , n2 − n1 tan θβ ] Perform multiplication of the interpolated object gˆ with the parabolic window U˜ α g˜ α,β [n1 , n2 ] = gˆ [n1 , n2 − n1 tan θβ ]U˜ α , gˆ [n1 , n2 ]
4:
Obtain the discrete curvelet coefficients CD (α, β, γ) by applying inverse 2D FFT to each g˜ α,β . Algorithm 1. FDCT via USFFT algorithm.
s = ⌈ log 2 (min (nr , nc )) − 3⌉.
(3)
Since the MR images are of size 256 × 256, s value is 5 where each scale contains information along different orientations (subbands) except first and last scale. Scale 2, 3 and 4 possesses 32, 32 and 64 angles respectively. It is worth pointing out here that curvelet at angle θ and θ + π generates same coefficients. Therefore, the coefficients of the symmetric sub-bands at scale 2, 3 and 4 are discarded to remove redundancy from the original feature vector. But, the resultant feature vector is still of large dimension even after discarding symmetric bands. This necessitates the employment of feature reduction techniques on the resultant feature vector. 3.3. Feature reduction based on PCA+LDA Feature reduction methods play a vital role in reducing computational burden, understanding data and improving the classification performance. Both PCA and linear discriminant analysis (LDA) have received considerable attention from the researchers in the past decades. PCA transforms high dimensional input data to a lower dimensional space while keeping maximum variations of the data. In contrast, LDA attempts to find a feature subspace that best discriminates between the classes. But, conventional LDA performs poorly while dealing with high dimensional and small sample size problem as in this case the within-scatter matrix (Sw) is always singular [24]. To address this issue, a popular approach called PCA+LDA is applied in this study, where a D-dimensional data is first reduced to an M-dimensional data using PCA and then reduced to a L-dimensional data using LDA, L < < M < D. It may be noted that the optimal number of features (L) required in our system is selected using the normalized cumulative sum of variances (NCSV) measure. The NCSV value for ath feature is calculated as a
NCSV (a) =
∑u = 1 λ (u) D
∑u = 1 λ (u)
; 1≤a≤D (4)
where, λ(u) represents the eigenvalue of the u feature and D denotes the total number of the eigenvectors (features) sorted in descending order of eigenvalues. Here, we set a threshold value manually and the number of features (for instance L) for which the NCSV value surpasses the threshold are selected. It is worth mentioning here that L best eigenvectors are retained for extraction of features from the unknown test MR images. The overall steps involved in the feature reduction stage are listed in Algorithm 2. th
3.4. Classification based on MSCA-ELM In this section, we first discuss about extreme learning machine (ELM) and modified sine cosine algorithm (MSCA), and thereafter, the proposed MSCA-ELM algorithm is presented in detail. 3.4.1. Extreme learning machine (ELM) Extreme learning machine (ELM) is one of the most simple and effective learning approaches for training the single-hidden layer feed-forward neural networks (SLFNs) that avoids the limitations of gradient based learning schemes [25,26]. It has achieved dramatic successes in solving problems like multi-label classification problems and regression tasks. Contrast with conventional learning approaches such as BP, SVM and LS-SVM, ELM learns faster with better generalization performance. In ELM, the hidden node parameters (the input weights and hidden biases) are randomly generated, while the output weights of SLFNs are mathematically determined by a simple inverse operation of the hidden layer output matrix. Given N distinct training samples (xj, tj), where x j = [x j1, x j2, …, x jL]T ∈ RL and t j=[t j1, t j2, …, tjC]T ∈ RC, the hidden node number nh and an activation function ϕ(.), the steps of basic ELM algorithm are described as follows. 369
370
Algorithm 2. Feature reduction using PCA+LDA.
Require: Feature matrix: FM of size N × D Ensure: Reduced feature matrix: Fr of size N × L Function pca() and lda() reduce the dimension using PCA and LDA respectively 1: Choose a dimension M Reduced dimension using PCA 2: F (N × M) ← pca(FM , M) 3: Select a dimension L using NCSV measure Reduced dimension using LDA 4: Fr (N × L) ← lda(F , L) 5: Output the reduced matrix Fr
D.R. Nayak et al.
Computers and Electrical Engineering 68 (2018) 366–380
Computers and Electrical Engineering 68 (2018) 366–380
D.R. Nayak et al.
1. Generate hidden node parameters randomly (wih, bi ), i = 1, 2, …, nh . 2. Compute the hidden layer output matrix H. 3. Compute the output weight matrix using the minimal norm least square method w o = H†T T
Here, wih = [wih1, wih2, …, wiLh] represents the weight vector that links between ith hidden neuron and the input neurons, o T wio = [wio1, wio2, …, wiC ] indicates the weight vector that connects the ith hidden neuron and the output neurons, and bi is the bias of the th i hidden neuron. H† indicates the Moore-Penrose (MP) generalized inverse of matrix H. The size of H, wo and T are N × nh, nh × C and N × C respectively. As the solution of ELM is obtained using an analytical method without iteratively tuning parameters, it converges faster than other traditional learning algorithms. 3.4.2. Proposed Modified Sine Cosine algorithm (MSCA) The sine cosine algorithm (SCA) is a recently proposed population-based optimization technique which uses two trigonometric functions to find the global optima. In particular, SCA utilizes sine and cosine functions to update a set of candidate solutions [27]. The solutions in SCA are updated as follows.
si (t ) + r1 × sin(r2) × r3 sbest − si (t ) si (t + 1) = ⎧ ⎨ ⎩ si (t ) + r1 × cos(r2) × r3 sbest − si (t )
if r4 < 0.5 if r4 ≥ 0.5
(5)
where, t denotes the current generation, s(t) indicates the current solution, sbest indicates the best solution having best fitness achieved so far, || denotes the absolute values. r1, r2, r3, and r4 are the random variables. The parameter r1 helps in determining the position of the next solution, which may be either in the space between s(t) and sbest or outside it. In order to balance exploration and exploitation, r1 is changed adaptively as follows
r1 = q − t
q MaxItr
(6)
where, MaxItr is the maximum number of generations and q is a constant. The parameter r2 defines the direction of movement of the next solution towards or outwards sbest. The parameter r3 gives random weights for sbest in order to stochastically emphasize (r3 > 1) or deemphasize (r3 < 1) the effect of destination in defining the distance. The parameter r4 helps in switching between the sine and cosine function, and is a random number between 0 and 1. It has been observed that the conventional SCA has many disadvantages like getting trapped at local optima, slow convergence, and high computational cost [28]. In order to enhance the performance, we introduce a mutation operator to traditional SCA and name it as modified SCA (MSCA). In general, the mutation operator provides additional diversity and hence improves the search toward the global best solution. In MSCA, we select a random candidate solution and add some random perturbation (mutation step size) to the randomly selected solution by a mutation probability. The general steps of the proposed MSCA method are described in Algorithm 3. In the algorithm, r11, r12, rand1(.), rand2(.) are four separate random numbers in the range [0,1]. rnd denotes a random number between 1 to the number of candidate solutions, Pm indicates the mutation probability, MAXlimit indicates the maximum limit of the variable in the solution, and step size dictates the mutation step size. 3.4.3. Proposed MSCA-ELM method Due to random choice of the input weights and hidden biases, standard ELM poses two critical problems. First, it needs more hidden neurons for which ELM responds slowly to unknown testing data [29]. Second, it causes an ill-conditioned hidden layer output matrix H in presence of large hidden neurons which leads to poor generalization performance. Condition number is shown to be an effective qualitative measure to find the conditioning of a matrix [30]. It may be noted that an ill-conditioned system holds large condition number, while a well-conditioned system holds small condition number. In order to overcome the issues of basic ELM, few research efforts have been reported in the past years where the population-based optimization schemes such as genetic algorithms (GA), differential evolution (DE) and PSO are used to optimize the hidden node parameters of ELM. However, in this paper, a modified SCA algorithm is introduced to train ELM (MSCA-ELM) which markedly enhance the recent results. In MSCA-ELM, MSCA is used to optimize the hidden node parameters, whereas, MP generalized inverse is utilized to analytically find the solution. In the current study, we first verify the effectiveness of ELM trained by conventional SCA (referred as SCA-ELM) and thereafter, we verify our proposed approach called MSCA-ELM. It is worth mentioning here that unlike SCA-ELM, MSCA-ELM approach searches global optima by considering both root-mean squared error (RMSE) and norm of the output weights of SLFNs which leads potential improvement in the generalization performance and conditioning. The main goal of MSCA-ELM is to minimize the norm of the output weights and to bound the hidden node parameters within a specific range with an aim to enhance the convergence performance of ELM. It is known from Bartlett’s theory that for neural networks reaching smaller training error, the smeller the norm of weight is, the better generalization performance of the networks tend to acquire. The steps of the proposed MSCA-ELM are as follows: (a) Randomly initialize all the candidate solutions (z = 1, …, Ps ) such that each solution consists of a set of input weights and hidden biases within a range of [-1,1] as h h h h sz = [w11 , w12 , …, w1hL, w21 , w22 , …, w2hL, wnhh 1, wnhh 2, …, wnhh L, b1, b2, …, bnh]
371
(7)
372
28:
26: 27:
23: 24: 25:
20: 21: 22:
17: 18: 19:
14: 15: 16:
11: 12: 13:
9: 10:
6: 7: 8:
3: 4: 5:
1: 2:
Algorithm 3. General steps of the proposed MSCA algorithm.
Initialize a set of random candidate solutions (s) Calculate the fitness of each solution Find the best candidate solution (sbest ) while (t < maximum number of iterations) do for each candidate solution do Update r1 , r2 , r3 , and r4 Update the solution using Eq. (5) end for Compute the fitness of each updated solution Update the best solution achieved so far (sbest ) Select a random candidate solution (srnd ) if (r11 < Pm ) then if (r12 < 0.5) then srnd = srnd + rand1(.) ∗ MAXlimit /step size else srnd = srnd + rand2(.) ∗ MAXlimit /step size end if if ( f itness(srnd ) is better than f itness(srnd )) then srnd = srnd if ( f itness(srnd ) is better than f itness(sbest )) then sbest = srnd end if else srnd = srnd end if end if end while Return the best solution achieved so far as the global optimum Mutation
D.R. Nayak et al.
Computers and Electrical Engineering 68 (2018) 366–380
Computers and Electrical Engineering 68 (2018) 366–380
D.R. Nayak et al.
(b) Evaluate the output weights and fitness of each candidate solution and find sbest in the population. For fitness evaluation, we calculate the RMSE over the validation set. The fitness is stated as N
f () =
n
∑ j =v 1 ∑i =h 1 wio ϕ (wih·x j + bi ) − t j
2 2
Nv
(8)
where, Nv indicates the number of validation samples. (c) For each candidate solution, update r1, r2, r3 and r4, and update the solutions using Eq. (5) (d) Bound the new solutions using the following expression
−1 if sz (t + 1) < −1 sz (t + 1) = ⎧ ⎨ + ⎩ 1 if sz (t + 1) > +1
(9)
and find the new best solution sbestnew. (e) Update the sbest using the fitness value and the norm of the output weights as follows
sbest =
⎧ sbestnew if ( f (sbest ) − f (sbestnew ) < ϵf (sbest ) and wsobestnew < wsobest ) ⎨ otherwise ⎩ sbest
(10)
where, f(sbest) and f(sbestnew) denotes the fitness of the best solution so far and the new best solution respectively. wsobest and wsobestnew represents the output weights of best solution so far and the new best solution respectively. ϵ > 0 is a user-defined tolerance rate. (f) Randomly select an updated solution in the population and apply mutation to it (using Eqs. in Algorithm 3) and update the solution sbest if there is better solution. (g) Repeat (c)-(f) until the maximum number of iterations are over and eventually obtain the optimal hidden node parameters. The MSCA makes use of Eq. (10) in order to find the optimal input weights and hidden biases and therefore it tends to provide a lower norm value of the output weights of SLFNs. On the other hand, the lower norm value prompts to a lower condition value of the output hidden matrix. In summary, the key advantages of the proposed MSCA-ELM algorithm are as follows: (i) it improves the conditioning and (ii) it produces better generalization performance with a much more compact network. Unlike other gradient based methods and classical ELM, MSCA-ELM algorithm does not require any activation function to be infinitely differentiable. Because the proposed framework includes techniques such as FDCT-USFFT, PCA+LDA, and MSCA-ELM, hereafter, in this paper, the proposed framework is referred to as FDCT-USFFT + PCA+LDA + MSCA-ELM. 4. Experimental results and analysis The proposed method was implemented using MATLAB on a PC with 16 GB RAM, 3.5 GHz processor, and windows 10 OS. The parameters used and the statistical set up was kept similar to other competent schemes to derive fair comparisons. 4.1. Experimental design In order to validate the proposed scheme FDCT-USFFT + PCA+LDA + MSCA-ELM, simulation has been carried out on three benchmark datasets, namely, DS-I, DS-II, and DS-III. For statistical analysis, cross-validation (CV) has been employed which avoids over-fitting. In this work, we have incorporated stratification into CV (SCV) which splits the folds in such a way that each fold will have a similar class distribution. Fig. 2 depicts the setting of a 5-fold CV for a single run. In each trial, one fold is used for testing, one for validation and the rests are for training. The validation set is used to find the parameters of the MSCA-ELM i.e., it helps us to know when to stop training. The test set is used to evaluate the performance in a run of five trials. It is worth mentioning here that the statistical setting for all the three datasets is kept similar to the literatures [3,4,18] as shown in Table 1. For DS-I, we employ 6-fold SCV strategy while for rest two datasets, we employ 5-fold SCV strategy. Additionally, we run the SCV procedure ten times to avoid
Fig. 2. Illustration of 5-fold cross validation setting for a single run. 373
Computers and Electrical Engineering 68 (2018) 366–380
D.R. Nayak et al.
Table 1 Specification of three benchmark datasets [3,4,18]. Dataset
k-fold SCV
DS-I DS-II DS-III
Total samples
6 5 5
Training
Validation
Testing
H
P
H
P
H
P
H
P
18 20 35
48 140 220
12 12 21
32 84 132
3 4 7
8 28 44
3 4 7
8 28 44
randomness. 4.2. Performance metrics The performance of the proposed framework is evaluated using four benchmark metrics such as sensitivity (Se), specificity (Sp), precision (Pr) and accuracy (Acc). The metrics are defined as follows. Se is the fraction of pathological MR samples successfully predicted, while Sp is the fraction of healthy MR samples successfully predicted. However, Acc determines the fraction of the correctly predicted samples (both pathological and healthy) in the total number of testing samples. Moreover, to compare proposed MSCA-ELM method against other methods such as DE-ELM, PSO-ELM, basic ELM and BPNN, two additional parameters such as condition number (K2 ) and norm of output weights are used. The 2-norm condition number of the matrix H is calculated as,
K2 (H) =
λmax (HT H) λmin (HT H)
(11)
where, λmax(H H) and λmin(H H) denotes the largest and smallest eigenvalues of H H. T
T
T
4.3. Results analyses In the following, we discuss the results obtained at various stages of the proposed scheme. 4.3.1. Preprocessing and feature extraction results In preprocessing stage, CLAHE is utilized which relies on the proper setting of its parameters. Here, the original MR image is divided into 64 contextual regions. The number of bins and the clip limit (β) are set to 256 and 0.01. The representative enhanced images corresponding to four original MR images are depicted in Fig. 3. It can be seen that the affected lesions are clear in the enhanced images than that of original images. Subsequently, a 5-level FDCT-USFFT is employed to extract features from the preprocessed images. The 5-level FDCT-USFFT decomposition of a healthy image is depicted in Fig. 4. It is worth mentioning here that we merely consider the coefficients of 66 subbands (excluding symmetric sub-bands) from a total of 130 sub-bands for feature extraction. The feature vector for a single MR image is constructed by collecting these coefficients and the number of features are counted as 125,952 which is much larger in size.
Fig. 3. Preprocessing using CLAHE. Row 1 lists the original MR samples. Row 2 lists the corresponding contrast enhanced images using CLAHE. 374
Computers and Electrical Engineering 68 (2018) 366–380
D.R. Nayak et al.
Fig. 4. Coefficients at level 5 decomposition of FDCT-USFFT.
4.4. Feature reduction results In this study, PCA+LDA has been harnessed to reduce the dimension of the derived feature vectors. The value of M in Algorithm 2 is set to N − 1, where N is the number of training samples. The number of significant features is obtained based on the NCSV values of different features. The simulation results show that PCA preserves maximum information with more features as compared to PCA+LDA. Here, the threshold value for NCSV was set to 0.95. The classification accuracies with respect to the increasing number of features for PCA and PCA+LDA over three datasets are shown in Fig. 5. From the figure, it is clear that PCA based scheme achieves
Fig. 5. Classification accuracy with respect to number of features for three datasets. 375
Computers and Electrical Engineering 68 (2018) 366–380
D.R. Nayak et al.
Table 2 Performance comparison of different algorithms on DS-I. Classifiers
Acc (%)
Hidden neurons (nh)
Norm
Condition number (K2 )
BPNN ELM PSO-ELM DE-ELM SCA-ELM MSCA-ELM
100.00 100.00 99.85 100.00 100.00 100.00
4 5 3 3 3 3
– 30.4136 20.4912 18.4813 12.7985 9.2384
– 4.1260e+03 60.0386 51.3119 42.1588 38.6785
higher accuracy with 13 features over all the three datasets, while PCA+LDA based scheme yields higher accuracy with only two features. 4.5. Classification results The proposed system employs MSCA-ELM for classification of MR images as healthy or pathological. In this study, the performance of the proposed MSCA-ELM is compared against other learning algorithms such as SCA-ELM, DE-ELM, PSO-ELM, ELM, and BPNN. The activation function used in all the algorithms was kept same for all the algorithms i.e., sigmoidal function and the inputs to the networks were normalized into the range [−1,1]. Further, we set the population size to 20 and the maximum number of iterations to 30 for MSCA-ELM, SCA-ELM, DE-ELM, and PSO-ELM algorithm. The ϵ value in the proposed MSCA-ELM was tested between a range [0.01,0.2] at equally spaced intervals. However, it has been found that the proposed scheme achieves highest performance with ϵ value as 0.02. The parameters r1, r2, r3, and r4 were initialized as follows. r1 is selected using Eq. (6) and q in the equation is set to 2, r2 is a random number in the range [0, 2π], r3 is a random number in the range [0,2], and r4 is a random number in the range [0,1]. The mutation probability (Pm) was set to 0.8 and the step size was set to MAXlimit to 0.1MAXlimit. In case of PSOELM, the value of c1 and c2 were set to 2, while in DE-ELM, the crossover rate (Cr) and scaling factor (fs) were set to 0.7 and 0.8 respectively. Tables 2– 4 show the results obtained by MSCA-ELM, SCA-ELM, DE-ELM, PSO-ELM, ELM and BPNN on three benchmark datasets. From the tables, it is clear that MSCA-ELM outperforms others with less hidden neurons over all the datasets. It can also be noticed that SCA-ELM earns perfect classification on DS-I and DS-II, however, it earns lower accuracy over DS-III. As compared to other algorithms, standard ELM demands more hidden neurons. The comparative analyses also indicate that the condition value of the matrix H obtained by MSCA-ELM, SCA-ELM, DE-ELM and PSO-ELM algorithm is much smaller compared to the conventional ELM. Therefore, it is proved that the network trained by all these algorithms are highly well-conditioned compared to the basic ELM. Further, their corresponding norm values are much smaller than basic ELM and hence, these algorithms provide better generalization performance compared to traditional ELM. Moreover, it can be seen that the smaller norm value of wo leads to the smaller condition value. Compared to PSO-ELM, DE-ELM and SCA-ELM, MSCAELM obtains smaller condition and norm values. Therefore, it can be concluded that the proposed MSCA-ELM can provide better generalization performance with a compact network structure. It is worth mentioning here that the results reported in the tables are the average values of 50 trials and the parameters of all the schemes are determined through experimental evaluation. To prove the efficacy of the suggested MSCA-ELM classifier, an additional experiment is performed. In this experiment, we have made accuracy comparison with other standard classifiers like BPNN, KNN, random forest (RF), and SVM classifier over the three datasets and the results are depicted in Fig. 6. For DS-I, KNN, BPNN, SVM, RF, ELM and SCA-ELM yield an accuracy of 99.39%, 100.00%, 100.00%, 99.85%, 100.00% and 100.00% respectively; while for DS-II, these classifiers obtain an accuracy of 99.44%, 99.94%, 99.88%, 99.81%, 100.00% and 100.00% respectively. The accuracies yielded by KNN, BPNN, SVM, RF, ELM and SCA-ELM are 99.25% 99.37%, 99.49%, 99.41%, 99.49%, and 99.61% respectively on DS-III. But the proposed algorithm (MSCA-ELM) earns ideal classification on DS-I and DS-II datasets, and an accuracy of 99.73% on DS-III dataset. This shows that the proposed algorithm outperforms all other classifiers on DS-III and is able to provide ideal results over other two datasets. Table 5 indicates the number of correctly classified MR images obtained by the proposed scheme (FDCT-USFFT+ PCA+LDA + MSCA-ELM) over DS-III in each trial of a 10 × 5-fold SCV. It is found that the proposed scheme could successfully classify 2543 MR images out of 2550 samples (2200 pathological and 350 healthy MR images). In particular, 2196 pathological samples are Table 3 Performance comparison of different algorithms on DS-II. Classifiers
Acc (%)
Hidden neurons (nh)
Norm
Condition number (K2 )
BPNN ELM PSO-ELM DE-ELM SCA-ELM MSCA-ELM
99.88 100.00 100.00 99.94 100.00 100.00
4 5 3 3 3 3
– 33.5066 17.3990 20.4078 13.6705 10.6514
– 7.8646e+03 66.8353 70.1131 51.6276 33.8798
376
Computers and Electrical Engineering 68 (2018) 366–380
D.R. Nayak et al.
Table 4 Performance comparison of different algorithms on DS-III. Classifiers
Acc (%)
Hidden neurons (nh)
Norm
Condition number (K2 )
BPNN ELM PSO-ELM DE-ELM SCA-ELM MSCA-ELM
99.37 99.49 99.61 99.57 99.61 99.73
4 5 3 3 3 3
– 102.6282 22.0464 33.7713 16.7121 13.5815
– 7.9502e+03 103.9707 121.7516 94.0292 72.4073
Fig. 6. Classification accuracy achieved by different classifiers for three datasets.
successfully classified by our scheme and the rest four samples are misclassified to healthy class. However, the proposed system successfully predicts 347 healthy MR images and rest three samples are misclassified to pathological class. From these results, the sensitivity (Se), specificity (Sp) and precision values (Pr) of the proposed scheme are computed as 99.82%, 99.14% and 99.86% respectively.
377
Computers and Electrical Engineering 68 (2018) 366–380
D.R. Nayak et al.
Table 5 Correctly classified samples of the proposed scheme on DS-III. Run
F-1
1 2 3 4 5 6 7 8 9 10 Total
51 51 50 51 51 51 51 51 50 51
F-2 (51) (51) (51) (51) (51) (51) (51) (51) (51) (51)
51 51 51 51 50 51 51 51 51 51
F-3 (51) (51) (51) (51) (51) (51) (51) (51) (51) (51)
51 51 51 51 51 51 50 51 51 51
F-4 (51) (51) (51) (51) (51) (51) (51) (51) (51) (51)
51 51 51 51 51 51 51 51 51 49
F-5 (51) (51) (51) (51) (51) (51) (51) (51) (51) (51)
51 50 51 51 51 51 51 51 51 51
(51) (51) (51) (51) (51) (51) (51) (51) (51) (51)
Total
Acc(%)
255 (255) 254 (255) 254 (255) 255 (255) 254 (255) 255 (255) 254 (255) 255 (255) 254 (255) 253 (255) 2543 (2550)
100.00 99.61 99.61 100.00 99.61 100.00 99.61 100.00 99.61 99.22 99.73
x(y) indicating x brain images are correctly classified out of y brain images.
4.6. Comparison with PCA based scheme An additional experiment has been performed over three datasets in order to test the effectiveness of PCA+LDA approach over PCA. The performances of both the schemes, namely, FDCT-USFFT+PCA+MSCA-ELM and FDCT-USFFT+ PCA+LDA +MSCA-ELM are listed in Table 6. It may be noticed that the proposed FDCT-USFFT+ PCA+LDA +MSCA-ELM scheme achieves better sensitivity, precision and accuracy than FDCT-USFFT+PCA+MSCA-ELM over all the datasets with a relatively less number of features. Also, it can be observed that FDCT-USFFT+ PCA+LDA +MSCA-ELM obtains slightly lower specificity than FDCT-USFFT+ PCA +MSCAELM on DS-III. However, it is worth addressing here that the CAD system with higher sensitivity values leads to have better performance. Therefore, it can be concluded that the proposed FDCT-USFFT+ PCA+LDA +MSCA-ELM scheme holds greater potential in taking accurate clinical decisions.
4.7. Comparison with previous works To benchmark the performance of the suggested scheme in context of the number of features required and classification accuracy, extensive comparison with twenty existing schemes has been performed over three datasets and is shown in Table 7. It is found that most of the earlier PBDSs yield ideal classification on DS-I; however, three PBDSs such as RT + PCA + LS-SVM [3], WPTE + FNN + RCBBO [21] and WPTE + GEPSVM [12] offer ideal classification on DS-II. Further, there is no PBDS available which can yield perfect classification over DS-III. However, our proposed PBDS obtains the highest accuracy of 99.73% as compared to state-of-the-arts while requiring the least number of features. Based on the computational results, we can summarize the key advantages of our scheme: (i) It efficiently captures the texture features from the MR images, (ii) The proposed MSCA method helps in enhancing the global search capability via the introduction of a mutation operator, (iii) MSCA-ELM earns better generalization performance and responds faster to unknown testing data, and (iv) It obtains better classification accuracy with the least number of features. The proposed framework has the following scopes which can be improved in future: (i) The proposed framework was tested on three openly accessible datasets accommodating images from the patients through the late and middle stages of diseases, but a larger dataset with images from all stages of diseases can be validated to achieve better generalization performance, (ii) The present study deals with solving a two-class classification problem, however solving a multi-class brain disease classification problem is more challenging, and (iii) MSCA demands more parameter to tune, hence investigating an optimization scheme that requires less number of parameters is another possible scope. Table 6 Classification performance (%) of the proposed scheme based on PCA and PCA+LDA. Dataset
Schemes No. of feature
FDCT-USFFT+PCA+MSCA-ELM 13
FDCT-USFFT+PCA+LDA+MSCA-ELM 2
DS-I
Se Sp Pr Acc Se Sp Pr Acc Se Sp Pr Acc
100.00 100.00 100.00 100.00 99.86 100.00 100.00 99.88 99.64 99.43 99.91 99.61
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 99.82 99.14 99.86 99.73
DS-II
DS-III
378
Computers and Electrical Engineering 68 (2018) 366–380
D.R. Nayak et al.
Table 7 Performance comparison with previous works on three standard datasets. Existing PBDSs
DWT + SVM + POLY [6] DWT + PCA + BPNN + SCG [2] DWT + PCA + FNN + SCABC [9] DWT + PCA + FNN + ACPSO [8] DWT + PCA + KSVM [10] WPSE + GEPSVM [12] WPTE + GEPSVM [12] WPTE + FNN + RCBBO [21] WE + HMI + GEPSVM [16] DWT + PCA + ADBRF [4] FRFE + WTT + SVM [17] DTCWT + VE + GEPSVM [20] FRFE + WTT + DP-MLP + ARCBBO [18] RT + PCA + LS-SVM [3] DWT + PCA + k-NN [7] FPCNN + DWT + PCA + FNN [1] SWT + PCA + GEPSVM [13] WE + NBC [15] DWT + PCA + LDA + RF [14] MBD + SLFN + IPSO [19] FDCT-USFFT + PCA + MSCA-ELM FDCT-USFFT + PCA+LDA + MSCA-ELM (Proposed)
Feature size
Run
4761 19 19 19 19 16 16 16 14 13 12 12 12 9 7 7 7 7 7 5 13 2
5 5 5 5 5 10 10 10 10 10 10 10 10 5 5 10 10 10 10 10 10 10
Acc (%) DS-I
DS-II
DS-III
98.00 100.00 100.00 100.00 100.00 99.85 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 98.00 100.00 100.00 92.58 100.00 100.00 100.00 100.00
97.15 98.29 98.93 98.75 99.38 99.62 100.00 100.00 99.56 99.18 99.69 99.75 99.19 100.00 97.54 98.88 99.62 91.87 99.75 98.19 99.88 100.00
96.37 97.14 97.81 97.38 98.82 98.78 99.33 99.49 98.63 98.35 98.98 99.25 98.24 99.39 96.79 98.43 99.02 90.51 99.14 98.08 99.61 99.73
5. Conclusion In this paper, we have proposed an efficient system for detection of pathological brain. The system derives features using fast discrete curvelet transform via unequally-spaced fast Fourier transform. For classification, we have proposed a hybrid algorithm modified sine cosine algorithm - extreme learning machine. The modified sine cosine algorithm involves an introduction of a mutation operator which helps in optimizing the hidden node parameters of extreme learning machine. The simulation results over three standard datasets demonstrate that the proposed system obtains higher accuracy than other competent schemes with the least number of features. Moreover, it is observed that the proposed classifier obtains good generalization performance and the network trained by it is well conditioned. The proposed learning algorithm can be applied to multi-class classification and regression problems. The proposed system has been validated over various accessible datasets of smaller size, however, a larger dataset collected online will further prove its effectiveness. Further, we plan to hybridize other promising meta-heuristic algorithms with extreme learning machine to improve the performance. Future work also includes the application of deep learning algorithms on a larger dataset. We also plan to investigate the images of the BrainWeb database. References [1] El-Dahshan EA, Mohsen HM, Revett K, Salem ABM. Computer-aided diagnosis of human brain tumor through MRI: a survey and a new algorithm. Expert Syst Appl 2014;41(11):5526–45. [2] Zhang Y, Dong Z, Wu L, Wang S. A hybrid method for MRI brain image classification. Expert Syst Appl 2011;38(8):10049–53. [3] Das S, Chowdhury M, Kundu K. Brain MR image classification using multiscale geometric analysis of ripplet. Progr Electromagn Res 2013;137:1–17. [4] Nayak DR, Dash R, Majhi B. Brain MR image classification using two-dimensional discrete wavelet transform and AdaBoost with random forests. Neurocomputing 2016;177:188–97. [5] Nayak DR, Dash R, Majhi B, Prasad V. Automated pathological brain detection system: a fast discrete curvelet transform and probabilistic neural network based approach. Expert Syst Appl 2017;88:152–64. [6] Chaplot S, Patnaik LM, Jagannathan NR. Classification of magnetic resonance brain images using wavelets as input to support vector machine and neural network. Biomed Signal Process Control 2006;1(1):86–92. [7] El-Dahshan ESA, Honsy T, Salem ABM. Hybrid intelligent techniques for MRI brain images classification. Digit Signal Process 2010;20(2):433–41. [8] Zhang Y, Wang S, Wu L. A novel method for magnetic resonance brain image classification based on adaptive chaotic PSO. Progr Electromagn Res 2010;109:325–43. [9] Zhang Y, Wu L, Wang S. Magnetic resonance brain image classification by an improved artificial bee colony algorithm. Progr Electromagn Res 2011;116:65–79. [10] Zhang Y, Wu L. An MR brain images classifier via principal component analysis and kernel support vector machine. Progr Electromagn Res 2012;130:369–88. [11] Saritha M, Joseph KP, Mathew AT. Classification of MRI brain images using combined wavelet entropy based spider web plots and probabilistic neural network. Pattern Recognit Lett 2013;34(16):2151–6. [12] Zhang Y, Dong Z, Wang S, Ji G, Yang J. Preclinical diagnosis of magnetic resonance (MR) brain images via discrete wavelet packet transform with tsallis entropy and generalized eigenvalue proximal support vector machine (GEPSVM). Entropy 2015;17(4):1795–813. [13] Zhang Y, Dong Z, Liu A, Wang S, Ji G, Zhang Z, Yang J. Magnetic resonance brain image classification via stationary wavelet transform and generalized eigenvalue proximal support vector machine. J Med Imaging Health Inform 2015;5(7):1395–403. [14] Nayak DR, Dash R, Majhi B. Classification of brain MR images using discrete wavelet transform and random forests. Fifth national conference on computer
379
Computers and Electrical Engineering 68 (2018) 366–380
D.R. Nayak et al.
vision, pattern recognition, image processing and graphics (NCVPRIPG). IEEE; 2015. p. 1–4. [15] Zhou X, Wang S, Xu W, Ji G, Phillips P, Sun P, Zhang Y. Detection of pathological brain in MRI scanning based on wavelet-entropy and naive Bayes classifier. Bioinformatics and biomedical engineering. 2015. p. 201–9. [16] Zhang Y, Wang S, Sun P, Phillips P. Pathological brain detection based on wavelet entropy and Hu moment invariants. Biomed Mater Eng 2015;26(s1). S1283–S1290 [17] Wang S, Zhang Y, Yang X, Sun P, Dong Z, Liu A, Yuan TF. Pathological brain detection by a novel image feature fractional Fourier entropy. Entropy 2015;17(12):8278–96. [18] Zhang Y, Sun Y, Phillips P, Liu G, Zhou X, Wang S. A multilayer perceptron based smart pathological brain detection system by fractional Fourier entropy. J Med Syst 2016;40(7):1–11. [19] Zhang YD, Chen XQ, Zhan TM, Jiao ZQ, Sun Y, Chen ZM, Yao Y, Fang LT, Lv YD, Wang SH. Fractal dimension estimation for developing pathological brain detection system based on Minkowski–Bouligand method. IEEE Access 2016;4:5937–47. [20] Wang S, Lu S, Dong Z, Yang M, Zhang Y. Dual-tree complex wavelet transform and twin support vector machine for pathological brain detection. Appl Sci 2016;6(6):169. [21] Wang S, Li P, Chen P, Phillips P, Liu G, Du S, Zhang Y. Pathological brain detection via wavelet packet tsallis entropy and real-coded biogeography-based optimization. Fundam Inform 2017;151(1–4):275–91. [22] Johnson K.A., Becker J.A.. The whole brain atlas. http://www.med.harvard.edu/AANLIB/. [23] Candes E, Demanet L, Donoho D, Ying L. Fast discrete curvelet transforms. Multiscale Model Simul 2006;5(3):861–99. [24] Martínez AM, Kak AC. PCA versus LDA. IEEE Trans Pattern Anal Mach Intell 2001;23(2):228–33. [25] Huang GB, Zhu QY, Siew CK. Extreme learning machine: theory and applications. Neurocomputing 2006;70(1):489–501. [26] Huang GB, Zhou H, Ding X, Zhang R. Extreme learning machine for regression and multiclass classification. IEEE Trans Syst, Man, Cybern, Part B (Cybern) 2012;42(2):513–29. [27] Mirjalili S. SCA: a sine cosine algorithm for solving optimization problems. Knowl Based Syst 2016;96:120–33. [28] Elaziz MA, Oliva D, Xiong S. An improved opposition-based sine cosine algorithm for global optimization. Expert Syst Appl 2017;90:484–500. [29] Zhu QY, Qin AK, Suganthan PN, Huang GB. Evolutionary extreme learning machine. Pattern Recognit 2005;38(10):1759–63. [30] Zhao G, Shen Z, Miao C, Man Z. On improving the conditioning of extreme learning machine: a linear case. 7th international conference on information, communications and signal processing. ICICS, IEEE; 2009. p. 1–5. Deepak Ranjan Nayak is currently pursuing PhD in Computer Science and Engineering at National Institute of Technology, Rourkela, India. His current research interests include medical image analysis, pattern recognition and cellular automata. He serves as reviewer of several international journals and conferences. He is a student member of IEEE. Ratnakar Dash is currently working as Assistant Professor in the Department of Computer Science and Engineering at National Institute of Technology, Rourkela, India. His field of interests include signal processing, image processing, steganography, etc. He is a professional member of IEEE, IE, and CSI. He has authored more than 50 research papers. Banshidhar Majhi is a full Professor at the Department of Computer Science and Engineering of the National Institute of Technology, Rourkela, India. His research interests include image processing, data compression, cryptography and security, soft computing, and biometrics. He has co-authored more than 150 journal and conference papers. He has served as reviewer for many international journals and conferences. Shuihua Wang is currently an assistant professor at School of Electrical Engineering, Nanjing University, China. Her research interests focus on machine learning, deep learning, and biomedical image processing. She has published over 30 papers in peer-reviewed international journals and conferences. She serves as an editor and reviewer for many well-reputed journals and conferences. She is a member of the IEEE.
380