Conditional mutual information-based feature selection for congestive heart failure recognition using heart rate variability

Conditional mutual information-based feature selection for congestive heart failure recognition using heart rate variability

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 8 ( 2 0 1 2 ) 299–309 journal homepage: www.intl.elsevierhealth.com...

469KB Sizes 1 Downloads 95 Views

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 8 ( 2 0 1 2 ) 299–309

journal homepage: www.intl.elsevierhealth.com/journals/cmpb

Conditional mutual information-based feature selection for congestive heart failure recognition using heart rate variability Sung-Nien Yu ∗ , Ming-Yuan Lee Department of Electrical Engineering, National Chung Cheng University, Chia-Yi County, Taiwan

a r t i c l e

i n f o

a b s t r a c t

Article history:

Feature selection plays an important role in pattern recognition systems. In this study, we

Received 2 March 2011

explored the problem of selecting effective heart rate variability (HRV) features for recogniz-

Received in revised form

ing congestive heart failure (CHF) based on mutual information (MI). The MI-based greedy

17 December 2011

feature selection approach proposed by Battiti was adopted in the study. The mutual infor-

Accepted 22 December 2011

mation conditioned by the first-selected feature was used as a criterion for feature selection. The uniform distribution assumption was used to reduce the computational load. And, a

Keywords:

logarithmic exponent weighting was added to model the relative importance of the MI with

Feature selection

respect to the number of the already-selected features. The CHF recognition system con-

Mutual information

tained a feature extractor that generated four categories, totally 50, features from the input

Congestive heart failure

HRV sequences. The proposed feature selector, termed UCMIFS, proceeded to select the

Heart rate variability

most effective features for the succeeding support vector machine (SVM) classifier. Prior to feature selection, the 50 features produced a high accuracy of 96.38%, which confirmed the representativeness of the original feature set. The performance of the UCMIFS selector was demonstrated to be superior to the other MI-based feature selectors including MIFSU, CMIFS, and mRMR. When compared to the other outstanding selectors published in the literature, the proposed UCMIFS outperformed them with as high as 97.59% accuracy in recognizing CHF using only 15 features. The results demonstrated the advantage of using the recruited features in characterizing HRV sequences for CHF recognition. The UCMIFS selector further improved the efficiency of the recognition system with substantially lowered feature dimensions and elevated recognition rate. © 2012 Elsevier Ireland Ltd. All rights reserved.

1.

Introduction

Heart rate variability (HRV) is a widely used tool for studying the role of cardiovascular diseases and affliction influences. Recently, numerous studies have focused on using HRV measurements for diagnosis purpose, especially in recognizing

congestive heart failure (CHF) from normal sinus rhythms (NSR) [1–3]. CHF is an omen of cardiac morbidity, which is a dysfunction of the cardiovascular system that the heart is unable to drain the blood away. CHF usually accompanies with chest tightness, abdominal swelling, and hard breathing. However, the patients usually do not suffer from pain in daily life such that the symptoms may be ignored.

∗ Corresponding author at: Department of Electrical Engineering, National Chung Cheng University, 168 University Road, Ming-Hsiung Township, Chia-Yi County 621, Taiwan. Tel.: +886 5 2720411x33205; fax: +886 5 2720862. E-mail address: [email protected] (S.-N. Yu). 0169-2607/$ – see front matter © 2012 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.cmpb.2011.12.015

300

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 8 ( 2 0 1 2 ) 299–309

In recent years, numerous methods have been developed to recognize CHF based on HRV [4–6]. In these studies, different categories of features calculated from long-term HRV sequences were recruited as an attempt to improve the performance of the classifier. This process resulted in increased feature dimensions and elevated computation load. Therefore, it becomes important to select the most representative features from the original feature set such that the recognition rate is retained with considerably reduced feature dimensions. In practice, the optimal subset of features is usually unknown and it is common to have irrelevant or redundant features at the beginning of the pattern classification tasks. To tackle this problem, two main dimension-reduction approaches, namely feature extraction and feature selection, are usually applied [4]. Feature extraction creates new features based on transformation or weight combination of the original feature set. On the contrary, feature selection refers to methods that select the best subset of features from the original feature set. Feature selection can be further categorized into filters and wrappers [5]. A filter involves a predefined performance measure which is independent of the subsequent classifier. Alternatively, a wrapper requires a specific learning machine and refers its classification accuracy as a performance measure to search for an optimal feature subset. Although wrappers usually produce better accuracy than filters, they are criticized as being computational extensive and over-fitted only for specific classifiers. Consequently, filters are usually preferred to wrappers. A number of measures, such as distance [6,7], correlation [8], and mutual information (MI) [9], have been applied in filters for evaluating the efficacy of a feature. Techniques, such as linear discriminant analysis between features and classes [10], fast correlation-base filter using approximate Markov blanket method for feature relevance calculation [11], and filters using entropy and other information concepts for feature selection [11], are some examples that have successfully applied feature selection in clinical practice. Among them, mutual information (MI) has been reported to be effective in selecting features for a global category of pattern classification problems [9,11]. The main advantages of using MI as a criterion for feature selection are two folds. Firstly, MI is capable of measuring the relationship among attributes and between attributes and classes which may not be easily characterized by other measures. Secondly, MI is invariant under space transformations. These advantages distinguish MI from other measures. In this study, we tackle the problem about how to improve the approximation of MI in a high-dimensional feature space and how to effectively use MIs as criteria for selecting the most representative features for CHF recognition. Battiti is one of the major pioneers who applied a greedy algorithm based on MI to select relevant features form the original feature set [9]. The greedy algorithm sequentially selects optimal features from the remaining feature set. The criterion of selecting the next feature is based on maximizing the conditional mutual information between the candidate feature and the class attribute. It is apparent that this process

is complicated and computational extensive as the number of features increases. To cope with these problems, Battiti’s algorithm, termed mutual information feature selection (MIFS), approximates the conditional MI with the summation of paired MI between the candidate feature and each of the features inside the already-selected feature subset. However, a great deal of information was lost with this approximation. In view of MIFS’s potential in feature selection, several attempts have been made to improve the performance of MIFS. Kwak and Choi [12] assumed uniform distributions in the information of input features and proposed the MIFS-U algorithm that amends the ignorance of the joint probability term in the MIFS. Cheng et al. [13] proposed a conditional mutual information feature selector (CMIFS), which conditioned the calculation of mutual information with the first selected feature. Also included into consideration is the last feature selected just prior to the candidate feature. In this manner, the conditional MI required in the MIFS is more reasonably approximated. Other techniques, including min-redundancy max-relevance (mRMR) [14] and normalized mutual information feature selection (NMIFS) [15], were also proposed to improve the performance of MIFS. Inspired by MIFS-U and CMIFS, we propose to modify CMIFS and apply the uniform distribution approximation exploited in MIFS-U to simplify the calculation of the conditional MI. The result is a modified conditional mutual information feature selector with uniform distribution assumption (UCMIFS). Considering the significance of the first-selected feature f1 in the greedy algorithm, we also employ the mutual information conditioned by f1 in the approximation. However, different from the CMIFS, all the features in the alreadyselected feature subset, instead of only the last feature, are considered. Secondly, the uniform distribution assumption is recruited to simplify the calculation. Finally, a weighting parameter represented as a logarithmic function of the number of the already-selected feature subset is added to model the relative importance of individual terms in the calculation of MIs. The original feature set contains typical features, including personal data and features calculated from the time statistics, Poincare plots, and frequency-domain distribution, and features calculated from the third-cumulant spectra (bispectra) [16] of the RR interval (RRI) sequences. In this study, the efficiency of the proposed UCMIFS algorithm in selecting features for CHF recognition is justified and compared to that of other MI-based feature selectors. The performance of the proposed system is also compared to that of the other outstanding CHF classifiers published in the literature. Section 2 reviews the background knowledge of using MI for feature selection and discusses some of the popular and effective MI-based feature selectors. Section 3 proposes the modified conditional mutual information feature selector with uniform distribution assumption (UCMIFS). Section 4 establishes the original feature set applied in this study. Section 5 demonstrates the experimental results with some critical discussions. Finally, some conclusions are drawn in Section 6.

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 8 ( 2 0 1 2 ) 299–309

2.

S from the initial feature set F, we lead to the typical FRn-k problem that can be formally formulated as follows [9].

Theoretical background

2.1. Entropy and mutual information related to feature selection The Shannon’s information theory provides an approach to quantize the information of random variables with entropy and mutual information (MI). In this section, we summarize the theoretic background required to calculate the MIs between features and between features and classes as quantitative measures for feature selection. Please refer to [9,14] for details. Assume p(fi ) represents the probability density function (pdf) of fi , the entropy of a feature fi is defined as



H(fi ) = −

p(fi )log p(fi )

(1)

fi

The conditional entropy of fi give another feature fj is defined as



H(fi |fj ) = −



p(fi )

fj ,i = / j

fi

p(fi |fj )log p(fi |fj )

(2)

where p(fi /fj ) is the conditional pdf of fi given the value of fj . The mutual information between two features fi and fj can be calculated from Eqs. (1) and (2) as follows: (3)

When considering the relationship, in terms of MI, with a feature fi and the output class C, we need supplementary knowledge of I(fi ; C), H(C) and H(C|fi ). Assume the class set C contains m classes C = {c1 , c2 , . . ., cm }, the entropy of C is defined as m 

H(C) = −

p(cl )log p(cl )

(4)

l=1

Then, the conditional entropy of C given the knowledge of feature set F is m 

H(C|F) = −

l=1

Definition FRn-k. Given an initial set F with n features and the class set C, the subset S ⊂ F with k features is said to be the optimal feature subset iff I(C;S) is maximized or H(C|S) is minimized [9].

2.2. Greedy feature selection algorithms for the FRn-k problem To find the optimal feature subset requires to generating all combinations of features and comparing their MI between the subset and C. This mission is practically infeasible because of the large number of feature combinations. Therefore, the greedy feature selection algorithm was proposed to solve the FRn-k problem [9]. Starting from an empty set of selected features S, the best available features are added to S one by one until a preset number of k features are reached. From the initial feature set F and the already-selected feature set S, the next feature fi to be selected is the one that maximizes I(C; fi , S), where fi ∈ {F − S}. By using the chain rule for information [18], I(C; fi , S) can be formulated as follows: I(C; fi , S) = I(C; S) + I(C; fi |S)

I(fi ; fj ) = H(fi ) − H(fi |fj )



p(cl )

p(cl |fi )log p(cl |fi )

(5)

where p(cl ), c = 1, . . ., l, is the probabilities density function of class cl and p(cl |fi ) is the pdf of cl given fi . The MI between F and C is (6)

Eq. (3) formulates the quantitative relationship (mutual information, MI) between two features and Eq. (6) represents the MI between a feature set F and the output class C. Fano [17] has pointed out that maximizing the MI between the features and the target can achieve a lower bound to the probability of error. Therefore, if we intend to optimally select a subset

(7)

For a given feature subset S, I(C;S) is a constant and the maximization of I(C; fi , S) is equivalent to the maximization of the conditional mutual information I(C; fi |S) which can be reformulated as [13] I(C; fi |S) = I(C; fi ) − [I(fi ; S) − I(fi ; S|C)]

(8)

The calculation of mutual information between a candidate feature fi and the already-selected feature set S is an awful, and almost impossible, task, especially when S is of higher dimensions. Battiti [9] is probably the pioneer who applied mutual information based on greedy algorithm to estimate the conditional mutual information. Battiti proposed an algorithm, termed MIFS, which estimates the MI between the candidate feature fi and the already-selected feature set S with the summation of the MIs between fi and each of the features in S such that I(C; fi |S) is approximated as I(C; fi |S) ≈ I(C; fi ) − ˇ



I(fs ; fi )

(9)

fs ∈ S

fi ∈ F

I(C; F) = H(C) − H(C|F)

301

where the parameter ˇ is the quality factor that regulates the relative importance between the candidate feature fi and the already-selected features fs ∈ S. The MIFS algorithm works well. However, the MIFS only approximate the first term in the square parenthesis of Eq. (8) and ignore the second term. To tackle this problem, Kwak and Choi [12] modified the MIFS algorithm under the assumption that the information of input features is distributed uniformly and the ratio of H(S) to H(S|C) is equal to the ratio of H(fi |S) to I(fi ;S|C), i.e. I(fi ; S|C) H(S|C) = I(fi ; S) H(S)

(10)

302

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 8 ( 2 0 1 2 ) 299–309

and

The MIs in the approximation can be calculated by applying the information chain rule, such that

I(fi ; S|C) =

H(S|C) I(fi ; S) H(S)

(11)

They proposed an algorithm, termed MIFS-U, and estimated the conditional MI I(C; fi |S) with

I(C; fi ) − ˇ

 fs ∈ S

I(C; fs ) I(fs ; fi ) H(fs )

(12)

Other MI-based methods such as the feature selector based on the min-redundancy max-relevance criterion (mRMR) [14] and the normalized mutual information (NMIFS) [15] were also proposed to improve the performance of the MIFS by modifying Battiti’s approximation as in Eq. (9). Both MIFS and MIFS-U algorithms consider only the MI components between the candidate feature fi and each of the features in the already-selected feature-subset S. To more comprehensively consider the MI between fi and S, Cheng et al. [13] reformulated Battiti’s framework and proposed a conditional MIFS (CMIFS) that recruited the first and the last features selected into the already-selected feature subset S and used them as criteria for selecting the next feature. With f1 and fk representing the first and last features selected into S, the conditional MI of a candidate feature fi ∈ {F − S} was approximated as I(C; fi |S) ≈ I(C; fi |f1 ) − ˇI(fi ; fk |f1 )

I(C; fi |f1 ) = I(C; fi ) − [I(fi ; f1 ) − I(fi ; f1 |C)], and I(fi ; fs |f1 ) = I(fi ; fs ) − [I(fi ; f1 ) − I(fi ; f1 |fs )],

(16)

Moreover, to cope with the rise in calculation load with this modification, we adopted the uniform distribution assumption proposed in MIFS-U to simplify the calculation. Assume the information of input features is distributed uniformly such that I(fi ; f1 |C) H(f1 |C) = , I(fi ; f1 ) H(f1 )

(17)

and I(fi ; f1 |fs ) H(f1 |fs ) = I(fi ; f1 ) H(f1 )

(18)

Thus, I(fi ; f1 |C) =

H(f1 |C) I(fi ; f1 ), H(f1 )

(19)

H(f1 |fs ) I(fi ; f1 ) H(f1 )

(20)

and I(fi ; f1 |fs ) =

(13)

The inventors claimed that the CMIFS could detecting the relevant feature combinations in some degree with low memory storage and computation cost [13].

(15)

By substituting Eqs. (19) and (20) into Eqs. (15) and (16) and after some simple manipulations based on information theory with two variables X and Y that H(X) − H(X|Y) = I(X; Y)

3. UCMIFS—the proposed modified conditional mutual information feature selector with uniform distribution assumption The idea of using the first and the last features selected into S for calculating the MI proposed in the CMIFS algorithm is inspirational. The first feature f1 is definitely the most significant feature in the selected feature set. Comparing to the MIFS which considering only the paired relationship between the candidate feature and each of the features in the alreadyselected feature set S, recruiting f1 in the approximation of I(C; fi |S) undoubtedly enhances the precision of estimation. However, the use of only the last feature fk selected into S as the second criterion may be unconvincing. In reality, when the number of features in S increases, the significance and contribution of the last selected feature fk to the mutual information decreases dramatically. Consequently, it is favorable to consider all the features in S, as in the MIFS, while recruiting the first feature f1 in the calculation of I(C; fi |S) for the selection of next feature. I(C; fi |S) ≈ I(C; fi |f1 ) − ˇ

 s∈S

I(fi ; fs |f1 )

(14)

(21)

the two equations can be reformulated as I(C; fi |f1 ) = I(C; fi ) −

I(C; f1 ) I(fi ; f1 ), H(f1 )

(22)

I(fs ; f1 ) I(fi ; f1 ) H(f1 )

(23)

and I(fi ; fs |f1 ) = I(fi ; fs ) −

Further substituting Eqs. (22) and (23) into Eq. (14), the approximation of the conditional MI is reformulated as



I(C; fi ) −



 I(C; f1 ) I(fs , f1 ) I(fi ; fs ) − I(fi ; f1 ) − ˇ I(fi , f1 ) H(f1 ) H(f1 )

(24)

fs ∈ S

When compared with the original MIFS algorithm, the proposed method, besides of considering the paired MIs between fi and each of the features in the already selected feature subset S, also takes into account the terms associated with the first feature f1 selected into S, which is believed to be most significant in providing information for classification. In both Eqs. (9) and (12), the parameter ˇ is set heuristically to weight the relative contribution of the mutual information

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 8 ( 2 0 1 2 ) 299–309

between the candidate feature fi and the already-selected feature subset S. According to the information theory, the mutual information of a feature selected by using the greedy algorithm decreases dramatically with the number of features in S because of the raise in correlation with the features already in S. To cope with this phenomenon, Huang et al. [19] proposed a logarithm to model the tendency of variation in the second summative terms. We adopt similar methodology to model the MI variations in the third and four summative terms. The ˇ parameter is modified as ˇlog2 |S| , where |S| is the number of features in S, which operates as a forgetting factor that reduce the contribution of the third and four summative terms as the number of features selected into S increased. The basis of two of the logarithm was empirically determined to achieve satisfactory results. After modification, the criterion for selecting the next feature in the greedy algorithm is the remaining feature that contribute to the largest I(C; fi |S) which is approximated as



I(C; fi ) −



 I(C; f1 ) I(fs , f1 ) I(fi ; fs ) − I(fi ; f1 ) + ˇlog2 |S| I(fi , f1 ) H(f1 ) H(f1 )

(25)

s∈S

In summary, the proposed UCMIFS algorithm includes the following steps: Step 1: (Initialization) Set F ← “initial set of n features”, S ← “empty set”. Step 2: (Computation of the MI with the output class) ∀fi ∈ F, compute I(C; fi ). Step 3: (Selection the first feature) find the feature fi that maximizes I(C; fi ); set F ← F\{fi }, S ← {fi }. Step 4: (Greedy algorithm) Repeat until |S| = k (a) Computation of the MI between variables: For all pairs (fi , fs ) with fi ∈ F and fs ∈ S, compute I(fi , fs ) if it is not yet available. (b) Selection of the next feature: choose the feature fi ∈ F that maximizes





 I(C; f1 ) I(fs , f1 ) I(C; fi ) − I(fi ; f1 ) + ˇlog2 |S| I(fi , f1 ) I(fi ; fs ) − H(f1 ) H(f1 ) s∈S

(c) Set F ← F\{fi }; set S ← {fs }. Step 5: Output the set S containing the selected feature. The probabilities of the features and classes required to calculate the (conditional) MI were estimated by first participating the histogram into equal segments and calculating the values.

4.

Features exploited in this study

Two categories of features associated with heart rate variability (HRV) were used in this study to testify the feature selection power of the proposed greedy algorithm in discriminating the congestive heart failure (CHF) from the normal sinus rhythm (NSR). The first category of features contained typical features including personal data and features calculated from the time statistics, Poincare plot, and the frequency-domain distribution of the RR interval (RRI) sequences. The second category of

303

features was calculated from the third-cumulant spectra (bispectra) of the RRI sequences, which had been demonstrate to be powerful in characterizing HRV for recognizing cardiovascular diseases [20].

4.1.

Typical features

Two personal data, age and gender (female or male) of the subject, were included as regular features. A value of “0” or “1” was assigned to the subject’s gender feature associated with female or male, respectively. The time-domain HRV features were calculated by applying statistical analysis to the RRI sequences. A total of nine time-domain features were recruited in the study, including (1) the mean of all of the RR intervals (Mean), (2) standard deviation of all of the RR intervals (SDRR), (3) the square of the mean of the sum of differences between adjacent RR intervals (RMSSD), (4) the number of adjacent RR intervals differing by more than +50 ms in the entire data (NN50-1), (5) the number of adjacent RR interval differing by more than −50 ms in the entire data (NN50-2), (6) the sum of NN50-1 and NN50-2 divided by the total number of all RR intervals (pNN50), (7) the standard deviation of differences between adjacent RR intervals (SDSD), (8) the standard of deviation of the average of RR intervals in all 5-min segments of the entire data (SDARR), and (9) the mean of the standard deviation of all RR intervals for all 5-min segments of the entire data (SDRR ind). The Poincare plot arranged the RRI sequences into pairs of adjacent RR intervals and plot them on a two-dimensional plane. With the present and the next RR intervals in the x and y axes, respectively, the Poincare plot transformed the onedimensional RRI sequence into two-dimensional relationship of adjacent RRI data. An ellipse could be generated to fit these data points. The width (SD1) and length (SD2) of the ellipse were then calculated from the two of the time-domain features, namely (1) the standard deviation of differences between adjacent RR intervals (SDSD) and (2) the mean of the standard deviation of all RR intervals (SDRR). The RRI sequences were also represented in the frequency domain. The power in the low (0.04–0.15 Hz) and high (0.15–0.4 Hz) frequency bands were believed to be closely related to the physiological activities of the autonomic nervous system (ANS) [21]. After re-sampling the RRI sequence at 4 Hz and applying a detrending preprocessor to remove ectopic beats and baseline wander (trend), Fourier transform was used to calculate the power spectral density of the RRI sequence. The power of the LF and HF bands were calculated and were respectively denoted as the power of the low frequency band (PLF) and the power of the high frequency band (PHF). Additionally, the PHF and PLF were normalized to their sum and the resulting parameter were denote as NLF and NHF, respectively. The ratio of PHF and PLH, represented as Ratio, was also used to characterize the relative contribution between the parasympathetic and sympathetic neural branches of the ANS. In summary, the features of the first category included two personal data, nine calculated from time statistics, two from Poincare plot, and five from frequency-domain analysis, resulting in totally 18 typical features as summarized in Table 1.

304

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 8 ( 2 0 1 2 ) 299–309

Table 1 – Summary of the typical features used in the study. Feature extraction methods Personal data Time statistical

Poincare plot Frequency domain analysis

Features 1. AGE, 2. GENDER 3. SDRR, 4. SDSD, 5. RMSSD, 6. NN50-1, 7. pNN50, 8. NN50-2, 9. Mean, 10. SDANN, 11. SDRR index 12. SD1, 13. SD2 14. PLF, 15. PHF, 16. Ratio, 17. NLF, 18. NHF

Fig. 1 – Segmentation of the bispectrum for subband feature calculation.

4.2.

Features calculated from bispectrum

The bispectrum is the Fourier transform of the third-order cumulant, which provides supplementary information to the power spectrum [22]. To extract information from the bispectrum for characterizing HRV sequences, it is valuable to take into account the changes in subband components due to autonomic nervous control of the heart rate. As have been emphasized in the literature [21,22], the HRV components in the low frequency (LF; 0.04–0.15 Hz) and high frequency (HF; 0.15–0.4 Hz) reflect the controlled and balanced behavior of the two, i.e. sympathetic and parasympathetic, branches of the autonomic nervous system. Fig. 1 specifies the region of interest (ROI) inside the non-redundant bispectral triangle that contains the components spanning the 0.04–0.4 Hz frequency ranges in the two axes f1 and f2 . We further segmented the three subband region inside the ROI specified by LF and HF, namely LF-LF (LL), LF-HF (LH), and HF-HF (HH). Eight features were calculated from each of the three subbands and the ROI to characterize the bispectrum. The eight bispectrum-related features were average magnitude of the bispectrum (Mavg ), average power of the bispectrum (Pavg ) of the bispectrum, normalized bispectral entropy (E1 ), normalized bispectral squared entropy (E2 ) [23], sum of the

logarithmic magnitudes (H1 ), sum of the logarithmic magnitudes of the diagonal elements (H2 ) [16], and two frequency axis of the weight center of bispectrum (Z1m ) and (Z2m ) [24]. Consequently, a total of 8 (feature) × 4 (regions) = 32 bispectrum-related features were calculated from the ROI and the subband regions of the HRV bispectrum.

5.

Experimental design

Data for this research were selected from the congestive heart failure (CHF) and normal sinus rhythm (NSR) database, both of which were available on the PhysioNet [25]. Recording from 29 CHF and 54 NSR subjects were selected respectively from the CHF and NSR databases for analysis. The data sampling rate was 128 samples/s. Each record comprised also a beat annotation file which showed the occurring time of the specify R peaks confirmed by specialists. The HRV sequences were generated by first extract onehour segment data in the early morning from each of the records and then calculated the RR-interval based on the annotation file. This one-hour data length was inspired by a recent study [26] which explored the influence of segment length in differentiating CHF from NSR. They concluded that using record segments of one hour in length was sufficient for recognizing CHF. We further confined the one-hour data segments to be extracted in the same period of time as an attempt to minimize the influence of natural time cycle. Preprocessors were designated to remove the ectopic beats, as described in [27], and trend in the original RRI sequences [28]. Undoubtedly, the arrhythmias can be an index of the CHF. However, extremely short or long RRI data can be the effect of artifacts or miss beats of a recorder system [27]. Therefore, we empirically set an RR-interval (RRI) range of 0.4–1.2 s as thresholds and remove the ectopic beats with RRI outside the range [28]. This procedure aimed to eliminate outliers, especially the extremely small-valued data, possibly induced by artifacts while providing an adequate range for both normal and arrhythmic RRI variations. The results in [28] demonstrated the effectiveness of the thresholding filter in reducing the effect of artifacts while preserving the major properties of the RRI sequences for CHF recognition. The filtered RRI sequence was re-sample by using cubic-spline interpolation at 4 Hz and the trend was removed by using a discrete wavelet filter [28]. After these preprocessors, the first 16,384 data (around 4096 s) of each of the RR interval sequences were extracted for analysis. As depicted in the previous sections, a total of 50 features were calculated to characterize each record. These features include two personal features age and gender, 11 time-domain features, 5 frequency-domain features, and 32 bispectral features. Each of the features was normalized by first subtracting the mean and dividing with the standard deviation (s.d.) and then passing through a tangent sigmoid function, such that all of the features were normalized to be bounded in the same range of [−1, +1]. The normalization process was performed prior to classification to eliminate the influence of bias due to the use of different feature scales. The support vector machine (SVM) classifier was then employed to discriminate the CHF from the NSR HRV records.

305

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 8 ( 2 0 1 2 ) 299–309

Table 2 – Performance comparison of the proposed CHF system to the other methods – rows 1–3 without feature selection and rows 4–9 with the UCMIFS feature selector. Algorithm

Classifier

Cross-validation

a NS

SEN (%)

SPE (%)

b NF

ACC (%)

Proposed method (all) Asyali’s method [1] Isler’s method (without GA) [2]

SVM (linear) Bayesian KNN (k = 3)

Leave-one-out Leave-one-out Leave-one-out

83 74 83

93.10 81.82 84.75

98.14 98.08 83.33

96.38 93.24 84.34

50 9 30

UCMIFS MIFS-U [12] CMIFS [13] mRMR [14] Isler’s method (with GA) [2] Melillo’s method (with CART) [3]

SVM (linear) SVM (linear) SVM (linear) SVM (linear) KNN (k = 5) CART

Leave-one-out Leave-one-out Leave-one-out Leave-one-out Leave-one-out Ten-fold

83 83 83 83 83 110

96.55 93.10 93.10 93.10 96.43 89.74

98.14 98.14 100.00 98.14 96.36 100.00

97.59 96.38 97.59 96.38 96.39 96.36

15 20 41 31 11–16 3

a b

Number of samples. Number of features.

In the experiments, we set the bounds of the Lagrange multipliers in the SVM to be infinity, the Kernel function as Gaussian with a standard derivation of 19.89, and the stop criteria for quadratic programming (QP) optimization as 0.000001. To justify the performance of the methods, the leaveone-out cross-validation was employed in this study. In each trial, the leave-one-out cross-validation method reserved one sample for testing and the other samples for training the classifier. This procedure repeated until all the sampled had been reserved once as testing sample and the percentage of true results were calculated as a measure of classifier’s performance. This method allowed each sample to have the same opportunity to serve as the training and testing sample. The leave-one-out cross-validation method was used because most of the outstanding CHF recognition systems published in the literature were justified by using this method [1–3]. Therefore, we also adopted this method such that their performances were compared in a more reasonable manner. The performance of the proposed feature selection algorithm UCMIFS was compared to that of the other MI-based methods, including MIFS-U [12], CMIFS [13], and mRMR [14], and other well-known CHF recognition systems published in the literature [1–3].

6.

a surplus of 11.28%, 0.06%, and 3.14% to Asyali’s method and 9.35%, 14.82%, and 12.04% to Isler’s method in SEN, SPE, and ACC, respectively. The high discrimination power confirmed the effectiveness of combining HRV features calculated from time-domain, frequency-domain, Poincare plot, and bispectra for CHF recognition.

Effect of the ˇ parameter

6.2.

Before comparing the performance of the proposed feature selection method to that of other methods, the value of the ˇ parameter needed to be determined. We compared the performance of the feature selector with three ˇ values, namely 0.5, 0.8, and 1. The comparative results were depicted in Fig. 2. Among the three ˇ values, the propose UCMIFS algorithm with ˇ = 0.8 achieved the highest accuracy of 97.69% with the lowest feature dimensions of 15 and was demonstrates to be most efficient. Comparatively, the algorithm with ˇ = 1 peaked at a slightly higher dimensions of 16 with the same accuracy, yet showed inferior classification rates at lower dimensions. The algorithm with lower ˇ value of 0.5 showed deteriorated results when compared with the other two settings.

98

Experimental results

96

6.1. Performance of the CHF classifiers using the original feature set Accuracy (%)

The entire 83 records (29 CHF and 54 NSR) in the database were used in the study. To test the baseline performance of the features and classifier, the entire 50 features were used in the stimulation and the discriminating capability of the classifier was compared to that of two well-known CHF classifiers proposed by Asyali [1] and Isler and Kuntalp [2], respectively, which also differentiated CHF without feature selection. They were referred to as Asyali’s and Isler’s methods, respectively, in the following discussion. The comparative results, detailed as sensitivity (SEN), specificity (SPE), and accuracy (ACC), were summarized in the rows 1–3 of Table 2. It was obvious from the table that, by using the entire 50 features, the proposed method achieves high recognition rates of 93.10%, 98.14%, and 96.38% in SEN, SPE, and ACC, respectively. The results showed

94 92

β= 1

90

β = 0.8 β = 0.5

88 86 84 82

0

10

20

30

40

50

Number of Features Fig. 2 – Performance comparison of different ˇ parameters in recognizing CHF as a function of feature dimensions.

306

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 8 ( 2 0 1 2 ) 299–309

100

Table 3 – Performance comparison of different MI-based feature selectors in CHF recognition based on the average accuracy of the selected 1–50 features.

Accuracy (%)

95

90

MIFS-U CMIFS mRMR

85

UCMIFS

Algorithm

a NS

UCMIFS MIFS-U [12] CMIFS [13] mRMR [14]

83 83 83 83

a

SEN (%) 89.51 88.62 86.06 88.06

SPE (%)

Averaged ACC (%)

95.92 95.25 92.37 94.96

93.68 92.93 90.16 92.55

Number of samples.

80

75

0

10

20

30

40

50

Number of features Fig. 3 – Performance comparison of different MI-based feature selectors in recognizing CHF as a function of feature dimensions.

The performance of feature selector with ˇ = 0.8 significantly outperformed that with ˇ = 1 in lower dimensions, especially at dimensions 5–6 and 11–12 in which 2–5% difference in accuracy can be observed. With these feature dimensions, the performance of feature selector with ˇ = 1 dropped significantly to a level even worse than that with ˇ = 0.5. These observations showed the requirement of using the ˇ parameter in the algorithm. Although it may not effective at high feature dimensions, the ˇ parameter contributes considerably at lower feature dimensions that are really important for feature selectors. Furthermore, it was interesting to note that, with ˇ = 0.8, the proposed feature selector chose as few as 15 features to achieve a recognition accuracy higher than that using the entire 50 features. Therefore, we set ˇ as 0.8 in the following studies.

6.3.

Performance of the proposed feature selector

By setting ˇ = 0.8, we first compared the efficiency of the proposed feature selector UCMIFS to that of three other MI-base feature selectors, namely MIFS-U [12], CMIFS [13], and mRMR [14], with SVM as the classifier. Since the three selected MI-based selector had been demonstrated to outperform the original MIFS algorithm, we did not recruit MIFS in the comparison. The discrimination capabilities of these methods as functions of the number of selected features were demonstrated in Fig. 3 for comparison. The performance of the four MI-based classifier was ranked, from better to worst, as UCMIFS > MIFS-U > mRMR > CMIFS. Among them, the proposed UCMIFS outperformed the three MI-based feature selectors in terms of the lowest optimal feature dimensions and higher accuracies in most of the dimensions under study. To quantitatively assess the efficiency of the feature selectors, the SEN, SEP, and ACC of the four MI-based selectors (MIFS-U, CMIFS, mRMR, and UCMIFS) were compared. The optimal accuracies of different methods depicted in Fig. 3 were summarized in the rows 4–7 of Table 2. Also included for

comparison were the Isler’s method that used genetic algorithm (GA) as feature selector [2] and Melillo’s method [3] that used exhaustive search in combination with classification and regression tree (CART) as feature selector. Their results were listed in rows 8 and 9, respectively, for comparison. The results in Table 2 showed that UCMIFS outperformed the other MI-based methods (MIFS-U, CMFIS, and mRMR) and the two well-known CHF classifiers (Isler’s method and Melillo’s method) which also possessed a feature selector. The UCMIFS with optimal feature subset (row 4) showed a high accuracy of 97.59% in recognizing CHF which was superior to the 96.38% accuracy achieved using the original feature set (row 1). Specifically, a 3.45% increase in SEN was observed when applying the UCMIFS feature selector. When compared to the other MI-based selectors (MIFS-U, CMFIS, and mRMR) the UCMIFS selector produced increases in ACC of 1.21%, 0%, and 1.21%, respectively, in recognizing CHF with substantially reduced feature dimensions. As few as 15 features were selected by the UCMIFS to attain the highest accuracy of 97.59%. Compared to the other MI-based method, the 15 features selected by UCMIFS was significantly lower than the 20 features selected by MIFS-U, 41 features selected by CMIFS, and 31 features selected by mRMR. This reduction in feature dimension certainly reduces the load of the classifier and enhances the performance of the entire system. Compared to the two well-known CHF recognition systems that also included feature selector, i.e. Isler’s method and Melillo’s method, the propose system with UCMIFS outperformed both of them with about 1.2% increase in ACC. The advantage of UCMIFS became more promising if we highlighted the 3.45% increase in SEN when compared to the three MI-based methods and the 1.2% and 6.81% increases in SEN compared to the Isler’s and Melillo’s methods, respectively. To further analyze the robustness of the four MI-based feature selectors in choosing the most effective features, the averaged accuracy throughout the entire (1–50) range of feature dimensions was applied. The averaged accuracy was equivalent to numerically calculation of the area under the curves of Fig. 3 divided by the feature dimensions 50. Therefore, the higher the value of the averaged accuracy the higher the capability of the feature selector in selecting the most effective features in their earliest attempts. The comparative results were summarized in Table 3. Again, the UCMIFS was demonstrated to be superior to the other MI-based methods in average ACC. More specifically, the UCMIFS outperforms MIFSU, CMFIS, and mRMR, with increases in average ACC of 0.75%, 3.52%, and 1.13%, respectively.

307

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 8 ( 2 0 1 2 ) 299–309

Table 4 – Ranking of the first fifteen features selected by different MI-based methods. Rank

MIFS-U [12]

CMFIS [13]

mRMR [14]

1 2 3 4 5

NHF SDARR Z2 ROI SD2 AGE

NHF SDARR RATIO E1 HH NLF

NHF SDARR SD2 Z2 ROI GENDER

NHF SDARR Z2 ROI Ratio NLF

6 7 8 9 10

GENDER E1 HH TLF E1 LL NN50-2

AGE GENDER E2 LH Z1 LL Pavg LL

AGE E1 HH E1 LL TLF Ratio

SD2 GENDER AGE E1 HH E1 LL

11 12 13 14 15

SDRR ind E2 LH Z1 HH Pavg HH Z1 LH

Z2 LL SD2 Pavg HH Z2 HH E2 ROI

NN50-2 E2 LH SDRR ind Z1 ROI Pavg HH

NN50-2 H1 LH Z1 LH TLF Mean

7.

UCMIFS

Discussion

The results in the previous section demonstrated the superiority of the proposed feature selector UCMIFS over the other MI-based feature selectors. To assess the advantages of using UCMIFS, the features selected by different MI-based selectors were compared. By using the leave-one-out procedure, a subset of features that resulted in optimal recognition rate was discovered in each trial. With a total of 83 records, we obtained 83 subsets of optimal features. After statistical analysis of the selected features, the most frequently selected features were ranked in descending order and the first 15 features of the four MI-based methods were listed in Table 4 for comparison. By ranking the selected features, the relative importance of individual features for a CHF recognition task were readily identifiable. It is obvious that, except for the first two features that were consistent among the four methods, the other selected features were different by applying different selectors. This observation highlighted the importance of the first two features NHF and SDARR in CHF recognition and explained the significant difference in performance among these methods. Although all of these methods were based on information theory, the method used to approximate mutual information determined the performance of a selector. To further justify the enhancement in the performance is from the feature selector and not from the application of a specific classifier, in this case SVM, the features ordered by the four MI-based selectors were tested by using three other classifiers, including two simple and well-known classifiers, the k-nearest neighbor (KNN) and the linear classifiers (LC), and a somewhat complex quadratic classifier (QC) [29]. The same 83 records and leave-one-out cross-validation method were employed in the test. The optimal performances of different combinations of feature selectors and classifiers are summarized in Table 5. Also included for comparison are the classification rates of individual classifiers with all the 50 features. It is interesting to note that classifier type indeed influences the classification rates and the determination of optimal feature numbers. By using all the 50 features, QC outperforms

Table 5 – Performance comparison of different combinations of classifiers and MI-based feature selectors. ACC (%)

a NF

94.44 96.29 96.29 94.44 94.44

86.74 93.97 92.77 91.56 92.77

50 8 5 8 3

81.48 93.10 93.10 96.55 96.55

79.31 98.14 94.44 94.44 94.44

80.72 96.38 93.97 95.18 95.18

50 15 6 12 11

94.44 96.55 96.55 96.55 96.55

93.10 100.00 98.14 98.14 98.14

93.97 98.79 97.59 97.59 97.59

50 16 38 42 32

Classifier

Algorithm

SEN (%)

KNN (k = 3) KNN (k = 3) KNN (k = 3) KNN (k = 3) KNN (k = 3)

None UCMIFS MIFS-U [12] CMIFS [13] mRMR [14]

72.41 89.65 86.20 86.20 89.65

LC LC LC LC LC

None UCMIFS MIFS-U [12] CMIFS [13] mRMR [14]

QC QC QC QC QC

None UCMIFS MIFS-U [12] CMIFS [13] mRMR [14]

SPE (%)

KNN: k-nearest neighbor classifier; LC: linear classifier; QC: quadratic classifier. a Number of features.

both KNN and LC with 7.23% and 13.25% surpluses in ACC, respectively. However, the use of feature selectors reduces this difference in classification rates. Among the four MIbased feature selectors, UCMIFS outperforms the other three selectors. Compared to the ACCs using the entire 50 features, UCMIFS raises the ACCs of the three classifiers KNN, LC, and QC from 86.74% to 93.97%, 80.72% to 96.38%, and 93.97% to 98.79%, respectively, with significantly reduced dimensions from 50 to 8, 15, and 16, respectively. Therefore, it is evident that, although different classifiers may influence the classification rates, the use of MI-based feature selectors is capable of raising the performance of the classifier with significantly lowered feature dimensions. Among the four MI-based classifiers, the UCMIFS method is the most effective selector for CHF classification. Computational load is an important issue to justify the efficiency of a feature selector. As already mentioned in Section 3, the proposed feature selector UCMIFS stemmed from conditional MIFS (CMIFS) [13]. We adopted the MI conditioned by the first feature selected into S in the calculation while considering all the features in S, as in the MIFS, for the selection of next feature. Moreover, we adopted the uniform distribution assumption from the MIFS-U [12] to simplify the calculation. The former process increased while the latter reduced the computational load. The forgetting parameter represented as a logarithmic function of the number of the already-selected feature subset, added some but ignorable load to the computational cost. Therefore, the computational complexity of UCMIFS was comparable to that of CMIFS but higher than those of MIFS and MIFS-U. However, the increased computational load was compensated by the superiority achieved in recognizing CHF with reduced feature dimensions as depicted in Fig. 3 and Tables 2 and 3. As discussed in Section 1, the proposed UCMIFS, and the other MI-based, algorithm applies the “filter” approach [5] that selects features based solely on a performance measure which is independent of the subsequent classifier. On the contrary,

308

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 8 ( 2 0 1 2 ) 299–309

Isler’s method and Melillo’s method were designated to cooperate with a specific classifier in the feature selection process and were usually referred to as “wrapper” methods [5]. A wrapper requires an iterative procedure to determine an optimal feature subset and is usually computational extensive. Therefore, the UCMIFS is considered to be superior to the other two methods in effectiveness of selecting the most representative features for CHF recognition, as depicted in Tables 2 and 5, but also in computational efficiency to calculate the results.

8.

Conclusion

This paper proposed a feature selector based on mutual information for congestive heart failure (CHF) recognition based on heart rate variability (HRV). The proposed algorithm UCMIFS took advantage of the conditioned mutual information that extracted information from all the selected features conditioned with the first selected one. The uniform distribution assumption was adopted to simplify the calculation of mutual information and a logarithmic weighting was applied to model the relative importance of the order of the selected features. Totally fifty, including five categories (personal, time-domain, frequency-domain, Poincare, and bispectral) of, features and the support vector machine (SVM) classifier were recruited to justify the performance of the feature selector. When tested by using leave-one-out cross-validation protocols, the proposed method outperformed the other MI-based feature selectors with elevated recognition rate and lowered feature dimensions. Compared to the other CHF classifiers published in the literature, the proposed method showed an impressive recognition rate as high as 97.59% by using a feature dimensions as low as 15. The results demonstrated the advantage of using the recruited five-category features in characterizing HRV sequences for CHF recognition. The UCMIFS algorithm promoted the capability of the recognition system with substantially lowered feature dimensions and elevated recognition rate. The advantage of using UCMIFS in selecting effective features for CHF recognition was demonstrated to be valid among a broad range of different classifiers.

Acknowledgements This study was supported in part by the grants NSC 97-2220E-194-010, NSC 98-2220-E-194-003, and NSC 99-2220-E-194-002 from the National Science Council, Taiwan, R.O.C.

[4]

[5]

[6] [7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17] [18] [19]

references [20] [1] M.H. Asyali, Discrimination power of long-term heart rate variability measures, in: Proceeding of the 25th Annual International Conference of the IEEE EMBS, Cancum, Mexico, 2003, pp. 17–21. [2] Y. Isler, M. Kuntalp, Combining classical HRV indices with wavelet entropy measures improves to performance in diagnosing congestive heart failure, Computers in Biology and Medicine 37 (10) (2007) 1502–1510. [3] P. Melillo, R. Fusco, M. Sansone, M. Bracale, L. Pecchia, Discrimination power of long-term heart rate variability

[21]

[22]

measures for chronic heart failure detection, Medical and Biological Engineering and Computing 49 (1) (2011) 67–74. A.K. Jain, R.P. Duin, J. Mao, Statistical pattern recognition: a review, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (1) (2000) 4–37. J. Kohavi, K. Pfleger, Irrelevant features and the subset selection problem, in: Machine Learning: Proceedings of the Eleventh International, 1994, pp. 121–129. M. Dash, H. Liu, Feature selection for classification, Intelligence Data Analysis (3) (1997) 131–156. J. Bins, B. Draper, Feature selection from huge feature set, in: Proceeding of the 8th IEEE International Conference on Computer Vision, 2001, pp. 159–165. M. A. Hall, Correlation-base feature selection for machine learning, Ph.D. Dissertation, Department of Computer Science, University of Waikato, New Zealand, 1999. R. Battiti, Using mutual information for selecting features in supervised neural net learning, IEEE Transactions on Neural Networks 5 (4) (1994) 537–550. A. Schuman, N. Wessel, A. Schirdewan, K.J. Osterziel, A. Voss, Potential of feature selection methods in heart rate variability analysis for the classification of different cardiovascular disease, Statistics in Medicine 21 (2002) 2225–2242. M.B. Malarvili, M. Mesbah, B. Boashash, HRV feature selection based on discriminant and redundancy analysis for neonatal seizure detection, in: Proceeding of the 6th International Conference on Information, Communications and Signal Processing, 2007, pp. 1–5. N. Kwak, C.-H. Choi, Input feature selection for classification problems, IEEE Transactions on Neural Networks 5 (4) (2002) 143–159. H. Cheng, Z. Qin, W. Qian, W. Liu, Conditional mutual information based feature selection, in: International Symposium on Knowledge Acquisition and Modeling, 2008, pp. 103–107. H. Peng, F. Long, C. Ding, Feature selection based on mutual information: criteria of max-dependency, max-relevance and min-redundancy, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (8) (2005) 1226–1238. P.A. Estevez, M. Tesmer, C.A. Perez, J.M. Zurada, Normalized mutual information feature selection, IEEE Transactions on Neural Networks 20 (2) (2009) 189–201. S.M. Shang, J.Q. Gan, S. Francisco, Classifying mental tasks based on features of higher-order statistics from EEG signals in brain–computer interface, Information Sciences 178 (2008) 1629–1640. R.M. Fano, Transmission of Information: A Statistical Theory of Communications, Wiley, New York, USA, 1961. T.M. Cover, J.A. Thomas, Elements of Information Theory, Wiley, New York, 2006. J.J. Huang, Y.Z. Cai, X.M. Xu, A wrapper for feature selection based on mutual information, in: Proceeding of the 18th International Conference on Pattern Recognition (ICPR), 2006. K.C. Chua, V. Chandran, U.R. Acharya, C.M. Lin, Cardiac state diagnosis using higher order spectra of heart rate variability, Journal of Medical Engineering & Technology 32 (2) (2008) 145–155. A. Hoseen, A.G. Bader, Identification of patients with congestive heart failure by recognition of sub-bands spectral, Proceedings of World Academy of Science, Engineering and Technology 34 (2008) 21–24. Electrophysiology, T.F.o.t.E.S.o.C.a.N.A.S.o.P.a., Heart rate variability: standards of measurement, physiological interpretation and clinical use, Circulation 93 (1996) 1043–1065.

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 8 ( 2 0 1 2 ) 299–309

[23] T. Inouye, K. Shinosaki, H. Sakamoto, S. Toi, S. Ukai, A. Iyama, Y. Katsuda, M. Hirano, Quantification of EEG irregularity by use of the entropy of the power spectrum, Electroencephalography and Clinical Neurophysiology 79 (3) (1991) 204–210. [24] J.W. Zhang, C.X. Zheng, A. Xie, Bispectrum analysis of focal ischemic cerebral EEG signal using third-order recursion method, IEEE Transactions on Biomedical Engineering 47 (3) (2000) 352–359. [25] http://www.physionet.org/physiobank/database. [26] R.-G. Yeh, G.-Y. Chen, J.-S. Shieh, C.-D. Kuo, Parameter investigation of detrended fluctuation analysis for

309

short-term human heart rate variability, Journal of Medical and Biological Engineering 30 (5) (2009) 277–282. [27] M. Costa, A.L. Goldberger, C.-K. Peng, Multiscale entropy analysis of biological signals, Physical Review E 71 (2005) 021906. [28] M.-Y. Lee, S.-N. Yu, Improving discriminality in heart rate variability analysis using simple artifact and trend removal preprocessors, in: IEEE EMBS 32nd Annual International Conference, Buenos Aires, Argentina, 2010. [29] Statistical Pattern Recognition Toolbox for Matlab: http://cmp.felk.cvut.cz/cmp/software/stprtool/.