Pattern Recognition Letters 32 (2011) 337–341
Contents lists available at ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Feature selection using mutual information in CT colonography Ju Lynn Ong ⇑, Abd-Krim Seghouane College of Engineering and Computer Sciences, The Australian National University, Canberra, ACT 2601, Australia NICTA, 7 London Circuit, Canberra, ACT 2601, Australia1
a r t i c l e
i n f o
Article history: Received 6 October 2009 Available online 25 September 2010 Communicated by S. Aksoy Keywords: Feature selection Computed tomography Support vector classifier Mutual information
a b s t r a c t Computed tomographic (CT) colonography is a promising alternative to traditional invasive colonoscopic methods used in the detection and removal of cancerous growths, or polyps in the colon. Existing computer-aided diagnosis (CAD) algorithms used in CT colonography typically employ the use of a classifier to discriminate between true and false positives generated by a polyp candidate detection system based on a set of features extracted from the candidates. However, these classifiers often suffer from a phenomenon termed the curse of dimensionality, whereby there is a marked degradation in the performance of a classifier as the number of features used in the classifier is increased. In addition, an increase in the number of features used also contributes to an increase in computational complexity and demands on storage space. This paper investigates the benefits of feature selection on a polyp candidate database, with the aim of increasing specificity while preserving sensitivity. Two new mutual information methods for feature selection are proposed in order to select a subset of features for optimum performance. Initial results show that the performance of the widely used support vector machine (SVM) classifier is indeed better with the use of a small set of features, with receiver operating characteristic curve (AUC) measures reaching 0.78–0.88. Ó 2010 Elsevier B.V. All rights reserved.
1. Introduction Computer aided diagnosis (CAD) for CT Colonography is used in the detection of possible polyp candidates with the aim of removing them in the early stages before they turn cancerous (Yoshida and Dachman, 2005). CAD tools are now currently used as a second-reader, with the potential to improve sensitivity results in cases where polyps could be missed by radiologists tediously screening through each slice. The key elements of a CAD tool include measuring, describing and comparing different lesions in order to detect polyp candidates and reject others. However, this can prove to be a challenging process given the wide variety of polyp shapes and sizes and because many naturally occurring normal colonic structures occasionally imitate polyp shapes. As such, the resulting polyp candidates typically include many false positives (FPs). Fig. 1 gives an example of different colonic structures with significant bulbous protrusions highlighted in red - Fig. 1(a) shows correctly identified polyp regions, while the others are false posi⇑ Corresponding author. Fax: +61262676220. E-mail addresses:
[email protected],
[email protected] (J.L. Ong),
[email protected] (A.-K. Seghouane). 1 NICTA is funded by the Australian Department of Communications, Information Technology and the Arts and the Australian Research Council through Backing Australia’s Ability and the ICT Center of Excellence Program. 0167-8655/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2010.09.012
tives and should be discarded in order to increase the specificity of the system. It should be noted that this figure is used only for explanation of the different colonic structures and in the actual detection system, all protrusions would be detected and highlighted. Once these candidate lesions have been identified, a false positive filtering step is typically employed. There have been many methods proposed to extract features for false positive filtering. Göktürk et al. (2001) extracted random triples of mutually orthogonal sections through a candidate polyp volume to calculate sets of geometric attributes from each plane. A histogram of these attributes from each triple was then used as a feature vector to represent the lesion shape. Acar et al. (2002) instead proposed a CAD scheme using edge-displacement fields to capture changes in local image gradients of consecutive image planes. Wang et al. (2005) computed geometrical, morphological and textural internal features in order to eliminate FPs from the detected candidate patches. Most of these methods extract large numbers of various features in order to discriminate between the polyp and non polyp lesions. Once this is done, classification is then required to reduce the amount of FPs using various classifiers. For example, linear and quadratic classifiers are employed in (Yoshida et al., 2002; Yoshida and Näppi, 2001; Acar et al., 2002; Jerebko et al., 2006) while Göktürk et al. (2001), Zheng et al. (2006) used support vector
338
J.L. Ong, A.-K. Seghouane / Pattern Recognition Letters 32 (2011) 337–341
Fig. 1. Polyp and non polyp regions. (a) Polyp. (b) Semi-Planar. (c) Large Fold. (d) Narrow Fold. Candidate regions that belong to the elliptic class of the peak subtype have been highlighted. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
machines for classification. In (Suzuki et al., 2006; Näppi and Yoshida, 2007), neural network classifiers are used instead while a novel logistic linear classifier was developed in (van Ravesteijn et al., in press) for this purpose. Classification is performed by first extracting a set of features (e.g. curvature, shape index, intensity) from each lesion followed by an application of a statistical classifier to the feature space for discrimination of false positives from actual polyps. Such CAD schemes tend to show high sensitivities in the detection of polyps; however, they tend to suffer from a much larger number of FPs than that of human readers (Yoshida and Dachman, 2005). The goal of an improved classifier would be to retain the true polyps detected (TPs) (i.e. maintaining sensitivity) while significantly reducing the FPs (i.e. increasing specificity). This is not a straightforward task as it can be possible to extract more than 500 features from a polyp candidate, usually by a feature extraction algorithm that create new features based on transformations or combinations of the original feature set. Although the original features provide a better physical interpretation of an object (e.g., Gaussian curvature, shape index etc.) these transformed or combined features generated may sometimes provide a better discriminative ability than the best combination of the original features (Jain et al., 2000). These features are input into a statistical classifier which is then trained for use on future test samples; likely different from the ones used for training. A common problem suffered by many classifiers - e.g. linear discriminant analysis (LDA), artificial neural networks (ANN) and support vector machines (SVM) is termed the curse of dimensionality whereby the performance of the classifier degrades when the number of features used is too large relative to the number of training samples - usually the case in CAD algorithms for CTC where there are usually a very small number of TPs to work on. This is due to the fact that for a fixed sample size, the number of unknown parameters to estimate increases as the number of features increase and the task of learning a good decision boundary for classification will be difficult to achieve. However, an arbitrary reduction in the number of features can lead to a loss in the discrimination ability of the system, thus lowering the accuracy of the classifier. In a feature selection method, ranking of a criterion or score C(k) where k = 1, 2, . . . m of the discriminative ability of all m features needs to be performed. A feature vector with features corresponding to the best values of C(k) is then selected. Examples of score computation include the F-score (Chen and Lin, 2006), Pearson’s linear correlation coefficient, v2 coefficient (Biesiada et al., 2005) and methods based on information entropy (Battiti, 1994; Mantaras, 1991). A high score will demonstrate a high dependency between that feature and the separation of the two classes. The first three methods treat features individually while the methods in (Battiti, 1994; Mantaras, 1991) takes into account the fact that some features may have high mutual correlation with each other. This may be useful as Guyon and Elisseeff (2003) have shown that sometimes noise reduction and better class separation may be obtained by adding variables that are presumably
redundant and a feature that is redundant by itself can by useful when taken with others. For this reason, we choose to investigate methods that exploit the mutual information between the features in order to select an optimum set of features that will maximize the performance of our classifier. We choose here the SVM classifier widely used due to its good generalization characteristics; making it more robust to the problem of dimensionality and improving prediction performance, reducing computation and storage requirements and training and utilization times. The next section will introduce a feature selection method using mutual information before proposing two new mutual information methods. Section 3 will briefly describe the support vector machine classifier while Section 4 will demonstrate the application and performance of these two criterion for feature selection on our CAD system using this classifier. Concluding remarks will be given in Section 5. 2. Mutual information Mutual information methods are able to evaluate the ‘‘information content” of each individual feature with regard to the output class, and also with each other. For example, the method proposed by Battiti (1994) is used to measure arbitrary relations between variables and does not depend on transformations (e.g. scaling, translation) of the variables. It can also allow for nonlinear relations between different variables which gives it an advantage over traditional linear methods of analysis like the correlation. The basic idea is thus to measure the amount by which the knowledge provided by a feature vector decreases the uncertainty about a class, by ‘‘consuming” the information contained in the input vector. For example, if we consider the probability of combined events c and f where c is the class c occurring and f is the feature vector, the mutual information, MI of I(C; F) between variables c and f is given by
IðC; FÞ ¼ HðCÞ HðCjFÞ
ð1Þ
whereby if P(c); c = 1 . . . Nc gives the initial uncertainty in the output class, the entropy H(C) is given by
HðCÞ ¼
Nc X
PðcÞlogPðcÞ
ð2Þ
c¼1
and the conditional entropy measuring the uncertainty after knowing the feature vector f with Nf components is
HðCjFÞ ¼
Z f
PðfÞ
Nc X
! PðcjfÞlogPðcjfÞ df
ð3Þ
c¼1
This function I(C; F) is symmetric with respect to C and F, and can be reduced to the following expression if we take P(cjf) = P(c, f)/P(f) from the definition of conditional probability:
339
J.L. Ong, A.-K. Seghouane / Pattern Recognition Letters 32 (2011) 337–341
IðC; FÞ ¼ HðCÞ HðCjFÞ ¼
Nc X
PðcÞ log PðcÞ þ
Nc X
PðcÞ log PðcÞ þ
¼
Z f
c¼1 Nc X
PðfÞ
f
c¼1
¼
Z
PðcÞ log PðcÞ þ
Z f
c¼1
Nc X
and
!
IðC; f jsÞ ¼ HðCjsÞ HðCjf ; sÞ X X ¼ PðcjsÞ log PðcjsÞ þ Pðcjf ; sÞ log Pðcjf ; sÞ
PðcjfÞlogðPðcjfÞÞ df
c¼1 Nc X Pðc; fÞ Pðc; fÞ log PðfÞ PðfÞ c¼1 ! Nc X Pðc; fÞ Pðc; fÞ log PðfÞ c¼1
!
c;s
PðfÞ
¼
c;s
and by using the law of total probability to expand the first term, we have
IðC; FÞ ¼
Nc Z X f
c¼1
¼
Z X Nc f
Pðc; fÞ log PðcÞ þ
Z f
Pðc; fÞ log
c¼1
Nc X
Pðc; fÞ Pðc; fÞ log PðfÞ c¼1
!
f
X
ð7Þ
max
1X ðIðs; f ; CÞ Iðf ; sÞÞ s s2S
ð8Þ
where
Pðc; fÞ PðfÞPðcÞ
Iðf ; sÞ
c;f ;s
Pðc; sÞ X Pðc; f ; sÞ þ Pðc; sÞ log Pðc; f ; sÞ log PðsÞ Pðf ; sÞ c;f ;s
An alternative method to incorporate this information would be to compute
f
Iðs; f ; CÞ ¼ Iðs; Cjf Þ þ Iðf ; CjsÞ ¼ Hðsjf Þ HðsjC; f Þ þ Hðf jsÞ Hðf jC; sÞ X X Pðf ÞPðsjf Þ log Pðsjf Þ þ Pðc; f ÞPðsjc; f Þ log Pðsjc; f Þ ¼
which is a function of the joint probability distribution of the two variables c and f. As such, given a set F of n features, we want to find a subset S F with k features that gives the most information about the class, i.e. minimizes H(CjS) and maximizes the MI I(C; S). However, the computation of this becomes prohibitively expensive as the dimensionality of the feature vector f is large. In (Battiti, 1994) the approximation using only I(f, C) and I(f, f0 ) is computed where f and f0 are individual features. This means that given a set of selected features, the algorithm needs to choose the next feature as the one that maximizes information about the class corrected by a subtraction of an amount proportional to the average MI with the selected features, or as follows:
max IðC; f Þ b
X
c;s
X
c;f ;s
PðsÞPðf jsÞ log Pðf jsÞ þ
f ;s
X
Pðc; sÞPðf jc; sÞ log Pðf jc; sÞ
c;f ;s
X Pðs; f Þ Pðc; f ; sÞ þ Pðc; f ; sÞ log Pðf Þ Pðc; f Þ c;s c;s;f X X Pðf ; sÞ Pðc; f ; sÞ þ Pðf ; sÞ log Pðc; f ; sÞ log PðsÞ Pðc; sÞ f ;s c;f ;s
¼
X
Pðs; f Þ log
ð9Þ Here, the term I(s, Cjf) and I(f, Cjs) give us the information s and f provides in terms of predicting C. These algorithms can be summarized below:
ð4Þ
s2S
2.0.1. A new mutual information algorithm If we suppose however that a new feature f does not add any new information, then the conditional probabilities verify
Pðcjs; f Þ ¼ PðcjsÞ In the absence of information from the feature f on the prediction of the class c, the variable f has no influence on the conditional probability p(cjf). Therefore to measure the ‘extra’ information that f provides, we need to measure the deviation of P(cjs, f) from P(cjs). This can be done via the Kullback–Leibler divergence (Kullback, 1959) as shown in the following equation:
X
Pðc; s; f Þlog
s2S
Pðcjs; f Þ ¼ HðCjsÞ HðCjs; f Þ ¼ IðC; f jsÞ PðcjsÞ
However, in Eq. (5), only the mutual information between f and s is measured without taking into the account the objective of predicting the class C. Furthermore, there is no direct method to estimate the regularization parameter b used and finding an optimum b may be computationally expensive. As such, we propose to construct a new mutual information algorithm that takes into account the prediction of class C from features f and s and eliminates the use of a b parameter. This can be summarized in the following equation:
max IðC; f Þ f
1X IðC; f jsÞ s s2S
ð5Þ
To expand this, we take
IðC; f Þ ¼ HðCÞ HðCjf Þ ¼
For our application, due to the fact that we are estimating the probability density from a finite number of samples, we approximate this using a histogram and count the number of cases falling into each histogram bin. For example, if we have N examples in the training set, we find Pc = nc/N, Pf = nf/N, Pc,f = nc,f/N and Pc,f,s = nc,f,s/N where n is the number of occurrences for that particular interval. 3. Support vector classification
X c;f
Pðc; f Þ log PðcÞPðf Þ
ð6Þ
Support vector machines (SVMs) (Vapnik, 1995) are a wellknown tool for data classification which allows for the efficient
340
J.L. Ong, A.-K. Seghouane / Pattern Recognition Letters 32 (2011) 337–341
use of kernels and absence of local minima. The basic idea is to map a set of data inputs into a high dimensional space and find a separating hyperplane with a maximal margin in order to discriminate between a set of classes. Given a training data set of feature vectors D = (x1, y1), . . . , (xn, yn) where xi is the set of feature vectors and yi 2 {+1, 1} with +1 corresponding to a polyp label and 1 to a non polyp for example, the SVM clasifier aims to solve the quadratic optimization problem:
min x;b;n
m X 1 T x xþC nk ; 2 k¼1
ð10Þ
subject to
yk ðxT /ðxk Þ þ bÞ P 1 nk ; nk P 0; k ¼ 1; . . . ; m Fig. 2. Mutual information diagram for the features extracted from the polyp candidates.
ð11Þ
where the training data is mapped by the function / and C is a penalty parameter on the training error. For a testing instance x, the decision function is given by
f ðxÞ ¼ sgnðxT /ðxÞ þ bÞ
ð12Þ
T
where x /(x) + b = 0 is the equation of the hyperplane separating the data with the largest possible margin and x and b are weights that need to be optimzed. More details can be found in (Chen and Lin, 2006). For purposes of our experiment, we use the LibSVM toolbox (Chen and Lin, 2006) with an RBF kernel and 5-fold cross validation to find the optimal settings of C and c used in the RBF kernel given by K(xi, xj) = exp(ckxi xjk2), c > 0. 4. Experimental results
Fig. 3. Mutual information function which shows the MI between feature #16 and the other features.
CT data of polyp candidates (TPs and FPs) obtained from the Walter-Reed Army Medical Centre database Walter Reed (2006) using a 4 and 8 channel GE Light Speed scanner with collimation 1.25–2.5 mm, table speed 15 mm/rotation, reconstruction interval 1 mm, slice thickness 1.25 mm, tube voltage 120 kVp and effective tube current 100 mA were used in our simulations. Of the entire database set, we used 45 polyp and 45 non polyp samples for training and testing purposes and extracted features
Fig. 4. Effect of the number of features on AUC.
J.L. Ong, A.-K. Seghouane / Pattern Recognition Letters 32 (2011) 337–341
from these. 50 of these samples were selected randomly for training, and the remainder for testing. These features are explained in (Yoshida and Näppi, 2001; Yoshida et al., 2002) and include:
Intensity of the structure Shape index Curvedness value Mean curvature Gaussian curvature Elongation Size Directional gradient concentration
and its corresponding maximum, minimum, mean, standard deviations and skewness of both surface and volumetric patches. The explanation and computations of these features can be found in (Yoshida and Näppi, 2001). 100 features were extracted from this database were used and ranked using the three MI methods in Eq. (5) (with b = 0.5), 6 and 9 respectively and then put through the SVM classifier with an RBF kernel and 5-fold cross validation. Fig. 2 shows the MI diagram I(F, C) obtained for each feature in the database while Fig. 3 shows the MI function relating the MI between a particular feature (feature #16 in this case) and all other features. Clearly it correlates most with itself, with a sharp spike at f = 16. As a measure of performance, Fig. 4 shows the Area Under Curve (AUC) of the Receiver Operating Characteristic (ROC), a widely used statistic for model comparison, for the two methods discussed. Two things can be quickly noted. Firstly, for a small number of training samples, the addition of features will first increase the performance of the classifier, and then subsequently degrade the system, depending on whether the addition of features used were efficient for discriminating between the features. Secondly, the performance of SVM + Method 1 seems to be slightly better for a small number of features used, and on the whole similar to Battiti’s method without the need for the estimation of the b parameter. On the whole, the AUC measures are able to reach 0.78–0.88 for the top ranked features using these methods. If no feature selection method was used, the worst-case scenario could yield AUC results below 0.60, assuming weaker, less discriminatory features were selected. The top features that gave excellent class separability were features derived from the mean values of the shape index and CT intensity values. 5. Conclusion In this paper, we investigate the benefits of feature selection on our polyp candidate database, with the aim of increasing specificity
341
while preserving sensitivity. Two new mutual information methods that did not require the estimation of a hyperparameter were proposed and it was noted that for a small number of training samples, the addition of features would first increase the performance of the classifier, and then subsequently degrade the system, depending on whether the addition of features used were efficient for the discrimination process. On the whole, the AUC measures are able to reach 0.78–0.88 for the top ranked features using these methods. References Acar, B., Beaulieu, C., Gokturk, S., Tomasi, C., Paik, D.S., Jeffrey Jr., R.B., Yee, J., Napel, S., 2002. Edge displacement field based-classification for improved detection of polyps in CT colonography. IEEE Trans. Med. Imaging 21 (12), 1461–1467. Battiti, R., 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Networks 5 (4), 537–550. Biesiada, J., Duch, W., Kachel, A., Maczka, K., Palucha, S., 2005. Feature ranking methods based on information entropy with Parzen windows. In: Proc. of Internat. Conf. on Res. in Electrotechnol. and Appl. Inform. (REI-05), Poland, pp. 109–119. Chen, Y., Lin, C., 2006. Combining SVMs with Various Feature Selection Strategies. Feature Extraction, Foundations and Applications. Springer. Göktürk, S., Tomasi, C., Acar, B., Beaulieu, C., Paik, D., Jeffrey, B., Yee, J., Napel, S., 2001. A statistical 3-D pattern processing method for computer-aided detection of polyps in CT colonography. IEEE Trans. Med. Imaging 20 (12), 1251–1260. Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature selection. J. Machine Learning Res. 3, 1157–1182. Jain, A.K., Duin, R.P., Mao, J., 2000. Statistical pattern recognition: A review. IEEE Trans. Pattern Anal. Machine Intell. 22 (1), 4–37. Jerebko, A., Lakare, S., Cathier, P., Periaswamy, S., Bogoni, L., 2006. Symmetric curvature patterns for colonic polyp detection. In: Proc. of Med. Imaging Comput. and Comp. Asst. Interven. (MICCAI-06), in Proc. MICCAI 2006, Copenhagen, 2006, pp. 169–176. Kullback, S., 1959. Information Theory and Statistic. John Wiley and Sons, New York. Mantaras, R., 1991. A distance based attribute selecting measure for decision tree induction. Machine Learning 6, 81–92. Näppi, J., Yoshida, H., 2007. Fully automated three-dimensional detection of polyps in fecal-tagging CT colonography. Acad. Radiol. 14, 287–300. Suzuki, K., Yoshida, H., Näppi, J., Dachman, A., 2006. Massive training artificial nerual network (MTANN) for reduction of false positives in computer-aided detection of polyps. Med. Phys. 33 (10), 3814–3824. van Ravesteijn, V.F., van Wijk, C., Vos, F.M., Truyen, R., Peters, J.F., Stoker, J., van Vilet, L.J., in press. Computer aided detection of polyps in CT colonography using logistic regression. IEEE Trans. Med. Imaging. Vapnik, V., 1995. The Nature of Statistical Learning Theory. Springer. Walter Reed Army Medical Center, 2006.
. Wang, Z., Liang, Z., Li, L., Li, X., Li, B., Anderson, J., Harrington, D., 2005. Reduction of false positives by internal featuers for polyp detection in CT-based virtual colonoscopy. Med. Phys. 32 (12), 3602–3616. Yoshida, H., Dachman, A.H., 2005. CAD techniques, challenges, and controversies in computed tomographic colonography. Abdominal Imaging 30, 26–41. Yoshida, H., Näppi, J., 2001. Three-dimensional computer aided diagnosis scheme for detection of colonic polyps. IEEE Trans. Med. Imaging 20 (12), 1261–1274. Yoshida, H., Masutani, Y., MacEneaney, P., Rubin, D., Dachman, A.H., 2002. Computerized detection of colonic polyps in CT colonography based on volumetric features: A pilot study. Radiology 222, 327–336. Zheng, Y., Yang, X., Beddoe, G., 2006. Reduction of false positives in polyp detection using weighted support vector machines. In: Proc. of 29th Annual Internat. Conf. of the EMBS, Lyon, France, 2006, pp. 4433–4436.