Artificial Intelligence in Medicine (2009) 46, 267—276
http://www.intl.elsevierhealth.com/journals/aiim
A model-free ensemble method for class prediction with application to biomedical decision making Ralph L. Kodell a,*, Bruce A. Pearce b, Songjoon Baek c, Hojin Moon d, Hongshik Ahn e, John F. Young c, James J. Chen c a
Department of Biostatistics, #781, University of Arkansas for Medical Sciences, 4301 W. Markham St., COPH 3218, Little Rock, AR 72205, United States b Information Technology Staff, National Center for Toxicological Research, Jefferson, AR 72079, United States c Division of Personalized Nutrition and Medicine, National Center for Toxicological Research, Jefferson, AR 72079, United States d Department of Mathematics and Statistics, California State University — Long Beach, Long Beach, CA 90840, United States e Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY 11794, United States Received 17 April 2008; received in revised form 30 October 2008; accepted 3 November 2008
KEYWORDS Cancer; Disease classification; Convex hull; Gene imprinting; Genomics; k-Nearest-neighbor; Medical screening
Summary Objective: A classification algorithm that utilizes two-dimensional convex hulls of training-set samples is presented. Methods and material: For each pair of predictor variables, separate convex hulls of positive and negative samples in the training set are formed, and these convex hulls are used to classify test points according to a nearest-neighbor criterion. An ensemble of these two-dimensional convex-hull classifiers is formed by trimming the mC2 possible classifiers derived from the m predictors to a set of classifiers comprised of only unique predictor variables. Because only two-dimensional spaces are required to be populated by training-set samples, the ‘‘curse of dimensionality’’ is not an issue. At the same time, the power of ensemble voting is exploited by combining the classifications of the unique two-dimensional classifiers to reach a final classification. Results: The algorithm is illustrated by application to three publicly available biomedical data sets with genomic predictors and is shown to have prediction accuracy that is competitive with a number of published classification procedures.
* Corresponding author. Tel.: +1 501 686 5353; fax: +1 501 526 6729. E-mail addresses:
[email protected],
[email protected] (R.L. Kodell). 0933-3657/$ — see front matter # 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.artmed.2008.11.001
268
R.L. Kodell et al. Conclusion: Because of its superior performance in terms of sensitivity and negative predictive value compared to its competitors, the convex-hull ensemble classifier demonstrates good potential for medical screening, where often the major emphasis is placed on having reliable negative predictions. # 2008 Elsevier B.V. All rights reserved.
1. Introduction Class prediction is an area of scientific endeavor that involves the use of statistical learning techniques to develop algorithms for classifying unknown samples through supervised training on samples of known class. Although the classification of unknowns based on pattern recognition is a mature field, it has received renewed attention in recent years due to advancements in biotechnology that facilitate the use of high-dimensional genomics information in biomedical decision making. Class prediction has been proposed, for example, for identifying which individuals have a disease and which are diseasefree based on patterns of gene expression [1], and for predicting which cancer patients will benefit from chemotherapy and which will experience unnecessary toxic side effects based on geneexpression profiling [2]. The utility of high-dimensional genomics classifiers in the diagnosis and treatment of disease has been discussed by Simon [3]. In recent years, ensemble methods have become popular in the development of algorithms for class prediction. An ensemble method behaves like an expert committee in predicting the class to which a sample belongs. Even though individual classifiers might be somewhat weak and error-prone in making decisions, combinations of classifiers can form a powerful, highly accurate committee [4]. In most cases the class prediction is determined by a majority vote of the members of the committee, or by an average of the ‘‘opinions’’ of committee members. Ensembles may be assembled from virtually any type of predictive method (e.g., decision tree, neural network). Some of the better-known ensemble methods are bagging [5] and boosting [6]. With bagging (bootstrap aggregating), the N samples in the training set are re-sampled K times with replacement, and with equal probability [7], to form an ensemble of K classifiers each based on N samples [8]. With boosting, the N samples in the training set are re-sampled K times to build up an ensemble of K classifiers, with hard-to-classify samples given more sampling weight as classifiers are added to the ensemble [8]. With both of these types of ensembles, members of the ensemble are trained on different (generally, overlapping) subsets of the training set, but they are allowed to select predictor variables from the same set. Recently, a different
approach has been proposed whereby each member of the ensemble (i.e., each decision tree) uses all the samples in the training set, but bases its predictions on completely different (non-overlapping) predictor variables [9—11]. Various other methods of combining classifiers to form ensembles have been discussed [12]. In this paper, a nonparametric method is proposed for developing an ensemble classifier. First, two-dimensional convex hulls of pairs of predictor variables are constructed for each two-variable combination of predictor variables, one convex hull for positive samples and one for negative samples. Each such pair of convex hulls is trained as a classifier based on the k-nearest-neighbor (k-NN) criterion, and a subset of these individual convex-hull classifiers is selected to form an ensemble classifier. Criteria are established for determining which of the individual classifiers should be included in the ensemble. The method is applied to the colon cancer data of Alon et al. [1], the breast cancer data of van’t Veer et al. [2], and the gene imprinting data of Greally [13]. The cross-validated classification accuracy (ACC), sensitivity (SEN), specificity (SPC), positive predictive value (PPV) and negative predictive value (NPV) are compared to corresponding values for several other classification procedures. ACC is the overall proportion of correct predictions, SEN is the proportion of correctly predicted positives, SPC is the proportion of correctly predicted negatives, PPV is the proportion of correct predictions among positive predictions and NPV is the proportion of correct predictions among negative predictions.
2. Methods 2.1. Ensemble classifiers Many methods have been proposed for combining the outputs of classifiers in an ensemble, including methods for combining class labels and methods for combining continuous outputs [14]. The ensemble classifier proposed in this paper combines class labels. Among the methods for combining class labels — simple-majority voting, weighted majority voting, behavior knowledge space and Borda count — simple-majority voting is optimal for the characteristics sought here: (1) an odd number of ensem-
Model-free ensemble classifier ble members, (2) independence among members, and (3) equal classification accuracy of ensemble members [14]. Thus, a simple majority of the individual classifiers’ votes determines the ensemble’s classification of a sample. Let n be an odd positive integer greater than 1. Suppose that there are n independent classifiers, each with probability, p (0 < p < 1), of a correct classification. All n classifiers are used to predict the class of a particular subject or sample, based on a majority vote of the classifiers. (Note: making n odd prevents ties.) Because the n classifiers are independent, the binomial expansion of [p + (1 p)]n can be used to calculate the probability of a correct classification by the ensemble, where only the (n + 1)/2 terms with powers of p greater than or equal to (n + 1)/2 are summed together. That is, the probability of a correct classification by the ensemble is
269 Thus, under the assumption that ensemble members are mutually independent, the prediction error of the ensemble can theoretically be made arbitrarily small as members are added. Clearly, independence among classifiers in an ensemble can enhance prediction accuracy. It has also been shown that having high disagreement among classifiers is a desirable property [18], perhaps even more desirable than independence [19]. It is usually stated informally that individual classifiers should all be highly accurate, while disagreeing as much as possible in their class predictions. Generally, this involves a bias-variance trade-off (BVTO) [18]. The customary requirement of low variance is ‘‘traded off’’ in favor of achieving low bias, given that diversity (variance) among classifiers in an ensemble is a desirable feature. Achieving a higher variance without increasing the bias is generally accomplished via correlation reduction [20]. In fact,
n n n n n1 p þ p ð1 pÞ þ þ pðnþ1Þ=2 ð1 pÞðn1Þ=2 n n1 ðn þ 1Þ=2 n n n ¼ p pn1 þ pn2 ð1 pÞ þ þ pðn1Þ=2 ð1 pÞðn1Þ=2 : n n1 ðn þ 1Þ=2
For the ensemble to have probability greater than p of correct classification, the expression in braces in (1) must exceed 1. Hansen and Salamon [15] noted that the gain in ensemble accuracy can be shown inductively provided p > 1/2. Later, Lam and Suen [16] provided a formal derivation of the result, showing that the probability given by (1) will be equal to p when p = 1/2, less than p and strictly decreasing with n when p < 1/2 and greater than p and strictly increasing with n when p > 1/2. Table 1 gives a few examples to illustrate the property. Note that for p = 0.45 ( p < 1/2), the ensemble’s prediction accuracy is always less than 0.45 and declines toward 0 as n increases, i.e., as more independent classifiers are added to the ensemble. On the other hand, if p > 1/2, then even when p is near 1/2, the ensemble’s prediction accuracy always exceeds p and continues to increase toward 1 as more independent classifiers are added. If the accuracy of the individual classifiers is high enough, say 0.75, then the ensemble’s accuracy can be made arbitrarily close to 1 provided that enough individual independent classifiers are available. From the perspective of the prediction error as opposed to the prediction accuracy, Perrone and Cooper [17] showed that the mean squared error of the ensemble can be reduced by a factor of n compared to the error of its constituent members.
(1)
methods have been devised to form ensembles of negatively correlated members by introducing a correlation penalty term into the error function of each member [21]. The focus of this paper will be the development of an ensemble classifier whose members tend to be independent, given that independence is clearly a desirable property that can be exploited in a defined quantitative way in building an ensemble.
2.2. Two-dimensional convex-hull classifiers A model-free approach is proposed for developing individual classifiers to be included in the ensemble. In order to exploit second-order interactions among the predictor variables [22], all pairs of predictors are examined as possible classifiers. For each pair of predictors, the two-dimensional convex hull [23] is formed for each of the two classes of samples (e.g., positive and negative values of an outcome variable). To ensure that distances used to determine nearest neighbors are uniformly measured on a normalized scale, the values of each predictor variable are first mapped to the unit interval, that is, normalized value ¼
observed value min ; max min
270
R.L. Kodell et al.
Table 1 Illustration of decrease ( p < 0.5), no change ( p = 0.5), or increase ( p > 0.5) in prediction accuracy of ensembles of n independent classifiers, each having prediction accuracy p. n 5 13 21 55
p 0.45
0.5
0.55
0.65
0.75
0.407 0.356 0.321 0.228
0.5 0.5 0.5 0.5
0.593 0.644 0.679 0.772
0.765 0.871 0.922 0.989
0.896 0.976 0.994 0.999
where max and min are the maximum and minimum observed values of the predictor variable. An illustration is provided in Fig. 1. Each convex hull is the linear enclosure of the respective sets of positive and negative samples in the training set, i.e., solid squares for the positive predictor pairs and solid circles for the negative predictor pairs. The intersection of the two two-dimensional convex hulls (enclosure A, Fig. 1) is an indicator of a classifier’s ability. If the pair of predictors is a good classifier, then the two convex hulls will tend to have little overlap, i.e., a small intersection. If not, then there will be substantial overlap, i.e., a large intersection. Portions of the convex hull for positive outcomes that are not in the intersection are referred to collectively as the exclusive hull for positives (enclosure B, Fig. 1), while portions of the convex hull for negative outcomes that are not in the intersection are referred to collectively as the exclusive hull for negatives (enclosure C, Fig. 1). It is possible that individual predictor variables will show sufficient separation between the two classes of samples to serve as members of the ensemble. Operationally, to include these individual variables as classifiers in the two-dimensional convex-hull strategy, each is paired with a dummy variable, which has no effect on the classification. In Phase I, all mC2 classifiers formed by all pairwise combinations of the m predictor variables are evaluated for classification accuracy. For each such classifier, all training samples residing within either of that classifier’s exclusive hulls are automatically classified accordingly. Using Fig. 1 for illustration, all solid squares in enclosure B and all solid circles in enclosure C are automatically classified as cancer and non-cancer samples, respectively. For all other training samples, i.e., all solid squares and solid circles in enclosure A of Fig. 1, the distance from all training points in each of the exclusive hulls for positives (B) and negatives (C) is determined. Using a k-NN criterion (k being an odd positive integer), a sample is classified by the (one- or) two-dimensional convex-hull classifier according to which exclusive hull, B or C, contains the majority of the k nearest
neighbors. Fig. 1 illustrates this for the test points (open squares and circles) in A, using a 1-NN criterion. A classifier is retained if both its sensitivity and specificity for the training set equal or exceed a prespecified level. In Phase II, the classifiers retained in Phase I are ranked according to the number of times the two predictors that define them occur with other predictors to form other classifiers. Classifiers made up of predictors occurring least often are ranked highest. Beginning with the classifier having the highest ranking, classifiers are deleted if they share a common predictor variable with any classifier that is ranked higher. With this approach, only unique convex-hull classifiers are retained, and the ranking approach ensures the largest possible such set. Because the classifiers do not have any predictors in common, they will be more likely to operate independently from one another. If the number of retained classifiers is even, then the last classifier included is deleted in order to avoid ties in the voting. The set of unique classifiers selected this way makes up the ensemble classifier.
Figure 1 Two-dimensional convex hulls formed by normalized values of expression of two representative genes for tissue samples classified as cancerous (linear enclosure of solid squares, dashed lines) or non-cancerous (linear enclosure of solid circles, solid lines). Enclosure A defines the intersection or overlap area for the positive and negative convex hulls; enclosure B defines the exclusive hull for positives; and enclosure C defines the exclusive hull for negatives. Lines connecting test points denote single nearest-neighbor (1-NN) classifications of test points. Test points 1—3 are classified correctly, while test points 4 and 5 are classified incorrectly. Test points 6 and 7 are also classified correctly by residing in the correct exclusive hulls.
Model-free ensemble classifier
271
In Phase III, the ensemble classifier developed in Phase II is used to classify all samples in the test set. The accuracy of the ensemble is evaluated. In general, K-fold cross-validation is employed in order to assess the ensemble’s ability to generalize (i.e., to classify samples outside the training set), where commonly K = N (the total number of samples), or K = 10. N-Fold cross-validation is commonly referred to as leave-one-out cross-validation. Both N-fold and 10-fold cross-validations have become generally accepted.
3. Application to problems in biomedical decision making For each of the three data sets analyzed, the results of twenty repetitions of 10-fold cross-validation are reported here. Each data set was randomly split into 10 subsets. Each subset was used once as the test set, with the remaining nine subsets making up the training set for that test set. The ACC, SEN, SPC, PPV and NPV across the ten test sets, based on classifications by each convex-hull ensemble constructed in Phase II for each corresponding training set, were calculated. This 10-fold cross-validation was repeated twenty times with a different random split of the samples each time. For any of the twenty repetitions, because the ACC, SEN, SPC, PPV and NPV were measured over all 10 test subsets, the estimates should not be very dependent on a single test set. In addition, given that the cross-validated results are the averages of 20 repetitions of these 10-fold runs, the results can be viewed with confidence, even though the data sets themselves have relatively small numbers of patients. In addition to the convex-hull classifier, several other published classification methods were applied to each data set. These included Classification by Ensembles from Random Partitions (CERP), both
Classification-Tree (C-T) CERP [24] and Logistic Regression-Tree (LR-T) CERP [11]; Random Forest [25]; AdaBoost [6], LogitBoost [26]; k-NN [27]; Shrunken Centroid [28]; Support Vector Machine (SVM) [29]; Diagonal Linear Discriminant Analysis (DLDA) [30]; and Fisher’s Linear Discriminant Analysis (FLDA) [31]. Results for the convex-hull algorithm were produced by a program written in C++. Results for classifiers other than the convex-hull classifier were produced by Ahn et al. [11] and Moon et al. [24] and are also based on twenty 10-fold cross-validations. C-T CERP was implemented with a C/C++ program, while LR-T CERP was implemented with an R program that uses rpart to construct base trees. A package in R, RandomForest, was used to implement the Random Forest algorithm, with the default number of features in each node, floor (m1/2). The value of ntree was 200 for the colon data and 500 for the breast data and gene imprinting data. AdaBoost and LogitBoost were implemented with boost in R, which uses the classification tree rpart with a single split as the base classifier. The number of iterations was mfinal = 100. The R package class was used to implement k-NN. The value of k and the number of predictors were selected using nested CV in the training phase with ranking based on the ratio of between-to-within group sums of squares (BW ratio). The Shrunken Centroid method was implemented with the R package pamr using a soft thresholding option and the BW ratio. The R package e1071 was used for SVM with the linear kernel. DLDA and FLDA were implemented with sma and mass, respectively, in R, with the BW ratio for feature selection in the training phase using a nested CV. Alon et al. [1] presented gene-expression data on 62 colon tissue samples, 40 being from patients with colon adenocarcinoma (positive cases) and 22 from subjects with normal colons (negative cases). The goal in this application is to develop a classification
Table 2 Performance (%) of classification algorithms for colon cancer data (62 samples: 40 positives, 22 negatives)a. Algorithm
ACC
Convex hull C-T CERP LR-T CERP Random forest AdaBoost LogitBoost k-NN Shrunken centroid SVM-linear DLDA FLDA
86.0 84.4 84.8 81.2 74.4 74.0 83.5 84.7 84.1 85.1 87.4
a
SEN (1.1) (1.4) (2.3) (2.7) (2.9) (2.4) (3.5) (2.3) (2.1) (2.4) (2.1)
88.0 86.9 87.1 87.8 82.1 82.3 88.3 86.9 86.9 86.0 88.6
SPC (1.0) (2.0) (1.5) (1.1) (3.0) (3.2) (2.3) (2.3) (1.6) (1.7) (1.7)
82.3 79.7 80.5 69.3 60.2 59.1 75.0 80.7 79.1 83.4 85.2
PPV (2.5) (3.3) (4.9) (6.4) (4.9) (4.9) (8.0) (7.2) (4.8) (4.7) (4.6)
Average (standard deviation) of twenty repetitions of 10-fold CV for each method.
90.0 88.8 89.1 83.9 79.0 78.6 86.6 89.3 88.4 90.4 91.7
NPV (1.3) (1.3) (2.5) (2.9) (2.3) (2.0) (3.7) (3.4) (2.4) (2.6) (2.4)
79.0 76.8 77.4 75.6 65.0 64.9 77.8 77.2 76.8 76.6 80.5
(1.5) (2.9) (2.5) (3.0) (4.5) (4.2) (4.4) (2.8) (2.5) (2.9) (2.6)
272
R.L. Kodell et al.
algorithm to screen for colon cancer based on patient-specific high-dimensional genomic data to enable streamlined classification of new unlabeled tissue samples in a clinical setting. An initial set of 6500 genes whose expression levels were measured with an Affymetrix oligonucleotide array was reduced by Alon et al. to 2000 genes having the highest intensity levels across the 62 tissue samples. The cross-validated results for this subset of 2000 genes are shown in Table 2, where results in all but the first row were generated by Ahn et al. [11]. The results for the convex-hull ensemble classifier are based on a pre-specified training-set sensitivity and specificity of 0.85 with k = 1 (1-NN). As a check on sensitivity of the algorithm to these training parameters, ten 10-fold CVs were run for (0.9, 1) and ten 10-fold CVs were run for (0.85, 3), resulting in the same accuracy of 86% as for training parameters (0.85, 1) to two decimal places. One reason for choosing k = 1 for small samples like the colon data is that, with only 20 expected colon-cancer negatives in a 10-fold CV training set, the exclusive hull for negatives might not be highly populated, so that a choice of k > 1 could limit the number of classifiers achieving the desired specificity. With (0.85, 1) the numbers of unique members of an ensemble ranged from 53 to 159 over the twenty 10-fold cross-validations. To gain a measure of repeatability of individual predictions, the proportion of correct predictions by the convex-hull classifier among the twenty CV runs for each of the 62 colon samples was calculated. Fifty-one samples (16 negatives and 35 positives) were correctly predicted by the ensemble classifier in all twenty runs. Five samples (2 negatives and 3 positives) were incorrectly predicted in all twenty runs. The remaining 4 negatives were correctly predicted 4, 6, 14 and 18 times out of 20 while the remaining 2 positives were correctly predicted 1 and 3 times out of 20. Thus, 82% of the samples were Table 3
Performance (%) of classification algorithms for breast cancer data (78 patients: 34 positives, 44 negatives)a.
Algorithm
ACC
Convex hull C-T CERP LR-T CERP Random forest AdaBoost LogitBoost k-NN Shrunken centroid SVM-linear DLDA FLDA
63.2 65.3 60.6 62.5 58.8 65.2 61.7 60.9 56.5 62.5 62.3
a
easy to classify correctly, 8% were impossible to classify correctly, and 10% had intermediate degrees of difficulty. Note that the degree of repeatability of predictions is not necessarily an indicator of the degree of confidence in the accuracy of the predictions. van’t Veer et al. [2] presented data on gene expression on 78 primary breast cancer patients who had undergone surgery. A 70-gene expression signature identified by van’t Veer and colleagues for distinguishing patients with good and bad prognoses in terms of distant metastasis-free survival is being compared to a common clinical-pathological prognostic tool in a multi-center, prospective, phase III randomized study of node-negative breast cancer patients. The study, termed the MINDACT trial (Microarray In Node negative Disease may Avoid ChemoTherapy), is described at http:// www.breastinternationalgroup.org/TRANSBIG/ MINDACT (Accessed: 29 October 2008). The objective of the study is to validate the 70-gene profile for increasing the percentage of node-negative patients who are spared chemotherapy from the current level of 15—20%, which is based on conventional criteria. Thus, although it is customary to subject a high percentage of post-surgery patients to adjuvant chemotherapy, it may be possible to identify patients with a good prognosis for whom the chemotherapy would not necessarily be beneficial, in which case these patients could be spared the toxic side effects of the adjuvant treatment. This is the goal of the classification analysis. In the study of van’t Veer et al. [2], there were 34 patients classified with a poor prognosis (likely to develop distant metastases within 5 years: positive cases) and 44 patients classified with a good prognosis (unlikely to develop distant metastases within 5 years: negative cases). Out of approximately 25,000 genes, the expression levels of approximately 5000 genes determined by van’t Veer
SEN (3.6) (2.1) (3.0) (1.9) (4.1) (4.9) (3.6) (1.9) (2.9) (1.9) (2.6)
54.1 54.3 55.1 46.8 32.1 55.6 50.6 50.6 39.6 52.4 55.1
SPC (4.8) (3.9) (5.6) (3.2) (8.9) (8.4) (8.4) (2.6) (5.3) (2.3) (4.9)
70.2 73.8 64.6 74.7 79.4 72.6 70.3 68.9 69.7 70.3 67.8
PPV (4.5) (3.6) (3.7) (3.2) (6.9) (6.1) (4.7) (2.3) (2.7) (2.6) (2.0)
Average (standard deviation) of twenty repetitions of 10-fold CV for each method.
58.4 61.6 54.7 58.9 55.0 61.1 56.8 55.7 50.1 57.8 56.9
NPV (4.3) (3.1) (3.6) (2.9) (9.4) (6.7) (4.5) (2.4) (4.2) (2.6) (2.9)
66.4 67.6 65.1 64.5 60.3 68.0 65.0 64.3 59.9 65.6 66.3
(2.7) (1.8) (3.1) (1.4) (2.8) (4.3) (3.7) (1.6) (2.5) (1.5) (2.6)
Model-free ensemble classifier
273
Table 4 Performance (%) of classification algorithms for gene imprinting data (131 samples: 43 positives, 88 negatives)a. Algorithm
ACC
Convex hull C-T CERP LR-T CERP Random forest AdaBoost LogitBoost k-NN Shrunken centroid SVM-linear DLDA FLDA
82.6 88.0 88.7 87.9 71.5 84.1 78.4 83.5 84.6 85.9 79.4
a
SEN (1.5) (1.1) (1.3) (1.2) (3.5) (2.6) (2.4) (1.7) (1.7) (1.6) (2.4)
77.4 71.6 71.7 65.1 44.2 71.6 66.7 70.1 70.0 62.7 62.8
SPC (4.3) (2.6) (3.6) (3.0) (9.0) (3.2) (4.6) (3.1) (3.6) (4.3) (5.9)
85.1 96.0 97.0 99.0 84.9 90.2 84.1 90.0 91.7 97.3 87.5
PPV (2.2) (1.2) (1.6) (0.8) (3.2) (2.8) (3.1) (1.8) (1.7) (1.1) (2.8)
71.7 89.8 92.1 97.0 58.9 78.1 67.2 77.4 80.5 91.9 71.1
NPV (3.2) (2.8) (3.9) (2.4) (7.1) (4.9) (4.6) (3.2) (3.3) (3.1) (5.0)
88.5 87.4 87.5 85.3 75.7 86.7 83.8 86.0 86.2 84.2 82.8
(2.0) (1.0) (1.4) (1.1) (3.0) (1.4) (1.9) (1.3) (1.4) (1.5) (2.3)
Average (standard deviation) of twenty repetitions of 10-fold CV for each method.
et al. to be significantly regulated among the 78 patients were selected for the classification analysis. The cross-validated results are shown in Table 3, where results in all but the first row were generated by Moon et al. [24]. The results for the convex-hull ensemble classifier are based on a pre-specified training-set sensitivity and specificity of 0.8 with k = 1 (1-NN). Here too, k = 1 was chosen considering that if the exclusive hull for positives were not highly populated in any given 10-fold CV training set, a choice of k > 1 could limit the number of classifiers achieving the desired sensitivity. With data sets like the breast data, where the classification accuracy is relatively low, setting the training sensitivity and specificity too high could result in too few classifiers that satisfy the criterion; on the other hand, setting the value too low will eventually result in deterioration of the accuracy. The numbers of unique members of an ensemble for (0.8, 1) ranged from 15 to 175 over the twenty 10-fold CVs. Imprinted genes give rise to numerous human diseases because of the silencing of expression of one of the two homologs at an imprinted locus. Greally [13] described the first characteristic sequence parameter that discriminates imprinted regions, and provided data on 131 samples at http://greallylab.aecom.yu.edu/greally/imprinting_data.txt (Accessed: 1 September 2006), which were downloaded from the UCSC Genome Browser (http://genome.ucsc.edu; Accessed: 1 September 2006). The data set consists of 43 imprinted genes and 88 non-imprinted genes, along with 1446 predictors. The data set used here is the reduced set used by Ahn et al. [11], in which the set of predictors was reduced to 1248 by eliminating predictors that had identical values for more than 98% of the samples. The objective in this application is to classify the samples according to imprinting status in order to determine predisposition to disease. The crossvalidated results are shown in Table 4, where results
in all but the first row were generated by Ahn et al. [11]. The results for the convex-hull ensemble classifier are based on a pre-specified training-set sensitivity and specificity of 0.8 with k = 1. The numbers of unique members of an ensemble ranged from 1 to 13 over the twenty 10-fold cross-validations. Here, even though the classification accuracy is fairly high, the ensemble sizes are smaller, perhaps due to a smaller number of predictors from which to choose. As can be seen in Tables 2—4, the standard deviations indicate fairly good repeatability of predictions for most methods, with a few exceptions. For AdaBoost, LogitBoost and k-NN, the standard deviations for all performance measures tended to be among the largest across the three data sets. The convex-hull procedure tended to have standard deviations in line with the majority of classifiers. To further evaluate the performance of the convex-hull classifier relative to that of the other classifiers across the three data sets, a rank analysis was performed. First, the ACCs, SENs, SPCs, PPVs and NPVs of the classifiers were individually ranked within each data set, with a rank of 1 assigned to the highest value of each measure. For each measure, the average rank across the three data sets was then calculated for each classifier and tabulated in Table 5. Second, the ranks of SEN and SPC were averaged for each classifier within each data set, and then averaged again across the three data sets. A similar average was calculated for PPV and NPV. These average ranks are also given in Table 5. The latter two ‘unweighted’ averages of ranks give a measure of balance between the respective components. The accuracy, because it is a weighted average of SEN and SPC (as well as of PPV and NPV), can be high even when there is a large discrepancy between the respective components. The top three performers in each column of Table 5 are indicated by average ranks in bold-faced type. The convex-
274
R.L. Kodell et al.
Table 5 Average ranks of performance measures across the three data sets: colon cancer, breast cancer, and gene imprintinga. Convex hull C-T CERP LR-T CERP Random forest AdaBoost LogitBoost k-NN Shrunken centroid SVM-linear DLDA FLDA a b c
ACC
SEN
SPC
AVGSS b
PPV
NPV
AVGPN c
4.3 3.0 4.7 5.5 10.3 6.3 8.3 6.7 7.7 3.8 5.3
3.0 4.8 3.2 7.0 11.0 4.8 5.5 6.5 7.7 8.3 4.2
6.3 4.3 6.3 4.0 7.0 7.0 8.2 6.7 6.7 3.2 6.3
4.7 4.6 4.8 5.5 9.0 5.9 6.8 6.6 7.2 5.8 5.3
5.5 3.7 5.7 4.3 10.0 6.3 8.3 6.3 7.7 3.3 5.3
2.0 3.8 4.0 8.0 10.3 5.3 6.3 6.7 7.5 7.0 5.0
3.5 3.8 4.8 6.2 10.2 5.8 7.3 6.5 7.6 5.2 5.2
Top three average ranks in each column are indicated in bold-faced type. AVGSS is the average across the three data sets of the average rank of SEN and SPC within each data set. AVGPN is the average across the three data sets of the average rank of PPV and NPV within each data set.
hull procedure was consistently among the top three classifiers, and ranked highest on average for SEN, NPV and the average of PPV and NPV. It should be noted that Demsar [32] has recommended formal nonparametric statistical tests based on ranks for comparing two or more classifiers over multiple data sets.
4. Discussion The proposed convex-hull ensemble classifier performed well relative to the competing procedures against which it has been compared here. Its accuracy across three biomedical data sets ranked, on average, in the top three among the classifiers considered. Importantly, the convex-hull classifier had the highest average rank for both SEN and NPV. For medical screening, often the major emphasis is placed on having reliable negative predictions (e.g., the PSA test for prostate cancer screening). Positive results can always be followed up by additional, confirmatory tests, but negative results provide no compelling reason for follow up. The fact that the convex-hull procedure showed high SEN and NPV relative to the other classifiers for data sets having two-thirds positive cases (colon cancer), two-thirds negative cases (gene imprinting), and a near-even split of positives and negatives (breast cancer), indicates that it is not strongly influenced by the positiveto-negative ratio of the training set. An important structural advantage of the convex-hull classifier is that it is not dependent on complete predictor data. Missing predictor values for some samples, which can occur in medical settings, do not preclude these samples from being used to develop an ensemble classifier or to be classified by a trained classifier. Many classifiers require complete data (e.g., k-NN),
which can either force the elimination of incomplete samples or imputation of their missing values. The convex-hull classifier might have an additional advantage in biomedical decision making in that it is easy for physicians to understand and to explain to their patients. As patients become increasingly involved in decisions regarding their course of treatment for cancer and other diseases, the ability of physicians to explain confidently to their patients the analytical tools available to help them choose among the available treatment options, such as whether or not to undergo adjuvant chemotherapy following surgery, becomes critical. The convex-hull classifier’s geometrical interpretation makes it a translational-research tool that is easy to visualize and explain, even for complex genomic information like gene-expression data from microarray screens. There is no esoteric mathematical algorithm hidden in a black box. For this reason, it should have broad appeal to researchers, practitioners and patients. The proposed convex-hull approach is modelfree. It does not employ a regression model, but instead exploits differences between the classes with respect to the two-dimensional geometry of the predictors. It is straightforward to extend the two-dimensional convex-hull procedure to higher dimensions. However, computational time would increase dramatically, and it is not anticipated that a great improvement in accuracy would be achieved. That is, it is conjectured that the firstand second-order interactions capture most of the predictive information [22]. In addition, by considering only two dimensions at a time, the proposed approach tends to alleviate the need for ever increasing sample sizes as the dimension increases in order to maintain the same density of the sample space. For fixed sample sizes, this ‘‘curse of dimen-
Model-free ensemble classifier sionality’’ [33] causes training-set prediction spaces to be sparsely populated as the dimension increases because of the ever increasing distance between sample points which can lead to poor out-of-sample prediction by classifiers. With the two-dimensional approach, the prediction space for each classifier in the ensemble needs to be populated with training-set points in only two dimensions at a time. For data sets having many more samples than predictors, Bay [34] showed that an ensemble classifier made up of multiple nearest-neighbor classifiers each based on a random subset of predictors performed well. Although the convex-hull classifier showed good balance between SEN and SPC and between PPV and NPV, it can be seen from Tables 2—4 that all classifiers, including the convex-hull classifier, tend to favor the majority class. The issue of balance between the two classes of samples is addressed elsewhere [35,36]. Young et al. [35] showed that unbalanced class sizes lead to unbalanced sensitivity and specificity, while Chen et al. [36] proposed methods for achieving balance between sensitivity and specificity when classifying unknown samples in datasets with unbalanced class sizes. An ensemble approach to class prediction can produce a classifier with higher accuracy than any single classifier in the ensemble. Of course, higher accuracy is guaranteed only for the training data, and the convex-hull ensemble’s accuracy may drop below the pre-specified accuracy of individual classifiers when applied to test data. In fact, there is no guarantee that an individual classifier’s accuracy for a particular test set will not drop below 0.5, the level needed to ensure an increase in accuracy, no matter how high the training accuracy is set. For the convex-hull classifier, the cross-validated accuracy for the colon cancer data was 86.0% with the training accuracy set to either 85% or 90%; for the gene imprinting data the cross-validated accuracy was 82.6% with the training accuracy set to 80%; however, the crossvalidated accuracy for the breast cancer data set was only 63.2% with the training accuracy set to 80%. Conceptually, any ensemble’s predictivity can be enhanced by the addition of still more independent classifiers. Hence, one could envision an ensemble classifier made up of genomic descriptors combined with any other type of predictor variables, including categorical variables. The approach used in this paper could be used to build the expanded ensemble. Of course, there is only so much information in the data that can be used for class prediction. If the classification of the ‘‘known’’ samples is subject to substantial uncertainty, then a plateau of pre-
275 dictive accuracy might be reached that cannot be exceeded even with the addition of more ensemble members. The convex-hull approach provides a springboard for several avenues of future research. In particular, its use for the important problem of variable selection [37—39] is being explored by the authors. Its geometric underpinnings make it amenable to selective voting based on the location of test points relative to training points, in terms of both distance and direction. It is anticipated that selective voting is a way to deal with population heterogeneity in variable selection, so that a ‘‘one-size-fits-all’’ algorithm need not be applied in the prediction of unknown samples. The algorithm is CPU-intensive relative to other classification algorithms because of the need to construct two convex hulls for each of the mC2 potential classifiers. However, the authors are exploring ways to increase the speed of the algorithm, and preliminary results are encouraging.
References [1] Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences 1999;96:6745—50. [2] van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002;415:530—6. [3] Simon R. Roadmap for developing and validating therapeutically relevant genomic classifiers. Journal of Clinical Oncology 2005;96:1—10. [4] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference and prediction. New York: Springer; 2001. [5] Breiman L. Bagging predictors. Machine Learning 1996;24: 123—40. [6] Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 1997;55:119—39. [7] Efron B. Nonparametric estimates of standard error: the jackknife, bootstrap and other methods. Biometrika 1981;68:589—99. [8] Bauer E, Kohavi R. An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Machine Learning 1999;36:105—39. [9] Tong W, Hong H, Fang H, Xie Q, Perkins R. Decision forest: combining the predictions of multiple independent decision tree models. Journal of Chemical Information and Computer Science 2003;43:525—31. [10] Moon H, Ahn H, Kodell RL, Lin C-J, Chen JJ. Classification methods for the development of genomic signatures from high-dimensional genomic data. Genome Biology 2006;7: R121.1—7. [11] Ahn H, Moon H, Fazzari MJ, Lim N, Chen JJ, Kodell RL. Classification by ensembles from random partitions of highdimensional data. Computational Statistics and Data Analysis 2007;51:6166—79.
276 [12] Kim H. Combining decision trees using systematic patterns. Computing Science and Statistics 2002;33:608—17. [13] Greally JM. Short interspersed transposable elements (SINEs) are excluded from imprinted regions in the human genome. Proceedings of the National Academy of Sciences 2002;99:327—32. [14] Polikar R. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine 2006;6(3):21—45. [15] Hansen LK, Salamon P. Neural network ensembles. IEEE Transactions on Pattern Recognition and Machine Intelligence 1990;12:993—1001. [16] Lam L, Suen CY. Application of majority voting to pattern recognition: an analysis of its behavior and performance. IEEE Transactions on Systems Man and Cybernetics 1997;27: 553—68. [17] Perrone MP, Cooper LN. When networks disagree: ensemble methods for hybrid neural networks. In: Mammone RJ, editor. Neural networks for speech and image processing. New York: Chapman-Hall; 1993. p. 126—42. [18] Krogh A, Vedelsby J. Neural network ensembles, cross validation and active learning. In: Tesauro G, Touretzky DS, Leen TK, editors. Advances in neural information processing systems, vol. 7. Cambridge: MIT Press; 1995. p. 231—8. [19] Opitz DW, Shavlik JW. Generating accurate and diverse members of a neural-network ensemble. In: Touretzky DS, Mozer MC, Hasselmo ME, editors. Advances in neural information processing, vol. 8. Cambridge: MIT Press; 1996. p. 535—41. [20] Tumer K, Ghosh J. Error correlation and error reduction in ensemble classifiers. Connection Science 1996;8(3/4): 385—404. [21] Liu Y, Yao X, Higuchi T. Evolutionary ensembles with negative correlation learning. IEEE Transactions on Evolutionary Computation 2000;4(4):380—7. [22] Foster DP, Stine RA. Variable selection in data mining: building a predictive model for bankruptcy. Journal of the American Statistical Association 2004;99:303—13. [23] Barber CB, Dobkin DP, Huhdanpaa HT. The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software 1996;22:469—83. [24] Moon H, Ahn H, Kodell RL, Baek S, Lin C-J, Chen JJ. Ensemble methods for classification of patients for personalized medicine with high-dimensional data. Artificial Intelligence in Medicine 2007;41:197—207.
R.L. Kodell et al. [25] Breiman L. Random forest. Machine Learning 2001;45 (2001): 5—32. [26] Friedman J, Hastie T, Tibshirani R. Adaptive logistic regression: a statistical view of boosting. Annals of Statistics 2000;28:337—74. [27] Cover TM, Hart PE. Nearest neighbor pattern classification. IEEE Transactions on Information Theory IT-13 1967;21—7. [28] Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences 2002;99:6567—72. [29] Vapnik V. The nature of statistical learning theory. New York: Springer; 1995. [30] Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 2002;97:77—87. [31] Mardia KV, Kent JT, Bibby JM. Multivariate analysis. San Diego: Academic Press; 1979. [32] Demsar J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 2006;7:1—30. [33] Bellman RE. Dynamic programming. Princeton, NJ: Princeton University Press; 1957 [Republished Dover: ISBN 0486428095, 2003]. [34] Bay SD. Nearest neighbor classification from multiple feature subsets. Intelligent Data Analysis 1999;3:191—209. [35] Young JF, Tsai C-A, Chen JJ, Latendresse JR, Kodell RL. Database composition can affect the structure-activity relationship prediction. Journal of Toxicology and Environmental Health Part A 2006;69:1527—40. [36] Chen JJ, Tsai C-A, Young JF, Kodell RL. Classification ensembles for unbalanced class sizes in predictive toxicology. SAR and QSAR in Environmental Research 2005;16: 517—29. [37] Ho TK. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 1998;20:832—44. [38] Lee JW, Lee JB, Park M, Song SH. An extensive comparison of recent classification tools applied to microarray data. Computational Statistics and Data Analysis 2005;48:869—85. [39] Tsymbal A, Pechenizkiy M, Cunningham P. Diversity in search strategies for ensemble feature selection. Information Fusion 2005;6:83—98.