Pattern Recognition 42 (2009) 1274 -- 1283
Contents lists available at ScienceDirect
Pattern Recognition journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / p r
Ensemble component selection for improving ICA based microarray data prediction models Kun-Hong Liu a,∗ , Bo Li b , Jun Zhang c , Ji-Xiang Du d a
School of Software, Xiamen University, 361005, Xiamen, Fujian,P.R. China School of Computer Science of Technology, Wuhan University of Science and Techology, 430081, 947 Heping Road, Wuhan, Hubei, P.R. China School of Electronic Science and Technology, Anhui University, Hefei, Anhui, P.R. China d Department of Computer Science and Technology, Huaqiao University, Quanzhou 362021, Fujian, P.R. China b c
A R T I C L E
I N F O
Article history: Received 12 October 2007 Received in revised form 9 January 2009 Accepted 19 January 2009 Keywords: Microarray data Independent component analysis (ICA) Multiple classifier system (MCS) Ensemble component selection Multi-objective genetic algorithm (MOGA) Global-recording technique
A B S T R A C T
Independent component analysis (ICA) has been widely used to tackle the microarray dataset classification problem, but there still exists an unsolved problem that the independent component (IC) sets may not be reproducible after different ICA transformations. Inspired by the idea of ensemble feature selection, we design an ICA based ensemble learning system to fully utilize the difference among different IC sets. In this system, some IC sets are generated by different ICA transformations firstly. A multi-objective genetic algorithm (MOGA) is designed to select different biologically significant IC subsets from these IC sets, which are then applied to build base classifiers. Three schemes are used to fuse these base classifiers. The first fusion scheme is to combine all individuals in the final generation of the MOGA. In addition, in the evolution, we design a global-recording technique to record the best IC subsets of each IC set in a globalrecording list. Then the IC subsets in the list are deployed to build base classifier so as to implement the second fusion scheme. Furthermore, by pruning about half of less accurate base classifiers obtained by the second scheme, a compact and more accurate ensemble system is built, which is regarded as the third fusion scheme. Three microarray datasets are used to test the ensemble systems, and the corresponding results demonstrate that these ensemble schemes can further improve the performance of the ICA based classification model, and the third fusion scheme leads to the most accurate ensemble system with the smallest ensemble size. © 2009 Elsevier Ltd. All rights reserved.
1. Introduction Recent advances in microarray techniques allow scientists to diagnose and classify some particular cancers directly based on DNA microarray data. Currently there have been a variety of algorithms and mathematical models used for the management, analysis and interpretation of these high-density microarray data. The independent component analysis (ICA) transformation has been applied to the analysis of microarray data with great success. There have been many ICA based algorithms and methodologies proposed to analyze microarray data [1–7]. In these papers, the authors mainly paid attention to the biological interpretation of ICA results, but the discussions on how to select proper independent components (ICs) for different prediction systems are weak or completely ignored. However, it was found that the dominant ICs are related
∗ Corresponding author at: School of Software, Xiamen University, China. E-mail address:
[email protected] (K.-H. Liu). 0031-3203/$ - see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2009.01.021
to particular biological or experimental effects, and the component weights are either tumor cluster or chromosomal aberration specific [1,6]. So IC selection is a necessary preprocessing method for the application of ICA based prediction models. By using proper IC subsets, the performance of ICA based prediction system will be further improved. However, not any universal rule for IC selection is available up to now. The reasons lie in some aspects: firstly, neither the energies nor the biological significance of different ICs can be determined immediately. Thus, the simple principle for principal component (PC) selection in principal component analysis (PCA) transformation cannot be applied to IC selection. Secondly, different ICA algorithms are designed based on different estimate rules and objective functions, so they will generate different IC sets even for a same source data. Thirdly, the results obtained from an ICA algorithm are not `ordered', and different source matrices will be obtained if different numbers of ICs are set for a same observed signal. What's more, it is not clear that for different ICA based classification systems, whether the contribution of a same IC to the final prediction result varies greatly
K.-H. Liu et al. / Pattern Recognition 42 (2009) 1274 -- 1283
or not. In short, the IC selection is still an unsolved problem, and it is impossible to set up a simple and universal rule to guide the IC selection for all classification systems. But similar to the feature selection problem, the application of efficient selection algorithms will be a promising method to select a proper IC subset from an IC set. And in [8,9], the sequential floating forward selection (SFFS) and genetic algorithm (GA) were used to deal with the IC selection problem for different ICA based models successfully. With these methods, an optimal IC subset will be selected from an IC set and then used to build an accurate classifier. However, it was pointed out that one classifier cannot always guarantee high classification accuracy. Instead, a multiple classifier system (MCS) would be a better choice, and it has been proved that a MCS is more robust than an excellent single classifier in many fields [10]. Ensemble feature selection (EFS) is an efficient method to construct MCS [11], which is implemented by training base classifiers with different feature subsets. There are two methods to implement an EFS system: (1) evolutionary based method, which builds base classifiers using optimal feature subsets selected by evolutionary approaches [11–13]; (2) random subspace method, which randomly selects different feature subsets to construct different base classifiers [14]. The second method consumes less time, but due to the randomization, its performance may vary with the increase of the ensemble size. On the contrary, the first method requires more time, but it can usually achieve stable and accurate results. So we only address the first method in this paper. Inspired by the EFS, it is obvious that when building base classifiers with different IC subsets extracted from different IC sets, high diversity among the base classifiers will be achieved. So the ensemble learning scheme for ICA is promising. It should be pointed out that there are two completely different concepts for the ensemble learning for ICA: the first one is a Bayesian ICA where an ICA model is optimized over a parametric distribution that approximates the intractable true posterior distribution [15,16]; the other is to build a MCS using different IC subsets. And in this paper, the ensemble learning only refers to the second one. Similar to EFS, it is named as ensemble independent component selection (EICS) in this paper. It was found that ICA is not always reproducible when used to analyze gene expression data even using a same ICA algorithm [1,5]. That is, after different ICA transformations, different IC sets would be generated. The instability of ICA algorithm is hard to deal with. However, in this paper, we do not try to solve this problem. On the contrary, we propose a multi-objective genetic algorithm (MOGA) based EICS approach, which utilizes the difference among IC sets to generate diverse base classifiers. In this approach, ICA transformations are applied to the training set with random initializations, so as to produces different IC sets. A MOGA is applied to select informative IC subsets from different IC sets to build base classifiers, and then classifiers are selectively combined to construct an ensemble system according to the majority vote rule. The goals of the MOGA are to minimize error rate of classifiers and to maximize the covering of IC sets at the same time. And the second goal is used to extract different IC subsets from different IC sets equally, so that the diverse of classifiers can be guaranteed. A global-recording technique is designed to record the best two IC subsets extracted from each IC set to a global-recording list discovered in evolution. After the MOGA stops, three fusion schemes are used to combine the classifiers. The first scheme is to combine all the individuals in the final generation of the MOGA, and the second is to combine all individuals in the global-recording list. In addition, by only combining the individuals above average accuracy in the list, an efficient and compact ensemble system is built, which implements the third ensemble scheme. When testing these schemes on three microarray datasets, we find that these schemes are efficient and effective. This paper is organized as follows. Section 2 presents the ICA based microarray dataset classification model. The framework of the
1275
MOGA based EICS is described in Section 3. In Section 4, experimental results are shown to demonstrate the effectiveness of the EICS methods along with corresponding discussions. Then Section 5 concludes this paper. 2. The ICA based classification model To illustrate a typical ICA processing scheme, assume an n×p data matrix X, whose rows ri (i = 1, . . . ,n) correspond to observational variables and whose columns cj (j = 1, . . . ,p) are the individuals of the corresponding variables. Then the ICA model of X is: X = AS
(1)
Without loss of generality, A is an n×n matrix, and S is an n×p source matrix whose rows are as statistically independent as possible. Those variables in the rows of S are ICs, and the statistical independence between variables is quantified by mutual information I = H(Sk )−H(S), where H(Sk ) is the marginal entropy of the variable Sk , and H(S) is the joint entropy. And estimating the ICs is accomplished according to the following formula: U = S = A−1 X = WX
(2)
Let matrix X denote the gene expression data, then it is described as a linear mixture of statistically independent basis snapshots (eigenassay) S combined by an unknown mixing matrix A. In this approach, ICA is used to find a weight matrix W such that the rows of U are as statistically independent as possible. The independent eigenassays estimated by the rows of U are then used to represent the snapshots. The representation of the snapshots consists of their corresponding coordinates with respect to the eigenassays defined by the rows of U, i.e., rj = aj1 u1 + aj2 u2 + · · · + ajn un
(3)
The original training data sets Xtn and test data sets Xtt are transposed so that they will be applied to evaluate the ICs with the following formulae: U = Wtn Xtn = A−1 tn Xtn
(4)
Xtn = Atn U
(5)
The rows of Atn contain the coefficients of the linear combination of statistical sources that comprise Xtn . Then the representation of the test set Xtt is calculated as: Att = Xtt U −1
(6)
And after selecting some special ICs, formulas (1)–(6) are still applicable by adjusting Atn as n×m, S as m×p and Att as k×m if there are m ICs selected. Then the ICA model is constructed based on the selected ICs. In detail, after an ICA transformation, an IC subset is selected from the IC set to construct an IC subspace. Then a classifier is trained and then used to classify new samples in this IC subspace. In this study, we first employ FastICA [17] on gene expression data to generate different IC sets. It is of great possibility that there are some ICs containing special biological significance in each generated IC set, thus it is important to find out these ICs. Unfortunately, it is still a hard task up to now. In [6], the authors suggested that the most reliable way to select components is to manually inspect and evaluate ICs according to the corresponding component loading and the distribution of the mixing matrix A. Although we can discover which ICs are of biological significance using some visualizing and clustering techniques, it is impractical to analyze all ICs each time when using ICA based models. What's more, it is important to note that the best components do not always constitute the best subset
1276
K.-H. Liu et al. / Pattern Recognition 42 (2009) 1274 -- 1283
for classification, and a given component would give out more information when presented with certain components than the considered alone case [19]. So a best IC subset for classification may also contain some seemingly less important ICs. Based on this observation, all generated ICs are equal candidates in our ensemble systems. Whether an IC should be included or excluded in an IC subset is directly determined by its contribution to the accuracy of the corresponding base classifier. In this way, it is easy to discover optimal IC subsets. 3. The design of EICS 3.1. The framework of EICS To the best of our knowledge, there is only one paper discussing EICS by now [18]. In [18], Cheng et al. proposed an EICS based on the random subspace method, which randomly selects different IC subsets to construct base classifiers. The efficiency of this EICS is verified in a face recognition task. However, this method is not applicable to the classification of microarray dataset. Due to thousands of gene data in a microarray dataset, the ICA algorithms require searching the maxima of a target function in a high dimensional space. As a result, most algorithms may get stuck in local maxima in the process of searching the maxima, and ICA is not always reproducible when used to analyze gene expression data even using a same ICA algorithm [1]. Moreover, the results may be sensitive to the initializations. To avoid the problem of converging to local optima, a widely used method is `credibility', which was proposed in [7]. In this method, the independent source estimate is run several times with different random initializations so as to obtain several IC sets. By doing so,
the ICs in one IC set are usually quite different from those in other IC sets. Usually, some ICs will appear in some different IC sets, but some may appear only once. With this method, only the eigenassays with a frequency larger than a certain threshold are conserved, and the appearance frequencies of these eigenassays are used as the credibility indices to evaluate the stability of the results. For the i-th IC, the credibility is calculated by credibilityi = ai /n, where ai is the frequency it appears in all IC sets, and n is the total number of all the IC sets. This method has been used as a preprocessing method when applying the IC selection algorithms [8,9]. However, as pointed out in [6,7], low credibility does not necessarily mean low biological significance, and vice versa. So the frequency of IC is not a perfect criterion for IC selection. Instead of solving this problem, we design an ensemble IC selection system by utilizing the difference among the generated IC sets. The framework of our EICS is illustrated by Fig. 1. In this system, some different IC sets are generated by different ICA transformations with random initializations firstly, and then an algorithm is deployed to search some biologically significant IC subsets from each IC set. For an IC, the chance of being selected is based on its contribution to the classification, instead of its appearance frequency in different IC sets. As a result, the potentially important ICs will be found even when their appearance frequencies are very low. The selected IC subsets are used to build classifiers, and only accurate classifiers are used to construct a MCS with the majority vote rule. In this scheme, the diversity among the IC sets is no longer a trouble problem to deal with. On the contrary, it benefits the EICS scheme because it provides an easy method to maintain the diversity among base classifiers, which is important to the generalization capability of an ensemble system. As the work designed in [18] is only based on an ICA transformation, it can be viewed as a branch of this scheme. Although PCA is
training set
IC set
IC subset
classifier
test set
classifier evaluation and selection
test outputs
final decision
Fig. 1. The process of ensemble independent component selection system. Firstly, different IC sets are obtained by applying ICA algorithm to a training set with random initializations. Then a selection algorithm is used to discover informative IC subsets from respective IC sets, which are deployed to train base classifiers. Finally, base classifiers are selectively combined to construct an ensemble system. When a new sample comes, all selected base classifiers are deployed to distinguish it, and their outputs are fused to produce the final decision with the majority vote rule.
K.-H. Liu et al. / Pattern Recognition 42 (2009) 1274 -- 1283
Fig. 2. The chromosome design scheme of the GA. Each chromosome represents a selected IC subset, and consists of two different parts: the first part represents the index of IC set, and the second part represents the mask for IC subset. If there are N IC sets, the length of the first part is calculated by log2 (N) , which means the minimum integer larger than log2 (N). In the process of decoding, these bits are converted to an integer, which indicates the index of the selected IC set. The length of the second part is equal to the number of ICs, which may vary for different microarray datasets. At this part, each gene is valued as 1/0 to represent whether a corresponding IC is/isn't selected.
a more widely deployed linear transformation method, the PC sets generated by different PCA transformations keep consistent because of the stability of PCA algorithm. So this ensemble method cannot be applied to PCA based prediction methods. It is necessary for us to generate a large number of IC sets to analyze the relationship among the ICs in different IC sets statistically. However, when the number of IC sets is too large, the scale of the ensemble system will be too large when combining different IC subsets selected from different IC sets. Based on this consideration, 50 different IC sets are generated in our experiments. 3.2. The design of MOGA In the design of EICS, different IC sets are generated by running the FastICA algorithm with random initializations, and they are all marked for index. Different IC subsets are selected from each candidate set, and then are used to build base classifiers. It is obvious that an efficient selection method is the key to an efficient ensemble system. The rise of GA is inspired by mechanisms of evolution in nature. Compared with the widely deployed sequential search algorithms, such as SFFS, when applied to the task of searching proper feature subsets, GA has great advantages over them in that it always evaluates a subset as a whole, which is presented as a chromosome. Furthermore, many GA based ensemble systems have been proposed to tackle the classification problems with great success [11–13,20,21]. So GA is deployed to implement the ensemble scheme here. The GA based ensemble scheme is outlined as follows. Binary coding scheme is applied. Each chromosome represents an IC subset, which is used to train a base classifier, as illustrated in Fig. 2. The selection operator is roulette, which allows chromosomes with low fitness value to get a chance to enter the next generation. Double point recombination operator is adopted to exchange a randomly selected part of individuals in pairs. The simple inversion mutation is employed as the mutation operator, and it randomly selects two points in a parent and produces offspring by reversing the genes between the two points. These operators guarantee the diversity among the population, which is important to an ensemble system. None of existing operators embedded in the standard GA is available to encourage the covering of different IC sets. But it is obvious that if the base classifiers are built based on IC subsets selected from different IC sets, high diversity among these classifiers will be easily achieved, which is the key to build an effective ensemble system. So it is necessary to maintain the covering of different IC sets in evolution. So the GA should evolve towards two desired goals: minimizing the error rate and maximizing the covering of IC sets. It seems that the niching technique is applicable to achieve the covering goal because its searching mechanism can detect the basin of each individual. Then by defining an IC set as a niche, the individuals will search the optimal IC subsets for each IC set. However, we do not apply the niching technique here. The main reason lies in that the niching technique will drive the individuals separating in each niche equally.
1277
As there are 50 different IC sets, to search optimal subsets, there should be a group of individuals for each IC set. Then the population size should be very large, which increases computation burden to a great extent. At the same time, it is not necessary to obtain optimal subsets for all IC sets when building an efficient MCS. As pointed out in [22], using only a part of generated base classifiers can result in an efficient ensemble system. So the algorithm designed here searches optimal results for only a part of IC sets, which requires far less computation cost. The fitness design scheme used in [20] is based on the idea of sharing method. In detail, each individual represents a neural network, and the GA evolves to generate a neural network ensemble system. The fitness scheme encourages the individuals to cover all samples. When a sample is recognized by k individuals, each of them shares 1/k scores, and others get zero score. The fitness value of each individual is the sum of its covering scores. Although the GA will not converge according to this fitness scheme, it produces diverse individuals, and the accuracy of each individual is guaranteed by a negative learning method. As a result, the final ensemble system is composed of diverse and accurate classifiers. Inspired by this scheme, we apply a multi-objective GA (MOGA) here. The first objective is to generate accurate individuals, and the second is to encourage the covering of the IC sets. In this way, both the covering and the accuracy are achieved at the same time. Although this MOGA will not converge because of the second goal, the divergence will not deteriorate the efficiency of this MOGA, just like the GA designed in [20]. And despite of the MOGA framework, this MOGA can be regarded as an implicit niching GA because the second optimization goal is used to maintain the covering among IC sets. Bootstrap 632+ is a widely deployed method for the estimation of generalization error [23]. Although it is time-consuming, the final results are close to be unbiased for small sample size problem. It is deployed to evaluate the performance of the base classifiers built by selected IC subsets in each generation. And the first optimization goal for the MOGA is to minimize the bootstrap .632+ error rates. The IC subsets represented by the individuals in a population is called as a group here if they belong to a same IC set. In order to encourage the covering, we use the method similar to the sharing scheme [20]. That is, if there are n IC subsets in a group, the fitness values of these IC subsets are multiplied by 1/n. Assuming two individuals, xi and xj , belong to the same group. The Hamming distance is used to evaluate the difference between them. The second fitness function for an individual is the sum of its Hamming distance in its group. When there is only a single subset in a group, the Hamming distance cannot be calculated, and we simply assign a very large value, for example, 1000, to it. So this fitness value is calculated by (7). According to this formula, if a group is of small scale, all individuals in this group will have high fitness value at the second goal. This optimization goal is used to encourage the covering of IC sets by keeping the size of group small in each population. ⎧ n ⎪ ⎨ x or (xi , xj )/n, if n > 1 Fi = j=1,j i (7) ⎪ ⎩ 1000, if n = 1 A widely deployed fusion scheme for an evolutionary ensemble system is to combine all the individuals in the final generation [11,20], which is also deployed in our ensemble system and is named as EICS-1. Because there are already 50 different IC sets deployed in our experiments, the number of individuals in a generation cannot be too small in order to find optimal subsets for the IC sets. But at the same time, it cannot be too large in considering the scale of the ensemble. So we set the population size to 100, which allows two subsets for an IC set to survive in a population on average. Then there are 100 base classifiers in EICS-1.
1278
K.-H. Liu et al. / Pattern Recognition 42 (2009) 1274 -- 1283
Due to the encouragement of diversity, it is of great probability that some accurate individuals will be replaced in evolution if they are in the same group. If an ensemble is constructed with accurate individuals coming from all different IC sets, the difference among the IC sets will be fully utilized. So a global-recording technique is designed in our MOGA to record the best IC subsets of each IC set globally. In detail, a global-recording list is used to record the best two subsets of each IC set obtained during evolution. By combining the individuals in the list, an efficient ensemble system is built, which is denoted by EICS-2. Then the ensemble size of this ensemble system is also 100, just as EICS-1. In this way, these two ensemble systems can be compared objectively. In the evolution, the global-recording list may be updated in each generation by replacing the IC subsets in the list with the IC subsets that achieve lower error rate in the current generation. It should be noted that only the subsets belonging to the same IC set are compared, and only the best two subsets for each IC set are kept in the list. The framework of this MOGA is based on NSGA-II [24], which implements a diversity-preserving mechanism. The chromosomes in a population are first sorted based on the nondominated sorting, and are assigned to different ranks according to the individuals dominated by them. Then they are assigned to a crowing distance according to the difference among objective values. The selection is performed using the crowded tournament selection. In this process, the chromosomes with lower rank and larger crowding distance are chosen. Based on this MOGA, two ensemble schemes are implemented, and the comparisons between them are done in next section. 4. Experimental results and discussions Three publicly available microarray datasets are deployed in our experiments: the hepatocellular carcinoma dataset [25], the prostrate cancer dataset [26] and the breast cancer dataset [27]. In these datasets, all samples have already been assigned to training set or test set, and Table 1 summarizes these datasets. Preprocessing of the datasets is done exactly as [28]: transforming the raw data to natural logarithmic values, and then standardizing each sample to zero mean and unit variance. Besides the original division, each dataset are reshuffled with 9 randomizations. And for all datasets, each randomized training and test set contains the same amount of samples of each class compared to the original training and test set. In all our experiments, the classifiers are built using the training samples, and the classification accuracies are estimated using the independent test set. 4.1. The analysis of ICS For ICA based microarray dataset analysis, it is still an unsolved problem that how many ICs should be generated after an ICA transformation. One method is to set the number of ICs to the number of samples due to the small training sample size of microarray datasets, as done in [5,9]. This method is applied to the hepatocellular dataset. But for the prostate and breast cancer datasets whose training sample sizes are 78 and 102, respectively, it takes quite a long time to transform the datasets with so many ICs. In addition, the larger number of ICs will make it more difficult to search global optimal results. So we simply set the number of ICs to 50 for these two datasets to simplify our discussion. We analyze the relationship among the ICs produced by different ICA transformations using the method proposed in [7]. Firstly, all ICs are combined to construct a matrix X whose rows are variables and whose columns are ICs. Then a correlation coefficient matrix R is calculated from X. In this way, each column in R represents a corresponding IC, and if the square of an element rij in matrix R is larger than a threshold, it is considered that the j-th IC appears in
Table 1 The summary of the datasets. Datasets
Training set
Test set
Levels
Microarray technology
Hepatocellular Prostate Breast
33 102 78
27 34 19
7129 12600 24188
Oligonucleotide Oligonucleotide cDNA
an IC set once. So the sum of each column represents the appearance frequency of the corresponding IC. It is obvious that different thresholds would result in different number of unique ICs. With the increase of threshold, the number of unique ICs also increases. The total numbers of ICs are 1650 for the hepatocellular dataset and 2500 for other two datasets. When the threshold is set to 0.7, as suggested in [7], there are about 800 unique ICs for the hepatocellular and prostate datasets, and more than 100 unique ICs for the breast dataset in the 50 IC sets. But when only keeping ICs with high appearance frequency, only 33 ICs for the hepatocellular dataset and 50 ICs for other two datasets are kept. Then a huge proportion of unique ICs will be discarded. Fig. 3 shows the distribution of ICs' credibility in all IC sets when setting the threshold to 0.7. We find that even for the breast dataset, whose proportion of unique ICs is the smallest among the three datasets, there are still about 40% ICs appearing only once. And the proportions of ICs with lower than 10% credibility is higher than 90% for other two datasets. As analyzed above, the ICs with low appearance frequency does not really mean less biological significance, and they can help to discover the potentially important biological information. So it is obvious that when only preserving the ICs with high appearance frequency, a large proportion of ICs will be discarded, which would unavoidably lead to the loss of potentially important ICs that would contribute a lot to the classification. 4.2. The analysis of experimental results According to the chromosome design scheme, six bits are used at the first part of chromosomes because there are 50 different IC sets in our experiments. As the population size is 100, in the first generation, the integers represented by the first part of chromosomes take value in the range of [1,50] in sequence for the first and the second 50 individuals. In this way, all the IC sets appear twice in the first generation, so that they have an equal chance to compete with others at the initial stage. The second part of each chromosome representing the IC mask is randomly initialized. In experiments, for original division or each random initialization on each dataset, the MOGA runs five independent times. So in all, the results obtained by the MOGA are based on 50 runs for each dataset. The bootstrap sample size is set to 100 in all experiments. Despite of the relatively small bootstrap size, bootstrap .632+ is quite time consuming when used in MOGA for the fitness value evaluation. Based on this consideration, although some accurate classifiers, such as neural network and support vector machine, can also be employed to deal with the classification task, only the nearest neighbor classifier (1-NN), a relatively weak learner, is deployed in all our experiments owing to its small computational cost. And it should be noted that this method is independent of base classifier. When more accurate prediction systems are deployed as base classifiers, better performance will be achieved using our method. The results shown in Table 2 are the mean and standard deviation of the results based on each original dataset and nine randomizations. For comparison, we list the results using nine different methods in Table 2: PCA and kernel PCA with FDA, LS-SVM [28]. These corresponding results were all obtained based on a single classifier. We list the results obtained by an ICA based 1-NN. From Table 2, it is found that when only comparing the results obtained by a single
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
Credibility (%)
00
0
-1 91
0
-9 81
0
-8 71
0
-7 61
0
-6 51
0
-5 40
0
-4
31
-3
10 1-
00
0
-1 91
0
-9 81
-8 71
-7 61
-6 51
-5 40
0
-4
31
-2
-3 21
1-
0
0 0
0 0
0.1 0
0.2
0.1 0
0.2
10
0.3
0
0.4
0.3
-2
0.4
1279
0.5
21
0.5
11
Percent
1
11
Percent
K.-H. Liu et al. / Pattern Recognition 42 (2009) 1274 -- 1283
Credibility (%)
0.9 0.8 0.7
Percent
0.6 0.5 0.4 0.3 0.2 0.1
-6 0 61 -7 0 71 -8 0 81 -9 0 91 -1 00
51
0
0
-5 40
0
-4
31
0
-3 21
-2 11
1-
10
0
Credibility (%) Fig. 3. The credibility distribution of ICs on the three datasets: (a) the hepatocellular dataset; (b) the prostate dataset; (c) the breast dataset.
classifier, ICA based 1-NN is not superior to the other nine methods, and no method can lead to obvious advantage in classification. So it is hard to choose a best method for the classification of all these three microarray datasets, which reveals the limit of the single classifier system. It is found that by setting the covering of IC sets as one optimization goal, there are usually about twenty different IC sets in each generation. And in evolutionary process, there are less than twenty IC sets appearing with high frequency usually because of the random search mechanism of the MOGA. As analyzed before, once a group contains some more individuals picked by random style, the individuals in this group can get more opportunities to be optimized on the first goal despite of getting relatively low scores on the second goal. Then only the IC subsets obtained from the IC sets with high appearing frequency would be fully optimized. From Table 2, it is interesting to find that although the average test results of base classifiers for each ensemble system are slightly worse than those obtained by a single classifier using a whole IC set for classification, the ensemble based classification results are much better than all others in most cases. The success of the ensemble systems lies in the diversity among the base classifiers. As different base classifiers produce different outputs by projecting samples into different IC subspaces, the samples that cannot be correctly classified in an IC subspace may be recognized in other subspaces. Then
even when some base classifiers make wrong decision on a sample, other classifiers can still have a chance to correct this result. So the ensemble system can produce a correct output. In this way, the final results are better than both the average results of base classifiers and the results of using a single classifier. With the global-recording technique, the best subset of each IC set found by far will be recorded. Due to the computational cost, the MOGA only runs a relatively small number of generations and explores a very small part of the search space, so it is impossible for the MOGA to fully investigate all the IC sets. Some subsets in the record list are not truly optimized, which unavoidably deteriorate the performance of the ensemble. As a result, when combining all the individuals in the global-recording list, the results may not always be better than those of EICS-1, as shown in Table 2. As proved in [22], by pruning some classifiers in an ensemble system, a compact and still accurate (or even more accurate) ensemble system will be built. And the idea of `overproduce and choose' has been proved to be successful in the design of ensemble systems. In [13,22], after generating a set of base classifiers, the authors applied GA to choose the best team of classifiers to build ensemble systems. However, it is obvious that it requires much more time to apply this method for selective combination. An alternative method is to apply the individuals with above average accuracy in the global-recording list to construct a compact ensemble system, and the diversity is maintained
1280
K.-H. Liu et al. / Pattern Recognition 42 (2009) 1274 -- 1283
Table 2 The prediction results on test sets based on fifty independent runs for each dataset. In the table, Aver_1 represents the average prediction results in the final generation, and Aver_2 represents the average prediction results in the global-recording list. Method
Hepatocellular
Prostate
Breast
LS-SVM linear kernel LS-SVM RBF kernel LS-SVM linear kernel (no regularization)
68.43 ± 4.52 68.61 ± 6.32 49.56 ± 12.60
84.31 ± 13.66 88.10 ± 4.93 48.18 ± 10.25
67.92 ± 8.58 68.42 ± 7.62 57.14 ± 9.08
PCA+FDA (unsupervised PC selection) PCA+FDA (supervised PC selection) kPCA lin+FDA (unsupervised PC selection) kPCA lin+FDA (supervised PC selection) kPCA RBF+FDA (unsupervised PC selection) kPCA RBF+FDA (supervised PC selection)
68.25 ± 7.37 66.67 ± 9.96 68.25 ± 7.37 66.67 ± 9.96 61.20 ± 12.91 69.49 ± 3.94
83.89 ± 13.63 82.49 ± 13.35 85.01 ± 9.07 82.49 ± 13.35 85.01 ± 11.00 28.71 ± 10.02
57.39 ± 66.92 ± 60.90 ± 65.41 ± 51.38 ± 36.84 ±
ICA+1-NN
66.68 ± 7.15
92.06 ± 8.26
65.63 ± 7.33
96.47 99.71 99.71 91.50 88.41
± ± ± ± ±
3.04 0.93 0.93 5.00 8.42
77.37 ± 6.59 68.42 ± 4.30 78.42 ± 4.61 64.89 ± 10.04 59.53 ± 11.39
MOGA based ensemble IC selection
± ± ± ± ±
30 Generations
EICS-1 EICS-2 EICS-3 Aver_1 Aver_2
65.15 63.11 61.52 58.78 58.67
50 Generations
EICS-1 EICS-2 EICS-3 Aver_1 Aver_2
65.41 ± 6.11 63.74 ± 3.12 62.89 ± 2.73 57.89 ± 8.12 58.44 ± 9.73
96.47 99.71 99.71 91.24 88.62
± ± ± ± ±
2.73 0.93 0.93 5.35 8.15
76.84 ± 70.53 ± 78.42 ± 65.32 ± 60.11 ±
7.10 6.18 6.30 9.93 11.23
100 Generations
EICS-1 EICS-2 EICS-3 Aver_1 Aver_2
67.11 64.89 64.19 58.70 59.05
± ± ± ± ±
4.36 3.68 4.35 6.08 9.10
96.00 100 99.82 90.82 88.94
± ± ± ± ±
2.79 0.00 1.52 4.83 7.90
77.37 ± 72.11 ± 82.11 ± 65.89 ± 61.11 ±
8.25 6.10 5.08 8.93 11.39
150 Generations
EICS-1 EICS-2 EICS-3 Aver_1 Aver_2
67.04 67.63 69.75 59.81 59.56
± ± ± ± ±
5.68 4.08 5.05 6.15 8.74
95.71 100 99.24 90.41 89.76
± ± ± ± ±
3.34 0.00 1.52 5.08 7.98
77.89 ± 8.52 74.74 ± 6.47 79.47 ± 5.79 68.68 ± 8.96 61.95 ± 11.09
200 Generations
EICS-1 EICS-2 EICS-3 Aver_1 Aver_2
69.52 68.52 70.15 59.11 61.44
± ± ± ± ±
4.20 4.92 6.50 5.62 8.16
95.12 99.41 99.65 90.26 90.06
± ± ± ± ±
3.92 1.24 2.32 5.55 7.27
79.47 ± 75.79 ± 82.11 ± 68.79 ± 62.63 ±
by the global-recording list. With this method, about 50 individuals will be selected, which are originated from about twenty IC sets. As the number of further optimized IC sets is close to twenty in the evolutionary process, by keeping the individuals above average classification accuracy, all (or at least most of all) of the top individuals are included. Then this new ensemble scheme, denoted by EICS-3, can achieve good performance with smaller ensemble size. But this simple pruning method cannot be applied to EICS-1. It is because some large-scale groups will dominate the population to a degree in evolution, then the individuals above average accuracy belong to only about ten IC sets in the final generation. Consequently, the diversity is not high enough to guarantee ideal performance of the ensemble. Due to the random search mechanism embedded in MOGA, we can find that for all the ensemble schemes, the results obtained in the last generation are not always the best. The results of EICS-1 vary more violently than those of other ensemble systems in the process of evolution. As analyzed above, the IC sets in one generation would be different from those in another generation, and then quite different individuals will be produced among different generation. It can also be observed in other evolutionary algorithm based ensemble
5.80 3.60 2.34 8.53 9.43
15.57 9.90 14.49 7.54 15.91 0.00
8.40 4.44 4.44 9.41 11.13
Table 3 The proportion of the employed IC sets obtained by EICS-3. The number of generations
Prostate
Hepatocellular
Breast
30 50 100 150 200
43.18 ± 9.43 45.82 ± 10.76 47.80 ± 10.54 46.82 ± 12.76 48.53 ± 9.64
44.45 ± 8.76 40.23 ± 10.65 47.58 ± 9.68 44.23 ± 10.54 48.58 ± 7.86
46.47 48.75 47.83 46.75 48.83
± ± ± ± ±
5.06 7.53 8.70 8.53 8.70
system [29], so it is hard to decide in which generation we should stop the evolutionary process to achieve the best ensemble system. But this ensemble scheme still works well compared with other results. The results of EICS-2 and EICS-3 are more stable and usually keep increasing in evolution. From Table 2, it is obvious that EICS-2 cannot always guarantee the best performance except for the prostate dataset. When only running a small number of generations, for example, in the 50th generation, the corresponding results are not always
K.-H. Liu et al. / Pattern Recognition 42 (2009) 1274 -- 1283
1281
0.118 0.415
0.116 0.114
0.41
0.112 Error
Error
0.405 0.4
0.106
0.395
0.104
0.39 0.385 0.25
0.11 0.108
0.102 0.3
0.35
0.4
0.45 0.5 Kappa
0.55
0.6
0.65
0.1 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 Kappa
0.42 0.4
G = 30 (EICS-1) G = 50 (EICS-1) G = 100 (EICS-1) G = 150 (EICS-1) G = 200 (EICS-1) G = 30 (EICS-2) G = 50 (EICS-2) G = 100 (EICS-2) G = 150 (EICS-2) G = 200 (EICS-2) G = 30 (EICS-3) G = 50 (EICS-3) G = 100 (EICS-3) G = 150 (EICS-3) G = 200 (EICS-3)
Error
0.38 0.36 0.34 0.32 0.3 0.2
0.25
0.3 0.35 Kappa
0.4
0.45
0.5
Fig. 4. The centroids of the kappa-error diagram with different number of generation of the three EICS schemes: (a) the hepatocellular dataset; (b) the prostate dataset; (c) the breast dataset. In the legend, G represents the number of generation. In three figures, x-axis represents the average kappa statistic, and y-axis represents the average error rate in each classifier pair-wise. The results are based on all the results obtained in 50 runs.
good enough. But after running some more generations, this scheme can lead to better results compared with EICS-1. The reason is that for EICS-2, the performance is mainly affected by the accuracies of the base classifiers. As the base classifiers in the global-recording list are more and more accurate during the evolution, EICS-2 can achieve better and better performance. And it also holds for EICS-3. When running some more generations, the results of EICS-2 and EICS-3 will be much better, as validated by the results shown in Table 2. But since the results obtained in 200 generations are usually satisfying, we do not further evolve the MOGA. As shown in Table 3, there are about 40 base classifiers in EICS3. In comparison, there are always 100 base classifiers in other two ensemble systems. Then usually EICS-3 is more efficient than EICS-2 and EICS-1 by achieving close or higher classification accuracy with fewer base classifiers, which proves that the simple pruning method works very well. So EICS-3 is the best choice for the microarray dataset classification. 4.3. The analysis of accuracy and diversity among base classifiers In order to get a deep insight on the change of diversity and accuracy with the increase of generations, we apply the kappa-error measure to further compare the performances of different ensemble systems. Kappa-error is a widely deployed visualization method [30]. The diversity measure is named as kappa (k), which measures
the level of agreement between two classifier outputs. Given two classifiers, Da and Db , for a c-class problem, k is defined on a contingency matrix M that is of size c×c. The element mi,j of the matrix stands for the proportion of the data for which Da labels as i-th class and Db labels as j-th class. The agreement between Da and Db is calculated by mi,i − ABC (8) ka,b = i 1 − ABC where i mi,i is the probability that the two classifiers agree, and `ABC' is `agree-by-chance', which is given by ⎛ ⎞ ⎝ mi,j × mj,i ⎠ (9) ABC = i
j
j
k = 1 only when the classifiers produce identical class labels. If k = 0, the classifiers are completely independent and the agreement of the two classifiers equals that expected by chance. And it is more desirable when k < 0, which is `negative dependence'. In such case, when one classifier fails, the other has more than random chance of correcting it. For the ensemble of L classifiers, there are (L−1)×L/2 pairs of classifiers. For a kappa-error diagram, x-axis is the k for the pair, and y-axis is the averaged individual error, which is calculated by Ea,b = (Ea +Eb )/2. Here, Ea and Eb stand for the error of classifiers in the
1282
K.-H. Liu et al. / Pattern Recognition 42 (2009) 1274 -- 1283
Table 4 Comparison of time requirements for the MOGA running 200 generations on different datasets (103 seconds). Datasets
Prostate
Hepatocellular
Breast
Time
4.272 ± 0.134
1.287 ± 0.146
5.282 ± 0.249
classifier pair, Da and Db , respectively. As small value of k indicates better diversity among the ensemble and small value of Ea,b indicates better classification results, the diagram of an ideal ensemble should be filled with the pots in the bottom left corner. When drawing the kappa-error diagrams, we found that usually clouds of kappa-error obtained by different generations are heavily overlapping for each dataset, so only the centroids are shown in Fig. 4. In this way, relative positions are evaluated visually. From the figures, we can find that for the prostate dataset, the diversity among the base classifiers in all ensemble systems is lower than that of other two datasets. It is because the average accuracy of base classifiers for the prostate dataset is close to 90%, which is much higher than that of other two datasets. As a result, most of the base classifiers would produce correct outputs for training samples, which lowers the diversity. But even in this case, it is obvious that the ensemble schemes can still benefit from the diversity when comparing the ensemble outputs with the corresponding average results of base classifiers in ensemble systems. And when comparing Fig. 4(a) and (c), it is obvious that the base classifiers for the breast dataset are more accurate and diverse. As a result, the ensemble systems work much more efficient for the breast dataset than for the hepatocellular dataset. In most cases, the more generations MOGA runs, the more accurate base classifiers will be for each ensemble system. Consequently, the average error rate and the diversity in the classifier pairs will decrease at the same time. In the experiments, the kappa-error diagrams validate the results shown in Table 2. And EICS-3 usually obtains the best trade-off of the diversity and accuracy in the three ensemble systems and achieves the best performance. 4.4. The time consumption for the GA The time requirements of the GA for running 200 generations on all datasets are listed in Table 4. The program is implemented based on Matlab 2007, and is run in a computer using Core 2 E4300 with 2 GB memory. For these datasets, the average time for a run ranges from a half to one and a half hour. It is tolerable in considering the increase of classification accuracy. 5. Conclusions For microarray datasets, different IC sets would be generated after different initialized ICA transformations. So it is an unsolved problem that how to obtain a set of stable ICs for microarray dataset analysis. But in this study, the diversity among different IC sets is utilized to design a MOGA based ensemble system. The MOGA evolves towards two goals: minimize the error rate and maximize the covering of IC sets. As 50 different IC sets are used with only 100 individuals in a population, it is impossible to allow all the IC sets to appear in each generation. So a global-recording technique is designed, which records the best two results for each IC set in a global-recording list. Then two ensemble schemes are implemented to fuse the base classifiers generated by the MOGA: combining the individuals in the final generation and combining the individuals in the global-recording list. With only keeping the classifiers above average accuracy in the global-recording list, the third ensemble scheme is implemented. We test our ensemble system on three microarray datasets. Compared with the results obtained by other methods, it is found that
our ensemble schemes are effective and efficient in classifying normal and tumor samples from the three human tissues. And the third scheme achieves the highest prediction accuracy with the smallest ensemble scale. In conclusion, it is obvious that the study of ensemble IC selection system is just at its beginning stage. In future works, we will further study how to apply ensemble systems based on different ICA classification models, and try to solve other practical problems with different ensemble schemes. References [1] W. Liebermeister, Linear modes of gene expression determined by independent component analysis, Bioinformatics 18 (2002) 51–60. [2] X.W. Zhang, Y.L. Yap, D. Wei, F. Chen, A. Danchin, Molecular diagnosis of human cancer type by gene expression profiles and independent component analysis, European Journal of Human Genetics 13 (12) (2005) 1303–1311. [3] C.H. Zheng, Y. Chen, X.X. Li, Y.X. Li, Y.P. Zhu, Tumor classification based on independent component analysis, International Journal of Pattern Recognition and Artificial Intelligence 20 (2) (2006) 297–310. [4] S.I. Lee, S. Batzoglou, Application of independent component analysis to microarrays, Genome Biology 4 (R76) (2003). [5] D.S. Huang, C.H. Zheng, Independent component analysis-based penalized discriminant method for tumor classification using gene expression data, Bioinformatics 22 (2006) 1855–1862. [6] A. Frigyesi, S. Veerla, D. Lindgren, M. Hoglund, Independent component analysis reveals new and biologically significant structures in microarray data, BMC Bioinformatics 7 (2006) 290. [7] P. Chiappetta, M.C. Roubaud, B. Torresani, Blind source separation and the analysis of microarray data, Journal of Computational Biology 11 (2004) 1090–1109. [8] C.H. Zheng, D.S. Huang, L. Shang, Feature selection in independent component subspace for microarray data classification, Neurocomputing 69 (16–18) (2006) 2407–2410. [9] K.H. Liu, D.S. Huang, J. Zhang, Improving the performance of ICA based microarray data prediction model with genetic algorithm, in: IEEE CEC 2007, 2007, pp. 606–611. [10] L.I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, Wiley, New York, 2004. [11] D. Opitz, Feature selection for ensembles, in: Proceedings of 16th National Conference on Artificial Intelligence (AAAI), 1999, pp. 379–384. [12] L.I. Kuncheva, L.C. Jain, Designing classifier fusion systems by genetic algorithms, IEEE Transactions on Evolutionary Computation 4 (4) (2000) 327–336. [13] L.S. Oliveira, M. Morita, R. Sabourin, Feature selection for ensembles using the multi-objective optimization approach. Studies in Computational Intelligence, vol. 16, Springer, Berlin, 2006 pp. 49–74. [14] T.K. Ho, The random subspace method for constructing decision forests, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (8) (1998) 832–844. [15] J.W. Miskin, D.J.C. MacKay, Ensemble learning for blind source separation, Independent Component Analysis: Principles and Practice, Cambridge University Press, Cambridge, 2001, pp. 209–233. [16] H. Lappalainen, Ensemble learning for independent component analysis, in: Proceedings of International Workshop on Independent Component Analysis and Signal Separation (ICA'99), 1999, pp. 7–12. [17] A. Hyv¨arinen, Fast and robust fixed-point algorithms for independent component analysis, IEEE Transactions on Neural Networks 10 (1999) 626–634. [18] J. Cheng, Q. Liu, H. Lu, Y.-W. Chen, Ensemble learning for independent component analysis, Pattern Recognition 39 (1) (2006) 81–88. [19] J.D. Elashoff, R.M. Elashoff, G.E. Goldman, On the choice of variables in classification problems with dichotomous variables, Biometrika 54 (1967) 668–670. [20] Y. Liu, X. Yao, T. Higuchi, Evolutionary ensembles with negative correlation learning, IEEE Transactions on Evolutionary Computation 4 (4) (2000) 381. [21] X. Wang, H. Wang, Classification by evolutionary ensembles, Pattern Recognition 39 (4) (2006) 595–607. [22] Z.H. Zhou, J. Wu, W. Tang, Ensembling neural networks: Many could be better than all, Artificial Intelligence 137 (1–2) (2002) 239–263. [23] B. Efron, R.J. Tibshirani, Improvements on cross-validation: the .632+ bootstrap method, Journal of the American Statistical Association 92 (1997) 548–560. [24] K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjective genetic algorithm: NSGA-II, IEEE Transactions on Evolutionary Computation 6 (2) (2002) 182–197. [25] N. Iizuka, M. Oka, H. Yamada-Okabe, M. Nishida, Y. Maeda, N. Mori, T. Takao, T. Tamesa, A. Tangoku, H. Tabuchi, Oligonucleotide microarray for prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative resection, The Lancet 361 (2003) 923–929. [26] D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. D'Amico, J.P. Richie, E.S. Lander, M. Loda, P.W. Kantoff, T.R. Golub, W.R. Sellers, Gene expression correlates of clinical prostate cancer behavior, Cancer Cell 1 (2) (2002) 203–209. [27] L.J. van't Veer, H.Y. Dai, M.J. van de Vijver, Y.D.D. He, A.A.M. Hart, M. Mao, H.L. Peterse, K. van der Kooy, M.J. Marton, A.T. Witteveen, G.J. Schreiber, R.M. Kerkhoven, C. Roberts, P.S. Linsley, R. Bernards, S.H. Friend, Gene expression
K.-H. Liu et al. / Pattern Recognition 42 (2009) 1274 -- 1283
profiling predicts clinical outcome of breast cancer, Nature 415 (6871) (2002) 530–536. [28] N. Pochet, F. De Smet, J.A.K. Suykens, B.L.R. De Moor, Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction, Bioinformatics 20 (17) (2004) 3185–3195.
1283
[29] Y. Liu, How to stop the evolutionary process in evolving neural network ensembles, ICNC 2006, Part I, in: Lecture Notes in Computer Science, vol. 4221, Springer, Berlin, 2006, pp. 185–194. [30] D.D. Margineantu, T.G. Dietterich, Pruning adaptive boosting, in: Proceedings of 14th International Conference on Machine Learning, 1997, pp. 211–218.
About the Author—KUN-HONG LIU received B.Sc. and M.Sc. degree in the School of Physics, OptoElectronics Technology of Fujian Normal University in 1999 and 2004, respectively; received Ph.D. degree in University of Science & Technology of China in 2008. He is now with the school of software school of Xiamen University. His Research Interests include evolutionary computation, machine learning, bioinformatics, WDM optical networks, and traffic grooming. About the Author—BO LI received M.Sc. and Ph.D. degree in Mechanical & Electronic Engineering and Pattern Recognition & Intelligent System from Wuhan University of Technology and University of Science and Technology of China in 2003 and 2008, respectively. His research interests include pattern recognition, manifold learning and image processing. About the Author—JUN ZHANG was born in Anhui province, China, in 1971. He received M.Sc. degree in Parttern Recognition & Intelligent System in 2004, from Institute of Intelligent Machines, Chinese Academy of Sciences. In 2007, He received his Ph.D. degree in Pattern Recognition & Intelligent System from University of Science and Technology of China. Currently, his research interests include swarm intelligence, artificial neural networks, classifiers ensemble. About the Author—JI-XIANG DU from September 1995 to July 1999, took courses as B.Sc. degree candidate in Vehicle Engineering, Hefei University of Technology, and obtained B.Sc. degree in July 1999. From September 1999 to July 2002, took courses as M.Sc. degree candidate in Vehicle Engineering, Hefei University of Technology, and obtained B.Sc. degree in July 2002. From February 2003 on, in pursuit for Ph.D. degree in Pattern Recognition & Intelligent System in University of Science and Technology of China (USTC), Hefei, China, and in December 2005, received Ph.D. degree. From 2006, he has engaged in the postdoctoral study in Department of Automation of USTC. Now, he is working at the Department of Computer Science & Technology of Huaqiao University.