Molecular Genetics and Metabolism 73, 239 –247 (2001) doi:10.1006/mgme.2001.3193, available online at http://www.idealibrary.com on
Feature (Gene) Selection in Gene Expression-Based Tumor Classification Momiao Xiong,* ,1 Wuju Li,* Jinying Zhao,* Li Jin,* and Eric Boerwinkle* ,† *Human Genetics Center, and †Institute of Molecular Medicine, University of Texas–Houston Health Science Center, Houston, Texas Received December 6, 2000; published online June 27, 2001
number of selected genes may be used as biomarkers for tumor classification, or may have some relevance in tumor development and serve as a potential drug target. In this report we also show that stepwise Fisher’s linear discriminant function is a practicable method for gene expression-based tumor classification. © 2001 Academic Press Key Words: gene expression; gene selection; Monte Carlo; microarray; tumor classification.
There is increasing interest in changing the emphasis of tumor classification from morphologic to molecular. Gene expression profiles may offer more information than morphology and provide an alternative to morphology-based tumor classification systems. Gene selection involves a search for gene subsets that are able to discriminate tumor tissue from normal tissue, and may have either clear biological interpretation or some implication in the molecular mechanism of the tumorigenesis. Gene selection is a fundamental issue in gene expressionbased tumor classification. In the formation of a discriminant rule, the number of genes is large relative to the number of tissue samples. Too many genes can harm the performance of the tumor classification system and increase the cost as well. In this report, we discuss criteria and illustrate techniques for reducing the number of genes and selecting an optimal (or near optimal) subset of genes from an initial set of genes for tumor classification. The practical advantages of gene selection over other methods of reducing the dimensionality (e.g., principal components), include its simplicity, future cost savings, and higher likelihood of being adopted in a clinical setting. We analyze the expression profiles of 2000 genes in 22 normal and 40 colon tumor tissues, 5776 sequences in 14 human mammary epithelial cells and 13 breast tumors, and 6817 genes in 47 acute lymphoblastic leukemia and 25 acute myeloid leukemia samples. Through these three examples, we show that using 2 or 3 genes can achieve more than 90% accuracy of classification. This result implies that after initial investigation of tumor classification using microarrays, a small
Diagnostic pathology has traditionally relied on macro- and microscopic histology and tumor morphology as the basis for classifying tumors. Current classification frameworks, however, are unable to discriminate among tumors with similar histopathologic features that vary in clinical course and response to treatment (1). Recently, there is increasing interest in changing the emphasis of tumor classification from morphologic to molecular. In the past decade, array technologies have been developed that can be used to simultaneously assess the level of expression of thousands of genes (2–10). Several studies have used arrays to analyze gene expression in colon, breast, and other tumors, and these studies have demonstrated the potential utility of expression profiling for classifying tumors (11,12). Gene expression profiles may offer more information than classic morphology and provide an alternative to morphology-based tumor classification systems. Feature (gene) selection is an important component for gene expression-based tumor classification systems. The great advantage of microarray is that it is able to simultaneously monitor the expression of thousands or even tens of thousands of genes and provide extremely useful biological information. Theoretically, having more features should give us
1
To whom correspondence and reprint requests should be addressed at Human Genetics Center, University of Texas– Houston, P.O. Box 20334, Houston, TX 77225. E-mail:
[email protected]. 239
1096-7192/01 $35.00 Copyright © 2001 by Academic Press All rights of reproduction in any form reserved.
240
XIONG ET AL.
more discriminating power. However, there is more than one reason for the necessity to reduce the number of features to a sufficient minimum. First, large numbers of features increase computational complexity and cost. Second, although two features may carry good classification information when treated separately, there is little gain if they are combined together in a feature vector because of a high mutual correlation. Thus, complexity increases without much gain. Third, a large number of features will compromise the generalization properties of the classifier. It is recognized that the higher the ratio of the number of training samples to the number of free classifier parameters, the better the generalization properties of the resulting classifier (14). A large number of features is directly translated into a large number of classifier parameters (e.g., weights in a linear classifier, synaptic weights in a neural network). In general, the number of tissue samples is two or three orders of magnitudes less, often consisting of 10 –100 samples. Thus, for a limited number of training samples, keeping the number of features as small as possible is in line with our desire to design classifiers with good generalization capabilities. Fourth, a large number of features will harm the estimation of the classification error. One important step in the design of a classification system is the performance evaluation stage, in which the classification error probability of the designed classifier is estimated. A small number of features will improve the estimation of classification error. Therefore, reducing the dimensionality of the gene expression information is a key issue in developing a successful gene expression-based tumor classification system. In addition to reducing noise and improving the accuracy of tumor classification, selected subsets of genes with high accuracy of classification may be involved in the pathways or some biological processes leading to tumor development. The selected subsets of genes may have important biological interpretation and may be used for drug target discovery or identifying future possible research directions. Recently, Getz et al. (18) presented a coupled twoway clustering analysis of gene expression data to identify subsets of genes and discussed feature selection in the cluster analysis. In this report, we address the gene selection issue under a classification framework that may be more relevant to clinical application in diagnosis. Many sophisticated methods for classification such as support vector machine, neural networks, and Baysian classification
based on Gaussian process, and for feature selection such as mixed-integer programming and genetic algorithm have been developed in the past several decades. The purpose of this report is to illustrate the principle of gene selection in tumor classification, but not to investigate and compare various sophisticated statistical and computational methods for classification and feature selection. To do so, we present the use of a simple Fisher linear method for classification and heuristic stepwise and Monte Carlo methods for selecting an optimal subset of genes that provide high accuracy for tumor classification. By analyzing the expression profiles of 2000 genes in 22 normal and 40 tumors, 5776 cDNA clones in 14 human mammary epithelial cells and 13 breast tumors, and 6817 genes in 47 acute lymphoblastic leukemia (ALL) and 25 acute myeloid leukemia (AML) samples, we show that only a few genes can achieve high accuracy of classification and that the selected genes can be used as biomarkers for tumor classification. MATERIALS AND METHODS Gene expression data from tumor and normal tissues. Three gene expression data sets were used to illustrate the gene selection methods. The first data set consists of expression profiles of 2000 genes using an Affymetrix oligonucleotide array from 22 normal and 40 colon tumor tissues which were originally retrieved from www.molbio.princeton.edu/colondata (11) and are now posted in www.sph.uth.tmc.edu/hgc. The second data set consists of expression profiles of 5776 human sequences using a cDNA microarray from 14 human mammary epithelial cells and 13 breast tumors and was retrieved from http://genome-www. stanford.edu/sbcmp (12). The third data set consists of expression profiles of 6817 genes from 47 ALL and 25 AML samples and retrieved from http://waldo.wi. mit.edu/MPR/data-set-ALL-AML.html. Discriminant analysis. We will use the Fisher linear discriminant function for classifying tumor and normal tissues. The Fisher approach does not assume that the observations are normally distributed. But, it does implicitly assume that the covariance matrices of the observed variables in the populations (normal and tumor) are equal (16). Tumor and normal tissues are classified on the basis of k selected feature variables. The feature variables can represent the level of expression of k genes or k summary measures representing the level of gene expression of any number of genes or other measures. Suppose that n N normal and n T tumor tissue
GENE SELECTION IN TUMOR CLASSIFICATION
samples are examined. For tissue sample i, we have the vector Y⬘i ⫽ (Y i 1 , Y i 2 , . . . , Y i k ). The Y i ’s for the normal (N) and tumor (T) samples constitute the following data matrix, Y N ⫽ 关Y N1, Y N2, . . . , Y Nn N兴 共k⫻n N兲
(1)
Y T ⫽ 关Y T1, Y T2, . . . , Y Tn T兴 共k⫻n T兲 .
(2)
From these data matrices, the sample mean vectors and covariance matrices are determined by N⫽ Y
SN ⫽
T⫽ Y
1 nN
冘Y nN
nN
Ni
N兲共Y Ni ⫺ Y N兲⬘ ⫺Y
(3)
i⫽1
冘Y nT
Ti
,
i⫽1
1 ST ⫽ nT ⫺ 1
S⫽
,
冘 共Y
1 nN ⫺ 1 1 nT
Ni
i⫽1
冘 nT
T兲共Y Ti ⫺ Y T兲⬘ 共Y Ti ⫺ Y
(4)
i⫽1
共n N ⫺ 1兲S N ⫹ 共n T ⫺ 1兲S T . nN ⫹ nT ⫺ 2
(5)
Fisher’s idea was to transform the multivariate observations Y N and Y T into univariate observations Z N and Z T such that the Z’s were separated as much as possible. Fisher suggested taking linear combinations of the Y’s to generate Z’s, which can be easily manipulated mathematically. The midpoint, m ˆ , beN tween the two univariate sample means, Z N ⫽ (Y T)⬘S ⫺1 Y N and Z T ⫽ (Y N⫺Y T)⬘S ⫺1 Y T, is given by ⫺Y m ˆ ⫽
1 N⫺Y T兲⬘S ⫺1 共Y N⫹Y T兲. 共Y 2
(6)
The classification rule based on Fisher’s linear discriminant function for an unknown sample, Y 0 , is as follows: Assign Y 0 to N
N⫺Y T兲⬘S ⫺1 Y 0 ⱖ m if 共Y ˆ,
Assign Y 0 to T
N⫺Y T兲⬘S ⫺1 Y 0 ⬍ m if 共Y ˆ.
and
241
To evaluate the accuracy of the Fisher linear discriminant function with K selected genes for tumor classification, which was defined as the percentage of correctly classified tissue samples, the tissue samples (for both normal and tumor tissues) were randomly divided into two sets. The Fisher linear discriminant function was trained on the first set (training set), where the characteristic (e.g., normal vs tumor, subtypes of tumor) of the tissue samples was assumed to be known, and then tested on the second set (test set), where the characteristic of the tissue samples was assumed to be unknown. Three different proportional partitioning of the tissue samples were used for assessing the accuracy of the Fisher linear discriminant function for tumor classification: (1) 50% for training and 50% for testing; (2) 68% for training and 32% for testing; and (3) 95% for training and 5% for testing. To evaluate the performance of selected genes, we randomly partitioned the total tissue samples into training and test sets 200 times for each of the three partitions and then averaged the results. Gene selection for tumor classification. The ultimate purpose of these investigations is to form a rule for the allocation of subsequent unclassified tissue samples into similar classes (e.g., tumor vs normal, subtypes of tumor). Therefore, the accuracy of classification, which is defined as the percentage of correctly classified normal and tumor tissues, or subtypes of tumor tissues, will be used as a criterion for comparing gene selection algorithms. The goal of gene selection is to find an optimal (or near optimal) subset of K genes which have the highest (or near highest) accuracy of classification from the initial complete set of p genes. Stepwise selection and Monte Carlo methods. Four methods were applied to the full array of data to select a subset of genes to be used in the discriminant analyses: stepwise procedure, a Monte Carlo method, t test, and the prediction strength (PS) statistic suggested by Golub et al. (13). In the forward stepwise selection procedure, we began with an optimal subset of two genes which had the highest accuracy of classification among all possible combinations of two genes. The accuracy of classification was then computed for the previous selected optimal subset of genes and each of the remaining p-2 genes in the data set and the one that gives the highest accuracy of classification is the next gene to enter the optimal subset of genes. The procedure continues to select one additional gene at a time until the
242
XIONG ET AL.
accuracy of classification reaches a predetermined threshold or a fixed number of K is reached. Using the Monte Carlo method, we randomly selected a subset of k genes n times (k begins with 1). For this study, we set n ⫽ 200,000. We next identified the subset of k genes with the highest accuracy of classification among all n subsets of k genes. The value k is increased by 1 and the procedure was repeated until the accuracy of classification reaches the same predetermined threshold or a fixed value of k is reached. Since the search by either stepwise procedure or Monte Carlo method is not exhaustive, the selected subsets of genes may not have the highest accuracy of classification among all possible combinations of genes. Therefore, such selected subsets of genes are referred to optimal (or near optimal) subsets of genes. It should also be pointed out that the optimal subsets of genes may not be unique. t test. Those genes whose expressions are significantly different between normal and tumor tissues or between subtypes of tumor tissues are also candidates for selection. A simple t test statistic can be used to measure the degree of gene expression difference between normal and tumor tissues. We select those K genes with the largest t statistic for inclusion in the discriminant analyses. Prediction strength and gene selection. Golub et al. proposed the use of a collection of known samples to generate a “class predictor” which is then able to assign a new sample to one of two classes (13). This predictor is created with the aid of “prediction strength.” Let [ 1 ( g), s 1 ( g)] and [ 2 ( g), s 2 ( g)] denote the means and SD (standard deviation) of the log of the expression levels of gene g in the normal and tumor tissues, respectively. Then, PS is defined as PS共g兲 ⫽
1 共g兲 ⫺ 2 共g兲 . s 1 共g兲 ⫹ s 2 共g兲
Genes with large PS are “informative” and are selected for tumor classification. We selected those K genes with the largest PS statistic for inclusion in the discriminant analyses. RESULTS Number of genes with a certain accuracy of classification. We initially evaluated the accuracy of prediction for each individually. Table 1 shows the number of genes whose accuracy of prediction fell within the specified intervals. These data demon-
TABLE 1 Number of Genes with a Certain Accuracy of Classification Interval of classification accuracy [0.0, [0.5, [0.6, [0.7, [0.8, [0.9, [0.0,
0.5) 0.6) 0.7) 0.8) 0.9) 1.0] 1.0]
Number of genes (ratio) Colon cancer 541 (27.05%) 1049 (52.45%) 366 (18.30%) 40 (2.00%) 4 (0.20%) 0 (0.00%) 2000 (100.00%)
Breast cancer
ALL-AML
1122 (19.43%) 843 (11.82%) 1813 (31.39%) 3672 (51.51%) 1290 (22.33%) 2077 (29.13%) 1176 (20.36%) 468 (6.56%) 360 (6.23%) 66 (0.93%) 15 (0.26%) 3 (0.042%) 5776 (100.00%) 7129 (100.00%)
strate that most genes individually do not offer good predictive ability, but a few genes do individually provide very strong accuracy of prediction. For example, for predicting tumor versus normal breast, 15 genes individually provided more than 90% accuracy of prediction. Optimal and near optimal sets of genes for tumor classification. Table 2 shows the optimal or near optimal one-gene, two-gene, and three-gene subsets for tumor classification using the stepwise, Monte Carlo, t statistic, and PS methods of gene selection. Accuracy of classification was evaluated for the total collection of tissue samples. These data have two implications. First, using a small number of genes can achieve a high accuracy of classification. Second, the stepwise and Monte Carlo method of gene selection performed consistently better than the t test and PS methods. For example, considering only three genes, both the stepwise and Monte Carlo methods provided 93% accuracy of classification for colon cancer. Evaluation of performance of selected optimal sets of genes for tumor classification. Table 3 summarizes the accuracy of classification of selected optimal or near optimal sets of genes using training and test subsets of the tissue samples and the stepwise and Monte Carlo methods. The data presented in Table 3 underscore the fact that using selected optimal or near optimal sets of genes can achieve a high accuracy of classification. In fact, using one, two, or three genes yielded more than 85% accuracy of classification in all three proportions of the partitioning. To further illustrate this point, Figs. 1A and 1B show the expression of two selected genes in normal and tumor tissues. Figure 1C shows the expression of two selected genes in ALL and AML
243
GENE SELECTION IN TUMOR CLASSIFICATION
TABLE 2 Optimal or Near Optimal Sets of Genes for Tumor Classification Gene name
Number
Methods
Accuracy
1 1 1 1 2 2 2 2 3 3 3 3
Stepwise Monte Carlo t test PS Stepwise Monte Carlo t test PS Stepwise Monte Carlo t test PS
85.5% 85.5% 75.8% 66.1% 91.9% 91.9% 74.2% 67.7% 93.5% 93.5% 74.2% 79.0%
1 1 1 1 2 2 2 2 3 3 3 3
Stepwise Monte Carlo t test PS Stepwise Monte Carlo t test PS Stepwise Monte Carlo t test PS
96.3% 96.3% 92.6% 92.6% 100.0% 100.0% 96.3% 88.9% 100.0% 100.0% 96.3% 92.6%
1 1 1 1 2 2 2 2 3 3 3 3
Stepwise Monte Carlo t test PS Stepwise Monte Carlo t test PS Stepwise Monte Carlo t test PS
91.7% 91.7% 83.3% 91.7% 100.0% 95.8% 81.9% 93.8% 100.0% 97.2% 83.3% 94.4%
A. Colon DES DES GUCA2B PLK DES HS.35496* GUCA2B PLK DES HS.35496* GUCA2B PLK
PBP GUCA2B KCNMB1 G3BP GUCA2B GUCA2B KCNMB1 G3BP
PBP GAS6 SSBP HS.71968* B. Breast
LAMC2 LAMC2 NO.1711† MYPT1 MYPT1 APLP2 NO.1711† MYPT1 DLG1 SLC4A1 NO.1711† MYPT1
LAMC2 MSR1 DLG1 NO.3112† MYPT1 ALOX5 DLG1 NO.3112†
LAMC2 NO.4137† DDX21 DDX21 C. ALL-AML
ZYX ZYX PSMA6 ZYX MLP ZYX CCND3 DF MLP KIAA0176 CCND3 CD33
ZYX ATP2A3 PSMA6 ZYX PCCB ZYX PSMA6 DF
ZYX ITK TCF3 ZYX
* The gene names here are represented by the UniGene System because the designations are unique. The “*” in Table 3A, and Figure 1A have the same meaning. † The gene names here are represented by the order number in original retrieved gene expression file.
tissues. It is apparent that normal and tumor tissues or ALL and AML tissues are well separated. Comparing among methods, the accuracy of classification by the stepwise and Monte Carlo methods is similar. However, the accuracy of classification by the t statistic and PS methods is lower than that provided by the stepwise and Monte Carlo methods.
Optimal size of selected sets of genes. Figure 2 shows the relationship between the accuracy of tumor classification and the number of genes in a selected optimal (or near optimal) gene set using the acute leukemia tumor data. For all three partitionings, three genes is an optimal number for tumor classification. As the number of genes in an optimal
244
XIONG ET AL.
TABLE 3 Classification Accuracies for Different Gene Sets Using the Stepwise and Monte Carlo Methods Number
Gene name
Method
Proportion
Training
Test
Cross
Average
85.7% 85.3% 85.6% 85.7% 85.3% 85.6% 87.5% 87.7% 88.4% 92.7% 93.0% 93.2% 91.1% 91.4% 91.9% 92.3% 92.6% 92.7%
85.3% 86.2% 83.8% 85.3% 86.2% 83.8% 86.0% 86.4% 87.2% 90.7% 91.0% 93.0% 89.3% 89.5% 89.3% 89.1% 89.4% 90.5%
85.5% 85.3% 85.6% 85.5% 85.3% 85.6% 85.9% 86.2% 86.9% 90.4% 91.2% 91.8% 89.0% 90.2% 90.2% 88.9% 89.7% 90.0%
85.5% 85.6% 85.0% 85.5% 85.6% 85.0% 86.5% 86.7% 87.5% 91.3% 91.7% 92.7% 89.8% 90.3% 90.5% 90.1% 90.6% 91.0%
95.6% 95.4% 96.3% 95.6% 95.4% 96.3% 98.5% 98.8% 99.3% 99.5% 99.9% 100.0% 99.5% 99.6% 100.0% 99.6% 99.9% 100.0%
93.0% 96.4% 96.8% 93.0% 96.4% 96.8% 93.4% 94.1% 94.0% 95.1% 97.9% 100.0% 94.2% 94.9% 96.0% 97.8% 98.8% 100.0%
93.9% 94.6% 96.1% 93.9% 94.6% 96.1% 94.3% 94.7% 95.6% 95.4% 97.1% 99.7% 92.9% 95.1% 95.8% 96.3% 98.0% 99.6%
94.2% 95.5% 96.4% 94.2% 95.5% 96.4% 95.4% 95.9% 96.3% 96.7% 98.3% 99.9% 95.5% 96.5% 97.2% 97.9% 98.9% 99.9%
92.1% 92.3% 91.6% 92.1% 92.3% 91.6% 99.4% 99.7% 99.9% 96.1% 95.8% 95.8% 99.9% 100.0% 100.0% 96.5% 96.7% 97.2%
92.1% 91.2% 93.2% 92.1% 91.2% 93.2% 98.6% 98.6% 98.8% 94.4% 95.0% 96.7% 99.1% 99.4% 100.0% 95.1% 95.7% 97.7%
91.6% 92.0% 91.6% 91.6% 92.0% 91.6% 98.8% 99.0% 98.8% 95.5% 95.3% 95.7% 99.0% 99.3% 100.0% 95.1% 95.7% 97.0%
92.0% 91.8% 92.1% 92.0% 91.8% 92.1% 98.9% 99.1% 99.2% 95.3% 95.4% 96.0% 99.3% 99.6% 100.0% 95.5% 96.2% 97.3%
A. Colon 1
2
3
DES
Stepwise
DES
Monte Carlo
DES PBP
Stepwise
GUCA2B HS.35496*
Monte Carlo
DES GUCA2B PBP HS.35496* GUCA2B GAS6
Stepwise
Monte Carlo
50%:50% 68%:32% 95%:5% 50%:50% 68%:32% 95%:5% 50%:50% 68%:32% 95%:5% 50%:50% 68%:32% 95%:5% 50%:50% 68%:32% 95%:5% 50%:50% 68%:32% 95%:5% B. Breast
1
2
3
LAMC2
Stepwise
LAMC2
Monte Carlo
LAMC2 MYPT1
Stepwise
APLP2 MSR1
Monte Carlo
LAMC2 MYPT1 DLG1 SLC4A1 ALOX5 NO.4137*
Stepwise
ZYX
Stepwise
ZYX
Monte Carlo
MLP ZYX
Stepwise
ZYX ATP2A3
Monte Carlo
MLP PCCB ZYX KIAA0176 ZYX ITK
Stepwise
Monte Carlo
50%:50% 68%:32% 95%:5% 50%:50% 68%:32% 95%:5% 50%:50% 68%:32% 95%:5% 50%:50% 68%:32% 95%:5% 50%:50% 68%:32% 95%:5% 50%:50% 68%:32% 95%:5% C. ALL-AML
1
2
3
Monte Carlo
50%:50% 68%:32% 95%:5% 50%:50% 68%:32% 95%:5% 50%:50% 68%:32% 95%:5% 50%:50% 68%:32% 95%:5% 50%:50% 68%:32% 95%:5% 50%:50% 68%:32% 95%:5%
GENE SELECTION IN TUMOR CLASSIFICATION
245
FIG. 2. Comparison of ALL-AML tissue classification accuracies for different numbers of genes selected using Monte Carlo simulation for different partition ratios of test and training data.
(or near optimal) gene set exceeds 3, the accuracy of classification decreases. The colon and breast tumor data had similar patterns (data not shown). Optimal or near optimal sets of genes form clusters. An optimal set of genes for classifying tumor and normal tissue may not be unique. There may exist a number of sets of genes whose accuracy of classification is quite close. For example, there are 15 pairs of genes whose expression provides an accuracy of prediction greater than 90.3% for the colon cancer data set. The 15 pairs of genes fall into two groups, and the pairs are made up of one member from Group 1 and one member from Group 2. Figure 3 shows that these genes indeed form two clusters of genes. The genes marked with an asterisk are in Group 1 and the genes without an asterisk are in the Group 2. Similar patterns were observed for the breast cancer data set (data not shown). DISCUSSION Simultaneously monitoring the expression of thousands of genes holds great promise for better understanding cancer biology and developing accu-
FIG. 1. Gene expression levels in tumor and normal tissue for GUCA2B and HS.35496 in colon (A) and HS.136016 and HB58 in breast (B) and PSCD1 and ZYX in acute leukemia (C). The straight lines in the figures represent the Fisher linear discriminant function and demonstrate the separation of tissue types by only two genes.
246
XIONG ET AL.
FIG. 3. Fifteen pairs of genes with each having more than 90% accuracy of classification in the colon cancer data set form two clusters.
rate tumor classification schemes. However, the very large amount of gene expression information provided by contemporary microarray technology (6 –10) raises problems for both basic research and clinical application. First, the high cost of large-scale microarray experiments dictates that the sample size is usually several orders of magnitude smaller than the number of genes being monitored. As a result, it is mathematically infeasible to use all of the gene expression information to develop a classification algorithm for a relatively small number of tumors (14). It is also well documented in the statistical literature that too many feature variables (i.e., genes) can harm the performance of the classifier (15). Therefore, development of an accurate tumor classification scheme must begin with selection of a subset of the initially observed genes for tumor classification. In the parlance of the classification literature, this is known as “pruning” or “feature set reduction.” Second, application of gene expression analyses for tumor classification and pathology requires cost-effective and stream-lined methodology. In other words, the cost and complexity of monitoring the expression of thousands of genes will not be necessary in a clinical setting if a handful of genes
provides an accurate tumor classification scheme. Third, only small numbers of genes are relevant for tumor development. The selected subset of genes with high accuracy of classification may have some implications in elucidating the molecular mechanism for tumor development. In this report, we propose and characterize methods for selecting an optimal subset of genes for tumor classification. Selecting an optimal subset of genes poses two related problems: first, determining the number of genes to be selected; second, determining which genes belong to the set. To begin to address these problems, we have analyzed available expression data from 2000 genes in 22 normal and 40 colon cancer samples, 5776 cDNA clones in 14 mammary epithelial cells and 13 breast tumors, and 6817 genes in 47 ALL and 25 AML samples. Our results indicate that expression information from three or four genes is optimal for tumor classification in these three data sets. In fact, as few as two genes achieve more than 90% accuracy in many instances. This result is appealing and has profound implications for clinical applications. It bodes well for the following scenario. Initial basic research and clinical trials will monitor the expression of thousands of genes using microarrays to identify a handful of genes
GENE SELECTION IN TUMOR CLASSIFICATION
providing optimal tumor classification information. Clinical applications will then monitor only this small subset of genes, thus avoiding the cost and complexity of large-scale gene expression arrays. Of course, the number of selected genes and the optimal set of the genes will likely differ according to tumor type, and thus the clinical laboratory will still need the ability to monitor a variety of genes. There is no single best procedure for selecting an optimal subset of genes for tumor classification. It is not the purpose of this paper to survey the various methods for feature variable (gene) selection, but rather to illustrate how gene selection can be useful to tumor classification and investigation of tumorigenesis. Any successful procedure will require a mix of statistical decision rules and expert knowledge. In this report, four statistical procedures were compared for their accuracy of tumor classification: stepwise discriminant analysis, Monte Carlo methods, t statistic, and a PS statistic suggested by Golub et al. (13). The results indicate that the stepwise and Monte Carlo methods performed similarly, and both methods performed better than the t or PS statistics. However, the stepwise method demands much less computational time than the Monte Carlo method, and, therefore, the stepwise discriminant analysis provides a practical and accurate method for tumor classification using gene expression profiles. The level of expression of sets (i.e., clusters) of genes across tissue samples is correlated (11). As a result, it is likely that the information contained in a large number of gene can be captured by a smaller number without significant loss of information. This is a direct result of the fact that sets of genes are similarly regulated and, hence, play a similar role in tumor classification. It is known from pattern recognition theory that a good feature subset is one that contains features highly correlated with the class, yet uncorrelated with each other. In this report, we show that there are a number of gene subsets that have similarly high accuracies for tumor classification, and the genes in these subsets form two clusters. The expression levels of genes within the same cluster have a high correlation, and the expression levels of genes in different clusters have a low correlation. We speculate that the genes within a cluster lie in a single pathway or coregulated pathways, and genes in different clusters lie in different pathways or pathways that are not coregulated.
2.
Tlsty TD, Margolin BH, Lum K. Differences in the rates of gene amplification in nontumorigenic and tumorigenic cell lines as measured by Luria-Delbruck fluctuation analysis. Proc Natl Acad Sci USA 86:9441–9445, 1989.
3.
Theillet C. Full speed ahead for tumor screening. Nature Med 4:767–768, 1998.
4.
Strausberg RL, Austin MJF. Functional genomics: Technological challenges and opportunities. Physiol Genomics 1:25–32, 1999.
5.
Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JCF, Trent JM, Staudt LM, Hudson J Jr, Boguski MS, Lashkari D, Shalon D, Botstein D, Brown PO. The transcriptional program in the response of human fibroblasts to serum. Science 283:83– 87, 1999.
6.
Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H, Brown EL. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotechnol 14:1675–1680, 1996.
7.
Wodicka L, Dong H, Mittmann M, Ho MH, Lockhart DJ. Genome-wide expression monitoring in Saccharomyces cerevisiae. Nature Biotechnol 15:1359 –1367, 1997.
8.
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9:3273–3297, 1998.
9.
Yang GP, Ross DT, Kuang WW, Brown PO, Weigel RJ. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Nucleic Acids Res 27:1517–1523, 1999.
10.
DeRisi J, Penland L, Brown PO, Bittner ML, Meltzer PS, Ray M, Chen Y, Su YA, Trent JM. Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nature Genet 14:457– 460, 1996.
11.
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96:6745– 6750, 1999.
12.
Perou CM, Jeffrey SS, Rijn MVD, Rees CA, Eisen MB, Ross RT, Pergamenschikov A, Williams CF, et al. Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc Natl Acad Sci USA 96:9212–9217, 1999.
13.
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286:535–537, 1999.
14.
Theodoridis S. Pattern Recognition. San Diego: Academic Press, 1999.
15.
McLachlan GJ. Discriminant Analysis and Statistical Pattern Recognition. New York: Wiley, 1992.
16.
Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. Englewood Cliffs, NJ: Prentice-Hall, 1982.
17.
Draper NR, Smith H. Applied Regression Analysis, 2nd ed. New York: Wiley, 1981.
18.
Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene microarray data. Proc Natl Acad Sci USA 97:12079 –12084, 2000.
REFERENCES 1.
Stephenson J. Human genome studies expected to revolutionize cancer classification. JMMA 282:927–928, 1999.
247