Survival Analysis with Gene Expression Arrays

Survival Analysis with Gene Expression Arrays

37 Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23037-6 Survival Analysis wit...

115KB Sizes 3 Downloads 62 Views

37

Handbook of Statistics, Vol. 23 ISSN: 0169-7161 © 2004 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(03)23037-6

Survival Analysis with Gene Expression Arrays

Donna K. Pauler, Johanna Hardin, James R. Faulkner, Michael LeBlanc and John J. Crowley

1. Introduction In this paper we discuss the identification and measurement of gene expressions as prognostic indicators for survival times. As the technology and field of bioinformatics has rapidly exploded in recent years, so too has the need for tools to analyze outcome data with covariates of extreme high-dimension, such as arises from the measurement of expression levels from large numbers of genes. As a particular example, a study in progress at the Myeloma Institute for Research and Therapy in Little Rock, Arkansas has measured 12 625 gene expression levels on 156 patients before standard treatment for myeloma. Patients are currently being followed for response to treatment and survival. Part of our collaboration with this institute will focus on the identification of genes and genetic profiles that serve as pre-treatment prognostic indicators for response and survival. Here we outline a variety of data-reduction and analyses techniques for correlating survival times with high-dimensional covariates. We perform simulation studies to compare techniques for use in the upcoming analysis of the Arkansas data. 1.1. The measurement of gene expression levels A key to the genomic revolution has been the development of the microarray chip, a technology which permits the researcher to study literally thousands of genes in a single assay. There are several approaches to developing chips, but the basics are that known genes or key parts of genes, such as sequences of nucleotides are fixed on a small slide, and a sample from a patient is prepared in such a way that gene transcripts that exist in the sample will attach to the gene targets on the slide. The output for each gene is compared to either an internal or external control and is expressed as either a categorical variable, indicating gene expressed or not, or a measured variable, which is typically a log ratio of expression by the experimental to control, which quantifies the degree of over or under expression. The chips can be created to target genes hypothesized to be involved in a specific cancer, such as the lymphochip for studying lymphoma (Alizadeh et al., 2000) or can be quite general, such as the Affymetrix chip (Affymetrix, Inc., http://www.affymetrix.com/). 675

676

D.K. Pauler et al.

There are two types of chips that are commonly used, which produce oligonucleotide and cDNA arrays. In an oligonucleotide array, single strands of oligos, or partial gene sequences, are printed on a chip. Multiple oligos are printed for each gene and a mismatch strand, which is an oligo with one base change, is simultaneously hybridized as a measure of control. The sample is labeled with fluorescent dye and measured for both the pairing to the oligo and the mismatch. In a spotted cDNA array, one basic sequence representing all or part of a gene is printed on the chip. The experimental sample is labeled with fluorescent red dye while the reference sample is labeled with fluorescent green dye, and both are simultaneously hybridized to the chip. The reference sample acts as a control for the experimental sample. For both types of arrays the amount of fluorescent dye measured is quantified. There are many questions that can be addressed with microarray technology, such as which genes, combinations of genes, or entire genetic profiles differentiate normal subjects from cancer patients, predict survival, or refine histological subsets of specific cancers. As a long term goal, identification of responsible genes can potentially lead to the development of targeted therapy against the gene products or proteins. 1.2. The statistical analysis of genes Routine statistical inference correlating gene expression with clinical outcome is complicated by the extreme high numbers of genes. Even in the optimistic case of a large complete clinical databank, the number of genes will certainly outnumber the patients, making model selection or multivariable modeling difficult. To combat this problem many approaches have focused on performing a large number of simultaneous hypothesis tests, such as t-tests or Wilcoxon tests, one for each gene; see, for example, the “class prediction” approach of Golub et al. (1999). In principle the same simultaneous testing procedure can be performed for analyzing survival outcomes, using Cox’s proportional hazards models as an example. There are more advanced screening techniques, including the significance analysis of microarrays (SAM) (Tusher et al., 2001) and the directed indices for exploring gene expressions proposed by LeBlanc et al. (2003), which might better serve to polish the crude multiple hypothesis testing approach. To account for the thousands of simultaneous hypothesis tests, a Bonferroni-correction can be applied to retain the family-wide error rate of 5%. However, due to the high dimensionality involved, this approach can be overly conservative, leading to the rejection of all genes. Therefore, a number of more liberal testing procedures which retain a pre-specified family-wide error rate or false discovery rate have been proposed. For examples, see the permutation method of Westfall and Young (1993) and the adjustment method of Benjamini and Hochberg (1995). The simultaneous hypothesis-testing approach can be used as a first stage in a two-stage procedure to reduce the covariate space to a manageable size for a second-stage multivariable model. Stepwise regression, tree-based (Breiman et al., 1984; Segal, 1988; LeBlanc and Crowley, 1993) or neural network (Khan et al., 2001) methods can then be applied to hopefully further optimize the predictive capability of the model. The marginal simultaneous hypothesis testing approach does not take into consideration the correlation structure of the genes. Alternative data reduction techniques attempt

Survival analysis with gene expression arrays

677

to first understand the structure of the covariate space and utilize dependencies before performing independent or multivariable modeling. An old familiar solution is principal component decomposition, which reduces the predictor space to a smaller sample of orthogonal linear combinations of the original variables which explain a pre-specified portion of the total variation. Multivariate modeling can then proceed using the reduced set of principal component scores. For gene expression data, the method is not completely straightforward to implement because of the high dimensionality. Many software packages utilize standard approaches for matrix inversion, which do not work efficiently for high-dimensions. The covariate matrix may be singular or nearly singular, in which case principal component decomposition will not work. Another drawback of principal component decomposition for gene expressions is interpretation. It is often the case that investigators are interested in the roles of individual genes in order to understand mechanism, and the linear combinations comprising the principal components may not easily translate back to individual influential genes. It has been shown in simple cases that principal components may obscure groupings of interest (Chang, 1983). However, a number of authors are investigating methodologies to address this issue. In a recent report, Hastie et al. (2000) outline a method for using principal components in a clustering framework to identify subsets of genes. For additional applications see Alter et al. (2000), Quackenbush (2001), and Yeung and Ruzzo (2001). An alternative approach to principal component decomposition for capitalizing on the dependency structure is to first perform a clustering algorithm, such as hierarchical (e.g., Ward, 1963; Weinstein et al., 1997; Eisen et al., 1998), K-means (e.g., MacQueen, 1967; Herwig et al., 1999; Tavazoie et al., 1999) or model-based (e.g., Bock, 1996; Fraley and Raftery, 2002) clustering on the gene space and then use the cluster centers or medoids as the predictors in the regression model. A small pre-specified number of clusters may be specified so that a multivariable model with stepwise selection can be feasibly used. Alternatively, the number of clusters may be determined using a criteria such as the average silhouette width (Kaufman and Rousseeuw, 1990). The clustering approach retains some interpretability since individual genes in the influential clusters may be easily traced. With so many potential predictors in addition to the multitude of clinical variables, there is a high probability of overfitting the data. Therefore, it is important to examine other criteria such as predictive capability in addition to goodness of fit as measures for selecting a model on which to base inference. Criteria such as Akaike’s information criterion (AIC) (Akaike, 1973) and the Bayesian information criterion (BIC) (Schwarz, 1978), which penalize for high dimensionality and overfitting, are part of most standard software packages. A number of authors provide recommendations for measuring predictive accuracy in Cox’s proportional hazards models; see, for example, Harrell et al. (1984, 1996), Korn and Simon (1990), Henderson (1995), Raftery et al. (1996), Volinsky et al. (1997), and O’Quigley and Xu (2001). Verweij and van Houwelingen (1993) propose a cross-validated log likelihood based on the partial likelihood, which reduces to Allen’s PRESS (Allen, 1974) and Mallow’s Cp (Mallows, 1973) statistic in normal linear models. Methods described in Efron and Tibshirani (1997) can be used to reduce variability in cross-validation.

678

D.K. Pauler et al.

1.3. Motivating application Our motivating application comes from an ongoing study at the Myeloma Institute for Research and Therapy in Little Rock, Arkansas. In this study 156 patients are enrolled on a protocol designed to determine the efficacy of thalidomide added to high dose chemotherapy and tandem transplants for multiple myeloma. For individuals in the study, blood sera were collected immediately pre-treatment for gene expression level measurement. Expression levels from 12 625 genes were obtained from each of the samples using the Affymetrix oligonucleotide chip. The patients are currently being followed for response and survival. When these data become available, it will be of interest to examine the prognostic significance of expression levels in conjunction with other clinical parameters such as stage of disease, age and treatment. In this paper we perform simulation studies to compare some of the approaches outlined in Section 1.2 for correlating survival times with gene expression levels in preparation for our analysis of the Myeloma dataset. For feasibility of simulation, we reduce attention to 500 genes, and choose the 500 with highest variance in the Myeloma sample.

2. Methods For our simulations we use Cox’s proportional hazards model, which specifies that the hazard of failing at time t is equal to:   λ0 (t) exp β  x , (1) where λ0 (t) denotes the baseline hazard function, β a vector of regression coefficients, and x a vector of predictors. We use the Splus function coxph to fit (1) (Mathsoft Inc., 1999). The function provides an estimate of the AIC as −2 times the difference between the partial log likelihood evaluated at the maximum likelihood estimate and the number of regression coefficients in the model. 2.1. Data reduction and model fitting We consider the following four methods for reducing and analyzing the data. Stepwise selection. First do 500 simultaneous marginal Cox proportional hazards regressions for each gene. Choose the 50 genes which have largest absolute t-statistic for further analysis. On the set of selected 50 genes, perform a forward stepwise AIC procedure to select the main effects model with lowest AIC, allowing the model size to vary from 1 to 50. Top 5. First do 500 simultaneous Cox proportional hazards regressions. Choose the 5 genes with largest absolute test statistic values and include as main effects in a second Cox proportional hazards model. Clustering. Perform Splus CLARA medoid clustering on the set 500 genes of dimension 156 (Mathsoft Inc., 1999). Choose the cluster size with maximum average silhouette width. Fit a Cox proportional hazards model with one predictor for each cluster with value equal to the median of all genes assigned to the cluster.

Survival analysis with gene expression arrays

679

Principal components. First do a principal component analysis of the 156 by 500 matrix of gene expression levels to obtain a ranked list of principal component scores in terms of percent variability explained. Fit a Cox proportional hazards model using the top 5 principal components as predictors. For the 500 simultaneous Cox proportional hazard regressions we use the method of fast simultaneous regressions as proposed by Hastie and Tibshirani (1990) and outlined in Appendix A. We also invoke this procedure for the forward stepwise AIC selection since a large number of models have to be fit at each forward step. We cluster the 500 genes of dimension 156 using the Splus program CLARA (Clustering LARge Applications) (Mathsoft Inc., 1999). CLARA is an extension of the partitioning around medoids (PAM) method to accommodate large numbers of objects, as often encountered in the analyses of microarrays; see Kaufman and Rousseeuw (1990) or Han and Kamber (2001, pp. 352–354). PAM was one of the first K-medoid clustering algorithms for partitioning n objects into K groups. After an initial random selection of K medoids, PAM repeatedly tries to make a better choice of medoids by minimizing the sum of dissimilarities of all objects to their nearest medoid. PAM does not work efficiently for large numbers of objects, n, greater than 200. CLARA extends PAM by drawing multiple samples of the dataset, applying PAM, and returning the best clustering as the output. The effectiveness of CLARA depends on the number of random samples drawn. For our simulations we use five samples. For additional relief to the computational burden, we fix the cluster size within simulations. In principal the cluster size can be selected by repeatedly running CLARA and choosing the K corresponding to the cluster analysis with largest average silhouette width. The average silhouette width is a function of individual object dissimilarities and measures how well the clustering performs, with well-separated clusters having values near 1; for a precise definition see the plot.partition command in Splus or Chapter 2 of Kaufman and Rousseeuw (1990). For the principal component analysis we apply the Splus function princomp to the 156 by 500 matrix of gene expression levels, which produces a list of 500 principal scores for each of the observations and the orthogonal loading matrix for the scores (Mathsoft Inc., 1999). 2.2. Assessing predictive accuracy To assess the predictive accuracy of the methods we perform 6-fold cross-validation. That is, we randomly divide the sample of 156 cases into six groups of 26 cases. We apply the four model-fitting procedures of Section 2.1 to 5/6 of the sample and then assess them on the remaining 1/6 using the average cross-validated log likelihood (ACVL). Repeating for K = 6 random partitions of the data, the ACVL is given by ACVL =

K  1  ˆ k β(−k) , K

(2)

k=1

where k (β) = (β) − (−k) (β)

(3)

680

D.K. Pauler et al.

is the difference between the partial log-likelihood for the entire sample and that with the kth group of observations excluded, respectively, and βˆ(−k) is the value of β that maximizes (−k) (β), for k = 1, . . . , K. See Verweij and van Houwelingen (1993) for an explicit formula for the leave-one-out cross-validated log likelihood in the context of Cox’s partial likelihood. For the clustering and principal component procedures it is necessary to retain both the clustering assignment indices and loading matrices from the fit to the training data (the 5/6th sample) in addition to the estimated regression coefficients, βˆ(−k) ’s. These indices and loading matrices are then applied to both the genes in the full data set and reduced training set as required in the two terms of (3). 2.3. Simulations We simulated both gene and survival data for the 12 scenarios listed in Table 1. The scenarios are clumped into three different types corresponding to the underlying joint distribution for the set of 500 genes for 156 patients. For the first subset of simulations, the Independent case, the genes are assumed to follow a Normal distribution with zero mean and variance–covariance matrix equal to the identity matrix. For the second subset, Myeloma, the genes are again assumed to follow a Normal distribution, but with mean vector and variance–covariance matrix set equal to that observed for the 500 genes with most variability in the Arkansas myeloma dataset. In this dataset, the mean log expression level across all genes was 5.27 with a range of 2.36 to 9.92. The standard deviations for the first four genes were 5.31, 3.65, 2.92, and 2.30, respectively. The median standard deviation for the 500 genes was 1.28 with range 1.11 to 5.31. The median correlation was 0.02 with a range of −0.69 to 0.97. The variance covariance matrix was borderline singular so we added a small constant, ε = 1.0e−13, to the diagonal elements to obtain a nonsingular matrix. For the third subset, 3-Cluster, genes are simulated from a mixture of Normals distribution with three clusters. Genes from different clusters are assumed independent. Means, variances and correlation coefficients within the clusters are listed in the caption of Table 1. We generated survival data from an exponential model with baseline hazard specified so that the median survival time was approximately 4 years. We varied the designated number of genes with non-zero β coefficients from 0, 1, 10 to 50. Values of the designated β parameters are given in the fourth column of Table 1; values for all other β parameters are set to zero. For the case of a single gene we set β equal to − log 2, corresponding to a doubling in the odds of failing with a unit decrease in gene expression. For the case of 10 and 50 designated genes we dampened the values of β to control the prognostic signal and produce survival times in the range observed for the default case of median survival at 4 years. We used a standardization to control the magnitude of the predictor, β  x, which yielded the β coefficient values listed for the 10 and 50 gene cases in Table 1. We assumed a uniform censoring mechanism on the interval 4 to 10 years, which yielded approximately 35% censoring under all of the scenarios considered. We simulated 100 datasets for each scenario; average median survival times and censoring proportions for each scenario under all simulations are reported in the last two columns of Table 1. For each of the 100 datasets we calculated (2) for a single random partition and then averaged the ACVL over the 100 datasets.

Survival analysis with gene expression arrays

681

Table 1 Parameters and survival data used in the simulation study; Gene dist. indicates the distribution of the individual gene vectors, No. des. genes indicates the number of genes that have non-zero β coefficients in the survival model, Des. parameter values indicates the value of the non-zero β coefficients, Med. surv. and Med. % cens. indicate the mean of the median survival time and censoring proportion, respectively, over 100 datasets from the simulation. Under Des. param. values, 1x indicates a column vector of ones of length x. Under Gene dist., Independent indicates each gene is identically distributed with a N(01156 , I156 ) distribution, where I156 denotes the 156 × 156 identity matrix, Myeloma indicates each gene follows a 156-variate normal distribution with mean and variance–covariance matrix equal to that for the top 500 genes in the Arkansas Myeloma project, and 3-Cluster indicates 3 clusters of size 25, 25, and 450, with N(81156 , Σ1 ), N(41156 , Σ2 ), and N(01156 , I156 ) marginal distributions, respectively, where Σ1 is a matrix with unity across the diagonal and all pairwise correlations equal to 0.8 and Σ2 is similarly defined with pairwise correlations equal to 0.5 Sim. no.

Gene dist.

No. des. genes

Des. param. values

Med. surv.

Med. % cens.

1 2 3 4

Independent

0 1 10 50

−0.69 −0.32 × 110 −0.14 × 150

4.05 3.81 3.71 3.66

0.31 0.33 0.34 0.34

5 6 7 8

Myeloma

0 1 10 50

−0.69 −0.32 × 110 −0.14 × 150

4.10 3.65 3.66 3.63

0.32 0.43 0.42 0.41

9 10 11 12

3-Cluster

0 1 10 50

−0.69 −0.32 × 110 −0.14 × 150

3.98 3.87 3.53 3.57

0.31 0.33 0.38 0.43

For the clustering method, we performed initial smaller sized trial runs, which based on the average silhouette width, indicated that the cluster size should be no greater than two for the Independent and Myeloma scenarios and should be 3 for the 3-Cluster scenarios. We therefore fixed the cluster size at 2 for the first 8 simulations and 3 for the remaining 4 simulations. For the principal components procedure we also ran a small pre-test to see which of the optimal rules provided the maximal ACVL: the top 5, 10, or 50 principal components, or the number that explained at least 90% of the variability. We found that restricting to the top 5 principal components produced ACVLs just as high as the other rules, even though the percent variation explained may be very low. Therefore we fixed the number of principal components at 5 for all simulations.

3. Results The results for the different simulations of Table 1 are shown in Figure 1. For each of the four methods we report the difference between the ACVL for the method and the ACVL for a naive approach of just taking the median of all genes as the sole predictor in the model.

682

D.K. Pauler et al.

Fig. 1. Average cross-validated log likelihoods (ACVL) for the Stepwise (S), Top 5 (T), Cluster (C) and Principal Component (P) method from the simulation study of 100 randomly generated datasets corresponding to the scenarios in Table 1. The ACVLs have each been subtracted by ACVL(median), which is the ACVL corresponding to the case where the only predictor in the model is the median across all genes. Across all scenarios, the average standard errors for points on the graph were approximately 1.9, 0.64, 0.67 and 0.7 for the Stepwise, Top 5, Cluster and Principal Component method, respectively.

Stepwise As shown by the figure, the stepwise method performed much more poorly than any of the other procedures, including the naive median approach, under all simulations except for the 1 Beta case under Myeloma. The standard errors for the stepwise estimates in the figure were approximately 1.9 and varied from 0.9 to 2.9 across simulations. Standard errors for all other procedures were smaller in magnitude and variation and were near 0.7. An explanation for the poor performance of the stepwise procedure is given in Table 2. The last column of Table 2 shows that the stepwise procedure had a tendency to select oversized models, even in the case where there were no designated genes. As seen by the fourth column of the table, except for the case for a single gene with large absolute value of the β coefficient, the procedure was not able to detect designated genes. The true positive rate, or number of designated genes that made it into the final stepwise model, declined as the magnitude of the β coefficients decreased for all scenarios except for the Cluster-3 scenario (simulations 9–12). Finally, as seen by the third column, except for the 3-Cluster scenario, the stepwise procedure was doomed from the start since many of the designated genes did not make it past the top 50 marginal regression screen.

Survival analysis with gene expression arrays

683

Table 2 Performance of the stepwise model; Top 50 is the number of designated genes that were selected in the initial marginal screen to select the top 50 genes, True positive rate (TPR) is the proportion of designated genes that appeared in the stepwise model, Model Size is the number of predictors in the model selected by stepwise. All summaries are given as the average and range over the 600 fits of the model Sim. no.

No. des. genes

1 2 3 4

0 1 10 50

5 6 7 8 9 10 11 12

Top 50

TPR (range)

Model Size (range)

1.00 (1, 1) 6.56 (2, 10) 11.00 (4, 18)

1.0 (1.0, 1.0) 0.53 (0.10, 1.0) 0.14 (0.02, 0.28)

27.37 (12, 50) 26.22 (8, 50) 28.39 (12, 50) 28.24 (14, 50)

0 1 10 50

1.00 (1, 1) 6.52 (3, 9) 12.85 (4, 20)

1.0 (1.0, 1.0) 0.53 (0.10, 1.0) 0.15 (0.02, 0.34)

15.35 (6, 38) 10.50 (1, 50) 25.83 (7, 50) 25.12 (8, 50)

0 1 10 50

1.00 (1, 1) 9.93 (8, 10) 48.93 (41, 50)

0.89 (0, 1.0) 0.51 (0.10, 1.0) 0.64 (0.12, 1.0)

26.77 (13, 50) 21.98 (8, 50) 16.67 (3, 50) 32.56 (6, 50)

Top 5 Surprisingly, just choosing a model with the top 5 genes in terms of largest absolute test statistics from an initial screen of the 500 genes produced a model that always outperformed the stepwise procedure in cross-validation. For the case where all β’s were set to zero, under all three gene distribution types the top 5 approach performed worse than the principal components, clustering and naive median approach. Note again that the standard errors for all three methods are approximately 0.7. For the 1 Beta case the top 5 approach performed as well as the principal and clustering method for the Independent and 3-Cluster scenarios. For the Myeloma 1 Beta case it performed markedly better than all of the other methods. This simulation is somewhat different from the others in that we placed a large β-coefficient on a gene with highest variability (standard deviation 5.31) so that the signal was strong. The principal component method could not recover the signal as effectively as the top 5 approach. For the 10 Beta case, the top 5 approach did not perform as well as clustering and principal components in the Independent and 3-Cluster scenarios. It did however match the principal component approach in the Myeloma case and outperform the clustering method there. Finally, the top 5 approach did not perform as well as the principal component and clustering approaches in the 100 Beta case, but for the 3-Cluster case it performed better than expected. The top 5 approach is handicapped because its fixed size does not allow it to capture the majority of designated genes. A top 10 or 20 approach might have higher predictive power for the 100 Beta case. Clustering The clustering approach does not beat the naive median approach in either the Independent or Myeloma case because there was little evidence for clusters in these scenarios.

684

D.K. Pauler et al.

For the Independent and Myeloma scenarios, 2 clusters were used but the average silhouette widths were only 0.009 and 0.28, respectively. For the 3-Cluster scenario, 3 clusters were used and the average silhouette width was 0.69, reflecting the three wellseparated clusters. In this case the clustering method performed neck to neck with principal components and greatly outperformed the top 5, stepwise and median approaches. Principal components The top 5 principal components explained on average only 8%, 28% and 13% of the total variability for the Independent, Myeloma and 3-Cluster scenarios, respectively. Yet this method either outperformed or matched the best method for all simulations except for the Myeloma 1 Beta case.

4. Discussion We have compared four methods for correlating high-dimensional gene expression levels with survival outcome in the context of Cox’s proportional hazards model. Each method consists of an initial data reduction step and a secondary model-fitting step. Under all three types of dependency structures considered, stepwise selection consistently fit overly complex models leading to a reduction in predictive power. In the case of a single influential gene, the stepwise model did consistently capture the gene in the optimal model, but unfortunately it included a number of false positive genes as well. Increasing the number of designated genes but dampening their correlation with survival led to a decrease in the true positive rate and little to no improvement in predictive performance as measured by the ACVL. Perhaps relying on a more stringent criterion such as the BIC for stepwise selection would choose more parsimonious models and hence improve predictive ability. Some authors have found that principal component analysis or partial least squares does not necessarily improve prediction (Yeung and Ruzzo, 2001; Nguyen and Rocke, 2002). We did not find this to be the case in any of the scenarios we considered, except for the 1 Beta case under Myeloma, where a very strong signal was placed on a single gene with high variance. We did not explore the next step of isolating influential genes through the loading matrices. This is important because frequently an aim of a microarray experiment is to identify smaller subsets of genes for further exploration of co-regulation or pathway regulation. We were impressed that the clustering algorithm performed as well as the principal component algorithm, even in the Independent and Myeloma cases. As in the principal component approach, we did not further explore the next step of how one proceeds from the cluster to the single gene or genes driving the relationship to survival. The K-medoids algorithm gives medians of the genes in each cluster so in principle one can identify genes that are close to the median value. Hierarchical clustering could also be used to further refine clusters. There are certainly a variety of more complex techniques that can be used for data reduction. As an example, van der Laan and Bryan (2000) provide asymptotic properties of general subsetting techniques using the parametric bootstrap. Hastie et al. (2000)

Survival analysis with gene expression arrays

685

outline a particular form of clustering based on principal components and show how it may be used to identify influential subsets of genes. If there are specific genes with known function it may be of interest to perform a supervised clustering to locate similar genes that may or may not correlate with survival. Finally, other baseline or prognostic variables can also be included in the final gene model to see if the explanatory capability of the genes supercedes better known indicators. For example, Rosenwald et al. (2002) examined the additive explanatory power of individual genes over the independently obtained international prognostic index for the prediction of survival after chemotherapy in diffuse large B-cell lymphoma.

Appendix A We use a weighted least squares approximation to full maximum likelihood. Hastie and Tibshirani (1990) use a similar weighted least squares method for estimating smooth covariate effects in the proportional hazards model. Cox’s model can be defined with hazard function   λ(t | z) = λ0 (t) exp η(z) , where λ0 (t) is an unspecified baseline hazard function. Assume there are n patients indexed by i = 1, . . . , n. Let ηi = η(zi ), where zi = (zi1 , . . . , zip ) is the vector of covariates (genes) for the ith patient. Let ∆i = 1 if patient i is a failure and ∆i = 0 if patient i is censored. Define the score vector as U = dl/dη and the observed information matrix as I = −d2 l/dηηT . An adjusted dependent variable is y = η + I −1 U. For ease of presentation assume the data have been ordered so that Ti  Ti  for i < i  . The elements of the score vector and information matrix can be expressed as follows: dl/dηi = ∆i − exp(ηi )A(η, Ti ), d2 l/dηi2 = − exp(ηi )A(η, Ti ) + exp(2ηi )B(η, Ti ),   d2 l/dηi dηj = − exp(ηi ) exp(ηj )B η, min(Ti , Tj ) , i = 1, . . . , n, where l represents the logarithm of the partial likelihood and where A(η, Ti ) =





k∈Ri

B(η, Ti ) =

 k∈Ri

(

∆k , j ∈Rk exp(ηj )



∆k , 2 j ∈Rk exp(ηj ))

686

D.K. Pauler et al.

and where Ri is the risk set at time Ti . We drop all second order terms B(η, Ti ), which leaves a diagonal approximation to the information matrix. The diagonal elements of matrix of I are used as the weight vector. Therefore, we use yi = ηi −

∆i − exp(ηi )A(η, Ti ) A(η, Ti )

the adjusted dependent variables, and weights wi = A(η, Ti ). Typically, we set the ηi = 0, although multiple iterations are possible. Since we need to apply the algorithm for each of several thousand genes, it is computationally efficient to start by sweeping out the mean by centering the adjusted dependent variables by their weighted mean, and the covariates by their weighted mean. Therefore, the univariate regressions do not require many 2 × 2 matrix inversions. We note a close connection to score test statistics based on the partial likelihood to choose basis functions. For instance, for ηi = 0 the component of the score vector for covariate g is  Ug = zgi wi yi . i

References Akaike, H. (1973). Information theory and an extension of the entropy maximization principle. In: Petrov, B.N., Csak, F. (Eds.), Proceedings of the Second International Symposium on Information Theory. Akademia, Kiado. Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., Powell, J.I., Yang, L., Marti, G.E., Moore, T., Hudson Jr, J., Lu, L., Lewis, D.B., Tibshirani, R., Sherlock, G., Chan, W.C., Greiner, T.C., Weisenburger, D.D., Armitage, J.O., Warnke, R., Levy, R., Wilson, W., Grever, M.R., Byrd, J.C., Botstein, D., Brwon, P.O., Staudt, L.M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511. Allen, D.M. (1974). The relation between variable selection and data augmentation and a method for prediction. Technometrics 16, 125–127. Alter, O., Brown, P.O., Botstein, D. (2000). Singular value decomposition for genome-wide expression data and modeling. Proc. Nat. Acad. Sci. 97, 10101–10106. Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57, 289–300. Bock, H.H. (1996). Probabilistic models in cluster analysis. Comput. Statist. Data Anal. 23, 5–28. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA. Chang, W.C. (1983). On using principal components before separating a mixture of two multivariate normal distributions. Appl. Statist. 32, 267–275. Efron, B., Tibshirani, R. (1997). Improvements on cross-validation: The 632 bootstrap. J. Amer. Statist. Assoc. 92, 548–560. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Nat. Acad. Sci. 95, 14863–14868. Fraley, C., Raftery, A.E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97, 611–631.

Survival analysis with gene expression arrays

687

Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeck, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S. (1999). Molecular classification of cancer: Class discovery and class prediction be gene expression monitoring. Science 286, 531–537. Han, J., Kamber, M. (2001). Data Mining Concepts and Techniques. Morgan Kaufmann, San Francisco. Harrell Jr, F.E., Lee, K.L., Califf, R.M., Pryor, D.B., Rosati, R.A. (1984). Regression modelling strategies for improved prognostic prediction. Statist. Medicine 3, 143–152. Harrell Jr, F.E., Lee, K.L., Mark, D.B. (1996). Tutorial in biostatistics. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statist. Medicine 15, 361–387. Hastie, T., Tibshirani, R. (1990). Exploring the nature of covariate effects in the proportional hazards model. Biometrics 46, 1005–1016. Hastie, T., Tibshirani, R., Eisen, M.B., Alizadeh, A., Levy, R., Staudt, L., Chan, W.C., Botstein, D., Brown, P. (2000). ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology 1. research0003.1–0003.21. Henderson, R. (1995). Problems and prediction in survival-data analysis. Statist. Medicine 14, 161–184. Herwig, R., Poustka, A.J., Muller, C., Bull, C., Lehrach, H., O’Brien, J. (1999). Large-scale clustering of cDNA fingerprinting data. Genome Research 9, 1093–1105. Kaufman, L., Rousseeuw, P. (1990). Finding Groups in Data. Wiley, New York. Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westerman, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., Meltzer, P.S. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 7, 673–679. Korn, E.L., Simon, R. (1990). Measures of explained variation for survival data. Statist. Medicine 9, 487–503. LeBlanc, M., Crowley, J. (1993). Survival trees by goodness of split. J. Amer. Statist. Assoc. 88, 457–467. LeBlanc, M., Kooperberg, C., Grogan, T.M., Miller, T.P. (2003). Directed indices for exploring gene expression data. Bioinformatics 19, 686–693. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In: Cam, L.M.L., Neyman, J. (Eds.), Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. University of California Press, Berkeley, CA. Mallows, C.L. (1973). Some comments on Cp . Technometrics 15, 661–675. Mathsoft Inc. (1999). S-PLUS 2000 Professional Release 1. Mathsoft Inc. Nguyen, D., Rocke, D. (2002). Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18, 39–50. O’Quigley, J., Xu, R. (2001). Explained variation in proportional hazards regression. In: Crowley, J. (Ed.), Handbook of Statistics in Clinical Oncology. Marcel Dekker, New York. Quackenbush, J. (2001). Computational analysis of microarray data. Nature Rev. 2, 418–427. Raftery, A.E., Madigan, D., Volinsky, C.T. (1996). Accounting for model uncertainty in survival analysis improves predictive performance. In: Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M. (Eds.), Bayesian Statistics 5. Oxford University Press, Oxford. Rosenwald, A., Wright, G., Chan, W.C., Connors, J.M., Campo, E., Fisher, R., Gascoyne, R.D., MullerHermelink, H.K., Smeland, E.B., Staudt, L.M., et al. (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. New England J. Medicine 346, 1937– 1947. Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461–464. Segal, M.R. (1988). Regression trees for censored data. Biometrics 44, 35–48. Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.M. (1999). Systematic determination of genetic network architecture. Nature Genetics 22, 281–285. Tusher, V.G., Tibshirani, R., Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc. Nat. Acad. Sci. 98, 5116–5121. van der Laan, M.J., Bryan, J. (2000). Gene expression analysis with the parametric bootstrap. Biostatistics 1, 1–19. Verweij, P.J.M., van Houwelingen, H.C. (1993). Cross-validation in survival analysis. Statist. Medicine 12, 2305–2314. Volinsky, C.T., Madigan, D., Raftery, A.E., Kronmal, R.A. (1997). Bayesian model averaging in proportional hazard models: Assessing the risk of a stroke. Appl. Statist. 46, 433–448.

688

D.K. Pauler et al.

Ward, J.H. (1963). Hierarchical groupings to optimize an objective function. J. Amer. Statist. Assoc. 58, 234– 244. Weinstein, J.N., Myers, T.G., O’Connor, P.M., Friend, S.H., Fornace Jr, A.J., Kohn, K.W., Fojo, T., Bates, S.E., Rubenstein, L.V., Anderson, N.L., Buolamwini, J.K., van Osdol, W.W., Monks, A.P., Scudiero, D.A., Sausville, E.A., Zaharevitz, D.A., Bunow, B., Viswanadhan, V.N., Johnson, G.S., Wittes, R.E., Paull, K.D. (1997). An information intensive approach to the molecular pharmacology of cancer. Science 275, 343– 349. Westfall, P.H., Young, S.S. (1993). Resampling-Based Multiple Testing: Examples and Methods for P-value Adjustment. Wiley, New York. Yeung, K.Y., Ruzzo, W.L. (2001). Principal component analysis for clustering gene expression data. Bioinformatics 17, 763–774.