0022-5347/03/1706-0006/0 THE JOURNAL OF UROLOGY® Copyright © 2003 by AMERICAN UROLOGICAL ASSOCIATION
Vol. 170, S6 –S10, December 2003 Printed in U.S.A.
DOI: 10.1097/01.ju.0000094764.56269.2d
COMPARISON OF COX REGRESSION WITH OTHER METHODS FOR DETERMINING PREDICTION MODELS AND NOMOGRAMS MICHAEL W. KATTAN* From the Health Outcomes Research Group, Department of Epidemiology and Biostatistics, and Department of Urology, Memorial SloanKettering Cancer Center, New York, New York
ABSTRACT
Purpose: There is controversy as to whether artificial neural networks and other machine learning methods provide predictions that are more accurate than those provided by traditional statistical models when applied to censored data. Materials and Methods: Several machine learning prediction methods are compared with Cox proportional hazards regression using 3 large urological datasets. As a measure of predictive ability, discrimination that is similar to an area under the receiver operating characteristic curve is computed for each. Results: In all 3 datasets Cox regression provided comparable or superior predictions compared with neural networks and other machine learning techniques. In general, this finding is consistent with the literature. Conclusions: Although theoretically attractive, artificial neural networks and other machine learning techniques do not often provide an improvement in predictive accuracy over Cox regression. KEY WORDS: statistical model, machine learning, neural network models, predictive value of tests
with slightly fewer predictors and fewer target events. The third dataset followed patients treated with surgery for renal cell carcinoma. The end point here was also disease recurrence, and this dataset was the smallest. The datasets have been described in more detail previously.5–7 Prediction methods. The standard statistical approach for multivariate survival prediction is Cox regression. Two different Cox regression models were fit, including 1 model that required all ordinal or higher scale predictors to have linear effects and 1 model which relaxed this assumption by fitting restricted cubic splines to these variables.8 No interactions were fit or tested. No variable selection was performed because of Harrell’s work, which showed that variable selection methods tend to harm predictive accuracy relative to the fitting of the full models.9 Four tree-based methods were fit. Recursive partitioning (rpart)10 and tree structured survival analysis (tssa)11 derive trees from censored data. The rpart method builds large trees and then automatically prunes them to an optimal size. The 2 additional tree-based methods compared were classification and regression trees (CART)12 and multiple adaptive regression splines (MARS).13 These do not work for censored data, and so adaptation suggested by Therneau et al was used.14 Similar to rpart, CART also uses cross-validation to find the optimal tree size. MARS is similar to CART, except that in the risk groups of the tree MARS does not necessarily predict the same outcome for all patients in the same manner as CART. Instead, MARS allows regression models in the terminal nodes of the tree. The 2 networks compared were proportional hazards neu-
Predictions are central to medical decision making, and clinicians are always in need of good prediction tools. There are numerous prediction models in the urological literature,1 and it can be difficult for the clinician to choose among competing prognostic models. A question that often arises is whether a prognostic model derived from a machine learning method (eg trees and neural networks) is better than one derived from a traditional statistical method (eg Cox or logistic regression). Although the latter are easy to perform and routinely available in standard software packages, the former are thought possibly to predict more accurately because of greater model-fitting flexibility.2 While comparisons of machine learning methods with statistical methods are common in the binary outcome variable setting,3, 4 applications to survival data are somewhat rare. Experience of using these predictive methods on 3 time-torecurrence datasets (ie survival type is reported). The primary concern was to determine which modeling method produced the most accurate prognostic model, as these studies had as their purpose the development of predictive tools for use in clinical counseling. METHODS
Datasets. Eight modeling techniques were compared in these datasets. Prediction accuracy was the measure of interest. All analyses were performed using S-Plus Professional Software version 4.5 or 2000 (Insightful Corp., Seattle, Washington). Each dataset consisted of an observational cohort from Memorial Sloan-Kettering Cancer Center following particular forms of treatment for particular forms of cancer. The first dataset contained 1,042 patients treated with radiation therapy for prostate cancer. There were 5 predictor variables, and 268 patients had the target event, disease recurrence (see table). A second dataset comprised a cohort of patients treated with seed implants for prostate cancer and followed until recurrence. This dataset was of similar size
Descriptions of datasets Dataset Renal cell Ca Prostate Ca, external beam Prostate Ca, brachytherapy
* Corresponding author: Memorial Sloan-Kettering Cancer Center, 1275 York Ave., Box 27, New York 10021 (telephone: 646-422-4386; FAX: 630-604-3605; email:
[email protected]). S6
No. Predictions
No. Pts
4 (categorical 3, continuous 1) 601 5 (categorical 1, continuous 2) 1,042 3 (categorical 1, ordinal 1, 920 continuous 1)
No. Events 66 268 124
S7
PREDICTION MODELS AND NOMOGRAMS
ral network (phnet) developed for survival analysis15 and a neural network that does not directly work for censored data as with some of the tree-based methods.16 Thus once again, the modification of Therneau et al14 was used as demonstrated in Kattan et al.17 Briefly, this involved calculating a new outcome variable and building a neural network to predict the new variable. Cross-validation was used to identify the level of decay associated with minimum error for all neural networks, as this form of penalty is more effective than the adjustment of the number of hidden layer neurons.16 Predictive accuracy measure. Computing the accuracy of prognostic models, when the evaluation uses censored data, involves special challenges and was reviewed thoroughly by Begg et al.18 Their conclusion is similar to that of Harrell et al19 in that a concordance index, similar to the area under a receiver operating characteristic curve, is attractive and, thus, was used in these analyses. The concordance index is the probability that in a randomly selected pair of cases, the case that fails first had the higher predicted probability of failure. It is computed by first constructing all possible pairs of patients, and deleting those pairs in which the patient with shorter followup time was censored. Then of the remaining pairs the concordance index is the proportion of pairs in which the prediction method correctly identified the case that failed first. Resampling. The various prediction techniques were compared on the basis of their expected accuracy when applied to future cases. It is important not to simply compare the apparent accuracy of each technique in the dataset at hand, as this accuracy assessment is vulnerable to overfit. In 2 datasets repeated split sample methods were used, in which the dataset was randomly split into two-thirds training and onethird testing, with the entire process repeated 50 times. This method proved to be computationally burdensome, and so 10-fold cross-validation was used in the third comparison. Nested within each training sample, internal cross-
validation was performed for some techniques to set tuning parameters, such as tree size or weight decay. In this manner the test set was never used for tuning and the prediction model was fixed by the time the one-third testing set or one-tenth fold was used for prediction. RESULTS
Technique comparisons appear in figures 1 to 3. In the radiation therapy dataset the Cox regression model with restricted cubic splines appeared to be superior by a considerable margin to the next best method, which was a neural network, closely followed in third place by MARS (fig. 1). For the renal cell carcinoma dataset the 2 Cox models appeared to perform the best, with virtually no advantage provided by restricted cubic splines (fig. 2). In the brachytherapy dataset the 2 Cox models narrowly outperformed the other methods, although MARS and an artificial neural network had similar performance (fig. 3). Thus, in all 3 datasets Cox regression was never outperformed on average. DISCUSSION
In these real world datasets the standard Cox proportional hazards regression model predicted at least as accurately as any of the 4 tree-based methods as well as the 2 neural networks. This finding is interesting when the goal was to develop a prognostic model for clinical use and predictive accuracy should be the primary objective. Cox proportional hazards regression is common in many statistical software packages and is relatively fast to perform. Furthermore, it yields other insights into the prediction model, such as hazard ratios and tests of significance for the predictors. It has the additional advantage of reproducibility. The same result is achieved each time it is run on a particular dataset, which is not necessarily true for machine learning techniques because they use random processes for sampling and/or coefficient estimation. For the busy analyst it may not always be
FIG. 1. Technique comparisons in 3-dimension conformal radiation therapy dataset. Y-axis is concordance index, which ranges from 0.5 (coin toss) to 1.0 (perfect ability to discriminate). Boxes indicate 25th and 75th percentiles of 50 areas with clear band indicating median. Whiskers extending from boxes denote range of estimates. Note that Cox with splines appears to predict best. Cox, Cox proportional hazards. Coxrcs, Cox proportional hazards with restricted cubic splines. rpart, recursive partitioning. tssa, tree structured survival analysis. phnet, neural network adapted to survival data and free of proportional hazards assumption. ANN, artificial neural network adapted to censored data.
S8
PREDICTION MODELS AND NOMOGRAMS
FIG. 2. Technique comparison in renal cell carcinoma dataset. There are 10 replications in each box plot. Cox models appear to predict best. Cox, Cox proportional hazards. Coxrcs, Cox proportional hazards with restricted cubic splines. rpart, recursive partitioning. tssa, tree structured survival analysis. phnet, neural network adapted to survival data and free of proportional hazards assumption. ANN, artificial neural network adapted to censored data. Adapted with permission from Kattan et al.6
FIG. 3. Technique comparison in brachytherapy dataset. There are 50 replications in each boxplot. Cox models appear to predict best. Cox, Cox proportional hazards. Coxrcs, Cox proportional hazards with restricted cubic splines. rpart, recursive partitioning. tssa, tree structured survival analysis. phnet, neural network adapted to survival data and free of proportional hazards assumption. ANN, artificial neural network adapted to censored data.
worth the effort to build the additional machine learning models. When building models designed for predictive accuracy, Cox regression has demonstrated its ability to compete with trendier alternatives. Clearly, unless inferior predictive accuracy is achieved, the Cox model should be favored over a machine learning approach for the aforementioned reasons when a prediction model is the goal of the analysis.
The present results are similar to the conclusions reached from the literature reviews by Schwarzer et al20 and Sargent4 that machine learning methods often have failed to perform better than traditional statistical methods. These authors reviewed survival and binary outcome comparison studies. In particular, the reviews provided by Schwarzer et al document numerous design flaws in the studies that show the superi-
S9
PREDICTION MODELS AND NOMOGRAMS
ority of neural networks.20 For example, on numerous occasions the neural network was provided with additional data not available to the statistical method. Or many different neural networks were compared with a single statistical model, possibly contributing to a chance finding. In some cases the wrong statistical model was used as the benchmark. There are several noteworthy limitations to our study and our conclusions. The analyses were based on only 3 datasets, which is too small a study to draw definitive conclusions, and it is difficult to do formal testing to compare methods in this setting. All that can really be said is that it is not uncommon for the Cox model to perform just as well as, if not better than, the more flexible machine learning methods. Apparently, these datasets did not contain highly predictive nonlinear or interactive effects, which the machine learning methods tend to detect automatically.17 Certainly this may not always be the case, but one wonders how often exotic relationships may be found in real world (non-simulated) data. Another major limitation is the manner in which some of the methods were adapted to accommodate censored data by use of the null martingale residual.14 The null martingale residual is a variable that combines the followup time with the censoring indicator (ie whether treatment has failed or not). The null martingale residual is a continuous variable proportional to the risk of the event for the individual patient, and represents the difference between the observed and expected number of events which should have occurred for that point in time. Others may find alternative methods for adapting the techniques and, thus, these results would not be applicable to those approaches or even to the methods generally. Rather, these findings are restricted to this particular approach of adapting CART, MARS and neural network to censored data. However, simple methods of adapting these techniques to these data do not work.20 It is important to highlight the problem of modeling survival data by simply using the patients in whom treatment fails along with those who are long-term survivors. This biased approach will yield incorrect predictions.20 Consequently, we used a different approach. An important limitation is the fact that the role of the analyst in our studies was limited. The professional analyst would most likely prefer to perform numerous modeling activities, which would include use of particular transformations and other pre-processing of the data, manual exploration of polynomials and/or interactions in the statistical models, and adjustment of tuning parameters in the machine learning techniques. Most settings were left at their defaults, except for decay of the neural network, which was automatically set by internal 10-fold cross validation. This lack of involvement by the analyst was intentional because of the difficulty in performing a scientific study using an involved analyst who cannot be independent across holdout samples (ie testing datasets). In other words 10-fold cross validation assumes independence across the tenth of the data held out in each repetition. This cannot be achieved with analyst involvement, as he or she will remember what was done in previous replications and how well it worked. The analyst can undoubtedly remember cases, and, thus, defeat the purpose of cross-validation. It would be practically impossible to perform our study with an involved analyst, although it would be interesting and valuable. CONCLUSION
Analysts should recognize that the traditional statistical models, when constructed optimally and flexibly, might very
well provide predictions as accurate as those obtained by machine learning methods. Moreover, an artificial neural network prediction model is not guaranteed to be more accurate than that derived using a traditional statistical technique such as logistic or Cox regression.
REFERENCES
1. Ross, P. L., Scardino, P. T. and Kattan, M. W.: A catalog of prostate cancer nomograms. J Urol, 165: 1562, 2001 2. Kattan, M. W. and Cooper, R. B.: A simulation of factors affecting machine learning techniques: an examination of partitioning and class proportions. Omega Int J Mgmt Sci, 28: 501, 2000 3. Kattan, M. W. and Cooper, R. B.: The predictive accuracy of computer-based classification decision techniques. A review and research directions. Omega Int J Mgmt Sci, 26: 467, 1998 4. Sargent, D. J.: Comparison of artificial neural networks with other statistical approaches: results from medical data sets. Cancer, 91: 1636, 2001 5. Kattan, M. W., Potters, L., Blasko, J. C., Beyer, D. C., Fearn, P., Cavanagh, W. et al: Pretreatment nomogram for predicting freedom from recurrence after permanent prostate brachytherapy in prostate cancer. Urology, 58: 393, 2001 6. Kattan, M. W., Reuter, V., Motzer, R. J., Katz, J. and Russo, P.: A postoperative prognostic nomogram for renal cell carcinoma. J Urol, 166: 63, 2001 7. Kattan, M. W., Zelefsky, M. J., Kupelian, P. A., Scardino, P. T., Fuks, Z. and Leibel, S. A.: Pretreatment nomogram for predicting the outcome of three-dimensional conformal radiotherapy in prostate cancer. J Clin Oncol, 18: 3352, 2000 8. Harrell, F. E., Jr., Lee, K. L. and Mark, D. B.: Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med, 15: 361, 1996 9. Harrell, F. E., Jr.: Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer-Verlag, p. 568, 2001 10. Atkinson, E. and Therneau, T.: rpart: Recursive partitioning and regression trees. Http://lib.stat.cmu.edu/S/rpart 11. Segal, M. R.: Regression trees for censored data. Biometrics, 44: 35, 1988 12. Breiman, L., Friedman, J. J., Olshen, R. A. and Stone, C. J.: Classification and Regression Trees. Monterey: Wadsworth, 1984 13. Kooperberg, C., Bose, S. and Stone, C. J.: Polychotomous regression. J Am Stat Assoc, 92: 117, 1997 14. Therneau, T. M., Grambsch, P. M. and Fleming, T. R.: Martingale-based residuals for survival models. Biometrika, 77: 147, 1990 15. Ripley, B. D. and Ripley, R. M.: Neural networks as statistical methods in survival analysis. In: Artificial Neural Networks: Prospects for Medicine. Edited by R. Dybowski and V. Gant. Austin, Texas: Landes Biosciences, pp. 1–13, 1997 16. Venables, W. N. and Ripley, B. D.: Modern Applied Statistics with S, 4 ed. New York: Springer-Verlag, p. 495, 2002 17. Kattan, M. W., Hess, K. R. and Beck, J. R.: Experiments to determine whether recursive partitioning (CART) or an artificial neural network overcomes theoretical limitations of Cox proportional hazards regression. Comput Biomed Res, 31: 363, 1998 18. Begg, C. B., Cramer, L. D., Venkatraman, E. S. and Rosai, J.: Comparing tumor staging and grading systems: a case study and a review of the issues, using thymoma as a model. Stat Med, 19: 1997, 2000 19. Harrell, F. E., Jr., Califf, R. M., Pryor, D. B., Lee, K. L. and Rosati, R. A.: Evaluating the yield of medical tests. JAMA, 247: 2543, 1982 20. Schwarzer, G. and Schumacher, M.: Artificial neural networks for diagnosis and prognosis in prostate cancer. Semin Urol Oncol, 20: 89, 2002
S10
PREDICTION MODELS AND NOMOGRAMS
DISCUSSION Dr. Philip W. Kantoff. It seems as if one of the problems with prostate cancer is that the enormous amount of biological variability cannot be accounted for by the current factors. We are trying to set up methodologies that may be not robust enough, given the factors that we throw into the system to predict. Are there artificial scenarios that one can create with all of the factors that can predict 100% of the variability and then compare these different methodologies? Because I suspect that during the next 5 to 10 years there will be a movement toward explaining the majority of the biological variability based on factors that go beyond the current clinical factors. So we should be poised and ready to use the appropriate instruments to explain the variability. Dr. Michael Kattan. If you take the situation you mentioned to the extreme, for which we have all predictors in that setting, we now shift from a probabilistic world to a deterministic world in which we have everything we need to know. It is just a matter of putting it together. In that instance I would say that the neural network is the way to go because it will drive it to a 100% fit. You just have to grow a big enough tree to get homogenous leaves all the way around it, and there are techniques that do that. The problem is, what if you are approaching that degree of completeness but are not quite there yet. Then your choice is not quite so clear because you still have some small region where there is a mix. When you have a mix, you get into trouble with the techniques that push it all the way. Doctor Kantoff. What percentage of the biological variability that predicts recurrence or biological behavior, which is to say which cancers progress and which do not, do we have in hand right now? Doctor Kattan. It depends on how you want to measure it. When you use area under the ROC curve, preoperatively we are roughly in the middle of that scale in predicting recurrence. If you use something like a percent of variance explained or an R square type of measure, we are dramatically lower than that. Dr. Joel B. Nelson. To what extent have these tools actually changed behavior and specifically patient behavior? As a rule, the public has absolutely no concept of what probability and odds mean. Has this been applied in a population to change behavior? Doctor Kattan. My hypothesis is that if you take a patient and show him the odds with each treatment, odds of having confined disease, recurrence, metastasis, complications or death, and let him make his choice, he is less likely to show regret if treatment fails. Comparing this approach to the patient who was told that treatment X is guaranteed to be curative, there is no question that the informed patient is less likely to show regret. Dr. Peter R. Carroll. I think clinicians and patients use this information. We do stratify patients based on probabilities of being cured. Patients may look at the recommendations we make and disagree, but we have come a long way in the last 15 years in terms of stratifying patients using nomograms or other systems to help us sort out who needs to be treated in which way. Dr. Mack Roach, III. With evolving therapy and patient populations, you will have to keep updating the model to reflect those changes. For example, in Hodgkin’s disease histology used to be an important factor until effective chemotherapy was available. In your model with radiotherapy you do not have the issue of whether the patients received radiotherapy to the whole pelvis or only to the prostate. There are other things changing in therapy to show better results. It seems that the model can be affected by factors that used to be but are no longer considered to be important. Doctor Kattan. There is always that risk. Doctor Carroll. I have found your nomogram to be very durable. However, given the kind of risk migration that we are seeing, have you looked at your nomogram over time to see how predictive it is per year of diagnosis? Has that held up? Doctor Kattan. We were not allowed to include this information in our recent preoperative validation paper because it was not statistically significant. However, we found that the patients who were treated more recently were predicted more accurately by the tool. Dr. Celestia S. Higano. Speaking as a medical oncologist, I sometimes think that as clinicians we may focus too much on cure. We should also be looking at long-term morbidity. Although it is great to use these nomograms to help patients know what their outcomes might be, I personally do not use them to dictate what treatment a patient should receive. Because of my experiences in dealing with local relapse with bladder and/or ureteral obstruction, bleeding and pain, I may be inclined to advise certain patients to consider surgery despite the likelihood of relapse predicted by the nomogram. Doctor Kattan. I completely agree with that. I know some people who misuse the nomograms to dictate treatment by simply choosing the method that comes out on top. We have added information clarifying the proper use of the nomogram in the “Frequently Asked Questions” section of our on-line version.