EUROPEAN UROLOGY 62 (2012) 597–600
independent of what his personal a priori decision threshold was. For those who do not want to work with a simple threshold model for personal decisions and still are interested in assessing the utility of the risk prediction model, we should mention that it is possible to introduce into the Brier score a potential imbalance between the benefits of the surgery performed for a diseased man and the costs of treating a healthy man [7]. Note first that the Brier score is the population average of the patients’ individual residuals, which are defined as the squared difference between actual binary outcome (0 = SVI is not present, 1 = SVI is present) and the risk of SVI predicted by the model. In its usual form, the Brier score weighs all residuals equally, relative to the sample size. The Brier score can be modified to put relatively more weight (eg, double weight) on residuals from men who are diseased (and therefore need surgery) than on residuals from those who are disease-free. It should be noted that the risk prediction model that is optimal with respect to such a cost–benefit-modified Brier score cannot be calibrated at the same time. This shows that utility and calibration are fundamentally different. 5.
Conclusions
599
winning model only if all models are calibrated. A recalibrated version of the Gallina nomogram will likely have similar performance to the Partin tables, hence it is likely that the decision curve analyses will yield crossing curves and thus cannot pick a clear winner. Conflicts of interest: The authors have nothing to disclose.
References [1] Lughezzani G, Zorn KC, Buda¨us L, et al. Comparison of three different tools for prediction of seminal vesicle invasion at radical prostatectomy. Eur Urol 2012;62:590–6. [2] Zorn KC, Capitanio U, Jeldres C, et al. Multi-institutional external validation of seminal vesicle invasion nomograms: head-to-head comparison of Gallina nomogram versus 2007 Partin tables. Int J Radiot Oncol Biol Phys 2009;73:1461–7. [3] Senn SJ. Dichotomania: an obsessive compulsive disorder that is badly affecting the quality of analysis of pharmaceutical trials. Presented at: 55th Session of the International Statistical Institute; April 5–12, 2005; Sydney, Australia. [4] Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 2006;25:127–41. [5] Savage LJ. Elicitation of personal probabilities and expectations. J Am Stat Assoc 1971;66:783–801. [6] Hand DJ. Construction and assessment of classification rules. Chi-
Risk prediction models can be assessed on one or all of the three fundamental levels: discrimination, calibration, and utility. The AUC summarizes discrimination only; the Brier score reflects both calibration and discrimination. Naturally, these traditional measures will pick the same
chester, UK: John Wiley; 1997. [7] Gerds TA, Cai T, Schumacher M. The performance of risk prediction models. Biometrical J 2008;50:457–79. http://dx.doi.org/10.1016/j.eururo.2012.04.053
Platinum Priority Reply from Authors re: Michael W. Kattan, Thomas A. Gerds. Stages of Prediction Model Comparison. Eur Urol 2012;62:597–9 Andrew J. Vickers a,*, Giovanni Lughezzani b,c a
Memorial Sloan-Kettering Cancer Center, New York, NY, USA; b Cancer Prognostics and Health Outcomes Unit, University of Montreal Health Center, Montreal, QC, Canada; c Department of Urology, Vita-Salute San Raffaele University, Milan, Italy
We would like to thank Drs. Kattan and Gerds for their comments [1] on our paper [2]. Much of their editorial consists of a discussion on the different measures available to evaluate prediction models. This is well written and relatively uncontroversial. Other aspects of the editorial
DOIs of original articles: http://dx.doi.org/10.1016/j.eururo.2012.04.022, http://dx.doi.org/10.1016/j.eururo.2012.04.053 * Corresponding author. Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, 1275 York Avenue, Box 44, New York, NY 10065, USA. E-mail address:
[email protected] (A.J. Vickers).
make direct reference to our paper. We would like to question some of the conclusions drawn by Kattan and Gerds. The authors claim that decision curve analysis makes a restrictive and simplistic assumption about the decisionmaking process, which is that patients and doctors have a single unchanging threshold for an intervention, such as seminal vesicle resection. We find this a rather strange claim quite outside of common understandings of decision curve analysis. The assumptions involved in decision curve analysis are two-fold. First, patients and doctors do have some kind of threshold that can be used to turn the results of a prediction model into action. Note that if this assumption is incorrect, then prediction modeling itself becomes null and void: If I do not know whether a particular level of risk is high enough to prompt action, then you might as well not tell me my risk. The second assumption in decision curve analysis is that an experienced clinician can state a reasonable range of thresholds. Sensible individuals can have different thresholds within this range, depending on personal preferences about outcomes, such as sexual function; an individual with a threshold outside this range would, however, be considered irrational. For example, a
600
EUROPEAN UROLOGY 62 (2012) 597–600
doctor who refused to resect the seminal vesicles unless there was a 50% probability of seminal vesicle invasion would be thought of by peers as having a highly inappropriate approach to cancer surgery. This leads to a second unusual claim by Kattan and Gerds, which is that decision curve analysis is limited because ‘‘the distribution of the personal thresholds in the population is unknown.’’ But the way to interpret a decision curve is entirely independent of the distribution of thresholds. If, as is the case with our seminal vesicle invasion paper, one model is superior across the full range of reasonable threshold probabilities, then we can advise clinicians to use that model to make clinical decisions for all patients. If different models are favored at different threshold probabilities, what Kattan and Gerds describe as ‘‘crossing curves,’’ then no model is favored. In the original paper introducing decision curve analysis [3], it is explicitly stated that decision curve analysis will sometimes come to ambiguous conclusions and, if so, a more formal decision analysis may be required: ‘‘We do not propose decision curve analysis as a substitute for existing decisionanalytic methods, though it may help indicate where such methods may be of benefit.’’ To use a very simple analogy, imagine that I own a business that sells a gadget for $100. I am fully aware that shoppers vary in their willingness to pay for my gadget. However, if all reasonable shoppers would pay $120 or more, or if no one would be willing to pay more than $90, then I do not have to worry about the distribution of the thresholds. If the thresholds include $100, then obviously I will have to do more research. As it happens, decision curve analysis is more often like the first case than the second: If you look up published decision curves, they often do find a clear answer, such as that a model is preferable across all (or almost all) reasonable threshold probabilities [4–8], or that a model is preferable (if at all) within such a limited range that it should not be used [9,10]. In other cases, an indeterminate finding constitutes a clear negative result. For example, when a novel marker improves net benefit for only part of the range of threshold probabilities [11], the conclusion must be that the marker has not been shown to be of clear benefit and that clinical use remains unjustified. A third unusual claim of the paper involves a severe case of subjunctivitis: Had the results of one model been adjusted so that they were better (recalibration) and had this led to the decision curves crossing, then the method would not have been able to pick a clear winner. This would be like criticizing randomized trial methodology on the basis of a trial showing drug A to be superior to drug B, arguing that had a different dose of drug B been used, and had this led to a nonsignificant difference between groups, then the randomized trial would not have been of use. To recap our paper, we wished to compare three published prediction tools for seminal vesicle invasion.
We investigated a variety of statistical methodologies, including discrimination, calibration, Brier score, predictiveness curves, and reclassification metrics. None was able to give a consistent and interpretable answer as to which model would lead to the best clinical decision making, and whether that model was better than an alternative, such as resection in all or no patients. In contrast, decision curve analysis was able to identify not only that the Partin tables would lead to the best clinical outcomes of the three models presented, but that use of the Partin tables would lead to better clinical outcomes than deciding treatment without a model. We concluded that only decision curve analysis was able to provide clinically meaningful results. Nothing in the editorial of Kattan and Gerds contradicts this conclusion. Conflicts of interest: The authors have nothing to disclose.
References [1] Kattan MW, Gerds TA. Stages of prediction model comparison. Eur Urol 2012;62:597–9. [2] Lughezzani G, Zorn KC, Buda¨us L, et al. Comparison of three different tools for prediction of seminal vesicle invasion at radical prostatectomy. Eur Urol 2012;62:590–6. [3] Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making 2006;26:565–74. [4] Perdona` S, Cavadas V, Di Lorenzo G, et al. Prostate cancer detection in the ‘‘grey area’’ of prostate-specific antigen below 10 ng/ml: head-to-head comparison of the Updated PCPT calculator and Chun’s nomogram, two risk estimators incorporating prostate cancer antigen 3. Eur Urol 2011;59:81–7. [5] Cavadas V, Oso´rio L, Sabell F, Teves F, Branco F, Silva-Ramos M. Prostate Cancer Prevention Trial and European Randomized Study of Screening for Prostate Cancer Risk Calculators: a performance comparison in a contemporary screened cohort. Eur Urol 2010;58: 551–8. [6] Di Napoli M, Godoy DA, Campi V, et al. C-reactive protein level measurement improves mortality prediction when added to the spontaneous intracerebral hemorrhage score. Stroke 2011;42: 1230–6. [7] Cooperberg MR, Hilton JF, Carroll PR. The CAPRA-S score. Cancer 2011;117:5039–46. [8] Yang Q, Liu T, Valdez R, Moonesinghe R, Khoury MJ. Improvements in ability to detect undiagnosed diabetes by using information on family history among adults in the United States. Am J Epidemiol 2010;171:1079–89. [9] Nam RK, Kattan MW, Chin JL, et al. Prospective multi-institutional study evaluating the performance of prostate cancer risk calculators. J Clin Oncol 2011;29:2959–64. [10] Vickers AJ, Wolters T, Savage CJ, et al. Prostate-specific antigen velocity for early detection of prostate cancer: result from a large, representative, population-based cohort. Eur Urol 2009;56:753–60. [11] Klatte T, Waldert M, de Martino M, Schatzl G, Mannhalter C, Remzi M. Age-specific PCA3 score reference values for diagnosis of prostate cancer. World J Urol 2012;30:405–10. http://dx.doi.org/10.1016/j.eururo.2012.05.067