CLINICALTHERAPEUTICSw/VOL.23,N0. IO,2001
Letters to the Editor Dealing with Skewed Cost Data Dear Dr. Walson: The problem of highly skewed cost data in health economic studies is an important one that deserves careful consideration. Complex statistical issues, as well as issues arising from the health economic context, are involved; it is therefore essential that contributors to the literature have statistical expertise. Unfortunately, a recent publication in Clinical Therapeutics@ by Rascati et al’ is full of statistical blunders, and their article therefore poses a health risk to this important topic. 1. Probably the biggest single blunder is to take logarithms of negative data. Of course, this is strictly impossible, yet this is what the authors claim to have done. Their data are differences between costs incurred during the 6 months before and after beginning treatment, and these differences can be negative. What I presume the authors have done is to add an arbitrary amount to the data, to make them positive, before taking logarithms. This is an absurd thing to do, not least because the answers they obtain will then depend on the choice of constant. Logarithms should never be used in this way without very good justification. 2. The next most significant blunder concerns the use of the bootstrap. The authors take 150 bootstrap samples from each treatment group. Why 1.50? This is too small to remove the additional sampling variation of bootstrapping, and it is almost as easy to take 15,000 as 150. They calculate the mean of each bootstrap sample and consider these means as comprising a new dataset. That is already a strange way to view the bootstrap samples, but the authors then compare the 2 treatments by applying a 2-sample t test to the “bootstrap datasets.” This is not how the bootstrap should be used. Moreover, it is guaranteed to produce exactly the same result as applying the 2-sample t test to the original data, given a big enough bootstrap dataset. The only reason the authors report slightly different t statistics for the original data and the bootstrap dataset is because of additional bootstrapping variation due to the small size of the latter. 3. Another error is in using the 2-sample t test in the first place. The original data are obtained by matching patients from the 2 treatment groups, for age, sex, and other factors. If the assumption is that the data are not too skewed for normal-theory methods to apply, the analysis should have been based on the matched-pairs t test. 4. More statistical naivete is seen in the analysis of sample size variation. The authors seem surprised that increasing the sample size does not increase the variance. Why should 3 times as many patients from the same population have a higher variance, particularly when they comprise 3 matches to each of the original patients? The final comment in the same section is that “the differences between the untransformed t test and the bootstrap t test results are important,” because in 1 case they reach a P value of
1783
CLINICAL THERAPEUTICS
“’
0.05 and in another the P value is higher-0.06. Not only are these differences completely spurious, as pointed out previously, but to attach importance to this trivial difference in P values demonstrates the worst kind of P-value mentality. These are all serious errors of statistical understanding and judgment. The authors correctly identify that the proper treatment of skewed cost data is an important concern. They might usefully read some other contributions to this topic.‘,’ Anthony O’Hagan, PhD University of Sheffield Sheffield, United Kingdom
References I. Rascati KL, Smith MJ, Neilands T. Dealing with skewed data: An example using asthma-related costs of Medicaid clients. C/in Thrr: 2001;23:48 1498. 2.
Manning WC, Mullahy J. Estimating Econ. 200 I :20:46 1494.
log models: To transform or not to transform? J Hrdth
3.
O’Hagan A. Stevens JW. Assessing and comparing costs: How robust are the bootstrap and methods based on asymptotic normality’? Research Report No. .506/00, Department of Probability and Statistics, University of Shefield. Available at: http://www.shef.ac.uk/-stlao/pdf/nonpar3.PDF.
The authors reply: Anthony O’Hagan raises several questions and concerns regarding the methodology we used in our paper. We appreciate Dr. O’Hagan’s interest in our work and his observations and suggestions. We respond to each of the points he raises, and where appropriate, we supply additional information that may be of use to other investigators who encounter skewed data in future applications.
1. Use of Logarithm
Transformations
of Difference
Scores
Dr. O’Hagan correctly intuits that some of our difference score data were negative, so it was necessary to add a constant to each score to obtain a positive number so that logarithmic transformations of the difference scores could be performed. The number selected resulted in the smallest possible value of I .OO for the transformed variable, which results in the natural logarithm value of zero. Dr. O’Hagan rightly notes that the end result from an analysis of log-transformed data will depend on the value of the constant chosen, so it is important to note the type of constant employed in such analyses. We appreciate Dr. O’Hagan’s calling this omitted detail to our attention. More generally, we are in full agreement with Dr. O’Hagan’s primary point regarding log transformations, which is that they should not be used without substantial justifica-
1784