RADIOTHERAPY ~ONCOLOGY Radiotherapy and Oncology 33 (1994) 171-176
Technical note
Potential pitfalls in the use of p-values and in interpretation of significance levels Hans-Peter Beck-Bornholdt*, Hans-Hermann Dubben Institute of Biophysics and Radiobiology, University of Hamburg, Martinistrafle 52, 20246 Hamburg, Germany Received 19 July 1994; revision received 18 October 1994; accepted 25 October 1994
Abstract In multiparameter analysis of clinical data the likelihood of obtaining significant results, just by chance, increases considerably with the overall number of tests performed. This can be compensated for by adjusting the p-values. Two tables are given from which adjusted p-values may be read off, providing a simple procedure to test the reliability of the results from clinical studies. Multivariate analyses often impress with extremely low p-values, but frequently these results turn out to be non-significant when the influence of multiple testing is considered. As a consequence the actually relevant results might be overlooked, because of the large number of spurious results accumulating in the literature. The use of inappropriate statistics hampers progress in clinical research. It is concluded that more care in the use of p-values in analysis and interpretation of clinical data is required.
Keywords: Significance level; p-value; Multivariate analysis; Multiple testing
1. Introduction
2. Materials
In clinical research the application of adequate statistical analysis is indispensable. Present publication policy is to accept a paper when statistically significant results are reported. Consequently the quotation of p-values has become a very common practice in radiooncological studies. In general, a result is called statistically significant if p ~ 0.05 is obtained. This level is nothing but a convention. It implies that if 20 irrelevant parameters are tested, on average one (= 5%) is obtained as significant simply by chance (type I error). Whether this rate of error is really acceptable depends on the consequences that might arise from a potential mistake. The risk of being wrong increases when more than one parameter is tested for significance [2]. For example, if two independent parameters are tested there is a 5% risk in either test to obtain a 'significant' result by chance. Instead of having a 5% risk for deriving an erroneous conclusion, this risk is now almost doubled to approximately 10%. The present paper deals with some potential pitfalls in the interpretation of data arising from this type of risk accumulation.
The overall significance level ac can be calculated from the individual significance levels a~ by:
* Corresponding author.
and methods
ac=l-(l-oq)
X (1-,~2) x ( l - a 3 )
X ... ×(1-an),
where n represents the number of independent tests performed. When all significance levels are equal, i.e. for cq = c~2 = a3 = • • • = an, we obtain: ac = 1 (1 -
-
ai) 0.
If an overall significance level ac for the entire study is not to be exceeded, the p-values Ps that are required for every individual test have to be adjusted to the number of tests n: pi= 1 - ( 1 _ ~)l/,. Using a rather simple rule of thumb the individual value Pi required for significance can roughly be approximated by dividing the cumulative value ac (which is usually 0.05) by the number of tests n: Pi = ac/n = 0.05/n.
016%8140/94/$07.00 © 1994 Elsevier Science Ireland Ltd. All rights reserved SSDI 0167-8140(94)01476-J
172
H.-P. Beck-Bornholdt, H.-H. Dubben / Radiother. Oncol. 33 (1994) 171-176
This adjustment of the significance level is called the Bonferroni method [1]. Since this method is conservative when the parameters tested are not independent, less conservative methods for multiple tests have been developed attempting to keep the overall significance at the intended level, while making it less likely that a true effect will be missed (see Ref. [1] for review). 3. R e s u l t s
The consequences from multiple testing are illustrated in Fig. 1. Chart A in Fig. 1 shows 100 squares. Every square represents a study in which only one parameter was tested. The filled squares represent the studies in which a significant result, A.
DDDIDDDDDD DDODnDODDD DDDDDDDDD1 DDDDDIDDDD nDDDnnDDDn DIDDDDDDDD DDDDDODODD DDDDnDDDDD DnDDOnDIDD ODDDDnDDDD
Table 1 Adjusted individualp-values (pi) required to obtain an overall significance levelof 5%, 1%, or 0.1%, respectively.The magnitude represents the number of independent significance tests performed. Example: When 12 parameters are tested then Pi < 0.0043 is required to obtain significancefor the individualparameter, when the overall significance level must not exceed 5%. The latter is the probability of finding at least one significantparameter by chance. n
B.
NN
N
C. ~
I I I I I I I I i']
IIIIIII IIIIIIIII1
I III
IIIIII
i.e. p < 0.05, was obtained by chance. On average, five out of 100 studies give a false positive result, corresponding to the accepted risk of 5%. However, clinical studies in which only one parameter is tested for significance are rare. Chart B shows a more realistic example. Every large square represents one study in which 16 parameters (represented by the small squares) were tested. The filled squares again represent the false positive parameters. Although the total proportion of erroneously significant parameters is once again 5% (13/256), 9 out of 16 studies (56°,6) give an incorrect overall result. This is much more than the accepted risk of 5%. Chart C shows four large squares, each containing 81 small squares, representing four studies in which 81 independent parameters were tested for significance. In the example shown, 16 parameters were found to be significant by chance, this again corresponds to a rate of 5%. The probability to obtain an incorrect result is 1 - 0.95 s~ = 98.4%, i.e. virtually all studies testing such a large number of parameters will show up with false conclusions.
II
II II
II II
Fig. I. Illustration of the increasing risk for finding 'significant' parameters by chance with increasing number of parameters (small squares). Filled squares represent parameters with a 'significant' (p "¢ 0.05) result obtained by chance. Chart A: 100 studies testing only one parameter each. Chart B: 16 studies testing 16 parameters each. Chart C: four studies testing 81 parameters each.
Overall significancelevelPc 0.05 (5%)
O.Ol 0%)
0.001 (0.1%)
I 2 3 4 5
0.05 0.025 0.017 0.013 0.0102
0.01 0.005 0.0033 0.0025 0.00201
0.001 0.00050 0.00033 0.00025 0.000200
6 7 8 9 10 11 12 13 14 15 16 17
0.0085 0.0073 0.0064 0.0057 0.0051 0.0047 0.0043 0.0039 0.0037 0.0034 0.0032 0.0030
0.00167 0.00143 0.00126 0.00112 0.00100 0.00091 0.00084 0.00077 0.00072 0.00067 0.00063 0.00059
0.000167 0.000143 0.000125 0.000111 0.000100 0.000091 0.000083 0.000077 0.000071 0.000067 0.000063 0.000059
18 19
0.0028 0.0027
0.00056 0.00053
0.000056 0.000053
20
0.0026
0.00050
0.000050
25 30 35 40 45
0.00205 0.00171
0.00040 0.00034
0.000040 0.000033
0.00146
0.00029
0.000029
0.00128 O.OO114
0.00025 0.00022
0.000025 0.000022
50 60 70 80 90 100
0.00103 0.00086 0.00073 0.00064 0.00057 0.00051
0.000201 0.000167 0.000144 0.000126 0.000112 ' 0.000100
0.000020 0.0000167 0.0000143 0.0000125 0.0000111 0.0000100
173
H.-P. Beck-Bornholdt, H.,H. Dubben/ Radiother. Oncol. 33 (1994) 171-176 Table 2 Probability (as a percentage) of finding at least x significancesby chance Number of tests
x
(n) 1
l
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 25 30 35 40 45 50 60 70 80 90 100
2
3
4
5
6
0.25 0.73 1.4 2.3 3.3 4.4 5.7 7. I 8.6 10 12 14 15 17 19 21 23 25 26 36 45 53 60 67 72 81 87 91 94 96
0.01 0.05 0.12 0.22 0.38 0.58 0.84 1.2 1.5 2.0 2.5 3.0 3.6 4.3 5.0 5.8 6.7 7.6 13 I8 25 32 39 46 58 69 77 83 88
O.Ol 0.02 0.04 0.06 0.10 0.16 0.22 0.31 0.42 0.55 0.70 0.88 1.1 1.3 1.6 3.4 6.1 9.6 14 19 24 35 47 57 66 74
0.01 0.01 0.02 0.03 0.04 0.06 0.09 0.12 0.15 0.20 0.26 0.72 1.6 2.9 4.8 7.3 10 18 27 37 47 56
0.01 0.01 0.01 0.02 0.02 0.03 0.12 0.33 0.72 1.4 2.4 3.8 7.9 14 21 29 38
7
8
9
10
0.01 0.03 0.08 0.29 0.80 !.8 3.6 6.3
0.01 0.02 0.07 0.25 0.65 1.5 2.8
5
9.75 14 19 23 26 30 34 37 40 43 46 49 51 54 56 58 60 62 64 72 79 83 87 90 92 95 97 98 99 99.4
0.02 0.06 0.15 0.34 0.66 1.2 3.0 6.0 11
16 23
0.01 0.03 0.07 0.16 0.32 0.98 2.3 4.7 8.1 13
Probability (in percent) of finding at least x significant results by chance when n independent tests were performed on a significancelevel of 5% (Pi .c 0.05). Example: there is a probability of 13% of finding three or more significant parameters when 25 parameters were tested.
Table I lists the adjusted individual prvalues required for overall significance levels of 5%, 1%, or 0.1%. If, for example, 14 parameters are tested and an overall risk of 5% must not be exceeded, the required prvalue is 0.0037. Using the rule of thumb given above a value of 0.0036 is obtained, which is not very different from the exact value. In some publications variables are only reported as being significant or non-signifumnt but the corresponding p-values are not quoted. In this case, risk accumulation can be allowed for by using Table 2 which lists the probability of finding a certain minimum number of false significant parameters after multiple testing. For example, if 14 parameters that are completely irrelevant for treatment outcome are tested, the probability of finding at least one 'significant' parameter by chance is 51°/0, the probability for two 'significant' variables or more is 15% (Table 2). In other words, every second study that explores 14 meaningless parameters will show up with at least one 'significant' result; every seventh study will yield even two ox more 'significant' effects. These considerations do not only hold for the test of multi-
pie parameters but generally for the overall number of tests. For example, if the impact of two parameters (e.g. total dose and overall treatment time) on two endpoints (e.g. local control and distant metastases) is tested for the four T stages the number of tests is given by n = 2 x 2 x 4 = 16. The probability of obtaining a false significant result is given by pc = 1 - ( 1 - a i ) " = 1-0.9516=0.56. This again differs considerably from the usually accepted risk of 0.05. Of course, the number of tests, n, that has to be considered when applying Tables 1 and 2, comprises all tests that were performed and not only those that were selected for publication (Fig. 2). Therefore, it is necessary to quote all parameters, subgroups and endpoints that were analysed for significance at any time. 4. Dibal~ioa To illustrate some of the potential pitfalls in the use of pvalues, the following three examples taken from this journal arc discussed.
H.-P. Beck-Bornholdt, H.-H. Dubben/ Radiother. Oncol. 33 (1994) 171-176
174
Endpoints il
III
U Ill
Endpoints I-'1"-I I'-'1
r"FI
"rIllllllll
TEST
FFI
mm-I El
~
VI-I cl
SELECTION
REPORT
Fig. 2. Illustration of the selection of 'significant' parameters and endpoints ('cut-and-glue' technique). Every square represents a significance test. Filled squares represent 'significant' results (p ~ 0.05) obtained by chance. Selection leads to a higher proportion of'significant' results. Note that also non-significant parameters are reported, because it would look odd if they were missing, e.g. T-stage or nodal involvement.
Example a: Use of p~=O.05 while testing many parameters In a retrospective analysis of clinical data 22 independent parameters were tested. While 19 turned out to show no significant effect, three parameters were found to be significant with p~ < 0.05, p~ < 0.05, and p~ < 0.01. Table 1 shows that when 22 parameters are tested for a cumulative p < 0.05 an individual p; < 0.0023 is required. Using this criterion, none of the parameters tested was, in fact, significant. As shown in Table 2, the probability of finding at least three significant parameters by chance is about 9% when 22 tests are performed.
Example b: Testing subgroups In a multivariate analysis of side effects, five parameters were tested, leading to four non-significant parameters while for the fifth parameter an individual p~value of 0.004 was obtained. Since according to Table l only Pi -< 0.01 is required, this parameter is really significant. Subsequently the analysis was repeated for two subgroups of patients that were stratified according to the fifth parameter. This corresponds to eight additional tests of significance. Five of these tests turned out to be non-siguificant, whereas two were quoted as trends (p~ = 0.066 and p~ = 0.065) and one as significant (pj = 0.013). According to Table l an individual pj ~ 0.0064 is required for significance. Thus, none of the parameters tested was significant. It is also obvious that the quoted trends are invalid.
Example c: Unknown number of tests but only quotation of some Analysis of clinical data showed no significant impact of overall treatment time on treatment outcome in most of the analysed subgroups. However, in some combinations of pooled subgroups treatment time became significant. From the text it is apparent that the authors analysed far more combinations of subgroups and parameters than those that were specified in the published Table, Thus, only a selection of results was reported. The estimation of the statistical significance is not
possible, since the total number of tests was not quoted. It is a rather frequent procedure in retrospective analyses of clinical data to test the influence of any available variable on as many endpoints as possible. Then the most promising variables and endpoints are selected for publication (Fig. 2) suggesting an artificially high proportion of significant results to be present. It is not possible to correct the significance level by using Table l or 2, when the total number of significance tests performed is not known. Thus, as an absolute minimum, all individual pvalues should be quoted as numbers, to enable readers to perform their own calculations, and to address the multiple comparison problem carefully in drawing conclusions from data and mention it in the discussion. It is remarkable, that the incorrect use of p-values in the analysis of clinical data is a rather frequent phenomenon. The assessment of the statistical quality of articles published in various journals [1,2], e.g. the British Medical Journal or the New England Journal of Medicine, revealed inaccurate statistics in about 50% of all published papers reporting numerical data. However, this is not a peculiarity of the above mentioned journals. Similar findings can also be obtained from other journals, including Radiotherapy and Oncology. This is illustrated in Table 3, where a number of statements are listed which were taken from the abstracts or titles of articles published in Radiotherapy and Oncology from January 1993 through August 1994. For these findings significance was reported but their cumulative risk of being false is greater than 5%. The list is probably not complete, but contains all publications in which the incorrect use ofp-values was apparent at a first glance. The 18 papers compiled in Table 3 present a considerable portion of all the clinical studies published in this journal during the time period considered. The statistical considerations presented in this paper are neither new nor original but merely basic statistics. Recently Tannock pointed out that ' . . . I in every 20p-values will be less than 0.05 by chance alone. Thus, if retrospective analysis is to be undertaken at all, any apparent differences must be regard-
H..P. Beck-Bornholdt, H.-H. Dubben/ Radiother. Oncol. 33 (1994) 171-176
175
Table 3 Statements taken from abstracts and/or titles of articles published in Radiotherapy and Oncology from January 1993 through August 1994 for which the overall significance level of being wrong is greater than 5% Statement
(a) (b) (c)
(d)
(e) (0 (g) (h) (i) (J) (k) (1)
(m) (n) (o)
(P) (q)
(r)
Reported p-value
After limited surgery and radiotherapy of craniopharyngioma age a and technique of radiotherapy b are independent prognostic factors for long term survival. In brachytherapy of oral cancers necrosis was significantly related to activity of wires. Median pO2 a and clinical stage b (FIGO) are independent highly significant predictors of recurrence free survival in uterine cervix cancer following irradiation or multimodality therapy including irradiation. The results of a randomized clinical trial comparing two radiation schedules in the palliative treatment of brain metastases showed that the presence of extracerebral metastasesa and multiple brain metastases b appeared to be prognostic of the overall survival. A significant increase in the incidence of rectal a, skin b, and bone c tumours was observed in patients treated with radiotherapy for breast cancer. Long-term results of adult craniopharyngeiomas treated with combined surgery and radiation showed that the 5- and 10-year survival for adults < 50 years was significantly better than for older patients. In radiotherapy for supratentorial low-grade gliomas Karnofsky score significantly determined the patients outcome. Albumin a < 35 g/1 and lactate dehydrogenase b > 400 U/l, as determined by commonly recorded blood tests, as well as age >75 years c are significant prognostic factors in bladder carcinomas treated with definitive radiotherapy. In cancer patients with advanced disease psychological distress is related to pain. T-stage of the primary lesion is predictive for nodal failure in squamous cell carcinoma of the lip. Local tumour control in patients with posterior uveal melanomas treated with ruthenium106 plaques was predicted by total dose to the sclera. Persistence of malignant cells in the cerebrospinal fluid a and non-radical surgery b arc significantly correlated with an adverse outcome for overall survival in adult patients with medulioblastoma. The serum prostate-specific antigen level 3 months after radiotherapy for prostate cancer provides a remarkably early assessment of response to treatment. T-stage a and pathological margins b are independent factors predictive for recurrence after radiotherapy in breast conserving treatment for early stage breast cancer. Both series are consistent in showing for T2/T3 laryngeal tumours the presence of a mean time factor of 0.6-0.8 Gy per day required to abrogate the decrease in turnout control concomitant with an increase in overall treatment time. The local control rate of Ti carcinoma of the mobile tongue and/or floor of the mouth was better after exclusive brachytherapy than after combination with external irradiation. The risk of death from differentiated thyroid cancer was significantly influenced by histological type of tumour a, clinical-pathological stage b of disease and cervical lymph node c status. There is a volume effect in radiation induced diarrhea a in patients treated for rectal carcinoma. The type of rectal surgery b significantly influenced the incidence of chronic diarrhea and malabsorption.
ed as hypothesis generating, n o t hypothesis testing. Such analysis can provide ideas for new trials b u t n o t h i n g m o r e ' [3]. Multivariate analyses that include dozens of variables a n d several endpoints usually impress with extremely small pvalues which often turn o u t to be non-significant after considering the overall significance level. This is a severe p r o b l e m for the interpretation o f radiooncological data. It is the purpose of this paper to draw a t t e n t i o n to this p r o b l e m by indicating that a result labelled 'statistically significant' by a
p-value after consideration of the overall significance level
0.01 a 0.01 b 0.013 0.023 a 0.014 b
>> 0. I a >>0.1 b > 0.1 >>0.1 a >0.1 b
< 0.01 a <0.01 b
< 0.06 a <0.06 b
<0.05 a <0.05 b <0.01 c 0.03
<0.68 a <0.68 b <0.2 c >> 0. I
0.015
>0.05
0.05 a 0.02 b 0.02 c < 0.01 0.03 0.04
>>0.1 a >>0.1 b >>0. ! c < 0,19 > 0.1 >>0.1
<0.05 a < 0.01 b
<0.46 a < 0.11 b
<0.005
<0.07
0.04 a 0.03 b <0.05
>>0.1 a >>0.1 b
<0.01
<0.4
<0.05 a < 0.01 b < 0.005 c 0.025 a 0.04 b
<0.9 a < 0.4 b < 0.2 c >0.2 a >0.2 b
routine statistics c o m p u t e r p r o g r a m may, in m a n y instances, be the result o f chance, n o t o f science. One might object t h a t after correction for multiple tests, investigators are punished rather t h a n rewarded for extensive evaluation of d a t a sets. If for example a given set o f data yields a significant dependence of treatment outcome from the total dose (p = 0.03), this significance vanishes if we test for another, perhaps irrelevant parameter, since the p-value for reaching significance drops to 0.025. However, testing two rather t h a n
176
H.-P. Beck-Bornholdt, H.-H. Dubben / Radiother. Oncol. 33 (1994) 171-176
one parameter increases the probability of chance results and therefore the significance level has to be adjusted. If not, then those who proceed as illustrated in Fig. 2 are rewarded and those who use statistics adequately are punished. Therefore, it is important to decide a priori how many and which statistical test will be performed with a given data set. The use of inappropriate cumulative significance levels does not only increase the dally load of useless information but also leads to disinformation that hampers progress in clinical research. It is therefore concluded that considerably more care in the use of p-values in analysis and interpretation of data is required, not only for scientific but also for ethical reasons.
References [1] Godfrey, K. Comparing the means of several groups. N. Engl. J. Med. 313: 1450-1456, 1985. [2] Jamart, J. Statistical tests in medical research. Acta Oncol. 31: 723-727, 1992. [3] Tannock, I.F. Some problems related to the design and analysis of climcal trials. Int. J. Radiat. Oncol., Biol. Phys. 22: 881-885, 1992.