GYNECOLOGIC ONCOLOGY
23, 275-283 (1986)
Analysis of Clinical Trials in Gynecologic Cancer-Timing and Interpretation’,2 JOHN
A. BLESSING, PH.D.,* AND BARRIE ANDERSON, M.D.?
*Roswell Park Memorial Institute, Buffalo, New York: and tDivision of Gynecologic Oncology, Department of Obstetrics and Gynecology, University of Iowa Hospitals and Clinics, Iowa City, Iowa
Received March 5, 1985 When and how often a clinical trial is analyzed is as important as the use of appropriate methodology. Premature analysis may result in inaccurate estimation of tumor response and adverse effects as well as misrepresentation of survival. Moreover, bias may be introduced or the study may be abandoned entirely. Of parallel importance to the timing is the emphasis given to the analyses. If a randomized comparative trial of two regimens is conducted and a significant therapeutic effect is not discerned, the study is generally classified as a negative study. This view is oversimplified; the negative conclusion drawn must have a much more limited scope than is generally appreciated. Lastly, the pitfalls of interpreting time variable curves such as survival are of special note. Quite often the “tails” of the curves totally belie the true implications of the analysis. 0 1986 Academic Press, It%.
INTRODUCTION When and how often a clinical trial is analyzed is as important as the use of appropriate methodology. The purpose of the present paper is to present discussion of some of the lesser understood and frequently misapplied aspects of timing and interpretation of clinical trials analysis using as examples various studies conducted by the Gynecologic Oncology Group (GOG) in recent years. Much of the data employed are unpublished data from preliminary analyses of various protocols. They were not published and should not have been. Their use in this presentation is intended solely to exemplify potential problems had they erroneously been disseminated. Other results are from appropriately conducted final analyses of several GOG studies and will be referenced accordingly. MATERIAL AND METHODS Tumor response, grading of toxicity, progression-free interval, and survival are defined according to GOG criteria [l]. Standard statistical techniques were ’ Research supported by National Cancer Institute Grant CA 37517(Gynecologic Oncology Group Statistical Office). Presented at the Society of Gynecologic Oncologists Annual Meeting February, 1985. ’ Address reprint requests to: GOG Statistical Office, RPM1 Apartments, 666 Elm Street, Buffalo, N.Y. 14263. 275
0090-8258186$1.50 Copyright 0 1986 by Acade$c Press, Inc. All rights of reproduction in any form reserved.
276
BLESSING
AND
ANDERSON
employed. Pearson’s x2 test of contingency tables was used to examine differences in response rates. Survival, progression-free interval, and response duration comparisons were accomplished using the Breslow test [2]. PREMATURE ANALYSIS A substantial number of patients is required to conduct most phase III studies. Results based upon the early stages of accrual should not be considered as representative of the final results. Many patients featured in such early analyses are still on study and hence their final outcome is subject to change. That is, further response may be noted or additional toxic effects might occur. Moreover, the overall results are subject to considerable fluctuation as additional patients are entered. A dramatic example of the erroneous suggestion of early analysis can be seen if premature analysis is retrospectively made of the data in GOG Protocol No. 42, “Treatment of Recurrent or Advanced Uterine Sarcoma-A Randomized Comparison of Adriamycin versus Adriamycin plus Cyclophosphamide.” Figures l-4 depict a comparison by treatment of the duration of progression-free interval based upon patient data at 18, 24, 30, and 36 months after study activation, respectively. Figures l-3 are unpublished, in-house results prepared in the course of monitoring the study; however, treatment comparisons were not appropriate and not attempted. Note that the casual reader could easily erroneously infer the likelihood of a therapeutic effect from Fig. 1. The appropriate final analysis
4
I
MONTHS
Rx -P B-----p
FROM O&T
Progression -free
OF PROTOCOL
Failures ---
Total
12.
Median
7
18
25
3.1
15
18
33
7.6
FIG. 1. GOG protocol No. 42. Progression-free interval by treatment 18 months after activation of study.
CLINICAL
TRIALS
277
ANALYSIS
1 18. MONTHS
Rx
Fffifi
ONSET
Progression -free
0:2’PROTOCOL
Failures ---
Total
Median
-P
10
23
33
3.9
--m-p
15
22
37
7.3
FIG. 2. GOG protocol No. 42. Progression-free interval by treatment 24 months after activation of study.
is reflected in Fig. 4 which shows no apparent difference between the regimens (ZJ = 0.22) [3]. In a similar fashion, early results can inaccurately estimate the incidence of toxicity. GOG Protocol No. 48 featured a randomized comparison of Adriamycin (A) versus Adriamycin plus cyclophosphamide (AC) in patients with advanced endometrial carcinoma after hormonal failure. Toxicity on this study was routinely monitored throughout. Eighteen months after the study was initiated no cases of grade 4 leukopenia had been observed among the first 20 patients evaluated
-4. MONTHS
Rx
kROM
ONFiT
Progression -free
OF P%TOCOL
Failures
Total
Median
-P
16
31
47
5.0
---a
31
17
40
7.3
FIG. 3. GOG protocol No. 42. Progression-free interval by treatment 30 months after activation of study.
278
BLESSING
AND
ANDERSON
t
c
24.
MONTHS
Rx
:RON
Progression -free
-P
--s-p
OF !%TOCOL
ONk
Failures ---
38 43
15 12
Total
Median
53 55
5.0 4.9
FIG. 4. GOG protocol No. 42. Progression-free interval by treatment 36 months after activation of study.
on the AC regimen. However, as Table 1 clearly demonstrates, the true incidence of this particular adverse effect was somewhat higher when additional patients were entered and additional follow-up was obtained. In GOG Protocol No. 43, “A Randomized Comparison of Us-Platinum 50 mg/m2 versus &-Platinum 100 mg/m2 versus C&-Platinum 20 mg/m2/day x 5 in the Treatment of Patients with Advanced Carcinoma of the Cervix,” a difference of 15% in objective response rate was sought. Early results indicated that a substantial difference favoring the 100 mg/m2 was possible. However, this was not verified as the study matured (see Table 2). Instead, the final conclusion was that a 15% difference could not be documented, although an 11% difference between the 50 mg/m2 regimen and 100 mg/m2 was ultimately confirmed [4]. Premature presentation of data can not only erroneously depict the results, but can also induce considerable bias toward or against particular regimens. Resulting problems can include improper randomization of patients with subsequent withdrawal. Additionally, a lack of enthusiasm may result which jeopardizes the chance of fulfilling study objectives. Carried to the extreme, early termination TABLE 1 GOG
PROTOCOL
No. 48.
PERCENTAGE EXPERIENCING
GRADE 4 LEUKOPENIA
FOLLOWING ACTIVATION
Date examined May, May, May, May,
1981 1982 1983 1984
No. receiving AC 20 51 80 108
a AC = Adriamycin plus cyclophosphamide.
AT VARIOUS TIMEP~INTS
(1l/19/79) No. with grade 4 leukopenia
%
0 2 6 11
0.0 3.9 7.5 10.2
CLINICAL
279
TRIALS ANALYSIS
TABLE 2 GOG PROTOCOLNo. 43. REWNSE RATES FOR EACH REGIMEN YEARLY FOLLOWINGACTIVATION (1 l/27/78) Regimen Date examined November, November, November, November, November,
50 mg/m’
100 mg/m’
20 mg/m2/day x 5
20.7% 20.3% 23.0% 20.5% 20.7%
42.4% 29.2% 27.5% 31.0% 31.8%
23.3% 28.0% 24.0% 25.0% 25.0%
1979 1980 1981 1982 1983
of the study can occur. In this instance, therapeutic differences may go unproven. Moreover, unproven differences may be widely publicized. THE NEGATIVE STUDY
If a randomized clinical trial is conducted and a significant treatment difference is not proclaimed, the study is sometimes stated to be a negative study. This does not mean that no therapeutic difference exists. Rather, a difference of the magnitude considered clinically significant when the study was designed does not exist. The existence of a therapeutic difference of a lesser magnitude cannot be determined, since this was not part of the original study design and thus insufficient patients were accrued to test such a hypothesis. Consequently, the negative implication of the study must be stated in terms of the original goal. Hence, a negative study does not proclaim treatment equivalence, but merely fails to confirm a therapeutic difference initially thought clinically meaningful and possible to detect. Table 3 depicts the number of patients required to detect therapeutic differences of various magnitudes if the response rate for the standard has been documented to be 20%. Assuming a one-sided test is warranted and probabilities of falsepositive and false-negative errors are 0.05 and 0.20, respectively, a substantially larger number is required to detect smaller differences.
TABLE 3 NUMBER OF PATIENTS REQUIRED TO DETECT THERAPEUTICDIFFERENCESOF VARIOUS MAGNITUDES; cy = 0.05, p = 0.20; ONE-SIDED TEST
Difference sought
Number of patients required per regimen
10% 15% 20% 25%
249 121 73 49
280
BLESSING
GOG Therapy A” A@
PROTOCOL
AND
ANDERSON
TABLE 4 No. 48. RESPONSE
BY THERAPY
No. entered
No. responses
Responses
97 105
22 34
22% 32%
%
’ A = Adriamycin. b AC = Adriamycin plus cyclophosphamide.
GOG Protocol No. 48 was designed to seek a therapeutic difference of 20% in response rates. Recent analysis disclosed an observed difference of 10% (22 versus 32%) in overall response rates [6] (Table 4). This may well reflect a true difference of this magnitude. However, since this study was not designed to detect differences less than 20%, it is not possible to address the significance of the observed difference. We can only conclude that a 20% difference does not exist. TIME VARIABLE CURVES Graphs depicting treatment comparisons of survival, progression-free interval, and response duration are frequently misinterpreted due to the “tails” of the curves. These segments are generally based upon few patients and consequently are subject to considerable fluctuation. Quite often investigators read too much into the “tails” and diminish the impact of the overall comparison being illustrated. It should be noted that the purpose of the analytical comparison is to examine the appropriate parameter for all patients treated, not just those with long-term follow-up. Consequently, all conclusions must be based upon all data and not merely the “tails.” It is not uncommon for curves displaying no difference to be visibly distinct in the “tails. ” However, this does not represent a delayed therapeutic advantage. Figure 5 depicts a comparison of response duration for two of the regimens of GOG Protocol No. 43 which was described earlier. Although the curves appear to diverge beyond 6 months, there is not a significant difference (P > 0.5). Median response duration for both regimens is 4.8 months. Moreover, of 63 patients featured in this analysis, only 22 were observed longer than 6 months; only 10 were observed longer than 1 year. Format is also of interest in presenting such data. For example, if the same data were displayed as shown in Fig. 6, depicting only the first 6 months rather than 24 months, no question would arise. Conversely, graphs depicting highly significant differences can cross in the “tails.” In this situation, true differences must not be discounted, as this phenomenon can occur due to as few as one late occurring failure! GOG Protocol No. 47, “A Phase III Randomized Study of Adriamycin Plus Cyclophosphamide (AC) versus Adriamycin plus Cyclophosphamide plus Cis-Platinum (PAC) in Patients with Advanced Ovarian Carcinoma-Suboptimal Stage III, Stage IV,
CLINICAL
-4.
MONTHS
FROM ONSET
FIG.
-
-
281
ANALYSIS
OF PROTOCOL
Progression ---- -free
Failures
Total
50 mgfm2
9
22
31
4.0
20 mglm~lday
3
29
32
4.0
nx -
TRIALS
Median
5. GOG protocol No. 43. Response duration by therapy (plotted to 24 months).
and Recurrent,” detected statistically significant differences favoring PAC in response, response duration, progression-free interval in all patients, and survival in patients with measurable disease [7]. Figure 7 depicts duration of progressionfree interval by therapy for patients with measurable disease. This highly significant difference (P < 0.0001) is not surprising in light of the other results just mentioned. However, note in Fig. 8 that the curves have been altered and now cross. In such situations it has been incorrectly argued that the treatment difference is artificial since ultimately both curves wind up at the same place. To see the fallacy of such arguments, consider that the only difference in the analyses that
6. MONTHS
FROM ON&T
!!5 FIG.
-
-
OF PROTOCOL
P’W;:SiO” ----
Failures
Total
Median
50 nq/m2
9
22
31
4.0
20 mglm’fday
3
29
32
4.0
6. GOG protocol No. 43. Response duration by therapy (plotted to 6 months).
282
BLESSING AND ANDERSON
WONTHS’:kON
ON%
OF %TOCOL
’
Progression & -
-fret
AC -
-
PAC
Failures ---
Total
8
113
121
Median 7.3
17
90
107
13.2
FIG. 7. GOG protocol No. 47. Progression-free interval by treatment. Measurable disease.
produced Figs. 7 and 8 is that the status of the last observed case on the PAC regimen was artificially changed from progression-free to failure. That is, one failure has caused the PAC regimen to cross the AC and drop to O! Of 228 patients featured in this analysis only 29 were observed beyond 2 years; only 10 were observed beyond 3 years. CONCLUSIONS
To avoid inaccurate estimates and potential bias, premature analysis and presentation should be avoided. Toxicity and safety monitoring should and can be
MONTHS’~ROM Rx -
-
AC
-
-
PAC
Progression -free s
16’
ON:?&
OF EOTOCOL
Failures
113 91*
--Total
121 107
48
Median
7.3 13.2
FIG. 8. GOG protocol No. 47. Progression-free interval by treatment. Measurable disease; one case altered,*.
CLINICAL
TRIALS
ANALYSIS
283
accomplished without disseminatinginconclusive results. When presenting studies which do not depict significantdifferences,an effort must be made not to generalize beyond the scope of the original design. Lastly, in examining graphs of time variables, undue emphasis must not be given to the “tails” of the curves. REFERENCES 1. Thigpen, T., Shingleton, H., Homesley, H., Lagasse, L., and Blessing, J. cis-Dichlorodiammineplatinum (11) in the treatment of gynecologic malignancies: Phase II trials by the Gynecologic Oncology Group, Cancer Treat. Rep. 63, 1549-1555 (1979). 2. Breslow, N. “A generalized Kruskal-Wallace test for comparing k samples subject to unequal patterns of censorship,” Biometrika 57, 579 (1970). 3. Muss, H. B., Bundy, B., DiSaia, P. J., Homesley, H. D., Fowler, W. C., Jr., Creasman, W., and Yordan, E. Treatment of recurrent or advanced uterine sarcoma: A randomized trial of doxorubicin versus doxorubicin and cyclophosphamide (A phase III trial of the Gynecologic Oncology Group), Cancer 55, 1648-1653 (1985). 4. Bonomi, P. Personal communication. 5. Peto, R., Pike, M. C., Armitage, P., Breslow, N. E., Cox, D. R., Howard, S. V., Mantel, N., McPherson, K., Peto, J., and Smith, P. G. Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design, Brit. J. Cancer 34, 585-612 (1976). 6. Thigpen, T., Blessing, J., DiSaia, P., and Ehrlich, C., for the Gynecologic Oncology Group. A randomized comparison of adriamycin with or without cyclophosphamide in the treatment of advanced or recurrent endometrial carcinoma, Proceedings of ASCO, March, 1985. 7. Omura, G. Personal communication.