DECEMBER
The
American
Journal
1973
of Medicine VOLUME NUMBER
55 6
EDITORIALS
Statistical Problems in Clinical Trials The UGDP Study Revisited
STANLEY Philadelphia,
S. SCHOR.
After the University Group Diabetes Program (UGDP) first reported its mortality results [1] in November 1970, both the statistical and clinical analyses became the subject of considerable controversy. In September 1971, the Journal of the American Medical Association published a critical statistical review [2] of the results, together with new data and a response [3] provided by one of the UGDP’s advisory statisticians. Despite this apparent rebuttal to the criticism, the basic controversy has continued to smolder without resolution. The issues involved in this controversy are of fundamental importance to contemporary physicians, regardless of the particular problems involved in the UGDP project. With the increasing use of large scale, multiclinic trials of therapeutic agents, physicians will be increasingly confronted with the problem of evaluating the results of such studies. These evaluations will require an appraisal of the design and analysis of complex investigative activities. For this reason, a recapitulation of some of the arguments and counter arguments in the UGDP controversy may help lead to better understanding of the problems involved in this particular investigation of diabetes mellitus, and in analogous therapeutic trials in the future. The most controversial conclusion in the UGDP study was the investigators’ decision to discontinue tolbutamide therapy because it was associated with apparently excessive cardiovascular mortality. The validity of this conclusion will be the main topic discussed here.
Ph.D.
Pennsylvania
From the Department
of Biometrics,
Temple
University Medical School, Philadelphia, Pennsylvania. Requests for reprints should be addressed to Dr. Stanley S. Schor. Manuscript accepted April 19, 1973.
BASE LINE FACTORS AND DEFINITION OBJECTIVE The purpose ative efficacy
727
OF THE RESEARCH
of the UGDP study was to determine of various
forms
of treatment
the comparvas-
in preventing
STATISTICAL
PROBLEMS
IN CLINICAL
TRIALS-SCHOR
cular complications in patients with recent maturity-onset diabetes mellitus. The purpose was not to determine if more cardiovascular deaths occur in one treatment group than in another. Since death was not initially planned as an “end no data were collected for important point,” base line characteristics that might affect cardiovascular mortality. Such characteristics included major disease other than cigarette smoking, diabetes, familial longevity and differences in duration of diabetes before diagnostic detection. If there were more heavy cigarette smokers, more people with major diseases other than diabetes and more people with greater duration of diabetes in the group receiving tolbutamide therapy (tolbutamide group) than in the group receiving placebo therapy (placebo group), this disparity could account for the excess mortality observed in the tolbutamide group. This excess mortality may have been associated with an excess in the combination of all base line risk factors, many of which were omitted in the UGDP analysis. This possibility is not adequately considered in the UGDP contention that even large differences in smoking habits between the tolbutamide and placebo groups-a single factor only-would not be likely to explain the greater differences in cardiovascular mortality. DEFINITION
OF THE TARGET POPULATION
The target population in a research study consists of all subjects about whom conclusions are to be drawn. In order to be admitted into the study each patient was supposed to be healthy enough at time of entry to be expected to live another 5 to 10 years. From the base line characteristics of people who died (the only group for which the UGDP supplied detailed information), several people who entered this study did not meet the criteAlthough these violations ria for “healthiness.” can be detected in the base line details in the fatal cases, the fatalities provide “numerators” but not “denominators.” To determine the frequency with which the admission criteria were either ignored or overruled would require the examination of detailed data for all patients, not just for those who died. How many people were improperly accepted into each of the treatment groups and how many of these people survived? The UGDP’s noncompliance with its own admission criteria was defended by the claim that the criticism was not appropriate because the two patients described actually survived for 5 years before death. The basic question remains unanswered.
720
December 1973
The American Journal of Medicine
DEFINITION
OF THE STATISTICAL
UNIT
The statistical unit in a research project is the item to be counted or measured. In the UGDP study, the rate of ascertainment of this unitdeath due to cardiovascular disease-was different in each treatment group because hospitalization and/or autopsy results were required for ascertainment. This method of ascertainment can lead to a bias in results. If autopsy is performed more frequently among those who died in one treatment group than in another, the rate of cardiovascular death may be higher in the first group merely because the cardiovascular lesions are more likely to be found at autopsy [4]. Moreover, if a larger proportion of those who died are examined at autopsy in one treatment group than in another, cardiovascular deaths are more likely to be detected in that group. Since in the cohort of patients under investigation there were more cardiovascular deaths in one treatment group’ than in the others, it is important to determine whether the excess deaths in that group could be attributed to an excess proportion of people in that group with cardiovascular risk factors or to an already demonstrated excess proportion subjected to autopsy or hospitalization. In response to this problem the UGDP stated that differences in hospitalization rates and in autopsy rates among the four treatment groups had a probability of occurrence of greater than 0.1 by chance alone. This response does not seem relevant to the main issue, which has nothing to do with drawing inferences about a target population, and which is not answered by a calculation of probabilities. CHANGES
IN DEFINITION
Certain base line characteristics which were defined before the UGDP study began, and which were maintained during the first few years of the investigation, were later redefined after the data were collected and analyzed. Such a change occurred in the definition of major electrocardiographic abnormality. Under the original definition [5], 25 per cent of the tolbutamide group but only 15 per cent of the placebo group had this abnormality. Under the “final” version of the published definition, the corresponding percentages were reduced to 4 and 3 per cent. The explanation for this statistical alteration of base line characteristics was that the revised definition contained abnormalities believed to be most closely related to prognosis. Why the later definition, if valid, was not used originally, or why the revised definition is
Volume 55
STATISTICAL
medically any better than the original tients with diabetes, is not explained. ADMINISTRATION
one for pa-
OF TREATMENT
Physicians treating diabetic patients usually watch the blood sugar levels to determine whether the treatment should be changed or even discontinued. The various UGDP treatments were not tested in a manner consistent with this customary clinical procedure. All patients received the same fixed dose of tolbutamide for the entire duration of the study. This undesirable therapeutic practice was defended by a statement that “despite the virtually identical base line characteristics on blood glucose level control, they (the tolbutamide and standard insulin groups) differed widely in mortality experience.” This response does not deal with the problem posed. SELECTION OF THE SAMPLE Base Line Differences. A basic requirement of the sampling procedure in a trial is that the random allocation of patients to the various t!eatments should produce groups which are proportionally similar in at least certain important characteristics. This equality of proportions did not occur in the UGDP treatment groups. As noted in my earlier paper [2], the tolbutamide group contained a preponderance of patients with all risk factors for cardiovascular disease save one (hypertension). The “rebuttal” to this point actually served to confirm it. As shown in Table I of Professor Cronfield’s paper [3], there was a total of 306 risk factors among the 185 patients in the placebo group, and 362 risk factors among the 189 patients in the tolbutamide group. The tolbutamide group thus had a statistically significant larger number of risk factors than the placebo group at the 5 per cent level, by the “t” test (mean difference = 0.27; t = 2.19; p <0.05). The statistical significance of the difference was not mentioned in the UGDP “rebuttal,” and its importance was downgraded with the remark that its magnitude is “only one fourth of a risk factor” per patient. One might wonder whether a base line excess of 62 risk factors might not readily account for the excess of 16 cardiovascular deaths (26 for tolbutamide and 10 for placebo) in the tolbutamide group. Another analysis of the same data discloses that the percentage of people with three or more risk factors was much higher in the tolbutamide group than in the placebo group (35 per cent versus 21 per cent). This difference is not only statistically significant at the 5 per cent level (using
December
PROBLEMS
IN CLINICAL
TRIALS-SCHOR
Chi-square) but also, if Dr. Jeremiah Stamler’s hypothesis [6] is correct concerning the synergistic effect of multiple risk factors, the risk of cardiovascular death was substantially greater in the tolbutamide group than in the placebo group. These major, statistically significant differences in the base line risk factors between the tolbutamide and placebo groups do not seem compatible with the contention that “all in all the luck of the draw does not seem to have been too bad” [3]. Differences in Long-term Patients. The UGDP group also concluded that the excess cardiovascular mortality in the tolbutamide group did not begin to appear until 4 years after treatment was initiated. Because of this temporal distinction, it becomes important to examine the differences in the base line characteristics for the “long-term patients” who are followed for at least 4 years. The pertinent data, not published in the original UGDP report, were presented in Table II of Dr. Cornfield’s paper, which shows the base line characteristics for the patients who were admitted before October 7, 1965, and who were thus available for at least 4 years of follow-up. The results demonstrate that the “long-term” tolbutamide treated group was more prone to the development of cardiovascular disease than the corresponding placebo group. For all but one cardiovascular risk factor, the proportion of people with the risk factor was higher in the tolbutamide group than in the placebo group. A cholesterol value of greater than 300 mg/lOO ml was found in 15.5 per cent of the tolbutamide group compared with 8.8 per cent in the placebo group. For “significant electrocardiographic abnormalities,” the corresponding percentages were 4.1 and 3.1 per cent; for history of angina pectoris, 7.2 and 4.5 per cent; for history of digitalis, 7.8 and 4.5 per cent: and for abnormal fasting blood glucose, 71.6 and 63.8 per cent. For relative body weight greater than 1.25, the values were 58.9 per cent for the tolbutamide group and 53.2 per cent for the placebo group; for arterial calcification, 19.3 and 14.1 per cent; for age over 55, 48.2 and 42.3 per cent; for males 31.5 and 30.8 per cent; and for whites, 52.8 and 50.2 per cent. The only risk factor that was distributed in the opposite direction was hypertension (29.5 per cent for the tolbutamide group and 37.1 per cent for the placebo group). It was claimed that the excesses in all these risk factors in the tolbutamide groups would be balanced by the greater deficit in the single factor of hypertension. This claim depends on the medically untenable assumption that the risk for car-
1973
The American
Journal
of Medicine
Volume
55
729
STATISTICAL PROBLEMS IN CLINICAL TRIALS-SCHOR
diovascular death in a person with abnormal blood pressure is the same as in a patient with a significant electrocardiographic abnormality, a history o’ the use of digitalis and/or a history of angina pet toris. of Adjustment. When a randomizatior procedure yields disparities in important base line characteristics, the investigator can either (1) discard the entire experiment and start again or (2) try to correct for these base line differences with various adjustment technics. When such adjustments are used, the reader must be made aware of the caveats of each technic. The UGDP analysts used three methods of adjustment, each of which requires close scrutiny. The first adjustment was performed by expressing the fatality rate according to the presence or absence of one “risk factor.” The adjustments were then extended to the presence or absence of two “risk factors” and sometimes three. Even when three characteristics are “adjusted,” however, there still remain the other four risk factors that are more prevalent in the tolbutamide group and that could account for the excess cardiovascular deaths. A second attempt at adjustment involved the application of a multiple logistic regression procedure. In the regression technic, it is assumed that the “shape” of the relationship between the dependent variable (survival) and each of the 17 independent variables (base line characteristics and treatments) is similar. Because of this probably erroneous assumption, some of the effects (the net regression coefficients) are bound to be overstated and some understated. Evidence that such peculiarities occurred in the UGDP analyses is provided by some of the strange medical relationships produced in the regression procedure. According to the negative signs found for the net regression coefficients, the lower the diastolic blood pressure or the lower the weight, the greater was the probability of death when the effects of the 16 other variables were held constant. The “explanation” for this peculiar result for blood pressure was that since the algebraic sign of the coefficient for systolic pressure was positive, the combined effect of systolic and diastolic pressure would give a more medically meaningful result than diastolic pressure alone. This “explanation” does not explain why either measurement should produce a bizarre coefficient. The model also does not take into account the possible synergistic effects of combination of risk factors mentioned previously. A third method of adjustment used in the UGDP Technics
730
December 1973
The American Journal of Medicine
analyses was to cross-classify the patients into two groups: one with none of five base line cardiovascular risk factors and the other with at least one of these five factors. When the cardiovascular death rates among those who received tolbutamide versus those who received placebos were compared within these two groups, the excess deaths in the tolbutamide group persisted. After it was pointed out that two possibly important cardiovascular risk factors were omitted from the five used in the analysis, the UGDP workers constructed a new table based on seven rather than five cardiovascular risk factors [3]. The differences in cardiovascular mortality between the placebo and tolbutamide groups persisted, even among patients with none of the seven risk factors, although the numbers were small and the differences were not statistically significant. Regardless of the statistical significance, the excess of deaths in the tolbutamide group within this seven factor “low risk” group is an important finding. The finding is not consistent, however, with the results obtained in the “low risk” group delineated from the same data by the logistic regression form of adjustment. The contradiction can be noted in Professor Cornfield’s Table IX, which cites the number of deaths observed for each treatment, with the patients divided into categories according to the probabilities of cardiovascular death that were calculated from the logistic function. The “low risk” group, in which the probability of cardiovascular death was smallest, consisted of two categories for which the calculated probabilities were under 0.014. In this logistically calculated “low risk” group the cardiovascular death rates were the same in both placebo and tolbutamide groups. No explanation was provided for this striking inconsistency in results when two different adjustment procedures were applied to the same data. The “low risk group” could be delineated either by the absence of seven cardiovascular risk factors or by the logistic regression function. In the first type of delineation, the cardiovascular death rates were higher in the tolbutamide group than in the placebo group; in the second type of delineation, the rates were the same in both groups. One possible explanation for the discrepancy is an ineffective selection of variables for either or both forms of analysis. Other possibilities are that the risk factors were chosen incorrectly and that the logistic adjustment was either inappropriate or calculated erroneously. The main point is that a major inconsistency in the results remains unexplained. The Extraordinary Group. Another
Volume 55
Good
important
Health
of the
demonstration
Placebo
of the
STATISTICAL
validity
of a controlled
clinical
trial
is that
the out-
come in the placebo group, which received no active therapy, particularly for mortality, resembles that in the untreated groups. In the UGDP study the patients in the placebo group were in extraordinary cardiovascular health in that none of the patients in this group died of myocardial infarction. The explanation offered for this remarkable event was that a// the patients admitted to the study were particularly healthy in terms of cardiovascular risk factors. This explanation is refuted by the data presented in Table II of the Cornfield report, which shows the diverse proportions of cardiovascular risk factors and deaths due to myocardial infarction in various treatment groups. Since only the placebo group enjoyed this special type of “immunity” to death by myocardial infarction, an inevitable bias exists in all comparisons of cardiovascular deaths between this placebo group and any other treatment group. COLLECTION
OF THE DATA
The Pooling of “Early” and “Late” Clinics. multiclinic study the question of among clinics is always perplexing.
In any pooling data To validate
such pooling it must be determined that the protocol was carried out in essentially the same manner in all clinics and that the patients were admitted under approximately the same circumstances. Inadvertent errors that can create systematic bias may arise because of changes in procedures or technics concomitant with secular change in any study performed over a long period of time. To test for the latter possibility, many statisticians will contrast the results obtained in the first secular half of the study with those of the second half. In the UGDP project, the original seven clinics used four treatment groups; the next five clinics, which joined the study subsequently, used five treatment groups. This division of clinics provides a convenient demarcation for testing for systematic error. If no systematic error is present, it would be expected that the results in the first seven clinics would be similar to those in the second five clinics. Nevertheless, the noted differences in mortality among treatment groups in the first seven clinics did not occur among corresponding groups in the last five clinics. In addition, the overt differences noted in cardiovascular risk factors in the first seven clinics were not so readily apparent in the last five clinics. These data suggest that the major differences in mortality noted in the UGDP study were due to the base line differences in the first seven clinics. An adequate examination of this possibility has not yet been performed in the UGDP analysis.
December
PROBLEMS
IN CLINICAL
TRIALS--SCHOf?
Management of Transfers and Dropouts. Another problem in the UGDP analysis was the assessment of data for patients who transferred from one treatment group to another. The UGDP investigators compared the mortality experience in the original randomized groups, regardless of how long each patient received that particular treatment before being dropped or transferred. There are several alternate ways of dealing with this problem: (1) analyze the results only for people who remained in the same treatment group throughout the study, deleting results for dropouts and transfers; (2) look at what would have happened if the transfers had been in their final group throughout the study; and (3) determine the reasons for the transfers or dropouts and see how they affect the characteristics being measured. None of these alternative methods of analysis was employed. The data on all dropouts and transfers were analyzed as though they had remained in their original groups throughout the study. This procedure was defended with the claim that it is a “conservative” statistical practice, diluting whatever beneficial or adverse treatment effects are present. USE OF PROBABILITIES The kinds of probability values calculated by the UGDP investigators for differences in base line “risk factors” are necessary for drawing inferences about a target population on the basis of a sample of observations. These probabilities also shed light on the possible violations of the truly random allocation of subjects into treatment groups. These probabilities, however, cannot tell us whether the differences in base line characteristics affected the differences observed in a response variable in the same sample. Thus, a statistically insignificant difference in a base line characteristic could explain a difference in an observed cardiovascular death rate regardless of the particular p value that was calculated for either the base line characteristic or the death rate. CONCLUSION Because clinical medicine has now entered the era of multiclinic therapeutic trials, clinicians and biostatisticians wili be increasingly confronted with evaluating the complex issues and data resulting from such trials. When well conducted and well the trials can offer powerful leads analyzed, toward the discovery of scientific truths. No unique guidelines have yet been developed, however, for ensuring that the trials will be well conducted and well analyzed. The disputes that arise
1973
The American
Journal
of Medicine
Volume
55
731
STATISTICAL PROBLEMS IN CLINICAL TRIALS-SCHOR
about the limitation of a particular study, and the arguments that are offered pro and con, can therefore be helpful in eliminating the issues that require additional attention for better design and a clearer analysis of future trials. For this reason, the UGDP study, as a possible
“prototype” of future problems in therapeutic trials, deserves continuing attention. The more clearly we begin to understand why the UGDP results have been so controversial, the more effectively will we be able to eliminate or solve those problems in the future.
REFERENCES 1. 2.
3.
732
The University Group Diabetes Program: Mortality results. Diabetes 19 (suppl 2): 1, 1970. Schor S: The University Group Diabetes Program-a statistician looks at the mortality results. JAMA 217: 1671, 1971. Cornfield J: The University Group Diabetes Program-a further statistical analysis of the mortality findings. JAMA217: 1676, 1971.
December 1973
The American Journal of Medicine
4.
5. 6.
Beadenkopf WG, Polan AK, Marks RU: Some demographic characteristics of an autopsied population. J Chron Dis 19: 333, 1965. Seltzer HS: A summary of criticisms of the findings and conclusions of the UGDP. Diabetes 21: 977, 1972. Stamler J, Berkson DM, Lindberg H, Miller W, Hall Y, Mojomier L, Levinson M, Cohen D, Young 0: Coronary risk factors. Med Clin N Amer 50: 229, 1966.
Volume 55