Workshop 8: Data analysis

Workshop 8: Data analysis

Workshop 8: Data analysis Robert C. Elston, Chairperson, William Mary F. Johnson MEMBERS: and Richard R. Evans III, Cochairperson C. Blackwelder,...

710KB Sizes 2 Downloads 55 Views

Workshop

8: Data analysis

Robert C. Elston, Chairperson, William Mary F. Johnson

MEMBERS:

and Richard R. Evans III, Cochairperson

C. Blackwelder,

C. Ralph Buncher,

I. General principles for the analysis of clinical trials A. Statistical methods appropriate to the measures of outcome B. Problems of multiple comparisons C. Comparing like with like D. Including appropriate and complete information in the analysis E. Results to report II. Analysis for particular designs A. Parallel matched designs B Crossover designs C. Serial measures III. Summary

GENERAL PRINCIPLES FOR THE ANALYSIS OF CLINICAL TRIALS This workshop addresses a number of topics of general relevance in the analysis of data from randomized controlled clinical trials, as well as some of particular importance in trials of NBAAD. The treatment of any one subject is necessarily brief, with the emphasis on the identification of important issues and considerations. Merely identifying an important issue should be sufficient motivation for an investigator to seek statistical assistance with the analysis; professional consultation under such circumstances is highly recommended. A few general references may be particularly helpful in further exploration of the situations mentioned below.‘” Statistical measures

methods appropriate of outcome

to the

In the comparison of treatment groups, parametric methods may be suitable for continuous measures of outcome, such as FEV,, FVC, and other measures of pulmonary function. Depending on the distribution of the variable of interest, a logarithmic or other transformation may be appropriate before the groups are compared. In particular, quantitative data can be transformed to their ranks; this leads to nonparametric methods,5-6 such as the Wilcoxon rank sum test. Such methods have traditionally been used less frequently than parametric methods, but they may be more useful because they are appropriate both for continuous variables and for discrete variables such as scores relating to disease status. Even if the variable of interest is distributed normally, there is little loss of efficiency when the rank-sum test is used to compare two groups.

Janet D. Elashoff,

and

Special methods are available for dealing with problems such as the analysis of time to some event,’ ordered categorical variables,‘, * and variables measured at repeated points in time (this last situation is considered in further detail later). Whatever method of analysis is used, a necessary step is to determine that the data are compatible with the assumptionssuch as normality and homoscedasticity-underlying that method of analysis. Problems

of multiple

comparisons

In many trials it will be necessary to pay some attention to the problems involved in drawing inferences when many comparisons are made, such as comparisons at various points in time for monitoring purposes; comparisons based on various outcome measures; or comparisons in several groups of patients. It is especially important in such cases not to place undue importance on observed significance levels (p values) for individual comparisons, such as those based on subgroups of patients. If multiple comparisons are made, there is an increased probability of finding “significant” differences by chance alone when there are actually no treatment differences.‘, ‘O A simple adjustment based on the well-known Bonferroni inequality is to multiply the observedp value for a single comparison by the number of comparisons made. If the result is significant, then it will still be significant if a more exact procedure is used. In some situations, however, this procedure will be overly conservative. The problem can be largely avoided by identifying a few specific key hypotheses before the trial and by limiting significance testing to them. ” A way to deal with the situation in which there are multiple outcomes, such as pulmonary function tests or symptom and drug; scores, is to perform a multivariate analysis first. ” Here it is determined whether there is any significant effect on any of the outcomes, allowing for the fact that many outcomes are being examined. Only if this overall test shows a significant effect are the individual outcomes then examined univariately. However, care should be taken not to include in this multivariate analysis many outcomes that are a priori unlikely to show any difference, because the inclusion of “noise” variables automatically weakens the power of any multivariate test. Another approach 541

542

Elston et al.

J. ALLERGY

that can be used is first to combine the outcome measures into one or a few principal components. The first principal component is determined as that linear function of the measuresthat accountsfor most of the variability in the entire data set. The second is that linear function, uncorrelatedto the first, that accounts for most of the remaining variability. Subsequentprincipal components.always uncorrelatedto the previous ones, accountfor even smaller proportions of the variability. The first or first few principal componentsare then analyzed as though they were the original outcome measures.Other methods of analysis that take account of the same outcome measuredat multiple points in time will be consideredlater. Comparing

like with

like

It will usually be important to consider numerous patient characteristics(age, sex, race, socioeconomic status, status at entry into study, and whether the patient’s diseaseis seasonalor perennial) in the interpretation of clinical trial data. It is important to allow for variables that are associatedwith the outcome of treatment, even if the treatmentgroups appearsimilar with respectto thesevariables. Appropriate statistical tests for interaction should be performed to evaluate whether influences of these patient characteristicson outcome are additive. When sufficient data are available, it can be very valuable to examine measuresof treatment outcome according to a cross-tabulationof relevant patient characteristics. This analysis is especially important if any interaction is present. It may be important for exploratory purposes to examine subgroupsof patients, both separatelyand in overall analysesthat take strata (i.e., the subgrouping) into account. Regression techniques, both linear and nonlinear, are often helpful in allowing for effects of continuous variables on outcome and in allowing for numerous variables simultaneously. Thus baselinevaluesof pulmonary function and measuresof compliance are usually best included in the analysis as covariates. If a principal componentderived from all the measuresof compliance is included as a covariate, for example, the resulting analysis tells (1) to what extent the outcome measuresare determined by compliance, and (2) what the differences in outcomemeasuresare predicted to be if all comparison groups have the same degree of compliance. However, if the drug that is being evaluated causespoor compliance, it is inappropriate to adjust for compliance in this manner. Baseline values measuredbefore patients are randomized to treatmentscannotbe affectedby the subsequent treatment, so it is always statistically better to include these in the analysis as covariates, rather than, for example, expressingpulmonary function measuresas

CLIN. IMMUNOL. SEPTEMBER 1986

percentagesof baselinevalues. If the variables are log transformed, then expressing pulmonary function as a percentageof baseline is equivalent to analyzing as the dependentvariable: log (pulmonary function measure during the study) minus log (baselinepulmonary function measure).Including the baseline value as a covariate, on the other hand, is tantamount to analyzing asthe dependentvariable: log (pulmonary function measureduring study) minus b log (baselinepulmonary function measure), where the regressioncoefficient b is estimated from the data rather than assumedto equal unity. In this way the datadetermine how best to allow for the fact that individuals in the various comparison groups have different baseline values. Becausethese regression analysesmake the assumptionof parallel regression lines in the various groups, this assumptionmust be testedto ensurethat the analysis is valid. Including appropriate and complete information in the analysis

An important general principle is to include in the analysis all randomized patients. The theory that allows one to draw inferencesfrom clinical trial data is basedon samplesof patients allocated at random to treatment groups, and it is the comparison of the groups as randomized that is thus justified. Complete adherenceto this principle means, for example, that a patient randomized to receive treatment A should be included in the “treatment A” group for analysis, even if the patient in fact received treatment B by mistake. But this is a situation in which there is some question as to whether the generalprinciple should be adheredto strictly. A frequent and important argument for deleting patient data from the analysisis that some patients did not receive the treatment intended-becauseof intentional noncompliance, toxicity, illness, or death-and that the investigators are interested in comparing groups of patients who actually received the intendedtreatments.A counterargument,however, is that the reasonsfor noncompliance may be related to treatment and omission of such patients may thus bias the comparison.A compromisethat is often helpful is to present the data from both points of view. Another procedure is to include all the patients, but with appropriatemeasuresof compliance, withdrawal, and so forth included as covariates in the analysis. For this method to be valid, the covariates that are included must not be variables that are differentially affected by the treatments. A related issue is the problem of missing data, especially loss to follow-up becauseof dropouts. The investigator should make a careful determination as to whetherthe dropout is relatedto the treatmentunder study. If at the end of the study there is no substantial

VOLUME NUMBER

78 3, PART 2

Data analysis

imbalance between the treatment groups in the number of patients who withdrew for reasons not related to treatment, such patients can be excluded from the analysis of the final results without introducing bias. Alternatively, for studies with multiple assessment points, data from such patients can be included in the analysis of variables recorded until the point of withdrawal. If there is a substantial difference between treatment groups in the number of patients withdrawn for reasons not related to treatment, the reasons for withdrawal should again be reviewed to assure that some aspect of the treatment is not in some way a cause for the withdrawals. Also, even if there is no imbalance between treatment groups, a substantial number of dropouts for any reason may raise questions concerning the design and conduct of the study. Results to report A typical drug trial will report whether there is a significant difference between treatment groups with respect to the most important outcome measures in relation to a prespecified level of significance. Results should also include the actual p values, especially if the differences are “significant,” because this additional information allows the reader to gauge the strength of the findings. In the case of nonsignificant results, the power of the statistical tests to detect clinically important differences should also be stated. It is sometimes of interest to determine how large a sample would be needed for a particular observed difference to be found significant, if in fact it is a true difference and not merely a chance effect. Whether the patients receiving the drug are compared with a placebo group or an active control group, mean group differences and their confidence intervals should be reported for the most important outcome measures. This informs the reader how large or how small the true group differences could be and still be reasonably compatible with the study findings. If both active control and placebo groups are included in the trial, these should be compared with each other. Reporting confidence intervals is especially useful if there is no placebo group and if the purpose of the trial is to demonstrate that the new drug is as effective as an active control.‘3 ANALYSIS

FOR PARTICULAR

DESIGNS

Typically, the long-term repetitive dose studies of NBAAD are conducted in patients with chronic asthma undergoing a randomized, double-blind, parallel, placebo- or active-controlled study with a 2week baseline (washout) period followed by 8 to 12 weeks of treatment. Primary efficacy variables include pulmonary function tests, patients’ daily symptom scores, multiple daily PEFR measures, concomitant

543

medication use, and prevention of subsequent asthmatic episodes. Emphasis is often placed on subjective assessment of changes in the severity of symptoms such as wheezing, breathlessness, chest tightness, and cough, as well as more objective improvement in pulmonary function variables such as FEV,, FVC, PEFR, and FEF25.75.Thus analysis and interpretation of results from these long-term trials are often encumbered by many outcome variables as well as by the frequent follow-up evaluations obtained from patient diary information and pulmonary function tests. In what follows is an analysis of particular designs for the case of a single outcome measure and no covariates in the analysis. It will thereby be simpler to see how the experimental design determines the type of analysis required. It should be understood, however, that in practice there are multiple measures and covariates, and hence more sophisticated extensions of the indicated analyses are desirable. Parallel or matched

designs

The most straightforward designs are those known as parallel or matched designs. In these designs a single patient takes only one medication. Thus the analysis is not confounded by the effects of more than one drug. Either a single patient is matched with another single patient, or a group of persons receiving the test medication is compared with a group of control persons. In the one-for-one matching experiment, the model (and therefore the actual randomization) is a coin tossing procedure. The 2n individuals to be studied are grouped into n pairs of individuals who are very similar. The results of the trial may be measured as a continuous variable (e.g., FEVl or pulmonary function testing) or as a qualitative variable (e.g., success or failure, or improved versus not improved). Consider qualitative results, such as which member of the pair had the better outcome, when drug A is compared with drug B. The null hypothesis that there are no differences states that the only reason one group may differ from the other is because of the random coin tossing that decided which of the pair received the test drug. Therefore, one uses the binomial distribution, with the probability of success in a single trial equal to 0.5, to test whether the results in the experiment are within the realm of statistical chance. In the case of two groups of n individuals each with a continuous measurement, the statistical analysis might be a pooled variance t test. This can be used whether all of the patients are known before the beginning of the study or if they are accumulated as the study goes on. In the latter case, the random code just stipulates that the next patient who enters the trial will be given the treatment dictated by the random code.

544

Elston et al.

TABLE I. Analysis

J. ALLERGY

of a two-period

crossover

design with continuous

outcome

Differences Group Summary of information Number Mean Standard deviation Pooled standard deviation*

CLIN. IMMUNOL. SEPTEMBER 1986

variables Totals

1

h 4 %I

Tests I. Test of residual effects

Group 2

Group

n2 d, SdZ

nl t, S,I

t df t df t df t df

2. Test of period effects with residual effects confounded 3. Test of treatment effect (no residual effects) 4. Test of treatment effect if residual effects presentt

= = = = = = = =

1

Group 2

n2 f? St2

(t, - t,)l[s,/(lin, + l/nz)] n, + nz ~ 2 (d, - d,)/[s,l(lin, + l/n?)] n, + n2 - 2 (d, + d,)/[s,/(lln, + l/n,)] n, + nz -2 (x,, - x,,)i[s,i(l/n + l/n,)] n, + nl - 2

*Computed as the square root of the pooled variance, i.e., the squareroot of: sd2= [(II, - I)sdU2+ (II2 - l)s,,*]i(n, + nl - 2). S,2= [(n, -- I)$,2 + (Ill

-~ I )S,,‘]/(Il,

t

n2

2).

ts, iy computedby pooling the standarddeviation of

the responses

to A in

group I

The two-group parallel design can also be used with subgroups, or strata, and is a form of compromise between the strictly paired study and the strictly pooled study. In this case the treatments are compared within each stratum and then the results are pooled across strata in an analysis of variance. The underlying model and specific sources of variation are determined by the particular experimental design. In the matched experiment it is also possible to use two or more controls for every test subject. The appropriate analyses are then somewhat more complicated. 14.I5In clinical trials in which randomization is used and all participants must give their formal consent, there is little advantage to such designs. Crossover

designs

The simplest crossover design is one that compares two treatments and in which each subject received both treatment A and treatment B; half the subjects receive A first and then B, and the other half receive B first and then A. An important part of the analysis is to assesswhether there are in fact period, carryover, or residual effects, although a crossover design should only be used in those cases in which such effects are believed to be absent.16 Specifically, the goal is to learn if the response to treatment A given in the first period is the same as the response to treatment A given in the second period after treatment B. If the first drug has not washed out of the patient’s system before the second drug is given, if it produces a long-term change in the patient’s status, or if the end of its dosing creates a rebound effect, then a carryover or residual effect exists.

and the standard

dewation

of the responses

to B in

group 2. analogously.

Table I illustrates the basic steps in the analysis of a crossover design in which the outcome variable is continuous (FEV, and average weekly symptom scores are examples) and can be assumed, after transformation if necessary, to have a normal distribution. Group 1 consists of all subjects who received A first and B second, and group 2 consists of all subjects who received B first and A second. Let X,,, be the response to treatment A :in the ith subject in group 1 and XBji be the response to treatment B in the same subject, and let XA2, and X,,i be the responses to treatments A and B in the ith subject in group 2. For each subject, compute the difference (d) between the response to A and the response to B as: d,i = X,,, X,,i for subjects in Group 1 and d,, = XA2, - XBZ, for subjects in group 2. Also, compute the total sum of the responses (t) to treatments A and B for each subject as: tl, = XAii + X,,, and tZi = XA2, + XBZi. Then compute the mean and standard deviation of the differences and the totals for each group. To test whether there is any difference in the residual effects of treatments A and B, a two-sample t test between group I and group 2 totals is computed (test 1 in Table I). A two-sample t test comparing group 1 and group 2 differences (test 2 in Table I) provides a test of period effects confounded with any residual effects; that is, it tests whether there is any difference in the treatment comparison depending on which drug was given first. If neither test suggests significant residual effects, the investigator may test for differences between treatments A and B by the paired t test (test 3, Table 1). If there are significant residual effects, treatments A

VOLUME NUMBER

78 3. PART 2

and B can be validly compared only by restricting attention to the first treatment given (test 4, Table I). If the response variable is not normally distributed, or is categorical or dichotomous, the tests follow the same logic but rank tests or contingency table tests”. ‘* are substituted for the t tests. If baseline responses before or between treatments are available, period and carryover effects may be easier to assess. Refer to Klein et al.19 for an illustration of carryover effects seen in a crossover study and refer to Koch et al.*’ for an illustration of the methods of analysis appropriate when baseline measures are available. Serial measures In a long-term trial, symptom severity may be evaluated daily and pulmonary function tests may be recorded at baseline, at weekly intervals for the first few weeks, and then perhaps biweekly thereafter. At each follow-up time the FEV, may be recorded before dosing and then at intervals for the next 4 to 6 hours (depending on the half-life and dosing interval of the drug). With this design, there may be interest in how well the short-term acute effects of a drug on pulmonary function are maintained over time. Statistically, one must address the clinical question: Do patients develop a tolerance to the acute drug effect after long-term repetitive dosing? The analysis directed at this question should take advantage of the repeated measures aspect of the design. There are a number of analytic options available to handle this type of hypothesis, and because each method has its own advantages and disadvantages under certain conditions, no single technique can be put forth as optimal. A simple approach that is often applied to these data is referred to as “slope analysis.” Least-squares regression lines are fitted to each patient’s data for a given variable across visits. With use of a one-sample t test, the average estimated slope within each treatment group is tested against a mean of zero (negative slopes implying tachyphylaxis). Likewise, with a twosample t test, the average slope can be compared with the corresponding estimate for a concurrent control group to determine whether the rate of change in response over time differs between treatment groups. Nonparametric counterparts to the t test can also be applied. This analysis is a useful exploratory approach, but it has severe drawbacks, in that (1) it is relatively insensitive to anything but gross and sustained shifts in pulmonary performance or other response measures over time; and (2) it assumes that the pattern of response is linear over time, thus imposing a linear relationship between duration of therapy (or cumulative dose) and response. As an alternative approach, one might consider a study of this type as a two-factor experiment with

Data

analysis

545

TABLE II. Repeated-measures analysis of variance for n patients in each of p treatment groups measured at each of q visits Source

of variation

Among subjects Treatments Subjects within treatments Within subjects Visits Treatments x visits Visits X subjects within

Degrees

of freedom

np - 1 P-1 p(n - I) np(q - 1) q-1 (P - 1) (q - 1) p(n - 1) (q - 1)

treatments

repeated measures on one factor.*’ For example, in a study with p treatments, q visits, and n patients per treatment group, the total variation would be partitioned as in Table II. A test of the main effect for treatments would indicate whether the treatment groups differ with respect to response evaluations averaged across follow-up visits. A test of the treatment X visits interaction term would address treatment differences in the pattern of response over time. Of course, one must check that the data satisfy the assumptions of such a model, including the assumptions that the q x q variance-covariance matrices are homogeneous over treatment groups and that the pattern of elements in the common matrix takes a certain form (compound symmetry). This type of analysis should only be considered when a high percentage of patients has a complete set of follow-up information. A graph of average values of efficacy variables as well as relevant laboratory data in each treatment group at each follow-up visit would supplement any repeated measures analysis and highlight increasing or decreasing trends in response (or toxicity) with chronic drug use. Analysis of treatment effects at each follow-up visit or graphic displays of visit-by-visit results should be based on all patients with data at each visit, as well as the subset of patients who complete the study, to determine the impact of dropouts on conclusions regarding efficacy and drug tolerance. Another clinical objective of a long-term antiasthmatic drug trial may be to assess the degree of protection that a new drug provides in preventing the onset of asthmatic episodes-that is, in addition to alleviating the symptoms of asthma and improving pulmonary function, the drug may be investigated for its prophylactic effects relative to a placebo or active control. The type of data collected to substantiate the latter claim should include measures of asthma attack frequency derived from patient diary forms. These might include the biweekly total (or daily average) number of sudden coughing attacks or the number of

546

Elston

et al.

J. ALLERGY

isolated attacks of bronchospasm per month. Other indirect measures would include the number of days missed from work or school, number of sleep disturbances, number of rescue medication doses, or number of hospitalizations caused by asthma during the test period. For these variables, statistical tests would compare treatment groups during each follow-up interval of the test period with respect to the change from baseline. To investigate a prophylactic claim, the analysis should address the extent to which chronic drug therapy is capable of maintaining a reduced level of symptom severity as well as a reduced incidence of asthmatic attacks over the course of a long-term trial. A visit-by-visit graphic or tabular display and a treatment comparison of average attack rates during each follow-up interval would therefore be informative INFORMATION

FOR FUTURE DESIGNS

A complete and thorough analysis of the data from a given clinical trial can answer more questions than were originally asked, and the answers to these questions can be used in the design and analysis of future trials. What is the best way to combine the various outcome measures to maximize the differences among the drugs? Which measures are most important, and which are superfluous? How frequently should the measures be taken? What are the best transformations to use to obtain normally distributed outcome indices, and which covariates are important to include in the analyses? Although some of these questions can be answered at the time particular drug is evaluated, a more powerful and efficient evaluation would be possible if the answers preceded the study. SUMMARY It is clear that the analysis of data from trials to evaluate NBAAD can be performed at various levels of sophistication. A variety of statistical methods may be equally valid for a given set of data, but whatever method is used should be justified in terms of the clinical questions being addressed and the correctness of the statistical assumptions underlying that method. The most powerful analysis will almost certainly require specialized statistical expertise; only simple guidelines have been presented here. In choosing a method of data analysis for a particular drug trial, consideration should be given to (1) the design of the trial, (2) the relevant concomitant variables, and (3) the specific hypotheses the trial was designed to test. All methods of analysis used should be justified in terms of the clinical questions being asked and the statistical assumptions underlying the methods. The results reported should include p values

CLIN. IMMUNOL. SEPTEMBER 1986

for the statistical tests performed and confidence intervals for estimates of differences between treatment groups. The power of the statistical tests performed should be stated if the results are nonsignificant. After the questions a particular drug trial was designed to investigate are answered, further detailed exploratory analyses can be performed to guide the design and analysis of future trials. REFERENCES 1. Friedman LM, Furberg CD, DeMets DL. Fundamentals of clinical trials. Littleton, Mass: John Wright PSG, 1983. 2. Peto R, Pie MC, Armitage P, et al. Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design. Br J Cancer 1976; 34:585-612. II. Analysis and examples. Br J Cancer 1977; 35:1-39. 3. Shapiro SH, Louis TA, eds. Clinical 4.

5. 6. I. 8. 9. 10.

11.

trials: issues and approaches. New York: Marcel Dekker, 1983. Tygstrup N, Lachin JM, Juhl E, eds. The randomized clinical trial and therapeutic decisions. New York: Marcel Dekker, 1982. Lehmann EL. Nonparametrics: statistical methods based on ranks. San Francisco: Holden-Day, 1975. Siegel S. Nonparametric statistics for the behavioral sciences. New York: McGraw-Hill, 1956. Fleiss JL. Statistical methods for rates and proportions. 2nd ed. New York: John Wiley & Sons, 1981. Moses LE, Emerson JD, Hosseini H. Analyzing data from ordered categories. N Engl J Med 1984;311:442-8. Miller RG. Simultaneous statistical inference. New York: McGraw-Hill, 1966. Armitage P, McPherson CK, Rowe BC. Repeated significance tests on accumulating data. J R Stat Sot A 1969;132:235-44. Dunnett C, Goldsmith C. When and how to do multiple comparisons. In: Buncher CR, Tsay J-Y, eds. Statistics in the pharmaceutical industry. New York: Marcel Dekker. 198 1:397-

433. 12. Morrison DF. Multivariate 13. 14. 15. 16. 17.

statistical methods. 2nd ed. New York: McGraw-Hill, 1967. Blackwelder WC. “Proving the null hypothesis” in clinical trials. Controlled Clin Trials 1982;3:345-53. Schlesselman JJ. Case-control studies: design, conduct, analysis. New York: Oxford University Press, 1982. Miettinen OS. Individual matching with multiple controls in the case of all-or-none responses. Biometrics 1969;25:339-55. Brown BW. The crossover experiment for clinical trials. Biometrics 1980;36:69-79. Koch GG. The use of non-parametric methods in the statistical analysis of the two-period change-over design. Biometrics

1972;28:577-84. 18. Dunsmore IR. Analysis of preferences in crossover designs.

Biometrics 1981;37:575-8. 19. Klein R, Waldman D, Kershnar H, et al. Treatment of chronic

childhood asthma with beclomethasone dioproprionate aerosol. I. A double-blind crossover trial in nonsteroid-dependent patients. Pediatrics 1977;60:7-13. 20. Koch GG, Elashoff JD, Amara IA. Repeated measurements studies design and analysis. In: Johnson NL, Kotz S, eds. Encyclopedia of statistical sciences. New York: John Wiley & Sons (in press). 21. Winer BJ. Statistical principles of experimental design. 2nd ed. New York: McGraw-Hill, 1962:514-39.