APPLIED
STATISTICAL
Multivariate
ANALYSIS
Analysis and Repeated Measurements: A Primer Thomas
E. Rudy, John A. Kubinski,
and J. Robert
Boston
Most research designs in critical care medicine inherently are multivariate in that experimental manipulations are expected to produce changes on several dependent variables, including measurements repeated over time. Increased availability of easy to use computer programs has made it practical for investigators to benefit from the advantages of multivariate analyses, such as increased control of experimentwise type I error rates and enhanced interpretation of treatment effects. Researchers need not understand the mathematical underpinnings of these analytic techniques to make good use of them, and it is the practical application of these statistical methods that is addressed in the present paper. The assumptions necessary for appropriate use of multivariate approaches, as well as discussion of the interpretations to be drawn
from the information provided by computer programs, are presented. The multivariate analysis of variance (MANOVA) is discussed as an extension of its univariate counterpart, the analysis of variance (ANOVA), with the added advantage of assessing differential treatment effects simultaneously across multiple dependent measurements. Discriminant analysis, including examples of how to interpret discriminant weights and canonical loadings, is presented as a particularly useful method of providing an in-depth and unique representation of the results of a significant MANOVA. Finally, the utility of MANOVA is extended to repeated measurements experimental designs, including how to analyze and interpret designs that involve both between-subjects and within-subjects factors. Copyright o 1992by W.B. Saunders Company
T
rors (rejecting the null hypothesis when it is true). In this paper, we will extend our discussion of statistical hypothesis testing to situations in which the researcher collects multiple dependent measurements or multiple, repeated measurements on the same variable at successive points in time. Rarely will a study be designed to compare different experimental groups on a single response variable. Frequently, theoretical considerations dictate that an experimental manipulation or treatment effect should impact several outcome or dependent variables. Additionally, in many areas of medical research, measurements are less than perfectly reliable; thus, multiple measurements are used to provide a more accurate evaluation of the treatment effects. Demonstrating significant differences on multiple measurements is viewed as providing a more valid assessment of experimental effects than showing change on a single measurement. In studies comparing different treatments for hypertension, for example, treatment effects on systolic blood pressure may be as important as treatment effects on diastolic blood pressure. For studies in which several dependent variables are of equal interest or importance, a multivariate analysis that tests for significant differences on multiple dependent variables simultaneously can have distinct advantages over a series of separate analyses, one for each variable. Most research designs in critical care medi-
HE PRIMARY TASK of interpreting experimental research findings is to decide whether the manipulated or explanatory independent variables led to a significant change in the outcome or dependent variables. In the first paper in this series,’ we highlighted experimental design problems that can hinder the researcher’s ability to conclude that the obtained results in a study were due to the experimental manipulations. Without careful attention to design issues, the internal validity of a study becomes suspect. The second paper in this series’ focused on the obstacles related to computing multiple statistical tests and how multiple tests can have an adverse impact on type I and type II decision errors. We described several statistical techniques that can be used to decrease type II errors (failure to reject the null hypothesis when it is false), which increases statistical power, while simultaneously controlling for type I erFrom the Departments ofAnesthesiologylCritica1 Care Medicine and Electrical Engineering, University of Pittsburgh, Pittsburgh, PA. Supported in part by grant no. 2ROIAR38698 ffom the National Institute of Arthritis and Musculoskeletal and Skin Diseases and grant no. 2ROIDE07514 from the National Institute of Dental Research. Address reprint requests to Thomas E. Rudy, PhD, Pain Evaluation and Treatment Institute, Department ofAnesthesiology/CCM, University of Pittsburgh Medical Center, Baum Blvd at Craig St, Pittsburgh, PA 15213. Copyright o 1992 by W.B. Saunders Company 0883-9441192/0701-0007$05.0010 30
Journal
of Critical
Care, Vol 7, No 1 (March),
1992: pp 30-41
MULTIVARIATE
ANALYSIS
AND
REPEATED
tine are inherently multivariate in that experimental manipulations are expected to change more than one dependent measurement. In fact, it is common practice to evaluate changes on multiple measurements to demonstrate that the intended experimental protocol or procedures were executed successfully. Another common multivariate situation in critical care research involves collecting the same dependent measurement repeatedly throughout an experiment. Repeated measurements designs are particularly useful to evaluate how changes in the level or intensity of the independent variable influence the dependent variables. Before we describe multivariate statistical testing methods, a word of caution is needed. The indiscriminant use of multiple dependent measurements is never an appropriate substitute for a well-conceived study with a carefully selected number of variables. Conducting a study that measures numerous variables related to one’s area of interest is a poor substitute for a better-controlled design in which several reliable measurements are selected based on theory and previous research. The measurements for any study should be carefully selected to test the hypothesized experimental effects. MULTIVARIATE
ANALYSIS
31
MEASUREMENTS
OF VARIANCE
Multivariate analysis of variance (MANOVA) is becoming an increasingly popular and useful statistical tool in medical research. Although the mathematical basis of MANOVA and related techniques was developed in the 1930s only in recent years have these techniques been applied in medical and other areas of research. The increase in the use of multivariate procedures is largely due to the recent availability of easy to use computer programs, including some for microcomputers, that relieve the investigator from doing the tedious and complex calculations required by these procedures. However, a frequent problem that arises when medical investigators are interested in applying multivariate techniques to their data is that most multivariate statistical texts devote far more space to the mathematical underpinnings of these procedures than to their practical applications. It is our belief that to conduct multivariate analyses and apply them correctly to practical research situations you do not need
to plow through long matrix equations, understand eigenvalues and eigenvectors, and so forth. Most of the statistical software packages that are widely available at academic computing centers have been tested extensively by statisticians and perform the correct calculations with a high degree of accuracy. It is important for the applied researcher to know under what conditions the data collected in an experiment are appropriate for multivariate analyses and to correctly interpret the information provided by these computer programs. Examples are provided throughout this paper to illustrate the practical application of several multivariate procedures and the usual types of printouts that result from multivariate software programs. THE RELATIONSHIP BETWEEN VARIANCE AND MULTIVARIATE VARIANCE
ANALYSIS ANALYSIS
OF OF
Multivariate analysis of variance is conceptually a straightforward extension of the wellknown univariate analysis of variance (ANOVA) technique, which was discussed in the second paper in this series.’ An ANOVA approach is used to test the null hypothesis that the means of k groups (where k represents the number of groups) are all equal to one another. When mean differences from only two groups are tested, an ANOVA approach produces results that are identical with those provided by at test. Despite the number of groups, an ANOVA is testing whether there are significant mean differences between any of the groups on a sir& dependent variable. In sum, the null hypothesis tested by ANOVA can be expressed as follows: H,,: M, = M, = . . = M; In the MANOVA situation, the null hypothesis tested is that the sample means of the k groups are equal to one another for each of the p-dependent variables. The null hypothesis tested by MANOVA can be expressed as follows: M,, = M,? = ... = M iA Ml, = M-,? = . . . = M,, H,,:
. M,, = M , = . : . = M,,, P-
32
RUDY,
The null hypothesis states that all groups (1 to k) have the same population mean for each variable (1 top). The MANOVA null hypothesis is simply an extension of the ANOVA null hypothesis described above in that multiple ANOVA null hypotheses are tested together rather than separately. THE ADVANTAGES ANALYSIS
OF MULTIVARIATE OF VARIANCE
There are several important reasons to use MANOVA in studies designed to test for mean differences. First and foremost, the use of MANOVA allows researchers to examine the relationships among multiple measurements rather than considering each of them in isolation. The MANOVA becomes a particularly useful data analytic technique when the investigator wants to evaluate mean differences on multiple dependent variables while simultaneously controlling for the intercorrelations among them. In this situation, MANOVA provides information that cannot be obtained from separate ANOVAs. Specifically, MANOVA analyses can be used to determine which dependent variables contribute most to group differences. These findings not only help clarify and enhance the interpretation of experimental effects, but also provide the investigator with important information that can be used to design future studies (eg, those measurements that provide redundant information and can be eliminated). Thus, MANOVA reveals more about the data by examining the variables in some combination or pattern rather than individually. Another important advantage is that in some situations a MANOVA approach provides a more powerful test than does separate ANOVAs3 For example, a MANOVA test may find significant differences among the experimental groups even if none of the ANOVA tests that can be performed on the individual variables are statistically significant. However, there are some situations in which MANOVA is unnecessary or may even be inferior to separate ANOVAs. If all the dependent variables of interest are uncorrelated, there is little advantage to using MANOVA. Furthermore, MANOVA may be not be as statistically powerful as separate ANOVAs. Adequate statistical power can generally be maintained when measurements are at
KUBINSKI,
AND
BOSTON
least moderately correlated and the number of subjects per experimental group is two to three times larger than the number of dependent measurements.4 Finally, since most studies involve multiple dependent measurements, MANOVA can be used to help control the experiment-wise type I error rate, discussed in the second paper in this series.’ Even if the researcher is only interested in testing for significant differences on each variable individually, MANOVA can be used to control the overall alpha level at the desired level (usually 0.05). Thus, for example, a nonsignificant MANOVA for several dependent variables raises concerns about the appropriateness of conducting multiple ANOVAs to interpret group differences. ASSUMPTIONS
OF MULTIVARIATE OF VARIANCE
ANALYSIS
Similar to all inferential statistical techniques, the mathematical underpinnings of MANOVA are based on a set of assumptions. The assumptions needed for MANOVA closely parallel those needed for ANOVA. The MANOVA assumptions can be summarized as follows: 1. The subjects have been randomly sampled from the population of interest. 2. The dependent measurements provided by each subject are statistically independent of the measurements obtained from other subjects. 3. The dependent measurements are continuous or interval level scales (eg, age, systolic blood pressure). 4. The dependent measurements have a multivariate normal distribution. In practice, this implies that each variable follows a normal distribution, although univariate normality is necessary but not sufficient for multivariate normality.5 5. The k groups have a common within-group covariance matrix. This implies that (a) as in ANOVA, the homogeneity of variance assumption (ie, the means of the groups being compared have the same variances) must be met for each dependent variable and (b) the covariance (unstandardized correlation) between any two dependent variables is the same in all k groups. With real data, these mathematical requirements are unlikely to be met precisely. As in
MULTIVARIATE
ANALYSIS
AND
REPEATED
MEASUREMENTS
ANOVA, violating some assumptions does not necessarily invalidate the results, but violations of any of the first three assumptions listed above can produce highly misleading results. However, care in formulating a research design can assure that these three assumptions are met. Departures from multivariate normality generally have only a slight effect on type I error rates, although this condition in both MANOVA and ANOVA can lead to reduced statistical power. A possible solution to variables that display high nonnormality (eg, marked skewness) is to use transformations to achieve normality or at least to approximate it more closely. Common transformations for this purpose are described by Fleiss” and Armitage and Berry.’ They include taking the log or the reciprocal of the data. The effects of violating the equality of covariante matrices assumption is a more complicated issue. When sample sizes are equal, MANOVA has been found to produce robust results under violations of this assumption unless sample sizes are small or the number of variables is large and the differences in the covariance matrices are large.’ As the sample sizes across experimental groups become increasingly discrepant, the negative impact of nonhomogeneous covariance matrices on the robustness of MANOVA becomes more pronounced, which results in reduced statistical power. Although a test of homogeneity of covariance matrices is widely available (Box’s M), this test is generally not useful because the test itself is extremely sensitive to departures from normality.’ A simple and effective method is to have the software program, as part of computing a MANOVA, print the covariance matrices for each group and then compare the covariance values of each element across the experimental groups. If this visual inspection results in highly discrepant values, expert statistical consultation should be obtained prior to accepting the results of the MANOVA. A NUMERICAL EXAMPLE OF MULTIVARIATE ANALYSIS OF VARIANCE
An example should help to illustrate how MANOVA can be applied to experimental data. In this hypothetical example, assume that we are interested in testing the increased efficacy of patient-controlled analgesia (PCA) (only
33
self-administered) versus PCA plus basal opioid infusion (PCA+BI) in a group of 20 patients following surgery. We might select meperidine as our analgesic agent and randomly assign IO patients to each of the two experimental groups, PCA and PCA+BI. Further, assume that one of the investigator’s primary hypotheses is that the PCA+BI treatment should lead to increased patient “comfort” when compared with the PCA-only condition. “Comfort” cannot be directly measured, but may be indirectly assessed by measuring several different variables. Thus, “comfort” is a multidimensional construct and requires multivariate operationalization. For example, the investigator might select pain intensity, sleepiness, nausea, and patients’ beliefs about the adequacy of the pain relief received as indirect indicators of the patient’s level of comfort. In this example, 1l-point rating scales ranging from 0 to 10, with anchor descriptors for the extremes (eg, 0 = “no pain” and 10 = “very severe pain”), are created and administered to patients 10 hours after they begin using the PCA device. Since the person collecting the patients’ ratings can potentially bias the obtained results, we probably would use a doubleblinded study design. The obtained means and standard deviations for the four dependent measurements are displayed in the “descriptive statistics” section of Table 1. The next section in Table 1, labeled “multivariate test statistics,” presents the MANOVA results for these data. In most software packages, the MANOVA test can be obtained by entering several commands. One statement usually is required to specify which variable in the data set is to be considered as the subject grouping or classification factor. In this example, a variable named GROUP was used and the first 10 patients were assigned the value of 1 to indicate that they received the PCA+BI treatment and the remaining 10 were assigned the value of 2 (PCA only). At least one additional statement usually is required to specify which variables in the data set should be used as dependent variables in the MANOVA analysis (in this example, the dependent variables were termed PAIN, SLEEPY, NAUSEA, and ADEQUATE). The three multivariate test statistics (labeled WILKS’ LAMBDA, PILLAI TRACE, and HOTELLING-LAWLEY TRACE‘) displayed
RUDY,
34
Table
1. Multivariate
Analysis
Descriptive Pain Group 1 (n = 10; group Mean SD
label = PCA + BI)
Group 2 In = 10; group Mean SD
label = PCA only)
Variable
AND
BOSTON
Results
Statistics Sleepy
NWSS3
Adequate
3.100 1.912
2.800 1.135
4.000 1.886
8.300 1.337
5.500 2.321
3.600 1.838
7.100 1.663
6.400 2.221
Multivariate Wilks’ lambda = 0.344 Pillai trace = 0.656 Hotelling-Lawley trace = 1.904
of Variance
KUBINSKI,
f-statistic F-statistic F-statistic
Test Statistics = 7.141 = 7.141 = 7.141
DF=4,15 DF=4,15 DF=4,15
Univariate F tests DF
ss
Prob = 0.002 Prob = 0.002 Prob = 0.002
MS
F
P
Pain Error
28.800 81.400
1 18
28.800 4.522
6.369
,021
Sleepy Error
3.200 42.000
1 18
3.200 2.333
1.371
,257
Nausea Error
40.050 56.900
1 18
48.050 3.161
15.200
.OOl
Adequate Error
18.050 60.500
1 18
18.050 3.361
5.370
,032
Abbreviations:
DF, degrees
of freedom;
Prob. probability
level;
in Table 1 may appear strange or unfamiliar to the reader. These are the three most widely used methods to test the multivariate null hypothesis. Each provides a different method with the same basic intent, to compute an approximation to the F statistic, the same statistic that is used in ANOVA. In this example, all three multivariate statistics have the same value for the F approximation. This is because the approximation is exact when there are only two groups (in this situation, some software packages will print Hotelling’s T2, which produces identical results). When the number of experimental groups is greater than two, multivariate tests may produce different results, although the differences are usually small. Although most software packages print several multivariate F approximation tests, the Pillai-Bartlett trace offers the best control of type I error rates because it is more robust to violations of assumptions 4 and Sy described above. Returning to Table 1, the F statistic is 7.14 with 4 degrees of freedom in the numerator and 15 in the denominator, which is significant at P = .002. Thus, if we had set our alpha level at 0.05 for this experiment, we would reject the multivariate null hypothesis and conclude that
SS, sums of squares;
MS, mean sums
of squares.
the mean values for the PCA+BI and PCA groups were significantly different on at least one of the four dependent measurements. INTERPRETING SIGNIFICANT MULTIVARIATE ANALYSIS OF VARIANCE
Once a significant overall MANOVA is found, the next step is to investigate which dependent measurements contributed most to the obtained difference or separation of the experimental groups. Additionally, similar to ANOVA, when there are more than two experimental groups, the investigator also must find the groups responsible for the significant difference found by the MANOVA. Univariate F Tests
A common practice after a significant MANOVA has been found is to compute ANOVAs for each of the dependent variables, Each univariate F statistic that reaches the specified alpha level is considered significant and available for interpretation.‘O The results of the univariate ANOVAs for our PCA example are presented in Table 1 under the section labeled “univariate F tests.” These findings suggest that the two PCA groups had significantly different
MULTIVARIATE
ANALYSIS
AND
REPEATED
MEASUREMENTS
means for the PAIN, NAUSEA, and ADEQUATE dependent measurements, but not for the SLEEPY measurement. It should be recognized, however, that although the univariate F tests are insensitive to the correlations among the variables, the F tests that result from the separate ANOVAs are not statistically independent because of correlations among the measurements. This condition is analogous to the test of non-orthogonal between-group contrasts in ANOVA. Because of the non-independence of separate ANOVAs, some statisticians (for example, Harris*‘) recommend using the Bonferroni procedure to adjust the alpha level for each ANOVA computed. Discriminant Analysis The use of separate univariate F tests ignores the relationships among the dependent variables. Thus, potentially useful information, such as suppressor effects or redundancies, cannot be determined by this approach. Discriminant analysis can be used to provide additional information concerning the contribution of individual dependent measurements to the overall difference among experimental groups. Discriminant analysis assigns weights to the dependent variables to find the linear combination that best separates the k groups by maximizing the between-group variance of the linear combinations relative to the within-group variance. This procedure produces s = min Cp, k 1) discriminant functions, some or all of which may be significant at the specified alpha level. Usually, however, it is the first discriminant function that contains most of the information about experimental group differences. In the situation when k = 2 or p = 2, interpreting multiple discriminant functions is not an issue since only one discriminant function is mathematically possible. Most software packages have options to produce discriminant functions as part of the MANOVA analysis. Two measurements related to the discriminant functions are particularly useful for further interpretation of significant MANOVAs, standardized discriminant weights and canonical loadings. The weight for each variable represents the relative contribution of that measurement to the discriminant function and, therefore, to the separation among
35
the experimental groups. These weights, however, are highly influenced by the correlations among the dependent measurements. For example, if two variables in a MANOVA are highly correlated, one measurement may receive a large weight and the other a small weight, or both may receive only modest weights, although both measurements make a major contribution to group separation. Canonical loadings are correlations that represent how much variance a specific measurement shares with the discriminant function. Thus, these loadings are more useful than weights when the investigator wants an unbiased estimate of how much each variable contributed to the significant differences among the experimental groups. In contrast. discriminant weights can be used to evaluate whether some measurements are providing redundant information. If so, the investigator may want to drop some of these measurements in future research or create a composite score from highly correlated measurements, which would lead to increased reliability.” The results of a discriminant analysis for OUI PCA example are presented in Table 2. As noted above, only one discriminant function resulted since there were only two experimental groups. When a MANOVA results in more than Table
2. Discriminant Analysis Results Multivariate Analysis of Variance
Test of discriminant
Discriminant functions
Function 1: ,y* statistic Canonical correlations Function
to Interpret Significant Results in Table 1 _____. Analysis
= 17.060
DF = 4
Prob = 0.002
1: 0.810
Standardized Discriminant Coefficients
___-
Canonical Loadings
Pain Sleepy
0.661
0.215
0.200
Nausea Adeauate
0.957
0.666
0.431
-0.087
0.396
Pearson Correlation Pali-
Sleepy
Matrix Adequate
Nat&e.?
_____
Pain Sleepy Nausea Adequate Abbreviations: level.
1.000 0.380
1.000
0.118 -0.689
0.026
DF, degrees
-0.107
1.000
1.000
-0.274
of freedom;
Prob.
probability
RUDY,
36
one significant discriminant function, the interpretation of each function is the same as the procedures that we will describe (see ref. 3 for a conceptual description of multiple discriminant functions). Before printing discriminant coefficients and canonical loadings, most software programs will print a chi-square test for each discriminant function and its associated canonical correlation. Of course, none of the functions will be statistically significant if the MANOVA is not significant. The canonical correlation provides an indicator of how much of the variance among the group means can be accounted for by the discriminant function. In this example, the canonical correlation was 0.81, indicating that the discriminant function accounted for 65.6% (0.81*) of the variance between the group means. The standardized discriminant coefficients and canonical loadings also are displayed in Table 2. Comparison of these two values for each dependent variable suggests the presence of a substantial correlation between the PAIN and ADEQUATE measurements. This conclusion was reached because the loadings for these two measurements are roughly equivalent in absolute value, but the coefficients are not. Note that the negative sign for the ADEQUATE variable in Table 2 indicates that the direction of the group mean differences for this measurement is opposite in direction from the mean differences for the other three measurements (in general, most software packages will change variable signs so that most variables have positive signs). The Pearson correlations presented in Table 2 confirmed this finding. Thus, as might be expected, patients who reported higher levels of pain also were less satisfied with the adequacy of the pain treatment they received. In sum, the results of the discriminant analysis displayed in Table 2 provide a concise interpretation of the significant MANOVA. The loadings indicate that the NAUSEA measurement made the largest contribution to group differences, that the PAIN and ADEQUATE measurements made approximately equal contributions, and the SLEEPY measurement made the least contribution. Additionally, the comparison of discriminant coefficients and canonical loadings indicated that the PAIN and ADEQUATE measurements provided redundant in-
KUEINSKI,
AND
BOSTON
formation in terms of group differences. Finally, the discriminant analysis provided important information regarding how well we did with operationalizing the “comfort” construct and tested whether PCA+BI was superior to PCA only in terms of this index. It appears that nausea level was the most important factor in patient comfort, with PCA+BI being superior in this regard, and that patients’ perceptions of their level of sedation were not relevant to distinguishing response differences to these two methods of PCA delivery. Multivariate Contrasts When the number of experimental groups is greater than two, multivariate contrasts can be used to compare specific pairs or weighted combinations of groups. The methods described to test a priori or planned comparisons in the second paper in this series2 can be extended directly to the multivariate case. Most MANOVA software programs have specific options that permit the researcher to specify multivariate contrasts between independent groups. These contrasts compare the vectors of the p-dependent variables for the specified groups. The most common statistic that is reported for these contrasts is the Hotelling’s T’ statistic. This statistic is the multivariate generalization of the t test. For example, in our PCA study we might have selected a different experimental design that evaluated not only whether PCAtBI was superior to PCA alone, but if the quantity of meperidine infused at a continuous rate led to significant .differences. One PCA+BI group might receive meperidine 5 mg/mL by continuous infusion set at 10 mgihr and the other group continuous infusion set at 5 mg/hr. Thus, this design would result in three experimental groups. A significant MANOVA on our “comfort” measurements described above would only show that at least one of the four dependent measurements was significantly different in at least one of the experimental groups. As in ANOVA when k > 2, further analyses would be computed. Also similar to ANOVA, some steps to control the experiment-wise alpha level would be needed. Harris” provides a good discussion on the use of multivariate contrasts and suggests a set of decision rules to control
MULTIVARIATE
ANALYSIS
AND
REPEATED
MEASUREMENTS
type I error rates based on whether the hypotheses being tested are a priori or post hoc. Finally, if the researcher wants to investigate group differences among the dependent variables individually when k > 2, the usual ANOVA multiple comparison techniques (eg, Scheffe. Tukey) can be used.: REPEATED MEASUREMENTS MULTIVARIATE ANALYSIS
ANALYSIS WITH OF VARIANCE
Another multivariate situation that frequently arises is multiple measurements on each subject for different trials or times using the same dependent measurement at each trial or time. This experimental design is referred to as a within-subjects or repeated measurements design. In addition to repeated measurements, a common experimental design is to include a between-subjects factor, that is, subjects also are assigned randomly to independent groups (eg, a treatment or control group). The combination of between and within factors is called a mixed-model design. Because repeated observations on the same dependent measurement are almost never independent of each other, which is a critical assumption for regular ANOVA, repeated measurements designs require special techniques for proper analysis. Historically, mixed-mode1 ANOVA has been used to analyze repeated measurements designs. Although a powerful method when all the assumptions are met, unfortunately. the assumptions of this type of analysis are rarely met in applied research situations. Among other assumptions, mixed-model ANOVA requires that all variances of the repeated measurements be equal and that all correlations between the pairs of repeated measurements be equal (referred to as compound symmetry). This implies, for example, that the correlation between measurements taken at times I and 2 is the same as the correlation between measurements taken at times 1 and 4. In general, this assumption is usually not valid in repeated measurements and commonly is violated in most experimental designs with more than two repeated measurements.” An important consequence of these violations is that type I error rates and statistical power are markedly affected. Frequently. the type I error rate ex-
37
ceeds the intended alpha level.‘” As a result, the application of mixed-model ANOVA to repeated measurements can lead to highly inaccurate conclusions.h Two basic approaches to repeated measurements analysis have been proposed to avoid the assumptions described above: modifying the traditional mixed-model method and using MANOVA methods. Several procedures have been used to modify the F statistic produced by mixed-model ANOVAs. Typically, these modifications involve systematically reducing the degrees of freedom for the F statistic. The two most common methods are the GreenhouseGeiser and Huynh-Feldt statistics. The HuynhFeldt approach is a more recent adjustment that corrects possible biases in the GreenhouseGeiser approach. Similar to mixed-model ANOVA, the MANOVA approach to repeated measurements still controls for individual differences and produces a more powerful test of hypotheses related to within factors than would a betweensubject design. However, applying MANOVA to repeated measurements does not require any additional assumptions beyond those reviewed earlier. O’Brien and Kaiser” suggest that perhaps the greatest virtue of the MANOVA approach is that specific error terms are clearly defined for specific contrasts, in comparison to the problems of computing general or averaged error terms in mixed-mode1 ANOVA. AN EXAMPLE OF A ONE-FACTOR REPEATED MEASUREMENTS ANALYSIS
The basic components of a repeated measurements analysis can be illustrated by considering a simple, but extremely useful experimental design in which there is only one within factor and no between factor. The analysis of experimental designs with only one within factor can be viewed as an extension of the paired 1 test. A primary advantage, however, of obtaining more than two repeated measurements on the same dependent measurement is that this permits the investigator to calculate whether certain trends over time (or experimental trials) occur. A significant trend in the data not only demonstrates the measurement changed, but also facilitates testing theoretical models about the behav-
RUDY,
38
ior of certain measurements in controlled experimental situations. Returning to our PCA example, suppose that we are interested in determining changes over time in terms of the total amount of meperidine used, despite experimental group assignment. For the PCA+BI group, this would require adding the continuous infusion amount to the amount the patient self-administered with the PCA device. Four-hour sums of the amount of meperidine received for the first 16 hours were calculated. Thus, four repeated measurements were created for each patient. The typical output produced by software programs for this type of design is displayed in Table 3. In many software packages, this analysis can be performed within the same procedure used to calculate between-subjects MANOVAs. Usually, all that is needed is to specify that the variables defined as dependent in the analysis are to be considered as repeated measurements. Because of the increasing evidence that violating mixed-model ANOVA assumptions can lead to major decision errors in hypothesis Table
3. Within-Subjects
Repeated
Descriptive Time
Mean SD
SOUKt3
1
101.186 40.981
Univariate ss
21.036.250
Error
38,038.748
Multivariate
lambda
Time 3
Time 4
74.436 35.342
68.686 36.446
56.936 23.557
Within Subjects P G-G
10.507
,001
Repeated Measures
Time Error Polynomial Time
19.182.250 14,092.750 test of order 1,125.OOO
Error
15,424.998
Abbreviations:
0.001
Analysis
Error DF
F
P
3 3
17 17
9.305 9.305
,001 ,001
3
17
9.305
,001
Single Degree of Freedom Polynomial ss DF MS
test of order
0.001
H-F
Hypoth. DF = 0.379
Polynomial
MS, mean Huynh-Feldt;
Analysis F
667.346
Pillai trace = 0.621 H-L trace = 1.642
SOUrCe
(n = 20)
3 7.012.083 57
Results
Time 2
Repeated Measures DF MS
Time
Wilks’
Statistics
Measures
1 (linear) 1 19,182.250 19 741.724 2 (quadratic) 1 1.125.000 19
P
25.862
,001
1.386
.254
811.842
SS, sums of squares;
sums of squares; G-G, Hypoth, hypothesis.
Contrasts F
DF, degrees
of freedom;
Greenhouse-Geiser;
H-F,
KUBINSKI.
AND
BOSTON
testing, many programs automatically will produce modified and MANOVA statistics or will provide options to calculate these statistics. These statistics should always be printed. The multivariate statistics displayed in Table 3 are the same statistics described earlier. Table 3 also contains the traditional mixed-model ANOVA (univariate) statistics and also the adjusted probability values produced by the Greenhouse-Geiser and Huynh-Feldt approaches described above. Our bias is to rely on the multivariate statistics, based on the recommendations of many statisticians (ie, Harris” and O’Brien and Kaiserr2). If the univariate results are reported, several factors should be considered. O’Brien and Kaiser’* suggest that if the Huynh-Feldt probability values are different from the probability values produced by the unadjusted ANOVA (in Table 3 the probability value next to the F value), the data may not adhere to the compound symmetry assumption. Further, if the Huynh-Feldt-adjusted probability values do not agree with the probability values produced by the MANOVA approach, MANOVA should be used. The example displayed in Table 3 shows that, at least in these data, all three approaches lead to the same conclusion, that is, the null hypothesis of no mean differences over time in terms of the amount of meperidine received should be rejected. Since the overall multivariate repeated measurements analysis was found to be statistically significant, additional tests can be computed to describe the differences further. These additional tests can include post hoc pairwise mean comparisons of the repeated measurements, provided that a method is used to control the type I error rate for multiple comparison procedures.? Keselman et alI4 reviewed four pairwise comparison procedures designed to control the type I error rate for repeated measurements. When more than two repeated measurements are used, a significant repeated measurements effect can be interpreted by computing special single degree-of-freedom contrasts that test whether differences among the means follow linear or quadratic trends. These contrast tests, sometimes called “orthogonal polynomials,” apply a special set of weights to the means and test how well the observed data fit a
MULTIVARIATE
ANALYSIS
AND
REPEATED
MEASUREMENTS
specific trend. The weights used are determined by the type of trend being tested as well as the number of repeated measurements in the data set. These values usually are computed automatically for the user by the software program. If not, they are available in many statistics texts (see, for example, refs 6 and 7). As displayed in Table 3, the linear but not the quadratic trend was found to be statistically significant. In fact, the linear trend accounted for the majority of the variance among the four means. This can be determined by comparing the overall sum of squares (SS) for the time factor (21.036) with the SS value for each polynomial contrast. The linear SS value displayed in Table 3 was 19,182, which is 91.2% of the total SS. More generally, the SS values for all possible polynomials for repeated measurements sum to the total SS value obtained for the within factor. The number of orthogonal or statistically independent trends that can be tested is one less than the number of repeated measurements (p - 1). In our PCA example, a cubic trend, which would suggest a rising and falling or periodic pattern of PCA usage, also could have been computed. However, as in many other areas of medical research, theoretical support related to trend analyses usually is limited to linear and curvilinear (quadratic) patterns. ADDING
A BETWEEN-SUBJECTS FACTOR REPEATED MEASUREMENTS
TO
The addition of a between-subjects factor to a repeated measurements factor is a common experimental design used in treatment outcome research. In its simplest form, subjects are assigned randomly to one of two groups (eg, control and experimental) and are measured twice, before and after the experimental group receives the treatment intervention. Despite the number of repeated measurements taken, if this type of design is conceptualized as a two-factor design, with one within-subjects and one between-subjects factor, three primary statistical tests can be performed. To illustrate each of these tests and how they are interpreted, we will continue with the PCA example described earlier. As in the one-factor repeated measurements analysis described above, suppose we are interested in evaluating
39
not only PCA usage as a function of time following surgery, but also whether the amount of self-administered meperidine differed according to the experimental group assignment of patients. Thus, we would have a one-between (GROUP with two levels: PCA+BI and PCA only) and one-within (TIME, four levels: first, second, third, and fourth 4-hour time blocks postsurgery) factorial design. The means and standard deviations for the eight cells of this study (2 x 4) are displayed in Table 4. The commands used by most software packages to analyze this type of design are not complicated and are simply a combination of those described earlier. That is, one statement is used to declare which variable in the data set represents the coding index for the between subjects factor and another statement is used to specify those variables that are to be used as dependent variables and that are declared as repeated measurements. The first of the three primary tests is the ANOVA test of the GROUP or betweensubjects factor. This is simply a univariate ANOVA that tests whether there are mean Table
4. One-Between
Time 1
Group
and One-Within Descriptive
Statistics
Time 2
Time 3
= 1 (N = 10; Group
Mean SD
70.000 24.037
Subjects
Results
Time 4
Total
label = PCA + BI)
56.000 17.127
28.000 10.593
193.400 46 455
Group = 2 (N = 10; group label = PCA only) Mean 97.000 72.000 66.000 67.400 SD 29.078 19.889 17.764 17.450
302.400 74.603
Univariate Source
Analyses
SS
39.400 14.998
Between Subjects
DF
MS
Group
14,851.250
1
14.851.250
Error
17.378.200
18
965.456
Multivariate
Time Wilks’
lambda
= 0.232
TimeXgroup Wilks’ lambda = 0.443 Pillai trace = 0.557 = 1.258
Abbreviations: SS, sums MS, mean sums of squares; Lawley
Tracey.
P
,001
Repeated Measures Analysis Hypoth. DF Error DF
Pillai trace = 0.768 H-L trace = 3.311
H-L trace
F
15.383
F
P
3
16
17.658
,001
3 3
16 16
17.658 17.658
,001 001
3 3
16 16
6.708 6.708
,004 ,004
3
16
6.708
,004
of squares; DF, degrees of freedom; Hypoth, hypothesis; H-L, Hotelling-
40
differences between the experimental groups when the sum (some software programs compute an average) of thep-dependent (repeated) measurements is formed. That is, the repeated measurements are collapsed into a single measurement. As displayed in Table 4 under the “univariate analyses” section, overall the two PCA groups self-administered significantly different amounts of meperidine (see “descriptive statistics” section, Table 4) with the PCA-only group, as expected, using more over this 16hour time period. The second primary test evaluates the within factor, ignoring the between factor. This is called the “flatness” hypothesis and tests whether the means of the repeated measurements, collapsing across experimental group assignments, is a horizontal line or “flat.” Thus, in our PCA example, this is a test of the time effect and the null hypothesis is that patients used the same amount of meperidine at each 4-hour time period following surgery. As displayed in Table 4 under the label “time” for the multivariate analysis, the null hypothesis was rejected, which indicated that significant changes occurred over time in the amount of meperidine used. Finally, the third and perhaps most interesting hypothesis that can be tested with this experimental design is the “parallelism” hypothesis, which evaluates whether the pattern or trend in the repeated measurements is the same for each experimental group. This is equivalent to testing the interaction of the between and within factors. This is labeled “time*group” in Table 4, and was found to be statistically significant (P = .004). As with any significant interaction, our interpretation of the results found for the separate group and time tests (ie, main effects), described above, must be modified in light of the significant interaction between these two factors. In this example, interpretation of the significant group by time interaction takes precedence over interpreting the group and time main effects. That is, the significant interaction indicates that changes in the amount of meperidine used over time were not the same for the two experimental groups. Interpreting a significant interaction requires that the investigator isolate the sources of the interaction. The best way to begin is to plot the means for the independent variables involved in the significant interaction and evaluate the form
RUDY,
KUBINSKI,
100 0 2
AND
BOSTON
t WA ONLY 0 F'CA+BI
90
1
3
2
4
TIME Fig 1. Graphic illustration within-subjects interaction.
of a significant
between-
and
of the functional relationships between experimental groups. Figure 1 plots the means for meperidine usage over time for both the PCA+BI and PCA-only groups. As can be seen in Fig 1, the pattern of meperidine usage over time for these two groups of patients is not parallel. As displayed in Table 4, the PCA+BI group self-administered less meperidine than the PCA-only group. This finding is not surprising since this group of patients also was receiving meperidine by continuous infusion. What is important to recognize in Fig 1 is that the significant group by time interaction did not occur because the PCA+BI patients selfadministered significantly less meperidine. Rather, the interaction indicated that the pattern of usage over time was significantly different between these groups, as is readily apparent in Fig 1. This example highlights that the common strategy of computing separate tests (eg, t tests or ANOVAs) for each repeated measurement in order to interpret a significant interaction can result in an inadequate interpretation of the interaction. In this example, t tests for each of the four time periods used to evaluate meperidine usage would be statistically significant, but would be uninformative in interpreting the source of the significant interaction. A better strategy with these types of data is to compute “simple main effects.““. Simple main effects consist of computing main effect tests for one
MULTIVARIATE
ANALYSIS
AND
REPEATED
MEASUREMENTS
factor at each of the levels of the second factor involved in a significant interaction. In our PCA example, the best strategy to use to interpret the “group*time” interaction would be to compute one-factor repeated measurements analyses (described earlier) separately for the PCA+BI and PCA-only groups. Additionally, as suggested by inspecting Fig 1, we also may want to restrict these simple main effects analyses to times 2 through 4, since it appears that the amount of meperidine used remained constant during these time periods for the PCA-only group but not the PCA+BI group. Repeated measurements MANOVA for the PCA-only group indicated no significant differences among the means for times 2 to 4 (P = .12), but mean differences for these time periods were found for the PCA+BI patient group (P = .003). Thus, we can conclude that the amount of meperidine used by PCA+BI patients significantly declined during the first 16 hours postsurgery and that significant decline in the amount used by PCA-only patients occurred between 4 and 8 hours postsurgery, but then remained constant over the next 8 hours. CONCLUSION
Throughout this paper we have contended that most research designs in critical care medicine are inherently multivariate in that experimental manipulations are expected to produce changes simultaneously on several dependent
41
measurements or repeated measurements of the same measurement over time. Our goal was to provide the reader with an introductory understanding of how multivariate statistical approaches can be used to analyze these types of research designs. Rather than focusing on the mathematical basis of these procedures, we attempted to illustrate these statistical procedures with practical research examples. We do not assume, however, that this brief primer of multivariate statistics provides sufficient background for the reader to begin applying multivariate statistical procedures in his or her own research. Expert statistical consultation and additional reading is highly recommended. Fortunately, several good conceptually oriented sources that describe the procedures outlined in this paper are available (eg, Bray and Maxwell” and Kleckalh). Readers interested in a more mathematical treatment of multivariate statistical procedures can consult Mardia et alI7 or Morrison.iH It is our hope, however, that this primer has persuaded the reader to consider research situations in which a multivariate approach may have distinct advantages over more commonly used univariate statistical procedures. ACKNOWLEDGMENT The authors thank Dr Michael R. Pinsky. editor of the Journal of Critical Care, for his support of this series on applied statistical analysis and his helpful suggestions on an earlier draft of this article.
REFERENCES I. Kubinski JA, Rudy TE, Boston JR: Research design and analysis: The many faces of validity. J Crit Care 6:143-151,19Yl 2. Boston JR, Rudy TE, Kubinski JA: Multiple statistical comparisons: Fishing with the right bait. J Crit Care 6:21 l-220, 1901 3. Bray JH. Maxwell SE: Multivariate Analysis of Variance. Beverly Hills, CA, Sage Publications, 1985 4. Stevens JP: Power of the multivariate analysis of variance tests. Psycho1 Bull 88728-737, 1980 5. Carroll JB: The nature of the data, or how to choose a correlation coefficient. Psychometrika 26:347-372,1961 6. Fleiss JL: The Design and Analysis of Clinical Experiments. New York. NY. Wiley, 1986 7. Armitage P, Berry G: Statistical Methods in Medical Research (ed 2). Oxford, UK, Blackwell Scientific, 1987 8. Olson CL: On choosing a test statistic in multivariate analysis of variance. Psycho1 Bull 83579-586, 1976 9. Olson CL: Comparative robustness of six tests in multivariate analysis of variance. J Am Stat Assoc 69:89490x. 197-t
10. Bock RD: Multivariate Statistical Methods in Behavioral Research. New York, NY, McGraw-Hill, lY75 Il. Harris RJ: A Primer of Multivariate Statistics. New York, NY, Academic, 1975 12. O’Brien RG, Kaiser MK: MANOVA method for analyzing repeated measures designs: An extensive primer. Psycho1 Bull 97:316-333, 1985 13. Huynh H, Feldt LS: Performance ol traditional F tests in repeated measures designs under variance heterogeneity. Comm Stat: Series A 9:61-74, 1980 14. Keselman HJ, Keselman JC. Shaffer JP: Multiple pairwise comparisons of repeated measures means under violation of multisample sphericity. Psycho1 Bull 110: 162. 170,199l 15. Keppel G: Design and Analysis: A Researcher’s Handbook. Englewood Cliffs, NJ, Prentice-Hall. 1973 16. Klecka WR: Discriminant Analysis. Beverly Hills, CA, Sage Publications, 1980 17. Mardia KV, Kent JT. Bibby JM: Multivariate Analysis, London. UK, Academic. 1979 18. Morrison DF: Multivariate Statistical Methods (cd 2). New York. NY. McGraw-Hill. 1976