Research Without Tears
Analysis of repeated measurements in physical therapy research Greg Atkinson In this paper, I attempt to introduce physical therapists to the most common statistical tests for analysing differences between repeated measurements over time. Using the example of `wholebody ¯exibility' recorded at six different times of day and the Statistical Package for the Social Sciences (SPSS), I discuss the advantages and disadvantages of the various approaches for analysing a simple one-factor design. The most important issues in test selection for repeated measures are the exploration of, and correction for, the violation of `sphericity' when employing a univariate general linear model (GLM), as well as the sample size when adopting a multivariate GLM. I summarize current advice on choice of test with the aid of a `decision tree', based on the results of documented statistical simulations which have investigated how the various statistical tests `perform' in certain situations. Lastly, I comment on the most appropriate ways to present and c 2001 Harcourt Publishers Ltd interpret data drawn from serial measurements. *
Overview
Greg Atkinson Research Institute for Sport and Exercise Sciences, Liverpool John Moores University, Henry Cotton Building, Webster Street, Liverpool L3 2ET, UK. Tel: 44(0) 151 231 4333; Fax: 44(0) 151 231 4353; E-mail: g.atkinson@ livjm.ac.uk
194
As a teacher of research methods and statistics, students tell me that one of my more frustrating responses to any of their questions is, `It depends' (students might like to liven up a statistics lecture by counting how many times the statistician says this phrase). When a statistician responds in this way, he or she really means that the choice of a particular analysis depends on the type of data that has been collected, and whether characteristics of the data obey certain rules and regulations called `statistical assumptions'. To complicate matters, students and researchers may read different texts and ®nd that there are con¯icting views amongst statisticians about how important statistical assumptions are to the accuracy of some statistical tests. In the past, some statisticians were so worried about the implications of breaking underlying assumptions, they promoted the use of a family of tests that are generally associated with fewer rules called `non-parametric' statistics. Other statisticians argued that the more widely
Physical Therapy in Sport
(2001) 2, 194±208
doi : 10.1054/ptsp.2001.0071, available online at http://www.idealibrary.com on
known `parametric' tests (e.g. the t-test) are `robust', in that they are relatively unaffected by violations of any assumptions. Bland (1995) discussed the dangers of holding such generalized and opposing views about choice of statistics and, thankfully, such views are now rare. Most statisticians now agree that a course in statistics should stress the importance of examining whether underlying assumptions hold for a particular statistical test, before that test is applied to data. Moreover, a decision would need to be made about how serious any violation of statistical assumptions is. My ®rst aim in this paper is to communicate how statisticians know whether a violation of a certain statistical assumption is serious or not, since this knowledge is ultimately used to rationalize the choice of a statistical test. Primarily, the availability of powerful computers has gone a long way in making the `It depends' response more objective. Statisticians can now simulate large amounts of different data types and observe how different statistical tests `perform' when underlying assumptions are violated to varying degrees. There is even a
c 2001 Harcourt Publishers Ltd *
Analysis of repeated measurements
Table 1 Type I and type II errors in the null hypothesis testing process. The nominal type I error rate is conventionally 5% and the nominal type II error rate is conventionally 10±20% The `real' population conclusion
Researcher's conclusion: from sample H0 false (an effect) H0 true (no effect)
H0 false (an effect)
H0 true (no effect)
Correct conclusion (true `positive') Type II error (beta) (false `negative')
Type I error (alpha) (false `positive') Correct conclusion (true `negative')
Journal of Statistical Computation and Simulation, which is devoted especially to the examination of how different statistical tests perform under different circumstances (see for example, Al-Subaihi 2000). This investigative nature of statistics might not be very well appreciated by researchers working in an applied setting. One job of the statistician is to synthesize the results of data simulations into coherent advice for researchers to select the most appropriate statistical tests. Why are statistical assumptions important? What is meant when it is said that a particular test `performs' well, or poorly, in certain circumstances? In essence, if a test is applied to data that break the underlying assumptions for that particular test, then the test might well be considered not to `perform' well, in that research conclusions could be compromised. The researcher endeavours to arrive at the `true' research conclusion, which in statistical terms is the conclusion that would be reached if all subjects in the whole population could be observed. Obviously, this ideal situation rarely arises. So the researcher's conclusions, made from a smaller sample of the population, might be incorrect. Table 1 shows the possible situations that may arise in the statistical testing process, relative to what might be the `true' case in the population. At ®rst glance, one may appreciate why this table has been called the `confusion matrix'! To summarize Table 1, a researcher, in selecting an inappropriate statistical test, could make either a `type I' or a `type II' error. A type I error, or `false positive' arises when the researcher has concluded that an effect or difference is present, i.e. null hypothesis rejected, when in reality (in the population) there is no effect or difference. A researcher would make a type II error, or a `false negative'
c 2001 Harcourt Publishers Ltd *
if it is concluded that no effect or difference is present, i.e. null hypothesis not rejected, when in truth there really is an effect. In returning to the question posed previously, statistical assumptions are important because, if they are not upheld, the chances of making a type I or type II error might increase. In other words, choosing the wrong test for the wrong data might lead to a researcher arriving at a completely erroneous conclusion. An example, cited as one of Altman's (1991) `basic errors in analysis', would be a researcher who chooses to analyse the difference between two serial measurements with a statistical test designed for comparisons between two separate samples (e.g. a two-sample t-test). In this situation, it is more likely that the researcher would make a type II error, since it would be probable that the measurements over time are correlated, and this characteristic effects the magnitude of the `error variance' used in the calculations. A two-sample t-test is used on the assumption that the two data sets have been sampled from completely independent populations. The most appropriate choice of test in the above situation, with less chance of the researcher making a type II error, would be a paired or `correlated' t-test, since this approach takes into account that the measurements over time are related. In other words, the paired t-test `knows' that some of the variation in the data can be attributed to between-subjects differences, and so does not include this component in the calculations. As one will appreciate later, it is this likelihood of serial measurements being correlated that leads onto the consideration of several related statistical assumptions for the analysis of more than two tests over time, and ultimately, makes the choice of analysis of serial measurements over time one of the most complicated issues in statistics.
Physical Therapy In Sport
(2001) 2, 194±208
195
Physical Therapy in Sport
As an illustration of how complicated `repeated measures' analyses can seem, I present the `basic' output that is provided by version 9.0 of the Statistical Package for the Social Sciences (SPSS) for a relatively simple analysis of six serial measurements over time on a sample of six subjects (Table 2). In fact, I have removed some of the default output that pertains to pre-planned speci®c contrasts for the sake of the readers sanity. What does all this output mean? The aim of this paper is to gradually unravel such statistical output and answer the above question by discussing which particular types of analysis are most appropriate, depending on whether or not statistical assumptions are upheld.
Basic de®nitions As already mentioned, the terms `serial measurements' or `repeated measures' refer to any variable that has been measured in the same sample of subjects over time. This type of examination is often referred to as a `withinsubjects' design. Within-subjects designs encompass three main research scenarios; observing the same subjects under different treatment conditions, comparing measurements between different tests or measuring instruments (as in the previous paper in PTiS on validity by George et al. 2000) or observing subjects longitudinally over time (Maxwell & Delaney 1990). As an example of the latter design, the output shown in Table 2 resulted from a larger study by Atkinson (1994), and involved the measurement of `whole-body' ¯exibility (using a `Takei' forward-¯exometer) at six times of day (02:00, 06:00, 10:00, 14:00, 18:00 and 22:00 h) for six young adults. The number of measurements taken over time are termed `levels' so there are six levels (times of day) in our example. One can see this information presented in the ®rst part of the output in Table 2. Sometimes, a research design is more complicated than the basic scenarios above. A researcher might desire to take repeated measures over time and compare these measurements between different samples or conditions in an experiment. Such designs are termed `multifactorial' (or just `factorial'), since the researcher desires to examine the in¯uence
196
Physical Therapy in Sport
(2001) 2, 194±208
of several factors on the outcome variable of interest. For example, the effects of time of day on whole-body ¯exibility might in turn be thought to be different between a sample of young adult and middle-aged people (Atkinson 1994). What the researcher has done in this design is add a `between-subjects' factor of `age' with two levels (young adult and middle-aged). This design could also be described as `mixed', in that there are both within- and betweensubjects factors present. A researcher might also be interested in examining the in¯uence of other within-subjects factors, as well as time of day, on ¯exibility. For example, it might be hypothesized that stretching prior to the morning (06:00) measurement ameliorates the reduction in ¯exibility at this time of day. Therefore, the researcher might take the six measurements on two different days under two different conditions (stretching and no-stretching before the morning test), ideally administered in a counter-balanced order. In this scenario, the researcher has added another within-subjects factor (besides time of day) that might be called `treatment' containing two levels. In this, my ®rst paper of a series of two, I will concentrate on the simpler situation of a design having only one within-subjects factor. Issues such as interpretation of main effects and interactions in factorial designs will be discussed in the second paper presented in a later issue of Physical Therapy in Sport. However, I haven't quite ®nished with the nomenclature yet. Most of the other terms are speci®c to a particular type of analysis, so I now want to introduce the general names of the different analyses for serial measurements that the researcher might come across. Two procedures which can both be considered as `parametric' methods, but differ considerably in the way they treat repeated measures data are `univariate' and `multivariate' general linear modelling (GLM). Readers might be more familiar with the term `analysis of variance' (ANOVA) with respect to univariate and multivariate methods, but statisticians noticed how similar ANOVA is to some linear regression methods, hence the allencompassing term of GLM is now preferred. Although this issue might be academic to applied researchers, interested readers might
c 2001 Harcourt Publishers Ltd *
Analysis of repeated measurements
Table 2 Default output from SPSS (version 9.0) for a repeated measures general linear model with one withinsubjects factor of `time of day' having six levels (02:00, 06:00, 10:00, 14:00, 18:00 and 22:00). The outcome variable of interest is `whole body ¯exibility' taken from a larger study by Atkinson (1994) Within-subjects factors Measure: Measure_1 Time
Dependent Variable
1 2 3 4 5 6
TWOAM SIXAM TENAM TWOPM SIXPM TENPM
Multivariate testsb Effect
Value
F
Hypothesis df
Error df
Sig.
Time Pillai's Trace Wilks' Lambda Hotelling's Trace Roy's Largest Root
0.922 0.078 11.902 11.902
2.380a 2.380a 2.380a 2.380a
5.000 5.000 5.000 5.000
1.000 1.000 1.000 1.000
0.455 0.455 0.455 0.455
aExact
statistic. bDesign: intercept within subjects design: Time.
Mauchley's test of sphericityb Measure: Measure_1 Epsilona Within subjects Mauchley's effect W
Approx. chi square
df
Sig.
GreenhouseGeisser
Huynh-Feldt
Lower-bound
Time
11.886
14
0.736
0.439
0.795
0.200
0.022
Tests the null hypothesis that the error covariance matrix of the orthonormalized transformed dependent variables is proportional to an identity matrix. aMay be used to adjust the degrees of freedom for the averaged tests of signi®cance. Corrected tests are displayed in the Tests of within-subjects effects table. bDesign: intercept within subjects design: Time. Tests of within-subjects effects Measure: Measure_1 Source
Type III sum of squares
df
Mean square
F
Sig.
25.762 25.762 25.762 25.762
0.000 0.000 0.000 0.004
Time Sphericity assumed Greenhouse-Geisser Huynh-Feldt Lower-bound
155.035 155.035 155.035 155.035
5 2.193 3.976 1.000
31.007 70.687 38.993 155.035
Error (Time) Sphericity assumed Greenhouse-Geisser Huynh-Feldt Lower-bound
30.090 30.090 30.090 30.090
25 10.966 19.880 5.000
1.204 2.744 1.514 6.018
Tests of between-subjects effects Measure: Measure_1 Transformed variable: average Source
Type III sum of squares
df
Mean square
F
Sig.
Intercept error
6958.340 725.285
1 5
6958.340 145.057
47.970
0.001
c 2001 Harcourt Publishers Ltd *
Physical Therapy In Sport
(2001) 2, 194±208
197
Physical Therapy in Sport
like to consult Field (2000) who provides more rationale for adopting the term GLM. Before looking at each of the above methods in more detail, I have to stress that there are even more approaches that have been suggested for analysing repeated measures, but the relative performance of these approaches is only just starting to be investigated. Readers interested in a recent comparison of the non-parametric tests developed by Friedman and Conover are directed towards Al-Subaihi (2000), as well as the ®rst investigation (to my knowledge) on the in¯uence of data transformations (e.g. logarithmic) on repeated measures analyses (Boik et al. 2000). Some readers may have also heard of the `mixed model' approach to analysing repeated measures. This type of analysis will also not be considered here, since its relative performance is, again, still under investigation and maybe suspect (Keselman et al. 1998; 1999). Yet another approach is to analyse `summary statistics' calculated from each subject's repeated measurements (Matthews et al. 1990). This method is relevant mainly to factorial designs and so will be discussed in the second paper.
Analysis using univariate general linear modelling In the simplest scenario of a one-factor withinsubjects design (e.g. Table 2), the commonly employed univariate method is basically an extension of the `paired' t-test. Indeed, anyone who performs both a paired t-test and a repeated measures GLM on some data obtained from a design with one within-subjects factor having only two levels might notice that the t-value is the square root of the F-value obtained from the GLM. What the GLM can be employed to do that the t-test cannot (or should not) is to examine hypotheses concerning multiple serial observations. Use of multiple t-tests in such situations increases the chance of making a type I error in one's research (Table 1). For example, it would take 15 t-tests to examine all pair-wise differences between the six times of day in our example, and the subsequent type 1 error rate can be calculated to be 54%, rather than the nominal level of 5% (Maxwell & Delaney 1990). Therefore, there would be a good chance that we
198
Physical Therapy in Sport
(2001) 2, 194±208
would ®nd a difference between test-times that is merely a false-positive result.
Basic considerations Use of a univariate repeated measures GLM is associated with the usual assumptions that pertain to any parametric analyses, namely: 1. Observations are independent. In the case of repeated measures, this assumption means that the cases or subjects should be mutually exclusive. The measurements made within each subject are, of course, not independent from each other in repeated measures designs. 2. Data should be sampled from a population that is normally distributed. 3. The population variances for the various levels within a factor should be similar. Violation of assumption 1 can lead to statistical errors, and is something that should be considered at the design phase of the experiment, rather than being an analysis issue. Besides avoiding the mistake of using the same subject more than once in any experiment (Hurlbert, [1984] called this statistical crime `pseudo-replication'), a researcher should ensure that subjects cannot in¯uence each other's results during the data collection period. For example, when measuring the physical performance of a group of subjects in the same laboratory session, knowledge of results, and consequent competition between subjects, should be discouraged to avoid an increased source of experimental error variance. In the introduction, I mentioned the dangers of selecting a particular test without checking the various assumptions associated with its use. Nevertheless, I also mentioned that there is some ¯exibility in the assumptions, depending on evidence from data simulations. Indeed, there is good evidence to suggest that the univariate GLM is very `robust' to departures from the second and third assumptions cited above. Maxwell and Delaney (1990) discussed this evidence in detail. To summarize the results of the statistical simulations, only with data that are considerably non-normal does the type I error rate depart from the nominal 5%, or the type II error rate becomes a cause for concern (Glass et al. 1972). Similarly, only when the
c 2001 Harcourt Publishers Ltd *
Analysis of repeated measurements
sample size in each level of a factor is different does unequal variances become a problem in univariate GLM. Obviously, in a repeated measures design, there should be equal numbers of cases between the levels of the within-subjects factor, since this is the nature of the design, unless there are missing data, which is a different issue entirely. Both Maxwell and Delaney (1990) and Field (2000) de-emphasize the importance of the above second and third assumptions in the analysis of serial measurements, but place much emphasis on a particular assumption that is unique to repeated measures designs. This assumption is related to the previously mentioned fact that serial measurements are likely to be correlated, and is called `sphericity' or `circularity'.
What is the sphericity assumption? The assumption of sphericity is a dif®cult issue, both to understand and communicate. Understanding has not been helped by some confusion between sphericity and the related, but different assumption of `compound symmetry' in the literature (e.g. Vincent 1995). Most students switch off soon after reading such statements as `the variance-covariance matrix should have a certain form', which nevertheless is relevant to the issue of sphericity. I refer readers to note `a' that accompanies the component of output in Table 2 labelled `Mauchley's test of sphericity' to illustrate my point here. The best discussions of sphericity I have found are in Maxwell and Delaney (1990) and Field (2000), not least because they provide the deductive reasoning behind how sphericity can be explained in a much more accessible manner than that
involving matrix algebra. Maxwell and Delaney (1990) showed how sphericity can be thought of as equality of the population variances calculated for the differences between all pairs of serial measurements. What does this statement mean? Table 3 shows the sample variances for all 15 pair-wise differences relevant to our example (six times of day). To calculate each variance, a column of differences data or `residuals' was calculated for each pair of tests in the Microsoft Excel spreadsheet package. One can see that there is evidence that not all the difference variances are similar, so one would suspect from this `eyeball check' that the sphericity assumption might be violated. Some readers may have also heard of the assumption mentioned above that is related to sphericity called `compound symmetry'. Compound symmetry is a special case of sphericity and, indeed, if the assumption of compound symmetry is upheld, so is the sphericity assumption. Nevertheless, there is a technical difference between these two assumptions. Compound symmetry refers to two conditions being true; that the population variances of data in the levels (not the differences) of the factor are the same (assumption 3 above) and the correlations between the levels are the same. Maxwell and Delaney (1990) provided an example to show how the variances of differences might be the same (thus the sphericity assumption is upheld), but the variances of the data in each level of the within-subjects factor might be different (thus breaking the compound symmetry assumption). To summarize, the most important assumption in repeated measures GLM is the equivalence of the difference variances, which is called sphericity. In Table 3, we saw that some the difference
Table 3 Sample variances for all possible pair-wise differences between the six times of day. The time of day of the ®rst of the serial measurements was counterbalanced, but there is still evidence that some variances are different from others (e.g. 02:00±10:00 h vs. 06:00±22:00 h).
02:00 06:00 10:00 14:00 18:00 22:00
c 2001 Harcourt Publishers Ltd *
02:00
06:00
10:00
14:00
18:00
22:00
* * * * * *
3.4 * * * * *
1.0 2.5 * * * *
1.6 6.0 1.1 * * *
0.8 3.6 1.7 2.0 * *
1.1 6.2 1.5 1.5 2.3 *
Physical Therapy In Sport
(2001) 2, 194±208
199
Physical Therapy in Sport
variances might be unequal, but how unequal do the difference variances need to be for it to be suspected that the sphericity assumption has been violated? Firstly, I need to communicate why the sphericity assumption is so important and discuss whether anything can be done about it at the design stage of research.
What happens when sphericity is violated? The assumption of sphericity is important for the accuracy of the examination of what is called the `omnibus hypothesis' in the GLM. The `null' omnibus hypothesis is that all level means within the factor of interest, and for the population of interest, are equal. In essence, statistical hypotheses are examined by comparing any variability between level means in relation to the general background error or `residual' variability that is present. One can see this concept in Table 2, by observing that the Fvalue for the univariate `tests of within-subjects effects' is calculated from the mean square variability for the time factor divided by the mean square variability for error, i.e. F
MStime MSerror
Thus, a general level of residual variability is assumed across all levels called the `pooled error term'. For examinations of all the differences between level means within a factor to be accurate, this pooled error term should be relatively stable across those level means. When the assumption of sphericity is not upheld, it follows that the population error variance will be inaccurate (i.e. could be higher or lower) for some comparisons of level means. On average, the results of GLM with violation of the sphericity assumption might lead the researcher to conclude that some level means are different, when in reality (in the population) they are not. In other words, it would be more likely that the researcher would make a type I error (Table 1). Statisticians have calculated that the nominal type I error rate of 5% could be as high as 15% if the assumption of sphericity is not met (McCall & Appelbaum 1973). Moreover, a researcher who might think he/she is rejecting an omnibus null hypothesis at the
200
Physical Therapy in Sport
(2001) 2, 194±208
oft stated `highly signi®cant' 1% level, might in reality be rejecting the null hypothesis with a type I error rate of 6%, if the sphericity assumption is not upheld. Whether one should go as far as disregarding all research conclusions if sphericity has not been explored, as Huck and Cormier (1996) maintained, is debatable. Nevertheless, it is obvious that sphericity is important for accuracy of research conclusions and, therefore, the assumption needs to be explored for and corrected, if violated. However, maybe there is something one can do to reduce the chances of the sphericity assumption not being upheld at the design phase, i.e. before collecting any data? I attempt to examine this question in the next section.
Sphericity and design issues If sphericity is concerned with inequality of difference variances, what can be done to ensure that these variances are similar? If one is examining a repeated measures factor other than time itself (e.g. a treatment factor of `warm up' with three levels of no warm up, short warm up and prolonged warm up), then counterbalancing the order of treatments may prove useful. Obviously, such a tactic is well advised anyway to control for any learning or training effects on the measurement of the dependent variable of interest (e.g. ¯exibility). Nevertheless, counterbalancing may also mean that the difference variances between the pairs of levels are similar, and the sphericity assumption may be upheld. The logic behind this advice is that it is likely that difference variances are larger between measurements made far apart in time compared to difference variances between measurements made close together in time. If it is possible for the order of tests to be counterbalanced, then this effect will be controlled for. It should be noted that counterbalancing may not control for `differential carry-over effects', which if present may also in¯uence equality of difference variances. Differential carry-over effects mean that one treatment in particular in¯uences subsequent treatments. For example, in a repeatedmeasures experiment which examines different doses of caffeine on performance, maybe the
c 2001 Harcourt Publishers Ltd *
Analysis of repeated measurements
higher doses of caffeine have prolonged carry-over effects, but the lower doses do not. In our example experiment involving time of day in¯uences on ¯exibility, the order of the ®rst time of measurement was randomized, but there is still evidence that the difference variances between the six times of day are not equal (Table 3). This illustrates how complicated relationships between serial measurements may make violation of sphericity dif®cult to avoid at the design phase. If the repeated measures factor involves trends over time, then control of difference variances is more dif®cult, since counterbalancing is impossible. It is highly likely that the variances of the differences between levels that are close together in time would be smaller than the variances of the differences between levels that are far apart (Maxwell & Delaney 1990). Astute readers may recognize a link here between sphericity and reliability of measurements. Reliability also involves the within-subjects variability between repeated tests (Atkinson & Nevill 1998) and reliability may be better, the less time there is between repeated tests (as long as there are no residual fatigue in¯uences). The important point is that adequate familiarization before experimental data are collected may lessen the chance of the sphericity assumption being violated, since reliability would be maximized (the within-subjects variability might be minimized to a `baseline' before the experiment begins).
Exploring and correcting for violation of sphericity Besides the `eyeball test' detailed above for assessing whether the sphericity assumption is upheld or not, there is a signi®cance test for sphericity called `Mauchley's test for sphericity'. The calculations for this test are complicated but, thankfully, statistical packages like SPSS provide the test (Table 2). The test statistic is Mauchley's W which has an approximate chi-squared distribution. The idea is that, if the P-value associated with this test is below 0.05, then there is evidence that the sphericity assumption has not been upheld (i.e. the population difference variances are not
c 2001 Harcourt Publishers Ltd *
similar). In Table 3, Mauchley's test for sphericity gives a P-value of 0.736, suggesting that the sphericity assumption is upheld. Now we have a problem in that there is disagreement between what we suspect from an eyeball check (sphericity violated) and what a statistical test tells us (sphericity upheld)± why? The answer again lies in the `performance' of Mauchley's test for sphericity. Statisticians (e.g. Keselman et al. 1980) found that the results of the Mauchley test can be affected greatly by such data characteristics as a non-normal distribution. Consequently, the general consensus is that the test has little value for testing the assumption of sphericity (Maxwell & Delaney, 1990; Hays 1994). The preferred method of examining the assumption of sphericity is not through a dichotomous decision process as for Mauchley's test, but through the magnitude of `epsilons' which can also be used to correct or `adjust' the GLM for the effects of violation of sphericity. An epsilon is interpreted like a correlation coef®cient in that the closer an epsilon is to one, the less the departure from sphericity there is. Unfortunately, as one can see from the relevant part of Table 2, there are three types of epsilons to choose from which are quite different from each other (this is a common ®nding when working with small sample sizes). There is the Lower Bound epsilon, the Greenhouse±Geisser epsilon and the Huynh-Feldt epsilon (a name which is dif®cult to spell seems to be a requirement of a good statistician who invents a new statistic!). The output from SPSS probably provides all epsilons because statisticians used to be unsure which was the most accurate. The Lower bound epsilon is the most stringent (conservative), assumes maximum violation of sphericity and will always be lower than the other two epsilons (Table 2). Vincent (1999) incorrectly stated that the Greenhouse± Geisser epsilon is the most stringent epsilon available. What confuses some people is that Greenhouse and Geisser are responsible for giving us the Lower±Bound epsilon and their namesake epsilon (Geisser & Greenhouse 1958). So, the Lower bound epsilon has also been referred to as the Greenhouse±Geisser method! Geisser & Greenhouse (1958) followed up the work of Box (1954) by realising that the Lower-
Physical Therapy In Sport
(2001) 2, 194±208
201
Physical Therapy in Sport
bound epsilon overestimated the degree of violation in many cases, so they attempted to provide a statistic which was more accurate. It is this latter statistic which is labelled as the Greenhouse±Geisser epsilon in many statistical packages. As a good example of progress in statistical investigation, the statisticians Huynh & Feldt (1976) noticed that the Greenhouse± Geisser epsilon still overestimated the violation of sphericity, especially when true violation was only slight (epsilon above 0.75). These statisticians consequently provided yet another epsilon that was more accurate for assessing slight violation. Unfortunately, it was subsequently found that the Huynh-Feldt epsilon is less accurate than the Greenhouse± Geisser epsilon when violation is marked (Maxwell & Delaney 1990)! So, in the absence of a `gold-standard' epsilon that is accurate across all degrees of sphericity, what should the researcher `pick-out' from the output in Table 2? Some researchers (e.g. Stevens 1992) have suggested taking an average of the Greenhouse±Geisser and Huynh±Feldt epsilon. The disadvantage of this procedure is that one would need to calculate the correction of the GLM by hand (the correction necessitates decimals of degrees of freedom, which would be dif®cult to use in statistical tables). I prefer the advice of Girden (1992) which is much more in line with what is known about the accuracy of epsilons in certain conditions; that is, if the Greenhouse±Geisser epsilon is below 0.75, use this epsilon to correct the analysis, if the Greenhouse Geisser epsilon is above 0.75, use the Huynh-Feldt epsilon. What do I mean by `use' and `correct'? Well, epsilons might not just be used as a general indicator of the degree of violation of sphericity, but they may also be employed to correct the GLM for the effects of sphericity, so that the overall type I error is not too high. To correct for a violation of sphericity, the idea is that the degrees of freedom in the GLM are multiplied by the epsilon. If sphericity has been violated, this inevitably means that the degrees of freedom will reduce. If the degrees of freedom reduces, then a given F-value will be less likely to be found to be signi®cant (i.e. the P-value associated with the F-value will increase). One can see this calculation in the portion of the output in Table 3 labelled `Tests
202
Physical Therapy in Sport
(2001) 2, 194±208
of within-subjects effects'. The top row of the output represents the analysis assuming no violation of sphericity and has no epsilon. The fourth row represents the analysis with maximal violation of sphericity using the Lower-bound epsilon. In line with the advice above, because the Greenhouse±Geisser epsilon was found to be below 0.75, we should select this epsilon for correction. The degrees of freedom reduce from 5,25 to 2,11 (approximately) following correction, but the F-value is still statistically signi®cant in this case. It should be noted that a P-value (signi®cance level) of 0.000 does not mean that it is actually zero, merely that the P-value is too small to be shown with the three decimal places that SPSS works to (such a P-value would be more accurately written as P 5 0.0005). So, in our example, there would in reality be a slight increase in the P-value after correction, but of course there is still good evidence that we should reject the null hypothesis of no differences between the times of day (i.e. time of day does seem to in¯uence ¯exibility measurements). I need to make a few more points before I leave the univariate GLM. Firstly, if the null hypothesis cannot be rejected from the `sphericity assumed' GLM, then correcting for sphericity is a waste of time, since this would not change the conclusion reached (correction will only make a large and non-signi®cant Pvalue even larger). Secondly, sphericity is only an issue when there are more than two repeated measurements in any factor. Thirdly, readers may be wondering what the `between subjects' component is doing in the output of Table 2, when there were no between-subjects factors present in the GLM in our example? Some researchers might be interested in the general variability between subjects in the sample and SPSS provides an examination of this, even if there is no formal between-subjects factor present (e.g. age). In this respect, if the betweensubjects factor is signi®cant, it means that one has a varied or `heterogeneous' sample on the dependent variable of interest. It makes sense that there should be inter-individual differences in the general degree of ¯exibility in a sample of healthy young adults. I will discuss later the implications of this between-subjects variability for presentation of the data.
c 2001 Harcourt Publishers Ltd *
Analysis of repeated measurements
Analysis using multivariate general linear modelling One may think that the difference between univariate and multivariate analyses is that the former is appropriate for only one variable of interest, while the latter can be employed for the simultaneous analysis of several variables of interest. This is indeed the difference between univariate and multivariate analyses in general. Nevertheless, there is no rule that says that multivariate methods cannot be employed when there is only one variable that has been measured. Indeed, the analysis of changes in one variable over time using multivariate methods has several potential advantages (as long as the sample size is relatively large ± see below), not least because all the above work on exploring and controlling for sphericity is not needed when a multivariate GLM is used to analyse repeated measures data. Although this advantage sounds appealing, there are still assumptions associated with the multivariate GLM, the degree of violation of which determines its performance for analysing repeated measures.
Basic considerations The difference between the univariate and multivariate techniques for analysing repeated measures is in the way the differences data between levels of the repeated measures factor are conceptualised. In the univariate method, the differences data are used to calculate an overall pooled error term and this pooled error may, as already discussed, be inaccurate for some comparisons between levels (i.e. violation of sphericity). In the multivariate method, the differences data between levels are, in essence, treated like different variables of interest, each with its own speci®c error term. These speci®c error terms are maintained throughout the analysis by utilizing complicated matrix calculations (Namboodiri 1984). While such calculations are the crux of the multivariate method, and the reason why the sphericity assumption does not apply, I would not want to burden anyone too much with matrices and determinants. Indeed, the use of the multivariate method has increased in line with the increased availability of powerful
c 2001 Harcourt Publishers Ltd *
computers to do the complex calculations. In brief, the important point is that the multivariate method not only considers separate error terms between differences data, but also the inherent relationship between differences data. Such a characteristic makes the sphericity assumption irrelevant to the multivariate approach. The fact that the multivariate approach is not associated with the sphericity assumption makes interpretation of the analysis relatively easier than the univariate approach. In Table 2, one can see the `multivariate tests' component to the SPSS output. Like the univariate method, it seems as though one has a bewildering number of test statistics to choose from such as Pillia's Trace, Wilks' Lambda, Hotelling's Trace and Roy's largest root (a student once asked me if one planted Roy's largest root in Geisser's Greenhouse!). Thankfully, the choice of a speci®c test statistic only becomes important when the multivariate method is applied to between-subject factors in an experiment. One can see that, in the case of repeated measures, all these statistics give the same F-value and Pvalue. In our example, the P-value of 0.455 suggests that the null hypothesis of no differences over time cannot be rejected. The fact that this conclusion disagrees with that made from the univariate analysis makes one wonder about how the multivariate method compares with the univariate method, and which conclusion is the correct one. This important issue is discussed below.
Relative power of multivariate versus univariate modelling The multivariate approach does not require the assumption of sphericity to hold, but there is an assumption of `multivariate normality', which in the case of the repeated measures design is that all data in the separate levels of the withinsubjects factor are drawn from a population that is normally distributed. Simulations have shown that any violation of this assumption is much less serious than the violation of the sphericity assumption in univariate GLM (Maxwell & Delaney 1990). This adds weight to the argument that the multivariate method might be the preferred approach. However, there is a factor which affects the performance
Physical Therapy In Sport
(2001) 2, 194±208
203
Physical Therapy in Sport
of the multivariate method to a much greater degree; sample size. The fact that sample sizes can be low in research involving humans and performance variables, makes the discussion of how this factor in¯uences choice of analysis very important. If all the assumptions of the univariate GLM are satis®ed, the ®rst thing that can be said is that this approach is `better' (in terms of not arriving at `false negative' conclusions or too many type II errors) than the multivariate approach. However, many authors maintain that the univariate assumption of sphericity rarely holds in practice (Maxwell & Delaney 1990). Therefore, it may well be that for many situations, the multivariate approach is more appropriate than the univariate approach when sphericity is not upheld. Unfortunately, a complicating factor is sample size. In general, the type II error rate of the multivariate method relative to the univariate method increases as sample size decreases, i.e. use of the multivariate method is more inaccurate than the univariate method, even when sphericity has been corrected for, with small sample sizes. Maxwell & Delaney (1990) provided a `rule of thumb' based on the work of Davidson (1972) which says that if the number of subjects is less than 10 the number of repeated measures levels (k), then the multivariate method should probably not be used at all. In our example, there were six levels and six subjects in the design, so it would be expected that the multivariate method would lead to too many type II errors on average than the univariate method. In support of this notion, readers are reminded that the null hypothesis could not be rejected with the multivariate approach (P 0.455) but could be rejected with the univariate approach, even after correcting for violation of sphericity (P 5 0.0005). More importantly, if the number of subjects is less than the number of levels in the design, then the multivariate method is impossible to compute (the cells in the `multivariate tests' output of SPSS would be completely empty). To synthesize the results of data simulations into general advice, a decision tree is presented in Figure 1 to help the researcher choose the most appropriate approach as well as, in the case of selecting the univariate method, the most appropriate correction factor for violation
204
Physical Therapy in Sport
(2001) 2, 194±208
of sphericity. It is stressed that the order of the steps in this decision tree are important; it is inappropriate to choose an approach on the basis that it is the one that provides the signi®cant F-value!
Presentation of repeated measures data Most researchers are aware of the importance of presenting some indication of the amount of variability in data, alongside the presentation of the central tendency statistic that has been selected (e.g. the mean). An often quoted humorous example is a person who puts one hand in a bucket of water at 28C and the other hand in another bucket of water at 608C and concludes that he or she is comfortably warm, on average! When several measurements have been made over time, the mean and the measure of variability in the data can be presented in table form or, to get a better idea of trends over time, in the form of a ®gure. For example, Figure 2 presents the means and standard deviations (SD), shown as `error bars', for the ¯exibility data in our example (six times of day). I would like to offer a few notes of caution in presenting repeated measures data like those shown in Figure 2, and also offer some more informative methods of presentation. Firstly, it is stressed that the standard deviation of each level mean in Figure 2 is largely in¯uenced by the differences between, not within, subjects. In this respect, it is a mistake to think that there is something inherently `bad' about large error bars in a repeated measures design. I have witnessed Professors at conferences challenge the accuracy of a research student's conclusions on the basis that the error bars are `too large' for the differences between means in a repeated measures study to be signi®cant. If such a way of thinking is acceptable at all, it would only be appropriate for comparison of means between separate samples (groups) of subjects (not changes within-subjects), and only if 95% con®dence intervals are compared for overlap between levels, rather than the standard deviation. Nevertheless, it is stressed that even then, comparison of overlap of error bars would be merely an approximation of signi®cance.
c 2001 Harcourt Publishers Ltd *
Analysis of repeated measurements
Is n >10 + k (k = number of levels) yes
no
Select the multivariate GLM approach
Select the univariate GLM approach
Is F-value significant?
Is uncorrected F-value significant?
yes
Is Greenhouse Geisser epsilson >0.75?
Not reject Ho
Reject Ho
no
yes
no
yes Use Huynh-Feldt epsilon
Is corrected F-value significant?
Reject Ho
no Use Greenhouse-Geisser epsilon
Is corrected F-value significant?
no
yes
yes
Not reject Ho
Not reject Ho
Reject Ho
no Not reject Ho
Fig. 1 A decision tree to help with test selection and assumption checking in the analysis of repeated measures designs.
25
Flexibility (cm)
20 15 10 5 0
02:00
06:00
10:00
14:00
18:00
22:00
Time of day (h) Fig. 2 Presentation of ¯exibility data measured at six times of day in six subjects. Data are means and SDs calculated at each time of day.
Similarly, I have also observed some researchers being so worried about the magnitude of error bars relative to the differences between means in a repeated measures design that they use the error bars to represent the standard error of the mean (SEM), solely for the reason that this statistic will be smaller than the standard deviation, and so look not as `bad' when presented in a ®gure. The choice of presentation of SD versus SEM should be based on statistical rather than cosmetic reasons, since even though these
c 2001 Harcourt Publishers Ltd *
statistics are mathematically related, they have quite different meaning. Is there a more informative way of presenting repeated measures data? The simplest method is to plot the pro®les over time for individual subjects as in Figure 3. When the data are plotted in this way, one is showing the reader clearly the between-subjects differences that are present, but the trends over time for the individual subjects can still be observed. An even better, but more tedious, method of presenting repeated measures data was
Physical Therapy In Sport
(2001) 2, 194±208
205
Physical Therapy in Sport
30
Flexibility (cm)
25 20 15 10 5 0
02:00
06:00
10:00
14:00
18:00
22:00
Time of day (h) Fig. 3 Presentation of ¯exibility data measured at six times of day for the six individual subjects.
Flexibilty (cm)
20 15 10 5 0
02:00
06:00
10:00 14:00 Time of day (h)
18:00
22:00
Fig. 4 Mean and 95% con®dence intervals (corrected for between-subjects variability) for the ¯exibility data.
promoted by Loftus & Masson (1994), and discussed relevant to the use of SPSS by Field (2000). The approach involves the conceptualization of the trends over time for the `average person' by normalizing subject means and expressing all changes relative to the same mean. The basic steps to follow are:
. Step 1 ± Calculate the mean across the levels for each subject.
. Step 2 ± Calculate the `grand' mean of all the above subject means.
. Step 3 ± Subtract each subject's mean which
was calculated in step 1 by the `grand' mean which was calculated in step 2, to give an `adjustment factor' for each subject. . Step 4 ± Add each subjects score at each level to the `adjustment factor' for each subject which was calculated in step 3 to form a `new' set of corrected data. . Step 5 ± Calculate the mean and standard deviation (or better, the 95% con®dence interval) for each level on the new data in the conventional way.
206
Physical Therapy in Sport
(2001) 2, 194±208
. Step 6 ± Plot the mean and standard
deviation (95% con®dence interval) in the same way as in Figure 2A.
Figure 4 presents the ¯exibility data `corrected' for between-subject variability according to the methods of Loftus & Masson (1994). One can immediately see that the level means are the same in Figure 4 as they are in Figure 2, but the error bars are much smaller in Figure 4 than in Figure 2. This is so even though I have chosen to present the more appropriate 95% con®dence intervals in Figure 4 rather than the SDs shown in Figure 2. These smaller error bars re¯ect mainly the `eradication' of the between-subjects variability in the data at each level. One can also see that the likelihood of the differences between levels being signi®cant (the degree to which the 95% con®dence intervals overlap) is now clearer. In the case of our example, there is a suggestion that the signi®cant time of day effect is described by the 06:00 test being signi®cantly lower than ¯exibility at the other times of day, and that the 10:00 test is `midway'
c 2001 Harcourt Publishers Ltd *
Analysis of repeated measurements
between the 06:00 test and the tests at the other times of day. I stress that this `eyeball check' of where differences lie following the testing of the omnibus hypothesis is only an approximation. Besides, it might make more sense to test whether the data ®t a speci®c trend or shape, rather than try to compare all pairs of level means. In my next paper, I will discuss more formal methods of following up a general hypothesis and making multiple comparisons between levels of a within-subjects factor.
Synopsis The main purpose of this paper was to bridge the gap between statisticians who work hard to identify the `best' statistical analyses, and physical therapists who also work hard in a very applied science. Although it is very easy in these days of powerful computers to think of statistics as a `black box' science and not to be interested in statistical theory and exploration, I hope I have shown, through this account of repeated measures analyses, that this way of thinking is dangerous. Selection of the wrong statistical test could just as easily lead to the physical therapist missing an important ®nding as it could mislead researchers into thinking that a useless treatment is important. Obviously, researchers strive not to do this and knowledge about statistics can help in this respect. I realise that the reader might have several unanswered, yet very pertinent, questions such as `how do I know for sure which level mean is different to other level means?' or `what if I have more than one factor in the analysis?' I have hopefully provided an introduction to repeated measures analysis. My decision to cover the whole topic in two papers is hopefully justi®ed if I tell the reader that answering these additional questions may treble the amount of information that is provided in the output from a statistical package such as SPSS. Nevertheless, I hope to answer such questions in my next paper.
Acknowledgement I thank Dr Mark Scott and Mr Nigel Balmer from Liverpool John Moores University for their comments on an earlier draft of this paper.
c 2001 Harcourt Publishers Ltd *
References Al-Subaihi A A 2000 A monte carlo study of the Friedman and Conover tests in the single-factor repeated measures design. Journal of Statistical Computation and Simulation 65: 203±223 Altman D G 1991 Practical Statistics for Medical Research. London: Chapman and Hall Atkinson G, Nevill A M 1998 Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports Medicine 26: 217±238 Atkinson G 1994 Effects of age on human circadian rhythms in physiological and performance variable. Unpublished Doctoral thesis, Liverpool John Moores University Bland J M 1995 An introduction to medical statistics. Oxford: Oxford Medical Publications Boik R J, Todd C, Hyde S K 2000 Response transformations in repeated measures and growth curve models. Communications in Statistics: Theory and Methods 29: 699±733 Box G E P 1954 Some theorems on quadratic forms applied in the study of analysis of variance problems: II. Effects of inequality of variance and of correlation between errors in the two-way classi®cation. Annals of Mathematical Statistics 25: 484±498 Davidson M L 1972 Univariate versus multivariate tests in repeated measures experiments. Psychological Bulletin 77: 446±452 George K, Batterham A, Sulliva I 2000 Validity in clinical research: a review of basic concepts and de®nitions. Physical Therapy in Sport 1: 19±27 Geisser S, Greenhouse S W 1958 An extension of Box's results on the use of the F-distribution in multivariate analysis. Annals of Mathematical Statistics 29: 885±891 Girden E R 1992 ANOVA: repeated measures. Sage university paper series on quantitative applications in social sciences, 07±084. Newbury Park: Sage Glass G V, Peckham P D, Sanders J R 1972 Consequences of failure to meet assumptions underlying analysis of variance and covariance. Review of Educational Research 42: 237±288 Hays W L 1994 Statistics. Fort Worth: Harcourt Huck S W, Cormier W H 1996 Reading statistics and research (2nd Edition). New York: Harper Collins Hurlbert S H 1984 Pseudoreplication and the design of ecological ®eld experiments. Ecology Monographs 54: 187±211 Huynh H, Feldt L S 1976 Estimation of the Box correction for degrees of freedom from sample data in randomised block and split-plot designs. Journal of Educational Statistics 1: 69±82 Field A 2000 Discovering statistics using SPSS for Windows. London: Sage Keselman H J, Algina J, Kowalchuk B K, Wol®nger R D 1999 A comparison of the recent approaches to the analysis of repeated measurements. British Journal of Mathematical and Statistical Psychology 52: 63±78
Physical Therapy In Sport
(2001) 2, 194±208
207
Physical Therapy in Sport
Keselman H J, Algina J, Kowalchuk B K, Wol®nger R D 1998 A comparison of two approaches for selecting covaraince structures in the analysis of repeated measurements. Communications in Statistics. Simulation and Computation 27: 591±604 Keselman H J, Rogan J C, Mendoza J L, Breen L J 1980 Testing the validity conditions of repeated measures F-tests. Psychological Bulletin 87: 479±481 Loftus G R, Masson M E J 1994 Using con®dence intervals in within-subject designs. Psychonomic Bulletin and Review 1: 476±490 Mathews J N S, Altman D G, Campbell M J, Royston P 1990 Analysis of serial measurements in medical research. British Medical Journal 300: 230±235
208
Physical Therapy in Sport
(2001) 2, 194±208
McCall R B, Appelbaum M I 1973 Bias in the analysis of repeated measures designs: Some alternative approaches. Child Development 44: 401±415 Maxwell S E, Delaney H D 1990 Designing experiments and analysing data. A model comparison perspective. Belmont: Wadsworth Namboodiri K 1984 Matrix algebra: an introduction. Sage university paper series on quantitative applications in the social sciences, 07±38. Beverly Hills: Sage Stevens J P 1992 Applied multivariate statistics for the social sciences. Hillsdale: Erlbaum Vincent W J 1999 Statistics in Kinesiology. Champaign: Human Kinetics Publishers
c 2001 Harcourt Publishers Ltd *