International Journal of Psychophysiology 31 Ž1999. 155]161
Dealing with baseline differences: two principles and two dilemmas John JamiesonU Department of Psychology, Lakehead Uni¨ ersity, Thunder Bay, ON, Canada P7B 5E1 Received 16 June 1998; received in revised form 1 September 1998; accepted 1 September 1998
Abstract Two principles are presented that describe directional confounds associated with baseline differences: Principle 1: Change scores are confounded with baseline whenever data are skewed. Principle 2: When baseline differences are real, ANCOVA has a directional bias that magnifies differences in one direction and masks those in the other direction. Both principles involve a directional bias that is related to the direction of baseline difference and the direction of the hypothesized difference in change. Ethical dilemmas arise if decisions Žwhether or not to transform, whether or not to use ANCOVA. are chosen in order to maximize power by capitalizing on the directional bias and Type 1 error. Q 1999 Elsevier Science B.V. All rights reserved. Keywords: Baseline; Change scores; ANCOVA; Ethical dilemma; Skewness
1. Introduction A common problem facing psychophysiologists is how to control for baseline differences between groups or conditions. When groups differ in baseline levels, one feels obliged to ensure that other group differences, such as amount of change from baseline to levels during a task, are not an artefact of these baseline differences. An apparent solution is to use ANCOVA, with the baseline
U
Corresponding author. Tel.: q1 807 3438738; fax: q1 807 3467734; e-mail:
[email protected]
value as a covariate, in order to remove its possible influence, but this is often the wrong decision. ANCOVA does not simply ‘control for’ or ‘eliminate’ baseline differences; rather it has specific, directionally sensitive effects which are not widely recognized. The purpose of this note is to present a readable, non-mathematical description of the effects of baseline differences in order to provide researchers with a better framework on which to base their statistical decisions. Unfortunately this knowledge leads to ethical dilemmas because it can be used to maximize power at the risk of increased Type 1 error. For the past decade this author has been trying
0167-8760r99r$ - see front matter Q 1999 Elsevier Science B.V. All rights reserved. P I I: S 0 1 6 7 - 8 7 6 0 Ž 9 8 . 0 0 0 4 8 - 8
156
J. Jamieson r International Journal of Psychophysiology 31 (1999) 155]161
to understand the factors which cause a relationship between baseline and change ŽJamieson and Howk, 1992; Jamieson, 1993., and the implication of these factors for statistical analyses comparing groups ŽJamieson, 1995. or looking for correlates of change ŽJamieson, 1994.. This work was initially driven by the Law of Initial Values ŽWilder, 1967., which postulated that psychophysiological measures were governed by biological ‘floors’ and ‘ceilings’ that created a relationship between change and baseline. It is now clear that the Law of Initial Values is not a major influence on psychophysiological measures, and that the relationship between change and baseline is generally due to measurement error, i.e. regression to the mean ŽMyrtek and Foerster, 1986; Jin, 1992; Geenen and van de Vijver, 1993.. However, the question remains why baseline differences might affect the amount of change. What factors can cause the amount of change to be related to baseline differences, i.e. what are we controlling for? At present there appear to be only two established reasons why the comparison of changes between groups may be related to Žconfounded by. baseline differences. These two factors are skewness and inappropriate use of ANCOVA. They are described below from the perspective of two principles. 2. Principle 1: Change scores are confounded with baseline whenever data are skewed Some measures, such as response latency, are inherently skewed; others may be skewed because of ceilings or floors. In an earlier paper, Jamieson and Howk Ž1992. illustrated how, with skewed distributions, when the mean changes, the shape of the distribution also changes. With a positively skewed distribution, higher scores change more than lower scores. With a negatively skewed distribution, the opposite occurs. The rule is that scores in the tail of the distribution change more than scores in the head of the distribution. For example, with latency scores which are positively skewed, a low mean is closer to the floor Žnear zero. and has less room to decrease. A higher mean is further from the floor and has more
Fig. 1. Skewed distributions change shape as the mean changes. Scores near the tail ŽA1]A2. will change more than scores near the middle ŽB1]B2..
room to decrease. Thus the amount of change is a function of the baseline level. Fig. 1 illustrates this principle for a positively skewed distribution. Skewed distributions change shape as the mean shifts. An increase in the mean during the task is represented by the shift from the dotted line Žbaseline . to the solid line Žtask.. Conversely a task which produces a decrease would be represented by a shift from the solid line Žbaseline . to the dotted line Žtask.. Two scores which maintain their ordinal position as the curve shifts are represented as A and B. A, which is in the upper tail of the distribution, changes more ŽA1]A2. than B, which is in the middle of the distribution ŽB1]B2.. The point of this illustration is that the amount of change is a function of the baseline level when distributions are skewed. Another way to illustrate this principle is to consider a measurement model in which the true dimension is normally distributed, but the observed measure is positively skewed. Suppose a square root transformation will change the observed measure to a normal distribution, eliminating the skewness and thereby reflecting the true underlying process. Consider a numerical illustration, in which observed scores of 25 and 100 correspond to true scores of 5 and 10 Žthe square roots.. If a task produced an increase of one in each true score, from 5 to 6 and from 10 to 11, this would correspond to changes in the observed scores from 25 to 36 and from 100 to 121. The changes for the observed scores are 36 y 25 s 9 and 121 y 100 s 21. Thus the
J. Jamieson r International Journal of Psychophysiology 31 (1999) 155]161
observed scores show a difference Ž21 y 9 s 15. in the amount of change, even though the true scores both have the identical increase of one unit. The positive skewness caused the higher score to change more. If two groups have different baselines, then they can be expected to change differently, simply as a function of the skewness of the measure. The group with its baseline mean closer to the tail will change more. For example, with a positively skewed distribution, just by chance, the group with the higher baseline mean will increase more Žif the task produces an increase . or will decrease more Žif it produces a decrease. than the group with the lower baseline. With negatively skewed data, just by chance, the group with the lower baseline mean will decrease more Žif the task produces a decrease. or increase more Žif it produces an increase . than the group with the higher baseline mean. Thus simply by looking at the direction of baseline differences, and the direction of the skewness, it is possible to predict which group will change more as a function of the skewness. Fig. 2 illustrates how skewness will interact with baseline differences to produce apparent differences in changes from baseline to task. The effects illustrated in Fig. 2 are simply caused by skewness. Failure to transform the data could result in these differences being identified as significant, when they are in fact just artifacts of skewness and baseline differences. Often one is faced with a decision of whether or not to apply a transformation to eliminate the skewness. An ethical dilemma arises if one makes this decision based on whether the transformation will assist or hinder detection of a hypothesized effect. Fig. 2 illustrates the changes that will result simply from skewness, and that might be detected as significant if the data are not transformed. Using this knowledge to decide whether or not to transform will yield more significant findings, but at the cost of an increased rate of Type 1 errors. For example, with positively skewed data and a task which produces an increase Že.g. a stressor., suppose you hypothesize a greater increase in Group A, than in Group B. If you look at the baselines and find that Group A has the
157
Fig. 2. Differential change in means from baseline to task simply due to skewness interacting with baseline differences.
higher baseline, you might be tempted not to transform the data since the skewness ‘works for’ an effect in this direction. This case is represented as the top left illustration in Fig. 2. The group with the higher baseline will increase more just due to the skewness. On the other hand, if Group A had the lower baseline, a real effect would have to overcome the difference in the opposite direction caused by skewness, so a transformation would greatly improve the chances of detecting a statistically significant difference. Moreover, as there is often a choice of transformations Žsquare root, reciprocal, log., one may be tempted to use a transformation that overcorrected, to yield a slightly negatively skewed distribution, since a negatively skewed distribution would ‘work for’ the direction of effect you would like to ‘find’. Clearly the above scenarios are highly inappropriate, as they are capitalizing on Type 1 errors. Moreover, statistical decisions should never be made based on the direction of baseline differences and should be made before the data are collected. Of course, the appropriate guideline is always to transform skewed data ŽTabachnick and Fidell, 1996.. A further point is that the effects of skewness are magnified if difference scores Žtask minus baseline . or equivalently, repeated measures ANOVA are used to compare groups. AN-
158
J. Jamieson r International Journal of Psychophysiology 31 (1999) 155]161
COVA, with baseline as a covariate, is much better, although it will not remove the effect of skewness as well as a transformation. However, ANCOVA is often inappropriate for reasons described below. 3. Principle 2: When baseline differences are real, ANCOVA has a directional bias that magnifies differences in one direction and masks them in the other direction The term ‘real’ is used to include within-subjects designs in which different treatments, for example different drugs, may produce different baselines, as well as between-subjects designs, in which naturally occurring groups have different baselines. The important distinction is between real baseline differences that will be maintained, vs. baseline differences produced by random assignment that are due to chance and will be affected by regression to the mean. This principle is an extension, to include within group conditions, of the guideline that ANCOVA should not be used with naturally occurring groups ŽHuitema, 1980.. ANCOVA applies an adjustment for baseline differences that ‘expects’ the baselines to regress toward the mean wsee Huitema Ž1980., or Reichardt Ž1979., for a more rigorous explanationx. Thus high scores are expected to decrease upon retesting. This is an appropriate correction when baseline differences are due to chance Žrandomized experiment., but it creates a directional bias when baseline differences are real. The most dramatic demonstration of this bias is known as Lord’s Paradox ŽLord, 1967.. Lord used an example of male and female adolescents weighed on 2 consecutive years. Males and females started at different mean baseline weights but both groups gained an identical amount of weight. Even though both genders had identical mean weight gains, they were found by ANCOVA to have significantly different changes. Males, with the higher baseline weight, were found by ANCOVA to increase more than females, even though the amount of weight gain was identical. The explanation for Lord’s Paradox is that ANCOVA asks a conditional question, that focuses
on how much the groups would be expected to change if they had both come from a population with the same baseline mean, that is the average of the observed baseline means. Thus if the baseline mean for females was 100 lbs and for males 140 lbs, how would the two groups have been expected to change if they had both come from a population with a mean of 120 lbs? As 140 lbs is above the population mean, it would be expected to decrease Žregression toward the mean., while females starting at a mean of 100 would have been expected to increase. Fig. 3 illustrates Lord’s Paradox. The solid lines connect the Žhypothetical. observed mean weights, and the dotted lines connect baseline to the values predicted by ANCOVA under the null hypothesis of no differential change in the two groups. Note the regression to the mean in the predicted values. The observed mean for males is greater than expected while the observed mean for females is less than expected. ANCOVA yields the conclusion that males increased significantly more than did females. Thus the identical mean change is interpreted by ANCOVA as relatively greater for males Žwho were expected to decrease. and relatively less for females Žwho were expected to increase .. However, this conditional question is artificial and esoteric, and the answer is rarely of interest when the baseline differences are real. Moreover, it produces a directionally biased adjustment when interest is simply in which group increased more. The rule governing this directional bias is that ANCOVA magnifies ‘consistent’ differences. The group with the higher baseline increasing more or
Fig. 3. Illustration of Lord’s Paradox: hypothetical observed weights for males and females over 2 years Žsolid lines. and the weights predicted by ANCOVA Ždotted lines. under the null hypothesis.
J. Jamieson r International Journal of Psychophysiology 31 (1999) 155]161
the group with the lower baseline decreasing more, and masks ‘inconsistent’ differences. The group with the higher baseline decreasing more or the group with the lower baseline increasing more. ANCOVA ‘expects’ the means to come closer together just by chance. To see this, it is useful to visualize the usual plot of an interaction, with separate lines for each group, the dependent variable on the y-axis, and two points on the x-axis: baseline and task. Fig. 4 illustrates four cases of the identical sized differential changes. In the left hand column, ANCOVA will be more likely than difference scores to detect the differential change as significant. Conversely, for the two cases in the right hand column, difference scores will be more likely than ANCOVA to detect the differential change as significant. ANCOVA is biased to find significance in means which stay apart Žparallel lines, as in Lord’s Paradox. or which diverge. Conversely it is biased against detecting significant effects when the lines converge. Frequently one is faced with a choice between ANCOVA or difference score Žrepeated measures ANOVA. methods for comparing the changes between groups. Guidelines on when to use each of these methods are not readily avail-
Fig. 4. Illustration of differential changes which are more likely to be detected as significant by ANCOVA Žfirst column. or difference scores Žsecond column..
159
able ŽWainer, 1991. although a number of sources caution against use of ANCOVA with naturally occurring groups Že.g. Huitema, 1980; Rogosa, 1988; Gottman and Rushe, 1993.. An ethical dilemma arises if one makes the decision between these two methods based on which will have more power for detecting a difference in the hypothesized direction. Fig. 4 illustrates the patterns of findings for which ANCOVA or difference scores are each more likely to yield significance. Using this knowledge to decide between ANCOVA and difference scores will yield more significant findings, but at the cost of an increased rate of Type 1 errors. For example, with a task which produces an increase Že.g. a stressor. suppose you hypothesize that Group A will increase more than Group B. If you look at the baselines and find that Group A has the higher baseline, you might be tempted to use ANCOVA since it will have more power to detect a difference in this direction. On the other hand, if Group A had the lower baseline, then difference scores would have more power to detect the hypothesized effect. Clearly the above scenario represents inappropriate behavior and capitalizes on Type 1 error. Choice of statistical method should be made before the data are collected and based on more substantive criteria. The criterion Žguideline. should be: ‘Do not use ANCOVA when baseline differences are real’. Unfortunately, the requirement of making the decision before the data are collected is not sufficient to avoid ethical problems. This follows because there are a number of applications in psychophysiology where the direction of baseline differences is predictable beforehand. For example, males generally have lower average resting heart rates than females, physically fit people have lower resting heart rates than the unfit, those with a family history of hypertension have higher resting blood pressure, etc. Thus it is possible to combine this knowledge with the hypothesized direction of group differences to identify which of ANCOVA or difference scores will have more power, and to make this decision before the data are collected. However, such a decision is clearly capitalizing on Type 1 error. Moreover ANCOVA is not appropriate for any of
160
J. Jamieson r International Journal of Psychophysiology 31 (1999) 155]161
these cases because the baseline differences are all real, not chance differences. It should be acknowledged that ANCOVA is the correct model to answer a conditional question: ‘If the groups had come from a population with the same baseline level, which would have increased Žor decreased. more?’ Unfortunately, when baseline differences are real, the conditional question addresses an esoteric issue that is rarely of interest, and is more often misleading. In contrast, the difference or change score Žtask minus baseline . question asks: ‘Which group increased Žor decreased. more’? So long as the data are not skewed, the difference score question, which is consistent with a linear additive model, provides a more appropriate answer to whether the groups changed differently. There is increasing acceptance of the value of difference scores as measures of change, and earlier reservations about lack of reliability of difference scores are no longer viewed as valid limitations Že.g. Llabre et al., 1991.. Moreover it has recently been pointed out that the issue of reliability may be quite irrelevant, since it ignores individual variation. Thus reliability could be zero but measures of change could be perfectly valid: ‘who cares whether change scores are reliable, if this says nothing about whether they are precise measures of change’ ŽCollins, 1996, p. 290.. The above discussion dealt with the issue of comparing changes between groups. The directional bias of ANCOVA also applies when the covariate and dependent variables are different scales. In this case, the correlation between the covariate and dependent variable may be positive or negative. When the correlation is positive, ANCOVA has more power to detect consistent effects Že.g. the group with the higher covariate mean having a relatively higher mean on the dependent variable. than inconsistent effects. When the correlation is negative, it has more power to detect inconsistent effects Že.g. the group with the higher mean on the covariate having a relatively lower mean on the dependent variable. than consistent effects. Again, when covariate differences are real, this adjustment can produce misleading conclusions.
These two principles are known to statisticians and can be shown mathematically to be true. However, the fact that both principles lead to a predictable directional bias is less widely known. By looking at the baseline differences between groups Žor conditions., and knowing the expected Žhypothesized. direction of difference in changes, it is always possible to identify which of ANCOVA or difference scores will have more power, by applying the rule included in this paper. This leads to an ethical dilemma since this increased ‘power’ is due to an increased risk of Type 1 error. In the case of skewness, the confound between baseline and change is necessarily true. It is always possible to determine which group will change more, simply due to the skewness. It is only necessary to look at the baseline differences between the groups, the direction of the skewness Žpositive or negative. and apply the rule included in this paper. Again, increased ‘power’ based on increased Type 1 error will result if this information is used in deciding whether or not to transform the data. It is interesting to observe the parallels between the two principles and their ethical dilemmas. Both involve a directional bias, that is related to the direction of baseline difference and direction of hypothesized differences in change. In each case there is a guideline for the correct action: transform skewed data, avoid ANCOVA with real baseline differences. However, in each case there is also an apparently acceptable alternative action Žnot transforming; using ANCOVA. that might be chosen in order to maximize power by capitalizing on the directional bias and Type 1 error. The purpose of this note is to draw attention to this directional bias and the importance of clear guidelines for dealing with baseline differences. Acknowledgements I would like to thank Robert J. Barry and his colleagues at the University of Wollongong for their assistance with this paper and for providing such a stimulating experience during my sabbatical.
J. Jamieson r International Journal of Psychophysiology 31 (1999) 155]161
References Collins, L.M., 1996. Is reliability obsolete? A commentary on ‘Are simple gain scores obsolete?’. Appl. Psychol. Meas. 20, 289]292. Geenen, R., van de Vijver, F.J.R., 1993. A simple test of the law of initial values. Psychophysiology 30, 525]530. Gottman, J.M., Rushe, R.H., 1993. The analysis of change: issues, fallacies and new ideas. J. Consult. Clin. Psychol. 61, 907]910. Huitema, B.E., 1980. The Analysis of Covariance and Alternatives. Wiley, New York. Jamieson, J., 1993. The law of initial values: five factors or two? Int. J. Psychophysiol. 14, 233]239. Jamieson, J., 1994. Correlates of reactivity: problems with regression based methods. Int. J. Psychophysiol. 17, 73]78. Jamieson, J., 1995. Measurement of change and the law of initial values: A computer simulation study. Educ. Psychol. Meas. 55, 38]46. Jamieson, J., Howk, S., 1992. The law of initial values: a four factor theory. Int. J. Psychophysiol. 12, 53]61. Jin, P., 1992. Toward a reconceptualization of the law of initial value. Psychol. Bull. 111, 176]184.
161
Llabre, M.M., Spitzer, S.S., Saab, P.G., Ironson, G.H., Schneiderman, N., 1991. The reliability and specificity of delta versus residualized change as measures of cardiovascular reactivity to behavioral challenges. Psychophysiology 28, 701]711. Lord, F.E., 1967. A paradox in the interpretation of group comparisons. Psychol. Bull. 68, 304]305. Myrtek, M., Foerster, F., 1986. The law of initial value: a rare exception. Bio. Psychol. 22, 227]237. Reichardt, C.S., 1979. The statistical analysis of data from non equivalent group designs. In: Cook, T.D., Campbell, D.T. ŽEds.., Quasi-Experimentation. Rand McNally, Chicago. Rogosa, D., 1988. Myths about longitudinal research. In: Schaie, K.W., Cambell, R.T., Meredith, W., Rawlings, S.C. ŽEds.., Methodological Issues in Aging Research. Springer, New York. Tabachnick B.G., Fidell, L.S. 1996. Using Multivariate Statistics, 3rd ed. Harper Collins, New York. Wainer, H., 1991. Adjusting for differential base rates: Lord’s Paradox again. Psychol. Bull. 109, 147]151. Wilder, J., 1967. Stimulus and Response: The Law of Initial Value. Wright, Bristol.