INTELLIGENCE 15, 151-156 (1991)
EDITORIAL
Causal Inferences From Observational
Data: Use a Redesigned Cross-Lagged Methodology LLOYD G . HUMPHREYS
University of Illinois, Urbana-Champaign
Ten years ago Rogosa's critique (1980) of the cross-lagged correlation methodology largely contributed to a virtual halt in its use by psychological researchers, but his conclusions were too sweeping. It is essential to recognize the effects on the crosscorrelations of differences in the stability of true scores from one occasion to another, as Rogosa clearly demonstrated, as well as the effects of differences in measurement error, a long-recognized constraint. If parallel forms' reliabilities are obtained for each measure on each occasion, the observed intercorrelations can be corrected for attenuation in a dependable fashion. A possible causal relation is revealed by positive nonzero multiple regression weights or causal paths from the estimated true scores for both measures on the first occasion in predicting one or the other, but not both, on the second occasion. Given a positive finding, an interpretation that individual differences on one measure are anticipating individual differences on the other is conservative. Stronger causal inference requires additional research.
M a n y important h u m a n p r o b l e m s cannot be studied experimentally because rand o m a s s i g n m e n t o f persons to treatments, the core of experimental control, is not possible. In other cases, w h e n attempts are m a d e to bring a c o m p l e x problem into the laboratory for study, the treatment in the experiment b e c o m e s a pale shadow o f the treatment that occurs in the social milieu. Quasi-experimental designs are important because research on p r o b l e m s not c o n d u c i v e to laboratory control are important. Research should not be avoided because r a n d o m assignment to treatments is lacking, but the limitations of other designs must be understood. A corollary is that causal inferences should be a d v a n c e d with due regard for those limitations. O n e such design, the cross-lagged correlation, is a currently n e g l e c t e d methodology.
THE BASIC CROSS-LAGGED
DESIGN
T h e simplest design involves identical or parallel forms' measures o f each of two variables administered on two occasions. The resulting four score distributions This research was supported by the Department of Psychology, University of Illinois, UrbanaChampaign. Correspondence and requests for reprints should be sent to Lloyd G. Humphreys, 603 East Daniel, Champaign, IL 61820. 151
152
HUMPHREYS
generate six correlations: two concurrent (rx,y ~ and rx2y2), two trait stabilities ( r ~ 2 and ry~y2), and two cross-lagged (rx~y 2 and rx2y ). A difference in the size of the two cross-lagged correlations was expected to reveal a causal connection from X to Y or Y to X, depending on the direction of the difference.
RECENT HISTORY Ten years ago, burgeoning psychological research based on the cross-lagged methodology came to a virtual halt. This was largely due to the conclusions reached in a critique of the methodology by Rogosa (1980) that appeared in a widely read journal. Consider the following quotations (p. 257): "No justification was found for the use of CLC." "CLC is best forgotten." "CLC should be set aside as a dead end." A contributing cause may well have been the rapidly growing popularity of the LISREL methodology. LISREL allows investigators to test casual path models in cross-sectional data with estimates of error-free latent variables. Crosssectional data are easier to find or gather than longitudinal data, which adds to their attractiveness and the attractiveness of LISREL. It is the thesis of this editorial that a longitudinal facet in a research design has a great deal more potential for valid causal inferences than does the cross-sectional design. This potential does not disappear, contrary to Rogosa's conclusions, even if there are only measures of two variables, for which LISREL is not applicable, that are studied longitudinally. C O N S T R A I N T S ON T H E M E T H O D O L O G Y Prior to Rogosa's article, one constraint on the use of the cross-lagged correlations was widely known: A difference in the reliabilities of X and Y could produce a spurious difference in the size of the cross-correlations. Kenny (1975) developed a methodology for correcting for differences in reliability that was applicable in the absence of independently determined parallel forms' reliabilities in the sample (a typical characteristic of the data commonly used). Even if X and Y were standardized tests, reliabilities from test manuals are not accurate estimates of reliabilities in new samples. Humphreys and Parsons (1979) used an alternative methodology for correcting correlations for the effects of measurement error that was only applicable to measures administered on four or more occasions. Their methodology also has a second weakness, discussed analytically by Rogosa and Willett (1985), in that it depended on an assumption that longitudinal growth data was fit by Guttman's (1954) simplex model. A satisfactory fit for one model does not, of course, preclude satisfactory fits for other untried models. Rogosa's (1980) methodological contribution can be described verbally and numerically, omitting his equations. He showed analytically that correcting for the effects of measurement error was not sufficient. A cross-lagged difference
CAUSAL INFERENCES FROM OBSERVATIONAL DATA
153
can arise from a difference in the two true score stabilities of X and Y that has no implication for a causal connection from X to Y or Y to X. After correcting for the effects of measurement error, if individual differences in X change more rapidly than individual differences in Y, rxo, 2 will be greater than rx~_:,' in the absence of a causal connection of the sort that the methodology was expected to detect. It is also of interest that the Humphreys and Parsons' article had contained an accurately interpreted numerical example of this effect, anticipating Rogosa by a year. The latter authors, however, merely warned of the artifact and did not reject the design. NUMERICAL EXAMPLES Table 1 provides two numerical examples of effects on the differences in crosslagged correlations o f a difference in the stabilities of individual differences in true scores of X and Y. Both examples are similar to those used by Rogosa. The correlations on the left illustrate a difference in cross-lagged correlations without the expected causal significance; those on the right illustrate a directional effect from Y to X in the absence of a difference. The multiple regressions that document these statements appear below the correlations upon which they are based. In the left-hand equations, the predictability of neither measure at Time 2 is influenced by knowledge of both measures at Time 1, but both measures at Time 1 have positive regression weights in one of the two equations on the right. The directional effect of Y on X in the R-matrix on the right required not only a nonzero regression weight of Y1 in the multiple with X1 when predicting X 2, but satisfaction of two psychometric assumptions mentioned but not emphasized earlier. X and Y must each be the same or a parallel test of itself on each occasion of measurement, and all correlations must be corrected for the effects of random measurement error. Because these conditions are satisfied in this example, a form of causal inference from the nonzero regression weight of Yj in predicting, in conjunction with X1, performance on X 2, is supported. Rogosa's equations were accurate, but his conclusions were too sweeping. TABLE 1 Two Examples of Cross-Lagged Outcomes Between Measures of Unequal True Score Stabilities A Cross-Lagged Difference
Xi Xi X2 YI Yz
.850 .700 .665
X2 .595 .681
Yi
No Cross-Lagged Difference
1"2
.950
XI X2 YI Y2
Xt
X2
Yi
.850 .662 .629
.629 .693
.950
2
Y2
,,_~,y, = (850)(850) + (000)(595) = 7225
G_,~o.t --- (772)(850) + (117)(629) = 7298
2 r,.;.~. ~ = (950)(950) + (000)(665) = 9025
2 r3._or~ = (950)(950) + (000)(629) = 9025
154
HUMPHREYS THE FORM OF CAUSAL INFERENCE
Individual differences in any behavioral characteristic are never completely stable. That is, the stability of individual differences over an appreciable interval of time are never equal to the root of the product of the reliabilities of measurement on each of the two occasions. There are, therefore, determinants of change both for X and Y in the R-matrix on the left in Table 1. There are, also, in all probability, overlapping determinants of change for the two variables in that matrix, but there are no cross-effects. For the R-matrix on the right there is a cross effect, but it is an exaggeration to conclude that Y causes X. It does not tax credulity, however, to conclude that individual differences in Y are anticipating individual differences in X. Y does not add a great deal to the prediction of X 2 over that obtainable from X~ in these examples, but Y; is showing early the effects of determinants of change that affect X late. Humphreys and Parsons (1979) had access to data on 16 tests administered on four occasions between the 5th and 1 lth grades to a constant sample. Using their simplex process model they concluded that individual differences in aural comprehension of language anticipated individual differences on a broad composite of 15 printed cognitive tests by 2 to 4 years during that time period. The composite was heavily weighted with complexly worded academic achievement and aptitude tests and would have been highly correlated with a standard printed test of general intelligence. Carroll and Maxwell suggested (1979) that the printed tests required a level of reading skill that individual examinees were approaching at different rates at this stage of development, which was a viable interpretation of the finding. Subsequently, however, Humphreys and Davey (1983) completed an analysis in which the second measure was a composite of simply worded information tests, not all of them academic, with similar results. There may be something more fundamental involved than reading lagging behind listening in these grade groups. Both of these examples made use of a method of estimating reliabilities that creates a degree of uncertainty about outcomes, but the degree is quite limited. Patterns in the observed correlations over four occasions, and psychometric expectations concerning likely levels of reliabilities of the individual tests used in the research, and the increased levels of reliabilities expected from composites formed by those tests, drastically restrict the uncertainty. RECOMMENDATIONS CONCERNING DESIGN Nevertheless, in contrast to past practice of myself and others, it seem clear that a cross-lagged analysis should be designed and the data gathered for that express purpose. Let X and Y each be composed of the sum of two separately timed and carefully selected parallel forms. The two forms can be printed in the same booklet, and only seconds need elapse between the expiration of the time limit on
CAUSAL INFERENCESFROM OBSERVATIONALDATA
155
Form A and the start of work on Form B. This format was used without observable artifact on all experimental tests in one of the Aviation Psychology Research Units of the Army Air Force during WWII (Humphreys, 1947). The correlations between parallel forms are available, after extension to tests of double length by the Spearman-Brown formula, for the correction for attenuation of all correlations in the cross-lagged design. (This reliability estimation procedure could be adopted with profit in other research as well). Compulsive statisticians object to using correlations corrected for attenuation in multiple regression, but the problem is trivial when there are only two predictors and obtained score reliabilities are high: the higher the reliability, the smaller the correction, and the decreased probability of obtaining an impossible pattern of only three correlations. Standard errors for the regression weights are also nonexistent, but the problem is not insurmountable, especially when the preceding conditions hold. As a first pass, use the standard error of regression weights as if the correlations had been obtained rather than estimated, and impose a substantially more stringent level of alpha. Most psychological data violate one or more assumptions of sampling error formulas to some degree anyway. Also, being able to place sample statistics into significant and nonsignificant categories is vastly overvalued as the way to do good science (Meehl, 1990). Outcomes can be replicated. Although I had to make do with two versions of simplex-derived reliabilities, I have computed regression weights for six pairs of multiple regressions for the estimated true score data of Humphreys and Parsons, and Humphreys and Davey. The weights for aural comprehension and the composite in predicting later composites compared with their weights in predicting later aural comprehension support, on average, the conclusions based on the simplex process model. There is evidence that individual differences in aural comprehension anticipate individual differences in a broad intellectual composite and a narrower information composite, but the regression analyses are not neat and tidy. There are too many negative regression weights for either composite, of which some appear to be significant, in predicting aural comprehension. Parallel forms' reliabilities might produce outcomes more consistent with expectations. It is also possible to adapt the two-variable, two-occasion model for use with LISREL. Obtain three or more parallel measures, in place of the two previously recommended, of each of the two variables. The latent variables estimated are the true scores of X and Y, and the estimated path coefficients are the regression weights linking true scores from occasions 1 to 2. If the true scores of X and Y are too narrow to meet the needs of the research, one can select three measures having different "methods" variance. The error-free latent variables estimated are now the effects common to the three separate measures of each construct. In this latter design I would still advise that each of the six measures used, three of each hypothetical construct, be administered in independently timed parallel forms for use in checking the validity of the measurement model.
156
HUMPHREYS
CONCLUSIONS There is a continuing need to obtain clues concerning causation in problem areas in which controlled experimentation is impossible or inadequate. A well-designed and analyzed cross-lagged study is still a promising methodology for this purpose. The prime feature of the cross-lagged methodology upon which this evaluation is based is the longitudinal facet in the design. I have no bias against complex LISREL designs per se, but LISREL also has the attractive feature that it can be used just as readily for longitudinal data for two variables as for crosssectional data for many variables. REFERENCES Carrol, J., & Maxwell, S. (1979). Individual differences in cognitive abilities. Annual Review of Psychology, 30. 603-640. Guttman, L. (1954). A generalized simplex for factor analysis, Psychometrika, 20. 173-192. Humphreys, L.G. (1947). Some commonly used statistical techniques. In J.P. Guilford & J.1. Lacey (Eds.), Research with printed tests (chap. 3). Washington, DC: U.S. Government Printing Office. Humphreys, L.G., & Dave),. T.C. (1983). Anticipation of gains in general information: A comparison of verbal aptitude, reading comprehension, and listening (Tech. Rep. No. 282). Champaign, IL: Center for the Study of Reading. Humphreys, L.G., & Parsons, C.K. (1979). A simplex process model for describing differences between cross-lagged correlations. Psychological Bulletin, 86, 325-334. Kenny, D.A. (1975). Cross-lagged panel correlation: A test for spuriousness. Psychological Bulletin, 82, 887-903. Meehl, P.E. (1990). Appraising and amending theories: The strategy of Lakatosian defence and two principles that warrant using it. Psychological lnqui~., 1, 108-141. Rogosa, D. (1980). A critique of cross-lagged correlations, Psychological Bulletin, 88, 245-258. Rogosa, D., & Willett, J.B. (1985). Satisfying a simplex structure is simpler than it should be. Journal of Educational Statistics, 10, 99-107.