JOURNAL
OF VERBAL
LEARNING
AND
Discussion
VERBAL
BEHAVIOR
of Wike
15,
257-266 (1976)
and Church’s
Comments
The following views of the immediately preceding article by Wike and Church (1976) are by Herbert H. Clark, Jacob Cohen, J. E. Keith Smith, and Geoffrey Keppel. They were written at the invitation of the editor.
RepZyto Wike and Church
assumptions. I will take up these three points Wike and Church (1976) take issue with in turn. First, Wike and Church argue that the Clark (1973) on (a) what constitutes a random effect and (b) how reasonable it is to use quasi F-ratio, F’ = (MS, + MS,,,,w)/(MS,,s + MS,,,) for testing the treatments effect in approximate F tests and, more generally, random effect models. On the basis of (b), a Treatment x Subjects x Words analysis of variance (see Clark, 1973, Table l), is only Wike and Church suggest (c) that “investigators continue using fixed factor designs approximate and therefore too risky to use.’ about which more is known and seek non- As evidence, Wike and Church point to a recent numerical study by Davenport and statistical generality by means of various modes of replication.“1 In this note I will Webster (1973) of exact Type I error rates for argue that my earlier recommendations are F’ under various configurations of this analysis still optimal. First, Wikes and Church have of variance. Davenport and Webster themexaggerated the problems of approximate F selves, however, conclude that F’ (and a tests and random effects models. Second, related quasi F-ratio) “do well and about equally well” except under very special condieven if there were problems, the solution tions. With CI= .05, the Type I error rates for Wike and Church offer does not follow F’ varied between .038 and .053 (all should be logically from their arguments and would .05) except for three outliers at .028, .033, and foster more costly errors than it would avoid. .035 for three highly improbable experimental Third, Wike and Church’s characterization of configurations.3 Even though Davenport and what constitutes a random effect needs Webster considered only small degrees of revision. freedom (from three to seven treatments, Wike and Church’s three main complaints subjects, or words), the more subjects and against random effects models are these: they words there were-the closer to real experirequire approximate F tests; they increase mental configurations-the better was the Type II errors; and they require strong approximation. Using Monte Carlo (not exact) H. H. C.‘s work was supported in part by Grant techniques, Forster and Dickinson (1976)
MH-20021 from the National Institute of Mental Health. Dr. Clark is indebted to A. R. Jonckheere for discussion and advice on several aspects of his note. The preparation of G. K.‘s paper was facilitated by Grant MH 10249 from the Public Health Service. Send reprint requests to Dr. Edwin Martin, Department of Psychology, University of Kansas, Lawrence, Kansas 66045.
r I heartily agree that investigators should seek additional generality by replication; thus, the latter half of this suggestion per se is not at issue. Copyright Q 1976 by Academic Press, Inc. 257 AII rights of reproduction in any form reserved. Printed
in Great
Britain
2 For convenience throughout, I will speak of “words” as the language material being sampled even though the arguments apply equally to other language materials. Also, I will illustrate the arguments only for the Treatments (T) x Subjects (S) x Words (W) factorial design (except where noted); the arguments apply to other designs in a fairly obvious way. 3 In all three configurations the variance of the unexplained error (Go’) was 100 times as large as the variance of any of the interactions.
258
CLARK ET AL.
drew similar conclusions and in addition suggested a useful procedure for increasing the approximation under the more unusual configurations. These two experimental studies, along with others (Box, 1954; Cochran, 1954; Davenport, 1975; Hudson & Krutchkoff, 1968; Myers & Howe, 1971), show that F’ is a very good approximation, at least for those conditions most likely to arise in psychological experiments. Actually, there is no need to use approximations at all, since there are exact tests as alternatives to F’. Here I mention just three. 1. Imhof (1960) derived an exact test, based on Hotelling’s T2, for testing Treatments in a Treatments x Subjects x Words design. 2. Naik (1974) recently derived a more practical test that can always be used in place of F’. Let Fl = MST/MSrXs and Fz = MS,/ Mhxw and let F,” and Fz* be their critical values at the stated alpha level (e.g., .05). Then the statistic to compute is this: Z= MS,/ (F,*MS,,, + Fz*MSTxw). If Z is greater than 1, the hypothesis of no difference among treatments can be rejected. The actual Type I error rate is always somewhat less than alpha, but it can be calculated exactly. Without this calculation, Z appears to be conservative compared to min F’ = MS,/(MS,,, + MS,,,), which uses the same three mean squares (see Clark, 1973). 3. With one degree offreedom in the numerator, min E;‘, or rather its square root, is equivalent to an exact t test between two means with unequal variances. This test has been studied by Welch (1947), and critical values for LY= .05 and .Ol have been provided by Aspin (1949). For many designs in which words are treated as a random effect, the problem of approximations does not even arise. These designs require only the ordinary F-ratio, or the KruskalWallis, Mann-Whitney, Wilcoxon, or sign test (Clark, 1973, p. 348). Thus, since F’ is a very good approximation for most practical purposes, since there are exact tests available to replace F’ anyway, and
since many designs require only the ordinary statistics, the problem of approximations is a bogus issue. Wike and Church’s next complaint is that F’ is unduly conservative. By using F’ to avoid Type I errors, investigators might instead make Type II errors-failing to detect differences that are actually there-and thereby become discouraged and stop investigations. “A stray Type I error in language research is not a catastrophic event.” Two points here. First, it is not clear that F’ is conservative (see Davenport & Webster, 1973; Forster & Dickinson, 1976, for studies on the power of F’). Second, Wike and Church’s argument about Type I and Type II errors is backwards. It has long been known that Type I errors occur far more frequently than the stated level of significance (e.g., CI= .05). Because journals do not accept nonsignificant findings, investigators tend to report only significant ones and discard data that do not confirm their hypotheses (Bakan, 1966; Cohen, 1962; McNemar, 1960). And because investigators generally cannot get their failures to replicate published, Type I errors, once made, are very difficult to correct. Type II errors do not have these liabilities. Bakan (1966) has further argued that “the danger to science of the Type I error is much more serious than the Type II error” (p. 427) because highly significant results appear definitive and tend to discourage further investigation. What is more serious, to my mind, is that too many Type I errors have served as foundation stones for highly influential theories or found their way into textbooks as “definitive” findings. A “stray Type I error” can indeed be “catastrophic.” Under Wike and Church’s recommended solution, however, Type I errors would not even be “stray.” When words are treated as fixed when they should be treated as random, Type I errors can occur as much as 60 % of the time even though the alpha level is only 5% (see Forster & Dickinson, 1976; Naik, 1974).4 4 Theoretically, the Type I error rate under these conditions can approach lOOok. Paradoxically, when
DISCUSSION OF WIKE AND CHURCH’S COMMENTS
259
It seems unwise to let the stated “level of Once the two options have been spelled out, significance” become a meaningless measure however, Wike and Church’s logic no longer of Type I error rate. stands. If the investigator could report either Wike and Church’s last complaint is that Fl or Fz, there is no reason not to report both. random effects models assume statistical But if he does that, he needs a rule for deciding independence in the population effects of whether his finding is “significant” or not. Words, Subjects, and their interaction, and The obvious rule would be to judge a finding they argue that this assumption may not be significant only if Fl and Fz are both signifijustified. However, analysis of variance models cant. Yet this rule consistently leads to Type I all rely on assumptions that in practice are error rates that are spuriously high (Clark, rarely justified. In the past, these models have 1973; Forster & Dickinson, 1976). A rule that combines Fl and F, and yet yields Type I error been shown to be fairly robust under moderate violations of such assumptions. What little is rates very close to the stated alpha level is known about violations of these assumptions this : Judge a finding significant only if min F’ of independence (see Box, 1954) suggests that = F,F,/(F, + F& is significant (Clark, 1973; Forster & Dickinson, 1976). But this rule random effects models may be fairly robust follows just as directly from treating subjects too. Wike and Church’s own recommendation and words simultaneously as random effects. that we draw “nonstatistical” generalizations Thus, even if we were to accept Wike and from replicated experiments, however, is just Church’s conclusions about approximations as vulnerable to their criticism as are the and random effects models, we are still led to random effects models. On close examination, the same decision rule: F’ and min F’ are the the logic behind generalizations from succes- most appropriate statistics to use in this design. sive experiments-even when the generalizaThe problems Wike and Church see in the tions are called “nonstatistical”-is ultimately random models would not really be ameliorthe same as the logic behind generalizations ated by their recommendation to seek nonfrom random effects models (see below). statistical generality through various modes Let us assume with Wike and Church for the of replication. If anything, the problems would moment that treating both subjects and words be exacerbated. Imagine that Investigator A as random effects is too risky and that there reported a finding with a significant F,, as must be only one random effect. The question prescribed by this recommendation. How is, which effect should be random-subjects or many more such experiments with fresh subwords? Wike and Church opt, arbitrarily, for jects and words would we need to see before subjects. The argument, however, is symmetriwe could be confident this finding was cal, and they could just as well have opted for genuine? Fl does not give us any clue. With a words. Under Wike and Church’s option, the Type I error rate of 50 %-and this could easily investigator would report Fl = MSTJMSTxs happen-replications by Investigators B, C, and D still leave us with a one-in-eight chance (or the equivalent). Under the alternative option, he would report F2 = MST/MST,, it was not real. If A had reported F2, we could (or the equivalent). Both options assume assess these chances far more accurately models with just one random effect, exactly (though probably not exactly). So even under as Wike and Church require. Wike and Church’s recommendation to seek ___-generality by replication, it does not make there is no real difference, the investigator is more sense for Investigator A to deprive others of likely to report that there is a real difference the more subjects he uses. This paradox leads to great dif%- F2. But if he does report F*, the earlier arguculties in interpreting experiments in which words are ment holds, and there is no reason not to inappropriately treated as fixed (see below). report F’, min F’, or the equivalent.
260
CLARK ET AL.
In Clark (1973, pp. 352-355), I suggested that under many circumstances the investigator may want to proceed by the “method of single cases” in which words are treated as fixed effects. In using this method, the investigator has at least one obligation : He must list the words he used and the scores he found for each. That is, he must show words the same respect he would show any other fixed effect. It would be unthinkable for a medical researcher, say, to test an effect under 10 separate drugs without listing the drugs and the results for each. The same goes for words.5 What I have said so far, except for the last paragraph, applies only when words ought to be treated as a random effect. The question is, when should they be so treated? In Clark (1973) the way I answered this question, unhappily, was open to misinterpretation (the interpretation Wike and Church chose). I suggested that words ought to be treated as random only if (i) the levels chosen do not deplete the population from which they were sampled, (ii) the levels are sampled at random from the population, and (iii) the investigator wants to generalize to the population from which the levels are drawn. Taken out of context, the statements Wike and Church quote seem to emphasize (i), yet I clearly stated (ii) and (iii). The confusion may have arisen because of a further point I was trying to make. It concerns the two meanings of the word random. For the statistician, the word random has a very special meaning (call it random,). A factor in an analysis of variance is random, if its levels, a finite number of them, are drawn without replacement from an infinite population of levels so that each has an equal chance of being drawn. This ideal, on analogy with s Note that under the method of single cases, the reader can compute Fz and hence min F’ even if the investigator himself did not think it was warranted. So no matter whether the investigator treated words as a random effect or worked by the method of single cases, the reader then has the information needed to judge the generality of the finding.
balls being drawn from an urn, makes the mathematics neat and tractable. For the practicing experimentalist, however, random, can only be a model. He can rarely, if ever, hope to specify each level in the full population ahead of time and draw a sample giving each population level an equal chance. He has to make do with approximate procedures for defining random (call it random,). Then, just as the assumption of normality can serve as a model for his real data, so can the assumption of a random, sample serve as a model for his random, sample. Random, is an ideal. Random, is reality. What is a random, effect? Wike and Church remark about Rubenstein et al’s experiment that “there appears to be no evidence that the words within homophones and nonhomophones were selected at random.” Wike and Church here mean random,, but this misses the point. Practically speaking, Rubenstein et al. could not have listed all homphones and nonhomophones, put them in a hat, and drawn out a sample at random,. They had to use an approximate procedure that led to a “quasi-random” sample “representative” of the homophones and nonhomophones they wished to generalize to. Indeed, they drew inferences from their random, sample to the population as if it were a random, sample. This is common practice (see Landauer & Meyer, 1972). Note that by Wike and Church’s criteria subjects ought to be treated as fixed most of the time too. They are almost never drawn at random, from their population, even when the population is defined as narrowly, say, as “Stanford University students taking introductory psychology.” In most experiments, subjects are selected no more at random, than words, and when this holds, it hardly seems justified to treat subjects as random and words as fixed. Even when the procedure for selecting words is very rough and could hardly be called random,, there are good practical grounds for treating words as random anyway (assuming that the method of single cases is not practical
DISCUSSIONOFWIKEANDCHURCH'SCOMMENTS
to use). Imagine that A has used such a loose selection procedure and that B reads A’s experiment, including the procedure, and wants to build on A’s findings. The first thing B will want to know is, even if he selects words by A’s procedure, can he duplicate A’s finding or would it be a waste of time to try ? To make even a rough assessment, B needs information about the distribution of scores for A’s sample of words. This information would not be available if A has followed Wike and Church’s recommendation, treating words as a fixed effect and reporting only Fi. (Recall that Type I error rates can soar with an Fl of any size.) However, it will be available if A treats words as a random effect and reports F’, min F’, or the like. Thus, even when the selection procedure is dubious, Wike and Church’s recommendation could be very costly of time and effort. In summary, Wike and Church’s recommendation should be weighed against my own (Clark, 1973) on the basis of costs and benefits. I have argued that the costs of the arbitrarily high and unknown Type I error rates under Wike and Church’s recommendation are enormous compared to the slight costs of approximation and potential Type II errors under my recommendation. HERBERT H. CLARK Stanford
University
Random Means Random
I find myself in essentially complete agreement with Wike and Church. I would marshal1 my argument with Clark along two lines, covering the cases of randomness accomplished and randomness assumed. 1. To the extent to which Clark’s position is the advocacy that language materials used in experiments actually be randomly sampled from enumerated categories, my objection is only moderate. One practical difficulty with such an approach lies in the enumeration of
261
the language population which defines the category. Assuming this can be done in any given experiment and that the theoretical issues can be addressed in terms of the populations being randomly sampled, there then arise the thorny statistical problems which are associated with hypothesis testing and estimation of random effects. In addition to the uncertainties associated with quasi F-ratios and error term pooling decisions, there are the consequences of failure of the basic analysis of variance assumptions. It is not sufficiently appreciated that the well-advertised robustness of F tests holds only for fixed-effects models; in random-effects models, nonnormality plays hob with parameter estimation and conformity to tabled M and power rates. It turns out that departures from normal kurtosis in particular do heavy damage: platykurtosis (short, thin tails) makes the actual c1 smaller than the nominal CIand thus also reduces power, and leptokurtosis (fat, heavy tails) does the reverse. The central limit theorem does not supply its usual succor here, nor is there much else that can be done about this. Scheffe, who discusses the problem of nonnormality in random models with characteristic thoroughness and clarity (1959, Chap. lo), sadly concludes, “The situation is not very hopeful, and normaltheory inferences about variance components must be accepted as being much less reliable than those about means” (p. 363). 2. But Clark goes far beyond advocating the actual random sampling of words, mandating the analysis of any set of words as though they were a random sample from the word population in that category on the simple grounds that the words used do not exhaust the population. My objection here is, if anything, even stronger than that of Wike and Church. It is difficult to find any basis for Clark’s position other than wishful thinkinghe proceeds as though his need to generalize to the language population confers upon him not only the right but the duty to do so. But random means random, and the effects of words selected on any other basis will not have
262
CLARKETAL.
the properties that justify referring the resulting “F”-ratios (let alone quasi “F’‘-ratios) to a standard F table for assessment in the usual manner. (Parameter estimation is obviously completely out of the question.) The quotation from Scheffe at the end of the preceding paragraph continues, “This conclusion is reinforced by the consideration that models with variance components have, even without the normality assumptions, a rather tenuous relation to those frequent applications where nothing is done to insure the random sampling of the effects which is assumed in the model” (p. 363). Wike and Church reject selection on the grounds that the luck of the draw may produce a set of stimuli which is not representative or does not have large enough effects. This goes to the heart of the matter. If one wishes to make inferences about a population of effects, random sampling is a prerequisite for statistical induction. But theory usually imposes structural demands which make simple random sampling unsuitable. One wishes to select a sample which is representative, not of a language population, but of a theoretical framework. One thing which random sampling assumes is the unrepresentativeness of some samples, the denizens of the tails of sampling distributions. When population effects are normally distributed random samples will be unrepresentative to a precisely defined degree, namely that of the medium tails of the normal distribution, “mesokurtosis.” The F distribution allows precisely for this degree of unrepresentativeness. But what is the kurtosis of the effects in the “population” in the experimenter’s head? Who knows? But it seems a reasonable speculation that the goal of representativeness will tend to produce effects like the temperature of Goldilock’s little bear’s porridge, “not too hot, not too cold, but just right,” that is, one deficient in “unrepresentative” effects-a short-tailed, platykurtic distribution. But this is the kind whose actual a is less than that of the tabled value-if repre-
sentative enough, that is, rectangular, the tabled a = .05 F value is actually at ccM .OOl. Since power is a function of the true ~1,such “representative” distributions will result in less (perhaps much less) power than for any old random sample. (Thus is virtue rewarded!) Perhaps this explains why Clark’s randomeffects reanalysis of the Rubenstein et al. data wiped out so many significant effects. JACOBCOHEN
New York University
The Assuming-Will-Make-It-So Fallacy
Lincoln once posed the question, “How many legs does a horse have, if we assume his tail to be a leg?’ Assuming a collection of verbal materials to be a random sample provides no more support for a random-effects model than the horse’s fifth leg provides for him. Nevertheless it does seem useful to me to consider whether, had these materials been a sample, the main effect would still be large enough to justify confidence in its reality. The answer to this question should not be considered conclusive, however it turns out. The haphazard choice of experimental materials might as well result in uncharacteristically small interactions as in uncharacteristically large main effects. In the end one really cannot reap the benefits, such as they are, of random sampling without sampling randomly. It has often seemed to me that workers in this field counterbalance and constrain word lists to such an extreme that there may in fact be no other lists possible within the current English language. Perhaps the experimenter should be required to provide another, unused, set of materials which he feels to be equally appropriate. Certainly a random factor interpretation is questionable if this small generality is impossible. Secondly, I would comment on the point made by Wike and Church that excessesin the name of conservative analysis have their costs. Frequently the experimental analysis ignores
DISCUSSION
OF WIKE
AND
the counterbalancing mentioned above, which generates a set of suppressed factors. These factors are typically independent of the explicit factors (that is what good control is) but are not separated from error. To the extent that the counterbalancing was necessary, the error term is thereby inflated to an unknown extent, and conservatism wins again. I have nothing against conservatism, at least in experimental inference, but should it not be out in the open where the experimenter can decide how much power to spend for how much conservatism? Finally I do not opt for either way of making F-tests exclusively. For me, Clark’s contribution was to remind us forcefully that generalization across materials is as important as generalization across subjects, and to publicize techniques for considering that generalization. No one should accept his dogma or that of his critics. Authors will continue to be responsible for defending the generality of their finding across subjects and materials, statistically where appropriate, logically always. J. E. KEITH SMITH University of Michigan
Words as Random Variables
Clark (1973) brought to our attention potential problems in the analysis of experiments in which linguistic materials provide the basis for the treatment manipulations. The main thrust of Clark’s thesis was to argue for the adoption of a random-effects model rather than a fixed-effects model when the results of linguistic manipulations are to be generalized beyond any given experiment. The other reviewers in this current exchange have concentrated their comments on the statistical basis of Clark’s argument. These discussions reveal that we should be wary of any set of prescribed procedures or rules that are to be applied to all data analyses and that we should continually remind ourselves that the “state of the art” in statistics is often much more
CHURCH’S
COMMENTS
263
tentative and controversial than the authors of statistical texts and statistical articles would imply. In the present situation, we seethat there is controversy, which simply means that we should exert caution and informed thoughtfulness when analyzing and interpreting data generated from experiments of the sort with which Clark is concerned. I will not add to this controversy, but instead restrict my comments to a brief consideration of some of the limitations and difficulties associated with the assumption of words as a random independent variable. Before we get involved in this discussion, we should keep in mind that the statistical model which underlies or justifies any statistical analysis is often nothing more than an approximation to the particular data at hand. A researcher must certainly be prepared to justify whatever position is taken on fixed versus random variables as well as on such matters as the selection of a per comparison significance level, the necessity to control experiment-wise error rate when conducting multiple comparisons, the substitution of planned comparisons for omnibus F tests, the use of univariate versus multivariate approaches, and the appropriateness of an additive or nonadditive model with repeated measures. But a researcher should also understand that a statistical analysis is designed to assist us in arriving at conclusions concerning the operation of psychological processes. A clear specification of what analysis was done and why should not mislead a sophisticated researcher. On the contrary, it should remind us that there may be some question concerning the reliability of a particular finding and suggest that an early replication is in order. Let is now turn to the current problem. Suppose we decide to manipulate a linguistic variable and feel that random sampling of materials is the appropriate thing to do. While it may not be obvious, random sampling of stimulus materials is a fairly complicated procedure. What is usually not recognized is that there are frequently critical theoretical
264
CLARK
decisions that must be made in order to define precisely the population to be sampled. With the Clark noun-verb example, we might have to decide whether or not certain types of nouns, like proper nouns, or certain types of verbs, like intransitive verbs, or words with very low frequency of occurrence, and so on, should be included in the sampling. All of these questions must be resolved and made specific before a researcher can begin to sample randomly from the final pools of materials. We will assume that the definititional problems have been solved and that we have randomly selected a list of nouns and a list of verbs to be employed in the experiment. Since the two sets of words were randomly sampled from two clearly defined populations, the random model would presumably be appropriate. But who would be satisfied with this particular design even though it is possible to analyze the outcome? To be more specific, we should be reluctant to accept any experiment in which a single list of material is used to define a linguistic manipulation. While the use of the same materials for all subjects may be economical of time and money, the procedure is actually reckless due to the distortion that may occur as a result of randomly selecting the words for the experiment. This danger is significantly reduced if we include, at a minimum, a second set of materials in the experiment. The problem could essentially be eliminated if we greatly expanded the random selection and drew a d@rent tandom list of nouns or of verbs for each subject. Moreover, this particular arrangement of the materials might result in an additional important benefit, namely, the use of a less controversial statistical analysis. That is, according to Clark (1973, p. 348), this latter type of design allows the use of much simpler statistics and may even avoid the use of quasi F-ratios or of statistical tests with low power recommended by Clark in his 1973 paper. It would seem, therefore, that a researcher would be well advised to construct a separate list for each subject and thus avoid the complicated solu-
ET AL.
tions Clark describes in his paper, which as we see in this present exchange of views are not yet embraced by all statisticians.6 Clark’s original discussion concentrated primarily on the appropriateness of a particular statistical analyses and not on the more fundamental question of whether or not a researcher should sample linguistic materials randomly. I would like to argue that in most cases random sampling is not the best choice of methods for constructing lists of verbal materials. In essence, the problem with random sampling is the possible covariance of other characteristics of verbal materials and the resulting ambiguity that such a confounding produces. Suppose, for instance, that we were interested in the detectability of “pleasant” versus “unpleasant” words. Since it is known that pleasant and unpleasant words differ on such relevant attributes as frequency of occurrence in the language, word length, the initial letter of the word, and the predictability of letter pairs within a word, and the predictability of letter pairs within a word, a random sample of pleasant and unpleasant words will reflect these differences in addition to pleasantness, the variable under study. Consequently, any differences in detectability observed in this experiment as a function of pleasantness may be hopelessly confounded by the presence of these other differences. If our only purpose in this experiment were to determine whether words differing in pleasantness-and anything else-affected detectability, the design would be appropriate. But if we wanted to determine whether pleasantness per se affected detectability, the design is inappropriate. Thus, random sampling of verbal materials is of limited value to the researcher when what essentially consists 6 An even better solution might be to assign more than one subject to each of several lists. This arrangement would allow the assessment of list differences and the possibility of individual item analyses which often provide useful supplementary information. A design falling between the two extremes of one list and many lists is not exactly fit by the corresponding statistical models.
DISCUSSION OF WIKE AND CHURCH’S COMMENTS
of a multidimensional variable is being sampled. The results of such an experiment are generally not analytical and often are only suggestive. The solution, of course, is to construct lists in which words are matched on all relavant attributes other than pleasantness so that the independent operation of this variable may be observed with the confoundings neutralized. Lists constructed in this analytical way are obviously not randomly selected, and the fixed-effects model would certainly appear to be more appropriate than the random model. Again, a careful attention to design sidesteps the problems of analysis raised in Clark’s paper. A final point to be considered is that even with the most precise matching of words all that one is left with is a correlation between a linguistic attribute, like pleasantness, and some response measure, like detectability. While the establishment of such a correlation may be important. it is only a first step in the analysis of a particular phenomenon. The next stage of such an analysis is not to sample more words, but to subject the phenomenon to careful experimental analysis. These analyses consist of manipulations that are designed to illuminate the operation of assumed processes. In the pleasantness example, a critical experiment might be one in which a researcher creates in the laboratory nonwords that are “pleasant” and “unpleasant” by pairing them consistently with pleasant and unpleasant events, respectively. Differences in detectability observed with these artificially created pleasant and unpleasant words would help to establish the causal link between the variable and the behavior under study. In short, while the random seIection of linguistic materials will produce results that are representative of the population from which the materials were sampled, all that we will probably end up with is a correlation between a nonunitary stimulus dimension and the dependent variable. Unless we can
265
neutralize the variation of correlated stimulus attributes and establish a unitary dimension, any conclusions based on processes will be difficult if not impossible to make. At this stage, however, experimental analysis will take over and we will design studies which point directly to the operation of particular processes and which are engineered to rule out alternative explanations and interpretations. In conclusion, it appears that Clark’s thoughtful discussion of research designs utilized when linguistic variables are manipulated is of limited applicability. More specifically, random selection of stimulus materials is usually uninformative unless great care is taken to rule out the covariation of other relevant stimulus attributes. Under these circumstances, however, the adoption of a random model hardly seems appropriate. Moreover, even an experiment where materials are, in fact, randomly sampled may not require complicated (and controversial) statistical analysis if one includes a different list of stimulus materials for each subject. Finally, the correlational approach is only a first step in the analysis of cognitive processes. For the experimental psychologist, the purpose of using words is to tap or to arouse differentially certain cognitive processes. Eventually, these processes will be studied more directly by means of a variety of converging experimental operations. We assume that it is these latter investigations that will bring us eventually to an understanding of the behavior under study. GEOFFREY KEPPEL University of California, Berkeley
REFERENCES
ASPIN, A. A. Tables for use in comparisons whose accuracy involves two variances separately estimated. Biometrika, 1949, 36, 290-293. BAKAN, D. The test of significance in psychological research. Psychological Bulletin, 1966,66,423-437. Box, G. E. P. Some theorems on quadratic forms applied in the study of analysis of variance problems : I and II. Annals of Mathematical Statistics, 1954,25,290-302,484-498.
266
CLARK
H. H. The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning and Verbal Behavior, 1973,12,335-359. COCHRAN, W. G. Testing linear relations among variances. Biometrics, 1954, 7, 17-32. COHEN, J. The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology. 1962, 65, 145153. DAVENPORT, J. M. Two methods of estimating the degrees of freedom of an approximate F. BiomeCLARK,
trika,
1975,62,682-684.
DAVENPORT,
J. M.,
J. T. A comparison of F-tests. Technometrics, 1973,
WEBSTER,
some approximate 15,779-789.
K. I., & DICKINSON, R. G. More on the language-as-fixed-effect fallacy: Monte Carlo estimates of error rates for F,, F,, F’, and min F’.
FORSTER,
Journal of Verbal 1976,15,135-142.
Learning
and
Verbal
Behavior,
ET AL.
J. P. A mixed model for the complete three-way layout with two random-effect factors. Annals of Mathematical Statistics, 1960,31, 906925. LANDAUER, T. K., & MEYER, D. E. Category size and semantic-memory retrieval. Journal of Verbal Learningand Verbal Behavior, 1972,11,539-549. MCNEMAR, Q. At random: Sense and nonsense. American Psychologist, 1960,15,295-300. Myers, R. H., & HOWE, R. B. On alternative approximate F-tests for hypotheses involving variance components. Biometrika, 1971, 58, 393-396. NAIK, U. D. On tests of main effects and interactions in higher-way layouts in the analysis of variance random effects model. Technometrics, 1974, 16, 17-25. WELCH, B. L. The generalization of Student’s problem when several different population variances are involved. Biometrika, 1947, 34, 28-35. WIKE, E., & CHURCH, J. Comments on Clark’s “The language-as-fixed-effect fallacy.” Journalof Verbal IMHOF,
Learning
and Verbal
Behavior,
J. D., & KRUTCHKOFF, R. C. A Monte Carlo investigation of the size and power of tests employing Satterthwaite’s synthetic mean squares.
HUDSON,
Biometrika,
1968,55,431-433.
(Received February 23,1976)
1976,15,249-255.