Physiology & Behavior, Vol. 25, pp. 159-161.PergamonPress, 1982. Printedin the U.S.A.
A Comment on Confusion in Open-field Studies: Abuse of Null-Hypothesis Significance Test TOSHIAKI TACHIBANA
Institute f o r D e v e l o p m e n t a l Research, Aichi Prefectural Colony, Kasugai, Aichi 480-03, Japan R e c e i v e d 5 A u g u s t 1981 TACHIBANA, T. A comment on confusion in open-field studies: Abuse of nail-hypothesis significance test. PHYSIOL. BEHAV. 29(1) 159-161, 1982.--The confusion in open-field studies which has resulted from a misinterpretation of satistical results was examined. Several discrepant results obtained on the basis of significance tests have revealed that the discrepancy was not so serious as had been thought. Examples of highly reliable but substantially small experimental effects were presented to indicate the erroneous impression given. The expectation that consistent results can always be obtained from even a small sample size was criticized. In order to reduce some of this confusion, the use of strength of association ('02) together with consideration for statistical power in the interpretation of open-field results was recommended. Open-field studies
Null-hypothesis test
THE open-field test (OFT) has been widely used in the behavioral study of rodents. However, there have been many conflicting results in OFT studies, and two critical reviews have appeared [1,16]. Differences in OFT results may be, as Walsh and Cummins [16] pointed out, a consequence of differences in variables such as species, strain, age, experimental situation, etc. However, discrepancies resulting from differences in experimental methods are inevitable to some extent. Furthermore, in order to establish the generality of OFT, studies using different methods are needed. In this note, I attempt to draw attention to another type of discrepancy, a discrepancy in results due to the misinterpretation of statistical tests. In current OFT studies, the null-hypothesis test of significance has been the primary standard in the analysis of OFT data. In view of the positive correlation between statistical power and sample size, it is clear that the significance test alone is not a sufficient indicator of the importance of results. Indeed, when large samples are used, almost any difference among means and almost any correlation different from zero are likely to be statistically significant [6]. However, most OFT researchers have neglected this fact and use the significance test as the sole criterion in the evaluation of experimental effects. This has resulted in increased confusion in the interpretation of OFT results. The object of this note is, therefore, to point out this confusion and to propose the use of an index of strength of association, in conjunction with consideration of statistical power, as a remedy. In so doing, this note will comment critically on two OFT reviews [1,16]. The discussion that follows is not new, but it may be salutary as a fresh warning of the effect of misinterpreting statistical tests in OFT studies.
SIGNIFICANCETEST AND STRENGTHOF ASSOCIATION In spite of the frequent use of the product-moment correlation (r) in OFT studies, the square of the correlation coefficient (r2--the coefficient of determination) has seldom been used. The meaning of r z is intuitively clearer than that of r. For example, an r z of 0.25 between variables A and B indicates that 25% of the variance in A can be accounted for (or predicted) by variable B. In the case of t- or F-tests, we can use eta squared 092) as a relative of r ~. The interpretation of "Oz is also straightforward: it represents the rate of variance in the dependent variable accounted for (or predicted) by the independent variable. The value of ~/2 is index of the strength of association of the magnitude of an experimental effect. The value of 7/z constitutes a descriptive statistic based on a sample. If an inferential measure of association is desired, omega squared (to2) is appropriate. This is an estimate of the strength of association in the population. In order to keep the matter simple, we have restricted our attention to r ~ and a?z in the present note. CORRELATIONSTUDIESOF OFF It is known that defecation in the open field occurs at a higher frequency than in a familiar environment, and wanes with repeated exposure to the open field. Because of these facts, open-field defecation has been regarded as the most valid index of so-called emotionality, but not without qualification [ 11,13]. Some studies have been conducted to investigate whether any consistency in defecation can be obtained in various stress situations. Ley [5] reported that the correlation between defecation during OFT and defecation during fear conditioning was positively significant (r=0.24,
C o p y r i g h t © 1982 P e r g a m o n Press---0031-9384/82/070159-03503.00/0
160
TACHIBANA
p<0.05), findings which supported emotionality as an index of general drive. On the other hand, Seliger [8] reported that the correlation between open-field defecation and waterwading defecation was not significant (r=0.15), and concluded that defecation alone was not a sufficient description of emotionality. The interpretations of these results were apparently contradictory. Since statistical significance simply implies that the r of the population is not zero, the coefficient of determination (r 2) might help to interpret the findings of Ley and those of Seliger. In view of the magnitude of r 2, the values of 0.06 for Ley and 0.02 for Seliger are both small and do not seem to make much difference in the magnitude of the effect in question. The problem of whether 0.06 (or 0.02) is sufficiently large (or small) to confirm (or deny) the relationship will be mentioned later. There are other studies which have investigated the possible correlation between open-field defecation and other measures assumed to be an index of emotionality. Bindra and Thompson [2] reported that the correlation between open-field defecation and timidity-test scores (cage emergency test) was not significant (r=0.26), and concluded that emotionality is not generally valid measure of fearfulness. On the other hand, Snowdom, Bell, and Henderson [10] found that the correlation between defecation and heart rate is significant ( r = - 0 . 3 0 , p<0.05), and concluded that a high heart rate corresponds to a low level of emotionality. In these studies, the r2s (0.07 and 0.09) are close to being equal. The correlation between defecation and ambulation in the open field has been frequently reported to be negative. However, a few studies have reported a nonsignificant relationship. F o r example, Pare [7] found that correlation between defecation and ambulation was not significant (r = -0.13). He interpreted the result so as to suggest that ambulation and defecation in the open field are independent and, therefore, ambulation is not a useful index of emotionality. On the other hand, Walden [15] reported a significant negative correlation between the two measures ( r = - 0 . 2 7 , p<0.05), and concluded that innate timidity (reflected by open-field defecation) suppresses exploration in a novel environment. The r2s (0.02 and 0.07) are also similarly small and do not seem to make much difference in the importance placed upon the interpretation of the results. It is apparent from the above mentioned examples that even results apparently discrepant in terms of a significance test are not necessarily discrepant in terms of r 2. ANALYSIS OF VARIANCE STUDIES OF OFT
In the analysis of the correlation, although the significance test is primarily regarded as important, the value of r as an index of the strength of association is also considered. But, in the analysis of variance (ANOVA), which has also been used widely in OFT studies, most investigators observed only p values in order to decide whether obtained experimental effects were meaningful. The size of the F value itself does not reflect directly the magnitude of the experimental effect. In this case, we have no information about the magnitude or importance of the experimental effect. Thus, erroneous interpretations and confusion in A N O V A (and t-test) studies of OFT are expected to be even greater than in correlational studies. The observation of ~z can be a safeguard against an erroneous interpretation due to a highly reliable result, but a very small experimental effect. Also, observation of'02 may reveal that the seeming discrepancy between studies based on a significance test were in fact substantially similar.
Regrettably, most OFT studies have not reported the ~z or to2. Thus, it is very difficult to point out the above mentioned confusion by concrete examples. If an analysis were conducted by t-test, one-way ANOVA, or factorial two-way A N O V A by a fixed-effect model, we can estimate 92 by its F value (or t value) and its df. However, most OFT studies have been conducted with a more complex design, viz, a repeated-measure design. Therefore, we must discuss the problem with very restricted examples for which we can estimate the r/2 value. One fault derived from the supposed superiority of the significance test is that we have no safeguards against seemingly important results which show a highly statistically significant but substantially small experimental effect. Singh [9], for example, reported that rats from an emotionally reactive strain acquired a greater fear response than those from a nonreactive strain. The strain difference in acquisition was highly statistically significant, F(1,256)=13.92, p<0.001. These strains had been selectively bred on the basis of defecation in the open field. Therefore, he concluded that openfield defecation is a valid measure of emotional reactivity. The r/2 of the strain difference was 0.05. Despite the very statistically significant result, the magnitude of "02 might not be enough to insist that open-field defecation is valid measure of emotional reactivity. Whittier and McReynolds [17] reported that mice spend more time on a previously occupied side of an apparatus than on a previously empty one, implying an attraction to the odor cues left by another mouse (z=3.08, p<0.002). They insisted, on the basis of the highly statistically significant effect, that experiments using rodents should take account of the odor effect, and proposed the disquieting thought that previous results may have been influenced by the uncontrolled odor effect. Surely, we must acknowledge the odor effect. However, we should see the results not only in terms of the p value but also in terms of the .q2 value. The value of z (3.08) corresponds to an 92 of 0.09. Thus, 9% of the total variance of the dependent variable is accounted for by the odor effect. Or, 91% of the remaining variance is not accounted for by the odor effect. The 0.09 of'02 seems to mean that some uncontrolled effects were involved in previous studies, but that these effects were not as serious as the statistical significance (p<0.002) implied. In addition to the misinterpretation of significance tests in OFT studies, little attention has been paid to type II errors in the interpretation of the results of experiments which do not agree. One of the pitfalls is our tendency to attribute inconsistency in results to differences in experimental procedure. However, because of the usual low-power situation of so much behavioral research, even if the same experiment is repeated exactly, significant results are not always obtained. Therefore, it seems more reasonable to attribute an inconsistency in results to the low probability of detecting a statistically significant experimental effect than to differences in experimental procedure [12]. Denenberg and Rosenberg [3] found no significant difference in defecation in the open-field between offspring of two types of mother (handled in infancy or not). This result contradicted the results of their previous study which obtained a significant difference [4]. They interpreted this discrepancy in terms of a difference in the age of the rats. Surely, the possibility of the age difference cannot be denied. However, this interpretation of discrepant results seems to be too easy when we consider the power situation of such an experiment. Statistical power is rarely so large that we can expect
O P E N - F I E L D TEST AND S I G N I F I C A N C E TEST
161
consistent statistically significant results. We must consider the strong possibility of failure to detect a statistically significant experimental effect by chance in the interpretation of inconsistent results. The two reviews [ 1,16] referred to in this paper emphasize the unreliability of OFT results. However, it should be noted that consistent statistically significant results can not readily be expected in such small sample sizes as is the case of the typical OFT study, because small sample size results in small power. The two reviews seem to suffer from "belief in the law of small sample," that is, the conviction that any sample regardless of size must necessarily be highly representative of its population [14]. In addition, since the two reviews do not take "02 into consideration, the contradiction and confusion in OFT studies may not be as serious as these reviews suggest. SETTINGA CRITERIONOF STRENGTHOF ASSOCIATION Although the ~2 approach has been stressed in this paper, it is important to note that a large value of "02 in a small sample study is not always meaningful, because the reliability of "02 depends on sample size. In the case where "02 is large but there is no statistically significance effect, "02 lacks reliability, and another experiment by a larger sample is needed.
Therefore, it is necessary to examine ap-value obtained by a null-hypothesis test. It should also be noted that if sample sizes are very different from each other, a direct comparison of the values of "02 can not be made. Then where should we set an acceptable criterion of "02? How large an "02 is important practically? There is no simple criterion. The choice of the criterion depends heavily on the nature and context of the field application. In some cases, even a small "02 may be meaningful, whereas in others, a rather large "02 may be needed to substantiate an experimental effect as having meaning. The seeming arbitrariness in the selection of a criterion for "02might be considered a weakness of this approach. However, if the primary aim of statistical analysis is to assess the magnitude of an obtained experimental effect rather than to make a decision to accept or reject a null hypothesis, then careful consideration will have to be paid to the selection of a criterion on the basis of practical considerations related to the experiment and the implications of its findings. ACKNOWLEDGEMENT I thank Dr. Ronald Ley, State University of New York, Albany, for his kind help in improving the manuscript.
REFERENCES
1. Archer, J. Tests for emotionality in rats and mice: A review. Anim. Behav. 21: 205-235, 1973. 2. Bindra, D. and W. R. Thompson. An evaluation of defecation and urination as measures of fearfulness. J. comp. physiol. Psychol. 46: 43-45, 1953. 3. Denenberg, V. H. and K. M. Rosenberg. Programming life histories: Effect of maternal and environmental variables upon open-field behavior. Devl Psychobiol. 1: 93-96, 1968. 4. Denenberg, V. H. and A. E. Whimbery. Behavior of adult rats is modified by the experience their mother had as infant. Science 142: 1477-1479, 1963. 5. Ley, R. Open-field behavior, emotionality during fear conditioning, and fear-motivated instrumental performance. Bull. Psychon. Soc. 6: 598-600, 1975. 6. Nunnally, J. The place of statistics in psychology. Educ. psychol. Measur. 20: 641-650, 1960. 7. Pare, W. P. Relationship of various behaviors in the open-field test of emotionality. Psychol. Rep. 14: 19-22, 1964. 8. Seliger, D. L. Open field and water wading as tests of emotionality in rats. Psychol. Rep. 42: 175-182, 1978. 9. Singh, S. D. Conditioned emotional response in the rats: I. Constitutional and situational determinants. J. comp. physiol. Psychol. 52: 574-578, 19759.
10. Snowdom, C. T., D. D. Bell and N. D. Henderson. Relationships between heart rate and open-field behavior. J. comp. physiol. Psychol. 58: 423-428, 1964. 11. Tachibana, T. The open-field test: An approach from multivariate analysis. Anim. Learn. Behav. 8: 465-467, 1980. 12. Tachibana, T. Persistent erroneous interpretation of negative data and assessment of statistical power. Percept. Mot. Skills 51: 37-38, 1980. 13. Tachibana, T. Open-field test for rats: Correlational analysis. Psychol. Rep. 50: 899-910, 1982. 14. Tversky, A. and Kahneman, D. Belief in the low of small number. Psychol. Bull. 76: 105-110, 1972. 15. Walden, A. M. Studies of exploratory behavior in the albino rat. Psychol. Rep. 22: 483-493, 1968. 16. Waish, R. N. and R. A. Cummins. The open-field test: A critical review. Psychol. Bull. 83: 482-504, 1976. 17. Whittier, J. L. and P. McReynolds. Persisting odours as a biasing factor in open-field research with mice. Can. J. PsychoL 19: 224-230, 1965.