On the number of subjects used in animal behaviour experiments

On the number of subjects used in animal behaviour experiments

Anita. Behav., 1982, 30, 873-880 ON THE NUMBER OF SUBJECTS USED IN ANIMAL BEHAVIOUR EXPERIMENTS BY A. W. STILL Department of Psychology, Science Labo...

778KB Sizes 17 Downloads 44 Views

Anita. Behav., 1982, 30, 873-880

ON THE NUMBER OF SUBJECTS USED IN ANIMAL BEHAVIOUR EXPERIMENTS BY A. W. STILL Department of Psychology, Science Laboratories, Durham DH1 3LE

Abstract, Experiments published in Animal Behaviour in 1979 are reviewed, and it is concluded that there are grounds for thinking that the numbers of animals used could sometimes be reduced without loss of scientific rigour. Principles underlying the choice of number of subjects (N) are outlined, and illustrated with examples from the 1979 sample. Several alternatives to increasing N in order to increase power are given, including better design, larger c~, and sequential methods. The logic of inference from experiments with small N is discussed. This paper is a response to a discussion in the ethical committee of the Association for the Study of Animal Behaviour. Whatever one's views on the ethics of experiments on animals, one would surely agree that the number of animals used should be reduced where that can be done without sacrificing scientific goals. And one way of achieving such a reduction may be through more attention to statistical considerations in deciding upon the number of subjects to be used in an experiment. From the animal's point of view this may be a two-edged sword, since the recommendation may sometimes be that more rather that fewer subjects should be used. But even if that is true of a particular experiment, the information from experiments will be acquired more efficiently, there will be less need for wasteful replication, and in general the number of animals used will be reduced. Ideally a statistician should be consulted, preferably before the experimental design is decided upon. But the scientist must communicate his requirements to the statistician, and some familiarity with the principles governing the selection of numbers used can be an important aid to this communication.

Of these 49,18 involved experimentalmanipulations involving mild suffering (including food deprivation, ovariectomy, hormone injections, and exposure to bright light), while in six cases the treatment was more severe (electric shock in rats and chickens, 48-h food deprivation in cats, pigs and rats, and 24-h water deprivation in rats). Grounds exist, therefore, for asking whether the numbers used in animal experiments can not sometimes be reduced. Table I shows the mean number of subjects for the different species used, together with the range and number of experiments from which the mean is taken. Only the first experiment from each paper was used, in order to avoid giving too much weight to authors reporting several experiments. The first experiment is fairly typical of an author's practice in all cases. There is some indication that smaller numbers are used with more exotic and expensive species (such as mink and seal) than with those that are familiar and easily obtained, such as mice, pigeons, and chickens. If numbers can be reduced on economic grounds without sacrificing scientific rigour, they can also be reduced on ethical grounds. Some designs may require large numbers, and it may be that the design determines the species used, rather than the other way round. Also the numbers used in an experiment are divided into varying numbers of groups or treatments. But this cannot explain the differences in numbers in experiments on predatory behaviour in mink (Dunstone & O'Connor (1979a) and (1979b) used only 2 subjects) and cats (Biben (1979) used a total of 24 subjects in two experiments); or on the sexual behaviour of seals (Beier & Wartzok (1979) used only one pair) and other species studied in this sample (where large numbers of pairs were used).

A Survey of Papers Published in Animal Behaviour, 1979 For this study I looked at all the full-length articles published in Animal Behaviour during 1979. There were 117, of which 38 were observational studies of wild animals or domestic animals in their normal habitat, and did not involve painful experimental procedures (except, in some cases, tagging). Nineteen were experimental studies on invertebrates, 10 were theoretical papers, one paper was on human infants, and 49 were experimental studies on vertebrates. 873

874

ANIMAL

BEHAVIOUR,

30,

3

Table L Numbers of Animals Used in the First Experiments of the Experimental Stu~iles Published ~n Animal Behaviour, 1979 Class

Genus

Mammal

Rat Mouse Hamster Guinea Pig Cat Monkey Mink Pig Rabbit Seal Lemming Ecbidna Gerbil

Mean no.

No. of

papers

Range

20.8 120.0 24.7 18.0 24.5 9.5 2.0 20 10 2 40 6 24

5 3 3 3 2 2 2 1 1 1 I 1 1

7-40 96-156 12--32 10--30 8--41 8-11 2-2

Bird

Chicken Tit Duck Pigeon Finch Weaverbird

81.0 28.0 34.0 110.5 40 64

5 3 2 2 1 1

6.180 12-60 20-48 6-215

70.0 36 35 46

3 1 1 1

5-152

Fish

Stickleback Blenny Bream Butterfly fish

Amphibia

Salamander Bullfrog tadpoles

In the case of predatory behaviour it is the prey that suffers, so the number of animals actually involved, and hence the difference between the two cases, is considerably amplified. S o m e Basic Principles

Individual members of a species vary in their behaviour, both amongst themselves (between subjects) and, in the case of a single individual, over time (within subjects). The scientist wishes to reveal the significant patterns that remain when such sources of variation are allowed for, and when this is achieved the reported results are said to have validity. An ideal experiment has both internal and external validity (Campbell & Stanley 1963). Internal validity follows from proper control, so that observed effects are due to the independent variables of interest to the experimenter, and not to extraneous factors, such as unintended cues in a discrimination experiment, or sudden noises in an experiment on exploratory activity. External validity follows from proper design and sampling, so that generalization is possible from the sample to a population of subjects (population validity) and

used

300 75

1 1

to other stimulus situations (ecological validity). The concern of this paper is mainly with the effect of sample size on external validity, especially where the conclusions are supported by statistical tests. The usual practice in experiments on animal behaviour is to pose the question to be answered in the form of a null hypothesis, which is then tested by experiment. Thus if the question is whether a species has colour vision, the null hypothesis (Ho) will be that it does not, and this will be tested by a discrimination experiment and a statistical test on the results of this experiment. If the results are suflficienfly unlikely by chance alone, then Ho is rejected, and//1, the alternative hypothesis, is accepted. Textbooks tell us that in testing a statistical hypothesis we should try to minimize two types of error: type I, where Ho is true but rejected, and type II, where Ho is false but accepted. The probabilities of these two kinds of error are a and 13 respectively. The researcher's aim should therefore be to minimize a and 13,or, equivalently, to minimize a and maximize the power, 1--13, an aim whose achievement is complicated by the fact that a and [3 are usually

STILL: NUMBERS OF SUBJECTS IN EXPERIMENTS inversely related. In practice we usually know the value of a but are very hazy about [5, and cannot control it exactly. This is because, while Ho is almost always precise (e.g. that the difference between the effect of two treatments is zero: A = B), //1 is imprecise or composite (e.g. that one treatment will have a greater effect that another: A > B). This asymmetry has directed attention to type I errors at the expense of type II errors, and the criterion for a sarisfactory positive result seems often to be simply that a is sufficiently small. An experimenter will wish, when he commences an experiment, to use a test powerful enough to reject a false Ho, but once the results have been obtained and Ho has been rejected on the basis of an acceptably small value of a, it will seem to be of little importance, in retrospect, whether or not the test used was powerful. Such an attitude on the part of individual experimenters has its dangers. The individual may be anxious to produce as many significant (and publishable) results as possible, but anyone whose care is for the health of the science itself (such as the journal editor) will be concerned to see that a high proportion of published conclusions are valid, and a universally small a is not in itself a guarantee of this. What is required is that the probability o f / / 1 being true, given that Ho has been rejected at the acceptable a level, is as large as possible. Write this probability as p(H1/p~cO, where p is the observed probability of Ho being true, and a is the probability at which the decision to reject Ho is acceptable. Then, by Bayes' theorem (Hays 1973),

(1 -- [3). p(H1) p(H~/p~ct) --

(1) (1 --13). p(H1) + ~. p(Ho)

where p(H1) (which equals 1--p(Ho)) is the a priori probability of H i being true (roughly, how expected it would be). It is clear from this formula that p(H1/p<~a) will increase as a decreases, as the power, 1--[3, increases, and as p(H1) increases, p(H1) is not usually considered in evaluating experimental results, presumably because it is thought to be too hard to assess. This leaves 1--13 or power as well as a, and an important aim, therefore, is for large power: not just in order to get significant results in a particular experiment, but to contribute to a corpus of valid research findings. To complete the picture, we should also take account of the probability of Ho being true when it is accepted:

875

~ . p (~o)

p(no/p<~) =

(2)

~. p(Ho) + (1--13). p ( g x ) This will also increase with (1-13), which reveals another reason for requiring high power: to give reliability to negative results. A simple way of increasing power is to increase the number (N) of subjects, but this does not mean that statistical considerations will always urge for more subjects. There are several reasons for this. t. Power increase is subject to the law of diminishing returns. If a test is already very powerful (say 1--]3 = 0.99), any further increase in power is unnecessary and therefore, if it involves increasing N, wasteful. To act upon this some estimate of power will be necessary. 2. Power can be increased by means other than increasing N. (i) By greater precision; e.g. by more accurate measurement or by better control of those extraneous variables which increase random error. Failure to achieve adequate precision is hard to detect in published reports, without specialist knowledge of the techniques described. It is anyhow an already recognized goal of good experimental technique, and will not be discussed further in this paper. (ii) By use of a more powerful test; e.g. a parametric test, where appropriate, rather than a non-parametric test. (iii) By better experimental design; e.g. by using repeated measures rather than independent groups. (iv) By choosing a larger a; e.g. 0.05 instead of 0.001. 3. A peculiar feature of a composite fla, such as that A > B, is that if HI is true the power of the test will usually depend upon the exact value of the difference between A and B. The smaller the actual difference, the less powerful the tes~. This means that if the difference is small, we may need a relatively large N to achieve a given level of power. But small differences are often of little interest to researchers, who should be reluctant to use resources merely to avoid missing trivial effects. This is another instance of the law of diminishing returns. By increasing N, we may increase the range of targets likely to be hit, but as we do this so the new potential targets become less and less interesting. A similar point is made by Meehl (1967). With this danger in mind, an index of magnitude of effect is some-

876

ANIMAL

BEHAVIOUR,

times used, such as the proportion of variance explained (e.g. Waring & Perper 1979). In elaborating upon some of these considerations, a few of the papers surveyed from Animal Behaviour, 1979, wilt be examined in more detail. These papers have not been selected for criticism, but to serve as illustrations; a different set of papers, from a different journal, would have done as well. The Use of Power Estimation in order to Choose N It is possible to estimate power approximately by estimating the expected extent and standard deviation of the effect being tested from previous published experiments or pilot experiments. The sample size is then chosen accordingly (Cohen 1977). There was no explicit statement that this was done in any of the 1979 papers. Perhaps the relevant data were not available, though this is not true when there is a series of experiments on the same theme. But in such cases (and in general) the sample size may actually have been chosen quite carefully (but informally) on the basis of results expected from previous experiments or an unstatable 'feel' for the situation. There is no way of knowing where this was so, although it is unlikely if the levels of significance reported are consistently very low, as happened in one paper (Van der Poel 1979), where special mention was made of a significance level well below 0.001. A chi-square test is reported as significant at p < 10-2~ and although the other test results are only reported as p < 0.001, it appears, when full details are given, that p is again well below 0.001. The paper is on 'stretched attention' in rats following electric shock, a phenomenon that is familiar t o anyone who has watched a rat receiving its first shock. Thus the a priori probability of the phenomenon is very high, which means, as we have seen, that it is unnecessary to aim for high power by using, as Van der Poel did, N's of 20 or 30. The scientific community would have been just as convinced with N's of only 5. Use of More Powerful Tests I f more powerful tests are used, the same power m a y be attained with fewer subjects. In some cases non-parametric and parametric tests are used in the same paper, and on similar data (Banks et al. 1979; Birke et al. 1979; Porter & Wyrick 1979). I f the conditions required for the use of a parametric test are met (and presumably this is thought to be the case if it is used), then it will be more powerful than the corresponding

30,

3

non-parametric test. Hence by using the parametric test throughout, power could have been retained with fewer subjects. In some cases a more powerful test can result from a more precise statement of the null hypothesis. For example, sometimes a chi-square test is used to test the null hypothesis that the frequencies do not fall randomly into the different classes. If the classes are ordered, and the theory predicts that there will be a trend, a test sensitive to trend would be more powerful. Wada et al. (1979) provide a possible example of this in their analysis of the tables on page 363. They are looking for a change in colour preference in a definite direction, as a function of colour exposure at different stages of incubation. Their Ho is that there is no change, rather than the more specific change in a particular direction. A similar and familiar argument applies to the use of one-tailed rather than two-tailed tests, when these correspond to the null hypotheses 'A is not greater than B', and 'A = B', respectively. Use of Better Designs In at least one case a factorial analysis of variance could have been used to test the effects of several treatments in one experiment, rather than testing the effects of different treatments in separate experiments. Payne (1979) carried out several experiments in which the attractiveness of female (oestrus and mid-cycle) and male Harderian smears and control smears of quadriceps muscle and protoporphyrin solution was tested in experienced and inexperienced male golden hamsters. I f the experiment had been planned as a single factorial analysis of variance, approximately one third of the subjects actually used would have been needed to achieve the same power, since this would have given the same degrees of freedom in the error term as Payne had in each experiment (or would have done if he had used an F t e s t ; in fact the Friedman oneway analysis of variance was used, which is probably less powerful still). O f course, if Payne had carried out the experiment in this way, he could not have presented a series of experiments following naturally one after the other as the analysis penetrated more deeply. A series is obviously necessary if the questions eventually asked could not have been anticipated in advance. Another common way of using design to reduce N is to use repeated measures rather than randomized groups. This should be done when possible, but there are some well-known

STILL: NUMBERS OF SUBJECTS IN EXPERIMENTS difficulties with repeated measures. In particular, traumatic procedures cannot usually be used because of carryover effects, and learning or boredom may make interpretation difficult in behaviour experiments. It may be noted here that external validity should not only involve generalization to other members of the species, but also to other environments. In operant work this is ensured by systematic repIication: replication of a basic result, but under different circumstances (Sidman 1960). Thus more generality may be obtained by varying conditions rather than keeping conditions constant. Ultimately this may save animals being used to explain why apparently similar experiments give different results. In general, the use of exactly controlled conditions can reduce external validity, and so can the use of very pure strains. This decreases variance, and hence the number of subjects required for given power, but that can also be achieved, with greater external validity, by using matched pairs from more diverse stock. Choice of Larger a

It is very rare to find a researcher justifying the use of a relatively large ct (e.g. 0.05 or even 0.1, rather than 0.001 or 0.01) on the grounds that it would increase power, or permit a smaller N (and hence save animals) without reducing power. If researchers did do this it would certainly interfere with ~ in its present use as a kind of index of acceptability. And yet it is most unlikely that science would suffer if, say, an ct of 0.1 was sometimes acceptable, namely when p(H1) was judged to be high, and it was deemed desirable to reduce N. Sequential Decision Making

Suppose an experimenter wishes to determine whether an animal has a preference for blue over red. He continues to give choices (one per animal, say), until the difference between the number of choices to the two alternatives has reached a predetermined value de, or up to a predetermined number (he) of subjects or choices. The null hypothesis is rejected if de is reached before ne subjects are tested; otherwise it is accepted. If the preference is large, fewer subjects (or choices) will be needed than with conventional designs. But more subjects may be necessary if the preference is small or zero. Sequential tests of this kind, where the decision to end an experiment is based on the current

877

value of some statistic, can often replace the more conventional test where the length of the experiment is fixed in advance (Hays 1973; Wetherill 1976). The technique could only be used to reduce numbers when all testing o f one animal (or block o f animals) is completed before going on to the next, and if a large pool of subjects is available. This last requirement might seem to defeat the object of the exercise, but the technique could have been used, with a reduction in the nmnber of subjects, in at least one of the 1979 papers. Wada et al. (1979) tested chicks for preference between red and green following different conditions of illumination during incubation. Preferences were generally clear-cut, of the order of 80%, but relatively large numbers of chicks were used for each condition, and tNs number was perhaps necessary to provide a test powerful enough to be convincing on the few occasions when no preference was found. By using a sequential test far fewer subjects would have been needed to detect the preferences when, as was usual, it occurred, but probably more subjects would have been necessary when there was no preference. Overall there could have been a large saving on subjects, and it would not, with chicks as subjects, have been necessary to keep a large pool available. Experiments or Observations when N = 1

It will be argued that under certain conditions it is possible to make inferences about populations from very small samples, even when N = I. This is commonly done in psychophysical or physiological experiments where there is a body of evidence or a set of assumptions from which it follows that the dependent variables under investigation do not vary greatly within the population. Thus Dunstone & Clements (1979) use only 2 mink in their psychophysical studies, and yet their findings can be generalized, not because there is much prior data available on mink psychophysics but because the techniques used are adapted versions of those well tried on a large number of species and found to be reliable. Usually N is 2 or 3 rather than 1 in such studies, and this guards against the possibility of misleading results from a single freak, such as a mink with anomalous vision in an experiment on visual psychophysics. Visual inspection of the data from the 2 or 3 individuals is usually enough to confirm that the pattern is sufficiently similar for generalization. The logic behind such inferences is not usually made formal; although it has been discussed informally for the similar

878

ANIMAL

BEHAVIOUR,

logic at work in experiments on operant conditioning (Sidman 1960). A good example to illustrate the principles of this paragraph is the experiment of Buchmann & Rhodes (1979). They used 6 echidnas, divided into 3 groups of 2, each group being given a different set of successive reversal conditions in an operant discrimination. They plotted trials to criterion over successive reversals for each animal, and the similarity of the data for animals within groups, contrasted with the clear-cut difference between groups, makes formal statistical testing redundant. This is helped by the fact that the results are more or less what one would expect from studies on other species, so that p(H1) is large. A special case is where the question to be answered by experiment is whether some discrimination is within the capacity of the species, or whether some distinctive piece of behaviour is part of the repertoire. In the former case it is generally sufficient to show the capacity in a single member of the species, and let it be assumed that we have not accidentally tested a unique genius. Thus chimpanzees' capacity for language learning could in principle be demonstrated by the successful performance of one individual (if in practice this has not been so, it is because there is uncertainty about what constitutes language, not because of the sample sizes). Colour vision in the domestic cat could be demonstrated by generalization from the performance of a single cat; if one cat can discriminate, then the necessary mechanisms are likely to be present in all cats of that breed, unless there are colour-blindness anomalies. The same argument does not apply to the demonstration of colour preferences. A single cat's preference for green over red may well be idiosyncratic, and little can be inferrred about the population of cats, Gaughwin (1979) reports a distinctive response, flehmen, in a single hairy-nosed wombat, backed up by occasional observations on other wombats. The implicit generalization (that this species shows flehmen) is reasonable here. Such examples, where a capacity, or the inclusion of a behaviour in the species' repertoire, is being demonstrated, are best thought of as attempts to falsify negative generalizations. Thus a single colour-discriminating domestic cat falsifies the generalization that no domestic cats have colour vision, and flehmen in a single hairy-nosed wombat falsifies the generalization that hairynosed wombats do not show flehmen. This avoids

30,

3

any unwarranted and unnecessary commitment to the conclusion that all domestic cats have colour vision, or all hairy-nosed wombats show flehmen. Sometimes when N - - 1 statistical tests of a null hypothesis are carried out, as in the study by Beier & Wartzok (1979) of the mating behaviour of a pair of seals. The interpretation of tests in such cases requires care. The use of parametric tests, such as the F test in analysis of variance, in order to make inferences about a population sampled, assumes random sampling from the population, and although the single subject may be chosen randomly, it cannot be the basis for estimating error in order to generalize to the population of suNects. The error is in fact estimated from variation over several trials of the same subject, so that the generalization based upon the statistical test must be to the population of all possible trials. But the 'sample' of trials is certainly not random. A solution to this problem of generalization from a single subject is to base inference not upon the strict requirement of random sampling (which can rarely be achieved in any case), but upon random assignment (Edgington 1980). This is the basis of randomization tests (Edgington 1980) which compute, given random assignment of subjects (or, more generally, experimental units) to treatments, the probability of a result as extreme as that observed, if the observed arrangement of treatments and individual scores is due to chance. The tests make use of the fact that if the observed arrangement is due to chance, any rearrangement (within certain restraints) would have been equally likely. Since random assignment only is assumed, this provides a reasonable interpretation of statistical tests when N = 1, even of parametric tests which normally assume random sampling such as the F test, since the latter may be taken as an approximation to the corresponding randomization test. Any generalization carried out following a randomization test is at the discretion of the experimenter, and must be justified by the nature of the techniques used and accepted knowledge of the area under investigation. It may even be argued that this is the true logic of all statistical testing, since random sampling is not really possible in animal experiments. In certain important cases the question of generalization seems irrelevant. This is often true for instance, in medical case histories, and it may also be true of the behaviour reported in studies on a group of social animals (e.g. Harcourt

STILL: NUMBERS OF SUBJECTS IN EXPERIMENTS 1979). The social structure and the behaviour expressing this structure may be described, but these are presented as facts in themselves, and no generalization to other groups may be invited. This is not to say that the observations form a kind of 'sport', since common principles of adaptation or learning may be appealed to, to show that the behaviour, though unique, is nonetheless the result of lawful forces. It does not follow that every observation on the behaviour of a single animal or a group is of scientific interest, but it is scientific rather than statistical criteria that determine what observations are of interest. Use of Techniques Other Than Null Hypothesis Testing In the long run the use of null hypothesis testing may be wasteful of subjects. Other possibilities include: (a) the use of multivariate analysis, which appears occasionally in the 1979 sample (e.g. Dunstone & O ' C o n n o r 1979b), and which may require to be completed with some null hypothesis testing, and (b) more emphasis on the estimation o f parameters. This appears in the use of confidence limits, but fiducial inference (Finney 1968), in which a probability distribution for the parameter being estimated is derived from the data, or Bayesian inference (Hays 1973), in which the data are combined with a priori probability distributions to give a new, a posteriori, distribution for the parameter, could also be used. In practice these estimation techniques are scarcely ever used in studies of animal behaviour, and their widespread use would require a very different way of thinking about experiments. The advantage, from the animals' point of view, might be that it would be easier to combine the results of experiments in a precise way, and there could be less wasteful overlap. A Brief Conclusion and a Recommendation Perhaps the most widely used experimental technique in animal behaviour is what might be called the analytic series, in which successive experiments are designed to narrow down the possibilities in the manner of the parlour game Twenty Questions (a comparison made by Broadbent 1958). In retrospect this might sometimes seem to have missed the opportunity of a power-saving factorial design (as suggested above in the case of Payne (1979)), but this is not generally the case, and savings in subjects may

879

b e produced simply by demanding a less stringent a level (hence retaining p o w e r ) i n the individual experiments, and relying upon systematic replication t o accumulate reliability for the series as a whole. The problem is that there would no longer be a simple criterion for the statistical acceptability of a result, which would place a heavy burden upon the statistical intuition of referees and editors as well as authors. Perhaps the answer is to cultivate that intuition from the beginning, and to teach statistics as one would aesthetics, by the careful consideration of good and bad examp!es, and not just by the complicated rules of thumb for all occasions that are purveyed in many textbooks. Acknowledgments I would like to thank members of the A. S. A. B. ethical committee for comments on an earlier version of this paper; especially Robert Drewett, Felicity Huntingford, and Peter Slater.

REFERENCES Banks, E. M., Huck, U. W. & Mankovich, N. J. 1979. Interspecific aggression in captive male lemmings. Anita. Behav., 27, 1014-1021. Beicr, J. C. & Wartzok, D. 1979. Mating behaviour of spotted seals (Phoca largha). Anita. Behav., 27, 772-781. Biben, M. 1979. Predation and predatory play behaviour of domestic cats. Anita. Behav., 27, 81-94. Birke, L. I. A., Andrew, R. J. & Best, S. M. 1979. Distractibility changes during the oestrous cycle of the rat. Anita. Behav., 27, 597-601. Broadbent, D. E. 1958. Perception and Communication. London: Pergamon Press. Bucb_rnalm, O. L. K. & Rhodes, J. 1979. Instrumental conditioning: reversal learning in the monotreme Tachyglossus aculeatus setosus. Anita. Behav., 27, 1048-1053. Campbell, D. T. & Stanley, J. C. 1963. Experimental and quasi-experimental designs lbr research. In: Handbook of Research on Teaching (Ed. by N. L. Gage), pp. 171-246. Chicago: Rand McNally. Cohen, J. 1977. Statistical Power Analysis for the Behavioral Sciences. New York: Academic Press. Dunstone, N. & Clemcnts, A. 1979. The threshold for high-speed directional movement detection in the mink, Mustela vison Schreber. Anita. Behav., 27, 613-620. Dunstone, N. & O'Connor, R. J. 1979a. Optimal foraging in an amphibious mammal. I. The aqualung effect. Anita. Behav., 27, 1182-1194. Dunstone, N. & O'Connor, R. J. 1979b. Optimal foraging in an amphibious mammal. II. A study using principal component analysis. Anita. Behav., 27, 1195-1201. Edgington, E. S. 1980. Randomization Tests. New York: Marcel Dekker. Finney, D. J. 1968. Statistics for Mathematicians~An Introduction. Edinburgh: Oliver & Boyd.

880

ANIMAL

BEHAVIOUR,

Gaughwin, M. D. t979. The occurrence of flehmen in a marsupial--the hairy-nosed wombat (Lasiorhinus latifrons). Anim. Behav., 27, 1063-1065. Harcourt, A. H. 1979. Social relationships between adult male and female mountain gorillas in the wild. Anim. Behav., 27, 325-342. Hays, W. L. 1973. Statistics far the Social Sciences. New York: Holt, Rinehart & Winston. Meehl, P. E. 1967. Theory testing in psychology and physics: a methodological paradox. Philosophy of Science, 34, 103-115. Payne, A. P. 1979. The attractiveness of Harderian gland smears to sexually naive and experienced male golden hamsters. Anim. Behav., 27, 897-904. Porter, R. H. & Wyrick, M. 1979. Sibling recognition in spiny mice (Acomys cahirinus): influence of age and isolation. Anim. Behav., 27, 761-766.

30,

3

Sidman, M. 1960. Tactics of Scientific Research. New York: Basic Books. Van der Poel, A. M. 1979. A note on 'stretched attention', a behavioural element indicative of an approachavoidance conflict in rats. Anim. Behav., 27, 446--450. Wada, M., Goto, I., Nishiyama, H. & Nobukuni, K. 1979. Colour exposure of incubating eggs and colour preference of chicks. Anim. Behav., 27, 359-364. Waring, A. & Perper, T. 1979. Parental behaviour in the Mongolian gerbil (Meriones unguiculatus). I. Retrieval. Anim. Behav., 27, 1091-1097. Wetherill, G. B. 1976. Sequential Methods in Statistics. London: Methuen.

(Received 20 July 1981 ; revised 28 October 1981 ; MS. number: 2148)