Pseudoreplication: a widespread problem in primate communication research

Pseudoreplication: a widespread problem in primate communication research

Animal Behaviour 86 (2013) 483e488 Contents lists available at SciVerse ScienceDirect Animal Behaviour journal homepage: www.elsevier.com/locate/anb...

312KB Sizes 2 Downloads 33 Views

Animal Behaviour 86 (2013) 483e488

Contents lists available at SciVerse ScienceDirect

Animal Behaviour journal homepage: www.elsevier.com/locate/anbehav

Commentary

Pseudoreplication: a widespread problem in primate communication research B. M. Waller a, *, L. Warmelink a, K. Liebal a, b, J. Micheletta a, K. E. Slocombe c a

Centre for Comparative and Evolutionary Psychology, Department of Psychology, University of Portsmouth, Portsmouth, U.K. Cluster Languages of Emotion, Evolutionary Psychology, Freie Universitat Berlin, Berlin, Germany c Department of Psychology, University of York, York, U.K. b

a r t i c l e i n f o Article history: Received 27 February 2013 Initial acceptance 29 March 2013 Final acceptance 15 May 2013 Available online 28 June 2013 MS. number: 13-00188 Keywords: ape facial expression gesture monkey pooling fallacy pseudoreplication statistics vocalization

Pseudoreplication (the pooling fallacy) is a widely acknowledged statistical error in the behavioural sciences. Taking a large number of data points from a small number of animals creates a false impression of a better representation of the population. Studies of communication may be particularly prone to artificially inflating the data set in this way, as the unit of interest (the facial expression, the call or the gesture) is a tempting unit of analysis. Primate communication studies (551) published in scientific journals from 1960 to 2008 were examined for the simplest form of pseudoreplication (taking more than one data point from each individual). Of the studies that used inferential statistics, 38% presented at least one case of pseudoreplicated data. An additional 16% did not provide enough information to rule out pseudoreplication. Generalized linear mixed models determined that one variable significantly increased the likelihood of pseudoreplication: using observational methods. Actual sample size (number of animals) and year of publication were not associated with pseudoreplication. The high prevalence of pseudoreplication in the primate communication research articles, and the fact that there has been no decline since key papers warned against pseudoreplication, demonstrates that the problem needs to be more actively addressed. Ó 2013 The Association for the Study of Animal Behaviour. Published by Elsevier Ltd. All rights reserved.

[pooling multiple observations from each individual] reflects a fundamental error in the logic underlying random sampling since it implicitly assumes that the purpose of data gathering in ethology is to obtain large ‘samples of behaviour’ rather than samples of behavior from a large number of individuals. (Machlis et al. 1985, page 201) The goal of the majority of behavioural scientists is to draw conclusions about a specific population of individuals (usually a species) by examining a sample of individuals from this group. Scientists seek the largest samples they can achieve, in order to increase the reliability of extrapolating findings from their sample to the population, and thus increase the accuracy of their conclusions. Reliability, however, can be increased only by increasing sample size in terms of the number of individuals that make up the sample, not by taking multiple samples from each individual and pooling them to create a larger data set. In the 1980s, two key

* Correspondence: B. M. Waller, Department of Psychology, University of Portsmouth, King Henry Building, Portsmouth PO1 2DY, U.K. E-mail address: [email protected] (B. M. Waller).

papers highlighted the problem of drawing conclusions from these artificially inflated samples, terming it ‘pseudoreplication’ or the ‘pooling fallacy’ (Hurlbert 1984; Machlis et al. 1985), and it is now widely acknowledged as an error to be carefully avoided. The simplest sampling error that can lead to pseudoreplication is extracting more than one data point per individual, and then adding these data to the main data set without using appropriate repeated measures statistical techniques. The data points are thus not independent. The pooling procedure then results in an artificially inflated sample size, which falsely raises statistical power and increases the chances of making a type I error (a false positive: rejecting the null hypothesis when it is true). Mundry & Sommer (2007) demonstrated (using mock analyses on real data: contact calls of the Arabian babbler, Turdoides squamiceps) that conducting discriminant function analysis (DFA) on nonindependent data increases the chance of rejecting the null hypothesis. Specifically, the discriminability of groups (species, sexes, contexts, etc.) was overestimated when multiple samples were taken per animal (10 calls per individual), and the factor ‘subject’ was not taken into account. Pseudoreplication can also occur when data points result from the same stimulus treatment and/or temporal, spatial or social

0003-3472/$38.00 Ó 2013 The Association for the Study of Animal Behaviour. Published by Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.anbehav.2013.05.038

484

B. M. Waller et al. / Animal Behaviour 86 (2013) 483e488

group, and so are not statistically independent from each other. In animal behaviour, experimental playback designs may be particularly prone to this form of pseudoreplication, as a single exemplar (e.g. one example of a type of call) is often used, but then multiple responses to this exemplar are analysed (Kroodsma 1990; Kroodsma et al. 2001). This latter type of pseudoreplication can be more difficult to identify (and avoid, but see Zuur et al. (2009) for methods to avoid temporal nonindependence), but the same principle applies: the replicates are not independent from each other and/or they do not increase the reliability of generalizations from the sample to the wider population. Some scientists argue that by identifying anything connected through spatiotemporal proximity or physical boundaries as nonindependent, Hurlbert (1984) sets impossible standards for the design of experiments (Schank & Koehnle 2009). The authors argue that drawing lines around the experimental unit in this way can be arbitrary, and that the decision about whether units are independent or not needs to be made in light of the specific research question (and, if possible, through empirical analysis). Blanket acceptance of the relationship between statistical independence and spatiotemporal proximity may render pseudoreplication as a ‘pseudoproblem’ (Schank & Koehnle 2009, page 421). Others have argued that the need for replicates within an experiment is unnecessary, unless the predicted response is weak or highly prone to noise (Oksanen 2001). However, even the harshest critics of pseudoreplication tend to concur that taking multiple data points from one animal and treating them as independent is incorrect: ‘If the point is that treating repeated measures, say from the same animals, as independent data points is an error, then we completely agree’ (Schank & Koehnle 2009, page 422). Others predict that this simple pooling form of pseudoreplication must be rare given the wide dissemination of the classic papers that urged avoidance: ‘An important result of Hurlbert’s article (and others, e.g. Machlis et al. 1985) is that authors today are unlikely to publish articles with obviously pseudoreplicated data’ (Freeberg & Lucas 2009, page 450). Several reviews have documented the incidence of pseudoreplication in specific subfields of ecological and behavioural science. Hurlbert’s original paper (Hurlbert 1984) reported an incidence of 48% among 101 ecological studies (that used inferential statistics) published between 1960 and 1980. In a review of invertebrate field experiments nearly a decade later, Hurlbert & White (1993) reported a lower pseudoreplication incidence of 32%. Later still, Heffner et al. (1996) found 12% pseudoreplication in a sample of articles akin to Hurlbert’s original review. Thus, at least in ecology, the frequency of pseudoreplication seems to be on the decline. However, in other fields pseudoreplication is only starting to be highlighted, and so the problem may still occur at a greater frequency. A more recent review of pseudoreplication in zoo biology studies, for example, found an incidence of 40% (Kuhar 2006). It is possible that studies of animal communication are particularly prone to artificially inflating the data set, as the unit of communication (the call, the facial expression, etc.) is a tempting unit of analysis. Using the unit of communication as the unit of analysis could often lead to pseudoreplication, unless: (1) each individual contributes only one data point averaged from their sampled communication; (2) appropriate within-subject statistical analyses are used (e.g. paired t test, Wilcoxon signed-ranks test, repeated measures ANOVA); or (3) individual-level membership is taken into account by the statistical test (e.g. by using hierarchical modelling techniques; Pinheiro & Bates 2009). The pervasiveness of this issue within communication research, however, has not been explored. Here, we focused on primate communication. The aim of this study was (1) to examine the prevalence of pseudoreplication

in a systematic review of peer-reviewed articles published in the primate communication field and (2) to determine which factors (if any) were associated with pseudoreplication. METHODS Article Database We reviewed a database of articles systematically collected for a previous study (see Slocombe et al. 2011 for a detailed description of the search criteria). The database contained 551 empirical, peerreviewed research articles published in English and conducted on naturalistic, conspecific primate communication (excluding studies on communication with humans) from 1960 to 2008. Each article was coded for the primary modality of communication under investigation (vocal, gestural, facial or multimodal), method (whether the study used observational methods), taxa (great ape, lesser ape, monkey or prosimian), research environment (wild or not), impact factor of the journal (in 2011) and citations per year (at time of search: April 2013). Coding for Pseudoreplication Each article was read and the statistical analysis coded for the presence of pseudoreplication. Each article was classified as: (1) presenting no statistics; (2) undeterminable; (3) not including pseudoreplicated data; or (4) including pseudoreplicated data. Articles coded as presenting no statistics did not use inferential statistics (i.e. did not contain tests that yield P values). They may still have used descriptive statistics such as frequencies, percentages or similar, so although it is possible that these data were pseudoreplicated, any pseudoreplication of this sort was not counted in the coding system. Undeterminable articles either (1) did not present enough data for us to determine whether pseudoreplication had taken place, (2) stated that information was not available to the researcher (e.g. the researcher could not reliably track the number of animals) or (3) focused on a level below the individual (e.g. the neuron). The appropriate unit of analysis has also been questioned in neuroscientific studies (Lazic 2010), but we felt that comprehensive treatment of these papers was beyond this review. To classify each article as pseudoreplicating or not pseudoreplicating data, the use of statistics was examined. For each statistic, the reported sample size (number of animals), the statistical test used and the degrees of freedom were noted. If this information was not mentioned in the text, the coder checked figures, tables, captions, footnotes or additional published material. This information was not always stated specifically in numbers, but if the text made clear how the data had been treated, this information was also accepted. A statistic was classified as pseudoreplicating data if the stated degrees of freedom were higher than the stated sample size, or if the analysis was explicitly conducted on the number of observations, and actual sample size was not included. Some tests (e.g. repeated measures ANOVAs) create higher degrees of freedom than the sample size. If such a test was used, the coder checked whether a subject contributed more than one data point per condition. If so, the statistic also qualified as pseudoreplication. For some statistics, the sample size, the test used or the degrees of freedom could not be determined. These statistics were listed as undeterminable. An article was classified as having pseudoreplicated if at least one of the statistic tests presented included pseudoreplicated data, regardless of the presence of any other statistics that did not pseudoreplicate. An article was classified as undeterminable only if all statistics used in the article were undeterminable.

B. M. Waller et al. / Animal Behaviour 86 (2013) 483e488

Articles that contained pseudoreplicated data were further classified as (1) avoidable and (2) unavoidable. Avoidable pseudoreplication referred to studies that explicitly took more than one observation from a subject. Unavoidable pseudoreplication referred to studies where the design did not allow the researcher to determine which animal from the study group was the subject, or even how many animals are the subjects (i.e. in wild settings with unhabituated groups). In such cases avoiding pseudoreplication is almost impossible. Statistical Analyses To examine the relationship between the explanatory variables and the occurrence of pseudoreplication (as a binary response variable: yes or no), we used generalized linear mixed models (GLMMs) with binomial error structure and logit link function. The categorical explanatory variables were taxa, modality (multimodal was removed from the modality category as there were no multimodal papers in our final sample), method (used observational methods or not) and research environment (wild or not). Note that some studies used both observational and experimental methods, so we could not directly compare the two methods. Instead, we coded all articles that used observational methods as observational, regardless of whether they also used experimental methods (which was only a few cases). Continuous predictor variables were impact factor of the journal, sample size (if several studies were reported in the article, a mean sample size was taken), year of publication (adjusted to number of years post-1960, e.g. an article published in 2000 was coded as 40) and mean number of citations per year since publication (from Google Scholar: total number of citations divided by the number of years since publication). Continuous predictors were standardized to a mean of 0 and a standard deviation of 1 to make the estimates comparable. We controlled for repeated sampling of research from the same researchers by including categorical coding of the identity of the first and last authors as random factors (Pinheiro & Bates 2009). We checked the distribution of the continuous explanatory variables (sample size, impact factor and citations per year) to identify outliers (Zuur et al. 2009). After visual inspection, we removed six data points with extreme values (four data points with impact factor >11 and two data points with N > 150). Sample size was not available for 20 articles and impact factor was not available for one article, so these were also removed from the data set (running a model without sample size and impact factor did not fit the data better). This left 309 data points for analysis. We then checked for collinearity between the predictors using variance inflation factors (VIF, calculated with the vif function of the package car (Fox & Weisberg 2010) for R 2.15.1 (R Development Core Team 2007)). None of the variables had a VIF > 1.75, indicating that there was low collinearity, so all variables were retained in the analysis (Zuur et al. 2009). We fitted GLMMs using the function glmer provided by the package lme4 (Bates & Maechler 2009) for R 2.15.1. To assess the overall significance of the model, we compared it to a null model including only the intercept and the random variables by performing a likelihood-ratio test comparing the log-likelihoods of both models (Dobson 2002; Forstmeier & Schielzeth 2011). Significant effects were considered only if the model with predictors was more informative than the null model. For categorical predictors with more than two levels (taxa and modality), overall P values were obtained with likelihood-ratio tests comparing the full model with a reduced model without the particular predictor. We also conducted post hoc multiple comparisons with the function glht of the package multcomp (Bretz et al. 2011) for R 2.15.1. Estimates and

485

their standard error are given, alongside z scores and P value (a ¼ 0.05), as measures of the effect of each explanatory variable on the occurrence of pseudoreplication.

RESULTS Incidence of Pseudoreplication All 551 articles were coded. Of these 131 (23.77%) did not contain any inferential statistics and so were excluded from any further analysis. Table 1 shows a breakdown of the remaining 420 articles in terms of whether they did or did not pseudoreplicate data. Only 336 articles presented enough information for pseudoreplication to be confirmed or ruled out, of which 38.6% reported findings generated from pseudoreplicated data. Pseudoreplication was classified as avoidable in the majority of articles that pseudoreplicated (141 articles: 88%).

Variables Associated with Pseudoreplication The set of predictor variables used in our model had a significant influence on the probability of pseudoreplication (likelihood-ratio test comparing the full model with the null model: c211 ¼ 57:01, P < 0.001). Table 2 summarizes the results for each predictor. One significant predictor of pseudoreplication was whether the study used observational methods. Such studies were significantly more likely to report findings from pseudoreplicated data than those that did not (Fig. 1a). Overall, the effect of taxa was significant (likelihood-ratio test comparing the full model with a model without taxon: c23 ¼ 12:56, P < 0.01), so follow-up multiple comparison tests were conducted to find out if any study taxa was more likely to pseudoreplicate than others (see Table 3). Studies of monkeys were more likely to pseudoreplicate than studies of lesser apes, and there was a trend for studies of prosimians to pseudoreplicate more than studies of lesser apes (but note the very small samples of prosimian and lesser ape papers: Fig. 1b). The modality of the communication under study did not affect likelihood of pseudoreplication (likelihood-ratio test comparing the full model with a model without modality: c22 ¼ 0:479, P ¼ 0.787), so whether the focus was facial expression, vocalization or gesture had no effect (Fig. 1c). The research environment did not affect likelihood of pseudoreplication (Fig. 1d). Year of publication did not predict likelihood of pseudoreplication, and Fig. 2 illustrates that the proportion of articles that pseudoreplicated did not change after publication of the seminal papers (Hurlbert 1984; Machlis et al. 1985). Finally, sample size, impact factor of journal and mean number of citations per year did not predict likelihood of pseudoreplication (Fig. 3). Figure 3 illustrates that although impact factor was not a significant predictor, there was a slight trend for articles

Table 1 Incidence of pseudoreplication in the sample of 420 primate communication articles that used inferential statistics Number of articles (%) No pseudoreplication Pseudoreplication Not enough information reported in article Not enough information available to the authors Explicit focus on a level other than individual (i.e. neuron)

174 162 56 8 20

(41.43) (38.57) (13.33) (1.90) (4.76)

486

B. M. Waller et al. / Animal Behaviour 86 (2013) 483e488

Table 2 Influence of the predictor variables on the probability of pseudoreplication in primate communication articles (N ¼ 309) Predictor variable

Estimate*

SE

z

Intercept Taxon Lesser ape vs great ape Monkey vs great ape Prosimian vs great ape Modality Gesture vs facial Vocal vs facial Observational method Yes vs no Research environment Wild vs not wild Year published Sample size Impact factor Citations per year

1.632

0.489

3.334

0.001

1.709 0.820 1.387

0.995 0.402 0.946

1.718 2.038 1.466

0.086 0.042 0.143

0.406 0.073

0.575 0.388

0.706 0.189

0.480 0.850

1.477

0.301

4.901

<0.001

0.502 0.173 0.127 0.337 0.182

0.370 0.150 0.153 0.183 0.170

1.359 1.150 0.829 1.846 1.077

P

0.1741 0.250 0.407 0.065 0.281

* Estimates for categorical variables represent the change in the dependent variable relative to the baseline category of each predictor variable, and so show the strength and direction of each variable’s influence on the probability of pseudoreplication. Estimates for quantitative variables represent the change in the dependent variable per unit of the predictor variable.

published in lower impact journals to pseudoreplicate at higher frequency. DISCUSSION Over one-third (38.6%) of our sample of empirical papers on primate communication presented at least one case of pseudoreplication (due to pooling data across individuals). An additional 20% did not provide enough information (or could not be assessed reliably) to confirm the absence of pseudoreplication. Such a high prevalence in published, peer-reviewed scientific articles is worrying, and given that pseudoreplicating data increase the chance of finding a false positive result, it means that some published findings that are believed to be genuine findings may not be.

100 (a)

Table 3 Multiple comparisons between the different categories of taxa Comparison

Estimate

SE

z

P

Lesser ape e great ape Monkey e great ape Prosimian e great ape Monkey e lesser ape Prosimian e lesser ape Prosimian e monkey

1.709 0.820 1.387 2.529 3.097 0.568

0.995 0.402 0.946 0.952 1.284 0.895

1.718 2.038 1.466 2.657 2.412 0.635

0.289 0.156 0.429 0.034 0.065 0.912

The GLMM analyses show which variables were significantly associated with the occurrence of pseudoreplicated data, and thus may help identify and avoid factors that may lead scientists to pseudoreplicate. Only one variable was significantly more associated with pseudoreplication: studies using observational methods were more likely to pseudoreplicate. The implication of this finding is that papers presenting experimental methods were less likely to pseudoreplicate. Thus, when taking repeated measures from individuals in the form of observations (in contrast to taking repeated measures in an experiment), scientists are more prone to using these units as the units of analysis and treating them as independent data points. The modality of the communicative system under study (facial expression, vocalization or gesture) did not affect the probability of pseudoreplication, which is surprising given that these three communicative modalities can be studied in very different ways (Slocombe et al. 2011). Studies of lesser apes were less likely to include pseudoreplication, but the number of these papers is small (N ¼ 11), so it is hard to speculate on why this might be the case. Most likely it is the result of the statistical practices of the specific research groups working on these species. Impact factor of the journal was also not associated with pseudoreplication, suggesting that better quality journals (if impact factor is a genuine reflection of quality, which, of course, it may not be: Tressoldi et al. 2013) are as likely to accept pseudoreplication as journals with lower impact factors. Similarly, the mean number of citations per year was not associated with pseudoreplication, meaning that studies

(b)

75 50 Pseudoreplicated (%)

25 136

173

81

11

Not observational

Observational

Great apes

Lesser apes

0

100 (c)

208

9

Monkeys Prosimians

(d)

75 50 25 0

63

45

201

222

87

Facial

Gesture

Vocal

Not wild

Wild

Figure 1. Percentage and number of primate communication articles that presented pseudoreplicated data as a function of (a) research method, (b) taxa under study, (c) communicative modality and (d) research environment.

Number of publications

B. M. Waller et al. / Animal Behaviour 86 (2013) 483e488

487

30 No pseudoreplication Pseudoreplication

25 20 15 10 5 0 1965

1970

1975

1980

1985 1990 Year of publication

1995

2000

2005

2010

Figure 2. Year of publication and the number of articles published that do and do not report pseudoreplicated data. The dotted line shows the point at which seminal articles (Hurlbert 1984; Machlis et al. 1985) highlighting the problem with pseudoreplication were published.

presenting pseudoreplicated data do not have a higher impact than studies that do not. This is reassuring but worrying at the same time: ideally, pseudoreplicated papers should be cited less often because the results and conclusions they present are likely to be unreliable. Actual sample size (number of animals) did not predict pseudoreplication, which is highly surprising as it seemed obvious that the pooling fallacy would be most seductive when dealing with small samples. Therefore, this error is likely to be taking place as a genuine (and rectifiable) misunderstanding of statistics

Impact factor

3.5

(a)

3 2.5 2

Sample size

30

(b)

25 20 15

Citations per year

4.5

(c)

4 3.5 3 No

Yes

Pseudoreplication Figure 3. Comparison of (a) journal impact factor, (b) sample size and (c) citations per year between articles that do and do not report pseudoreplicated data. Means are shown  95% confidence intervals.

(Hurlbert 2009). Finally, the year of publication (from 1960 to 2008) did not affect the probability of pseudoreplication at all. If the early warnings of Hurlbert (1984) and Machlis et al. (1985) had affected how primate communication research is being conducted, the likelihood of pseudoreplication would have reduced as year of publication increased. Given that this is not the case, we can conclude that the decline seen in ecological research (Hurlbert 1984; Hurlbert & White 1993; Heffner et al. 1996) has not occurred in primate communication research. To remedy the problem of pseudoreplication, statistical methods that analyse repeated data points from individuals while avoiding pseudoreplication should be employed. In many cases, simple within-subject analyses (e.g. paired t tests, Friedman’s ANOVA) can be used and are sufficient to answer the research question. Alternatively, and to avoid summarizing the data and losing power, GLMMs can control for group membership of data points at the level of the individual and are becoming more common in primate studies (e.g. Gomes et al. 2009; Engelhardt et al. 2012). GLMMs can also be used to control for higher levels of group membership (above the level of individual), which can also render data nonindependent (e.g. group, geographical area, etc.). The extent to which data points need to be independent, however, is a difficult issue, and intrinsically linked to the specific research question. Here, for example, we were concerned that some articles were related to each other by virtue of being written by the same authors, and so we used first and last author as random factors (thus grouping articles by author) in our model. It is possible, however, that articles are similarly connected by journal, editor, research group, institutional affiliation and so on. Permutated discriminant function analysis has also been developed for use with nonindependent data (Mundry & Sommer 2007), and allows subject to be included as a factor in addition to other factors of interest (e.g. sex, species, context). These alternative methods (along with repeated measures statistics on summary data per individual) should be used as a matter of course when analysing multiple data points per individual. The high prevalence of pseudoreplication in the large sample of primate communication research articles examined is a problem. The conclusions derived from pseudoreplicated data are (sometimes grossly) unreliable (Mundry & Sommer 2007) and should not be accepted, and so our current understanding of primate communication is likely to be incorrect at some level. Thus, when citing existing articles that are based on pseudoreplicated data, we should be extremely cautious over our interpretation of the stated findings, perhaps even stating explicitly that the data have been pseudoreplicated. In some cases, of course, it is possible that the same result would still be found if analyses avoiding pseudoreplication were conducted on the same data, but if this has not been

488

B. M. Waller et al. / Animal Behaviour 86 (2013) 483e488

demonstrated, caution is essential. Articles received for peer review that are yet to be published do not, however, present us with the same difficulties, and the only way to improve the reliability of the literature is to avoid endorsing articles that present pseudoreplicated data. Authors, editors and referees should all be vigilant of the problem and should not publish findings if they have been derived from pseudoreplicated data. Primate communication may not be any more likely than studies of other taxa to pseudoreplicate, and likewise, other aspects of behaviour may be equally prone to pseudoreplication errors. Further research is needed to clarify the prevalence of such problems to continue improving the quality of our research. It is vital that published findings are generated from appropriate methodology and analysis, and (above all) are reliable. We thank the University of Portsmouth, Department of Psychology Research Committee, the University of York Pump-priming fund and the BBSRC for funding. We also thank Julia Friedrich, Carla Pritsch, Margarete Überfuhr and Alejandra Picard for assistance with the organization and coding of the article database, and the editor and two anonymous referees for helping us to improve the manuscript. References Bates, D. & Maechler, M. 2009. lme4: Linear Mixed-effects Models using S4 Classes. R package version 0.999375-32. http://CRAN.Rproject.org/package¼lme4. Bretz, F., Hothorn, T. & Westfall, P. 2011. Multiple Comparisons using R. Boca Raton, Florida: CRC Press. Dobson, A. J. 2002. An Introduction to Generalized Linear Models. Boca Raton, Florida: Chapman & Hall/CRC Press. Engelhardt, A., Fischer, J., Neumann, C., Pfeifer, J.-B. & Heistermann, M. 2012. Information content of female copulation calls in wild long-tailed macaques (Macaca fascicularis). Behavioral Ecology and Sociobiology, 66, 121e134. Forstmeier, W. & Schielzeth, H. 2011. Cryptic multiple hypotheses testing in linear models: overestimated effect sizes and the winner’s curse. Behavioral Ecology and Sociobiology, 65, 47e55. Fox, J. & Weisberg, S. 2010. An R Companion to Applied Regression. London: Sage Publications.

Freeberg, T. M. & Lucas, J. R. 2009. Pseudoreplication is (still) a problem. Journal of Comparative Psychology, 123, 450e451. Gomes, C. M., Mundry, R. & Boesch, C. 2009. Long-term reciprocation of grooming in wild West African chimpanzees. Proceedings of the Royal Society B, 276, 699e706. Heffner, R. A., Butler, M. J. & Reilly, C. K. 1996. Pseudoreplication revisited. Ecology, 77, 2558e2562. Hurlbert, S. H. 1984. Pseudoreplication and the design of ecological experiments. Ecological Monographs, 54, 187e211. Hurlbert, S. H. 2009. The ancient black art and transdisciplinary extent of pseudoreplication. Journal of Comparative Psychology, 123, 434e443. Hurlbert, S. H. & White, M. D. 1993. Experiments with fresh-water invertebrate zooplanktivores: quality of statistical analyses. Bulletin of Marine Science, 53, 128e153. Kroodsma, D. E. 1990. Using appropriate experimental designs for intended hypotheses in song playbacks, with examples for testing effects of song repertoire sizes. Animal Behaviour, 40, 1138e1150. Kroodsma, D. E., Byers, B. E., Goodale, E., Johnson, S. & Liu, W. C. 2001. Pseudoreplication in playback experiments, revisited a decade later. Animal Behaviour, 61, 1029e1033. Kuhar, C. W. 2006. In the deep end: pooling data and other statistical challenges of zoo and aquarium research. Zoo Biology, 25, 339e352. Lazic, S. E. 2010. The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? BMC Neuroscience, 11, 5. Machlis, L., Dodd, W. D. & Fentress, J. C. 1985. The pooling fallacy: problems arising when individuals contribute more than one observation to the data set. Zeitschrift für Tierpsychologie, 68, 201e214. Mundry, R. & Sommer, C. 2007. Discriminant function analysis with nonindependent data: consequences and an alternative. Animal Behaviour, 74, 965e976. Oksanen, L. 2001. Logic of experiments in ecology: is pseudoreplication a pseudoissue? Oikos, 94, 27e38. Pinheiro, J. C. & Bates, D. M. 2009. Mixed-Effects Models in S and S-Plus. New York: Springer-Verlag. R Development Core Team, R. D. C. 2007. R: a Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing. http://www. R-project.org. Schank, J. C. & Koehnle, T. J. 2009. Pseudoreplication is a pseudoproblem. Journal of Comparative Psychology, 123, 421e433. Slocombe, K. E., Waller, B. M. & Liebal, K. 2011. The language void: the need for multimodality in primate communication research. Animal Behaviour, 81, 919e924. Tressoldi, P. E., Giofré, D., Sella, F. & Cumming, G. 2013. High impact ¼ high statistical standards? Not necessarily so. PLoS One, 8, e56180. Zuur, A. F., Ieno, E. N., Walker, N., Saveliev, A. A. & Smith, G. M. 2009. Mixed Effects Models and Extensions in Ecology With R. New York: Springer-Verlag.