Journal of Experimental Social Psychology 67 (2016) 95–96
Contents lists available at ScienceDirect
Journal of Experimental Social Psychology journal homepage: www.elsevier.com/locate/jesp
Be careful what you wish for: Commentary on Ebersole et al. (2016) Benoît Monin 1 Stanford University, United States
a r t i c l e
i n f o
Keywords: Credentials Licensing Replication predictions
a b s t r a c t Ebersole et al.'s (2016) attempt to replicate Monin and Miller (2001) raises important questions about choosing beforehand which statistical test is the target of a replication. While our original theory a priori only predicted a main effect of the credentials manipulation, we had observed in the study reproduced here an unexpected interaction with participant gender. The current paper fails to replicate this originally unpredicted interaction, which it initially codes as a failure (Table 3), but significantly replicates the main effect of the credentials manipulation as an “added effect.” Ebersole et al. graciously acknowledge that the latter effect is of greater theoretical importance to our original theory, but the fact remains that the main pre-registered prediction for this replication (the interaction with gender) failed to garner empirical support. In this brief commentary I discuss issues with deciding a priori what should count as a replication, and own up to my responsibility making the ill-fated gender interaction be the centerpiece of this replication attempt. © 2016 Elsevier Inc. All rights reserved.
I thank Ebersole et al. for choosing Study 1 of our first moral credentials paper (Monin & Miller, 2001). As when my work was selected for ManyLabs 1 (Klein et al., 2014; Monin & Oppenheimer, 2014) and the Reproducibility Project (Open Science Collaboration, 2015), I appreciated this independent test of our hypotheses. I was impressed how Ebersole et al. consulted me in this process, including inviting edits to their report, and applaud their even-handedness and rigor throughout. I only wish to complement their article by highlighting a complexity which is likely to resurface in other replications, and thus worth discussing — especially as I may ironically have played a role in bringing it about. 1. Yelling while tied to the mast An emerging difficulty in the replication literature lies in deciding which predictions matter for any given study. Ebersole et al. dropped the baseline condition from our three-cell design, and set out to test a difference between the two experimental cells. Our original prediction was that the opportunity to disagree with sexist statements phrased as “Most women…” would give respondents moral credentials to later favor a man, whereas items phrased as “Some women…” would not afford this opportunity. But when Charlie Ebersole originally contacted me, I was quick to stress that we had observed an unexpected interaction with gender: Only male participants showed a simple effect for credentials. My concern was that ignoring this moderation in the replication might E-mail address:
[email protected]. Contact information: Stanford Graduate School of Business, Knight Management Center, 655 Knight Way, Stanford University, Stanford, CA 94305-7298, United States, Tel.: +1 650 723 8081. 1
http://dx.doi.org/10.1016/j.jesp.2016.01.007 0022-1031/© 2016 Elsevier Inc. All rights reserved.
conceal an effect among men. Accordingly, their “Formal Analysis Plan” document stipulates that “The primary effect of interest for this replication is the (…) interaction, with an expected difference between conditions among males only”. In the 2001 paper, we had attributed this interaction, unpredicted by our theory, to the threatening nature of the manipulation for women (perhaps over-interpreting normal variability in observed effects, Tversky & Kahneman, 1971). We revised the manipulation in our Study 2, and the interaction went away. Ebersole et al. failed to replicate our observed interaction by gender (listed as one of their 10 target effects in Table 3), but do find a statistically robust main effect of credentials (listed as an “added effect”). After originally clamoring that they should be testing for the interaction, I naturally insisted once the results came in that the main effect was the central theory-testing prediction... Fortunately Ebersole et al., like Odysseus' sailors, stuck with their originally agreed-upon prediction and kept on rowing while I, tied to the mast, pleaded with them to revise their course towards the Seductive Sirens of Significance. Theirs was clearly the right choice here, but it does raise issues for future replications: Should replicators attempt to reproduce originally obtained patterns of data, or a priori predictions? Had I initially insisted on the latter, the present results would seem perfectly supportive of our theory, but ironically my insistence on the interaction yields an apparent failure to replicate.
2. Are Ebersole et al.'s results actually stronger than ours? As I recovered from kicking myself, a second, more encouraging irony crept in: that the present replication might provide stronger support than the 2001 demonstration. After all, our original gender
96
B. Monin / Journal of Experimental Social Psychology 67 (2016) 95–96
interaction resulted from the fact that only men showed a significant effect of credentials, t(54) = 3.3, p = .002, while women did not, t(61) = .40, p = .69. In this replication, the effect replicates for men, t(1058) = 2.03, p = .04, but now as a bonus it also obtains for women, t(2072) = 3.76, p = .0002 — yielding no interaction. If we understood the effect to be replicated as the simple effect of credentials among men, it thus would seem to replicate, with the simple effect among women being gravy. So are these replication results ironically stronger support for our predictions than our original data? While this would be a seductive conclusion, a closer look suggests otherwise. The observed effect sizes in the replication for both men (d = .12) and women (d = .17) are quite small, and in fact on par with what we observed for women in the original study (d = .10), not with the large effect size observed for men (d = .87). Rather than women looking more like men, the interaction apparently disappeared because men in the replication looked more like the women in the original — however, we also find that this small effect size for women, originally dismissed as unreliable, now appears robust and significant provided adequate statistical power. 3. So did our effect replicate? As Ebersole et al. knowingly discuss, the answer is not straightforward. It crucially depends on what is being targeted for replication: (1) The Gender × Credentials interaction? With an observed F of .0004 (Table 3), this would qualify as complete failure to replicate. Under the null, an F ratio is 1, not 0 — actually suggesting an uncanny similarity between simple effects here; (2) The large simple effect for men reported in the 2001 paper? Not by a long shot: the observed effect size for men in the replication (d = .12) looks nothing like the original report (d = .87). This
puts us in good company, as Ebersole et al. similarly only observe an effect size of d = .09 for the availability heuristic (original d = .82, Tversky & Kahneman, 1973). It also suggests that our early demonstration is a statistical outlier, as pointed out by Blanken, van de Ven, and Zeelenberg in their meta-analysis (2015, esp. Fig. 1 and 2) which finds an average d of .31 for a range of manipulations of moral licensing in 91 studies (including ours) with 7397 participants; (3) The qualitative claim that credentials matter? Here we are on stronger footing. To a more modest degree than initially observed, but statistically significant and independently supported for both men and women, the replication undeniably supports the central claim of our Study 1, as stated in our abstract: “Participants given the opportunity to disagree with blatantly sexist statements were later more willing to favor a man for a stereotypically male job” (Monin & Miller, 2001, p.33).
References Blanken, I., van de Ven, N., & Zeelenberg, M. (2015). A meta-analytic review of moral licensing. Personality and Social Psychology Bulletin, 41(4), 540–558. Klein, R.A., Ratliff, K.A., Vianello, M., Adams, R.B., Jr., Bahník, Š., Bernstein, M.J., ... Nosek, B.A. (2014). Investigating variation in replicability: A “many labs” replication project. Social Psychology, 45, 142–152. Monin, B., & Miller, D.T. (2001). Moral credentials and the expression of prejudice. Journal of Personality and Social Psychology, 81(1), 33–43. Monin, B., & Oppenheimer, D.M. (2014). The limits of direct replications and the virtues of stimulus sampling [Commentary on Klein et al., 2014]. Social Psychology, 45, 299–300. Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76(2), 105–110. Tversky, A., & Kahneman, D. (1973). Availability: A heuristic for judging frequency and probability. Cognitive Psychology, 5(2), 207–232.