Journal of Behavioral and Experimental Economics 86 (2020) 101525
Contents lists available at ScienceDirect
Journal of Behavioral and Experimental Economics journal homepage: www.elsevier.com/locate/jbee
Classroom experiments as a replication device Tony So
T
Economics Division, International Business School Suzhou, Xi'an Jiaotong-Liverpool University, 111 Ren'ai Rd, Dushu Lake Science and Innovation District, Suzhou Industrial Park, Suzhou, Jiangsu Province, China
ARTICLE INFO
ABSTRACT
Keywords: Classroom experiments Replication Induced value theory Intrinsic motivation Guessing game
A string of failed experimental replications in many disciplines have shed light on the low levels of replicability of published research. There is an increasing call for more replications to be conducted to bring credibility back to academic research. Despite this, there are few incentives for researchers to conduct replicating studies. They are costly in terms of time and money, and are difficult to publish due to the competitive nature of publication, where journals seek a high degree of novelty and contribution. This paper proposes a low-cost method of replication: conducting replication experiments in a classroom context. As a case in point, we present results from a simple replication of Weber's (2003) “`Learning' with no feedback in a competitive guessing game”.
JEL classification: A20 C73 C90 D83
1. Introduction There has been a lot of recent attention in the social sciences about the replicability of experimental results. In the field of psychology, the Open Science Collaboration (2015) found that of attempts to replicate 100 studies published in top psychology journals, only 39% were successful, and the mean effect size of the replications were half the magnitude of the original studies. In a recent large scale replication, Camerer et al. (2018) found that 13 of 21 (62%) social science experiments published in Nature and Science between 2010 and 2015 were replicable. The low rate of replicability of empirical research raises questions about its credibility and robustness. Camerer et al. (2016) conducted replications of experimental studies published in the American Economic Review and the Quarterly Journal of Economics from 2011 to 2014, and showed that 11 of 18 replications (61%) were successful. This is in part due to publication bias, where significant results from these papers are more than 30 times more likely to be published than insignificant ones (Andrews & Kasy, 2019). In a large scale attempt to look at the incidence of successful replication in the pre-existing literature, Maniadis, Tufano and List (2017) found that 43% of 85 replication studies in the pre-existing literature have been successful. The low rates of successful replication could be due to various factors, including systematic factors such as publication bias and results being false-positives; the innocent occurrence of errors in the data or
analyses; and questionable research practices of p-hacking and outright fraud. Some of these factors can be identified through a fact-checking exercise of 'reproduction': using the same data (or source data) of an original study, and the author-provided code, to see whether the original findings can be reproduced or not by a third party. This 'reproduction' serves not only as a check for procedural correctness, but also for signs that the data or results have not been falsified to yield 'interesting' findings that make it more attractive for publication. On this note, there has been longstanding concerns about the reproducibility of results in economics (for example see Dewald, Thursby & Anderson, 1986; McCullough, McGeary & Harrison, 2006). This notion of 'reproduction' should be distinguished from that of 'replication'. Disciplines that study human behaviour such as psychology, behavioural and experimental economics, and the social sciences, rely on methodology which utilises human participants. Since people are highly heterogeneous in their upbringing, values, social networks, etc., replication carries a different meaning. Replication seeks to identify the set of parameters which are necessary for a particular finding to be confirmed, allowing for better understanding regarding the generalizability of the original finding (see Schmidt, 2009). As we discuss below, the notion of replication itself is subject to various confounds, so one should not take successful replication for granted. Most surprising are not the rates of (un)successful replication per se, but rather the small number of replication attempts. This paper focuses specifically on the field of experimental economics, and the
E-mail address:
[email protected]. https://doi.org/10.1016/j.socec.2020.101525 Received 20 August 2019; Received in revised form 21 February 2020; Accepted 21 February 2020 Available online 28 February 2020 2214-8043/ © 2020 Elsevier Inc. All rights reserved.
Journal of Behavioral and Experimental Economics 86 (2020) 101525
T. So
unique challenges of replication in this discipline. Following the methodology of Makel, Plucker and Hegarty (2012), Maniadis et al. (2017) find that only 4.2% of published experimental studies in the field of economics attempt to replicate pre-existing findings. While the low incidence of replication studies is concerning, it is not at all surprising due to two factors. Firstly, replicating studies – successful or not – are difficult to publish, since they are perceived to provide little contribution over and above the original.1 Moreover, there have been reports of rivalry and aggression between replicators and original authors (Bohannon, 2014; Gertler, Galiani & Romero, 2018). Secondly, replication would require significant resources – namely time and money – which could be utilised on projects with better publication prospects. This is especially true in the field of experimental economics, where it is protocol for participants to receive monetary payment. Studies are usually not considered for publication if the experiments are not incentivised. This paper proposes a simple solution which aims to increase the incidence of replication: conducting classroom experiments as a means of replication. Classroom experiments have been widely advocated as a pedagogical tool to provide students deeper insights on various economics topics (Dixit, 2005; Holt, 1999; Shubik, 2002). If an original experimental study is adaptable to the classroom setting, the classroom replication provides several advantages. It overcomes a key barrier in replication since classroom experiments do not require subject payment. Furthermore, since many experimental economists already carry out classroom experiments as part of their teaching duties, it does not impose too much additional time and effort to conduct the replication. Replication in a classroom setting is not without confounds. The biggest advantage of non-payment is also likely to be its biggest drawback. Experimental economics is founded on the premise that people's decisions are guided by their preference for larger experimental payment, so non-payment will raise questions of why people would make sensible, consistent and incentive-compatible decisions at all. Later on, we propose that “psychological incentives”, including hypothetical payments and intrinsic motivation, play an important role in mitigating these concerns. We follow up with findings from a classroom replication of Roberto Weber's (2003) “`Learning' with no feedback in a competitive guessing game”: both with and without monetary incentives offered to student participants. We have two main findings. Firstly, we are unable to replicate the main finding of Weber's paper in the classroom setting. We speculate that this could be explained by differences in participants' cognitive ability. Secondly, we find empirical regularities between both incentivised and unincentivised classroom experiments – suggesting that monetary incentives play a much smaller role than many would expect. This could be seen as evidence which validates the use of classroom experiments as a replication device. In the sections that follow, we discuss the notion of replication, elaborate on the idea of low-cost classroom experiments, and discuss potential concerns regarding non-incentivised experiments. We follow up by discussing the classroom replication that we conduct, and potential factors which may have contributed to our failed replication.
sampling, reporting or publication (Di Tillio, Ottaviani & Sørensen, 2017). Secondly, replications identify the extent for which the finding is indeed true, and identify the set of boundary conditions for which the result will hold or not. In this paper, we confine our attention to the latter. To understand what constitutes a replication, we first conceptualise an experiment as a bundle of characteristics and procedures (see Lancaster, 1966), which lead to a particular finding when combined together. An experiment E is therefore:
E = {K , C , F }
K = (K1, K2, …, Ki, …, KI ) is the vector of the key attributes and characteristics, such as the features that comprise a treatment comparison, which are thought to be strictly required to bring about the finding F. C = (C1, C2, …, Cj, …, CJ ) represents other features of the experiment which are not thought to influence the finding F, but are nevertheless required to conduct the experiment.2 Procedures, the sample and characteristics of the sample and experimenters are some examples of these C characteristics. The array of K and C characteristics comprise an experiment, which leads to the finding F. This finding F, may or may not represent the underlying 'truth'. Drawing from Schmidt (2009), a replication is only required to reestablish the array of necessary characteristics K which featured in the original. A perfect replication, whereby the replication is identical to the original in every aspect of both K and C, is not useful as it has no replicating value (Schmidt, 2009). To give an analogy, consider an experiment which is conducted in front of a mirror: the same experiment is run in the same time-space in the same environment with the same subjects – each with the same state-of-mind. All mirror-copy participants make the same decisions as their actual counterparts, and analyses subsequently yields the same finding as the original. What do we learn from this perfect mirror-copy replication in terms of the boundaries of robustness? Nothing. Since experiments are designed with the principle of ceteris paribus in mind, it is only natural to expect replications to embrace this principle by changing only one of the C characteristics. Whether the replication brings about the finding F or not, the confirmatory or conflicting finding can be attributed to this single change of characteristic. In practice, however, a true ceteris paribus replication is impossible. Using a different subject pool would itself violate ceteris paribus, since subjects themselves can be decomposed into various characteristics – such as age, gender, ethnicity, cultural norms, language, income, etc. Likewise, if the replication is conducted by different experimenters from the original, this would be an additional source of variation. Despite this, we argue that it is not necessary for replications to fully embrace the principle of ceteris paribus, as long as a large enough number of replicating studies are conducted. The value of replication comes from the number of times they are conducted with the key characteristics K. Coffman and Niederle (2015) show that the underlying truth behind a finding can emerge from three to five replications. Consider the following original experiment E and a replication R1:
E = {K , C1, C2, C3, C4, F } R1 = {K , C1 , C2 , C3, C4, F }
2. What is replication? Replication serves two important purposes. Firstly, to verify the truth associated with a particular finding. This checks for the accuracy and correctness of the finding, whether any errors have been made, and/or the finding is due to false positives or result from selective
The replication R1 differ in characteristics 1 and 2 from the original, and leads to a different finding F′. What do we make out from this single replication? Since this is not a clean ceteris paribus comparison, we do not know whether the discrepant finding is due to characteristics 1 or 2, or both. However, interpretation changes with a greater number of replicating studies:
1 To this end, many authors have called for journals to embrace replications, and the creation of new outlets which focus on replication – such as the Journal of the Economic Science Association. See Coffman, Niederle, and Wilson (2017).
2 To keep this conceptualisation simple, we refer to a singular finding. Naturally, an experiment could yield various findings. The key insights made in this section does not change with multiple findings.
2
Journal of Behavioral and Experimental Economics 86 (2020) 101525
T. So
with conducting a replication, and as a result should increase its incidence. The reduced cost of replication would also solve a second-order problem: only studies that are deemed important, surprising and/or are published in top journals warrant replication. This is highlighted by the fact that, for example, Camerer et al. (2016) decided to arbitrarily replicate papers from the top general-interest economics journals American Economic Review and Quarterly Journal of Economics, as opposed to papers from field journals such as Games and Economic behaviour or Experimental Economics. The majority of published studies would be neglected – undermining the very purpose of replication itself. The prioritisation of scarce financial resources gives rise to the peck-order, but this issue would be lessened if replication did not require monetary outlay.
R2 = {K , C1 , C2, C3, C4 , F } R3 = {K , C1 , C2 , C3 , C4, F } R 4 = {K , C1 , C2, C3 , C4, F } Even when there are different ceteris non paribus replications of the same experiment, each with a set of different characteristics C from the original experiment, but with a consistent finding F′, then there will be increased confidence that the original finding F does not represent the underlying truth – that it is caused by key features K. In other words, K was mis-specified: the original finding F might have been jointly affected by some C characteristics too: in this example, C1 should instead have been an element of K and the replications should have controlled for this characteristic. While Schmidt (2009) believed that a replication only requires the common set of key characteristics K to be identical to the original study, a practical problem arises: one does not know the composition of these key characteristics. One does not know precisely whether a given characteristic should be classified as K or C upfront – whether a characteristic or procedure of an experiment is one that drives the findings or not. While the replications Ri in the above example sought to keep the key characteristics K the same, what are these K characteristics? For example, if the finding is specific to a particular sample, it should be considered a key characteristic Ki, but one does not know whether or not the finding is sample-specific when the experiment only draws from a single sample. Replication with different parameters would be able to identify the elements that constitute K, identifying whether the finding is robust to a wider, or narrower, set of characteristics than originally thought. The value of replication comes from how many times an original study is replicated. As the number of replications of one particular study increases, the average finding from the replications converges to the underlying truth. This suggests that one should be very careful when interpreting and making judgements about the underlying truth based on a single replication failure, or indeed a large number of failures based on a single replication of many different studies – as in the Open Science Collaboration (2015).
3.2. Non-payment The obvious rebuttal to classroom replication experiments would be the notion of non-payment. Monetary payment is required in economic experiments to ensure that experimental subjects make consistent, reliable and incentive-compatible decisions; rather than out of randomness. Vernon Smith's Induced Value Theory (1976, p. 275) states that: “Given a costless choice between two alternatives, identical except that the first yields more of the reward medium (usually currency) than the second, the first will always be chosen (preferred) over the second, by an autonomous individual, i.e., utility is a monotone increasing function of the monetary reward, U(M), U′ > 0.” This implies that experimental subjects will make choices consistent with making more money if their objective is to maximise their monetary earnings.5 We note that, however, Smith referred only to a “reward medium” rather than money specifically – with “currency” as an example. If decision alternatives yield non-monetary utility and this non-monetary preference-ordering is consistent with that in the presence of a monetary reward, then this non-monetary utility would substitute for monetary payment. We propose that psychological incentives could potentially act as a reward medium that ensures incentive-compatibility. If people are intrinsically motivated (Bowles & Polanía-Reyes, 2012; Deci, Koestner & Ryan, 1999; Festré & Garrouste, 2015; Frey & Jegen, 2001; Gneezy, Meier & Rey-Biel, 2011) and want to do as well as they can, then their hypothetical monetary earnings – which are not paid out – could be interpreted as a `score' which measures their performance in the experimental game. The process of maximising such these hypothetical payments would ensure subject decisions are analogous to those if they were paid instead. In a review of 74 experimental studies which study the effect of different monetary incentives, Camerer and Hogarth (1999) find that the most common finding is that different monetary stakes have no effect on performance across a range of experimental tasks.6
3. Classroom experiments 3.1. Frequency of replication The power of replication comes from its frequency. However, as discussed earlier, the incidence of experimental replication is low as there are low incentives to replicate, while being costly at the same time. Given this, how would the incidence of replication be expected to increase? On the premise that replications will be imperfect however they are to be carried out due to unavoidable violations of ceteris paribus, we propose classroom experiments as a means of replication to make the most out of these imperfections.3 Classroom experiments do not require subject payment, especially when students do not expect to be paid when they show up to class.4 This reduces the monetary cost associated
(footnote continued) Brañas-Garza (2013) provide an in-depth discussion of self-selection effects in economic experiments. We thank an anonymous referee for raising this point. 5 In a study by Veszteg and Funaki (2018), 57% of subjects report that their objective when playing 2 x 2 matrix games is to maximise monetary payoffs. 27% of subjects play according to some other objective, unrelated to payoffs. The 57% of subjects who are money-maximisers seem to be much lower than what Smith's Induced Value Theory would presume. 6 Camerer and Hogarth's (1999) review of the role of monetary incentives makes a number of more nuanced points: for example that monetary incentives reduces the variance of experimental decisions, reduces “presentation” effects (e.g. warm glow, generosity, risk seeking), and that any effects are task dependent. Our main thesis that unincentivised classroom experiments could potentially work as a means of replication need to be interpreted against this background, and the reader should exercise judgement to see whether they are
3 Classroom experiments have been widely documented as a useful pedagogical tool (Dixit, 2005; Holt, 1999; Shubik, 2002) which assists with teaching and provides students a hands-on understanding of the mechanisms and processes that underpin economic models. Classroom experiments have been found to improve students' academic achievement, interest and knowledge retention (Durham, McKinnon, & Schulman, 2007; Emerson & Taylor, 2004; Frank, 1997). 4 Classroom experiments also helps to mitigate a potential self-selection problem with the standard subject recruitment procedure associated with lab experiments, where participation is voluntary. In the classroom experiment, all students participate in-class – although there will nevertheless be selection in the sense that some students do not attend class. Exadaktylos, Espín, and
3
Journal of Behavioral and Experimental Economics 86 (2020) 101525
T. So
Furthermore, DellaVigna and Pope (2018) both show that various types of psychological incentives are effective in soliciting effort in a realeffort task, though monetary incentives are even more effective. For example, if people care about favourable relative comparisons or `status' (Card, Mas, Moretti & Saez, 2012; Kuhn, Kooreman, Soetevent & Kapteyn, 2011), then their behaviour will be driven by the objective to outperform their peers. This behaviour could well be present, though rarely discussed, in prior studies. For example, the classic guessing game (Nagel, 1995) pays winners a fixed prize if they won, so if subjects take a win-loss frame, it is plausible for them to choose smaller numbers over time just out of the utility derived from winning (Charness, Masclet & Villeval, 2014; Kräkel, 2008; Kuhnen & Tymula, 2012), independent of the monetary reward associated with it. The existence of hypothetical payments is most apparent when we consider classroom games used to support teaching. Without money incentives in place for students to behave in an incentive-compatible manner, why would they do so? Anticipating this, why would teachers run classroom experiments at all when they would not reasonably expect students to behave in the way they are expected to? In fact, early market experiments by Chamberlin (1948) and Smith (1962) were unincentivised classroom experiments – both of which pre-date Induced Value Theory.7 We are not alone in suggesting that monetary payment may not be required to derive meaningful findings. Rubinstein (1999) analysed responses from pre- and post-class problem sets that were given to students in his game theory class and found that responses in a variety of games were consistent with those of incentivised, controlled lab experiments. This was true even in ultimatum bargaining games,8 where money takes a central role in the interpretation of the game itself.9 Rubinstein subsequently questioned the role of monetary payment in experiments. The most convincing validation of classroom experiments not only as an empirical tool, but as a replication device, comes from a largescale study by Lin et al. (2018). Lin et al. (2018) use data from over 2000 classroom experiments conducted in 7 different countries using the Moblab experimental platform. With approximately 10,000 observations in both ultimatum and double auction experiments, they find that classroom experiments are able to replicate the widely accepted findings of each of these classic experiments. Similar to the main theses of this paper, Lin et al. (2018) also advocate the use of classroom experiments “to serve as another venue of replication” (p.11).
For these reasons, we decided to replicate Weber (2003), which is a 10-round guessing game10 which experimentally manipulates whether end-of-round feedback is provided or not. The guessing game is easy to explain, the experiment is relatively short, and is easily implementable using pen-and-paper procedures.11 Furthermore, the guessing game was chosen as it allows for critical discussion of several common misunderstandings of game theory amongst students. Weber posits and shows that learning could occur in Nagel's (1995) guessing game even in the absence of round-by-round feedback on others' guesses: the average and the 'target' number. He hypothesises that learning may occur through repeated experience with an environment or a set of procedures, even in the absence of feedback. This is counter-intuitive from the perspective of most learning models (for example, Roth and Erev's (1995) reinforcement learning), which assumes that people update their beliefs and/or adjust their decisions from the feedback which is provided to them. If people have no prior feedback to anchor their beliefs on, it is not clear how inexperienced subjects form their beliefs and make decisions in subsequent rounds. Strategic uncertainty should persist in the absence of feedback. Since subjects' information set is identical round after round, one possibility is that they guess the same number repeatedly. Another possibility is that players randomise their guesses, hoping to win out of chance. We deem replication to be successful if our replication yields statistically significant findings in the same direction as in the original study, even if the effects are smaller in magnitude. Firstly, this lowers the bar for successful replication, being charitable to the original study. Secondly, in light of the replication framework presented in Section 2, where replicating studies are necessarily ceteris non paribus, one should not expect findings of successful replication to be exact point-estimates of the original. In the context of our classroom replication of Weber, a successful replication should find that a) there is evidence of learning when feedback is provided after each round – that guesses decrease over rounds; and b) there is evidence of feedback-free learning. The latter is the main finding of Weber, and is the main one to be replicated. The former serves as validity check, since it is the standard finding in generic guessing games.
4. Replication of Weber (2003)
In Weber's guessing game, groups of 8–10 players simultaneously choose a number in the closed interval [0, 100]. The winner(s) is the one who had chosen the number closest to two-thirds of the average of these numbers, for which (s)he receives a monetary reward of $6, while everyone else receives nothing. There is a unique Nash Equilibrium in the game where all players select the number of 0. The game is played for 10 rounds. In Weber's Control treatment, participants are provided feedback on the average guess, 2/3 of the average, and the winners' participant numbers after each round. None of this feedback was provided in Weber's No-Feedback No-Priming treatment.12 The main finding is that learning occurs in the No-Feedback No-Priming treatment even in the absence of feedback, albeit at a rate slower than in the Control treatment when such feedback is provided. Weber's findings are illustrated in Fig. 1, a reproduction of Fig. 1 of Weber (2003, p. 139) with original data kindly provided by Weber. This result is corroborated by experiment 2 in Rick and Weber (2010), where there is learning in both Feedback and No-
4.1. Design and procedures
Before we describe the study which we ultimately replicate in the classroom environment, we will first discuss the criteria which we applied when selecting which study to replicate. The main inclusion criteria is that replication is feasible within the environment and context of the classroom. One of the constraints is therefore time: each tutorial class is scheduled for a single hour for which the entire experiment need to be completed. The second major constraint is implementability. Many experiments nowadays use customised software which may be difficult to access in the classroom environment. For these reasons, a simple experiment is desirable. A third consideration is that the classroom experiment should deliver educational value to students, given the teaching context. (footnote continued) viable given the specific task at hand. 7 See also the classroom market experiments by Bergstrom and Kwok (2005). 8 See also Tompkinson and Bethwaite (1995), where they conduct ultimatum games using unincentivised questionnaires and still manage to replicate generic ultimatum game findings. 9 In another task where money plays a central role, Brañas-Garza and Prissé (2020) find that time preferences are not affected whether its elicitation is incentivised with money or not.
10
See Nagel (2008) for a review. Weber's study was also implemented with pen-and-paper. Original instructions was kindly provided by Weber upon request. 12 Weber (2003) also includes No-Feedback Low-Priming and No-Feedback High-Priming treatments. Due to parsimony, these are not discussed as they are not strictly necessary to support Weber's main thesis. 11
4
Journal of Behavioral and Experimental Economics 86 (2020) 101525
T. So
Fig. 1. Median guess in Weber (2003).
Feedback treatments, but learning occurs more rapidly in the presence of feedback. In this regard, Rick and Weber may be considered a replication13 of Weber (2003), using a different subject pool. The classroom replication was conducted with undergraduate students in an Intermediate Microeconomics class at Xi'an JiaotongLiverpool University.1415 The study was conducted using pen-and-paper during students' weekly tutorial, where the experimenters split the tutorial groups into subgroups of 6–11 students (average 9.1), depending on the number of students who showed up to tutorials.16 The instruction and record sheets, adapted from Weber's original materials, were handed out to each student.17 Instructions were read out loud, and students were provided an opportunity to ask clarifying questions before the experiment was to begin. The instruction and record sheets are in the Appendix.18 Similar to Weber (2003), students recorded their guess on the record sheet for the first round. After students have chosen their guess for
the round, the experimenter recorded this number for each student. In the 'Feedback' (Weber's Control) condition, students were informed of the average guess, two-thirds of the average guess and the winner(s) at the end of each round. In the 'No Feedback' (Weber's No-Feedback NoPriming) condition, students are asked to proceed to the next round without receiving any of the aforementioned feedback. The game proceeds for 10 rounds. In the No Feedback condition, groups are provided the entire stock of feedback at the end of round 10. We run two versions of each of the two conditions: one version without monetary payment in October 2017; and an incentivised version in September 2018 with the following cohort of the same class. The latter aims to alleviate potential concerns associated with the unincentivised classroom experiments. In the incentivised classroom experiments, all students received a show-up fee of 25 RMB.19 The winning 'prize' is 20 RMB for each of the rounds, equally split if there are joint winners in a particular round.20 In the No Feedback condition, the identity of these winners were not revealed until the end of the final round. Our classroom replication experiment therefore consists of a 2 × 2 design: manipulating both feedback (F and NF, for Feedback and No Feedback respectively) and incentives (P and NP, for Payment and No Payment respectively). We also supplement forthcoming analyses with data from Weber's original study (W). For ease of exposition, we refer to our four replication treatments as F-NP, NF-NP, F-P, NF-P. We also include the data from Weber's original study, where we refer to these as F-P-W and NF-P-W for Weber's incentivised treatments with and without feedback respectively.21 Our replication consists of 193 participants in 22 groups spread across the four replication treatments; this compares to 56 participants in 6 groups in the corresponding treatments in Weber's original.22 The classroom replication took
13 Rick and Weber (2010) focuses on a different research question to Weber (2003), but the essence of the design is similar enough in both studies to allow meaningful comparisons as a replication. 14 The classroom experiments were conducted with the purpose of making it easier to learn game theory, and this was indeed the primary purpose of the experiments. Students were unaware that they were participating in a replication study. 15 Xi'an Jiaotong-Liverpool University is the largest joint-venture university in China. Its students are mostly Chinese, but all teaching is conducted in English. The classroom experiment, accordingly, was conducted in English. 16 We aimed to have groups of 8 to 10 players as in Weber (2003), but in some instances this was not possible due to the number of people who actually turned up to the tutorial sessions. Tutorials are compulsory, but attendance is not enforced, so non-attendance is commonplace. In lab experiments, it would be possible to turn participants away, but this would not be ethically possible in the classroom setting. Ex post, we show that group size does not affect our replication findings. 17 Two small adjustments were made. First, we added a restriction so that subjects only chose integer numbers. This was made to make the pen-and-paper experiment procedurally simpler – we do not believe it has any impact on the findings. Second, references to money were removed in the non-incentivised treatments. 18 Each record sheet is identified only by a unique, but meaningless, 'player id' number. Students do not write their names on the record sheet, nor provide any personal information which may identify them. This ensures anonymity, and such anonymity mitigates concerns around experimenter demand effects in trying to impress the teacher with their responses.
19 To put the payoffs into perspective, 20-25 RMB was sufficient for a standard meal at the time of the experiments. 20 Student participants were not informed about the game and potential monetary payment they would receive before arriving to the tutorials. 21 The relevant data that we use from Weber's original study, F-P-W and NF-PW, were referred in the original study as the “Control” and “No Feedback No Priming” treatments respectively. 22 We conduct a two-sample power analysis to give an indication of how large our sample should be, using first round mean guesses and standard deviations from Weber's study: these are reported in Table 1. We justify the use of first round statistics, since the Weber's study (and this replication) focuses on learning across rounds: in this sense, the means and standard deviations are
5
Journal of Behavioral and Experimental Economics 86 (2020) 101525
T. So
Table 1 Mean first round guess. Replication No Payment n (groups) Feedback
42 (5)
No Feedback
39 (5)
Total
81 (10)
Mean (sd)1st round F-NP 44.4 (24.1) NF-NP 40.5 (18.5) |t| = 0.82 p = 0.41
Payment n (groups)
Mean (sd)1st round
55 (6)
F-P 40.9 (20.6) NF-P 36.7 (21.1) |t| = 1.08 p = 0.28
57 (6) 112 (12)
Weber (2003, pg 138) Payment n (groups) Mean (sd)1st round 26 (3) 30 (3) 56 (6)
F-P-W 24.6 (17.5)+ NF-P-W 33.4 (20.6)+ |t| = 1.70 p = 0.094
+ statistics not reported in Weber (2003), but calculated with author-provided data.
Fig. 2. Median guess in non-incentivised replications.
approximately 40 min to complete. Summary statistics are presented in Table 1. Table 1 shows that the first round mean guess for the replication sample tends to be higher than in Weber (2003). Amongst other explanations, this could reflect the difference in cognitive abilities of the student samples. Interestingly, there is also a marginal difference in the first-round guesses between treatments in the original, but not in the replication.
learning are both different – convergence to 0 occurs within four rounds in the original, but in the replication only reaches 16 by round 10.23 Although the speed and depth of learning differ, some degree of learning nevertheless occurs in the replication. The observation of learning in the F-NP treatment in our replication is important for a number of reasons. First, it is consistent with both Weber, and findings from a typical guessing game. This should mitigate concerns related to the notion of classroom replication. Second, this familiar finding is derived in the absence of monetary payment, supporting the notion that monetary payment may not strictly be required to attain meaningful results. In light of the treatment being unincentivised, it would seem unlikely that such monotonically decreasing guesses occur out of chance. Most likely students have some latent objective, independent of money. Our second finding is that we observe no learning when feedback is suppressed. In fact, from Fig. 2, the median guess in the NF-NP treatment is identical (37) in both rounds 1 and 10 (Wilcoxon signed-rank test: p = 0.696). Regression results presented in Table 2 tell a similar story. When feedback is suppressed, there is no significant learning in the NF-NP treatment – and this is robust to group and session effects. These regressions do not substantially differ if we allow for non-linearities in the trend with a quadratic term (see Appendix Table A1). Given that we find that learning occurs in the F-NP treatment and no
4.2. Findings 4.2.1. Non-incentivised replication treatments Fig. 2 shows the median guesses in the non-incentivised treatments – F-NP and NF-NP – across time. Comparing Figs. 1 and 2, it is clear that there are a few discrepancies. Firstly, when feedback is provided, convergence towards the Nash equilibrium of 0 occurs in both the original and in the replication. However, the speed and depth of (footnote continued) disentangled from the learning which might or might not occur. Furthermore, power analyses is usually an a priori tool to determine the sample size required, but requires information about means and standard deviations which are not a priori available. In the spirit of a priori analyses, we input the observed means and standard deviations of Weber's treatments into the power analysis calculation. A two-sample power analysis with means (st. dev) of 24.6 (17.5) in Weber's Feedback treatment and 33.4 (20.6) in the No Feedback treatment, indicates that we need a sample size of 152 (76 in each treatment) to attain α of 5% and 80% power. We have 193 participants in total, so our replication should have sufficient power.
23 In Rick and Weber (2010, p. 723, Fig 1a), convergence in the Feedback treatment occurs at a slower rate than in Weber (2003), although learning is still deeper than what is observed in our replication.
6
Journal of Behavioral and Experimental Economics 86 (2020) 101525
T. So
No-Payment conditions. From the regressions in Table 2, there are no differences in learning associated with incentivisation in the Feedback (model 1: F-NP*Round = F-P*Round: p = 0.950) and No-Feedback conditions (model 1: NF-NP*Round = NF-P*Round: p = 0.086). The consistent findings from our classroom experiment with and without monetary incentives suggest that incentive-compatibility should not be a concern for unincentivised classroom experiments.
Table 2 Random effects regressions. dependant variable:guess F-NP
(1) (reference)
(2) (reference)
(3) (reference)
F-P
−8.452 (0.009) −27.12 (0.000) −6.914 (0.050) −6.195 (0.081) −11.04 (0.016) −2.422 (0.000) −2.453 (0.000) −1.166 (0.011) 0.392 (0.348) −0.582 (0.131) −1.774 (0.000) −0.692 (0.227) 50.74 (0.000) No No 2490 249 0.180
−1.291 (0.894) −21.19 (0.020) 2.021 (0.803) −8.993 (0.362) −0.129 (0.990) −2.422 (0.000) −2.453 (0.000) −1.166 (0.011) 0.392 (0.350) −0.582 (0.132) −1.774 (0.000) −1.964 (0.323) 57.26 (0.000) Yes No 2490 249 0.217
−24.47 (0.005) −47.63 (0.000) −21.20 (0.006) −30.06 (0.002) −26.57 (0.014) −2.422 (0.000) −2.453 (0.000) −1.166 (0.011) 0.392 (0.349) −0.582 (0.131) −1.774 (0.000) 4.645 (0.024) 17.60 (0.172) No Yes 2490 249 0.200
F-P-W NF-NP NF-P NF-P-W F-NP * Round F-P * Round F-P-W * Round NF-NP * Round NF-P * Round NF-P-W * Round Group Size Constant Group Fixed Effects Session Fixed Effects Observations Subjects R2
4.2.3. Confusion Anecdotally when running the experimental sessions, we observed a high degree of 'confusion' amongst participants when no feedback was provided to them, whether or not they were incentivised. We refer to 'confusion' in the sense that subjects were not too sure of how to 'play' the game. When feedback is provided, subjects would evaluate such feedback and learn from it, clarifying any misconceptions that they would have regarding gameplay. However, when feedback was suppressed, subjects did not seem to make large deviations from their previous round guess – and in many instances chose the same number repeatedly. This makes sense from the perspective of typical learning models. In the absence of feedback, the information set does not change across rounds, strategic uncertainty remains, so guesses should not be too different across rounds. We provide evidence of this by looking at a simple measure of “misunderstanding”. Following Rubinstein (2007), we classify a guess as one based on misunderstanding if it is greater than or equal to the value of 50. Table 3 shows the frequency of these guesses across each treatment, and how they change across rounds 1 and 10. Since players' first-round information set is identical across treatments – before any feedback is provided – the incidence of misunderstanding should not be significantly different across Feedback and No-Feedback conditions. This is indeed the case, except for the marginal difference between the incentivised F-P and NF-P treatments (rank sum test: p = 0.078). When we compare the incidence of misunderstanding in each treatment between the first and last rounds of the game, a clear pattern emerges in the replication treatments. When feedback is provided, learning occurs, and the incidence of misunderstanding reduces. This is shown by stark reduction in misunderstanding between rounds 1 and 10 in both the F-NP and F-P treatments. On the other hand, we observe no change in the incidence of misunderstanding between rounds 1 and 10 in the NF-NP and NF-P treatments, showing no signs of feedback-free learning. This is consistent with what we had anecdotally observed as experimenters, where strategic uncertainty persists with no change in the information set. Our findings are in contrast to the incidence of misunderstanding in Weber's sample.24 The patterns of misunderstanding show learning in the absence of feedback. Worth noting is that the incidence of misunderstanding does not significantly change across rounds 1 and 10 when feedback is provided – most likely because it was already very low to start off with.
P-values in parentheses. Standard errors (clustered by subject) are not reported.
learning in the NF-NP treatment, we subsequently find that there are statistical differences between these two treatments (Table 2 model 1: F-NP*Round = NF-NP*Round: p < 0.000). Since Weber finds evidence of feedback-free learning, we regard this replication a failed one. 4.2.2. Incentivised replication treatments In the non-incentivised replication treatments, we have shown that learning occurs with the provision of feedback, while non-occurrent when feedback is suppressed. Since these findings are not aligned to the original, it is reasonable to ask whether this discrepancy arises as a result of not paying students in the classroom experiments. When people are not paid, they may not have the incentives to exert cognitive effort by thinking about what to play in order to win the game – and such effort would be especially important when feedback is suppressed. In order to alleviate such concerns about the incentive-compatibility of the non-incentivised treatments, we present findings from a parallel pair of incentivised treatments. The incentivised treatments are a more faithful replication of Weber (2003). Fig. 3 shows the median guesses in the F-P and NF-P treatments across the ten rounds of play. When subjects are paid, learning occurs in the F-P treatment when feedback is provided – mirroring the unincentivised equivalent. From the regressions in Table 2, the average guess in the F-P treatment reduces by 2.45 points every round. Fig. 3 also mirrors Fig. 2 in that there is no significant learning in the NF-P treatment. Although there visually appears to be a small negative trend, the regressions in Table 2 indicate that it is small in magnitude and statistically insignificant. Furthermore, a within-subject Wilcoxon signed-rank test shows that round 10 guesses are not significantly different from round 1 guesses (p = 0.393). We find consistent patterns of learning when feedback is provided while no significant learning in the absence of it, irrespective of whether these treatments are incentivised or not. This is further confirmed by statistically testing the patterns of learning across the Payment and
5. Reconciliation of a failed replication Our attempt to replicate Weber's (2003) main finding in the classroom has been unsuccessful in that we were unable to find evidence for feedback-free learning in a standard guessing game. Consistent with our view of replication whereby we are trying to understand the boundaries of robustness of the original study, we discuss a factor which could reconcile our replication with Weber. A divergent characteristic of our replication is that we use a different subject pool. While subject-pool differences represent an underlying array of differences, we suspect that there exists a difference in cognitive ability between our replication sample and Weber's (2003) 24 Misunderstanding is not reported in Weber (2003). Here we calculate it with data from the original paper.
7
Journal of Behavioral and Experimental Economics 86 (2020) 101525
T. So
Fig. 3. Median guess in incentivised replications.
A second related factor is language. Students in the replication sample are almost entirely Chinese students, while the replication was conducted in English: their second language. The use of English-language instructions could potentially lead to increased cognitive load for them, depleting the pool of cognitive resources available when they play the game (see the literature on ego depletion, for example: Hagger, Wood, Stiff & Chatzisarantis, 2010). Even if the replication sample is equally-matched in terms of cognitive ability, the impairment associated with English-language instructions (relative to instructions provided in the native language), could reduce the level of cognition in the guessing game, and bring about a similar effect as previously discussed. Mækelæ and Pfuhl (2019) offers an alternative perspective and null results in this regard. Each of these explanations could account for the discrepant findings between Weber and this replication: the discrepancy could be due to differences in cognitive abilities, or could be due to increased cognitive load associated with language. In terms of boundaries of robustness, we suspect that Weber's findings may require cognitively sophisticated participants, and further attempts at replication may be successful with such subjects. Nevertheless, one thing that our classroom replications have identified is that monetary incentives do not seem to matter in the guessing game. Providing feedback to facilitate learning, participants made smaller guesses across time even when they are not incentivised for winning. This would be difficult to explain if people's sole objective was money, adding to recent evidence which suggests that experimental participants have motives other than money (DellaVigna & Pope, 2018; Veszteg & Funaki, 2018).
Table 3 Misunderstanding. Replication – No Payment All Rounds F-NP (n = 42) NF-NP (n = 39)
75 of 420 guesses (17.9%) 135 of 390 guesses (34.6%)
Round 1
Round 10
Signed-Rank Test
15 of 42 guesses (35.7%) 13 of 39 guesses (33.3%)
5 of 42 guesses (11.9%) 19 of 39 guesses (48.7%)
p = 0.012 p = 0.134
Replication – Payment F-P
(n = 55)
NF-P (n = 57)
51 of 550 guesses (9.3%) 139 of 570 guesses (24.4%)
21 of 55 guesses (38.2%) 13 of 57 guesses (22.8%)
4 of 55 guesses (7.3%)
p = 0.0002
13 of 57 guesses (22.8%)
p = 1.00
Original F-P-W (n = 26) NF-P-W (n = 30)
13 of 260 guesses (5.0%) 32 of 300 guesses (10.7%)
2 of 26 guesses (7.7%) 7 of 30 guesses (23.3%)
1 of 26 guesses (3.8%) 2 of 30 guesses (6.7%)
p = 0.564 p = 0.059
sample of graduate and undergraduate students from the California Institute of Technology. Prior studies have found that people with higher cognitive ability tend to choose smaller numbers in the guessing game (Brañas-Garza, García-Muñoz & González, 2012; Burnham, Cesarini, Johannesson, Lichtenstein & Wallace, 2009; Gill & Prowse, 2016), and are more likely to predict the cognitive ability of other members in their group (Fehr & Huck, 2016). Different starting conditions will impact the subsequent dynamics of play. While neither Weber nor this replication explicitly measured the cognitive ability of subjects, two simple observations point to the higher cognitive ability of participants in the original study. Firstly, the first round mean guess (Table 1) is much lower in the original than in our replication study. Secondly, there is a lower incidence of misunderstanding in Weber's sample in the first round (Table 3). The guess in the first round is unaffected by both the history of play and the (non-) provision of feedback, so differences in the first-round guesses points towards differences in cognitive ability.
6. Conclusion While our classroom replication did not confirm Weber's (2003) finding of feedback-free learning, it should not be interpreted in isolation. As earlier discussed, the power of replications come about from the frequency for which they are carried out. Since Rick and Weber (2010) confirm Weber's original finding (2003), the balance of evidence is currently in favour of the original: that learning occurs even in the absence of feedback. More replications of Weber – and potentially of our classroom experiments – would be required to tease out the boundaries of robustness, uncovering the underlying truth. While our classroom replication was unable to replicate the original finding, it has provided some preliminary insights into the boundaries 8
Journal of Behavioral and Experimental Economics 86 (2020) 101525
T. So
of its robustness. The findings of the replication highlight possible areas for which Weber's finding of feedback-free learning rely upon. We suspect that high cognitive ability subjects may be required for the replication of feedback-free learning. Further experimentation with different replication parameters would allow us to better understand the extent of its robustness. We reiterate that the value from replication arises from the number of times that they are independently conducted. The benefits of classroom replication are against the backdrop that it required zero monetary outlay. Our incentivised replications should have alleviated basic concerns about incentive compatibility of non-incentivised classroom experiments – at least in the domain of guessing games. Replication in the classroom will face the same constraints as any classroom experiment. First, time is constrained in the classroom, so there is limited scope to conduct experiments that have a sophisticated design or is repeated for a large number of rounds. Second, it may be difficult to implement experiments which are highly customised and/or have complex procedures in the classroom. While this may appear to be a critical limitation for the notion of classroom replication, it appears that it is a wider issue in itself as experiments have become increasingly complex. As Camerer et al. (2016), p. 1436) laments, replication can be laborious and calls for experiments to be designed with replication in mind, not making them needlessly complex. This should facilitate classroom experiments as a replication device.
Coffman, LC., Niederle, M., Wilson, A., 2017. A proposal to organize and promote replications. American Economic Review 107 (5), 41–45. Deci, EL., Koestner, R., Ryan, RM., 1999. A meta-analytic review of experiments examining the effects of extrinsic rewards on intrinsic motivation. Psychological Bulletin 125 (6), 627–668. DellaVigna, S., Pope, D., 2018. What motivates effort? evidence and expert forecasts. Review of Economic Studies 85, 1029–1069. Dewald, WG., Thursby, JG., Anderson, RG., 1986. Replication in empirical economics: The journal of money, credit and banking project. American Economic Review 76 (4), 587–603. Di Tillio, A., Ottaviani, M., Sørensen, P.N., 2017. Persuasion bias in science: Can economics help. Economic Journal 127 (605), F266–F304. Dixit, A., 2005. Restoring fun to game theory. Journal of Economic Education 36 (3), 205–219. Durham, Y., McKinnon, T., Schulman, C., 2007. Classroom experiments: Not just fun and games. Economic Inquiry 45 (1), 162–178. Emerson, TL.N., Taylor, BA., 2004. Comparing student achievement across experimental and lecture-oriented sections of a principles of microeconomics course. Southern Economic Journal 70 (3), 672–693. Exadaktylos, F., Espín, AM., Brañas-Garza, P., 2013. Experimental subjects are not different. Scientific Reports 3 (1213), 1–6. Fehr, D., Huck, S., 2016. Who knows it is a game? on strategic awareness and cognitive ability. Experimental Economics 19, 713–726. Festré, A., Garrouste, P., 2015. Theory and evidence in psychology and economics about motivation crowding out: A possible convergence. Journal of Economic Surveys 29 (2), 339–356. Frank, B., 1997. The impact of classroom experiments on the learning of economics: An empirical investigation. Economic Inquiry 35, 763–769. Frey, BS., Jegen, R., 2001. Motivation crowding theory. Journal of Economic Surveys 15 (5), 589–611. Gertler, P., Galiani, S., Romero, M., 2018. How to make replication the norm. Nature 554 (417), 417–419. Gill, D., Prowse, V., 2016. Cognitive ability, character skills, and learning to play equilibrium: A level-k analysis. Journal of Political Economy 124 (6), 1619–1676. Gneezy, U., Meier, S., Rey-Biel, P., 2011. When and why incentives (Don't) work to modify behavior. Journal of Economic Perspectives 25 (4), 191–210. Hagger, MS., Wood, C., Stiff, C., Chatzisarantis, NL.D., 2010. Ego depletion and the strength model of self-control: A meta analysis. Psychological Bulletin 136 (4), 495–525. Holt, CA., 1999. Teaching economics with classroom experiments: A symposium. Southern Economic Journal 65 (3), 603–610. Kräkel, M., 2008. Emotions in tournaments. Journal of Economic Behavior & Organization 67 (1), 204–214. Kuhn, P., Kooreman, P., Soetevent, A., Kapteyn, A., 2011. The effects of lottery prizes on winners and their neighbours: Evidence from the dutch postcode lottery. American Economic Review 101 (5), 2226–2247. Kuhnen, CM., Tymula, A., 2012. Feedback, self-esteem, and performance in organizations. Management Science 58 (1), 94–113. Lancaster, KJ., 1966. A new approach to consumer theory. Journal of Political Economy 74 (2), 132–157. Lin, Po-H, Brown, AL., Imai, T., Wang, J.T.-Y., Wang, S., & Camerer, Colin, F. (2018). General economic principles of bargaining and trade: Evidence from 2,000 classroom experiments. Available at SSRN: Https://ssrn.com/abstract=3250495. Mækelæ, M.J., Pfuhl, G., 2019. Deliberate reasoning is not affected by language. PloS one 14 (1), e0211428. Makel, MC., Plucker, JA., Hegarty, B., 2012. Replications in psychology research: How often do they really occur. Perspectives on Psychological Science 7 (6), 537–542. Maniadis, Z., Tufano, F., List, JA., 2017. To replicate or not to replicate? exploring reproducibility in economics through the lens of a model and a pilot study. Economic Journal 127, F209–F235. McCullough, B.D., McGeary, K.A., Harrison, TD., 2006. Lessons from the JMCB archive. Journal of Money, Credit and Banking 38 (4), 1093–1107. Nagel, R., 1995. Unraveling in guessing games: An experimental study. American Economic Review 85 (5), 1313–1326. Nagel, R., 2008. Experimental beauty contest games: Levels of reasoning and convergence to equilibrium. In: Plott, C.R., Smith, V.L. (Eds.), Handbook of experimental economics results 1. Elsevier, Amsterdam, pp. 391–410. Open Science Collaboration, 2015. Estimating the reproducibility of psychological science. Science (New York, N.Y.) 349 (6251), aac47161–aac47168. Rick, S., Weber, RA., 2010. Meaningful learning and transfer of learning in games played repeatedly without feedback. Games and Economic Behavior 2010, 716–730. Roth, AE., Erev, I., 1995. Learning in extensive-form games: Experimental data and simple dynamic models in the intermediate term. Games and Economic Behavior 8, 164–212. Rubinstein, A., 1999. Experience from a course in game theory: Pre- and Postclass problem sets as a didactic device. Games and Economic Behavior 28, 155–170. Rubinstein, A., 2007. Instinctive and cognitive reasoning: A study of response times. Economic Journal 117 (523), 1243–1259. Schmidt, S., 2009. Shall we really do it again? the powerful concept of replication is neglected in the social sciences. Review of General Psychology 13 (2), 90–100. Shubik, M., 2002. The uses of teaching games in game theory classes and some experimental games. Simulation and Gaming 33 (2), 139–156. Smith, VL., 1962. An experimental study of competitive market behavior. Journal of Political Economy 70 (2), 111–137. Smith, VL., 1976. Experimental economics: Induced value theory. American Economic Review 66 (2), 274–279. Tompkinson, P., Bethwaite, J., 1995. The ultimatum game: Raising the stakes. Journal of Economic Behavior & Organization 27 (3), 439–451. Veszteg, RF., Funaki, Y., 2018. Monetary payoffs and utility in laboratory experiments. Journal of Economic Psychology 65, 108–121. Weber, R.A., 2003. `Learning' with no feedback in a competitive guessing game. Games and Economic Behavior 44 (1), 134–144.
Acknowledgements This paper would not have been remotely possible had I not received support from Roberto Weber, who provided original data, instructions, and comments on early drafts. I also thank Ananish Chaudhuri, John Hey and Marek Hudik for useful comments; and to Fan Liu, Hao Lan, Marek, Xinyan Cai, Jenny Wang and Yanning Zeng for assistance. No research funding was received for this research. Any errors are my own. Supplementary materials Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.socec.2020.101525. Reference Andrews, I., Kasy, M., 2019. Identification of and correction for publication bias. American Economic Review 109 (8), 2766–2794. Bergstrom, TC., Kwok, E., 2005. Extracting valuable data from classroom trading pits. Journal of Economic Education 36 (3), 220–235. Bohannon, J., 2014. Replication effort provokes praise - and 'bullying' charges. Science (New York, N.Y.) 344 (6186), 788–789. Bowles, S., Polanía-Reyes, S., 2012. Economic incentives and social preferences: substitutes or complements. Journal of Economic Literature 50 (2), 368–425. Brañas-Garza, P., García-Muñoz, T., González, R.H., 2012. Cognitive effort in the beauty contest game. Journal of Economic Behavior & Organization 83, 254–260. Brañas-Garza, P., Prissé, B., 2020. Eliciting time preferences with continuous mpl. mimeo. Burnham, TC., Cesarini, D., Johannesson, M., Lichtenstein, P., Wallace, B., 2009. Higher cognitive ability is associated with lower entries in a p-Beauty contest. Journal of Economic Behavior & Organization, 72 171–175. Camerer, CF., Dreber, A., Forsell, E., Ho, T.-.H., Huber, J., Johannesson, M., et al., 2016. Evaluating replicability of laboratory experiments in economics. Science (New York, N.Y.) 351 (6280), 1433–1436. Camerer, CF., Dreber, A., Holzmeister, F., Ho, T.-.H., Huber, J., Johannesson, M., et al., 2018. Evaluating the replicability of social science experiments in nature and science between 2010 and 2015. Nature Human Behaviour 2 (Sept 2018), 637–644. Camerer, CF., Hogarth, RM., 1999. The effects of financial incentives in experiments: A review and capital-labor-production framework. Journal of Risk and Uncertainty 19 (1), 7–42. Card, D., Mas, A., Moretti, E., Saez, E., 2012. Inequality at work: The effect of peers salaries on job satisfaction. American Economic Review 102 (6), 2981–3003. Chamberlin, EH., 1948. An experimental imperfect market. Journal of Political Economy 56 (2), 95–108. Charness, G., Masclet, D., Villeval, M.C., 2014. The dark side of competition for status. Management Science 60 (1), 38–55. Coffman, LC., Niederle, M., 2015. Pre-Analysis plans have limited upside, especially where replications are feasible. Journal of Economic Perspectives 29 (3), 81–98.
9