Conceptualizing and evaluating the replication of research results

Conceptualizing and evaluating the replication of research results

    Conceptualizing and Evaluating the Replication of Research Results Leandre R. Fabrigar, Duane T. Wegener PII: DOI: Reference: S0022-...

470KB Sizes 0 Downloads 22 Views

    Conceptualizing and Evaluating the Replication of Research Results Leandre R. Fabrigar, Duane T. Wegener PII: DOI: Reference:

S0022-1031(15)00096-7 doi: 10.1016/j.jesp.2015.07.009 YJESP 3343

To appear in:

Journal of Experimental Social Psychology

Received date: Revised date: Accepted date:

16 September 2014 19 July 2015 30 July 2015

Please cite this article as: Fabrigar, L.R. & Wegener, D.T., Conceptualizing and Evaluating the Replication of Research Results, Journal of Experimental Social Psychology (2015), doi: 10.1016/j.jesp.2015.07.009

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

PT

Replication of Research Results 1

SC

RI

Conceptualizing and Evaluating the Replication of Research Results

MA

NU

Leandre R. Fabrigar Department of Psychology Queen’s University Kingston, Ontario, K7L 3N6 Canada

AC CE P

TE

D

Duane T. Wegener Department of Psychology 1835 Neil Avenue The Ohio State University Columbus, OH, 43210

Corresponding Author: Leandre R. Fabrigar Department of Psychology Queen’s University Kingston, Ontario, K7L 3N6 Canada [email protected] 613-533-6492 (voice) 613-533-2499 (fax) July 19, 2015

ACCEPTED MANUSCRIPT Replication of Research Results 2

Abstract Many recent discussions have focused on the role of replication in psychological science. In this

PT

article, we examine three key issues in evaluating the conclusions that follow from results of studies at least partly aimed at replicating previous results: the evaluation and status of exact

RI

versus conceptual replications, the statistical evaluation of replications, and the robustness of

SC

research findings to potential existing or future “non-replications.” In the first section of the

NU

article, we discuss the sources of ambiguity in evaluating failures to replicate in exact as well as conceptual replications. In addressing these ambiguities, we emphasize the key role of

MA

psychometric invariance of the independent and dependent variables in evaluations of replications. In the second section of the article, we use a meta-analytic framework to discuss

D

the statistical status of replication attempts. We emphasize meta-analytic tools that have been

TE

used too sparingly, especially in evaluation of sets of studies within a single article or focused

AC CE P

program of research. In the final section of the article, we extend many of these meta-analytic tools to the evaluation of the robustness of a body of research to potential existing or future failures to replicate previous statistically significant results.

ACCEPTED MANUSCRIPT Replication of Research Results 3

Conceptualizing and Evaluating the Replication of Research Results Reproducibility of research findings is a central feature of the scientific method. Thus,

PT

many classic (e.g., Cook & Campbell, 1979) and contemporary discussions (e.g., Brewer &

RI

Crano, 2014; Smith, 2014) of research design in social psychology feature commentary on replications and their implications for various types of validity. Indeed, the importance of

SC

replication in psychological research has been especially salient in recent years. Sizeable

NU

sections or entire issues of journals have been devoted to topics related to the current “replication crisis” in psychology (e.g., Perspectives on Psychological Science in 2011, 2012, and 2014;

MA

Journal of Mathematical Psychology in 2013; Journal of Experimental Social Psychology in 2015). The current article is not about whether such a crisis exists. Rather, it is about a number

TE

D

of criteria directly related to evaluating replication efforts – criteria that, in our views at least, have been too infrequently applied to the evaluation of such efforts.

AC CE P

Despite the importance psychologists place on replication, a number of ambiguities exist with respect to how replications should be conducted and evaluated (e.g., see Schmidt, 2009). The present article explores three such ambiguities. Specifically, we begin by examining the distinction between exact and conceptual replications and discuss ambiguities that arise in interpreting failures to replicate related to both approaches. Next, we explore the use of metaanalytic techniques to statistically evaluate the extent to which a given study has successfully replicated the findings of the original study. Finally, we discuss approaches to assessing the robustness of a given set of studies to potential or future failures to replicate. Strategies for Conducting Replications Distinguishing between Exact and Conceptual Replication Although there is widespread consensus among psychologists regarding the value of replication, the precise strategy that should be adopted when replicating a prior study has long

ACCEPTED MANUSCRIPT Replication of Research Results 4

been a matter of debate. Over the years, a variety of categories have been proposed to describe differences in objectives and approaches used to conduct replication studies (for a review, see

PT

Schmidt, 2009). However, perhaps the categorization that has been most prevalent is the

RI

distinction between exact (often referred to as direct) and conceptual replication. The term exact replication typically refers to a study aimed at repeating as closely as possible the procedures

SC

used in a prior study. Conceptual replication is generally aimed at reproducing an effect from a

NU

prior study using different operationalizations of the independent (predictor) and/or dependent (outcome) variable(s) than in the prior study. In the current article, we primarily focus on

MA

replication of experiments, but the issues discussed are equally applicable to nonexperimental studies. Methodologists have long recognized that both exact and conceptual replication can be

TE

D

useful and have a role to play in any program of research. However, in the current debate regarding replication in psychology, there has been considerable discussion regarding the relative

AC CE P

value of these two approaches.

Traditionally, the primary presumed strength of conceptual replications has been their ability to address issues of construct validity (e.g., Brewer & Crano, 2014; Schmidt, 2009; Stroebe & Strack, 2014). As noted by Cook and Campbell (1979) in their influential discussion of experimental design, any experimental manipulation or measure of a given construct is likely to reflect various “irrelevancies” (i.e., the influence of constructs other than the intended construct). An alternative operationalization of a given construct, especially one that is procedurally quite different from the original operationalization, is likely to introduce different “irrelevancies”. If the same pattern of findings nonetheless emerges in the conceptual replication, the work has not only demonstrated that an earlier pattern of findings is reproducible

ACCEPTED MANUSCRIPT Replication of Research Results 5

but has also provided stronger evidence that the construct(s) of interest are responsible for the findings.

PT

Despite their appeal from the standpoint of construct validity, conceptual replications

RI

have also faced criticisms. In particular, a number of commentators have expressed reservations regarding the utility of conceptual replications in evaluating the reliability of previously

SC

documented effects (e.g., LeBel & Peters, 2011; Nosek, Spies, Motyl, 2012; Pashler & Harris,

NU

2012; Simons, 2014). Central to these reservations has been the concern that any failure to reproduce an earlier finding in a conceptual replication is ultimately open to a variety of

MA

interpretations. Such a failure could of course indicate that the original finding was spurious. However, a conceptual replication might also fail because the new operationalizations of the

TE

D

independent and/or dependent variables less effectively represent the constructs of interest or reflect a somewhat different set of constructs than in the prior experiment. Thus, skeptics of

AC CE P

conceptual replication argue that failed conceptual replications are typically dismissed as uninformative and seldom lead researchers to question the reliability of the earlier finding. In contrast, an exact replication follows as precisely as possible the procedures used in the original experiment. Indeed, when feasible, the same experimental materials used in the original experiment are used in the replication experiment. This approach is exemplified by the Reproducibility Project, a large scale collaborative effort to replicate studies published in prominent journals (Open Science Collaboration, 2012). One key feature of the standardized protocol used in this initiative is the “use of original materials, if they are available” (pg. 658, Open Science Collaboration, 2012). Advocates of exact replication argue that such experiments are less ambiguous from an interpretational standpoint because the same operationalizations have

ACCEPTED MANUSCRIPT Replication of Research Results 6

been used. Thus, these studies are more difficult to dismiss as uninformative with respect to the reliability of the effect demonstrated in the original experiment.

PT

Psychometric Invariance and the Interpretation of Replications

RI

Although the case for the interpretational clarity of exact replications intuitively seems persuasive, it is not clear that this line of reasoning is as compelling as it might appear at first

SC

glance. Central to interpreting any replication experiment, be it exact or conceptual, is the

NU

assumption of what we term psychometric invariance. Psychometric invariance is present when the psychometric properties of the operationalizations of independent variables and dependent

MA

variables are the same across the samples and contexts in which the original experiment and replication experiment were conducted. Stated another way, psychometric invariance can be said

TE

D

to hold when the operationalizations used in the original experiment and replication experiment are equally effective in representing the same underlying constructs in both experiments.

AC CE P

We use psychometric invariance as a broad term to capture the full set of traditional psychometric properties commonly discussed in the research literature -- including but not limited to reliability, convergent validity, discriminant validity, construct validity, and predictive validity. The concept of psychometric invariance is also intended to be broad in that it is applicable to both nonexperimental and experimental designs. For purposes of simplicity, we discuss psychometric invariance mostly in terms of a dichotomous distinction. However, in reality it is more accurate to think of it as a continuum with increasingly strict levels of invariance. Psychometric invariance might well be satisfied in some respects but not others. The sorts of inferences one can make in comparing studies will be dependent on the level at which psychometric invariance holds. In this context, it is important to note that psychometric invariance does not necessarily imply that the operationalizations are valid in an absolute sense.

ACCEPTED MANUSCRIPT Replication of Research Results 7

It merely implies that the level of validity in the two studies is comparable and that the natures of any threats to validity are similar across the two studies.

PT

The critical role of psychometric invariance for comparisons within and across studies

RI

has long been recognized in the measurement literature. For example, in work on factorial invariance, methodologists have long recognized that clear interpretations of differences in

SC

associations across groups requires that the underlying factor structure of the measures is the

NU

same across those groups (see Millsap & Meredith, 2007; Widaman & Reise, 1997; Widaman & Grimm, 2014). If the factor structure of a given set of measures is not invariant across groups,

MA

then cross-group comparisons cannot be readily interpreted. That is, in such cases, the measures do not necessarily reflect the same underlying constructs in the two groups. Methodologists

TE

D

have identified varying levels of strictness with which factorial invariance can be said to hold and discussed the inferences that are reasonable to reach when each of these levels is satisfied.

AC CE P

They have also developed statistical procedures for evaluating the extent to which these levels of factorial invariance hold. A detailed discussion of these issues goes beyond the scope of the present paper. For now, it is sufficient to note that many of the same inferential issues highlighted in the factorial invariance literature are relevant when comparing results across studies.

To illustrate this point, consider the simple case of a non-experimental study in which the predictor variable was measured using a scale consisting of 4 items and the outcome variable was measured at a later point in time using a different scale also consisting of 4 items. Analyses have indicated that the scores on the predictor variable are significantly and positively related to scores on the outcome variable. Figure 1 illustrates in path diagram form the assumptions that have been made (perhaps implicitly) by the researchers conducting this hypothetical study. The

ACCEPTED MANUSCRIPT Replication of Research Results 8

researchers have assumed that the 4 items comprising their measure of the predictor variable reflect a single underlying construct and that the 4 items comprising the outcome variable also

PT

reflect a single underlying construct. Each item is also assumed to have some unique variance

RI

associated with it (i.e., variance reflecting systematic influences unique to that item and variance reflecting random measurement error). They have further assumed that the predictor variable

SC

exerts a directional influence on the outcome variable. It is the significant coefficient obtained

NU

for this directional effect that is the substantive finding of interest in the study. Imagine now that another group of researchers attempts to replicate this study. Although

MA

the primary goal of these researchers is to replicate the directional effect between the predictor variable and the outcome variable, meaningful comparison of this effect across the studies also

TE

D

requires replication of the underlying factor structure of the two scales. For instance, if the underlying construct serving as the independent variable had a strong positive effect on all 4

AC CE P

items in the original study (i.e., large positive factor loadings), but had only had a strong positive effect on 2 of the items in the replication study, meaningful comparisons between studies could be compromised. Such comparisons would be ambiguous because the scale designed to measure the predictor variable might not be assessing the same construct in the two studies. Thus, failure to replicate the directional effect would not necessarily imply that the original effect of the predictor variable on the outcome variable is unreliable. Because the replication study has not necessarily assessed the same predictor variable, there is no reason to expect that the same directional effect on the outcome variable should emerge. Conversely, even if a significant directional effect was obtained in the replication study, it could not necessarily be interpreted as a replication of the effect demonstrated in the original study. Similar interpretational ambiguities

ACCEPTED MANUSCRIPT Replication of Research Results 9

would arise if differences across the two studies emerged in the scale assessing the outcome variable.

PT

In nonexperimental studies, the concept of psychometric invariance can be viewed as

RI

equivalent to the concept of factorial invariance (also sometimes referred to as measurement invariance). However, in the context of experimental studies, psychometric invariance captures

SC

a somewhat broader set of properties. Consider a second hypothetical example in which

NU

researchers conduct a simple experiment in which the independent variable consists of two conditions and the dependent variable is assessed using a scale consisting of 4 items. Figure 2

MA

illustrates the assumptions typically made (perhaps implicitly) in such an experiment. This figure indicates that these researchers have assumed that assignment to experimental condition is

TE

D

interchangeable with the underlying construct the independent variable is presumed to represent. The researchers have assumed that the four items assessing the dependent variable reflect a

AC CE P

single underlying construct. They have also assumed that the independent variable exerts a directional influence on the dependent variable. Once again the substantive effect of interest is this directional effect of the independent variable on the dependent variable. If a second group of researchers were to replicate this experiment, just as in our nonexperimental example, meaningful comparison between the two studies would require psychometric invariance. In the case of the dependent variable, the issues are precisely the same as in the nonexperimental study. Factorial invariance would need to hold for the scale assessing the dependent variable. However, in the case of an experimentally manipulated independent variable, factorial invariance as it is traditionally discussed is not precisely applicable. As Figure 2 illustrates, because experimental condition is presumed to be a perfect indicator of the independent variable, one cannot meaningfully assess factorial invariance for independent

ACCEPTED MANUSCRIPT Replication of Research Results 10

variables across experiments. Nonetheless, experimental manipulations can be conceptualized as having psychometric properties, and these properties must be invariant across studies if

PT

straightforward comparisons of effects across studies are to be made.

RI

Figure 3 illustrates the sorts of psychometric assumptions commonly assumed in an experiment. Specifically, a given experimental manipulation is typically intended to produce

SC

differences on some underlying construct of interest. Thus, in the context of Figure 3, one would

NU

hope for a strong directional influence of the independent variable on the construct it is intended to influence (which could be conceptualized as a form of predictive or convergent validity).

MA

Ideally the manipulation would have negligible influence on other constructs not of interest in the current experiment, but which might plausibly influence the dependent variable. Thus, in

TE

D

Figure 3, one would hope for no directional influences of the independent variable on the two unintended constructs depicted in the figure (which could be conceptualized as a form of

AC CE P

discriminant validity). The psychometric properties of a given manipulation might be characterized by the strength of its effect on the intended construct and as well as its (lack of) effects on other unintended constructs. For example if we had an experimental manipulation designed to instantiate fear, we might view the independent variable in theoretical terms as level of exposure to a fear-inducing stimulus. This latent variable would be presumed to be perfectly represented by assignment to experimental condition (control condition vs. fear condition). Our hope would be that this independent variable would exert a strong directional effect on a measure of fear and have little or no effect on measures of other emotions we would not want to be influenced by our manipulation (e.g., a measure of sadness and a measure of anger). This manipulation could be said to be invariant if a replication experiment produced a comparable set of directional effects

ACCEPTED MANUSCRIPT Replication of Research Results 11

on these measures of emotions as the original experiment. If it did not (e.g., if the manipulation in the replication experiment produced a weaker effect on fear and stronger effect on sadness),

PT

clear interpretation of differences across the studies would be difficult to make. The independent

RI

variable could no longer be assumed to represent the same construct in the two experiments. Of course meaningful comparisons of these directional effects across experiments would only be

SC

possible if the measures of the scales assessing each of the three emotions (fear, sadness, and

NU

anger) are factorially invariant across experiments. Thus, invariance of an experimental manipulation involves invariance at the level of the measurement model (factorial invariance of

MA

measures of constructs potentially influenced by the independent variable), but also invariance at the level of the structural model (i.e., invariance of directional effects of the independent variable

variable).

TE

D

on its intended construct and unintended constructs potentially influenced by the independent

AC CE P

In summary, straightforward interpretation of replication studies requires an assumption of psychometric invariance across the original and replication experiments. If psychometric invariance is severely violated, there would be little reason to expect a replication experiment to reproduce the findings of the original experiment. Indeed, even if the replication experiment did reproduce the findings of the original experiment, this replication could not necessarily be interpreted as truly reproducing the original effects if it was shown that psychometric invariance was severely violated. The results of the replication experiment could reflect a very different psychological phenomenon than that demonstrated in the original experiment. Psychometric Invariance of an Exact Replication Implicit in the logic of exact replication is the assumption that using the original experimental materials is the best strategy for ensuring psychometric invariance. Intuitively, it

ACCEPTED MANUSCRIPT Replication of Research Results 12

would seem that a researcher using precisely the same materials as in the original experiment would be much more likely to recreate the same psychological conditions that were present in

PT

the original experiment. Alterations of experimental materials would only seem to enhance the

RI

risk that operationalizations of the independent variables and/or dependent variables will no longer capture the same underlying psychological constructs. Although this assumption is likely

SC

true in many cases, it should not be accepted without careful evaluation. Even when a researcher

NU

wishes to replicate a prior experiment as closely as possible (as opposed to a conceptual replication in which the goal is to explore operationalizations that are procedurally distinct from

MA

the original experiment), there are many cases in which using the original experimental materials is likely to be a poor strategy for achieving psychometric invariance.

TE

D

To illustrate this point, it is useful to explore this issue in the context of a concrete example. Consider a well-known experiment in the persuasion literature originally reported by

AC CE P

Petty and Cacioppo (1979). In this experiment (Experiment 2), the researchers were interested in exploring the role of issue involvement in persuasion. Undergraduate students at the University of Missouri were randomly assigned to conditions in a 2 (Issue Involvement: High vs. Low) X 2 (Argument Quality: Strong vs. Weak) between-participants factorial design. Participants listened to a message on the topic of senior comprehensive exams. Those in the highinvolvement condition received a version of the message in which the speaker advocated that the policy be implemented at the University of Missouri. Those in the low-involvement condition received a version of the message in which the speaker advocated that the policy be implemented at North Carolina State University. Participants assigned to receive the strong-argument version of the message received arguments that pre-testing had indicated produced predominantly favorable thoughts toward the policy. Participants in the weak-argument condition received

ACCEPTED MANUSCRIPT Replication of Research Results 13

arguments that pre-testing had indicated produced predominantly negative thoughts toward the policy. The participants then completed 4 semantic differential scales and a single-item Likert

PT

scale to assess their attitudes toward the general policy of senior comprehensive exams.

RI

If a researcher wished to conduct an exact replication of this experiment in 2015, would using the original experimental materials of Petty and Cacioppo (1979) be the best strategy to

SC

ensure psychometric invariance? First, it should be recognized that an exact replication in the

NU

strictest sense of the term can never be achieved as it will always be impossible to fully recreate the contextual factors and participant characteristics present in the original experiment (see

MA

Schmidt, 2009; Stroebe & Strack, 2014). Nonetheless, the traditional approach to exact replication would suggest that using the original research materials would be the optimal strategy

TE

D

for coming as close as possible to recreating the psychological conditions present in the original experiment. However, a careful examination of the experimental manipulations and the logic

AC CE P

that governed their original construction might suggest otherwise. Turning first to the manipulation of issue involvement, would it make sense to use the University of Missouri as the school for implementing the policy in the high-involvement condition? Even strong advocates of using original materials would probably concede that this strategy would not make sense if the replication study is not being conducted at the University of Missouri. The logic underlying this manipulation would suggest that the university where the data will be collected in the new study would be the most appropriate choice and would be much more likely to recreate the psychological conditions present in the original experiment. Even then, many aspects of the university where a new study is conducted might differ from the University of Missouri in the late 1970s.

ACCEPTED MANUSCRIPT Replication of Research Results 14

The situation is all the more ambiguous in the low-involvement condition. Would it make sense to retain the original condition in which the focal university was North Carolina

PT

State University (NCSU)? The answer to this question can best be formulated by considering

RI

why NCSU was used in the original study. For students at the University of Missouri, NCSU was geographically distant and not a university that had an established history of competition

SC

with their own university (e.g., in the context of sports). Thus, it was unlikely to serve as a

NU

salient point of comparison. Similarly, though a well-known university, NCSU was not sufficiently famous that the participants likely had a well-developed stereotype of the school or

MA

viewed it as a school that should perhaps be emulated (as might be the case for more famous universities such as Yale or Cambridge). All of these factors make it likely that a policy being

TE

D

considered for implementation at NCSU would probably not trigger strong involvement in the issue. However, whether the same criteria would be met in a replication experiment would very

AC CE P

much depend on where that study was being conducted. If conducted at the University of Iowa, using NCSU as the low-involvement message target might be reasonable. However, if the replication experiment was conducted at Clemson University, using the original materials for this condition might be a very poor choice. NCSU would obviously be much closer in physical distance, much more familiar to participants, and a more salient school of comparison given the schools’ membership in the same athletic conference. What about the manipulation of argument quality? It would be tempting to assume that retaining the original arguments used in Petty and Cacioppo (1979) would be the best strategy for ensuring the psychometric invariance of this manipulation. However, an examination of the sorts of arguments used in this experiment and related experiments (see pp. 54-59, Petty & Cacioppo, 1986), suggests that this might not be the case. For example, if one were to conduct

ACCEPTED MANUSCRIPT Replication of Research Results 15

the study in Europe, many of the arguments might not be similarly effective as they use American universities as examples and reference features of the American university system that

PT

might not be fully applicable outside of this context. Moreover, even if the replication

RI

experiment was conducted in the United States, it is not clear that all of these arguments would have the same effect. For example, arguments citing specific corporations might not have the

SC

same effect if these corporations are no longer as familiar or seen as especially successful

NU

exemplars of their industries. Likewise, arguments citing specific amounts of money could be interpreted very differently in 2015 than they were in the late 1970s. One argument commonly

MA

used in the strong versions of the senior comprehensive exam message was that graduates of universities implementing the exams had an increase in average starting salary of $4,000. This

TE

D

increase might have been perceived as much larger in the late 1970s and early 1980s than it would be today (inflation calculators would suggest using a value of about $13,000 in 2015).

AC CE P

In short, there are a number of reasons that using the original arguments in a replication experiment might fail to produce as effective a manipulation of argument quality as it did in the original experiment. Indeed, in their discussion of argument quality manipulations, Petty and Cacioppo (1986) discussed the fact that the conceptualization of argument quality was not tied to specific arguments but instead was constructed on the basis of an empirical criterion (i.e., prevalence of positive versus negative thoughts when carefully evaluating the argument). They noted that the success of a given argument to meet this criterion should not be assumed to be invariant and recommended that researchers using argument quality manipulations conduct pretesting to verify the success of their manipulations. In point of fact, one might argue that a complete replication of Experiment 2 in Petty and Cacioppo (1979) would be to fully replicate their pretesting of materials prior to conducting the replication of the main experiment.

ACCEPTED MANUSCRIPT Replication of Research Results 16

A final consideration would be whether the attitude measure used in the original experiment (the primary dependent variable) would be likely to satisfy psychometric invariance

PT

in a replication experiment. The measures used in the original experiment are comparatively

RI

well-established general measures of attitudes (e.g., semantic differential scales) whose psychometric properties have been extensively evaluated over the years. They have generally

SC

been found to perform well across a range of contexts and populations. Thus, retaining them for

NU

use in a replication experiment would seem entirely reasonable, though one could not be confident that invariance was achieved without formally evaluating the psychometric properties

MA

of the measures. Furthermore, if the study was to be replicated in a non-English speaking population, the measures would obviously need to be translated and this process could alter the

TE

D

psychometric properties of the measure. For example, when translating the evaluative adjective pairs in a semantic differential item, it might be ambiguous as to which terms in the new

AC CE P

language most closely approximate the original adjectives, and it is possible that these new terms might not have the same level of evaluative connotation. Reassessing the Distinction between Exact and Conceptual Replications As the above example illustrates, using original experimental materials is no guarantee that psychometric invariance will be achieved between an original experiment and a replication experiment. Indeed, in some cases, there would be a strong a priori basis to predict that exact replications would fail to produce the original effects and that this failure would not be particularly informative with respect to the reliability of the original effect. Importantly, the example we have provided is not particularly unique. The operationalizations of independent variables, and in some cases dependent variables, are often formulated with reference to specific features of the context and population in which they will be used. Thus, these

ACCEPTED MANUSCRIPT Replication of Research Results 17

operationalizations would not be expected to be psychometrically invariant if used in contexts and populations substantially different from those for which they were originally developed.

PT

Stroebe and Strack (2014) have argued that there is good reason to expect that many

RI

traditional and contemporary experimental manipulations in social psychology would have

SC

different psychological properties and effects if used in contexts or populations different from the original experiments for which they were developed. For example, classic dissonance

NU

manipulations and fear manipulations or more contemporary priming procedures might work

MA

very differently if used in new contexts and/or populations. One could generate many additional examples beyond those mentioned by Stroebe and Strack. For instance, a positive mood

D

manipulation using an episode from an American sitcom or a routine from an American

TE

comedian might be less effective in a different era or cultural context. A self-threat manipulation involving poor performance on a task might be less effective if the participants in the replication

AC CE P

experiment had lower expectations of performance on the task and/or the task was less central to their self-concepts. In summary, the specific operationalizations used in many experiments are not constructed with the intent of making the operationalization per se broadly applicable across samples and contexts. Instead, the operationalizations are constructed to be optimal within the population and context in which they will be used. Another important point illustrated by the above example is that the distinction between exact and conceptual replications is much more nebulous than many discussions of replication would suggest. Indeed, some critics of the exact/conceptual replication distinction have gone so far as to argue that the concept of exact replication is an “illusion” (Stroebe & Strack, 2014). Though we see some utility in the exact/conceptual distinction (especially regarding the goal of the researcher in the work), we agree with the sentiments expressed by Stroebe and Strack.

ACCEPTED MANUSCRIPT Replication of Research Results 18

Classifying studies on the basis of the exact/conceptual distinction is more difficult than is often appreciated, and the presumed strengths and weaknesses of the approaches are less

PT

straightforward than is often asserted or assumed.

RI

Consider a replication of Petty and Cacioppo (1979) along the lines we have previously

SC

discussed. In such a replication, to achieve psychometric invariance, it might be necessary for the researcher to substantially revise both levels of the issue involvement manipulation and

NU

substantially revise or replace some arguments in both the strong and weak versions of the

MA

message. Should this experiment be regarded as a conceptual or exact replication? It is not a conceptual replication in the purest sense of the term in that the researcher has not deliberately

D

attempted to adopt operationalizations that are as procedurally distinct as possible from the

TE

original operationalizations. Rather the researcher has tried to precisely follow the rationales that governed the construction of the original operationalizations and closely mimic the original

AC CE P

materials within the confines of these rationales. However, the end of result of this process has been a set of operationalizations that differ from the original experiment in terms of many of the specifics. Thus, many researchers might be reluctant to call it an exact replication. Our own position is that this experiment would be closer in its goals to the objectives of an exact replication than it is to the objectives of a conceptual replication. An experiment of this type would provide only a very modest advance in terms of construct validity because at the general procedural level it is very close the original experiment. However, we do think that this experiment would be potentially informative with respect to evaluating the reliability of the originally demonstrated effects. Indeed, it might be much more informative than an experiment that uses the original materials (an exact replication) without considering the psychometric differences that might exist across studies. In the context of psychological science, the

ACCEPTED MANUSCRIPT Replication of Research Results 19

fundamental goal of “exact replication” should be to as exactly as possible recreate the psychological conditions present in the original experiment. Using the original experimental

PT

materials is often a highly imperfect proxy for recreating the psychological conditions of the

RI

original experiment.

SC

Conclusions and Recommendations

We believe that exact replications can be useful. However, the manner in which these

NU

exact replications are conducted should be implemented with certain conceptual considerations

MA

in mind. The objective of an exact replication is to recreate the psychological conditions of the original experiment (i.e., to achieve psychometric invariance). The original materials used in a

D

prior experiment should certainly be the starting point for achieving this goal, but not necessarily

TE

the end point. Each aspect of the original materials should be carefully considered in the context of the logic that governed its development and judged with respect to how likely that logic is to

AC CE P

apply to the context and population in which the replication experiment will be conducted. When this process suggests that the logic does not extend to the new sample and/or context, revisions should be undertaken and not be seen as detracting from the central objective of exact replication.

One potential objection to this position might be to claim that the goal of exact replication is not to recreate the same psychological conditions as the original experiment but rather to see if the experimental procedures themselves produce the same outcome. Although such a goal might be true in some cases (e.g., an applied study), in general we do not think that is the goal of most theoretically-oriented experiments. The overarching goal of these experiments is rarely to test whether a specific manipulation effectively captures a given construct or to test if a specific manipulation creates the same outcome separate from the psychological mechanisms

ACCEPTED MANUSCRIPT Replication of Research Results 20

hypothesized to underlie the effect. Rather the overarching research goal is to examine whether a particular construct influences another construct. Thus, a researcher studying the effects of

PT

mood is not interested in whether a particular sitcom episode produces a positive mood state, nor

RI

is the researcher interested in the broader question of whether sitcom episodes in general produce positive mood. The use of a particular sitcom episode is a means to an end (i.e., creating a

SC

particular psychological state) and has little inherent interest beyond that.

NU

Regardless of the researcher’s goals, clear interpretation of any failed replication experiment requires differentiating between failures that reflect the operationalizations not

MA

representing the same psychological constructs in the two studies and failures that reflect the effect of one construct on another not replicating. Thus, each of these potential components of

TE

D

the replication failure should be directly tested whenever possible. First, the efficacy of the operationalizations used in replication experiments should be

AC CE P

empirically evaluated with reference to the original experiment. In the case of dependent variables, if the dependent variable is an aggregate of multiple indicators and the data from the original experiment are available, this could be accomplished by formally testing factorial invariance across the studies using methods such as multi-sample confirmatory factor analysis. If the original data are not available but factor analytic results were reported, the same factor analysis should also be performed in the replication experiment and its results compared in a more descriptive fashion. The match of the two sets of factor analytic results can even be more formally quantified using an index such as a coefficient of congruence (see Gorsuch, 1983). If factor analytic results were not reported in the original experiment and the original data are not available, the properties of the dependent variable should still be compared to the extent possible. For example, comparison of reliability indices across studies can be useful. Although

ACCEPTED MANUSCRIPT Replication of Research Results 21

similar levels of reliability do not guarantee that factorial invariance holds, substantially different levels of reliability would be a warning sign that this condition has not been met. Even

PT

comparing the dependent variable in terms of its distributional properties (e.g., mean, standard

RI

deviation, skew, and kurtosis) could be useful. Comparable distributional properties certainly would not provide any confirmation that factorial invariance has been satisfied. However, very

SC

different distributional properties could be a result of differences in the underlying psychometric

NU

properties of the dependent variable or at the very least could indicate that the new population differs substantially on the construct of interest when compared with the population sampled in

MA

the original experiment. Unfortunately, in the case of dependent variables based on a single indicator, there is comparatively little that can be done to evaluate assumptions of invariance

TE

D

other than to compare the distributional properties of the dependent variable across experiments. With respect to evaluating independent variables, if pre-testing procedures guided the

AC CE P

development and/or evaluation of the efficacy of the original operationalizations, the same pretesting procedures should be replicated to examine whether the original operationalizations retain their psychometric properties in the new population and context. When data from the original pre-testing study are available, direct comparisons should be made to more formally evaluate the assumption of psychometric invariance. For example, in a comparatively ideal case such as that illustrated in Figure 3, multi-sample structural equation modelling could be done to formally test that factorial invariance holds for the measures of constructs potentially influenced by the independent variable. Assuming that condition is met, the invariance of the directional effects on the intended and unintended constructs across the studies could then be tested. If the original data are not available but results of pre-testing have been reported, results can be compared in a more descriptive fashion, and in some cases even more formally evaluated.

ACCEPTED MANUSCRIPT Replication of Research Results 22

For instance, imagine in our example that the researcher conducting the original pre-test study had simply reported three t-tests assessing the impact of the experimental manipulation of fear on

PT

the aggregate scores for the intended construct of fear as well as on the unintended constructs of

RI

sadness and anger. The same t-tests could be computed for the replication pre-testing study. Effect sizes could then be calculated for both the original study and the replication study and

SC

meta-analytic tests could be conducted assessing if the corresponding effect sizes differed across

NU

the two studies (e.g., see Borenstein, Hedges, Higgins, & Rothstein, 2009; Rosenthal, 1991). Although obtaining equivalent effect sizes across the two studies would not provide as

MA

compelling evidence of psychometric invariance as the structural equation modeling approach discussed previously (there would be no way to be certain that the measures of fear, sadness, and

TE

D

anger were factorially invariant across studies), it would nonetheless provide useful information for evaluating the viability of the assumption of psychometric invariance. Significantly different

AC CE P

effect sizes would be a warning that psychometric invariance might not have been achieved. In other cases, formal pre-testing of the independent variables might not have taken place, but the original experiment might have included manipulation check measures. In such cases, comparisons of analyses of manipulation checks across the two experiments can provide some information relevant to assessing psychometric invariance. For the most part, the analytical strategies outlined for comparing pre-testing studies can also be used for comparing results of manipulation checks across studies. Analyses of a replication pre-testing study or analyses of the manipulation checks in the replication experiment might suggest that the original operationalizations of the independent variables and/or dependent variables no longer have the same psychometric properties. If so, revisions should be undertaken in an effort to establish psychometric invariance between the

ACCEPTED MANUSCRIPT Replication of Research Results 23

original experiment and the replication experiment. These revised materials should then be assessed to evaluate if invariance has been achieved.1

PT

Obviously, the extent to which a researcher can successfully follow these guidelines is to

RI

some degree determined by choices made by the individuals conducting the original experiment. Unfortunately, researchers are often not sufficiently attentive to the psychometric properties of

SC

their measures and manipulations when conducting original experiments (LeBel & Peters, 2011).

NU

Measures and/or manipulations are sometimes based largely on intuition and/or face validity. In other cases, more formal rationales might have guided the construction of the

MA

measures/manipulations and empirical evaluation of the measures/manipulations might have been conducted. However, because of space constraints, such information sometimes goes

TE

D

unreported.

Clearly such practices make replication of prior experiments much more challenging.

AC CE P

With these challenges in mind, researchers publishing new experiments should attempt to facilitate meaningful replications of their experiments. Making the original data and the original experimental materials publicly available or available upon request is clearly important. Additionally, the logic underlying the development of the operationalizations used in experiments should be explicitly articulated in the article reporting the experiment. If it is not feasible to provide detailed information in the article, a supplementary document outlining the rationales underlying the development of the operationalizations should be available to other researchers. Similarly, if pretesting of operationalizations and other forms of psychometric evaluation were conducted, this information should also be provided in the article or a supplementary document available to researchers.

ACCEPTED MANUSCRIPT Replication of Research Results 24

Finally, our focus on exact replication in the current section is not meant to imply that conceptual replication is unimportant. To the contrary, we see real value in such studies. Issues

PT

of construct validity should not be ignored in an effort to establish the reliability of a previously

RI

documented effect. Both issues are important. Moreover, we do not see conceptual replications as inherently more vulnerable to alternative interpretations and thus easy to dismiss. If there are

SC

compelling theoretical rationales underlying the new operationalizations and empirical evidence

NU

to support their validity, conceptual replications can be difficult to dismiss out of hand. At the very least, they may serve to challenge the interpretation of the originally demonstrated effect

MA

and/or to suggest that it may be more contextually bound than originally recognized.2 Conversely, an exact replication using original materials might lack compelling rationales for its

TE

D

operationalizations in the new context and population or might lack evidence for the validity of its operationalizations in the new context and population. If so, the results of the study might be

AC CE P

open to many interpretations and it might be very easy to dismiss this study as uninformative. At the end of the day, the value of both exact and conceptual replications rest in large part on the strength of the theoretical and empirical foundations of their key operationalizations. Evaluating the Statistical Status of Replications

Defining Success and Failure in Replication Experiments Even if we have reason to conclude that psychometric invariance has been achieved, ambiguity exists regarding how to evaluate whether that experiment has been a successful versus failed replication. Moreover, if the experiment is judged a failed replication, ambiguity exists regarding how a failure should alter our views of the originally demonstrated effect. Traditionally, by far the most pervasive definition of a successful replication has been an experiment that obtains a statistically significant effect in the same direction as the original

ACCEPTED MANUSCRIPT Replication of Research Results 25

experiment. Thus, an experiment producing a non-significant effect in the expected direction, no effect at all, a non-significant effect in the opposite direction, or a significant effect in the

PT

opposite direction would all be regarded as “failures to replicate.” Most recent discussions of

RI

replication (e.g., Francis, 2012, 2013b; Schimmack, 2012) have explicitly or implicitly defined success versus failure purely in terms of whether statistical significance has been achieved and

SC

have not differentiated among different patterns of non-significant effects that might emerge.

NU

Furthermore, assuming that these failed replication experiments have used the same operationalizations of the independent and dependent variables, the most common inference

MA

drawn from such failures is that confidence in the existence of the originally demonstrated effect should be substantially undermined (e.g., see Francis, 2012; Schimmack, 2012). Alternatively, a

TE

D

more optimistic interpretation of such failed replication experiments could be that the failed versus successful experiments differ as a function of one or more unknown moderators that

AC CE P

regulate the emergence of the effect (e.g., Cesario, 2014; Stroebe & Strack, 2014). At first glance, the traditional definition of successful versus failed replications might seem sensible. One could certainly imagine that obtaining a non-significant result in a replication experiment would decrease confidence in the existence of an effect and that the existence of several non-significant replications would be especially troubling when considering the reliability of the original effect. However, a more formal assessment of the statistical implications of non-replications indicates that, in many situations, such an inference would not be warranted. Specifically, some methodologists have long argued that the broader statistical implications of “failed” replications are best evaluated from a meta-analytic perspective (see Mullen, Muellerleile, & Bryant, 2001; Rosenthal, 1990). Moreover, this meta-analytic

ACCEPTED MANUSCRIPT Replication of Research Results 26

perspective has begun to be featured in more contemporary discussions of replication (e.g., Braver, Thoemmes, & Rosenthal, 2014; Galak & Meyvis, 2012; Stanley & Spence, 2014).

PT

Meta-Analytic Illustrations of Failed Replications

RI

To illustrate this alternative viewpoint, it is useful to consider a few simple hypothetical examples using some basic meta-analytic calculations (Borenstein, Hedges, Higgins, &

SC

Rothstein, 2009).3 Imagine a situation in which a researcher has published a series of 3 simple

NU

experiments each involving two conditions and a single dependent variable. Each experiment was based on a sample of 40 participants. The experiments produced effect sizes of: d=.67, .57,

MA

and .77. The two-tailed significance levels of these experiments were: p=.04, .08, and .02. Assuming an effect size of .67 (the average of the three studies), each experiment has only

TE

D

modest power (.54). When viewed in isolation, each experiment does not appear particularly impressive at a statistical level. Only two of the experiments reach statistical significance and in

AC CE P

neither of these cases was the critical threshold of .05 greatly exceeded. Moreover, the modest power of each study and the comparatively modest number of participants per experimental condition might give readers a basis for concern regarding the reliability of these findings (20 participants per cell would just meet the minimum threshold of acceptability according to some guidelines; Simmons, Nelson, & Simonsohn, 2011). In light of these facts, one might imagine that a failure to replicate would constitute a substantial basis for concern with respect to our confidence in the existence of this effect. Imagine that a researcher at a different lab attempts 3 replications of these experiments using the same sample size as the original experiments. This researcher obtains effect sizes of d = .32, .24, and .40. The corresponding two-tailed significance levels for these 3 experiments are p = .32, .45, and .21. Traditionally, each of these 3 replication experiments would be viewed as fairly

ACCEPTED MANUSCRIPT Replication of Research Results 27

dramatic failures to replicate the original findings. The effect sizes are clearly smaller and the effects are not even close to conventional standards of statistical significance. Moreover, when

PT

considered on the whole, we now have more non-significant (4) than significant results (2)

RI

demonstrating the effect. Presuming that the replication experiments have achieved psychometric invariance, it might seem that these failed replications should dramatically

SC

undermine our confidence in the existence of the effect. Alternatively, if we are optimists,

as a function of some unknown moderator.

NU

perhaps we might conclude that the original experiments and the replication experiments differ

MA

Interestingly, a meta-analytic evaluation of the implications of these new studies would in many respects suggest a different view of these studies. First, it is useful to recognize that

TE

D

although the original 3 experiments might appear somewhat “fragile” at the individual level, they appear more robust at the meta-analytic level. A meta-analytic test of the overall significance of

AC CE P

the original 3 studies produces a Z = 3.52, p = .0004 (two-tailed) and a failsafe number of 10.74. Thus, at the meta-analytic level, the original 3 experiments indicate a very low probability of occurrence due to chance if the population effect is zero and these were all the data collected. Also, the failsafe number indicates that 10.74 additional experiments with an average sample size of 40 (i.e., total N = 430) and an average effect size of d = .00 would be needed to reduce the meta-analytic test statistic to non-significance at a one-tailed level. 4 Based on this failsafe number, we can already predict that the 3 “failed” replication experiments will not be enough to fully undermine confidence in the effect. We have only 3 such failures and their average effect size is a positive value rather than 0. However, even if these 3 failures do not completely eliminate our confidence in the effect, the question remains whether one should reduce

ACCEPTED MANUSCRIPT Replication of Research Results 28

confidence that the effect differs from a population value of zero (as many researchers in the field seem to think).

PT

On one hand, these failures should certainly cause us to revise our estimate of the

RI

magnitude of the effect size. The average effect size across the six studies is smaller. However, in two other respects a meta-analytic view of the studies suggests that we should increase rather

SC

than decrease our confidence in the existence of the effect (i.e., we should be more confident that

NU

the effect size in the population is positive rather than 0). The addition of these 3 “failed” experiments actually slightly increases the meta-analytic test statistic to Z = 3.69, p = .0002 (two-

MA

tailed) and substantially increases the failsafe number to 24.18. That is, it would now require 967 additional participants’ worth of data producing an average effect size of 0 to reach

TE

D

borderline significance (one-tailed). Thus, these 3 experiments, that would be regarded by many as failures, have actually slightly reduced the likelihood of a Type I error and substantially

AC CE P

increased the number of (average) null effect studies that would be required to eliminate the effect at the meta-analytic level.

Although this finding might seem counter-intuitive, if one considers it from a metaanalytic perspective, it is entirely sensible. If the effect size in the population is truly 0, we would expect experiments testing this effect to mostly produce effects sizes close to 0 and for these effects to form a roughly symmetrical distribution on both sides of 0 (i.e., to have a roughly equal number of positive and negative effects). As our original 3 experiments demonstrated at the meta-analytic level, if an effect size is 0 in the population, it is extremely unlikely that one would obtain 2 significant effects and 1 nearly significant effect in the same direction (unless they are selectively chosen from a much larger set of studies – we address that possibility later in the manuscript). The addition of even 3 non-significant more modest-sized effects in the same

ACCEPTED MANUSCRIPT Replication of Research Results 29

direction would only increase our confidence in the direction of the effect. The increased level of asymmetry in the direction of effects (i.e., 6 effects in one direction) is only less likely than

PT

the asymmetry based on the first three studies if the real population effect size was actually 0.

RI

Moreover, it is not even clear if these three weaker replication experiments should serve as a basis to prompt us to look for moderators that might account for differences between the

SC

original experiments and the replication experiments. The test of heterogeneity of effect sizes is

NU

nowhere close to significant for the set of 6 experiments, Q (df=5) = 2.04, p = .84. Thus, the distribution of these effect sizes does not suggest that they were drawn from different

MA

populations. Moreover, if we conduct a formal meta-analytic test comparing the average effect size in the original experiments with the average effect size in the replication experiments, this

TE

D

test is also not significant, Q (df=1) = 1.73, p = .19. Thus, although it might be sensible on conceptual grounds to explore potential moderators in future research, these 3 replication

AC CE P

experiments are not sufficiently discrepant from the original studies to provide a clear statistical foundation to suggest that the sets of experiments might differ as a function of one or more moderators. The differences could well be attributed to chance variations in effect sizes even though superficial “eyeballing” of the effects might make the set of six studies seem less supportive of the effect than the original set of three studies. Contrary to the assumption made by many researchers, this example clearly illustrates that some experiments regarded as “failures” might actually strengthen the evidence arguing for the existence of an effect (see also Braver et al., 2014). Furthermore, even replication experiments producing very small effects in the expected direction will often have only modest implications at the meta-analytic level. For instance, imagine our researcher conducting the 3 replication experiments had obtained an even weaker set of effect sizes: d = .16, .12, and .20

ACCEPTED MANUSCRIPT Replication of Research Results 30

(with corresponding p values of .62, .71, and .53). The addition of these experiments would produce a modestly reduced but still significant meta-analytic test statistic Z = 3.07, p = .002

PT

(two-tailed), and actually increase the failsafe number to 14.95. The test of heterogeneity of

RI

effect sizes would still be non-significant, Q (df=5) = 3.92, p = .56, and a meta-analytic test comparing the original experiments and the replication experiments would fall just short of

SC

significance, Q (df=1) = 3.61, p = .06. Thus, from a meta-analytic standpoint, it would take a

NU

more extreme set of “failure” studies to statistically undermine the effect than many researchers would assume.

MA

Of course, this does not mean that any set of failed replications will have negligible effects on the meta-analytic case for the existence of an effect. Had our researcher conducting

TE

D

the 3 replication experiments obtained non-significant negative effect sizes of d = -.16, -.12, and -.20 (with corresponding p values of .62, .71, and .53), the meta-analytic case for the existence of

AC CE P

the effect would have been substantially undermined. The addition of these experiments to the original three experiments would reduce the meta-analytic test statistic to Z = 1.78, p = .07 (twotailed) and decrease the failsafe number to 1.06. Additionally, although the test of heterogeneity of effect sizes would fall just a bit short of significance, Q (df=5) = 10.22, p = .07, a formal meta-analytic test comparing the original experiments and the replication experiments would be significant, Q (df=1) = 10.00, p = .002. Thus, there would be a statistical basis to argue that the replication experiments and original experiments differed in some meaningful way. Furthermore, this third example helps illustrate an important fact about the failsafe number of the original set of experiments (10.74). When the effect sizes from a set of replication experiments are in the opposite direction to the original results, the meta-analytic case for an effect can be

ACCEPTED MANUSCRIPT Replication of Research Results 31

undermined with far fewer studies than the typical failsafe number would suggest (e.g., Iyengar & Greenhouse, 1988).

PT

Conclusions and Recommendations

RI

Traditionally, a replication experiment has been considered a success if it produces a statistically significant effect in the expected direction. This definition seems sensible in that one

SC

can argue the experiment provides evidence for the existence of the effect at the individual level

NU

and it is guaranteed to strengthen the case for the existence of the effect at the meta-analytic level.5 However, the prior meta-analytic examples illustrate that what should be considered a

MA

“failed” replication is much more difficult to determine. Not all failed replications are created equal. As our first example illustrated, sometimes studies that might be considered “failures”

TE

D

could actually strengthen the case for the existence of the effect at the meta-analytic level (see also Braver et al., 2014). Although one might be reluctant to declare these studies successful

AC CE P

replications, it seems inappropriate to label a study that strengthens the meta-analytic case for an effect as a failed replication. It would perhaps be more appropriate to label such studies as directionally consistent and only label studies that are sufficiently discrepant to weaken the metaanalytic case for an effect as failed replications. From this perspective, one might view the success versus failure of a replication experiment as more of a continuum than a simple dichotomous distinction. Irrespective of the label assigned to a given study, we believe that the examples highlight the value of evaluating replication experiments in terms of their meta-analytic implications. Thus, once an effect has been initially demonstrated, the results of each new attempt to replicate this effect (presuming a credible case exists for assuming psychometric invariance) should be integrated with the original demonstrations and earlier replication attempts in a “cumulative

ACCEPTED MANUSCRIPT Replication of Research Results 32

meta-analysis” that provides an updated assessment of the overall statistical evidence for the effect (Braver et al., 2014; Mullen et al., 2001).

PT

It is worth noting that the approach we advocate is a somewhat different use of meta-

RI

analysis than what most readers are likely accustomed to seeing. Traditionally, meta-analysis is used to evaluate large numbers of studies conducted by researchers from many different labs.

SC

Such analyses are conducted to characterize an overall effect of interest as well as to understand

NU

variations in the magnitude of that effect across studies. In contrast, the approach we have illustrated uses meta-analysis to evaluate a single experiment or a small set of experiments to

MA

determine how their addition to a prior set of experiments alters meta-analytic indices. As such, the use of meta-analytic indices in this context is primarily comparative in nature. For instance,

D

although the issue of whether the overall meta-analytic test statistic for an effect is statistically

TE

significant is certainly of interest, the “success” or “failure” of a replication experiment is

AC CE P

primarily evaluated with respect to whether its addition increases or decreases this test statistic. Likewise, in this context, the failsafe number is used in a comparative sense to see if the addition of a replication experiment has increased or decreased the robustness of the meta-analytic case for an effect to unreported or future non-replications. Finally, it should also be acknowledged that the meta-analytic approach we have discussed is not the only alternative to the traditional approach to evaluating replications. Although the traditional view of evaluating replications remains pervasive, a number of researchers have recently discussed its limitations and proposed new methods of evaluating replications (e.g., Simonsohn, 2015; Verhagen & Wagenmakers, 2014). There is certainly much to recommend about the meta-analytic approach we have discussed, as meta-analytic techniques are familiar to most researchers, have been extensively evaluated in the methodological

ACCEPTED MANUSCRIPT Replication of Research Results 33

literature, and have been adapted for use with a wide range of different study designs and teststatistics. That being said, many of these new approaches show promise and provide useful

PT

alternative perspectives on assessing replications. Indeed, a researcher could certainly evaluate a

RI

given replication study or set of studies from multiple perspectives. Such a multiple perspective approach might provide insights not readily apparent when assessed using a single approach.

NU

Evaluating the Credibility of Research Findings

SC

Evaluating the Robustness of Research Findings to Non-Replications

As discussed in the prior section, it can be very useful to evaluate the statistical

MA

implications of a non-significant replication experiment from a meta-analytic perspective. Such an analysis can suggest very different conclusions than those often drawn by researchers when

TE

D

considering each experiment in isolation. This meta-analytic approach also has potential utility when considering another issue related to replication that has generated considerable debate.

AC CE P

Recently, concern has been expressed regarding the credibility of psychology articles reporting multiple experiments with statistically significant effects when these experiments have comparatively modest power (e.g., Francis, 2012, 2013b; Schimmack, 2012). These commentators note that when the power of individual experiments is modest, it is very unlikely that a researcher could conduct multiple experiments without obtaining at least a few experiments with non-significant results. Thus, if a set of modestly powered experiments is reported and all these experiments are statistically significant, it is likely that the researcher in question has conducted, but chosen not to report, one or more non-significant experiments. These commentators argue that the fact that such experiments likely exist is a basis to seriously question the existence of the effect demonstrated in the published experiments. For example, commenting on a set of 12 experimental findings relating the color red to women’s

ACCEPTED MANUSCRIPT Replication of Research Results 34

ratings of various targets from Elliott et al. (2010), Francis (2013a) suggested that, because of the likelihood of non-significant unreported studies, “it remains an open question whether the color

PT

red influences women’s ratings of men’s attributes” (p. 292). That is, Francis (2013a) suggested that “the results may be entirely made up of Type I errors” and for a new researcher interested in

RI

the effects of color “the best approach is to ignore the findings in Elliot et al. and start over” (p.

SC

296). Accordingly, advocates of this viewpoint argue that articles containing only one or two

NU

very highly powered experiments are much more convincing and less likely to be undermined by unreported non-significant experiments (Schimmack, 2012).

MA

A Meta-Analytic Perspective on Robustness to Non-Replication As with the previous issue, we believe it is useful to evaluate this position from a meta-

TE

D

analytic perspective using some hypothetical examples. Consider the first example presented in the previous section in which our original researcher conducted 3 experiments, each based on a

AC CE P

sample of 40 participants and producing effect sizes of d =.67, .57, and .77 (with p values of .04, .08, and .02). As we previously noted, if we conduct a power analysis of these studies using the average effect size of the three studies, the power of each experiment is quite modest (.54). The “total power” of this set of studies is approximately .16 (see Schimmack, 2012). That is, given the power of each individual study, there is only a 16% chance that a researcher conducting these 3 studies could have obtained a significant effect for all 3 experiments. Certainly such an outcome is not impossible, but it is comparatively unlikely. Thus, it would not be unreasonable for us to expect that this researcher might have had some failures to obtain a significant effect that have gone unreported. A key question is whether the possible existence of these non-significant experiments should cause us to view the published data as uninformative and the effect as illusory (cf.

ACCEPTED MANUSCRIPT Replication of Research Results 35

Francis, 2013a). Some commentaries have certainly challenged the notion that reason to suspect some unreported non-significant studies should suggest that the published studies have no

PT

evidentiary value (e.g., Simonsohn, 2012, 2013). Our prior meta-analytic examples support this

RI

dissenting view in that they demonstrate that the existence of non-significant studies could have a negligible impact on the meta-analytic evidence for an effect or could actually strengthen the

SC

meta-analytic evidence for an effect. Thus, from a meta-analytic perspective, there is no reason

NU

to assume that the existence of non-significant studies must increase the likelihood that the significant published studies are a result of Type I error.

MA

Moreover, our meta-analytic illustrations also highlight that multi-study publications might often require a large number of non-replications before the meta-analytic evidence would

TE

D

lead one to question the existence of the effect. For instance, the failsafe number in our original example was 10.74. Thus, if the unreported studies averaged an effect size of zero, it would take

AC CE P

a fairly large number of additional studies to bring those studies to non-significance. Of course, if the reported studies really represent Type I errors, then they come only from one tail of a larger distribution of effects, and the remaining studies would average something less than zero. As noted by Iyengar and Greenhouse (1988), it is possible to estimate the mean of unreported studies by assuming that the reported studies cover the truncated distribution of the other 95% of the distribution (if the reported studies cover the 5% tail in the predicted direction). That assumption would seem conservative in the current case. Even so, using this approach in the context of our original example leads to an estimated average effect size of d = -.03454 for the unreported studies and a failsafe N of 6.85. Thus, with this comparatively conservative estimate, the researcher in our original example would require more than twice as many failed studies as reported studies to undermine the meta-analytic case for the effect.

ACCEPTED MANUSCRIPT Replication of Research Results 36

More generally, it would seem important to directly identify the distribution of effects that is implied by assuming that the reported effects represent Type I errors. In the current case,

PT

assuming that the three presented studies represent the 4 most extreme percent of obtained

RI

effects (i.e., half of the two-tailed probability of the weakest reported study) in a normal distribution of effects would suggest that the researcher collected a very large number of studies.

SC

That is, the distribution centered on a population effect of zero would imply that the complete set

NU

of studies included approximately three studies of equal extremity in the opposite direction of the reported studies as well as 75 additional studies (92% of the distribution) that fell in between the

MA

two tails. Thus, assuming that the reported studies were Type I errors would imply in the current case that the researcher collected roughly 81 studies in order to publish the 3 that were reported

TE

D

(and that the researcher withheld 3 equally extreme studies in the opposite direction of those that were published). Those assumptions seem rather extreme and unlikely to characterize the actual

AC CE P

practices of the researcher.

When considered from a meta-analytic perspective, the views advocated by Shimmack (2012) and Francis (2012, 2013b) seem to have two potential inferential problems. First, these perspectives assume that the statistical implications of all non-significant experiments are the same: a non-significant experiment should reduce our confidence in the existence of the effect. As the prior meta-analytic examples show, this assumption is highly problematic. Indeed, if we assume that researchers’ non-replications are a result of sampling error in insufficiently powered experiments, the expectation would be that most of these non-significant experiments would fall in the same direction as the published experiments. That is, with a normal distribution of effects and a study of power equal to .5, 97.5% of all the effects would fall in the same direction as the population effect. In such settings, then, it is highly unlikely that the unreported studies would

ACCEPTED MANUSCRIPT Replication of Research Results 37

oppose the reported studies or come close to an average effect size of zero. Thus, the most common meta-analytic outcome of including these unreported studies would be to lower the

PT

estimate of the effect size but to increase both the meta-analytic test statistic and failsafe number

RI

(i.e., to increase confidence in the existence and direction of the effect).

The second problem with these perspectives is that they do not directly address the

SC

strength of the meta-analytic case made by multiple significant studies. It is true that the

NU

likelihood of a non-significant experiment existing does increase as the number of significant published studies increases. Indeed, this is true even when studies are highly powered at the

MA

individual level. The rate of the increase is simply slower than that of a modestly-powered set of studies. However, more important meta-analytic implications are that the likelihood of the

TE

D

reported effects reflecting Type I errors from a population effect size of 0 decreases dramatically as significant effects are added, and the number of non-significant studies needed to eliminate

AC CE P

the effect at the meta-analytic level increases dramatically as the number of significant studies increases. Current perspectives pointing to a lack of “credibility” of multiple significant but modestly powered studies do not take into account the implications of the increasing strength of the meta-analytic case as the number of significant studies increases. To further illustrate some of the inferential problems that can arise when ignoring the meta-analytic case, it is useful to briefly consider two additional meta-analytic examples. Imagine that our researcher had actually conducted 5 experiments rather than 3 based on sample sizes of 40 and producing effect sizes of d=.66, .77, .55, .65, and .72. The corresponding twotailed significance levels for these 5 experiments are p = .04, .02, .09, .05, and .03. This set of studies would once again have an average effect size of .67. Using this average as our basis for a power analysis, the total power for this set of 5 studies would be approximately .05. Thus, this

ACCEPTED MANUSCRIPT Replication of Research Results 38

set of studies would be seen as even less credible (and would be more likely to be dismissed) if following the rationale presented by Francis (2012, 2013b) or Schimmack (2012). However, at

PT

the meta-analytic level, these studies would be much more convincing that the population effect

RI

is not 0 than our original example with 3 studies. The meta-analytic test statistic for the 5-study set is Z = 4.55, p < .000005 (two-tailed). The Rosenberg (2005) failsafe number is 33.19. Even

SC

assuming that the unreported studies average an effect of d = -.03454 (because they cover the

NU

truncated distribution of the remaining 95% of studies after the 5% most consistent with the hypothesis have been reported, Iyengar & Greenhouse, 1988), the failsafe number would be

MA

17.85. 6

The likelihood of at least one non-significant study having occurred is higher in this set

TE

D

of five studies than in the previous 3-study set. However, the nature of the meta-analytic evidence was quite a bit stronger for the 5-study set in terms of the test statistic and both types of

AC CE P

failsafe N numbers. Thus, although the total power value suggests a greater likelihood of one or more unreported studies, the meta-analytic results suggest that the 5-study set would be much more robust to the potential existence of any unreported studies that are not significant. Alternatively, imagine that our researcher had conducted a single experiment based on a sample size of 80 and producing an effect size of d =.67. This study would have a power of .84. This study would not provide any basis for concern when viewed from the perspectives outlined by Schimmack (2012) or Francis (2012, 2013b). However, from a meta-analytic perspective, it would provide far less evidence for the existence of the effect than either of the prior examples. Its meta-analytic test statistic would be Z=2.88, p=.004. Its Rosenberg (2005) failsafe number would be 2.06 (i.e., 165 subjects worth of data showing an effect size of 0 would undermine the

ACCEPTED MANUSCRIPT Replication of Research Results 39

meta-analytic significance of the effect), and assuming that any unreported studies average an effect of d = -.03454 results in a failsafe number of 1.46.

PT

When considering the potential risks that might arise from a researcher having conducted

RI

multiple studies and only choosing to report the significant studies, our 5-modest-powered-study and 1-high-powered-study examples provide an interesting contrast. In the single high-power

SC

case, it is quite possible that the researcher only conducted the study reported. However, if the

NU

“file drawer” of this researcher is not empty, there would not need to be much in that file drawer to undermine evidence for the effect (only 2 studies with sample sizes of 80 and an average

MA

effect size of 0 or only 1.5 studies with sample sizes of 80 and an average effect size of -.03454). Certainly it does not seem impossible that a researcher could have run 3 studies (for a total of

TE

D

200 to 240 participants) and simply picked the one study that produced an effect. Moreover, if we allow for the possibility of researchers using opportunistic data analytic practices to

AC CE P

exaggerate the effect (see Simmons, Nelson, & Simonsohn, 2011), researchers could do so in any single study and not worry about the consistency of the direction of the effect or consistency of the specific practices across the studies because only one study is selected for publication. In our example consisting of 5 studies, we have a pretty compelling basis to believe that the file drawer is not empty. However, to undermine evidence for the effect, that file drawer would require quite a number of studies. Moreover, because multiple studies are reported, it is not enough that these 5 studies be significant (or close to it). All 5 studies must be significant in the same direction. If opportunistic data analytic practices are being suggested as the method by which the 5 studies were obtained, presumably the strategy would need to be consistent across the 5 studies so as not to attract suspicion (see Galak & Meyvis, 2012). That would not be

ACCEPTED MANUSCRIPT Replication of Research Results 40

necessary for the one high-powered study because only one approach to the data is being presented.7

PT

Conclusions and Recommendations

RI

In summary, when evaluating the credibility of a set of findings, it is certainly reasonable to be concerned about the likelihood that a researcher has failed to report all the relevant

SC

experiments. However, simply computing the likelihood that an unreported non-significant

NU

experiment exists is not enough to evaluate the statistical credibility of the claims based on a set of experiments. It is also important, and perhaps more important, to consider the robustness of

MA

the reported studies to the possible existence of these non-significant studies. We believe that meta-analytic statistics provide a useful approach to evaluating the strength of the case a set of

TE

D

studies makes as well as the robustness of the set of studies to unreported non-significant studies. These meta-analytic tools are also particularly valuable in making comparisons between sets of

AC CE P

studies or evaluating how added studies change the case when added to a previous set of studies. A researcher reporting a single reasonably well-powered experiment might be comparatively unlikely to be concealing one or more non-significant studies. However, if other experiments do exist, the reported experiment might be relatively vulnerable to those nonreplications. Additionally, even if we assume that the researcher has reported all the experiments, one would certainly need to recognize that the effect reported in this study might potentially be vulnerable to future non-significant replications (depending on the sample size in that high-powered study). In comparison, a researcher reporting multiple experiments of moderate power could have conducted some additional studies, but the studies that have been reported are also likely to be relatively robust to the existence of non-significant studies. Moreover, this set of studies might also be relatively robust to future non-significant replications

ACCEPTED MANUSCRIPT Replication of Research Results 41

by other researchers (again, depending on the sample size across all the studies and how they compare with the overall size of a comparison study or set of studies). When considering which

PT

researcher has provided the more compelling statistical case for the existence of their effect, it is

RI

often possible that an argument can be made in favor of the researcher reporting multiple modestly-powered studies over the researcher reporting a single higher-powered study.

SC

That being said, our goal here is not to suggest that statistical power at the level of

NU

individual studies is unimportant. Indeed, our multi-study examples would have been even more compelling at the meta-analytic level had each study had higher power. Instead, our goal is to

MA

note that individual study power (or total power across a set of studies, Schimmack, 2012) is not the only factor that should be considered. The implications of non-significant studies for the

acknowledged.

TE

D

existence of an effect are much more complex than many recent discussions of this issue have

AC CE P

We also want to make clear that, although we see meta-analytic statistics as very useful, they obviously do not address all relevant considerations when evaluating the potential implications of unreported studies. For example, as we have already noted, although the various failsafe indices have some value in gauging robustness of findings to non-replication (especially changes in that robustness as studies are added), they must be interpreted in the context of their assumptions regarding the average effect size used to create the index. If there is a basis to seriously question the effect size being used for that calculation, inferences based on the index must be adjusted accordingly.8 It is also important to point out that there are some cases in which one might argue that unreported studies should not be considered at all when evaluating statistical evidence for the existence of an effect. Many of the same issues regarding psychometric invariance that we have

ACCEPTED MANUSCRIPT Replication of Research Results 42

previously discussed are also relevant when evaluating the appropriateness of including unreported studies in a meta-analytic assessment. If analyses suggest that the operationalizations

PT

used in a given study failed to achieve their intended psychometric properties, it might well be

RI

inappropriate to include that study in a meta-analysis. For instance, if a researcher conducted several studies in which operationalizations were unsuccessful before arriving at the final set of

SC

experimental procedures that achieved their psychometric objectives, a case might be made that

NU

the most accurate estimate of the effect of interest would be based on a meta-analytic evaluation that excludes these earlier studies.

MA

In closing, it should also be noted that meta-analytic statistics are only as good as the data and the primary analyses upon which they are based. If inappropriate data collection

TE

D

practices and/or data analytic practices have been undertaken, meta-analytic results will be misleading. Of course, these same caveats also apply to the other approaches to evaluating the

AC CE P

statistical credibility of a set of research findings (e.g., Francis, 2012, 2013b; Schimmack, 2012). General Discussion

Replication has long been regarded as a central feature of scientific research. Yet, recent discussions of replication in psychology have left much ambiguity regarding how replications should be carried out and how the implications of their results should be evaluated. The fact that such ambiguity exists is perhaps not surprising given that many of the traditional distinctions made in discussions of replication are more nebulous than discussions of them would often suggest. Accordingly, a central theme that we have tried to stress is that although these distinctions have some conceptual utility, they are more complex and fuzzy than is often appreciated. The distinction between exact and conceptual replication is not as clear as it might seem and the relative strengths and limitations of these approaches are not as straightforward as

ACCEPTED MANUSCRIPT Replication of Research Results 43

some have asserted. Similarly, defining when a replication study has succeeded versus failed is not as simple as many have assumed. The broader statistical implications of the existence or

PT

possible existence of non-significant experiments are not as clear as has been asserted. We do

RI

not see these ambiguities as insurmountable. Rather, we believe it suggests the need to adopt new and more sophisticated ways of approaching these problems. We have suggested a few

SC

general approaches that might prove useful in dealing with these challenges (see Table 1 for a

NU

summary of these recommendations).

Another theme that readers might draw from our discussion is that concerns about a

MA

“replication crisis” in psychology are exaggerated. In a number of respects, one might conclude that much of what we have said is reassuring for the field. For example, reports of exact

TE

D

replications using original materials failing to produce significant effects might be less troubling to the extent that psychometric invariance of the manipulations and measures have not been

AC CE P

addressed (or to the extent that background – unmanipulated – factors have varied across the settings in which the original and replication research have been conducted). If such issues have not been addressed, perhaps “failures to replicate” in such settings should not be seen as evidence of the illusory nature of the original findings. Such issues remain unclear in settings where the psychometric issues have not been addressed. Moreover, some studies that were judged as failures might not actually have undermined the statistical case for the effect and could even have enhanced it. Finally, even if some published articles failed to include all the relevant experiments, these omissions do not necessarily imply that there is no evidence for the effects that were claimed.9 Much of what we have said could be viewed as challenging some of the more extreme claims that have been made regarding the state of the field. We certainly feel that some of the

ACCEPTED MANUSCRIPT Replication of Research Results 44

claims have been based on questionable assumptions or incomplete analyses. That being said, our goal is not to advocate complacency. Rather, we believe that it is crucial for researchers to

PT

be educated and conversant in the concerns that have been raised and the costs or benefits of the

RI

various solutions that have been offered. We also believe that more frequent and thorough use of tools designed to analyze sets of studies, such as meta-analysis, should become a standard part of

SC

the research process. This seems especially true as replicability is often required within top-tier

NU

publications and sets of studies constitute the empirical case for an effect rather than single studies in isolation. As a field, we can clearly do better to educate ourselves about these issues

MA

and to prepare ourselves to scrutinize and explain the nature of the empirical case supporting (or failing to support) the predictions we test in our research. Thus, our goal in this article was to

TE

D

examine and unpack some of the solutions that are currently being advocated. We hope that the discussion helps to highlight some of the questions that reviewers, editors, and research

AC CE P

consumers would hopefully ask themselves as they scrutinize attempts at replication and evaluate multi-study empirical articles.

ACCEPTED MANUSCRIPT Replication of Research Results 45

References

PT

Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to meta-analysis. Chichester, West Sussex, UK: John Wiley & Sons Ltd.

RI

Braver, S. L., Thoemmes, F. J., & Rosenthal, R. (2014). Continuously cumulating metaanalysis and replicability. Perspectives on Psychological Science, 9, 333-342.

SC

Brewer, M. B., & Crano, W. D. (2014). Research design and issues of validity. In H. T. Reis & C. M. Judd (Eds.), Handbook of research methods in social and personality psychology (Second Edition, pp. 11-26). New York, NY: Cambridge University Press.

NU

Cesario, J. (2014). Priming, replication, and the hardest science. Perspectives on Psychological Science, 9, 40-48.

MA

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Chicago, IL: Rand McNally College Publishing Company.

TE

D

Elliot, A. J., Greitemeyer, T., Gramzow, R. H., Kayser, D. N., Lichtenfeld, S., Maier, M. A., & Liu, H. (2010). Red, rank, and romance, in women viewing men. Journal of Experimental Psychology: General, 139, 399-417.

AC CE P

Francis, G. (2012). The psychology of replication and replication in psychology. Perspectives on Psychological Science, 7, 585-594. Francis, G. (2013a). Publication bias in “Red, rank, and romance in women viewing men,” by Elliot et al. (2010). Journal of Experimental Psychology: General, 142, 292-296. Francis, G. (2013b). Replication, statistical consistency, and publication bias. Journal of Mathematical Psychology, 57, 153-169. Galak, J. & Meyvis, T. (2012). You could have just asked: Reply to Francis (2012). Perspectives on Psychological Science, 7, 595-596. Gorsuch, R. L. (1983). Factor analysis (2nd Edition). Hillsdale, NJ: Lawrence Erlbaum Associates. Iyengar, S. & Greenhouse, J. B. (1988). Selection models and the file drawer problem. Statistical Science, 3, 109-135. LeBel, E. P., & Peters, K. R. (2011). Fearing the future of empirical psychology: Bem’s evidence of Psi as a case study of deficiencies in modal research practices. Review of General Psychology, 15, 371-379.

ACCEPTED MANUSCRIPT Replication of Research Results 46

PT

Millsap, R. E., & Meredith, W. (2007). Factorial invariance: Historical perspectives and new problems. In R. Cudeck & R. C. MacCallum (Eds.), Factor analysis at 100: Historical developments and future directions (pp. 131-152). Mahwah, NJ: Lawrence Erlbaum Associates.

SC

RI

Mullen, B., Muellerleile, P., & Bryant, B. (2001). Cumulative meta-analysis: a consideration of indicators of sufficiency and stability. Personality and Social Psychology Bulletin, 27, 14501462.

NU

Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7, 615-631.

MA

Open Science Collaboration. (2012). An open, large-scale, collaborative effort to estimate the reproducibility of psychological science. Perspectives on Psychological Science, 7, 657-660.

D

Pashler, H., & Harris, C. R. (2012). Is the replicability crisis overblown? Three arguments examined. Perspectives on Psychological Science, 7, 631-636.

AC CE P

TE

Petty, R. E., & Cacioppo, J. T. (1979). Issue-involvement can increase or decrease persuasion by enhancing message-relevant cognitive responses. Journal of Personality and Social Psychology, 37, 1915-1926. Petty, R. E., & Cacioppo, J. T. (1986). Communication and persuasion: Central and peripheral routes to attitude change. New York, NY: Springer-Verlag. Rosenberg, M. S. (2005). The file-drawer problem revisited: A general weighted method for calculating fail-safe numbers in meta-analysis. Evolution, 59, 464-468. Rosenthal, R. (1990). Replication in behavioral research. In J. W. Neulip (Ed.), Handbook of replication research in the behavioral and social sciences (pp. 1-30). Corta Madera, CA; Select Press. Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86, 638-641. Rosenthal, R. (1991). Meta-analytic procedures for social research (Revised Edition). Newbury Park, CA: Sage Publications. Schmidt, S. (2009). Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Review of General Psychology, 13, 90-100.

ACCEPTED MANUSCRIPT Replication of Research Results 47

Shimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551-566.

PT

Simons, D. J. (2014). The value of direct replication. Perspectives on Psychological Science, 9, 76-80.

SC

RI

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366.

NU

Simonsohn, U. (2012). It does not follow: Evaluating the one-off publication bias critiques by Francis (2012a, 2012b, 2012c, 2012d, 2012e, in press). Perspectives on Psychological Science, 7, 597-599.

MA

Simonsohn, U. (2013). It really does not follow, comments on Francis (2013). Journal of Mathematical Psychology, 57, 174-176.

D

Simonsohn, U. (2015). Small telescopes: Detectability and the evaluation of replication results. Psychological Science, 26, 559-569.

AC CE P

TE

Stanley, D. J., & Spence, J. R. (2014). Expectations for replications: Are yours realistic? Perspectives on Psychological Science, 9, 305-318. Stroebe, W. (2015). Why in social psychology most published research findings may not be false. Journal of Experimental Social Psychology. Stroebe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives on Psychological Science, 9, 59-71. Verhagen, J., & Wagenmakers, E. (2014). Bayesian tests to quantify the result of a replication attempt. Journal of Experimental Psychology: General, 143, 1457-1475. Widaman, K. F., & Grimm, K. J. (2014). Advanced psychometrics: Confirmatory factor analysis, item response theory, and the study of measurement invariance. In H. T. Reis & C. M. Judd (Eds.), Handbook of research methods in social and personality psychology (Second Edition, pp. 534-570). New York, NY: Cambridge University Press. Widaman, K. F., & Reise, S. P. (1997). Exploring the measurement invariance of psychological instruments: Applications in the substance use domain. In K. J. Bryant, M. Windle, & S. G. West (Eds.), The science of prevention: Methodological advances from alcohol and substance abuse research (pp. 281-324). Washington, DC: American Psychological Association.

ACCEPTED MANUSCRIPT Replication of Research Results 48

Authors’ Note

AC CE P

TE

D

MA

NU

SC

RI

PT

Preparation of this manuscript was supported by two Insight Grants (410-11-0346; 435-20150114) from the Social Sciences and Humanities Research Council (SSHRC) of Canada to the first author. Inquiries regarding this manuscript can be directed to Leandre R. Fabrigar, Department of Psychology, Queen’s University, Kingston, ON, K7L 3N6, Canada, Email: [email protected]

ACCEPTED MANUSCRIPT Replication of Research Results 49

Table 1

PT

Summary of Recommendations for Facilitating and Conducting Replications ______________________________________________________________________________ Recommendations for Original Experiments

NU

SC

RI

Articulate Principles Underlying Construction of Operationalizations Evaluate Psychometric Properties of Operationalizations Fully Report Results of Psychometric Evaluation of Operationalizations Make Experimental Materials Available Make All Pre-Testing and Primary Experiment Data Available Report Meta-Analytic Statistics for Key Effects across Primary Experiments to Assess Robustness to Non-Replication

MA

Replication Experiments

TE

D

Evaluate Original Experimental Materials with Reference to New Context and Population Justify Retention/Revision of Original Materials for New Context and Population Evaluate Psychometric Properties of Operationalizations in New Context and Population Compare Psychometric Properties of New and Original Experiment to Assess Invariance Evaluate Meta-Analytic Implications of Replication Experiment

AC CE P

______________________________________________________________________________

ACCEPTED MANUSCRIPT Replication of Research Results 50

Footnotes 1

RI

PT

These recommendations regarding psychometric invariance would primarily apply to exact replications. In the case of a conceptual replication, a researcher would not necessarily expect psychometric invariance to hold. Indeed, by deliberating adopting a procedurally distinct operationalization, the researcher would expect that the manipulation might have a somewhat different pattern of effects on unintended constructs (i.e., different “irrelevancies”) and would not necessarily expect that its impact on the intended construct would be of the same magnitude as the original manipulation. Even in such cases, however, differences in the links between the manipulation and the intended construct could be informative about whether the conceptual replication represented a good test of the hypothesis of interest. 2

NU

SC

There can sometimes be ambiguity regarding what constitutes a conceptual replication versus a demonstration of a related, but new phenomenon (see Stroebe, 2015). Strictly speaking, an experiment can only be considered a conceptual replication if both the independent and dependent variables can be considered alternative operationalizations of the same constructs examined in the original experiment. 3

MA

All the meta-analytic calculations presented in this example and subsequent examples are based on the fixed-effect model discussed by Borenstein et al. (2009). In the context of exact replications in which psychometric invariance has been achieved, the underlying assumptions of the fixed-effect model are very defensible. Although all our calculations are based on the fixed-effect model, it is important to recognize that all of the general principles we illustrate would also be true using the random-effects model or an alternative meta-analytic approach (e.g., Rosenthal, 1991). 4

5

TE

D

All failsafe numbers are based on a failsafe index proposed by Rosenberg (2005). This failsafe index is conceptually similar to the original failsafe index proposed by Rosenthal (1979). However, the Rosenberg (2005) index is computationally compatible with contemporary meta-analytic models whereas the original failsafe index proposed by Rosenthal (1979) was based on an older meta-analytic approach.

AC CE P

Although a successful replication has traditionally been defined as obtaining a significant effect in the expected direction, an alternative approach might be to define successful replication in terms of whether the experiment produces an effect size in the same direction and of comparable magnitude. Although this definition might appear appealing in some respects, it would have to be evaluated in light of its underlying assumptions. First, expecting equivalent effect sizes across studies requires assuming that psychometric invariance has been satisfied at a very strict level. This might be plausible for some exact replications, but it might not for many conceptual and some exact replications. Second, this approach associates meaning with particular effect sizes of often arbitrary manipulations or measures. Hypotheses in psychology are rarely formulated with respect to a specific effect size, in part, because the particular manipulations and measures are often arbitrary instantiations rather than representative of a population of manipulations or measures. Thus, it would be rare to judge a theory as supported with an effect size of d = .60 but rejected with an effect size of d = .30, for example, if both effects were found to be significant. 6

Moreover, if one would want to assert that the 5 presented studies actually come from a population of effects averaging zero, one is really asserting that the 5 studies reflect one tail of a distribution covering 4.5% of the distribution (half the two-tail probability of the weakest presented study). If the distribution of effects is normal, this would imply 5 equally extreme studies in the other tail and 111 studies that fall between the two tails (the remaining 91% of the effects in the distribution). Thus, one is really suggesting that the researcher conducted roughly 121 studies to obtain the 5 that were published. 7

Two of the critical factors in determining the robustness of a given set of studies to additional non-significant studies are the average effect size and the total sample size across these original studies. Thus, our five-study set is more robust because its effect size is the same as the single-study example, but the total sample size across the studies is larger (200 versus 80). Were we to compare our 5 study set to a single study with the same effect size and a sample of 200, the meta-analytic statistics would be extremely similar. Thus, a set of a few well powered studies will not be more robust than a set of many modestly powered studies if the set of many studies has a comparable effect size and a larger total sample size. In short, when evaluating robustness of findings at the meta-analytic level the total sample size is more important than the per study sample size.

ACCEPTED MANUSCRIPT Replication of Research Results 51

8

PT

Although the failsafe index is recognized as having heuristic value by many methodologists and continues to be widely used in published meta-analyses, it has long been recognized that evaluation of publication bias in a research literature requires a number of approaches beyond simply the failsafe number (see Borenstein et al., 2009, for a discussion). Some of these additional methods might be used when evaluating the statistical credibility of a given set of studies, though in many cases there might not be enough individual studies for some of these methods to be feasible.

RI

9

AC CE P

TE

D

MA

NU

SC

Though a full discussion would require more space than the current manuscript allows, as noted earlier, the vast majority of empirical papers in psychology are testing directional hypotheses that are consistent with the directional predictions generated from a psychological theory. In those theories, hypotheses are rarely formulated with respect to a specific effect size of an arbitrary operationalization of the independent or dependent variable. Therefore, if one were to suggest that the real problem identified by “credibility” critics was that the effect size is incorrect (despite their claims about the presented studies representing Type I errors), it is important to scrutinize the meaning of an effect size per se in the context of testing directional hypotheses. If the evidence for an effect is entirely accurate regarding direction but somewhat inaccurate regarding size, many researchers would quite reasonably believe that the evidence for the existence and direction of the effect is exactly what the study was designed to identify. Indeed, because effect sizes are in part determined by specific features of the operationalizations used and characteristics of the context and sample in which they are employed, it is rare for researchers engaged in basic theory testing to make strong claims regarding effect sizes obtained in experiments. Rather, their claims typically involve generalizations that the effect size will be a non-zero value in a specific direction. Claims seldom assert that a specific effect size value should be reified and thus treated as the “true” or “most representative” effect size of a given psychological phenomenon that would be expected to broadly generalize beyond the study or set of studies in which it was originally obtained.

ACCEPTED MANUSCRIPT

AC CE P

TE

D

MA

NU

SC

RI

PT

Replication of Research Results 52

ACCEPTED MANUSCRIPT

AC CE P

TE

D

MA

NU

SC

RI

PT

Replication of Research Results 53

ACCEPTED MANUSCRIPT

AC CE P

TE

D

MA

NU

SC

RI

PT

Replication of Research Results 54