Behavioural Processes 89 (2012) 187–195
Contents lists available at SciVerse ScienceDirect
Behavioural Processes journal homepage: www.elsevier.com/locate/behavproc
Review
Comparative psychology and the grand challenge of drug discovery in psychiatry and neurodegeneration Dani Brunner a,∗ , Fuat Balcı b , Elliot A. Ludvig c a b c
PsychoGenics Inc, 765 Old Saw Mill River Road, Tarrytown, NY 10591, USA Koc¸ University, Department of Psychology, I˙ stanbul, Turkey Princeton University, Princeton Neuroscience Institute, Princeton, NJ, USA
a r t i c l e
i n f o
Article history: Received 16 June 2011 Received in revised form 14 October 2011 Accepted 17 October 2011 Keywords: Behavioral neuroscience Comparative psychology Drug discovery False positives Psychopharmacology Publication biases Translational Research
a b s t r a c t Drug discovery for brain disorders is undergoing a period of upheaval. Faced with an empty drug pipeline and numerous failures of potential new drugs in clinical trials, many large pharmaceutical companies have been shrinking or even closing down their research divisions that focus on central nervous system (CNS) disorders. In this paper, we argue that many of the difficulties facing CNS drug discovery stem from a lack of robustness in pre-clinical (i.e., non-human animal) testing. There are two main sources for this lack of robustness. First, there is the lack of replicability of many results from the pre-clinical stage, which we argue is driven by a combination of publication bias and inappropriate selection of statistical and experimental designs. Second, there is the frequent failure to translate results in non-human animals to parallel results in humans in the clinic. This limitation can only be overcome by developing new behavioral tests for non-human animals that have predictive, construct, and etiological validity. Here, we present these translational difficulties as a “grand challenge” to researchers from comparative cognition, who are well positioned to provide new methods for testing behavior and cognition in non-human animals. These new experimental protocols will need to be both statistically robust and target behavioral and cognitive processes that allow for better connection with human CNS disorders. Our hope is that this downturn in industrial research may represent an opportunity to develop new protocols that will re-kindle the search for more effective and safer drugs for CNS disorders. © 2011 Elsevier B.V. All rights reserved.
Contents 1.
2.
3.
4.
5. 6.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1. Cosmic habituation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2. Publication biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sociological factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1. Conflict of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2. Better drugs for a better future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistical issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1. Lack of experimental and statistical design prior to experimentation and replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. Susceptibility to false positives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Translational factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1. Animal models of psychiatric and neurodegenerative disorders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2. Preclinical cognitive assessment and clinical translation failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Consequences of false-positive publication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
∗ Corresponding author. Tel.: +1 914 406 8000; fax: +1 914 593 0645. E-mail address:
[email protected] (D. Brunner). 0376-6357/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.beproc.2011.10.011
188 188 189 189 189 190 190 190 191 192 192 193 194 194 194 194 195
188
D. Brunner et al. / Behavioural Processes 89 (2012) 187–195
1. Introduction “Keep the mind open, or at least ajar.” (Edwards et al., 1963) Over the past few years we have seen a fury of mergers and takeovers in big Pharma. Much of this turmoil is the consequence of an empty pipeline, starved by frequent failures of new drugs in clinical trials and the expiration of patents that have created a growing gulf between drug and target discovery and market success. These mergers create, in the short term, large inefficient bodies with redundant departments with overlapping functions, and in the end, through layoffs and reorganization, streamlined companies whose success we are yet to judge. Accompanying these reorganizations, in what promises to be a true paradigm shift in central nervous system (CNS) drug discovery, many companies have either considerably reduced or entirely shut down their internal research and discovery (R&D) programs, turning instead to academia and smaller biotechnology companies for the discovery of new scientific leads (Kissinger, 2011). Although some critics may cheer the demise of big Pharma and the advent of cheaper generic versions of many blockbuster drugs, there is a significant risk that these cuts in total R&D spending may mean fewer new and safer drugs for CNS indications. How, you may wonder, do these corporate dealings impact the comparative psychologist? In this paper, we, as others have (Markou et al., 2008), argue that one of the main factors leading to this paradigm shift in drug discovery is the failure in translating the apparent successes of preclinical neuroscience (in particular, behavioral neuroscience) to the clinic and market. That is, early findings about the potential efficacy of a new drug in non-human animals do not stand up to scrutiny either through subsequent replication or when tested on humans. There are several key reasons for this failure: a lack of robustness in the preclinical findings, a bias in the reporting of preclinical successes and failures, and a lack of robustness in the clinical trials. We will examine in this review some of these factors contributing to the first two reasons, leaving a detailed scrutiny of clinical trials to the expert clinician. In the drug discovery pipeline for CNS indications, new compounds are often tested in behavioral tasks using non-human animals (usually mice or rats) to verify efficacy before being tested for safety in other species and for efficacy in humans. Table 1 presents an outline of the typical steps in this drug discovery and development pipeline. For CNS disorders, the utility of this process for evaluating potential efficacy depends very strongly on having reliable behavioral tasks in non-human animals with predictive validity for the effects of new compounds on human behavior and cognition. Finding such tests has been problematic. In particular, to find potential new CNS drugs, pre-clinical in vivo assessment often relies on behavioral tests (e.g., tail suspension test for antidepressants, pre-pulse inhibition of startle for antipsychotics) that are sensitive to older drugs already on the market, searching through pharmacological isomorphisms, rather than an understanding of the comparative principles of behavior (Markou et al., 2008). Similarly, for cognitive disorders, new drugs are often evaluated through simple learning tasks in animals, such as fear conditioning, novel object recognition, or the Morris water maze, that may not engage the particular cognitive process of interest (such as deficits in executive function for neurodegenerative diseases). Here is where drug discovery needs the technological know-how of comparative psychologists to complement the understanding of neuropathology provided by neuroscientists and clinicians. In the spirit of the grand challenges that motivate some of the best research in other science and engineering disciplines (http:// www.netflixprize.com; http://www.xprize.org/; Thrun et al., 2006; Dandapani and Marcaurelle, 2010), we propose this failure in preclinical drug studies as a potential challenge for all comparative
Table 1 The phases of drug discovery and development. Preclinical
Target identification/validation Lead identification/optimization Compound screening Hit identification and confirmation Hit expansion Lead optimization Early ADME/PK Target profiling Preclinical efficacy Cellular and molecular pharmacology Target engagement readouts Cellular disease models Animal models of disease/other testing Pathophysiology Behavioral endpoints General health Pharmacological sensitivity ADME/PK Preclinical tox and safety pharmacology/AMES Formulation
Clinical
Safety Toxicity Pharmacokinetics Metabolism Efficacy Superiority
Note: ADME/PK: absorption, distribution, metabolism, excretion and pharmacokinetics.
psychologists to tackle. To re-build a thriving pipeline of potential treatments and cures for psychological and neurological disorders, we need new and better tasks that bridge the divide between human and animal psychology. We need tasks that are statistically robust, reliable, repeatable, and most of all, have predictive validity in the translation to human clinical studies. Recent advances in comparative psychology (Shettleworth, 2009) have led to a rapprochement with many other fields. Here we suggest that a particularly promising venue for future interaction and practical consequence for researchers in comparative psychology is with the field of behavioral pharmacology and the process of drug discovery. In this paper, we lay out two specific challenges with which the process of preclinical (i.e., non-human animal) testing in CNS drug discovery is currently struggling. First, there is the issue of replicability of results. Many findings on the effects of new drugs on behavior and cognition in animals do not withstand repeated replication. We illustrate our major points with a statistical case study: a comparison of a large body of published results on the effects of motor impairment in mice with a mostly unpublished in-house set of studies from industry. This analysis graphically demonstrates how badly skewed the published record is toward positive results. We then identify 2 key sociological factors—conflicts of interests and the potential benefits of new drugs—and 2 key statistical factors—inadequate experiment planning and susceptibility to false positives—which all contribute to a bias in the publication record toward positive results. Second, we discuss how tests of “cognition” in non-human animals have failed to translate into successful predictors of efficacy in humans for cognitive disorders. 1.1. Cosmic habituation The problem of finding effective compounds for treating CNS disorders extends both to potential new drugs that fail in the clinic and to older, established compounds, already on the market, which fail retrospectively. At the end of 2010, an article by Jonah Lehrer in the New Yorker summarized a number of findings exemplifying this potentially worrisome trend of waning efficacy (Lehrer, 2010).
D. Brunner et al. / Behavioural Processes 89 (2012) 187–195
One of the examples detailed was a comparison of the antipsychotics Abilify, Seroquel, and Zyprexa, which had been deemed efficacious in their respective clinical trials but now showed very little efficacy. The same pattern was found in the reported efficacy of benzodiazepines over time, for which chronic efficacy needs to be weighed against the well-known cognitive and sedative side effects and addictive potential (Lader, 2008). This decrease in effectiveness seems to occur despite extensive clinical testing before FDA approval. How is this possible? It seems that newer clinical trials tend to show less and less efficacy of already marketed compounds. This loss of efficacy is what some have called “cosmic habituation” or the “decline effect”—the apparent loss of an experimental effect that once seemed robust. Cosmic habituation is the idea that the universe adjusts to the experimental intervention of scientists, thereby reducing the size of an experimental effect in repeated testing. Though lacking any potential mechanism, and running counter to some basic principles of scientific inquiry such as a commitment to materialism and testability, the idea has gained traction in some circles (Lehrer, 2010). Here, we try to debunk the fallacy of cosmic habituation (CH) by delineating some issues behind this difficulty in replicating results, including the various factors that determine publication bias. This investigation should allow us to improve the truthful dissemination of results and protect ourselves from illusions of scientific grandeur. We argue that the unintended consequences of publication practices have had far-reaching consequences. For one, some apparently positive pre-clinical studies have led to premature clinical trials with drugs that had little therapeutic potential. In other cases, combined with the publication bias surrounding clinical trials results, these practices have contributed to the approval of minimally efficacious drugs with potentially dangerous side effects. The lack of replication of published findings—in both the preclinical and the clinical realm—has also resulted in some distrust of all things pharmacological and, in particular, of all things related to behavioral pharmacology. There is no question that animal models are useful, but, clearly, how useful they are depends on the question being asked and the type of validity that the model offers (see Section 4). Using behavior, and even in vivo tests more generally, as a means to assess the potential functional benefit of a treatment is entirely disregarded in certain circles. The lack of translatability between the preclinical and the clinical, combined with a weak economy, has had a major consequence: the empty pipeline of big Pharma, the shutting down on many R&D departments, the layoff of thousands of employees and, possibly, a grim future for the development of safer and more effective new drugs in psychiatry. We will examine here some of the factors contributing to these developments.
189
linger, unchallenged, due to the bias against publications that show no effects, especially if the initial publication has aged and ceased to generate excitement (Ioannidis, 2006). To better quantify these publication biases, we compared a large dataset on the effects of different compounds on motor performance from the published literature against a similar body of experiments (unpublished) collected in-house at PsychoGenics Inc. We reviewed 42 peer-reviewed research articles that tested R6/2 mice—a genetically modified mouse model of Huntington’s disease—on the rotarod test, which evaluates motor ability. We included all R6/2 lines, including different CAG repeat lengths (the induced genetic anomaly in these mice), and examined findings from all ages on all variations of the rotarod test. As Huntington’s disease is a progressive neurodegenerative disease, the motor deficits typically increase with age in both humans and the R6/2 mouse model. Most of these studies were gathered from the extensive review of Gil and Rego (2009). This exercise should be taken as an example and not as a futile attempt at meta-analysis. Many of the published studies only reported the results of pairwise comparisons (i.e., for different ages and rotarod speeds) in the absence of statistics about the overall differences. For the experiments conducted at PsychoGenics Inc., instead, we use the simple main effect p values, if and only if an overall statistically significant interaction of treatment with age was found. This approach to selecting the representative p values is deliberately statistically conservative. We selected the highest or “least significant” p value (irrespective of the directionality of the effect or the age at which the effect was found) from the published studies. Fig. 1A depicts how, as compared to the unpublished dataset, the p values for the published studies were substantially smaller and skewed toward a beneficial effect of the experimental manipulations. These results graphically exemplify how, for a similar body of research, the published literature contains a disproportionate number of significant positive results, thereby unduly emphasizing studies that “worked” over studies that did not yield significant results. On the flip side, we also explored the p values for a known, reliable positive effect: the standard antidepressant drug sertraline, which is used as a routine positive control in the preclinical rodent forced swim test. Fig. 1B plots the results of a comparison between the published studies with this drug (examining the highest dose in each study) and two-dozen tests conducted at PsychoGenics Inc. In this case, the vast majority of the studies in both the published and unpublished datasets produced the expected positive results. The even greater robustness observed in the unpublished dataset are probably due to the very standardized protocol used in this lab as compared to the variety of methods used in the published dataset (e.g., different pretreatment times, strains, and rodent species). 2. Sociological factors
1.2. Publication biases 2.1. Conflict of interest The bias against the publication of negative effects (i.e., studies in which there was no effect of the experimental manipulation) has been extensively discussed in the literature and at scientific conferences, and luckily, as a consequence, new venues have opened up for the publication of negative findings (Dirnagl and Lauritzen, 2010; Kundoor and Ahmed, 2010; Gupta and Stopfer, 2011). The difficulty of publishing does not stop at non-significant results; it also extends to those significant findings that run contrary to the consensus or dogma of the moment. When novel discoveries do find their way into a journal, they can create waves of research in specific scientific microcosms. In the ideal case, when the discovery receives sufficient support, the microcosm expands into a scientific consensus even leading to, though rarely, paradigm change (Kuhn, 1996), or, when the evidence is robustly against the initial report, this microcosm implodes. But, some microcosms can
One key reason for these publication biases stems from the incentive structure in place for modern science, both in academia and in industry. There is a significant incentive toward publishing positive results that runs counter to a full, balanced accounting of all experiments undertaken that can produce a conflict of interest for scientists. Fig. 2 schematically illustrates this pressure to highlight and focus on statistically significant, positive results at all costs. Though scientists are often asked to declare any conflicts or appearance of conflict of interest when publishing a paper, this declaration normally comprises financial interests or issues involving overt scientific competition with another investigator. These other conflicts that arise from the incentive structure in science, however, are rampant in the competitive environment of academia and industry.
190
D. Brunner et al. / Behavioural Processes 89 (2012) 187–195
Fig. 1. (A) Distribution of p values for studies assessing the effects of various experimental drugs in the R6/2 mouse model of Huntington’s disease for published and unpublished datasets. (B) Distribution of p values for published (highest doses) and unpublished forced swim test studies in rodents, using the standard antidepressant and reference drug sertraline.
In academia, in particular, until one has obtained tenure (and even afterward), there is enormous pressure to publish new and exciting stories which attract students to work in the lab, generate requests to collaborate with other teams, trigger invitations to present results, and enable success in the grant system, which further feeds discovery and recognition. The situation in industry is not very different as there is pressure on department heads and project leaders to show that successful projects are generated and advanced from early discovery toward the clinic. Career advancement often depends on the number of milestones reached or the number of publications per year. Company profits, however, depend on having efficacious drugs in the market that do not have serious side effects. Therefore, moving projects up the development ladder that are not truly deserving is beneficial for project directors in the short term, but detrimental for the company, stakeholders, managers, and society at large in the longer term (Markou et al., 2008). 2.2. Better drugs for a better future In addition to the above career-driven factors, there is pressure from the patient community to bring better compounds to the market. All these factors combined may bias the first clinical drug studies leading to FDA approval—results that do not fare well with posterior scrutiny by agencies other than the sponsor. Thus, the discrepancies between the first publications of results to later drugto-drug comparisons can also contribute to the apparent reduction of the size of the treatment effect. Despite initially being hailed as
Fig. 2. The t statistic distribution, conflict of interest, and publication bias in science.
better and safer drugs, several studies have shown, for example, that the advantages of second-generation (e.g., clozapine, risperidone) over first-generation antipsychotics (e.g., chlorpromazine, haloperidol) are often small or limited to secondary outcomes (Sikich et al., 2008). Second-generation drugs do not seem to be more efficacious than the first-generation drugs, even for negative symptoms, or better across the board with respect to major side effects than low doses of first-generation antipsychotics. Of additional concern is the finding that many studies comparing antipsychotics tend to favor the antipsychotic marketed by the sponsor of the study (Heres et al., 2006). Other factors contributing to the misleading publications include small sample sizes, short treatment durations, few comparisons with reference drugs, or comparisons with a relatively high dose of a first-generation antipsychotic (Keefe et al., 2007). Although these examples refer particularly to the bias in the literature regarding antipsychotic treatment, similar situations exist for antidepressants and other CNS drugs (Trikalinos et al., 2004; Kirsch et al., 2008; Turner et al., 2008).
3. Statistical issues 3.1. Lack of experimental and statistical design prior to experimentation and replication For approval of a new drug through clinical trials, thanks to the close regulation exerted by government agencies, both the full experimental design and all endpoint measures to be used as proof of efficacy need to be specified a priori. This controlled experimental environment strikes a sharp contrast with most research in basic science as well as preclinical animal research, where there are no requirements to nominate primary endpoint measures, nor specify other design details. There is therefore a risk that exploration of the various endpoint measures that are available to the experimenter may reveal an effect that was not particularly expected. For exploratory research, this lack of pre-specification may be a virtue and lead to new and exciting findings, but this flexibility comes at a cost. When analyses and endpoints are flexible, the chances that a significant difference between experimental groups is solely due to chance is larger (the experiment-wise alpha is larger than the normal set value of 0.05), especially when sample and effect sizes are small (Ioannidis, 2005). The same limitation applies to the statistical analysis of data if different statistical approaches are attempted and only the approach that leads to a significant result is used and reported. In some cases, experimental subjects are added until statistical significance is achieved, and if
D. Brunner et al. / Behavioural Processes 89 (2012) 187–195
191
Fig. 3. Prior knowledge and the updating effect of experience. The Bayesian view is that prior expectations should be updated with new data. Psychologically speaking, however, strong contrasts between expectations and experience may lead to irrational reassessments (an interesting frustration, or surprise, effect that is beyond the scope of this paper).
only significance in the expected direction is pursued, then, it is clear that this practice tends to inflate the odds of publishing false positive results. Some of these mistakes of over-optimism could be corrected if results were to be replicated prior to publication. But, of course, much research is very expensive and time consuming and replication cannot be done in the originating lab. In addition, given the above sociological pressures, there is little incentive for scientists to replicate a positive result prior to publication—if anything, there is a strong disincentive to do so. 3.2. Susceptibility to false positives Recently, an article in the New York Times highlighted an ongoing discussion between statisticians (Carey, 2011). Some statisticians have argued that the classical methods used by social scientists, neuroscientists, and psychologists—methods based on assigning p values to observed differences or “significance testing”—produce a bias toward false positives. They suggest that experimenters should calculate the probability of finding exactly the observed result due to chance, instead of the probability of finding an effect at least this large, or, alternatively, relying on a more conservative threshold. Others have taken an even stronger stance and contended that all empirical research should only use Bayesian statistics (Goodman, 1999), leading to a long, still active debate. Classical statistics honors “the urge to know how sure you are after looking at the data, while outlawing the question of how sure you were before” (Edwards et al., 1963). It explicitly denies all prior knowledge a role in the statistical decision process. Bayesian inference, in contrast, makes use of all the information available to the decision-maker including any prior information (e.g., analytical
considerations) or possibly prior odds.1 Fig. 3 depicts an example where prior expectations are updated with experience using a Bayesian approach (bottom left panel) in contrast to a possible frustration effect that could result from reality being much worse than prior expectations (bottom right panel). Of course, the latter is inspired by a known psychological phenomenon and is not a mathematical approach. For a mathematical example of the Bayesian approach, see Appendix A. In classical statistics, two different techniques are used to assess whether the null hypothesis can be rejected (hypothesis testing with p values) and whether the alternative hypothesis has been given a fair chance (power analysis). In contrast, Bayesian inference requires an explicit, quantitative formulation of two (or more) alternatives, plus a way to decide which hypothesis is favored by the data at hand. Both alternatives are evaluated in the light of the data, and the alternative with the higher likelihood is chosen. One advantage of the Bayesian approach is that the measure of the strength of the evidence delivered by the Bayesian analysis (the Bayes Factor) can be, but need not be, mathematically integrated with the prior odds (i.e., the relative likelihood of two hypotheses before one considers the implications of some new data) to give
1 An interesting example of prior knowledge, the important concept of “personal probabilities” in Bayesian inference, is given by Salsburg (2001). In the Three Mile Island near nuclear disaster, operators blatantly ignored alarm lights that showed imminent problems because those same alarm lights had been faulty in the past. This prior knowledge of false alarms made the operators insensitive to the actual data being presented. All scientists are influenced by those priors: We do not even test those hypotheses we consider extremely unlikely in view of our prior knowledge. We only test hypotheses for which uncertainty about their truth state remains.
192
D. Brunner et al. / Behavioural Processes 89 (2012) 187–195
a ratio that incorporates prior knowledge. Furthermore, integration with a pay-off matrix (benefits and costs) can even give the ratio of the expected values of deciding one way or another, taking into account the cost-benefit trade-off for different types of errors (misses or false positives). Note that the p values from classical methods cannot be readily used in this way as they do not measure the relative likelihood of the two alternatives in light of the data. The exact formulation of the prior odds, however, is a research subject onto itself, as the quantification of the prior odds can be difficult when the research concerns a novel question (either clinical or preclinical), and thus there is no public consensus about the relative likelihood of different hypotheses prior to the test. At these times, the solution can be to “ignore” the prior odds and only consider the Bayes Factor. One problem that appears to hold for both the classical and Bayesian approaches is the requirement that the robustness of the results needs to be subjectively evaluated. Whereas the alpha level is a simple a priori threshold for significance, there is nothing sacred about this threshold, which can be adjusted when necessary. In fact, Fisher himself, after developing the p-value approach, objected to Neyman’s proposal to fix the probability of a false positive and favored adjusting the use of p values according to the circumstances (Salsburg, 2001). What is robust for confidence intervals or for Bayesian statistics is similarly a matter of choice. For example, if the odds that a drug had an effect are estimated to be 4:1, one may not be convinced that the effect is “robust”. But what about 6:1? 12:1? There is a need for a criterion here as well, and this criterion is what makes the test more or less conservative. Ideas for such criteria for Bayesian statistics have been developed in several papers (Jeffreys, 1961; Gallistel, 2009). The decision criterion can be manipulated to make the testing more or less conservative (conservative for efficacy, liberal for safety), according to the relative risks of errors (being wrong when we called an inactive drug efficacious, or an active drug inactive, respectively). Consider the question “Is the drug efficacious?” Here, false positives are very expensive as they lead us down the wrong path and can cause lost opportunities to do better as discussed earlier. Using a stricter criterion, requiring replication, or switching to more conservative statistics is appropriate. Second, consider the question “Is the drug safe?” Here, a liberal criterion is acceptable and should be preferred as this will put us in a risk-averse place. In other words, concerning safety, a false positive is safer than a false negative. To demonstrate the relationship between the Bayesian and classical approaches, Fig. 4 plots the p values and corresponding odd ratios (Jeffreys, 1961; Rouder et al., 2009) from the unpublished set of forced-swim studies. In this dataset, the discrepancies between the two analyses were minimal, with only 1 out of 24 studies being “significant” according to the p value but “anecdotal” for the odds ratio. In general, evidence supported with p values between 0.05 and 0.01 corresponds to “anecdotal” evidence in terms of the odds ratio. Evidence supported with p values less than 0.01 ranges from “substantial” to “decisive” odds ratios. Following Neyman’s tradition, some journals only require the categorical statement of whether p is less than the set value of alpha. Other levels, however, can and are generally reported, namely 0.01, 0.001 and 0.0001, showing that many authors consider those p values to be more robust than 0.05—more in line with Fisher’s own view (Salsburg, 2001). This increase in perceived significance is noted in the figure with “very” significant and “extremely” significant labels. The consistency in the sertraline dataset is due to the robustness of the effects of this drug (considered a true positive as it is widely used in the clinic). In general, however, using a p value less than 0.01 would also ensure consistency between the two methods for most studies (see also Rouder et al., 2009). This consistency suggests that the large number of positive significant findings in the published R6/2 literature studies reviewed above (see Fig. 1A) was not solely
Fig. 4. A comparison of p values against the odds ratio (Jeffreys, 1961; Rouder et al., 2009) for the unpublished set of forced swim test studies, using the common antidepressant reference drug sertraline. The output of these two statistical methods is non-linearly related (see also Wetzels et al., 2011).
due to the use of more liberal p values, as the majority of the studies had p values less than 0.01. As mentioned earlier, the risk of false positives increases when studies use multiple endpoints aimed at measuring the same phenomenon (e.g., “health” in a mutant mouse model). One way to assess whether the obtained p values as a group support rejecting the null hypothesis is to assess whether they are uniformly distributed or not. As p values must have a uniform distribution between 0 and 1 if the null hypothesis is true (but not when it is false), a deviation from this distribution, in the expected direction, supports the view that the collection of endpoints and corresponding p values are not a set of false positives (see also the related false discovery rate analysis in Benjamini et al., 2001). Simply stated, if 10 measures of health are taken, and 5 of them are “significant” (p < .05) and the other 5 are so-called “marginally” significant (.05 < p < .07), then the evidence is overwhelmingly in favor of rejecting the null hypothesis because this set of 10 p values is clearly not uniformly distributed between 0 and 1. A meta-p value can even be assigned to the probability of erroneously finding that this set of p values is not uniformly distributed.2 Problems only arise when significant findings are selectively presented while the other results are not presented. 4. Translational factors 4.1. Animal models of psychiatric and neurodegenerative disorders Is the experimental model valid for the disease processes in question? How does performance of rodents in a given task relate to the performance of humans on a disease-relevant dimension? Can the behavioral endpoint detect the disease onset and track its progression, remission, and relapse (Day et al., 2008)? Answers to such inquiries about the validity of experimental models should ideally be considered in the light of convergent evidence from both comparative psychology and neuroscience. Comparative psychology
2 And here we go again collecting p values . . . Some may say that it is incorrect to assign p values to other p values, but here we are assigning a p value to the hypothesis that the set of p values found (which could be any other quantity) is not uniform (when in fact it is because the null was true). To our knowledge, this method for correcting for multiple comparisons is novel and distinct from other multiple comparisons correction analyses.
D. Brunner et al. / Behavioural Processes 89 (2012) 187–195
emphasizes the importance of understanding the behavior, its goal, and the context of its instantiation (e.g., aggression vs. predation), which can be particularly useful when different neural circuits might underlie similar behavior. Neurobiological evidence can be very informative about the functional relationships between different behaviors or aspects of a given behavior that are not apparent from purely behavioral observations. Reliance on convergent evidence provides the methodological and theoretical framework for a more comprehensive understanding of the behavioral processes and their neural substrates. This integrative approach fundamentally contrasts with the superficial conception and study of behavior sometimes seen in neuroscience that has hampered much translational research. This integrative approach emphasizes the need not only for predictive validity, but also for construct and etiological validity in animal models used in translational research. Face validity—the use of superficial similarities between the target and the modeled phenomenon—should play less of a role in translational research, apart from the very obvious face validity which indicates that, for example, to study cognition one should not measure body temperature but rather changes in behavior in response to experimental contingencies. Furthermore, this approach contends that pharmacological isomorphism, or predictive validity in the pharmacological realm, is only of value when accompanied by construct validity. The project of establishing this construct and etiological validity in animal models will require the joint efforts of comparative psychologists and behavioral neuroscientists. Both construct and etiological validity need to be carefully defined for each particular animal model or disorder. One approach would be to model a symptom of a disorder, assuming that focusing on the unit that contributes to a disorder would help achieve robustness in the translation. Symptoms in human disorders, however, are often defined with reference to processes, calling for a deeper definition. Establishment of a homology between the behavioral processes of an animal model and the human patient population can be done through parallel experimentation or by study of the underlying laws, phenomena, and circuits. Further, sensitivity to pathology, which can be included under etiological validity, is of fundamental importance for animal models of psychiatric and neurodegenerative disorders (Geyer and Markou, 1995). Etiological validity can often be established through pharmacological insult (i.e., a treatment that recreates an abnormal neuronal or neurotransmitter deficit known to be causative in the disorder) or genetic manipulation (when the genetic link or risk factor is known). Whenever possible, the principle of neurobiological correspondence between humans and animal models should be augmented by the fact that human and animal behavioral processes in a given task serve similar functions, are underlain by similar dynamics, and follow the same laws (e.g., Weber’s Law). To that end, psychophysical and neurophysiological characterization of the underlying processes as well as computational models with clear quantitative predictions about behavioral performance may prove instrumental. These issues point at the importance of following basic neuroscientific and psychological research closely and utilizing this information in addressing issues regarding translational research. To avoid the risk of finding results that rely heavily on speciesspecific mechanisms, it is standard in classic drug discovery to use at least two different species to prove both efficacy and toxicity. In certain areas, however, such as drug discovery testing efficacy using animal models of particular disorders, this approach is not possible as many times only murine models are available (other species, however, are becoming more common including rats, monkeys, and even sheep). This integrative approach will enable the understanding of the information processing at the three hierarchical levels of analysis proposed by Marr (1982): computational
193
theory (goal of the computation), the representation or algorithm for the transformation, and the implementation or physical realization. Building on Marr’s hierarchical framework, the integrative approach to translational research that we promote favors and benefits from an informative interaction between the different levels of analysis. This interactive framework constrains the problem by describing behavior at multiple levels and thus making it more tractable. 4.2. Preclinical cognitive assessment and clinical translation failure The failure to bring new drugs to the clinic has been particularly pronounced for cognitive enhancers. Agonists of the type-3 serotonin receptor, for example, were shown to enhance cognition in both rats and non-human primates. Despite these very promising preclinical results, a subsequent clinical trial failed to find any significant effects on cognitive endpoints in patients with Alzheimer’s disease (Dysken et al., 2002). In addition to the inherent difficulty in translating processes from species as diverse as rodents and humans, the failure of translation to the clinic for cognitive disorders are hindered by the common strategy of demonstrating beneficial behavioral effects in preclinical tests with little or no attention to the underlying psychological mechanisms. Not only is little attention paid to which particular cognitive domain is being assessed but also whether any other processes—such as motivation, arousal, or behavioral inhibition—can explain the results. Although a warning against such lack of attention was published long ago (Sarter et al., 1992), nowadays we do not seem to do much better, exemplified by the universal use of some tests such as novel object recognition or the water maze, independent of the specific disorder or cognitive process being targeted. The time is ripe for a new breed of standardized behavioral tests of cognition in non-human animals that will target particular cognitive processes and embody the key virtues of predictive, construct, and etiological validity discussed in Section 4.1. Another related issue has pre-clinical methodological implications: Behavioral experiments typically require standard procedures to be completed (e.g., handling of subjects and housing) prior to the critical test. These necessary procedures make animal behavior particularly vulnerable to cross-experimenter and cross-laboratory variability in implementing even these very basic protocols (e.g., handling of animals). Even rigorous standardization does not always eliminate systematic differences in behavior across laboratories (Crabbe et al., 1999). The issue of variability, of course, is more relevant for those phenomena that are less robust, and thus the question should be raised concerning the translatability of such non-robust phenomena. The serious issue here is the significant variability in effect size with changes in protocol detail, and not so much the changes in baseline rates or absolute endpoint values. This problem points at the need for new, automated approaches to behavioral testing in translational research rather than a fundamental flaw of animal testing (Brunner et al., 2002; Gallistel et al., 2010). These approaches to limiting the role of humans in the running and analysis of experiments through automation may include advances in self-contained living quarters for animals, computer- or robot-controlled experimental apparatus, and specialized software for parsing video and other output data. These methodological undertakings might be instrumental in limiting the across-lab variability in the implementation of procedures and enable different research groups to more closely replicate the same experimental conditions. Note that standardization of testing and flexibility to adjust to the experimental question at hand are orthogonal, rather than paradoxical, necessities. Flexibility is needed at the level of test design, so that we can have
194
D. Brunner et al. / Behavioural Processes 89 (2012) 187–195
a battery of tests that target specific cognitive processes and translate well from animals to humans. Standardization is needed so that those tests are replicable from lab-to-lab and in the same lab over time. 5. Consequences of false-positive publication One unintended consequence of the bias in the preclinical literature is the pressure on medical researchers to proceed with clinical trials that have little justification other than a putative hypothesis and an as-yet-to-be-replicated finding. In Alzheimer’s disease, for example, it could be tempting to explore inhibitors of -amyloid deposition or aggregation, growth factors, tau phosphorylation inhibitors, -secretase inhibitors, anti-inflammatory drugs, HDAC inhibitors and many others (Choi, 2002), but the cost of such trials would be stratospheric, forcing a strong prioritization of the different targets to maximize chance of success. Many times, however, these clinical trials move ahead and either no positive results are found or even deleterious effects are registered. These failures carry not only economic costs (wasted money) and opportunity costs (lost chances to do better) but may even constitute failures against humanity. Nobody should underestimate the harm done to patients and their families who enroll and follow clinical protocols only to see their hopes vanish. This human cost is unjustified and even unethical, as more comprehensive and orderly dissemination of preclinical results could have discouraged the clinical researchers from pursuing another ineffective, or even detrimental, experimental treatment. The ethical principle of “equipoise” states that all treatments tested must be equal in their likelihood of success (otherwise those assigned to the least promising treatment would have been treated unethically) (Heemskerk et al., 2002); the biases in publication create the illusion of equipoise, as those who do receive the drugs with little hope of success were treated unfairly. If we knew better, we would have never subjected humans to such experiments in futility.
pharmacological tools, biomarkers, and sophisticated analysis methods (e.g., systems biology). These tools can be used to probe both the human and non-human central nervous systems to provide a mapping of the necessary homologies. The new term endophenotypes captures those hypothetical constructs (neurophysiological, biochemical, endocrine, neuroanatomical, cognitive, or neuropsychological) that can be used as units of study in neuropathology (Gould and Gottesman, 2006). The value of these endophenotypes is twofold: to link the functional readouts to a molecular mechanism that is causal to the disease or lies downstream from pathophysiological causes, and to provide a readout of the engagement of the drug target by the candidate drug. Endophenotypes need to be validated in these two aspects to be truly useful. The integration of these non-invasive techniques with behavioral analysis, the adoption of suitable endophenotypical translational constructs, and the proper use of novel animal models of disease mark the way forward for a new century of discovery and, hopefully, more robust translation to the clinic. Although many lead the way (Dunnett, 1990; Robbins, 1990; Weinberger, 2001), a true marriage of comparative cognition and behavioral neuroscience has yet to be consummated. Acknowledgements We are grateful to David Howland, CHDI, for allowing us the publication of the statistical results from R6/2 studies, to Carol Murphy and Liliana Menalled for conducting the R6/2 studies, to Taleen Hanania and Barbara Caldarone for conducting the forced swim studies, to Charles R. Gallistel for his valuable input about Bayesian statistics, and to Berivan Ece Usta and Ercenur Unal for their help with literature review. Finally, we are grateful to Alex Kacelnik who taught us, directly or through his work, not to be afraid of asking the right question. Appendix A. Bayes’ rule establishes that:
6. Summary and conclusion Ever since Herbert Spencer publicly put forward the idea that human’s mental functioning may have arisen from those of the primordial ancestors (Spencer, 1855)—a proposal famously echoed by Darwin (Darwin, 1871)— comparative psychologists have argued about how to further knowledge about the principles underlying animal behavior. Excessive zeal in the application of Morgan’s canon of parsimony (Morgan, 1894) and the influences of radical behaviorism left the “cognition” out of comparative psychology. Tolman’s efforts to simplify the complex network of observed relations between stimuli, context, and behavior with his “intervening variables” (Tolman, 1936) started the path away from the limited realm of behaviorism into a more productive cognitive psychology, culminating with the arrival of the cognitive revolution (Wasserman, 1981). Extension of Tolman’s intervening variables into the more current “hypothetical constructs” stimulated further investigation to explain the observed by reference to other hidden phenomena such as cognitive or computational processes, physiological states, or the works of brain circuitry. We are now left with a multiplicity of constructs, some drawing on theoretical efforts to operationalize behavior (e.g., expectancy, discrimination, choice) and others borrowed from information processing approaches (e.g., working memory, divided attention). Perhaps where there is the greatest promise for the delivery of robust translational constructs are those areas where non-invasive techniques are being developed to understand brain connectivity and function (e.g., diffusion tensor imaging, transcranial magnetic stimulation, functional neuroimaging) or other physiological and
p(A|B)p(B) = p(B|A)p(A)
(1)
We can exemplify Bayesian inference though a simple thought experiment: Imagine trying to decide whether a coin randomly picked from a pile is fair or biased (because it has two heads, in this example). Let us assume that before observing any tosses, we think there is a 95% chance that the coin is fair (p(fair) = 0.95) and only a 5% chance that the coin is biased (p(biased) = 0.05). We then toss the coin 3 times and observe three heads: Now, what is our posterior confidence that the coin is a fair one? Bayes’ Rule, after rearrangement, spells out exactly how confident we should be that the coin is fair based on the observed tosses and our prior confidence: p(fair|head) =
p(head|fair)p(fair) p(head|fair)p(fair) + p(head|biased)p(biased)
(2)
Using this equation and given the additional information that p(head|fair) = 0.5 and p(head|biased) = 1, after the first toss, we should only be 0.91 confident that the coin is a fair one: p(fair|head) = 0.5 × 0.95/(0.5 × 0.95 + 1 × 0.05) = 0.9048, down from 0.95 at the outset. After the first toss, we are bound to update our belief regarding the fairness of the coin, lowering p(fair) to 0.9048. When the second toss also turns up heads, we should only be 0.83 confident that the coin is a fair one: p(fair|head) = 0.5 × 0.9048/ (0.5 × 0.9048 + 1 × 0.0952) = 0.8262. Similarly, after the third toss also shows heads, we should only be 0.70 confident that the coin is a fair one: p(fair|head) = 0.5 × 0.8262/(0.5 × 0.8262 + 1 × 0.1739) = 0.7038. A similar updating process would happen after each subsequent coin toss to determine the posterior confidence in the hypothesis that the coin is fair.
D. Brunner et al. / Behavioural Processes 89 (2012) 187–195
For the odds ratio, we want to compare the odds of the coin being fair versus the odds of the coin being biased, given that we obtained three heads. The odds ratio is: p(fair|head) : p(biased|head),
(3)
or we could substitute in (2) to get: p(head|fair)p(fair) : p(head|fair)p(fair) + p(head|biased)p(biased) p(head|biased)p(biased) p(head|fair)p(fair) + p(head|biased)p(biased) Which becomes, when simplified: p(head|fair)p(fair) : p(head|biased)p(biased) = p(head|biased)(1 − p(fair))
(4)
letting us finally calculate the odds ratio for the coin being fair as: 0.5 × 0.8262 : 1 × (1 − 0.8262) which is 0.4131:0.1738, or 2.38:1 . Whereas before the experiment started the odds were 0.95:0.05 (19:1), after three heads the odds of the coin being fair versus being biased becomes 2.38:1 (i.e., our confidence that the coin was fair has been considerably eroded). If we had started the thought experiment believing that there was only a 50% chance that the coin was fair (p(fair) = .5) instead of 95%, then after observing three heads, we should only be 0.11 confident that the coin is a fair one: p(fair|heads) = 0.5 × 0.125/(0.5 × 0.125 + 0.5 × 1) = 0.11. In this case, after three tosses, the odds of the coin being a fair to being biased would be about 1:8 odds in favor of the coin being biased. Our prior belief about the fairness of the coin filters through the updating process to influence our confidence about the fairness of the coin even after taking into account the observations. In either case, however, the more tosses we observe with heads, the higher our confidence will be that the coin is biased. This example only contained point probabilities. Bayesian inference often uses probability distributions for parameters—for instance, the prior probability distribution of a parameter can be centered at some value, uniform over a given interval, etc. The shape of this prior distribution reflects our knowledge and uncertainty about the parameters before seeing the data. The posterior distribution reflects our updated knowledge and uncertainty about the parameters after seeing the data. References Benjamini, Y., Drai, D., et al., 2001. Controlling the false discovery rate in behavior genetics research. Behav. Brain Res. 125 (1–2), 279–284. Brunner, D., Nestler, E., et al., 2002. In need of high-throughput behavioral systems. Drug Discov. Today 7 (18 Suppl.), S107–S112. Carey, B., 2011. You might already know this. The New York Times. Choi, D.W., 2002. Exploratory clinical testing of neuroscience drugs. Nat. Neurosci. 5 (Suppl.), 1023–1025. Crabbe, J.C., Wahlsten, D., et al., 1999. Genetics of mouse behavior: interactions with laboratory environment. Science 284 (5420), 1670–1672. Dandapani, S., Marcaurelle, L.A., 2010. Grand challenge commentary: accessing new chemical space for ‘undruggable’ targets. Nat. Chem. Biol. 6 (12), 861–863. Darwin, C., 1871. The Descent of Man and Selection in Relation to Sex, vol. 1. Day, M., Balci, F., et al., 2008. Cognitive endpoints as disease biomarkers: optimizing the congruency of preclinical models to the clinic. Curr. Opin. Investig. Drugs 9 (7), 696–706. Dirnagl, U., Lauritzen, M., 2010. Fighting publication bias: introducing the Negative Results section. J. Cereb. Blood Flow Metab. 30 (7), 1263–1264. Dunnett, S.B., 1990. Role of prefrontal cortex and striatal output systems in shortterm memory deficits associated with ageing, basal forebrain lesions, and cholinergic-rich grafts. Can. J. Psychol./Rev. Can. Psychol. 44 (2), 210. Dysken, M., Kuskowski, M., et al., 2002. Ondansetron in the treatment of cognitive decline in Alzheimer dementia. Am. J. Geriatr. Psychiatry 10 (2), 212–215.
195
Edwards, W., Lindman, H., et al., 1963. Bayesian statistical inference for psychological research. Psychol. Rev. 70 (3), 193. Gallistel, C.R., 2009. The importance of proving the null. Psychol. Rev. 116 (2), 439–453. Gallistel, C.R., King, A.P., et al., 2010. Screening for learning and memory mutations: a new approach. Xin Li Xue Bao 42 (1), 138–158. Geyer, M.A., Markou, A., 1995. Animal models of psychiatric disorders. In: Bloom, F., Kupfer, D. (Eds.), Psychopharmacology: The Fourth Generation of Progress. Raven Press, New York, pp. 787–798. Gil, J.M., Rego, A.C., 2009. The R6 lines of transgenic mice: a model for screening new therapies for Huntington’s disease. Brain Res. Rev. 59 (2), 410–431. Goodman, S.N., 1999. Toward evidence-based medical statistics. 1: The P value fallacy. Ann. Intern. Med. 130 (12), 995. Gould, T., Gottesman, I., 2006. Psychiatric endophenotypes and the development of valid animal models. Genes Brain Behav. 5 (2), 113–119. Gupta, N., Stopfer, M., 2011. Negative results need airing too. Nature 470 (7332), 39. Heemskerk, J., Tobin, A.J., et al., 2002. From chemical to drug: neurodegeneration drug screening and the ethics of clinical trials. Nat. Neurosci. 5 (Suppl.), 1027–1029. Heres, S., Davis, J., et al., 2006. Why olanzapine beats risperidone, risperidone beats quetiapine, and quetiapine beats olanzapine: an exploratory analysis of head-tohead comparison studies of second-generation antipsychotics. Am. J. Psychiatry 163 (2), 185. Ioannidis, J.P.A., 2005. Why most published research findings are false. PLoS Med. 2 (8), e124. Ioannidis, J.P.A., 2006. Evolution and translation of research findings: from bench to where. PLoS Hub Clin. Trials 1 (7), e36. Jeffreys, H., 1961. The Theory of Probability. Oxford. Keefe, R.S., Bilder, R.M., et al., 2007. Neurocognitive effects of antipsychotic medications in patients with chronic schizophrenia in the CATIE Trial. Arch. Gen. Psychiatry 64 (6), 633–647. Kirsch, I., Deacon, B.J., et al., 2008. Initial severity and antidepressant benefits: a meta-analysis of data submitted to the Food and Drug Administration. PLoS Med. 5 (2), e45. Kissinger, P.T., 2011. Too big to succeed, too entrenched to listen. Drug Discov. News 7. Kuhn, T.S., 1996. The Structure of Scientific Revolutions. University of Chicago Press. Kundoor, V., Ahmed, M.K., 2010. Uncovering negative results: introducing an open access journal “Journal of Pharmaceutical Negative Results”. Pharmacogn. Mag. 6 (24), 345–347. Lader, M., 2008. Effectiveness of benzodiazepines: do they work or not? Expert Rev. Neurother. 8 (8), 1189–1191. Lehrer, J., 2010. The truth wears off: is there something wrong with the scientific method? The New Yorker. Markou, A., Chiamulera, C., et al., 2008. Removing obstacles in neuroscience drug discovery: the future path for animal models. Neuropsychopharmacology 34 (1), 74–89. Marr, D., 1982. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W.H. Freeman, San Francisco. Morgan, C., 1894. An Introduction to Comparative Psychology. London: Scott, 1909 (original date of publication). Robbins, T.W., 1990. The case for frontostriatal dysfunction in schizophrenia. Schizophr. Bull. 16 (3), 391. Rouder, J.N., Speckman, P.L., et al., 2009. Bayesian t tests for accepting and rejecting the null hypothesis. Psychon. Bull. Rev. 16 (2), 225–237. Salsburg, D., 2001. The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. W.H. Freeman, New York. Sarter, M., Hagan, J., et al., 1992. Behavioral screening for cognition enhancers: from indiscriminate to valid testing: Part I. Psychopharmacology (Berl.) 107 (2–3), 144–159. Shettleworth, S.J., 2009. The evolution of comparative cognition: is the snark still a boojum? Behav. Process. 80 (3), 210–217. Sikich, L., Frazier, J.A., et al., 2008. Double-blind comparison of first-and secondgeneration antipsychotics in early-onset schizophrenia and schizo-affective disorder: findings from the treatment of early-onset schizophrenia spectrum disorders (TEOSS) study. Am. J. Psychiatry 165 (11), 1420. Spencer, H., 1855. The Principles of Psychology. Longman, Brown, Green, and Longmans. Thrun, S., Montemerlo, M., et al., 2006. Stanley, the robot that won the DARPA Grand Challenge. J. Field Robot. 23 (9), 661–692. Tolman, E.C., 1936. Operational Behaviorism and Current Trends in Psychology. Trikalinos, T.A., Churchill, R., et al., 2004. Effect sizes in cumulative meta-analyses of mental health randomized trials evolved over time. J. Clin. Epidemiol. 57 (11), 1124–1130. Turner, E.H., Matthews, A.M., et al., 2008. Selective publication of antidepressant trials and its influence on apparent efficacy. N. Engl. J. Med. 358 (3), 252–260. Wasserman, E.A., 1981. Comparative psychology returns: a review of Hulse, Fowler, and Honig’s cognitive processes in animal behavior. J. Exp. Anal. Behav. 35 (2), 243. Weinberger, D.R., 2001. Implications of normal brain development for the pathogenesis of schizophrenia. Sci. Mental Health Schizophr. 3, 18.