Introduction to Bayesian Analysis*

Introduction to Bayesian Analysis*

10 Introduction to Bayesian Analysis 10.1. Introduction This chapter contains an introduction to Bayesian analysis, from Bayes theorem to the Bayes f...

1MB Sizes 5 Downloads 166 Views

10 Introduction to Bayesian Analysis

10.1. Introduction This chapter contains an introduction to Bayesian analysis, from Bayes theorem to the Bayes factor, as well as Bayesian inference and updates. Skipping the 18th Century story of Reverend Bayes, the method experienced a rise in popularity in the year 2000, infiltrating all fields of science: 10 years later, approximately 25,000 scientific publications per year referenced “Bayes”1 (with a total of over 1500 psychology articles by 2015, see van de Schoot et al. (2017)). This impressive figure might imply that any scientist should know the basics of Bayesian analysis in order to avoid missing out on a tool that has the potential of being applied to a variety of phenomena. This large number of articles is indicative of a paradigm shift, and as such, several authors cited in this chapter are already presenting Bayesian inference as a revolution in statistics and in the sciences in general. Some of these authors, including the likes of John K. Kruschke, have directly addressed science editors and researchers in an open letter.2 The letter lists the fields that are moving toward adopting Bayesian inference as a method and suggests encouraging the movement in order to move as quickly as possible beyond the previous paradigm, which we shall present later (section 10.6). Bayesian inference has a very large range of potential applications, as its objective is simply the quantification of the relative merit of several hypotheses. For a color version of the code appearing in this chapter, see www.iste.co.uk/mathy/ experiments.zip. 1 Listen to: https://www.franceculture.fr/emissions/science-publique/une-formule-mathematiqueuniverselle-existe-t-elle. 2 http://www.indiana.edu/~kruschke/AnOpenLetter.htm.

236

Experiments and Modeling in Cognitive Science

As the statistician Dennis Lindley once said: “Inside every Non Bayesian, there is a Bayesian struggling to get out”. The key point is that this method is an integral part of our natural cognition, and it is used spontaneously in intuitive reasoning. Why not then train it to adapt to scientific reasoning, albeit with more technical computational tools? Bayesian inference can be found at three different levels in psychology, potentially making it one of the fields with the most to gain from Bayesian analysis, which is probably why Bayesian models have been extensively developed in psychology. The three levels mentioned are development, cognition and science. All of these levels can be combined very nicely. Development, learning (this point is explored later), naïve reasoning (Griffiths and Tenenbaum 2006) and scientific reasoning all fit in well to a Bayesian description, since all of these levels relate to a context of problem solving. In cognitive science, this method also allows us to make interesting links with artificial intelligence, for example, the extraction of structures from within data (for example, Griffiths and Steyvers 2004). In a Bayesian context, all knowledge that is being built is governed by a priori previous knowledge. However, it is important to first of all understand, and then, in a Bayesian context, assume, that this a priori knowledge corresponds to “beliefs”, which are each associated with a particular degree of certainty (for example, 100% of animals move). A priori knowledge of a system does not reflect a process of precise internalization of the world (Feldman 2013) comprising objective frequencies (the fact that x% of animals are mobile, for example). For instance, Feldman’s work shows that the perception of visual contours simply invokes an a priori belief based on the (Richard) Von Mises distribution of the angles of a contour; this is not necessarily an accurate representation of the contours really found in the world. Beliefs are unique to a system, and even when they are unfounded or biased, they remain a base from which inferences can be drawn. A classic example in the domain of perception is the belief that the origin of a light is most likely a light source located above (for example, the Sun), resulting in a perception of shapes as irrepressibly concave or convex depending on whether the dark part of the object is at the top or at the bottom. Another similar belief, that all faces are convex, makes the inside of a mask look convex (search: hollow-face illusion). Still in the domain of perception, Feldman and Manish (2015) give the example of Bayesian analysis of perceptual grouping, providing a satisfying estimation of the contours of shapes.

Introduction to Bayesian Analysis

237

10.2. Conditional release Although we did not want to admit it in the sub-title, one has to come to terms with the fact that Bayes’ theorem implies dealing with the concept of conditional probability. A simple example of a conditional probability would be distinguishing between the probability of being a boy (roughly one in two) and the probability of being a boy knowing that the individual in question has short hair (greater than one in two). The term knowing that is symbolized by “/” and refers to a conditional probability. Conditional probability is not a simple concept, as the inversion of two terms linked by knowing that is not always obvious. For example, it is not easy to tell the difference between the probability of a car being involved in an accident knowing that it is gray and the probability of a car being gray knowing that it has been involved in an accident. The second of these two probabilities could be high because of the high number of gray cars, leading insurance companies to conclude that they are less easy to spot. However, the probability of a car being involved in an accident knowing that it is gray might not be any higher than for another color. This example shows that inversing the terms on either side of the “knowing that” quickly turns it into a bit of a puzzle. An easy one for the legal neophyte: does conditional release result in freedom knowing that the objective of the prison sentence has been accomplished, or freedom knowing that the individual has respected the penitentiary establishment’s leaving conditions? Furthermore, these conditional probabilities are so hard to grasp that they even become deleterious. In a 1990 Sunday supplement, Marylin Vos Savants published an analysis in response to a reader’s question regarding a game similar to Let’s Make a Deal. The question refers to what is now known as the Monty Hall (presenter of the show in question at the time) problem. The game involved three doors behind which one car and two goats were hidden, respectively. The player had to first choose a door, which was left closed. The presenter opened another door so as to increase the chances of the player winning the car. The key was whether the player should stick with their initial choice, or change door. Once the choice was made, the door was opened and the player won the prize that was behind it. Intuitively, it would seem that once the door has been opened by the presenter, each of the closed doors has just as high a chance of hiding the car (so one in two), but the less intuitive answer is that there is a two in three chance of the car actually being hidden behind the other door, and not behind the door chosen initially. Bayes’ law can help derive the calculations, which shows that it is

238

Experiments and Modeling in Cognitive Science

in our interest to always change our choice once the door has been opened by the presenter. Interestingly, the best mathematicians and statisticians in the USA did not initially come to terms with this intuition, and Marilyn Vos Savants came under a lot of criticism. To summarize the opposing voices, they sounded a bit like: “You are wrong and very much so! You do not need to be the smartest person in the United States to spread false concepts of mathematics”. Fortunately for the psychologists that we are – up to our eyes in IQ tests – Marilyn’s IQ ended up being right after all. 10.3. Bayes’ law All in all, Bayes’ rule is nothing more than a continuous reversal of four probabilities that make up the product rule: p(H/D) × p(D) = p(D/H) × p(H) In the following table, there are two hypotheses and two pieces of data, representing all of the H hypotheses and all of the D pieces of data. d1

d2

h1

1

1

h2

2

3

Tot

3

d1

d2

h1

1

1

h2

2

3

Tot

5

Table 10.1. For a color version of this table, see www.iste.co.uk/mathy/experiments.zip

The cell containing number 2 means that when we are under hypothesis h2, we get a number of events of 2 for dataset d1. We calculate: p(h2/ d1) × p(d1). The table indicates that p(h2/ d1) is equal to a 2 in 3 chance since there are 3 d1 cases, but only 2 in case h2. The probability p(d1) totals 3 chances out of 7 (7 is obtained by adding the cases from the table). In the end, p(h2/d1) × p(d1) = 2/3 × 3/7 = 2/7, which we call the joint probability of h2 and d1.

Introduction to Bayesian Analysis

239

Similarly: p(d1/h2) × p(h2) = 2/5 × 5/7 = 2/7. Bayesian inference simply involves focusing on one of the four terms by rewriting the equation in the following fashion: p(H/D) = p(D/H) × p(H) / p(D) For example: p(h2/d1) = p(d1/h2) × p(h2) / p(d1) = (2/5 × 5/7)/3/7 = 2/3 Bayesian reasoning is really an intuitive calculation, and for this reason, it is opposed to Tversky and Kahneman’s theory (1974). While Tversky and Kahneman defend the idea that intuitive designs of the probabilities of individuals are far from the axioms of mathematics in a day-to-day reasoning (and therefore wrong), supporters of Bayesian inference would say the opposite and defend the idea that the use of Bayesian principles by individuals highlights human reasoning as well as its development with age without any contradiction. Here, we go back to the class by Stanislas Dehaene on the statistician’s brain (in the Collège de France, 2011–2012), which explored the idea that Bayesian inference provides a unified view of psychological processes. This concept is developed further in section 10.10. 10.4. Principle of Bayesian inference A posterior probability P(H/D) relies on an initial likelihood p(D/H) and a prior probability p(H). The evidence is represented by P(/D), often omitted in the calculations which are summarized for reasons of practicality as P(H/D) ≈ p(D/H) × p(H). The term P(H/D) corresponds to the probability of a hypothesis after having gathered a piece of data; as a reminder, the symbol ‘/’ means ‘knowing that’ in the calculation of probabilities. The term P(D/H) is the likelihood of observing a given data if we assume a hypothesis to be true. The term P(H) corresponds to the probability of a hypothesis without knowing any information. For example, P(H) is the probability of finding high-quality diamonds in a hectare of land when randomly drilling to extract 10m3 of rubble; D is the presence of a plant X; this plant X grows in the presence of kimberlite, a rock that can contain diamonds; P(D/H) is the probability of a plant X growing if there are diamonds present (that is, due to the presence of kimberlite), and P(H/D) is the final probability of finding high-quality diamonds in such a terrain knowing that plant X grows there. The symbol ≈ indicates that the calculation relates only to the choice of the

240

Experiments and Modeling in Cognitive Science

most probable hypothesis, without the precise value of the probability being known. In reality, Bayes’ law – which we shall study later – allows us to carry out precise calculations using the four terms of the equation, but these calculations require a high total number of calculations that are not really necessary. Three terms are enough to calculate the most likely hypothesis. The most common introductory example is that of a medical diagnosis. The patient has a cough. What is the diagnosis? In a first instance, without the physician auscultating the patient (without having any data regarding their health), the probability p(Cough/H1) of lung cancer causing a bad cough is high; this is also true of the common cold p(Cough/H2); the probability p(Cough/H3) of a cough in the case of gastroenteritis is very low, although vomiting can cause irritation in the throat, resulting in a cough. However, the final diagnosis also depends on the prior probabilities and, in this example, the probability of having a simple cold is so high that the product p(D/H) × p(H) is maximal for p(D/H2) × p(H2). The term P(H2/D) is therefore the largest. We have just calculated the MAP hypothesis (Maximum A Posterior hypothesis), using the principle of the maximum a posteriori. We shall later see how to calculate a probability ratio in order to show that a common cold is, for example, 1,000 times, or 10,000 times more probable than a cancer a posteriori (a posteriori meaning here: once the patient has been observed coughing). This provides a quantitative indication of the plausibility of the hypotheses in relation to each other. Bayes’ law can be used to carry out Bayesian inferences. In a Bayesian inference, the probability of obtaining data based on theories or hypotheses is known or assumed. As a result, we can see which hypothesis best satisfies the data collected. Bayesian learning is mainly used to make a decision between competing hypotheses. The goal is to determine the best hypothesis based on a data sample, with the best hypothesis being the most plausible one once all the real data has been gathered.

Introduction to Bayesian Analysis

241

The previous graph shows the distribution of scores in a test providing data D, in a population of schizophrenic patients S or paranoid patients P. Here, we were aiming to build a diagnosis knowing that the result of the test is between 50 and 60. We can see that while such a score is more typical of patients belonging to the P group, it still corresponds to a higher number of S patients. Bayes theorem provides the following calculation method: p (h / D ) =

p ( D / h) p (h) p( D)

It is implied that there are several hypotheses h within the set of hypotheses H, and several possible pieces of data d within D. The constant p(D) is also called marginal likelihood. We want to calculate HMAP, the Maximum A Priori hypothesis. For this, we use: HMAP = argmax p ( D / h) p (h) or HMAP ≈ p ( D / h) p(h)

HPPP is the value of h for which the product p ( D / h) p (h) is maximal. For example, P versus S: Statement: p(h = S) = .8 p(h = P) = .2 p(50 < D < 60 / S) =.10 p(50 < D < 60 / P) =.20 Calculations: p(S/D) = p(50 < D < 60 / S) × p(S) = .10 × .8 = .08 p(P/D) = p(50 < D < 60 / P) × p(P) = .04 .08/.04 = 2. This ratio means that it is twice as likely to be S than P with such a test. As a result: HPPP = S.

242

Experiments and Modeling in Cognitive Science

Another example involving medical diagnosis: 1) The patient has some form of cancer or not. 2) The laboratory provides analysis with two possibilities for detection: + or –. 3) The prior probability is that .008 (0.8%; this figure is made up) of the population has this form of cancer. 4) The test is not perfect. It returns positive results in 98% of cases when the cancer is indeed present and only returns a negative result 97% of the time when the cancer is not present. Summary: D, the data, here the Test whose result is + or – ; H (Hypotheses), or C for Cancer, whose result is c or ~c p(c) = .008 p(~c) = .992 p(+/c) = .98 which makes 2% false negatives, also called omissions: p(–/c) (serious mistake in the diagnosis) p(–/~c) = .97, meaning 3% of false positives, also called false alarms: p(+/~c) (less serious mistake in medical diagnosis, but can be serious in other domains such as in a court of law) Question: We observe a patient with a positive test (we acquire a piece of information) – what is the most probable hypothesis? That they have cancer or that they do not? p(c/+) or p(~c/+) ? This is different to what is the probability of the patient having cancer? Solution and calculations: p (+ / c) p (c) = (.98)(.008) = .0078 p (+ /  c) p( c) = (.03)(.992) = .0298 nota bene: .0298/.0078 = 3.8 (normalization constant/posterior odds)

Introduction to Bayesian Analysis

243

Therefore: the most likely hypothesis is that the patient does not have cancer, knowing that the test is +: HMAP = ~ ca The exact probabilities can be obtained by using the constant p(+) in the denominator, calculated by marginalization as follows: p(+) = p (+,c) + p (+,~ c) = p (+/c) p (c) + p (+/~ c) p (~ c) = .0078 + .0298 = .0376 p (c/+) = p (+,c)/ p (+) = .0078/(.0078 + .0298) = .21 p (~ c/+) = p (+,~ c) / p (+).0298/(.0078 + .0298) = .79 Nota bene: 79/.21 = 3.8, which is the normalization constant calculated previously without all the calculations of p(+). We can say that the ratio of the posterior odds depends on the ratio of the prior odds, passing by Bayes’ factor (written here as BF~c.c to denote that it is Bayes’ factor in favor of ~c and in comparison with c), as follows: Posterior odds = Bayes Factor × Prior odds, or p (~ c / + ) p (~ c ) = BF × p (c / + ) p (c )

Here, the term on the left is equal to 3.8, and the term on the far right is .992/.008 = 124, so Bayes’ factor is roughly .0306, meaning that the ratio between the two tests is approximately 33, now that the test result is back. This factor is therefore p(+/~c)/ p(+/c) = .03/.98 = .0306. The next example is based on two test results; this time, it is used to diagnose two different populations. We can imagine a sample of 60 individuals who either have paranoia P (n = 20) or schizophrenia S (n' = 40). We obtain their scores in two personality tests. The distribution of the results is shown in the subsequent figure, with the data of the first on the x-axis and the data of the second along the y-axis.

244

Experiments and Modeling in Cognitive Science

The shading distinguishes the patients with paranoia (dark circles) from the schizophrenic patients (white circles). The sample contains 60 patients. The prior probability of belonging to group P and that of belonging to group S are 20 out of 60 and 40 out of 60, respectively. We obtain scores from a new patient, and we would like to put forward a diagnosis based on the results obtained from the 60-person sample. The new patient’s score is represented by the gray circle in the following figure. The closest other scores to the patient’s are contained in the dashed-line circle (the choice of a circle here is arbitrary, as it might as well have been a square or any other size of circle).

The likelihoods p(D/patient) of the scores closest to the patient’s (contained in the dashed-line circle) are p(D/P) = 3/20 and p(D/S) = 1/40. We can then calculate the most probable hypothesis with regard to the patient. Are they more likely to have schizophrenia or be paranoid, given the test results? Here, p(P/D) ≈ 2/6 × 3/ 20 = 1/20 and p(S/D) ≈ 4/6 × 1/ 40 = 1/60

Introduction to Bayesian Analysis

245

The constant is p ( P / D ) 1 / 20 = =3 p ( S / D ) 1 / 60

A posteriori, we get a ratio of 3 to 1 that the patient should be schizophrenic. On viewing the points contained in the circle, it would seem that the calculation is very intuitive: knowing that the result of the test changes our point of view. 10.5. Updating hypotheses

The term Bayesian inference is used in the specific case where the information is updated by the repeated use of Bayes’ theorem. The idea is that what is posterior today will be prior tomorrow. Going back to the example from Angner (2016), imagine that John and West are arguing over whether a coin is a trick coin or not. We shall see how we can carry out a belief calculation using a Bayesian update. A student claims that the coin is a trick coin. John reckons that the hypothesis H that the coin has two ‘heads’ sides has only a one in 100 (p = .01) chance of being correct, as the student in question does not usually carry out magic tricks. The student tosses the coin and it lands on ‘heads’. If you were John, would you maintain the estimate of 1% now that this new data has appeared? Bayes’ law, simplified to two alternative hypotheses, is as follows: p(H/D) = p(D/H) × p (H) / p (D), but p(D) must be marginalized as follows: p (h / D ) =

p ( D / h) p (h) p ( D / h) p ( h) + p ( D / ~ h) p (~ h)

Hypothesis ~H is the alternative one, and for simplicity’s sake here, we consider that the coin normally comprises a ‘heads’ side and a ‘tails’ side. In logic, the symbol ‘~’ represents negation, which means that ~H is the opposite of hypothesis H. More generally, Bayes’ law allows us to compare a set of i hypotheses by adding the value of the marginal likelihood to the denominator in order to

246

Experiments and Modeling in Cognitive Science

find the probability of a hypothesis (here the number 1: H1), knowing that a piece of data D has been acquired: p (h / D ) =

p ( D / h) p (h) p( D)

≡ p(h1 / D) =

p( D / h1 ) p(h1 )  p( D / hi ) p(hi ) i

Going back to the data from the problem at hand, let us stick to two alternatives. A priori, John thinks that p(h) = .01. He must therefore also believe that p(~h) = .99. According to the title of the problem, p(D_the_coin_lands_on_heads/h) = 1 and p(D_the_coin_lands_on_heads/~h) = .50 After the first coin toss, which resulted in ‘heads’, we get p(h/D) = (p(D/h) × p(h)) /(p(D/h) × p(h) + p(D/~h) × p(~h)) = (1 × .01) /(1 × .01+ .50 × .99) = .02 If John’s thought-process is at all in line with Bayes’, he should double his starting estimate that the coin is a trick coin. His belief can continue to evolve though. Imagine now that a second toss again results in ‘heads’. Presumably, we must now assume that p(h) = .02. After the second coin toss, which results in ‘heads’, we get p(h/D) = (p(D/h) × p(h)) /(p(D/h) × p(h) + p(D/~h) × p(~h)) = (1 × .02) /(1 × .02+ .50 × .98) = .04 Does John’s estimate double after each ‘heads’ result? To find out, the following piece of code follows the evolution of John and West’s understanding as a function of their starting beliefs. Seeing as the coin is indeed tricked, we shall see how many tosses it takes for West and John to converge toward a common belief that the coin is tricked.

Introduction to Bayesian Analysis

247

Code: pH_John=.01 pH_West=.50 pHead_givenH=1 pHead_givenNormal=.50 beliefsJohn=[pH_John]; beliefsWest=[pH_West]; Ntrials=17; for trialNum=1:Ntrials new_pH_John =(pHead_givenH*pH_John)/(pHead_givenH*pH_John+pHead_givenNorma l*(1-pH_John)) beliefsJohn=[beliefsJohn,new_pH_John]; pH_John=new_pH_John new_pH_West =(pHead_givenH*pH_West)/(pHead_givenH*pH_West+pHead_givenNorma l*(1-pH_West)) beliefsWest=[beliefsWest,new_pH_West]; pH_West=new_pH_West end plot(beliefsJohn,'LineWidth',2) hold on plot(beliefsWest ,'r:','LineWidth',3) hold off axis([1 Ntrials 0 1]) text(7,.2,'John','color','blue','FontSize',14) text(4,.75,'West','color','red','FontSize',14) Output

Figure 10.1. For a color version of this figure, see www.iste.co.uk/mathy/experiments.zip

248

Experiments and Modeling in Cognitive Science

It is interesting to note that West, being more open to the possibility that it is a trick coin, becomes convinced much quicker than John, who starts from further back. In any case, this would be the psychological interpretation that we would provide if a real individual’s estimates followed these predictions! Furthermore, it is interesting to look at how the ratio between a priori and a posteriori operates globally. When the a priori distributions are defined vaguely enough, only small amounts of data are needed in order to obtain a reliable a posteriori distribution, but when the a priori distributions are defined too specifically, more data is needed in order to restore beliefs to be in line with the data. Very different a priori distributions can lead to the same a posteriori distributions, as the data absorb the a priori distributions if there is enough data. This is why one should not dwell on the issue of which prior probabilities are assigned to the model. Going further, we recommend the works by Kruschke (2015) and Lee and Wagenmakers (2013). Between the two, they cover more complex calculations as well as an introduction to the programs R, JAGS and WinBUGS. 10.6. Statistics: going past rejecting the null hypothesis

It is becoming increasingly apparent that the 21st Century will see Bayesian inference replace the 20th Century’s null hypothesis test as the standard statistical method. The latter is centered around testing a null hypothesis and calculates the probability of a result being due to chance (the null hypothesis). This is often abbreviated as NHST (Null Hypothesis Significant Testing). It is designed to say whether a result is significant or not, using the statistics of the null hypothesis. For example, according to the null hypothesis of chance, there is a 1 in 32 chance that a couple has 5 children that are all boys if p(boy) = .5. One chance in 32 is 3%, which is a threshold that lies below the 5% criterion needed to reject the null hypothesis. For instance, if the alternative hypothesis is that eating a lot of French fries increases the probability of your baby being a boy, you could put the couple on a fries-only diet and test the hypothesis. If they have 5 boys, you could reject the null hypothesis (H0), by stating that such a result due to chance is rare. This supposedly “significant” result implies that the fries-only method works (the alternative hypothesis is true). Clearly, this thought-process is flawed, as the alternative is not plausible. A valid

Introduction to Bayesian Analysis

249

deduction using the modus tollens works as follows: (1) if H0, then not D, (2) we observe D and (3) therefore, not H0. The NHST is often used in the following way, which is not valid: (1) if H0, D very unlikely, (2) we observe D and (3) H0 is very unlikely. It is easier to see how this logic is flawed in the following example given by Rouder (Bayesian Inference in Psychology: A Workshop; see also, Rouder (2014)): (1) if a person is French, it is very unlikely that they are a minister, (2) the person is a minister and (3) they are very unlikely to be French. Not only is the reasoning not valid, it does not allow us to say what is the alternative (Wagenmakers et al. 2015). Bayesian analysis not only presents a great number of advantages, it avoids some of the drawbacks of the null hypothesis test and avoids all of the issues that NHST can lead to (Munafò et al. 2017). Current debate has tended to conclude that traditional statistical tests relating to NHST should be abandoned and replaced by Bayesian analysis. Dienes (2011) has provided a very good introduction to this issue, covering a number of issues associated with NHST. Dienes shows for instance that the classical process focuses on p(data/theory) when deriving statistical calculations under the null hypothesis, for example, while scientists tend to search for p(theory/data), which is more in line with the spirit of Bayesian inference. All in all, the NHST framework provides bad tools, and every time that scientists do not follow the principles of this framework (for example, by increasing the amount of data accumulated when the result is not significant, also known as p-hacking, reformulating hypotheses based on new results, or even HARKing –Hypothesis After the Result is Known – hesitating between a posthoc test and a planned comparison test, inflating family-wise error rates by running too many unplanned tests), they are reasoning in a way that would not necessarily be flawed under a Bayesian framework. This framework is therefore closer to the rationality of scientists. As a result, there are three possible attitudes (Lecoutre 2005): (1) use the NHST framework but respect its framework, (2) use the NHST framework, but with flawed interpretation as our mind would appear to work more in line with Bayesian reasoning and (3) move to a Bayesian framework. The most rational choice appears to be the last! The key point here is to understand that within a classic NHST framework, not rejecting H0 does in no way mean that the alternative is more plausible; moreover, this framework does not help to estimate the plausibility of H0 (that is, of rejecting the alternative hypothesis); see Gallistel (2009), Kruschke and Liddell (2017), or even Mulder and

250

Experiments and Modeling in Cognitive Science

Wagenmaker (2016). As explained subsequently, there is no point in focusing on a rejected null hypothesis if no alternative model has been specified (Rouder et al. 2016). 10.7. What alternative for an implausible null hypothesis?

Let us go back to the example by Jeff Rouder. The question is whether a coin is more likely to land on tails if we blow on it beforehand. After 1000 tosses, it turns out that this is indeed the case, as the coin has landed 527 times on tails, which is significant if we apply a classic test using a binomial distribution (or a unilateral test, for the specialists). Seeing as the odds of observing at least 527 ‘tails’ due to chance are lower than the traditional criterion of 5%, the hypothesis (called ‘null’ by statisticians) that blowing on the coin has no effect is rejected. This would not have been the case if the number had been 526! What is the alternative then? That blowing on the coin has an effect? Does the test therefore imply that the alternative is more likely? Say that the hypothesis (called ‘alternative’ hypothesis by the statisticians) is that the probability of getting tails increases to p = .527 when the coin is blown on. A simple probability calculation using binomial distribution gives a ratio of 4.3 to 1 for the alternative. In Bayesian analysis, a ratio as low as 4.3 is far from conclusive in terms of choosing one hypothesis over the other. A ratio of 3 is basically anecdotal; a value of more than 10 would be considered high. However, in this particular case, we can hardly hope for more than an alternative hypothesis corresponding to the mode of the likelihood! Testing the hypothesis that it falls exactly on ‘tails’ 527 times is not very convenient. What if we just did not have any luck, and actually blowing on the coin increases the number of tails even more? What if we tested the alternative hypothesis that p = .55 when we blow properly? The odds supporting the alternative in this case fall to 1.5 to 1. As a simplification, we test the alternative that blowing results in a rate of tails between p = .5 and p = .6. The odds in favor of the alternative sit at 1.5 to 1. The calculations are complicated and involve integration over either a beta distribution of the a priori hypotheses or a uniform a priori distribution (if you think anything can happen). There are many introductions available

Introduction to Bayesian Analysis

251

to binomial distribution and the beta-binomial model, which can help the reader to acquire the mathematical tools involved in Bayesian analysis (Griffiths and Yuille 2008; Kruschke 2014). For a simpler approach, we recommend Morey et al. (2016). An example of codes and outputs using a binomial distribution (the functions binocdf and binopdf give the precise calculation, but seeing as this is not available in the MATLAB® packages, a solution for approximating a binomial distribution through a normal distribution can be used instead, as provided here): >> 1-binocdf(527,1000,.5) ans = 0.0410 >> 1-binocdf(527,1000,.527) ans = 0.4876 >> binopdf(527,1000,.5) ans = 0.0059 >> binopdf(527,1000,.527) ans = 0.0253 >> .0253/.0059 ans = 4.2881, so 4.3.

Approximation of a binomial distribution through a normal distribution first requires a calculation of a z-score through the formula z =(k-np)/(np(1-p))0.5. The term (np(1-p))0.5 is the standard deviation of the distribution. For 527 tails, we get z =(k-np)/(np(1-p))0.5.

252

Experiments and Modeling in Cognitive Science

>> (527-1000*.5)/(1000*.5*.5)^.5 ans = 1.7076

This value is used with the normal distribution function normpdf available in the statisticsToolbox. >> normpdf(1.7076) ans = 0.0928

Similarly, to test the hypothesis that p = .527, we calculate >> (527-1000*.527)/(1000*.527*.473)^.5 ans = 0 >> normpdf(0) ans = 0.3989 >> .3989/.0928 ans = 4.2985, so 4.3. Note that 4.2985 is the approximation of 4.2881 obtained previously.

In the following graph, we can see the ratio of 4.3 when the abscissa is equal to 527. Code: X = [400:1:600]; StdDev1= (1000*.5*.5)^.5 plot(X,normpdf(X,500, StdDev1)) ptest=.527 StdDev2= (1000*ptest*(1-ptest))^.5 hold on plot(X,normpdf(X,1000*ptest, StdDev2) ,'Color','r') line([527 527], [0 0.03],'color','k') hold off Output

Introduction to Bayesian Analysis

253

Figure 10.2. For a color version of this figure, see www.iste.co.uk/mathy/experiments.zip

We can see that the black line intersects at the level of the values of 0.253 and 0.0059. Dividing 253 by 59 results in 4.3. In order to obtain the ratio of 1.5 to 1 by testing p = .55, we do >> (527-1000*.55)/(1000*.55*.45)^.5 ans = -1.4620

Then >> normpdf(-1.46)/normpdf(1.7076) ans = 1.4802, so roughly 1.5.

This ratio can be seen here:

254

Experiments and Modeling in Cognitive Science

Code: (Replace ptest = .527 by ptest=.55 in the previous code) Output

Figure 10.3. For a color version of this figure, see www.iste.co.uk/mathy/experiments.zip

10.8. More complex distributions for calculating whether toast lands more often on the buttered side

Murphy’s law is the best candidate for explaining why your toast always seems to land on the buttered side. This section draws on Morey et al. (2016), which we recommend reading before going any further, and which uses this example as a way of explaining Bayesian inference. We would assume that there is a p = .5 of the toast landing on the wrong side, with p normally distributed approximately .5. As a result, we shall not settle for an a priori hypothesis like “tails” p = .5 – instead, we assume that it is the value with the greatest prior probability within a distribution of other possible values (here, we have chosen a normal distribution, but other distributions are possible). The idea is to have a continuous distribution of plausible values close to .5, but which become progressively less plausible the further they get from .5. The following code uses calculations similar to the ones by Morey et al. (2016).

Introduction to Bayesian Analysis

255

The goal of some of the calculations is that within the distributions, there should be a sum of likelihoods that is equal to 1. As you can see in the following program: >> sum(BETAlikeVectorOfPriors ) ans = 1.0000

Same for posterior distribution: >> sum(py) ans = 1.0000

The following program first calculates the likelihood of getting values between 1 and 50 from the values of p varying between .01 and .99. Note that the extremes are less likely. This distribution could actually be ignored, and we could directly build a normal distribution of the plausible values using normpdf when creating BETAlikeVectorOfPriors without having to go through calculating likelihood as it is of no interest with the values ps=[.01:.01:.99] as most of them are close to 1.

256

Experiments and Modeling in Cognitive Science

However, this calculation is extremely relevant when using the values ps=[.50:.01:.99].

When this distribution is multiplied by a normal distribution of approximately p = .5 (most plausible value), we get a priori distribution of the values with a slightly less symmetrical appearance, as follows:

The following full programs therefore calculate the Bayes factors of two models, knowing that the toast has been seen to fall 30 times on the buttered side, out of 50 falls in total.

Introduction to Bayesian Analysis

257

Here, we test the alternative hypothesis that p = .50. The null hypothesis is that p is not specified, but we nonetheless expect a plausibility gradient centered approximately on p = .50. The distribution of the null hypothesis (in black in the following figures) is therefore greater than that of the alternative hypothesis (in green). Code: % To understand: % Morey, R. D., Romeijn, J. W., and Rouder, J. N. (2016). The philosophy of Bayes factors and the quantification of statistical evidence. Journal of Mathematical Psychology, 72, 6-18. %Calculations p. 11 clc clear all ps=[.01:.01:.99]; Nps=length(ps); yValues=[1:1:50]; NyValues=length(yValues); parameterForNormalPrior=2.45; BETAlikeVectorOfPriors =normpdf(linspace(parameterForNormalPrior,parameterForNormalPrior,NyValues))/sum (normpdf(linspace(parameterForNormalPrior,parameterForNormalPrior,NyValues))); likelihood=[]; incr=0; for p=ps incr=incr+1; for y =yValues likelihood(incr,y)=binopdf(y,50,p); end end pyGivenH0=mean(likelihood).*BETAlikeVectorOfPriors/sum(mean(li kelihood).*BETAlikeVectorOfPriors); plot(yValues,pyGivenH0+.001,'color','black') hold on pyGivenH1=[]; for y =yValues pyGivenH1=[pyGivenH1,binopdf(y,50,.50)]; end plot(yValues,pyGivenH1/sum(pyGivenH1),'color','green') line([30 30], [0 pyGivenH1(30)],'color','black') pyGivenH1(30) pyGivenH0(30) Ev=pyGivenH1(30)/pyGivenH0(30)

258

Experiments and Modeling in Cognitive Science

Output ans = 0.0419 ans = 0.0365 Ev = 1.1460

Visually, we can see the ratio 1.146 at the level of the vertical line, when x = 30.

Figure 10.4. For a color version of this figure, see www.iste.co.uk/mathy/experiments.zip

Introduction to Bayesian Analysis

259

Next, we test the alternative hypothesis that p = .50. The null hypothesis is that p > .50. % Morey, R. D., Romeijn, J. W., and Rouder, J. N. (2016). The philosophy of Bayes factors and the quantification of statistical evidence. Journal of Mathematical Psychology, 72, 6-18. %p. 13 clc clear all ps=[.50:.01:.99]; Nps=length(ps); yValues=[1:1:50]; NyValues=length(yValues); parameterForNormalPrior=2.45; BETAlikeVectorOfPriors =normpdf(linspace(parameterForNormalPrior,parameterForNormalPrior,NyValues))/sum (normpdf(linspace(parameterForNormalPrior,parameterForNormalPrior,NyValues))); likelihood=[]; incr=0; for p=ps incr=incr+1; for y =yValues likelihood(incr,y)=binopdf(y,50,p); end end pyGivenH0=mean(likelihood).*BETAlikeVectorOfPriors/sum(mean(li kelihood).*BETAlikeVectorOfPriors); plot(yValues,pyGivenH0+.001,'color','black') hold on pyGivenH1=[]; for y =yValues pyGivenH1=[pyGivenH1,binopdf(y,50,.50)]; end plot(yValues,pyGivenH1/sum(pyGivenH1),'color','green') line([30 30], [0 pyGivenH0(30)],'color','k') pyGivenH1(30) pyGivenH0(30) Ev=pyGivenH0(30)/pyGivenH1(30) Output ans = 0.0419 ans = 0.0643 Ev = 1.5354

260

Experiments and Modeling in Cognitive Science

Figure 10.5. For a color version of this figure, see www.iste.co.uk/mathy/experiments.zip

10.9. Model selection

In order to compare a model M1 to other models, we go back to the base equation: p( M 1 / D) =

p( D / M 1 ) p(M 1 ) D

Comparing two alternative models is the same as: p(M 1 / D) =

p( D / M 1 ) p( M 1 ) p( D / M 1 ) p( M 1 ) + p( D / M 2 ) p( M 2 )

As we saw previously, we can rewrite the equation in order to find the Bayes factor (BF), as follows: p( M 1 / D) p( D / M 1 ) p( M 1 ) p( M 1 / D) p(M1 ) = × ≡ = BF × p(M 2 / D) p( D / M 2 ) p(M 2 ) p(M 2 / D) p(M 2 )

In essence, the Bayes factor (BF) helps state whether the data are in favor of one of the models (if BF12 = 5, then Model 1 is 5 times more likely than

Introduction to Bayesian Analysis

261

Model 2). The Bayes factor tells us how much more plausible a model is compared to another when taking into account D. It represents the evolution of prior likelihood into posterior likelihood. When the likelihood of the models is the same at the start, the relative posterior likelihood of the models is simply the Bayes factor (the ratio of the likelihoods): p( M 1 / D) p( D / M 1 ) = p( M 2 / D) p( D / M 2 )

In terms of the value of the Bayes factor, there are simple correlation tables between the Bayes factor and its value: 0 to 3, not worth mentioning (in other words, anecdotal); 3 to 10, moderate; more than 10, high. The BIC (Bayes Information Criterion) is an approximation of the logarithm of the Bayes factor. It was developed to help assess the probability ratio provided by two models that are compared in order to account for scientific data (Busemeyer and Diederich 2010). It is defined for a model as a function of k, N and the measure of adjustment to the data calculated using the maximum likelihood (we shall not go into detail here on likelihood calculations, which vary from one model to the next), with k being the number of parameters of the model and N the number of pieces of data used in the adjustment calculation. For this, we must choose the model that has the smallest BIC. The BIC takes into account both the quality of the adjustment and the parsimony of the model, as the BIC penalizes a model that contains too many free parameters. A comparison of BIC and other criteria such as the AIC can be found in the highly instructive article by Pitt and Myung (2002). Similar but more complex articles include Shiffrin et al. (2008), or Wagenmakers et al. (Part I, Part II, 2018a, 2018b). 10.9.1. Bayesian tests with JASP

JASP (https://jasp-stats.org/) is a very well designed piece of software. Not only does it allow us to carry out classic tests like in SPSS, it can also carry out these tests as part of a Bayesian framework. Its designers are not supporters of the classic NHST framework, but they believe that by offering both options, users can more easily discover Bayesian concepts. The gamble appears to have paid off, as the software is well regarded and well known (perhaps because it provides a free alternative to SPSS, but the idea of

262

Experiments and Modeling in Cognitive Science

introducing the concept of Bayesian analysis appears to be working too). A fictional dataset organized into two modalities x and y of the variable cat is subsequently shown. cat.

perf.

x

3

x

5

x

4

x

6

x

5

x

4

x

7

x

2

x

4

x

5

y

1

y

2

y

1

y

2

y

3

y

2

y

3

y

1

y

3

y

1

JASP lets us carry out a Bayesian student t-test for independent samples

Introduction to Bayesian Analysis

263

The descriptive statistics indicate an effect size (Cohen’s d) of (4.5-1.9)/((1.434+0.876)/2) = 2.25 This effect size is compared to an effect size of zero. According to the null hypothesis, we assume an effect size of zero, which is given the greatest credibility through a prior distribution that has the shape of a Cauchy distribution, very close to a normal distribution. JASP returns a Bayes factor (BF10) of 173, thus favoring the alternative hypothesis (since BF has an index of 10 ; if the index is 01, then BF favors the null hypothesis; in any case, we move from one to the other through the ratio BF01 = 1/ BF10 = .006). The ratio is indicated subsequently between the height of the two grayed-in points on the prior and posterior distributions. Once the data has been analyzed, an effect size like this has a likelihood that is close to zero. The mode of the posterior distribution corresponds to Cohen’s d.

This summary graph shows the sequential analysis when the analysis receives data one after the other. Generally speaking, Bayesian analysis is only sensitive to prior choice when there are few data pieces. As the amount of data increase, the posterior results converge, no matter the prior results.

264

Experiments and Modeling in Cognitive Science

Other programs that allow Bayesian analysis are STATA (http://www. stata.com) and Winbugs (https://en.wikipedia.org/wiki/WinBUGS). 10.10. Cognitive psychology

The idea that the human mind could be viewed as an optimal calculator based on Bayesian inference might not be very promising for some (Marcus and Davis 2013). Nevertheless, the Bayesian model has constituted a veritable revolution in cognitive psychology (see the classes by S. Dehaene at the Collège de France 2011–2012, open access). This approach takes into account a large number of phenomena from the domains of perception, learning and reasoning (for example, Tauber et al. 2017; Tenenbaum et al. 2011). As an example, Griffiths and Tenenbaum (2006) modeled the optimal character of reasoning, and Tenenbaum et al. (2006) modeled the acquisition of knowledge. We are convinced that this can be used in developmental psychology to develop a simple and elegant model of Piagetian structuralism. The Bayesian model can be perfectly adapted to the concept that a child tests a simple hypothesis, gathers data through experience and updates their hypotheses, and so on and so forth (Gopnik and Tenenbaum 2007). The Piagetian hypothesis that children are little growing scientists is supported by this model, which allows us to carry out calculations that are far more realistic than the Piagetian notions (assimilation, accommodation, equilibration), which have become outdated as they are hard to create models for. A good starting point for becoming

Introduction to Bayesian Analysis

265

acquainted with Bayesian models in developmental psychology is the tutorial by Perfors et al. (2011). Lastly, this approach reconsiders learnability from an empiricist perspective, following the idea that a few fundamental principles (more precise than the simple notion of associationism usually connected with empiricism) are what allow learning to take place (Chater et al. 2015). In psycholinguistics, a precursor model for language comprehension was developed in 2008 by Norris and McQueen; this model, along with its predecessor (14 years prior), has been cited over 1500 times. Another example is the model by Xu and Tenenbaum (2007), which looks at inductive learning of the meaning of words. One of its advantages is to highlight the generalization of a word from a small sample (known as fast mapping). The method thus provides a rational approach to the issue of weak exposure to stimulus in linguistics and to the blessing that is abstraction. Simple assumptions regarding hierarchical structures of meaning help individuals generalize words to exemplars. For example, one word associated with three Labradors is enough to assume that the word is not associated with a higher category. It would be an unlikely coincidence for the three randomly chosen exemplars to land on three similar exemplars, rather than Labrador, Basset Hound and Dalmatian, or a dog, a parrot and a crocodile. For this reason, only a few positive exemplars are needed to get the meaning of a word, and negative exemplars are not required. The authors also demonstrated a developmental effect in their data, as children tend to overestimate the specificity of a new word. For example, when meeting a Labrador called Yep, children do not usually consider that Yep could refer to a superior level, such as “dogs”.