Caution in exploring the effects of distant past outcomes on sequential choices

Caution in exploring the effects of distant past outcomes on sequential choices

Neuroscience Research 156 (2020) 159–164 Contents lists available at ScienceDirect Neuroscience Research journal homepage: www.elsevier.com/locate/n...

1MB Sizes 3 Downloads 50 Views

Neuroscience Research 156 (2020) 159–164

Contents lists available at ScienceDirect

Neuroscience Research journal homepage: www.elsevier.com/locate/neures

Caution in exploring the effects of distant past outcomes on sequential choices Kenji Morita a,b,∗ , Asako Mitsuto Nagase c,d a

Physical and Health Education, Graduate School of Education, The University of Tokyo, Japan International Research Center for Neurointelligence (WPI-IRCN), The University of Tokyo, Japan Research Fellow of Japan Society for the Promotion of Science, Japan d Division of Neurology, Department of Brain and Neurosciences, Faculty of Medicine, Tottori University, Japan b c

a r t i c l e

i n f o

Article history: Received 3 September 2019 Received in revised form 16 November 2019 Accepted 3 December 2019 Available online 23 December 2019 Keywords: Decision making Choice Win-Stay-Lose-Switch Win-Stay-Lose-Shift Reinforcement learning Simulation Oscillology

a b s t r a c t We sometimes make sequential decisions depending solely on the immediate past outcomes, e.g., according to the Win-Stay-Lose-Switch rule. In other occasions, we make decisions depending also on the distant past outcomes. It appears of interest to distinguish these two cases based on the generated choice sequences. At first glance, it may seem straightforward to distinguish the two cases by examining whether the rate of reselecting the same option that was chosen in the distant past, for example, at two trials before, depends on the outcome obtained there. However, such naive analysis can theoretically lead to detection of spurious dependence of three different types. Whereas two of them can easily be avoided by calculating the rate of reselection separately for each case sorted by the choice and outcome at the previous trial, the third type of spurious dependence appears after being sorted by the previous choice and outcome. Here we show how such spurious dependence appears. This article exemplifies the need for caution in analyzing the limited number of sequential choices. © 2019 Published by Elsevier B.V.

1. Introduction We sometimes make sequential decisions depending solely on the immediate past outcomes, e.g., according to the Win-Stay-LoseSwitch or Win-Stay-Lose-Shift (WSLS) rule, in which we repeat to choose the same option as in the previous choice if its outcome was good and switch to choose another option if the outcome was bad. We might consciously take such a strategy as it appears to be reasonable, not necessarily optimal for outcome maximization but perhaps in a good balance between performance benefit and cognitive effort cost related to decision. We might also take strategies like WSLS unconsciously. In other occasions, we make decisions depending not only on the immediate past outcomes but also on the distant past outcomes. Specifically, we might choose one option more frequently when the same option resulted in a good outcome in the previous two choice opportunities than when it resulted in a good outcome in the previous time but resulted in a bad out-

∗ Correspondence author at: Physical and Health Education, Graduate School of Education, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan. E-mail address: [email protected] (K. Morita). https://doi.org/10.1016/j.neures.2019.12.011 0168-0102/© 2019 Published by Elsevier B.V.

come two times before. Again, we might consciously consider such a strategy. Meanwhile, dependence on the distant past outcomes is inherent in the reinforcement learning (RL) algorithms, which have been suggested to be able to approximate various learning and choice behavior of humans, from reward learning to avoidance learning for pain (Seymour et al., 2004) or physical or mental effort cost (Skvortsova et al., 2014; Nagase et al., 2018), as well as of animals (e.g., (Samejima et al., 2005)). Inferring the types or strategies of decision making from observed behavior is important for basic cognitive neuroscience research on the mechanisms of decision making, as well as for clinical diagnosis of neuropsychiatric disorders or applications such as neuromarketing. Therefore it appears of interest to distinguish the two cases, the case where choice depends solely on the immediate past outcomes and the case where it depends also on the distant past outcomes, based on the observed choice behavior, or more specifically, generated choice sequences in experimental or real situations where people have repeated opportunities to make choices from a certain set of options. Although there are systematic ways to analyze this issue, in particular, logistic regression analysis (Katahira, 2015), here we concentrate on a more simple intuitive approach. At first glance, it may seem straightforward to distinguish the two cases by examining whether the rate of reselecting

160

K. Morita, A.M. Nagase / Neuroscience Research 156 (2020) 159–164

Fig. 1. Tasks and decision rules. (A) Example trial sequences of the two tasks. Black and white indicate good and bad outcomes, respectively. (B) Schematic diagrams of the three decision rules.

the same option that was chosen in the distant past, for example, at two trials before, depends on the outcome obtained there. However, in fact there are possibilities that such naive analysis leads to detection of spurious dependence. Here we introduce these possibilities, from simple one to rather complicated one, showing how spurious dependence appears, and discuss its implications. 2. Methods 2.1. Tasks, decision rules, simulations, and calculation of the rate of choices We considered two types of two-alternative forced choice tasks consisting of 180 trials (Fig. 1A). In Task 1, choosing either option 1 or option 2 always results in good or bad outcome with 50 % probability each. In Task 2, choosing option 1 results in good or bad outcome with 90 % and 10 % probabilities, respectively, whereas choosing option 2 results in good or bad outcome with 10 % and 90 % probabilities, respectively, in the initial 30 trials, and the probabilistic associations between choices and outcomes are reversed in

an oscillatory manner every 30 trials. Note that these probabilities were the values set in the program codes to generate trial sequences using pseudo-random numbers, and the actual ratio of good and bad outcomes in 30 or 180 trials in individual generated tasks could somewhat considerably deviate from these probabilities. Three decision rules were considered (Fig. 1B). Rule 1 is a probabilistic Win-Stay-Lose-Switch (pWSLS) rule, where Win (i.e., obtaining good outcome) is followed by Stay (i.e., choose the same option) with 90 % probability whereas Lose (i.e., obtaining bad outcome) is followed by Switch (i.e., choose the different option) with 90 % probability. Rule 2 is a probabilistic selection rule, where obtaining good outcome is followed by Stay with 90 % probability whereas obtaining bad outcome is followed by Stay or Switch with 50 % probability each. Rule 3 is a reinforcement learning rule, where the probability of choosing either option depends on the ¨ 1 and V2 for option 1 and option 2, respectively). ¨ option’s value (V At each trial, difference between the obtained outcome R (good: R = 1, bad: R = 0) and the chosen option’s value (VX , where X is 1 or 2), named the prediction error (PE, ı = R − VX ), is calculated, and the chosen option’s value is updated according to the PE (VX → VX + ˛ı, where ˛ is the learning rate). Then, the probability of choosing option 1 is calculated as P1 = exp(ˇV1 )/(exp(ˇV1 ) + exp(ˇV2 )), where ˇ is the parameter determining the degree of exploitation over exploration). The parameters ˛ and ˇ were set to 0.8 and 3, respectively, and the initial values of V1 and V2 were both set to 0 in the simulations shown in the present article. Notably, Rules 1 and 2 are classified as decision strategies that depend solely on the immediate past outcome, whereas Rule 3 is a strategy that depends also on the distant past outcome. For each combination of the tasks and decision rules, we performed 10,000 simulations using MATLAB, where trial sequences were generally different from simulation to simulation (i.e., we generated 10,000 × 3 trial sequences for each task using pseudorandom numbers). In each trial, choice was made according to the probability set as above using a pseudo-random number. The pseudo-random numbers were generated by using the ¨ r¨ andfunction of MATLAB. We calculated the rate of choices satisfying particular conditions (described in the Results) in each simulation where relevant data existed, and calculated the average and the standard error of the mean (SEM) across those simulations. The results of analyses based on these simulations were shown in Figs. 1–4. For the results shown in Fig. 5, we separately performed 10,000 simulations of Tasks 1 and 2 and Rule 1 with 46,080 ( = 180 × 28 ) trials. 3. Results 3.1. Spurious dependence of the choice on the distant past outcome We simulated choices made in two tasks (Fig. 1A) using either of three decision rules (Fig. 1B), and for each combination of the tasks and rules, we calculated the rate of reselecting the same option as chosen at two trials before, when good or bad outcome was obtained there, averaged across simulations. Fig. 2 shows the results. In the case with Task 1 and Rule 1 (Fig. 2A), the rate looks independent of the outcome at two trials before. In the case with Task 1 and Rule 2 (Fig. 2B), however, the rate of reselection was higher when good outcome was obtained at two trials before than when bad outcome was obtained. The reason for this is straightforward. The rate of reselection when good outcome, or bad outcome, was obtained at two trials before is calculated to be: P Stay|Win (P Win P Stay|Win +P Lose P Stay|Lose )+P Switch|Win (P Win P Switch|Win +P Lose P Switch|Lose ), and

K. Morita, A.M. Nagase / Neuroscience Research 156 (2020) 159–164

161

whereas in the case with Task 1 and Rule 2, they are 0.9×(0.5 × 0.9 + 0.5×0.5) + 0.1×(0.5 × 0.1 + 0.5×0.5) = 0.66, and 0.5×(0.5 × 0.9 + 0.5×) + ×(0.5 × 0.1 + 0.5×) = 0.5. These calculations appear to match the simulation results (Fig. 2A,B), and given these calculations, the observed spurious dependence of the choice on the outcome at two trials before in the case of Rule 2 can be said to come from the Win-vs-Lose asymmetry in Rule 2. Fig. 2D shows the results of the case with Task 2 and Rule 1. As shown in this figure, in Task2, spurious dependence on the outcome at two trials before appears even with Rule 1 that has Win-vs-Lose symmetry. The reason for this is also simple. In the case of Task 1 described above, PWin and PLose are constant (0.5). However, in Task 2, these probabilities change over time (Fig. 1A), and critically, they generally differ between trials following Win and those following Lose, in such a way that if one takes the same option as in the previous trial, the same outcome (good or bad) tends to repeat. More specifically, except for the small number of trials where the probabilistic option-outcome contingency is reversed (i.e., between the 30th-31st, 60th-61st, 90th-91st, 120th121st, and 150th-151st trials, because the probabilistic associations were reversed every 30 trials in Task 2 as mentioned before), if an option leads to good outcome, or bad outcome, in a given trial, that option is likely to lead to good outcome, or bad outcome, again in the next trial with a high probability whereas the opposite option is likely to lead to the opposite outcome also with a high probability. Although actual values for these probabilities may not be so obvious, if we assume that these probabilities are 80 % and ignoring the effect of contingency reversal, the rate of reselection of the option chosen at two trials before when good outcome, or bad outcome, was obtained at two trials before is calculated to be 0.9×(0.8×0.9 + 0.2×0.1) + 0.1×(0.2×0.1 + 0.8×0.9) = 0.74, and 0.1×(0.2×0.9 + 0.8×0.1) + 0.9×(0.8×0.1 + 0.2×0.9) = 0.26.

Fig. 2. The rate of reselecting the same option as chosen at two trials before when good or bad outcome was obtained there. k-th and (k-2)-th trials indicate the current and two-back trials, respectively. 10,000 simulations were conducted for each combination of the tasks (Task 1: A–C, Task 2: D–F) and decision rules (Rule 1: A,D, Rule 2: B,E, Rule 3: C, F), and the average rates ± SEM across the simulations are indicated by the bars and error-bars.

P Stay|Lose (P Win P Stay|Win +P Lose P Stay|Lose )+P Switch|Lose (P Win P Switch|Win +P Lose P Switch|Lose ),

respectively, where PWin and PLose are the probability that good or bad outcome was obtained at the previous trial and PStay|Win , PSwitch|Win , PStay|Lose , and PSwitch|Lose are the conditional probability that stay or switch was taken after good or bad outcome; thus, PStay|Win PWin PStay|Win in the first term in the first formula, for example, indicates the probability, after good outcome was obtained at two trials before, that stay was chosen and good outcome was obtained at the previous trial and stay is chosen at the current trial. In the case with Task 1 and Rule 1, these rates are 0.9×(0.5 × 0.9 + 0.5 × 0.1) + 0.1×(0.5 × 0.1 + 0.5 × 0.9) = 0.5, and 0.1×(0.5 × 0.9 + 0.5 × 0.1) + 0.9×(0.5 × 0.1 + 0.5 × 0.9) = 0.5,

Although the effect of contingency reversal is in fact not ignorable and the simulation results (Fig. 2D) deviate from the above calculations, it can be said that the inter-trial correlations of outcomes underlie this type of spurious dependence of the choice on the outcome at two trials before. As we have so far seen, the rate of reselection of the option chosen at two trials before can differ depending on the outcome obtained there even if choice was made according to the decision strategies that depend solely on the immediate past outcome (Fig. 2B,D,E). Indeed, these results look similar to the results for the cases where reinforcement learning, a strategy that depends also on the distant past outcome, was used (Fig. 2C,F). Therefore, in order to distinguish whether the used decision strategy truly depends on the distant past outcome, additional effort is needed. A natural way would be to examine if the rate of reselection of the option chosen at two trials before differs depending on the outcome obtained there separately for each case with a particular combination of choice (Stay or Switch) and outcome (good or bad) at the previous trial. If decision strategy depending on the immediate, but not distant, past outcome was used, combination of the choice and outcome at the previous trial should perfectly determine the probability of current choice, and so the reselection rate is expected to be the same regardless of the outcome at two trials before. In contrast, if strategy depending on the distant past outcome was used, the reselection rates should differ depending on the outcome at two trials before even after being sorted by the previous choice and outcome.

162

K. Morita, A.M. Nagase / Neuroscience Research 156 (2020) 159–164

Fig. 3. The rate of reselecting the same option as chosen at two trials before when good or bad outcome was obtained there, separately for each case sorted by the choice and outcome at the previous trial. k-th, (k-1)-th, and (k-2)-th trials indicate the current, previous, and two-back trials, respectively. The bars and error-bars indicate the average rates ± SEM across simulations in which relevant data existed.

Fig. 3 shows the reselection rates sorted by the previous choice and outcome averaged across simulations for each task and decision rule. As expected above, the reselection rates differed depending on the outcome at two trials before in the cases with Rule 3 (reinforcement learning) (Fig. 3C,F). In contrast, in the cases with Rule 1 or 2 that does not depend on the distant past outcome, the rates of reselection took values that are similar to those set in the decision rules: 90 % or 10 % in Rule 1 (Fig. 3A,D) and 90 %, 50 %, or 10 % in Rule 2 (Fig. 3B,E), again conforming to the above expectation. However, looking more closely at these results, the rate of reselection appears to somewhat differ depending on the outcome at two trials before in the cases where the previous choice and outcome were Stay and bad or Switch and good (i.e., the middle two pairs of bars in Fig. 3A,B,D,E). In order to make this point more visible, focusing on the cases with Rule 1 (probabilistic Win-Stay¨ Lose-Switch), we plot the rate of non-default¨ choice (i.e., Switch after Win or Stay after Lose) (Fig. 4A,B), instead of plotting the rate of reselection of the option chosen at two trials before. According to Rule 1 (Fig. 1B), the rate of non-default choice is set to 10 %, and this rate is expected not to vary depending on the outcome at two trials before. In reality, however, this rate was on average less than 10 % in the cases where the previous choice and outcome and the outcome at two trials before were Stay, bad, and bad or Switch, good, and good (the fourth and fifth bars in Fig. 4A,B), especially in Task 2.

3.2. Origin of the spurious dependence What causes this spurious dependence of the choice on the outcome at two trials before? The cases where the rate of non-default choice was less than 10 % correspond to the situations in which the same non-default choice, either Stay after Lose or Switch after Win, was repeated. It is as if simulated subjects were particularly unwilling to make non-default choice after making the same non-default choice once. Thinking more rationally, a notable thing is that the number of occasions for repetitive non-default choices during the task should be limited, given that there are just 180 trials in total and the probability of non-default choice set in the rule is only 10 %. Indeed, trial(s) corresponding to the denominator of the rate shown in the fourth and fifth bars in Fig. 4B existed in not all 10,000 simulations but 9400 and 8376 simulations, and the average number of trial(s) when existed was 3.2 and 2.2 trials, respectively. This smallness of the number of relevant trials seems potentially related to the observed spurious dependence. However, actually there are two other cases where also non-default choices were repeated: the second and seventh bars in Fig. 4B. Trial(s) corresponding to the denominator of the rate shown in these bars existed in 8114 and 10000 simulations among total 10000 simulations, and the average number of trial(s) when existed was 2.0 and 11.3 trials, respectively. As such, the number of relevant trials was rather small in the case of the second bar in Fig. 4B, too, while the rate of non-default choice indicated by this bar was close to 10 %, and thus the smallness of

K. Morita, A.M. Nagase / Neuroscience Research 156 (2020) 159–164

163

Fig. 4. Spurious dependence appears after being sorted by the previous choice and outcome. (A,B) The rate of non-default choice (i.e., Switch after Win or Stay after Lose) in the case with Rule 1 (Win-Stay-Lose-Switch) and Task 1 (A) or Task 2 (B) when good or bad outcome was obtained at two trials before, separately for each case sorted by the choice and outcome at the previous trial. The bars and error-bars indicate the average rates ± SEM across simulations in which relevant data existed. (C–E) Across-simulation relationship between the rate of non-default choice and the number of trials used for calculating the rate. (C), (D), and (E) correspond to the cases of the second, fourth, and fifth bars in (B). The size (area) of the circles indicate the number of simulations (there were in total 10,000 simulations).

Fig. 5. Increasing the number of trials dilute the spurious dependence. The results of 10,000 simulations with 46,080 ( = 180 × 28 ) trials are shown. These are the 10,000 simulations with 46,080 trials. (End)

the number of relevant trials cannot solely explain the deviations of the rates from 10 % in the fourth and fifth bars. Then, what explains the deviations? A difference between the case of the second bar in Fig. 4B, which was close to 10 %, and the cases of the fourth and fifth bars, which were less than 10 %, is that, although non-default choice was repeated in all of these cases, the same non-default choice (either Stay after Lose or Switch after Win) was repeated only in the latter cases whereas different

non-default choices were in succession (Stay after Lose followed by Switch after Win) in the former case. Let us consider a situation where simulated subject repeated the same non-default choice, for example, Stay after Lose, twice and then obtained bad outcome (i.e., Lose → Stay → Lose → Stay → Lose). Because this contains the sequence L¨ ose → Stay → Loset¨ wice, with an overlap of the middle L¨ ose¨, this contributes to the calculated rate shown in the fourth bar in Fig. 4B twice. It is then indicated that if there exist more repeti-

164

K. Morita, A.M. Nagase / Neuroscience Research 156 (2020) 159–164

tions of the same non-default choice, the number of relevant trials for calculations of the rates in the fourth and fifth bars increase, and thereby there would exist positive correlations, across simulations, between the number of relevant trials and the calculated rate in each simulation in the cases of the fourth and fifth bars. We examined this, and found that there indeed exist positive correlations, though not strong and not linearly well related, in the cases of the fourth and fifth bars (r = 0.29, p = 1.7×10−185 (Fig. 4D) and r = 0.19, p = 2.9×10−72 (Fig. 4E), respectively) (very small p values are reasonable given the very large sample sizes) but not in the case of the second bar (r = 0.01, p = 0.21 (Fig. 4C)). Through averaging across simulations, such a positive correlation results in (a sort of) less frequent sampling of high rates than low rates, and can thereby explain why the average rates were smaller than the theoretically expected value (10 %). Let us see how this mechanism operates by using an extreme example, which is independent of the simulated tasks and rules that we have so for examined but captures an aspect of the specific situation described in the previous paragraph. Consider that a task consisting of 180 trials was simulated 100 times. Now, in 50 out of these 100 simulations, 8 (out of 180) trials were of a particular type (referred to as t¨ ype-At¨ rials), and a particular event (referred ¨ B)¨ occurred at 1 trial out of those 8 trials, whereas in to as event the remaining 50 simulations, 2 (out of 180) trials were type-A, and event B never occurred in those 2 type-A trials (here we implicitly assume that event B should occur once if there are 8 type-A trials within a single task simulation whereas event B never occurs if there are only 2 type-A trials within a single task simulation). Then, consider to calculate the rate of the occurrence of event B within type-A trials. If we initially count the numbers of type A trials and event B occurrences across all the 100 simulations and then divide the latter number by the former number, the rate is calculated as (1 × 50 + 0 × 50) / (8 × 50 + 2 × 50) = 0.1, i.e., 10 %. In contrast, if we initially calculate the rate in each of the 100 simulations and then taking an average across simulations, the result is ((1/8)×50 + (0/2)×50) / 100 = 0.0625, i.e., 6.25 %. As such, different ways of rate calculation result in different values, and going back to our original point, this is considered to underlie the spurious dependence of the choice on the outcome at two trials before appeared after being sorted by the previous choice and outcome.

Nevertheless, little but some dependence would still remain, such as shown in Figs. 3 and 4. Based on our analyses, the fundamental reason of this remaining spurious dependence can be said to be the smallness of occurrence rates of the event of interests coupled with the way of calculation of the rate by averaging the rate of individual task simulations. Specifically, with the pWSLS strategy and under the situation with a limited number of trials such as 180, the number of relevant trials used as a denominator for the calculation of choice rate is quite small (centered around 2 or 3 trials in 180 trials, as shown in Fig. 4C-E). Such a smallness makes the data of choice rate not very reliable, and taking an average across simulations (i.e., across subjects in the case with real ¨ experimental data), the issue of disparity in the relative weight ¨ of one votebecomes apparent. Increasing the number of relevant trials should dilute this issue, as shown in our simulations with 46,080 ( = 180 × 28 ) trials in Fig. 5, although conducting experiments with such a huge number of trials in humans would be impractical. Sorting the trials according to the choice and outcomes in the previous trials is a simple intuitive way to analyze the experimental data of sequential choices. However, in fact there are potential bias as we have shown in this paper. In a somewhat similar vein, though in a different field, it has been pointed out that the so-called instantaneous firing frequency, which is defined as the reciprocal of the current inter-spike interval and is a simple intuitive measure calculated from experimental data of spiking, should always be higher than the firing rate in the normal sense; what underlies there is the inequality between arithmetic and harmonic averages (Lánsky´ et al., 2004). Although situations where the effect of the potential bias that we have shown in the present work becomes practically high may be rather limited, we think it would still be good to keep in mind the possibility of potential bias in simple intuitive analyses. Funding This work was supported by the Grant-in-Aid for Scientific Research (No. 15H05876) of The Ministry of Education, Culture, Sports, Science and Technology in Japan to K.M. and A.M.N. and the Research Fellowships for Young Scientists from the Japan Society for the Promotion of Science (JSPS) to A.M.N.

4. Discussion References In order to reveal the mechanism of decision making, it is important to clarify underlying decision strategies among various ones, such as WSLS, probabilistic selection, or RL. However, even if choices are generated according to a strategy based solely on the immediate past outcome, spurious dependence of the choice on the distant past outcome can potentially appear. We have introduced three types of spurious dependence in the previous sections. Regarding the first two types, spurious dependence is caused due to an inappropriate way to sort the sequential choice data. If we sort the k-th choices (same or not) according only to the outcomes of (k-2)-th trials (Good or Bad), asymmetry in the choice strategy, as well as in the task rule, causes spurious dependence (Fig. 2). If we sort the k-th choices according also to the choices and outcomes of (k-1)-th trial, most of the spurious dependence would disappear.

Katahira, K., 2015. The relation between reinforcement learning parameters and the influence of reinforcement history on choice behavior. J. Math. Psychol. 66, 59–69. ´ P., Rodriguez, R., Sacerdote, L., 2004. Mean instantaneous firing frequency is Lánsky, always higher than the firing rate. Neural Comput. 16, 477–489. Nagase, A.M., Onoda, K., Foo, J.C., Haji, T., Akaishi, R., Yamaguchi, S., Sakai, K., Morita, K., 2018. Neural mechanisms for adaptive learned avoidance of mental effort. J. Neurosci. 38, 2631–2651. Samejima, K., Ueda, Y., Doya, K., Kimura, M., 2005. Representation of action-specific reward values in the striatum. Science 310, 1337–1340. Seymour, B., O’Doherty, J.P., Dayan, P., Koltzenburg, M., Jones, A.K., Dolan, R.J., Friston, K.J., Frackowiak, R.S., 2004. Temporal difference models describe higher-order learning in humans. Nature 429, 664–667. Skvortsova, V., Palminteri, S., Pessiglione, M., 2014. Learning to minimize efforts versus maximizing rewards: computational principles and neural correlates. J. Neurosci. 34, 15621–15630.