ABSTRACT .I. M. and Staddon .I. E. R. Horner, invariance. Behav. Processes,
1987. Probabilistic 15:5Y-92
choice:
A simple
When subjects must choose repeatedly between two or more alternatives, each of which dispenses reward on a probabilistic basis (two-armed bandit), their behavior is guided by the two possible outcomes, reward and nonreward. The simplest stochastic choice rule is that the probability of choosing an alternative increases following a reward and decreases following a nonreward (reward following). We show experimentally and theoretically that animal subjects behave as if the absolute magnitudes of the changes in choice probability caused by reward and nonreward do not depend on the response which produced the reward or nonreward (source independence), and that the effects of reward and nonreward are in constant ratio under fixed conditions (effect-ratio invariance&properties that fit the definition of satisficing. Our experimental results are either not other theories of free-operant choice such predicted by, or are inconsistent with, as Bush-Mosteller, molar maximization, momentary maximizing, and melioration (matching). Key words: satisficing,
stochastic equilibrium
model, ratio schedules, effect-ratio analysis,
pigeon, Bush-Mosteller, invariance
reward,
INTRODUCTION Understanding important
how reward
questions
symmetrical,
two-choice
to discover have
is
recurrent
and a rule
that
optimality
rate
(e.g.,
Staddon
maximizing
theories
0376.6357/87/$03.50
or
have been
many molar
of
of
that
treatments widely
of
& Motheral,
19781,
Rachlin,
that
subject,
future
choices.
situations like,
0 1987 Elsevier Science
but
to which links
each
is
in a
rewarded
choice--reward
or
and the objective We believe
is
that
we
rate animals
the organism in behavior choice,
act
always is
to
B.V. (Biomedical
strong 1978;
Division)
the
value
of
For example,
so as to maximize
& Burkhard,
be
presumed
two kinds
and probability.
we may term
Rachlin
Publishers
may be quite
can nevertheless
changes
free-operant
and what
1978;
each
to the
discussed:
assume
of
and most
problem
rule.
variable
rules
this
responses
outcome
affect this
theories
(e.g.,
the
the oldest
study
two identical
and the
a reward set
one of
in probabilistic
expectancies
In recent
variable. variable
outcomes
choice
is
way to
available
property
down to two parts:
sensitive,
reward
by which
involving
with
a situation
information
a simple
of
complicated, boiled
only
the rule
discovered
A theory
of
the
behavior A simple
situation
In such
probabilistically. nonreward-
affects
in psychology.
reward
molar Rachlin,
60
Green,
Kagel,
addition reward cf.
& Battalio,
that rate
(&
Staddon
molar
1980;
matching
of
to
computed
over
responding
a time
in the theory 1969;
of
that
animals
higher
will
local
rewards
for
animals
will
all
these
always
(e.g.,
aggregative
data,
attempts
).
occurs
under
Staddon, are
several current
often
(1972),
studying
interval
(cone
pigeons’
Right/responses
but
value
less
Melioration,
Right)
more
wheu the
reward
variable-ratio
rate, is
schedules,
spaced-responding
for
under 1977).
scheduled
VI values
are
small
are
large
for
matching,
maximizing,
(low
the
this
free-operant
1965);
reviews
matching
in
reward
(high
matching
and Peterson
cannot idea
ratios
(VI value reward
reward
account
that
variable-
(responses
scheduled
scheduled
Hinson
from
variable-interval
on concurrent (Hermstein
(see
ratios
proposed
(Staddon,
choice
(a) Although
all
free-
tradition.
Delbrfick,
sometimes
example
individual to
recent
(b) Deviations
violated:
schedules
of
for
in
concurrent
and its
different
applied
1982,
the
to
(1955)
to
Squires,
than
is
often
occur
with
that
are
effects
and maximizing.
extreme
which also
(I Daly,
assumes
probability.
1963) the
is
1966,
responses
reward
widely
response
Molar
Shimp,
assumes
with
paper
under
proposed
of
Sternberg,
that
(c)
is
example,
associated
ratio
higher
an approach
found
b;
maximizing
and Daly
not
for
Melioration
been
Fantiao,
when the VI values
deviations.
and melioration,
choice
are
reward
1980).
deal
d Myers,
performance
the mechanism
systematic maximize
Left)
extreme
does
1983a,
have not
on matching
schedules,
animals
of
variable:
Bush and Mosteller
they
in this
Myers
the
the
of 1972;
1982,
For example,
VI VI)
that
rate
when an animal
the alternative
with
consider
it
the choice
only
period
proposed.
Momentary
models
Vaughan,
1982;
time
been
to
in that
to
(Myerson
for
i.e.,
Staddon,
equalize
as those
we develop
Killeen,
to
6 Wagner,
emphases
systematic.
Left/VI
such
many conditions,
1983b;
devoted
the alternative
reasons
theory
assume
rate,
& Staddon, 1978;
have
(matching).
(see
The approach
from
(Binson
rule
Stochastic
however
1981),
overall
process:
free-operant
as a guiding
6 Casey,
approaches
operant
different
proposed
time
Rescorla
kinetic
responsible in
reward the
in to
alternative.
tending
models
and nonrewards.
are
the
as the
Vaughan,
assume
sensitive
on the maximizing
observed
or over
1980)
directly
(proposed
overall)
window,
been
thus
choose
learning
revards
There
1980; to
behavior-change
rate,
such
widely
Ziriax,
alternatives
many derivatives
theories,
maximizing
increase
reward
Stochastic
from
has also
of
silent
a particular
Hamilton,
Two main kinds
are
opposed
momentary
Silberberg,
theories
ratios
or response
probability
are
& Vaughan,
to
& Battalio,
that
or melioration
and reward
(as
Kagel,
processes
Other
1977)
local
actually Reward
1983).
Berrnstein
:
sensitive
Rachlin,
via
maximizing
Staddon,
response
experiments
1976;
do this
6 Hinson,
& Miesin,
are
animals
for
animals
as an alternative
rates)
rates). these act
so as to
to matching
variable-interval
C Heyman,
1979);
and on schedules
on singlethat
involve
&
61
a choice
between
a simple
In exploratory results
that
and an adjusting
experiments
do not
seem
choice:
matching/melioration, both
sensitive
to
with
each
neither
choice).
theory
we looked the
If
makes
at
choice
same for
both
performance
these
performance
appeared
which
under but
to
implies
the
vary
conditions
absolute
reward
a process
other
rates
both
operant
associated therefore,
local
probability
For are
choices,
these
(Homer,
1986)
variables
were
varied.
as a function
than momentary
up
free
animals
experiments
where
systematically
of
maximizing.
that
or reward same for
turned
theories
assume
In several
prediction.
repeatedly
or momentary
probabilities are
1976).
current
maximizing
variables
alternatives,
often
probability,
(reward
any point
the
maximizing,
and momentary
properties
(Lea, we have
by any of
molar
melioration
ratio
laboratory,
explicable
example,
local
in our
of
Choice
absolute
maximizing
reward
or
melioration. These each
experiments
choice
Left).
This
is
terminology
of
operant
preference.
systematic
changes
l/75
Figure
1 shows
plotted all
the
two
partial
in
the
that
Right)
rewarded
and q (on
the
decision-theory
schedule,
in the
for
in
and reward
the
four
absolute
pigeons.
terminated
animal
showed
across
a partial
of
the
one or other p = l/75
all
for
exposed l/20
food
(days
delivery. right
During to
animals
were
were l-3).
on the
close
alternative.
condition
48th
(sessions).
preference
But when p = l/20, for
the
independent
p and looked
(days
responding
days
are
of
The animals l/75
after
(proportion
rate
value
in ABA sequence: were
each
preference
preferences
of
l/20,
preference
animals
probability
choice
sessions
in
indifference
shifted
As the
across
figure
recoverable
days
shows,
after
the p =
condition. When payoff
experiment,
accounts
proportion value
of
choices responding matching, definition
probabilities
any pattern
maximizing
the
or
alternatives.
exclusive
bandit
variable-ratio
we varied
Sessions
average
p = l/75,
l/20
reward
preferences
g-10).
R/(R+L))
the
both
in the
which
toward
concurrent
schedules
p (on the
conditioning.
equal,
key:
between
a two-armed
of
p = l/75
(days
reward
probabilities
In one experiment
p-values:
4-81,
termed
a variety
When p and q are
two
constant
is
it
to
random , probabilistic
with
procedure
literature;
of
used
response
of s is are
pecks
to x/y
on the
the
R(x) two
rate
molar
= px and R(y) keys
for
higher-probability-of-reward
same for
and R(x) is
the
is is
right-hand
with
= R(x)/R(y),
identical
the
preference
Reward
:
consistent just
are
of
forced
both
independent key,
which
where
with of
Since
are
by this
procedure.
response)
so that offers
the
call
the
associated Reward
momentary
were
matching
reward
x and y are
and R(y)
two choices,
as they
both
preference
we will
maximization. = py,
choices,
consistent
(i.e., 8).
Hence for
rates
of
any the
Thus,
two
rates,
probability
maximizing
no guidance.
the
rates
reward
in this
and
is
(picking neither
by
62
345676
9
10
DAYS FIGllEE 1. The effect of absolute reward probability on choice between identical probabilistic alternatives (“two-armed bandit”). The figure plots the proportion s, across daily sessions for each animal of choices of the right-hand alternative, for two conditions, p = l/75 and p = l/20, in ABA sequence. Open squares, Bird 096; closed diamonds, Bird 145; closed squares, Bird 151; and closed triangles, Bird 156.
molar
maximizing
result:
that
nor
low reward
alternatives
this
reward
the pattern
and we have not
conditions
for
In other
figure.
simple
in which
have not
infrequently
example,
under
will
will (cf.
sometimes Allison,
it
might
be have
the
schedules,
full
the
to
conditions
(usually
sample seems
both,
clear
from
or momentary led
range
of
results
and then derive
smaller
when both than
these
results
that
is
the
going
are
high), two are
on the
low)
majority more
Our reflections
ratio-invariant
more
pigeons
something
on.
the
we For
the
that
describe
for
again
theories.
fixating
we call
is
shown in the
and here
of
we have obtained
predictions
pattern
probabilities
rather
seen
and sufficient
standard
probabilities the
We first
short.
one we have
probabilistic
(p # q), the
maximizing
us to a theory for
the only
between
all
when both in choosing
equiprobable
and exclusivity
choice
our
choice.
the necessary
unequal
contrary
predicts
exclusive-choice
indifference
are
(usually
not
isolate
studied
between
exclusive
1 is
some time
It
or ratio-invariance,
can accommodate
probabilistic
results
persistently 1983).
than melioration
that
following
found
for
yield
to
between
we have
Under other
complicated on what
switch
able
maximizing
indifference
replications, been
two probabilities
persist
probabilities. they
yet
some conditions
sometimes
side
the
yield
shown in Figure
experiments
schedules
nor momentary
probabilities
In various
procedure.
commonest,
the
(melioration)
probabilities
and higher
Unfortunately with
matching
theory,
with
reward show how
simple
sensitive
tests
with
63
an equal-probability first experiment
(symmetrical)
interdependent
for an asymmetrical
interdependent
probabilistic
The
schedule.
The second experiment
tests these predictions.
tests predictions
schedule.
RATIO-INVARIANCE Our preliminary both local reward Accounts
choice.
be modified averaging
results with equal-reward-probability rate and local reward such as melioration
that a rewarded
choice
construction increases
here; a more
formal
assumption
the probability decrements increment
more
of stochastic
this probability. owing
to reward
owing
choice
allocated
respectively,
the expected
choice-reward
of responding:
the contributions
Reward
reward
probabilities
on R:
Nonreward
on R:
Nonreward
on L:
decremented
a rewarded
combinations.
Consider
contribution
s,
a set of
to delta(s) of the four
for responses
outcomes
a(s)ps
to Right and
are as follows:
(la) (lb) (lc)
-b(sXl-p)s
(Id)
b(s)(l-q)(l-s),
in s associated
b(s).
in choice proportion,
-a(s)q(l-s)
is incremented
la-d.
Note that s, the proportion
by both reward
with
of such a co-occurrence, response,
change
and nonreward
by the other two possibilities.
change
probability
in general be some function
on Right and Left be p and q, respectively.
delta(s) is just the sum of terms
(right) choices,
s,
response)
is that the
is an amount
to delta(s) of the four possible
Reward on L:
increments
(unrewarded
s and l-s to Right and Left keys,
and let us look at the relative
outcomes
intuitively
that each reward
and each nonreward
a(s) that will
in the proportion
Let the reward
with our initial
We begin with the
A.
to each nonreward
delta(s),
with the four possible
expected
derived
the assumption
predictions
form of this assumption
is an amount
a quantity
We derive
learning models:
A useful
associated
where
about
however,
while an unrewarded
of this idea is consistent
response,
responses,
Then
alternative,
is given in Appendix
of the rewarded
We can define
Left.
might nevertheless
assumptions
of the law of effect, namely
explicit.
development
of s, and that the decrement
possible
is a simpler
in probability,
version
it needs to be made
standard
maximizing
dynamic
rule out guiding
in probability.
To show how a particular results
There
and the like.
from the most primitive
explicit
alternatives
as the variables
and momentary
to fit these data by adding
windows
decreases
probability
on R and nonreward
In words,
a rewarded
a(s), and similarly
Eq. la says that the
R response
ps, multiplied
of R
on L and
is equal to the
by the change associated
for the other equations.
Summing
with these
64
four terms
yields:
delta(s)
= a(s)ps - a(s)q(l-s)
which yields after rearrangement function
- b(s)(l-p)s
the following
+ b(s)(l-q)(l-s),
expression
(2a)
fOK delta(s) as a
of s, p, q, a(s) and b(s):
delta(s)
Another
= s[(p+q)(a(s)+b(s))
- 2b(s)l + b(s) - q[a(s)+b(s)l.
way to think of Eq. 2a is in terms of the absolute