LEARNING
AND
MOTIVATION
( 1973)
237-246
4,
Delayed Reward Learning: Disproof of the Traditional Theory1 Bow Memorial
TONG LEW
University
of Newfoundland
Without the aid of secondary rewards to bridge the temporal gap, each of 15 rats learned to select the rewarded side of a T-maze although the reward was delayed until 1 min after the response was emitted. Similar results were obtained from another group of eight rats for which the length of the delay was 5 min. In a final experiment using the same basic procedure, five groups of rats were trained for 25 days with delays of 0.5, 1.0, 2.0, 4.0, or 8.0 min. The percentage of correct responses did not significantly differ among groups. According to prevailing psychological theory, these results are impossible.
Garcia (Revusky & Garcia, 1970) has shown that associations readily occur between events separated by as much as an hour or more. In those experiments, rats were made to consume a flavored substance; some time later, toxicosis was induced by independent means such as injection of toxins or exposure to X-irradiation. After recovery, the rats had aversions to the flavored substance. These aversions must be learned because they are not obtained with any of the usual control procedures (e.g., presenting the flavored substance alone, producing toxicosis alone, and doing both but on separate days). Moreover, the traditional parametric manipulations yield functions similar to those obtained from other learning situations except, of course, that the delay gradient is measured in hours rather than seconds ( Revusky & Garcia, 1970). Since mediating events such as lingering aftertastes have been excluded as means of bridging the temporal gap (Revusky & Garcia, 1970; Rozin & Kalat, 1971), these findings of learned flavor aversions severely challenge the principle that temporal contiguity is necessary for learning. 1 This research stitute of Mental Northern Illinois rats in the start writing this paper. Brenart, S. Dwyer, Wolfe performed ‘Requests for Memorial University Copyright All rights
was supported in part by Grant MH 16643 from the National InHealth to S. Revusky while the author was a research associate at University. S. Revusky suggested the procedure of rewarding the box rather than in a separate goal box and helped extensively in W. Hershberger suggested the truncation of the usual T-maze. R. R. Greenwood, S. Radtke, K. Rusiniak, P. Weinhold, and M. conscientiously and well in running the rats. reprints should be sent to the author, Department of Psychology, of Newfoundland, St. John’s, Newfoundland, Canada.
@ 1973 by Academic Press, of reproduction in any form
237 Inc. reserved.
238
l3OW
TONG
LETT
In more conventional situations, it has been demonstrated that rats can associate a stimulus that is presented on one trial with a response emitted on the next trial even when these events are separated by long intertrial intervals lasting hours. For example, rats were trained to run down a straight alley to a goal box that contained food only on alternate trials (Tyler, Wortz, & Bittcrman, 1953; Capaldi, 1967). The rats learned to run slowly on unrewarded trials and quickly on rewarded trials, indicating that the presence or absence of food at the end of one trial can function as a stimulus for fast or slow running on the next trial. These intertrial associations are not limited to situations involving alternation of presence and absence of reward. Pschirrer (1972) has shown that the particular type of reward (Le., chow pellets or milk) received at the end of one trial can be the stimulus for fast or slow running on the next trial with an intertrial interval of 15-20 min. He also extended his findings to include the control of qualitatively different responses such as running to the right or left in a T-maze. For instance, rats learned to run left after receiving a milk reward at the end of the preced:ng trial and right after a pellet reward with an intertrial interval of 3 min. Finally, Petrinovich and Bolles (1957) have shown that the rat’s own behavior on the preceding trial can function as the stimulus for which of two directions to go in a T-maze on the next trial. Since learning involving substantial delay intervals is not limited to a particular situation, class of stimuli, or responses, Revusky and Garcia (1970) and Revusky (1971) have argued that the associative processes underlying long-delay learning are the same as those underlying the more traditional short-delay paradigms. If so, it follows that temporal contiguity of the events to be associated is not intrinsically necessary for learning to occur. Revusky (1971) explained the usually deleterious effects of a delay as follows. Increasing the interval of time between event A and event B increases the number of intervening events. An increase in the number of intervening events increases the probability that A or B or both will become associated with one of these extraneous events. Such extraneous associations have been shown to interfere with the association of A with B (Honig, 1970; Kamin, 1969; Revusky, 1971; Wagner, 1970). Thus, learned flavor aversions are possible only because the many events occurring during a long delay such as the animal’s own movements and the sights and sounds of the environment cannot become strongly associated with either the ingested flavor or with toxicosis (Braveman & Capretta, 1965; Garcia & Koelling, 1966; Garcia, McGowan, Ervin, & Koelling, 1968). In extending his interference analysis to the learning of intertrial associations. Revusky (1971) argued that the many events OCcurring during the intertrial interval must not have become strongly
DELAYED
REWARD
239
LEARNING
associated with the experimentally defined stimulus and response; otherwise no learning would have occurred. Although there is no direct empirical evidence delineating the mechanism, he conjectured that these events do not interfere because they occur outside the experimental apparatus in the home cage, whereas the association under observatiou is between a stimulus and a response occurring inside the apparatus. The present study used this interference analysis to obtain learning of a simple position habit with relatively long-delayed reward. Since it is no longer tenable to suppose that reward makes learning occur (Bolles, 1972), a reward was considered as an event that increases the probability of the response with which it becomes associated. Presumably then, the same rules of association account for the detrimental effects on learning of both an enforced delay between a stimulus and response and a delay between a response and reward. Thus, a delay of reward reduces the likelihood of an association between the response and reward by increasing the number, and hence the probability, that the intervening events will become associated with either the response or reward. If so, learning of a position habit with relatively long-delayed reward should be possible if the potential for interfering associations is minimized. A means of minimizing interfering associations was extrapolated from the experiments on intertrial associations. If removing the animal from the experimental apparatus permitted learning when there was a delay between the stimulus and response, it should also permit learning when the delay is between the response and reward. In the present experiments, the rat was immediately removed after it chose one side of a T-maze. Whether the choice was correct or incorrect, the rat was placed in its home cage to spend the delay. The reward for a correct response was administered when the rat was returned from its home cage to the apparatus after the delay. EXPERIMENTS
1,2,
AND
3
Method A total of 23 adult male Wistar rats, weighing an average of 410 g at the start of the experiment, were gradually reduced to 75% of their freefeeding weights. There were seven rats in Experiment 1, eight each in Experiments 2 and 3. They were maintained in polyethylene home cages with water continuously available. The cages were in the same room as the experimental apparatus. In each of the three experiments, half the rats were trained to run to the right chamber of the T-maze shown in Fig. 1 and half were trained to run to the left chamber. No pretraining experience was given in the
240
BOW
FRONT’
TONG
LETT
DOOR
FIG. 1. The experimental apparatus was a truncated T-maze. On either side of a gray wooden start box were remotely controlled, vertically sliding, transparent doors that led to different choice chambers. Although the task was a position habit, the choice chambers were made distinctively different to facilitate learning. The left chamber was narrow with white plastic walls; the right chamber was wide with black wooden walls.
apparatus. At the start of each trial, the rat was removed from its home cage and placed head first through the front door of the start box. Ten seconds later the side doors were simultaneously raised. After the rat went into one of the choice chambers, the sliding doors were lowered and the rat was removed and placed in its home cage for the appropriate delay interval. After the delay interval, the rat was returned to the start box. If it had previously made the correct choice, a food cup containing wet mash (40% by weight of Purina rat chow) was present. The rat was allowed enough time to eat its full daily ration of 2530 g of wet mash. If the choice had been incorrect, the start box was empty and the training procedure was repeated. That is, 10 set later, both doors were raised; the rat made another choice followed by another delay interval in the home cage. Thus, each rat was given one rewarded tial per day preceded by an indeterminate number of unrewarded trials with nothing to distinguish a correct run from an incorrect run until after the delay interval. Throughout Experiment 1, the delay was 1 min. During Days l-60, the rats were trained to select one side of the maze; and during Days 61130, they were similarly trained to select the opposite side. In Experiment 2, Days l-40 were a replication of Experiment 1. During Days 4150. the procedure was modified to assessthe possibility that an experimenter was more likely to close the door to a choice chamber when the rat was partly in the correct chamber than when it was partly in the incorrect chamber, since such a bias could produce a spurious learning curve. Under extinction conditions, the rats received one and only one trial per day, which was administered by a naive experimenter who was not informed as to which side was correct. Since the rats were not fed in the start box during this phase, a maintenance diet of several chow pellets was given to them in the home cages an hour after the trial was completed. In Experiment 3, the rats were trained for 50 days with a 5-min
DELAYED
REWARD
241
LEARNING
delay of reward using the same acquisition procedure of one reinforced trial per day as in Experiments 1 and 2. At the beginning of training, there were instances in which a rat balked and did not go into a choice chamber sufficiently far to allow door closure. When this happened, the rat was guided into the correct chamber after some fixed amount of time. In Experiments 1 and 2, a rat was allowed 2000 set on the first trial of the day, 1500 set on the second trial, and thereafter 1000 sec. In Experiment 3, a rat was allowed 1500 set on the first trial, 1200 set on the second trial, and 1000 set thereafter. The rat was then treated as if it had made the correct choice and was fed in the start box after the appropriate delay. In the statistical analysis, only the response on the first trial of the day was considered because later correct responses might be due to spontaneous alternation tendencies (Estes & Schoeffler, 1955); guided responses were ignored. The mean percentage of correct responses during each block of 10 days was computed and used in a one-way analysis of variance with repeated measures (Winer, 1962) to test for learning. Results In Experiment 1, the delay was 1 min throughout acquisition and reversal learning. Figure 2 shows the original learning (F( 5,30) = 10.83; p < .OOl) and the reversal learning (F( 6,36) = 11.63; p < .OOl). Days l-40 of Experiment 2 were a replication of Experiment 1 and yielded similar results: 56% correct responses during Block 1, 81% during Block 2, 95% during Block 3, and 90% during Block 4 (F( 3,21) = 13.72; p < 901). During Days 4150, the rats were given one unreinforced trial a day administered by a naive experimenter. The result was 86% correct choices, which differed significantly from a chance level of 50% (t( 7) = 7.20; p < .OOl); thus, it seems unlikely that the learning curve was an artifact of experimenter bias. L loo8k
80.
0 60. I5 40. E
2@
BLOCKS FIG.
ing (left of reward.
2. Percentage of correct side of vertical line) The broken vertical
OF TEN DAYS
responses for blocks of 10 days during and reversal training (right side) with lines indicate the range for individual
original a 1-min rats,
leamdelay
242
BOW
t
TONG
1
.
2 BLOCKS
LETT
J--1_-
3
4 5 OF TEN DAYS
FIG. 3. Percentage of correct responses for blocks of 10 days with a 5-min delay of reward. The broken vertical lines indicate the range for individual rats.
In Experiment 3 (Fig, 3), learning occurred with a 5-min delay of reward (F( 4,28) = 6.33; p < .OOl). Performance was poorer than in the earlier experiments, but because the experiments were run at different times with slightly different procedures, this does not necessarily imply the existence of a delay-of-reward gradient measurable in minutes. EXPERIMENT
4
According to the interference approach, a delay of reward affects learning only insofar as it contains extraneous events that may become associated with the response or reward, thereby decreasing the probability of an association between the response and reward. An increase in the length of the delay increases the number of extraneous events and hence the probability of interfering associations. However, if the delay were to be varied without also changing the number of associable extraneous events, variations in the length of the delay would not affect learning. The purpose of this experiment was to test this hypothesis. Using the same basic procedure that presumably minimizes associative interference during the delay, five groups of rats were trained with a delay of 0.5, 1.0,2.0,4.0, or 8.0min. Method Sixty male Wistar rats weighing an average of 265 g were reduced to approximately 80%of their normal weights. Since the results of the previous experiments indicated that substantial learning occurred by the end of 20 days, the rats were given only 25 days of training. Half the rats, approximately 60-70 days old at the start of training, were divided into five equal groups and trained for 25 days; then the remaining animals were run so that they were 25 days older than the first half at the beginning of training. During the course of the experiment, one rat died; four others (three from the 2.0-min delay group and one from the 4.0min group) were eliminated because they failed to make at least three voluntary choices during the fina 10 days of training.
DELAYED
REWARD
243
LEARNING
The same basic procedure was used, except that the rules for guided trials were modified. Rats were guided to a choice chamber after 300 sec. On the first guided trial of a day, regardless of what number trial of the day it was, the rat was guided to the incorrect side, and on the second guided trial the rat was guided to the correct side. After a guided response the rat was treated exactly as it would have been after a voluntary choice. The rat was removed to the home cage for the appropriate interval of time and then replaced in the start box. After an incorrect choice, guided or voluntary, the rat was given an opportunity to make another choice response; after a correct choice, guided or voluntary, the rat was fed its entire daily ration of food. The rats were usually given enough time to eat approximately 25-30 g of wet mash (40% by weight of ground rat chow). After a guided trial, however, the time was slightly reduced so that the rats ate approximately 20-25 g of wet mash in the hope of facilitating performance on the next day. Using only the first response of the day, the percentage of correct responses was computed for each rat during each block of 5 days. Guided responses were ignored. An analysis of variance was used to assess the effects of replications, blocks, and length of delay.
Results Since there was no significant effect of replications, the data were pooled; Fig. 4 shows the percentage correct during each block of 5 days for each delay group. Inspection of Fig. 4 suggests that there were no marked differences among groups, and the statistical analysis supported this conclusion. The only statistically significant effect was that percentage correct increased over blocks (F( 4,180) = 8.23; p < .OOl). Thus, it would appear that the length of the delay accounts for surprisingly little variance. DISCUSSION
According to the generally accepted theory of delayed reward (Spence, I947), learning cannot occur unless the reward follows the response
FIG.
a delay
4. of
Percentage of correct responses during blocks reward of 0.5, 1.0, 2.0, 4.0, or 8.0 min.
of 5 days
for
Ss receiving
244
BOW
TONG
LETT
nearly immediately. All apparent exceptions are attributed to mediation by secondary rewards which bridge the temporal gap. Experiments by Perkins (1947) and Grice (1948) seemed to provide definitive confirmation of this theory. In the present experiments, the rats learned to select the rewarded side of the T-maze despite delays of reward lasting up to 8 min. These results cannot be readily explained in terms of a mediational hypothesis, since there was nothing in the procedure or experimental situation that provided a basis for differential secondary reward. The animals were administered the same delay treatment after every response whether the response was correct or incorrect. Any differential stimuli that might have served to mediate the delay would have had to come from within the animal. Traditionally, it has been supposed that afferent traces of the stimulus situation may persist through the delay interval. If so, the afferent traces produced by the physical characteristics of the choice chamber used in the experiment (e.g., the blackness or whiteness of the compartment) may have persisted even after the animal was removed from it. However, such traces have been assumed to last no more than a few seconds (Grice, 1948) and hence should not have been present at the time of reward 8 min or even 1 min later to bridge the delay. Another possible internally produced mediator is the proprioceptive trace of the response. Although it has been assumed that a proprioceptive trace lasts longer than a pure sensory trace ( Spence, 1947; Perkins, 1947), it is difficult to conceive of how any trace could have survived the rigors of the present experimental procedure. The rats were handled twice during the delay of reward, the first time when being transferred to the home cage and again when being returned to the start box. Finally, if a fading trace is responsible for bridging the delay, then increasing the length of the delay should decrease the amount of learning. Presumably, increasing the delay results in an increase in the amount of fading, thereby decreasing the similarity between the trace immediately after the response and the trace at the time of reward. The less the amount of generalization, the less the secondary reward, and hence less learning should occur as the length of the delay increases. The results of Experiment 4 in which the length of the delay was varied from 0.5 to 8.0 min provide little support for this deduction. The present findings complement those of Garcia and his co-workers who have obtained learning with much longer delays of reward using the paradigm of learned flavor aversions. It has been argued (Rozin & Kalat, 1971) that these flavor aversions, while they are indeed learned, represent a unique adaptive specialization, a form of learning involving stimuli, responses, and rewards that are inherently related to one another.
DELAYED
REWARD
LEARNING
245
By implication, temporal contiguity is the basic mechanism of association for all other forms of learning in which the stimuli, responses, and rewards are arbitrarily related by the experimental procedure. If flavor aversions were the only exception to the principle of contiguity, then the discrepancy between them and findings obtamed in more conventional learning situations could be considered resolved. The results of the present experiments and those demonstrating intertrial associations over long delays suggest that this conservative solution is not appropriate. REFERENCES BOLLES, R. C. Reinforcement, expectancy, and learning. Psychological Review, 1972, 79, 394-409. BRAVEMAN, N., & CAPRE-ITA, P. J. The relative effectiveness of two experimental technique< f~lr the modification of food preferences in rats. Proceedings of the 73rd Annual Conoention of the American P.~ychological Association. Washington, DC: American Psychological Association, 1965. Pp. 129-130. CAPALDI, E. J. A sequential hypothesis of instrumental learning. In K. W. Spence & J. T. Spence ( Eds.), The psychology of learnin, n and motivafiw: Aduanws in theory and research. Vol. 1. New York: Academic Press, 1967. Pp. 67-I57. ESTES, W. K., & SCHOEFFLER, M. S. Analysis of variab!es influenc;ng alternation after forced trials. Journal of Comparat:ve and Physio’ogiczl Psychology, 1955, 48, 357-362. GARCIA, J., & KOELLING, R. A. Relation of cue to consequence in avoidance learning. Psychonomic Science, 1966, 4, 123-121. GARCIA, J., MCGOWAN, B. K., FRVIN, F. R., & KOELLIUC, R. A. Cue,-: Their effectiveness as a function of the rcinf.,rcer. Science, 1968, 163, 794-795. GRICE, G. R. The relation of secondary reinforcement to delayed reward in v;sual discrimination learning. Journal of Z‘xperimcntal PsychoZogy, 1948, 38, 1-16. HONIG, W. K. Attention and the modulation of stimulus c.:ntrol. In D. Most.,fsky (Ed.), Attention: Contemporary studies und atlalyses. New York: Appleton, 1970. Pp. 193-238. KAMIN, L. J. Selective ascociatim and conditi ning. In N. J. Mackintosh & W. K. Honig ( Eds. ), Fundamental issues in associative learning. Halifax: Dalh,msie Univ. Press, 1969. Pp. 42-64. PERKINS, C. C. The relation of secondary reward to gradients of reinforcement. Journal of Experimental Psychology, 1947, 37, 377-392. PETRINOVICH, L., & BOLLES, R. C. De!ayed alternation: Evidence for symbolic processes in the rat. Journal of Comparative and Physiological Psychology, 1957. 50, 363-365. PSCHIRRER, M. E. Goal events as discriminative stimuli over extended intertrial intervals. Journal of Experimental Psychology, 1972, 96, 425-432. REWSKY, S. The role of interference in a-so&&ion over a delay. In W. K. Honig & H. James (Eds.), Animal memory. New York: Academic Prels, 1971. Pp. 155213. REVUSKY, S., & GARCIA, J. Learned associations over long delays. In G. H. Bower (Ed.), The psychology of learning and motivation: Advances in theory and mesearch. Vol. 4. New York: Academic Press, 1970. Pp. l-84. ROZIN, P., & KALAT, J. W. Specific hungers and poison avoidance as adaptive specializations of learning. Psychological Review, 1971, 78, 459-486.
246
BOW
I<. W. Psychological
SPENCE,
The role Review,
of secondary 1947, 47, 1-8.
TONG
LET-I
reinforcement
in
delayed
reward
learning.
WORTZ, E. D., & BITTERMAN, M. E. The effect of random and alterpartial reinforcement on resistance to extinction in the rat. American Journul of Psychology, 1953, 66, 57-65. WAGNER, A. R. Stimulus selection and a “modified continuity theory.” In G. H. Bower & J. T. Spence (Eds.), The psychology of learning and motivation: Advances in theory and research. Vol. 3. New York: Academic Press, 1969. Pp. l-41. WINER, B. J. Statistical principles in experimental design. New York: McGraw-Hill, 1962. TYLER,
D.
W.,
nating
(Received
July
5, 1972)