Simulation of sequential data: An enhanced reinforcement learning approach

Expert Systems with Applications 36 (2009) 8032–8039 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ww...

Download PDF

475KB Sizes 1 Downloads 112 Views

Report

PDF Reader
Full Text

Expert Systems with Applications 36 (2009) 8032–8039

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Simulation of sequential data: An enhanced reinforcement learning approach Marlies Vanhulsel, Davy Janssens, Geert Wets *, Koen Vanhoof Hasselt University – Campus Diepenbeek, Transportation Research Institute, Wetenschapspark 5 bus 6, B-3590 Diepenbeek, Belgium

a r t i c l e

i n f o

Keywords: Reinforcement learning Regression tree Function approximation Activity-based travel demand modelling

a b s t r a c t The present study aims at contributing to the current state-of-the art of activity-based travel demand modelling by presenting a framework to simulate sequential data. To this end, the suitability of a reinforcement learning approach to reproduce sequential data is explored. Additionally, as traditional reinforcement learning techniques are not capable of learning efﬁciently in large state and action spaces with respect to memory and computational time requirements on the one hand, and of generalizing based on infrequent visits of all state-action pairs on the other hand, the reinforcement learning technique as used in most applications, is enhanced by means of regression tree function approximation. Three reinforcement learning algorithms are implemented to validate their applicability: the traditional Q-learning and Q-learning with bucket-brigade updating are tested against the improved reinforcement learning approach with a CART function approximator. These methods are applied on data of 26 diary days. The results are promising and show that the proposed techniques offer great opportunity of simulating sequential data. Moreover, the reinforcement learning approach improved by introducing a regression tree function approximator learns a more optimal solution much faster than the two traditional Q-learning approaches. Ó 2008 Elsevier Ltd. All rights reserved.

1. Introduction Models are the result of the human urge to organize facts and behaviour. Moreover, while outlining policies, governments and policy makers wish to be supported by models in order to estimate the impact of their decisions on society as a whole. Travel demand models compromise a major example of such decision supporting models, as they can be applied to evaluate the inﬂuence of both transport- and non-transport-related policies on mobility as well as to assess the impact of mobility on non-transport related issues, such as air quality (Fried, Havens, & Thall, 1977; Shiftan & Surhbier, 2002; Shiftan et al., 2003; Stead & Banister, 2001). To serve this purpose, activity based-transportation models have entered the area of transportation modelling during the past decade. These models are founded on four basic concepts. First, activity-based transportations models assume that travel is derived from the demand for activities in space and time which are executed in an attempt to meet individual goals and needs (Chapin, 1974). Next, humans face a number of temporal-spatial constraints that restrict the individual’s action space (Hägerstrand, 1970). Furthermore, activity and travel decisions cannot be disconnected from the household context in which the individual operates * Corresponding author. Tel.: +32 (0) 11 26 91 58; fax: +32 (0) 11 26 91 99. E-mail address: [email protected] (G. Wets). 0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2008.10.056

(Jones, Dix, Clarke, & Heggie, 1983). Last but not least, activity and travel decisions are affected to a large extent by past and anticipated future events (Bowman, 1995). Such activity-based transportation models offer the opportunity of predicting travel demand more accurately, as they provide more profound insight into individual activity-travel behaviour (Algers, Eliasson, & Mattsson, 2005; Kitamura, 1996). Before being able to account for policy changes, a model needs to be formulated; the core of which consists of simulating daily activity-travel schedules (Arentze & Timmermans, 2004). To this end, the current research wants to explore the use of artiﬁcial intelligence techniques to extract information from sequential data. More particularly, this study focuses on the application of reinforcement learning within this area of research, as it is funded on the rather straightforward principles of human learning by trial and error interactions in a dynamic environment (Kaelbling, Littman, & Moore, 1996). Consequently, the contributions of the present study to the current state-of-the-art are twofold. The ﬁrst added value includes the application of Q-learning, which is a well-known reinforcement learning technique, to simulate sequential data. However, when implementing this technique, both memory and computational time requirements increase rapidly with the dimensionality and granularity of the state-action space (Sutton & Barto, 1998; Vanhulsel, Janssens, Wets, & Vanhoof, 2008). Therefore, an improve-

M. Vanhulsel et al. / Expert Systems with Applications 36 (2009) 8032–8039

ment of this reinforcement learning technique emerges to overcome their limitations. The second contribution of this research thus consists of deﬁning a regression tree-based function approximation to enhance the traditional reinforcement learning approach. To summarize, the research goals consist of Presenting a framework to simulate sequential data by means of reinforcement learning. Enhancing the performance of traditional reinforcement learning algorithms by means of regression tree function approximation. Developing a ﬁrst prototype model based on two Q-learning approaches as well as this improved reinforcement learning algorithm, and Validating the applicability of these approaches when simulating sequential data. The remainder of this paper is organised as follows: Section 2 will discuss the background of processing sequential data, which founds the analysis of the present research. The third part will introduce three versions of reinforcement learning which will be applied in the empirical section: the ﬁrst version consists of the well-known Q-learning approach within reinforcement learning (Kaelbling et al., 1996), the second one introduces bucket-brigade updating within this Q-learning approach (Mitchell, 1997) and the third and novel version combines the basics of reinforcement learning with an inductive learning technique – in particular regression tree induction – in order to obtain a suitable function approximator. The empirical part will present both the data used in this research and the results obtained by the three proposed reinforcement learning algorithms. Conclusions and topics for further research will be formulated in the ﬁfth and ﬁnal section.

2. Background The current research thus concentrates at contributing to the development of an activity-based model by presenting a framework for simulating the individual decision-making behaviour. This behaviour is revealed in a number of successive activity-travel episodes executed in the course of a day by a certain individual. Each activity-travel episode is characterized by a number of dimensions which need to be considered simultaneously: an individual needs to decide which activity is performed at which location, when and for how long, which transport mode is used in order to get to the desired location and who is accompanying the individual. Because both spatial and temporal dimensions play an important role, these activity-travel decisions are combined into so-called activity-travel schedules which constitute the basis of the assignment of the individual routes to the transportation network when estimating aggregate travel demand (Arentze & Timmermans, 2004; Ettema & Timmermans, 1997). In its attempt to formulate an approach to reproduce these daily activity-travel schedules, the present research starts from observed activity-travel schedules which are captured in activity-travel diaries. Furthermore, as the modelling framework presented here aims at exploring the suitability of reinforcement learning to reproduce sequential data, this research focuses on a single dimension, in particular the activity type. The simulation of the remaining dimensions can be incorporated afterwards. When studying daily activity patterns, it is important to deﬁne a measure to compare and analyze observed and simulated activity schedules. After all, activity-travel data do not contain a collection of unlinked activity-travel records. The order in which activityepisodes occur within a day is of great importance. Therefore, a major requirement of such measure includes the ability to take

8033

into account the sequential relationships within the data. To this end, sequence alignment methods –often applied in molecular biology – are applied to determine similarity of activity schedules (Wilson, 1998). The purpose of such a sequence alignment method or SAM (Joh, Arentze, & Timmermans, 2001) consists of matching two activity schedules by carrying out a number of operations (indel or substitution) to transform one schedule into the other one. Indel refers to inserting an activity into one schedule, which corresponds to deleting an activity from the other schedule. Substitution signiﬁes replacing an activity in one schedule by another activity. The outcome of this method is a distance measure indicating how much effort is needed to align two schedules. The higher the SAM-score, the more manoeuvres have to be performed in order to equalize the patterns and thus the less similar the schedules are (Joh et al., 2001). The authors refer to Wilson (1998) for a more detailed background of sequence alignment methods.

3. Techniques 3.1. Reinforcement learning The approach used in this paper is called reinforcement learning (Kaelbling et al., 1996). To illustrate the basis of this technique, consider a child learning to ride a bike. The result of the learning process is either that child is able to continue riding his bike or that he falls. Remembering the results of previous attempts, the child will adjust his operations by resuming manoeuvres producing a positive result (not falling) and avoiding manoeuvres producing a negative result (falling). In the course of this learning process, the child aims at minimizing the amount of bumps due to crashes. The remainder of this paragraph will introduce the reader to the theoretic concepts of reinforcement learning. However, for a more comprehensive overview of reinforcement learning, the authors refer to Sutton and Barto (1998) and Smart and Kaelbling (2000). In a reinforcement learning setting, the agent refers to the decision-making unit under consideration, i.e. the child in above illustrated example. This agent operates in an environment recorded in a set of states S, determined by a number of observable dimensions of the environment which are considered to be relevant to the learning problem. The set of states in the example above could be deﬁned by the following characteristics: the position of the child with respect to the bike, the speed of the bike and the weather and road conditions. In each of these states, the agent will be faced with a decision; for example accelerate, break, steer left, steer right or put feet on the ground. These decisions are called actions, and are assembled in the set of actions A(s). As a result of executing an action a in a certain state s, a new state s0 will emerge. This conversion is captured in a transition function d, which is unknown to the agent. Furthermore, each action will produce a result, which can be either positive, i.e. a reward (not falling), or negative, i.e. a punishment (falling). These results are recorded in a (scalar) reward function R(s, a) which is based on the state s and the action a and also unknown to the agent. In addition, within reinforcement learning not only immediate rewards can be taken into account, but also delayed reward can be considered. For example, the child falls after a series of ‘‘wrong” actions from which it is unable to recover. To incorporate the effect of delayed rewards, a discounting factor c (<1) has been introduced to reﬂect the weight assigned to immediate versus future rewards, and the value of an action a in a state s is deﬁned as the sum of the discounted rewards. The goal of the reinforcement learning agent then consists of maximizing the total reward or minimizing the total punishment received. In case of the example stated above, the child tries to bike as long as possible without falling. To this aim, the agent attempts to

8034

M. Vanhulsel et al. / Expert Systems with Applications 36 (2009) 8032–8039

deﬁne for each possible state the ‘‘best” action, which maximizes the sum of the (discounted) rewards. The ﬁnal outcome of the reinforcement learning process is a so-called optimal policy, deﬁning the best action to be taken in each state (Kaelbling et al., 1996; Sutton & Barto, 1998). The present research will implement Q-learning, a model-free reinforcement learning technique (Watkins, 1989). The key concept of this approach compromises of the Q-value, reﬂecting the value of an action a executed in a state s and selecting the best actions from then on. The Q-value can be deﬁned as follows:

Q ðs; aÞ ¼ rðs; aÞ þ cQ ðs0 ; a0 Þ

ð1Þ

where Q(s, a) is the Q-value of the state action pair (s, a) and Q*(s0 , a0 ) is the best Q-value which can be obtained by selecting action a0 in state s0 , which is the state resulting from executing action a in state s. r(s, a) is the reward received when executing action a in state s. c is the discounting factor, reﬂecting the weight assigned to future rewards. The Q-values are stored in a so-called Q-table, which contains one entry for each state-action pair. When attempting to ﬁnd the best action a* in a state s, the algorithm searches in the Q-table the action that corresponds to the highest Q-value for the given state s (Kaelbling et al., 1996; Sutton & Barto, 1998; Watkins & Dayan, 1992). The Q-table is populated in the course of the learning process as summarized in Table 1. The learning process takes place in the course of a number of learning episodes. Each learning episode starts in a random state s, the agent selects and executes an action, receives the immediate reward and observes the next state. Based on this information, the agent updates the Q-value corresponding to this state-action couple according to the formula below:

Q tþ1 ðs; aÞ ¼ ð1 aÞQ t ðs; aÞ þ a rðs; aÞ þ c max Q t ðs0 ; a0 Þ 0 a

ð2Þ

where Qt+1(s, a) is the updated Q-value, Qt(s, a) is the Q-value previously stored in the Q-table and which needs to be updated and a is the step-size parameter or learning rate of the algorithm and expresses the weight assigned to the ‘‘newly” calculated Q-value 0 0 Q ðs ; a Þ compared to the ‘‘old”, saved estimate rðs; aÞ þ c max t 0 a

of the Q-value Qt(s, a). This learning rate usually decreases as the number of learning episodes increases, reﬂecting the fact that the Q-value actually equals a weighted average of all experiences. The agent continues to select an action, receive a reward, observe a new state and update the corresponding Q-value until the end of the learning episode is reached (Kaelbling et al., 1996; Sutton & Barto, 1998; Watkins & Dayan, 1992). In each state the agent needs to select an action. But, as the agent initially has no knowledge of the result of his actions in a given state, he will select actions randomly at the beginning of the learning process. He will thus start by exploring the action space. This way, the agent gathers knowledge on the results of his actions.

Table 1 Reinforcement learning process (Kaelbling et al., 1996). Initialize Q-values. Repeat N times (N = number of learning episodes) Select a random state s. Repeat until the end of the learning episode: Select an action a. Receive an immediate reward r(s, a). Observe the next state s0 . Update the Q-table for the state-action pair (s, a) according to update rule (2). Set s = s0 .

Consequently, after a while, the agent will be able to select the action in a given state that is known to yield the highest reward. This approach is called exploitation. However, after a while, it can still pay off to explore once in a while, as it may lead to a state, which might not be visited otherwise, which produces a higher reward than that of the most optimal action so far (Janssens & Wets, 2008; Mitchell, 1997; Sutton & Barto, 1998). A method to incorporate this trade-off between exploration and exploitation is based on a parameter, the so-called exploration rate pexplore reﬂecting the probability of selecting a random action instead of the optimal one. Exploration generally occurs more in the beginning of the learning process –at that moment the agent still needs to ‘‘discover” his environment– and often decreases with increasing learning episodes (Janssens & Wets, 2008). 3.2. Reinforcement learning with bucket brigade updating To accelerate convergence to the optimal policy, above discussed Q-learning approach can be modiﬁed to include bucket-brigade updating as shown in Table 2. This signiﬁes that the Q-values are not updated in the course of the learning process. Instead the triplets containing state, action and reward are accumulated until the end of the learning episode. The triplets are then used in reverse order of occurrence to update the Q-table (Mitchell, 1997). 3.3. Disadvantages of Q-learning 3.3.1. Curse of dimensionality Above described Q-learning algorithms needs to store the Qvalues of all feasible state-action pairs encountered in the course of the iterative learning process in a look-up table. Furthermore, in order to ensure convergence to the optimal policy, these Qlearning approaches require all such state-action pairs to be visited at least once and preferably an inﬁnite number of times during the training process (Sutton & Barto, 1998). Therefore, these algorithms are preferably applicable to small state-action problems, due to the fact that the look-up table will grow exponentially with the dimensionality of the state and action spaces. Large space problems will thus take both a huge amount of memory to store the large Q-tables and a huge amount of time and data to estimate the Q-values accurately. 3.3.2. Limited applicability Moreover, when changes in the environment of the agent occur, the traditional reinforcement learning methods require retraining the Q-function from scratch, as it does not recover previously acquired knowledge, even though the majority of the state-action settings remain unchanged (Driessens, 2004). For instance, when the child receives a new bike for his birthday, it is very likely that he can apply most of the techniques learned on his old bike when riding his new one. However the Q-learning approach does not al-

M. Vanhulsel et al. / Expert Systems with Applications 36 (2009) 8032–8039

8035

Table 2 Reinforcement learning process incorporating bucket-brigade updating (Mitchell, 1997). Initialize Q-values. Repeat N times (N = number of learning episodes) Select a random state s. Repeat until the end of the learning episode: Select an action a. Receive an immediate reward r(s, a). Observe the next state s0 . Store the state-action-reward triplet (s, a, r). Set s = s0 . Repeat for the number of stored (s, a, r)-triplets, starting with the last one: Update the Q-table for the state-action pair (s, a) according to update rule (2). Delete (s, a, r)-triplet.

low reusing this knowledge and requires relearning for this new situation (Driessens, 2004). In addition, it is not realistic to assume that an agent will visit every feasible state-action pair in the course of the learning algorithm. The agent should be able to use experience of only a limited subset of the state-action space to represent all state-action pairs, even the ones that have never been visited (Sutton & Barto, 1998). 3.4. Function approximation based on regression trees The key issue is thus one of generalization of the state-action space. To this end, variable resolution discretization can be applied to aggregate either state or action space. In this case, states or actions will be joined together to reduce the resolution of the state and the action spaces (Uther & Veloso, 1998). However, a bad discretization may introduce a hidden state into the problem area, whereas a too ﬁne discretization does not solve the issue of the large amount of data and time required to learn the optimal policy (Smart & Kaelbling, 2000). A better solution is to replace the discrete look-up tables by function approximators capable of handling continuous variables with several dimensions and of generalizing across similar states and actions (Smart & Kaelbling, 2000). For the purpose of function approximation, existing generalization techniques from the area of supervised learning can be used; for instance artiﬁcial neural networks, statistical curve ﬁtting and pattern recognition (Sutton & Barto, 1998). Most researches incorporating function approximation in reinforcement learning focus on generalizing either state or action space aiming to reduce the number of feasible state or action dimensions. However, the goal of the function approximation presented here consists of generalizing both state and the action space simultaneously. Therefore, the current approach aims at estimating reward values based on experienced (state, action, reward)-

Table 3 Reinforcement learning algorithm incorporating function approximation based on a regression tree. Initialize the regression tree. Repeat N times (N = number of learning episodes) Select a random state s. Repeat until the end of the learning episode: Select an action a. Receive an immediate reward r(s, a). Observe the next state s0 . Store the state-action-reward triplet (s, a, r). Set s = s0 . Select sample of data. Fit regression tree based on sample of data.

triplets, which draw up the training instances of the supervised learning technique. In the present study, tree induction will serve the role of a reinforcement learning function approximator. Since the target variable of the tree consists of the continuous Q-value, classiﬁcation trees do not provide an acceptable solution and regression tree induction emerges. An outline of the reinforcement learning approach incorporating function approximation based on a regression tree can be found in Table 3. This approach resembles the approach incorporating bucket-brigade updating, in the sense that state-action-reward triplets are saved to a database in the course of the learning episode and that the re-estimation of the regression tree only occurs at the end of each learning episode. This re-estimation is based on the stored dataset. However, because agents face a continuously changing environment, (state, action, reward)-triplets may become outdated, obsolete or even invalid. Moreover, adding all experienced examples to the database causes the number of examples to rise with the number of learning episodes. As a consequence the amount of memory required to save these examples, as well as the amount of time needed to re-estimate the regression tree increase rapidly. Therefore, in the current research, only a subsample of this database containing the 5000 most recently experienced triplets is used to ﬁt the regression tree. 3.5. Application of reinforcement learning to simulate activity schedules All of the three reinforcement learning approaches discussed above are implemented to simulate sequential diary data. To this end, the algorithm starts with an empty schedule. The goal of the learning algorithm is to learn to decide either to add an activity to add or to stop scheduling. This decision comprises the action dimension of the reinforcement learning problem. The latter action (stop scheduling) ends the learning episode. The state variables include the current schedule position and the previous activity in the schedule. An example of a learning episode, determined by a sequence of states and actions, is visualised in Fig. 1. The reward of the action taken by the reinforcement learning agent reﬂects the improvement of the (temporary) schedule in matching the input schedules.

Rt ðs; aÞ ¼

X

ðSAM tj SAMt1 Þ j

ð3Þ

j

Rt(s, a), which is the reward to be calculated, is based on the schedule seqt, which is obtained by adding the activity as deﬁned in action a to the previous schedule seqt1. Both schedules seqt1 and seqt are compared to a number of input schedules by means of the sequence alignment method described in Section 2, resulting in and SAMti respectively. SAMt1 i

8036

M. Vanhulsel et al. / Expert Systems with Applications 36 (2009) 8032–8039

Fig. 1. Example of the scheduling process.

4. Empirical section 4.1. Diary data The data used in this research stem from a project entitled ‘‘An Activity-based Approach for Surveying and Modelling Travel Behaviour” executed by the Transportation Research Institute, aiming at collecting activity-travel diaries from 2500 households across Flanders (Belgium) (Bellemans, Kochan, Janssens, & Wets, 2008; Janssens & Wets, 2005). The data used in this research embrace only a part of the gathered individual activity-travel diaries within the project. Beside these activity-travel diaries, the respondents were asked to ﬁll out an individual and household survey covering socio-demographic variables, such as age, household composition, income, educational background and work situation. Because the current research focuses mainly on examining the suitability of the reinforcement learning to the scheduling problem described above, only the data of a group of individuals whose activity-travel diaries are assumed to be homogeneous, are taken into account. For this purpose, the activity-travel diary data of single full-time working individuals on working days are selected: 26

diary days covering 158 activity episodes serve as input schedules to this prototype study. Given this setting, the current research applies leave-one-out cross-validation to train and validate the modelling framework (Witten and Frank, 2000, pp.127–128). Each of the three reinforcement learning techniques are estimated 26 times based on a training set containing 25 diary days. The outcome of resulting models is validated on the data of the remaining diary day. The activities used in this research are classiﬁed into thirteen activity types: grocery shopping (G), non-daily shopping (N), education (D), social activities (O), leisure (L), bring/get activities (B), touring (T), working (W), services (V), out-of-home eating (E), sleeping (S), in-home activities (H) and a remainder category (R). 4.2. Descriptives The observed schedules include on average 6.08 activities. Each of these schedules records at least one working and one in-home activity episode. In addition, the majority of the schedules (23 of 26 schedules) contain exactly one sleeping activity at the end of the schedule. The average mutual dissimilarity of the observed

8037

M. Vanhulsel et al. / Expert Systems with Applications 36 (2009) 8032–8039

Fig. 2. Sample output (5000 learning episodes).

Table 4 Descriptive statistics of the simulation results (5000 learning episodes).

Average schedule length Standard deviation of schedule length Average SAM of simulated compared to corresponding observed validation schedule Standard deviation of SAM of simulated compared to corresponding observed validation schedule

4.3. Results Fig. 2 shows an example of the outcome learned in the course of 5000 episodes. The results of the analyses are promising as can be deduced from Table 4. On average the simulated schedules include around 5.00 activities, whereas the average schedule length of the observed ones equals 6.08. The schedules simulated by the traditional Q-learning algorithms all record exactly one working episode and on average 1.31 sleeping episodes. With respect to the in-home activities, on average 2.03 and 2.04 episodes are counted for the Q-learning and the Q-learning with bucket-brigade updating algorithms respectively. For the simulation results of the reinforcement learning algorithm incorporating a CART function approximator, the schedules record at least one, and on average 1.42, working activity episodes, on average 1.19 sleeping activity episodes and 2.19 in-home activity episodes. When focussing on the quality of the schedules of activities as a whole, as is deﬁned by the sequence alignment method, the reinforcement learning algorithm incorporating CART as a function approximator outperforms both Q-learning algorithms (average SAM of 4.77 vs. 6.38 for both Q-learning algorithms). 4.4. Learning progress The progress of the average quality of the learning algorithm can be measured by comparing for each learning episode the best schedule so-far to the corresponding observed validation schedule by means of the sequence alignment method. This progress is visualized in Fig. 3. This graph shows that the reinforcement learning

RIL

RIL including BBU

RIL using CART FA

6.08 1.1974 NA NA

5.15 0.3679 6.38 1.7906

5.12 0.3258 6.38 1.7906

5.19 0.4019 4.77 1.9737

algorithm incorporating a CART regression tree function approximator allows learning the optimal schedule more quickly than the two Q-learning algorithms. Additionally, the algorithms are adapted to enable the learning agent to determine autonomously when to stop learning. To reach this goal, the agent calculates the most optimal schedule so far at the end of each learning episode, and compares this schedule to all input schedules based on SAM. When the average value of these SAM-measures does not improve considerably compared to a predeﬁned threshold in the course of a predeﬁned number of learning episodes, the agent terminates the learning process. For the results shown in Table 5, the improvement threshold equals 2 and the number of learning episodes to hold this rate of improvement is set to 200.

Learning progress 20

Average distance measure

schedules, as deﬁned by the SAM measure discussed in Section 2, equals 5.86.

Observed

RIL

RIL-BBU

RIL-CART

18 16 14 12 10 8 6 4 2 0 0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Learning episode number Fig. 3. Progress of average SAM during the learning process (5000 learning episodes).

8038

M. Vanhulsel et al. / Expert Systems with Applications 36 (2009) 8032–8039

Table 5 Descriptive statistics of the simulation results (number of learning episodes is autonomously determined).

Average number of learning episodes Standard deviation of number of learning episodes Average schedule length Standard deviation of schedule length Average SAM of simulated compared to corresponding observed validation schedule Standard deviation of SAM of simulated compared to corresponding observed validation schedule

These results clearly demonstrate that the novel reinforcement learning algorithm is able to learn much faster than the Q-learning approaches. These ﬁndings are due to the fact that the regression tree function approximator allows generalization of both the state and action spaces. Moreover, the novel reinforcement learning algorithm allows extrapolating reward experiences to (state, action)-pairs that have not been visited before.

Observed

RIL

RIL including BBU

RIL using CART FA

NA NA 6.08 1.1974 NA NA

776 58.64 5.27 0.4523 6.46 1.9846

847 113.55 5.35 0.4852 6.92 2.0576

413 88.75 6.88 0.4315 6.54 1.0670

The analyses that are already conducted show that less computational time will be required, without loss of quality. Furthermore, the parameter settings can be subject to additional tests in order to improve the algorithm’s performance. Finally, in order to arrive at a dynamic online learning algorithm, the sensitivity of the reinforcement learning approaches to changes in environmental parameters or to changes in the input schedules in the course of the learning phase will have to be examined.

5. Conclusions It has been proven in this paper that – while several other scheduling algorithms exist – the presented approach using algorithms originating from the research area of reinforcement learning, offers a great opportunity of becoming a reliable learning algorithm for micro-simulation in dynamic activity-based transportation models. The current research has improved reinforcement learning by introducing a function approximator based on inductive learning, allowing as such the incorporation of a more extensive set of variables as well as an increased granularity of both the explanatory and the predicted variables. After a profound examination of the extended reinforcement learning algorithm, three reinforcement learning approaches – a Q-learning approach, a Q-learning approach including bucket-brigade updating and a reinforcement learning approach improved with a regression tree-based function approximator – have been applied to ﬁt activity schedules based on data of 26 diary days. The results have been validated by means of the distance measure generated by a sequence alignment method (SAM) indicating the (dis)similarity of activity patterns. The results of this study have proven to be very promising, as they have shown that, after the initial learning phase, both Qlearning approaches as well as the reinforcement learning approach in combination with a regression tree function approximator are able to determine autonomously activity schedules that match the input schedules well. In addition, the analysis has revealed that the reinforcement learning approach enhanced by a regression tree function approximator is able to reach a better solution than the two Q-learning approaches. Moreover, the results have displayed that the quality of the solution rises more quickly in the course of the learning process in case of the improved reinforcement learning algorithm. In spite of these favourable results, some issues remain to be examined. First, the authors are examining possibility of implementing an incremental regression tree induction algorithm since the CART induction algorithm also involves some drawbacks. After all, when the learning process progresses, both the amount of memory used to store the examples and the amount of time needed to estimate the Q-tree and retrieve information from this Q-tree, increase. In addition, previously experienced examples become outdated or even invalid in the continuously changing environment, and should thus be excluded from the Q-tree estimation. To overcome these limitations, the incorporation of an incremental regression induction algorithm which allows updating rather than re-estimating the regression tree is currently being investigated.

References Algers, S., Eliasson, J., & Mattsson, L.-G. (2005). Is it time to use activity-based urban transport models? A discussion of planning needs and modelling possibilities. The Annals of Regional Science, 39(4), 767–789. Arentze, T. A., & Timmermans, H. J. P. (2004). A learning-based transportation oriented simulation system. Transportation Research Part B: Methodological, 38(7), 613–633. Bellemans, T., Kochan, B., Janssens, D., & Wets, G. (2008). In the ﬁeld evaluation of the impact of a GPS-enabled personal digital assistant on activity-travel diary data quality. In Proceedings of the 87th annual meeting of the transportation research board (TRB), Washington, DC, USA. Bowman, J. (1995). Activity based travel demand model system with daily activity schedules. Massachusetts: Department of Civil and Environmental Engineering, Massachusetts Institute of Technology. Chapin, F. S. (1974). Human activity patterns in the city. New York: John Wiley and Sons. Driessens, K. (2004). Relational reinforcement learning. PhD thesis. Leuven, Belgium: Department of Computer Science, Katholieke Universiteit Leuven. Ettema, D., & Timmermans, H. (1997). Activity-based approaches to travel analysis. Oxford, UK: Elsevier Science Ltd. Fried, M., Havens, J., & Thall, M. (1977). Travel behavior – A synthesized theory. National Cooperative Highway Research Program. Boston, Massachusetts: Transportation Research Board, National Research Council. Hägerstrand, T. (1970). What about people in regional science? Papers of the Regional Science Association, 24(1), 7–24. Janssens, D., & Wets, G. (2005). The presentation of an activity-based approach for surveying and modelling travel behaviour. In Proceedings of the 32nd Colloquium Vervoersplanologisch Speurwerk 2005: Duurzame mobiliteit: Hot or not? Antwerp, Belgium. Janssens, D., & Wets, G. (2008). The allocation of time and location information to activity-travel sequence data by means of reinforcement learning. In C. Weber, M. Elshaw, & N. M. Mayer (Eds.), Reinforcement learning: Theory and applications (pp. 359–378). Vienna, Austria: I-Tech Education and Publishing. Joh, C.-H., Arentze, T. A., & Timmermans, H. J. P. (2001). Pattern recognition in complex activity-travel patterns: A comparison of Euclidean distance, signal processing theoretical, and multidimensional sequence alignment methods. In Proceedings of the 80th annual meeting of the transportation research board (TRB), Washington, DC, USA. Jones, P., Dix, M., Clarke, M., & Heggie, I. (1983). Understanding travel behaviour. Aldershot, England: Gower Publishing Company Limited. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artiﬁcial Intelligence Research, 4, 237–285. Kitamura, R. (1996). Applications of models of activity behavior for activity based demand forecasting. In Proceedings of the activity-based travel forecasting conference: Summary, recommendations and compendium of papers, Texas Transportation Institute, Arlington, Texas. Mitchell, T. M. (1997). Machine learning. USA: The McGrawhill Companies, Inc.. Shiftan, Y., Ben-Akiva, M., Proussaloglou, K., de Jong, G., Popuri, Y., Kasturirangan, K., & Bekhor, S. (2003). Activity-based modeling as a tool for better understanding travel behaviour. In Proceedings of the 10th international conference on travel behaviour research (IATBR). Lucerne, Swiss. Shiftan, Y., & Surhbier, J. (2002). The analysis of travel and emission impacts of travel demand management strategies using activity-based models. Transportation, 29(2), 145–168. Smart, W. D., & Kaelbling, L. P. (2000). Practical reinforcement learning in continuous spaces. In Proceedings of the 17th international conference on machine learning (ICML’00), Stanford, CA.

M. Vanhulsel et al. / Expert Systems with Applications 36 (2009) 8032–8039 Stead, D., & Banister, D. (2001). Inﬂuencing mobility outside transport policy. Innovation: The European Journal of Social Sciences, 14(4), 315–330. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, Massachusetts, USA/London, England: The MIT Press. Uther, W. T. B., & Veloso, M. M. (1998). Tree based discretization for continuous state space reinforcement learning. In Proceedings of the AAAI ‘98/IAAI ‘98: 15th national and 10th conference on artiﬁcial intelligence/innovative applications of artiﬁcial intelligence, Madison, Wisconsin. Vanhulsel, M., Janssens, D., Wets, G., & Vanhoof, K. (2008). Implementing an improved reinforcement learning algorithm for the simulation of weekly

8039

activity-travel sequences. In Proceedings of the 87th annual meeting of the transportation research board (TRB), Washington, DC, USA. . Watkins, C. J. C. H. (1989). Learning from delayed rewards. Cambridge, England: University of Cambridge. p. 234. Watkins, C. J. C. H., & Dayan, P. (1992). Technical note: Q-Learning. Machine Learning, 8(3–4), 279–292. Wilson, W. C. (1998). Activity pattern analysis by means of sequence-alignment methods. Environment and Planning A, 30(6), 1017–1038. Witten, I. H., & Frank, E. (2000). Data mining: Practical machine learning tools and techniques with java implementations. San Francisco, USA: Morgan Kaufman.

Simulation of sequential data: An enhanced reinforcement learning approach

Simulation of sequential data: An enhanced reinforcement learning approach

Recommend Documents