A multi-agent reinforcement learning approach to obtaining dynamic control policies for stochastic lot scheduling problem

A multi-agent reinforcement learning approach to obtaining dynamic control policies for stochastic lot scheduling problem

Simulation Modelling Practice and Theory 13 (2005) 389–406 www.elsevier.com/locate/simpat A multi-agent reinforcement learning approach to obtaining ...

NAN Sizes 0 Downloads 82 Views

Simulation Modelling Practice and Theory 13 (2005) 389–406 www.elsevier.com/locate/simpat

A multi-agent reinforcement learning approach to obtaining dynamic control policies for stochastic lot scheduling problem Carlos D. Paternina-Arboleda

a,*

, Tapas K. Das

b

a

b

Department of Industrial Engineering, Universidad del Norte, Km 5 via Puerto Colombia, Barranquilla, Colombia Department of Industrial and Management Systems Engineering, University of South Florida, Tampa, FL 33620, USA

Received 26 March 2003; received in revised form 15 October 2004; accepted 10 December 2004 Available online 22 January 2005

Abstract This paper presents a methodology that, for the problem of scheduling of a single server on multiple products, finds a dynamic control policy via intelligent agents. The dynamic (state dependent) policy optimizes a cost function based on the WIP inventory, the backorder penalty costs and the setup costs, while meeting the productivity constraints for the products. The methodology uses a simulation optimization technique called Reinforcement Learning (RL) and was tested on a stochastic lot-scheduling problem (SELSP) having a state–action space of size 1.8 · 107. The dynamic policies obtained through the RL-based approach outperformed various cyclic policies. The RL approach was implemented via a multi-agent control architecture where a decision agent was assigned to each of the products. A Neural Network based approach (least mean square (LMS) algorithm) was used to approximate the reinforcement value function during the implementation of the RL-based methodology. Finally, the dynamic control policy over the large state space was extracted from the reinforcement values using a commercially available tree classifier tool.  2004 Elsevier B.V. All rights reserved.

*

Corresponding author. Tel.: +57 300 800 1327; fax: +57 359 8852. E-mail addresses: [email protected] (C.D. Paternina-Arboleda), [email protected] (T.K. Das).

1569-190X/$ - see front matter  2004 Elsevier B.V. All rights reserved. doi:10.1016/j.simpat.2004.12.003

390

C.D. Paternina-Arboleda, T.K. Das / Simulation Modelling Practice and Theory 13 (2005) 389–406

Keywords: SELSP; Scheduling; Reinforcement learning; Simulation optimization

1. Introduction The problem of economic production scheduling of multiple items on a single facility under a random production environment is generally referred to as the stochastic lot scheduling problem (SELSP). Other specific features of this problem include random sequence dependent production change-over times and costs, random asymmetric demand arrival patterns for the items, inventory holding costs, backordering/lost customer costs, and, perhaps, specified service level constraint for the items. Industrial applications where SELSP can be found are auto manufacturing (stamping, forging, painting, etc.), paper mills, and bottling operations. The important question associated with controlling a SELSP is ‘‘how to dynamically switch production from one item to another depending on the buffer levels and demand characteristics of the current item being produced as well as the others so as to minimize the overall cost per unit time while, perhaps, satisfying the service level constraints for this items?’’ An excellent review of the SELSP literature can be found in Sox et al. [26]. The SELSP literature can be primarily classified into two classes: cyclic scheduling and dynamic scheduling. As far as the solution approach is concerned, there are two primary classifications in the literature: discrete time control [3,9,16,19,25] and queuing-based control [1,8,21]. The latter approach is again classified into three categories: the pooling systems approach, heavy traffic approximation, and simulation-based approach. A good application using simulation agents can be seen in Gelenbe et al. [10]. The approach adopted in this paper is a combination of discrete simulation and reinforcement learning, of which the latter is derived from dynamic programming and stochastic approximation. Our approach essentially consists of the following steps: (1) (2) (3) (4)

Develop a mathematical model for the SELSP problem. Develop a multi-agent RL algorithm. Develop a simulation model of the system. Run the system simulation (with service level constraints built into it) and apply the RL algorithm during the simulation to arrive at a dynamic scheduling policy.

The multi-agent system uses reinforcement learning algorithms to perform unsupervised learning. An excellent review of reinforcement learning agents can be seen in [18,22,27]. We give a brief introduction to reinforcement learning in the next section.

2. Reinforcement learning Reinforcement learning (RL) is a way of teaching agents (decision-makers) nearoptimal control policies. This is accomplished by assigning rewards and punishments

C.D. Paternina-Arboleda, T.K. Das / Simulation Modelling Practice and Theory 13 (2005) 389–406

391

Real or simulated world

System Environment response

Learning Agent

action

Learning algorithm Fig. 1. Typical reinforcement learning scheme.

for their actions based on the temporal feedback obtained during active interactions of the learning agents with dynamic systems. The agent should choose actions that tend to improve the measure of system performance [18]. Such an incremental learning procedure suitable for prediction and control problems was developed by Sutton [28] and is referred to as temporal-difference (TD) methods. A typical single-agent learning model (as depicted in Fig. 1) contains four elements, which are the environment, the learning agent, a set of actions, and the environmental response (sensory input). The learning agent selects an action for the system, which leads the system evolution along a unique path till the system encounters another decision-making state. At that time, the system consults with the learning agent for the next action. After a state transition, the learning agent gathers sensory inputs from the environment, and from it derives information about the new state, immediate reward, and the time spent during the most recent state-transition. Using this information and the algorithm, the agent updates its knowledge base and selects the next action. This completes one step in the iteration process. As this process repeats, the learning agent continues to improve its performance. A simulation model of the system provides the environment component of the model. For a more detailed explanation of the working of a reinforcement learning model, refer to Bertsekas and Tsitsiklis [2], Das et al. [5], Das and Sarkar [6]. Although there are several different types of reinforcement learning models in the open literature, our methodology uses an infinite horizon model with the average reward performance metric.

3. Multi-agent RL The learning procedure for the SELSP is modeled as a multi-agent system in which each agent represents a product type. If the product type being produced is

392

C.D. Paternina-Arboleda, T.K. Das / Simulation Modelling Practice and Theory 13 (2005) 389–406

i, after the product buffer reaches the base stock target inventory level of R(i), the ith RL agent takes an action. The actions available for this agent are to switch to another product type j and thus transfer control to the jth agent or to keep the machine idle until a demand arrival epoch when switching to another product type is worthwhile. Upon switching, the jth RL agent continues production of product type j until the buffer reaches R(j). At this point the reinforcement value of the most recent action by agent i is updated and the jth agent takes a suitable action (once again, either to switch to another product type and transfer control or to keep the machine idle). This process repeats until the reinforcement values of all state–action combinations for all agents reach a steady state and a clear control policy emerges. Note that the RL agents do not decide on the base stock target inventory levels (R(i), for all i). Their decisions only include to switch to another product or to keep the machine idle. A second level search is performed to find an optimal combination of base stock levels for all products i = 1, . . . , N. This is done using a commercial optimization tool (OptQuest) in conjunction with the ARENA simulation software. 4. Mathematical model for SESLP The system considered in this paper has a single machine that produces N products, having individual finished product/backordered demand buffers. The time between demand arrivals for the products and the unit production times are random variables. The system state changes at the production completion and demand arrival epochs. A decision to switch production for product type i is made only when the buffer reaches the target basestock level R(i). However, if switching is not worthwhile, the machine remains idle until the next arrival epoch at which the machine finds switching to a new product valuable. Let Xm(i) = Qm(i)  Dm(i), be the status of the ith product combined inventory/backorder level at decision making epoch m, where Qm(i) is the status of the ith inventory level, and Dm(i) is the status of the ith product backorder level. The state of the system at the mth decision-making epoch can be expressed as Em ¼ fðX m ðiÞ; M m Þ: i ¼ 1; 2; N g;

ð1Þ

where Mm indicates the product for which the machine is setup at the mth decision epoch. Let max{D(i)} be the maximum backorder level of the ith product. The total number of system states is given by jEj ¼ N

N Y ðRðiÞ þ maxfDðiÞgÞ:

ð2Þ

i¼1

For example, for a system with three products, basestock (target) levels of 100, 50, and 50 for products 1, 2 and 3, respectively, and maximum backorder queue sizes of 100, 50 and 50, respectively, the total number of states would be 6 · 106. For a three-product system, when the machine is setup for product type 1, the actions available for the decision agent are switch to product 2, switch to product 3, or

C.D. Paternina-Arboleda, T.K. Das / Simulation Modelling Practice and Theory 13 (2005) 389–406

393

remain idle. Hence, the number of actions available at any state for a N-product problem is N. The complete state–action space over which a learning agent searches for an optimal policy is jEj · N, which for the 3-product example would be 1.8 · 106. In order to measure the agentÕs learning, a cost (reward) structure function is defined as follows. If the system is run for a very long period of s time units under a fixed sequencing policy p while hi is the unit holding cost, bi is the unit backordering cost and si is the setup cost when the station switches to product i, then the average cost function for N product types is defined as q¼

N X

hi lp ðiÞ þ bi bp ðiÞ þ

i¼1

si k p ðiÞ ; s

ð3Þ

where lp(i) is the average work-in-process in the ith queue and bp(i) is the average backorder queue in the system for product type i under policy p. The term kp(i) denotes the number of setups performed over time s for product type i. Our objective is to find a policy that minimizes the average cost q. The average cost presented above is built upon average values for the buffer and backorder queue levels. These average values are time-based statistics and are collected at the end of each simulation run that lasts for a time interval equal to s. Let tm be the simulation time at the mth decision epoch. Given a control policy p, the average queue level for product type i at time tm is given by tm1 tm  tm1 lpm ðiÞ ¼ lpm1 ðiÞ þ Qm ðiÞ ; i ¼ 1; 2; . . . ; N ; ð4Þ tm tm where N is the total number of products in the system. Also, the queue levels Q(i) are restricted to be less than or equal to the maximum (basestock) level R(i), i.e., 0 6 Q(i) 6 R(i). If the number of simulation steps tends to a large number (s), then lpm ðiÞ can be written as lpi , i.e., lpm ðiÞ ¼ lpi ðtm ! sÞ:

ð5Þ

The average backorder queue level bp(i) can be found in a manner similar to lp(i).

5. The RL algorithm (relaxed-smart [5,13,14]) This algorithm is followed by all the agents of the multi-agent system independently. Suppose a system is in state i, and action a is selected, then the system moves to state j at the next decision-making epoch. Let cimm(i, a, j) be the reward generated by going from state i to state j under action a, q be the average cost, and s(i, a, j) be the time spent during the system transition. Step 1. Let the time step m = 0 and the current state be i. Initialize the reinforcement values Rm(i, a) = 0 "i 2 E, and a 2 Ai (action space for state i). Set the cumulative parameters of cost (cm), and time (tm) equal to zero and the current average cost qm = 0. Choose the initial values of the rates for exploration (pm) and learning (am, bm).

394

C.D. Paternina-Arboleda, T.K. Das / Simulation Modelling Practice and Theory 13 (2005) 389–406

Step 2. While m < MaxSteps do (a) With high probability (1  pm), choose an action a that minimizes Rm(i, a), otherwise choose another action from the set Aina for which all actions have equal probability of being chosen. (b) Allow simulation of the system in state i with the chosen action to continue until the next decision epoch is reached. Let the state at the next decision epoch be j, s(i, a, j) be the transition time due to action a, and cimm(i, a, j) be the immediate cost incurred as a result of taking action a in state i. Then update the reinforcement value for the state–action combination (i,a) as follows:   Rmþ1 ði; aÞ ð1  am ÞRm ði; aÞ þ am cimm ði; a; jÞ  qm sði; a; jÞ þ min Rm ðj; bÞ : b2A

ð6Þ

(c) In case a nonrandom action was chosen in step 2(a) 1. Update cumulative time: tm+1 tm+s(i, a, j) 2. Update total cost for the agent: cm+1 cm + cimm(i, a, j) 3. Update average reward: qmþ1

ð1  bm Þqm þ bm

½tm qm þ cimm ði; a; jÞ  sði; a; jÞ : tmþ1

ð7Þ

The transition time between consecutive decision-making epochs is computed as follows. Every time an agent takes control of the simulation run it sets a temporary variable to TNOW (current simulation time). At the next decision-making epoch, the time variable computes the difference between current time and the value previously recorded and assigns it to s(i, a, j). The immediate cost incurred during transition (cimm) is computed according to the cost structure presented later in Section 6. Step 3. Decrease am, bm, and pm, using the Darken and Moody scheme (Eq. (8)) [4]. Set i j, m m + 1. Note that the updating scheme of the RL algorithm requires the storing of reinforcement values R(i, a) for each state–action pair. For a problem with a small state space, the values for each state–action pair can possibly be stored. The optimal action to be taken in each state can therefore be decided based upon the lowest reinforcement value. But for a reasonably sized problem, there could be millions of states. This huge state space makes it infeasible to store all the action values explicitly. Hence there is a need for a function approximation scheme, which can be used to maintain the reinforcement values for each state–action pair. Hence there is need for a function approximation scheme, which can be used to maintain the reinforcement values for each state–action pair.

C.D. Paternina-Arboleda, T.K. Das / Simulation Modelling Practice and Theory 13 (2005) 389–406

395

5.1. A neural network based scheme for reinforcement value function approximation Artificial Neural Networks (ANN) [17] provide a general and practical method for learning real-valued and discrete-valued functions from examples. A primitive class of ANN consisting of a single neuron operating under the assumption of linearity is presented next. This class of network is known as the Least Mean Square (LMS) algorithm, the delta-rule or the Widrow–Hoff rule [17]. This algorithm is based on the use of instantaneous estimates of the environment (simulation) response to the learning agent. The method is very simple and allows for incremental learning, which makes it perfect for the kind of problems that are solved with the RL methodology. A summary of the LMS algorithm follows: Step 1. Initialization. Set ^ k ð1Þ ¼ 0 for k ¼ 1; 2; . . . ; p: w Step 2. Filtering. For time m = 1, 2, . . . , compute ym ¼

p X

^ m ðjÞ  xm ðjÞ; w

j¼1

em ¼ d m  y m ; ^ mþ1 ðkÞ ¼ wm ðkÞ þ gem xm ðkÞ w

for k ¼ 1; 2; . . . ; p;

where p is the number of neurons that will undergo updating during the learning process (equivalent to the number of partitions for the problem), wm(k) are estimates of the neuron weights at time step m, ym is the actual output of the neuron, dm is the desired output, em is the error, and g is the learning rate. The main difficulties encountered with the LMS algorithm may be attributed to the fact that the learning rate parameter is maintained constant throughout the computation, as shown by gm = g for all m. Darken and Moody [4] proposed the use of a so-called search-then-converge schedule, defined by, g0 gm ¼ ; ð8Þ 1 þ ðm=sÞ where g0 and s are constants. In the early stages of adaptation, the learning rate parameter is approximately same as the constant g0 and the algorithm mainly explores. As the number of time steps approaches the constant s, the algorithm rather converges. For a number of iterations m sufficiently large compared to the search time constant s, the learning rate parameter operates as a traditional stochastic approximation algorithm. After each action choice in step 2 of the Relaxed-SMART algorithm presented in Section 5, the weights of the corresponding action LMS neuron are updated as follows:

396

C.D. Paternina-Arboleda, T.K. Das / Simulation Modelling Practice and Theory 13 (2005) 389–406

Dwx ¼ gm am em xðiÞ;

ð9Þ

where gm is the neuron convergence parameter at decision making epoch m, am is the learning rate of the Relaxed-SMART algorithm, x(i) is the variable for which the weights are being updated, and em is the temporal difference error, defined as follows:   cimm ði; a; jÞ  qm sði; a; jÞ þ min Rm ðj; a; wÞ  Rm ði; a; wÞ ; ð10Þ a2Ai

where w is the vector of weights of the neuron. Other approaches to learning with networks may be seen in [11,12], where the performance of cognitive networks is analyzed.

6. Numerical example We consider a three-product example problem as a vehicle for providing further details on the solution procedure (Fig. 2). We note that the choices of the types of probability distributions for the example problem are somewhat arbitrary and do not imply any limitations of our modeling process. These probability distributions were chosen in order to use the problems that were studied in [1]. In what follows, a description of the single-server multi-product system under study is given. In addition to those of a classical SELSP problem, the following characteristics apply to this problem: • Demands arrive in sets with a general distribution of time between arrivals. • Setups are not incurred unless necessary. When the machine is idling, the system waits for the next demand event to occur in the system that makes a product type have a positive demand (i.e., the D(i) variable that first gets a positive value triggers production setup for product type i). • Each item type i has a setup time st(i), a setup cost si, a constant production rate, a holding cost hi per unit time, and a backorder penalty cost bi per unit.

Flexible Station

P1

P2

P3

Demand Fig. 2. A three-product flexible production station.

C.D. Paternina-Arboleda, T.K. Das / Simulation Modelling Practice and Theory 13 (2005) 389–406

397

• Customers (orders) are homogeneous. Service is given on a first in first out basis. The system is simulated as a discrete-event model, using the ARENA simulation software. The simulation model was first validated with the problem and corresponding results reported in [1]. Table 1 shows the production, demand and cost parameters for the problems studied in [1]. The values for the performance measures considered were averaged over 30 replicates. Each replicate was run for 12,000 time units with a warm-up period of 2400 time units. Two cases were chosen for validation of the simulation model; in particular, the cases with squared coefficient of variation (scv) of 0.25 and 1.0 (with a truncated lognormal distribution, i.e., blognormal(li + 0.5, ri)c, r2i ¼ scv l2i ) for demand size. The results for the above cases are shown in Table 2. In the validation study, to obtain best estimates for the basestock parameters, an exhaustive search over ±5 units of the optimal values reported in [1] was performed and the cost recorded. Since the cyclic switching policy is fixed, the search was performed on each variable at a time. There are 103 cases in total, among which 10 · 3 were evaluated. In most cases the basestock values resulted in the rounded-up integer value corresponding to the optimal basestock levels reported. All results (basestock parameters) were within 2.5% of the continuous values reported.

Table 1 Parameters for the numerical problems studied in [1] Mean time between demands arrivals (time) SCV of time between demand arrivals (generic units)

2.0 0.25

Products

Setup time

Production rate (units/time)

Costs ($/unit-time) Holding

Backlog

1 2 3

0.1 0.2 0.15

45.0 30.0 35.0

1.0 1.0 1.0

25.0 12.5 20.0

Setup cost ($/setup)

Mean demand (units)

5.0 10.0 10.0

25.0 10.0 10.0

Table 2 Validation results for ABAC schedule simulation model (results for Anupindi and Tayur marked by the reference number) l2/r2

Cost ($)

Basestock levels (units)

Service levels (% point)

z1

z2

z3

t1

t2

t3

[1] SIMAN Error on l

1 1

178.12 175.22 1.63%

76.68 78 1.72%

30.54 31 1.51%

40.55 41 1.11%

0.94 0.92 1.67%

0.91 0.91 0.14%

0.94 0.94 0.06%

[1] SIMAN Error on l

0.25 0.25

100.85 100.75 0.10%

48.50 49 1.03%

18.54 19 2.48%

23.6 24 1.69%

0.94 0.96 2.26%

0.90 0.93 3.07%

0.93 0.95 1.65%

398

C.D. Paternina-Arboleda, T.K. Das / Simulation Modelling Practice and Theory 13 (2005) 389–406

Similarly, the values for the total cost obtained for both cases under study were within 1.7% of the optimal cost reported in [1]. The cost figures vary slightly from the optimal cost reported in [1]. The confidence interval (95% confidence level) for the case of high variance (scv = 1.0) is 175.22 ± 3.34, which includes the value (178.12) reported in [1]. For the case of scv = 0.25, the confidence interval is 100.75 ± 0.65, which also includes the value of 100.85. Notice the higher variability of the first case, obviously resulting from a higher variance process. The remaining estimates for both cases were within 3.1% of the values reported. Based on the results in Table 2, we conclude that the simulation model developed was able to reproduce the existing results. Hence, this model was used as a vehicle for implementing the RL-based solution approach to derive dynamic control policies for the SELSP, and compare with the optimal results for the ABAC cyclic scheduling policy. In what follows we compare the results of the RL schedule with those obtained from the ABAC schedule. To make a fair comparison, the same basestock levels were set into the learning controller. In addition, for both cases of scv, results with reduced basestock levels for the RL schedule for which the average total cost is better to that of the ABAC schedule are also presented. Recall that the reason for having the same basestock levels is to compare the results to those results reported [1]. Table 3 shows the results for the RL agent along with the results from the SIMAN implementation of the ABAC scheduling policy. It may be noticed from Table 3 that for the same basestock levels, the implementation of the RL algorithm performs somewhat better than the fixed dynamic cyclic schedule ABAC, in terms of average total cost. The RL average cost results are about 2.2% lower for the same basestock levels (in the case of scv = 0.25), and about 0.3% lower (for the case of scv = 1.0). However, at a 95% level of confidence, the results show no evidence of significant improvement in the high variance case. The confidence interval for the case of high variance is 174.57 ± 1.70, which includes the value of 175.22. For the case of low variance, the improvements are evident, where the confidence interval is given as 98.51 ± 0.51, which does not include the reported value for the ABAC switching policy (i.e., 100.75). Perhaps, the choice of a

Table 3 Numerical results for the RL multi-agent scheme Schedule

scv

Cost ($)

(l/r)2

Basestock levels (units)

Service levels (% point)

z1

z2

z3

u1

u2

u3

ABAC RL schedulea RL scheduleb

1

175.22 174.57 172.12

78 78 78

31 31 28

41 41 40

0.92 0.92 0.92

0.91 0.88 0.87

0.94 0.95 0.95

ABAC RL schedulea RL scheduleb

0.25

100.75 98.51 93.58

49 49 45

19 19 15

24 24 20

0.96 0.96 0.95

0.93 0.91 0.88

0.95 0.95 0.93

a b

Results for the schedule with same basestock levels as for the ABAC schedule. Results for the schedule with better cost, different basestock levels.

C.D. Paternina-Arboleda, T.K. Das / Simulation Modelling Practice and Theory 13 (2005) 389–406

399

better partitioning scheme may further improve these results. Notice also that the variability produced by the RL system is lower compared to that of the cyclic schedule. It is observed that after fixing the RL policy learned by the multi-agent system, it is possible to find a suitable combination of basestock levels that improves the average total cost over the results obtained with the basestock levels equal to those in [1]. The choice of this parameter requires direct search over the basestock levels. Once again, the search for the basestock parameters is performed over a space of ±5 units to the ones previously found as optimal (for the ABAC cycle). Since the policy learned by the multi-agent RL system is not cyclic, but rather dynamic, a gradient-based procedure would not work appropriately. A direct search over 1000 cases was performed with a small sample of each case where the number of simulation replicates during the search was set to 5. For both cases it is observed that higher basestock levels actually increase the average cost function, therefore the search space was reduced greatly by considering lower values of basestock. The best average cost achieved by the RL-base procedure in the case of high-variance demand arrivals is 1.3% better than the optimal cost achieved by the fixed cyclic policy ABAC. The confidence interval is given as 172.12 ± 1.88, which shows some improvements over both ABAC and RL (same basestock levels), but the confidence interval still overlaps with that of the cyclic policy (see Table 3). For the case of medium-variance demand arrivals (scv = 0.25), the modified basestock RL outperforms the policies reported in [1], yielding results for the average total cost that are 7.1% lower. In this case, the confidence interval is 93.58 ± 0.61. We also noted the following advantages of the RL policy over the ABAC schedule: • The total number of setups is reduced (for all products), making the operation of the manufacturing facility smoother. • The actual physical space to allocate work in process is reduced, thus improving storage allocation. • The system controlled by the RL policy is leaner, with less accumulation of WIP. We note that the scope for improvements in performance for a three-product SELSP with respect to the cyclic ABAC policy is quite limited. The RL-based procedure was still able to find control policies that are superior to ABAC. However, it is expected that for problems with larger number of products, the RL agents will be able to produce much superior schedules than cyclic schedules. A disadvantage of the RL-based procedure is that the policy learned by the RL agents is not readily known since the actions depend on comparisons based on a certain combination of the WIP values and the corresponding weights of many LMS neurons. We refer this phenomenon as the black-box effect. This undesirable aspect of the RL policy can be remedied in part by means of another learning procedure, i.e., decision tree learning, which is discussed later in Section 6.1. A plot of the learning curves of the three agents is shown in Fig. 3. Notice that there are three learning curves, one for each agent in the scheme. The curves have

400

C.D. Paternina-Arboleda, T.K. Das / Simulation Modelling Practice and Theory 13 (2005) 389–406 7000000

Transient exploration 6000000

Average reward

5000000

4000000

Steady-state policy 3000000

2000000

1000000

0 0

100000

200000

300000

400000

500000

600000

Simulation steps

Fig. 3. Convergence plots of the multi-agent RL scheme.

different convergent average reward value, the reason being that the choice of cost parameters is different for each agent. Following is an explanation of the cost function assigned as rewards or penalties to the agents. The action-value updating scheme accounts for a time-based cost of all products that are introduced to the system during the production run, plus all the demands that are backlogged during the same period. Assume that the machine is set to produce product j 2 {1, 2, 3}. P The cost expression 1s ½hj  tj þ j bj tdj  is assigned to all product entities that are in the system during two consecutive decision-making epochs, where tj is the time spent by a product j entity in the system and is the time that demand entities of product type j have been sitting in the backorder queues. On the other hand, if the entity that is updating the corresponding agent is a logical simulation entity P created for the remain idle actions then the reward scheme is represented as 1s j bj tdj . The agent is required to make a decision after either at a production completion event at which a basestock level is reached, or, in case all queues are with positive inventory, at the next demand event that generates a positive demand for any of the products. If the event is a production completion, the decision agent is called by a product type entity. If the event is the positive demand generated, the decision agent is called by a logical wait entity. Computational convergence is achieved long before the simulation run is ended. Notice however that the addition of an approximation scheme improves the agentsÕ updating procedure. The convergence plot appears to be much smoother than that of the single-product RL agent studied in [23]. Also, as compared with the look-up

C.D. Paternina-Arboleda, T.K. Das / Simulation Modelling Practice and Theory 13 (2005) 389–406

401

table approach presented in [23], the addition of an approximation scheme seems to increase the convergence rate of the average reward (cost) function. 6.1. Discovering knowledge from the RL policy: a decision tree learning approach The implementation of procedures such as RL-based decision-making agents would be very attractive to some industries, but for some manufacturing environments the use of sophisticated intelligent structures may be overwhelming. To some extent, the RL-based optimization procedure generates a policy that is not clear to the human eye and its application relies on a high level of computer integration. This problem can be overcome by developing a fixed set of rules that summarizes RL actions. As mentioned earlier in this paper, suitable alternatives, also coming from the artificial intelligence literature, appear to help in gaining some understanding of the actions executed by the RL agents. Among these techniques, we chose decision tree learning because of its simplicity and classification power [7]. Decision tree learning is a method for approximating discrete-valued target functions, in which a decision tree represents the learned function. Decision tree learning is one of the most widely used and practical methods for inductive inference [22]. For the general tree classifier, given a set of training data, consisting of input variables (or attributes, as they are called in the decision tree learning literature) and a desired response (action), it partition the space Rn into regions, often hypercubes, each one containing data with attributes that posses similar values. Fig. 4 shows an example of a two-attribute three-action decision space with a rectangular partition. Clearly the classification is not perfect. In fact, no classifier can perform better than the optimal Bayes classifier [7]. Some points will be wrongly classified, but the majority of those will be classified correctly. Decision tree classifiers fall in the category of data mining methods. One of the most popular decision tree classifiers is the c4.5 software, from Quinlan [24], or more recent versions c5.0 for Unix, and See5 for Windows 95/98/NT. In what follows, an application of decision tree classifiers to one of the cases for the SELSP in this study is presented. Consider the results provided by the RL-based controller on the SELSP with demand batch size scv = 1.0, with improved basestock levels. At all decision epochs

Fig. 4. Partition of an action space for a decision-making problem.

402

C.D. Paternina-Arboleda, T.K. Das / Simulation Modelling Practice and Theory 13 (2005) 389–406

of a significant simulation run let the system record the state of the system and the suggested action from the controller. By a significant simulation run it means that a replicate with sufficient time to record enough data for the decision tree to be learned. For this example, the model for the SELSP was run for 24,000 time units. Over 33,000 training cases were recorded and presented to the decision tree learning program. The following is a sample decision tree that was obtained. Fig. 5 shows that the corresponding decision tree built has 30 branches, but it classified the data almost perfectly. It can be noticed from the output report that only two cases out of 33,082 were misclassified after the tree tested its classification

Fig. 5. Output report from See5.

C.D. Paternina-Arboleda, T.K. Das / Simulation Modelling Practice and Theory 13 (2005) 389–406

403

Fig. 5 (continued)

efficiency. This results in over 99.99% accuracy in the training data for this experiment; a very robust classifier. In addition, the time spent by the classifier is negligible compared to the time spent on the entire simulation-based optimization procedure. The importance of this information lies in the understanding of the actions that each learning agent is suggesting. Before the implementation of the decision tree learning program, the actions suggested by the agents were not easily followed by the normal human eye. Hence, the use of data mining techniques in addition to the RL-based optimization procedure is strongly suggested. In what follows, a brief explanation of the decision tree described in Fig. 5 is presented.

404

C.D. Paternina-Arboleda, T.K. Das / Simulation Modelling Practice and Theory 13 (2005) 389–406

Table 4 Set of rules for product agent (RL agent) 3 Branch

X(1) (units)

X(2) (units)

Action

1 2 3 4 5 6 7

(1, 79] (79, 80] (1, 79] (2, 79] (1, 2] (1, 53] (53, 2]

(27, 28] (27, 28] (1, 27] (1, 1] (20, 1] (1, 20] (1, 20]

Switch to 1 Remain idle Switch to 1 Switch to 2 Switch to 1 Switch to 1 Switch to 2

Notice that the main branches of the decision tree correspond to the three product agents (RL agents) embedded in the system. Each product agent i then subdivides in child branches, with split criteria based on the status of the X(j)Õs from products j 2 {1, 2, 3} such that j 5 i. As an example, consider the first set of branches from the decision tree, i.e., those related to the decisions made by product agent (RL agent) 3. Agent 3 has seven branches, all related to variables X(1) and X(2). Recall that X(i) is the difference between the product i buffer Q(i) and its associated demand queue D(i). Table 4 explains with rules what each of the seven branches means to the decision-making agent number 3. The rules are sequential, i.e., in order for branch (rule) 5 to be chosen, all previous four branches (rules) have to be evaluated and rejected. A similar analysis must be performed over the remaining branches of the decision tree, those related to product agents (RL agents) 1 and 2. It is now clear, with the help of the resulting decision tree, how the agents choose to act when they are required to make a decision, according to the state of the system at that moment.

7. Concluding remarks and directions for future research A simulation optimization methodology using the reinforcement learning (RL) approach is developed and implemented on a single-server multi-product lot scheduling problem. The test problem used in particular is a 3-product stochastic lotscheduling problem (SELSP). The proposed methodology proved computationally feasible and quite efficient. The multi-agent RL scheme is able to reduce the basestock levels in the system while keeping the product demand backorders at low levels. Though backlog (backorders) consideration significantly adds to the system search state space, the increase is handled via the use of multiple single adaptive filter neural networks to approximate the value function for the RL algorithm. The value function approximation scheme considerably increased the speed of convergence and evens out the effects of the systemÕs transient state, helping the RL-based agents to make more consistent decisions fast. The methodology can be easily extended to deal with larger SELSP problems. Applications for much larger problem settings are reported in [5,15,20,23].

C.D. Paternina-Arboleda, T.K. Das / Simulation Modelling Practice and Theory 13 (2005) 389–406

405

A disadvantage of using neural networks is that it still requires trial and error in the process of selecting the network parameters such as the architecture and space partition, among others. It is noticed that the choice of an appropriate state space partition is critical for the agent to perform well. Perhaps the use of data mining techniques, such as rule-based multi-variate regression procedures can improve this process and produce better approximation functions. However, incremental rulebased regression models are still on the early stages and remain open for future research. For a certain combination of basestock levels (previously found by means of direct search), there exists a dynamic state-dependent policy that performs better than the cyclic policies with corresponding optimal basestock levels. These state-dependant policies do not follow a fixed sequence (schedule) rule but rather a dynamic one that depends on the sensory information gathered from the system. State-dependant policies reduce the number of unnecessary setups in the system and allow for a smoother operation and product flow. It is possible that the RL procedure presented in this study may seem uncomfortable or hard to implement because the policy that it generates is not readily known. These inconvenient can be handled by the use of data mining classification techniques. A brief study of the classification technique referred to as decision tree classifiers, and its implementation to the SELSP problem is presented. Decision tree classifiers appear to be a feasible alternative for state-dependant problems. The application of the decision tree classifier to a recorded set of the actions suggested by the RL agents helps in gaining understanding on the performance of the controller. The application of artificial intelligence procedures to perform simulation-based optimization provides a feasible framework for the analysis of complex production systems in order to improve the operating conditions and, consequently, the productivity of such systems. The flexibility provided by the off-line optimization analysis allows the implementation of embedded intelligent agents that are capable of responding to conditions in very dynamic production settings, such as logistics (supply chain), capacity planning, and demand processes. The proposed optimization architecture is generic and can be applied to very diverse production systems that involve dynamic control policies. However, every time a system is modified, it is recommended to update the control policy through a new learning simulation run.

References [1] R. Anupindi, S. Tayur, Managing stochastic multiproduct systems: model, measures, and analysis, Operations Research 46 (3) (1998) S98–S111. [2] D.P. Bertsekas, J.N. Tsitsiklis, Neuro-Dynamic Programming, first ed., Athena Scientific, 1996. [3] K.E. Bourland, C.A. Yano, The strategic use of capacity slack in the economic lot scheduling problem with random demands, Management Science 40 (12) (1994) 1690–1704. [4] C. Darken, J.E. Moody, Towards fasters stochastic gradient search, in: J.E. Moody, S.J. Hanson, R.P. Lippmann (Eds.), Advances in Neural Information Processing Systems, vol. 4, Morgan Kaufmann, San Mateo, CA, 1992, pp. 1009–1016.

406

C.D. Paternina-Arboleda, T.K. Das / Simulation Modelling Practice and Theory 13 (2005) 389–406

[5] T.K. Das, A. Gosavi, S. Mahadevan, N. Marchellack, Solving semi-Markov decision problems using average reward reinforcement learning, Management Science 45 (4) (1999) 560–574. [6] T.K. Das, S. Sarkar, Optimal preventive maintenance in a production/inventory system, IIE Transactions 31 (6) (1999) 537–551. [7] L. Devroye, L. Gyo¨rfi, G. Lugosi, A Probabilistic Theory of Pattern Recognition, Springer-Verlag, New York, 1996. [8] A. Federgruen, Z. Katalan, The stochastic economic lot scheduling problem: cyclical base-stock policies with idle times, Management Science 42 (6) (1996) 783–796. [9] G. Gallego, Scheduling the production of several items with random demands in a single facility, Management Science 36 (12) (1990) 1579–1592. [10] E. Gelenbe, E. Seref, Z. Xu, Simulation with learning agents, Proceedings of the IEEE 89 (2) (2001) 148–157. [11] E. Gelenbe, R. Lent, Z. Xu, Measurement and performance of a cognitive packet network, Computer Networks 37 (2001) 691–791. [12] E. Gelenbe, Review of experiments in self-aware networks Invited Paper, in: 18th International Symposium on Computer and Information Sciences, Lecture Notes in Computer Science, LNCS 2869, Springer, Berlin/New York, 2003, pp. 1–8. [13] A. Gosavi, An Algorithm for Solving Semi-Markov Decision Problems Using Reinforcement Learning: Convergence Analysis and Numerical Results, Unpublished Ph.D. Thesis, Department of Industrial Engineering, University of South Florida, 1999. [14] A. Gosavi, A convergent reinforcement learning algorithm for solving semi-Markov decision problems under average cost for an infinite time horizon, European Journal of Operational Research, in press. [15] A. Gosavi, N. Bandla, T.K. Das, A reinforcement learning approach to a single leg airline revenue management problem with multiple fare classes and overbooking, IIE Transactions, Norcross 34 (9) (2002) 729 (14 pages). [16] S.C. Graves, The multi-product production cycling problem, AIIE Transactions 12 (1980) 233–240. [17] S. Haykin, Neural Networks: a Comprehensive Foundation, McMillan College Publishing Company, Englewoods Cliffs, NJ, 1994. [18] L.P. Kaelbling, M.L. Littman, A.W. Moore, Reinforcement learning: a survey, Journal of Artificial Intelligence Research 4 (1996) 237–285. [19] R.C. Leachman, A. Gascon, A heuristic scheduling policy for multi-item, single-machine production systems with time-varying, stochastic demands, Management Science 34 (1988) 377–390. [20] S. Mahadevan, G. Theochaurus, Optimizing production manufacturing using reinforcement learning, in: Eleventh International FLAIRS Conference, AAAI Press, 1998, pp. 372–377. [21] D.M. Markowitz, M.I. Reiman, L.M. Wein, The stochastic economic lot scheduling problem: heavy traffic analysis of dynamic cyclic policies, Working paper 3863-95-MSA, Sloan School of Management, MIT, Cambridge, MA, 1995. [22] T. Mitchell, Machine Learning, McGraw-Hill, New York, 1997. [23] C.D. Paternina-Arboleda, T.K. Das, Intelligent dynamic control of single-product serial production lines, IIE Transactions 33 (1) (2001) 65–77. [24] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA, 1993. [25] C.R. Sox, J.A. Muckstadt, Optimization-based planning for the stochastic lot scheduling problem, IIE Transactions 29 (5) (1997) 349–357. [26] C.R. Sox, P.L. Jackson, A. Bowman, J.A. Muckstadt, A review of the stochastic lot scheduling problem, International Journal of Production Economics 62 (3) (1999) 181–200. [27] R.S. Sutton, A.G. Barto, Reinforcement Learning: an Introduction, a Bradford Book, The MIT Press, Cambridge, MA, 1998. [28] R.S. Sutton, Learning to predict by the methods of temporal differences, Machine Learning 3 (1988) 9–44.