Online preference learning for adaptive dispatching of AGVs in an automated container terminal

Online preference learning for adaptive dispatching of AGVs in an automated container terminal

Applied Soft Computing 38 (2016) 647–660 Contents lists available at ScienceDirect Applied Soft Computing journal homepage: www.elsevier.com/locate/...

2MB Sizes 1 Downloads 71 Views

Applied Soft Computing 38 (2016) 647–660

Contents lists available at ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

Online preference learning for adaptive dispatching of AGVs in an automated container terminal Ri Choe a , Jeongmin Kim b , Kwang Ryel Ryu b,∗ a Material Handling Automation Group, Giheung Hwaseong Complex, Samsung Electronics, 1, Samsunggjeonja-ro, Hwaseong-si, Gyeonggi-do 445-701, Republic of Korea b Department of Electrical and Computer Engineering, Pusan National University, 2, Busandaehak-ro 63-gil, Geumjeong-gu, Busan 609-735, Republic of Korea

a r t i c l e

i n f o

Article history: Received 2 December 2014 Received in revised form 1 August 2015 Accepted 13 September 2015 Available online 28 September 2015 Keywords: Vehicle dispatching Automated container terminal Machine learning Genetic algorithm Artificial neural network

a b s t r a c t This paper proposes an online preference learning algorithm named OnPL that can dynamically adapt the policy for dispatching AGVs to changing situations in an automated container terminal. The policy is based on a pairwise preference function that can be repeatedly applied to multiple candidate jobs to sort out the best one. An adaptation of the policy is therefore made by updating this preference function. After every dispatching decision, each of all the candidate jobs considered for the decision is evaluated by running a simulation of a short look-ahead horizon. The best job is then paired with each of the remaining jobs to make training examples of positive preferences, and the inversions of these pairs are each used to generate examples of negative preferences. These new training examples, together with some additional recent examples in the reserve pool, are used to relearn the preference function implemented by an artificial neural network. The experimental results show that OnPL can relearn its policy in real time, and can thus adapt to changing situations seamlessly. In comparison to OnPL, other methods cannot adapt well enough or are not applicable in real time owing to the very long computation time required. © 2015 Elsevier B.V. All rights reserved.

1. Introduction The automated guided vehicles (AGV) in automated container terminals transport containers between the quay cranes (QC) at the quayside and the automated stacking cranes (ASC) at the storage yard to support the discharging and loading operations. In a discharging operation, an inbound container is picked up from a vessel by a QC and is handed over to an AGV. The container is then delivered by the AGV to an ASC at a storage block in the yard, where it is stacked and stored until it is claimed for road transportation. The container move in a loading operation is in the opposite direction. First, an outbound container to be loaded onto a vessel is picked up by an ASC from the block where it resided. Next, it is handed over to an AGV and then delivered to the destination QC servicing the target vessel. To maximize the productivity of a terminal, both the discharging and loading operations should be carried out efficiently so that the turn-around time of each vessel is minimized. One of the key factors influencing the quayside productivity is the degree

∗ Corresponding author. Tel.: +82 51 510 2453; mobile: +82 10 2555 2453. E-mail addresses: [email protected] (R. Choe), [email protected] (J. Kim), [email protected] (K.R. Ryu). http://dx.doi.org/10.1016/j.asoc.2015.09.027 1568-4946/© 2015 Elsevier B.V. All rights reserved.

of synchronization between the AGVs and cranes, i.e., the QCs and ASCs. If the AGVs can collaborate with the cranes in such a way that the cranes do not need to wait for the AGVs, the loading and discharging operations can be performed seamlessly without a delay. Such collaboration is not easily achieved by simply increasing the number of AGVs because of the traffic congestion incurred, which results in a delivery delay. What is needed is a good dispatching scheme that allows the AGVs to be scheduled more efficiently. AGV dispatching can be initiated by either a vehicle or a job. For vehicle-initiated dispatching, the vehicle that just finished its previous assignment selects a job to be done. For job-initiated dispatching, the most urgent job to which no vehicle has yet been assigned selects a vehicle. In this paper, we propose a vehicleinitiated AGV dispatching method. Many previous works on AGV dispatching have aimed at the automation of manufacturing systems. A variety of methods have been proposed, ranging from the adoption of simple heuristic rules [1–3] to the use of a Markov decision process [4], fuzzy logic [5], and neural networks [6]. Owing to the relatively simple path layout and small number of AGVs in their manufacturing system environment, simple heuristics such as the nearest-workcenter-first rule and variations of the modified firstcome-first-served rule have shown good performances. Although not many, there are some previous studies on the AGV dispatching

648

R. Choe et al. / Applied Soft Computing 38 (2016) 647–660

problem in automated container terminals. These works propose not only the use of simple rule-based methods but also optimization methods based on mathematical programming [7–11]. The problem we deal with in this paper for the dispatching of AGVs in an automated container terminal has two objectives. One is to maximize the QC productivity, and the other is to minimize the CO2 emissions. The reduction of CO2 emissions, which has received more attention recently owing to environmental concerns, can be achieved by reducing the empty-travel distances of the AGVs. Although reducing the empty-travel distance may enhance the efficiency of the AGV operation and thus contributes to a reduction in the QC delays, a very strong bias toward an empty-travel reduction leads to a sacrifice of services to the QCs, resulting in increased QC delays. By taking the reduction of empty travel as an explicit objective, we are prepared to trade QC productivity for a reduction of CO2 emissions. Note that simple rule-based methods show a limitation in achieving multiple objectives because of their simplistic decision-making process. Kim et al. [12] solved this problem by introducing a multi-criteria scoring function to evaluate and select candidate jobs. Their scoring function calculates the score of a candidate job through a weighted sum of the evaluations based on various criteria. Each criterion is designed to evaluate a job’s status from the standpoint of either QC productivity or the empty-travel distance. Since the resulting score depends on the weights of these criteria, and thus a different weight vector brings about a different best candidate, the weight vector is viewed as a dispatching policy. Pursuing the best policy that works well under various conditions, Kim et al.’s method conducts a search in the policy space to find one that shows best average performance when simulated with a set of various training scenarios. However, the policies thus obtained fail to show the best performance when they are applied to scenarios that differ from what they were trained on. Moreover, their policy search demands hours of CPU time, making it impossible to adjust their policies to new situations in real time. Another aspect distinguishing our work from previous efforts is that our dispatching policy is intended to work in an uncertain and dynamic environment. The work progress in a container terminal often deviates from the expectation for unpredictable reasons. The time taken to load a container onto a vessel depends on the skill level of the QC operator and/or the weather conditions. The retrieval time of a container by an ASC from a storage block can be lengthened if some other containers are stacked on top of the target container. The travel time of an AGV depends not only on the travel distance but also on the traffic congestion it may face en route. Under these circumstances, the terminal situation varies continuously with time, and thus we virtually face an infinite number of different situations. It should be noted that dispatching policies based on simple heuristics or static optimization approaches will not easily achieve synchronization between the AGVs and cranes under this type of environment. However, the dispatching method proposed in this paper can adapt to changing situations by learning and adjusting the dispatching policy in real time. The AGV dispatching method proposed in this paper uses a preference function [13] that returns a real number as a preference value for a given pair of candidates. When k candidates are given, the best one can be sorted out by applying this pairwise preference function to every possible pair. Conflicts among different preferences, if any, can be resolved using the heuristics developed by Cohen et al. [13]. To dispatch AGVs to their appropriate delivery jobs, the preference function is represented by a set of attributes [6,12,14–16] whose values are obtained by evaluating the candidate jobs based on various criteria regarding the achievement of the two objectives described above. Some of these criteria include the urgency of the candidate job, the empty-travel distance to the target container, and the loaded travel distance required by the job. One of the key

features of our approach is that, to adapt to the ever-changing situations, the preference function is relearned after every dispatching action. In preparation for the learning, each candidate job seen at the time of a decision is evaluated through a simulation of a short look-ahead horizon. A set of training examples is then generated from these candidate jobs. Each training example is a pair of jobs, in which the first is preferred to the second for a positive example, and vice versa for a negative example. To alleviate the effect of noisy examples that can cause preference conflicts, the examples are weighted differently depending on the degree of difference in preference between the pair of jobs in each example. How much better one job is than the other for a selection is revealed by the simulation result. The experimental results show that our online learning method can really make the dispatching adapt to changing situations. Other methods compared cannot adapt well enough or are not applicable in real time owing to the very long computation time demanded. Although the adaptability does not seem to be essential when the workloads of AGVs are not too heavy, it does make a difference, particularly when the workload is high but there is an insufficient number of AGVs available. The remainder of this paper is organized as follows. The next section describes the AGV dispatching problem in detail through an illustration of the layout of an automated container terminal. Section 3 reviews previous works on the AGV dispatching problem. Section 4 describes our online preference learning algorithm. Section 5 evaluates the performance of the proposed algorithm from a comparison with other algorithms through a series of experiments. Finally, Section 6 provides a summary and some concluding remarks.

2. AGV operation in an automated container terminal Fig. 1 shows the layout of an automated container terminal, where AGVs move around in the apron area to deliver containers between the quay and stacking yard. The hinterland area is where the external trucks enter and exit to deliver containers to/from their inland destinations/sources. The stacking yard consists of many blocks of container stacks, in each of which a pair of ASCs are operated for container stacking and retrieval. In front of each block in the stacking yard are a few handover points (HP) where a container transfer between an AGV and ASC takes place. A container transfer between an external truck and an ASC occurs at the HPs at the back of each block. An HP also exists under the back-reach of each QC for a container transfer to/from an AGV. AGVs are not allowed to stay idle at any HP under the QC. They can either stay at an HP in front of a block or in the waiting area in the middle zone of the apron between the blocks and the quayside. Some of the AGVs shown in the vertical direction in the figure are in a waiting status. The loading and discharging of containers for a vessel are both carried out in predetermined sequences that are carefully preplanned by taking into account various constraints and conditions. Building a loading sequence is particularly complicated because the weights of the containers, their destination ports, and their stacking status in the yard should all be considered simultaneously. The heavier containers should be loaded before the lighter ones for the weight balance of the vessel. Those to farther destinations should be loaded before those to nearer to avoid rehandlings. Fig. 2(a) shows that the containers in a vessel are stacked in bays arranged in a longitudinal direction. QCs usually work starting from a bay, and move to the next bays in consecutive order to minimize gantry travel. The containers within a ship bay are stored in groups according to their destination ports, as illustrated by A, B, and C in Fig. 2(b). The loading and discharging plans are built under the constraint imposed by this stowage plan in the ship bays. Another aspect to take into account is that those containers stacked higher in the yard

R. Choe et al. / Applied Soft Computing 38 (2016) 647–660

649

Fig. 1. Layout of an automated container terminal.

should be loaded before those stacked lower whenever possible to minimize the rehandling required during the retrieval from a yard block. Since a delay of any single job in a sequence can disturb the entire process, terminal operators try their best to make the loading operation go smoothly, which is critical for the productivity of the whole terminal. Any delay of a QC or ASC caused by the late arrival of an AGV should be avoided to maximize the productivity. Given sequences of container loading and discharging that are being undergone in parallel, the dispatching of AGVs to determine which delivery jobs to be done next and by which AGVs directly influences the productivity of the cranes and the operational cost of the AGVs. A delivery job of an AGV consists of four steps: empty travel to an HP for container reception, the reception (or pickup) of the container, loaded travel to the destination HP, and release (or dropoff) of the container. Suppose an AGV is assigned a loading job right after finishing its previous job. It first has to make an empty travel to one of the HPs of the block where the target container to be loaded is stored. It then waits there for the ASC of the block to retrieve and put down the container on top of it. After the container is put on, it starts a loaded travel to the HP under the QC that is supposed to load the container onto the target vessel. As soon as the container is picked up by the QC, the AGV moves to the waiting area for the next job assignment. If an idling AGV is assigned to a discharging

job, it travels to the HP under the QC that is supposed to discharge the target container. After receiving the container from the QC, the loaded AGV travels to an HP of the block where the container is planned to be stacked. After the container on top is picked up by the ASC of the block, the AGV waits there for its next job assignment. After completing a job upon releasing a container to an ASC or QC, if the AGV receives another container from the same crane as its next delivery job, it can accomplish this job without the need for any empty travel. This so-called dual cycle operation contributes significantly to a reduction of the empty travel by a fleet of AGVs in operation. Many operators of modern container terminals try to build a loading and discharging plan that can maximize the dual cycle because an efficient use of the AGVs leads to a reduction of crane’s waiting time for the AGVs. However, effort to reduce the empty travel can sometimes cause longer crane delays. As an example, suppose an ASC of a block needs to release an export container to an AGV for delivery to the QC for loading onto a vessel. If an AGV just freed from its job is far from the ASC, and there is another AGV at an HP of a neighbor block that is still busy with its current job, this second AGV may be a better candidate for reducing the emptytravel distance. However, if the expected arrival time of the second AGV to the ASC is later than that of the first AGV, the ASC has to suffer from a longer wait for the benefit of a shorter empty-travel distance of the AGV. A longer wait by the ASC eventually leads to a

Fig. 2. Ship bays (a) and stowage plan of a bay (b).

650

R. Choe et al. / Applied Soft Computing 38 (2016) 647–660

longer QC delay. The choice depends on the relative importance of the two objectives, i.e., enhancing the QC productivity or reducing the empty-travel distance of the AGVs.

3. Related work The majority of previous works on AGV dispatching have focused on the automation of the manufacturing systems or logistics centers. Such approaches can be roughly categorized into two groups: one based on simple heuristic dispatching rules [1–3], and the other based on the optimal plans [7–9,11,17–19]. Simple heuristic rules are advantageous for real-time applications because they do not require much computation. However, their effectiveness for the enhancement of the operational efficiency is limited due to the narrow and shortsighted nature of their decisionmaking. Methods building optimal plans make better decisions by foreseeing future jobs in addition to the current jobs, but the computational demand increases exponentially with the number of jobs to be planned. Another aspect to be pointed out is that the optimal plans are not of much use if the environment changes dynamically under uncertainty because any plan will become obsolete not long after its execution in such an environment. The rolling-horizon methods attempt to overcome this problem by iteratively rescheduling the current and future jobs within a relatively short horizon in regular intervals ([20–24]). However, as the interval between consecutive rescheduling becomes shorter to cope with a higher degree of uncertainty or dynamicity, the length of the horizon should accordingly shorten due to a harder real-time constraint (because not much time is allowed to compute a new job schedule). This inevitably leads to a sacrifice of the quality of the generated schedule or plan. The inventory-based dispatching method by Briskorn et al. [21] combines heuristics with an optimization method to reduce the computation demanded for scheduling the jobs in a horizon, thus making it possible to schedule a horizon of a reasonable length. However, the main objective here is improving the productivity of the QCs. Reducing the amount of empty travel is not their explicit objective and is only considered for helping speed up the QC processing. Other previous works on AGV dispatching advocate the use of a policy. A policy can be viewed as a mapping from a set of states to a set of actions. In the context of AGV dispatching, a state is a given job situation in the terminal, and an action corresponds to either directly selecting a job among the candidate jobs or applying one dispatching rule among other alternative rules. An advantage of using such a policy is that it can take into account various aspects in making a decision by representing a state with a set of relevant features. Another advantage is that it usually takes little computation time, although longer than with simple heuristic rules, in applying a policy under any given situation. Methods for obtaining a good policy take a few different approaches. One such approach searches directly for a good policy in a policy space by using a simulated annealing or genetic algorithm [12,14–16,25–27]. In this approach, each candidate policy is evaluated by applying it to various situations through simulations and observing the resulting performances. One problem with this approach is that the search algorithms merely find a policy that shows the best average performance for the set of simulation scenarios provided for the evaluations during the search. While it may be possible to obtain a policy specialized to a certain situation by providing the search algorithms with only the scenarios of that particular situation, it is not feasible to derive all sorts of policies a priori for a set of virtually infinite situations. Another approach conducts offline supervised learning to obtain a good policy from a set of decisionmaking examples collected under various situations [27–30]. This

approach also suffers from the problem of an infinite number of situations. Yet another viable approach to obtaining policies is to apply a reinforcement learning method. The task of reinforcement learning is to learn an optimal policy by reflecting the observed rewards fed back after performing a sequence of actions. When learning to play chess, for example, the goodness or badness of each individual move is not clearly seen in the middle of the game. Only when the game is over are certain positive or negative rewards made available. In such an environment, reinforcement learning is the only feasible way to learn from experience. Early applications of reinforcement learning were attempted to solve small-scaled and well-formalized problems, such as toy-game playing or robot control in which the numbers of both states and actions are rather small [31–35]. More recent works have reported applications to largerscaled real-world problems [36–38], among which the study by Zeng et al. [38] is particularly relevant to our own. In this study, a reinforcement learning algorithm called Q-learning is used to learn optimal dispatching policies for yard cranes and yard trailers in a conventional container terminal. A state or situation is represented simply by a single attribute, which is the number of waiting QCs for trailer dispatching and the number of waiting trailers for crane dispatching. An action is not the selection of a job to be assigned but the selection of a dispatching rule to be applied among the three given heuristic rules. While such an over-simplification results in a worse performance than ours, reinforcement learning itself seems to be a good option for learning policies in an uncertain and dynamic environment because it does not demand a long CPU time and is thus appropriate for real-time adaptation. However, reinforcement learning algorithms using a nonlinear function of features often diverge, as discussed by Tsitsiklis and Van Roy [39]. We also observed that our trial of reinforcement learning of AGV dispatching policies using a more realistic and complicated state representation failed to converge. Achieving convergence calls for additional major research effort, particularly due to the fact that the policy to be learned for AGV dispatching is dynamic, unlike the static policies pursued in many previous studies. 4. Online preference learning for AGV dispatching Our dispatching scheme adopts the preference function of the type proposed by Cohen et al. [13] to select a job for an AGV. The preference function adapts to changing situations by being continually updated through learning from the preference relation examples collected through simulations. In the following, we first describe the dispatching policy based on the preference function, and then describe how we can learn and update the preference function online. 4.1. AGV dispatching using a policy Each time an AGV finishes its previous job, a dispatching policy is invoked for a selection of the best next job assignment among a set of candidate jobs under the current situation. The candidate jobs are collected from the loading or discharging sequence of each QC; if the first job remaining in the ongoing sequence of a QC is not yet assigned an AGV, it becomes a candidate job. A dispatching policy recommends a job by evaluating the candidate jobs based on various criteria. Some of these criteria are designed to minimize the empty-travel distance of the AGV, and others are designed to minimize the average QC processing time. In the following, we first describe how a dispatching policy can be made up of a preference function that compares a pair of candidate jobs. We then describe how a job is represented by a set of attributes obtained from multicriteria evaluations. After comparing the preference function with

R. Choe et al. / Applied Soft Computing 38 (2016) 647–660

the scoring function of Kim et al. [12], we describe how the performance evaluation of applying policies for AGV dispatching is made in regards of achieving the objectives. 4.1.1. Dispatching policy based on a preference function A preference function F returns a real number in [0,1] given a pair of candidate jobs, each of which is represented by a d-dimensional real vector indicating the scores of the job in various aspects under the current situation, i.e., F : Rd × Rd → [0, 1].

(1)

A return value closer to 1 indicates a stronger preference of the first candidate job to the second. When k different candidate jobs are given, the most preferred job can be identified by applying this pairwise preference function to every possible pair, and combining the results. However, conflicts arise if the preferences are inconsistent. For example, if the preferences for three given candidate jobs x, y, and z are x → y (prefer x to y), y → z, and z → x, then there cannot be a single most-preferred job. The greedy ordering heuristic by Cohen et al. [13] resolves such conflicts by rating the given candidate jobs based on their potential values of preferences. The potential value v(x|J ) of the preference of a candidate job x given a set J of candidate jobs under situation  is calculated by

v(x|J ) =



(F(x, y) − F(y, x)).

(2)

y ∈ J −{x}

Then, the candidate job with the highest potential value of preference is selected as the most preferred job in J . Hence, the dispatching policy ␲ for a given set J of candidate jobs under situation  can be formally represented as (J ) = arg max x ∈ J

v(x|J ).

(3)

4.1.2. Representation of a candidate job Using the preference function F of Eq. (1) for AGV dispatching requires providing the d-dimensional vectors that represent the candidate jobs to be compared. Let x be the container pertaining to a candidate job. The d-dimensional vector x that represents this job is then built from the evaluations of x based on d different criteria, i.e., x = (C1 (x), C2 (x), . . ., Cd (x)),

(4)

where Ci (x) is the evaluation value of x under the current situation based on the ith criterion. Our implementation employs nine criteria, which are essentially the same as those used by Kim et al. [12] in their scoring function. The adoption of all of these criteria as features to represent candidate jobs was justified empirically. When any one of the criteria was missing, the scoring-function-based dispatching policies showed degraded performances. Since the scoring function by Kim et al. [12] is designed for a minimization problem, smaller scores indicate higher preferences. The details of the nine criteria are given below. • C1 (x) indicates the urgency of container x, which is obtained by measuring the time remaining until the due time of x, and then subtract from it the minimum value in the candidate set. • C2 (x) is the difference between the expected arrival time of the current AGV to the pickup HP of x and that of the other quickest AGV. It indicates how advantageous the current AGV is in processing x over the other competing AGVs in terms of the arrival time. This criterion enables a dispatching decision to be made considering not only the current AGV that has already finished



• • •

• •

651

its job but also other AGVs that will finish their jobs in the near future. C3 (x) is the time taken for the crane (either a QC or an ASC) to become ready to handover x to an AGV. If the crane is scheduled to do other jobs of higher priority than x, the estimated processing time for each of these jobs is included. C4 (x) is the empty-travel distance of the current AGV to the pickup HP of x. C5 (x) is the negation of the required loaded-travel distance to the destination HP of x. C6 (x) is the negation of the average delay per container of the QC planned to process x. The purpose of taking a negation is to make the scores smaller for higher preferences. C7 (x) is −1 if x is a loading container, and is 1 if it is a discharging container. C8 (x) is the relative remaining workload of the ASC that will process x, i.e.,

C8 (x) =

⎧ Wx ⎪ ⎨−

if x is a loading container;

⎪ ⎩−1

otherwise,

2Wavg 2

(5)

where Wx is the remaining workload of the ASC of the block where x is stored, and Wavg is the average remaining workload of all ASCs in all blocks. C8 gives a higher priority to the loading container in a busier block. The value becomes −1/2 when the workload is equally distributed. For a discharging container, the value is fixed to −1/2 because they are preplanned by the terminal operating system to be headed to the blocks with a low workload. • C9 (x) is the likelihood of a dual cycle when the job for x is assigned to the current AGV. Let xp be the container of the previous job the current AGV just finished, and let xf be the container of the future job available at the HP where the AGV will finish the job for x. If handling x after xp forms a dual cycle, then C9 (x) = −1. If handling x after xp does not form a dual cycle but handling x followed by xf does, then C9 (x) = −0.5. Otherwise, C9 (x) = 0. The score based on each of these nine criteria as calculated above is normalized to a value in [0,1] using the following fuzzy function:

˚[l(i),u(i)] (vi ) =

⎧ 1 ⎪ ⎪ ⎨ 0

⎪ ⎪ ⎩ vi − l(i)

u(i) − l(i)

if vi > u(i); else if vi < l(i);

(6)

otherwise,

where l(i) and u(i) are the empirical lower and upper bounds of the ith criterion, respectively, and vi is the score of the ith criterion before normalization. The empirical bounds are obtained through separate simulations with a number of randomly generated scenarios of AGV operation. Note that the execution of these scenarios involves the dispatching of AGVs. Since we have not yet been provided with a good dispatching policy at this stage, we simply use the earliest-deadline-first heuristic, which is a frequently used dispatching policy in many conventional container terminals. However, more investigation on the impact of simulating with other dispatching policies for deriving the bounds would be needed as a future work.

4.1.3. Preference function versus scoring function It is worth pointing out that a dispatching policy can be obtained in a more straightforward manner if the scoring function of the type adopted by Kim et al. [12] is used instead of the pairwise preference function. According to Kim et al. [12], score s(x) for a candidate job

652

R. Choe et al. / Applied Soft Computing 38 (2016) 647–660

x is calculated by a weighted sum of all the criteria described above, i.e., s(x) = w · x =

n 

wi Ci (x).

(7)

i=1

Their dispatching policy M then simply selects the best job x* from the set J of all the candidate jobs with the minimum score: M(J) = x∗ = arg min

s(x).

(8)

x∈J

The advantage of using this policy is that the scoring function, unlike the preference function, does not lead to any conflicts in ranking the jobs. Kim et al. [12] optimized this policy by directly searching for a good weight vector for the scoring function of Eq. (7), with each candidate weight vector evaluated through a simulation of applying the corresponding policy to various scenarios. However, any learning method other than this direct search can hardly be used to obtain a good weight vector because it is not easy to prepare training examples labeled with appropriate scores. For this reason, our online learning method adopts a preference function for the dispatching policy. 4.1.4. Performance measurement of AGV dispatching Since the objectives of our AGV dispatching problem are to minimize the average QC processing time and minimize the empty travel by AGV, the goodness of the dispatching decisions should also be measured with such regard. Suppose n jobs, where n is a sufficiently large number, have been processed using a dispatching policy or a series of policies adapted to changes in the situation. The following weighted sum is the objective function according to which the long-term performance of the dispatching decisions on these n jobs is measured: wT · Tn + wD · Dn ,

(9)

where Tn is the average QC processing time per container up to n jobs, Dn is the average empty-travel distance up to n jobs, and wT and wD are the respective weights. Here, Tn and Dn are calculated as follows: Tn = Dn =

|Q | (tn − s), n

1  ej , n

(10) (11)

q ∈ Q j ∈ Fq,n

where Q is the set of QCs used, tn is the time when the nth job is completed, s is the start time of all jobs, Fq,n is the set of jobs completed by QC q up to time tn , and ej is the empty-travel distance of an AGV for job j. Tn is obtained by dividing the total job processing time by the average number of jobs processed per QC. Tn can therefore be viewed as the average QC makespan per job. Dn is obtained by dividing the total empty-travel distance needed for all jobs by the total number of jobs. The values of wT and wD can vary depending on the relative importance of the two objectives. 4.2. Training examples After each time a job is assigned to an AGV using the policy based on the preference function described above, the preference function is updated for use in the next round of job assignment. This is achieved by collecting new training examples and learning an updated preference function from the recently collected examples. A training example e consists of a pair of candidate jobs (x, y) and a binary label r indicating the preference, i.e., e = ((x, y), r), where r is 1 if x is preferred, and is 0 otherwise. Recall that each candidate job is represented by a vector of the nine attributes described above.

Fig. 3. Change in the survival probability in a pool with the number of rounds.

Suppose a job assignment is simply made among k candidate jobs. Our online preference learning scheme (OnPL) first evaluates each of these k jobs through a simulation of a short look-ahead horizon, which will be explained later, and creates a set of training examples based on this evaluation. It generates k − 1 positive examples (r = 1) by pairing the best job with each of the rest, and k − 1 negative examples (r = 0) by inverting the positive pairs. Since 2(k − 1), the total number of training examples generated, is usually not sufficiently large to enable a reliable learning of a whole new function, OnPL maintains a pool of recent training examples to which new training examples are accumulated. All examples in this pool are then used for learning an updated preference function. As more past examples are reserved in the pool, however, the function learned becomes less sensitive to new changes in the environment. 4.2.1. Reserve pool of recent training examples OnPL keeps training examples in its reserve pool, where the lifetime of an example is determined probabilistically based on its age; older examples have a lower probability of dwelling in the pool. Once the pool grows to and exceeds a predetermined size, examples to remain in the pool are selected by sampling without replacement to maintain the size. Let R be the predetermined bound on the number of reserved examples in the pool. If the size of the pool exceeds R for the first time by adding q new examples, all of these examples are fed to the learner for an update of the preference function. In the next round of update, however, R examples are selected from the pool by sampling without replacement before adding q new examples into the pool. This repeated sampling without replacement from then on has the effect of making older examples harder to survive in the pool. For the simplicity of calculation, suppose that the number of examples newly added to the pool is always the same, i.e., q. The probability for an example to be selected in a certain round is then R/(R + q), and the probability for the same example to survive after m rounds is (R/(R + q))m . Fig. 3 shows the exponential decay of this probability with the number of rounds for R = 30,000 and q = 10. The figure also shows the survival probability of the truncation selection according to which the oldest q examples are deleted in every round. As described in Section 5.2, it has been shown empirically that sampling without replacement is better than truncation for the learning of a good preference function online. 4.2.2. Evaluation of candidate jobs After a dispatching decision is made using the current preference function, OnPL evaluates each of the candidate jobs seen at the time of this decision. Based on this evaluation, OnPL generates a set of training examples that will be used to update the current

R. Choe et al. / Applied Soft Computing 38 (2016) 647–660

653

preference function. Let x be one of the jobs to be evaluated. To foresee the goodness of assigning x to a just-freed AGV, some subsequent job assignments in a short look-ahead horizon are simulated, and the overall performances are measured. In this simulation, all dispatching decisions are made using the current preference function. The AGV routing and traffic control are done in realistic detail by using the simulator developed by Bae et al. [40]. To save the time required for the simulation, the number of future jobs simulated including x is set to less than 20 in our implementation. Note that the evaluation of a candidate job is made on a short-term basis by running the simulation of a small number of subsequent job assignments in a short look-ahead horizon. Consequently, the evaluation criteria are slightly different from those of the original objective function shown by Eqs. (9)–(11). After L additional jobs are processed, the candidate job evaluation is made using the following objective function: wT · Tn+L + wD · Dn+L ,

(12)

where Tn+L and Dn+L are calculated as Tn+L =

tn+L − s , min|Fq,n+L |

(13)

q∈Q

Dn+L =

1  n+L



ej .

(14)

q ∈ Q j ∈ Fq,n+L

Although Dn+L is calculated basically in the same way as Dn , Tn+L is obtained by calculating the average job processing time of the most retarded QC whose number of jobs processed is the smallest. Calculating Tn+L differently from Tn brings about the effect of promoting the equalization of the job progress among all QCs. The reason is that Tn+L gets smaller when the number of jobs processed by the most retarded QC becomes closer to the average number of jobs processed per QC. If Tn style measure is insisted upon in shortterm evaluations, jobs with a shorter processing time are preferred so that the average QC processing time per container gets smaller. However, this bias toward jobs with a short processing time leaves jobs with a long processing time unselected until a later time, eventually causing delays of the QCs to which these jobs are assigned. Note that the values of wT and wD in Eq. (12) should remain the same as those used in the original objective function of Eq. (9) because the relative importance of the two objectives in any short-term evaluation cannot be different from that of a long-term evaluation. 4.3. OnPL algorithm Fig. 4 is the pseudocode of the OnPL algorithm. In the first step, OnPL selects a job from among the candidate jobs by using the current dispatching policy, and then assigns this job to the AGV that requested a new job. Note that the job selected by the current dispatching policy is not necessarily the real best job. In particular, at the very beginning, OnPL selects an arbitrary job because it starts with a random initial policy. In the second step, OnPL evaluates each of the candidate jobs by simulating L consecutive job assignments in preparation for generating the training examples. The current dispatching policy is used in this simulation. In the third step, the best job is identified by examining the evaluation values of all candidate jobs. Then, (k − 1) positive and (k − 1) negative examples are generated by forming ordered pairs of the best job with each of the remaining jobs in one and in reverse order, respectively, where k is the number of candidate jobs evaluated in the previous step. Each of these examples is tagged with a weight to indicate the reliability. Since the evaluation by a short simulation using the current dispatching policy may be inaccurate, the preference relation represented by an example can be erroneous. Such noisy examples become the major source of preference conflicts explained in

Fig. 4. Pseudocode of OnPL algorithm.

Section 4.1. OnPL tries to alleviate this noise problem by giving higher weights to more reliable examples. The weight w for the example made up of a pair (j*, j) of candidate jobs from the set J of all current candidate jobs is calculated as follows:

 w=

|V [j∗] − V [j]| max|V [i] − V [k]|

ˇ ,

(15)

i,k ∈ J

where ˇ is a control parameter. The value of w gets larger as the difference in the normalized evaluation values between the two jobs j* and j is larger, and a larger ˇ makes w more discriminative. In the fourth step, the examples just generated are merged with the examples reserved in the pool. Before the merge operation,

654

R. Choe et al. / Applied Soft Computing 38 (2016) 647–660

Table 1 Settings used in the scenarios. Total number of delivery jobs Number of QCs at work Number of yard blocks involved Number of ship bays per QC Number of jobs per ship bay Total number of AGVs

3000 6 14 4 62.5 < 125 + N(0, 2.62 ) < 187.5 12, 18, 24

however, the size of the pool is checked. If it exceeds the predetermined bound R, the pool is reduced by sampling R examples without replacement. In the final step, the dispatching policy is updated by learning a new preference function from the renewed pool of training examples labeled 0 or 1. Owing to the learning error, the learned preference function can output any real value in [0,1]. The preference function is implemented as an artificial neural network, which is advantageous for incremental learning. The learning algorithm used is the resilient backpropagation (RPROP) algorithm developed by Riedmiller and Braun [41]. RPROP is known to be a fast learner that requires almost no parameter tuning. Unlike other backpropagation learning algorithms, it takes into account not the magnitude but the sign of the partial derivative of the total error function. In each iteration of the weight update, RPROP checks whether the sign of the partial derivative is different from that in the previous iteration. If the sign is changed, the update value used in the previous iteration is multiplied by a factor − < 1. If the sign remains the same, the update value is multiplied by another factor + > 1. The weight is then changed by this new update value. The direction of change is opposite to the direction of the weight’s partial derivative, so as to minimize the total error function. The recommended values for the two parameters are + = 1.2 and − = 0.5. 5. Experimental results Two sets of experiments were conducted to validate the performance of OnPL. First, the effect of accumulating training examples in a pool was investigated by trying different pool sizes and replacement strategies. In addition, the best value for parameter ˇ in Eq. (15) was determined empirically to best discriminate reliable examples from unreliable ones. Second, the performance of OnPL was compared with other methods for AGV dispatching. These experiments were all conducted by running simulations with various scenarios of discharging and loading operations, and then measuring the makespan of the QCs (Eq. (10)) and the empty-travel distances (Eq. (11)) of the AGVs. All of the experiments were conducted on a PC with an Intel® CoreTM i7-2600 CPU (3.40 GHz) and 8.0 GB of RAM under Windows 7. The language used for the implementation is C++ in Visual Studio 2005. In the following, we begin with a detailed description of how we generated the simulation scenarios used in these experiments. 5.1. Simulation scenarios Table 1 summarizes the basic settings of our simulation scenarios in an automated container terminal. It is assumed that six QCs are at work servicing a vessel at berth, with a total of 3000 jobs. The number of yard blocks the containers are delivered to/from the QCs are 14. All these blocks are in consecutive locations. Each QC services four ship bays consecutively with the discharging jobs being completed before the loading jobs at each bay. The number of jobs per ship bay is normally distributed with a mean of 125 (note that 125 × 4 × 6 = 3000) and a standard deviation of 2.6, but this number is restricted to be no less than 63 and no more than 187. The jobs at each ship bay are divided into 3–6 groups, within each of which the individual jobs are done in sequence (see Fig. 2). The number of AGVs at work is either 12, 18, or 24. In the experiments with

24 AGVs, OnPL shows a shorter empty-travel distance but lower QC productivity than the policy search method by Kim et al. [12], which is the strongest competitor to OnPL in terms of performances although it cannot be applied in real time due to the very long computation time required for its search. When the number of AGVs is 18 and 12, OnPL shows the best performance for both the emptytravel distance and QC productivity. It appears that the advantage of OnPL relative to other competing methods becomes clearer as the AGVs face tougher working conditions with higher job loads. All of the experimental results given below assume the use of 12 AGVs. However, the results with 18 and 24 AGVs are also briefly reported toward the end of Section 5. The ASCs at the yard block serve the AGVs on a first-come-firstserved basis. The blocks where discharged containers are to be stored are pre-planned in the planning stage by considering the travel distances for delivery and the expected workloads of the ASCs. In addition, the containers to be loaded onto a same vessel are stored together at the blocks close to the berth where the vessel is planned to anchor. Therefore, in our scenarios, the containers belonging to a same group are assumed to be placed at the nearby blocks following a normal distribution. Fig. 5 illustrates how the containers are distributed among the blocks at the yard. The containers of a same group handled by QCi are placed at the blocks following the normal distribution N(i ,  i 2 ), where i is again normally distributed by N(mi , si 2 ). In the figure, the mean value 2 = 5 indicates that the containers belonging to a group to be handled by QC2 are distributed around block B5 . As  2 gets larger the containers are widely spread out from B5 . Note that 2 itself is also normally distributed by following N(m2 , s22 ), where m2 is set to the nearest block from QC2 . If s2 is small, 2 is very likely to be the nearest block B4 . If s2 is large, 2 can be a block far from QC2 . In short, we can generate scenarios of different difficulty levels by adjusting  i and si . Large variances lead to long loaded-travel distances and unequal job-load distributions. We prepared two test cases T1 and T2 , each consisting of 50 scenarios in a sequence, where there are 3000 delivery jobs in each scenario. The ratio between the discharging and loading containers in each scenario are set to 1:1. The scenarios in T1 are generated by setting the parameters as (m1 , m2 ,. . ., m6 ) = (2, 4, 6, 9, 11, 13), si = 0.5, and  i = 1.0. In these scenarios, the containers to be handled by each QC are narrowly distributed around the closest block most of the times. Therefore, the distances of loaded travel of the AGVs are all relatively short and uniform. Test case T2 consists of more pessimistic scenarios. The parameter settings for the scenario generation are (m1 , m2 ,. . ., m6 ) = (4, 5, 6, 7, 8, 9), si = 10, and  i = 5. With a high probability, in these scenarios, the containers are widely scattered among the blocks, and the required loadedtravel distances are non-uniform. To make certain that T2 is really different from T1 , the scenarios that look similar to those in T1 are discarded through the following filtering process. First, a scenario s1 is randomly selected from T1 . Next, a dispatching policy 1 optimized for s1 is obtained by applying the policy search method of Kim et al. [12]. Whereas Kim et al. [12] provided their search algorithm with a set of scenarios to find a policy that shows a good average performance for these scenarios, we provide the search algorithm with only one scenario s1 to derive a policy 1 that is specialized to s1 . Then, to each scenario generated for T2 , both ␲1 and the simple earliest-deadline-first dispatching heuristic are applied and the results compared. If 1 shows a better performance than the earliest-deadline-first heuristic, the scenario is discarded. In this way, only the scenarios in which 1 does not work well are collected for T2 . For the performance test of OnPL, all scenarios in each test case are serially combined to form a single long sequence of scenarios. Notice that T2 as a test case imposes a much higher degree of dynamically changing situations than T1 because of the much larger variances in its scenario generation.

R. Choe et al. / Applied Soft Computing 38 (2016) 647–660

655

Fig. 5. Distribution of containers at different blocks.

5.2. Adjustment of the design parameters of OnPL All the experiments described in this section were conducted using test case T2 . To implement the preference function using an artificial neural network (ANN), both single-layer perceptron (OnPL-SLP) and multi-layer perceptron (OnPL-MLP) were attempted and compared. The number of hidden nodes in OnPLMLP was empirically determined to be 11. The weights for both ANN were initially set to random values in [−0.01, 0.01]. For learning the ANN, the RPROP algorithm described in Section 4.3 was used. There are four parameters in this algorithm: max and min are the maximum and minimum sizes of the weight update allowed, respectively, and + and − are the increase and decrease rates of the weight update, respectively. Table 2 shows the popularly used parameter settings for RPROP. In this table and others that follow, the numbers in bold indicate the best results. Under these settings, RPROP uses all examples in the reserve pool to train the ANN for one epoch in each round of learning after every dispatching decision. Since the examples in the pool are replaced by only a very small portion in each round, most of the examples are reused for Table 2 Parameter settings for RPROP. Parameter

Value

Update increase rate (+ ) Update decrease rate (− ) Maximum size of update (max ) Minimum size of update (min )

1.2 0.5 1.0 1.0E−20

training for many epochs throughout the rounds that follow until they are removed from the pool. To see which values are good for parameter ˇ of Eq. (15), which controls the weights for the training examples, integer values from 0 to 10 were attempted for both OnPL-SLP and OnPL-MLP. In these experiments, the upper bound R for the number of examples reserved in the pool was set to 30,000, the length L of the simulation for evaluating a candidate job was set to 20 jobs, and the period of update I of the dispatching policy was set to 1 iteration (i.e., updated at every iteration). Weights wT and wD in Eq. (9) were both set to 1. Table 3 summarizes the results of the experiments showing the performances in terms of our two objectives. When there are multiple objectives, a case ci is said to dominate another case cj if ci is better than cj in at least one objective and ci is not worse than cj in all of the other objectives. For OnPL-SLP, the result with ˇ = 5 dominates the results using other values. In contrast, for OnPL-MLP, the result with ˇ = 6 dominates the others. The fact that all nonzero values of ˇ provide better results than zero implies that the weighting scheme based on a candidate evaluation certainly contributes to an improvement in the learning performance. Figs. 6 and 7 compare two strategies for replacing the reserved examples in the pool, i.e., truncation and sampling without replacement. The parameter settings for these experiments are R = 30,000, L = 20, and I = 1 as before, and ˇ = 5 for OnPL-SLP and ˇ = 6 for OnPLMLP. For each replacement strategy, the performance was tested by varying the weights wT and wD of Eq. (9) from 1:1 to 2:1, 5:1,

Table 3 Performances with different values of weighting parameter ˇ. Weighting parameter ˇ

Makespan (s)

Empty-travel distance (m)

OnPL-SLP

0 1 2 4 5 6 7 10

188.2 186.7 187.0 186.9 185.6 186.4 186.7 186.3

240.7 240.3 240.5 240.1 238.7 239.9 238.9 239.5

OnPL-MLP

0 1 2 4 5 6 7 10

188.1 187.3 187.0 186.7 188.1 186.6 186.9 187.8

237.5 237.8 237.2 237.4 237.9 237.0 237.3 237.0

Fig. 6. Comparison of truncation and sampling for OnPL-SLP.

656

R. Choe et al. / Applied Soft Computing 38 (2016) 647–660 Table 5 Parameter settings for the compared methods. Method

Parameter

RH

Lookahead horizon Rescheduling interval Crossover (rate) Mutation (rate) Parent selection Fitness sharing (radius) Population size × generations Crossover (rate) Mutation (rate)

PS Parent selection Fitness sharing (radius) Population size × generations

Fig. 7. Comparison of truncation and sampling for OnPL-MLP.

10:1, 20:1, 50:1, and 100:1. The observed average QC makespan and average empty-travel distance pairs (T, D) were then plotted in the objective space. Clearly, the results by sampling almost always dominate the results by truncation. Table 4 shows how the performance of OnPL changes with the upper bound R on the number of examples reserved in the pool. The other parameter settings are the same as in the previous experiments except that the weights wT and wD are both fixed to 1. For OnPL-SLP, the performance with R = 30,000 dominates all other values. When too many examples are reserved in the pool, the performance degrades because it becomes difficult to catch up with the changing situations owing to the overly large inertia. OnPL-MLP requires more examples than OnPL-SLP because of its topological complexity. It shows good performances when the number of examples is within the range of 60,000–120,000. A pool size of 30,000 examples corresponds to one scenario on the timeline. Recall that 3000 dispatching actions occur in one scenario, and 10 examples are generated after each dispatching action because there are six candidate jobs given from six QCs. If truncation is used as the replacement strategy, the pool will maintain the examples generated from the 3000 most recent jobs, which corresponds to the length of one scenario. While sampling without replacement maintains some very old examples, as indicated in Fig. 3, truncation completely forgets examples older than one scenario. The better performance by sampling seems to be because it allows learnTable 4 Performances using different pool sizes R. Pool size

Makespan (s)

Empty-travel distance (m)

CPU time for learning (s)

OnPL-SLP

1000 5000 10,000 20,000 30,000 60,000 90,000 120,000

473.0 364.3 186.8 187.1 185.6 186.0 185.7 186.3

306.5 283.4 241.0 240.1 238.7 239.1 239.2 239.3

0.32 0.33 0.42 0.47 0.53 0.67 0.67 0.73

OnPL-MLP

1000 5000 10,000 20,000 30,000 60,000 90,000 120,000

192.8 188.6 185.9 186.3 185.8 185.4 185.3 185.9

248.7 240.3 238.2 236.8 236.5 236.1 236.1 235.8

0.38 0.39 0.40 0.47 0.70 1.38 1.80 2.16

RL

OnPL

Value 12 jobs 6 jobs Uniform (1.0 with exchange rate 0.2) Bit-flip (1/(3 bit × 12 jobs) = 1/36) Binary tournament Yes (0.1) 25 × 20, 50 × 20, 50 × 50 SBX (1.0 with spread factor 2.0) Non-uniform mutation (1/(9 weights) = 1/9) Binary tournament yes (0.1) 50 × 300

Learning rate (˛) Discount rate ()

0.01 0.1

Inc./dec. rate (+ , − ) Max./min. update (max , min ) Number of hidden nodes Update interval Weighting parameter (ˇ) Pool size (R) Lookahead length (L)

1.2, 0.5 1.0, 1.0E−20 11 (only for MLP)

SLP 5.0 30,000 20 jobs

1 iteration MLP 6.0 60,000 10 jobs

ing from not only recent but also long-term experiences. Table 4 also shows the CPU times taken to update the preference function through learning from the examples in the pool. The learning time includes the time taken to both generate training examples and calculate the weights for the examples. Although the time increases with the number of examples (i.e., pool size), it is still less than 1 s in most cases, which is sufficiently short for real-time processing in the context of AGV dispatching. 5.3. Performance comparison of OnPL with other methods The performance of OnPL has been compared with that of some of the other methods reviewed in Section 3, i.e., the rolling horizon (RH) method, the policy search (PS) method, and the reinforcement learning (RL) method. As summarized in Table 5, RH uses a genetic algorithm to search for good dispatching decisions for the next 12 jobs in the look-ahead horizon and repeats the search after every six jobs. A uniform crossover and bit-flip mutation are used for reproduction, and binary tournament selection is used for the parent selection. To promote population diversity, a fitness-sharing scheme was used. Different population sizes and numbers of generations, namely, 25 × 20, 50 × 20, and 50 × 50, were attempted to compare the performances with increasing investment in the search. PS also uses a genetic algorithm for its policy search. Since each policy is represented by a vector of real numbers, a simulated binary crossover (SBX) and non-uniform mutation are used for reproduction. As with RH, binary tournament selection and fitness sharing were adopted. The population size and number of generations were set empirically to 50 and 300, respectively. To test the performance of PS, one scenario si was randomly selected

R. Choe et al. / Applied Soft Computing 38 (2016) 647–660 Table 6 Performance comparison on test case T1 . Algorithm RH RH (25 × 20) RH (50 × 20) RH (50 × 50) PS RL OnPL SLP MLP

Makespan (s)

657

Table 8 Times taken for a dispatching decision, and CPU times for learning. Empty-travel distance (m)

154.6 (4) 154.1 (3) 152.7 (1) 154.0 (2) 156.4 (7)

221.8 (7) 213.1 (6) 201.7 (5) 182.9 (1) 198.5 (4)

155.1 (5) 155.4 (6)

186.2 (3) 183.8 (2)

from test case Ti , and a policy i optimized for si was derived. Then, i was applied to all scenarios in Ti in the given sequence, and the overall performance was measured. RL uses the Q-learning algorithm to update the policy. The learning rate ˛ and discount rate  were set to 0.01 and 0.1, respectively. More details regarding Qlearning, the learning rate, and the discount rate can be found in Zeng et al. [38]. The parameter settings for OnPL were previously described in Section 5.2 but are summarized again in Table 5. What should be additionally pointed out here is that the pool size was set to 30,000 and 60,000 for OnPL-SLP and OnPL-MLP, respectively. In addition, the look-ahead length, which is actually the length of the simulation for the job evaluation, was set to 20 and 10 jobs for OnPL-SLP and OnPL-MLP, respectively. Table 6 compares the average QC makespan and average emptytravel distance resulting from applying each of the different dispatching methods to test case T1 . The weights wT and wD were both set to 1 in this set of experiments. Appropriate ratio of wT to wD should actually be determined by terminal operators depending on their own operational objectives. Since we do not have any specific bias toward any of the two objectives, we simply set both of them to the same value. The numbers in parentheses indicate the ranks in each category. As expected, the performance of RH improved as more computational resources were invested. We can see that RH 50 × 50 ranks first in QC makespan, although it ranks only fifth in empty-travel distance. Among the policy-based methods, i.e., PS, RL, and OnPL, PS dominated, and RL was dominated by all other methods. Since T1 consists of rather uniform scenarios, the online learning method does not show any advantage. It fails to derive as good a policy as the direct policy-search method. Performance comparisons on test case T2 are summarized in Table 7. Considering the way the scenarios in T2 are generated, we suspect that it will be more difficult for PS to obtain a policy that generally works well for all scenarios with high variances. Unlike the results with T1 , PS fails to dominate OnPL-MLP, and the difference in their performances is not as large as before. RL shows the worst performance on T2 , perhaps because it cannot catch up with the large variations in the scenarios using the simplified state definitions in its Q-learning, and because of its simplistic decision making for action selection, as discussed in Section 3. Table 8 compares the times taken to make a dispatching decision and derive a dispatching policy. The long time taken by RH

Dispatching method RH RH (25 × 20) RH (50 × 20) RH (50 × 50) PS RL OnPL SLP MLP

Dispatching time

Learning time

<10 s <20 s <40 s <0.001 s <0.001 s

– – – <12 h <0.001 s

<0.001 s <0.001 s

for dispatching makes it inappropriate for a real-time application. Since RH searches for good decisions at decision time, no additional learning time is required. The other methods are all very quick in making a dispatching decision, and thus have no problem in real-time applications. PS, however, takes up to 12 h of CPU time in deriving an optimal policy through search. Therefore, it cannot prepare for changing situations in real time. Considering that the loading/discharging plan for a vessel is made available only a few hours before the operation starts in most container terminals, PS will be unable to provide a custom-designed policy for that vessel. Still, its policy derived for general use is very competitive as evidenced by the results in Table 7. The learning time of less than 2 s by OnPL-MLP is not much of a burden for real-time processing. Although its learning process is invoked whenever a dispatching is conducted, this is done after the dispatching decision has been completed, which is not a problem unless the interval between consecutive dispatches is less than 2 s. The dispatching interval in our simulation settings is about 10 s on the average. There may be rare occasions when consecutive dispatching needs to be done in a very short interval. One option to be taken in such a case would be to abort the learning process. The adaptability of OnPL to changes in the environment can be confirmed by the performance plots of OnPL and PS in the objective space, as shown in Figs. 8 and 9. For each dispatching method, the performance is measured with the ratio of weights wT and wD of Eq. (9) varied from 1:1 up to 1000:1. The average QC makespan and empty-travel distance pairs (T, D) are then plotted. OnPL-S and OnPL-M in the legend indicate the policies learned online by OnPL-SLP and OnPL-MLP, respectively. PS-T1 and PS-T2 indicate the policies obtained offline by PS using a scenario randomly selected from the test cases T1 and T2 , respectively. Fig. 8 shows the performance results of these policies on test case T2 . Naturally, PS-T2 provides the best performance, but PS-T1 shows a much worse performance than the OnPLs. The results on test case T1 , as shown in

Table 7 Performance comparison on test case T2 . Algorithm RH RH (25 × 20) RH (50 × 20) RH (50 × 50) PS RL OnPL SLP MLP

Makespan (s)

Empty-travel distance (m)

178.9 (3) 178.5 (2) 177.6 (1) 183.9 (4) 190.3 (7)

260.3 (6) 256.9 (5) 250.8 (4) 237.0 (2) 321.6 (7)

185.6 (5) 185.8 (6)

238.7 (3) 236.5 (1)

<1 s <2 s

Fig. 8. Performances on test case T2 with 12 AGVs.

658

R. Choe et al. / Applied Soft Computing 38 (2016) 647–660

Fig. 9. Performances on test case T1 with 12 AGVs. Fig. 12. Performances of OnPL-MLP and PS-T1 on T2 for five different weight ratios.

Fig. 10. Performances on test case T2 with 18 AGVs.

Fig. 9, are the opposite. PS-T1 is the clear winner, and PS-T2 performs worse than OnPL-MLP. To summarize, the offline method PS works well when the given scenarios are similar to the one from which its policy is learned, but its performance degrades if significantly different scenarios are given. In such a situation, the online learning method OnPL shows performances that dominate those of PS, implying that OnPL is able to better adapt to new scenarios. Fig. 10 shows the same performance plot for test case T2 when the number of AGVs is increased to 18. Similarly as in Fig. 8, OnPL dominates PS-T1. Notice that the average makespans are shorter overall than those in Fig. 8 due to the increased number of AGVs. Fig. 11 shows the plot for test case T2 when the number of AGVs is

further increased to 24. Notice the even shorter average makespans than those in Fig. 10. The performances of neither OnPL nor PS-T1 in Fig. 11 dominate. OnPL shows strength in reducing the emptytravel distance as compared to PS-T1, while three cases of PS-T1 achieve a shorter average makespan than OnPL, although the differences are less than 0.5 s. Fig. 12 compares the performances of PS-T1 and OnPL-MLP on test case T2 with five different wT : wD ratios of 1:1, 5:1, 10:1, 50:1, and 100:1. The number of AGVs is fixed at 12 in all of these experiments. To test the performance of PS-T1, a scenario was randomly picked from test case T1 , and a dispatching policy for each weight ratio was derived from that scenario. Each of these five policies was then applied to test case T2 , and the average QC makespan and average empty-travel distance pairs (T, D) were measured. If we plot these five performance results on an objective space, they will look similar to the five white squares shown in Fig. 12, and we can easily calculate the corresponding hyper-volume ratio (HVR). By repeating these experiments ten times with different training scenarios randomly picked from T1 , we obtained 10 different HVR values, the average of which is shown in the figure to be 0.746. In fact, each of the white squares in Fig. 12 represents the average of these 10 experimental results for each weight ratio. Table 9 lists the specific values of all of the average results. A similar series of experiments were conducted to measure the performance of OnPL-MLP on test case T2 . However, note that there is no pre-training step for OnPL because it learns online. Instead, we need to reshuffle the order of constituting scenarios in T2 because the performance of online learning depends on the order of the examples observed by the learner. Therefore, on T2 after each reshuffling, the performance of OnPL-MLP for the five different weight ratios was measured and Table 9 Test results of OnPL-MLP and PS-T1 on T2 for five different weight ratios. Dispatching method

Fig. 11. Performances on test case T2 with 24 AGVs.

wT : wD

Makespan (s) Mean ± standard deviation

PS-T1

1:1 5:1 10:1 50:1 100:1

209.3 194.6 189.2 185.1 184.7

± ± ± ± ±

44.9 32.2 09.1 01.8 01.2

255.9 265.1 268.1 280.3 282.7

± ± ± ± ±

10.2 09.6 09.4 12.9 14.2

OnPL-MLP

1:1 5:1 10:1 50:1 100:1

185.9 181.6 180.5 179.5 178.8

± ± ± ± ±

0.67 0.38 0.50 0.62 0.61

236.5 240.8 243.9 252.2 255.6

± ± ± ± ±

1.93 1.41 2.44 2.00 1.71

Empty-travel distance (m) Mean ± standard deviation

R. Choe et al. / Applied Soft Computing 38 (2016) 647–660 Table 10 Results of non-paired t-test with 95% confidence comparing OnPL-MLP against PS-T1 on test case T2 for each weight ratio shown in Table 9.

OnPL-MLP versus PS-T1

wT :wD

Comparison of makespan (probability of null hypothesis)

Comparison of empty-travel distance (probability of null hypothesis)

1:1 5:1 10:1 50:1 100:1

= (1.5.E−01) = (2.6.E−01) – (1.9.E−02) – (2.0.E−06) – (5.3.E−09)

– (2.6.E−04) – (2.8.E−05) – (1.9.E−05) – (9.4.E−05) – (2.8.E−04)

the resulting HVR calculated. Repeating these experiments 10 times again, we obtained an average HVR of 0.713, as indicated in Fig. 12. Each black triangle in the figure represents the average of these ten experimental results for each weight ratio, the specific values of which are listed in Table 9. It was confirmed through a non-paired t-test that the average HVRs of 0.713 and 0.746 are significantly different with a confidence level of 95%. Table 10 shows the results of a non-paired t-test with a 95% confidence when comparing OnPLMLP with PS-T1 on test case T2 for each weight ratio shown in Table 9. We can see that both the makespan and empty-travel distance of OnPL-MLP are significantly shorter than most of those of PS-T1.

6. Conclusion This paper presented an online preference learning algorithm named OnPL that can dynamically adapt the policy for dispatching AGVs to changes in the environment in an automated container terminal. The policy is based on a preference function that can be used to distinguish the best job among multiple candidate jobs. The policy adaptation is achieved by updating the preference function after every dispatching decision. For this update, OnPL generates new training examples and renews the example pool by replacing some old examples with new ones. The preference function is then relearned from the examples in this renewed pool. If the size of the pool is too big, the pool will maintain too many very old examples, making it difficult to adapt the preference function to changing situations. Too large a pool also slows down the learning process, which should be completed in real time. An appropriate pool size has been determined empirically by trying various sizes. For the replacement of old examples with new ones, it was shown that sampling without replacement is better than simple truncation based on the age of the example. The preference function used for job selection is a pairwise preference function. Despite the difficulty arising from preference conflicts, a pairwise preference function is adopted for the policy because the training examples of pairwise preferences are easy to generate. Online learning will not be possible if other functions such as the scoring function by Kim et al. [12] are adopted because training examples labeled with scores in real numbers cannot be easily generated. The training examples of pairwise preferences are generated after every dispatching decision. To do so, each candidate job considered for the dispatching decision is evaluated through a simulation of continued dispatching for the future jobs in a short look-ahead horizon. After the evaluation, the best job is paired with each of the remaining jobs to make positive examples, and the inversions of these pairs are generated to make negative examples. Some of these pairs may consist of jobs with little difference in their evaluation values. Since the reliability of an evaluation through a short simulation is not very high, examples made of such pairs may be noisy. To minimize the potential preference conflict caused

659

by such noisy examples, all examples generated are given weights depending on the difference in their evaluation values. The performance of OnPL was compared against the rolling horizon (RH) method, the reinforcement learning (RL) method, and the policy search (PS) method through a series of simulation experiments. Since RH conducts a search for a good dispatching decision at the time of the decision making, it cannot be easily scaled up to make quality decisions in real time (Table 8). The RL method [38] shows a limitation in its quality of online-learning result because of the simplified state definitions as well as the simplistic decision making for action selection (Tables 6 and 7). In terms of real-time applicability, however, the RL method seems to be one of the best options to consider (Table 8). The PS method [12] conducts an offline search to learn a policy that shows the best average performance for a given set of scenarios. Given an arbitrary situation, this policy may not work as well as a policy custom-designed for that particular situation. The experimental results show that the policy by PS works better than or is at least competitive with OnPL (Tables 6 and 7) when applied to scenarios that are similar to those for which the policy is learned. However, when a policy is learned using PS from one type of scenario and applied to different types of scenarios, the performance is not as satisfactory as that of OnPL (Figs. 8 and 9). Another disadvantage with PS is that it demands a rather long CPU time to derive a policy offline. Therefore, PS will have difficulty when exposed to quick changes in the situation. It is also worth pointing out that it is not easy for PS to decide when to update a policy in an environment where there are an infinite number of situations that change continuously. OnPL is quick in both making dispatching decisions and online learning (Table 8). Its policy based on the preference function obtained through online learning adapts well to changing situations (Figs. 8 and 9). All of these aspects make OnPL a viable option for AGV dispatching in an automated container terminal where the operation of the equipment is highly uncertain and the working environment changes dynamically. Acknowledgments This research was supported by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2015-H850115-1011) supervised by the IITP (Institute for Information & communications Technology Promotion). References [1] P.J. Egbelu, J.M.A. Tanchoco, Characterization of automatic guided vehicle dispatching rules, Int. J. Prod. Res. 22 (3) (1984) 359–374. [2] M.M. Srinivasan, Y.A. Bozer, M. Cho, Trip-based material handling systems: throughput capacity analysis, IIE Trans. Inst. Ind. Eng. 26 (1) (1994) 70–89. [3] H. Yamashita, Analysis of dispatching rules of AGV systems with multiple vehicles, IIE Trans. Inst. Ind. Eng. 33 (10) (2001) 889–895. [4] T.J. Hodgson, R.E. King, S.K. Monteith, S.R. Schultz, Developing control rules for an AGV using Markov decision processes, in: Proceedings of the IEEE Conference on Decision and Control (CDC1985), 1985, pp. 1817–1821. [5] K.K. Tan, K.Z. Tang, Vehicle dispatching system based on Taguchi-tuned fuzzy rules, Eur. J. Oper. Res. 128 (3) (2001) 545–557. [6] B.H. Jeong, S.U. Randhawa, A multi-attribute dispatching rule for automated guided vehicle systems, Int. J. Prod. Res. 39 (13) (2001) 2817–2832. [7] D.Y. Lee, F. DiCesare, Integrated scheduling of flexible manufacturing systems employing automated guided vehicles, IEEE Trans. Ind. Electron. 41 (6) (1994) 602–610. [8] G. Ulusoy, F. Sivrikaya-S¸erifo˘glu, Ü. Bilge, A genetic algorithm approach to the simultaneous scheduling of machines and automated guided vehicles, Comput. Oper. Res. 24 (4) (1997) 335–351. [9] K.H. Kim, J.W. Bae, A look-ahead dispatching method for automated guided vehicles in automated port container terminals, Transp. Sci. 38 (2) (2004) 224–234. [10] M. Grunow, H.O. Günther, M. Lehmann, Dispatching multi-load AGVs in highly automated seaport container terminals, in: K.H. Kim, H.O. Günter (Eds.), Container Terminals and Automated Transport Systems: Logistics Control Issues

660

[11]

[12]

[13] [14]

[15]

[16]

[17] [18]

[19] [20]

[21] [22]

[23] [24]

[25]

R. Choe et al. / Applied Soft Computing 38 (2016) 647–660 and Quantitative Decision Support, Springer Berlin Heidelberg, 2005, pp. 231–255. L. Lin, S.W. Shin, M. Gen, H. Hwang, Network model and effective evolutionary approach for AGV dispatching in manufacturing system, J. Intell. Manuf. 17 (4) (2006) 465–477. J. Kim, R. Choe, K.R. Ryu, Multi-objective optimization of dispatching strategies for situation-adaptive AGV operation in an automated container terminal, in: Proceedings of the 2013 Research in Adaptive and Convergent Systems (RACS2013), 2013, pp. 1–6. W.W. Cohen, R.E. Schapire, Y. Singer, Learning to order things, J. Artif. Intell. Res. 10 (1999) 243–270. D. Naso, B. Turchiano, Multicriteria meta-heuristics for AGV dispatching control based on computational intelligence, IEEE Trans. Syst. Man Cybern. B: Cybern. 35 (2) (2005) 208–226. T. Park, M. Sohn, K.R. Ryu, Optimizing stacking policies using an MOEA for an automated container terminal, in: Proceedings of the 40th International Conference on Computers & Industrial Engineering (CIE2010), 2010, pp. 1–6. T. Park, R. Choe, Y.H. Kim, K.R. Ryu, Dynamic adjustment of container stacking policy in an automated container terminal, Int. J. Prod. Econ. 133 (1) (2011) 385–392. W.C. Ng, K.L. Mak, Y.X. Zhang, Scheduling trucks in container terminals using a genetic algorithm, Eng. Optim. 39 (1) (2007) 33–47. V.D. Nguyen, K.H. Kim, A dispatching method for automated lifting vehicles in automated port container terminals, Comput. Ind. Eng. 56 (3) (2009) 1002–1020. L.H. Lee, E.P. Chew, K.C. Tan, Y. Wang, Vehicle dispatching algorithms for container transshipment hubs, OR Spectr. 32 (3) (2010) 663–685. E.Y. Ahn, K. Park, B. Kang, K.R. Ryu, Real time scheduling by coordination for optimizing operations of equipments in a container terminal, in: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI2007), 2007, pp. 44–48. D. Briskorn, A. Drexl, S. Hartmann, Inventory-based dispatching of automated guided vehicles on container terminals, OR Spectr. 28 (4) (2006) 611–630. R. Choe, H. Cho, T. Park, K.R. Ryu, Queue-based local scheduling and global coordination for real-time operation control in a container terminal, J. Intell. Manuf. 23 (6) (2012) 2179–2192. T. Park, R. Choe, S.M. Ok, K.R. Ryu, Real-time scheduling for twin RMGs in an automated container yard, OR Spectr. 32 (3) (2010) 593–615. H. Yuan, R. Choe, K.R. Ryu, Twin-ASC scheduling in an automated container terminal using evolutionary algorithms, in: Proceedings of the 7th International Conference on Intelligent Manufacturing & Logistic Systems, 2011, CD. N. Umashankar, V. Karthik, Multi-criteria intelligent dispatching control of automated guided vehicles in FMS, in: Proceedings of the 2006 IEEE Conference on Cybernetics and Intelligent Systems (CIS2006), 2006, pp. 1–6.

[26] X. Guan, X. Dai, Deadlock-free multi-attribute dispatching method for AGV systems, Int. J. Adv. Manuf. Technol. 45 (5–6) (2009) 603–615. [27] H. Jang, R. Choe, K.R. Ryu, Deriving a robust policy for container stacking using a noise-tolerant genetic algorithm, in: Proceedings of the 2012 ACM Research in Applied Computation Symposium (RACS2012), 2012, pp. 31–36. [28] Y. Arzi, L. Iaroslavitz, Operating an FMC by a decision-tree-based adaptive production control system, Int. J. Prod. Res. 38 (3) (2000) 675–697. ˜ A comparison of machine-learning [29] P. Priore, D. Fuente, J. Puente, J. Parreno, algorithms for dynamic scheduling of flexible manufacturing systems, Eng. Appl. Artif. Intell. 19 (3) (2006) 247–255. [30] Y.R. Shiue, Data-mining-based dynamic dispatching rule selection mechanism for shop floor control systems using a support vector machine approach, Int. J. Prod. Res. 47 (13) (2009) 3669–3690. [31] A.G. Barto, R.S. Sutton, P.S. Brouwer, Associative search network: a reinforcement learning associative memory, Biol. Cybern. 40 (3) (1981) 201–211. [32] G. Tesauro, Temporal difference learning and TD-Gammon, Commun. ACM 38 (3) (1995) 58–68. [33] R. Sutton, Generalization in reinforcement learning: successful examples using sparse coarse coding Advances in Neural Information Processing Systems, vol. 8, MIT Press, 1996, pp. 1038–1044. [34] J.A. Bagnell, J.G. Schneider, Autonomous helicopter control using reinforcement learning policy search methods, in: Proceedings of the IEEE International Conference on Robotics and Automation 2001 (ICRA2001), 2001, pp. 1615–1620. [35] P. Abbeel, M. Coates, M. Quigley, A.Y. Ng, An application of reinforcement learning to aerobatic helicopter flight, in: Proceedings of the 20th Annual Conference on Neural Information Processing Systems (NIPS2006), 2006, pp. 1–8. [36] S. Proper, P. Tadepalli, Scaling model-based average-reward reinforcement learning for product delivery, in: J Fürnkranz, T. Scheffer, M. Spiliopoulou (Eds.), Lecture Notes in Computer Science, vol. 4212, Springer Berlin Heidelberg, 2006, pp. 735–742. [37] G. Tesauro, N.K. Jong, R. Das, M.N. Bennani, A hybrid reinforcement learning approach to autonomic resource allocation, in: Proceedings of the 3rd International Conference on Autonomic Computing (ICAC2006), 2006, pp. 65–73. [38] Q. Zeng, Z. Yang, X. Hu, A method integrating simulation and reinforcement learning for operation scheduling in container terminals, Transport 26 (4) (2011) 383–393. [39] N.J. Tsitsiklis, B. Van Roy, An analysis of temporal-difference learning with function approximation, IEEE Trans. Autom. Control 42 (5) (1997) 674–690. [40] H.Y. Bae, R. Choe, T. Park, K.R. Ryu, Comparison of operations of AGVs and ALVs in an automated container terminal, J. Intell. Manuf. 22 (3) (2011) 413–426. [41] M. Riedmiller, H. Braun, Direct adaptive method for faster backpropagation learning: the RPROP algorithm, in: Proceedings of the 1993 IEEE International Conference on Neural Networks (NeuralNetwork1993), 1993, pp. 586–591.