Reinforcement learning based resource allocation in business process management

Reinforcement learning based resource allocation in business process management

Data & Knowledge Engineering 70 (2011) 127–145 Contents lists available at ScienceDirect Data & Knowledge Engineering j o u r n a l h o m e p a g e ...

1MB Sizes 0 Downloads 86 Views

Data & Knowledge Engineering 70 (2011) 127–145

Contents lists available at ScienceDirect

Data & Knowledge Engineering j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / d a t a k

Reinforcement learning based resource allocation in business process management Zhengxing Huang a,b, W.M.P. van der Aalst b, Xudong Lu a,⁎, Huilong Duan a a b

College of Biomedical Engineering and Instrument Science of Zhejiang University, The Key Laboratory of Biomedical Engineering, Ministry of Education, PR China Eindhoven University of Technology, P.O. Box 513, NL-5600 MB, Eindhoven, The Netherlands

a r t i c l e

i n f o

Article history: Received 26 November 2009 Received in revised form 15 September 2010 Accepted 15 September 2010 Available online 20 October 2010 Keywords: Resource allocation Business process Markov decision process Reinforcement learning Q-learning

a b s t r a c t Efficient resource allocation is a complex and dynamic task in business process management. Although a wide variety of mechanisms are emerging to support resource allocation in business process execution, these approaches do not consider performance optimization. This paper introduces a mechanism in which the resource allocation optimization problem is modeled as Markov decision processes and solved using reinforcement learning. The proposed mechanism observes its environment to learn appropriate policies which optimize resource allocation in business process execution. The experimental results indicate that the proposed approach outperforms well known heuristic or hand-coded strategies, and may improve the current state of business process management. © 2010 Elsevier B.V. All rights reserved.

1. Introduction Business Process Management (BPM) provides a broad range of tools and techniques to enact and manage operational business processes. Increasingly, more and more organizations use BPM techniques and tools to promote business effectiveness and efficiency. These BPM techniques and tools focus on control-flow in combination with mature support for data in the form of XML and database technology. As a result, control-flow and data-flow are well addressed by existing techniques and tools. Unfortunately, less attention has been devoted to resource management perspective [1]. Resource management is important for the performance of an organization. In a BPM context, a resource is defined as an actor or an agent that carries out business process activities. Typically, resources are humans. However, depending on the application domain, resources can be machines, manpower, money, software, etc [2]. In this paper, we consider durable resources, i.e. resources that are claimed and released during the execution, but are neither created nor destroyed. Resource allocation has been recognized as an important issue in business process execution. In practice, there are several aspects involved in the need for resource allocation. Resources may be allocated to satisfy different and sometimes contradictory goals, such as sustaining a high utilization of available resource capacity (possibly resulting in bottlenecks); or smooth throughput of business processes cases (possibly resulting in idleness of resource and higher costs). As mentioned by Kumar and Aalst [3], proper resource allocation is a key issue in providing efficient usage of resources in business process execution. It ensures that each work item is performed by the correct resource at the correct time, so as to balance the demand for process execution facilities against the availability of these resources. Choosing the “right” resources to perform work items in business process execution is not a simple task. To achieve this, we propose the adoption of intelligent techniques for learning, reasoning, and planning resource allocation from experience, making reasonable decisions to optimize resource allocation in different process situations. We anticipate that such techniques will improve the allocation of resources.

⁎ Corresponding author. E-mail address: [email protected] (X. Lu). 0169-023X/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2010.09.002

128

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

There are two basic mechanisms of resource allocation [3], push and pull mechanisms, which have been widely used in practice. The push mechanism operates by pushing a work item to a single resource having a required role. The qualified resource depends on criteria explicitly defined by the manager of the whole process. The pull mechanism refers to the solution where a resource is allowed or requested to pull work items from view a common pool of work items. The push and pull mechanisms are simple, but do not actually consider performance issues [1,4]. It is unlikely that a simple push or pull model is capable of generating an optimal solution or adequately dealing with the complexity involving resource unavailability, overloading, and other sources of uncertainty. As a result, the resource is offered too few, too many, or even inappropriate work items [1,5,6], which results in poor performance of the resource allocation mechanisms. Analysis of the literature shows that machine learning approaches are preferred to mine resource allocation rules [1,7–9]. These approaches provide strict deduction, which usually involves complicated notations and equations under assumptions. However, they are mostly used to mine static resource allocation rules and only work well in one or several scenarios. No single approach can outperform all others in various scenarios, especially in the evolutionary environment, where current resource allocation decisions may not be suitable for another time. Thus, continuous efforts are required in order to make reasonable decisions. In a more general perspective, resource allocation, as a common topic in operation management, has been widely studied in the area of operation management, e.g., the job-shop scheduling problem where ideal jobs are assigned to resources at particular times [10]. Although many researchers in the operation management domain focus on the problems related to resource allocation [10–12], they seldom consider the resource allocation problems of the structured business process, of which tasks and their execution ordering are coordinated through different constructors, e.g., sequence, choice, parallelism, and synchronization [13]. Furthermore, there are dynamic natures to the business process execution throughout allocation of resources services. For one thing, the execution path of a business process is determined at the run time and different processes may have different execution paths within particular process execution scenarios. In addition, resource behaviors, e.g., preference, capability, and availability, etc., which have an important impact on resource allocation decisions in BPM [1], are dynamic, changing during business process execution. Finally, there are multiple business processes that must to be serviced concurrently. Constructing a model that assigns resources to every process would suffer from a combinatorial explosion. The problem of allocating resources to tasks from multiple concurrent business processes has to be considered. The optimal resource allocation problem in a business process execution can be seen as a sequential decision making problem. Considering this problem, the cost of a resource performing a task cannot be seen as an isolated decision, but is only one element in a sequence of decisions. In order to minimize the total cost, the decision maker may have to sacrifice immediate costs such that less cost is received later. Thus, finding a policy for making good resource allocation decisions in business process execution is a challenging problem. Ideally, the policy should indicate what the best decision is in each possible situation the decision maker may encounter. Markov Decision Processes (MDPs) have been widely used to model sequential decision making problems [14–16]. The most important property of MDPs is that an optimal decision in a given state is independent of earlier states the decision maker encountered. For MDPs, there exist a number of algorithms that are guaranteed to find optimal policies. For example, dynamic programming methods can be used to define such optimal policies. A problem with dynamic programming methods is deciding the moment when the transitions' probabilities are sufficiently reliable to classically solve the problem. A relatively new class of algorithms, known as reinforcement learning (RL) techniques, may help to overcome some of the problems associated with dynamic programming methods [17,18]. RL has been applied in a number of areas [19–22], and successful in many practical applications. These applications range from robotics and control, to industrial manufacturing and combinatorial search problems. From the operational point of view, resource allocation decision making is an interactive problem. To assign a particular work item to an appropriate resource at a particular time, there must be an interaction with the process execution environment, i.e., one is able to observe how the process execution environment responds to decisions made. RL provides a straightforward framework of learning from interaction to achievement of a goal. The learner and decision maker are called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment. These interact continually, the agent selecting actions and the environment responding to those actions, presenting new situations to the agent. The environment also gives rise to rewards or costs, special numerical values that the agent tries to maximize or minimize, respectively over time [18]. In this paper, we present a Reinforcement Learning Based Resource Allocation Mechanism (RLRAM) for BPM. RLRAM aims at making appropriate decisions to allocate resources by trying to minimize long-term cost and to improve the performance of business process execution. The remainder of this paper is organized as follows: Section 2 discusses current knowledge of RLRAM. Section 3 presents the formal mathematical model and implementation details of RLRAM. In particular, the concepts from reinforcement learning theory to allocation optimization in business process execution are presented in Section 3.1. A Q-learning based approach to optimize resource allocation of a partial process case is provided in Section 3.2; and then a Q-value based queuing approach to optimize resource allocation among a set of partial process cases is provided in Section 3.3. Experimental results are presented in Section 4. Related work is presented in Section 5, and conclusions are provided in Section 6. 2. Preliminaries In this section we introduce some basic concepts needed for the remainder of the discussion. We discuss Markov decision processes, business process, resource behavior, and resource allocation problems in BPM, so as to set up a necessary context for describing our approach.

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

129

2.1. Markov decision process In the framework of reinforcement, resource allocation decisions, and the choosing of long-term optimal actions based upon delayed rewards from the environment, are modeled as Markov Decision Processes (MDPs) [14,15]. Definition 1. (Markov decision process): A Markov decision process model is a tuple 〈S, A, P, C〉, where: • S is a set of possible states of the environment; • A is a set of possible actions available to the system; • P : S × A × S → [0, 1] is a state transition function, giving for each state and action, a probability distribution over states (P(s′|(s, a)) is the probability of ending in state s′, given that the system starts in state s and takes action a); and, • C : S × A → ℝ+ is a real valued immediate cost function, giving the expected immediate reward gained by the system for taking each action in each state (C(s, a) is the expected cost for taking action a in state s). The essence of MDP is that the effects of an action taken in a state depend only on that state and not on the history. In MDP, the agent and the environment interact in a sequence of discrete steps, τ = 0, 1, 2, … The state and the action at one time step, sτ ∈ S, and, aτ ∈ A, determine the probability distribution for the state at the next time step, sτ + 1 ∈ S, and, jointly, the distribution for the next cost cτ + 1. As we mentioned before, we are looking for the best policy; that is, a function π that associates an action a ∈ A to each state s ∈ S. The MDP theoretical framework assigns to each policy π a value function Vπ that associates to each state s ∈ S a global cost Vπ(s), obtained by applying π beginning with s. This value function allows one to compare policies. A policy π outperforms another policy π′ if: ∀s∈S; Vπ ðsÞ≤Vπ′ ðsÞ

ð1Þ

The expected sum of costs is weighted by a parameter γ in order to limit the influence of infinitely distant rewards (especially in the case when S is infinite):  ∞  π τ ∀s∈S; V ðsÞ = E ∑ γ cτ js0 = s

ð2Þ

τ=0

For each state, this value function gives the expected sum of future costs that can be obtained if the policy π is applied from this state on ward. This value function allows one to formalize the research of the optimal policy π ⁎: it is the one associated with the optimal value function V ⁎ = Vπ ⁎ that minimizes the accumulative costs. Bellman's optimality equations [23] characterize the optimal value function V ⁎ and an optimal policy π⁎ that can be obtained from it. In the case of the γ-weighted criterion, they can be written as:   ⁎ ð3Þ V ðsÞ = mina∈A C ðs; aÞ + γ ∑ P ðs′ j ðs; aÞÞV ⁎ ðs′ Þ ; s′ ∈ S

  ⁎ ⁎ ∀s∈S; π ðsÞ = argmina∈A C ðs; aÞ + γ ∑ P ðs′ j ðs; aÞÞV ðs′ Þ : s′ ∈ S

ð4Þ

2.2. Business process A business process or business method is a collection of related, structured activities or tasks that produce a specific service or product (serve a particular goal) for a particular customer or customers. It often can be visualized with a flowchart as a sequence of activities [24,25]. Definition 2. (Business process model): Let T be a finite set of tasks and pm = (i, o, T, F) be a business process model, where: • i is the input condition and o is the output condition; • T is a finite set of tasks; and, • F p ({i} × T) ∪ (T × {o}) ∪ (T × T) is a finite set of flow relation. As shown in Fig. 1 (A), an example business process model consists of one input condition (i), one output condition (o), two tasks (t1 and t2), and eight flow relations (fit1, fit2, ft1t1, ft1t2, ft2t1, ft2t2, ft1o, and ft2o). Definition 3. (Transition probability): Let F be the set of flow relations, R be the set of resources. Let ρ : F × R → [0, 1] be the task transition probability function, representing the transition probability of the source task to the target task of a flow relation under the condition that the source task is executed by a qualified resource.

130

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

Fig. 1. An example of a business process model.

For example, suppose that resources r1 and r2 are qualified to execute tasks t1 and t2 of the example business process model in Fig. 1. Then we can label a numerical value on each flow relation which relates a source task and a target task. This represents the transition probability from the source task to the target task by a qualified resource, e.g., as shown in Fig. 1 (B), ρ(ft1, t2, r1) = 0.4 represents the probability (0.4) of a flow relation from task t1 to task t2 under the condition that t1 is executed by resource r1. Especially, we define two specific notations: ρ(fi, t) and ρ(ft, o, r), which represent the probability that a flow relation starts from the task t and the probability of a flow relation from the task t to the output condition under the condition that t is executed by the resource r. For example, the transition probability ρ(fi, t1) = 0.5 in Fig. 1 (A) represents the probability that a business process case starts from task t1 is 0.5.

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

131

The transition probabilities matrix can be represented as:   ρ f 6  i;t1  6 6 ρ ft1 ;t1 ; r1 6 6  ⋯  6 6 ρ ft1 ;t1 ; rm 6 6  ⋯  6 6 ρ ftn ;t1 ; r1 6 4  ⋯  ρ ftn ;t1 ; rm 2

  ρ fi;t2   ρ ft1 ;t2 ; r1  ⋯  ρ ft1 ;t2 ; rm  ⋯  ρ ftn ;t2 ; r1  ⋯  ρ ftn ;t2 ; rm

⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯

  ρ fi;tn   ρ ft1 ;tn ; r1  ⋯  ρ ft1 ;tn ; rm  ⋯  ρ ftn ;tn ; r1  ⋯  ρ ftn ;tn ; rm

3 0  7 7 ρ ft1 ;o ; r1 7 7  ⋯ 7 7 ρ ft1 ;o ; rm 7 7  ⋯ 7 7 ρ ftn ;o ; r1 7 7  ⋯ 5 ρ ftn ;o ; rm

ð5Þ

where ρ(ftj, tk, rl) represents the probability that a flow relation from task tj to another task tk under the condition that tj is executed by resource rl. The transition probability matrix is a Markov chain where the states represent the tasks, and there is one source state i and one target state o representing the input condition and the output condition of a business process model. As mentioned before, it is possible to determine the probability of a process case being started from particular type of tasks, e.g., the probability that a process case starts from task tj is represented by ρ(fi, tj). Suppose there are λ process cases initialized concurrently, the number of process cases which starts from tj could be λρ(fi, tj). Note that a process case must be classified into a particular type of task before it forwards to the output condition o, which means that ρ(fi, o) = 0. Definition 4. (Enabled work item): Let T be a finite set of tasks, CID be a finite set of case identifiers, K be a finite set of attribute keys, V be a finite set of attribute values, TS be the set of all possible time stamps. Let EWItem = {(cid, t, dv, τe) ∈ CID× T × (K ↛ V) × TS} be a set of enabled work items, where (1) cid is the identifier of a process case, (2) t is the name of the task, (3) dv is a set of application data, and (4) τe is the enabled time of an enabled work item. In the following discussions, we use ewi. cid, ewi. t, ewi. dv and ewi. τe to represent the case identifier, task name, application data, and enabled time of an enabled work item ewi. For example, Table 1 shows an example of the example process model (as shown in Fig. 1). In the seventh line of Table 1, there is one enabled work item, which is represented as (c2, t2, {〈type, emergent〉}, τ). Definition 5. (Completed work item): Let T be a finite set of tasks, let CID be a finite set of case identifiers, let R be a finite set of resources, let K be a finite set of attribute keys, and V a finite set of attribute values. Let TS be the set of all possible time stamps. Let CWItem = {(cid, t, r, dv, τs, τc) ∈ CID × T × R × (K ↛ V) × TS × TS|τc ≥ τs} be a set of complete work items, where (1) cid is the identifier of a process case, (2) t is the name of the task, (3) r is the performer, (4) dv is a set of application data,(5) τs is the start time, and (6) τc is the completed time of a completed work item. An enabled work item will become a completed work item when it is performed by a particular resource. In the following discussions, we use cwi.cid, cwi.t, cwi.r, cwi.dv, cwi.τs, and cwi.τc to represent, respectively, the case identifier, task name, performer, application data, start time and completed time of a completed work item cwi. As shown in Table 1, there are five completed work items. Taking the first completed work item (in Line 2) as example, it is represented as: (c1, t1, r1, {〈type, emergent〉}, τ − 4, τ − 3). Definition 6. (Process case): Let Case p (CWItem ∪ EWItem)⁎ be the set of process cases, such that: • ∀ σ ∈ Case ∀ wi1, wi2 ∈ σwi1. cid = wi2. cid, and • ∀ σ1, σ2 ∈ Case ∃ wi1 ∈ σ1, wi2 ∈ σ2wi1. cid = wi2. cid ⇒ σ1 = σ2. Note that there are two types of process cases. If there exist enabled work items in a process case σ, then σ is a partial process case. If all of elements of a process case σ are completed work items, then σ is a full process case. Definition 7. (Partial process case): Let σ = b wi1, wi2, ⋯, win N ∈ Case be a partial process case, i.e., b wi1, wi2, ⋯, win − 1 N ∈ CWItem⁎ and win ∈ EWItem. Furthermore, let pre(σ) = {wi1, wi2, ⋯, win − 1} be the set of completed work items and cur(σ) = win be the current enabled work item. Table 1 An example containing two business process cases, five completed work items, and one enabled work item. Case id

Task

Resource

Application data

Enabled time

Start time

Completion time

c1 c1 c1 c2 c2 c2

t1 t1 t1 t2 t1 t2

r1 r1 r1 r2 r1

{〈type, normal〉} {〈type, normal〉} {〈type, normal〉} {〈type, emergent〉} {〈type, emergent〉} {〈type, emergent〉}

τ−4 τ−3 τ−2 τ−2 τ−1 τ

τ−4 τ−3 τ−2 τ−2 τ−1

τ−3 τ−2 τ−1 τ−1 τ

132

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

As shown in Table 1, there are two business process cases c1 and c2, which consists of three work items, respectively. Note that c2 is a partial case since it has an enabled work item waiting for execution. 2.3. Resource behavior Resource behavior refers to the actions or reactions of a resource in relation to business process execution. Resource behavior measure is highly relevant for the performance of processes [26], suggesting that comprehensive support for this is needed. In business process execution, the many factors involved in resource behavior measure, on the one hand, and the lack of sufficient information used to evaluate resource behavior on the other, make resource behavior measure a “wicked” problem. It is unlikely that approaches exist which are capable of considering all of the variety of factors in resource behavior measure. However, it is possible to measure resource behavior from one or several important perspectives by referring to the works from the social sciences [27,28]: • Preference: an acquired or developed component of knowledge or attitude to do a certain kind of activity at a certain level. For example, if a resource bids for a type of activity frequently, it implies that the resource prefers that activity; • Availability: indicates whether a resource is available to perform an activity within a specific time limitation. • Competence: the ability to perform a certain type of activity. In business process execution, if a resource performs a certain type of activity by using less cost than the others, the resource has a higher competence level, and is more competitive than the others in performing the activity. • Cooperation: the process of working with other resources. In its simplest form, it involves resources working in harmony with each other. The resource preference, competence, and cooperation change slowly during business process execution, while the resource availability changes frequently during business process execution. The resource availability has a great impact on the process performance [26]. For instance, if the resources are in scarce supply, the cost of starting and executing newly arrived work items is high. Thus, in this paper, we mainly focus on resource availability, which can be reflected by the workload of resources at each decision instant. Definition 8. (Workload): At each decision instant τ, for a particular resource r, we assume that a normalized resource workload rτ. wl can be measured as a particular value from {Available, Low, Normal, High, Overallocated}. Note that workload is the normalized amount of enabled work items distributed to a resource between the time period τ − 1 and τ. For example, if there are no enabled work items that are assigned to the resource r, rτ. wl = Available, and if there are too many work items that are assigned to the resource r, rτ. wl = Overallocated. As shown in Table 1, at the time instant τ, there are no work items allocated to r1 or r2. Thus, the workload of both r1 and r2 are Available at the time instant τ. 2.4. Resource allocation problems in business process management Resource allocation in business process management can be understood, in a broad sense, from the following perspectives [3,26,29–31]: • From the control-flow perspective (or process perspective), a business process model describes tasks and their execution ordering through different constructors, which permit flow of execution control, e.g., sequence, choice, parallelism, and synchronization. In addition, the dynamic natures are behind the business process execution. This means that the execution path of a business process case is determined at run time, and different process cases may have different execution paths within the different process execution scenarios. • From the resource perspective, resource behaviors have a great impact on the performance of business process execution. Note that some kinds of resource behaviors are always changing during business process execution, e.g., availability. They must interact with the process execution environment to measure those dynamic resource behaviors, so as to make appropriate resource allocation decisions and safeguard against inadequate and busy resources receiving work items, thereby avoiding the occurrence of bottlenecks and improving process performance. • Resource allocation decision and business process execution impact each other. On the one hand, in business process execution, one needs to choose appropriate resources to perform work items according to the requirements from both the authorization and performance perspectives [3]. On the other hand, resource behaviors have impact on not only the performance, but also the execution paths, of business process cases [31]. For example, considering an academic paper review process, even for the same academic paper, different reviewers, due to their specific domain capabilities and experiences, may make different decisions and result in different process execution paths (e.g., paper acceptance, minor revision, major revision, or rejection). Another typical example is the diagnostic and treatment process, a universal clinical process for a particular diagnosis, procedure, or symptom may not be practical because clinical care for patients may vary with physicians (or other clinical resources), practice preferences, and styles, even if the patients have the same diagnosis, procedure need, or symptom. Thus, as mentioned in Definition 3, the process transition probability not only depends on the business process structure, but also depends on the resources behaviors. • In business process execution, there are multiple business processes that must to be serviced concurrently. Constructing a model that assigns resources to every process would suffer with combinatorial explosion. As a matter of fact, business processes do not

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

133

all have equal importance. The optimization of less important cases might not be appropriate because they use resources that would be better reserved for more important process cases. With respect to the main issues above, the classical resource allocation algorithms widely used in the operation management domain, such as linear programming, integer programming [32], simulated annealing algorithm [33], tabu search, or genetic algorithm [34,35], cannot provide analytical answers for the resource allocation problems in BPM. Although dynamic programming, or reinforcement learning, have been proposed to solve resource allocation problems, these approaches are mainly aimed at the classical resource allocation problems, such as job-shop scheduling [10], grid computing [36,37], activity network [38], or autonomic systems [39] and so on. These approaches focus on specific problems and cannot be directly applied in business process resource allocation. Business process resource allocation, as a special problem, should be considered from the strategic and tactical level, rather than at the operational level. At the strategic level (input), business process managers set the objective of performance, as well as the relative importance of each process case. At the tactical level (output), resource allocation decisions, depending on the run time information of the work item (e.g., task type, application data, etc.) and resources behaviors (e.g., availability, etc.), allocate work items to appropriate resources. At the operational level, if the work items are always executed by the recommended resources during business process execution, over the long-run, the performance of business process execution will be optimized. 3. Reinforcement learning based resource allocation Achieving good performance of resource allocation in BPM using fixed strategies and hand-coded heuristics is very difficult and typically very inflexible [4]. In this study, we design and implement RLRAM, which is the act of learning the best resource allocation policies to be adopted within process states, based on feedback obtained from the environment [17,18]. Fig. 2 shows an overview of the RLRAM approach, which is a straightforward framing of the resource allocation optimization problem of learning from interaction, to minimize the total cost in business process execution. The agent corresponds to a decision maker of resource allocation. The thing it interacts with is the environment, which includes the unknown dynamics of the process execution, e.g., how long resources need to execute a particular work item. These interact continually in business process execution, i.e., the agent retrieves process states of newly arrived work items, and assigns those work items to appropriate resources. The environment responds to the resource allocation decisions to give rise to costs, special numerical values that the agent tries to minimize over time. Because there may be multiple business process cases that must be serviced concurrently, constructing a model that assigns resources to every process case would suffer a combinatorial explosion. Thus we propose a Q-learning approach to learn resource allocation policies for individual business processes, and then stitch those policies together using a heuristic method, Q-value based queuing, for ordering the queue of work items allocated to each resource. 3.1. Mapping resource allocation decision process to MDP One of the most important keys to achieving good performance in business process execution is the representation of the input space. Here, we describe how we design the elements of process state, action, and cost to be manageable in the RLRAM framework. 3.1.1. State and action Definition 9. (State space): Let T be a finite set of tasks, R be a finite set of resources, K be a finite set of application data keys, V be a finite set of application data values. We denote S be a set of states, S p T × (K ↛ V) × {Available, Low, Normal, High, Overallocated}. For example, considering a particular state s = (t, dv, wl), where: • t is the task name of a particular work item to be executed, • dv are the data and values, and • wl is the workload. Note that a partial case σ can be mapped onto a particular state in a particular time instant, where t = cur(σ). t, dv = cur(σ). dv and wl = wlr, and where r is a possible resource to perform cur(σ). As shown in Fig. 1, both resources r1 and r2 are qualified to perform task t1 and t2. And r1 and r2 are available at the time instant τ. Thus, the enabled work item (c2, t2, {〈type, emergent〉}, τ) is in the process state: (t2, {〈type, emergent〉}, Available) at the time instant τ. The features of process state are always changing in business process execution. In this research, we define a process state containing four features that indicate in what general direction a process case is headed (and therefore what resources will likely be distributed to it). Note that an action-dependent feature [40] is used in this research. As the name implies, action-dependent features cause an agent's state to change as different actions are considered. In this case, a resource-dependent feature, resource workload, always describes the current workload on whatever a particular resource may be allocated to. Note that this approach might make the process state transitions probabilities insufficiently reliable; however, it does not violate the Markovian property because RLRAM only knows the workload of a particular resource at a

134

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

Fig. 2. Overview of the RLRAM approach.

particular decision time instant. It is able to describe the possible states that a particular process case could be in at a particular decision time instant. Thus, it can guarantee that the algorithm converges to an optimal behavior. In this research, we assume that there is one and only one enabled work item of a partial process case at a particular decision instant. As a matter of fact, if there are multiple enabled work items of a partial process case, these work items can be assigned in sequence. Thus, at a particular decision instant, we can make an assumption that there is only one enabled work item of a partial process case. It guarantees that a particular process case is in a particular state at a particular decision instant. 3.1.2. Stochastic dynamics Prior to setting the transition probabilities among states in S one must define P(sτ + 1|(sτ, a)). Considering P(sτ + 1|(sτ, a)), the probability that the system moves from the state sτ to sτ + 1 given that an action a is adopted in the decision time instant τ, can be obtained as follows:   P sτ + 1 j ðsτ ; aÞ = P wiτ + 1 :t; wiτ + 1 :dv; rτ + 1 :wl j ðsτ ; aÞ

ð6Þ

Note that wiτ + 1. dv and rτ + 1. wl are independent with (sτ, a), thus:  P wiτ + 1 :t; wiτ + 1 :dv; rτ + 1 :wl j ðsτ ; aÞ    = P wiτ + 1 :t j ðsτ ; rτÞ × P rτ + 1 :wl × P wiτ + 1 :dv   = ρ fwiτ :t;wiτ + 1 :t ; rτ × P rτ + 1 :wl × P wiτ + 1 :dv :

ð7Þ

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

135

3.1.3. Cost function The cost function is a quality-oriented evaluation of the performance in business process execution. The definition of the cost function should be a generic structure on which the cost of resource allocation decision process can be defined, based on one or several fixed objectives. In reality, resource allocation in business process management is a typically multi-objective handling problem [3,31,41,42]. Time and execution cost are two of the most basic problems, others can be timely throughput, concurrency, resource utilization, etc. For example, the objective can be aimed at minimizing the execution costs of business process cases. Thus, the cost of adopting action a in the process state s is: C ðs; aÞ = cost ðwi:t; wi:dv; wi:r Þ

ð8Þ

where s = {wi. t, wi. dv, r. wl), a = r, and cost : T × ðK↛V Þ × R→Rþ is the task execution cost function to specify the costs of executing particular tasks by particular resources. As shown in Table 2, the average costs of a task t with application data 〈type, Normal〉 is 1.1 units, then cost(t, {〈type, Normal〉}, r) = 1.1. The objective can also be aimed at minimizing the flow time of business process cases: C ðs; aÞ = timeðwi:τc −wi:τe Þ

ð9Þ

where wi. τe and wi. τc are the enabled time and the completed time of a work item, respectively. The execution cost of a particular business process case σ is given as: cðσ Þ = ∑ C ðs; aÞ:

ð10Þ

wi∈σ

For a set of process cases Ξ executed in a limited period of time, the objective function is: Minimize ∑ cðσ Þ:

ð11Þ

σ∈Ξ

During the execution of a process case, the process state is observed and an action is adopted to allocate an enabled work item to a qualified resource. From the current state and the adopted action, it is possible to determine the next state probabilities and the expected cost incurred in the next period. 3.1.4. Resource allocation decision model Definition 10. (Resource allocation decision model): A resource allocation decision model is represented by the tuple 〈S, A, P, C〉 where S is the set of states; A is the set of possible action, in this study, A = R, i.e., the set of qualified resources which are participated in business process execution; and, P is the state transition function, P : S × R → P(S) and P(s, r, s′) = probability of moving from state s to s′ when a particular resource (r ∈ R) is selected under state s; and C is the cost function. Claim 1. The process of resource allocation in business process execution is a MDP. Proof. According to the properties of the Markov decision process, here we only need to prove the process constructed by sτ of business process resource allocation is a Markov decision process. □ According to Definition 1, we only need to prove state sτ + 1 only depends on (sτ, a). Note that at the time instant τ + 1, the process state transition probability P(sτ, a, sτ + 1) partly depends on (sτ, a) as Eqs. (6) and (7). Thus, this process is a Markov decision process. 3.2. Reinforcement learning based resource allocation In this research, we proposed a reinforcement learning based resource allocation in business process management. Table 2 Resource allocation assumptions about the example business process model. t1

t2

〈type, normal〉

〈type, emergent〉

〈type, normal〉

〈type, emergent〉

Executing time r1 r2

1.0 1.1

0.8 1.0

1.0 1.1

0.8 1.0

Costs for task execution r1 r2

1.2 1.0

1.5 1.3

1.5 1.4

1.6 1.5

136

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

As one of the main methods able to solve sequential decision problems for which the transitions' probabilities are not known (lack of “model”), or for MDPs with a very large state space, the RL approach can be applied to find the optimal resource allocation policies in business process management. There are a few algorithms for RL, such as Q-learning, R-learning, Sarsa etc [18]. As mentioned before, resources' availabilities (measured by resources' workloads) are always changing in business process execution, which makes it impossible to decide the moment when the transitions' probabilities are sufficiently reliable to classically solve the problem. One way to get around the “lack of model” is to use an indirect method and estimate the transitions' probabilities from simulations. The main difficulty is to decide the moment when the transitions' probabilities are sufficiently reliable to classically solve the problem. On the contrary, a direct method only tries to estimate the value function and not bother with the probabilities. In this work, we choose the latter method and use the most famous (but also the simplest) algorithm: Q-learning [43]. The Qlearning algorithm is a RL method that is able to solve Bellman's equations for the γ-weighted criterion. It uses simulations to iteratively estimate the value function V⁎, based on the observations of instantaneous transitions and their associated cost. For this purpose, Puterman [14] introduces a function Q, that carries a significance similar to that of V but makes it easier to extract the associated policy: there is no need to have the transitions' probabilities of the Markov model any more. To a given policy π and its value function Vπ, we associate the new function, called “Q-value”: ∀s∈S; a∈A; Qπ ðs; aÞ = C ðs; aÞ + γ ∑ P ðs′ j ðs; aÞÞVπ ðs′ Þ: s′

ð12Þ

It is easy to see that, in spite of the lack of transition probabilities, we can easily “track back” to the optimal policy: ⁎



∀s∈S; V ðsÞ = mina∈A Q ðs; aÞ

ð13Þ

and ⁎



π ðsÞ = argmina∈A Q ðs; aÞ:

ð14Þ

The principle of applying Q-learning algorithm in resource allocation during business process execution is as follows. For an enabled work item ewi, ewi ∈ σ, at a decision instant, it is in a particular state s. The state s maintains a table of estimates about the cost of different actions. Each Q(s, r) is an estimate of how much a process case will spend to complete if the resource r is selected to perform ewi under current state s. When r is selected and the work item is performed, σ is forwarded to a newly enabled work item ewi′. Suppose an enabled work item ewi′ is in the state s′, it will immediately receive back an estimate V for the cost of s, which is based on the values in the Q-table of s′. V ðs′ Þ = minr′ ∈A Q ðs′ ; r ′ Þ

ð15Þ

With this information, s can update its estimated cost for the process cases that are sent to the state s′. If c is the cost of performing ewi, then the following update rule applies: Q ðs; r Þ = c + γ ∑ P ðs′ js; r ÞVπ ðs′ Þ s′

ð16Þ

where c represents the immediate cost and V is the estimated value of the next state s′. Note that c cannot depend stochastically on s and r. Eq. (16) requires perfect knowledge of state transition function in advance. In reality, however, as indicated in Eq. (7), it is usually difficult for the agent to predict in advance the exact state transition probability, because resource workloads are non-stationary. Thus, a revised training rule as suggested by Watkins, which is sufficient to assure convergence, is shown below [43]: Q ðs; aÞ = ð1−αÞQ ðs; aÞ + αðc + γVπ ðs′ ÞÞ

ð17Þ

where α is a learning rate, 0 b α b 1. The revised rule is appropriate for a stochastic environment such as business process management. The key idea in this revised rule is that the revisions to Q are made more gradually than the deterministic case. By reducing α at an appropriate rate during training, we can achieve convergence to the correct Q function. One must mention Eqs. (16) and (17) are conditionally used to update Q-values. For example, if resources are always available to perform work items, so that the state transition function is independent of resource workloads, Eq. (16) would be appropriately used to update Q-values. Otherwise, Eq. (17) might be appropriate. The actual learning process can be described as follows [43,44]: 1. 2. 3. 4. 5.

Initialize the Q-values; Select a random starting state s which has at least one possible action from which to select; Select one of the possible actions, which leads to the next state s′; Update the Q-value of the state-action pair (s, a) according to the update rule above; and, Let s = s′ and continue with step 3 if the new state has at least one possible action, if it has none go to step 2.

Note that in this work, the parameter γ = 1, because business process cases are completed finally. Thus for each policy there is an absorbent state of null cost (the output condition o), and the function Q converges to Q⁎.

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

137

3.3. Q-value based work item queueing The Q-learning based strategy discussed above attempts to minimize the cost, so as to optimize resource allocation. However, cost is not the only factor in overall process performance. As a matter of fact, work items do not all have equal importance, minimizing the cost of less important work items can be dramatically suboptimal, because this uses resources that would be better reserved for more important work items. In this sense, the Q-learning explained above has a greedy approach: it attempts to minimize the cost of a given process case, but does not consider how doing so affects the cost of other process cases. In principle, this shortcoming can be addressed by revising the values upon which RLRAM learns and bases its decisions. For example, if the Q-values represented global cost, instead of the cost of a particular process case, RLRAM would have no incentive to favor the current work item and could eventually learn to optimal allocate resources in a way that minimizes global cost. Given an estimate of the cost of any work item, one can compute the total cost of any ordering of work items in a queue. Suppose there are n work items in a queue of a particular resource. The challenge is how to efficiently select a good ordering from among the n! possibilities. Clearly, enumerating each possibility is not computationally feasible. Thus, a simple, fast heuristic called the Q-value based queuing algorithm is given as follows: 1. Let ewi be an enabled work item. Let ℜ be a set of qualified resources. Let i = 1, m = CardðRÞ be the number of ℜ, let Ω = ∅. 2. Let r be the ith resource in ℜ, ewi be distributed to r, I be a set of enabled work items in the queue of r, n = CardðIÞ be the number of I, j = 1 and v = 0. 3. Insert ewi into the position j of I and let h  be the new queue of work items after insertion. 4. For each enabled work item ewi′ in h, get its' corresponding Q-value from Q-table according to its state and action, and let v ⇐ v + Q (ewi. t, r). 5. Let ω = ( h, v), let Ω ⇐ Ω ∪ {ω}. 6. Let j ⇐ j + 1, if j ≤ n + 1, GOTO (3). 7. i ⇐ i + 1, if i ≤ m + 1, GOTO (2). 8. Let ω⁎ be the optimal result, where it has the minimize v in Ω, return the corresponding queue h of ω⁎. Steps 2–7 represents a queuing process to order a newly arrival work item with all qualified resources considered. An appropriate resource is selected, which has the minimal Q-values of all work items in its queue. Thus, this algorithm is able to (1) estimate the cost of an ordering of work items and (2) efficiently select the best ordering from among the m × n contenders, where there are m resources and n work items in the queue of each resource. In principle, an optimal resource allocation strategy should take both work item queuing and distribution decisions into account. Based on RLRAM, a resource can see an ordered list of work items that it can carry out. Moreover, it can sort the work items based on Q-values. By determining which work items are of the most pressing need of completion, and processing them first, RLRAM can minimize the total cost to achieve an optimal performance in business process execution. 4. Case study This section describes two case studies. One case is about the example business process shown in Fig. 1. We introduce this case to establish the proof-of-concept of RLRAM. The other case is inspired by a real setting. It addresses the problem of optimizing resource allocation of radiology CT-scan examination process that exhibits the complications RLRAM attempts to address and provides some confidence that RLRAM will scale up. 4.1. Case 1 4.1.1. Experiment setting Fig. 1 illustrates the example process model. Furthermore, Table 2 lists several basic assumptions of the simulation experiment in this case study, such as the average costs and time of resources executing tasks. In addition, we added the following assumptions: • One application data element is defined: “type”. The possible values are “emergent” and “normal”. And, for a particular process case, the possibility of application data 〈type, normal〉 or 〈type, emergent〉 is 0.5. • There is only one process case executing at a particular time instant. It means the workload of resources r1 and r2 are Available; • To illustrate how the proposed RLRAM tackles multiple objectives (like time and cost), we set two objectives: 1) minimizing the average execution costs of business process cases and 2) minimizing the average flow time of business process cases; • All the values of the state-resource pairs, Q(s, r), are initialized to zero since all the actions for each state are assumed to be an equally valid choice. To have a better look at the results, we compared RLRAM with the greedy strategy. The greedy strategy prescribes an action that minimizes the current expected cost, without taking into account future possibilities. 4.1.2. Experiment results and evaluation We apply RLRAM in order to obtain optimal resource allocation policies in business process execution. The actions for four observed states are shown in Table 3, which exemplify the decisions determined by the resource allocation strategies, RLRAM and

138

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

Table 3 RLRAM, and Greedy strategies for the observed process states. Action Observed State

(t1, {〈type, normal〉}, Available) (t2, {〈type, normal〉}, Available) (t1, {〈type, emergent〉}, Available) (t2, {〈type, emergent〉}, Available)

Objective 1: Cost

Objective 2: Flow time

Greedy

RLRAM

Greedy

RLRAM

r2 r2 r2 r2

r1 r2 r1 r2

r1 r1 r1 r1

r1 r2 r1 r2

Table 4 Performance measures of the example process model.

average cost average flow time

Greedy

RLRAM

3.55 2.77

3.20 2.51

Greedy. For instance, an observed state s = (t1, {〈type, normal〉}, Available) indicates an enabled work item wi (wi. t = t1 and wi. dv = {〈type, normal〉}) waiting for allocation. For this observed state, the best policy is allocating resource r1 to perform wi if the objective is to minimize the cost. Although the cost(t1, {〈type, normal〉}, r1) is larger than the cost(t1, {〈type, normal〉}, r2), it is an optimal policy since the total costs of executing the whole process case can be minimized. On the other hand, if the objective is to minimize the flow time, for the observed state s = (t2, {〈type, normal〉}, Available), an enabled work item wi (wi. t = t2 and wi. dv = {〈type, normal〉}) should be allocated to r2 because it sacrifices immediate time consumption in order to minimize the global flow time of the whole process. This example indicates that an optimal resource allocation in business process execution does not depend on an isolated decision, but rather on a sequence of decisions. For a particular objective (cost or time), the decision maker may have to sacrifice immediate cost, such that less cost is incurred later. Thus, this proves that RLRAM has the ability to optimize resource allocations in business process execution. The simulation results are listed in Table 4. Note that Eq. (16) is adopted in this case study to update Q-values. We observed that the average cost and the average flow time of business process are about 3.20 and 2.51, respectively, under the control of RLRAM. This illustrates that RLRAM outperforms the greedy strategy. Thus, we can conclude that RLRAM is able to generate an optimal resource allocation policy with respect to different objectives in business process execution.

4.2. Case 2 4.2.1. Experiment setting In this case, we take a realistic business process, radiology CT-scan examination, as an example [45]. In Fig. 3, the process model is expressed in terms of a Workflow net [46]. The process starts with the “Receive an examination order” task (A). The physician of a clinical department sends a particular CT examination order of a patient to the radiology department. After receiving the order, the patient information includes the radiology order with multiple requested exam procedures registered and an examination will be arranged (B). Then the CT examination is performed according to the requested exam procedures (C). After that, images will be processed and stored to Image Archive (D). Then radiologists query images and create or edit report files (E). The documents are radiology exam process results and the content holds the interpretation and the impressions of the radiologist. Report files are subsequently audited by the radiologist (F). In parallel, an examination bill is processed (G). At last, the report files are submitted

Fig. 3. Radiology CT-scan examination process.

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

139

to server (H) and the case is closed. Note that task (E) and (F) may be performed one or several times iteratively. And after auditing report (F), the report file might be sent back for edit again (E). There were two issues that needed to be considered before simulation. One was the objective. In our experiment, we made an assumption that the examination process always has to be completed as quickly as possible. Thus, the objective was to minimize the flow time of the process. The other issue was how to perform a simulation experiment from various aspects of real scenarios. For this issue, we followed the real radiology department daily work process to motivate the requirements for radiology process management. We analyzed recent radiology process logs from 2009-01-01 to 2009-12-31 of the radiology department of the Chinese Huzhou hospital to provide the following assumptions in the simulations: • In a time horizon (i.e., from 8:00 AM to 5:00 PM), the number of radiology CT-scan examination process cases is the sum of appointments (scheduled cases) and unplanned examinations (e.g., emergencies), minus the number of absent cases. The number of appointments is known in advance. The number of the emergent cases follows a Poisson distribution with parameter λ. In practice, patients sometimes do not show up for appointments. Thus, we assume that each appointment examination has a probability ρabs of absence. Note that ρabs is the same for all appointment examinations and that those examinations are independent. • Each radiology CT-scan examination process case consists of a set of tasks in a prescribed order. • Each radiology CT-scan examination process case cannot start before its appointment time (or arrival time for emergent case). • The expected flow time of each CT-scan examination process case is 2.5 h.

Table 5 Average task execution time of the qualified resources. Examination type

Task

CT1

CT2

r1

r2

r3

r4

Craniocerebral scan

(C) (E) (F) (C) (E) (F) (C) (E) (F) (C) (E) (F) (C) (E) (F) (C) (E) (F) (C) (E) (F) (C) (E) (F) (C) (E) (F) (C) (E) (F) (C) (E) (F) (C) (E) (F) (C) (E) (F) (C) (E) (F)

20 min ∖ ∖ 20 min ∖ ∖ 20 min ∖ ∖ 20 min ∖ ∖ 20 min ∖ ∖ 20 min ∖ ∖ 20 min ∖ ∖ 30 min ∖ ∖ 20 min ∖ ∖ 20 min ∖ ∖ 20 min ∖ ∖ 20 min ∖ ∖ 20 min ∖ ∖ 20 min ∖ ∖

15 min ∖ ∖ 15 min ∖ ∖ 15 min ∖ ∖ 15 min ∖ ∖ 15 min ∖ ∖ 15 min ∖ ∖ 15 min ∖ ∖ 20 min ∖ ∖ 15 min ∖ ∖ 15 min ∖ ∖ 15 min ∖ ∖ 15 min ∖ ∖ 15 min ∖ ∖ 15 min ∖ ∖

∖ 20 min ∖ ∖ 30 min ∖ ∖ 20 min ∖ ∖ 30 min ∖ ∖ 20 min ∖ ∖ 30 min ∖ ∖ 30 min ∖ ∖ 40 min ∖ ∖ 20 min ∖ ∖ 30 min ∖ ∖ 20 min ∖ ∖ 30 min ∖ ∖ 20 min ∖ ∖ 30 min ∖

∖ 20 min ∖ ∖ 30 min ∖ ∖ 20 min ∖ ∖ 30 min ∖ ∖ 20 min ∖ ∖ 30 min ∖ ∖ 30 min ∖ ∖ 40 min ∖ ∖ 30 min ∖ ∖ 30 min ∖ ∖ 20 min ∖ ∖ 30 min ∖ ∖ 20 min ∖ ∖ 30 min ∖

∖ 15 min 10 min ∖ 20 min 15 min ∖ 15 min 10 min ∖ 20 min 15 min ∖ 15 min 10 min ∖ 20 min 15 min ∖ 20 min 15 min ∖ 30 min 20 min ∖ 15 min 10 min ∖ 20 min 15 min ∖ 15 min 10 min ∖ 20 min 15 min ∖ 15 min 10 min ∖ 20 min 15 min

∖ ∖ 10 min ∖ ∖ 15 min ∖ ∖ 10 min ∖ ∖ 15 min ∖ ∖ 10 min ∖ ∖ 15 min ∖ ∖ 15 min ∖ ∖ 20 min ∖ ∖ 10 min ∖ ∖ 15 min ∖ ∖ 10 min ∖ ∖ 15 min ∖ ∖ 10 min ∖ ∖ 15 min

Craniocerebral enhanced

Cervical scan

Cervical enhanced

Chest scan

Chest enhanced

Cardiac scan

Cardiac enhanced

Abdominal scan

Abdominal enhanced

Limb scan

Limb enhanced

Spine scan

Spine enhanced

140

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

• There is a set of resources. For each type of task, a subset of resources are qualified to perform it, and the execution time is fixed and known in advance. • Each work item can be performed without interruption by one resource. • Two application data elements are defined: “patient type” and “examination type”. The possible values of “patient type” are “emergent” and “normal”. The possible values of “examination type” are listed in Table 5. For example, the probability for a craniocerebral scan examination process case is 0.0714. • Each appointment examination case arrived every 12 min with absence probability ρ = 0.1, and the emergent examination arrival rate parameter was λ = 1.5/hour. • The Radiology Information System (RIS) was qualified to perform task (A), (B), and (H). There are two CT machines, CT1 (16PIE) and CT2 (64PIE), which were qualified to do task (C). The Picture Archiving and Communication System (PACS) was qualified to perform task (D). Three radiologists r1, r2 and r3 were qualified to do task (E), two radiologists r3 and r4 were qualified to do task (F) and Hospital Information System (HIS) was qualified to do task (G). • The execution time of each type of task was different according to examination type and resource ability. After the discussion with hospital managers and analysis of the process logs, we assumed that the service time of task (A), (B), (D), (G), and (H) is zero. For task (C), (E) and (F), we list the service time by the qualified resources in Table 5. • We assume that the transition probabilities are independent of resources except for the tasks (C), (E), and (F). The transition probabilities matrix of the tasks (C), (E) and (F) are as follows:    2  3 ρ fE;E ; r1 = 0:25 ρ fE;F ; r1 = 0:75     6 7 6 ρ f ; r = 0:25 ρ f ; r = 0:75 7 6 7 E;E 2 E;F 2 6  7    6 7 = 0:1 ρ f = 0:9 ; r ; r ρ f 6 7: E;F 3 E;F 3 6  7      6 7 6 ρ fF;E ; r3 = 0:1 ρ fF;F ; r3 = 0:1 ρ fF;H ; r3 = 0:8 7 4  5      ρ fF;E ; r4 = 0:1 ρ fF;F ; r4 = 0:1 ρ fF;H ; r4 = 0:8

ð18Þ

• The objective of the radiology CT-scan examination process is to optimize resource allocation with minimal process flow time. The important step is to learn and reason based on RL. The performance goal of this experiment was to minimize the makespan of radiology CT-scan examination process. In this case study, Eq. (17) is adopted to update Q-values. When starting the simulation, the values of Q(s, a) for each pair of state and action, can be initialized arbitrarily or assigned specific relative values to represent the confidence in favoring each possible alternative. In our study, we initialized all Q(s, a) to zero since all actions for each process state were assumed to be an equally valid choice at the beginning of experiment. Using this approach, the system started from a neutral state, assuming no a priori knowledge of which policy was best used in any of the process states. RLRAM must learn from scratch. The step-size parameter, α, which was a small positive fraction, influences the learning rate. The ε-greedy method was adopted for exploration and exploitation in this study. If ε is set to 0.1, then there is a 10% probability of randomly selecting one of the actions, independent of their Q(s, a), while a 90% probability of an action with the best Q(s, a) was selected. In the simulations, we applied the Q-learning algorithm with settings of ε = 0.1. 4.2.2. Resource allocation strategies To have a better look at the results, we compare our approach with three existing strategies which are described as follows: • FIFO. The first-in, first-out policy attempts to implement an unbiased conflict solver because it neglects properties of work items and the state of resources. • SPT. The shortest performing time policy gives priority to that work item with the shortest imminent performing time. A work item of a particular process case σ waiting in a queue may cause resources of the dedicated successor work items of σ to be idle. SPT alleviates this risk by reducing the length of the queue in the fastest possible way. • SLACK. The slack of a process case is defined as the time span left within its due-date, assuming that the remaining work items are performed without any delay. Since process cases may wait in front of each resource, the policy “slack per number of work items remaining” gives priority to the case with the minimum ratio of slack and the number of remaining work items. In this experiment, we set the expected flow time of radiology CT-scan examination process at 2.5 h. 4.2.3. Experiment results and evaluation The results of our approach and the comparison with three existing strategies are given in Fig. 4. For each strategy, 500 process cases were executed. For each strategy, the computational time data is recorded at the beginning, after the 100th process case was launched, and continued until the 500th process case was executed (i.e., 400 process cases were executed). This ensures that the process flow time is recorded only after the strategy has reached steady state in the simulations. Table 6 presented the evaluation results for each strategy. It is seen that, in the most evaluation parameters, RLRAM outperforms the others with a lower Mean Square Error (M.S.E) of the average process flow time. An improvement of about 17.3% of the average process make-span is possible compared with using SPT strategy (28.6% and 27.6% with the FIFO and SLACK

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

141

Fig. 4. Simulations of radiology CT-scan examination process.

strategies). The experimental results indicate clearly that RLRAM offers a substantial advantage in optimizing the performance of resource allocation in BPM, which confirms the usefulness of RLRAM and demonstrates its applicability. It not only focuses on dispatching work items efficiently, but also focuses on prioritizing those work items whose effect on cost will be most decisive. Since the results presented here were obtained using simulation, it remains to be seen how this approach will perform in real systems. In particular, while the simulator strives to capture many of the intricacies that make efficient resource allocation a challenge in business process execution, it also glosses over many complicating aspects of real scenarios. For example, in the simulator, the time required to execute a work item is a deterministic function of the resource ability, whereas in a real scenario it depends on the state of the resource availability, preference, competence, and cooperation with other resources, etc. In addition, the simulator assumes that the work items of the business process never need resources on multiple resources simultaneously; thus, avoiding the need to consider locks, etc. These, and many other issues, will need to be addressed before RLRAM can be deployed in practice. Nonetheless, our experimental results in this case suggest that an optimal resource allocation solution may play an essential role in improving the performance in business process execution. 5. Related work In order to introduce the related work of resource allocation, two background areas must be discussed: operation management and BPM context. 5.1. Operation management Resource allocation is one of the classic problems studied in operation management, such as job-shop scheduling [47], grid computing [36,37], activity network [38], autonomic systems [39], and so on. In particular, the job-shop scheduling, as an important and complex activity, is similar with the resource allocation problems in business process execution.

Table 6 Comparing RLRAM with other three strategies in the simulations of radiology CT-scan examination process. RA strategy

Average flow time (min)

S.T.D

M.S.E (with 5% confidence interval)

FIFO SLACK SPT RLRAM

86.220 84.995 74.042 61.575

23.578 20.237 12.752 11.563

83.903, 88.539 83.006, 86.984 72.789, 75.296 60.438, 62.711

142

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

A job-shop model, typical in manufacturing, can be described as a set of previous jobs composed of sequences of operations that are processed on a set of machines [47]. It is a decision making process which allocates limited resources over time to perform a set of previous jobs to meet objectives [47]. The decision making process can be broken into two functions: resource allocation and operations sequencing. Many methodologies have been developed to solve the problem, such as dispatching rules [48], search algorithms approach [33–35], and artificial intelligence. The dispatching rule approach uses information related to a specific job or operation to arrange the next job to be processed on a machine. Because they provide a straightforward method to solve a problem using characteristics of the jobs, operations, and machines, the computation is simple and quick. Yet, each dispatching rule approach is limited to one specific objective or data element. Panwalker et al. [48] classified dispatching rules into three categories-priority rules, heuristic scheduling rules, and others. Priority rules are based on information related to jobs, operations, or machine to determine the priority of resource allocation. Heuristic rules involve more complex considerations, such as machine load, alternative routing, alternative operation, etc. Search algorithms, such as simulated annealing and genetic algorithms, are other approaches that provide the strategies to explore the solution space efficiently. However, sometimes their computation is slow and complicated. Another problem with search algorithms is that these approaches may only give the solution, without telling how the solution is developed. Reinforcement learning has also been applied to learn domain-specific heuristics for job-shop scheduling. Zhang and Dietterich [10] proposed RL based approach to solve synthetic job-shop scheduling problems, which has been evaluated on problems from a NASA space shuttle payload processing task. The results indicate that reinforcement learning is efficient for constructing highperformance scheduling systems. Since industrial scheduling problems abound and general purpose solutions to these problems probably do not exist, reinforcement learning methods have the potential for quickly finding high-quality solutions to these scheduling problems. 5.2. Business process management The study of resource allocation in BPM context is still in its early stages. Nonetheless, there already exists a broad body of work that relates, mainly divided into two branches: authorization considerations and performance considerations, to the efforts described here. 5.2.1. Authorization considerations Early works focused on authorization considerations of resource allocation, and the corresponding mechanisms have been provided, including the Role-Based Access Control model [49], which tends to focus on security considerations and neglects other organizational aspects, such as resource availability. From an industry point of view, studies focus on the extensions of a business process model language such as BPEL4WS and Business Process Modeling Notation (BPMN) to express resource authorizations in business process management. In [50], a formal framework was proposed that integrates RBAC into BPEL and allows expression of authorization constraints using temporal logic. In this framework, model-checking can be applied to verify that a given BPEL process satisfies the security constraints. Wolter et al., presented an extension for the BPMN to express authorizations within the workflow model [51]. It enables the support of resource allocation pattern, such as separation of duty, role-based allocation, and so on in BPMN. In [52], a team-enabled resource allocation reference model was presented. The reference model is the presence of detailed Object Constraint Language (OCL) constraints. OCL allows for the specification of generic (i.e., at the meta level) and specific (i.e., at the organizational/process level) constraints. Based on this team-enabled resource allocation reference model, it can explore various aspects of resource allocation in the presence of teams. 5.2.2. Performance considerations Some researchers have paid attention to the balance resource allocation requirements and process performance. In [3], a dynamic resource allocation mechanism was introduced. Compared to push or pull approaches, this mechanism allows on-the-fly balancing of security and performance considerations. It integrates work distribution and delegation into a common framework; Thus, obviating the need for a separate delegation mechanism. Russel et al. introduce workflow resource patterns that can be viewed as a comprehensive study about resource management of contemporary workflow products [1]. They identified 43 Workflow Resource Patterns which describe the manner in which resources are allocated in BPMSs. All these patterns can be viewed as a guideline for BPMS construction and a basis for investigation of advanced resource allocation mechanisms. Investigation of resource allocation requires analysis, evaluation, and comparison of models of BPMSs, in [4], the models of Staffware, FileNet, and FLOWer. Resource patterns are comparable and then a reference model is provided based on Color Petri Net (CPN) Tools [53,54]. This work shows that one of the core elements of resource management is the advanced resource allocation mechanism. From [4], current BPMSs lack the ability to learn, mine, and reason suitable resource allocation knowledge in business process execution. To the best of our knowledge, the approaches that are most similar to ours are presented in [8,55]. Those approaches are mainly based on supervised machine learning algorithms. Liu et al. [7] provide a semi-automated approach to assign human resource in BPMSs at run time. Their approach is to select scattered target work items to consider how to distribute them based on precedent resource allocation results and supervised machine learning algorithms. In [55], Rinderle and van der Aalst introduce their approach for staff assignment mining, based on event log, knowledge about organizational structures, and a decision tree learning method. Their work focuses on staff assignment rules mining and seldom considers resource allocation optimization with a dynamic environment. Business process execution does not only require learning and obeying resource allocation rules, but also should answer the question of how to allocate resources efficiently to improve the process performance.

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

143

Resource allocation in business process execution can be seen as an interactive problem. It is impractical to obtain examples of desired behavior that are both correct and representative of all the situations in which the agent has to act. To assign a particular work item to an appropriate resource at a particular time, the agent must interact with the process execution environment, i.e., the agent is acutely aware of how the process execution environment responds to what it does, and it is able to seek to influence what happens through the process behaviors and resources behaviors. Although a number of works have dealt with resource allocation in business process management, to the best of our knowledge, none has the ability to provide an intelligent mechanism to optimize resource allocation in business process execution. From this point of view, our approach can be viewed as an experiment based on previous works. 6. Conclusion The main contribution of this paper is a reinforcement learning based approach for dealing with resource allocation optimization problems in BPM. RLRAM is able to perform dynamic resource allocation based on the interactions with the environment. It uses Q-learning to make allocation decisions real time. RLRAM has been implemented in the ProM framework [56]. As a proof of the concept, RLRAM can cooperate with business process management systems like YAWL [57]. Simulation experiments were conducted to evaluate the approach in this paper. The experimental results show that RLRAM seems to be efficient for solving the considered problem, clearly outperforming simple strategies in optimizing resource allocation in business process execution. This will be a good initiative to implement RLRAM in practice. We believe that RLRAM can be used at business process run time instantiation and execution stages to improve current state of business process management. Future work also includes further investigation on additional sources of information, and exploration of the experiments results based on real system data. One has to mention that the presented experiments are fixed to one evaluation dimension, i.e., monetary cost or flow time, etc. In reality, resource allocation in BPM is a typical multi-objective handling problem. Within particular scenarios, the resource allocation decisions might be made to satisfy ones but sacrifice the other objectives. Thus, how to tackle multiple objectives (potentially contradictory), as topic of interest, will be addressed in our future work. Another interesting direction for further research will deal with how resource allocation and business process impact each other. In the research presented here, the business process resource allocation performance is optimized given a certain process structure; the adjustment of the process structure is not within the learner's control. Ideally, the structure of a business process is improved so that resources can be utilized in a more optimal way [31]. In a more sophisticated system, RL would help to determine a more efficient business process structure when it is initially designed, when it needs to be upgraded, or when it is repaired. Acknowledgements This work was supported by the Brain-Bridge Project: Workflow Management and Process Mining in Healthcare, from Philips (China) Investment Co., Ltd. (Philips), Zhejiang University and Eindhoven University of Technology. The authors would like to thank the anonymous reviewers for their constructive comments on an earlier draft of this paper. References [1] N. Russell, A.H.M. ter Hofstede, D. Edmond, W.M.P. van der Aalst, Workflow resource patterns, BETA Working Paper Series, WP 127, 2004. [2] K. van Hee, A. Serebrenik, N. Sidorova, M. Voorhoeve, J. van der Wal, Scheduling-free resource management, Data & Knowledge Engineering 61 (1) (2007) 59–75. [3] A. Kumar, W.M.P. van der Aaslt, E.M.W. Verbeek, Dynamic work distribution in workflow management systems: how to balance quality and performance, Journal of Management Information Systems 18 (3) (2002) 157–193. [4] M. Pesic, W.M.P. van der Aalst, Modeling work distribution mechanisms using Colored Petri Nets, International Journal on Software Tools for Technology Transfer 9 (3–4) (2007) 327–352. [5] M. Dumas, W.M.P. van der Aalst, A.H.M. ter Hofstede, Process-Aware Information Systems: Bridging People and Software through Process Technology, Wiley&Sons, 2005. [6] N. Russell, W.M.P. van der Aalst, Work distribution and resource management in BPEL4People: capabilities and opportunities, Lecture Notes in Computer Science, vol. 5074, 2008. [7] Y. Liu, J. Wang, Y. Yang, J. Sun, A semi-automatic approach for workflow staff assignment, Computers in Industry 59 (2008) 465–476. [8] L.T. Ly, S. Rinderle, P. Dadam, M. Reichert, Mining staff assignment rules from event-based data, Lecture Notes in Computer Science, vol. 3812, 2006, pp. 177–190. [9] B. Turhan, A. Bener, Analysis of naive Bayes' assumptions on software fault data: an empirical study, Data & Knowledge Engineering 68 (2) (2009) 278–290. [10] W. Zhang, T. Dietterich, A reinforcement learning approach to job-shop scheduling, IJCAI, 1995. [11] M. Gravel, J.M. Martel, R. Nadeau, W. Price, R. Tremblay, A multicriterion view of optimal resource allocation in job-shop production, European Journal of Operational Research 61 (1–2) (1992) 230–244. [12] D.A. Koonce, S.-C. Tsai, Using data mining to find patterns in genetic algorithm solutions to a job shop schedule, Computers & Industrial Engineering 38 (3) (2000) 361–374. [13] W.M.P. van der Aalst, A.H.M. ter Hofstede, B. Kiepuszewski, A.P. Barros, Workflow patterns, Distributed and Parallel Database 14 (3) (2003) 5–51. [14] M.L. Puterman, Markov Decision Processes, Wiley, 1994. [15] J. Filar, K. Vrieze, Competitive Markov Decision Processes, Springer, 1997. [16] J. Tang, J. Zhang, Modeling the evolution of associated data, Data & Knowledge Engineering 69 (9) (2010) 965–978. [17] S. Mahadevan, L.P. Kaelbling, The NSF workshop on reinforcement learning: summary and observations, AI Magazine Winter, 1996. [18] R.S. Sutton, A.G. Barto, Reinforcement learning: An introduction, The MIT Press, Cambridge, MA, 1998. [19] A.B. Hillel, A.D. Nur, L. Ein, R.G. Bachrach, Y. Ittach, Workstation capacity tuning using reinforcement learning, ACM/IEEE Conference on Supercomputing, 2007. [20] K.H. Quah, C. Quek, Maximium reward reinforcement learning: a non-cumulative reward criterion, Expert Systems with Applications 31 (2) (2006) 351–359.

144

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

[21] F. Liu, G. Zeng, Study of genetic algorithm with reinforcement learning to solve the tsp, Expert Systems with Applications 36 (3) (2009) 6695–7001. [22] G. Gmez-Prez, J.D. Martn-Guerrero, et al., A reinforcement learning approach for individualizing erythropoietin dosages in hemodialysis patients, Expert Systems with Applications 36 (6) (2009) 9737–9742. [23] R.E. Bellman, Dynamic Programming, Princeton University Press, Princeton, NJ, 1957. [24] Business process, http://en.wikipedia.org/wiki/Business_process. [25] G. Decker, J. Mendling, Process instantiation, Data & Knowledge Engineering 68 (9) (2009) 777–792. [26] J. Nakatumba, W.M.P. van der Aalst, Analyzing resource behavior using process mining, Business process management workshops: BPM2009 International Workshops, 2009. [27] W.W. Hudson, et al., Measurement issues in human behavior theory, Journal of Human Behavior in the Social Environment (1998). [28] M.D. Feit, J.S. Wodarski, W.R. Nugent, Approaches to Measuring Human Behavior in the Social Environment, Routledge, February 2006. [29] S. Jablonski, C. Bussler, Workflow Management: Modeling Concepts, Architecture, and Implementation, International Thomson Computer Press, London, UK, 1996. [30] W.M.P. van der Aalst, K.M. van Hee, Workflow Management: Models, Methods, and Systems, MIT press, Cambridge, MA, 2002. [31] J. Xu, C. Liu, X. Zhao, Resource allocation vs. business process improvement: how they impact on each other, 6th international conference, BPM2008, Lecture Note in Computer Science, vol. 5240, 2008. [32] S.A. Burns, L. Liu, C.W. Feng, The LP/IP hybrid method for construction time-cost trade-off analysis, Construction Management & Economics 14 (3) (1996) 265–276. [33] R.W. Eglese, Simulated annealing: a tool for operational research, European Journal of Operational Research 46 (3) (1990) 271–281s. [34] E.G. Coffman, Computer and Job-Shop Scheduling Theory, John Wiley & Sons, 1976. [35] G. Viloct, J.C. Billaut, A tabu search and a genetic algorithm for solving a bicriteria general job shop scheduling problem, European Journal of Operational Research 190 (2008) 398–411. [36] D. Vengerov, A reinforcement learning approach to dynamic resource allocation, Engineering Applications of Artificial Intelligence 20 (3) (2007) 383–390. [37] D. Vengerov, L. Mastroleon, D. Murphy, N. Bambos, Adaptive data-aware utility-based scheduling in resource-constrained systems, Journal of Parallel and Distributed Computing 70 (9) (2010) 871–879. [38] S.E. Elmagrhraby, Resource allocation via dynamic programming in activity networks, European Journal of Operational Research 64 (1993) 199–215. [39] G. Tesauro, R. Das, W.E. Walsh, J.Q. Kephart, Utility-function-driven resource allocation in autonomic systems, Proceedings of the 2nd IEEE International Conference on Autonomic Computing, 2005. [40] P. Stone, M. Veloso, Team-partitioned, opaque-transition reinforcement learning, in: M. Asada, H. Kitano (Eds.), Robo-Cup-98: Robot Soccer World Cup II, Springer, 1999. [41] H.A. Reijers, et al., Resource allocation in workflows, Lecture Notes in Computer Science, Design and Control of Workflow Processes, vol. 2617, 2003, pp. 177–206. [42] Y.H.V. Lun, K.-H. Lai, T.C.E. Cheng, Shipping and Logistics Management, Springer, London, 2010. [43] C.J.C.H. Watkins, P. Dayan, Q-learning, Machine Learning 8 (1992) 279–292. [44] T. Mitchell, Machine Learning, McGraw-Hill, New York, 1997. [45] J. Zhang, X. Lu, H. Nie, Z. Huang, W.M.P. van der Aalst, Radiology information system: a workflow-based approach, International Journal of Computer Assisted Radiology and Surgery 4 (5) (2009) 509–516. [46] W.M.P. van der Aalst, Journal of Circuits, Systems, and Computers, J. Circuit. Syst. Comput 8 (1) (1998) 21–66. [47] K.R. Baker, Introduction to Sequencing and Scheduling, Wiley, New York, 1974. [48] S.S. Panwalker, W. Iskander, A survey of scheduling rules, Operation Research 25 (1977) 45–61. [49] D.F. Ferraiolo, R. Sandhu, S. Gavrila, D.R. Kuhn, R. Chandramouli, Proposed NIST standard for role-based access control, ACM Transactions on Information and System Security 4 (3) (2001) 224–274. [50] Z. Xiangpeng, A. Cerone, P. Krishnan, Verifying BPEL workflows under authorization constraints, in: S. Dustdar, J.L. Fiaderio, A.P. Sheth (Eds.), Business process management, 4th international conference, BPM2006, vol. 4102, 2006, pp. 439–444. [51] C. Wolter, A. Schadd, Modeling of task-based authorization constraints in BPMN, in: G. Alonso, P. Dadam, M. Rosemann (Eds.), Business process management, 5th international conference, BPM2007, vol. 4714, 2007, pp. 64–79. [52] W.M.P. van der Aalst, A. Kumar, A reference model for team-enabled workflow management systems, Data & Knowledge Engineering 38 (3) (2001) 335–363. [53] K. Jensen, Coloured petri nets, basic concepts, analysis methods and practical use, EATCS monographs on Theoretical Computer Science, 1, 1997. [54] CPN Group, CPN Tools Home Page, http://wiki.daimi.au.dk/cpntools/8University of Aarhus, Denmark. [55] S. Rinderle, W.M.P. van der Aaslt, Life-cycle support for staff assignment rules in process-aware infromation systems, Technical Report. TU Eindhoven, 2007. [56] B. van Dongen, A.K. Alves de Medeiros, H.M.W. Verbeek, A.J.M.M. Weijters, W.M.P. van der Aalst, The ProM framework: a new era in process mining tool support, in: G. Ciardo, P. Darondeau (Eds.), Application and Theory of Petri Nets, volume 3536 of Lecture Note in Computer Science, 2005. [57] W.M.P. van der Aalst, A.H.M. ter Hofstede, YAWL: yet another workflow language, Information Systems 30 (4) (2005) 245–275.

Zhengxing Huang received his B.S. degree in 2003 in the College of Biomedical Engineering and Instrument Science at Zhejiang University, P.R. China. At present he is a Ph.D. candidate of the College of Biomedical Engineering and Instrument Science at Zhejiang University, P.R. China. His research interests include computer-aided medical decision support using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc.

Xudong Lu is an associate professor in the College of Biomedical Engineering and Instrument Science at Zhejiang University, Hangzhou, P.R. China. Currently he is working on (i) workflow management and process mining in health-care (ii) fundamental research in the domain of Medical Informatics and Clinical Decision Support. He is the author of many scientific publications in the mentioned researchfield.

Z. Huang et al. / Data & Knowledge Engineering 70 (2011) 127–145

145

W. M. P. van der Aalst is a full professor of Information Systems at the Technische Universiteit Eindhoven (TU/e). He is also an adjunct professor at Queensland University of Technology (QUT) working within the BPM group there. His research interests include workflow management, process mining, Petri nets, business process management, process modeling, and process analysis.

Huilong Duan received his B.S. in Medical Instrumentation from Zhejiang University, China in 1985, M.S. in Biomedical Engineering from Zhejiang University, P.R. China in 1988, and Ph.D. in Engineering (Evoked Potential) form Zhejiang University, P.R. China in 1991. He is currently a Professor in the Department of Biomedical Engineering, and Dean of College of Biomedical Engineering & Instrument Science, Zhejiang University. His research interests are in Medical Image Processing, Medical Information System and Biomedical Informatics. He has published over 90 scholarly research papers in the above research areas. He is Program Committee Member of Computer Aided Radiology and Surgery; Editorial Board of Space Medicine & Medical Engineering and Chinese Journal of Medical Instruments respectively; Editorial Board of Chinese Journal of Biomedical Engineering; Secretary-General of BME Education Steering Committee, Chinese Ministry of Education; and Member of The Brain-Bridge Program Committee, Philips, TU/e & ZJU.