Discovering workflow models from activities’ lifespans

Discovering workflow models from activities’ lifespans

Computers in Industry 53 (2004) 283–296 Discovering workflow models from activities’ lifespans Shlomit S. Pinter*, Mati Golani IBM Research Laborator...

197KB Sizes 1 Downloads 101 Views

Computers in Industry 53 (2004) 283–296

Discovering workflow models from activities’ lifespans Shlomit S. Pinter*, Mati Golani IBM Research Laboratory in Haifa, Haifa 31905, Israel

Abstract Workflow systems utilize a process model for managing business processes. The model is typically a directed graph annotated with activity names. We view the execution of an activity as a time interval, and present two new algorithms for synthesizing process models from sets of systems’ executions (audit log). A model graph generated by each of the algorithms for a process, captures all its executions and dependencies that are present in the log, and preserves existing parallelism. We compare the model graphs synthesized by our algorithms to those of Agrawal et al. [Mining process models from workflow logs, in: Proceedings of the Advances in Database Technology (EDBT’98), 6th International Conference on Extending Database Technology, Valencia, Spain, 23–27 March 1998, Lecture Notes in Computer Science, Proceedings vol. 1377, Springer, Berlin, 1998] by running them on simulated data. We observe that our graphs are more faithful in the sense that the number of excess and absent edges is consistently smaller and it depends on the size and quality of the log. In other words, we show that our time interval approach permits reconstruction of more accurate workflow model graphs from a log. # 2003 Elsevier B.V. All rights reserved. Keywords: Workflow mining; Process discovery; Event log; Activity lifespan

1. Introduction Constructing business processes is a central issue for companies [8,20]. Managing processes in an automatic or semi-automatic fashion result in a significant reduction of cost and improves efficiency of business operations, thus, enabling fast adaptation to changing requirements and more. As a result, developing techniques for constructing and managing business processes is an active research area [2,6]. Workflow systems utilize a visual model of information flow that is used for monitoring and managing systems that execute actions (also called activities or tasks) of predefined situations. The actions together with constraints on execution order between them *

Corresponding author. Tel.: þ972-4-829-6541; fax: þ972-4-829-6116. E-mail address: [email protected] (S.S. Pinter).

define the business process [11]. Commercial workflow systems and management consoles need a model of the business process for scheduling agents (e.g., computers) to execute the actions, control production, etc. (see [11,8]). For modeling a business process, most ERP/CRM products use embedded workflow model [7,19]. Many organizations that run their systems using legacy applications do not have a model of the processes within the organizations. Current tools for model detection operate on the resource level only. Thus, there is a need for tools to build business process level models, especially when all executive level measures such as ‘‘return on investment’’, or SLA quality are derived from this level rather than the resource level. There are few methods for constructing a business process model from information stored in a workflow log (a collection of process execution logs). We follow the approach that represents the model as a directed

0166-3615/$ – see front matter # 2003 Elsevier B.V. All rights reserved. doi:10.1016/j.compind.2003.10.004

284

S.S. Pinter, M. Golani / Computers in Industry 53 (2004) 283–296

graph (workflow graph) with nodes representing activities and an edge from node A to node B represents that there is a process execution in which A must finish executing before B starts. In practice, a single business process model can permit an execution that include a given activity and another execution that may not include it. Thus, for each process execution the participating edges are selected with a Boolean function associated with each edge. The function determines whether the control flow or not along that edge. Another paradigm is to deal with workflow evolution that updates process models according to the logs [1,3,7,12,13,15,17]. 1.1. Contribution Unlike the event based models in which the execution of an activity is represented as a single event, we view the execution of an activity as a time interval (life span) based on its starting and ending events that are present in the workflow log. For this view we can recognize concurrent activities with a single process execution, since intersecting life span intervals in a process execution represent concurrent activities. In event based models concurrent activities cannot be recognized in a single process execution. We present a new algorithm based on our interval model. Given a workflow log we show that compared with the event based approach, the interval approach enables the reconstruction of a more accurate process model graph. We define causality dependence between activities with respect to the process execution log and describe an algorithm that, given a process execution log, generates a workflow graph that guarantees the following:  Completeness: Every process execution in the log can be generated from the workflow graph.  Correctness: All the dependencies with respect to the execution log exist in the workflow graph.  Preserving parallelism: If two activities, A and B, are concurrent with respect to the log then there are two paths from start to end such that one includes A and does not include B and the other includes B and does not include A. A workflow graph generated by our algorithm resemble the original workflow graph (the input for the

system that generated the log) by having a minimum number of excess and absent edges. This minimum depends on the size and quality of the log. The time complexity of the algorithm is OðjexjjVj2 Þ, where jexj, jVj are the number of executions and activities in the log, respectively. We measure the quality of the algorithms by computing their precision and recall values, where precision is the ratio of correctly identified edges over the total number of generated edges, and recall is the ratio of correctly identified edges over the total number of edges in the original workflow graph (the targeted workflow graph). In their seminal paper [2], the authors build a workflow graph by considering each activity as an atomic event. This was achieved by taking the finish activity event to be a node in the graph. We show that when the interval view is taken there is more information that can be used for reconstructing a more accurate model. We compare our results with those of [2] and show that the quality of the generated models is better with our algorithm. 1.2. Background and related work The topic of process discovery has been dealt for some time [4–6,9,10,21]. Our work can be viewed as an extension to the event based approach taken by [2,9]. They use a single event to denote the execution of an activity and reconstruct a directed acyclic graphs as the model. The common approach as described in those studies was to identify the different business processes in the execution log and gather the participating activities in each execution. In [5,6], Cook and Wolf searched for software processes by investigating events in concurrent environment. They present three methods for process discovery: a neural networks based method, algorithmic approach which builds a finite state machine where states are fused if their futures (in terms of possible behavior in the next k steps) are identical, and a Markovian approach which uses a mixture of algorithmic and statistical methods and takes noise in consideration. The Authors propose specific metrics (entropy, event type counts, periodicity, and causality) and use these metrics to discover models out of event streams. However, they do not provide an approach to generate explicit process models. In [4], they provide a measure to quantify discrepancies between a process

S.S. Pinter, M. Golani / Computers in Industry 53 (2004) 283–296

model and the actual behavior as registered using event-based data. The model generated by [5,6,18] are Petri-nets and finite automata. For the reconstruction of the models, they use frequencies of events sequences in the log. A learning method for process mining is presented in [14]. A comprehensive survey on process mining techniques is presented in [16]. The remainder of this paper is organized as follows: in Section 2, we describe process model and introduce our interval model. In Section 3, we present our new algorithms for reconstructing a workflow graph from rich executions log. The algorithms are demonstrated on an eBusiness process example. Then, in Section 4, we present our simulation environment followed by experiment results and discussion in Section 5. Finally, in Section 6, we conclude with a short summary.

2. Process model The common approach to view workflow modeling is to use a directed graph composed of a set of nodes each labeled by a name of an activity (two of the labels are start and end), and a set of directed edges with a Boolean control function associated with each one of them. There is an edge from a node labeled by activity A to a node labeled by activity B only if there is an execution where B can start executing immediately after the termination of activity A; in such a case the control function on the edge evaluates to one (TRUE). If there is an edge from A to B we say that B is a successor of A and A is a predecessor of B. The decision whether B must execute following the execution of A depends on the value of the control function applied to some data available when A is done. For more details on the model see [2,10]. In this common approach, the execution of an activity is thought to be atomic (instantaneous) and no two activities may ever start at the same time; a process execution in this model is a list of activities. In Section 2.1, we extend the model to capture a more accurate view in which the execution of an activity is an interval along the time axis and two intervals may intersect. We assume that there are no directed cycles in the graph and every label appears at most once in each

285

execution. This assumptions can be removed by relabeling activities. A legal flow is a maximal connected subgraph of the workflow model graph such that the control function is evaluated to one on each edge in the subgraph (all nodes are of AND type), both the start and the end activities are in the subgraph, and every activity (node) is on a directed path from start to end. A legal flow graph, over a set of activities, is a partial order representing all possible ways to schedule the selected activities (all possible executions). The meaning of the order is that an activity can be executed only if all of its predecessor activities in the flow graph finished executing (AND join) and only when its done, its successor activities can start executing (AND split). Since the successor activities can execute concurrently we say that they are immediately concurrent. We define an execution over a workflow graph to be a consistent linearization of a legal flow (i.e., preserves edge ordering) that is represented by a list of activities, A ¼ a1 ; a2 ; . . . ; an , starting with the start activity, a1 , and ends with the end (target) activity, an . From the assumptions on the workflow graph, we know that no activity appears more than once in an execution list. Fig. 1 is an example of a workflow model graph and two legal flow subgraphs. In flow (a), the control function is evaluated to 0 on the edge from A to C and on the edge from B to D. In flow (b), the value is 0 on the edge from B to C. Execution (A,B,C,D) is a legal execution in both flows, but (A,C,B,C) is a legal execution only in flow (b). Those are also the only legal executions in our example. For a given workflow model graph we define two conditions: (CA) Activities ai and aj are concurrent activities if there are two executions of a legal flow such that ai appears before aj in one execution, and ai appears after aj in the other execution. (NS) Activity ai is not immediate-successor of aj if ai 6¼ ajþ1 in all legal executions. From the edge semantics, it means that ai never immediately succeeds aj . Note that concurrent activities are defined with respect to a single flow. Both conditions will be refined for the interval view.

286

S.S. Pinter, M. Golani / Computers in Industry 53 (2004) 283–296 B A

B D

A

C

Workflow Graph

B D

C

A

D C

flow (a)

flow (b)

Fig. 1. A sample workflow model graph and two legal flows.

In Fig. 1, B and C are concurrent activities due to flow (b) for which both executions are legal, and D is not immediate-successor of A. There is no way to reconstruct the workflow model graph from the two executions since they are also the only legal executions for a workflow model graph for which there is no edge from B to C. The task of reconstruction is even harder when only some of the executions are known. Assume that we know only of (A,B,C,D), then we can only reconstruct a straight line graph. In the interval model we get, in addition, executions of the form (A, B k C, D) indicating that B and C were executed concurrently. In such a case, we can infer the existence of flow (b) with a single execution. Graph (b) cannot replace (a) as the model since (a) also comprises the option for the control function from B to D to evaluate to 0. Given two activities ai and aj in a workflow graph. Activity ai depends on activity aj iff whenever ai appears in some legal execution over the workflow graph, aj must appear in that same execution some time earlier (i.e., j < i and the time of the termination event of aj is smaller than the time of the ready event of ai ). This definition of dependence implies that ai cannot run unless aj ran sometime earlier (causality). In Fig. 1, activity D depends on A. For example, given a workflow graph (e.g., for a process of verifying car building) that contains the activities: ‘‘ship a car’’ and ‘‘final checks’’, we know that ‘‘ship a car’’ depends on ‘‘final checks’’. Indeed, our definition implies that in every execution of the flow graph it must be the case that if ‘‘ship a car’’ appears then ‘‘final checks’’ must appear sometime earlier. Claim 1. Given a workflow model graph and two of its activities ai and aj , such that ai depends on aj , then there is a path from activity aj to activity ai in the graph and in every legal flow.

The proof can be derived from the definitions and the above discussion. 2.1. Log structure and the interval model The log of a workflow contains monitored data that refers to process executions. Each execution of a process is composed of executed activities ordered on a time axis. In some systems, the monitored data is a single event per executed activity and the time in which it occurs. Some other systems generate richer audit trails that contain more events per executed activity. One example is the MQWorkflow Audit trail that follows the WFMC audit standard [22] and provides logs that contain start and stop events for each activity (in addition to other events), the details of the running business process instance, and its business process template. For each process instance (execution) we consider those logged records that contain the following events of an activity: {ready, started, restarted, ended normally, force finished, aborted (failed)}. Clearly, one can remove those executions that contain noisy events; e.g., if a force finished event indicates noise in the process workflow the corresponding executions can be removed. We say that an event is a start event if it is one of {started, restarted}, and it is a termination event if it is not a ready or a start event. Each record in the log contains additional data such as: time, process name, process ID (an execution), Activity name, Activity ID, and User ID. An activity has to be ready before it can start. Once it has started it can end normally, be forced to finish, or aborted unsuccessfully. The life span of an executed activity is defined to be the time interval from its start event to its termination event, and the extended life span is defined to be the time interval from its ready event to its termination event. Each event indicates a change in the state of the system; thus, for example, a

S.S. Pinter, M. Golani / Computers in Industry 53 (2004) 283–296

system in a ready state will stay in that state until a start event occurs. In what follows we refer to events of a single process ID only and view it as an execution. In a distributed system the time stamps of events are given locally by each component. As a result, we may have a case where the ordering of two events, as indicated by their time stamps, is incorrect. We assume that the time stamps of events imply that if activity B is a successor of activity A (there is an edge from A to B in the graph) then the time of the ready event of activity B is greater than the time of the termination event of A. This assumption does not impose a total order on all the events of an execution, but rather it says that the clocks keep causality order. The assumption on the time stamps can be easily maintained either with a global clock, when it exists, or by adding the time of the termination event to the data used in the control functions that trigger the successor’s activities. This time can be used to adjust the time of the successor’s ready event (at least in the log) to follow the termination of its predecessor. Since errors, like exceptions, may create execution instances which are not compatible with the business process, the respective log can have noisy data. Such noise is handled during the construction of the work flow graph. Note that the successor relationship between activities is maintained by both the times of a richer audit trail and the execution generated from a log that has a single event per executed activity. The following condition can be derived from the information in a rich audit trail in addition to CA and NS above: (rCA) For a given list of activity events (execution) of some process, activities ai and aj are concurrent activities if there is an overlap between the extended life spans of ai and aj (e.g. there is a time slot in which both activities are in active state).

3. Reconstructing a workflow graph We next describe the problem of reconstructing a model workflow graph from a given set of executions (process log) that are generated in a workflow system.

287

Given two activities ai and aj in a process log. Activity ai depends on aj with respect to the process log iff whenever ai appears in some execution in the log, aj must appear in that execution some time earlier and the time of the termination event of aj is smaller than the time of the ready event of ai . Note that the definition of dependence over the process log in [2] does not imply our notion of dependence over the log or even over the workflow graph since they, for example, permit the execution of the dependent activity without its cause. Since many legal executions may not be present in the log and since parallel activities appear sequentially in an execution, some none dependent activities can be considered as dependent with respect to the log. For a given process log L we define: (CAL) Activities ai and aj are concurrent activities with respect to L if one of the following conditions is satisfied: There are two executions in L, over the same set of activities, such that ai appears before aj in one execution and ai appears after aj in the other execution. If there is an execution in L such that the extended life spans of ai and aj overlap. (NSL) Activity ai is not a successor of aj with respect to L if for every execution in L at most one of aj or ai is present. A reconstructed workflow graph G is consistent with a process log L if the following conditions hold.  Completeness: Every execution in L can be generated from G.  Correctness: All the dependencies with respect to L exist in G; i.e., For every dependency with respect to L, there is a corresponding path in G.  Preserving parallelism: If two activities, A and B, are concurrent with respect to L then there are two paths, in G, from start to end such that one includes A and does not include B and the other includes B and does not include A. A consistent workflow graph of a given process log is optimal if it has the smallest number of excess and absent edges (with respect to the original model graph) over all consistent workflow graphs defined by that log.

288

S.S. Pinter, M. Golani / Computers in Industry 53 (2004) 283–296

3.1. Constructing the process execution graph We present an algorithm for reconstructing a workflow graph with the interval approach. The workflow model graph is generated by combining graphs that are generated from the process executions in the log. In the first step, a graph is generated for each execution (execution graph). The interleaving of intervals in a single execution makes this graph nontrivial. During this step we keep information of overlap intervals. Next, a single graph is generated for all the execution graphs that have the same set of activities. In the last step the graphs are merged together and strongly connected components are handled. Combining executions graphs is done in two steps. The reason for combining first executions that run over the same set of activities stems from the observation that, in general, they were all run on a single subset of the workflow graph (i.e., the parameters for the business process that used to select the participating edges were the same). Thus, the resulting graph corresponds to a flow graph. Given an execution from a process log where activity events are sorted based on time, the following algorithm builds an execution graph. During the generation of the graph we maintain two sets of nodes: current frontier and next frontier. The nodes in the current frontier are the latest nodes that were added to the graph and their out degree is zero at that stage. In addition, we maintain two markers along the time axis: current time, and next time. 3.1.1. Algorithm for generating an execution graph The first node in the graph is the start activity (the first in every execution). Let the start node be the current frontier set and the time of its finish event be the current time. The following steps are executed until the time of the finish event of the end (target) activity is the value of current time. 1. Let the time of the first finish event that followed current time be next time. 2. Add to the graph a node for each activity that has its ready event between current time and next time. This set of nodes is the new frontier.

F

B

A

H

E

C

D

G

Fig. 2. A sample work flow graph.

3. Add an edge from each node in current frontier to every node in new frontier. 4. Change the current frontier to be the set of activities that include the activity finished in next time and all the activities that finished between next time and the first ready event that followed next time (when exists). Let current time be the time of the latest finish event among the activities in this set. We demonstrate the reconstruction algorithm on the following eBusiness process example presented in Fig. 2. The process describes a system for making hotel, flight, and car reservations. The options (possible scenarios) are: (i) car reservation only, (ii) hotel and flight reservations, or (iii) both options. In addition, the client can indicate if he/she would like to join the customers club; this will be effective only when a flight is reserved. Activity A—the client enters the system, inserts his personal details and indicates the type of reservation needed. This define the scenario or legal flow that will be used. Activity B—handles membership registration. Activity C—the client inserts the required destination and dates. Activity D—the client choose the car type, pickup date and location. Activity E—performs a query to locate appropriate hotels and the client makes his choice. Activity F—performs a query to locate suitable flights and the client makes his choice. Activity G—the client indicates return date and location then the system performs a query for the available cars and the client make his choice.

S.S. Pinter, M. Golani / Computers in Industry 53 (2004) 283–296

1

B

A

C

F

H

E

G

D

2

B

A

C

289

F

H

E

D

G

Fig. 3. Execution graphs for executions 1 and 2.

Activity H—the client confirms his order and the reservation transactions are committed. The following rich executions were logged by the business process. In this log, a letter corresponds to a ready event and a tagged letter is a termination event of the corresponding activity: 1. (A, A0 , H0 ), 2. (A, A0 , H0 ), 3. (A, A0 , 4. (A, A0 ,

B, C, D, C0 , E, B0 , F, D0 , F0, G, G0 , E0 , H, C, B, D, D0 , B0 , G, C0 , E, F, F0, G0 , E0 , H, C, C0 , F, E, E0 , F0, H, H0 ), D, D0 , G, G0 , H, H0 ).

In Figs. 3 and 4, we provide the process execution graphs of the four executions. It is simple to show that an execution graph is a legal flow graph for that execution. Nevertheless, from a single execution we may not be able to find all the available parallelism. In such a case, at this stage, we may have redundant edges that may be removed when combining the graph with other execution graphs on the same set of activities. To handle noise we maintain a weight on each edge. The weight indicates how many executions are coherent with the edge. Thus, assuming that noise is relatively rare (e.g. less than 5%) edges with small weight can be filtered out.

3.2. Combining execution graphs In this step, we first combine execution graphs that have the same set of activities. We refer to the new graph as the reconstructed flow graph. During the scan of the events (in the step above) we also generated a set, fedges, of pairs that comprise the ‘‘forbidden’’ edges. If two activities overlap then there is a pair (forbidden edge) in this set. For example, in execution 1 we can see that activity B overlap each of C, D, and E. For a given set of execution graphs over the same set of nodes, the reconstructed flow graph is a graph F ¼ ðV; EÞ such that:  V is the set of activities comprising the nodes of the execution graphs.  The set of edges E is generated by taking the union of all the edges in the execution graphs and removing the forbidden edges. The type of concurrency presented by a forbidden edge is always true since it is derived from a single execution. Other forms of concurrency are derived from multiple executions. To show that all the executions can still be generated from the reconstructed flow graph we use two observations: (i) by the assumption that all the executions over the same set of edges were generated from a

Fig. 4. Execution graphs for executions 3 and 4.

290

S.S. Pinter, M. Golani / Computers in Industry 53 (2004) 283–296

F

B

A

H

E

C

D

G

Fig. 5. The resulting flow graph.

single flow graph (same set of parameters—values of the control functions—were used to select the graph), all the combined executions are consistent with a single partial order, thus there are no cycles; (ii) if an edge was removed then there is an execution—the one that is responsible for getting the edge into the forbidden set—that explore the relevant parallelism and thus its continuation can be used to compensate for the removed edge. If the log does not contain all the possible executions over a set of activities, then the set of edges in the reconstructed flow graph may be different from that of the corresponding original legal flow graph in spite of them being compatible (both can be used for generating the given set of executions). Note that at this stage all the nodes are of type AND as in the corresponding legal flow graph. In our running example, the reconstructed flow graph of executions 1 and 2 is given in Fig. 5. Since G and F overlap in execution 2 the pair fF; Dg is in the forbidden set and thus the edge from F to G of the first execution graph is not in the combined graph. Next we merge all the reconstructed flow graphs to get the merged flow graph. Following this merge, some of the nodes are changed to have an OR semantics. The merge is done similar to the previous step but the forbidden set is now empty. In our example the graph in Fig. 5 does not change when the three graphs are merged. The only difference is that some of the nodes like A, C, and B now have an OR semantics. It is simple to see that the graph is compatible with all the executions and only a single edge (from B to G) is not in the original workflow graph. Condition NSL and the second part of condition CAL are kept during the merge but this may not be the case with the first part of CAL. As a result we may get cycles in the

merged flow graph. In the last step we break the strongly connected components in the merged flow graph and get the workflow graph. For a given graph G, we denote by VðGÞ the set of nodes in G, EðGÞ is the set of its edges, and NðvÞ for v 2 VðGÞ is the set of neighboring nodes of v (on both incoming and outgoing edges). 3.2.1. Algorithm for breaking strongly connected components Given a merged flow graph F ¼ ðV; EÞ. 1. The nodes in each strongly connected component H in F, are partitioned into four sets: Mh ¼ fv 2 VðHÞjNðvÞ  VðHÞg that is, all the adjacent nodes and edges are in the strongly connected component. Bh ¼ fv 2 VðHÞj exists x; y 2 fVðFÞVðHÞg and ðx; vÞ; ðv; yÞ 2 EðFÞg that is, they have at least one incoming edge and one outgoing edge that are not in the strongly connected component. Ih ¼ fv 2 VðHÞ  VðBh Þj exists x 2 fVðFÞ VðHÞg and ðx; vÞ 2 EðFÞg Oh ¼ fv 2 VðHÞ  VðBh Þj exists x 2 fVðFÞ VðHÞg and ðv; xÞ 2 EðFÞg (i) Remove the edges in EðBh Þ and the edges going from VðBh Þ to VðIh Þ. (ii) Remove the edges going from VðMh Þ to VðIh Þ. 2. The procedure is applied until there are no strongly connected components. Claim 2. Every execution in the log can be generated from the merged flow graph. We next illustrate the algorithm for breaking strongly connected components. Initially the graph, presented in Fig. 6, has a strongly connected component of size 5 (nodes 3, 4, 8, 9 and 11), M ¼ f4; 8g; B ¼ f3; 11g; I ¼ f g and O ¼ f9g thus edge ð3; 11Þ is removed. Following the first call to the procedure, the strongly connected component is reduced to nodes 3, 4, 8 and 9, M ¼ f4g; B ¼ f3g; I ¼ f8g and O ¼ f9g, thus edge ð3; 8Þ is removed. In the last call, the strongly connected component is reduced to nodes 3, 4 and 9, M ¼ f g; B ¼ f3g; I ¼ f4g and O ¼ f9g, thus edge ð3; 4Þ is removed. The result is the graph in Fig. 7.

S.S. Pinter, M. Golani / Computers in Industry 53 (2004) 283–296

291

0 2

10

9

12 E

4

8

11

1

6

3

5

S 7

Fig. 6. A graph with a strongly connected component.

0 2

10

9

12 E

5

4

8

11

1

6

3 S

7

Fig. 7. Breaking a strongly connected component.

3.3. Discussion—comparing with the event based approach In our algorithm, the executions, as graphs, are consistently combined to get a workflow graph and strongly connected components are handled such that no execution is lost. In contrast, in the algorithm used by [2] for the general case, all the edges in the transitive closure of each execution sequence are inserted to the graph. This is followed by a cleaning phase. Thus, as we later see in our simulations results, for a given workflow log the resulting graph of [2] has more excess edges compared to our solution. Another major difference between the two approaches is the notion of dependence. In the event-based approach two activities can be dependent even when there are executions in which only one of them is present. We view dependence as causality and require that two dependent activities must be present in the same executions and in a single order. 4. Simulation environment For comparing different algorithms we wrote a process graph generator and used it to generate a

process graph and a set of legal flow subgraphs. Executions were generated from the flow graphs to serve as input to the reconstruction algorithms. 4.1. Generating a process graph for the simulation Business processes do not take an arbitrary shape. They are usually more structured than random graphs; we use that knowledge to guide some decisions when generating workflow graphs for testing our algorithms. Every graph is a DAG with one source and one sink. Every node has at most four adjacent edges. This forces an out/in degree of 3 or less to each node. The edges are generated by randomly selecting a node, u, and then a second node, v. If there is an edge from u to v we remove it, otherwise we add an edge from u to v only if it complies with the restriction on process graphs (e.g., the in/out degree is smaller then 3, there is no path from v to u). The number of times that this step is carried on is bounded by a number proportional to a polynomial in the number of nodes (the type of the graph). To complete the process graph, the source and target nodes are added with edges from the source node to nodes with in-degree zero and

292

S.S. Pinter, M. Golani / Computers in Industry 53 (2004) 283–296

edges from nodes with out degree zero to the target node. We use the process graph generator to generate three types of process graphs: with 10, 17, and 25 nodes (activities). For the experiment we used 50 graphs from each of the three types to get a collection of 150 process graphs. For each one of the 150 process graphs we generated a set of legal flow graphs. The number of flow graphs generated depends on the type of the graph, such that 5, 10, and 10 flow graphs are generated for process graphs of size 10, 17, and 25 nodes, respectively. 4.2. Generating a flow graph for the simulation A legal flow graph is generated from each process graph with the following procedure. 1. Put the source node in a working set. 2. Select a node, v, from the working set. 3. Randomly select an outgoing edge from v and mark it. The rest of the outgoing edges from v are randomly chosen (with probability 0.4) and the chosen edges are marked. This guarantees that there is at least one outgoing edge from v which is marked. 4. The respective destinashion nodes of marked edges are being added to the working set if they were never there before. 5. Delete from the graph outgoing edges from v that were not marked, and remove v from the working set. 6. Recursively remove from the graph nodes that have no incoming edge (other then the source node). 7. Repeat this procedure from Step 2 until the working set is empty. The motivation for selecting an additional outgoing edge with probability of 0.4 is implied from the observation that many real life flow graphs have out degree of 2. As a result our simulated graphs are similar to real life scenarios. Due to the probabilistic nature of the procedure for generating flow graphs, some of the nodes and edges of the simulated process graph are never selected. Thus, we take the simulated process model graph to be the union of all the flow graphs that are generated

from a single process graph. This process model graph is now fully covered by the flow graphs and it is used for comparing the quality of the reconstruction algorithms. 4.3. Generating a simulated execution log For each flow graph we generate an execution log that contains random executions (lists of events). The number of executions per flow graph varied from 100 to 150,000. We observe that legacy systems, already operating for many years, store some hundreds of business process executions in a single day. Large sized logs are also produced by systems that handle American stocks and many others systems. 4.3.1. Generating an execution 1. Put in a working set a ready activity event corresponding to the beginning of the source node. 2. Randomly select an event from the working set. 3. If it is a ready activity event: Remove it from the working set and add it to the log; add the respectively finish event (of the same activity) to the working set. Else /* ending event */ (a) Take the event out of the working set; put it in the log and mark its successors’ nodes (of the corresponding activity). (b) For each marked node, if all its predecessors finish events are in the log or in the working set, put its starting activity event in the working set and remove the marking. 4. Repeat (from Step 2) until the working set is empty.

5. Experiment results We tested three algorithms. Our interval algorithm (interval sorted) described in Sections 3.1 and 3.2, a modified version (interval) of our algorithm in which we merge all the execution graphs into one graph in a single step, and the non-interval algorithm, which is the second algorithm described in [2]. For the last algorithm we use the finish events to represent activities executions.

S.S. Pinter, M. Golani / Computers in Industry 53 (2004) 283–296

10 nodes Graph precision analysis

10 nodes Graph recall analysis 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

100% precision percentage

recall Percentage

293

99% 98% 97% 96% 95% 94% 93% 100

100

400

1000

4000

400

10000

1000

4000

10000

log size [executions]

Log size [executions] Interval

Interval Sorted

Interval

Non-Interval

Interval Sorted

Non-Interval

Fig. 8. Comparing the algorithms on logs from graphs with 10 nodes.

The average recall of each of the three algorithms is relatively stable and is independent with the log size. The interval sorted algorithm gives the best results. The average precision is very high and stable for both the interval and the interval sorted algorithms. The precision of non-interval algorithm is more sensitive to the size of the log but is relative high. In Fig. 9, each input graph has 17 nodes. The average recall of each of the three algorithms is relatively stable and is independent with the log size. The interval sorted algorithm gives the best results (its recall is between 89 and 92%). The average precision is relatively high for the three algorithms and is not too sensitive to the log size. In Fig. 10, each input graph has 25 nodes. The average recall and the average precision are relatively stable for the non-interval algorithm.

Every value presented in the following histograms is the average over 50 runs of each algorithm. The quality of the algorithm is measured by comparing the synthesized graphs generated with the reconstruction algorithms with the target process graph (reference) used for generating the logs for the experiments. The edges that are in the target graph but not in the synthesized graph are the absent edges and their percentage with respect to the target graph is the complement of the recall value. Those edges that are in the synthesized graph but not in the target graph are the excess edges and their percentage with respect to the synthesized graph is the complement of the precision value. We next show for each of the three algorithms the average precision and recall values as function of the log size. In Fig. 8, each input graph has 10 nodes.

17 nodes Graph recall analysis

17 nodes Graph precision analysis precision percentage

recall Percentage

92% 90% 88% 86% 84% 82% 80% 78% 600

1000

10000

50000

100000

200000

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 600

1000

log size [executions] Interval

Interval Sorted

Non-Interval

10000

50000

100000

log size [executions] Interval

Interval Sorted

Fig. 9. Comparing the algorithms on logs from graphs with 17 nodes.

Non-Interval

200000

294

S.S. Pinter, M. Golani / Computers in Industry 53 (2004) 283–296

25 nodes Graph precision analysis

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

precision percentage

recall Percentage

25 nodes Graph recall analysis

100

630

1000

10000

15600

50000

100000

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 100

150000

630

1000

Interval

Interval Sorted

10000

15600

50000

100000

150000

log size [executions]

log size [executions]

Interval Sorted

Interval

Non-Interval

Non-Interval

Fig. 10. Comparing the algorithms on logs from graphs with 25 nodes.

time performance Vs. Log size (50 nodes)

time performance Vs. Graph size (10,000 executions)

100000

1000

time [ms]

time [ms]

10000

100 10 1 100

200

1000

2000

10000

1000000 100000 10000 1000 100 10 1 10

log size [executions] Interval

Interval Sorted

25

50

100

Graph Size [nodes] Non-Interval

Interval

Interval Sorted

Non-Interval

Fig. 11. Performance results.

The recall of the interval Sorted is sensitive to very small logs and improves when the logs have more executions; it provides the best recall results. The average recall is relatively stable for the interval algorithm, yet its precision is somewhat sensitive to the log-size. The interval algorithm gives the best precision results for logs with more than 1000 executions. In Fig. 11, we compare execution times versus log size, and versus the size of the input graph.

6. Summary We have presented a new approach for reconstructing a process model, as a graph, from an audit trail logs. The results generated by the interval based algorithms are very accurate and make our approach useful and practical for mining processes from given

logs. The algorithms can be used to initiate implementation of a workflow system in an existing and running enterprise (e.g. discovering most of the business processes topologies instead of start modeling from scratch). We defined the notion of flows to separate the executions of different scenarios that are derived from the same business process. Thus, we can process separately the executions of each flow (with the interval Sorted algorithm) and get better results. We showed that our approach, based on activities’ lifespan intervals and separation of flows is relatively efficient compared with other methods, and results in a model which most resemble the original workflow. The algorithms have been implemented and tested on synthetic data, on Pentium IV 1.4 GHz 1GB RAM. The results from runs over various logs verify the usability and scalability of the proposed algorithms.

S.S. Pinter, M. Golani / Computers in Industry 53 (2004) 283–296

Both interval algorithms are more accurate (in resemblance to the original model) than the noninterval. One issue of concern is the recall value. All three algorithms miss some fragments of the original edges and the size of the fragment does not depend on the number of executions (size of the log); there is a threshold of about 90% even with very large logs. The policy of choosing an algorithm to use depends on the number of activities in the system and the size of the log.

Acknowledgements

[9]

[10]

[11]

[12]

The authors would like to thank the anonymous referees for their comments.

[13]

References

[14]

[1] A. Agostini, G. De Michelis, Improving flexibility of workflow management systems, in: W.M.P. van der Aalst, J. Desel, A. Oberweis (Eds.), Lecture Notes in Computer Science, vol. 1806, Springer-Verlag, Berlin, Germany, 2000, pp. 218–234. [2] R. Agrawal, D. Gunopulos, F. Leymann, Mining process models from workflow logs, in: Proceedings of the Advances in Database Technology (EDBT’98), 6th International Conference on Extending Database Technology, Valencia, Spain, 23–27 March 1998, Lecture Notes in Computer Science, Proceedings vol. 1377, Springer-Verlag, Berlin, 1998, pp. 469–483. [3] F. Casati, S. Ceri, B. Pernici, G. Pozzi, Workflow evolution, in: Proceedings of ER 96, Cottubus, Germany, pp. 438–455. [4] J.E. Cook, A.L. Wolf, Software process validation: quantitatively measuring the correspondence of a process to a model, ACM Transactions on Software Engineering and Methodology 8 (2) (1999) 147–176. [5] J.E. Cook, A.L. Wolf, Discovering models of software processes from event-data, ACM, Transactions on Software Engineering and Methodology (1998) 215–249. [6] J.E. Cook, A.L. Wolf, Event-based detection of concurrency, in: Proceedings of the Sixth International Symposium on the Foundation of Software Engineering (FSE-6), 1998, pp. 35–45. [7] C.A. Ellis, K. Keddara, A workflow change is a workflow, in: W.M.P. van der Aalst, J. Desel, A. Oberweis (Eds.), Business Process Management: Models, Techniques, and Empirical Studies, Lecture Notes in Computer Science, vol. 1806, Springer-Verlag, Berlin, Germany, 2002, pp. 201– 217. [8] D. Georgakopoulos, M. Hornick, An overview of workflow management: from process modeling to workflow authoma-

[15]

[16]

[17]

[18]

[19]

[20]

[21]

295

tion infrastructure, Distributed and Parallel Database 3 (2) (1995) 119–153. J. Herbst, D. Karagiannis, An inductive approach to the acquisition and adaptation of workflow models, in: Proceedings of the IJCAI’99 Workshop on intelligent Workflow and Process Management: The New Frontier for AI in Business, Stockholm, Sweden, 1999, pp. 52–57. J. Herbst, D. Karagiannis, Integrating machine learning and workflow management to support acquisition and adaptation of workflow models, International Journal of Intelligent Systems in Accounting, Finance and Management 9 (2000) 67–92. D. Hollingsworth, The workflow reference model, Technical report TC00-1003, issue 1.1, Workflow Management Coalition, UK, January 1995. M. Klein, C. Dellarocas, A. Bernstein, Towards adaptive workflow systems, in: Proceedings of the CSCW-98 Workshop Towards Adaptive workflow Systems, Seattle, Washington, 1998. M. Klein, C. Dellarocas, A. Bernstein (Eds.), A knowledgebased approach to handling exceptions in workflow systems, Journal of Computer Supported Cooperative Work 9 (3–4) (2000) 399–412 (special issue). L. Maruster, A.J.M.M. Weijters, W.M.P. van der Aalst, A. van den Bosch, Process mining: discovering direct successors in process logs, in: Proceedings of the 5th International Conference on Discovery Science (Discovery Science 2002), Lecture Notes in Artificial Intelligence, vol. 2534, SpringerVerlag, Berlin, Germany, 2002, pp. 364–373. G. Schimm, Process miner—a tool for mining process schemes from event-based data, in: Proceedings of the 8th European Conference on Artificial Intelligence (JELIA), Lecture Notes in Computer Science, vol. 2424, Springer-Verlag, Berlin, 2002, p. 525. W.M.P. van der Aalst, B.F. van Dongen, J. Herbst, L. Maruster, G. Schimm, A.J.M.M. Weijters, Workflow mining: a survey of issues and approaches, Technical report, Workflow Management Coalition, UK, January 2003. W.M.P. van der Aalst, S. Jablonski, Dealing with workflow change: identification of issues and solutions, International Journal of Computer Systems, Science, and Engineering 15 (5) (2000) 267–276. W.M.P. van der Aalst, B.F. van Dongen, Discovering workflow performance models from timed logs, in: Y. Han, S. Tai, D. Wikarski (Eds.), Proceedings of the International Conference on Engineering and Deployment of Cooperative Information Systems (EDCIS 2002), Lecture Notes in Computer Science, vol. 2480, Springer-Verlag, Berlin, Germany, 2002, pp. 45–63. W.M.P. van der Aalst, K.M. van Hee, Workflow Management: Models, Methods, and Systems. Number TC00-1003, issue 1.1, MIT Press, Cambridge, MA, 2002. D. Wastell, P. White, P. Kawalek, A methodology for business process re-design: experiences and issues, Journal of Strategic Information Systems 3 (1) (1994) 23–40. A.J.M.M. Weijters, W.P. van der Aalst, Rediscovering workflow models from event-based data, in: Proceedings of the 11th Dutch–Belgian Conference on Machine Learning, 2001, pp. 93–100.

296

S.S. Pinter, M. Golani / Computers in Industry 53 (2004) 283–296

[22] Workflow Management Coalition, Interface 5-audit data specification, Technical report WFMC-TC-1015, issue 1.1, Workflow Management Coalition, UK, September 1998. Shlomit S. Pinter is a Research Staff Member in the Systems and Software Department at the IBM Research Laboratory, Haifa Israel, and an adjunct professor with the Computer Science Department at Haifa University. She received her PhD degree in computer science from Boston University, Boston MA, in 1984, and her BSc and MSc degrees in computer science from the Technion, Israel Institute of Technology. Before joining IBM, she was a faculty member with the Department of Electrical Engineering at the Technion for 12 years, a Research Specialist with the Laboratory for Computer Science at MIT, and a visiting scholar at Yale and Stanford Universities. Dr. Pinter is a member of the editorial board of the International Journal of Parallel Programming and has published

extensively in international journals and conference proceedings. Her research interests include compilation techniques for advanced computer architectures, program analysis and parallelization, and distributed and parallel computing. Mati Golani is a research staff member in the IBM research laboratories in Israel since 2000. He is working in the Active middleware technologies department on new technologies/tooling development, and holds a master degree from the Technion—Israel Institute of technology – in Information management engineering (1999). Currently he is also a PhD student in the Information systems department in the Technion, and his research focuses on dynamic workflow management. Prior joining IBM, he provided consulting and integration services for the industry, especially in ERP environment in various projects. His research interests are rule management systems, databases, software design, and information systems analysis methodologies.