Interactive mining and retrieval from process traces

Interactive mining and retrieval from process traces

Accepted Manuscript Interactive mining and retrieval from process traces Alessio Bottrighi, Luca Canensi, Giorgio Leonardi, Stefania Montani, Paolo T...

12MB Sizes 2 Downloads 31 Views

Accepted Manuscript

Interactive mining and retrieval from process traces Alessio Bottrighi, Luca Canensi, Giorgio Leonardi, Stefania Montani, Paolo Terenziani PII: DOI: Reference:

S0957-4174(18)30346-4 10.1016/j.eswa.2018.05.041 ESWA 11994

To appear in:

Expert Systems With Applications

Received date: Revised date: Accepted date:

20 February 2018 11 May 2018 31 May 2018

Please cite this article as: Alessio Bottrighi, Luca Canensi, Giorgio Leonardi, Stefania Montani, Paolo Terenziani, Interactive mining and retrieval from process traces, Expert Systems With Applications (2018), doi: 10.1016/j.eswa.2018.05.041

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • We combine process model construction, path retrieval from model and trace retrieval

CR IP T

• Model construction aims at balancing overfitting vs underfitting • Model construction is interactive and incremental.

• We allow path/trace retrieval in response to partially specified patterns

AC

CE

PT

ED

M

AN US

• Trace retrieval is made efficient by using an index

1

ACCEPTED MANUSCRIPT

Interactive mining and retrieval from process traces

CR IP T

Alessio Bottrighia,, Luca Canensib , Giorgio Leonardia , Stefania Montania , Paolo Terenziania a

Computer Science Institute, DISIT, Universit` a del Piemonte Orientale, viale Teresa Michel 11, Alessandria, 15121, Italy b Department of Computer Science, Universit` a di Torino, Via Pessinetto, 12, Torino, 10149, Italy

AN US

Abstract

AC

CE

PT

ED

M

The traces of past process executions are maintained in many contexts, since they constitute a strategic source of information. Different tasks on such data can be supported. In particular, we focus on process model discovery, by proposing an approach that helps the analyst in identifying a good balance between overfitting and underfitting. To achieve such a goal, we have designed SIM (Semantic Interactive Miner), an innovative interactive and incremental tool, which starts from a non-generalized model, and provides the user with a path retrieval facility to analyse the current model, and with semantic abstractions to build increasingly more generalized models (through the selective merging of retrieved paths). Additionally, the tool exploits the path retrieval facility and an indexing strategy to support efficient trace retrieval. As a consequence, our framework represents the first literature contribution able to integrate in a synergic approach process model discovery, path retrieval, and trace retrieval. We experimentally compare our tool to two well-known process mining algorithms, namely inductive miner (Leemans et al., 2013) and heuristic miner (Weijters et al., 2006). The comparison enlights the main innovative aspect of our approach, i.e., its ability to facilitate the analyst in directly using her/his domain knowledge to lead process model discovery, a feature that can be extremely advantageous in knowledge-rich applications, such as the Email addresses: [email protected] (Alessio Bottrighi), [email protected] (Luca Canensi ), [email protected] (Giorgio Leonardi ), [email protected] (Stefania Montani), [email protected] (Paolo Terenziani)

Preprint submitted to Elsevier

May 31, 2018

ACCEPTED MANUSCRIPT

medical ones.

CR IP T

Keywords: Business Process Management, Information Search and Retrieval, Knowledge representation and reasoning 1. Introduction

AC

CE

PT

ED

M

AN US

Many commercial information systems and enterprise resource planning tools, like those provided by, e.g., Oracle and SAP, record information about the executed process instances in the form of an event log. The event log stores the sequences (traces henceforth) of actions that have been completed at the organization. They constitute a rich source of experiential knowledge, fundamental to support different tasks such as: (1) process model discovery; (2) path retrieval from the process model; (3) trace retrieval. The main focus of our approach is (1) above: the discovery of a process model from the event log is a key activity, that allows to understand the procedures actually implemented at a given organization and to analyze them, e.g., by identifying bottlenecks, differences with respect to an existing gold standard, and non-compliances with business rules, with the aim of improving their correctness and efficiency in practice. Such a task has been widely pursued by the process mining research area (van der Aalst, 2016). In van der Aalst (2016) five families of process model discovery approaches have been identified (see Section 2). Roughly speaking, our work belongs to the family termed “two-step approaches”, which has the paper in van der Aalst et al. (2010) as its main representative. As in van der Aalst et al. (2010), our approach focuses on the critical issue of identifying a “good” balance between underfitting and overfitting in the mined process model. Given an event log L and a model M, M is overfitting L if M does not generalize and is sensitive to particularities in L. In complex processes, it is unlikely that the instances in the log cover all possible executions, so that a certain degree of generalization is important, to allow for more behavior than the one recorded in L. On the other hand, too much generalization can lead to a model M underfitting the log L, in the sense that M allows for “too much behavior” that is not supported by L (van der Aalst et al., 2010). Indeed, “the main problem is to find the balance between overfitting and underfitting” (van der Aalst et al., 2010), while “none of the existing techniques enables the user to control the balance”(van der Aalst et al., 2010). The work in van der Aalst et al. (2010) proposes a two-step approach in which first a “low-level 3

ACCEPTED MANUSCRIPT

ED

M

AN US

CR IP T

model” (called “transition system”) is constructed, and then it is converted into a “high-level model” that can express more advanced control-flow patterns. To support users in the identification of a “good” balance, the authors propose a configurable approach for the first step, so that different abstraction techniques can be applied, providing a set of “low-level models”, that are evaluated by the analyst, who can use her/his knowledge to choose the one that best fits the balance. One of the main innovative aspects of such an approach is that the analyst plays an active role in controlling the level and nature of the abstraction: the user can choose between three types of syntactic (our terminology) abstractions (i.e., maximal horizon, maximum number of filtered events, and sequence/bag/sets) and two types of semantic abstractions (filter and visible activity). The first ones are independent of the specific actions (but consider their number, or their order, or so on), while the latter are semantic, in the sense that the analyst can indicate specific actions to be disregarded from states or transitions. In our work, we further move in the direction started by (van der Aalst et al., 2010), giving additional power and role to the analyst: our tool, named SIM (Semantic Interactive Miner), provides the analyst with a way of flexibly interacting with the process model construction (instead of just choosing between models autonomously built by the system), directly exploiting her/his domain (semantic) knowledge, in a synergic approach where also trace retrieval and path retrieval can be considered and adopted. In a nutshell, our approach is based on three main ideas:

AC

CE

PT

• Starting from a model with precision = 1 (i.e., a model whose paths correspond to at least one trace in the log), and replay-fitness = 1 (Buijs et al., 2012) (i.e., every trace can be replayed in the model with no errors) (Buijs et al., 2012). This is justified by the fact that, in certain domains (e.g., in some medical applications), a maximal precision or replay-fitness might be required. We start with such a nongeneralized model, and let the analyst drive the generalizations (which, in most cases, correspond to a loss of precision and/or replay-fitness). • Incremental and selective application of abstractions. To facilitate the analysts in the task of determining a suitable balance between overfitting and underfitting, we provide them the possibility of operating step-by-step; in such a way, the final model can be built by applying different abstraction steps, each one leading to a progressively more generalized version. At each step, the analyst may evaluate 4

ACCEPTED MANUSCRIPT

CR IP T

the result (also considering quantitative parameters such as precision and replay-fitness), and approve it, or move to a futher abstraction (if more generalizations are desirable), or even backtract to some previous version (e.g., if the latest abstractions have led to an underfitting model). Additionally, we provide analysts with the possibility of applying abstractions in a selective way (i.e., the analyst can decide what occurrences of a selected path to abstract);

AN US

• Semantic abstraction by merging paths. To give the analyst the possibility of exploiting her/his knowledge, we focus on “semantic” abstraction facilities (i.e., abstractions that may focus on specific actions in the domain). In particular, we explore one of them, which is of paramount importance in many domains and applications: generalizing the model by merging the instances of some paths occurring in it.

AC

CE

PT

ED

M

To facilitate the semantic analysis of the models being (incrementally) built, as a support to the merging abstraction we provide the analyst with an advanced path retrieval facility, helping her/him to retrieve significant patterns of actions within the model (which might be the object of future merge operations). Last, but not least, we enrich our initial (precision 1) model to operate as an index of the log traces, and we exploit the path retrieval facility also as a support for trace retrieval, able to efficiently return all the traces that conform an input (possibly partially specified) pattern. As a consequence, our framework represents the first literature contribution able to integrate in a synergic approach process model discovery, path retrieval, and trace retrieval. The paper is organized as follows: in Section 2 we introduce the existing literature; in Section 3 we describe SIM architecture and overall behavior. Then, we organize our presentation in a bottom-up fashion. First, we describe the different components that we integrate in our synergic approach. Specifically, in Section 4 we present the algorithm to construct our initial model from the log. In Section 5 we illustrate our path/trace retrieval approach. In Section 6, we detail our support to the generalization of models through semantic abstraction by merging paths, and in Section 7 we illustrate the facilities we provide to support interactive sessions of work. Then, in Section 8, we propose an experimental evaluation, in which we show how the different components are integrated in order to support flexible, 5

ACCEPTED MANUSCRIPT

CR IP T

incremental and interactive model discovery, driven by the analysts’ knowledge and analysis, and we compare our approach with two outstanding ones in the literature, i.e., inductive miner (Leemans et al., 2013) and heuristic miner (Weijters et al., 2006). Notably, a top-down reading of the paper is also possible, starting from Section 8 to grasp the overall behaviour of our process discovery approach, and then moving back to Sections 4-7 to check the details of the different components. Finally, Section 9 is devoted to an extensive and accurate comparison to related work, and Section 10 addresses our concluding remarks.

AN US

2. Related work

AC

CE

PT

ED

M

The discipline of process mining (van der Aalst, 2016) has proven to be capable of providing deep insights into process-related problems that contemporary enterprises face. Within process mining, several contributions have been described to support process model discovery (van der Aalst, 2016). In particular, in van der Aalst (2016), five main families of approaches have been identified. (a) The first family includes direct approaches, such as the alpha-algorithm (van der Aalst & van Dongen, 2002) and its evolutions like, e.g., the heuristic miner (Weijters et al., 2006), that extract some footprint from the event log and use this footprint to directly construct a process model. (b) The second family of approaches uses a two-step procedure in which first a “low-level model” is constructed, then the low-level model is converted into a “high-level model” that can express more advanced controlflow patterns. An example of such an approach is described in van der Aalst et al. (2010). (c) The third family breaks the problem into smaller problems. The inductive miner (Leemans et al., 2013), for instance, splits the event log recursively into sublogs. The sublogs are decomposed until they refer to a single activity. The way that the log is decomposed provides a structured process model. (d) Techniques originating from the field of computational intelligence form the basis for the fourth family. The genetic process mining approach described in Bratosin et al. (2010), which exploits evolutionary computation, is an example. Another example exploits planning technique and is described in Haigh & Yaman (2011). (e) The fifth family focuses on rules or frequent patterns. In declarative process model mining (Maggi et al., 2013), in particular, rules instead of procedural processes are extracted. All rules are explicitly supported by some trace; thus, this approach is contextaware (as it is our approach, see Section 4.2). 6

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

In the case of very large event logs, process model discovery can also be sped up by introducing techniques that allow to pass over the log only once, and to quickly evaluate the model quality by applying a divide-and-conquer strategy (Leemans et al. (2018)). Other approaches (see, e.g., de Leoni et al. (2016)) also allow to reconstruct process states from the log and visualise them in succession, leading to an animated history of the process. Specific applications of process mining in healthcare are receiving more and more attention in recent years (see, e.g., Rojas et al. (2016); Mans et al. (2009, 2008); Perimal-Lewis et al. (2012)). This application domain indeed presents particular challenges and issues (Mans et al. (2013)), mostly related to the fact that different types of patients exhibit different characteristics, that the patient’s state dynamically evolves, and that different hospital settings may have different resource constraints. All these peculiarities lead to the need to properly contextualize medical processes and/or process patterns. The survey in Rojas et al. (2016) demonstrates that medical process mining is primarily applied in the control flow discovery perspective. However, there is an increasing attention to the issue of conformance checking (MunozGama (2016); Spiotta et al. (2017)), another area of process mining. Conformance checking techniques are a way to visualize the differences between the assumed process represented in the model and the real process expressed by the traces in the event log, pinpointing possible problems to address (MunozGama (2016)). In medicine, conformance checking facilitates the analysis of processes for their verifying compliance with best practices (as dictated, e.g., by clinical guidelines) and with other “basic” medical knowledge (Spiotta et al., 2017). Once possible deviations have been discovered by conformance checking techniques, it is possible to repair the process model with respect to the event log, such that the resulting model can replay the log. The approach in Fahland & van der Aalst (2012), for instance, decomposes the log into several sublogs of non-conformant traces. For each sublog, a subprocess is derived that is then added to the original model at the appropriate location. In case of an entirely unfitting model, the old model is effectively replaced by a rediscovered model. The issue of trace retrieval, on the other hand, has been recently considered in the Case Base Reasoning (CBR) (Aamodt & Plaza, 1994) literature. In CBR, traces are exploited as sources for retrieving and reusing user’s experience. The work in Cordier et al. (2013), for instance, proposes trace-based reasoning, a CBR approach where cases are not explicitly stored in a library, 7

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

but are implicitly recorded as “episodes” within traces. The works in Huang et al. (2013) and Montani & Leonardi (2014) propose two trace retrieval approaches, where different metrics, based on extensions of the edit distance, are exploited. No indexing strategy is however provided to make retrieval faster. Finally, path retrieval has been partially treated in the area of process querying, aiming at retrieving models from large business process model repositories, typically for supporting a new model definition as an adaptation of a similar existing one. The work in Hornung et al. (2008), for instance, proposes a textual query interface to search for process models or fragments within a repository. A very interesting contribution in this field is the work in Jin et al. (2013): in this approach, the user is allowed to express a query in the form of a subgraph, with possibly incomplete control flow relations. The query is then used as a filter, to pre-select only those process models where the query actions can be identified. Among the filtered models, the ones actually answering the query are finally extracted, by exploiting a subgraph isomorphism approach. Other approaches in the area of process querying, such as the ones in LaRosa et al. (2013); Dijkman et al. (2009); Montani et al. (2015), on the other hand, do not consider paths/subgraphs, but use the whole process model as a query.

ED

3. System architecture

Our system architecture consists of three main components (see Fig. 1):

CE

PT

• a mining module (“mine” in Fig. 1) to build an initial process model that perfectly adheres to the event log input traces, granting precision = 1 and replay-fitness = 1 (Buijs et al., 2012), and that provides an explicit indexing to traces;

AC

• a trace and path retrieval module (“retrieve” in Fig. 1) which searches for patterns on process models (not just the initial one, but also more abstract models, see below); such a module relies on a flexible and high-level query language, which supports the definition of partially specified patterns; • a general abstraction module (“merge” in Fig. 1), which supports the merging of (possibly user-selected) patterns, to define more abstract (but less precise) models. 8

AN US

CR IP T

ACCEPTED MANUSCRIPT

M

Figure 1: General architecture of the approach

AC

CE

PT

ED

In Fig. 1, we show how such modules can be used in order to support analysts in the discovery of a process model with a good balance between overfitting and underfitting, through incremental steps of analysis (dotted box on the right of the figure) and abstraction (dotted box on the left). As a starting step, given a log, the mining module builds the initial process model, called the log-tree. The log-tree is already a process model. However, it is a possibly overfitting one, in that it perfectly adheres to the input traces (since all its paths correspond to at least one trace in the log). Thus, users may want to analyse it, and to apply abstractions to generalize the model. The retrieval module supports the analysis: the analysts can give in input a (possibly partially specified) pattern of actions, and the retrieval module outputs all the occurrences of such a pattern in the model (termed “paths” in Fig.1). On the other hand, the merge module support abstraction: it takes in input a set of paths in the model (as retrieved in the analysis step), and merges them into a unique occurrence (notably, merge is “selective”, the sense that the analyst may choose to merge only part of the paths), thus generating a more abstract model. This process can be iterated forth and back (in the sense that we also provide analysts with facilities to backtrack to previous 9

ACCEPTED MANUSCRIPT

AN US

CR IP T

less abstracted models), until the analyst is satisfied by the current balance between overfitting and underfitting (see a concrete example of interleaved analysis and abstraction steps in model discovery in Section 8). Finally, even if such an aspect is not explicitly depicted in Fig. 1, the retrieval module, when applied to the log-tree, can be exploited for trace retrieval, to return all the traces satisfying any pattern provided in input by the user. To this end, it exploits the log-tree as an index, as we will show in Section 5.1. While the overall procedure of interactively and incrementally discover of a process model with a good balance between overfitting and underfitting is experimented in Section 8, in the following sections we proceed bottom-up, by describing the different modules in technical detail. 4. Mining the log-tree

In this section, we describe how the log-tree, the starting point of our process model discovery approach, can be mined, and we discuss its properties.

PT

ED

M

4.1. Representation formalism and semantics Our mining facility takes in input the event log, and outputs the log-tree, a tree whose nodes represent actions, and arcs represent a precedence relation between them. Every node can represent a single action, or a set of actions. In this second situation, the interpretation is that the actions in the set can be executed in any order. In this case the node will be also referred to as an any-order node henceforth. When multiple arcs exit a node, the node itself represents a XOR splitting point. Specifically, in the log-tree, each node is represented as a pair hP,Ti, where:

AC

CE

• P denotes a (possibly unitary) set of actions; actions in the same node are in any order relation. Thus each path from the root of the tree to a given node N denotes a set of possible process patterns (called support patterns of N henceforth), obtained by following the order represented by the arcs in the path to visit the log-tree, and ordering in every possible way the actions in each any-order node (for instance, the path {A, B} → {C} represents the support patterns “ABC” and “BAC”).

• T represents a set of pointers to all and only those traces in the log whose prefixes exactly match the path from the root to one of the patterns in P (called support traces henceforth). In the case of an 10

ACCEPTED MANUSCRIPT

CR IP T

any-order node, T is composed by several sets of support traces, each one corresponding to a possible action permutation. T is typically a subset of the event log. Specifically, every time a XOR splitting point is encountered, the support traces are properly split among the various alternative paths. Every edge connecting a node A to another node B also stores the edge frequency of the sequence A > B (A immediately preceding B), defined in the next subsection.

ED

M

AN US

4.2. Data structure and algorithm The input event log is stored as a matrix with n rows and m columns, where n is the number of traces in the log and m is the maximum length of these traces. Each cell M atrix[i, j] contains the j-th action of the trace i. Actions in the different traces are aligned on the basis of their order of execution (i.e., the j index). All traces start with a dummy common action #. Algorithm 1 builds the log-tree. The function Build-Tree in Algorithm 1 takes in input a variable index, representing a given position in the traces (i.e., a column in the input matrix), and a node. Initially, it is called on the first position, and on the root of the tree (which is a dummy node, corresponding to the # action; thus, initially, index=0, P = {#} and T is the set of all the traces).

AC

CE

PT

Algorithm 1: Build − T ree pseudocode 1 Build-Tree (index,hP, T i) 2 nextP ← getN ext(index+1, T , α) 3 if nextP not empty then 4 nextActions ← XORvsAN Y (nextP , T , β) 5 foreach node hP’,T’i ∈ nextActions do 6 AppendSon(hP’,T’i,hP,Ti) 7 Build T ree(index + |P 0 |,hP’,T’i) 8 end 9 end

The function getNext inspects the traces in T to find all possible next actions. At this stage, “rare” patterns can be ruled out. Specifically, if 11

ACCEPTED MANUSCRIPT

P = {A}, and B is a possible next action, B will be provided in output by getNext only if the edge frequency EF of the sequence A > B is above a user-defined threshold α, where:

AN US

1 2

|A > B| |A > B| +P |A > X| X∈ActT Y ∈ActT |Y > B|

P

M

A→B=

CR IP T

|A > B| (1) |TA | being |A > B| the number of traces in T in which A is immediately followed by B (i.e., the cardinality of the support traces of B), and being |TA | the cardinality of the support traces of A in T . Setting α > 0 allows to rule out noisy patterns. Note that, if rare/noisy patterns are ruled out, the resulting log-tree is not guaranteed to still have precision=1 and replayfitness=1; however, in this paper, we will set α=0, and thus consider all the traces in the initial model construction. On the remaining next actions, the function XORvsANY applies formula (2) below in order to identify which actions are in any order and which are in XOR relation. We calculate the dependency frequency A → B between every action pair hA,Bi in nextP × nextP :   EF (A > B) =

(2)

AC

CE

PT

ED

where, always considering the traces in T , |A > B| is the number of traces in which A is immediately followed by B, |A > X| is the number of traces in which A is immediately followed by some action X (with X ∈ ActT , being ActT the set of all the actions appearing in the traces in T ), and |Y > B| is the number of traces in which B is immediately preceded by some action Y (with Y ∈ ActT ). The rationale of the formula is to consider the relative frequency of the traces where an action A is immediately followed by an action B, with respect to other patterns, where A is followed by a generic action, or B is preceded by a generic action, to assess whether the pattern A → B is strongly represented in the event log, and thus supported by the data. After evaluating the dependency frequency values A → B and B → A, we can have 3 possible situations: • if both values are below a given (user-defined) threshold β, this means that A and B rarely appear in the same trace, therefore they are in XOR relation;

• if A → B is above the threshold and B → A is below, then A precedes B, and viceversa; 12

ACCEPTED MANUSCRIPT

• if both values are above the threshold, then A and B are in the any order relation.

AC

CE

PT

ED

M

AN US

CR IP T

The output nextActions of the function XORvsANY is a set of nodes hP 0 , T 0 i, one for each maximal set P 0 of actions in any order. For each one of such sets P 0 , the corresponding set T 0 of support traces is also computed. T 0 is more precisely composed by several sets of support traces, each one corresponding to a possible permutation of the actions in P 0 . Finally, each new node is appended to the log-tree (function AppendSon), and Build-Tree is recursively applied to each new node (with the parameter index properly set). A-posteriori, we create a dummy node $, and connect all the leaves to it. The log-tree can be seen as a basic process model1 . Example 1 Given our previous experience in the area, we consider a running example from the medical domain of stroke management. Indeed in several medical contexts, the trade-off between overfitting and underfitting is crucial, and domain experts (e.g., physicians) can exploit their domain knowledge to explore such a trade-off. In Fig. 2 we show the output of the application of Algorithm 1 on a real world log (details of our graphical interface are discussed in section 7). The log contained 200 traces, expressed as sequences of 10 actions on average, counting 12 different action types. Note that the log was properly filtered, and the algorithm thresholds were properly set, in order to obtain a compact and readable log-tree for our running example. As it can be observed from the log-tree in the figure, the first tests (Blood test, Coaugulation screening and Electro CardioGram (ECG)) are executed on all patients. Then, Brain Computed Tomography (Brain CT) is requested, before neurological evaluation. After the latter evaluation, the log-tree splits into three main branches (1 to 3 from left to right). Branch 1, as an example, starts with the administration of TPA ev (intra-venous anti-thrombotic), continues with another brain CT and an evaluation by the surgeon. Some patients can then be treated, after further exams (Ecodoppler (EcoDDS), Angiographic Magnetic Resonance (AngioRMN) and Angiographic Computerized Tomography (AngioCT)), with TPA ia (intra-arterial anti-thrombotic), and/or with Re1

Indeed, citing van der Aalst (2016), “The goal of a process model is to decide which activities need to be executed and in what order” and the log-tree is able to represent this information.

13

AN US

CR IP T

ACCEPTED MANUSCRIPT

M

Figure 2: The log-tree representation

AC

CE

PT

ED

canalization, to fully restore the complete functionality of the blood vessels. It is worth noting that the first arrival tests (Blood test, Coaugulation screening and Electro CardioGram) must be all executed, but their ordering is not relevant. This situation is well captured by the any-order node offered by our formalism for representing the log-tree.  By construction, the properties below hold for the log-tree. Property 1: Trace Indexing. Each path (and sub-path) in the log-tree indexes all and only its support traces in the log. As a consequence of Property 1, Property 2 also holds. Property 2: Precision and replay-fitness. For each path in the logtree there is at least one support trace in the log. By construction, every trace can be replayed in the log-tree with no errors. Thus, the precision and the replay-fitness of the log-tree are both 1 (Buijs et al., 2012). Finally, property 3 also holds. Property 3: Context awareness. Since each node is a pair hP,Ti where T maintains the support traces, the log-tree is context aware; indeed, the support traces of each alternative path implicitly define the execution 14

ACCEPTED MANUSCRIPT

context of the corresponding model branch. 5. Trace and path retrieval

CR IP T

As described in section 3, in our architecture a single module implements both trace retrieval, operating on the log-tree, and path retrieval, operating on the log-tree or on a more abstract process model. Both tasks are provided by means of a common query language (illustrated in section 5.1), and by a retrieval procedure (presented in section 5.2), articulated into three steps: automaton generation (see section 5.2.1), search (see section 5.2.3), and filtering (see section 5.2.4).

AC

CE

PT

ED

M

AN US

5.1. Query language We provide a unified query language to allow users to search for paths or traces. In Fig. 3 we formally present the extended BNF grammar describing it2 . Variables are contained between h and i, terminal symbols are in bold, and tokens are underlined. The { and } symbols denote optionality. We distinguish between two main types of queries: queries asking for the paths in the model satisfying a pattern (P athQuery), and queries asking for the traces satisfying a pattern (T raceQuery). The former just specify a query pattern (QP attern), while the latter also include the keyword : traces to specify that all the traces satisfying the pattern should be reported in output (lines 1-3 in Fig. 3). A Qpattern is a list of terms (T ermList) preceded and followed by two special symbols (# and $ respectively - line 4). A T ermList (line 5) is a list of one or more terms (T erm), each one optionally preceded and/or followed by a delay (Delay). A delay is specified by a pair of natural numbers (n1, n2), where n1 ≤ n2 (line 6). Delays are used to specify parts of the pattern which are not relevant. For instance, a delay (2,4) means that at that point of the pattern, one expects to have between two and four actions (and does not care about what such actions are). We deal with two types of terms: actions (action) or actions in any order (AnyOrder) - line 7. AnyOrder is specified as a set of actions separated by the keyword &, and represents any possible permutation of the actions in the set (line 8).

2 In this section, we provide only the syntax of our query language; the semantic is provided in Appendix A

15

CR IP T

ACCEPTED MANUSCRIPT

Figure 3: BNF of the query language

CE

PT

ED

M

AN US

Notably, queries in our approach are “intensional”, since they allow to express in a compact (implicit) way a set of explicit “ground” queries (i.e., a set of fully instantiated patterns). Each explicit (ground) query can be obtained by choosing only one of the possible permutations of the actions to be executed in any order, and selecting a specific delay value in every given range, and substituting the delay itself with as many actions as the delay value. Example 2 Referring to the stroke application domain, a physician may issue the query: Q1: #(5, 8) EcoDDS&AngioCT&AngioRMN TPAia Recanalization (0, 3)$ looking for three exams to be executed in any order (namely EcoDDS, AngioCT and AngioRMN), followed by the sequence of two additional atomic actions (TPAia and Recanalization), which can be placed after 5 to 8 actions one does not care about, and which can be followed by up to 3 actions one does not care about. Given N as the cardinality of the set of actions existing in the our medical domain, and given that 6 permutations have to be considered, the query illustrated above corresponds to (N 5 + N 6 + N 7 + N 8 ) ∗ 6 ∗ (1 + N 1 + N 2 + N 3 ) totally explicit queries. 

AC

5.2. Retrieval 5.2.1. Automaton generation In our approach to query answering, we first generate a Nondeterministic Finite Automaton (NFA), which represents the query. To build the NFA, we have adapted the Thompson construction algorithm (Thompson, 1968). During the construction, every state in the NFA is labeled by the corresponding Delay or T erm, since this information will be used by the search step 16

ACCEPTED MANUSCRIPT

CR IP T

(see section 5.2.3). We introduce a dummy symbol ∗ to represent any action, and we directly manage it as a new symbol, which matches with any possible action. Then, in order to optimize efficiency, we keep the dummy symbol ∗ in the NFA. A transition labeled as ∗ automatically fires in correspondence of any input action. The cost of NFA generation is O(m) (Thompson, 1968) where m is the length (including operators) of the regular expression corresponding to the query.

PT

ED

M

AN US

5.2.2. Input to the NFA The NFA represents the query pattern, that has been issued in a (i) T raceQuery, or in a (ii) P athQuery (see Fig. 3). (i) To answer a T raceQuery, it would be possible to exploit the NFA in a classical way, by providing all the event log traces in input to it, to verify which of them satisfies the query pattern. However, this solution would be inefficient. Instead, we provide the log-tree as an input to the NFA. The log-tree is used as an index, since all of its nodes explicitly reference their support traces (see section 4); each path in the log-tree may index several identical support traces, or common trace prefixes, that will be considered only once, speeding up retrieval with respect to a flat search into the log. (ii) To answer a P athQuery, we provide a process model (which may be the log-tree - see section 4, or a graph resulting from the application of abstraction operations - see section 6), as an input to the NFA. Also in this case, common prefixes/sub-paths of various lengths may be shared by different traces/paths, and will be executed on the NFA only once. It is worth noting that providing a tree/graph as an input to the NFA represents a significantly novel contribution, since in the formal language literature the input to the NFA is typically a string.

AC

CE

5.2.3. Search The search step aims at finding all the paths/traces in the process model (graph or log-tree) that verify the query pattern represented by the NFA. To this end, our Search procedure takes in input the process model and the NFA, and, given the initial node # in the model, starting from the initial NFA state, it tries to reach the final state by executing a sequence of transitions, whose labels correspond to the sequence of actions found along a path in the model. The dummy transition ∗ can be used in correspondence of any type of action.

17

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

Algorithm 2: Search pseudo-code. 1 Search(Mnode, Astate, M, A, locout, output) 2 if final(Astate) then 3 output ← locout ∪ output 4 end 5 else 6 if atomic(Mnode) then 7 loc ← AtomicExecution(A, Astate, M node) 8 end 9 else 10 loc ← AnyOrderExecution(A, Astate, M node) 11 end 12 end 13 foreach hstate, resi ∈ loc do 14 foreach node ∈ next(Mnode,M) do 15 Search(node, state, M, A, append(locout, res), output) 16 end 17 end

AC

CE

PT

ED

The Search algorithm is shown in Algorithm 2. The algorithm takes in input a process model node (M node), an NFA state (Astate), the process model (M ), the NFA (A), and two variables locout and output (passed by reference and initially empty); locout is meant to store the output of a single recursive call, while output will contain the final output of the algorithm. The output of Search is a set of lists of hQTerm, Mnode, Acti triples. Each list corresponds to an answer, i.e., to a path in the model that satisfies the query. Each triple specifies the match between the query term (QT erm) at hand, the model node (M N ode) in the path, and the matched action (Act). The recursion ends when Astate is a final state, i.e., when the dummy symbol $ has been analysed: in this situation, the local output locout of the last recursive call is added to the global output output (line 3). On the other hand, in the general case, we must distinguish between the situations in which M node is an atomic action (lines 6-7), or a set of actions to be executed in any order (line 10). In the first case (line 7), we call the AtomicExecution procedure, which 18

ACCEPTED MANUSCRIPT

PT

ED

M

AN US

CR IP T

executes in the NFA, if possible, a transition from the current state Astate, following the edge labeled as the action in M node, or as the dummy action ∗ (remember that, in our approach, ∗ matches to any action). The variable loc is a set containing a single pair hstate, resi, where state is the new state reached on the NFA after the transition, and res is a list (of length 1 in this case) of triples hQTerm, Mnode, Acti. In turn Qterm is the T erm or Delay that labels state (see section 5.2.1), and Act is the name of the action that has just been consumed. In the second case (line 10), we call the AnyOrderExecution procedure, which iterates the execution on all possible permutations of the any-order actions in M node. At the end loc contains a set of pairs, where every pair stores the result of the execution on a single permutation, coupled with the new state reached on the NFA after the execution. Specifically, every pair is in the form hstate, resi defined as above; this time, res is a list whose length is equal to the number of actions in any order, where each element in the list corresponds to a single recognized action in the permutation. Finally, for each pair hstate, resi in loc, and for each node that follows the current node M node in the process model M , we recursively call Search passing the properly updated parameters: in particular, res is appended to locout (lines 13-17). Example 3 Referring to our running example, we consider the output generated by the Search algorithm, on the query Q1 in section 5.1 and on the log-tree in Fig. 2. The Search algorithm generates a set of lists of triples hQTerm, Mnode, Acti, one for each answer to the query. Notably, multiple answers can be generated in correspondence to any-order nodes (i.e., one answer for each permutation satisfying the query). Below, we show just one answer, taken from the rightmost path of the Branch 2 of the log-tree.

AC

CE

<#,#,#> <(5, 8), Bloodtest&CoagulScreening&ECG , Bloodtest> <(5, 8), Bloodtest&CoagulScreening&ECG, CoagulScreening> <(5, 8), Bloodtest&CoagulScreening&ECG, ECG> <(5, 8), Brain CT, Brain CT> <(5, 8), NeurologicalEvaluation, NeurologicalEvaluation> <(5, 8), TPAev, TPAev>

19

ACCEPTED MANUSCRIPT

<$,$,$>

CR IP T

As an example, the second, third and fourth triples in this example represent the fact that the first three actions of the QT erm “(5,8)” (corresponding to a delay consisting of 5 to 8 actions we do not care about) have matched the node “Bloodtest&CoagulScreening&ECG” in the process model, in the following ordering: “Bloodtest”, “CoagulScreening”, “ECG” (one of the possible permutations; the remaining permutations appear in other answers). The eighth triple, instead, represents the match between one of the actions of the QT erm “EcoDDS&AngioCT&AngioRMN” and the node “ AngioCT” in the model. 

ED

M

AN US

5.2.4. Filtering In the case of trace retrieval queries (T raceQuery in Fig. 3) the search algorithm is applied to the log-tree, and a filtering step is required to postprocess the output of Search. Indeed, every node in the log-tree references its support traces (see section 4). When the search step provides in output a path which includes one or more nodes with actions to be executed in any order, it is possible that only some of the permutations of these actions are acceptable to answer the input query. For instance, if we search for the sequence AB, and we find in the model the any-order node A&B, its support traces containing the sequence BA must be filtered out.

AC

CE

PT

5.3. Experimental results on trace retrieval We studied the efficiency of our retrieval facility, in the specific case of trace retrieval, in comparison to an existing regular expression processor provided by the Java Regex APIs3 . The choice of Java Regex for comparison is motivated by the fact that it adopts a classical approach for pattern matching with regular expressions; on the other hand, it was not specifically implemented to work with traces4 . We propose two experiments: (1) we tested the impact of the use of the log-tree as an index structure, on different event log dimensions and compositions (i.e., with different percentages 3

The interested reader can find information about Regex at the link http://www.regular-expressions.info/engine.html - last accessed on June 16, 2016 4 Indeed, we did not identify any trace retrieval mechanism directly comparable to ours in the business process management literature. Also Case Based Reasoning approaches to trace retrieval (see section 2) are only loosely related to our approach.

20

CR IP T

ACCEPTED MANUSCRIPT

AN US

Figure 4: Trace retrieval time comparison. Left: comparison on event logs of different dimensions and compositions; Right: comparison in response to queries with any-order symbols, corresponding to different numbers of permutation

AC

CE

PT

ED

M

of replicated traces) - since Java Regex is not coupled with any indexing strategy; (2) we tested the impact of any-order symbols in the query - since Java Regex requires to make all the permutations explicit, and to answer all the resulting queries separately. For the experiments, we used a machine equipped with an Intel i7-4810MQ CPU @ 2.80GhZ, 8GB RAM, SSHD Hybrid 64MB Cache. The experimental event log was taken from the BPI Challenge 2012 repository5 . The chosen event log contained 13087 traces, and 262200 actions overall, expressed as sequences of 3 to 175 actions. It counted 24 different action types. In the first experiment, we have created an initial event log, by randomly selecting 1000 distinct traces from the original BPI Challenge event log mentioned above. Then, we progressively replicated this initial dataset, obtaining event logs of 2000, 4000, 8000 and 16000 traces, respectively. We recorded the mean trace retrieval time when answering a set of 10 queries with different delay lengths (from 0 to 9), on each event log. Results are shown in Fig. 4(Left). As it can be observed, retrieval times when adopting our method are basically constant, while they grow linearly with the dimension of the dataset when adopting Java Regexp. In the second experiment, we worked on the original BPI Challenge event log of 13087 traces. We issued four different types of queries, and executed 10 queries for each type. We then calculated average query answering times. 5

http://data.4tu.nl/repository/uuid:3926db30-f712-4394-aebc-75976070e91f - last accessed on June 16, 2016

21

ACCEPTED MANUSCRIPT

CR IP T

The first query type included no any-order symbols; the second query type included one any-order symbol of length 2 (thus corresponding to 2 explicit queries, one for each possible permutation). The third query type presented one any-order symbol of length 3 (thus corresponding to 6 explicit queries), and the fourth presented one any-order symbol of length 4 (corresponding to 24 explicit queries). Results are shown in Fig. 4(Right). As it can be observed, once again retrieval times when adopting our method are basically constant, while they grow linearly when adopting Java Regexp. The impact of delay values in the query in comparison to Java Regexp was already successfully tested in Bottrighi et al. (2016).

AN US

6. Abstracting the process model: the merge facility

AC

CE

PT

ED

M

The log-tree perfectly adheres to the input traces, since all of its paths correspond to at least one trace in the event log. As discussed in the Introduction, when such a possibly overfitting model is not required/desired, users may want to generate more abstract models. In our framework, this goal can be achieved by means of a novel semantic abstraction facility, named Qterm merge. When adopting Qterm merge, abstraction is driven by the user, who starts an interactive process by asking a query. The paths in the input model that satisfy the query are then retrieved (see Section 5.2), and merged. The user may select which ones of the retrieved paths have to be merged together. The input of this merge facility requires a transformation of the output of the Search algorithm, which ignores the Act component, and concatenates the M nodes in each list only if they share the same QT erm, producing a set (Answers) of lists of hQT erm, N odeListi pairs. All lists in Answers are composed by N items (where N is the number of QT erms that compose the query issued by the user); thus, the merging facility can work on all answers, fusing them QT erm by QT erm. Example 4 The output generated by the Search algorithm in Example 3 is transformed as illustrated below, in order to be provided as an input to the QT erm facility. <# ,(#)> <(5,8), (Bloodtest&CoagulScreening&ECG BrainCT NeurologicalEvaluation <$ ,($)>

22

TPAev)>

ACCEPTED MANUSCRIPT

CR IP T

For instance, the eighth, ninth and tenth triples in Example 3, that correspond to the same QT erm “EcoDDS&AngioCT&AngioRMN” , are collapsed into a single pair (third pair in this example), where the two nodes “AngioCT” and “EcoDDS&AngioRMN” are concatenated, and the Act components (“AngioCT”, “EcoDDS”, and “AngioRMN”, respectively) are removed. 

AN US

Two versions for Qterm merge have been implemented in our framework: a strict version, and a generalized version. In fact, different paths may satisfy the same query, when the query includes imprecise delays or contains anyorder symbols (e.g., A&B). In the last case, for instance, all the paths that instantiate one of the symbol permutations will be retrieved by our pattern retrieval facility (e.g., AB and BA, and even any-order nodes A&B). However, not necessarily the user wants to merge non-identical paths. We thus provide her/him with two options: • when the strict version is applied, the user can only merge identical retrieved paths;

M

• when the generalized version is applied, the user can merge non-identical paths as well.

AC

CE

PT

ED

Example 5 To exemplify the difference between the two merge versions, consider the model in Fig. 5(A) and the query #(0, 2)A&B(0, 0)$. The retrieve facility identifies three paths which satisfy such a query (i.e. the branches 1 to 3 from left to right). In the case that strict merge is applied, the output is the one shown in Fig. 5(B): the sequences AB in branches 1 and 2 are fused, while branch 3 contains a BA path and is kept separate. Instead, if generalized merge is applied, all the three branches are fused. As shown in Fig. 5(C), a new node A&B is created and they are fused in a such node.  In this section we decribe only the algorithm for the strict version of Qterm merge, but an interested reader can find details on generalized version in Appendix B. The algorithm StrictV ersion (see Algorithm 3) takes in input the process model M and Answers, i.e., the output of the Search algorithm, preprocessed as described above. The algorithm iterates on the number N of QT erms (lines 4-19), and fuses in the model the identical N odeLists of different answers corresponding to the same QT erm (lines 6-18). The i-th pair hQTerm, NodeListi is extracted from RES list (line 7). Then, if QT erm is 23

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 5: (A) The input model of Example 5; (B) the output of the application of strict merge on the pattern A&B; (C) the output of the application of generalized merge

PT

ED

M

a delay, it is simply ignored (line 8). Otherwise, the function getP ath gets from the variable newP aths the path exactly composed by the list of nodes in N odeList, and stores it in the variable path (line 9); if such a path does not exist, N odeList is added to newP aths (line 11). Otherwise, all incoming/outgoing edges of N odeList are redirected to path (line 14). Moreover, in order to eliminate possible duplicated paths in the model, the original set of nodes belonging to N odeList is added to the set of nodes to delete (line 15). Such nodes are not immediately deleted, because they could be part of other answers, not processed yet. Deletion of nodes and corresponding edges takes place at the end of the algorithm (line 20), when all QT erms have been examined, and all edges have been properly redirected.

CE

7. Supporting interactive sessions of work

AC

In the previous sections, we have illustrated our initial process model mining facility (Section 4), our path/trace retrieval facility (Section 5), and our merging techniques to provide semantic abstraction (Section 6). In this section, we propose an innovative approach which exploits such ”ingredients” to facilitate analysts in determining a suitable balance between overfitting and underfitting in the definition of the process model. To achieve such a goal, we provide analysts with two innovative features is the area of process mining: the possibility of (i) operating ”step-by-step”, and of (ii) applying 24

ACCEPTED MANUSCRIPT

ED

M

AN US

CR IP T

Algorithm 3: StrictV ersion pseudo-code. 1 Let N be the number of Qterms 2 StrictVersion(M, Answers) 3 nodeT oDelete ← {} 4 for i ← 1 to N do 5 newP aths ← {} 6 foreach RES list ∈ Answers do 7 hQTerm,NodeListi ← get(RES list, i) 8 if QTerm is not a delay then 9 path ← getPath(newP aths, N odeList) 10 if path=null then 11 newP aths ← newP aths ∪ {N odeList} 12 end 13 else 14 redirect(M, N odeList, path) 15 nodeT oDelete ← nodeT oDelete ∪ listT oSet(N odeList) 16 end 17 end 18 end 19 end 20 delete(M, nodeT oDelete)

AC

CE

PT

abstraction facilities (i.e., merge) in a selective way. The key idea is to introduce an explicit support to process model versioning. Indeed, we propose a tree of models as a data structure to represent the different models obtained at each abstraction step, to support the generation of new more abstract models (through merge operations) or the backtracking to (possibly less abstract) previously built models. Each node in the tree represents a model version, and each arc from a node N 1 to a node N 2 represents a step of abstraction, obtained by applying a path retrieval operation on N 1 and by merging the retrieved paths (possibly in a selective way). As motivated in the Introduction, the root node of the tree of models is the log-tree, which has precision=1 and replay-fitness=1. Each node is also annotated with a set of quantitative values (i.e., precision, replay-fitness, generalization (Buijs et al., 2012)). On the basis of such values, and of her/his expertise in the domain, the user can analyse the current model (possibly taking also advantage of the 25

ACCEPTED MANUSCRIPT

CR IP T

path retrieval facilities we provide), and decide whether it achieves the desired trade-off between overfitting and underfitting, or whether it is necessary to move forward to further abstractions (in case of overfitting), or even back to previously built models. In Section 7.1, we provide the data structure and the algorithm to support such an interactive definition of the process model, while in Section 7.2 we illustrate our graphical interface, and in Section 8 we experiment the overall approach on a real event log.

AC

CE

PT

ED

M

AN US

7.1. An algorithm to support the interactive generation of the process model In the tree of models, Nodes are records consisting of different fields: M odel, containing the graph representing a process model version; P recision, Replay − F itness, and Generalization, representing the homonymous quantitative parameters6 . Arcs are labelled with a P attern (see QP attern in Section 5), representing the pattern we merge, a M odality (“strict” or “generalized”), the boolean field Selective? (which is true if a selective merge has been applied), and, in the case Selective? is true, P artitions, indicating the partitions used in order to apply selective merge. The interactive discovery of the process model is described by Algorithm 4. The function GenerateM odel has two parameters: the tree of models (T ree) and the log (L). Before the application of GenerateM odel, the logtree is mined, and the root node of the tree of models is generated, containing the log-tree and the value assumed by the quantitative parameters on it. The algorithm repeatedly proposes to the analyst four different operations, until the “approve” one is selected (meaning that the current node in the tree of models contains the model chosen by the analyst – to be returned as an output). The “analyse” operation allows the analyst to search for a pattern in the model contained in the current node, by activating the procedures illustrated in Section 5. The “merge” operation supports the merge abstraction on the retrieved paths in the current model, by resorting to the procedures presented in Section 6. It first allows the analyst to select the merge modality (“strict” or “generalized”), and to specify whether the merge is selective or not. A new node is generated (initialized as a copy of 6

In the current implementation, we adopt the definitions in Buijs et al. (2012) to evaluate precision, replay-fitness and generalization. However, we stress that our general methodology is not only independent of the evaluation methods for the quantitative parameters, but also of the chosen parameters themselves. For example, simplicity (Buijs et al., 2012) might be evaluated as well.

26

ACCEPTED MANUSCRIPT

CR IP T

the current node) and appended to the current node. In the non-selective case, the merge is applied to all the retrieved paths (see line 23), otherwise it is applied to each one of the subset of paths specified by the analysts through the “Partition” function (see lines 16-21). Specifically, given the set of all paths satisfying the input pattern, the function Partition allows analysts to group them into (disjoint and exhaustive) subsets. Finally, the quantitative parameters are evaluated for the new node (which becomes the current node lines 25-27). The “back” operation allows the analyst to choose any node in the tree as the new current node (line 30).

M

AN US

7.2. System graphical interface In our approach, interactivity in process model discovery is supported by means of a user friendly graphical interface, shown in Fig 7 and Fig. 8. The system interface is subdivided into two main panels. The leftmost panel shows the tree of model versions (see Section 7.1), whose root (M 1) corresponds to the log-tree. Quantitative parameters of each model version are shown next to the corresponding node, with FIT, PREC and GEN as short forms of, respectively, replay-fitness, precision and generalization. At the bottom of the left panel, the interface provides the button to possibly approve one of the models generated during the working session. The rightmost panel provides three kinds of information:

CE

PT

ED

• a table summarizing the operations already applied to generate the current model; in turn, each row in the table specifies: the type of merging that was required (whether a strict or a generalized merge), the query inputted by the user, whether the merging operator was applied to all the retrieved paths or just to a subset of them and, in the latter case, a button to show the parts chosen for merge;

AC

• the current process model (represented as a graph), obtained from the log-tree by the ordered application of all the operations listed in the table above; • a section for analyzing the current model by retrieving and visualizing specific paths in the model itself, and for possibly inputting the details of the next merge step on the retrieved paths or on a subset of them. This frame is split in two panels. The panel on the left allows the user to write the query needed for path retrieval or trace retrieval. The retrieved paths are highlighted in bold in the current process model. 27

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Algorithm 4: GenerateM odel pseudo-code. 1 GenerateModel(Tree, L) Result: Node 2 Pattern ← NULL 3 Search Out ← NULL 4 Curr Node ← root(Tree) 5 Operation ← Ask op(Curr Node) 00 6 while Operation 6= “approve do 7 switch Operation do 8 case “analyze00 9 Pattern ← Ask Pattern() 10 Search Out ← Pattern Search(Cur Node.Model,Pattern) 11 end 12 case “merge00 13 Merge Modality ← Ask Modality() 14 Selective Merge?← Ask Selectivity() 15 New Node ← copy(Curr Node) 16 if Selective Merge? then 17 Parts ← Partition(Search Out, Curr Node.Model) 18 foreach P ∈ Parts do 19 merge(Merge Modality, P, New Node.Model) 20 end 21 end 22 else 23 merge(Merge Modality, Search Out, New Node.Model) 24 end 25 Eval Parameters(New Node, L) 26 Append Son(Cur Node, New Node, Pattern, Merge Modality, Selective Merge?, Parts) 27 Curr Node ← New Node 28 end 29 case “back 00 30 Curr node ← Ask cur(Tree) 31 end 32 endsw 33 Operation ← Ask op(Curr Node) 34 end 35 return Cur Node 28

ACCEPTED MANUSCRIPT

CR IP T

Later, using the panel on the right, the user is allowed to specify from a drop-down menu whether the retrieved paths should be merged through the strict or the generalized operator and, by means of proper buttons, instruct the system to merge all of the retrieved path, or activate a further pop-up window, from which s/he will be enabled to select a subsets of the paths to be merged. Once the paths have been selected, they will be highlighthed in bold in the current process model. Note, however, that it also possible for the user to use the path retrieval facility to analyse the model, without asking for a subsequent merge.

M

AN US

At any stage of the process model discovery, the user can select a node in the tree of versions in the leftmost panel: as a result, the corresponding model will be plotted at the right, together with the operation history that led to its definition. In this way, the user can backtrack to that version, and by writing a new query in the analysis panel and applying a new merging operator, s/he will be able to create an alternative version, that will generate a new branch in the tree. In the next section, we will showcase this procedure by means of an experiment. 8. Experiments and comparisons

AC

CE

PT

ED

In this section, we will experimentally compare the results obtained by our tool (SIM) to the output generated by two well-known miners: the inductive miner (IM) (Leemans et al., 2013), and the heuristic miner (HM) (Weijters et al., 2006). HM, in particular, is one of the most used algorithms for the analysis of medical processes (Rojas et al., 2016), thanks to its robustness to noise, which typically affects the event logs in these domains. We run the experiments in the field of stroke, using the dataset described in Section 4.2. Such a choice depends on the fact that our approach focuses on the interactions with experts/analysts, so that we had to select a domain where we could count on existing collaborations and on experts cooperation. Specifically, we conducted our tests working with physicians of the Stroke Unit Network of Regione Lombardia, Italy. The experiments were run on a machine equipped with an Intel I7 7700HQ processor, and 16Gb RAM. In the experiments, we compared the three approaches considering three different dimensions: (i) a quantitative analysis of the execution times; 29

ACCEPTED MANUSCRIPT

(ii) a quantitative (by resorting to precision, replay-fitness and generalization) and qualitative (by resorting to domain experts collaboration) evaluation of the mined model;

CR IP T

(iii) an evaluation of the user/system interaction in the mining process.

In particular, dimension (iii) is crucial to our approach, which mainly aims at exploiting the experts’ knowledge in an incremental way. In the following, we will first consider dimension (i) (Section 8.1), and then (jointly) dimensions (ii) and (iii) (Section 8.2).

ED

M

AN US

8.1. Execution times and scalability To analyse execution times and scalability, we have started from the event log presented in Section 4.2, and we have artificially created logs of progressively growing dimensions, by replicating the original traces. As regards our tool SIM, we have considered the time needed to mine the log-tree, since the merging steps are independent of the dimension of the event log (they only depend on the dimension of the current process model), and their computational times are orders of magnitude smaller than the one of the log-tree mining phase. Fig. 6 shows the execution times (expressed in seconds) of our approach (log-tree), IM and HM, working on logs containing from 200 to 30000 traces. The figure proves that all the approaches scale in linear time, and SIM performs slightly better than the others.

CE

PT

8.2. Model mining comparisons In this section, we compare the three approaches working on the event log presented in Section 4.2, considering not only the mined model, but also the process used to discover such a model, in cooperation with the domain experts.

AC

8.2.1. Model mining in SIM During our experimental session, as a first step, the experts generated the log-tree from the event log. The result of this first step is shown in Fig. 2 (model M1). From the quantitative point of view, the log-tree has precision and replay-fitness equal to 1 (by construction), and generalization equal to 0.76. Besides such a quantitative analysis, from a “semantic” point of view, the experts noticed the presence of multiple occurrences of actions, and patterns of actions, and thus decided to move towards a more generalized 30

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 6: Results of the scalability test on the three approaches

AC

CE

PT

ED

M

model. In particular, they observed that many different branches in the logtree ended with a Recanalization action. They thus decided to focus on such an aspect, and used our analysis facility to retrieve all the occurrences of Recanalization, through the query: #(6, 11)Recanalization(0, 0)$ The experts executed a strict merge on all the retrieved paths (OP1). As a result, they unified all the Recanalization nodes into one, as shown in Fig. 7. Here, the model (M2) is equivalent to the previous model M1 (thus maintaining the same replay-fitness and precision), but more compact and general (generalization is incremented to 0.80). At this point, the experts observed that TPAia is typically executed after a patient assessment by means of some exams (EcoDDS, angioCT and angioRMN), which are generally executed in any order in clinical practice. However, the medical knowledge (see e.g., Mair & Wardlaw (2014)) suggests to focus on the results of the AngioRMN, since the AngioRMN is more suitable for excluding any contraindications to the thrombolysis treatment. On the basis of such a knowledge, the experts decided to investigate this direction. Therefore they formalized the following query: #(5, 9)AngioRM N (0, 2)T P Aia(1, 6)$, After inputting this query in the right panel of Fig. 7 and highlighting the corresponding retrieved paths using the search button, they decided to

31

AN US

CR IP T

ACCEPTED MANUSCRIPT

M

Figure 7: The process model obtained after applying operation OP1 shown by the graphical user interface

AC

CE

PT

ED

generalise the current model by applying a strict merge on all the retrieved paths (OP2). They thus obtained a third version of the model (M3), shown in Fig. 8 A. Observing the model M3 and its quantitative parameters in Fig. 8 A, the experts identified a number of important issues. First, from a quantitative viewpoint, despite the fact that the generalization of M3 slightly increases with respect to M2, the new model declines sharply in replay-fitness and - even if to a lesser degree - in precision. This is due to the fact that not all the traces can be replayed correctly in M3, since the results of the merge operation (which, among all the steps for merging, required a split of the node EcoDDS&AngioRM N ) deviated the control flow and added arcs in the model which are never reached by any of the traces in the log. Moving towards a more “semantic” analysis, the experts noticed that M3 contains two paths which are incorrect on the basis of the domain knowledge, since they present a Recanalization that does not follow any TPA. Such a procedure would have a very dangerous effect on patients, pushing the

32

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

thrombus even closer to the brain, and causing additional damages7 . As a first step, the experts used our trace retrieval facility to find whether there was any occurrence of such paths in the input log. Fortunately, no occurrence were found. As a consequence, the experts understood that the last step of generalization had led to an underfitting model, describing an unacceptable behavior. They thus decided to restart from the model M2 and try a different strategy. They achieved such a goal by clicking on the node M2 in the left part of the user interface, and considering the three exams (Eco DDS, angioCT and angioRMN), as executed in any order before performing the TPAia. They therefore issued the query: #(5, 8)EcoDDS&AngioCT &AngioRM N T P Aia(1, 4)$ and, after searching for the resulting paths in M2, applied a generalized merge operator (instead of a strict one) on all paths (OP3), obtaining the model M4 shown in Fig. 8 B. The model M4 is very general and compact, as it can be observed both from visual inspection and by considering the generalization, which moved from 0.80 in M2 to 0.84 in M4. On the other hand, with respect to M2, the replay-fitness of M4 is much lower and there is also a slight loss in precision. Moreover, the experts noted that the main “semantic” problem already present in M3 (i.e., a path where a Recanalization does not follow any TPA) was still present in M4. Therefore, they decided to re-analyze the situation, considering model M2, and running again the last query. They noticed that, among the three paths retieved by the query, two (the ones on the left in the model) were preceded by the action TPAev, and one (the rightmost path) was not. Given their medical knowledge, they realized that the first two paths concerned patients arriving before 4.5 hours from stroke onset (therefore eligible for TPAev), while the third one concerned the other patients. They concluded that merging such two significantly different procedures was totally inappropriate, and took advantage of our facility to perform merge in a selective way (thus merging only the two paths following TPAev). The resulting model M5 is shown in Fig. 8 C. M5 is more abstract than the one in Fig. 2, while retaining the property of being “semantically correct”, in the sense that it generalizes the initial model M1 without introducing any dangerous 7

ISO-SPREAD 12/04/2016

guidelines,

http://www.iso-spread.it/index.php,

33

last

accessed

ACCEPTED MANUSCRIPT

CR IP T

therapeutic path not supported by the event log. The remaining redundancy makes sense, as two distinct therapeutic pathways are evident and preserved: these two paths refer to patients arriving at the emergency department in different conditions, and who must be treated accordingly. This process model is the most general among all the obtained models, still maintaining very acceptable values of replay-fitness and precision. The final process model thus keeps the right balance between overfitting and underfitting and this interactive session could be closed by pressing the button Approve in the interface shown in Fig. 8 C.

AC

CE

PT

ED

M

AN US

8.2.2. Model mining in IM With IM, the experts applied the tool, using the default values for the filters, to the same log used in Section 7. They received as output, in a “one-shot” step, the process model illustrated in Fig. 9(a). Quantitatively speaking, the precision of the model is 0.66, the replay-fitness is 0.95, and the generalization is 0.99. From the “semantic” point of view, the experts found the resulting model quite clear (indeed, in the initial part, it is quite similar to our log-tree), but they noticed two main types of problems. First, they detected a cycle that introduces the iterative execution of BrainCT and TPAev. On the basis of their medical knowledge, such a cycle should not be present (since it has no medical meaning). And, indeed, by using the facilities provided by ProM, they checked that such an iteration was not recorded in the input traces. Second, they noticed that, in the mined model, several “crucial” actions appeared in optional branches, and the model did not represent the contextual information for branch selection (and thus for action execution). For example, medical experts know that Surgical Evaluation in the treatment of stroke must be planned in particular situations, and executed only if strictly necessary. A surgeon is an important resource for a hospital, therefore moving her/him improperly means a waste of time both for the specialist and for the patient under treatment. In the process model in Fig. 9(a), Surgical Evaluation has been mined as a single optional action, and there is no way to obtain information about context and conditions for its execution. From the model, the experts only see that, at a certain point in the process, they can choose to request an expert surgeon’s evaluation, or not, independently of what happened before. In the model, similar problems arise also concerning other important actions, such as TPAia and Recanalization. Experts tried to overcome such underfitting problems by applying again IM to the event 34

ACCEPTED MANUSCRIPT

CR IP T

log traces, with different values for the input filters. However, the filters only allow to rule out less frequent actions and paths from the model, so that, after several attempts, they concluded that such filters were not useful to solve the problems.

AC

CE

PT

ED

M

AN US

8.2.3. Model mining in HM With HM, the experts applied the tool, using the default values for the parameters, to the same log used in Section 7. They received as output, in a “one-shot” step, the process model illustrated in Fig. 9(b). Quantitatively speaking, the precision of the model is 0.44, the replay-fitness is 0.92, and the generalization is 0.99. From the “semantic” point of view, the experts found the resulting model quite clear (as for IM, in the initial part, the model is quite similar to our log-tree), but found two main problems. The first problem is again the presence of a cycle. In this case, the cycle introduces the iterative execution of BrainCT, Neurological Evaluation and TPAev. Once again, such a cycle was not supported by the experts’ knowledge. And, as for IM, they could check that such an iteration was not present in the input traces. An additional “semantic” problem was pointed out by the experts: in the mined model the action Surgical Evaluation was placed very early in the process, precisely after the BrainCT, and concurrently with the Neurological Evaluation. Indeed, the experts know that the Surgical Evaluation must be planned and potentially executed at a particular point of the treatment, and only in certain conditions, after a second BrainCT control in the context of the administration of TPAev. They tried to overcome such an underfitting by applying again HM, with different values for the input parameters. We explained to the experts that cycles can be tackled in HM by setting “proper” values for the lenght-one-loops and lenght-two-loops parameters. On the other hand, we suggested that the context of actions in HM can be tackled by setting “proper” values for the Long Distance Dependencies (LDD) parameter. Together with the experts, we re-run HM with different combinations of values for the LDD and loop parameters. The “best” of the resulting models, according to the experts’ evaluation, is the one shown in Fig. 9(c). Notably, the quantitative parameters for this model are very similar to the ones of the model in Fig. 9(b). From the “semantic” point of view, the experts noted that the iterative execution of BrainCT, Neurological Evaluation and TPAev was still present in the new model (as well as in all the models generated with the LDD and loop parameters). On the other hand, they observed that, in 35

ACCEPTED MANUSCRIPT

CR IP T

the new model, Surgical Evaluation had been put in a more proper context, but at the price of generating, especially in the first part of the model, a large number of alternative paths, which they could not motivate in terms of medical knowledge.

AC

CE

PT

ED

M

AN US

8.3. Comparisons While the comparisons concerning the execution times has been already presented in Section 8.1, in this section we compare SIM, IM and HM as regards the evaluation of the mined model and the user/system interaction in the mining process. Concerning the quantitative analysis of the mined model, it is important to remark that the three approaches provided similar results in replay-fitness, while they achieved a different balance between precision and generalization. Indeed, with SIM, experts could obtain a model that was slightly less general (generalization is 0.86 in SIM, 0.99 in IM, and 0.99 in HM) but much more precise (precision is 0.99 in SIM, 0.66 in IM, and 0.45 in HM). This is an important result in a knowledge-intensive domain such as the medical one, and was achieved through an intensive “knowledge-based” interaction with experts within the mining process (see below). As regards the qualitative analysis of the model, the experts pointed out two main types of problems with IM and HM: the presence of undesired cycles, and a limited capability of capturing the actions’ contexts. The former problem is probably due to the fact that both HM and IM impose the constraint of allowing the representation of only one instance of each action in the model (uniqueness of actions henceforth): in case more than one occurrence appears (as, e.g., for BrainCT in the experiment), their merger into a unique node leads to the generation of cycles. In SIM, such a problem did not take place (indeed, we do not force uniqueness of actions). As regards contextualization, as most current miners, HM mines the model mostly considering “local” dependencies, and both IM and HM operate “global” (i.e., on the whole model) syntactic generalizations. On the other hand, we start from the log-tree, which has precision 1 (so that all contextual dependencies are maintained), and then we provide experts with “semantic” merge operators, and let them interactively drive the generalizations, also supporting a selective application of merge operators themselves (so that it is up to the user to retain or to merge different contexts). Indeed, the main difference between HM and IM and our approach relies in the interaction with the experts, during the mining process. Both HM 36

ACCEPTED MANUSCRIPT

CR IP T

and IM provide “one-shot” approaches, where the miners directly provide the output model, given the chosen parameter setting. Different models can be obtained by changing the parameters, but the gap from the knowledge of the experts and the setting of the parameters is not easy to fill. On the other hand, the experts have appreciated the interactivity of our approach, since they could directly use their knowledge to lead the mining process incrementally, and possibly backtracking to previous model versions. Also, they appreciated SIM facilities to analyse the current model, and, above all, the possibility of applying the merge operation in the selective way.

AN US

9. Literature comparisons

AC

CE

PT

ED

M

Our approach provides a unifying framework, where a suite of different facilities are properly integrated, to support process model discovery at different levels of abstraction, but also efficient trace retrieval (also responding to partially specified patterns) and path retrieval, in an interactive session of work. To the best of our knowledge, no similar system is described in the literature. However, the approach presents original contributions also focusing on one task at a time. As regards process model discovery, in particular, we belong to the family of two-steps approaches van der Aalst (2016) and the closest work to ours is the one in van der Aalst et al. (2010), which aims at identifying a “good” balance between underfitting and overfitting in the mined process model, which is a core idea of our approach as well. As already observed, that work adopts a two-step method: first, a transition system is constructed; then, using the theory of regions (Cortadella et al., 1998), a Petri Net model is provided. In particular, since the balance between underfitting and overfitting seems to be domain and/or task dependent, the authors propose a configurable approach for step one, so that different transition systems can be obtained and evaluated by the analyst, who can use her/his knowledge to choose the one that best fits the balance. Specifically, the authors allow to build the transition system on the basis of prefixes, postfixes, or both, and identify different forms of abstractions to produce a transition system from the log. Notably, although other existing process mining algorithms use some forms of abstraction, in those cases the level and nature of the abstraction cannot be controlled or adapted by the user.

37

ACCEPTED MANUSCRIPT

AN US

CR IP T

It is worth noting that our log-tree could be easily converted into a prefix transition system by a syntactic transformation, that reduces to a simple isomorphism if no any-order nodes are present. Our choice of introducing the any-order nodes was motivated by the goal of making the starting point of our incremental and interactive approach to model abstraction more compact and readable for end users. With respect to a prefix transition system, the log-tree also includes pointers to the log traces: this decision was due to the goal of exploiting this data structure as an index, to speed up trace retrieval. Despite the fact that the approach in van der Aalst et al. (2010) is the closest to ours, and that we were influenced by its core ideas in defining our framework, our work presents significant innovative characteristics with respect to it, since it provides: • new forms of semantic abstraction facilities (namely, strict and generalized merge); • the possibility of applying abstractions in a selective way (i.e., the analyst can decide which occurrences of the patterns in the model have to be merged and which not);

ED

M

• the possibility of operating step-by-step; in such a way, the final model can be built by applying different merge steps, each one leading to a more abstract version. At each step, the analyst may evaluate the result (also considering quantitative parameters such as precision, replay-fitness and generalization (Buijs et al., 2012)), and approve it, or move-up to further merges, or backtrack to some previous solution.

AC

CE

PT

Conformance checking approaches, on the other hand, could be integrated in our framework in the near future. Indeed, our retrieval capability could be deployed as a first step towards the verification of the presence of specific paths of actions required by clinical guidelines (which encode the “best” - evidence based - process model), and the conformance with such paths could be studied exploiting the techniques introduced in Spiotta et al. (2017). Moreover, we could adopt the techniques in Spiotta et al. (2017) or the metrics in Montani et al. (2015) to compare the interactively constructed process model to the existing clinical guidelines, representing the reference process to be implemented in the medical application at hand. Model repair techniques (Fahland & van der Aalst (2012)), instead, are only loosely related to our solution, since they rely on a fully automatic (vs. interactive) approach. 38

ACCEPTED MANUSCRIPT

ED

10. Conclusions

M

AN US

CR IP T

As regards trace retrieval, the literature approaches developed in the CBR area implement forms of reasoning on traces, but they do not aim at providing support in business process management. The goal of these tools is therefore usually very different from ours. The ideas in Huang et al. (2013) and Montani & Leonardi (2014), however, could be further considered in our framework. In particular, we plan to extend the merging procedure by allowing the merger of non-identical paths, selecting the candidates for merging on the basis of their edit distance, properly setting a similarity threshold on the basis of domain knowledge. As regards path retrieval, finally, with respect to the work in Jin et al. (2013) we adopt a unifying approach, where the very same querying mechanism we exploit for trace retrieval is also used for supporting path retrieval itself. Indeed, as already observed, the smooth integration of different facilities is one of the most distinctive and innovative characteristics of our framework. The contribution in Jin et al. (2013) is however interesting, and, in a sense, supports abstract queries as well, since in the filtering step that system allows to provide subgraphs where the control flow relations are only partially specified. Possible integrations of this work may be considered in our future research.

AC

CE

PT

In many contexts, process traces are maintained as a strategic source of information. Event logs constitute a very rich source of experiential knowledge, fundamental to support different tasks, with the final goal of improving the quality or efficiency of (business) processes. While in the literature different tasks have been pursued independently of each other (often in different areas of research), we propose a comprehensive approach, aiming at integrating three key tasks, namely: process model discovery from the event log, path retrieval from the process model, and trace retrieval. For each one of such tasks, we provide advances with respect to the state of the art. Indeed, path retrieval on a process model is an area that deserves more attention, and we propose a flexible approach to it, also supporting the retrieval of partially specified patterns. As regards trace retrieval, our approach is computationally efficient, since it first exploits indexing, and is flexible, in the sense that it is the first one that supports the retrieval of traces corresponding to partially specified patterns as well. Concerning model discovery, our work follows one of the mainstreams of research in 39

ACCEPTED MANUSCRIPT

CR IP T

process mining pointed out in van der Aalst (2016), namely the two-step approaches (see section 2), whose main goal is to support analysts in the fundamental objective of obtaining a process model that properly balances overfitting and underfitting (van der Aalst et al., 2010). With respect to such a mainstream, we (i) provide new semantic abstraction facilities, and support an (ii) incremental and (iii) selective construction of the model. In such a way, as shown in the experiment in Section 8, our approach allows analysts to fully exploit their expertise. Such an expertise is typically adopted in two-step process mining approaches; we further exploit it by providing:

AN US

• the possibility of using expert’s analysis to directly lead the definition of the process model (balancing overfitting and underfitting) through a proper application of the merge abstractions. Notably, in our approach, the expert’s analysis is facilitated by path retrieval facilities;

M

• the possibility of operating “step-by-step” (i.e., in an incremental way), so that, at each step, the expert can obtain both a qualitative and a quantitative evaluation of the current model, and choose to move towards a further abstraction or back to a less abstracted model. Notably, the possibility of applying merge operations in a selective way provides additional flexibility to the whole approach.

AC

CE

PT

ED

Interestingly, we believe that our approach is somehow complementary with respect to several other approaches in the process mining literature. On the one hand, our “step-by-step” procedure to obtain the process model, and the possibility of applying abstraction facilities in a selective way, could be applied also by other approaches (e.g., in the determination of transition systems in van der Aalst et al. (2010)). On the other hand, we could easily import into our approach some abstraction facilities taken from the literature. Specifically, we strongly believe that our approach could be integrated with the one in van der Aalst et al. (2010), since (i) from one side, we share several goals and the underlying philosophy, and (ii) from the other side, we propose quite complementary techniques to achieve such goals. As a specific example, we envision the possibility of taking, as a starting point of our procedure, models derived from different transition systems provided by van der Aalst et al. (2010) (and not just the prefix transition system; however, this would have the prize of loosing the indexing capabilities of our starting model), and of adopting (in a selective and step-by-step way) also the filter and visible abstractions in van der Aalst et al. (2010). 40

ACCEPTED MANUSCRIPT

ED

M

AN US

CR IP T

Finally, we want to emphasize that limiting the attention to the advances (with respect to the state of the art) that we provide in each one of the three tasks mentioned above would lead to a partial and limited evaluation and understanding of our approach. Indeed, the definition of an homogeneous and integrated framework supporting users in all of the three tasks, and exploiting the synergies between such tasks, is the major and most innovative contribution of our work, motivating most of the strategical choices we made in the design of our approach (consider, for instance, (a) the definition of an architecture in which path retrieval can be sinergically used for the exploration of the process model, and as the basis for semantic abstraction facilities, and (b) the choice of defining an initial model with precision = 1 and replay-fitness=1, easily adoptable as an index of the traces, to support efficient trace retrieval). While the initial experimental evaluation in the stroke management context has provided satisfactorily results, in the future we plan to conduct a more extensive evaluation, focusing on healthcare contexts, where domain experts (i.e., physicians) typically have a “deep” domain knowledge, so that we expect that our framework, providing experts with suitable tools for analysing the model, semantic abstraction facilities to generalize it, and an interactive and step-by-step abstraction process, can provide critical advances with respect to traditional approaches. Acknowledgement

CE

PT

The authors would like to thank prof. Wil van der Aalst for his work as a reviewer of Luca Canensi’s Ph.D. Thesis, which is the basis of this work, and for his constructive comments to the preliminary version of this paper. We are also very grateful to physicians of the Stroke Unit Network of Regione Lombardia, Italy, for their cooperation in the experimental evaluation.

AC

References References

van der Aalst, W. (2016). Process Mining. Data Science in Action. Springer. van der Aalst, W., & van Dongen, B. (2002). Discovering workflow performance models from timed logs. In Y. Han, S. Tai, & D. Wikarski (Eds.),

41

ACCEPTED MANUSCRIPT

Proc. International Conference on Engineering and Deployment of Cooperative Information Systems (EDCIS 2002) (pp. 45–63). Springer, Berlin volume 2480 of Lecture Notes in Computer Science.

CR IP T

van der Aalst, W., Rubin, V., Verbeek, H., van Dongen, B. F., Kindler, E., & G¨ unther, C. W. (2010). Process mining: a two-step approach to balance between underfitting and overfitting. Software and System Modeling, 9 , 87–111.

AN US

Aamodt, A., & Plaza, E. (1994). Case-based reasoning: foundational issues, methodological variations and systems approaches. AI Communications, 7 , 39–59. Bottrighi, A., Canensi, L., Leonardi, G., Montani, S., & Terenziani, P. (2016). Trace retrieval for business process operational support. Expert Syst. Appl., 55 , 212–221.

ED

M

Bratosin, C., Sidorova, N., & van der Aalst, W. M. P. (2010). Discovering process models with genetic algorithms using sampling. In R. Setchi, I. Jordanov, R. J. Howlett, & L. C. Jain (Eds.), Knowledge-Based and Intelligent Information and Engineering Systems - 14th International Conference, KES 2010, Cardiff, UK, September 8-10, 2010, Proceedings, Part I (pp. 41–50). Springer volume 6276 of Lecture Notes in Computer Science.

PT

Buijs, J., van Dongen, B., & van der Aalst, W. (2012). On the role of fitness, precision, generalization and simplicity in process discovery. In On the Move to Meaningful Internet Systems: OTM 2012 (pp. 305–322). Springer.

AC

CE

Cordier, A., Lefevre, M., Champin, P., Georgeon, O., & Mille, A. (2013). Trace-based reasoning - modeling interaction traces for reasoning on experiences. In C. Boonthum-Denecke, & G. M. Youngblood (Eds.), Proceedings of the Twenty-Sixth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2013, St. Pete Beach, Florida. May 22-24, 2013.. AAAI Press. Cortadella, J., Kishinevsky, M., Lavagno, L., & Yakovlev, A. (1998). Deriving petri nets for finite transition systems. IEEE Trans. Computers, 47 , 859– 882.

42

ACCEPTED MANUSCRIPT

CR IP T

Dijkman, R., Dumas, M., & Garca-Banuelos, R. (2009). Graph matching algorithms for business process model similarity search. In U. Dayal, J. Eder, J. Koehler, & H. Reijers (Eds.), Proc. International Conference on Business Process Management (pp. 48–63). Springer, Berlin volume 5701 of Lecture Notes in Computer Science.

Fahland, D., & van der Aalst, W. M. P. (2012). Repairing process models to reflect reality. In A. P. Barros, A. Gal, & E. Kindler (Eds.), Business Process Management - 10th International Conference, BPM 2012, Tallinn, Estonia, September 3-6, 2012. Proceedings (pp. 229–245). Springer volume 7481 of Lecture Notes in Computer Science.

AN US

Haigh, K. Z., & Yaman, F. (2011). RECYCLE: learning looping workflows from annotated traces. ACM TIST , 2 , 42:1–42:32.

M

Hornung, T., Koschmider, A., & Lausen, G. (2008). Recommendation based process modeling support: Method and user experience. In Q. Li, S. Spaccapietra, E. S. K. Yu, & A. Oliv´e (Eds.), Conceptual Modeling - ER 2008, 27th International Conference on Conceptual Modeling, Barcelona, Spain, October 20-24, 2008. Proceedings (pp. 265–278). Springer volume 5231 of Lecture Notes in Computer Science.

ED

Huang, Z., Juarez, J. M., Duan, H., & Li, H. (2013). Length of stay prediction for clinical treatment process using temporal similarity. Expert Syst. Appl., 40 , 6330–6339.

PT

Jin, T., Wang, J., Rosa, M. L., ter Hofstede, A., & Wen, L. (2013). Efficient querying of large process model repositories. Computers in Industry, 64 , 41 – 49.

AC

CE

LaRosa, M., Dumas, M., Uba, R., & Dijkman, R. (2013). Business process model merging: An approach to business process consolidation. ACM Trans. Softw. Eng. Methodol., 22 , 11. Leemans, S., Fahland, D., & van der Aalst, W. (2013). Discovering blockstructured process models from event logs containing infrequent behaviour. In N. Lohmann, M. Song, & P. Wohed (Eds.), Business Process Management Workshops - BPM International Workshops, Beijing, China, August 26, 2013, Revised Papers (pp. 66–78). Springer volume 171 of Lecture Notes in Business Information Processing. 43

ACCEPTED MANUSCRIPT

Leemans, S., Fahland, D., & van der Aalst, W. (2018). Scalable process discovery and conformance checking. Software & Systems Modeling, 17 , 599–631.

CR IP T

de Leoni, M., Suriadi, S., ter Hofstede, A. H. M., & van der Aalst, W. M. P. (2016). Turning event logs into process movies: animating what has really happened. Software and System Modeling, 15 , 707–732.

AN US

Maggi, F. M., Dumas, M., Garc´ıa-Ba˜ nuelos, L., & Montali, M. (2013). Discovering data-aware declarative process models from event logs. In F. Daniel, J. Wang, & B. Weber (Eds.), Business Process Management 11th International Conference, BPM 2013, Beijing, China, August 26-30, 2013. Proceedings (pp. 81–96). Springer volume 8094 of Lecture Notes in Computer Science.

Mair, G., & Wardlaw, J. M. (2014). Imaging of acute stroke prior to treatment: current practice and evolving techniques. The British Journal of Radiology, 87 , 20140216. PMID: 24936980.

ED

M

Mans, R., van der Aalst, W., Vanwersch, R., & Moleman, A. (2013). Process mining in healthcare: Data challenges when answering frequently posed questions. In R. Lenz, S. Miksch, M. Peleg, M. Reichert, D. Ria˜ no, & A. ten Teije (Eds.), ProHealth/KR4HC (pp. 140–153). Springer, Berlin volume 7738 of Lecture Notes in Computer Science.

CE

PT

Mans, R., Schonenberg, H., Leonardi, G., Panzarasa, S., Cavallini, A., Quaglini, S., & et al. (2008). Process mining techniques: an application to stroke care. In S. Andersen, G. O. Klein, S. Schulz, & J. Aarts (Eds.), Proc. MIE (pp. 573–578). IOS Press, Amsterdam volume 136 of Studies in Health Technology and Informatics.

AC

Mans, R., Schonenberg, H., Song, M., van der Aalst, W., & Bakker, P. (2009). Application of process mining in healthcare – a case study in a dutch hospital. In A. Fred, J. Filipe, & H. Gamboa (Eds.), Proc. Biomedical Engineering Systems and Technologies (pp. 425–438). Springer, Berlin volume 25 of Communications in Computer and Information Science. Montani, S., & Leonardi, G. (2014). Retrieval and clustering for supporting business process adjustment and analysis. Information Systems, 40 , 128– 141. 44

ACCEPTED MANUSCRIPT

Montani, S., Leonardi, G., Quaglini, S., Cavallini, A., & Micieli, G. (2015). A knowledge-intensive approach to process similarity calculation. Expert Syst. Appl., 42 , 4207–4215.

CR IP T

Munoz-Gama, J. (2016). Conformance Checking and Diagnosis in Process Mining - Comparing Observed and Modeled Processes volume 270 of Lecture Notes in Business Information Processing. Springer.

AN US

Perimal-Lewis, L., Qin, S., Thompson, C., & Hakendorf, P. (2012). Gaining insight from patient journey data using a process-oriented analysis approach. In K. Butler-Henderson, & K. Gray (Eds.), Proc. Workshop on Health Informatics and Knowledge Management (HIKM) (p. 5966). Australian Computer Society volume 129 of Conferences in Research and Practice in Information Technology. Rojas, E., Munoz-Gama, J., Sepulveda, M., & Capurro, D. (2016). Process mining in healthcare: A literature review. Journal of Biomedical Informatics, 61 , 224 – 236.

M

Spiotta, M., Terenziani, P., & Dupr´e, D. T. (2017). Temporal conformance analysis and explanation of clinical guidelines execution: An answer set programming approach. IEEE Trans. Knowl. Data Eng., 29 , 2567–2580.

ED

Thompson, K. (1968). Regular expression search algorithm. Communications of ACM , 11 , 419–422.

CE

PT

Weijters, A., van der Aalst, W., & de Medeiros, A. A. (2006). Process Mining with the Heuristic Miner Algorithm, WP 166 . Eindhoven University of Technology, Eindhoven. Appendix A. The semantic of the Query language

AC

In 5.1 we have introduced the query language to allow the user to search for paths or traces, and we have described the syntax. In this Appendix, we provide a description of the semantics. The core of query is the (Qpattern). The semantics (see Fig. A.10) of our query patterns (Qpattern) can be more formally specified through the function Sem([ ]), which, for each construct of the grammar, presents its translation into the regular expression representing its meaning. In the

45

ACCEPTED MANUSCRIPT

AN US

CR IP T

formulae in Fig. A.10, • represents concatenation, { and } optionality, and we suppose that Σ = {α1 , ..., αk } is the set of all the possible actions. Notice that, in particular, the meaning of a set of actions in AnyOrder is the union of all the possible concatenations, representing a possible permutation of the actions (function Perm). On the other hand, the meaning of a delay (n1 , n2 ) is that there are between n1 and n2 actions (any of the possible actions in Σ). Or, in other words, it represents any one of the possible concatenations of n1 , or n1 + 1, ... , or n2 actions. In our approach, for the sake of efficiency, we adopt an optimization. We adopt a dummy symbol ∗ to represent any action in Σ, and we directly manage it as a new symbol, which matches with any possible action. Technically speaking, we adopt the following mapping of delays into regular expressions: SemOpt ([(n1 , n2 )]) = (∗ • ... • ∗) ∪ (∗ • ... • ∗) ∪... ∪ (∗ • ... • ∗) | {z } | {z } | {z } n1

n1 +1

n2

M

Appendix B. Generalized version

AC

CE

PT

ED

In this Appendix, we describe the generalized version of QTerm merge, which we have introduced in section 6. The generalized version of QTerm merge operates similarly to the strict version (see section 6), i.e., it iterates on the number of QT erms, and fuses in the model the parts of different answers corresponding to the same QT erm. However, in this case, the answers to be fused do not need to be identical. Given a QT erm, and considering a specific answer, the algorithm distinguishes between two situations: (1) the retrieved path in the model, corresponding to the QT erm at hand, is composed by a single node; two sub-cases are possible: (a) the set of actions in QT erm is identical to the set of actions in the retrieved node (e.g., QT erm is A&B, and the retrieved node is an any-order node A&B); (b) the set of actions in QT erm is included in the set of actions in the retrieved node (e.g., QT erm is A&B, and the retrieved node is an any-order node A&B&C). In both cases, the path in the process model is already identical or more general than the query term; therefore, we do not need to create new nodes in the model; 46

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Algorithm 5: GeneralizedV ersion pseudo-code. 1 Let N be the number of QT erms 2 GeneralizedVersion(M, Answers) 3 nodeT oSubstitute ← {} 4 for i ← 1 to N do 5 newN odes ← {} 6 foreach RES list ∈ Answers do 7 hQT erm, N odeListi ← get(RES list, i) 8 if QTerm is not delay then 9 foreach node | (node ∈ N odeList∧ < node, newN >∈ nodeT oSubstitute) do 10 replace(node, newN, N odeList) 11 end 12 if lenght(NodeList)=1 then 13 patt ← f irst(N odeList) 14 n ← getN ode(newN odes, patt) 15 if n=null then 16 n ← patt 17 end 18 else 19 redirect(M, N odeList, n) 20 nodeT oSubstitute ← nodeT oSubstitute ∪ {hpatt, ni} 21 end 22 end 23 else 24 m ← createAny(N odeList) 25 n ← getN ode(newN odes, m) 26 if n=null then 27 n←m 28 addRedirect(M, N odeList, n) 29 end 30 else 31 redirect(M, N odeList, n) 32 end 33 foreach node ∈ NodeList do 34 nodeT oSubstitute ← nodeT oSubstitute∪{hnode, ni} 35 end 36 end 37 newN odes ← newN odes ∪ {n} 47 38 end 39 end 40 end 41 delete(M, old(nodeT oSubstitute))

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

(2) the retrieved path in the model is composed by more than one node; once again, two sub-cases are possible: (a) the set of actions in QT erm is identical to the set of actions in the retrieved list of nodes (e.g., QT erm is A&B, and the retrieved list of nodes is the sequence AB); (b) the set of actions in QT erm is included in the set of actions in the retrieved list of nodes (e.g., QT erm is A&B, and the retrieved list of nodes is the sequence A&C followed by B&D). In both cases, we create a more abstract any-order node, involving all the actions in the retrieved list of nodes. In detail, the algorithm GeneralizedV ersion (see Algorithm 5) takes in input the process model M and Answers, as in the strict version. The algorithm iterates on the number N of QT erms (lines 4-40), and, for each QT erm, it examines the different RES list answers (lines 6-39). The i-th pair hQTerm, NodeListi is extracted from RES list (line 7). As in the strict version, delays are ignored (line 8). The algorithm checks whether any node in N odeList has to be substituted by a new node (created in a previous iteration - see comments below): if this is the case, the node at hand is properly replaced in N odeList (lines 9-11). Then if N odeList is composed by a single node, we manage situation 1) above (lines 12-22). The function getN ode gets from the variable newN odes the first (and unique) node (patt) in N odeList and stores it in n (lines 13-14). If such a node does not exist in newN odes, n is set to patt (line 16). If the node is present in newN odes, all incoming/outgoing edges of N odeList are redirected to n (line 19). The pair composed by the original node (patt) in N odeList, and the new node n, is then added to the set of nodes to substitute (line 20). The substituted node is not immediately deleted from the model, because it could be part of other answers, not processed yet. If N odeList is composed by more than one node, we manage situation 2) above (lines 13-36). In this situation, we create an any-order node (m), involving all the actions of the nodes in N odeList (line 24). Then the function getN ode gets from the variable newN odes the node m and stores it in n (line 25). If such a node does not exist in newN odes, n is set to m (line 27); n is added to the model, and all incoming/outgoing edges of N odeList are redirected to n (line 28). Otherwise, all incoming/outgoing edges of N odeList are simply redirected to n (line 31). The pairs composed by each original node in N odeList (node), and the new node n, are then added to the set of nodes to substitute (line 34). 48

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Substituted nodes are not immediately deleted from the model, because they could be part of other answers, not processed yet. In all cases, n is added to newN odes (set union does not change newN odes if n was already present – line 37). Deletion of the nodes that have been substituted by other nodes takes place at the end of the algorithm (line 41), when all QT erms have been examined, and all edges have been properly redirected.

49

AC

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

50

Figure 8: The process model obtained after applying (A) operation OP2, (B) operation OP3, and (C) operation OP4

AC

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 9: The process models obtained by inductive miner and heuristic miner

51

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

Figure A.10: The semantics of our query patterns

52