A probabilistic approach to event log completeness

A probabilistic approach to event log completeness

Expert Systems With Applications 80 (2017) 263–272 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www...

2MB Sizes 4 Downloads 160 Views

Expert Systems With Applications 80 (2017) 263–272

Contents lists available at ScienceDirect

Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa

A probabilistic approach to event log completeness Femi Emmanuel Ayo∗, Olusegun Folorunso, Friday Thomas Ibharalu Department of Computer Science, Federal University of Agriculture, Abeokuta, Nigeria

a r t i c l e

i n f o

Article history: Received 18 January 2016 Revised 12 March 2017 Accepted 18 March 2017 Available online 20 March 2017 Keywords: Bayesian scoring functions Process discovery Fuzzy logic Process aware information systems

a b s t r a c t Recently, researchers discovered that the major problems of mining event logs is to discover a simple, sound and complete process model. But since the mining techniques can only reproduce the behaviour recorded in the log, the fitness of the reproduced model is a function of the event log completeness. In this paper, a Fuzzy-Genetic Mining model based on Bayesian Scoring Functions (FGM-BSF) which we called probabilistic approach was developed to tackle problems which emanated from the incomplete event logs. The main motivation of using genetic mining for the process discovery is to benefit from the global search performed by the algorithm. The incompleteness in processes deals with uncertainty and is tackled by using the probabilistic nature of the scoring functions in Bayesian network based on a fuzzy logic value prediction. The global search performed by the genetic approach is panacea to dealing with the population that has both good and bad individuals. Hence, the proposed approach helps to enhance a robust fitness function for the genetic algorithm through highlift traces representing only good individuals not detected by mining model without an intelligent system. The implementation of our approach was carried out on java platform with MySQL for event log parsing and preprocessing while the actual discovery was done in ProM. The results showed that the proposed approach achieved 0.98% fitness when compared with existing schemes. © 2017 Elsevier Ltd. All rights reserved.

1. Introduction Records of operational processes are ordered in a special repository known as event log which provide the tracking means for the monitoring and enhancement of business operations. The main idea of event log is to allow organizations to monitor their daily operations for an effective decision making based on the event log attributes (Abawajy, 2015; Ouyang, Adams, Wynn, & ter Hofstede, 2015; van der Aalst & Verbeek, 2014). The requirement attached to event log information by organizations has provided a shift from data aware information system to process aware information systems (Conforti, de Leoni, La Rosa, van der Aalst, & ter Hofstede, 2015). Process Aware Information System (PAIS) is a software system that manages and executes procedural and executable operations on the basis of process models (Cognini, Corradini, Gnesi, Polini, & Re, 2016; Görg, 2016; Ma, 2007). Hence, PAIS assist not only in automating business operations but in keeping records of business processes. Process mining is a method for gaining knowledge about these operational processes recorded in logs by the PAIS. The inertia of process mining is the repository maintained by the PAIS (Rosemann & vom Brocke, 2015; van der Aalst & Verbeek, 2014). In this work, the focus is on the most indispensable challenge of ∗

Corresponding author. E-mail addresses: [email protected] (F.E. Ayo), [email protected] (O. Folorunso), [email protected] (F.T. Ibharalu). http://dx.doi.org/10.1016/j.eswa.2017.03.039 0957-4174/© 2017 Elsevier Ltd. All rights reserved.

process mining techniques, which is the inability to produce sound and complete process model from incomplete event logs. Process discovery is one of the most challenging tasks of process mining for reproducing individual model from log traces (Leemans, Fahland, & van der Aalst, 2014b; van der Aalst, 2016). Most mining techniques has problem in mining complete process model, since the log might contain insufficient information to discover a complete model (Leemans, Fahland, & van der Aalst, 2014a; Li et al., 2016a). However, several techniques has been developed for process mining but most of these techniques are unable to detect process model that are complete and of high fitness. Detecting a complete model from event log, however, needs a sound approach. Many organizations rely on human experts to manually monitor their daily operational processes for an effective decision making. Most of the mining techniques that are intended to assist detect these operational processes from event logs require some expert elements to overcome its challenges. This paper proposed a fuzzygenetic mining based on Bayesian Scoring Functions known as probabilistic approach that is robust to the problem of incompleteness in event log prior to the application of process mining techniques. Bayesian Scoring Functions is used for preprocessing the log in order to detect and fix cases of incompleteness before the application of the genetic mining. Fuzzy logic is used to decide which of the preprocessed traces will be loaded unto the genetic process mining. Association rules of the activities in the processes are estimated based on the probabilistic machine learning of the Bayesian network.

264

F.E. Ayo et al. / Expert Systems With Applications 80 (2017) 263–272

The proposed research model in relation with most theoretical researches (Buijs, Van Dongen, & van Der Aalst, 2012; Goedertier, De Weerdt, Martens, Vanthienen, & Baesens, 2011; Low, van der Aalst, ter Hofstede, Wynn, & De Weerdt, 2017; Vidal, Vázquez-Barreiros, Lama, & Mucientes, 2016), enhances the fitness function of genetic mining through the selection of good individuals from the highlift traces produced by the Bayesian scoring functions and based on the fuzzy logic value prediction. One of the most interesting part of our research model is that it does not need lots of data to train and no need to retrain the system when new rules are added. However, the strength of the system can be degraded if the rules are not welldefined (Akerkar & Sajja, 2016; Dima, Antonopoulos, & Koubias, 2017; Malinowska, 2017; Sabzi, Humberson, Abudu, & King, 2016). The proposed research model will also serve as an alternative to the filtering method of preprocessing event logs due to the disˇ & Majeed, 2011) of the filtering advantages (Weber, Bordbar, Tino, method. The result of this approach is expected to produce a sound model that can replay all the information recorded in the event log. In the remainder of this work, Section 2 explore some related work. In Section 3, we introduce the concept of event logs, petri nets, Completeness in log, Bayesian network, fuzzy logic and genetic process mining. The proposed approach, implementation and algorithms is presented in Section 4. Section 5 present the analysis and evaluation of the proposed approach, and section 6 conclude the work. 2. Related work In view of space limitation, core related work shall be discuss as follows: Wen, Wang, van der Aalst, Huang, and Sun (2010) was motivated by the incompleteness that exist in an event log and a process model (Ghasemi & Amyot, 2016; Guo, Wen, Wang, Yan, & Philip, 2015; Outmazgin & Soffer, 2016). Wen et al. (2010) observed that it is a little unattainable to mine event log with infrequent behaviour because some tasks do not appear in any or some event trace of the log. For this purpose, they develop a construction invisible task algorithm called α ++ algorithm a modification on classical α -algorithm. van der Aalst (2012) proposed a Computational Intelligence technique to process mining for the discovery of incompleteness in logs of PAIS. The author used the alpha algorithm technique (Ashoori, 2017; Cheng and Kumar, 2015; Garg & Agarwal, 2016; Guo et al., 2015; Lu, Zeng, & Duan, 2016; Porouhan, Jongsawat, & Premchaiswadi, 2014) to construct dependencies between tasks in a log and uses these ordering to construct a process model for the given event log. The process model discovered by their method is incomplete due to the presence of invisible task in the log and the inability to deal with parallelism in tasks. Leemans et al. (2014b) investigate the possible impact that the attributes of an event log can pose on the process discovery techniques (Reguieg, Benatallah, Nezhad, & Toumani, 2015; de Murillas, van der Aalst, & Reijers, 2015; Di Ciccio et al., 2015). The authors observed that the most prevalent challenge in process mining is to discover a process model that can reproduce all the information contained in the log. The authors work analyze the impact of incompleteness of logs on the soundness of the discovered model. A probabilistic behavioral relations algorithm was developed that can rediscover models from incomplete event logs compared to other process discovery algorithms. van der Aalst and Verbeek (2014) observed that the two most challenging task of process mining is reconstructing a process model from an event log and detecting the differences between the observed and modeled behavior; as also justified in Dagliati et al. (2017), Li, Thomas, & Osei-Bryson (2016b), Ly, Maggi, Montali, Rinderle-Ma, & van der Aalst (2015), Suriadi, Andrews, ter Hofstede, & Wynn (2017). In order to have an efficient diagnosis of the relationship between event logs and discovered model

Fig. 1. Petri net discovered based on the event log L (adapted from van der Aalst & Weijters, 2004).

van der Aalst and Verbeek (2014) proposed passage approach that divide set of activities into two non-empty sets to speed up process discovery and to make the conformance checking much easier per passage. De Weerdt, Vanthienen, and Baesens (2013) provides a comprehensive analysis on the effects of the different mining techniques on the characteristics of different event logs by using several statistical tools such as ANOVA and Regression analysis. Series of artificial logs were generated and mined with a wide collection of process mining techniques. The different process model mined by the different techniques was subjected to conformance checking (Baier, Rogge-Solti, Mendling, & Weske, 2015; Becker, Lütjen, & Porzel, 2017; Centobelli et al., 2015; Rogge-Solti, Senderovich, Weidlich, Mendling, & Gal, 2016; Shershakov & Rubin, 2015) using some evaluation metrics and the results from the conformance checking was analyzed in order to explain the significant differences of the process model produced by the different techniques based on the varying characteristics of the generated event logs. It was found that the quality of process models discovered are related to the characteristics of the given event logs. The above related work justify the need for this study. From literature, most of the problems happen because many current techniques are based on local information in the event log (van der Aalst and Verbeek, 2014). Hence, the unique contribution of the paper is based on the probabilistic nature of Bayesian network guided by the fuzzy rules to enhance the fitness of the genetic mining. The overall idea is to provide new approach to event log pre-processing for resolving incompleteness in log. The next section explains some preliminaries as the basic concepts to the domain of this work. Also, Table 1 shows a more rigorous investigation on the existing methods. 2.1. Event Log, traces, Petri nets, completeness in log Let Event Log be a four tuple model V = {I, A, O, T} where I is the set of traces, A is the set of activities, O is the set of originator and T the set of timestamps for A. The set of traces say {abcd, acbd, abcd} is an event log containing of 3 traces with four activities and each trace having A∗ as the set of possible sequence of execution of the set of activity A for each trace in the log. Let Dir = {ir \ i∈ I, r∈ IN} be the number of occurrences of a particular trace i∈ I. For example in our set of traces, we have that D(abcd)r = 2, D(acbd)r = 1. The dimension symbolized by L, represent the total of all traces that is contained in the log. L = 3 in our sample traces. Let O = {(u, a) \ u∈ O, a∈ A} represent the set of persons with each person u∈ O, that perform an activity a ∈ A and let L(O) denote the total number of persons in the log. Let T = {(t, a) \ t∈ T, a∈ A} be the set of timestamps for set of tasks A in the log. A Petri net PN = (P, T, F) is a 3- tuple model consisting of: a defined set of places P, a defined set of transition T such that P ∩ T = φ and a set of directed arrows F called flow relation such that F ⊆ (P X T)  (T X P). In the context of workflow net, transitions can be interpreted as tasks or activities and places as conditions that ensure the firing of tokens between places and transitions. Fig. 1 below shows an example of petri net discovered from the log L = [abh, ach, adefgh, adfegh].

F.E. Ayo et al. / Expert Systems With Applications 80 (2017) 263–272

265

Table 1 Summary of previous approaches. Author

Method

Pros

Cons

Wen et al. (2010)

a++ algorithm

Simple and easy to understand

van der Aalst (2012)

α algorithm

a. Address discovery of concurrency.

De Weerdt et al. (2013)

comprehensive benchmarking framework (CoBeFra)

a. The mined model is not sound due to deadlocks. b. The algorithm can only work with work flow net and not event log. a. Difficulty of dealing with loop, noise and non-local dependencies. b. Unrealistic assumption of completeness a. It cannot detect process model

b. It discover the invisible in logs. a. Fault tolerant during process execution b. faster processing

Leemans et al. (2014b) Aalst and Verbeek (2014)

Divide-and-conquer Divide and Conquer

Discover sound model fast. Faster rate of processing and efficiency

Maggi et al. (2014)

Predictive Monitoring Approach (Decision Tree)

It can prevent violation by prediction.

Broucke et al. (2014)

Statistical Analysis

It provide answers to questions on the infrequent behaviour of logs

Hong et al. (2014)

Genetic- Fuzzy Mining

Efficient and faster fitness value evaluation.

Sharma and Kaur (2014)

Clustering and Ranking

Rubia (2014)

Weighted Association Rule

It improve efficiency and execution time. a. Efficient association rule generation

De Leoni et al. (2015)

Mining Rule Based Merging & Rule Suggestion method Alignment method

It can repair log

Fahland and Aalst (2015)

Alignment

Able to discover loop in a process

Schönig, Cabanillas, Jablonski, and Mendling (2016)

Resource-aware and declarative process mining

a. Increased efficiency

vanden Broucke, Caron, Lismont, Vanthienen, and Baesens (2016) Suriadi et al. (2017)

Business process analytics framework

do Valle, Santos, and de FR Loures (2017)

Extended process mining method

Claes and Poels (2014)

patterns-based approach

b. Less memory use Robust event log formation

b. Can only check process model discovered by other methods Lower fitness a. Speed and passage trade-off. b. Merging overhead c. It lack the method to deal with incompleteness in log a. Exhibit lower accuracy b. The decision tree classification is weak in handling large data sets. The results of the analysis may not be valid since synthesized log was used instead of reallife log a. Less efficient with large number of population. b. The method can only run on a master- slave architecture The problem of finding the clusters that contain relevant document. Problem of producing association rule from huge data item with least confidence. It didn’t address incompleteness in logs The need for a prior declarative process model for alignment a. Cannot handle generalization and precision. b. The problem of finding the appropriate point of alignment Inter-case dependencies cannot be discovered

b. simplified process models Provides a clear and naming scheme for business process

Assume event log completeness

Increase quality of event log and process model Improve event log limitations

No formal framework to specify existing and yet to be describe patterns No standard evaluation metrics

In this work, a petri net denote a workflow net containing the initial activity A; and the stop activity H; as a model that can replay all the traces in log L. Hence, Let L be an event log and M be a process model discovered from L. L is said to be complete if there are no running traces in L otherwise known as incomplete traces. While M is said to be complete if it can replay all the traces recorded in L. The petri net in Fig. 1 shows a complete process model since it can replay all traces in log L.

2.2. Bayesian network A Bayesian network is a probabilistic technique for modeling and reasoning in domains that are associated with uncertainty (Adnan, 2008; Constantinou, Fenton, Marsh, & Radlinski, 2016). Common approaches that uses the concept of Bayes network are particle filters and hidden markov model. In the context of this work, Bayesian network is a probability distribution that can be

used to predict the membership or causal dependencies of task in traces of an event log.

2.3. Fuzzy logic Fuzzy logic has become a household concept in machine intelligence (Suganthi et al., 2015; Zadeh, 2015). Fuzzy logic is normally used to partially express the concept of false or true [0 or 1]. Basically, normal logic deals with false or true representing 0 or 1 respectively but there are situations that demands the outcome of an event to fall between 0 and 1 making it difficult for ordinary logic to detect these overlapping points. Hence, the addition of the word Fuzzy to Logic. Fuzzy logic was first proposed in a paper by Zadeh (1965) of the University of California at Berkeley. (Zadeh, 1965) Elaborated on his previous work, (Zadeh, 1975) a paper that coined out the concept of linguistic variables which most academia and researchers now refers to as fuzzy sets. The linguistic variables that was used in this work is the triangular membership Fuzzifica-

266

F.E. Ayo et al. / Expert Systems With Applications 80 (2017) 263–272 Table 2 Event log (van der Aalst, De Medeiros, & Weijters, 2005). Case id

Activity id

Originator

Timestamp

Case id

Activity id

Origi nator

Timestamp

Case Case Case Case Case Case Case Case Case

Activity Activity Activity Activity Activity Activity Activity Activity Activity

John John Sue Carol Mike John Mike Sue John

09–3–2004:15.01 09–3–2004:15.12 09–3–2004:16.03 09–3–2004:16.07 09–3–2004:18.25 10–3–2004:09.23 10–3–2004:10.34 10–3–2004:10.35 10–3–2004:10.34

Case Case Case Case Case Case Case Case Case

Activity Activity Activity Activity Activity Activity Activity Activity Activity

Pete Carol Pete Sue Pete Sue Clare Mike Clare

10–3–2004:12.50 11–3–2004:10.12 11–3–2004:10.14 11–3–2004:10.44 11–3–2004:11.03 11–3–2004:11.18 11–3–2004:12.22 11–3–2004:14.34 11–3–2004:14.38

1 2 3 3 1 1 2 4 2

a a a d b h c a h

1 2 3 3 1 1 2 4 2

Table 3 Individual process model 1.

tion method as shown in Eq. (1).

⎧ 0, ⎪ ⎪ ⎪x − a, ⎨

a μA (x; [a, b, c] ) = bc − −x ⎪ , ⎪ ⎪ ⎩c − b 0,

if x = a i f x ∈ [a, c] i f x ∈ [b, c]

e f d g h f e g h

(1)

if x ≥ c

2.4. Genetic process mining Genetic process mining is a mining technique that can solve most of the problems with other mining techniques with the concept of genetic algorithm. Genetic algorithm were first proposed by Goldberg & Holland (1988) and have been successfully applied to the fields of optimization, machine learning, neural network, fuzzy logic controllers, and so on (Alcalá, Gacto, Herrera, & AlcaláFdez, 2007; Fernandez, Lopez, del Jesus, & Herrera, 2015; Gautam & Goyal, 2010). This algorithm start with an initial population of process models. Every individual process model is assigned a fitness value to indicate its quality in terms of completeness. In the context of this work, an individual is a possible process model and the fitness is a function that measures the degree of completeness of an individual process model. When genetic algorithms are used to mine process models, there are three main steps: (i) Define the internal representation. The internal representation define the search space of a genetic algorithm. (ii) Fitness measure. This evaluate the quality of a process model in the search space against the event log. The fitness value of an individual process models that parse all the traces in the event log is normally indicated as 1. (iii) Genetic operators (crossover and Mutation). All individuals in the search space defined by the internal representation should be reached when the genetic algorithm runs. In order to apply genetic algorithm to process mining one need to generate an individual process models (Salcedo-Sanz, 2016). The initial method of representing process model by petri net cannot work with the genetic process mining method. The main reason is that in Petri nets there are places whose occurrence are not defined in the log (van der Aalst & Weijters, 2004). Because of this it becomes more cumbersome to generate an initial population, define genetic operators (crossover and mutation), and describe combinations of AND/OR-splits/joins (van der Aalst & Weijters, 2004). Therefore an internal representation called causal matrix that will only include transitions between relations instead of additional places to these transitions is defined. The Table 2 shows example of an event log consisting of 4 cases and 8 activities as in {A B C D E F G H}. The event log of Tables 3 and 4 shows two randomly generated individuals from the event log representing causal matrices of potential process model. Given a log, all individuals of causal matrix in any population of the genetic algorithm have the same set of activities A. This

ACTIVITY

INPUT

OUTPUT

A B C D E F G H

{} {{A}} {{A}} {{A}} {{D}} {} {{E}, {F}} {{C, B, G}}

{{B, C, D}} {{H}} {{H}} {{E}} {{G}} {{G}} {{H}} {}

Table 4 Individual process model 2. ACTIVITY

INPUT

OUTPUT

A B C D E F G H

{} {{A}} {{A}} {{A}} {{D}} {{D}} {{E}, {F}} {{C},{B},{G}}

{{B, C, D}} {{H}} {{H}} {{E, F}} {{G}} {{G}} {{H}} {}

set contains tasks that appear within event log. Nevertheless, the causality relation C and the I (input) and O (output) functional conditions may be distinct for all individuals in the population. 3. The dataset, probabilistic approach architecture, implementation and algorithms 3.1. Dataset The proposed model was evaluated using a set of 1104 cases of real life log, which is a modified datasets created by van der Aalst (2011). The log contains 104 running cases and 10 0 0 completed cases representing incomplete process instance and complete process instance respectively. The log also contain 1 start event “Register” and 1 end event “Archive Repair” . The log is based on the process of repairing telephones in a company. 3.2. The probabilistic approach architecture In this section, the proposed approach to event log completeness and process discovery is presented. The proposed approach uses Fuzzy Genetic Mining based on Bayesian Scoring Function (FGM-BSF). The proposed architecture for this work is as shown in Fig. 2. The preprocessing module is made up of Java language and MySQL used to compute the association rules of all the activities in the event log. These association rules shows the dependency relations between the tasks in the log. The Bayesian Scoring Functions are then used to predict the probability of a task always following a given task in the log. After the prediction a CompleteEventLog al-

F.E. Ayo et al. / Expert Systems With Applications 80 (2017) 263–272

267

Fig. 2. Probabilistic approach for process discovery.

gorithm introduced by this work is then used to fix the traces that are incomplete based on the values of the scoring functions. Then the processing module uses the fuzzy logic to get the minimum lift value for those complete traces that will be used for the process discovery (performed by genetic mining method in PROM) in the evaluation module. The basic modules in the architecture are described below.

Definition 2. Degree of confidence The confidence of an association rule A⇒B is a fraction that depicts the frequency of B in all instances that A is present. Mathematically, we can define Conf(A, B) as:

Con f (A, B ) =

3.3. The Bayesian network The Bayesian network was constructed using the association rules from the log and the scoring functions for each association rule was calculated with the Eqs. (2), (3) and (4) below. This serve as a preprocessing technique as well as predicting the dependency relation between tasks in the log in order to fix the problem of incompleteness in the event log. If the support of any association rule say A⇒B is more than 0.5 then the completeness algorithm that will be describe later in this section will run through the rules in the log to see if any association rule with predecessor A occur without the successor B. The activity B will be inserted into any trace consisting of predecessor A until all the traces are complete. The modelling of the scoring functions are defined below: Definition 1. Degree of support The degree of support denoted Supp(A∪B) of an association rule A⇒B is the percentage of traces containing all the items present in the association rule. Mathematically, we can define Supp(A∪B) as:

Supp(A, B ) = S (A ∪ B ) =

Where x = number of traces having A and B; n = is the total number of log traces.

x ; A∩B=φ n

(2)

S ( A, B ) S (A )

(3)

Definition 3. Lift The lift value of an association rule A⇒B is the fraction of the rule confidence to B support. It can be given mathematically as:

Li f t =

Con f idence S (B )

(4)

3.4. Fuzzification The fuzzy logic was then used for the Fuzzification of the scoring functions from all the association rules of the complete event log being constructed. The purpose of the fuzzy logic is to get the minimum threshold value for the lift value that will represent the user defined value of all the complete traces that will be involved in the process discovery techniques. This approach helps to select the best traces for the discovery technique and also enhance the fitness calculation of the genetic algorithm by reducing the search space from both fit and less fit individuals to only more fit individuals. Our fuzzy logic will use the triangular membership function having {support, confidence, lift} as the membership set. Triangular membership was used because the membership set were 3. The

268

F.E. Ayo et al. / Expert Systems With Applications 80 (2017) 263–272

Fig. 3. Progress of building generation.

Fig. 4. FGM-BSF and GM process model in ProM framework.

F.E. Ayo et al. / Expert Systems With Applications 80 (2017) 263–272 Table 5 Fuzzy range of values.

Average

Linguistic variable

Fuzzy range of value

Low Average High Very_high

0.1 ≤ x < 0.3 0.3 ≤ x < 0.6 0.6 ≤ x < 0.8 0.8 ≤ x ≥ 1.0

High

(5)

3.5. Knowledge base After the Fuzzification the rules were defined for the knowledge base of our fuzzy logic using the knowledge of some expert in the domain. Fuzzy Rules: A total of 16 rules were defined for this work by the rule of Thumb. Since our linguistic variables are 4, then we have 24 = 16 rules. The rules were defined by the help of experts in this domain and it is shown in Table 7. Our fuzzy logic uses the AND function to evaluate its rules by taking the minimum values. The fuzzy rule in the form of IF THEN RULES are used to set up the target output of the fuzzy system. For example, rule 1 in Table 7 can be interpreted as: IF degree of Support is low AND degree of Confident is low THEN Lift is low. Similarly, rule 2 can be interpreted as: IF degree of Support is low AND degree of Confident is average THEN Lift is low and so on. 3.6. Inference engine The fuzzy inference method used for this work is the Root Mean Square (RSS). The RSS formula is given by Eq. (6).

R2 =



R12 + R22 + R32 + · · · + Rn2

(6)

R1 2 + R2 2 + R3 2 + …..+ Rn 2 are values of different rules with the same conclusion in the fuzzy rule base of Table 7. RSS sum all the resultants of the same firing rules and compute the center of the aggregate area. 3.7. Defuzzification The defuzzification method that was used is Centre of Gravity (CoG) as indicated by Eq. (7). The CoG method was adapted in this work because of its simplicity and accuracy.

 μy(Xi )xi CoG (Y ) =  μy(Xi ) ∗

(7)

Defuzzification is the transformation from fuzzy values to crisp value for better understanding by human. The value from our defuzzification will be used to take decision on those complete traces that will be passed to the process discovery stage. The values of different rules with the same conclusion in the fuzzy rule base of Table 6 are computed below.

Low

√ R 12 + R22 + R32 + R52

= √ 0.252 + 0.252 + 0.252 + 0.252 0.25 = 0.5

R72 + R82 + R102 + R112

= √ 0.52 + 0.52 + 0.52 + 0.752 = 1.3125 = 1.15

Where xi = the individual weight factor for our linguistic variables and  n = the total number of weight factors of size n.



2 2 2 2 2 R4 + R6 + R9 + R13 + R14

= √ 0.252 + 0.52 + 0.252 + 0.252 + 0.52 = 0.6875 = 0.83

degree of linguistic variables are {low, average, high, very_high}. The fuzzy range of value for the Fuzzification process is as shown in Table 5 and computed by Eq. (5).The Fuzzification process is as represented by Table 6.

xi F uzzy value =  , = {1 . . . . . . . . . .n} n

269

Very_high

2 2 2 R12 + R15 + R16

= √ 0.752 + 0.752 + 0.92 = 1.935 = 1.39  μy(Xi )xi CoG (Y∗ ) =  μy(Xi ) 0.5 ∗ 0.2 + 0.83 ∗ 0.45 + 1.15 ∗ 0.7 + 1.39 ∗ 0.9 = 0.5 + 0.83 + 1.15 + 1.39 2.5295 = = 0.654 3.87 Hence, all the preprocessed traces having their lift value equal to or greater than 0.6 is output to the genetic process mining for the process discovery. 3.8. Implementation and the highlift traces algorithm The Java programming language and MySQL was used to parse the dataset into an event log. The association rules from this event log was computed with the help of the Java program and the Bayesian scoring functions was computed for all these association rules with Eqs. (2), (3) and (4). For each of these association rules, any rule having their degree of confidence greater than 0.5 will be compared with all the rules to determine if any among all the other rules is having the predecessor without the successor of that rule. If one is found, the successor of that rule will be inserted into the trace of all other rules having the predecessor of that rule. The result of this will return a complete trace without any incompleteness at where necessary along with their degree of support, degree of confidence and lift as shown in Table 8. Hence, those traces with high lift values will be the traces that will enter into the genetic process mining based on the fuzzy logic output for the discovery of a sound process model. In this work, our fuzzy logic output predicted a lift value of equal or greater than 0.6 as indicated in the defuzzification process. In Table 8, the first 10 traces in the association rules are displayed because of space limitation. 0 represent no predecessor, 1 represent Register, 2 represent Analyze Defect, 3 represent Repair(Complex), 4 represent Test Repair, 5 represent Inform users, 6 represent Archive Repair, 7 represent Repair(Simple) and 8 represent Restart Repair. For example, the second association rule is interpreted as Register being the predecessor of Analyze Defect. Similarly, in the third association rule, Register + Analyze Defect is the predecessor of Repair (Complex) and so on. The traces varies in length of tasks, for example, the last trace in the log has association rule to be 0 -> 1 -> 2 -> 7 -> 4 -> 5 -> 8 -> 3 -> 6 which account for one of the longest traces in the complete event log. The Table 9 shows the output of traces with highlift values from Table 8 based on the fuzzy logic value prediction. Now, the CompleteEventLog and the high lift traces algorithms is shown in Algorithms 1 and 2 respectively. 4. Analysis and evaluation of the probabilistic approach The output of complete traces with highlift values (output of Algorithm 2) were used as the input to the “genetic process mining plug-in” in ProM framework. The result of our mining called

270

F.E. Ayo et al. / Expert Systems With Applications 80 (2017) 263–272 Table 6 Fuzzification process view. 0,if x = a

Linguistic variables Low Average High Very_high

0,if 0,if 0,if 0,if

x= 0.1 x= 0.3 x= 0.6 x= 0.8

x−a , c−a

if x  [a, c]

b−x , c−b

 [0.1, 0.3]  [0.3, 0.6]  [0.6, 0.8]  [0.8, 1.0]

x−0.1 ,if x 0.2 x−0.3 ,if x 0.3 x−0.6 , if x 0.2 x−0.8 , if x 0.2

if x  [b, c]

 [0.2, 0.3]  [0.45,0.6]  [0.7, 0.8]  [0.9, 1.0]

0.2−x ,if x 0.1 0.45−x ,if x 0.15 0.7−x , if x 0.1 0.9−x , if x 0.1

0, if x ≥ c 0, 0, 0, 0,

if if if if

x x x x

≥ 0.3 ≥ 0.6 ≥ 0.8 ≥ 1.0

Table 7 Fuzzy rule base. Rule no

Degree of supp.

Degree of conf.

Lift(Conclude)

Non zero min no

1 2 3 4 5 6 7 8 10 11 12 13 14 15 16

0.25 0.25 0.25 0.25 0.5 0.5 0.5 0.5 0.75 0.75 0.75 0.9 0.9 0.9 0.9

0.25 0.5 0.75 0.9 0.25 0.5 0.75 0.9 0.5 0.75 0.9 0.25 0.5 0.75 0.9

Low Low Low Average Low Average High High High High Very_high Average Average Very_high Very_high

0.25 0.25 0.25 0.25 0.25 0.5 0.5 0.5 0.5 0.75 0.75 0.25 0.5 0.75 0.9

Table 8 Scoring functions for the first 10 traces in the association rules. Association rules 0 0 0 0 0 0 0 0 0 0

-> 1 -> 1 -> 1 -> 1 -> 1 -> 1 -> 1 -> 1 -> 1 -> 1

-> 2 -> 2 -> 2 -> 2 -> 2 -> 2 -> 2 -> 2 -> 2

-> 3 -> 3 -> 3 -> 3 -> 3 -> 3 -> 3 -> 3

-> 4 -> 4 -> 4 -> 4 -> 4 -> 4 -> 4

-> 5 -> 5 -> 5 -> 5 -> 8 -> 8

-> 6 -> 8 -> 8 -> 6 -> 5

Deg Of Supp

Deg Of Conf

Lift value

1.0 1.0 0.478 0.215 0.2 0.188 0.012 0.012 0.014 0.014

1.0 1.0 0.478 0.45 0.93 0.94 0.06 1.0 0.065 1.0

1.0 1.0 0.737 0.45 0.93 0.944 0.226 1.004 0.245 1.0

Table 9 Association rules of complete traces with high lift values. Association rules 0 0 0 0 0 0 0

-> 1 -> 1 -> 1 -> 1 -> 1 -> 1 -> 1

-> 2 -> 2 -> 2 -> 2 -> 2 -> 2

-> 3 -> 3 -> 3 -> 3 -> 3

Lift values

-> 4 -> 4 -> 4 -> 4

Algorithm 2 HighLift traces.

-> 5 -> 5 -> 6 -> 5 -> 6 -> 8 -> 5

1.0 1.0 0.737 0.93 0.944 1.004 1.0

Input: CompleteEventLog Output: Traces with high lift value: HighLiftLog Process: 1. Input CompleteEventLog 2. for i = CompleteEventLog association rule length 3. compute scoring function 4. If (lift value >= 0.6) 5. Add CompleteEventLog[i] to HighLiftLog 6. Return HighLiftLog

FGM-BSF model using our proposed approach was compared with the result using the Genetic Mining (GM) approach in the ProM framework. The comparison has shown that our proposed approach FGM-BSF discovered model fast at an elapsed time of 3 min: 58 s with proper completion value of 1.0 compared to the GM approach that discovered model at 13 min: 38 s and with proper completion of 0.50. Comparison with other technique of process mining such as Heuristics Miner also confirmed that our approach is better and sound. Heuristics Miner was chosen since it can also deal with noise and loop like the genetic process mining. Overall, the results indicates that the case for the proposed probabilistic approach using FGM-BSF method is better to the other approaches.

Algorithm 1 CompleteEventLog. Input: Event Log Output: CompleteEventLog Process: 1. Input Event log 2. Number of Association Rule N 3. for i = CasesAtLog length 4. for j = ActivitiesAtLog length 5. Compute Association rule of the form A, A->B, AB->C, ABC -> D etc. for each association rule v 6. 7. Compute Scoring function 8. if Deg Of Conf >= 0.5 9. for k = 1 to N 10. if (predecessor (v) exists without successor(v)) 11. insert successor(v) at i 12. Return CompleteEventLog

F.E. Ayo et al. / Expert Systems With Applications 80 (2017) 263–272

271

Table 10 Evaluation results. Process model

Bp

Br

Dp

Dr

Proper completion

Time elapsed(ms)

Extra behavioral punishment

FGM-BSF GM Heuristic miner

1.0 1.0 0.91

1.0 1.0 0.91

1.0 1.0 1.0

1.0 1.0 1.0

1.00 0.50 0.00

03:58 13:38 0 0:0 0

0.98 0.97 0.65

It should be noted that the mined model is as precise as the original model whenever Br and Bp are equal to 1 indicating the percentage of correctly parsed positive events in the log and ability of the mined model to disallow unseen events respectively. While Dp and Dr measures the ability of a mined model to avoid redundancy with value closer or equals to 1 more preferable. From Table 9, it showed that FGM-BSF and GM detect a precise model from the event log as compared to Heuristic Miner counterparts that is not precise. Similarly, the Proper Completeness metric showed that FGM-BSF has the best percentage of traces for which a Fitness value of 1 can be obtained compared to the other schemes. This is a justification that FGM-BSF is a more complete model in terms of Fitness value. Moreso, a fitness percentage of 0.98 based on Extra Behavioral punishment indicates that FGM-BSF provides the best process model compared to the other schemes.

Table 10 shows the behavioral precision (Bp ), behavioral recall (Br ), Duplicate precision (Dp ), Duplicate recall (Dr ), proper completion, Time Elapsed and Extra Behavioral punishment of the discovered process model for both our approach and the other schemes. Due to space limitation, we shall show the parameters used during building of generation by the “genetic algorithm plug-in” in ProM (Fig. 3), petri net for our discovered model called FGMBSF and the GM model are shown in Fig. 4. It is evident from the two petri nets that our proposed approach discovered simple model which is one of the evaluation criteria of process discovery compared to the GM process model. 5. Conclusions and future work In this work, we have outlined a model that enhances the detection of a precise and complete process model from a real- life event log. We investigated the mining of a complete process model and the selection of highlift traces (fittest individuals) for genetic mining based on fuzzy logic value prediction on the Bayesian scoring functions. The experimental results showed that our mining model is precise and capable of detecting a complete process model from event logs that are not detected by methods without an expert and intelligent system. The approach also serves as an alternative to the filtering method of preprocessing event logs. The proposed research model enhances the fitness function of genetic mining through the selection of good individuals from the highlift traces produced by the Bayesian scoring functions and based on the fuzzy logic value prediction. One of the most interesting part of our research model is that it does not need lots of data to train and no need to retrain the system when new rules are added. However, the strength of the system can be degraded if the rules are not well- defined. This work could be extended in several way. First, multiobjective approaches can be applied to the simultaneous task of tuning the membership function and selection of appropriate rules for the prediction of the traces to take part in the genetic mining. Secondly, hybridizing neural networks with evolutionary computation to learn the tasks of trace selection based on the Bayesian scoring functions. Finally, the approach could be formulated to handle distributed computation of merging two or more event logs. References Abawajy, J. (2015). Comprehensive analysis of big data variety landscape. In International journal of parallel, emergent and distributed systems: 30 (pp. 5–14). Abingdon, Oxfordshire: UK: Taylor & Francis. Adnan, Darwiche (2008). Bayesian networks. In: Handbook of knowledge representation. In Frank van Harmelen, Vladimir Lifschitz, & Bruce Porter (Eds.). In Foundations of artificial intelligence: 11 (pp. 476–509). Amsterdam: Elsevier. Akerkar, R., & Sajja, P. S. (2016). Fuzzy logic. In intelligent techniques for data science (pp. 95–123). New York City: US: Springer International Publishing.

Alcalá, R., Gacto, M. J., Herrera, F., & Alcalá-Fdez, J. (2007). A multi-objective genetic algorithm for tuning and rule selection to obtain accurate and compact linguistic fuzzy rule-based systems. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, World Scientific: Singapore, 15(05), 539–557. Ashoori, M. (2017). Selecting an appropriate algorithm for risk identification in business processes: Case study insurance company. Majlesi Journal of Mechatronic Systems, 5(4) Iran. Baier, T., Rogge-Solti, A., Mendling, J., & Weske, M. (2015). Matching of events and activities: An approach based on behavioral constraint satisfaction. In Proceedings of the 30th annual ACM symposium on applied computing (pp. 1225–1230). ACM. Becker, T., Lütjen, M., & Porzel, R. (2017). Process maintenance of heterogeneous logistic systems—a process mining approach. In Dynamics in logistics (pp. 77–86). New York City: US: Springer International Publishing. Buijs, J. C., Van Dongen, B. F., & van Der Aalst, W. M. (2012). On the role of fitness, precision, generalization and simplicity in process discovery. In OTM Confederated International Conferences" On the Move to Meaningful Internet Systems" (pp. 305–322). Springer. Centobelli, P., Converso, G., Gallo, M., Murino, T., & Santillo, L. C. (2015). From process mining to process design: A simulation model to reduce conformance risk. Engineering Letters, 23(3), 1–11 International Association of Engineer: Hong Kong. Cheng, H. J., & Kumar, A. (2015). Process mining on noisy logs—Can log sanitization help to improve performance? Decision Support Systems, 79, 138–149 New York: USA, Elsevier. Cognini, R., Corradini, F., Gnesi, S., Polini, A., & Re, B. (2016). Business process flexibility-a systematic literature review with a software systems perspective. Information Systems Frontiers, 1–29 Hingham, MA: USA ACM. Conforti, R., de Leoni, M., La Rosa, M., van der Aalst, W. M., & ter Hofstede, A. H. (2015). A recommendation system for predicting risks across multiple business process instances. Decision Support Systems, 69, 1–19 New York: USA, Elsevier. Constantinou, A. C., Fenton, N., Marsh, W., & Radlinski, L. (2016). From complex questionnaire and interviewing data to intelligent Bayesian network models for medical decision support. Artificial Intelligence in Medicine, 67, 75–93 New York: USA, Elsevier. Dagliati, A., Sacchi, L., Zambelli, A., Tibollo, V., Pavesi, L., Holmes, J. H., et al. (2017). Temporal electronic phenotyping by mining careflows of breast cancer patients. Journal of Biomedical Informatics, 1–26 New York: USA, Elsevier. de Murillas, E. G. L., van der Aalst, W. M., & Reijers, H. A. (2015). Process mining on databases: Unearthing historical data from redo logs. In International Conference on Business Process Management (pp. 367–385). Springer International Publishing. De Weerdt, J., Vanthienen, J., & Baesens, B. (2013). A comprehensive benchmarking framework (CoBeFra) for conformance analysis between procedural process models and event logs in ProM. In Computational Intelligence and Data Mining (CIDM), IEEE Symposium (pp. 254–261). Di Ciccio, C., Marrella, A., & Russo, A. (2015). Knowledge-intensive processes: Characteristics, requirements and analysis of contemporary approaches. Journal on Data Semantics, 4(1), 29–57 New Mexico: USA. Dima, S. M., Antonopoulos, C., & Koubias, S. (2017). Fuzzy inference systems design approaches for WSNs. In Components and services for IoT platforms (pp. 251–277). Switzerland: Springer International Publishing. do Valle, A. M., Santos, E. A., & de FR Loures, E. (2017). Applying process mining techniques in software process appraisals. In Information and software technology (pp. 1–19). New York: USA: Elsevier. Fernandez, A., Lopez, V., del Jesus, M. J., & Herrera, F. (2015). Revisiting evolutionary fuzzy systems: Taxonomy, applications, new trends and challenges. Knowledge-Based Systems, 80, 109–121 New York: USA, Elsevier. Garg, N., & Agarwal, S. (2016). Process mining for clinical workflows. In Proceedings of the International Conference on Advances in Information Communication Technology & Computing (p. 5). ACM. Gautam, S. K., & Goyal, N. (2010). Improved particle swarm optimization based load frequency control in a single area power system. In India Conference (INDICON), Annual (pp. 1–4). IEEE.

272

F.E. Ayo et al. / Expert Systems With Applications 80 (2017) 263–272

Ghasemi, M., & Amyot, D. (2016). Process mining in healthcare: A systematised literature review. International Journal of Electronic Healthcare, 9(1), 60–88 Inderscience Publisher, United Kingdom. Goedertier, S., De Weerdt, J., Martens, D., Vanthienen, J., & Baesens, B. (2011). Process discovery in event logs: An application in the telecom industry. Applied Soft Computing, 11(2), 1697–1710 Netherlands, Elsevier BV. Goldberg, D. E., & Holland, J. H. (1988). Genetic algorithms and machine learning. Machine Learning, 3(2), 95–99 Springer, USA. Görg, M. S. E. (2016). Fundamentals. In Foundations for a social workflow platform (pp. 11–75). Germany: Springer Fachmedien Wiesbaden. Guo, Q., Wen, L., Wang, J., Yan, Z., & Philip, S. Y. (2015). Mining invisible tasks in non-free-choice constructs. In International Conference on Business Process Management (pp. 109–125). Springer International Publishing. Leemans, S. J., Fahland, D., & van der Aalst, W. M. (2014a). Discovering block-structured process models from incomplete event logs. In Application and theory of petri nets and concurrency (pp. 91–110). Germany: Springer International Publishing. Leemans, S. J., Fahland, D., & van der Aalst, W. M. (2014b). Discovering block-structured process models from event logs containing infrequent behaviour. In Business process management workshops (pp. 66–78). Hoboken (NJ): United States: Springer International Publishing. Li, L., Deng, H., He, Y., Dong, A., Chang, Y., & Zha, H. (2016a). Behavior driven topic transition for search task identification. In Proceedings of the 25th International conference on world wide web (pp. 555-565). International world wide web conferences steering committee. ACM. Li, Y., Thomas, M. A., & Osei-Bryson, K. M. (2016b). A snail shell process model for knowledge discovery via data analytics. Decision Support Systems, 91, 1–12 New York: USA, Elsevier. Low, W. Z., van der Aalst, W. M. P., ter Hofstede, A. H., Wynn, M. T., & De Weerdt, J. (2017). Change visualisation: Analysing the resource and timing differences between two event logs. Information Systems, 65, 106–123 New York: USA, Elsevier. Lu, F., Zeng, Q., & Duan, H. (2016). Synchronization-core-based discovery of processes with decomposable cyclic dependencies. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(3), 31 New York, NY: USA ACM. Ly, L. T., Maggi, F. M., Montali, M., Rinderle-Ma, S., & van der Aalst, W. M. (2015). Compliance monitoring in business processes: Functionalities, application, and tool-support. In Information systems: 54 (pp. 209–234). New York: USA, Elsevier. Ma, H. (2007). Process-aware information systems: Bridging people and software through process technology. Journal of the American Society for Information Science and Technology, 58(3), 455–456 John Wiley & Sons, Inc. New York, NY: USA. Malinowska, A. A. (2017). Fuzzy inference-based approach to the mining-induced pipeline failure estimation. Natural Hazards, 85(1), 621–636 Kluwer Academic Publishers, Netherlands, Springer. Outmazgin, N., & Soffer, P. (2016). A process mining-based analysis of business process work-arounds. Software & Systems Modeling, 15(2), 309–323 Springer Verlag, Germany. Ouyang, C., Adams, M., Wynn, M. T., & ter Hofstede, A. H. (2015). Workflow management. In Handbook on business process management: 1 (pp. 475–506). Berlin Heidelberg: Springer. Porouhan, P., Jongsawat, N., & Premchaiswadi, W. (2014). Process and deviation exploration through Alpha-algorithm and Heuristic miner techniques. In ICT and Knowledge Engineering (ICT and Knowledge Engineering), 2014 12th International Conference on (pp. 83–89). IEEE. Reguieg, H., Benatallah, B., Nezhad, H. R. M., & Toumani, F. (2015). Event correlation analytics: Scaling process mining using mapreduce-aware event correlation discovery techniques. IEEE Transactions on Services Computing, 8(6), 847–860 United State. Rogge-Solti, A., Senderovich, A., Weidlich, M., Mendling, J., & Gal, A. (2016). In log and model we trust? A generalized conformance checking framework. In International conference on business process management (pp. 179–196). USA: Springer International Publishing.

Rosemann, M., & vom Brocke, J. (2015). The six core elements of business process management. In Handbook on business process management 1 (pp. 105–122). Berlin Heidelberg, CA, USA: Springer. Sabzi, H. Z., Humberson, D., Abudu, S., & King, J. P. (2016). Optimization of adaptive fuzzy logic controller using novel combined evolutionary algorithms, and its application in Diez Lagos flood controlling system, Southern New Mexico. Expert Systems with Applications, 43, 154–164 United Kingdom, Elsevier. Salcedo-Sanz, S. (2016). Modern meta-heuristics based on nonlinear physics processes: A review of models and design procedures. Physics Reports, 655, 1–70 Netherlands, Elsevier BV. Schönig, S., Cabanillas, C., Jablonski, S., & Mendling, J. (2016). A framework for efficiently mining the organisational perspective of business processes. Decision Support Systems, 89, 87–97 Netherlands, Elsevier BV. Shershakov, S. A., & Rubin, V. A. (2015). System runs analysis with process mining. Information Systems, 22(6), 818–833 New York: USA, Elsevier. Suganthi, L., Iniyan, S., & Samuel, A. A. (2015). Applications of fuzzy logic in renewable energy systems–a review. Renewable and Sustainable Energy Reviews, 48, 585–607 United Kingdom, Elsevier BV. Suriadi, S., Andrews, R., ter Hofstede, A. H., & Wynn, M. T. (2017). Event log imperfection patterns for process mining: Towards a systematic approach to cleaning event logs. Information Systems, Science & Technology, 64, 132–150 United Kingdom: Elsevier. van der Aalst W.M.P. & Weijters, A.J.M.M.(Eds.) (2004). In Process mining, special issue of computers in industry, Netherlands, Elsevier BV, 53, 3. van der Aalst, W. (2012). Process mining: Overview and opportunities. ACM Transactions on Management Information Systems (TMIS), 3(2), 7 Association for Computing Machinary, Inc. United State. van der Aalst, W. (2016). Process mining in the large. In Process mining (pp. 353–385). Berlin Heidelberg: Springer. van der Aalst, W. M. (2011). Process mining: Discovery, conformance and enhancement of business processes (p. 1). Berlin Heidelberg: Springer-Verlag. ISBN 978-3-642-19344-6. van der Aalst, W. M., & Verbeek, H. M. W. (2014). Process discovery and conformance checking using passages. Fundamenta Informaticae, 131(1), 103–138 IOS Press: Netherlands. van der Aalst, W. M., De Medeiros, A. A., & Weijters, A. J. M. M. (2005). Genetic process mining. In Applications and theory of petri nets (pp. 48–69). Berlin Heidelberg: Springer. vanden Broucke, S. K., Caron, F., Lismont, J., Vanthienen, J., & Baesens, B. (2016). On the gap between reality and registration: A business event analysis classification framework. Information Technology and Management, 17(4), 393–410 Baltzer Science Publishers B.V, Netherlands: Springer. Vidal, J. C., Vázquez-Barreiros, B., Lama, M., & Mucientes, M. (2016). Recompiling learning processes from event logs. Knowledge-Based Systems, 100, 160–174 Netherlands, Elsevier BV. ˇ P., & Majeed, B. (2011). A framework for comparing proWeber, P., Bordbar, B., Tino, cess mining algorithms. In GCC Conference and Exhibition (GCC) (pp. 625–628). IEEE. Wen, L., Wang, J., van der Aalst, W. M., Huang, B., & Sun, J. (2010). Mining process models with prime invisible tasks. Data & Knowledge Engineering, 69(10), 999–1021 Netherlands, Elsevier BV. Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8(3), 338–353 China, Elsevier. Zadeh, L. A. (1975). The concept of a linguistic variable and its application to approximate reasoning-III. Information Sciences, 9(1), 43–80 United States, Elsevier BV. Zadeh, L. A. (2015). Fuzzy logic—a personal perspective. Fuzzy Sets and Systems, 281, 4–20 Netherlands, Elsevier.