Behavioral model identification and classification of multi-component systems

Science of Computer Programming 177 (2019) 41–66 Contents lists available at ScienceDirect Science of Computer Programming www.elsevier.com/locate/s...

Download PDF

1MB Sizes 0 Downloads 28 Views

Report

PDF Reader
Full Text

Science of Computer Programming 177 (2019) 41–66

Contents lists available at ScienceDirect

Science of Computer Programming www.elsevier.com/locate/scico

Behavioral model identiﬁcation and classiﬁcation of multi-component systems Zeynab Sabahi-Kaviani, Fatemeh Ghassemi ∗ School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Islamic Republic of Iran

a r t i c l e

i n f o

Article history: Received 14 February 2018 Received in revised form 21 November 2018 Accepted 8 March 2019 Available online 22 March 2019 Keywords: Passive automata learning State-merge condition Behavioral equivalence relation Traﬃc classiﬁcation Runtime veriﬁcation

*

a b s t r a c t Learning model from the execution traces has been considered in several domains such as traﬃc classiﬁcation, malware detection, and software engineering since it distinguishes which processes actually executed through their traces. Behavioral model learning and classiﬁcation (in terms of learned models) are taken into accounts to eliminate the shortages of models derived based on non-behavioral features and improve the resulting classiﬁcations. So far, no general method has been proposed to automatically derive behavioral models. To this aim, we assume that the models of applications can be abstractly deﬁned in terms of how they execute their depending components, well-known in the domain. To automatically derive such models, we extend the passive automata learning by considering the behavior of the depending components in addition to the observed behaviors. The state merging algorithm of the learning process is equipped with a new equivalence relation which aggregate states modulo counter abstraction of symmetric reduction. To improve the generated model to cover unobserved behaviors, we leverage the technique of complex event processing to complete the model with the unseen interleaving of actions due to the concurrent execution of components. The derived models, speciﬁed by parametrized transition systems, can distinguish different executions of instances of each component by assigning a unique symbolic identiﬁer to each instantiation and parameterizing actions with such identiﬁers. The learned models are used to distinguish the executions of applications in an interleaved execution trace of different systems. The detection procedure is more complicated for parametric models because of the need for relating the information of the trace to symbolic identiﬁers as the parameters. We utilize runtime veriﬁcation techniques in a three-step novel approach so as to enhance the performance of the matching process for a trace. To illustrate the applicability of our approach, we have employed it for traﬃc classiﬁcation in the network domain and then applied it on some real applications. To demonstrate the effectiveness of our approach in this domain, we compare it to related approaches in terms of their true positive rate, false positive rate, and test time. Our results indicate that our technique prevents including invalid traces so that unobserved behaviors are covered with an acceptable precision. © 2019 Elsevier B.V. All rights reserved.

Corresponding author. E-mail addresses: [email protected] (Z. Sabahi-Kaviani), [email protected] (F. Ghassemi).

https://doi.org/10.1016/j.scico.2019.03.003 0167-6423/© 2019 Elsevier B.V. All rights reserved.

42

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Fig. 1. Java code for a ticket agency system and its components.

1. Introduction Learning model from the execution traces has been considered in several domains such as traﬃc classiﬁcation, malware detection, and software engineering since they distinguish which processes actually executed through their traces. Due to the usage of random-ports/encrypted traﬃc of applications over network and emergent of code obfuscation for malwares, pattern matching and statistical approaches are not beneﬁcial for classiﬁcation. Furthermore, utilizing ready-to-use components/libraries/services or imperfect documentations, the complete software speciﬁcation may not be available to derive a model for the system to take advantage of model-based approaches for software testing or analysis. Behavioral model learning and classiﬁcation (in terms of learned models) are taken into account [1–6]. There are some works which rely on some speciﬁc characteristics of an application behavior by which the traces of the application are identiﬁed. Detecting such behavioral characteristics are almost application-dependent. For instance, the topology of connection establishment has been used as a distinguishing metric to identify P2P-TV traﬃc [2] in the network domain. Therefore, no automated general method has been proposed, based on the behavior of applications. To derive the behavioral model automatically, we present a passive automata learning technique to infer abstract behavioral models for applications from their observed behaviors. Our proposed method is general because it does not rely on any pre-assumption about the behavioral characteristics of systems reﬂecting the category of the domain which they belong to. We consider a system as the composition of some basic components with known behaviors. For instance, Fig. 1 shows the Java code of a ticket agency system, relying on three component classes TicketReservation, User, and AccountTransactions. These classes are responsible for ticket reservation, user registration and payment processing, respectively. The application invokes the instances of its components in an speciﬁc order depending on the values of its variables. First an instance of TicketReservation and AccountTransactions are executed which let a user start either reserving a ticket or topping up its account in any order. However, a selected ticket is not reserved (by calling ticketReserved) unless the account balance has previously enough credit (being notiﬁed by AccountTransactions). An instance of User is also executed if the user is new. While, another program can invoke the instances of above mentioned classes in a different way. In other words, the abstract models of various applications differ in how they invoke the component instances. Thus, the behavior of different programs, captured as the execution traces, differ from each other. We consider the execution trace of an application as a sequence of its function calls. Since each component may be instantiated more than once, we assign a number to symbolically distinguish the outputs of each instance from the others using their object identiﬁers in this domain. For instance, selectTicket(1) requestNewTrans(2) ticketReserved(1) updateAccountInfo(2) showTicket(1) printTicket(1) is an execution trace of this application. We have abstracted away from the object identiﬁers by using integer values, called ﬂow identiﬁer, while each object identiﬁer in each trace is uniquely mapped to an integer value. For simplicity, we rename the function names with alphabets which are commented in Fig. 1 at the end of each function call statement. Therefore, the

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

43

above trace is renamed to a(1) x(2) b(1) y (2) c (1) d(1). Also, the traces x(3) a(4) x(5) a(4) y (5) b(4) c (4) d(4) y (3) and x(6) m(7) n(7) a(8) b(8) y (6) c (8) d(8) can be the result of two different executions of this code. The ﬂow identiﬁers identify co-related outputs in a control-ﬂow execution (ﬂow for short) of an instance. For instance in the second trace, two instances of Account T ransactions are involved and the ﬁrst and the last output x and y are generated by the same instance. Noting to the fact that in different executions, objects are assigned different identiﬁers, then the outputs of x and y in the different traces maybe the result of the same instantiation. For example x and y with the ﬂow identiﬁer 2 of the ﬁrst trace, those with the ﬂow identiﬁer 3 of the second trace, and the ones with the ﬂow identiﬁer 6 of the last trace belong to the same instantiation of AccountTransactions (line 13 of code), while, x and y with the ﬂow identiﬁer 5 of the second trace belong to another instantiation (line 15). Intuitively, we assume that the behavior of an application can be identiﬁed in terms of how it executes the component instances, abstracting away from its state variables like isRegistered and enoughCredit. Knowing the behavior of components TicketReservation, User, and AccountTransactions, we aim to derive the model of TicketAgencySystem class, as the transition system of Fig. 4c, in terms of how it executes their instances by observing the generated traces. By recognizing the source of the outputs (the equivalent ﬂow identiﬁers) in different traces, the model can be reﬁned accordingly to become more accurate. To this aim, we apply automata learning to the componentbased systems by considering the behavior of the involved components to direct the learning process. The ﬁrst problem we tackle is deriving automata-based models for multi-component applications from their sets of observed behaviors captured as execution traces. The automata learning problem aims at inferring an automaton which accepts a set of given words. If the execution traces are considered as the given words of a language, then the desired behavioral model will be the identiﬁed automaton. This learned model indicates the sequential relation between occurrences of events. Passive techniques tend to complete and compact an initial model derived from the observed behavior of the system by iteratively merging a set of states. However, the classical approaches in the literature of the automata learning are not eﬃcient to derive the most complete model to subsume unobserved traces as much as possible, while the invalid traces are not given. Because their state merge condition is the similarity of K next actions of states and states are merged irrespective to the domain. To consider the effect of the domain in the state merging condition, we additionally take into account the behavior of a set of basic components used as a part of the application in our model identiﬁcation. We present a behavioral pre-order equivalence relation to determine the equivalent states such that after their merging, the behavior of basic components is preserved while invalid behaviors are prevented. Actually by considering the behavior of components, the concepts of the domain are utilized to tailor the learning algorithm by deﬁning the suitable state merging constraints. Furthermore, we apply counter abstraction technique of symmetric reduction in deﬁning our behavioral pre-order relation to aggregate more states and so accept more behaviors. We leverage the concepts of complex event processing [7] to complete the model with the unseen interleaving of actions due to the concurrent execution of components. The derived models are speciﬁed by labeled transition systems that their actions are parametrized by ﬂow identiﬁers to distinguish different executions of instances of each component. Generally, our presented passive automata learning algorithm can be applied for learning models of every multi-component system that its components run concurrently and may communicate with each other either through shared variables or synchronous/asynchronous mechanisms. To apply the proposed method to a new domain, the only needed effort is to identify the basic components and the set of events in terms of which the behaviors of the system/components are speciﬁed, and the suitable approach to capture the behavior of the system. For example, our technique is applicable in other problem areas such as malware signature generation. In this domain, the basic components are the operating system services that their behaviors are captured as the system call sequences. The second problem we address is classifying a given interleaved execution trace in terms of the learned parametrized models. The interleaved trace of a number of applications leads to an uncertainty situation in determining the source of each action. The detection procedure is more complicated for parametric models because of the need for relating the ﬂow identiﬁer of the trace to the symbolic identiﬁers of models to detect behaviors belonging to the same instantiation. The matching problem actually is the problem of examining whether an automaton accepts a word or not which has been addressed mainly, in the runtime monitoring (veriﬁcation) domain [4,8,9]. But they do not consider interleaved sequences and parametric actions together. Since the parameters denote ﬂow identiﬁers and they do not have an absolute value, we need to ﬁnd an appropriate binding between the ﬂow identiﬁers of the model and the input trace. Thus, the number of different alternatives which should be considered is exponential and examining all these possible situations are time- and memory-consuming. We utilize runtime veriﬁcation technique in a three-steps novel approach to enhance the performance of the matching process for a trace. First, we reduce the set of models for further inspection by validating the trace against models while the action parameters are ignored. Second, a set of binding candidates among the parameters of the trace and the reduced set of models are inferred. Third, each nominated binding is exploited to validate the trace against the corresponding model, taking parameters into account. The main components involved in our framework have been shown in Fig. 2: Preprocessor, Model Generator and Trace Classiﬁer. The Preprocessor component is responsible for preparing the input traces by applying a mapper function which symbolically represents generated events. Among them, only the approach of capturing trace applications and the Preprocessor component are domain-speciﬁc. To illustrate the applicability of our approach, we have customized the capturing approach and the Preprocessor component for the network domain for the traﬃc classiﬁcation problem and then applied it on some real network applications. To demonstrate the effectiveness of our approach in the network domain, we compare it to the custom statistical approaches in terms of their true positive rate, false positive rate, and test time. Our results indicate that our technique prevents including invalid traces so that unobserved behaviors are covered with an acceptable precision.

44

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Fig. 2. The components of our framework.

The proposed methodology for model learning was ﬁrst introduced in [10]. We have extended it by

• Considering the ﬂow identiﬁers of actions to obtain more accurate models by generating parametrized models. Furthermore, we have provided an extra step to infer equivalent ﬂow identiﬁers for actions of generated models to make them more general. • Presenting the procedure of automatically generalizing the model by relaxing unnecessary orders and including such orders as a consequence of concurrent execution of the components. Also, the classiﬁcation of interleaved traces of applications, the model of some may be unknown, according to the learned parametrized models is ﬁrst introduced in this paper. This paper is organized as follows. We start with an overview of automata learning approaches, preliminary deﬁnitions and applied techniques in this research in Section 2. Our methodology for generating the behavioral models of applications is discussed in Section 3. The proposed parametric classiﬁer is introduced in Section 4. We present the evaluation of the proposed methodology in the network domain with experimental results in Section 5. Section 6 is dedicated to the related work. Section 7 concludes the paper and contains directions for future work. 2. Preliminaries First, we provide an overview of automata learning in Section 2.1 which is the basis of our algorithm. We deﬁne the main concepts related to transition systems in Section 2.2, since in our methodology, the inferred behavioral model is a transition system. The counter abstraction technique is introduced in Section 2.3 which is used to ﬁnd the equivalent states. Section 2.4 is dedicated to runtime veriﬁcation as the basis of our classiﬁcation method. 2.1. Automata learning There are equivalent keywords in the literature to automata learning such as grammar inference or regular inference, language or automata identiﬁcation. The goal of automata learning is to ﬁnd one of the smallest automata which is consistent with the set of given samples [11]. The samples can be the set of accepted or rejected words, called positive or negative samples, respectively. It is proved that for a given input sets of positive and negative samples, this problem is NP-complete when the alphabet is ﬁnite and the number of the states of the output automaton is determined [12]. If all of the words with the size equal and less than the arbitrary number n are given, then it is possible to solve the problem in polynomial time. Otherwise, an automaton, not necessarily the smallest, can be found which deﬁnitely accept/reject the input with some superior accepted words. The algorithms presented for this problem are divided into two categories: active and passive. The active techniques are based on Angluin L ∗ algorithm which solves the problem in polynomial time by posing some membership or equivalence queries [13]. This approach starts from an empty model, and then iteratively adds a state and a number of transitions to the model. The correctness of each reﬁnement step is validated by inspecting the membership of

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

45

the newly added words. These iterations last until the resulting model is accepted in response to an equivalence query; It is assumed that there is an oracle system which answers the required queries. The main focus of this approach is to reduce the number of queries by posing the most effective queries. Passive techniques tend to build a tree-like automaton, called preﬁx tree automaton, from input examples and then by merging their states according to some heuristics evidence achieve the smallest deterministic ﬁnite automaton. Depending on the order of merging, different solutions for a DFA identiﬁcation problem can be found. In this automaton, positive words end in an accepting state while the negative ones lead to a rejecting state. This technique is called Evidence Driven State Merging (EDSM) [14]. In this category, there is not any oracle and the algorithm should ﬁnd the solution only from the positive and negative words of the language. If the negative samples are given, the merge condition ensures that only the states leading to the consistent states (all accepting/rejecting) will be merged. In the absence of negative samples, the merging criteria is deﬁned in another way e.g., in [15] it is deﬁned by maximizing a heuristic score function in terms of the extent to which the suﬃxes of two states overlap with each other. The merging process is performed in the order of the score of the pairs. In this paper, we aim to infer the behavioral model only from the executions of systems (passive samples) when there is no oracle system. Hence, the most important challenge is how to ﬁnd the consistent states which should be merged. There are some algorithms in the literature such as KTail algorithm which merges states having K common future (i.e., states that accept the same set of strings of length K ) [16]. However, the state merging condition of these algorithms is irrespective to the domain. We present a novel domain-aware state-merging algorithm by considering the basic components of the system in addition to the executions. For ﬁnding the consistent states by passive algorithms such as KTail, it is required to search the state space with O (n2 ). Some research has been conducted to decrease the process of searching. Red-Blue is a well-known framework which limits the number of states searched by restricting the merging condition to the states with opposite colors [14]. In this framework, a core of red states is always surrounded by a row of blue states. At ﬁrst, the initial state is the only red state and after each merge the resulting state becomes red while its neighbors become blue. The procedure lasts until the color of all states becomes red. This framework reduces the candidates states to those with opposite colors. In our approach, we ﬁnd consistent states by O (1) by deﬁning their vectors and storing them in a hash-structure. Then, the states with the same hash value are merged and there is no need for any O (n2 ) search algorithm. 2.2. Transition system We specify the model of a system and its sub-systems in the form of transition systems. Deﬁnition 1 (Transition system [17]). A transition system is a tuple TS = ( S , Act, →, s0 , ↓) where S is a set of states, Act is a α

set of actions, → ⊆ S × Act × S is a transition relation, s0 is the initial state, and ↓ ⊆ S is a set of ﬁnal states. We use s − →t to denote (s, α , t ) ∈ →. TS is called action-deterministic if for all s ∈ S, there are not (s, α , t ) ∈ → and (s, α , v ) ∈ →, where α ∈ Act and t , v ∈ S, such that t = v. From this deﬁnition, the transition system of Fig. 3a is action-deterministic, since there is only one outgoing transition for each action. Deﬁnition 2 (Execution fragment and execution trace [17]). Let TS = ( S , Act, →, s0 , ↓) be a transition system.

• A ﬁnite execution fragment η = s0 α0 s1 α1 . . . αn sn+1 of TS is an alternating sequence of states and actions starting with the initial state and ending in a ﬁnal state, i.e., sn+1 ∈ ↓ such that (si , αi , si +1 ) ∈→ where 0 ≤ i ≤ n. • A ﬁnite sequence of actions π = α0 α1 . . . αn of TS is an execution trace if ∃s0 , . . . , sn+1 ∈ S such that η = s0 α0 s1 α1 . . . αn sn+1 is an ﬁnite execution fragment. We deﬁne Frags(TS) as the set of the all ﬁnite execution fragments and Traces(TS) as the set of the all ﬁnite execution traces of the transition system TS. As we only consider ﬁnite execution fragments/traces, we use ﬁnite execution fragments/traces and execution fragments/traces interchangeably in following discussions. For instance, s0 x s1 a s2 x s3 a s4 y s5 b s6 c s7 d s8 y s9 is an execution fragment of the transition system in Fig. 3a. By eliminating states from this sequence, the execution trace x a x a y b c d y is achieved. An abstraction operator hides the details of a transition system by turning some actions into the unobservable action τ to derive a more abstract and general model. All actions except τ are called observable. Observability is used to deﬁne different equivalence relation to distinguish various systems. Deﬁnition 3 (Abstraction operator [18]). Let TS = ( S , Act, →, s0 , ↓) be a transition system. The abstraction of TS via a set of actions L ⊆ Act, denoted by τ L (TS), is ( S , Act \ L , → , s0 , ↓) such that: → = {(s, α , t ) | (s, α , t ) ∈ →, α ∈ / L } ∪ {(s, τ , t ) | (s, α , t ) ∈ →, α ∈ L }.

46

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Fig. 3. A transition system and an example of a weak simulation relation.

The left transition system in the Fig. 3b is achieved by abstracting away the actions {x, y , m, n, a , b , c , d , x , y } from Fig. 3a. To compare transition systems, several behavioral pre-order and equivalence relations have been proposed ranging from strict to loose ones [19]. The Simulation relation is a ﬁnest pre-order relation which requires a transition system to precisely mimic transitions of another one. In the case of existing internal actions in the system, weak simulation relation is deﬁned to only match each observable a-transitions to the one preceded/proceeded by a number of τ -transitions. τ ∗

Let − → be reﬂexive and transitive closure of

τ -transitions:

τ ∗

• t− → t; τ ∗ τ τ ∗ • t− → s and s − → r, then t − → r. Deﬁnition 4 (Weak simulation relation [18]). A binary relation on the set of states S is a weak simulation relation if for any s1 , s 1 , and t 1 ∈ S and α ∈ Act, s1 R t 1 implies: τ ∗ α τ ∗

α

• s1 − → s 1 ⇒ (α = τ ∧ s 1 R t 1 ) ∨ (∃ t 1 ∈ S : t 1 − → − →− → t 1 ∧ s 1 R t 1 ); • s1 ∈ ↓ ⇒

τ ∗ (∃ t 1 ∈ ↓ : t 1 − → t 1 ).

For the given transition systems TSi = ( S i , Acti , →i , s0i , ↓i ), where i ∈ {1, 2}, TS1 is weakly simulated by TS2 or TS2 simulates TS1 , denoted by TS1 w TS2 , if s01 R s02 for some weak simulation relation R. A weak simulation relation R witnessing TS1 w TS2 is called minimal if there exists no weak simulation relation R

witnessing TS1 w TS2 such that R ⊆ R. A minimal weak simulation relation R is not necessarily unique. An example of a weak simulation relation between two transition systems has been illustrated in Fig. 3b which is minimal. We remark that if the pair (s4 , t 2 ) substitutes for the pair (s5 , t 2 ), another minimal weak simulation is obtained. 2.3. Counter abstraction Counter abstraction is a technique to abstract the states of a system. The idea is to represent each state as a vector of counters one per each value instead of a vector of state variable values. Each counter of a value determines the number of state variables with such a value. For instance, consider three variables x, y and z from the domain {1, 2, 3}. The states {x = 1, y = 2, z = 1}, {x = 2, y = 1, z = 1} and {x = 1, y = 1, z = 2} have the identical counter abstracted state {c 1 = 2, c 2 = 1, c 3 = 0} where c i denotes the counter of the value i. The states with an identical abstraction can be merged together, and hence the model of a system is reduced substantially. The soundness of the counter abstraction technique has been

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

47

proved in [20] by showing that the states with an identical abstraction are behavioral equivalent (with respect to strong bisimilarity [21]) and thus satisfy the same set of properties. This technique has been used in various applications. In symmetry reduction which is a technique to avoid state-space explosion problem, the counter abstraction plays the role of ﬁnding identical clusters of the state spaces so as to reduce the symmetrical states and decrease the cost of model checking [22]. Also this concept is used in [23] to abstract a parameterized system of an unbounded size into a veriﬁable ﬁnite-state system. In our proposed method, we use this technique to abstractly deﬁne the states of an application (in terms of the number of ﬂows corresponding to each state of a component). Therefore, two states can be merged if their corresponding abstract states are identical. 2.4. Runtime veriﬁcation Runtime veriﬁcation deals with those veriﬁcation techniques that inspect a ﬁnite execution trace of a system to determine whether one or more given desired properties are satisﬁed or not. Actually, runtime veriﬁcation ﬁlls the gap between the model checking and testing, since, it examines a ﬁnite trace of a system at runtime (like testing, no model of the system under consideration is needed) as opposed to model checking which applies on a model of the system. Hence, runtime veriﬁcation does not suffer from state-space explosion [24]. To apply runtime veriﬁcation, a monitor observes the events of the system and examines the desired properties to raise proper verdicts. We utilize runtime veriﬁcation techniques to present our classiﬁer so as to enhance the performance of the matching process. Given a trace and a set of models of some systems to the classiﬁer, it discovers which application(s) may have generated a part of the input trace. Hence, to apply runtime veriﬁcation techniques, the behavioral models are considered as the regular properties that should be held for the given input trace. 3. Model identiﬁcation In this section, our proposed methodology for learning a model for a multi-component system from its set of traces is discussed. To derive this model, it is assumed that the inputs are the set of execution traces and a set of the speciﬁcations of components. The given set does not necessarily contain the speciﬁcations of all the constituent components. Hence, the input traces only contain events related to the considered components and the others are removed. Finally, the derived model is deﬁned in terms of the behavior of its known components common with the set. Reducing the problem to the automata learning problem, we propose an evidence-driven state-merging technique based on an equivalence relation. The goal is to derive a model that subsumes unobserved traces as much as possible. 3.1. Problem statement Since we consider multi-component systems, the ﬁrst input of our problem is a set of the speciﬁcations of K components. We assume that the speciﬁcations are provided in the form of action-deterministic transition systems Ci = ( S i , Acti , →i , s0i , ↓i ) where 1 ≤ i ≤ K , τ ∈ / Acti , and ∀i , j ≤ K : (i = j ⇒ Acti ∩ Act j = ∅). The second input of our problem is N execution traces of the system. Each trace is an interleaving of a number of execution trace of component instances. The execution traces of two instances of a component are identiﬁed by their unique parameters e.g., object identiﬁer in Java code of Fig. 1. Hence, the co-related events of an instance can be detected. These parameters can be symbolically represented by a unique identiﬁer to distinguish their component instantiations. Each execution trace π can be considered as an interleaving of π f 1 , π f 2 , . . . where each subsequence π f i is an execution trace of an instance of a component e.g. C j ( j ≤ K ). Hence, the symbolic identiﬁer f i uniquely distinguishes its component instantiation by parameterizing the actions of the trace and is called ﬂow identiﬁer. Deﬁnition 5 (Flow). A ﬂow is a trace of a component generated from an instantiation of the component, and the actions of the trace are all parametrized by a symbolic unique identiﬁer to distinguish that instantiation. For instance, if two execution traces π f ≡ α0 ( f )α1 ( f ) . . . αn ( f ) and π f ≡ α0 ( f )α1 ( f ) . . . αn ( f ) are the ﬂows of two instances of component C j , then they are uniquely distinguished by the identiﬁers f and f . Therefore, we assign an identiﬁer to each ﬂow, ranged over by f , and assume that there exist totally F ﬂows in the given traces. Since each ﬂow belongs to an instance of a component, we deﬁne a function Component : Nat → Nat which identiﬁes the component index of a given ﬂow identiﬁer such that ∀ f ≤ F : Component( f ) ≤ K . A mapping function Mapper is deﬁned to represent events of the input execution traces with symbolic actions and also correspond each action α ∈ Act j belonging to the ﬂow π f to a parametric action α ( f ) where obviously Component( f ) = j. For example, for the code of Fig. 1, the functions select T icket, ticket Reser ved, show T icket, and print T icket are mapped to the symbolic actions a, b, c, and d, respectively. After applying the mapping function, the second input of the problem can be considered as N parametric action traces denoted by AT, ranged over by π . Let πi indicate the i-th action of the trace π , and len(π ) show the length of the trace. We remark that the length of action traces may vary.

48

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Fig. 4. Applying the proposed method on the running example.

The goal of our problem is to derive a transition system M = ( S M , Act M , → M , s0 M , ↓ M ), where Act M is a set of parametric actions Act M = {α ( f ) | ∃ j ≤ K : α ∈ Act j ∧ Component( f ) = j }, such that AT ⊆ Traces(M). There is a trade-off between the number of the input traces and the computation time of the learning algorithm. If we increase the number of the input traces to observe more unseen behaviors, we can achieve a more accurate model. However it is required more computation time. Running example: With regards to the code given in Fig. 1, the speciﬁcations of the three sample components are provided in Fig. 4a and the three input traces x(1) a(2) x(3) a(2) y (3) b(2) c (2) d(2) y (1), a(4) x(5) b(4) y (5) c (4) d(4) and x(6) m(7) n(7) a(8) b(8) y (6) c (8) d(8) are given (for simplicity, TicketReservation, User, and AccountTransactions are referred as C1 , C2 , and C3 , respectively). The integer parameters of the actions are the ﬂow identiﬁers. In the ﬁrst trace, there are three ﬂows, two are of the component C3 and one of the component C1 . Flow identiﬁers indicate that the ﬁrst x and the last y belong to one ﬂow and the second x is related to the ﬁrst y. In the second trace, there are two ﬂows from components C1 and C3 . In the last trace, there are three ﬂows each of which are a ﬁnite trace of the sample components. We use this example in the following. 3.2. Projection relation Intuitively, each application needs to invoke a number of component instances to provide a functionality. Each invocation follows the corresponding component speciﬁcation. As mentioned before, we call an execution trace of a component as ﬂow. Initially, a tree-like automaton which consists of all action traces is generated. A transition system has a tree-like structure if and only if any of its two distinct states can be connected by at most one unique simple path. Each state of the initial transition system can be considered as a vector of states, each of which identiﬁes a state of a component. Note that the size of the vector equals to the number of ﬂows. To generalize the initial transition system in order to include more behavior of the system, not observed during the application execution, states with the same behavior are merged together. Such states are called projection equivalent. Two states are projection equivalent if their vectors (of ﬂow states) are identical with respect to the counter abstraction technique. Before describing the method, some deﬁnitions and theorems should be mentioned. Deﬁnition 6 (Projection relation under a transition system). Let TSi = ( S i , Acti , →i , s0i , ↓i ), for i = 1, 2, be transition systems. Two states s1 and s2 of S 1 have projection relation under TS2 if TS1 w TS2 , witnessed by a minimal weak simulation relation R, such that ∃ t ∈ S 2 : s1 R t ∧ s2 R t. Then, it is said that s1 and s2 are the same projection of t under the transition system TS2 , denoted by s1 ∼TS2 s2 . The above deﬁnition is used to ﬁnd the consistent states, states preserving the same behavior. For instance, in the running example, the states s1 and s16 of Fig. 4b are the same projection of the state v 2 under the transition system C3 of Fig. 4a. In other words, these states preserve the behavior of the state v 2 . To ﬁnd states that are projection equivalent, the following lemma identiﬁes the conditions that the projection relation can act as an equivalence relation, and consequently can partition states.

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

49

Lemma 1. Transition systems TSi = ( S i , Acti , →i , s0i , ↓i ), for i ∈ {1, 2} are given such that TS1 is a tree-like transition system and TS2 is an action-deterministic transition system without any τ -transition (i.e., τ ∈ / Act2 ). If TS2 weakly simulates TS1 , witnessed by a minimal weak simulation R, then each state of S 1 only relates to one state of S 2 under R:

∀ s ∈ S 1 , ∀ t1 , t2 ∈ S 2 : s R t1 ∧ s R t2 ⇒ t1 = t2 Proof. We prove by induction on the length of the path from the root to the state s ∈ S 1 . For the base, when the length is zero, then s is the root of TS1 , i.e., s = s01 . By Deﬁnition 6, the roots of TS1 and TS2 should be related. We assume that t 1 = t 2 and (s02 = t 1 ) ∨ (s02 = t 2 ). We assume that s02 = t 1 as the other case can be proved with a same discussion. The minimality of R implies that for any pair (s , t ) ∈ R, there must be paths π and π from s01 and s02 to s and t respectively, such α

that for any transition s1 − → s 1 over

α

π , it is matched to a sub-path of π . In order words, if s1 − → s 1 and s1 R t 1∗ , then either (α = τ ∨ s 1 R t 1∗ ) or (α = τ R t 2∗ ) by Deﬁnition 4 and the assumption τ ∈ / Act2 . It can be concluded that if α

∗ ∗

∗

s1 − → s1 and s1 R t 1 , then there is t 2 over π such that s1 R t 2 . Thus, s R t 1 and s R s02 together with the minimality of α ∧ t 1∗ − → t 2∗ ∧ s 1

R imply that there must be a loop in TS1 which is a contradiction to our assumption that TS1 has a tree-like structure. We

assume that the lemma holds for all the states that the length of their path from the root is less than n, and we prove the lemma for s with a path with the length of n + 1. The tree-like structure of TS1 indicates that each state of S 1 is the target of exactly one transition which is not a self-loop. Therefore, since s essentially has a unique parent, denoted by p (s), two cases can be distinguished: 1) either it is reachable through a τ -transition, then τ ∈ / Act2 , minimality of R, and Deﬁnition 4 imply that p (s) R t 1 and p (s) R t 2 ; 2) or it is reachable through an observable action, then minimality of R, and Deﬁnition 4 together imply that the parents of t 1 and t 2 must have relation with p (s), i.e., p (s) R p (t 1 ) and p (s) R p (t 2 ). In the former case, by induction t 1 = t 2 , and in the latter case, by induction p (t 1 ) = p (t 2 ) and hence, t 1 = t 2 . 2 Since our purpose is to ﬁnd the candidate states to be merged, we prove in the next theorem that the projection relation is an equivalence relation by using the above lemma, and it can be used to partition states. Each equivalent class contains the candidate states to be merged. Theorem 2. Transition systems TSi = ( S i , Acti , →i , s0i , ↓i ), for i ∈ {1, 2} are given such that TS1 is a tree-like transition system and TS2 is an action-deterministic transition system without any τ -transition (i.e., τ ∈ / Act2 ). The projection relation under the transition system TS2 over the states of TS1 is an equivalence relation. Proof. We assume that TS1 w TS2 is witnessed by a minimal weak simulation relation R. We prove that the projection relation has three reﬂexive, symmetric, and transitive properties. The reﬂexive property results from the fact that (s, t ) ∈ R ∧ (s, t ) ∈ R ⇒ s ∼TS2 s. To prove the symmetric property, s1 ∼TS2 s2 implies that ∃ t ∈ S 2 : (s1 , t ) ∈ R ∧ (s2 , t ) ∈ R. Consequently, (s2 , t ) ∈ R ∧ (s1 , t ) ∈ R implies s2 ∼TS2 s1 . The transitive property of the relation is concluded by the facts that s1 ∼TS2 s2 ⇒ ∃ t ∈ S 2 : (s1 , t ) ∈ R ∧ (s2 , t ) ∈ R, and s2 ∼TS2 s3 ⇒ ∃ r ∈ S 2 : (s2 , r ) ∈ R ∧ (s3 , r ) ∈ R. According to Lemma 1, it holds that t = r, and hence, s1 ∼TS2 s3 . 2 As a consequence of Theorem 2, the states of TS1 can be partitioned into equivalence classes by a projection relation under TS2 . As all the states in an equivalence class are the projection of a unique state of TS 2 , each equivalence class can be represented by a unique state of TS2 . The equivalence class for a projection relation is deﬁned in the following deﬁnition. Deﬁnition 7 (Projection relation partitioning). Transition systems TSi = ( S i , Acti , →i , s0i , ↓i ), for i ∈ {1, 2} are given such that TS1 is a tree-like transition system and TS2 is an action-deterministic transition system without any τ -transition (i.e., τ ∈ / Act2 ). States of TS1 are partitioned under the projection relation under TS2 into the equivalence classes each of which is identiﬁed by their projecting unique state t ∈ S 2 such that: [t ]TS1 ∼TS = {s ∈ S 1 | s R t } where R is a minimal weak 2 simulation relation, witnessing TS1 w TS2 . 3.3. Model learning steps Step 1: Building the initial transition system. From Deﬁnition 2, there is an execution fragment for each execution trace. For π each input trace π , we generate its corresponding execution fragment ηπ = s0 π1 sπ 1 π2 . . . πlen(π ) slen(π ) by introducing the states s0 and sπ , where 1 < i ≤ len ( π ) . Note that the initial state in the fragments of all traces are intentionally identical. i In the ﬁrst step, the tree-like transition system M0 is built from aggregating the execution fragments of the input traces. Therefore, the initial transition system M0 = ( S , Act, →, s0 , ↓) is obtained as follows:

• • • •

S = {sπ | 1 ≤ i ≤ len(π ), π ∈ AT } ∪ {s0 }, i Act = {πi | 1 ≤ i ≤ len(π ), π ∈ AT }, → = π ∈AT ({(skπ , πk+1 , skπ+1 ) | 1 ≤ k ≤ len(π )} ∪ {(s0 , π1 , sπ 1 )}), ↓ = {sπ ( π ) | π ∈ AT } . len

50

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Fig. 4b is the result of performing this step. For simplicity, we enumerate the states from s0 to sπ ∈AT len(π )) instead of using the trace index such as sπ i . This initial transition system does not include any execution traces of the application which are not subsumed by the input. To generalize the initial transition system to cover more execution traces, next steps (step 2 to 5) provide a set of operations to reach this goal. Step 2: Generalizing by counter abstraction. First, we generalize the initial model to accept more behaviors with an evidence driven state merging technique. This technique takes advantage of the behaviors of components to identify the equivalent states to be merged. Each state of the initial transition system is unfolded in terms of the progress of the ﬂows. The progress is computed regarding the behavior of basic components by using our projection relation. Then states in which the ﬂows have progressed similarly modulo counter abstraction are subject to be merged. 1. Finding equivalent states: As we classify applications in terms of how they execute several instances of the given components, each state s of M 0 can be abstractly denoted by a vector t 1 , . . . t F , where state t i represents the progress of ﬂow π f i according to the transition system C k (Component ( f i ) = k). In other words, abstracting away the actions of other ﬂows from M 0 , denoted by τ ¯f (M0 ), the state s is simulated by t i ∈ S k . Assume that the ﬂows π f i and π f j are of the same component. Consider two states s1 and s2 of M 0 such that their i-th and j-th items in the abstracted vector are swapped. We consider these two states equivalent under counter abstraction. Therefore, each state s of M 0 can be represented as a vector of counters n1 , . . . , n| S ∗ | , where S ∗ = j ≤ K S j and | S | denotes the size of set S, one counter for each state of components where the counter n for the state t indicates how many ﬂows have progressed to the state t. The abstraction operator is used to determine the progress of each ﬂow according to the transition system of its origin component by making the actions of all other ﬂows unobservable. Since all the observable actions belong to the same ﬂow, the ﬂow identiﬁers of the actions are also removed by this operator. Deﬁnition 8 (Abstraction operator under a ﬂow identiﬁer). Let TS = ( S , Act, →, s0 , ↓) be a transition system. Then

( S , Act , → , s0 , ↓) such that:

τ ¯f (TS) =

→ = {(s, α , t ) | (s, α ( f ), t ) ∈ →} ∪ {(s, τ , t ) | ∃(s, α ( f ), t ) ∈ → ∧ f = f } Act ⊆ Acti where Component( f ) = i and Ci = ( S i , Acti , →i , s0i , ↓i ). For instance, the abstracted transition system of Fig. 4b under the ﬂow with the identiﬁer 2 (contains trace a a b c d) has been illustrated in the left side of Fig. 3b. This operator is helpful for computing the number of ﬂows preserving the behavior of a common state of the components, and hence, generating the vector of counters n1 , . . . , n| S ∗ | . Let count(s, t ) denote the number of ﬂows preserving the behavior of the component state t in the state s of the initial transition system. This function considers the behavior of the depending components. Deﬁnition 9 (Counter function). Given the initial transition system M0 = ( S , Act, →, s0 , ↓) and a set of component transition systems Ci = ( S i , Acti , →i , s0i , ↓i ) where 1 ≤ i ≤ K , the counter function count(s, t ) computes for the state s ∈ S the number of ﬂows like π f i , where Component( f i ) = j, that have progressed to the state t ∈ S j , i.e., t weakly simulates s in the abstraction of M0 under f i :

count (s, t ) = |{ f i ≤ F | s ∈ [t ]τ f¯ (M0 )∼C }|. i

j

Since we deﬁne the projection relation on the abstracted version of the initial transition system with a tree-like structure, all observable actions, belonging to a ﬂow of an input trace, are located on a branch of the tree with no other branching structure (see Fig. 3b). Therefore for simplicity, we have used the weak simulation relation in the deﬁnition of the projection relation (as it results the same as branching simulation relation on transition systems with no branching structure). We remark that each state s is uniquely simulated by a state t as the result of our projection relation (Lemma 1). The two states s1 and s2 are called equivalent due to the counter abstraction technique if and only if ∀ j ≤ K , t ∈ S j : count (s1 , t ) = count (s2 , t ). The results of applying the counter abstraction to the states of the initial transition system of the running example are presented in Table 1. To obtain each row, at ﬁrst, the projection relation under the abstraction of each ﬂow identiﬁer is computed. After that, the number of ﬂows in each state of components is counted. First row shows that the initial state of the abstraction of M 0 under all ﬂow identiﬁers are simulated by the initial states of all components. In other words, all ﬂows have no progress with regards to any component. By transition (s0 , x(1), s1 ), ﬂow 1 progresses with regard to component C 3 , and so the state of s1 ∈ τ1¯ ( M 0 ) is simulated by the state v 2 of the component C3 in Fig. 4a. Hence, the counter of ﬂows in the state v 1 decreases by one and the counter of ﬂows in the state v 2 increases by one. After calculating the counters for each state, the set of equivalent states are achieved as shown in Table 2.

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

51

Table 1 Valuation of count function in each state of the initial transition system. Function c is the abbreviation of count. State

s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17 s18 s19 s20 s21 s22 s23

Component C2

Component C3

c (s, t 1 )

Component C1 c (s, t 2 )

c (s, t 3 )

c (s, t 4 )

c (s, t 5 )

c (s, w 1 )

c (s, w 2 )

c (s, v 1 )

c (s, v 2 )

8 8 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 7 7 7 7 7

0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0

0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0

0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 7 8 8 8 8 8 8

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0

8 7 7 6 6 7 7 7 7 8 8 7 7 8 8 8 7 7 7 7 7 8 8 8

0 1 1 2 2 1 1 1 1 0 0 1 1 0 0 0 1 1 1 1 1 0 0 0

Table 2 The set of equivalent states with their new associated state number. New state

Equivalent states

New state

Equivalent states

s0 s1 s2 s3 s4 s5

{s0 } {s10 } {s1 , s16 , s18 } {s2 , s5 , s11 , s19 } {s6 , s12 , s20 } {s3 , s4 }

s6 s7 s8 s9 s10 s11

{s13 , s21 } {s7 } {s14 , s22 } {s8 } {s9 , s15 , s23 } {s17 }

2. Merging equivalent states: After ﬁnding the set of equivalent states of M0 , merging process should be done. Let [s] denote the equivalence class of the state s, i.e., ∀ s ∈ S : s ∈ [s] ⇔ ∀Ci = ( S i , Acti , →i , s0i , ↓i ), ∀t i ∈ S i , count (s, t i ) = count (s , t i ). A merged state inherits the union of the incoming and outgoing transitions of its origin states in M 0 . By merging these states, the next transition system M1 = ( S , Act, → , s 0 , ↓ ) is obtained, where

• • • •

S = {[s] | ∀ s ∈ S }, → = {([s], α , [t ]) | ∃ s , t ∈ S : s ∈ [s], t ∈ [t ], (s , α , t ) ∈ →}, s 0 = [s0 ], ↓ = {[s] | ∀ s ∈ ↓}.

Since M 1 does not remove any transition/state of M 0 , it contains all the traces of M 0 (i.e., Traces(M0 ) ⊆ Traces(M1 )). Furthermore, M1 is weakly simulated by M0 witnessed by the relation R = {(s , [s]) | s ∈ [s]}. This relation is weak simulation

α α

as for any arbitrary pair (s , [s]), if s − → t , then by construction of − → , [ s ] − → [t ], where t ∈ [t ], and t R [t ].

Step 3: Flow identiﬁers inference. The obtained model contains a number of ﬂow identiﬁers over its transitions. Depending on the values of the state variables, different executions of a multi-component application generate different number of component instances which may be executed concurrently. As explained in Section 1, two different ﬂow identiﬁers from different application executions may have been derived from the same instantiation of a component, and so it is important to ﬁnd the equivalent ﬂow identiﬁers and unify them in the resulting model. In this step, we discover the equivalent ﬂows by ﬁnding transitions between two states that only differ in their ﬂow identiﬁers. Because when two ﬂows trigger transitions with the same action, it means that they play the same role in the two states, i.e., the source and destination of the transition, of the model. For instance, as a consequence of merging the equivalent states, the states in the sets {s1 , s16 , s18 } and {s2 , s5 , s11 , s19 } are aggregated together while two transitions with the labels a(8) and a(2) connect them. Therefore, the ﬂow identiﬁer 8 and 2 can be uniﬁed. Table 3 illustrates how the equivalent ﬂow identiﬁers of the obtained

52

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Table 3 The inferred equivalence rules between ﬂow identiﬁers after merging the transitions of equivalent states. Source states

Destination states

Merged transitions

Inferred rule

{s1 , s16 , s18 } {s0 } {s2 , s5 , s11 , s19 } {s6 , s12 , s20 }

{s2 , s5 , s11 , s19 } {s1 , s16 , s18 } {s6 , s12 , s20 } {s13 , s21 }

(s1 , a(2), s2 ), (s18 , a(8), s19 ) (s0 , x(1), s1 ), (s0 , x(6), s16 ) (s11 , b(4), s12 ), (s19 , b(8), s20 ) (s12 , y (5), s13 ), (s20 , y (6), s21 )

– –

– –

– –

f f f f f f

:2= :1= :4= :5= :3 :7

f f f f

:8 :6 :8 :6

New ﬂow ID 1 2 1 2 3 4

Fig. 5. Relaxing unnecessary orders between two sync points.

model for the running example are inferred and renamed. Fig. 4c (without thick-transitions, two states s12 and s13 and the dashed-transitions) is the ﬁnal result of performing the steps 2 and 3 on M 0 . This approach is beneﬁcial to reduce the resulting model in cases that its application has two instances of a component that may be active exclusively but behave similarly in some cases. The behavior of each component instance is a part of the component. If we replace the partial behavior of an instance with the behavior of component, the model still includes previous traces. We remark that by applying this heuristic, we integrate the behavior of both instances in one instance. Step 4: Generalizing by completing transitions. The next generalization idea is completing the transition set according to the transition systems of the components. We add self-loops of each component state t ∈ S i , for some i ≤ K , to the state [s] if count (s, t ) > 0. In other words, if the state s of the model simulates the state t of the transition system of the component C i while t has a self-loop transition, we add this transition to the state s, too. Adding such transitions do not alter the equivalent classes of M1 as such self-loops do not change the state of the component and hence, the model. Consequently, the resulting model contains more accepted traces without changing its states. As a loop containing more than one state behavior of one component may be interleaved in a speciﬁc order with the behavior of other involved components in a multi-component system, we cannot blindly complete the model with such multi-state loops. Then, after applying this step, the resulting generalized transition system is M g = ( S , Act, → g , s 0 , ↓ ) such that:

→ g = → ∪ {([s], α ( f ), [s]) | ∀ j ≤ K , t ∈ S j , ∀ s ∈ S : s ∈ [t ]τ ¯f ( M 0 )∼C j ∧ (t , α , t ) ∈ → j }. After applying this step, the two thick self-loops a on states s1 and s3 are added to the Fig. 4c. Step 5: Generalizing by relaxing unnecessary orders. The interleaving semantics of the parallel composition leads to various orders of actions, which result in diamond-like patterns in the model. Since components run concurrently and may communicate with each other either through shared variables (like the example of Fig. 1) or synchronous/asynchronous mechanisms, the total ordering among the actions of such components may be limited. At this step, we aim to relax unnecessary orders resulting from the concurrent development of applications by identifying such diamond-like patterns in the resulting model in the previous steps. For instance, consider the left transition system in Fig. 5 as a sub-automaton of the generated model. This ﬁgure shows that there are two possible subsequences between ﬂows with the identiﬁers 1 and 2 from the component instances of C1 and C3 of Fig. 4a. To ﬁnd the diamond-like patterns of the system, we introduce sync point term. Deﬁnition 10 (Sync point). Sync points are those states of the transition system with at least two incoming or outgoing transitions. Two sync points can constitute a pair if there are at least two execution paths among them. Each pair of sync points represents the start and the end of a diamond pattern. For instance, the states s4 and s10 can constitute a sync-point pair. Indeed, this subsequence is the result of the parallel composition of two instances, one executes

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

53

c and d in a sequence and the other executes y under the constraint of the application. Therefore, to derive a more general model, the unobserved interleaving should be inferred. To this aim, we utilize the proposed method in [7] from the complex event processing (CEP) domain. CEP deals with processing huge number of events to ﬁnd out the causality relations between the occurrences of events. The causality relation α ⇒ β denotes that every occurrence of α should be before β in all execution traces. For instance, { y (2) ⇒ c (1), y (2) ⇒ d(1), c (1) ⇒ d(1)} and {c (1) ⇒ d(1), c (1) ⇒ y (2), d(1) ⇒ y (2)} are the causality relations for the left and right branches of the left LTS of Fig. 5, respectively. Thus, we list all this precedence rules for each branch between two sync points separately. After that, we compare these two sets to conclude which of the rules are mandatory (those rules that are not violated in the other rule set). In the example, the only mandatory rule is c (1) ⇒ d(1). Then, those orders which preserve the mandatory rules are added as new branches between the two sync points. For example, subsequence c (1) y (2) d(1) is a valid order while d(1) y (2) c (1) is not. Therefore, after generalizing the model based on this idea, the new sub-automaton is illustrated at the right side of Fig. 5. Thus, to automate this inference, it is required to detect the sync points. To this aim, we reduce our problem to the lowest common ancestor (LCA) problem [25] to ﬁnd sub-automatons which have at least two paths with the same source and destination. After ﬁnding the desired sub-automaton, the set of precedence rules for each trace between the two sync points is extracted and compared with the other sets to derive the mandatory rules. Then, we append new paths which contain new orders of actions while preserving the strict rules. Actually, we automatically induce essential orders among transitions and relax the unnecessary ones. After applying this step, two new states s12 and s13 and the dashed-transitions are added to the Fig. 4c. Trivially M 1 weakly simulates the resulted transition systems after these two generalizing steps (step 4 and 5), because no state/transition was removed. 3.4. Pseudocode of behavioral model identiﬁcation algorithm Algorithm 1 indicates the pseudocode of our methodology. Obviously, building the initial transition system takes linear time based on the total number of the input actions (lines 2-6). For each input trace π , the projection partitioning of its states is computed for all ﬂows in O (len(π )). Therefore, the corresponding state of the components for all the states is achieved in O (π ∈AT len(π )) (lines 8-13). Due to the tree-like structure of the initial transition system and the lack of both τ and nondeterministic actions in the transition systems of the components, we implicitly achieved a minimal weak simulation relation by simultaneously traversing of the initial transition system and the transition systems of components. Because, we relate each state of the initial transition system with exactly one state of each component, and at each traversal step, each transition of the initial state transition is only matched with a transition of a component. For the second step of the method, at ﬁrst, we use the v and count vectors for storing the state vectors, respectively, before and after applying the counter abstraction technique. The counting operation is performed to calculate the number of ﬂows in the same state of the component for each state during the previous step (line 13). Finding the equivalent states (states with the same counter abstraction key) and merging them to ﬁnd the identical ﬂow identiﬁers is achieved with a hash structure. Therefore, it is done in linear time in size of the states (lines 15-18). Let parents(s) denote the states with a transition with the destination s while incomes(s) denote the incoming transitions to the state s. Due to the tree-like structure of M0 , the result of applying these functions to the states in this step have one member. Finding the equivalent ﬂow identiﬁers (step 3) is done during the merging step (line 20-23). Also, after the merging phase, the new ﬂow identiﬁers are assigned according to their equivalence relation by method renumberingFIDs() (line 23). The fourth step adds the self-loops in an order of O ( S L × π ∈ET len(π )), where S L is the number of self-loops in all the components and it is assumed that the self-loops are found as the result of preprocessing and stored in a set called selﬂoops (lines 25-28). We can ﬁnd an upper bound for the number of ﬂows for execution trace of an application e.g., M such that M = Max( F π ) where Max is the maximum operator. Then, the total time complexity of the steps 1-4 equals O (( M + S L ) × π ∈AT len(π )). Since M can be assumed as a constant variable for an application and S L is not application-dependent, the total time complexity of the algorithm is linear in the size of the input. The last step is an extra step to generalize the model. In this step ﬁnding the sync points and generalizing each subautomaton takes exponential time in the size of the input (lines 30-37). In the pseudocode, LC A is the abbreviation for Least Common Ancestor. Method allPrecedenceRules() extracts all the precedence rules between the actions and method essentialRules() selects the necessary ones by eliminating two-ways precedence rules. Method add() adds the new inferred branches (newBranches) with regard to the essential rules to the automaton (line 37). All the steps except step 5 take linear time based on the input size. As the learning process for each model is performed once, the exponential time complexity of the last step is tolerable. Moreover, in our experiences, this step increases the precision of the result about 35% in some cases. 4. Trace classiﬁcation In this section, we elaborate on how the identiﬁed models can be used to discover which applications have generated a given action trace. Indeed, given an input trace π = π1 π2 ...πn and a set of behavioral models { M 1 , M 2 , ..., M m }, respectively

54

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Algorithm 1 Pseudocode of model identiﬁcation. Input: action traces( AT ) with F ﬂows, K components TSi = ( S i , Acti , →i , s0 i , ↓i ) Output: Transition system TS = ( S , Act, →, s0 , ↓) 1: 2: s0 = Ne w State (), S = {s0 }, Act = ∅, → = ∅, ↓ = ∅ 3: for all π ∈ AT do 4: for all j ∈ [1, len(π ) − 1] do 5: Act = Act ∪ {π j }, S = S ∪ {sπj } 6:

Building the initial transition system

→=→ ∪{(s j −1 , π j , sπj )}

7: 8: new vector v [][] with the size of | S | × F , initialized by 0 9: new vector count [][] with the size of | S | × | ∪i ≤ K S i |, initialized by 0 10: for all s ∈ S do 11: for all f ≤ F do 12: v [s][ f ] = StateofComponent ( v [ parents(s)][ f ], Component( f ), incomes(s)) 13: count [s][ v [s][ f ]] + + 14: 15: for all s ∈ S do 16: HashMap .add(count [s], s)

Finding the equivalent states

Merging the equivalent states

17: for all key ∈ HashMap do 18: Merge States( HashMap . get (key )) 19: 20: for all s ∈ S do 21: if ∃t ∈ S , (s, α ( f x ), t ), (s, α ( f y ), t ) then 22: set f x ≡ f y 23: renumberingFIDs() 24: 25: for all (state sl, transition α ) ∈ selﬂoops do 26: for all s ∈ S do 27: if count [s][sl] > 0 then 28: →=→ ∪{(s, α , s)} 29: 30: for all s ∈ S do 31: if inDegree(s) > 1 then 32: for all p1, p2 ∈ parents(s) do 33: sanc = LC A ( p1, p2) 34: aut = subAutomaton(sanc , s) 35: AC = allPrecedenceRules(aut ) 36: E O = essentialRules( AC ) 37: T S .add(newBranches(aut , E O ))

Inferring ﬂow identiﬁers

Completing the transitions by adding self-loops

Generalizing by Relaxing Unnecessary Orders

for the applications { App 1 , App 2 , .., Appm }, the problem is to identify those applications ({ App x , App y , .., App z }) where their models have a subsequence in the input trace while there is a consistence binding between the ﬂow identiﬁers of the subsequence and the model. Since applications may exploit common components, they may have actions in common. So there is uncertainty in determining the source of each action in the interleaved traﬃc of a number of applications. Moreover, the ﬁrst action of the execution traces of applications are not deﬁnitely clear in the given input. Besides, models are parametrized by ﬂow identiﬁers which are symbolic values, so we need to ﬁnd an appropriate binding between the ﬂow identiﬁers of the model and the input trace. Thus, the number of different alternatives which should be considered is exponential and examining all these possible situations are time- and memory-consuming. We utilize runtime veriﬁcation techniques in a three-steps novel approach instead of considering all possible bindings to enhance the performance of the matching process for a trace. First, we do not consider the ﬂow identiﬁers of the input trace and the behavioral models and use runtime veriﬁcation techniques to determine a set of candidate models which are non-parametrically satisﬁed by the input trace. These models have the chance of being the origin of a subsequence of the input trace if they also pass the parametric consistency checking in the further step. As discussed before, the number of various bindings each ﬂow identiﬁer of a model to a ﬂow identiﬁer of the input trace is exponential with respect to the number of ﬂows. As the size of the models are too large, this process is highly time-consuming. Hence, at the second step, we construct an abstract symbolic representation for each model to ﬁnd the candidate bindings. This abstraction only contains the ﬁrst transition of the ﬂows of its corresponding model so as to express the possible ﬂow initiation orders in the related application. We compare these orders with those of the input trace and report ones as candidate orders for which a parameter binding with a subsequence of the input can be found. Therefore, the number of different bindings to be considered decreases. Finally, at the third step, we extract a set of sub-automata from each candidate model to subsume a candidate order of the initial actions of ﬂows found in the previous step. Such a reduction limits the parametric consistency checking process to just the part of the model relevant to a candidate order. The checking process inspects the validity of the parameter

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

55

binding suggested by the candidate order by verifying each sub-automaton against the input trace by runtime veriﬁcation technique. Since deriving the abstract symbolic representations of the models is independent from the input trace, we do this step in a preprocessing phase. To this aim, we scan each model to obtain their ﬂow orders in terms of their initial actions as a tree-like graph called ﬂow dependency graph (FDG). In the following, we address each step in details. In the next subsections, we assume that the transition system of Fig. 4c is given as the behavioral model and the trace of Fig. 6c is the input trace that should be checked, and we apply each step to this running example. Preprocessing: building a symbolic representation abstraction. In this phase, we abstract all the behavioral models to obtain their symbolic representations (as ﬂow dependency graphs (FDGs)) which show dependencies among their possible ﬂows in their applications. As mentioned before, each trace of a model is an interleaving of different ﬂows which can be characterized by an ordering among the initiating actions of its ﬂows. If the input trace has a subsequence as an execution trace of a model, then the initiation order of ﬂows of such a subsequence should be a member of the initiation orders of ﬂows of the origin model traces. Therefore, each model can be abstracted by an FDG to represent such orderings of its traces. Hence, the derived FDGs can be used as blueprints to ﬁnd the right association between the ﬂows of an input subsequence and the ones of a model. To derive an FDG from a model, we abstract the model away from non-ﬁrst actions of ﬂows in order to illustrate the initiation orders of ﬂows. For each identiﬁed model M, we assume that the total number of its ﬂows is F M . We deﬁne the ﬁrst action (with its source and destination states) of a ﬂow f ≤ F M in an execution fragment η ∈ Frags( M ) as initiator( f , η) = {(s, α ( f ), s ) | ∃ ηi = (s, α ( f ), s ) ∧ j < i : η j = (t , α ( f ), t ) ∧ f = f }. Also, we assign an index to each initiator initiator( f , η) according to its location amongst the other initiators of the execution fragment η and it is denoted by 1 ≤ Index(initiator( f , η)) ≤ F M . Then, we deﬁne the set of nodes and the set of edges of the FDG of M = ( S M , Act M , → M , s0 M , ↓ M ) as the following:

Nodes( F D G M ) = {init} ∪ {(s, α (f ), s ) | ∃ η ∈ Frags( M ), initiator ( f , η) = (s, α ( f ), s )} Edges( F D G M ) = {init → (s, α ( f ), s ) | ∃ η ∈ Frags( M ), ∃ initiator ( f , η) = (s, α ( f ), s ), Index(initiator ( f , η)) = 1}∪ {(s, α ( f ), s ) → (t , α ( f ), t ) | ∃ η ∈ Frags( M ), initiator( f , η) = (s, α ( f ), s ) ∧ initiator ( f , η) = (t , α ( f ), t ) ∧ Index(initiator ( f , η)) = Index(initiator ( f , η)) + 1} Final( F D G M ) = {(s, α (f ), s ) | ∃ η ∈ Frags( M ), initiator ( f , η) = (s, α ( f ), s ), ∀ f = f ∧ f ≤ F M , Index(initiator ( f , η)) > Index(initiator ( f , η))} The derived FDG of the running example (transition system of Fig. 4c) is shown in Fig. 6a. Each node in this graph shows the ﬁrst transition of a ﬂow of an execution fragment. The ﬁnal nodes (member of Final( F D G M )) are shown by the sign . We remark that an FDG can be derived from a model by traversing the model in a depth-ﬁrst fashion. We apply this procedure to all the behavioral models of the applications and obtain a set of FDGs to be used at the second step of the classiﬁcation. Step 1 – Non-parametric categorization. It is required for each application to determine whether it has the chance of being an origin of the input trace or not. A model has a chance if a subsequence of the input belongs to the traces of the model while their parameters are ignored. As the consequence of this step, the set of the candidate models will be reduced. Hence, for this step, the inputs are the given set of generated models { M 1 , M 2 , ..., M m } and a ﬁnite action trace π and the output is the reduced set of the candidate models { M i , M j , ..., M k }. There are some challenges that we should consider in this phase. Because of the interleaving of execution traces of several applications, a trace of a model is observed as a subsequence of the input trace. It means that some extra actions may appear between two consequent actions of a trace of an application as the result of concurrent executions of various applications. We have utilized the runtime veriﬁcation technique to detect a subsequence of input as a trace of the model. The behavioral models can be considered as the regular properties that should be held for the given input trace. Therefore, we present a runtime monitor for each model that detects the satisfaction of its corresponding completed property by the input trace. To this aim, the monitor traverses the model and change its state by matching the actions of the trace with the transitions of the property. As a traversed transition may belong to another application, the monitor should also track the possibility of not matching an action. Therefore, the monitor should maintain a set of states during its traversal as its current states. Initially, the set of current states only contains the initial state of the model. Then, monitor take each action of the input iteratively and updates its set of current states based on the following main principles. To overcome the interleaved essence of the input, the monitor maintains a set of states as its current states: 1. Maintain all possible states: Due to uncertainty in matching an action of the input trace with some actions of a model, the origin state of each matched transition should be again included by the set of current states if the matched input was not its sole transition. Furthermore, since the ﬁrst action of the execution traces of an application is not clearly speciﬁed, the initial state always remains in the set of current states.

56

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Table 4 The step-wise application of the non-parametric classiﬁer on the running example. Iteration No.

Input

Current states

Triggered?

0 1 2 3 4 5 6 7 8 9 10 11 12 13

– a m x a x y a b a y n c d

[s0 ] [s0 , s1 ] [s0 , s1 ] [s0 , s2 , s3 ] [s0 , s1 , s2 , s3 ] [s0 , s2 , s3 , s5 ] [s0 , s2 , s3 ] [s0 , s1 , s2 , s3 ] [s0 , s1 , s2 , s4 ] [s0 , s1 , s2 , s3 , s4 ] [s0 , s1 , s2 , s3 , s4 , s6 ] [s0 , s1 , s2 , s3 , s4 , s6 ] [s0 , s1 , s2 , s3 , s4 , s6 , s7 , s8 ] [s0 , s1 , s2 , s3 , s4 , s6 , s9 , s10 ]

No No No No No No No No No No No No No Yes

2. Maximal progress: If an action of the input trace can be triggered by some states in the set of current states, it should be triggered. It means that the set of current states should be updated as soon as it is possible to satisfy the maximal progress constraint. A model is selected as a candidate if its monitor visits one of the ﬁnal states during its traversal. Algorithm 2 Pseudocode of the proposed non-parametric runtime monitor. Input: inferred model TS = ( S , Act, →, s0 , ↓), input trace π Output: true or f alse 1: curr States = {s0 } 2: for all j ∈ [1, len(π )] do 3: for all s ∈ curr States do 4: if ∃ (s, π j , s ) then 5: if out Degree (s) = 1 then curr States.remove (s) 6: 7:

curr States.add(s ) if s ∈ ↓ then return true

8: return f alse

Algorithm 2 describes the pseudocode of the non-parametric runtime monitor. It can be easily shown that the given algorithm is complete, i.e., if a given input trace has a subsequence of a model, the monitor of the model will detect it. Proposition 1. Given an input trace containing a subsequence of a model, its monitor described by the Algorithm 2, will detect it. Proof. Proof by induction on the length of the subsequence. Assume that the proposition holds for subsequences with the length less than equal n. Now we prove it for subsequences with the length of n + 1. Assume that for the given trace, the monitor has matched the ﬁrst n elements of the subsequence. If we assume that the state visited after matching the ﬁrst n subsequence should be s, then by induction assumption s ∈ CurrStates (as we can make this state ﬁnal to accept the preﬁx of subsequence with the length of n). The next actions like π j are ignored until ∃(s, π j , s ) and s = s . According to the algorithm in line 6, s ∈ CurrStates which make the monitor moves to s irrespective to the fact that π j may not be the next action in the subsequence. If π j is the sole action that s have transition for, then deﬁnitely π j is the next action in the subsequence. Otherwise, according to line 5, still s ∈ CurrStates. So s remains in CurrStates until a π j is received which is symbolically the same as n + 1-th action in the subsequence. Therefore, the algorithm deﬁnitely returns true. 2 The step-wise application of the proposed non-parametric runtime monitor on the running example is shown in Table 4. After thirteen iterations and before reading the last action of the trace, it is recognized that the given trace, without considering its parameters, contains a subsequence as a trace of the model. In this table the changes of current states in each step is shown. Step 2 – Determining the candidate ﬂow orders. This step leverages the symbolic representation abstraction of the candidate model of each application to provide some hints for relating its ﬂows to the ﬂows of the input trace. Indeed, an FDG illustrates the acceptable orders of the ﬂows of its corresponding application. Since the ﬂow identiﬁers in an FDG state are symbolic, at ﬁrst we should correspond ﬂow identiﬁers of the FDGs with their according ones of the input trace. This step is performed for the applications separately and in parallel. For each application, the inputs are the set of derived FDG of

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

57

Table 5 The list of candidate ﬂow orders, the ﬂow identiﬁer bindings and the result of validation for the running example. The left and the middle columns are obtained as the result of the step 2 and the right column is achieved after applying the step 3 of the classiﬁer. Candidate ﬂow order

Flow identiﬁers binding

Result of validation

(s0 , a(11), s1 ) → (s1 , x(13), s3 ) → (s3 , x(15), s5 ) (s0 , a(11), s1 ) → (s1 , x(13), s3 ) (s0 , a(11), s1 ) → (s1 , x(15), s3 ) (s0 , a(14), s1 ) → (s1 , x(15), s3 ) (s0 , x(13), s2 ) → (s2 , a(14), s3 ) (s0 , x(13), s2 ) → (s2 , a(14), s3 ) → (s3 , x(15), s5 ) (s0 , x(13), s2 ) → (s2 , a(11), s3 ) (s0 , x(15), s2 ) → (s2 , a(14), s3 ) (s0 , x(15), s2 ) → (s2 , a(11), s3 )

{(1 → 11), (2 → 13), (3 → 15)} {(1 → 11), (2 → 13)} {(1 → 11), (3 → 15)} {(1 → 14), (3 → 15)} {(2 → 13), (1 → 14)} {(2 → 13), (1 → 14), (3 → 15)} {(2 → 13), (1 → 11)} {(2 → 15), (1 → 14)} {(2 → 15), (1 → 11)}

× × × × × × ×

candidate models { F D G M i , F D G M j , ..., F D G Mk } and the given action trace π and the output for each given FDG is the set of the candidate ﬂow orders together with their ﬂow identiﬁers bindings. To this aim, we simultaneously traverse an FDG with regard to the input trace in a depth-ﬁrst fashion and annotate the states of the FDG to ﬁnd the correspondence between the ﬂows of input trace and the ﬂows of FDG. For example, for the running example, we assume that the input trace is Fig. 6c and the FDG in Fig. 6a. The ﬁrst action of the input trace is a(11) and by depth-ﬁrst searching (DFS) of the FDG, after the init state, the state (s0 , a(1), s1 ) is selected as their actions are the same. The state is annotated by the ﬂow identiﬁer 11 to represent a possible binding for the symbolic ﬂow identiﬁer of the state, i.e., (1 → 11). Then, the next action trace m(12) is ignored as the next FDG state is labeled by x(2) and proceed with the next action trace, x(13). This action will be matched with the next state (s1 , x(2), s3 ) and the state is annotated by 13. As this state is a ﬁnal state, the subsequence a(11) x(13) of the input trace can be a possible initiation order for the model of FDG. To show which transitions of the model have been matched by this subsequence, the candidate initiation order (s0 , a(11), s1 ) → (s1 , x(13), s3 ) is generated by applying the ﬂow identiﬁer bindings {(1 → 11), (2 → 13)} on the matched states of FDG. This candidate initiation order expresses which part of the model should be considered in the next step. After ignoring a(14), x(15) can be matched by the next state (s3 , x(3), s5 ) which is annotated by 15. As this state is again a ﬁnal state, the candidate initiation order (s0 , a(11), s1 ) → (s1 , x(13), s3 ) → (s3 , x(15), s5 ) for identiﬁer bindings {(1 → 11), (2 → 13), (3 → 15)} is generated. For each action of the input trace being matched with an FDG state, the alternation of ignoring that action is also considered during the DFS traversal. For instance, by ignoring x(13) in the state (s0 , a(1), s1 ), x(15) can be matched with the state (s1 , x(2), s3 ), and hence the candidate initiation order (s0 , a(11), s1 ) → (s1 , x(15), s3 ) is resulted. The left column of Table 5 shows the candidate ﬂow orders of the running example. It can be noted that no binding for the ﬂow identiﬁer of the state (s2 , m(4), s11 ) is generated, because the input trace does not have any subsequence such that action m proceeds x. Step 3 – Validating the parametric bindings. Each candidate ﬂow initiation order achieved in the previous state, suggests a binding between the parameters of a model and the input trace. The suggested ﬂow identiﬁer bindings on our running example have been illustrated in the middle column of Table 5. Hence, for this step, the inputs for each application with the model M are the set of the candidate ﬂow orders together with their ﬂow identiﬁers bindings and the given action trace π , and the output is whether the model M can be an origin of π or not. Hence, the overall output of this step is the set of applications { App x , App y , .., App z } where their models are matched with the given input trace. To validate each parameter binding for each model M, again, we use the runtime veriﬁcation technique by considering a larger part of the model (not just initials of ﬂows) with the parameters in this step. Beforehand, for each ﬂow initiation order o i , we prune the model M such that only the behaviors as the consequence of the given initiation order o i are possible. We remark that if a model accepts a subsequence of the given input, then the one achieved from the model while its self-loops have been removed also accepts a subsequence. Thus, we can extract the largest self-loop-free sub-automaton auti of the behavioral model M which has been symbolically abstracted by o i . During extraction, the suggested parameter binding by o i is applied by adjusting the parameters accordingly. If the input trace satisﬁes aut i as the property, the order o i is conﬁrmed. We check this satisfaction problem via a parametric runtime monitor and execute it for all the candidate orders to determine the accepted ones. Our parametric runtime monitor is exactly the same as the non-parametric runtime monitor described by Algorithm 2 except that the parameter of actions is considered in the line 4. To extract the largest self-loop-free sub-automaton, at ﬁrst, we remove all the transitions which do not involve in the candidate initiation ﬂow order, e.g., the transitions (s0 , x(2), s2 ), (s2 , m(4), s11 ), (s2 , a(1), s3 ) and (s3 , x(3), s5 ) for the order (s0 , a(11), s1 ) → (s1 , x(13), s3 ) of the running example. Hence, the sub-automaton only contains the transitions with those ﬂow identiﬁers which exist in the given ﬂow order. Subsequently, the model is pruned by including only reachable states and transitions in a depth-ﬁrst traversal from the initial state while the suggested bindings are applied on the actions. The sub-automaton of the model of Fig. 4c for the order (s0 , a(11), s1 ) → (s1 , x(13), s3 ), is shown in Fig. 6b. The result of parametric binding validation for the running example is shown in the right column of Table 5. The veriﬁcation fails for the order (s0 , a(11), s1 ) → (s1 , x(13), s3 ), since the input does not include the three expected actions

58

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Fig. 6. FDG of the model given in Fig. 4c and its sub-automaton considered for classiﬁcation of the given input.

y (13), c (11) and d(11) of the automaton in Fig. 6b after the occurrence of b(11). We discovered that for this example, the ﬂow orders (s0 , a(14), s1 ) → (s1 , x(15), s3 ) and (s0 , x(15), s2 ) → (s2 , a(14), s3 ) are accepted. 4.1. Completeness of the approach We prove that our approach is complete; if there is a subsequence of an application model in the input, our approach will deﬁnitely detect it. While, other approach such as statistical methods even may not detect their trained dataset because of their statistical measurements. Theorem 3. The proposed classiﬁcation is complete. In other words, there is no application A that its model contains an execution trace π as a subsequence of the input while A has not been reported by the classiﬁer. Proof. We assume that there exists an application A and an execution trace π of A as a subsequence of the input. Then by Proposition 1, it should pass the ﬁrst step of our approach, and hence, the model of A is selected as a candidate model. Recall that ignorance is a possible operation for matching each action trace during the depth-ﬁrst traversal of FDG of the model A. Therefore, all actions except the ones related to the initiation ﬂow order of π can be ignored and deﬁnitely reported by the second step. As the parametric runtime monitor is similar to the non-parametric one, again by application of Proposition 1, application A will be reported by the classiﬁer. 2 It can be shown that our approach is not sound. The soundness means that if an application model is reported as an origin of the input, then there exists a trace of the model as the subsequence of the input. Since, the input trace is an interleaving of the executions of different applications, some actions of another application may be wrongly recognized as the actions of the reported application and consequently, a subsequence of the input is recognized as a trace of the model (false positive result). The other classiﬁcation approaches are not sound, too. What distinguishes our approach from the others, is classiﬁcation in terms of application behaviors. Our experiments show that our false positive rate is better in comparison with its opponent approaches (see Section 5.3). The precision of our behavioral model will minimize our false positive results substantially. 5. Application to the network domain: traﬃc classiﬁcation To illustrate the applicability of our approach, we have applied our method for the network domain, the traﬃc classiﬁcation problem. The importance of traﬃc classiﬁcation for network administration tasks such as ensuring the security and quality of service of applications in computer networks has long been acknowledged [26]. Port-based classiﬁcation approach became ineﬃcient due to use of random or non-standard ports. To this aim, payload inspection [27,28] and statistical methods [29,30] were proposed. These techniques suffer from the high computation cost and ineﬃciency and lead to the development of behavioral classiﬁers. The payload inspection approaches examine the content of packets. Therefore,

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

59

Fig. 7. Layers of the TCP/IP model and ﬂow concept.

they are useless in encrypted traﬃc. The statistical methods classify the traﬃc according to the statistical information of packet traces fast but with lower accuracy in comparison with the others. However, the behavioral classiﬁers rely on behavioral aspects of applications and are useful in encrypted traﬃc or when we have no pre-assumption about the application protocol. In this paper, some network concepts have been utilized which are explained in Section 5.1. We present the customized Preprocessor component for the network domain in Section 5.2 and the obtained results in Section 5.3. 5.1. Network background Network communications take place according to deﬁned protocols which over the Internet follow the TCP/IP layered architecture [31]. This model consists of four layers: Application, Transport, Internet, and Link. The layered architecture means that each layer provides services to its upper layer. The packet content of each layer is built by its upper layers as visualized in Fig. 7a. For each layer, a number of different protocols is standardized, some of them are enumerated in Fig. 7a. Each packet transferred across a network, is composed of the information of different layers, divided into two parts: the header and the content. The header includes the control information needed by the corresponding protocol and is appended to the beginning of the content. For encrypted traﬃc, the content parts of packets are unreadable. Protocols are divided into the connection-less and connection-oriented categories. The Connection-oriented ones establish a connection before a data transmission. Thus, there are handshake (initialization) and ﬁnalization phases in these protocols. These phases are not required in the connection-less protocols. They just send a request packet for each desired data. Therefore, we divide the operation of each protocol into a set of control phases to abstractly consider its progress. The phases are assumed to be Init, Data, and Fin for connection-oriented protocols and Init and Data for connection-less ones. Intuitively, Init indicates the establishment of a connection or a query packet, Data shows the transmission of data, and Fin explains the termination of a connection. We exploit these control phases to specify the behavioral model of the protocols. A network application relies on different protocols at the different layers that they can be viewed as components in a multi-component system. Thus, a ﬂow is a trace of a protocol model symbolically representing a sequence of packets which have the same value for the parameters: source IP, source port, destination IP, destination port, and the protocol name. An execution of an application gives rise to initiating a number of ﬂows. These ﬂows are the connections which are established between the initiator system and anther end system. For instance, Fig. 7b illustrates the execution of a software which has established four ﬂows. The ﬂows with the protocol names TCP and HTTP have the same destination while the ﬂows with the protocol names FTP and DNS also differ in the destination of their connections. By applying our approach, the relation amongst the ﬂows are considered to classify traﬃc, as opposed to statistical classiﬁcation methods which only employ statistics of ﬂows. 5.2. Customizing preprocessor component To apply the proposed method for traﬃc classiﬁcation problem, the basic components and events are recognized as the well-known protocols and the transfered packets, respectively. Thus, we have only customized the capturing approach and the preprocessor component. Intuitively, each network application needs to establish a number of connections with other systems in order to perform a function. Each connection follows a protocol speciﬁcation. For instance, an execution of the Map application of Windows 8 contains four ﬂows where two are of the DNS protocol, one is of the TCP and one is of the TLS protocol. Hence, the well-known protocols can be considered as the components of a network application, and the transition system of these protocols should be provided to the component Model Generator (cf. Fig. 2).

60

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

We employ Wireshark1 to capture the generated traﬃc of different executions of an application. The captured traﬃc is the sequence of packets sent or received as the result of executing an application during a speciﬁed time. Each packet contains a data part and multiple headers because of layer encapsulation. We only consider the information of the upper layer of each packet instead of the whole one. For instance, we only take the information of the application layer of HTTP packets into account while these packets subsume information of the TCP layer. In the lack of application layer data, the transport layer information is considered instead. The component Preprocessor of Fig. 2 reassembles packets and omits useless and application-independent packets (e.x. DNS packets) to prepare the captured traﬃc for the component Model Generator. It also applies a mapping function to translate the packets to their corresponding actions. The function Mapper is deﬁned as Mapper : Packets → Act × N where Packets is the set of possible captured packets. Each packet is represented with a symbol (indicating the abstraction of packet info) and a natural number (indicating its ﬂow identiﬁer). The symbols are deﬁned as the concatenation of the packet protocol name, the name of its control phase and its direction. Recall that the control phase can be either Init, Data, or Fin. The direction is a binary tag which can be either I or O to indicate that the packet is sent or received, respectively. For example, a received TCP packet in the handshake phase is mapped to TCPInitI which is a member of the model symbol set. The amount of details about packets extracted to be in the correspondent actions, shows how sensitive the ﬁnal generated model is, to packet variations. Each packet of the application execution belongs to a ﬂow. For determining the ﬂow identiﬁer for each packet, we examine if there is at least a former packet of the ﬂow i with the same values for the ﬁve attributes source IP, source port, destination IP, destination port, and the protocol name, then we associate the ﬂow identiﬁer i to the new packet. Otherwise, we increase the ﬂows number and associate a new identiﬁer to the new packet. 5.3. Experimental results To evaluate our approach, we formulate four research questions:

• RQ1: Is our approach tailored for the network domain applicable to real-world applications? • RQ2: How much the steps in our learning process improve the learned model? How much parameterizing the learned models with ﬂow identiﬁers can improve our results?

• RQ3: How our classiﬁcation approach is successful with interleaved packets, not pure traﬃc? • RQ4: How our behavioral model is useful in classifying pure traces in comparison with other classiﬁcation techniques? For investigating RQ1, we have implemented the model identiﬁcation and classiﬁcation in the Java programming language. The codes are available at https://github.com/zsabahi/multi-component-identiﬁcation and https://github.com/zsabahi/ multi-component-classiﬁcation, respectively. Three categories of applications, version control system, remote desktop sharing and on-line chat are selected for our experiments. For the ﬁrst category, two applications TortoiseSVN client of SVN2 and Source Tree Client of GIT3 are selected. The traﬃc of update command of these applications are gathered as their captured packet traces. Also, we have selected two remote desktop sharing applications, namely TeamViewer 4 and JoinMe5 because their traﬃc is encrypted. Their traﬃc cannot be easily identiﬁed in terms of signature-based approaches by reading the content of packets. For the last category, Skype6 and VSee7 are chosen. Skype has been considered in many traﬃc classiﬁcation researches because of using an unknown application layer protocol [32,33]. Each one is run for 100 times and their network traces are captured using the Wireshark tool. By increasing the number of experiments, the generated model is deﬁnitely improved as the probability of observing a new behavior is increased. Some preprocessing operations have been performed to eliminate the repetitive and truncated packets. Also, we have reassembled segments of fragmented packets. Packets of the application and transport layer protocols (used by these softwares), namely TCP, SSL, SSLv2, TLSv1, TLSv1.2, HTTP and UDP have been considered and the others are ﬁltered. We construct the speciﬁcations of these protocols by an implementation of KTail automata learning algorithm with K = 3 to merge states which have 3 common future (values between 2 and 4 are often used [34–36]). To evaluate the result of our approach for each application, we have considered true positive rate (TPR), false positive rate (FPR) and the time for generating and testing the model. To this end, we use the cross validation technique for 100 times to calculate our metrics. Regarding to impossibility of measuring the real value of false positive rate (because it is not possible to gather all negative traces), we consider the traces of the another application which has the same functionality. Thus, we use the traces of applications in the same category crossly to calculate the false positive rates. To measure TPR metric, we test the learned models of applications with its traces. We built the models for 100 times by considering each

1 2 3 4 5 6 7

https://www.wireshark.org/. https://tortoisesvn.net/. https://www.atlassian.com/software/sourcetree. https://www.teamviewer.com/en/. https://www.join.me/. https://www.skype.com/en/. https://vsee.com/.

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

61

Table 6 The average result for step-wise applying the proposed network model identiﬁer, run on a system with Corei7 CPU and 8G RAM. Step

App.

States num.

Non-parametrized

Parametrized

FPR

TPR

FPR

TPR

Train time

Test time

Initial transition system

SVN GIT TeamViewer JoinMe Skype VSee

3982 4115 8637 34484 14450 14630

0 0 1 0 0 0

0.02 0.01 0 0 0 0

0 0 0 0 0 0

0.02 0.01 0 0 0 0

<2 sec

<1 sec

Applying counter abstraction

SVN GIT TeamViewer JoinMe Skype VSee

78 45 407 5458 64 38

0.95 0 0 0 0 0

0.55 1 0.36 0.25 0.88 0.91

0 0 0 0 0 0

0.55 0.90 0.36 0.25 0.77 0.90

<5 min

<1 sec

Completing self-loop transitions

SVN GIT TeamViewer JoinMe Skype VSee

78 45 407 5458 64 38

0.97 0.97 0 0 0 0

1 1 0.98 0.56 0.95 0.99

0 0 0 0 0 0

0.94 0.95 0.88 0.56 78 0.92

<2 min

<1 sec

Relaxing unnecessary orders

SVN GIT TeamViewer JoinMe Skype VSee

130 70 600 >7400 96 55

0 0 0 0 0 0

1 1 0.99 0.91 0.95 1

0 0 0 0 0 0

0.99 0.99 0.88 0.81 0.81 0.95

<20 hrs

<1 sec

time 99 traces as the train set and one trace for testing. Finally, we compute the average of the test results. To measure FPR metric, we test the learned models in each iteration with the traces of another similar application. So in each iteration, we test a model with 100 traces of another apps. We compute the average of FPR for 100 iterations. For addressing RQ2, we have measured the metrics for each step of our learning process to evaluate how much each step improve the results. Furthermore, we have measured the results for our approach in two settings; in one setting, the ﬂow identiﬁers are ignored and in another they are considered. To evaluate the effect of each step on the learned model, we have also measured the number of states. Table 6 shows the ﬁnal result of our experiments. For instance, the TPR results for SVN show that for the initial non-parametrized transition system, only 2 of 100 iterations could detect the test trace while these results increase to 55 after applying the counter abstraction, and reach to 100 after adding the self-loop transitions. We measured FPR of SVN by GIT traces. The results show that the initial transition system of SVN correctly reject the traces of the other application. After generalizing the model, this property is preserved. The major point is that by applying our proposed generalization steps, the false positive rate does not grow for the parametric classiﬁer. It means that our conservative approach prevents over-generalization to be occurred. Each generalization step improves the completeness of the model. Adding (self-loop) transitions has increased our precision for all applications. Relaxing unnecessary orders increases the precision (TPR) to 91% and 81% for non-parametric and parametric models in the worst case, respectively. Some results of TPR for the last step are still less than 1. For instance, for VSee, it is 0.95. It means that our approach fails to recognize 5 percent of its test traces which are mainly new unpredictable traces based on the train set. The non-parametrized columns of the table (column 4 and 5) illustrate that the result of applying our approach without considering the ﬂow identiﬁers of actions. By comparing these results with those of parametrized model, it can be concluded that although the non-parametrized models get better TPRs, the parametrized models get lower FPRs while their TPR results are acceptable. For inspecting RQ3, we captured traﬃc of a running system for a while as the background traﬃc (it includes 500 packets). We injected the pure traﬃc of each application into the background traﬃc based on the uniform distribution. We have also mixed the pure traﬃc of applications of each category with the background. This procedure was performed 100 times for 100 traces of an application. At each iteration, we have learned the model of the application by the remain 99 traces and then applied our parametric classiﬁer on the interleaved traﬃc. Finally, we have calculated the average of the results as shown in the Table 7. By comparing the columns of this table, we can infer that the classiﬁer precision does not decrease in the interleaved traﬃc because of the completeness of our approach. However, the FPR metric increases from 0 to 0.14 in the worst case. For answering RQ4, we ﬁrst considered the classical automata learning. To this aim, we have implemented KTail and one of its improved extension, GKtail algorithms. The average TPR metric for K = 2, 3, 4 for all selected applications was zero because both algorithms merge the states based on their same K suﬃxes. This result indicates that these approaches lead to merge the inconsistent states according to the protocols.

62

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Table 7 The result of the traﬃc classiﬁcation in the wild (avg is the abbreviation for average). Tested traﬃc

SVN + background GIT + background SVN + GIT + background TeamViewer + background JoinMe + background TeamViewer + JoinMe + background Skype + background VSee + background Skype + VSee + background

Pure traﬃc

Interleaved traﬃc

TPR

FPR

avg TPR

avg FPR

1 1 – 0.99 1 – 1 1 –

0 0 – 0 0 – 0 0 –

1 1 1 0.99 1 0.99 1 1 1

0 0 0 0.01 0 0.01 0 0.14 0.14

Table 8 The result of statistical classiﬁcation. Application

SVN GIT TeamViewer JoinMe Skype VSee

J48

Naive Bayes

Adaboost

TPR

FPR

TPR

FPR

TPR

FPR

Test time

0.94 0.97 0.95 0.98 1 0

0.03 0.06 0.01 0.05 1 0

0.90 0.66 0.76 0.98 1 0

0.33 0.10 0.02 0.24 1 0

0.95 0.97 0.78 0.99 0.39 0.90

0.031 0.05 0.01 0.22 0.10 0.61

<1 sec <1 sec <8 sec <8 sec <19 sec <19 sec

To clarify the applicability of our method in the network domain, it is compared to other packet classiﬁcation techniques. Port-based detection and payload inspection methods do not work on our datasets because of the usage of random or non-standard ports or unknown protocols and the encrypted traﬃc. Furthermore, the previous behavioral classiﬁcation methods are application-speciﬁc e.g. for Skype or P2P applications like Emule by considering their special features. Hence, they are not enough general to apply to our selected applications. Thus, statistical classiﬁcation methods are the only related work comparable to our method. To this aim, Netmate8 is used to obtain the feature vectors with 45 features of ﬂows of captured traﬃc. Then, by using Weka tool-set,9 TPR and FPR metrics for the mostly-used algorithms in the literature of network classiﬁcation J48 (c4.5), Naive Bayes and Adaboost were measured [37,32,33,38]. The ﬁnal result of these metrics are reported in Table 8. These results indicate that the precision of each statistical classiﬁer varies from an application to the others and does not follow a certain rule. For instance, Adaboost does not have an acceptable result for Skype, while, it gets the best precision for VSee amongst the other classiﬁers. Furthermore, the maximum amount of the false positive rate in these approaches is 1 in the worst case (for Skype) which is not acceptable for a classiﬁer. Hence, it is not reliable to select one of these algorithms as the classiﬁers for an interleaved traﬃc of different applications. Comparing these results with ours, conﬁrms the applicability of our approach because of its better results for false positive and true positive metrics than others. 6. Related work We categorize the related work into two sections: model learning and model matching. 6.1. Model learning The need to identify a model has been raised in many domains in literature, namely automata learning, protocol reverse engineering, speciﬁcation mining, and process mining. We brieﬂy explain the works in these domains. Automata learning. We have previously reviewed the basic algorithms of passive and active approaches like Angluin L ∗ , Ktail, and Red-Blue in Section 2. Some research has been conducted to extend the expressiveness of the inferred models of these algorithms. The KTail algorithm is extended in [39], called GKTail, with the aim to generate models from method invocation traces. It is assumed that the invocations of methods are parametric and the negative traces are not given. This approach is conducted in four steps. At ﬁrst, the traces with an identical sequence of methods (but different parameter values) are merged together. Next, constraints on parameters are obtained via Daikon invariant detector [40]. In the third step, a preﬁx tree automaton is built. Finally, the states are merged according to some criterion based on the similarity of k suﬃxes. However, this approach cannot be employed in our framework as the parameters are not concrete values. In

8 9

https://dan.arndt.ca/projects/netmate-ﬂowcalc/. http://www.cs.waikato.ac.nz/ml/weka/.

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

63

[41], the automata learning problem is extended to infer deterministic timed automata. It was proved that identifying timed automata with two or more timed elements is not eﬃciently achievable. Thus, a real-time identiﬁcation approach, based on evidence driven state merging, was proposed which infers models by considering the time between consecutive events of a system. Extending our approach with this technique to consider the timing behavior of actions is one of our future work. LearnLib is a Java open-source framework for automata learning [42]. It includes some active and passive learning techniques such as evidence driven state merging (EDSM) which is the basis of our algorithm. But, it assumes that the inputs are both positive and negative traces and the state merge is performed when two states reaches to consistent accepting or rejecting states. However, we assume that we have not the negative traces and deﬁne the state merge criterion based on preserving the behavior of the basic components. Thus, the implemented EDSM algorithm of this library cannot be used in our work. Some studies address the application of automata learning problem. Among these, [43] is the most related to our work which elaborates on inferring mealy machines of communication protocols. In this approach, the parameters in a message format of protocols such as sequence number, conﬁguration parameter and session identiﬁer, result in an inﬁnite-states model. Hence, to minimize state spaces, an abstract representation of protocol states are derived automatically in terms of operations that a requester and responder may perform. In the other words, they assume that the format of messages and some basic rules between the requester and responder (as the protocol rules) are given. Hence, they have a similar assumption to ours which is the existence of protocol speciﬁcations. Their algorithm is based on query evaluation (active automata learning), while, we have extended the passive automata learning. Reverse engineering of protocol speciﬁcation. In this part, we enumerate works that focus on inferring protocol speciﬁcations from traﬃc. These works are related to ours because of their restriction on inferring a model by observing the behavior of the application as a black-box. In [44], a probabilistic method was investigated to obtain a ﬁnite state machine of a protocol. It was assumed that the format of protocol messages is not determined. In the ﬁrst step, messages are segmented into l-length bytes and clustered with the aim of recognizing their control parts. Next, the most frequent patterns are selected as the message units by statistical analysis. Then, the main messages of the protocols are deﬁned by computing the centers of the clusters. Finally, the ﬁnite state machine is constructed whose states are the main messages and probabilistic transitions are the frequencies of each pair of messages. In ReverX algorithm [45], a preﬁx tree automaton is built from traces and then the states which are the destination of identical transitions, are merged. Therefore, transitions with the same source and destination are created. They claim that if these transitions are merged, the parameters of message headers are inferred. By continuing this operation for all the states, the state machine of the protocol is obtained. Actually, although their work is similar to ours in using passive automata learning, we differ in the conditions for state merging. The states similar in their 1-future action are merged in this approach while the states preserving the same behavior of well-known protocol speciﬁcations are merged in our approach which leads to the false positive rate of 0. Speciﬁcation mining. This area is a collection of techniques that attempt to mine a speciﬁcation of a program from some of its artifacts [4]. Some of these works are mentioned in the following. In [5] a model is generated from an input log ﬁle in two steps. First, a trace graph is built and after that, this graph is reﬁned to satisfy the invariants that inferred from the traces. Invariants show the precedence rules between the occurrences of the events. The method of [6] mines the speciﬁcation of a software in terms of the APIs of its libraries. It compares four algorithms: traces-only (KTail), invariants-only (Contractor [46]), and two novel algorithms: invariant-enhanced-traces and trace-enhanced-invariants. It concludes that trace-enhanced-invariants has the best result amongst others. By extending our approach with considering the state variables, these techniques can be used. An algorithm which tolerates noisy inputs is presented in [47], by computing the number of changes are required to accept a noisy input. If this value is lower than a threshold, they accept that trace. This technique can be used in approach to make it more robust to noisy inputs. Process mining. The process mining area [48], also referred to as workﬂow mining, elaborates on the problem of learning process description from the recorded logs. This technique is mainly used for the highly concurrent systems such as business processes and usually extracts the ﬁnal model as a Petri net [49]. The process mining algorithms are categorized in three categories: discovery, conformance and enhancement. In the ﬁrst category, it is assumed that there is not any prior model. The goal is to ﬁnd the causality relation between the recorded events from the work log in the form of a Petri net model. The alpha algorithm is a well-known algorithm of this category. The goal of the second category is conformance checking of a priori model with the given process events log. Finally, the third category addresses how to improve the performance of an existing model by some new information about the system. ProM [50] and Disco [51] are the samples of process mining tools. Though the learning phase of our work is comparable with the discovery algorithms, the level of abstraction of workﬂow mining is higher and not suitable for network packet traces or function call traces. Furthermore, in these methods, the domain information such as the speciﬁcation of components, is not used. Therefore, it seems that the precision of this method will be lower than our approach and is similar to KTail approach. Also, the second category, conformance algorithms, are comparable with our classiﬁer. However, we consider an interleaved traﬃc while they assume that the input is pure. Comparing our method with the process mining algorithms with more details is amongst our future works.

64

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

6.2. Model matching Runtime veriﬁcation. Monitoring a trace to check the property satisfaction problem is addressed completely in runtime veriﬁcation domain. Recently, dealing with the parameters of properties or traces has been considered more and a great deal of researches are conducted. JavaMOP [9] is the pioneering work on parametric trace slicing and after that, quantiﬁed event automata (QEA) [52] is presented. These approaches provide eﬃcient monitors by conceptually partitioning (slicing) a trace into subtraces with a speciﬁc value for each parameter combination. QEA allows existential and universal quantiﬁcation for variables in the property. Since our considered parameter is ﬂow identiﬁer, we do not need the quantiﬁers in our monitor. Monitorability for μ H M L, properties of automata-like with branching structures, was ﬁrst addressed in [53] and then its formal basis was provided in [54] and some studies complement this foundational framework namely [55] which extended monitors to consider internal actions. Since, we generate the models from the observed behaviors, then all actions are external. Another extension was introduced in [56] to address temporal properties. The monitor considers a set of conditions encoded in the trace which generated by another process. They may describe events that could not have happened or may happen at speciﬁc points in the execution of a system. However, since, our behavioral models acting as properties are terminating, our monitored properties are regular and we do not need to consider more expressive properties. Rule-based monitors such as LogFire [8] present a monitor that consists of a set of rules. If the facts on the left-hand of a rule hold, then the right-hand can be applied which may add or delete some facts. Although these methods and the other works propose rich techniques to overcome the parametric monitoring issues, our problem has different assumptions and we need a customized solution. In our problem, there are several models (as properties) to be considered in parallel and the actions of each model have symbolic parameters, ﬂow identiﬁers. In fact, a binding between the symbolic ﬂow identiﬁers of models and the parameters of the actions of the input trace should be found. The interleaving of traces of different application models and uncertainty about the location of their initial actions in the trace are our other assumptions. 7. Conclusion Classiﬁcation approaches based on the behavioral patterns of applications are the new trend to this problem. No general and automated method to derive behavioral models has been provided. We proposed a method to reach this goal based on the automata identiﬁcation problem and evidence driven state merging technique combined by transition system theories, namely behavioral pre-order relations and the counter abstraction, and complex event processing. Intuitively, we assumed that the behavior of an application can be identiﬁed in terms of how it executes its depending components, well-known in a domain, abstracting its state variables. Hence, we have introduced our merging conditions to identify the equivalent states based on the speciﬁcation of such components: two states preserving the same behavior (in each ﬂow) with respect to all of the components are considered equivalent. To improve the condition, we take advantage of counter abstraction to merge states with the same number of ﬂows preserving the same behavior. To cover unobserved behaviors and relax the unnecessary orders of actions due to the concurrent development of an application, we exploited complex event processing to derive the essential causality relations (between two sync points) to steer the completion step. We also implemented and evaluated our framework in the network domain which does not require human inspection. The experiments show very encouraging results that the generalization steps signiﬁcantly increase the accuracy from 0% to 91% and to 81% for non-parametric and parametric models in the worst case, respectively. The existing approaches such as KTail of model identiﬁcation can be used for deriving the model of stand-alone components. However, in complex applications depending on several components, these approaches are not beneﬁcial and exertion of domain-speciﬁc information (like component speciﬁcation) substantially improves the result. Our experimental results in the network domain conﬁrm the applicability of our approach in terms of false positive and true positive rate metrics in comparison with statistical approaches while the classical automata learning algorithms suffer from zero true positive rate. We aim to consider the timing behavior following the approach of [41] in combination with the techniques in the literature of speciﬁcation mining such as [4–6] to derive timing invariants. Another future work is to generalize our technique and apply that on the other areas such as malware detection or reverse engineering of software speciﬁcations (speciﬁcation mining). Extending our experiments to evaluate the scalability of our approach in the network domain is another direction. Acknowledgements We would like to thank Niloofar Naderian, Fateme Bajelan, and Mohammad Behzadifar for their help in the implementation of the method and Hamed Rahmatollahi for his help in gathering the datasets. References [1] K. Xu, Z. Zhang, S. Bhattacharyya, Proﬁling internet backbone traﬃc: behavior models and applications, SIGCOMM Comput. Commun. Rev. 35 (4) (2005) 169–180.

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52]

65

P. Bermolen, M. Mellia, M. Meo, D. Rossi, S. Valenti, Abacus: accurate behavioral classiﬁcation of P2P-TV traﬃc, Comput. Netw. 55 (6) (2011) 1394–1411. J. Kinable, O. Kostakis, Malware classiﬁcation based on call graph clustering, J. Comput. Virol. 7 (4) (2011) 233–245. G. Reger, Automata Based Monitoring and Mining of Execution Traces, Ph.D. thesis, University of Manchester, 2014. I. Beschastnikh, J. Abrahamson, Y. Brun, M. Ernst, Synoptic: studying logged behavior with inferred models, in: Proc. ESEC/FSE ’11, ACM, 2011, pp. 448–451. I. Krka, Y. Brun, N. Medvidovic, Automatic mining of speciﬁcations from invocation traces and method invariants, in: Proc. FSE 2014, ACM, 2014, pp. 178–189. M. Alessandro, C. Gianpaolo, T. Giordano, Learning from the past: automated rule generation for complex event processing, in: Proc. DEBS ’14, ACM, 2014, pp. 47–58. K. Havelund, Rule-based runtime veriﬁcation revisited, Int. J. Softw. Tools Technol. Transf. 17 (2) (2015) 143–170. P. Meredith, D. Jin, D. Griﬃth, F. Chen, R. Grigore, An overview of the MOP runtime veriﬁcation framework, Int. J. Softw. Tools Technol. Transf. 14 (3) (2012) 249–289. Z. Sabahi-Kaviani, F. Ghassemi, F. Bajelan, Automatic transition system model identiﬁcation for network applications from packet traces, in: Proc. FSEN 2017, 2017, pp. 212–227. M. Heule, S. Verwer, Exact DFA identiﬁcation using SAT solvers, in: Grammatical Inference: Theoretical Results and Applications, Springer, 2010, pp. 66–79. E. Gold, Language identiﬁcation in the limit, Inf. Control 10 (5) (1967) 447–474. D. Angluin, Learning regular sets from queries and counterexamples, Inf. Comput. 75 (2) (1987) 87–106. K. Lang, B. Pearlmutter, R. Price, Results of the Abbadingo one DFA learning competition and a new evidence-driven state merging algorithm, in: Proc. 4th ICGI, Springer, 1998, pp. 1–12. N. Walkinshaw, J. Derrick, Q. Guo, Iterative reﬁnement of reverse-engineered models by model-based testing, in: Proc. 2nd FM, Springer, 2009, pp. 305–320. A. Biermann, J. Feldman, On the synthesis of ﬁnite-state machines from samples of their behavior, IEEE Trans. Comput. C-21 (6) (1972) 592–597. C. Baier, J.-P. Katoen, Principles of Model Checking, MIT Press, Cambridge, ISBN 026202649X, 2008. J.F. Groote, M.R. Mousavi, Modeling and Analysis of Communicating Systems, The MIT Press, 2014. R. van Glabbeek, The linear time - branching time spectrum, in: Lecture Notes in Computer Science, vol. 458, Springer, 1990, pp. 278–297. G. Basler, M. Mazzucchi, T. Wahl, D. Kroening, Symbolic counter abstraction for concurrent software, in: Computer Aided Veriﬁcation, Springer, 2009, pp. 64–78. R. Milner, Communication and Concurrency, Prentice-Hall, 1989. A. Emerson, R. Treﬂer, From asymmetry to full symmetry: new techniques for symmetry reduction in model checking, in: Proc. 10th CHARME, Springer, 1999, pp. 142–156. A. Pnueli, J. Xu, L. Zuck, Liveness with (0, 1, infty)-counter abstraction, in: Proc. 14th CAV, Springer, 2002, pp. 107–122. M. Leucker, C. Schallhart, A brief account of runtime veriﬁcation, J. Log. Algebraic Program. 78 (5) (2009) 293–303. B. Schieber, U. Vishkin, On ﬁnding lowest common ancestors: simpliﬁcation and parallelization, SIAM J. Comput. 17 (6) (1988) 1253–1262, https:// doi.org/10.1137/0217079. S. Valenti, D. Rossi, A. Dainotti, A. Pescapè, A. Finamore, M. Mellia, Data Traﬃc Monitoring and Analysis, Springer, Berlin, Heidelberg, 2013, pp. 123–147 (Ch. Reviewing traﬃc classiﬁcation). A. Moore, K. Papagiannaki, Toward the accurate identiﬁcation of network applications, in: Passive and Active Network Measurement, Springer, 2005, pp. 41–54. S. Sen, O. Spatscheck, D. Wang, Accurate, scalable in-network identiﬁcation of p2p traﬃc using application signatures, in: Proc. 13th WWW, ACM, 2004, pp. 512–521. A. Moore, D. Zuev, Internet traﬃc classiﬁcation using bayesian analysis techniques, in: ACM SIGMETRICS Performance Evaluation Review, vol. 33, ACM, 2005, pp. 50–60. A. McGregor, M. Hall, P. Lorier, J. Brunskill, Flow clustering using machine learning techniques, in: Passive and Active Network Measurement, Springer, 2004, pp. 205–214. K. Fall, R. Stevens, TCP/IP Illustrated, vol. 1: The Protocols, Addison-Wesley, 2011. R. Alshammari, A.N. Zincir-Heywood, Machine learning based encrypted traﬃc classiﬁcation: identifying SSH and Skype, in: Proc. CISDA’09, IEEE Press, Piscataway, NJ, USA, 2009, pp. 289–296. R. Alshammari, A.N. Zincir-Heywood, Unveiling Skype encrypted tunnels using GP, in: Proc. CEC’10, 2010, pp. 1–8. S.P. Reiss, M. Renieris, Encoding program executions, in: Proc. ICSE ’01, IEEE, 2001, pp. 221–230. L. Mariani, M. Pezzè, Dynamic detection of cots component incompatibility, IEEE Softw. 24 (5) (2007) 76–85. J.E. Cook, A.L. Wolf, Discovering models of software processes from event-based data, ACM Trans. Softw. Eng. Methodol. 7 (3) (1998) 215–249. N. Williams, S. Zander, G. Armitage, A preliminary performance comparison of ﬁve machine learning algorithms for practical IP traﬃc ﬂow classiﬁcation, SIGCOMM Comput. Commun. Rev. 36 (5) (2006) 5–16. S. Ubik, P. Žejdl, Evaluating application-layer classiﬁcation using a machine learning technique over different high speed networks, in: Proc. ICSNC’10, 2010, pp. 387–391. D. Lorenzoli, L. Mariani, M. Pezzè, Automatic generation of software behavioral models, in: Proc. 30th ICSE, ACM, 2008, pp. 501–510. M. Ernst, J. Cockrell, W. Griswold, D. Notkin, Dynamically discovering likely program invariants to support program evolution, IEEE Trans. Softw. Eng. 27 (2) (2001) 99–123. S. Verwer, Eﬃcient Identiﬁcation of Timed Automata: Theory and Practice, Ph.D. thesis, TU Delft, Delft University of Technology, 2010. LearnLib - an open framework for automata learning, https://learnlib.de/. F. Aarts, B. Jonsson, J. Uijen, Generating models of inﬁnite-state communication protocols using regular inference with abstraction, in: Testing Software and Systems, Springer, 2010, pp. 188–204. Y. Wang, Z. Zhang, D. Yao, B. Qu, L. Guo, Inferring protocol state machine from network traces: a probabilistic approach, in: Proc. 9th International Conference on ACNS , Springer, 2011, pp. 1–18. J. Antunes, N. Neves, P. Verissimo, Reverse engineering of protocols from network traces, in: Proc. 18th WCRE, 2011, pp. 169–178. G. de Caso, V. Braberman, D. Garbervetsky, S. Uchitel, Automated abstractions for contract validation, IEEE Trans. Softw. Eng. 38 (1) (2012) 141–162. G. Reger, H. Barringer, D. Rydeheard, Automata-based pattern mining from imperfect traces, SIGSOFT Softw. Eng. Notes 40 (1) (2015) 1–8. W. Van Der Aalst, Process Mining: Discovery, Conformance and Enhancement of Business Processes, Springer Science & Business Media, 2011. W.M.P. van der Aalst, The application of Petri nets to workﬂow management, J. Circuits Syst. Comput. 08 (01) (1998) 21–66. B.F. Dongen, A.A. Medeiros, H.M.W. Verbeek, A.J.M.M Weijters, W.M.P. Aalst, The ProM framework: a new era in process mining tool support, in: Proc. ICATPN, Springer, 2005, pp. 444–454. Process mining and automated process discovery software for professionals, https://ﬂuxicon.com/disco/. H. Barringer, Y. Falcone, K. Havelund, G. Reger, D. Rydeheard, Quantiﬁed event automata: towards expressive and eﬃcient runtime monitors, in: Proc. FM, Springer, 2012, pp. 68–84.

66

[53] [54] [55] [56]

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

A. Francalanza, L. Aceto, A. Francalanza, L. Aceto, L. Aceto, A. Achilleos, A. L. Aceto, A. Achilleos, A.

A. Ingolfsdottir, Monitorability for the Hennessy–Milner logic with recursion, Form. Methods Syst. Des. 51 (1) (2017) 87–116. A. Achilleos, D.P. Attard, I. Cassar, D.D. Monica, A. Ingólfsdóttir, A foundation for runtime monitoring, in: Proc. RV, 2017. Francalanza, A. Ingólfsdóttir, Monitoring for silent actions, in: Proc. FSTTCS, 2017. Francalanza, A. Ingólfsdóttir, A framework for parameterized monitorability, in: Proc. FOSSACS, 2018.

Behavioral model identification and classification of multi-component systems

Behavioral model identification and classification of multi-component systems

Recommend Documents