Behavioral model identification and classification of multi-component systems

Behavioral model identification and classification of multi-component systems

Science of Computer Programming 177 (2019) 41–66 Contents lists available at ScienceDirect Science of Computer Programming www.elsevier.com/locate/s...

1MB Sizes 0 Downloads 28 Views

Science of Computer Programming 177 (2019) 41–66

Contents lists available at ScienceDirect

Science of Computer Programming www.elsevier.com/locate/scico

Behavioral model identification and classification of multi-component systems Zeynab Sabahi-Kaviani, Fatemeh Ghassemi ∗ School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Islamic Republic of Iran

a r t i c l e

i n f o

Article history: Received 14 February 2018 Received in revised form 21 November 2018 Accepted 8 March 2019 Available online 22 March 2019 Keywords: Passive automata learning State-merge condition Behavioral equivalence relation Traffic classification Runtime verification

*

a b s t r a c t Learning model from the execution traces has been considered in several domains such as traffic classification, malware detection, and software engineering since it distinguishes which processes actually executed through their traces. Behavioral model learning and classification (in terms of learned models) are taken into accounts to eliminate the shortages of models derived based on non-behavioral features and improve the resulting classifications. So far, no general method has been proposed to automatically derive behavioral models. To this aim, we assume that the models of applications can be abstractly defined in terms of how they execute their depending components, well-known in the domain. To automatically derive such models, we extend the passive automata learning by considering the behavior of the depending components in addition to the observed behaviors. The state merging algorithm of the learning process is equipped with a new equivalence relation which aggregate states modulo counter abstraction of symmetric reduction. To improve the generated model to cover unobserved behaviors, we leverage the technique of complex event processing to complete the model with the unseen interleaving of actions due to the concurrent execution of components. The derived models, specified by parametrized transition systems, can distinguish different executions of instances of each component by assigning a unique symbolic identifier to each instantiation and parameterizing actions with such identifiers. The learned models are used to distinguish the executions of applications in an interleaved execution trace of different systems. The detection procedure is more complicated for parametric models because of the need for relating the information of the trace to symbolic identifiers as the parameters. We utilize runtime verification techniques in a three-step novel approach so as to enhance the performance of the matching process for a trace. To illustrate the applicability of our approach, we have employed it for traffic classification in the network domain and then applied it on some real applications. To demonstrate the effectiveness of our approach in this domain, we compare it to related approaches in terms of their true positive rate, false positive rate, and test time. Our results indicate that our technique prevents including invalid traces so that unobserved behaviors are covered with an acceptable precision. © 2019 Elsevier B.V. All rights reserved.

Corresponding author. E-mail addresses: [email protected] (Z. Sabahi-Kaviani), [email protected] (F. Ghassemi).

https://doi.org/10.1016/j.scico.2019.03.003 0167-6423/© 2019 Elsevier B.V. All rights reserved.

42

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Fig. 1. Java code for a ticket agency system and its components.

1. Introduction Learning model from the execution traces has been considered in several domains such as traffic classification, malware detection, and software engineering since they distinguish which processes actually executed through their traces. Due to the usage of random-ports/encrypted traffic of applications over network and emergent of code obfuscation for malwares, pattern matching and statistical approaches are not beneficial for classification. Furthermore, utilizing ready-to-use components/libraries/services or imperfect documentations, the complete software specification may not be available to derive a model for the system to take advantage of model-based approaches for software testing or analysis. Behavioral model learning and classification (in terms of learned models) are taken into account [1–6]. There are some works which rely on some specific characteristics of an application behavior by which the traces of the application are identified. Detecting such behavioral characteristics are almost application-dependent. For instance, the topology of connection establishment has been used as a distinguishing metric to identify P2P-TV traffic [2] in the network domain. Therefore, no automated general method has been proposed, based on the behavior of applications. To derive the behavioral model automatically, we present a passive automata learning technique to infer abstract behavioral models for applications from their observed behaviors. Our proposed method is general because it does not rely on any pre-assumption about the behavioral characteristics of systems reflecting the category of the domain which they belong to. We consider a system as the composition of some basic components with known behaviors. For instance, Fig. 1 shows the Java code of a ticket agency system, relying on three component classes TicketReservation, User, and AccountTransactions. These classes are responsible for ticket reservation, user registration and payment processing, respectively. The application invokes the instances of its components in an specific order depending on the values of its variables. First an instance of TicketReservation and AccountTransactions are executed which let a user start either reserving a ticket or topping up its account in any order. However, a selected ticket is not reserved (by calling ticketReserved) unless the account balance has previously enough credit (being notified by AccountTransactions). An instance of User is also executed if the user is new. While, another program can invoke the instances of above mentioned classes in a different way. In other words, the abstract models of various applications differ in how they invoke the component instances. Thus, the behavior of different programs, captured as the execution traces, differ from each other. We consider the execution trace of an application as a sequence of its function calls. Since each component may be instantiated more than once, we assign a number to symbolically distinguish the outputs of each instance from the others using their object identifiers in this domain. For instance, selectTicket(1) requestNewTrans(2) ticketReserved(1) updateAccountInfo(2) showTicket(1) printTicket(1) is an execution trace of this application. We have abstracted away from the object identifiers by using integer values, called flow identifier, while each object identifier in each trace is uniquely mapped to an integer value. For simplicity, we rename the function names with alphabets which are commented in Fig. 1 at the end of each function call statement. Therefore, the

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

43

above trace is renamed to a(1) x(2) b(1) y (2) c (1) d(1). Also, the traces x(3) a(4) x(5) a(4) y (5) b(4) c (4) d(4) y (3) and x(6) m(7) n(7) a(8) b(8) y (6) c (8) d(8) can be the result of two different executions of this code. The flow identifiers identify co-related outputs in a control-flow execution (flow for short) of an instance. For instance in the second trace, two instances of Account T ransactions are involved and the first and the last output x and y are generated by the same instance. Noting to the fact that in different executions, objects are assigned different identifiers, then the outputs of x and y in the different traces maybe the result of the same instantiation. For example x and y with the flow identifier 2 of the first trace, those with the flow identifier 3 of the second trace, and the ones with the flow identifier 6 of the last trace belong to the same instantiation of AccountTransactions (line 13 of code), while, x and y with the flow identifier 5 of the second trace belong to another instantiation (line 15). Intuitively, we assume that the behavior of an application can be identified in terms of how it executes the component instances, abstracting away from its state variables like isRegistered and enoughCredit. Knowing the behavior of components TicketReservation, User, and AccountTransactions, we aim to derive the model of TicketAgencySystem class, as the transition system of Fig. 4c, in terms of how it executes their instances by observing the generated traces. By recognizing the source of the outputs (the equivalent flow identifiers) in different traces, the model can be refined accordingly to become more accurate. To this aim, we apply automata learning to the componentbased systems by considering the behavior of the involved components to direct the learning process. The first problem we tackle is deriving automata-based models for multi-component applications from their sets of observed behaviors captured as execution traces. The automata learning problem aims at inferring an automaton which accepts a set of given words. If the execution traces are considered as the given words of a language, then the desired behavioral model will be the identified automaton. This learned model indicates the sequential relation between occurrences of events. Passive techniques tend to complete and compact an initial model derived from the observed behavior of the system by iteratively merging a set of states. However, the classical approaches in the literature of the automata learning are not efficient to derive the most complete model to subsume unobserved traces as much as possible, while the invalid traces are not given. Because their state merge condition is the similarity of K next actions of states and states are merged irrespective to the domain. To consider the effect of the domain in the state merging condition, we additionally take into account the behavior of a set of basic components used as a part of the application in our model identification. We present a behavioral pre-order equivalence relation to determine the equivalent states such that after their merging, the behavior of basic components is preserved while invalid behaviors are prevented. Actually by considering the behavior of components, the concepts of the domain are utilized to tailor the learning algorithm by defining the suitable state merging constraints. Furthermore, we apply counter abstraction technique of symmetric reduction in defining our behavioral pre-order relation to aggregate more states and so accept more behaviors. We leverage the concepts of complex event processing [7] to complete the model with the unseen interleaving of actions due to the concurrent execution of components. The derived models are specified by labeled transition systems that their actions are parametrized by flow identifiers to distinguish different executions of instances of each component. Generally, our presented passive automata learning algorithm can be applied for learning models of every multi-component system that its components run concurrently and may communicate with each other either through shared variables or synchronous/asynchronous mechanisms. To apply the proposed method to a new domain, the only needed effort is to identify the basic components and the set of events in terms of which the behaviors of the system/components are specified, and the suitable approach to capture the behavior of the system. For example, our technique is applicable in other problem areas such as malware signature generation. In this domain, the basic components are the operating system services that their behaviors are captured as the system call sequences. The second problem we address is classifying a given interleaved execution trace in terms of the learned parametrized models. The interleaved trace of a number of applications leads to an uncertainty situation in determining the source of each action. The detection procedure is more complicated for parametric models because of the need for relating the flow identifier of the trace to the symbolic identifiers of models to detect behaviors belonging to the same instantiation. The matching problem actually is the problem of examining whether an automaton accepts a word or not which has been addressed mainly, in the runtime monitoring (verification) domain [4,8,9]. But they do not consider interleaved sequences and parametric actions together. Since the parameters denote flow identifiers and they do not have an absolute value, we need to find an appropriate binding between the flow identifiers of the model and the input trace. Thus, the number of different alternatives which should be considered is exponential and examining all these possible situations are time- and memory-consuming. We utilize runtime verification technique in a three-steps novel approach to enhance the performance of the matching process for a trace. First, we reduce the set of models for further inspection by validating the trace against models while the action parameters are ignored. Second, a set of binding candidates among the parameters of the trace and the reduced set of models are inferred. Third, each nominated binding is exploited to validate the trace against the corresponding model, taking parameters into account. The main components involved in our framework have been shown in Fig. 2: Preprocessor, Model Generator and Trace Classifier. The Preprocessor component is responsible for preparing the input traces by applying a mapper function which symbolically represents generated events. Among them, only the approach of capturing trace applications and the Preprocessor component are domain-specific. To illustrate the applicability of our approach, we have customized the capturing approach and the Preprocessor component for the network domain for the traffic classification problem and then applied it on some real network applications. To demonstrate the effectiveness of our approach in the network domain, we compare it to the custom statistical approaches in terms of their true positive rate, false positive rate, and test time. Our results indicate that our technique prevents including invalid traces so that unobserved behaviors are covered with an acceptable precision.

44

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Fig. 2. The components of our framework.

The proposed methodology for model learning was first introduced in [10]. We have extended it by

• Considering the flow identifiers of actions to obtain more accurate models by generating parametrized models. Furthermore, we have provided an extra step to infer equivalent flow identifiers for actions of generated models to make them more general. • Presenting the procedure of automatically generalizing the model by relaxing unnecessary orders and including such orders as a consequence of concurrent execution of the components. Also, the classification of interleaved traces of applications, the model of some may be unknown, according to the learned parametrized models is first introduced in this paper. This paper is organized as follows. We start with an overview of automata learning approaches, preliminary definitions and applied techniques in this research in Section 2. Our methodology for generating the behavioral models of applications is discussed in Section 3. The proposed parametric classifier is introduced in Section 4. We present the evaluation of the proposed methodology in the network domain with experimental results in Section 5. Section 6 is dedicated to the related work. Section 7 concludes the paper and contains directions for future work. 2. Preliminaries First, we provide an overview of automata learning in Section 2.1 which is the basis of our algorithm. We define the main concepts related to transition systems in Section 2.2, since in our methodology, the inferred behavioral model is a transition system. The counter abstraction technique is introduced in Section 2.3 which is used to find the equivalent states. Section 2.4 is dedicated to runtime verification as the basis of our classification method. 2.1. Automata learning There are equivalent keywords in the literature to automata learning such as grammar inference or regular inference, language or automata identification. The goal of automata learning is to find one of the smallest automata which is consistent with the set of given samples [11]. The samples can be the set of accepted or rejected words, called positive or negative samples, respectively. It is proved that for a given input sets of positive and negative samples, this problem is NP-complete when the alphabet is finite and the number of the states of the output automaton is determined [12]. If all of the words with the size equal and less than the arbitrary number n are given, then it is possible to solve the problem in polynomial time. Otherwise, an automaton, not necessarily the smallest, can be found which definitely accept/reject the input with some superior accepted words. The algorithms presented for this problem are divided into two categories: active and passive. The active techniques are based on Angluin L ∗ algorithm which solves the problem in polynomial time by posing some membership or equivalence queries [13]. This approach starts from an empty model, and then iteratively adds a state and a number of transitions to the model. The correctness of each refinement step is validated by inspecting the membership of

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

45

the newly added words. These iterations last until the resulting model is accepted in response to an equivalence query; It is assumed that there is an oracle system which answers the required queries. The main focus of this approach is to reduce the number of queries by posing the most effective queries. Passive techniques tend to build a tree-like automaton, called prefix tree automaton, from input examples and then by merging their states according to some heuristics evidence achieve the smallest deterministic finite automaton. Depending on the order of merging, different solutions for a DFA identification problem can be found. In this automaton, positive words end in an accepting state while the negative ones lead to a rejecting state. This technique is called Evidence Driven State Merging (EDSM) [14]. In this category, there is not any oracle and the algorithm should find the solution only from the positive and negative words of the language. If the negative samples are given, the merge condition ensures that only the states leading to the consistent states (all accepting/rejecting) will be merged. In the absence of negative samples, the merging criteria is defined in another way e.g., in [15] it is defined by maximizing a heuristic score function in terms of the extent to which the suffixes of two states overlap with each other. The merging process is performed in the order of the score of the pairs. In this paper, we aim to infer the behavioral model only from the executions of systems (passive samples) when there is no oracle system. Hence, the most important challenge is how to find the consistent states which should be merged. There are some algorithms in the literature such as KTail algorithm which merges states having K common future (i.e., states that accept the same set of strings of length K ) [16]. However, the state merging condition of these algorithms is irrespective to the domain. We present a novel domain-aware state-merging algorithm by considering the basic components of the system in addition to the executions. For finding the consistent states by passive algorithms such as KTail, it is required to search the state space with O (n2 ). Some research has been conducted to decrease the process of searching. Red-Blue is a well-known framework which limits the number of states searched by restricting the merging condition to the states with opposite colors [14]. In this framework, a core of red states is always surrounded by a row of blue states. At first, the initial state is the only red state and after each merge the resulting state becomes red while its neighbors become blue. The procedure lasts until the color of all states becomes red. This framework reduces the candidates states to those with opposite colors. In our approach, we find consistent states by O (1) by defining their vectors and storing them in a hash-structure. Then, the states with the same hash value are merged and there is no need for any O (n2 ) search algorithm. 2.2. Transition system We specify the model of a system and its sub-systems in the form of transition systems. Definition 1 (Transition system [17]). A transition system is a tuple TS = ( S , Act, →, s0 , ↓) where S is a set of states, Act is a α

set of actions, → ⊆ S × Act × S is a transition relation, s0 is the initial state, and ↓ ⊆ S is a set of final states. We use s − →t to denote (s, α , t ) ∈ →. TS is called action-deterministic if for all s ∈ S, there are not (s, α , t ) ∈ → and (s, α , v ) ∈ →, where α ∈ Act and t , v ∈ S, such that t = v. From this definition, the transition system of Fig. 3a is action-deterministic, since there is only one outgoing transition for each action. Definition 2 (Execution fragment and execution trace [17]). Let TS = ( S , Act, →, s0 , ↓) be a transition system.

• A finite execution fragment η = s0 α0 s1 α1 . . . αn sn+1 of TS is an alternating sequence of states and actions starting with the initial state and ending in a final state, i.e., sn+1 ∈ ↓ such that (si , αi , si +1 ) ∈→ where 0 ≤ i ≤ n. • A finite sequence of actions π = α0 α1 . . . αn of TS is an execution trace if ∃s0 , . . . , sn+1 ∈ S such that η = s0 α0 s1 α1 . . . αn sn+1 is an finite execution fragment. We define Frags(TS) as the set of the all finite execution fragments and Traces(TS) as the set of the all finite execution traces of the transition system TS. As we only consider finite execution fragments/traces, we use finite execution fragments/traces and execution fragments/traces interchangeably in following discussions. For instance, s0 x s1 a s2 x s3 a s4 y s5 b s6 c s7 d s8 y s9 is an execution fragment of the transition system in Fig. 3a. By eliminating states from this sequence, the execution trace x a x a y b c d y is achieved. An abstraction operator hides the details of a transition system by turning some actions into the unobservable action τ to derive a more abstract and general model. All actions except τ are called observable. Observability is used to define different equivalence relation to distinguish various systems. Definition 3 (Abstraction operator [18]). Let TS = ( S , Act, →, s0 , ↓) be a transition system. The abstraction of TS via a set of actions L ⊆ Act, denoted by τ L (TS), is ( S , Act \ L , → , s0 , ↓) such that: → = {(s, α , t ) | (s, α , t ) ∈ →, α ∈ / L } ∪ {(s, τ , t ) | (s, α , t ) ∈ →, α ∈ L }.

46

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Fig. 3. A transition system and an example of a weak simulation relation.

The left transition system in the Fig. 3b is achieved by abstracting away the actions {x, y , m, n, a , b , c , d , x , y } from Fig. 3a. To compare transition systems, several behavioral pre-order and equivalence relations have been proposed ranging from strict to loose ones [19]. The Simulation relation is a finest pre-order relation which requires a transition system to precisely mimic transitions of another one. In the case of existing internal actions in the system, weak simulation relation is defined to only match each observable a-transitions to the one preceded/proceeded by a number of τ -transitions. τ ∗

Let − → be reflexive and transitive closure of

τ -transitions:

τ ∗

• t− → t; τ ∗ τ τ ∗ • t− → s and s − → r, then t − → r. Definition 4 (Weak simulation relation [18]). A binary relation on the set of states S is a weak simulation relation if for any s1 , s 1 , and t 1 ∈ S and α ∈ Act, s1 R t 1 implies: τ ∗ α τ ∗

α

• s1 − → s 1 ⇒ (α = τ ∧ s 1 R t 1 ) ∨ (∃ t 1 ∈ S : t 1 − → − →− → t 1 ∧ s 1 R t 1 ); • s1 ∈ ↓ ⇒

τ ∗ (∃ t 1 ∈ ↓ : t 1 − → t 1 ).

For the given transition systems TSi = ( S i , Acti , →i , s0i , ↓i ), where i ∈ {1, 2}, TS1 is weakly simulated by TS2 or TS2 simulates TS1 , denoted by TS1  w TS2 , if s01 R s02 for some weak simulation relation R. A weak simulation relation R witnessing TS1  w TS2 is called minimal if there exists no weak simulation relation R

witnessing TS1  w TS2 such that R ⊆ R. A minimal weak simulation relation R is not necessarily unique. An example of a weak simulation relation between two transition systems has been illustrated in Fig. 3b which is minimal. We remark that if the pair (s4 , t 2 ) substitutes for the pair (s5 , t 2 ), another minimal weak simulation is obtained. 2.3. Counter abstraction Counter abstraction is a technique to abstract the states of a system. The idea is to represent each state as a vector of counters one per each value instead of a vector of state variable values. Each counter of a value determines the number of state variables with such a value. For instance, consider three variables x, y and z from the domain {1, 2, 3}. The states {x = 1, y = 2, z = 1}, {x = 2, y = 1, z = 1} and {x = 1, y = 1, z = 2} have the identical counter abstracted state {c 1 = 2, c 2 = 1, c 3 = 0} where c i denotes the counter of the value i. The states with an identical abstraction can be merged together, and hence the model of a system is reduced substantially. The soundness of the counter abstraction technique has been

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

47

proved in [20] by showing that the states with an identical abstraction are behavioral equivalent (with respect to strong bisimilarity [21]) and thus satisfy the same set of properties. This technique has been used in various applications. In symmetry reduction which is a technique to avoid state-space explosion problem, the counter abstraction plays the role of finding identical clusters of the state spaces so as to reduce the symmetrical states and decrease the cost of model checking [22]. Also this concept is used in [23] to abstract a parameterized system of an unbounded size into a verifiable finite-state system. In our proposed method, we use this technique to abstractly define the states of an application (in terms of the number of flows corresponding to each state of a component). Therefore, two states can be merged if their corresponding abstract states are identical. 2.4. Runtime verification Runtime verification deals with those verification techniques that inspect a finite execution trace of a system to determine whether one or more given desired properties are satisfied or not. Actually, runtime verification fills the gap between the model checking and testing, since, it examines a finite trace of a system at runtime (like testing, no model of the system under consideration is needed) as opposed to model checking which applies on a model of the system. Hence, runtime verification does not suffer from state-space explosion [24]. To apply runtime verification, a monitor observes the events of the system and examines the desired properties to raise proper verdicts. We utilize runtime verification techniques to present our classifier so as to enhance the performance of the matching process. Given a trace and a set of models of some systems to the classifier, it discovers which application(s) may have generated a part of the input trace. Hence, to apply runtime verification techniques, the behavioral models are considered as the regular properties that should be held for the given input trace. 3. Model identification In this section, our proposed methodology for learning a model for a multi-component system from its set of traces is discussed. To derive this model, it is assumed that the inputs are the set of execution traces and a set of the specifications of components. The given set does not necessarily contain the specifications of all the constituent components. Hence, the input traces only contain events related to the considered components and the others are removed. Finally, the derived model is defined in terms of the behavior of its known components common with the set. Reducing the problem to the automata learning problem, we propose an evidence-driven state-merging technique based on an equivalence relation. The goal is to derive a model that subsumes unobserved traces as much as possible. 3.1. Problem statement Since we consider multi-component systems, the first input of our problem is a set of the specifications of K components. We assume that the specifications are provided in the form of action-deterministic transition systems Ci = ( S i , Acti , →i , s0i , ↓i ) where 1 ≤ i ≤ K , τ ∈ / Acti , and ∀i , j ≤ K : (i = j ⇒ Acti ∩ Act j = ∅). The second input of our problem is N execution traces of the system. Each trace is an interleaving of a number of execution trace of component instances. The execution traces of two instances of a component are identified by their unique parameters e.g., object identifier in Java code of Fig. 1. Hence, the co-related events of an instance can be detected. These parameters can be symbolically represented by a unique identifier to distinguish their component instantiations. Each execution trace π can be considered as an interleaving of π f 1 , π f 2 , . . . where each subsequence π f i is an execution trace of an instance of a component e.g. C j ( j ≤ K ). Hence, the symbolic identifier f i uniquely distinguishes its component instantiation by parameterizing the actions of the trace and is called flow identifier. Definition 5 (Flow). A flow is a trace of a component generated from an instantiation of the component, and the actions of the trace are all parametrized by a symbolic unique identifier to distinguish that instantiation. For instance, if two execution traces π f ≡ α0 ( f )α1 ( f ) . . . αn ( f ) and π f ≡ α0 ( f )α1 ( f ) . . . αn ( f ) are the flows of two instances of component C j , then they are uniquely distinguished by the identifiers f and f . Therefore, we assign an identifier to each flow, ranged over by f , and assume that there exist totally F flows in the given traces. Since each flow belongs to an instance of a component, we define a function Component : Nat → Nat which identifies the component index of a given flow identifier such that ∀ f ≤ F : Component( f ) ≤ K . A mapping function Mapper is defined to represent events of the input execution traces with symbolic actions and also correspond each action α ∈ Act j belonging to the flow π f to a parametric action α ( f ) where obviously Component( f ) = j. For example, for the code of Fig. 1, the functions select T icket, ticket Reser ved, show T icket, and print T icket are mapped to the symbolic actions a, b, c, and d, respectively. After applying the mapping function, the second input of the problem can be considered as N parametric action traces denoted by AT, ranged over by π . Let πi indicate the i-th action of the trace π , and len(π ) show the length of the trace. We remark that the length of action traces may vary.

48

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Fig. 4. Applying the proposed method on the running example.

The goal of our problem is to derive a transition system M = ( S M , Act M , → M , s0 M , ↓ M ), where Act M is a set of parametric actions Act M = {α ( f ) | ∃ j ≤ K : α ∈ Act j ∧ Component( f ) = j }, such that AT ⊆ Traces(M). There is a trade-off between the number of the input traces and the computation time of the learning algorithm. If we increase the number of the input traces to observe more unseen behaviors, we can achieve a more accurate model. However it is required more computation time. Running example: With regards to the code given in Fig. 1, the specifications of the three sample components are provided in Fig. 4a and the three input traces x(1) a(2) x(3) a(2) y (3) b(2) c (2) d(2) y (1), a(4) x(5) b(4) y (5) c (4) d(4) and x(6) m(7) n(7) a(8) b(8) y (6) c (8) d(8) are given (for simplicity, TicketReservation, User, and AccountTransactions are referred as C1 , C2 , and C3 , respectively). The integer parameters of the actions are the flow identifiers. In the first trace, there are three flows, two are of the component C3 and one of the component C1 . Flow identifiers indicate that the first x and the last y belong to one flow and the second x is related to the first y. In the second trace, there are two flows from components C1 and C3 . In the last trace, there are three flows each of which are a finite trace of the sample components. We use this example in the following. 3.2. Projection relation Intuitively, each application needs to invoke a number of component instances to provide a functionality. Each invocation follows the corresponding component specification. As mentioned before, we call an execution trace of a component as flow. Initially, a tree-like automaton which consists of all action traces is generated. A transition system has a tree-like structure if and only if any of its two distinct states can be connected by at most one unique simple path. Each state of the initial transition system can be considered as a vector of states, each of which identifies a state of a component. Note that the size of the vector equals to the number of flows. To generalize the initial transition system in order to include more behavior of the system, not observed during the application execution, states with the same behavior are merged together. Such states are called projection equivalent. Two states are projection equivalent if their vectors (of flow states) are identical with respect to the counter abstraction technique. Before describing the method, some definitions and theorems should be mentioned. Definition 6 (Projection relation under a transition system). Let TSi = ( S i , Acti , →i , s0i , ↓i ), for i = 1, 2, be transition systems. Two states s1 and s2 of S 1 have projection relation under TS2 if TS1  w TS2 , witnessed by a minimal weak simulation relation R, such that ∃ t ∈ S 2 : s1 R t ∧ s2 R t. Then, it is said that s1 and s2 are the same projection of t under the transition system TS2 , denoted by s1 ∼TS2 s2 . The above definition is used to find the consistent states, states preserving the same behavior. For instance, in the running example, the states s1 and s16 of Fig. 4b are the same projection of the state v 2 under the transition system C3 of Fig. 4a. In other words, these states preserve the behavior of the state v 2 . To find states that are projection equivalent, the following lemma identifies the conditions that the projection relation can act as an equivalence relation, and consequently can partition states.

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

49

Lemma 1. Transition systems TSi = ( S i , Acti , →i , s0i , ↓i ), for i ∈ {1, 2} are given such that TS1 is a tree-like transition system and TS2 is an action-deterministic transition system without any τ -transition (i.e., τ ∈ / Act2 ). If TS2 weakly simulates TS1 , witnessed by a minimal weak simulation R, then each state of S 1 only relates to one state of S 2 under R:

∀ s ∈ S 1 , ∀ t1 , t2 ∈ S 2 : s R t1 ∧ s R t2 ⇒ t1 = t2 Proof. We prove by induction on the length of the path from the root to the state s ∈ S 1 . For the base, when the length is zero, then s is the root of TS1 , i.e., s = s01 . By Definition 6, the roots of TS1 and TS2 should be related. We assume that t 1 = t 2 and (s02 = t 1 ) ∨ (s02 = t 2 ). We assume that s02 = t 1 as the other case can be proved with a same discussion. The minimality of R implies that for any pair (s , t ) ∈ R, there must be paths π and π from s01 and s02 to s and t respectively, such α

that for any transition s1 − → s 1 over

α

π , it is matched to a sub-path of π . In order words, if s1 − → s 1 and s1 R t 1∗ , then either (α = τ ∨ s 1 R t 1∗ ) or (α = τ R t 2∗ ) by Definition 4 and the assumption τ ∈ / Act2 . It can be concluded that if α

∗ ∗



s1 − → s1 and s1 R t 1 , then there is t 2 over π such that s1 R t 2 . Thus, s R t 1 and s R s02 together with the minimality of α ∧ t 1∗ − → t 2∗ ∧ s 1

R imply that there must be a loop in TS1 which is a contradiction to our assumption that TS1 has a tree-like structure. We

assume that the lemma holds for all the states that the length of their path from the root is less than n, and we prove the lemma for s with a path with the length of n + 1. The tree-like structure of TS1 indicates that each state of S 1 is the target of exactly one transition which is not a self-loop. Therefore, since s essentially has a unique parent, denoted by p (s), two cases can be distinguished: 1) either it is reachable through a τ -transition, then τ ∈ / Act2 , minimality of R, and Definition 4 imply that p (s) R t 1 and p (s) R t 2 ; 2) or it is reachable through an observable action, then minimality of R, and Definition 4 together imply that the parents of t 1 and t 2 must have relation with p (s), i.e., p (s) R p (t 1 ) and p (s) R p (t 2 ). In the former case, by induction t 1 = t 2 , and in the latter case, by induction p (t 1 ) = p (t 2 ) and hence, t 1 = t 2 . 2 Since our purpose is to find the candidate states to be merged, we prove in the next theorem that the projection relation is an equivalence relation by using the above lemma, and it can be used to partition states. Each equivalent class contains the candidate states to be merged. Theorem 2. Transition systems TSi = ( S i , Acti , →i , s0i , ↓i ), for i ∈ {1, 2} are given such that TS1 is a tree-like transition system and TS2 is an action-deterministic transition system without any τ -transition (i.e., τ ∈ / Act2 ). The projection relation under the transition system TS2 over the states of TS1 is an equivalence relation. Proof. We assume that TS1  w TS2 is witnessed by a minimal weak simulation relation R. We prove that the projection relation has three reflexive, symmetric, and transitive properties. The reflexive property results from the fact that (s, t ) ∈ R ∧ (s, t ) ∈ R ⇒ s ∼TS2 s. To prove the symmetric property, s1 ∼TS2 s2 implies that ∃ t ∈ S 2 : (s1 , t ) ∈ R ∧ (s2 , t ) ∈ R. Consequently, (s2 , t ) ∈ R ∧ (s1 , t ) ∈ R implies s2 ∼TS2 s1 . The transitive property of the relation is concluded by the facts that s1 ∼TS2 s2 ⇒ ∃ t ∈ S 2 : (s1 , t ) ∈ R ∧ (s2 , t ) ∈ R, and s2 ∼TS2 s3 ⇒ ∃ r ∈ S 2 : (s2 , r ) ∈ R ∧ (s3 , r ) ∈ R. According to Lemma 1, it holds that t = r, and hence, s1 ∼TS2 s3 . 2 As a consequence of Theorem 2, the states of TS1 can be partitioned into equivalence classes by a projection relation under TS2 . As all the states in an equivalence class are the projection of a unique state of TS 2 , each equivalence class can be represented by a unique state of TS2 . The equivalence class for a projection relation is defined in the following definition. Definition 7 (Projection relation partitioning). Transition systems TSi = ( S i , Acti , →i , s0i , ↓i ), for i ∈ {1, 2} are given such that TS1 is a tree-like transition system and TS2 is an action-deterministic transition system without any τ -transition (i.e., τ ∈ / Act2 ). States of TS1 are partitioned under the projection relation under TS2 into the equivalence classes each of which is identified by their projecting unique state t ∈ S 2 such that: [t ]TS1 ∼TS = {s ∈ S 1 | s R t } where R is a minimal weak 2 simulation relation, witnessing TS1  w TS2 . 3.3. Model learning steps Step 1: Building the initial transition system. From Definition 2, there is an execution fragment for each execution trace. For π each input trace π , we generate its corresponding execution fragment ηπ = s0 π1 sπ 1 π2 . . . πlen(π ) slen(π ) by introducing the states s0 and sπ , where 1 < i ≤ len ( π ) . Note that the initial state in the fragments of all traces are intentionally identical. i In the first step, the tree-like transition system M0 is built from aggregating the execution fragments of the input traces. Therefore, the initial transition system M0 = ( S , Act, →, s0 , ↓) is obtained as follows:

• • • •

S = {sπ | 1 ≤ i ≤ len(π ), π ∈ AT } ∪ {s0 }, i Act = {πi | 1 ≤ i ≤ len(π ), π ∈ AT }, → = π ∈AT ({(skπ , πk+1 , skπ+1 ) | 1 ≤ k ≤ len(π )} ∪ {(s0 , π1 , sπ 1 )}), ↓ = {sπ ( π ) | π ∈ AT } . len

50

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Fig. 4b is the result of performing this step. For simplicity, we enumerate the states from s0 to sπ ∈AT len(π )) instead of using the trace index such as sπ i . This initial transition system does not include any execution traces of the application which are not subsumed by the input. To generalize the initial transition system to cover more execution traces, next steps (step 2 to 5) provide a set of operations to reach this goal. Step 2: Generalizing by counter abstraction. First, we generalize the initial model to accept more behaviors with an evidence driven state merging technique. This technique takes advantage of the behaviors of components to identify the equivalent states to be merged. Each state of the initial transition system is unfolded in terms of the progress of the flows. The progress is computed regarding the behavior of basic components by using our projection relation. Then states in which the flows have progressed similarly modulo counter abstraction are subject to be merged. 1. Finding equivalent states: As we classify applications in terms of how they execute several instances of the given components, each state s of M 0 can be abstractly denoted by a vector t 1 , . . . t F , where state t i represents the progress of flow π f i according to the transition system C k (Component ( f i ) = k). In other words, abstracting away the actions of other flows from M 0 , denoted by τ ¯f (M0 ), the state s is simulated by t i ∈ S k . Assume that the flows π f i and π f j are of the same component. Consider two states s1 and s2 of M 0 such that their i-th and j-th items in the abstracted vector are swapped. We consider these two states equivalent under  counter abstraction. Therefore, each state s of M 0 can be represented as a vector of counters n1 , . . . , n| S ∗ | , where S ∗ = j ≤ K S j and | S | denotes the size of set S, one counter for each state of components where the counter n for the state t indicates how many flows have progressed to the state t. The abstraction operator is used to determine the progress of each flow according to the transition system of its origin component by making the actions of all other flows unobservable. Since all the observable actions belong to the same flow, the flow identifiers of the actions are also removed by this operator. Definition 8 (Abstraction operator under a flow identifier). Let TS = ( S , Act, →, s0 , ↓) be a transition system. Then

( S , Act , → , s0 , ↓) such that:

τ ¯f (TS) =

→ = {(s, α , t ) | (s, α ( f ), t ) ∈ →} ∪ {(s, τ , t ) | ∃(s, α ( f ), t ) ∈ → ∧ f = f } Act ⊆ Acti where Component( f ) = i and Ci = ( S i , Acti , →i , s0i , ↓i ). For instance, the abstracted transition system of Fig. 4b under the flow with the identifier 2 (contains trace a a b c d) has been illustrated in the left side of Fig. 3b. This operator is helpful for computing the number of flows preserving the behavior of a common state of the components, and hence, generating the vector of counters n1 , . . . , n| S ∗ | . Let count(s, t ) denote the number of flows preserving the behavior of the component state t in the state s of the initial transition system. This function considers the behavior of the depending components. Definition 9 (Counter function). Given the initial transition system M0 = ( S , Act, →, s0 , ↓) and a set of component transition systems Ci = ( S i , Acti , →i , s0i , ↓i ) where 1 ≤ i ≤ K , the counter function count(s, t ) computes for the state s ∈ S the number of flows like π f i , where Component( f i ) = j, that have progressed to the state t ∈ S j , i.e., t weakly simulates s in the abstraction of M0 under f i :

count (s, t ) = |{ f i ≤ F | s ∈ [t ]τ f¯ (M0 )∼C }|. i

j

Since we define the projection relation on the abstracted version of the initial transition system with a tree-like structure, all observable actions, belonging to a flow of an input trace, are located on a branch of the tree with no other branching structure (see Fig. 3b). Therefore for simplicity, we have used the weak simulation relation in the definition of the projection relation (as it results the same as branching simulation relation on transition systems with no branching structure). We remark that each state s is uniquely simulated by a state t as the result of our projection relation (Lemma 1). The two states s1 and s2 are called equivalent due to the counter abstraction technique if and only if ∀ j ≤ K , t ∈ S j : count (s1 , t ) = count (s2 , t ). The results of applying the counter abstraction to the states of the initial transition system of the running example are presented in Table 1. To obtain each row, at first, the projection relation under the abstraction of each flow identifier is computed. After that, the number of flows in each state of components is counted. First row shows that the initial state of the abstraction of M 0 under all flow identifiers are simulated by the initial states of all components. In other words, all flows have no progress with regards to any component. By transition (s0 , x(1), s1 ), flow 1 progresses with regard to component C 3 , and so the state of s1 ∈ τ1¯ ( M 0 ) is simulated by the state v 2 of the component C3 in Fig. 4a. Hence, the counter of flows in the state v 1 decreases by one and the counter of flows in the state v 2 increases by one. After calculating the counters for each state, the set of equivalent states are achieved as shown in Table 2.

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

51

Table 1 Valuation of count function in each state of the initial transition system. Function c is the abbreviation of count. State

s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17 s18 s19 s20 s21 s22 s23

Component C2

Component C3

c (s, t 1 )

Component C1 c (s, t 2 )

c (s, t 3 )

c (s, t 4 )

c (s, t 5 )

c (s, w 1 )

c (s, w 2 )

c (s, v 1 )

c (s, v 2 )

8 8 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 7 7 7 7 7

0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0

0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0

0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 7 8 8 8 8 8 8

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0

8 7 7 6 6 7 7 7 7 8 8 7 7 8 8 8 7 7 7 7 7 8 8 8

0 1 1 2 2 1 1 1 1 0 0 1 1 0 0 0 1 1 1 1 1 0 0 0

Table 2 The set of equivalent states with their new associated state number. New state

Equivalent states

New state

Equivalent states

s0 s1 s2 s3 s4 s5

{s0 } {s10 } {s1 , s16 , s18 } {s2 , s5 , s11 , s19 } {s6 , s12 , s20 } {s3 , s4 }

s6 s7 s8 s9 s10 s11

{s13 , s21 } {s7 } {s14 , s22 } {s8 } {s9 , s15 , s23 } {s17 }

2. Merging equivalent states: After finding the set of equivalent states of M0 , merging process should be done. Let [s] denote the equivalence class of the state s, i.e., ∀ s ∈ S : s ∈ [s] ⇔ ∀Ci = ( S i , Acti , →i , s0i , ↓i ), ∀t i ∈ S i , count (s, t i ) = count (s , t i ). A merged state inherits the union of the incoming and outgoing transitions of its origin states in M 0 . By merging these states, the next transition system M1 = ( S , Act, → , s 0 , ↓ ) is obtained, where

• • • •

S = {[s] | ∀ s ∈ S }, → = {([s], α , [t ]) | ∃ s , t ∈ S : s ∈ [s], t ∈ [t ], (s , α , t ) ∈ →}, s 0 = [s0 ], ↓ = {[s] | ∀ s ∈ ↓}.

Since M 1 does not remove any transition/state of M 0 , it contains all the traces of M 0 (i.e., Traces(M0 ) ⊆ Traces(M1 )). Furthermore, M1 is weakly simulated by M0 witnessed by the relation R = {(s , [s]) | s ∈ [s]}. This relation is weak simulation

α α

as for any arbitrary pair (s , [s]), if s − → t , then by construction of − → , [ s ] − → [t ], where t ∈ [t ], and t R [t ].

Step 3: Flow identifiers inference. The obtained model contains a number of flow identifiers over its transitions. Depending on the values of the state variables, different executions of a multi-component application generate different number of component instances which may be executed concurrently. As explained in Section 1, two different flow identifiers from different application executions may have been derived from the same instantiation of a component, and so it is important to find the equivalent flow identifiers and unify them in the resulting model. In this step, we discover the equivalent flows by finding transitions between two states that only differ in their flow identifiers. Because when two flows trigger transitions with the same action, it means that they play the same role in the two states, i.e., the source and destination of the transition, of the model. For instance, as a consequence of merging the equivalent states, the states in the sets {s1 , s16 , s18 } and {s2 , s5 , s11 , s19 } are aggregated together while two transitions with the labels a(8) and a(2) connect them. Therefore, the flow identifier 8 and 2 can be unified. Table 3 illustrates how the equivalent flow identifiers of the obtained

52

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Table 3 The inferred equivalence rules between flow identifiers after merging the transitions of equivalent states. Source states

Destination states

Merged transitions

Inferred rule

{s1 , s16 , s18 } {s0 } {s2 , s5 , s11 , s19 } {s6 , s12 , s20 }

{s2 , s5 , s11 , s19 } {s1 , s16 , s18 } {s6 , s12 , s20 } {s13 , s21 }

(s1 , a(2), s2 ), (s18 , a(8), s19 ) (s0 , x(1), s1 ), (s0 , x(6), s16 ) (s11 , b(4), s12 ), (s19 , b(8), s20 ) (s12 , y (5), s13 ), (s20 , y (6), s21 )

– –

– –

– –

f f f f f f

:2= :1= :4= :5= :3 :7

f f f f

:8 :6 :8 :6

New flow ID 1 2 1 2 3 4

Fig. 5. Relaxing unnecessary orders between two sync points.

model for the running example are inferred and renamed. Fig. 4c (without thick-transitions, two states s12 and s13 and the dashed-transitions) is the final result of performing the steps 2 and 3 on M 0 . This approach is beneficial to reduce the resulting model in cases that its application has two instances of a component that may be active exclusively but behave similarly in some cases. The behavior of each component instance is a part of the component. If we replace the partial behavior of an instance with the behavior of component, the model still includes previous traces. We remark that by applying this heuristic, we integrate the behavior of both instances in one instance. Step 4: Generalizing by completing transitions. The next generalization idea is completing the transition set according to the transition systems of the components. We add self-loops of each component state t ∈ S i , for some i ≤ K , to the state [s] if count (s, t ) > 0. In other words, if the state s of the model simulates the state t of the transition system of the component C i while t has a self-loop transition, we add this transition to the state s, too. Adding such transitions do not alter the equivalent classes of M1 as such self-loops do not change the state of the component and hence, the model. Consequently, the resulting model contains more accepted traces without changing its states. As a loop containing more than one state behavior of one component may be interleaved in a specific order with the behavior of other involved components in a multi-component system, we cannot blindly complete the model with such multi-state loops. Then, after applying this step, the resulting generalized transition system is M g = ( S , Act, → g , s 0 , ↓ ) such that:

→ g = → ∪ {([s], α ( f ), [s]) | ∀ j ≤ K , t ∈ S j , ∀ s ∈ S : s ∈ [t ]τ ¯f ( M 0 )∼C j ∧ (t , α , t ) ∈ → j }. After applying this step, the two thick self-loops a on states s1 and s3 are added to the Fig. 4c. Step 5: Generalizing by relaxing unnecessary orders. The interleaving semantics of the parallel composition leads to various orders of actions, which result in diamond-like patterns in the model. Since components run concurrently and may communicate with each other either through shared variables (like the example of Fig. 1) or synchronous/asynchronous mechanisms, the total ordering among the actions of such components may be limited. At this step, we aim to relax unnecessary orders resulting from the concurrent development of applications by identifying such diamond-like patterns in the resulting model in the previous steps. For instance, consider the left transition system in Fig. 5 as a sub-automaton of the generated model. This figure shows that there are two possible subsequences between flows with the identifiers 1 and 2 from the component instances of C1 and C3 of Fig. 4a. To find the diamond-like patterns of the system, we introduce sync point term. Definition 10 (Sync point). Sync points are those states of the transition system with at least two incoming or outgoing transitions. Two sync points can constitute a pair if there are at least two execution paths among them. Each pair of sync points represents the start and the end of a diamond pattern. For instance, the states s4 and s10 can constitute a sync-point pair. Indeed, this subsequence is the result of the parallel composition of two instances, one executes

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

53

c and d in a sequence and the other executes y under the constraint of the application. Therefore, to derive a more general model, the unobserved interleaving should be inferred. To this aim, we utilize the proposed method in [7] from the complex event processing (CEP) domain. CEP deals with processing huge number of events to find out the causality relations between the occurrences of events. The causality relation α ⇒ β denotes that every occurrence of α should be before β in all execution traces. For instance, { y (2) ⇒ c (1), y (2) ⇒ d(1), c (1) ⇒ d(1)} and {c (1) ⇒ d(1), c (1) ⇒ y (2), d(1) ⇒ y (2)} are the causality relations for the left and right branches of the left LTS of Fig. 5, respectively. Thus, we list all this precedence rules for each branch between two sync points separately. After that, we compare these two sets to conclude which of the rules are mandatory (those rules that are not violated in the other rule set). In the example, the only mandatory rule is c (1) ⇒ d(1). Then, those orders which preserve the mandatory rules are added as new branches between the two sync points. For example, subsequence c (1) y (2) d(1) is a valid order while d(1) y (2) c (1) is not. Therefore, after generalizing the model based on this idea, the new sub-automaton is illustrated at the right side of Fig. 5. Thus, to automate this inference, it is required to detect the sync points. To this aim, we reduce our problem to the lowest common ancestor (LCA) problem [25] to find sub-automatons which have at least two paths with the same source and destination. After finding the desired sub-automaton, the set of precedence rules for each trace between the two sync points is extracted and compared with the other sets to derive the mandatory rules. Then, we append new paths which contain new orders of actions while preserving the strict rules. Actually, we automatically induce essential orders among transitions and relax the unnecessary ones. After applying this step, two new states s12 and s13 and the dashed-transitions are added to the Fig. 4c. Trivially M 1 weakly simulates the resulted transition systems after these two generalizing steps (step 4 and 5), because no state/transition was removed. 3.4. Pseudocode of behavioral model identification algorithm Algorithm 1 indicates the pseudocode of our methodology. Obviously, building the initial transition system takes linear time based on the total number of the input actions (lines 2-6). For each input trace π , the projection partitioning of its states is computed for all flows in O (len(π )). Therefore, the corresponding state of the components for all the states is achieved in O (π ∈AT len(π )) (lines 8-13). Due to the tree-like structure of the initial transition system and the lack of both τ and nondeterministic actions in the transition systems of the components, we implicitly achieved a minimal weak simulation relation by simultaneously traversing of the initial transition system and the transition systems of components. Because, we relate each state of the initial transition system with exactly one state of each component, and at each traversal step, each transition of the initial state transition is only matched with a transition of a component. For the second step of the method, at first, we use the v and count vectors for storing the state vectors, respectively, before and after applying the counter abstraction technique. The counting operation is performed to calculate the number of flows in the same state of the component for each state during the previous step (line 13). Finding the equivalent states (states with the same counter abstraction key) and merging them to find the identical flow identifiers is achieved with a hash structure. Therefore, it is done in linear time in size of the states (lines 15-18). Let parents(s) denote the states with a transition with the destination s while incomes(s) denote the incoming transitions to the state s. Due to the tree-like structure of M0 , the result of applying these functions to the states in this step have one member. Finding the equivalent flow identifiers (step 3) is done during the merging step (line 20-23). Also, after the merging phase, the new flow identifiers are assigned according to their equivalence relation by method renumberingFIDs() (line 23). The fourth step adds the self-loops in an order of O ( S L × π ∈ET len(π )), where S L is the number of self-loops in all the components and it is assumed that the self-loops are found as the result of preprocessing and stored in a set called selfloops (lines 25-28). We can find an upper bound for the number of flows for execution trace of an application e.g., M such that M = Max( F π ) where Max is the maximum operator. Then, the total time complexity of the steps 1-4 equals O (( M + S L ) × π ∈AT len(π )). Since M can be assumed as a constant variable for an application and S L is not application-dependent, the total time complexity of the algorithm is linear in the size of the input. The last step is an extra step to generalize the model. In this step finding the sync points and generalizing each subautomaton takes exponential time in the size of the input (lines 30-37). In the pseudocode, LC A is the abbreviation for Least Common Ancestor. Method allPrecedenceRules() extracts all the precedence rules between the actions and method essentialRules() selects the necessary ones by eliminating two-ways precedence rules. Method add() adds the new inferred branches (newBranches) with regard to the essential rules to the automaton (line 37). All the steps except step 5 take linear time based on the input size. As the learning process for each model is performed once, the exponential time complexity of the last step is tolerable. Moreover, in our experiences, this step increases the precision of the result about 35% in some cases. 4. Trace classification In this section, we elaborate on how the identified models can be used to discover which applications have generated a given action trace. Indeed, given an input trace π = π1 π2 ...πn and a set of behavioral models { M 1 , M 2 , ..., M m }, respectively

54

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Algorithm 1 Pseudocode of model identification. Input: action traces( AT ) with F flows, K components TSi = ( S i , Acti , →i , s0 i , ↓i ) Output: Transition system TS = ( S , Act, →, s0 , ↓) 1: 2: s0 = Ne w State (), S = {s0 }, Act = ∅, → = ∅, ↓ = ∅ 3: for all π ∈ AT do 4: for all j ∈ [1, len(π ) − 1] do 5: Act = Act ∪ {π j }, S = S ∪ {sπj } 6:

 Building the initial transition system

→=→ ∪{(s j −1 , π j , sπj )}

7: 8: new vector v [][] with the size of | S | × F , initialized by 0 9: new vector count [][] with the size of | S | × | ∪i ≤ K S i |, initialized by 0 10: for all s ∈ S do 11: for all f ≤ F do 12: v [s][ f ] = StateofComponent ( v [ parents(s)][ f ], Component( f ), incomes(s)) 13: count [s][ v [s][ f ]] + + 14: 15: for all s ∈ S do 16: HashMap .add(count [s], s)

 Finding the equivalent states

 Merging the equivalent states

17: for all key ∈ HashMap do 18: Merge States( HashMap . get (key )) 19: 20: for all s ∈ S do 21: if ∃t ∈ S , (s, α ( f x ), t ), (s, α ( f y ), t ) then 22: set f x ≡ f y 23: renumberingFIDs() 24: 25: for all (state sl, transition α ) ∈ selfloops do 26: for all s ∈ S do 27: if count [s][sl] > 0 then 28: →=→ ∪{(s, α , s)} 29: 30: for all s ∈ S do 31: if inDegree(s) > 1 then 32: for all p1, p2 ∈ parents(s) do 33: sanc = LC A ( p1, p2) 34: aut = subAutomaton(sanc , s) 35: AC = allPrecedenceRules(aut ) 36: E O = essentialRules( AC ) 37: T S .add(newBranches(aut , E O ))

 Inferring flow identifiers

 Completing the transitions by adding self-loops

 Generalizing by Relaxing Unnecessary Orders

for the applications { App 1 , App 2 , .., Appm }, the problem is to identify those applications ({ App x , App y , .., App z }) where their models have a subsequence in the input trace while there is a consistence binding between the flow identifiers of the subsequence and the model. Since applications may exploit common components, they may have actions in common. So there is uncertainty in determining the source of each action in the interleaved traffic of a number of applications. Moreover, the first action of the execution traces of applications are not definitely clear in the given input. Besides, models are parametrized by flow identifiers which are symbolic values, so we need to find an appropriate binding between the flow identifiers of the model and the input trace. Thus, the number of different alternatives which should be considered is exponential and examining all these possible situations are time- and memory-consuming. We utilize runtime verification techniques in a three-steps novel approach instead of considering all possible bindings to enhance the performance of the matching process for a trace. First, we do not consider the flow identifiers of the input trace and the behavioral models and use runtime verification techniques to determine a set of candidate models which are non-parametrically satisfied by the input trace. These models have the chance of being the origin of a subsequence of the input trace if they also pass the parametric consistency checking in the further step. As discussed before, the number of various bindings each flow identifier of a model to a flow identifier of the input trace is exponential with respect to the number of flows. As the size of the models are too large, this process is highly time-consuming. Hence, at the second step, we construct an abstract symbolic representation for each model to find the candidate bindings. This abstraction only contains the first transition of the flows of its corresponding model so as to express the possible flow initiation orders in the related application. We compare these orders with those of the input trace and report ones as candidate orders for which a parameter binding with a subsequence of the input can be found. Therefore, the number of different bindings to be considered decreases. Finally, at the third step, we extract a set of sub-automata from each candidate model to subsume a candidate order of the initial actions of flows found in the previous step. Such a reduction limits the parametric consistency checking process to just the part of the model relevant to a candidate order. The checking process inspects the validity of the parameter

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

55

binding suggested by the candidate order by verifying each sub-automaton against the input trace by runtime verification technique. Since deriving the abstract symbolic representations of the models is independent from the input trace, we do this step in a preprocessing phase. To this aim, we scan each model to obtain their flow orders in terms of their initial actions as a tree-like graph called flow dependency graph (FDG). In the following, we address each step in details. In the next subsections, we assume that the transition system of Fig. 4c is given as the behavioral model and the trace of Fig. 6c is the input trace that should be checked, and we apply each step to this running example. Preprocessing: building a symbolic representation abstraction. In this phase, we abstract all the behavioral models to obtain their symbolic representations (as flow dependency graphs (FDGs)) which show dependencies among their possible flows in their applications. As mentioned before, each trace of a model is an interleaving of different flows which can be characterized by an ordering among the initiating actions of its flows. If the input trace has a subsequence as an execution trace of a model, then the initiation order of flows of such a subsequence should be a member of the initiation orders of flows of the origin model traces. Therefore, each model can be abstracted by an FDG to represent such orderings of its traces. Hence, the derived FDGs can be used as blueprints to find the right association between the flows of an input subsequence and the ones of a model. To derive an FDG from a model, we abstract the model away from non-first actions of flows in order to illustrate the initiation orders of flows. For each identified model M, we assume that the total number of its flows is F M . We define the first action (with its source and destination states) of a flow f ≤ F M in an execution fragment η ∈ Frags( M ) as initiator( f , η) = {(s, α ( f ), s ) | ∃ ηi = (s, α ( f ), s ) ∧  j < i : η j = (t , α ( f ), t ) ∧ f = f }. Also, we assign an index to each initiator initiator( f , η) according to its location amongst the other initiators of the execution fragment η and it is denoted by 1 ≤ Index(initiator( f , η)) ≤ F M . Then, we define the set of nodes and the set of edges of the FDG of M = ( S M , Act M , → M , s0 M , ↓ M ) as the following:

Nodes( F D G M ) = {init} ∪ {(s, α (f ), s ) | ∃ η ∈ Frags( M ), initiator ( f , η) = (s, α ( f ), s )} Edges( F D G M ) = {init → (s, α ( f ), s ) | ∃ η ∈ Frags( M ), ∃ initiator ( f , η) = (s, α ( f ), s ), Index(initiator ( f , η)) = 1}∪ {(s, α ( f ), s ) → (t , α ( f ), t ) | ∃ η ∈ Frags( M ), initiator( f , η) = (s, α ( f ), s ) ∧ initiator ( f , η) = (t , α ( f ), t ) ∧ Index(initiator ( f , η)) = Index(initiator ( f , η)) + 1} Final( F D G M ) = {(s, α (f ), s ) | ∃ η ∈ Frags( M ), initiator ( f , η) = (s, α ( f ), s ), ∀ f = f ∧ f ≤ F M , Index(initiator ( f , η)) > Index(initiator ( f , η))} The derived FDG of the running example (transition system of Fig. 4c) is shown in Fig. 6a. Each node in this graph shows the first transition of a flow of an execution fragment. The final nodes (member of Final( F D G M )) are shown by the sign . We remark that an FDG can be derived from a model by traversing the model in a depth-first fashion. We apply this procedure to all the behavioral models of the applications and obtain a set of FDGs to be used at the second step of the classification. Step 1 – Non-parametric categorization. It is required for each application to determine whether it has the chance of being an origin of the input trace or not. A model has a chance if a subsequence of the input belongs to the traces of the model while their parameters are ignored. As the consequence of this step, the set of the candidate models will be reduced. Hence, for this step, the inputs are the given set of generated models { M 1 , M 2 , ..., M m } and a finite action trace π and the output is the reduced set of the candidate models { M i , M j , ..., M k }. There are some challenges that we should consider in this phase. Because of the interleaving of execution traces of several applications, a trace of a model is observed as a subsequence of the input trace. It means that some extra actions may appear between two consequent actions of a trace of an application as the result of concurrent executions of various applications. We have utilized the runtime verification technique to detect a subsequence of input as a trace of the model. The behavioral models can be considered as the regular properties that should be held for the given input trace. Therefore, we present a runtime monitor for each model that detects the satisfaction of its corresponding completed property by the input trace. To this aim, the monitor traverses the model and change its state by matching the actions of the trace with the transitions of the property. As a traversed transition may belong to another application, the monitor should also track the possibility of not matching an action. Therefore, the monitor should maintain a set of states during its traversal as its current states. Initially, the set of current states only contains the initial state of the model. Then, monitor take each action of the input iteratively and updates its set of current states based on the following main principles. To overcome the interleaved essence of the input, the monitor maintains a set of states as its current states: 1. Maintain all possible states: Due to uncertainty in matching an action of the input trace with some actions of a model, the origin state of each matched transition should be again included by the set of current states if the matched input was not its sole transition. Furthermore, since the first action of the execution traces of an application is not clearly specified, the initial state always remains in the set of current states.

56

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Table 4 The step-wise application of the non-parametric classifier on the running example. Iteration No.

Input

Current states

Triggered?

0 1 2 3 4 5 6 7 8 9 10 11 12 13

– a m x a x y a b a y n c d

[s0 ] [s0 , s1 ] [s0 , s1 ] [s0 , s2 , s3 ] [s0 , s1 , s2 , s3 ] [s0 , s2 , s3 , s5 ] [s0 , s2 , s3 ] [s0 , s1 , s2 , s3 ] [s0 , s1 , s2 , s4 ] [s0 , s1 , s2 , s3 , s4 ] [s0 , s1 , s2 , s3 , s4 , s6 ] [s0 , s1 , s2 , s3 , s4 , s6 ] [s0 , s1 , s2 , s3 , s4 , s6 , s7 , s8 ] [s0 , s1 , s2 , s3 , s4 , s6 , s9 , s10 ]

No No No No No No No No No No No No No Yes

2. Maximal progress: If an action of the input trace can be triggered by some states in the set of current states, it should be triggered. It means that the set of current states should be updated as soon as it is possible to satisfy the maximal progress constraint. A model is selected as a candidate if its monitor visits one of the final states during its traversal. Algorithm 2 Pseudocode of the proposed non-parametric runtime monitor. Input: inferred model TS = ( S , Act, →, s0 , ↓), input trace π Output: true or f alse 1: curr States = {s0 } 2: for all j ∈ [1, len(π )] do 3: for all s ∈ curr States do 4: if ∃ (s, π j , s ) then 5: if out Degree (s) = 1 then curr States.remove (s) 6: 7:

curr States.add(s ) if s ∈ ↓ then return true

8: return f alse

Algorithm 2 describes the pseudocode of the non-parametric runtime monitor. It can be easily shown that the given algorithm is complete, i.e., if a given input trace has a subsequence of a model, the monitor of the model will detect it. Proposition 1. Given an input trace containing a subsequence of a model, its monitor described by the Algorithm 2, will detect it. Proof. Proof by induction on the length of the subsequence. Assume that the proposition holds for subsequences with the length less than equal n. Now we prove it for subsequences with the length of n + 1. Assume that for the given trace, the monitor has matched the first n elements of the subsequence. If we assume that the state visited after matching the first n subsequence should be s, then by induction assumption s ∈ CurrStates (as we can make this state final to accept the prefix of subsequence with the length of n). The next actions like π j are ignored until ∃(s, π j , s ) and s = s . According to the algorithm in line 6, s ∈ CurrStates which make the monitor moves to s irrespective to the fact that π j may not be the next action in the subsequence. If π j is the sole action that s have transition for, then definitely π j is the next action in the subsequence. Otherwise, according to line 5, still s ∈ CurrStates. So s remains in CurrStates until a π j is received which is symbolically the same as n + 1-th action in the subsequence. Therefore, the algorithm definitely returns true. 2 The step-wise application of the proposed non-parametric runtime monitor on the running example is shown in Table 4. After thirteen iterations and before reading the last action of the trace, it is recognized that the given trace, without considering its parameters, contains a subsequence as a trace of the model. In this table the changes of current states in each step is shown. Step 2 – Determining the candidate flow orders. This step leverages the symbolic representation abstraction of the candidate model of each application to provide some hints for relating its flows to the flows of the input trace. Indeed, an FDG illustrates the acceptable orders of the flows of its corresponding application. Since the flow identifiers in an FDG state are symbolic, at first we should correspond flow identifiers of the FDGs with their according ones of the input trace. This step is performed for the applications separately and in parallel. For each application, the inputs are the set of derived FDG of

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

57

Table 5 The list of candidate flow orders, the flow identifier bindings and the result of validation for the running example. The left and the middle columns are obtained as the result of the step 2 and the right column is achieved after applying the step 3 of the classifier. Candidate flow order

Flow identifiers binding

Result of validation

(s0 , a(11), s1 ) → (s1 , x(13), s3 ) → (s3 , x(15), s5 ) (s0 , a(11), s1 ) → (s1 , x(13), s3 ) (s0 , a(11), s1 ) → (s1 , x(15), s3 ) (s0 , a(14), s1 ) → (s1 , x(15), s3 ) (s0 , x(13), s2 ) → (s2 , a(14), s3 ) (s0 , x(13), s2 ) → (s2 , a(14), s3 ) → (s3 , x(15), s5 ) (s0 , x(13), s2 ) → (s2 , a(11), s3 ) (s0 , x(15), s2 ) → (s2 , a(14), s3 ) (s0 , x(15), s2 ) → (s2 , a(11), s3 )

{(1 → 11), (2 → 13), (3 → 15)} {(1 → 11), (2 → 13)} {(1 → 11), (3 → 15)} {(1 → 14), (3 → 15)} {(2 → 13), (1 → 14)} {(2 → 13), (1 → 14), (3 → 15)} {(2 → 13), (1 → 11)} {(2 → 15), (1 → 14)} {(2 → 15), (1 → 11)}

× × ×  × × ×  ×

candidate models { F D G M i , F D G M j , ..., F D G Mk } and the given action trace π and the output for each given FDG is the set of the candidate flow orders together with their flow identifiers bindings. To this aim, we simultaneously traverse an FDG with regard to the input trace in a depth-first fashion and annotate the states of the FDG to find the correspondence between the flows of input trace and the flows of FDG. For example, for the running example, we assume that the input trace is Fig. 6c and the FDG in Fig. 6a. The first action of the input trace is a(11) and by depth-first searching (DFS) of the FDG, after the init state, the state (s0 , a(1), s1 ) is selected as their actions are the same. The state is annotated by the flow identifier 11 to represent a possible binding for the symbolic flow identifier of the state, i.e., (1 → 11). Then, the next action trace m(12) is ignored as the next FDG state is labeled by x(2) and proceed with the next action trace, x(13). This action will be matched with the next state (s1 , x(2), s3 ) and the state is annotated by 13. As this state is a final state, the subsequence a(11) x(13) of the input trace can be a possible initiation order for the model of FDG. To show which transitions of the model have been matched by this subsequence, the candidate initiation order (s0 , a(11), s1 ) → (s1 , x(13), s3 ) is generated by applying the flow identifier bindings {(1 → 11), (2 → 13)} on the matched states of FDG. This candidate initiation order expresses which part of the model should be considered in the next step. After ignoring a(14), x(15) can be matched by the next state (s3 , x(3), s5 ) which is annotated by 15. As this state is again a final state, the candidate initiation order (s0 , a(11), s1 ) → (s1 , x(13), s3 ) → (s3 , x(15), s5 ) for identifier bindings {(1 → 11), (2 → 13), (3 → 15)} is generated. For each action of the input trace being matched with an FDG state, the alternation of ignoring that action is also considered during the DFS traversal. For instance, by ignoring x(13) in the state (s0 , a(1), s1 ), x(15) can be matched with the state (s1 , x(2), s3 ), and hence the candidate initiation order (s0 , a(11), s1 ) → (s1 , x(15), s3 ) is resulted. The left column of Table 5 shows the candidate flow orders of the running example. It can be noted that no binding for the flow identifier of the state (s2 , m(4), s11 ) is generated, because the input trace does not have any subsequence such that action m proceeds x. Step 3 – Validating the parametric bindings. Each candidate flow initiation order achieved in the previous state, suggests a binding between the parameters of a model and the input trace. The suggested flow identifier bindings on our running example have been illustrated in the middle column of Table 5. Hence, for this step, the inputs for each application with the model M are the set of the candidate flow orders together with their flow identifiers bindings and the given action trace π , and the output is whether the model M can be an origin of π or not. Hence, the overall output of this step is the set of applications { App x , App y , .., App z } where their models are matched with the given input trace. To validate each parameter binding for each model M, again, we use the runtime verification technique by considering a larger part of the model (not just initials of flows) with the parameters in this step. Beforehand, for each flow initiation order o i , we prune the model M such that only the behaviors as the consequence of the given initiation order o i are possible. We remark that if a model accepts a subsequence of the given input, then the one achieved from the model while its self-loops have been removed also accepts a subsequence. Thus, we can extract the largest self-loop-free sub-automaton auti of the behavioral model M which has been symbolically abstracted by o i . During extraction, the suggested parameter binding by o i is applied by adjusting the parameters accordingly. If the input trace satisfies aut i as the property, the order o i is confirmed. We check this satisfaction problem via a parametric runtime monitor and execute it for all the candidate orders to determine the accepted ones. Our parametric runtime monitor is exactly the same as the non-parametric runtime monitor described by Algorithm 2 except that the parameter of actions is considered in the line 4. To extract the largest self-loop-free sub-automaton, at first, we remove all the transitions which do not involve in the candidate initiation flow order, e.g., the transitions (s0 , x(2), s2 ), (s2 , m(4), s11 ), (s2 , a(1), s3 ) and (s3 , x(3), s5 ) for the order (s0 , a(11), s1 ) → (s1 , x(13), s3 ) of the running example. Hence, the sub-automaton only contains the transitions with those flow identifiers which exist in the given flow order. Subsequently, the model is pruned by including only reachable states and transitions in a depth-first traversal from the initial state while the suggested bindings are applied on the actions. The sub-automaton of the model of Fig. 4c for the order (s0 , a(11), s1 ) → (s1 , x(13), s3 ), is shown in Fig. 6b. The result of parametric binding validation for the running example is shown in the right column of Table 5. The verification fails for the order (s0 , a(11), s1 ) → (s1 , x(13), s3 ), since the input does not include the three expected actions

58

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Fig. 6. FDG of the model given in Fig. 4c and its sub-automaton considered for classification of the given input.

y (13), c (11) and d(11) of the automaton in Fig. 6b after the occurrence of b(11). We discovered that for this example, the flow orders (s0 , a(14), s1 ) → (s1 , x(15), s3 ) and (s0 , x(15), s2 ) → (s2 , a(14), s3 ) are accepted. 4.1. Completeness of the approach We prove that our approach is complete; if there is a subsequence of an application model in the input, our approach will definitely detect it. While, other approach such as statistical methods even may not detect their trained dataset because of their statistical measurements. Theorem 3. The proposed classification is complete. In other words, there is no application A that its model contains an execution trace π as a subsequence of the input while A has not been reported by the classifier. Proof. We assume that there exists an application A and an execution trace π of A as a subsequence of the input. Then by Proposition 1, it should pass the first step of our approach, and hence, the model of A is selected as a candidate model. Recall that ignorance is a possible operation for matching each action trace during the depth-first traversal of FDG of the model A. Therefore, all actions except the ones related to the initiation flow order of π can be ignored and definitely reported by the second step. As the parametric runtime monitor is similar to the non-parametric one, again by application of Proposition 1, application A will be reported by the classifier. 2 It can be shown that our approach is not sound. The soundness means that if an application model is reported as an origin of the input, then there exists a trace of the model as the subsequence of the input. Since, the input trace is an interleaving of the executions of different applications, some actions of another application may be wrongly recognized as the actions of the reported application and consequently, a subsequence of the input is recognized as a trace of the model (false positive result). The other classification approaches are not sound, too. What distinguishes our approach from the others, is classification in terms of application behaviors. Our experiments show that our false positive rate is better in comparison with its opponent approaches (see Section 5.3). The precision of our behavioral model will minimize our false positive results substantially. 5. Application to the network domain: traffic classification To illustrate the applicability of our approach, we have applied our method for the network domain, the traffic classification problem. The importance of traffic classification for network administration tasks such as ensuring the security and quality of service of applications in computer networks has long been acknowledged [26]. Port-based classification approach became inefficient due to use of random or non-standard ports. To this aim, payload inspection [27,28] and statistical methods [29,30] were proposed. These techniques suffer from the high computation cost and inefficiency and lead to the development of behavioral classifiers. The payload inspection approaches examine the content of packets. Therefore,

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

59

Fig. 7. Layers of the TCP/IP model and flow concept.

they are useless in encrypted traffic. The statistical methods classify the traffic according to the statistical information of packet traces fast but with lower accuracy in comparison with the others. However, the behavioral classifiers rely on behavioral aspects of applications and are useful in encrypted traffic or when we have no pre-assumption about the application protocol. In this paper, some network concepts have been utilized which are explained in Section 5.1. We present the customized Preprocessor component for the network domain in Section 5.2 and the obtained results in Section 5.3. 5.1. Network background Network communications take place according to defined protocols which over the Internet follow the TCP/IP layered architecture [31]. This model consists of four layers: Application, Transport, Internet, and Link. The layered architecture means that each layer provides services to its upper layer. The packet content of each layer is built by its upper layers as visualized in Fig. 7a. For each layer, a number of different protocols is standardized, some of them are enumerated in Fig. 7a. Each packet transferred across a network, is composed of the information of different layers, divided into two parts: the header and the content. The header includes the control information needed by the corresponding protocol and is appended to the beginning of the content. For encrypted traffic, the content parts of packets are unreadable. Protocols are divided into the connection-less and connection-oriented categories. The Connection-oriented ones establish a connection before a data transmission. Thus, there are handshake (initialization) and finalization phases in these protocols. These phases are not required in the connection-less protocols. They just send a request packet for each desired data. Therefore, we divide the operation of each protocol into a set of control phases to abstractly consider its progress. The phases are assumed to be Init, Data, and Fin for connection-oriented protocols and Init and Data for connection-less ones. Intuitively, Init indicates the establishment of a connection or a query packet, Data shows the transmission of data, and Fin explains the termination of a connection. We exploit these control phases to specify the behavioral model of the protocols. A network application relies on different protocols at the different layers that they can be viewed as components in a multi-component system. Thus, a flow is a trace of a protocol model symbolically representing a sequence of packets which have the same value for the parameters: source IP, source port, destination IP, destination port, and the protocol name. An execution of an application gives rise to initiating a number of flows. These flows are the connections which are established between the initiator system and anther end system. For instance, Fig. 7b illustrates the execution of a software which has established four flows. The flows with the protocol names TCP and HTTP have the same destination while the flows with the protocol names FTP and DNS also differ in the destination of their connections. By applying our approach, the relation amongst the flows are considered to classify traffic, as opposed to statistical classification methods which only employ statistics of flows. 5.2. Customizing preprocessor component To apply the proposed method for traffic classification problem, the basic components and events are recognized as the well-known protocols and the transfered packets, respectively. Thus, we have only customized the capturing approach and the preprocessor component. Intuitively, each network application needs to establish a number of connections with other systems in order to perform a function. Each connection follows a protocol specification. For instance, an execution of the Map application of Windows 8 contains four flows where two are of the DNS protocol, one is of the TCP and one is of the TLS protocol. Hence, the well-known protocols can be considered as the components of a network application, and the transition system of these protocols should be provided to the component Model Generator (cf. Fig. 2).

60

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

We employ Wireshark1 to capture the generated traffic of different executions of an application. The captured traffic is the sequence of packets sent or received as the result of executing an application during a specified time. Each packet contains a data part and multiple headers because of layer encapsulation. We only consider the information of the upper layer of each packet instead of the whole one. For instance, we only take the information of the application layer of HTTP packets into account while these packets subsume information of the TCP layer. In the lack of application layer data, the transport layer information is considered instead. The component Preprocessor of Fig. 2 reassembles packets and omits useless and application-independent packets (e.x. DNS packets) to prepare the captured traffic for the component Model Generator. It also applies a mapping function to translate the packets to their corresponding actions. The function Mapper is defined as Mapper : Packets → Act × N where Packets is the set of possible captured packets. Each packet is represented with a symbol (indicating the abstraction of packet info) and a natural number (indicating its flow identifier). The symbols are defined as the concatenation of the packet protocol name, the name of its control phase and its direction. Recall that the control phase can be either Init, Data, or Fin. The direction is a binary tag which can be either I or O to indicate that the packet is sent or received, respectively. For example, a received TCP packet in the handshake phase is mapped to TCPInitI which is a member of the model symbol set. The amount of details about packets extracted to be in the correspondent actions, shows how sensitive the final generated model is, to packet variations. Each packet of the application execution belongs to a flow. For determining the flow identifier for each packet, we examine if there is at least a former packet of the flow i with the same values for the five attributes source IP, source port, destination IP, destination port, and the protocol name, then we associate the flow identifier i to the new packet. Otherwise, we increase the flows number and associate a new identifier to the new packet. 5.3. Experimental results To evaluate our approach, we formulate four research questions:

• RQ1: Is our approach tailored for the network domain applicable to real-world applications? • RQ2: How much the steps in our learning process improve the learned model? How much parameterizing the learned models with flow identifiers can improve our results?

• RQ3: How our classification approach is successful with interleaved packets, not pure traffic? • RQ4: How our behavioral model is useful in classifying pure traces in comparison with other classification techniques? For investigating RQ1, we have implemented the model identification and classification in the Java programming language. The codes are available at https://github.com/zsabahi/multi-component-identification and https://github.com/zsabahi/ multi-component-classification, respectively. Three categories of applications, version control system, remote desktop sharing and on-line chat are selected for our experiments. For the first category, two applications TortoiseSVN client of SVN2 and Source Tree Client of GIT3 are selected. The traffic of update command of these applications are gathered as their captured packet traces. Also, we have selected two remote desktop sharing applications, namely TeamViewer 4 and JoinMe5 because their traffic is encrypted. Their traffic cannot be easily identified in terms of signature-based approaches by reading the content of packets. For the last category, Skype6 and VSee7 are chosen. Skype has been considered in many traffic classification researches because of using an unknown application layer protocol [32,33]. Each one is run for 100 times and their network traces are captured using the Wireshark tool. By increasing the number of experiments, the generated model is definitely improved as the probability of observing a new behavior is increased. Some preprocessing operations have been performed to eliminate the repetitive and truncated packets. Also, we have reassembled segments of fragmented packets. Packets of the application and transport layer protocols (used by these softwares), namely TCP, SSL, SSLv2, TLSv1, TLSv1.2, HTTP and UDP have been considered and the others are filtered. We construct the specifications of these protocols by an implementation of KTail automata learning algorithm with K = 3 to merge states which have 3 common future (values between 2 and 4 are often used [34–36]). To evaluate the result of our approach for each application, we have considered true positive rate (TPR), false positive rate (FPR) and the time for generating and testing the model. To this end, we use the cross validation technique for 100 times to calculate our metrics. Regarding to impossibility of measuring the real value of false positive rate (because it is not possible to gather all negative traces), we consider the traces of the another application which has the same functionality. Thus, we use the traces of applications in the same category crossly to calculate the false positive rates. To measure TPR metric, we test the learned models of applications with its traces. We built the models for 100 times by considering each

1 2 3 4 5 6 7

https://www.wireshark.org/. https://tortoisesvn.net/. https://www.atlassian.com/software/sourcetree. https://www.teamviewer.com/en/. https://www.join.me/. https://www.skype.com/en/. https://vsee.com/.

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

61

Table 6 The average result for step-wise applying the proposed network model identifier, run on a system with Corei7 CPU and 8G RAM. Step

App.

States num.

Non-parametrized

Parametrized

FPR

TPR

FPR

TPR

Train time

Test time

Initial transition system

SVN GIT TeamViewer JoinMe Skype VSee

3982 4115 8637 34484 14450 14630

0 0 1 0 0 0

0.02 0.01 0 0 0 0

0 0 0 0 0 0

0.02 0.01 0 0 0 0

<2 sec

<1 sec

Applying counter abstraction

SVN GIT TeamViewer JoinMe Skype VSee

78 45 407 5458 64 38

0.95 0 0 0 0 0

0.55 1 0.36 0.25 0.88 0.91

0 0 0 0 0 0

0.55 0.90 0.36 0.25 0.77 0.90

<5 min

<1 sec

Completing self-loop transitions

SVN GIT TeamViewer JoinMe Skype VSee

78 45 407 5458 64 38

0.97 0.97 0 0 0 0

1 1 0.98 0.56 0.95 0.99

0 0 0 0 0 0

0.94 0.95 0.88 0.56 78 0.92

<2 min

<1 sec

Relaxing unnecessary orders

SVN GIT TeamViewer JoinMe Skype VSee

130 70 600 >7400 96 55

0 0 0 0 0 0

1 1 0.99 0.91 0.95 1

0 0 0 0 0 0

0.99 0.99 0.88 0.81 0.81 0.95

<20 hrs

<1 sec

time 99 traces as the train set and one trace for testing. Finally, we compute the average of the test results. To measure FPR metric, we test the learned models in each iteration with the traces of another similar application. So in each iteration, we test a model with 100 traces of another apps. We compute the average of FPR for 100 iterations. For addressing RQ2, we have measured the metrics for each step of our learning process to evaluate how much each step improve the results. Furthermore, we have measured the results for our approach in two settings; in one setting, the flow identifiers are ignored and in another they are considered. To evaluate the effect of each step on the learned model, we have also measured the number of states. Table 6 shows the final result of our experiments. For instance, the TPR results for SVN show that for the initial non-parametrized transition system, only 2 of 100 iterations could detect the test trace while these results increase to 55 after applying the counter abstraction, and reach to 100 after adding the self-loop transitions. We measured FPR of SVN by GIT traces. The results show that the initial transition system of SVN correctly reject the traces of the other application. After generalizing the model, this property is preserved. The major point is that by applying our proposed generalization steps, the false positive rate does not grow for the parametric classifier. It means that our conservative approach prevents over-generalization to be occurred. Each generalization step improves the completeness of the model. Adding (self-loop) transitions has increased our precision for all applications. Relaxing unnecessary orders increases the precision (TPR) to 91% and 81% for non-parametric and parametric models in the worst case, respectively. Some results of TPR for the last step are still less than 1. For instance, for VSee, it is 0.95. It means that our approach fails to recognize 5 percent of its test traces which are mainly new unpredictable traces based on the train set. The non-parametrized columns of the table (column 4 and 5) illustrate that the result of applying our approach without considering the flow identifiers of actions. By comparing these results with those of parametrized model, it can be concluded that although the non-parametrized models get better TPRs, the parametrized models get lower FPRs while their TPR results are acceptable. For inspecting RQ3, we captured traffic of a running system for a while as the background traffic (it includes 500 packets). We injected the pure traffic of each application into the background traffic based on the uniform distribution. We have also mixed the pure traffic of applications of each category with the background. This procedure was performed 100 times for 100 traces of an application. At each iteration, we have learned the model of the application by the remain 99 traces and then applied our parametric classifier on the interleaved traffic. Finally, we have calculated the average of the results as shown in the Table 7. By comparing the columns of this table, we can infer that the classifier precision does not decrease in the interleaved traffic because of the completeness of our approach. However, the FPR metric increases from 0 to 0.14 in the worst case. For answering RQ4, we first considered the classical automata learning. To this aim, we have implemented KTail and one of its improved extension, GKtail algorithms. The average TPR metric for K = 2, 3, 4 for all selected applications was zero because both algorithms merge the states based on their same K suffixes. This result indicates that these approaches lead to merge the inconsistent states according to the protocols.

62

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

Table 7 The result of the traffic classification in the wild (avg is the abbreviation for average). Tested traffic

SVN + background GIT + background SVN + GIT + background TeamViewer + background JoinMe + background TeamViewer + JoinMe + background Skype + background VSee + background Skype + VSee + background

Pure traffic

Interleaved traffic

TPR

FPR

avg TPR

avg FPR

1 1 – 0.99 1 – 1 1 –

0 0 – 0 0 – 0 0 –

1 1 1 0.99 1 0.99 1 1 1

0 0 0 0.01 0 0.01 0 0.14 0.14

Table 8 The result of statistical classification. Application

SVN GIT TeamViewer JoinMe Skype VSee

J48

Naive Bayes

Adaboost

TPR

FPR

TPR

FPR

TPR

FPR

Test time

0.94 0.97 0.95 0.98 1 0

0.03 0.06 0.01 0.05 1 0

0.90 0.66 0.76 0.98 1 0

0.33 0.10 0.02 0.24 1 0

0.95 0.97 0.78 0.99 0.39 0.90

0.031 0.05 0.01 0.22 0.10 0.61

<1 sec <1 sec <8 sec <8 sec <19 sec <19 sec

To clarify the applicability of our method in the network domain, it is compared to other packet classification techniques. Port-based detection and payload inspection methods do not work on our datasets because of the usage of random or non-standard ports or unknown protocols and the encrypted traffic. Furthermore, the previous behavioral classification methods are application-specific e.g. for Skype or P2P applications like Emule by considering their special features. Hence, they are not enough general to apply to our selected applications. Thus, statistical classification methods are the only related work comparable to our method. To this aim, Netmate8 is used to obtain the feature vectors with 45 features of flows of captured traffic. Then, by using Weka tool-set,9 TPR and FPR metrics for the mostly-used algorithms in the literature of network classification J48 (c4.5), Naive Bayes and Adaboost were measured [37,32,33,38]. The final result of these metrics are reported in Table 8. These results indicate that the precision of each statistical classifier varies from an application to the others and does not follow a certain rule. For instance, Adaboost does not have an acceptable result for Skype, while, it gets the best precision for VSee amongst the other classifiers. Furthermore, the maximum amount of the false positive rate in these approaches is 1 in the worst case (for Skype) which is not acceptable for a classifier. Hence, it is not reliable to select one of these algorithms as the classifiers for an interleaved traffic of different applications. Comparing these results with ours, confirms the applicability of our approach because of its better results for false positive and true positive metrics than others. 6. Related work We categorize the related work into two sections: model learning and model matching. 6.1. Model learning The need to identify a model has been raised in many domains in literature, namely automata learning, protocol reverse engineering, specification mining, and process mining. We briefly explain the works in these domains. Automata learning. We have previously reviewed the basic algorithms of passive and active approaches like Angluin L ∗ , Ktail, and Red-Blue in Section 2. Some research has been conducted to extend the expressiveness of the inferred models of these algorithms. The KTail algorithm is extended in [39], called GKTail, with the aim to generate models from method invocation traces. It is assumed that the invocations of methods are parametric and the negative traces are not given. This approach is conducted in four steps. At first, the traces with an identical sequence of methods (but different parameter values) are merged together. Next, constraints on parameters are obtained via Daikon invariant detector [40]. In the third step, a prefix tree automaton is built. Finally, the states are merged according to some criterion based on the similarity of k suffixes. However, this approach cannot be employed in our framework as the parameters are not concrete values. In

8 9

https://dan.arndt.ca/projects/netmate-flowcalc/. http://www.cs.waikato.ac.nz/ml/weka/.

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

63

[41], the automata learning problem is extended to infer deterministic timed automata. It was proved that identifying timed automata with two or more timed elements is not efficiently achievable. Thus, a real-time identification approach, based on evidence driven state merging, was proposed which infers models by considering the time between consecutive events of a system. Extending our approach with this technique to consider the timing behavior of actions is one of our future work. LearnLib is a Java open-source framework for automata learning [42]. It includes some active and passive learning techniques such as evidence driven state merging (EDSM) which is the basis of our algorithm. But, it assumes that the inputs are both positive and negative traces and the state merge is performed when two states reaches to consistent accepting or rejecting states. However, we assume that we have not the negative traces and define the state merge criterion based on preserving the behavior of the basic components. Thus, the implemented EDSM algorithm of this library cannot be used in our work. Some studies address the application of automata learning problem. Among these, [43] is the most related to our work which elaborates on inferring mealy machines of communication protocols. In this approach, the parameters in a message format of protocols such as sequence number, configuration parameter and session identifier, result in an infinite-states model. Hence, to minimize state spaces, an abstract representation of protocol states are derived automatically in terms of operations that a requester and responder may perform. In the other words, they assume that the format of messages and some basic rules between the requester and responder (as the protocol rules) are given. Hence, they have a similar assumption to ours which is the existence of protocol specifications. Their algorithm is based on query evaluation (active automata learning), while, we have extended the passive automata learning. Reverse engineering of protocol specification. In this part, we enumerate works that focus on inferring protocol specifications from traffic. These works are related to ours because of their restriction on inferring a model by observing the behavior of the application as a black-box. In [44], a probabilistic method was investigated to obtain a finite state machine of a protocol. It was assumed that the format of protocol messages is not determined. In the first step, messages are segmented into l-length bytes and clustered with the aim of recognizing their control parts. Next, the most frequent patterns are selected as the message units by statistical analysis. Then, the main messages of the protocols are defined by computing the centers of the clusters. Finally, the finite state machine is constructed whose states are the main messages and probabilistic transitions are the frequencies of each pair of messages. In ReverX algorithm [45], a prefix tree automaton is built from traces and then the states which are the destination of identical transitions, are merged. Therefore, transitions with the same source and destination are created. They claim that if these transitions are merged, the parameters of message headers are inferred. By continuing this operation for all the states, the state machine of the protocol is obtained. Actually, although their work is similar to ours in using passive automata learning, we differ in the conditions for state merging. The states similar in their 1-future action are merged in this approach while the states preserving the same behavior of well-known protocol specifications are merged in our approach which leads to the false positive rate of 0. Specification mining. This area is a collection of techniques that attempt to mine a specification of a program from some of its artifacts [4]. Some of these works are mentioned in the following. In [5] a model is generated from an input log file in two steps. First, a trace graph is built and after that, this graph is refined to satisfy the invariants that inferred from the traces. Invariants show the precedence rules between the occurrences of the events. The method of [6] mines the specification of a software in terms of the APIs of its libraries. It compares four algorithms: traces-only (KTail), invariants-only (Contractor [46]), and two novel algorithms: invariant-enhanced-traces and trace-enhanced-invariants. It concludes that trace-enhanced-invariants has the best result amongst others. By extending our approach with considering the state variables, these techniques can be used. An algorithm which tolerates noisy inputs is presented in [47], by computing the number of changes are required to accept a noisy input. If this value is lower than a threshold, they accept that trace. This technique can be used in approach to make it more robust to noisy inputs. Process mining. The process mining area [48], also referred to as workflow mining, elaborates on the problem of learning process description from the recorded logs. This technique is mainly used for the highly concurrent systems such as business processes and usually extracts the final model as a Petri net [49]. The process mining algorithms are categorized in three categories: discovery, conformance and enhancement. In the first category, it is assumed that there is not any prior model. The goal is to find the causality relation between the recorded events from the work log in the form of a Petri net model. The alpha algorithm is a well-known algorithm of this category. The goal of the second category is conformance checking of a priori model with the given process events log. Finally, the third category addresses how to improve the performance of an existing model by some new information about the system. ProM [50] and Disco [51] are the samples of process mining tools. Though the learning phase of our work is comparable with the discovery algorithms, the level of abstraction of workflow mining is higher and not suitable for network packet traces or function call traces. Furthermore, in these methods, the domain information such as the specification of components, is not used. Therefore, it seems that the precision of this method will be lower than our approach and is similar to KTail approach. Also, the second category, conformance algorithms, are comparable with our classifier. However, we consider an interleaved traffic while they assume that the input is pure. Comparing our method with the process mining algorithms with more details is amongst our future works.

64

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

6.2. Model matching Runtime verification. Monitoring a trace to check the property satisfaction problem is addressed completely in runtime verification domain. Recently, dealing with the parameters of properties or traces has been considered more and a great deal of researches are conducted. JavaMOP [9] is the pioneering work on parametric trace slicing and after that, quantified event automata (QEA) [52] is presented. These approaches provide efficient monitors by conceptually partitioning (slicing) a trace into subtraces with a specific value for each parameter combination. QEA allows existential and universal quantification for variables in the property. Since our considered parameter is flow identifier, we do not need the quantifiers in our monitor. Monitorability for μ H M L, properties of automata-like with branching structures, was first addressed in [53] and then its formal basis was provided in [54] and some studies complement this foundational framework namely [55] which extended monitors to consider internal actions. Since, we generate the models from the observed behaviors, then all actions are external. Another extension was introduced in [56] to address temporal properties. The monitor considers a set of conditions encoded in the trace which generated by another process. They may describe events that could not have happened or may happen at specific points in the execution of a system. However, since, our behavioral models acting as properties are terminating, our monitored properties are regular and we do not need to consider more expressive properties. Rule-based monitors such as LogFire [8] present a monitor that consists of a set of rules. If the facts on the left-hand of a rule hold, then the right-hand can be applied which may add or delete some facts. Although these methods and the other works propose rich techniques to overcome the parametric monitoring issues, our problem has different assumptions and we need a customized solution. In our problem, there are several models (as properties) to be considered in parallel and the actions of each model have symbolic parameters, flow identifiers. In fact, a binding between the symbolic flow identifiers of models and the parameters of the actions of the input trace should be found. The interleaving of traces of different application models and uncertainty about the location of their initial actions in the trace are our other assumptions. 7. Conclusion Classification approaches based on the behavioral patterns of applications are the new trend to this problem. No general and automated method to derive behavioral models has been provided. We proposed a method to reach this goal based on the automata identification problem and evidence driven state merging technique combined by transition system theories, namely behavioral pre-order relations and the counter abstraction, and complex event processing. Intuitively, we assumed that the behavior of an application can be identified in terms of how it executes its depending components, well-known in a domain, abstracting its state variables. Hence, we have introduced our merging conditions to identify the equivalent states based on the specification of such components: two states preserving the same behavior (in each flow) with respect to all of the components are considered equivalent. To improve the condition, we take advantage of counter abstraction to merge states with the same number of flows preserving the same behavior. To cover unobserved behaviors and relax the unnecessary orders of actions due to the concurrent development of an application, we exploited complex event processing to derive the essential causality relations (between two sync points) to steer the completion step. We also implemented and evaluated our framework in the network domain which does not require human inspection. The experiments show very encouraging results that the generalization steps significantly increase the accuracy from 0% to 91% and to 81% for non-parametric and parametric models in the worst case, respectively. The existing approaches such as KTail of model identification can be used for deriving the model of stand-alone components. However, in complex applications depending on several components, these approaches are not beneficial and exertion of domain-specific information (like component specification) substantially improves the result. Our experimental results in the network domain confirm the applicability of our approach in terms of false positive and true positive rate metrics in comparison with statistical approaches while the classical automata learning algorithms suffer from zero true positive rate. We aim to consider the timing behavior following the approach of [41] in combination with the techniques in the literature of specification mining such as [4–6] to derive timing invariants. Another future work is to generalize our technique and apply that on the other areas such as malware detection or reverse engineering of software specifications (specification mining). Extending our experiments to evaluate the scalability of our approach in the network domain is another direction. Acknowledgements We would like to thank Niloofar Naderian, Fateme Bajelan, and Mohammad Behzadifar for their help in the implementation of the method and Hamed Rahmatollahi for his help in gathering the datasets. References [1] K. Xu, Z. Zhang, S. Bhattacharyya, Profiling internet backbone traffic: behavior models and applications, SIGCOMM Comput. Commun. Rev. 35 (4) (2005) 169–180.

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52]

65

P. Bermolen, M. Mellia, M. Meo, D. Rossi, S. Valenti, Abacus: accurate behavioral classification of P2P-TV traffic, Comput. Netw. 55 (6) (2011) 1394–1411. J. Kinable, O. Kostakis, Malware classification based on call graph clustering, J. Comput. Virol. 7 (4) (2011) 233–245. G. Reger, Automata Based Monitoring and Mining of Execution Traces, Ph.D. thesis, University of Manchester, 2014. I. Beschastnikh, J. Abrahamson, Y. Brun, M. Ernst, Synoptic: studying logged behavior with inferred models, in: Proc. ESEC/FSE ’11, ACM, 2011, pp. 448–451. I. Krka, Y. Brun, N. Medvidovic, Automatic mining of specifications from invocation traces and method invariants, in: Proc. FSE 2014, ACM, 2014, pp. 178–189. M. Alessandro, C. Gianpaolo, T. Giordano, Learning from the past: automated rule generation for complex event processing, in: Proc. DEBS ’14, ACM, 2014, pp. 47–58. K. Havelund, Rule-based runtime verification revisited, Int. J. Softw. Tools Technol. Transf. 17 (2) (2015) 143–170. P. Meredith, D. Jin, D. Griffith, F. Chen, R. Grigore, An overview of the MOP runtime verification framework, Int. J. Softw. Tools Technol. Transf. 14 (3) (2012) 249–289. Z. Sabahi-Kaviani, F. Ghassemi, F. Bajelan, Automatic transition system model identification for network applications from packet traces, in: Proc. FSEN 2017, 2017, pp. 212–227. M. Heule, S. Verwer, Exact DFA identification using SAT solvers, in: Grammatical Inference: Theoretical Results and Applications, Springer, 2010, pp. 66–79. E. Gold, Language identification in the limit, Inf. Control 10 (5) (1967) 447–474. D. Angluin, Learning regular sets from queries and counterexamples, Inf. Comput. 75 (2) (1987) 87–106. K. Lang, B. Pearlmutter, R. Price, Results of the Abbadingo one DFA learning competition and a new evidence-driven state merging algorithm, in: Proc. 4th ICGI, Springer, 1998, pp. 1–12. N. Walkinshaw, J. Derrick, Q. Guo, Iterative refinement of reverse-engineered models by model-based testing, in: Proc. 2nd FM, Springer, 2009, pp. 305–320. A. Biermann, J. Feldman, On the synthesis of finite-state machines from samples of their behavior, IEEE Trans. Comput. C-21 (6) (1972) 592–597. C. Baier, J.-P. Katoen, Principles of Model Checking, MIT Press, Cambridge, ISBN 026202649X, 2008. J.F. Groote, M.R. Mousavi, Modeling and Analysis of Communicating Systems, The MIT Press, 2014. R. van Glabbeek, The linear time - branching time spectrum, in: Lecture Notes in Computer Science, vol. 458, Springer, 1990, pp. 278–297. G. Basler, M. Mazzucchi, T. Wahl, D. Kroening, Symbolic counter abstraction for concurrent software, in: Computer Aided Verification, Springer, 2009, pp. 64–78. R. Milner, Communication and Concurrency, Prentice-Hall, 1989. A. Emerson, R. Trefler, From asymmetry to full symmetry: new techniques for symmetry reduction in model checking, in: Proc. 10th CHARME, Springer, 1999, pp. 142–156. A. Pnueli, J. Xu, L. Zuck, Liveness with (0, 1, infty)-counter abstraction, in: Proc. 14th CAV, Springer, 2002, pp. 107–122. M. Leucker, C. Schallhart, A brief account of runtime verification, J. Log. Algebraic Program. 78 (5) (2009) 293–303. B. Schieber, U. Vishkin, On finding lowest common ancestors: simplification and parallelization, SIAM J. Comput. 17 (6) (1988) 1253–1262, https:// doi.org/10.1137/0217079. S. Valenti, D. Rossi, A. Dainotti, A. Pescapè, A. Finamore, M. Mellia, Data Traffic Monitoring and Analysis, Springer, Berlin, Heidelberg, 2013, pp. 123–147 (Ch. Reviewing traffic classification). A. Moore, K. Papagiannaki, Toward the accurate identification of network applications, in: Passive and Active Network Measurement, Springer, 2005, pp. 41–54. S. Sen, O. Spatscheck, D. Wang, Accurate, scalable in-network identification of p2p traffic using application signatures, in: Proc. 13th WWW, ACM, 2004, pp. 512–521. A. Moore, D. Zuev, Internet traffic classification using bayesian analysis techniques, in: ACM SIGMETRICS Performance Evaluation Review, vol. 33, ACM, 2005, pp. 50–60. A. McGregor, M. Hall, P. Lorier, J. Brunskill, Flow clustering using machine learning techniques, in: Passive and Active Network Measurement, Springer, 2004, pp. 205–214. K. Fall, R. Stevens, TCP/IP Illustrated, vol. 1: The Protocols, Addison-Wesley, 2011. R. Alshammari, A.N. Zincir-Heywood, Machine learning based encrypted traffic classification: identifying SSH and Skype, in: Proc. CISDA’09, IEEE Press, Piscataway, NJ, USA, 2009, pp. 289–296. R. Alshammari, A.N. Zincir-Heywood, Unveiling Skype encrypted tunnels using GP, in: Proc. CEC’10, 2010, pp. 1–8. S.P. Reiss, M. Renieris, Encoding program executions, in: Proc. ICSE ’01, IEEE, 2001, pp. 221–230. L. Mariani, M. Pezzè, Dynamic detection of cots component incompatibility, IEEE Softw. 24 (5) (2007) 76–85. J.E. Cook, A.L. Wolf, Discovering models of software processes from event-based data, ACM Trans. Softw. Eng. Methodol. 7 (3) (1998) 215–249. N. Williams, S. Zander, G. Armitage, A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification, SIGCOMM Comput. Commun. Rev. 36 (5) (2006) 5–16. S. Ubik, P. Žejdl, Evaluating application-layer classification using a machine learning technique over different high speed networks, in: Proc. ICSNC’10, 2010, pp. 387–391. D. Lorenzoli, L. Mariani, M. Pezzè, Automatic generation of software behavioral models, in: Proc. 30th ICSE, ACM, 2008, pp. 501–510. M. Ernst, J. Cockrell, W. Griswold, D. Notkin, Dynamically discovering likely program invariants to support program evolution, IEEE Trans. Softw. Eng. 27 (2) (2001) 99–123. S. Verwer, Efficient Identification of Timed Automata: Theory and Practice, Ph.D. thesis, TU Delft, Delft University of Technology, 2010. LearnLib - an open framework for automata learning, https://learnlib.de/. F. Aarts, B. Jonsson, J. Uijen, Generating models of infinite-state communication protocols using regular inference with abstraction, in: Testing Software and Systems, Springer, 2010, pp. 188–204. Y. Wang, Z. Zhang, D. Yao, B. Qu, L. Guo, Inferring protocol state machine from network traces: a probabilistic approach, in: Proc. 9th International Conference on ACNS , Springer, 2011, pp. 1–18. J. Antunes, N. Neves, P. Verissimo, Reverse engineering of protocols from network traces, in: Proc. 18th WCRE, 2011, pp. 169–178. G. de Caso, V. Braberman, D. Garbervetsky, S. Uchitel, Automated abstractions for contract validation, IEEE Trans. Softw. Eng. 38 (1) (2012) 141–162. G. Reger, H. Barringer, D. Rydeheard, Automata-based pattern mining from imperfect traces, SIGSOFT Softw. Eng. Notes 40 (1) (2015) 1–8. W. Van Der Aalst, Process Mining: Discovery, Conformance and Enhancement of Business Processes, Springer Science & Business Media, 2011. W.M.P. van der Aalst, The application of Petri nets to workflow management, J. Circuits Syst. Comput. 08 (01) (1998) 21–66. B.F. Dongen, A.A. Medeiros, H.M.W. Verbeek, A.J.M.M Weijters, W.M.P. Aalst, The ProM framework: a new era in process mining tool support, in: Proc. ICATPN, Springer, 2005, pp. 444–454. Process mining and automated process discovery software for professionals, https://fluxicon.com/disco/. H. Barringer, Y. Falcone, K. Havelund, G. Reger, D. Rydeheard, Quantified event automata: towards expressive and efficient runtime monitors, in: Proc. FM, Springer, 2012, pp. 68–84.

66

[53] [54] [55] [56]

Z. Sabahi-Kaviani, F. Ghassemi / Science of Computer Programming 177 (2019) 41–66

A. Francalanza, L. Aceto, A. Francalanza, L. Aceto, L. Aceto, A. Achilleos, A. L. Aceto, A. Achilleos, A.

A. Ingolfsdottir, Monitorability for the Hennessy–Milner logic with recursion, Form. Methods Syst. Des. 51 (1) (2017) 87–116. A. Achilleos, D.P. Attard, I. Cassar, D.D. Monica, A. Ingólfsdóttir, A foundation for runtime monitoring, in: Proc. RV, 2017. Francalanza, A. Ingólfsdóttir, Monitoring for silent actions, in: Proc. FSTTCS, 2017. Francalanza, A. Ingólfsdóttir, A framework for parameterized monitorability, in: Proc. FOSSACS, 2018.