Improving custom-tailored variability mining using outlier and cluster detection

Improving custom-tailored variability mining using outlier and cluster detection

Accepted Manuscript Improving Custom-Tailored Variability Mining Using Outlier and Cluster Detection David Wille, Önder Babur, Loek Cleophas, Christo...

808KB Sizes 0 Downloads 44 Views

Accepted Manuscript Improving Custom-Tailored Variability Mining Using Outlier and Cluster Detection

David Wille, Önder Babur, Loek Cleophas, Christoph Seidl, Mark van den Brand, Ina Schaefer

PII: DOI: Reference:

S0167-6423(18)30138-2 https://doi.org/10.1016/j.scico.2018.04.002 SCICO 2204

To appear in:

Science of Computer Programming

Received date: Revised date: Accepted date:

14 August 2017 19 February 2018 22 April 2018

Please cite this article in press as: D. Wille et al., Improving Custom-Tailored Variability Mining Using Outlier and Cluster Detection, Sci. Comput. Program. (2018), https://doi.org/10.1016/j.scico.2018.04.002

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Highlights • • • •

A variability mining approach to merge a set of models into a single representation. The approach allows customization for different languages using the presented framework. Supported by cluster and outlier detection technique to narrow input down to sensible subset. Feasability case study with industrial case studies.

Improving Custom-Tailored Variability Mining Using Outlier and Cluster Detection David Willea,∗, Önder Baburb , Loek Cleophasb,c , Christoph Seidla , Mark van den Brandb , Ina Schaefera a Technische

Universität Braunschweig, Germany University of Technology, The Netherlands c Stellenbosch University, South Africa

b Eindhoven

Abstract To satisfy demand for customized software solutions, companies commonly use so-called clone-and-own approaches to reuse functionality by copying existing realization artifacts and modifying them to create new product variants. Lacking clear documentation about the variability relations (i.e., the common and varying parts), the resulting variants have to be developed, maintained and evolved in isolation. In previous work, we introduced a semi-automatic mining algorithm allowing custom-tailored identification of distinct variability relations for block-based model variants (e.g., MATLAB/Simulink models or statecharts) using user-adjustable metrics. However, variants completely unrelated with other variants (i.e., outliers) can negatively influence the usefulness of the generated variability relations for developers maintaining the variants (e.g., erroneous relations might be identified). In addition, splitting the compared models into smaller sets (i.e., clusters) can be sensible to provide developers separate view points on different variable system features. In further previous work, we proposed statistical clustering capable of identifying such outliers and clusters. The contribution of this paper is twofold. First, we present guidelines and a generic implementation that both ease adaptation of our variability mining algorithm for new languages. Second, we integrate our clustering approach as a preprocessing step to the mining. This allows users to remove outliers prior to executing variability mining on suggested clusters. Using models from two industrial case studies, we show feasibility of the approach and discuss how our clustering can support our variability mining in identifying sensible variability information. Keywords: variability mining, block-based language, clustering, outlier detection, conceptual framework, clone-and-own

∗ Corresponding

author Email addresses: [email protected] (David Wille), [email protected] (Önder Babur), [email protected] (Loek Cleophas), [email protected] (Christoph Seidl), [email protected] (Mark van den Brand), [email protected] (Ina Schaefer)

Preprint submitted to Science of Computer Programming

April 24, 2018

1. Introduction To satisfy demand for customized products, companies often develop variants of their software that are specifically tailored to new requirements. These variants form product families with largely similar functionality, in which only small parts are newly implemented or slightly adapted compared to other variants [18]. For example, in the automotive domain different variants of software for electronic control units (ECUs) are needed as additional functionality, such as driver assistance systems or comfort features, can be selected by customers. Developing each of the software variants in isolation is a tedious task and in most cases infeasible because of the size and complexity of developed systems [18]. Thus, strategies for reuse of existing functionality from previous variants are needed. A common strategy is using so-called clone-and-own approaches that copy the code base from an existing variant and modify it to changed requirements. This approach allows easy reuse of implementation solutions for existing features; only the additional functionality has to be realized [18]. However, clone-and-own can have severe consequences during different maintenance tasks as relations between variants are often not documented and no managed reuse strategy exists [18]. Thus, keeping an overview of a growing set of variants becomes almost impossible. For example, duplicate variants might exist that are maintained by different teams completely unaware of the redundant solutions and the unnecessary costs involved. Overall, managing large sets of variants that were created using clone-and-own is a tedious, error-prone and costly task [18, 26, 42]. Software product lines (SPLs) allow to introduce managed reuse by maintaining a single code base that allows derivation of all contained variants using suitable generation facilities [13, 14, 43]. One of the benefits is that errors can be fixed in a single location and afterwards affected variants only have to be regenerated. Thus, developers do not have to manually search and fix errors in each variant individually. While these SPLs are a possible solution to clone-and-own related problems, adopting such a reuse strategy without abolishing existing variants is a complex and time-consuming task as variability relations (i.e., common and varying parts) between all variants have to be identified to support generation of all variants from a single code base. Block-based languages, such as The Mathworks MATLAB/Simulink [57] or IBM Rational Rhapsody [25] statecharts, are commonly used during development of software in domains with high complexity (e.g., the automotive domain). By providing suitable strategies to abstract from complex problems, these languages allow solving these problems on a less complex and more manageable level [15]. While different approaches exist in literature to merge such model variants into a single model containing annotations about the origin of the different parts (e.g., [23, 39, 45, 48]), these approaches do not make variability relations visible to developers. Such relations explicitly categorize parts into mandatory parts (i.e., common to all variants), alternative parts (i.e., mutually exclusive across all variants), or optional parts (i.e., only part of certain variants). To overcome these limitations, in previous work we introduced our family mining approach, which is capable of semi-automatically reverse2

engineering such relations for model variants [24, 63, 64, 66]. The approach allows for custom-tailored variability mining as it uses user-adjustable metrics and, thus, provides the possibility to integrate domain knowledge (e.g., about the relevance of certain elements on the model functionality) in the comparisons of models. The identified variability is stored in a single aggregated 150% model by merging all artifacts from the analyzed models and annotating them with information about their origin (i.e., their parent models) and their explicit variability relations. Different block-based languages are applied for different tasks or domains, and our variability mining approach is not directly applicable for all of them, due to their individual characteristics, but has to be adapted for the respective language. Thus, we developed guidelines to adapt our algorithm for new languages (e.g., domain specific languages (DSLs) used in companies or industrial languages, such as IBM Rational Rhapsody statecharts) [67]. Overall, the identified variability information allows developers to understand relations between the variants and to get an overview of the existing functionality. While such information eases maintenance of the variants (e.g., variants containing an erroneous part can be easily identified), it most importantly allows to introduce managed reuse strategies from the SPL domain by reusing implementations from variants that were created using clone-and-own approaches [65]. However, as our approach is dependent on the compared variants, problems can arise when relations between these possibly large sets of models are unknown. For example, after acquiring another company or expanding to a new market (e.g., from truck diesel ECUs to car diesel ECUs). So-called outliers (i.e., models whose implementation is unrelated to other models) can negatively influence the results, as undesired or unexpected variability might be identified. In addition, it could be more sensible to split a large set of variants into multiple smaller clusters to compare only variants that are functionally related. This way, specific variability information for these variants can be provided to developers maintaining them and unrelated variants can be neglected. In previous work, we developed a statistical clustering technique that is capable of identifying clusters of closely related model variants and ruling out outliers [6, 7, 8, 9]. In this paper, we adapt our clustering techniques to remove outliers prior to executing our variability mining approach on selected clusters of closely related variants. Furthermore, we improve our mining algorithm by providing a generalization allowing its adaptation for new languages with largely reduced effort. In particular, we make the following contributions: • We consolidate our efforts for a custom-tailored variability mining algorithm from previous work [24, 63, 64, 66] and generalize our data model to reduce the adaption effort for new languages using our guidelines from [67]. • We adapt our clustering technique [6, 7, 8, 9] to identify sensible clusters and to remove outliers from the input models prior to applying our variability mining approach. • We evaluate whether our adapted clustering technique is able to improve the results of our variability mining algorithm. For our feasibility case study we use industrial-scale model variants from one IBM Rational Rhapsody case study and one MATLAB/Simulink case study. 3

This paper is structured as follows: Section 2 gives a clear motivation of the solved problem. Section 3 explains background on block-based languages and our existing clustering and variability mining. Section 4 and Section 5 describe our cluster and outlier detection together with guidelines to adapt our mining for different block-based languages. Section 6 gives implementation details and Section 7 evaluates our solution to demonstrate its feasibility. Section 8 discusses related work and Section 9 concludes with an outlook to future work. 2. Motivating Example and Overall Workflow To give a clear motivation of the problems that we solve in this paper, we introduce a running example. We first motivate the need for automatic and finegrained variability mining between large sets of model variants (cf. Section 2.1) and then discuss challenges with sets of models consisting of multiple subclusters and containing outliers (cf. Section 2.2). From these observations, we derive our workflow to solve these challenges (cf. Section 2.3) in the remainder of the paper. 2.1. Fine-grained Variability Analysis of Model Variants To explain the challenges in analyzing the variability between large sets of related model variants, we present in Figure 1 two statechart implementations for the central locking system (CLS) feature of the body comfort system (BCS) of a car depending on the used power window (PW). The ManPW variant (cf. Figure 1a) is equipped with a manual power window that requires the user to push the up or down button until the window is completely opened or closed, whereas the AutoPW variant (cf. Figure 1b) uses a power window automatically closing or opening the window after pushing one of the corresponding buttons. The only major difference between the implementations manifests itself when the user locks the car and the transition from state cls_unlock to state cls_lock is executed. The implementation of the ManPW variant distinguishes between cases where the window is closed (pw_pos == 1) or still open (pw_pos != 1). In case the window is already closed, the system also disables the power window (pw_enabled = false). Otherwise, the user still can manually close the window after locking the car (e.g., when sitting inside the locked car). In case of the CLS implementation for the AutoPW variant no distinction is needed and key_pos_lock [pw_pos == 1] / cls_locked=true; pw_enabled=false;

cls_unlock

key_pos_lock [pw_pos != 1] / cls_locked=true;

cls_unlock

cls_lock

key_pos_unlock / cls_locked=false; pw_enabled=true;

cls_lock

key_pos_lock / cls_locked=true; pw_enabled=false; GEN(pw_but_up);

key_pos_unlock / cls_locked=false; pw_enabled=true;

(a) ManPW variant of the CLS feature.

(b) AutoPW variant of the CLS feature.

Figure 1: Differing statechart implementations of the central locking system (CLS) feature for a body comfort system (BCS) of a car depending on the applied power window (PW).

4

V EM FP HMI PW AutoPW ManPW RCK CLS Heat AS LED LED_AS LED_Heat V1 X X X X X V2 X X X X X X X V3 X X X X X X X X X X V4 X X X X X X X X X X V5 X X X X X X X X X X X X Table 1: A basic implementation of a BCS forming an outlier (variant V1 – highlighted) in contrast to a set of highly related more sophisticated BCS implementations (variant V2 – V5).

the car automatically disables the power window and closes the windows when locking the car by generating a command (GEN(pw_but_up)). Manually identifying the fine-grained variability relations (i.e., the common and varying elements) between the two implementations of the CLS feature (e.g., to apply bugfixes) might be feasible. However, in industry a large variety of different implementations exists. After the initial clone-and-own process these models evolve independently from each other and developers without knowledge about their relations can only reverse-engineer them with large manual effort. In addition, implementations of related features are not directly accessible because they are encapsulated in the hierarchy of large models consisting of hundreds or thousands of model elements (e.g., states or transitions in statecharts). For example, the variants presented in Figure 1 are part of a BCS implementation for a car. In these cases a manual approach fails as time-wise it is infeasible to manually analyze and compare large sets of such complex models in detail. Thus, we argue that automatic variability mining is inevitable to provide developers with detailed variability information of the analyzed variants. 2.2. Selecting the Right Variants for Variability Analysis With the large number of models that were created using clone-and-own approaches, further challenges arise as relations between them are not well documented if at all. As a consequence, selecting the right variants for comparison depends on domain knowledge that might not be available (e.g., after experts left the company or when no documentation exists) and has to be regained using tedious manual techniques. In Table 1 and 2, we present two scenarios that show the drawbacks of selecting variants for variability analysis without knowledge of the concrete implementation. Both tables show product variants V consisting of different feature combinations that were implemented to realize their functionality (marked with Xs). These features are part of the same BCS as in the previous example in Figure 1 and comprise exterior mirrors (EM), finger protection (FP) for the windows, a human machine interface (HMI), a PW (either ManPW or AutoPW), a remote control key (RCK), a CLS, heatable exterior mirrors (Heat), an alarm system (AS) and corresponding status LEDs. All variants share the EM, FP, HMI and PW features as the core implementation of the BCS. 2.2.1. Outliers In Table 1, we show a scenario where two product lines of the BCS exist. A very basic variant exists (i.e., V1), only consisting of the core implementation 5

C V EM FP HMI PW AutoPW ManPW RCK CLS Heat AS LED LED_AS LED_Heat V1 X X X X X C 1 V6 X X X X X X V7 X X X X X X X V2 X X X X X X X V3 X X X X X X X X X X C2 V4 X X X X X X X X X X V5 X X X X X X X X X X X X Table 2: Two clusters of basic implementations (cluster C1: V1, V6 & V7) and more sophisticated implementations (cluster C2: V2 – V5) for a BCS – distinct differences are highlighted.

and the ManPW feature. Apart from the core implementation, the other four variants realize a more sophisticated BCS (i.e., V2 – V5) for a more expensive car and contain the RCK and CLS together with the possible combinations of the Heat and AS features (plus their corresponding LEDs). In this scenario, V1 can be regarded as an outlier because it has almost no relation to the other four variants (apart from the common core implementation). When comparing variants V1 – V5 unexpected or undesired variability might be identified as, for example, developers implementing the RCK and CLS features for the more sophisticated variants V2 – V5 might not be aware of the basic variant V1. As a result, the developers are not only confused by the identified unexpected variability relations but also the maintenance of these variants is hampered. On the one hand, unexpected high-level variability exists because the developers would regard the RCK and CLS features as part of the core implementation. However, in contrast to the developers’ expectations, a variability mining algorithm would identify the RCK and CLS features as optional when comparing variants V1 – V5, because V1 does not contain these features. In addition, low-level changes to feature implementations (e.g., their states and transitions) might exist because of necessary dependencies between different features. For example, the RCK and CLS features might require changes to the HMI feature (e.g., to realize additional control buttons) that are not present in V1. A variability mining algorithm would also identify these low-level changes only present in certain variants and, thus, classifies them as optional. While this is a correct representation of the variability between all developed variants, such variability details confuse developers unaware of these relations. In addition, the identified variability information is bloated with details that are only relevant when considering all developed variants (i.e., V1 – V5). However, the work of developers might get hampered by details concerning the relation to variant V1 because they need to analyze variability that is irrelevant for maintaining variant V2 – V5. Thus, identifying such possible outliers and ruling them out prior to the variability mining allows to provide developers with information specifically focused on their current task. 2.2.2. Clusters In Table 2, we show another scenario, where the manufacturer decided to add two variants V6 and V7 with the ManPW feature in addition to variant V1 from the outlier scenario. Both have the same features as V1 but in addition contain 6

150% model C1

outliers C1

C2

Variability Mining

Clustering set of model variants

Section 4

%

found clusters & outliers

Section 5

% 150% model C2

Figure 2: Overall workflow for the combined clustering and variability mining approach.

the CLS and the Heat feature. The four variants V2 – V5 are exactly the same as from the outlier scenario. In this scenario, we can identify two clusters C1 and C2 of variants that are closely related amongst each other but only have minor similarities across the two clusters (i.e., the common core implementation and similar features across the variants). However, larger differences exist because of feature implementations that are only present in one cluster (highlighted in the table). Similar to the outliers scenario it might be desirable to identify variability for specific single clusters instead of the complete product family. This way developers can concentrate on variability information that is specifically focused on their work. In addition, aligning and improving the architecture per cluster can reduce the overall effort to create a common architecture for all variants. 2.3. Workflow of the Proposed Solution From the scenarios discussed in Section 2.2, we identified the need for: a) Algorithms to identify fine-grained variability information between related model variants (e.g., created using clone-and-own approaches) to make these relations visible to developers. b) Clustering and outlier detection algorithms that are executed prior to the fine-grained variability analysis to improve the results in situations with unclear relations between models (e.g., due to missing documentation). In Figure 2, we show the workflow that we derived from these observations. Starting with a large set of model variants, we first execute our clustering step (cf. Section 4) to identify smaller clusters of highly related models and to identify and remove outliers. For each of the clusters, we execute our variability mining (cf. Section 5) to identify so-called 150% models to store the identified variability (i.e., common and varying parts) for the clusters’ models. This 150% model can be visualized for a detailed analysis of the variability (e.g., when fixing bugs in multiple variants). Furthermore, managed reuse in form of an SPL can be introduced by reusing implementation parts from the existing variants [65]. 3. Background In this section, we give background information on block-based languages (cf. Section 3.1) and our previous work on identifying clusters of related mod7

on 1 In1

1 In1

b1

1 1 s Integrator Out1 In1

Out1

b1

on

o b1 [t > 3s]

1 Out1

blink b1 b2

blue

red b2

(a) MATLAB/Simulink

(b) Statechart

Figure 3: Example model instances for two block-based modeling languages.

els (cf. Section 3.2) [6, 7, 8, 9] as well as identifying fine-grained variability information for related block-based models (cf. Section 3.3) [24, 63, 64, 66]. 3.1. Block-based Languages A common means to develop solutions for complex problems in industry are model-based languages as they allow to describe domain knowledge and concrete implementations on an abstract level with reduced complexity [15]. Such descriptions can be used to automatically generate executable code for different platforms or to perform model-based testing. In the scope of this paper, we refer to block-based modeling languages as a subgroup of these languages that represent the functionality of software in form of directed graphs. In most of these languages, the graph’s nodes execute code written in a specific programming language (e.g., Java or C++) or represent atomic functions defined by the node’s language (e.g., mathematical operations). Usually, the execution can be passed from one node to another by triggering edges. These edges often provide means to exchange data between the nodes and connect their inports and outports. These ports define the nodes’ interfaces. A large part of block-based languages provides additional concepts to define abstractions from complex functionality by using hierarchical nodes encapsulating sub models defining the functionality on the corresponding hierarchy level. Depending on the used language the terms used for the model elements (i.e., nodes, edges, . . . ) might differ. In Figure 3, we show two concrete model instances for exemplary block-based modeling languages. In Figure 3a, we can see an exemplary MATLAB/Simulink model consisting of blocks linked with connectors. This model receives a data signal via the In1 block, processes this data using the functionality specified in the subsystem block (i.e., a hierarchical node), and emits the resulting value via the Out1 block. Using hierarchical nodes similar to the subsystem in this example, developers can encapsulate complex functionality into reusable elements (i.e., in this example the integrator block pipeline). By arbitrarily nesting such hierarchical nodes it is possible to develop complex systems on a more understandable level by refining abstract functionality with each added hierarchy level. In Figure 3b, we present an exemplary statechart implementation for an LED with two colors (i.e., blue and red) consisting of system states. For each

8

hierarchy level the execution start is unambiguously defined by the initial state (i.e., the black bullet). Transitions between the states allow to change the system state. For example, by pushing button b1 the LED transitions from state off to the parallel state on. This state contains two regions (separated by a dashed line) that define the states of the LED’s color and its current light mode. In contrast to hierarchical states, these parallel states allow the execution to be in multiple states at once as they can have more than one region. For example, the LED can be in any combination of the states {on, blink} and {blue, red} depending on the inputs of the user via the buttons b1 and b2. In case the user pushes button b1 more than 3 seconds the LED is turned off again. Looking at the presented examples, we can see that although both languages follow different purposes, they consist of nodes and edges linking them. However, they differ in the used paradigm (i.e., data-flow oriented language vs. state-oriented language). In addition, not all languages provide all discussed elements. While, for example, statecharts allow modeling of parallel execution using parallel states with multiple regions, MATLAB/Simulink only allows abstraction from complex functionality using subsystems. As a result, we define our guidelines in this paper to allow developers to correctly consider such differences during adaptation of our family mining for new languages. 3.2. Model Clustering Model comparison is a fundamental operation in model-driven engineering (MDE), used for tackling problems such as model merging [11] and versioning [5]. Addressing the increasing size and complexity of models, many techniques have been developed based on pairwise and ’deep’ comparison of models (such as [38]). However, another problem emerges when there is a large set of (e.g., hundreds) models to compare: pairwise techniques are too expensive to execute on all models and rather more scalable techniques are required. This aspect of model comparison has been addressed in our previous work [8], noting the need for treating a large set of models holistically, e.g., for getting an overview of the dataset and potential relations, such as proximities, groups, and outliers. The proposed approach, i.e., model clustering, is inspired by document indexing and clustering in information retrieval (IR) [34], notably vector space models (VSMs) with the major components of (1) a vector representation of term frequencies in a document, (2) zones (e.g., ’author’ or ’title’), (3) weighting schemes, such as inverse document frequency (idf), and zone weights, (4) natural language processing (NLP) techniques for handling compound terms and synonyms. The VSM allows transforming each document into an n-dimensional vector; resulting in an m×n matrix for m documents. Over the VSM, document similarity can be defined as the distance (e.g., Euclidean or Cosine) between vectors; this can be used for identifying similar groups of documents. We have applied this technique for models, and developed a model clustering framework. As input to our framework, we obtain a set of models of the same type, e.g., Ecore meta-models in the Eclipse Modeling Framework (EMF) [19], class diagrams in Unified Modeling Language (UML) or feature models. The major steps of the workflow are: 9

• IR-Feature extraction: The approach starts with the meta-model-based extraction of IR-features from the models. The framework currently supports extracting typed identifiers of model elements (e.g., class or attribute names), metrics (e.g., number of attributes for a class) and n-grams of those for capturing structural relations between model elements (e.g., a bigram for n = 2, encoding two classes with an association in between) [6, 8]. • IR-Feature comparison and NLP techniques: The framework has several parameters to configure (1) the use of the NLP techniques (tokenization, synonym checking, etc.) for comparing model identifiers, and (2) schemes for comparing IR-features (e.g., whether to consider the types of model elements, or just ignoring them) and weighting IR-features (e.g., whether to weight classes more than attributes). • Computing the VSM: Each model, consisting of a set of IR-features, is compared against the maximal set of IR-features collected from all models. The result is a matrix, similar to the term frequency matrix in IR, where each model is represented by a vector. Consequently, we reduce the model similarity problem into a distance measurement of the corresponding vectors. • Distance measurement and clustering: The framework allows several parameters for (1) distance measures among vectors (e.g., Euclidean or Cosine), (2) different clustering algorithms, such as k-means and hierarchical agglomerative clustering (HAC), and (3) further clustering-specific parameters, such as linkage criteria that specifies inter-cluster distance. There are two use cases applicable as the last step of the HAC workflow. The clusters can be automatically computed to be used directly, e.g., for the purpose of data selection and filtering. The resulting dendrogram can be visualized and inspected manually for gaining an insight into the dataset, leaving the task of cluster and outlier identification for the user. 3.3. Family Mining In previous work, we introduced family mining as a reverse-engineering algorithm to provide developers with fine-grained variability information between sets of block-based model variants [24, 63, 64, 66]. In Figure 4, we show its n-1 compare models

for each iteration

select

unprocessed compare models exist

Compare set of model variants

base model

Match

Merge

rst iteration

% 150% model

Figure 4: Iterative workflow for the family mining approach.

10

workflow consisting of the three phases Compare, Match, and Merge. Before executing the first phase, we divide the input models into a single base model (e.g., a smallest model) representing the basis for all comparisons and a set of compare models that are iteratively compared and merged with this base model for a total of n − 1 comparisons. During the Compare phase we iterate through the models by analyzing the data-flow and identify possible variability relations between the model elements. Usually, the resulting set of comparison elements is ambiguous as for each model element multiple possible counterparts in the compared model exist. The Match phase analyzes this list of preliminary relations to identify for each element from a model at most one counterpart in the other model. The resulting list of distinct relations is then used in the Merge phase to merge the identified variability relations into a single 150% model. In case more compare models exist that were not yet processed by the algorithm, another iteration is started where the created 150% model serves as input for the base model. This way, we iteratively generate a 150% model of all compared models that can be visualized or further processed to generate an SPL [65]. 4. Clustering for Variability Mining We have used and extended (via domain-specific extraction schemes and new distance measures) the clustering technique, briefly introduced in Section 3.2, to preprocess statechart variants for variability mining. For the overall workflow of clustering given in Figure 5, we detail each of the major steps in the following subsections. Note that the technique is completely language agnostic, in the sense that we can design meta-model-based extraction schemes to generate IR-features for clustering a set of structural (typically graph-based) models conforming to such a meta-model. We have so far used the framework for other types of models including EMF, UML and feature models with a corresponding extractor plug-in, with suitable parsers and domain-specific handling for each. AbstractClass

relaon

Class B

Comparison

M

ET

RAS 7.7M

Class A

extraction scheme

meta-model

NLP

outliers

distance measure

Matching

C1

C2

... Extraction set of model variants

Clustering

Weighting

S15 S14 S13 S12 S11 S10 S9 S8 S7 S6 S5 S4 S3 S2 S1 S0

Names Types N-grams ... IR-features

5 3.5 5 3 3 1 3 1 5 3 5 2.5 3 1 3 1

5 5 5 2.75 3 2 3 0.5 3.75 5 3.25 2 2 2 2 0.5

4 3 4 2.5 2.5 1.875 2 1 3 1 3 1 2.5 1,625 2 1

4 4 4 2.25 2.5 2.5 2.25 0.5 2 2.25 2 0.5 2.25 2.5 2 0.5

5 3.5 3.75 2.4167 3 1 2 0.625 5 2.5 3.75 1.75 3 1 2 0.625

5 5 3.75 2.5 3 2.25 1.875 5 3.25 2.7083 2.2917 1.25 2 1.3333 1.1667 0

4 3 2.6667 1.9583 2.6667 2.0417 0 0.375 3 1 1.875 0.625 2.25 1 0 0.5

4 4 2.6667 2 2.6667 2.6667 0 0 2 1.3333 1.1667 0 2 1.3333 0.8333 0

4.25 3.5 4.25 2.8333 3 1.5 3 1.5 5 3.5 0 2.5 3 1 3 1

4 4.25 3.75 2.5417 3 2.375 3 1.3333 5 5 2.7083 2 2.25 2 1.3333 0.5

3 2.5 3 2 2.7083 2.2083 2.375 1.5 2.6667 2.0417 2.25 1 2.5 1.875 2 1

3 3 3 1.75 2.7917 2.7083 3 1.3333 2.6667 2.6667 1.3333 0.5 2.5 2.5 1.3333 0.5

4 3.1667 3.0833 2.25 3 2 2.125 1.1667 3.25 2.375 2.5 1.5833 2.25 1.25 1.5 0.5

3.75 3.75 2.75 2.25 3 3 1.875 1.125 2.6667 2.6667 1.25 1.25 1.5 1.5 0 0

3 2.5 1.875 1.5 3 2.5 1.125 1 1.5 1.3333 0 0.375 1.5 1.3333 0 0.375

3 3 1.875 1.5 3 3 1.5 1.125 1.5 1.5 0 0 1.5 1.5 0 0

vector space model (VSM)

Figure 5: General overview of the clustering workflow.

11

found clusters & outliers

4.1. Extracting IR-Features from Statecharts The first step is to inspect the meta-model of the language (cf. Figure 7), and decide on the IR-features to extract. Depending on the problem at hand, one can, e.g., choose to extract just the names of the model elements for a domain analysis scenario while ignoring the types of nodes or the graph-structure, or completely ignore the names and consider the types and graph structure for a clone detection scenario (specifically type-II clones [44]). Here we are interested to find variants close to each other in terms of element names, types and structure. The translation of this into our clustering framework is setting the IR-feature type to typed bigrams (please refer to [7] for more details). Next, we inspect the meta-model of the language and design an extraction scheme. This is to be done by a domain expert who decides which parts of the meta-models are relevant for the problem at hand. Following the meta-model, we use the EMF API to code a simplified extraction scheme in Java including: • Regions, States, Events: include their names and types. • Transitions: include TransitionActions, Conditions along with their expressions/statements and types. • Relations (to be encoded as bigrams): – Region → State (containment), – State → Event/TransitionAction/Condition (via outgoing Transitions of State), – Event/TransitionAction/Condition → State (via target States of Transition). To demonstrate, example typed bigrams extracted from Figure 1a would be: [State,cls_unlock]-[Event,key_pos_lock], [Event,key_pos_lock][State,cls_lock] and so on. 4.2. Comparing the IR-Features Once the IR-features are extracted, we decide on how to process and compare the IR-features to build the VSM. Again, this is done by a domain expert who knows the modeling language and chooses the parameters of the framework considering the problem at hand. With reference to the details on parameters in [7, 8], we have chosen to simplify the setting by turning off certain framework settings, e.g., weighting and advanced NLP, such as using Wordnet for semantic relatedness. However, we do check the types, i.e., when comparing [State-A] with [Region-A], the resulting similarity is set to 0.5 due to the type mismatch, despite the exactly identical names. The NLP capabilities of the framework have also been extended to handle the names of the model elements in the statecharts. Inspecting the example statechart in Figure 1a, one would immediately notice that names are typically given in snake case (token1_token2_...). We have used the built-in tokenization capabilities of the framework to process those names. We have further employed a trick to avoid parsing full expressions/statements in the conditions and actions for transitions: operators, boolean primitives, parentheses and brackets are given to the framework as stop-words, so that they are ignored when comparing names. The NLP overall allows the framework to detect similarities of 12

event names (key_pos_lock vs. key_pos_unlock) or conditional expressions (pw_enabled = true vs. pw_enabled = false). Further NLP includes Levenshtein similarity for typos, and stemming for cutting off affixes. 4.3. Vector Space Model Computation and Clustering The IR-feature comparison scheme explained in the previous section is used to build the term frequency matrix for the VSM. The next step is to choose a distance measure and a clustering technique, which again depends on the problem at hand. We have used Bray-Curtis distance [16], as an approximate measure of the normalized (roughly percentage) distance between two vectors. p and q being two vectors of n dimensions, pi , qi non-negative, Bray-Curtis distance is: n |pi − qi | bray(p, q) = ni=1 . i=1 (pi + qi ) The hierarchical agglomerative clustering (HAC) algorithm with average linkage (i.e., inter-cluster distance equals the average pairwise distance of all the contained data points) is used on top of the vector distances to compute the dendrogram as the result. Inspecting the dendrogram, we detect outliers and cluster formations in the dataset. In our case study, we show a sample dendrogram in Figure 9a on page 30. The numbers, i.e., the leaves of the dendrogram with labels V* represent data points as individual statechart model variants. The joints in the dendrogram correspond to a height (y axis with possible values in the range [0..1]), which is the normalized distance between two leaves or subtrees. A possible interpretation of the dendrogram would be that variants V11611 and V11616 are outliers (marked in red frames), with variants V000* forming a big cluster (marked in a blue frame). Note that how much further a cluster can be decomposed into subclusters (e.g., into two subclusters in this case) depends on the interpretation. 5. Variability Mining for Block-based Languages To adapt our family mining algorithm briefly described in Section 3.3, we identified guidelines to adapt the corresponding algorithms in four steps. Figure 6 shows these steps and the sections in which they are discussed. Our approach is not completely language-agnostic as it takes into account details about the used modeling language in the analysis. Thus, the first three steps Sec. 5.1

1

Analyze the Block-based Language

Sec. 5.2

2

Build a MetaModel

Sec. 5.3

3

Dene a Family Mining Metric

Sec. 5.4

4

Adapt the Family Mining Algorithms

Figure 6: Guideline steps that are needed to adapt family mining for a new block-based language. Hand symbols highlight steps that have to be executed manually, while gears indicate steps that can be automated. Steps with both symbols can only be partially automated.

13

are concerned with gaining knowledge that should be considered in the mining process and providing this knowledge in a form suitable to the algorithm. Step 1 is dedicated to manually analyzing the used modeling language (cf. Section 5.1) to identify details that can be used to create a suitable meta-model representation of the language in Step 2 (cf. Section 5.2). Such a representation builds an abstract view on the analyzed model elements and allows users to exchange the meta-model depending on the language analyzed during family mining. Besides manually building such a new meta-model, reusing existing meta-model representations of well-known languages is possible. Based on the gained insights and the user’s domain knowledge a custom-tailored similarity metric can be defined in Step 3 specifically describing the domain’s perception of similarity for compared models and model elements (cf. Section 5.3). Afterwards the existing family mining algorithms (cf. Section 3.3) can be adapted in Step 4, allowing their execution for the analyzed language (cf. Section 5.4). A major improvement over our existing guidelines [67] is the exploitation of the previously identified generic structures of block-based languages. Based on this observation, we are able to reuse generic mining algorithms and reduce the adaptation effort of our family mining approach for new languages. 5.1. Analyze the Block-based Language The foundation of our family mining approach is formed by model-based techniques to abstract from the concrete block-based language of the compared model variants. Thus, analyzing the language enables us to create such a metamodel representation for any new language. By identifying the right level of abstraction from the used language, we can create fitting meta-models to reduce the structures of models to the desired level of abstraction. This allows our family mining approach to focus only on details that are absolutely necessary to identify variability information for such models. Finding the right level of abstraction is a fine line between providing too little information for the algorithm to produce sensible results and considering too many details resulting in unnecessary effort to gather the corresponding information and long execution times of the algorithm. Thus, profound domain knowledge of experts is needed to create a suitable meta-model for languages that should be analyzed with our family mining (cf. Section 5.2). As a result, the detailed analysis of languages to gain such knowledge is considered as a completely manual step. 5.1.1. Search for Existing Meta-Models and Analyze the Language For particular modeling languages, there might already exist suitable metamodels. Reusing such meta-models allows to exploit existing knowledge and, overall, to reduce efforts and costs to adapt our family mining approach for a new language. Thus, we recommend developers to search for existing representations as this basically allows to skip to Step 3 of the adaptation in Section 5.3. For modeling languages without an existing meta-model available, a structured analysis of the language’s artifacts has to be executed. This allows developers to get an overview and deeper understanding of all language elements

14

(e.g., states and transitions for statecharts) and properties (e.g., state names and transition guards) that might need to be considered during the family mining of corresponding model implementations. Suitable sources for such a detailed analysis are different types of documentation for the language, such as developer guides or language specifications. Even with such sources available, discussions and interviews with developers who use the modeling language can provide additional insights. In case no such documentation exists, this might be even the only available source, besides analyzing existing implementations using the language. In previous work, we demonstrated that the described analysis can also be used to create a single meta-model representation for multiple language dialects (in that case statechart dialects) based on a structured analysis [66]. 5.1.2. Select Relevant Language Elements and Properties After identifying and analyzing all language elements and corresponding properties the gained insights should be used to classify the elements according to their relevance for the variability identification. We distinguish between relevant and irrelevant elements. Language elements or corresponding properties that contribute to the overall functionality (e.g., blocks and their functions in MATLAB/Simulink ) of model instances or allow to distinguish between them (e.g., their names) are regarded as relevant parts because they make the models comparable. For most languages, element names normally do not influence the functionality of executed models as they are mostly used as identifiers. However, we argue that they still should be considered with at least a low influence as developers select them for specific reasons (e.g., to specifically name a system state in a statechart). Thus, these elements have to be considered to enable identification of relations between model instances by our family mining. Other language elements or properties that only represent syntactic sugar and can be transformed to an equivalent representation or do not contribute to the functionality at all are regarded as “irrelevant”. An example for such a language element are ModelReference blocks in MATLAB/Simulink models that allow to include additional models from external files and can easily be transformed to a representation directly including the file contents. An example for “irrelevant” element properties are the position or color of elements, as this information does not change the execution behavior or the semantical meaning of the models. When creating a meta-model for multiple dialects of a language, it is important to identify equivalent concepts with different names and align them to prevent redundant modeling of concepts under different names. It is also important to identify the differences between dialects to make sure that all dialect-specific concepts can be expressed with the created meta-model. During the selection of relevant elements the developer has to find a suitable trade off between cluttering the meta-model with additional elements and unnecessary transformations. For example, adding the ModelReference block to a MATLAB/Simulink meta-model adds additional complexity for the family mining as this element has to be processed in addition to normal blocks. On the other hand, transforming such a ModelReference block into its equivalent representation adds additional complexity to the processing of model instances. 15

ModelEnty

ContainerEnty

id : EString StateChart

NodeEnty

id : EString [1..1] rootRegion

id : EString

Region

[1..1] parent

AbstractState

[1..1] parent

name : EString isHierarchical() : EBoolean isParallel() : EBoolean

name : EString [0..*] states

IncomingState

EdgeEnty

OutgoingState

id : EString [0..*] inTransons [0..*] outTransions Transion [1..1] target

[1..1] source InialState

FinalState

State

Figure 7: Excerpt from the meta-model for the family mining of statecharts.

Here an importer has to execute the transformation during the import of models into a meta-model instance and also the exporter has to transform back to the original representation to not confuse developers with a representation differing from the notation they are used to in the generated results. Overall the relevant elements and their properties have to be selected carefully because otherwise created meta-models might not be expressive enough to model concrete implementations in the analyzed language. On the other hand, a meta-model storing too much information might result in long execution times for the family mining approach as too many details have to be processed. 5.2. Build a Meta-Model Based on the selected relevant model elements and corresponding properties it is now possible to build a meta-model or to modify an existing meta-model. The first step is to select a suitable meta-modeling language. For example, we selected EMF with its notation Ecore (an implementation of the Essential Meta-Object Facility (EMOF) [58]) as it provides seamless integration in the Eclipse infrastructure used to realize our algorithms. Relevant model elements should be modeled using classes (e.g., EClasses in Ecore) with attributes (e.g., EAttributes in Ecore) modeling their relevant properties and inheritance applied where applicable (e.g., states are a generalization of initial states). As we observed a common structure for block-based languages (cf. Section 3.1), we created a basic meta-model providing abstract classes for these structures (i.e., models, nodes, edges, and containers) and also including structures to annotate variability information for them. Inheriting from these classes enables us to create generic family mining algorithms, largely reducing the adaptation effort for new languages (cf. Section 5.4). In Figure 7, we show an excerpt from our 16

meta-model for statecharts (e.g., not showing transition labels and the possibility to include multiple regions in States) [66]. ModelEntity, ContainerEntity, NodeEntity, and EdgeEntity are base meta-model classes. By inheriting from them we created statecharts providing means to model parallel and hierarchical execution (i.e., using regions). To ease the creation of meta-models inheriting from our base meta-model, we created a DSL to define languages with their relevant elements and corresponding properties. Based on specifications created with this DSL, we are able to generate a meta-model [67]. While using such a DSL might constrain developers to use an unfamiliar way of describing a meta-model, such information allows us to automate the adaptation of family mining for new languages in later phases (e.g., to create metrics in Section 5.3). Apart from that, it is still possible to manually create meta-models using classic approaches (e.g., a visual editor). In previous work [67] our approach required inheriting from our base meta-model and using the provided Entity classes. Instead, we now provide an additional possibility to use EAnnotations to explicitly annotate the type of the created classes without introducing external dependencies. Thus, we eliminated a major limitation over previous work and are still able to use the high degree of automation for the adaptation (cf. Section 5.4). The created meta-model is able to express semantically correct models from the language if the users executed a complete language analysis with a suitable subset selection of relevant elements and realization of a well-formed meta-model (i.e., all elements and properties are modeled correctly) based on these results. To allow processing of the models by our family mining algorithms, developers have to realize importers to create corresponding meta-model instances. In addition, suitable exporters have to be realized to transform the resulting annotated meta-model instances back to the original language notation. 5.3. Define a Family Mining Metric The next step for the adaptation is to create a metric to allow comparison of elements and corresponding properties of meta-model instances. A sensible way to start is to execute a ranking of the selected element properties according to their influence on the overall functionality. Properties with a high impact (e.g., the function of a MATLAB/Simulink block) should be ranked higher than properties with less expressiveness (e.g., the name of a MATLAB/Simulink block). Based on the ranking, developers then can assign corresponding weights. The summed up weights for the properties of an element should equal 1 and, thus, be normalized in the interval [0..1]. This ensures the comparability of similarity values. For examples of concrete metrics, we refer to [64] and to [66] for metrics used in the comparison of MATLAB/Simulink models and statecharts. As the common graph structure for block-based languages (cf. Section 3.1) and the corresponding basic meta-model types (cf. Section 5.2) consist of nodes, edges and containers, we recommend defining a metric per type to allow their comparison. Cross-type comparisons (e.g., comparing a node with an edge) are not sensible as these different types represent completely unrelated functionalities (e.g., an execution state vs. a transition between these states) and, thus, 17

do not have to be considered during the modeling of the metrics. To realize our family mining algorithms in a generic way, it is possible to realize metrics as an additional abstraction layer from the comparison of the concrete types defined by the meta-model. Our current family mining implementation provides an interface to define metrics. Using inheritance techniques users can define their metric based on this interface and can distinguish between subtypes internally in the created metric (e.g., between initial states and normal states) to realize the concrete comparison of the specific types without changing the metric’s interface by adding compare methods per subtype. The step of defining a concrete metric for the family mining approach can be automated by using one of the following two approaches. On the one hand, users can enrich the DSL description previously used to generate the meta-model for their new block-based language (cf. Section 5.2). By adding additional attributes defining concrete metric weights for element properties, the necessary information for a concrete metric can be provided. On the other hand, users can also directly annotate the attributes of their manually created meta-model with EAnnotations defining these weights. Using the described interface for concrete metrics we are capable of automatically deriving a concrete metric implementation for the different types from the annotated meta-model or our DSL. The generated metric allows to compare the element properties (e.g., string, integer or boolean values) and based on the results calculates an overall similarity value using the defined weights. By using the interface for the generated metric, we can also integrate it in the graphical user interface (GUI) of our family mining framework and allow for manual adjustment of the user-defined weights to create a custom-tailored metric before executing the mining. 5.4. Adapt the Family Mining Algorithms After creating a meta-model representation for the analyzed language and defining a corresponding metric to allow comparison of corresponding metamodel instances, it is now possible to adapt our family mining algorithms. The major improvement over our previously introduced guidelines [67] is the generic family mining implementation that allows to adapt the algorithms for new blockbased languages with low implementation effort for the user. Thus, we will focus during our description of the approach on the corresponding peculiarities. In the following subsections, we give concrete details of the three family mining phases Compare, Match, and Merge (cf. Figure 4 in Section 3.3). 5.4.1. Compare Phase During the Compare phase, we identify possible relations between two compared models. For the first iteration, we have to select a base model that serves as a basis for the comparisons from the set of input models. All remaining models serve as compare models and are iteratively compared and merged with the selected base model. We allow for two approaches to select the base model. Either the user selects the base model manually based on domain knowledge or an automatic algorithm is used. In our current implementation this algorithm

18

selects the smallest model as we argue that extending an existing variant with additional functionality is a common clone-and-own approach. Thus, the smallest variant most likely represents the common core for the analyzed variants. To compare the selected base model and one of the compare models, we iterate over the data-flow of the analyzed models. By starting at the highest hierarchy level, we start this data-flow analysis and compare the models’ entry points that are defined by language-specific model elements (e.g., the initial states of the statecharts in Figure 1). Here it is important to note that not all languages or model implementations define such a clear entry point. For example, although MATLAB/Simulink has a clear notion to model the start of a data flow (e.g., Inport blocks or Constant blocks) it is common in industrial models that control circuits are used to model algorithms that are executed in a simulated environment. In these cases, the user either has to manually select the entry points for each of the compared models or we can use heuristics to automatically select them (e.g., by selecting blocks with identical names). For a detailed discussion of such automatic heuristics, we refer to [67]. The comparisons are continued by virtually separating the compared models into stages with each stage containing only elements that have the same distance (i.e., the number of nodes or edges) on their shortest path to the model’s entry points. For instance, all transitions from state cls_unlock to state cls_lock have distance 1. In this case the stage for the ManPW variant contains two transitions, while the stage for the AutoPW variant only contains a single transition. Stages with the same distance in relation to the entry points are compared by creating all possible combinations of their elements. Each combination is represented by a so-called comparison element storing the compared elements and a similarity value calculated according to the user-adjustable metric (cf. Section 5.3). In case no counterpart stage exists, the elements from the currently processed stage are compared with null to indicate that they are optional. While the described approach only considers comparisons of non-hierarchical elements, many block-based modeling languages use hierarchical elements (cf. Section 3.1). In this section, we only give a brief overview of the ideas that apply to include the model hierarchy in the comparisons, but refer to [67] for a more detailed discussion. We identified two scenarios that have to be considered in addition to the described approach to realize a complete comparison of hierarchical models. The first scenario considers the comparison of two hierarchical nodes. Here, it is important to realize a comparison of all elements in the model hierarchy. Basic idea is to recursively execute the model comparisons for elements in lower hierarchies and store the results in sub comparison elements to allows for traceability (e.g., during the merging of the 150% model). The second scenario considers the comparison of a hierarchical node (either parallel or single hierarchy level) with a non-hierarchical node. Here, we only compare the nodes’ attributes that are independent of the functionality realized in the hierarchy (e.g., their names or interfaces). To realize complete comparisons of hierarchical models all three scenarios (i.e., comparison of two non-hierarchical or two hierarchical nodes and comparison of a non-hierarchical with a hierarchical node) should be created for concrete metrics comparing nodes. 19

Both the entry point selection and the creation of stages are executed on the generic level of our current implementation. When using the base metamodel classes or EAnnotations (cf. Section 5.2), we are able to identify the nodes, edges and containers in model instances without knowing their concrete type (e.g., whether they are a block or a state). To identify the entry points we first pass the list of all nodes on the highest hierarchy level to the concrete user implementation extending our generic algorithm. The resulting list is used to execute the first comparisons using the user-defined metric. Next, the compared nodes are passed again to the concrete user implementation to identify all subsequent edges that need to be processed in the next phase. In case the language analysis (cf. Section 5.1) identified that edges influence the behavior of the models (e.g., transitions in statecharts with their events and actions), we compare the corresponding edges from the two models using the user-specified metric. Otherwise (e.g., for MATLAB/Simulink models where connectors do not directly influence the behavior), we directly continue with the identification of the subsequent nodes in the concrete user implementation for the next comparisons. This process is continued until we reach the end of the data-flow. Using this switch between a generic and an extending concrete implementation with small user-defined methods allows to adapt our family mining algorithm with low effort (cf. Section 7). This is possible as large parts of the algorithms are language-agnostic and do not need to know about concrete objects. All language-specific parts (e.g., the identification of subsequent elements) can be implemented in small methods by the user. In addition, using an abstract class for the metric with language-specific concrete implementations (cf. Section 5.3) allows to reuse large parts of comparison logic and only small parts of the concrete similarity calculation have to be realized by the user. Given that during the Merge phase (cf. Section 5.4.3) all variability is correctly merged into a 150% model, our generic family mining algorithms are able to walk through all alternative paths for model comparisons with n > 2 models. Thus, our compare algorithm can identify possible relations between elements on these alternative paths and elements from the new compare model that is analyzed during the current iteration of the mining algorithm. 5.4.2. Match Phase Usually, the resulting comparison elements are ambiguous as for each model element multiple possible counterparts in the compared model exist (e.g., for the transitions between state cls_unlock and state cls_lock in Figure 1 two possible matches exist). During the Match phase the list of all created comparison elements is processed to match each element from a model to at most one counterpart in the compared model. For each comparison element CE i , we first identify all other comparison elements sharing either the base model element or the compare model element with CE i . From the resulting list of comparison elements, we select the element with the highest similarity value as a direct match and rule out all other possibilities by removing them from the list. This step is repeated until all model elements were matched. This matching algorithm is completely agnostic about the elements stored inside the comparison elements 20

and only considers them as containers storing a similarity value for their contents (i.e., compared elements). Thus, the algorithm can be executed on any comparison element independently from its contents (nodes, edges, containers or any other object) as it was realized as a central element in our framework: it allows to store any object independent from concrete model logic. In case no distinct match can be selected based on the similarity value (i.e., the comparison elements have the same similarity), we sort all corresponding elements to the end of the comparison element list. This solution implicitly solves most conflicts by matching other comparison elements first. For cases where this strategy fails, we use a decision wizard that allows the user to either manually select the best match or apply additional user-defined automatic resolution strategies defined prior to the execution. For example, our current automatic resolution strategy selects the comparison elements that contain model elements with the same name. In case no such comparison element exists, we select the first comparison element from the list of conflicting elements. Although this is a very naive resolution strategy, we identified during our case studies that domain experts solve conflicts in a similar manner. Our DSL or EAnnotations in the manually created meta-model allow us to automatically generate large parts of the decision wizard. For example, we can automatically generate all GUI parts that are needed for the presentation of the conflicting model elements to the user. In addition, we allow selection of an automatic resolution strategy in the DSL or the EAnnotations to automatically generate classes for the concrete strategy. Currently, we only provide our described solution, but other algorithms can be implemented and integrated to provide additional choices. 5.4.3. Merge Phase After processing the matched list of comparison elements and establishing distinct relations between the compared models, we can merge a 150% model storing all identified variability relations. The first step for the merging is to categorize the relations identified for the model elements contained in each comparison element CE i from the list. Here, we analyze the calculated similarity and use the following mapping function rel(CE i ). The thresholds proved themselves in practice with different real-world models from our partners [24, 66], but we allow for adjustments by the user to identify the expected relations.  rel(CE i ) ←

mandatory alternative optional

simCE i ≥ 0.95 0 < simCE i < 0.95 simCE i = 0

Depending on the identified relations, we have to apply different strategies to merge the compared models into a copy of the base model. In case of mandatory elements, we mark the existing element in the base model as mandatory and annotate that it is contained in the base model and the compare model. Minor deviations between the compared elements also have to be stored as our threshold of 95% allows such deviations because we argue that minor differences (e.g., a minor change of names) do not necessarily result in differing functionality. Without these additional annotations, we would loose information as the 21

key_pos_lock [pw_pos == 1] / cls_locked=true; pw_enabled=false; ManPW key_pos_lock [pw_pos != 1] / cls_locked=true; ManPW

cls_unlock

cls_lock

key_pos_unlock / cls_locked=false; pw_enabled=true; AutoPW key_pos_lock / cls_locked=true; pw_enabled=false; GEN(pw_but_up); Figure 8: 150% model for the ManPW and AutoPW variants of the CLS feature in Figure 1.

deviations would not be stored in the 150% model. In case of an optional element, we either annotate the already existing model element in the base model as optional or first merge the corresponding element from the compare model and afterwards annotate it. Here, we only add one of the compared models as the element’s origin because it is not contained in both models. In case of alternative model elements, we only merge the element into the 150% model that is not already contained in the base model copy, mark both as alternatives to each other and annotate their origin models. To distinguish between different alternative elements across the 150% model, we also add unique group ids that allow to unambiguously identify which elements are alternative to each other. For comparisons with n > 2 models, we have to consider the already existing variability for all iterations i > 1 as here the 150% model from the previous iteration serves as base model (cf. Figure 4). For example, when identifying an additional alternative element in an alternative group, we have to merge it in the 150% model, mark it as alternative, annotate its source model and add the correct group id. Another example are elements that were previously identified as mandatory and that are not contained in the currently merged model. In this case, the existing variability has to be changed from mandatory to optional. The resulting 150% model for the two CLS features from Figure 1 can be found in Figure 8. We can see for the three transitions that only one of them is contained in the AutoPW variant and the other two variants are contained in the ManPW variant. The transition from the AutoPW variant and the upper transition from ManPW variant are regarded as alternatives because the additional guard in the ManPW variant is their only difference. The remaining ManPW transition is regarded as an optional element. Both states and the transition from cls_lock to cls_unlock are regarded as mandatory. For clarity reasons, we did not annotate the concrete variability classes (i.e., mandatory, alternative and optional) in Figure 8 and neglected the source model annotations for mandatory elements. The merging algorithm for models from the newly adapted language has to be manually implemented by the developers adapting our family mining as the structure of block-based languages are too diverse and the implementation of a generic merging algorithm is not possible. Starting from the highest hierarchy level such algorithms have to allow merging of each hierarchy level with all contained elements. For example, in case of MATLAB/Simulink models, the

22

algorithm has to merge blocks and connectors on the highest hierarchy level and can be recursively called inside of subsystem blocks. In case of statecharts, the algorithms are more complex because the highest hierarchy level is realized by the root region with states and transitions (cf. Figure 7). Here also different strategies have to be applied to merge hierarchical and parallel states. Most importantly the connections between nodes have to be correctly merged using corresponding edges. For example, in case of alternative nodes corresponding connections to their previous and subsequent nodes have to be created to correctly allow walking through all alternative paths in the final 150% model. Although the merging algorithms have to be implemented per adapted language, we allow generation of the categorization using thresholds defined in our metamodel DSL or using EAnnotations in a manually created meta-model. For this purpose, our abstract metric class defines methods to categorize comparison elements according to their similarity value. Using the concrete metric generated from the DSL or EAnnotations these methods are overloaded and the userspecified thresholds are used. In addition, the thresholds are integrated in our framework GUI to allow their manual adjustment prior to executing the mining. The final 150% model can now be used to export the results to a graphical representation or for further processing by additional algorithms (e.g., to generate an SPL [65]). Furthermore it serves as input for the next iteration of the family mining algorithms in case of n > 2 input models (cf. Figure 4). 6. Implementation In this section, we give an overview of the implementations for our clustering framework (cf. Section 6.1) and our family mining framework (cf. Section 6.2). 6.1. The Clustering Framework The clustering framework is implemented partly in Java (IR-feature extraction and comparison) and partly in R (clustering). While using Eclipse and EMF, the framework at the moment does not have an explicit Eclipse plug-in architecture; the components should be interpreted as conceptually distinct and modular steps of the workflow. Indeed for the study in this paper, the default importer and extractor parts have been replaced with new ones for statecharts. • Language/Base Meta-Models: The Ecore meta-models to be used for loading the statechart models and extracting IR-features. • Importer: Standard importer for EMF resources. • Schemes & Parameters: Specification of extraction, matching and comparison schemes and other framework parameters [7, 8]. • Extract: IR-Feature extraction code, using the EMF API in Java. Here the extraction logic discussed in Section 4.1 is implemented. • Compare: IR-Feature comparison to build the VSM (cf. Sections 4.2 and 4.3). As we extract IR-features into the internal representation of the framework, this component is used as-is with appropriate settings. • Cluster: Distance calculation and clustering in R using packages hclust for HAC and vegan for Bray-Curtis distance (cf. Section 4.3) 23

• Visualization: Export of the cluster hierarchy plot for visual identification of clusters and outliers. 6.2. The Family Mining Framework Our family mining framework is implemented using Java and the Eclipse plug-in mechanisms to allow for easy extension with customized algorithms. • Family Mining Core: The framework’s core plug-in provides all basic classes for user interaction (e.g., to configure or trigger algorithms for selected files) and the execution of the family mining workflow. Developers can customize this workflow by integrating additional plug-ins via the provided extension points for the following framework plug-ins. • Importer & Exporter: These plug-ins allow to import models from blockbased modeling languages files (e.g., MATLAB/Simulink model files) to the internal meta-model representation or to export the results (e.g., in form of reports or the original modeling language). As the import and export of models to/from an internal meta-model involves model-to-model transformations, realization of such plug-ins cannot be automated. However, we provide basic import and export plug-ins for Ecore files. • Base Meta-Model: The base meta-model plug-in contains artifacts that can be used to create a language-specific meta-model for our generic mining algorithms (cf. Section 5.2). It is realized using EMF Ecore. • Language Meta-Model: This meta-model is needed to have an internal representation of imported models. Its generation can be automated after a manual analysis of the language (cf. Section 5.1). It is realized using EMF. Its connection to the base meta-model is optional (cf. Section 5.2). • Compare: Our compare algorithm is part of the core plug-ins and uses the generic structure specified by meta-models (cf. Section 5.4.1). • Metric: Metric plug-ins can be automatically generated after analyzing new languages and specifying corresponding weights (cf. Section 5.3). • Match: Our match algorithm is completely language-agnostic (cf. Section 5.4.2) and, thus, part of the core plug-ins. • Decision Wizard: Decision wizards can be automatically generated with an extended DSL description or meta-model annotations (cf. Section 5.4.2). • Merge: Merge algorithms have to be manually implemented as block-based languages are too diverse for generating such algorithms (cf. Section 5.4.3). 7. Case Study We combined our clustering technique in Section 4 with our family mining in Section 5 to execute fine-grained variability mining on clusters of related models excluding outliers. We used the workflow described in Figure 2 and focused on the question whether our generic realization allows family mining for different languages and how our cluster and outlier detection can improve such results: • RQ1 – Adapting Family Mining: Is it possible to successfully execute family mining for block-based languages using our generic implementation?

24

• RQ2 – Reduce Implementation Effort: Does using our generic framework reduce the implementation effort during the adaptation of family mining for new languages compared to writing a custom family mining solution? • RQ3 – Outlier Detection: Is the clustering technique capable of eliminating outliers in input models before executing the family mining approach? • RQ4 – Cluster Detection: Is the clustering technique capable of identifying sensible clusters of related input models for the family mining? • RQ5 – Improvement of Results: Are the results of the family mining improved by applying it only to identified clusters and neglecting outliers? 7.1. Case Study Subjects For our case study, we selected different subjects to evaluate whether our generic algorithm can be extended for new languages to analyze variability relations between corresponding models (cf. Section 7.1.1) and whether the described clustering is capable of improving such mining results (cf. Section 7.1.2). 7.1.1. Extension of the Generic Family Mining Implementation We selected the block-based languages MATLAB/Simulink and statecharts for the evaluation of our adaptation guidelines using our generic implementation (Generic). For both languages, we already realized manual implementations (Manual) using our guidelines from previous work without exploiting their common structures and implementing the algorithms for each language separately [67]. Here, we only reused the interfaces of the family mining framework (e.g., for algorithms, metrics and decision wizards) and did not use a generic implementation of the mining algorithms. For our current case study regarding our family mining, we focus on extensibility by exploiting the common structure of block-based languages (cf. Section 3.1) to reduce the adaptation effort using our Generic implementation (cf. Section 5 and 6.2) with language-specific extensions. The goal is to show that the Generic implementation is a) capable of creating the same valid results as our previous Manual implementation (cf. RQ1 ) and b) reduces the adaptation effort for new languages compared to writing a new solution (cf. RQ2 ). We use the same subjects as for our previous evaluation [65, 66]. Thus, for the evaluation of the statechart family mining, we concentrate on the body comfort system (BCS) also used as a motivating example in this paper (cf. Section 2). The BCS implementation represents a real world system from the automotive domain that was realized using IBM Rational Rhapsody statecharts and was decomposed into an SPL [33, 41]. The resulting SPL comprises 27 reusable features and allows generation of 11,616 valid variants. These features encapsulate the functionality of different system parts (e.g., the central locking system or alarm system – cf. Section 2). Depending on the feature selection, the BCS statechart variants comprise up to 70 states, 40 regions and 94 transitions. In particular, we concentrate on 17 BCS variants that were derived from the BCS SPL to cover a wide range of functionality in the feature combinations [33, 41]. For the evaluation of the MATLAB/Simulink family mining, we use the MATLAB/Simulink variants derived from a driver assistance system model in the 25

SPES_XT project [59]. By combining the model’s features EmergencyBreak, FollowToStop, SpeedLimiter, CruiseControl and Distronic, we are able to derive 18 variants with up to 941 blocks and 980 connectors when observing the relations between the features (e.g., the Distronic feature always requires the CruiseControl feature). For the evaluation, we use eight pairwise variant combinations with an increasing number of differing features. Comparing the mining results for the selected MATLAB/Simulink and statechart combinations generated by the Generic implementation with the Manual implementations allows us to evaluate whether the results are the same. In addition, as the results for the completely Manual implementations were evaluated to be correct in previous work [65, 66], we have a ground truth to compare the results from the Generic implementation. In addition, we compare both implementations with each other to identify which parts have to be implemented manually and which parts can be generated partially or completely. To measure the actual implementation effort, we compare how the completely Manual and the Generic realization extended by language-specific parts relate in terms of lines of code (LOC). Thus, we analyze whether the Generic approach actually reduces the implementation effort to adapt family mining for new languages. 7.1.2. Cluster and Outlier Detection For the evaluation of our cluster and outlier detection, we use five scenarios in the context of the BCS SPL (cf. Section 2 and Section 7.1.1) that we outline in Table 3: To evaluate the capability of the outlier detection algorithms to identify a small set of outliers in a set of otherwise highly related variants (cf. RQ3 ), we selected three outlier detection scenarios (OD1 – OD3). These outlier variants differ compared to the cluster of related variants as large parts of their selected features differ. Apart from the outlier detection, these scenarios also evaluate that the algorithms are capable of identifying the expected cluster of related variants. Furthermore, to evaluate the capability of the clustering algorithms without outliers present (cf. RQ4 ), we selected two cluster detection scenarios (CD1 & CD2) with two delimitable clusters of variants. Each selected scenario comprises clusters of valid feature selections from the BCS SPL to generate corresponding variants. Analyzing the selected scenarios, we also evaluate the benefit of the outlier and cluster detection (cf. RQ5 ). In Table 3, we show the number of variants contained in the different scenarios. Each scenario consists of two clusters. While Cluster 1 always represents one of the expected clusters, Cluster 2 either represents the set of outliers for the OD scenarios or the expected second cluster for the CD scenarios. The table shows, for each scenario, the number of shared features (i.e., contained in both clusters), the number of mutually exclusive features (i.e., contained only in one of the clusters) and the number of alternating features (i.e., features that cannot clearly be assigned to variants from a particular cluster). Thus, we evaluate our clustering technique using variants with different degrees of similarity (i.e., number of shared or mutually exclusive features). In addition, we evaluate the resistance against “noise” (i.e., alternating features) which might negatively influence the clustering as variants might be assigned to unexpected clusters. 26

OD1 OD2 OD3 CD1 CD2 ∗

Variants Cluster 1 Cluster 2∗ 8 2 8 2 8 2 8 8 12 12



Shared

10 10 10 16 24

7 10 6 17 12

Mutually Exclusive 14 12 0 2 7

Alternating 6 5 21 8 8

containing the outliers for the OD scenarios and the second cluster for the CD scenarios

Table 3: High-level overview of the five scenarios from the body comfort system (BCS) used to evaluate the cluster detection (CD1 & CD2) and outlier detection (OD1 – OD3).

Using the selected scenarios for the cluster and outlier detection provides us with a ground truth as we clearly labeled variants as outliers or which cluster they belong to. As a consequence, we can evaluate whether the detection is capable of generating results confirming this ground truth and, thus, meets the expectations of experts well familiar with the BCS implementation. 7.2. Methodology For our evaluation, we first concentrate on RQ1 and RQ2 to evaluate the capabilities of our Generic family mining implementation before concentrating on RQ3 – RQ5 to evaluate the benefit of the cluster and outlier detection. After adapting family mining for the selected languages, we compare the results generated by the corresponding realizations with results by our completely Manual implementations from previous work. For both implementations (i.e., Manual and Generic), we count the LOC for the relevant realization artifacts Compare, Match and Merge to get a concrete number for an adaptation effort reduction. After adapting variability mining for statecharts, we execute our proposed approach for each scenario selected in Table 3. For each scenario, we measure the execution time of all executed steps. To account for inaccurate runtime measuring, we repeat the execution of each scenario 10 times and calculate the average runtime. We use the following execution flow for each scenario: 1. Execution of the cluster and outlier detection 2. Evaluation of the identified clusters and outliers 3. Execution of the family mining algorithms for the identified clusters 4. Evaluation of the 150% models generated for these clusters 7.3. Results and Discussion In this section, we report on our results of our case study and discuss them with respect to our research questions. RQ1 – Adapting Family Mining. When comparing results generated by our Generic implementation with results from our Manual implementation for the same input models, we identify them to be correct in a sense that they are the same as our previous results [65, 66]. Thus, as these results were previously evaluated to be correct and served as a ground truth, we conclude that the 27

Cluster/Outlier Detection 23631.3 ms 27882.5 ms 33268.7 ms

OD1 OD2 OD3 CD1 CD2

Cl. Cl. Cl. Cl.

1 2 1 2

47866.3 ms 37581.1 ms

Family Mining Compare Match Merge 3948.6 ms 256.2 ms 265.0 ms 17243.6 ms 973.8 ms 569.5 ms 14455.8 ms 690.4 ms 508.9 ms 9810.5 ms 1102.5 ms 755.3 ms 9785.9 ms 939.3 ms 536.8 ms 7919.8 ms 639.4 ms 468.6 ms 8497.4 ms 914.1 ms 501.1 ms

Overall 28101.1 46669.4 48923.8 59534.6 59128.3 46608.9 47493.7

ms ms ms ms ms ms ms

Table 4: Average execution times (in milliseconds) of the cluster and outlier detection as well as the generic family mining for the scenarios selected in Table 3.

Generic family mining is capable of identifying correct variability information. In addition, we do not identify significant increases in the execution time and, thus, conclude that using the Generic implementation with language-specific extensions does not negatively influence the performance of our algorithm. In addition to the comparison with results from previous work, we execute the scenarios in Table 3 to see the performance of our family mining algorithm in combination with our cluster and outlier detection. In Table 4, we present the corresponding execution times. For each scenario, we show the execution time of the cluster and outlier detection together with the times of the family mining phases Compare, Match and Merge. In case of the CD scenarios, we show these execution times per cluster. Overall, we conclude that, considering the size of the analyzed model variants, these execution times are in an acceptable range. In addition, we argue that they outperform any manual fine-grained variability analysis of the same models, especially when performing an additional manual cluster and outlier detection beforehand. Thus, we answer RQ1 positively as the algorithms correctly identify variability for large models in reasonable time (even when using the cluster and outlier detection as a preprocessing step). RQ2 – Reduce Implementation Effort. In Table 5, we present the LOC that we counted for the Manual and Generic implementations of our family mining algorithms. We separated these statistics into the parts for the Compare, Match and Merge algorithms. In case of the Compare implementation, we distinguish between a generic part reused by all implementations and a concrete part that has to be realized as an extension for a specific language. Each row shows the LOC for the Manual or Generic family mining implementations for MATLAB/Simulink (MS ) or statecharts (SC ). All other parts (i.e., metrics, decision wizards and meta-models) can be generated at least partially (cf. Section 5). Looking at the LOC for the different implementations, we can see that the Match algorithm can be reused across all implementations and the Merge algorithms are language-specific, but can be reused across the Manual and Generic implementations. Thus, the major difference exists for the Compare algorithm. Here, a generic part is reused across the Generic implementations and is extended with a language-specific concrete part to realize mining for 28

Approach Manual Generic ∗

MS SC MS SC

Compare Generic ∗ Concrete – 213 – 310 324 111 324 143

reused in all implementations

∗∗

Match∗

Merge∗∗

347 347 347 347

1271 1520 1271 1520

reused in language-specific implementations

Table 5: Lines of code (LOC) needed to adapt family mining for MS = MATLAB/Simulink and SC = statecharts using our guidelines and a Manual or a Generic implementation.

the new languages. As the generic part is provided by our framework (cf. Section 6.2) and can be used directly for a concrete implementation, we neglect this part for the adaptation effort analysis. Looking on the remaining concrete part, we identify a decrease of 47.89% and 53.87% in the LOC for the MATLAB/Simulink and statechart realization compared to the Manual implementations. Although the Merge parts still have to be manually implemented, we argue that this is a significant reduction, especially when considering that the complexity of the implementations differs. In case of the completely Manual implementation, the developer has to reimplement all family mining algorithms with the traversal of the models and all compare logic. In case of the implementation of language-specific extensions for the Generic family mining, the developer only has to implement interface methods returning clearly defined sets (e.g., subsequent edges for a given list of nodes). Thus, we answer RQ2 positively as the effort for developers can be reduced when using our Generic family mining implementation in combination with all provided generation facilities to automate adaptation steps (e.g., the provided DSL). RQ3 – Outlier Detection. We applied our clustering technique with the settings outlined in Section 4 for each scenario. By inspecting the dendrograms generated by the clustering, we identified the outliers in scenarios OD[1-3]. Figure 9a to 9c depict the corresponding dendrograms. The interpretation of the figures is as follows: (1) a big coherent cluster of data points with high similarity marked in a blue frame; and (2) individual data points with relatively little similarities with the main cluster marked in red frames. An important point to discuss is related to the distinction between a feature vs. its implementation in the variant models. While we designed the scenarios based on selecting/deselecting features to comprise a notion of similarity between models, we ignored their implementation, especially how big their corresponding realizations in the models are. This may lead to some situations, e.g., having a common feature with a very large implementation offsets the selection of all other minor features and dominates our similarity calculation. This is partly reflected in Figure 9c, where a considerable number of different features (6 out of 27) between V11616 and the large cluster contribute only to around 10% difference. Elaborate weighting schemes, e.g., based on importance of features or model elements are omitted considering the scope of this work. 29

0.4 0.3

0.4

Height

V11225

V11229

V11221

V11217

V11227

V11223

V11219

V11231

V00001

0.30

(b) Dendrogram for OD2

0.10 V10858 V10866 V10860 V10868 V10862 V10864 V10854 V10856 V10844 V10848 V10843 V10847 V10842 V10846 V10841 V10845

V05806

V05805

V05810

V05809

V05808

V05807

V05812

V05811

0.00

V11616

Height

0.20

V00001

0.4 0.3 0.2 0.1 0.0

Height

V00009

0.0 V00057

V00049

V00061

V00053

V00073

V00065

V00077

V11616

V00069

V11611

0.1

0.2

0.3 0.2 0.1

Height

0.0

(a) Dendrogram for OD1

(c) Dendrogram for OD3

(d) Dendrogram for CD1

Figure 9: Dendrograms for the cluster detection scenario CD1 and the outlier detection scenarios OD1 to OD3. Clusters are marked with blue frames and outliers with red frames.

RQ4 – Cluster Detection. For cluster detection we adopted an approach similar to the outlier detection; namely we inspected the resulting dendrogram and this time tried to find large groups of similar models rather than isolated outliers. The dendrograms for CD1 and CD2 are largely similar; thus we only depict the former for the sake of space. Looking at the dendrogram in Figure 9d, it can be seen clearly that there are two distinct sets of data points, with high similarity among each other (around 0.10), yet the intra-cluster distance is around 0.30; high enough to comprise separate groups. As mentioned previously, especially in the case where a much larger number of models is considered, it is arguable whether to obtain a few but large clusters, or many but small (sub-)clusters; it is up to the domain expert to make this design decision. The results of the outlier and cluster detection scenarios confirm that our clustering technique is able to perform with sufficient accuracy with respect to the expectations of the experts building up the ground truth for this study. This holds also in the presence of alternating features explained previously as being potentially challenging. Consequently, we answer RQ3 and RQ4 positively.

30

RQ5 – Improvement of Results. To examine the results of the family mining algorithms with or without cluster and outlier detection, we distinguish two situations: a) cases with outliers and b) cases without outliers. In case outliers exist, we identified that our outlier detection improves the fine-grained variability information generated by our family mining algorithm. The main reason is that outlier variants represent models that have a low or at worst no relationship to the remaining input models. Executing our family mining without detecting these outliers might result in unexpected variability relations in the 150% model or even elements that have no relation to the rest of the 150% model. Thus, we argue that using the outlier detection is essential in situations where users are not fully familiar with the input models and relations between these models are unclear (e.g., after developers left the company). In case no outliers exist, we do not necessarily need to execute our cluster detection as the generated results for the complete set of input models represent valid variability information. However, executing family mining on particular clusters might allow users to focus their analysis on the corresponding models. Furthermore, detecting clusters reduces the chance of unexpected variability information (e.g., induced by variants from other clusters; cf. Section 2.2). Thus, detecting clusters prior to executing the fine-grained variability mining can improve the experience of users. However, we think that these clusters should at least be evaluated at a high level by users as, depending on the focus of users, it might be interesting to combine multiple clusters for a bigger picture of the overall system. For example, this could be interesting for senior developers working on multiple projects that are identified as separate clusters. Overall, we answer RQ5 positively as using the cluster and outlier detection allows to improve the experience of users. The generated 150% models are tailored towards their expectations as outliers are removed and the identified variability information can be focused on particular clusters. 7.4. Threats to Validity Although we designed, implemented and evaluated our technique with great care, different threats to validity are inherently present. Our feasibility studies contain models from the automotive domain only and the used case studies are limited to MATLAB/Simulink models and IBM Rational Rhapsody statecharts. This limits the generalizability of our approach as we can only say with certainty that our algorithms and generic implementations work for these particular case studies and languages. However, we implemented our generic family mining and the outlier and cluster detection algorithms without having a particular domain in mind and prior to our evaluation. Thus, we kept ourselves from being biased and are confident that the algorithms are applicable to other blockbased languages and models from other domains. In addition, we claim that our generic implementation allows developers to adapt family mining with a reduced effort. The corresponding evaluation is also limited to the selected two languages and we only measured the needed effort in terms of LOC. This measure is limited to a quantitative evaluation and we did not conduct user studies to evaluate this claim. However, our generic framework allows developers to adapt family 31

mining without reimplementing our algorithms and only requires them to define a meta-model for their language and to implement clearly defined interfaces for the Compare phase (e.g., to return the entry points to the analyzed models), the used metric and the decision wizard. While these tasks are still connected with manual effort, we provide different supporting approaches for the adaptation. Developers can reuse existing meta-models by generating all remaining artifacts (i.e., the needed metric and decision wizard) from annotations or by using our DSL to generate a new meta-model together with these artifacts. Thus, we argue that the adaptation effort is largely reduced by these measures compared to a complete reimplementation of our family mining algorithms. On the clustering part of this work, there are also several threats of validity. First of all, the technique itself aims to deliver a fast but approximate overview of the given data; hence it may not be ideal if very high accuracy is required. The scenarios considered in this paper are relatively simple and further scenarios with varying size and complexity, preferably in real industrial settings, could be investigated to test our claims. Another point, already addressed above, is that the ground truth, hence the expectation of the domain experts, is shaped with respect to the high level features. This contrasts with the clustering technique working on the level of implementation, i.e., the model variants. Elaborate weighting schemes based on features and/or model elements could be introduced to mitigate this situation, where the domain experts associate the corresponding parts with varying importance to guide the clustering. 8. Related Work In this section, we discuss related work for our presented clustering algorithm (cf. Section 8.1) and our variability mining (cf. Section 8.2). 8.1. Clustering Techniques The existing approaches for model comparison typically consist of expensive pairwise techniques that focus on accurate comparison/matching of just two models [55]; EMFCompare being one of the most used industry standards. Only in recent work [7, 8, 10], scalable techniques based on information retrieval and clustering (i.e., machine learning) for comparing large number of models have been introduced. To the best knowledge of the authors, there is no further related work in the MDE domain with a comparable perspective, i.e., a holistic and scalable treatment of models for analysis and visualization. In the SPL domain, Zhang et al. [70] use model comparison (i.e., EMFCompare) to synthesize a product line model from variant models. Our approach is different in the sense that the mining/merging technique does not use model comparison directly, but rather clustering is meant as an individual step of data preprocessing, selection and filtering before family mining. Furthermore, model clustering is faster and more scalable than pairwise techniques [8, 10] possibly with a trade-off in accuracy. This trade-off fits our workflow perfectly where we need a rough overview of a potentially large number of models, whereas family

32

mining operates by itself in a precise way in the next step. Finally, Martinez et al. apply NLP to suggest names for feature identification [35]. 8.2. Variability Mining Techniques Comparing different types of models for various purposes (e.g., variability identification or model versioning) has been extensively investigated [55] and can be categorized into clone detection, differencing and variability identification. Clone Detection. Different clone detection algorithms exist for various software artifacts and languages including models [31, 44, 54]. For example, graph-based algorithms allow to detect syntactic clones [15, 32, 42], semantic clones [1] and near-miss clones (i.e., clones that have minor differences) [42]. Further algorithms translate models to textual representations for text-based clone detection [2]. Clustering identified clones Alalfi et al. identify varying parts in MATLAB/Simulink models by inferring that all remaining parts represent variability [3]. While the approach can theoretically be adapted for other languages, its applicability has only been demonstrated for MATLAB/Simulink and no corresponding guidance exists. In contrast, our family framework provides a generic implementation and adaptation guidelines for adapting further languages. Differencing. For a complete analysis of variability information in models focusing on the cloned parts only is not sufficient as the differences describe their variability. A large number of differencing algorithms exist for various languages and software artifacts in literature (e.g., [12, 27, 28, 29, 69]) and in commercial tools (e.g., DiffPlug [17], SimDiff [20]). These algorithms often identify the commonalities of models and derive the differences from the set of unmatched elements. For example, Kehrer et al. use a signature matching algorithm to initialize candidates prior to using a similarity-based matching similar to our used-adjustable metric [28]. However, as these algorithms do not identify explicit variability information (i.e., mandatory, alternative or optional parts) they are not directly applicable for fine-grained variability mining. Variability Identification. While clone detection and differencing only identify one side of the variability (i.e., either the common or the varying parts), variability identification combines both dimensions. In addition, classic clone and difference detection algorithms mostly compare only two models, while for a complete variability analysis in a set of models all models need to be compared. An essential part of variability identification is merging the compared models and storing the identified information in a unified representation. A large number of model merging algorithms exist for various contexts (e.g., [4, 37, 52, 60, 62]). These algorithms solely focus on merging the information of the models and neglect their variability. In contrast, additional variability-aware algorithms exist that merge the compared models and annotate the elements’ source models [23, 45, 47] or visualize the identified variability [36]. Unlike our algorithm these lack explicit variability information (i.e., about mandatory, alternative or optional parts) and, thus, are not applicable for fine-grained variability analysis. 33

Nejati et al. describe an approach that similar to ours uses heuristics (e.g., metrics) for comparing and matching elements from statechart variants [39, 40]. However, in contrast to our approach, Nejati et al. only merge model elements with annotations about their parent models and neglect fine-grained variability information, which limits their solution to the generation of contained variants. In contrast, our approach not only allows transition to an SPL allowing generation of these variants [65], but also detailed analysis of the fine-grained variability. Another variability identification approach similar to ours is that of Ryssel et al. [49]. However, unlike our approach they do not focus on storing the identified variability in a unified representation but focus on extracting reusable library elements [51]. Work by Font et al. focuses on identifying variability information of models [22] and incorporating the developers’ domain knowledge to identify larger reusable model parts similar to Ryssel et al. [21]. While these approaches store the identified artifacts in an SPL using the Common Variability Language (CVL), we focus on generating 150% models that can be translated to different SPL representations (e.g., delta-oriented SPLs [65]). Klatt et al. focus in their work on identifying variability between related source code artifacts [30]. While, similar to us, they operate on a graph-based representation, their algorithm relies on abstract syntax trees (ASTs) of the analyzed code and, thus, is limited to the underlying data-structure. In contrast, our presented model-based approach is applicable to different block-based languages and we also showed that similar algorithms can be applied to source code [68]. The most part of mining techniques in literature use so-called pairwise approaches. In contrast, n-way algorithms are capable of merging a potentially arbitrary number of model variants at the same time. In literature, such algorithms exist to merge different artifacts such as UML models into 150% models [47] and model-transformation rules into variability-based rules [56]. While we presented in this paper that our family mining approach can be easily adapted for different languages with varying paradigms (i.e., MATLAB/Simulink and statecharts) using our provided guidelines and framework, these approaches were evaluated for single languages and currently lack such capabilities. While all these approaches concentrate on the concrete realization artifacts, other techniques exist to extract high-level configuration options in form of feature models [46] or CVL models. Examples are approaches that analyze natural-language requirements [61], product maps [50, 53], or existing products [70]. During an SPL generation from the identified 150% models (e.g., [65]) such information could provide a configuration model for the generated SPL. 9. Conclusion and Future Work In this paper, we explained in detail how fine-grained variability information can be identified for large sets of models in different block-based languages using our family mining approach. For this purpose, we presented a set of guidelines that can be used to easily adapt our algorithms for new block-based languages. Major improvement over our previous work is the introduction of a generic

34

implementation that allows an even easier adaptation of the algorithms by providing clearly defined interfaces to developers reducing their implementation effort. In addition, we demonstrated and discussed how our language-agnostic cluster and outlier detection can improve the variability information generated by our family mining. Using the presented extension it is now possible to remove outliers (e.g., completely unrelated variants) from a set of input models and cluster them into more meaningful sets (e.g., relevant for particular users). In future work, we plan to implement a direct link between our family mining framework and our clustering framework. Currently, these two frameworks are realized independently and after executing the cluster and outlier detection the user has to manually interpret the dendrograms. Using this information outliers have to be manually removed and sensible clusters have to be selected prior to the detailed variability analysis using the family mining. By applying additional distance measures and algorithms to cut the dendrogram trees at sensible positions, we plan to (semi-)automate this step by at least automatically providing suitable suggestions. In addition, we plan to evaluate clustering techniques that can help to automatically select a base model for the comparisons using other (possibly more accurate) heuristics than selecting the smallest model. Acknowledgments This work was partially supported by the European Commission within the project HyVar under grant agreement H2020-644298. References [1] B. Al-Batran, B. Schätz, B. Hummel, Semantic Clone Detection for ModelBased Development of Embedded Systems, in: Intl. Conf. on Model Driven Engineering Languages and Systems (MODELS), vol. 6981 of LNCS, Springer, 2011, pp. 258–272. [2] M. Alalfi, J. Cordy, T. Dean, M. Stephan, A. Stevenson, Models are code too: Near-miss clone detection for Simulink models, in: Intl. Conf. on Software Maintenance (ICSM), IEEE, 2012, pp. 295–304. [3] M. Alalfi, E. Rapos, A. Stevenson, M. Stephan, T. Dean, J. Cordy, Semiautomatic Identification and Representation of Subsystem Variability in Simulink Models, in: Intl. Conf. on Software Maintenance and Evolution (ICSME), IEEE, 2014, pp. 486–490. [4] M. Alanen, I. Porres, Difference and Union of Models, in: «UML» 2003 - The Unified Modeling Language. Modeling Languages and Applications, vol. 2863 of LNCS, Springer, 2003, pp. 2–17. [5] K. Altmanninger, M. Seidl, M. Wimmer, A Survey on Model Versioning Approaches, Intl. Journal of Web Inf. Systems 5 (3) (2009) 271–304.

35

[6] Ö. Babur, Statistical Analysis of Large Sets of Models, in: Intl. Conf. on Automated Software Engineering (ASE), ACM, 2016, pp. 888–891. [7] Ö. Babur, L. Cleophas, Using n-grams for the Automated Clustering of Structural Models, in: Intl. Conf. on Current Trends in Theory and Practice of Computer Science (SOFSEM), Springer, 2017, pp. 510–524. [8] Ö. Babur, L. Cleophas, M. van den Brand, Hierarchical Clustering of Metamodels for Comparative Analysis and Visualization, in: European Conf. on Modeling Foundations and Applications (ECMFA), Springer, 2016, pp. 3– 18. [9] Ö. Babur, L. Cleophas, T. Verhoeff, M. van den Brand, Towards Statistical Comparison and Analysis of Models, in: Intl. Conf. on Model-Driven Engineering and Software Development (MODELSWARD), 2016, pp. 361–367. [10] F. Basciani, J. Di Rocco, D. Di Ruscio, L. Iovino, A. Pierantonio, Automated Clustering of Metamodel Repositories, in: Intl. Conf. on Advanced Information Systems Engineering (CAiSE), Springer, 2016, pp. 342–358. [11] G. Brunet, M. Chechik, S. Easterbrook, S. Nejati, N. Niu, M. Sabetzadeh, A manifesto for model merging, in: Intl. Workshop on Global Integrated Model Management (GaMMa), ACM, 2006, pp. 5–12. [12] S. S. Chawathe, A. Rajaraman, H. Garcia-Molina, J. Widom, Change Detection in Hierarchically Structured Information, in: Intl. Conf. on Management of Data (MOD), ACM, 1996, pp. 493–504. [13] P. C. Clements, L. M. Northrop, Software Product Lines: Practices and Patterns, Addison-Wesley, 2001. [14] K. Czarnecki, U. W. Eisenecker, Generative Programming: Methods, Tools, and Applications, Addison-Wesley, 2000. [15] F. Deissenboeck, B. Hummel, E. Jürgens, B. Schätz, S. Wagner, J.-F. Girard, S. Teuchert, Clone Detection in Automotive Model-based Development, in: Intl. Conf. on Software Engineering (ICSE), ACM, 2008, pp. 603–612. [16] M. M. Deza, E. Deza, Encyclopedia of Distances, Springer, 2009. [17] DiffPlug Simulink, https://www.diffplug.com/features/simulink. [18] Y. Dubinsky, J. Rubin, T. Berger, S. Duszynski, M. Becker, K. Czarnecki, An Exploratory Study of Cloning in Industrial Software Product Lines, in: European Conf. on Software Maintenance and Reengineering (CSMR), IEEE, 2013, pp. 25–34. [19] Eclipse Modeling Framework, http://www.eclipse.org/modeling/emf/. [20] EnSoft SimDiff, http://www.ensoftcorp.com/simdiff/. 36

[21] J. Font, L. Arcega, Ø. Haugen, C. Cetina, Building Software Product Lines from Conceptualized Model Patterns, in: Intl. Software Product Line Conf. (SPLC), ACM, 2015, pp. 46–55. [22] J. Font, M. Ballarín, Ø. Haugen, C. Cetina, Automating the Variability Formalization of a Model Family by Means of Common Variability Language, in: Intl. Software Product Line Conf. (SPLC), ACM, 2015, pp. 411–418. [23] H. Frank, J. Eder, Towards an Automatic Integration of Statecharts, in: Intl. Conf. on Conceptual Modeling (ER), vol. 1728 of LNCS, Springer, 1999, pp. 430–445. [24] S. Holthusen, D. Wille, C. Legat, S. Beddig, I. Schaefer, B. Vogel-Heuser, Family Model Mining for Function Block Diagrams in Automation Software, in: Intl. Workshop on Reverse Variability Engineering (REVE), ACM, 2014, pp. 36–43. [25] IBM Rational Rhapsody, rhapsody/.

http://www.ibm.com/software/awdtools/

[26] C. Kapser, M. W. Godfrey, "Cloning Considered Harmful" Considered Harmful, in: Working Conf. on Reverse Engineering (WCRE), IEEE, 2006, pp. 19–28. [27] T. Kehrer, U. Kelter, M. Ohrndorf, T. Sollbach, Understanding Model Evolution through Semantically Lifting Model Differences with SiLift, in: Intl. Conf. on Software Maintenance (ICSM), IEEE, 2012, pp. 638–641. [28] T. Kehrer, U. Kelter, P. Pietsch, M. Schmidt, Adaptability of Model Comparison Tools, in: Intl. Conf. on Automated Software Engineering (ASE), ACM, 2012, pp. 306–309. [29] U. Kelter, J. Wehren, J. Niere, A Generic Difference Algorithm for UML Models, in: Software Engineering, vol. 64 of LNI, Gesellschaft für Informatik e.V. (GI), 2005, pp. 105–116. [30] B. Klatt, M. Küster, K. Krogmann, A Graph-Based Analysis Concept to Derive a Variation Point Design from Product Copies, in: Intl. Workshop on Reverse Variability Engineering (REVE), ACM, 2013, pp. 1–8. [31] R. Koschke, Survey of Research on Software Clones, in: Duplication, Redundancy, and Similarity in Software, No. 06301 in Dagstuhl Seminar Proceedings, Schloss Dagstuhl, Germany, 2007. [32] Z. Liang, Y. Cheng, J. Chen, A Novel Optimized Path-Based Algorithm for Model Clone Detection, Journal of Software 9 (7) (2014) 1810–1817. [33] S. Lity, R. Lachmann, M. Lochau, I. Schaefer, Delta-oriented Software Product Line Test Models – The Body Comfort System Case Study, Tech. Rep. 2012-07, Technische Universität Braunschweig, Germany (2012). 37

[34] C. D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008. [35] J. Martinez, T. Ziadi, T. F. Bissyandé, J. Klein, Y. Le Traon, Name Suggestions during Feature Identification: The VariClouds Approach, in: Intl. Software Product Line Conf. (SPLC), ACM, 2016, pp. 119–123. [36] J. Martinez, T. Ziadi, J. Klein, Y. Le Traon, Identifying and Visualising Commonality and Variability in Model Variants, in: European Conf. on Modeling Foundations and Applications (ECMFA), vol. 8569 of LNCS, Springer, 2014, pp. 117–131. [37] A. Mehra, J. Grundy, J. Hosking, A Generic Approach to Supporting Diagram Differencing and Merging for Collaborative Design, in: Intl. Conf. on Automated Software Engineering (ASE), ACM, 2005, pp. 204–213. [38] S. Melnik, H. Garcia-Molina, E. Rahm, Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching, in: Intl. Conf. on Data Engineering (ICDE), IEEE, 2002, pp. 117–128. [39] S. Nejati, M. Sabetzadeh, M. Chechik, S. Easterbrook, P. Zave, Matching and merging of statecharts specifications, in: Intl. Conf. on Software Engineering (ICSE), IEEE, 2007, pp. 54–64. [40] S. Nejati, M. Sabetzadeh, M. Chechik, S. Easterbrook, P. Zave, Matching and Merging of Variant Feature Specifications, IEEE Transactions on Software Engineering (TSE) 38 (6) (2012) 1355–1375. [41] S. Oster, M. Zink, M. Lochau, M. Grechanik, Pairwise Feature-interaction Testing for SPLs: Potentials and Limitations, in: Intl. Software Product Line Conf. (SPLC), ACM, 2011, pp. 6:1–6:8. [42] N. H. Pham, H. A. Nguyen, T. T. Nguyen, J. M. Al-Kofahi, T. N. Nguyen, Complete and Accurate Clone Detection in Graph-based Models, in: Intl. Conf. on Software Engineering (ICSE), IEEE, 2009, pp. 276–286. [43] K. Pohl, G. Böckle, F. J. van der Linden, Software Product Line Engineering: Foundations, Principles and Techniques, Springer, 2005. [44] C. K. Roy, J. R. Cordy, A Survey on Software Clone Detection Research, Tech. Rep. 541, School of Computing, Queen’s University, Kingston, Ontario, Canada (2007). [45] J. Rubin, M. Chechik, Combining Related Products into Product Lines, in: Intl. Conf. on Fundamental Approaches to Software Engineering (FASE), vol. 7212 of LNCS, Springer, 2012, pp. 285–300. [46] J. Rubin, M. Chechik, Domain Engineering: Product Lines, Languages, and Conceptual Models, chap. A Survey of Feature Location Techniques, Springer, 2013, pp. 29–58. 38

[47] J. Rubin, M. Chechik, N-way Model Merging, in: European Software Engineering Conf./Foundations of Software Engineering (ESEC/FSE), ACM, 2013, pp. 301–311. [48] J. Rubin, M. Chechik, Quality of Merge-Refactorings for Product Lines, in: Intl. Conf. on Fundamental Approaches to Software Engineering (FASE), vol. 7793 of LNCS, Springer, 2013, pp. 83–98. [49] U. Ryssel, J. Ploennigs, K. Kabitzsch, Automatic Variation-point Identification in Function-block-based Models, in: Intl. Conf. on Generative Programming and Component Engineering (GPCE), ACM, 2010, pp. 23– 32. [50] U. Ryssel, J. Ploennigs, K. Kabitzsch, Extraction of Feature Models from Formal Contexts, in: Intl. Software Product Line Conf. (SPLC), ACM, 2011, pp. 4:1–4:8. [51] U. Ryssel, J. Ploennigs, K. Kabitzsch, Automatic library migration for the generation of hardware-in-the-loop models, Science of Computer Programming 77 (2) (2012) 83–95. [52] M. Sabetzadeh, S. Easterbrook, Analysis of Inconsistency in Graph-Based Viewpoints: A Category-Theoretic Approach, in: Intl. Conf. on Automated Software Engineering (ASE), IEEE, 2003, pp. 12–21. [53] S. She, R. Lotufo, T. Berger, A. Wasowski, K. Czarnecki, Reverse Engineering Feature Models, in: Intl. Conf. on Software Engineering (ICSE), IEEE, 2011, pp. 461–470. [54] M. Stephan, J. R. Cordy, A Survey of Methods and Applications of Model Comparison, Tech. Rep. 582, School of Computing, Queen’s University, Kingston, Ontario, Canada (2011). [55] M. Stephan, J. R. Cordy, A Survey of Model Comparison Approaches and Applications, in: Intl. Conf. on Model-Driven Engineering and Software Development (MODELSWARD), 2013, pp. 265–277. [56] D. Strüber, J. Rubin, T. Arendt, M. Chechik, G. Taentzer, J. Plöger, RuleMerger: Automatic Construction of Variability-Based Model Transformation Rules, in: Intl. Conf. on Fundamental Approaches to Software Engineering (FASE), Springer, 2016, pp. 122–140. [57] The Mathworks MATLAB/Simulink, products/simulink/.

http://www.mathworks.com/

[58] The Object Management Group, http://www.omg.org/mof/. [59] TU München, SPES_XT, http://spes2020.informatik.tu-muenchen. de/spes_xt-home.html.

39

[60] S. Uchitel, M. Chechik, Merging Partial Behavioural Models, in: Intl. Symposium on the Foundations of Software Engineering (FSE), ACM, 2004, pp. 43–52. [61] N. Weston, R. Chitchyan, A. Rashid, A Framework for Constructing Semantically Composable Feature Models from Natural Language Requirements, in: Intl. Software Product Line Conf. (SPLC), ACM, 2009, pp. 211–220. [62] J. Whittle, J. Schumann, Generating Statechart Designs From Scenarios, in: Intl. Conf. on Software Engineering (ICSE), ACM, 2000, pp. 314–323. [63] D. Wille, Managing Lots of Models: The FaMine Approach, in: Intl. Symposium on the Foundations of Software Engineering (FSE), ACM, 2014, pp. 817–819. [64] D. Wille, S. Holthusen, S. Schulze, I. Schaefer, Interface Variability in Family Model Mining, in: Intl. Workshop on Model-Driven Approaches in Software Product Line Engineering (MAPLE), ACM, 2013, pp. 44–51. [65] D. Wille, T. Runge, C. Seidl, S. Schulze, Extractive Software Product Line Engineering Using Model-based Delta Module Generation, in: Intl. Workshop on Variability Modeling in Software-intensive Systems (VaMoS), ACM, 2017, pp. 36–43. [66] D. Wille, S. Schulze, I. Schaefer, Variability Mining of State Charts, in: Intl. Workshop on Feature-Oriented Software Development (FOSD), ACM, 2016, pp. 63–73. [67] D. Wille, S. Schulze, C. Seidl, I. Schaefer, Custom-Tailored Variability Mining for Block-Based Languages, in: Intl. Conf. on Software Analysis, Evolution, and Reengineering (SANER), vol. 1, IEEE, 2016, pp. 271–282. [68] D. Wille, M. Tiede, S. Schulze, C. Seidl, I. Schaefer, Identifying Variability in Object-Oriented Code Using Model-Based Code Mining, in: Intl. Symposium on Leveraging Applications of Formal Methods, Verification and Validation (ISoLA), vol. 9953 of LNCS, Springer, 2016, pp. 547–562. [69] Z. Xing, E. Stroulia, UMLDiff: An Algorithm for Object-oriented Design Differencing, in: Intl. Conf. on Automated Software Engineering (ASE), ACM, 2005, pp. 54–65. [70] X. Zhang, Ø. Haugen, B. Møller-Pedersen, Model Comparison to Synthesize a Model-Driven Software Product Line, in: Intl. Software Product Line Conf. (SPLC), IEEE, 2011, pp. 90–99.

40