Test coverage criteria for software product line testing: Systematic literature review

Test coverage criteria for software product line testing: Systematic literature review

Journal Pre-proof Test coverage criteria for software product line testing: Systematic literature review Jihyun Lee , Sungwon Kang , Pilsu Jung PII: ...

1MB Sizes 0 Downloads 30 Views

Journal Pre-proof

Test coverage criteria for software product line testing: Systematic literature review Jihyun Lee , Sungwon Kang , Pilsu Jung PII: DOI: Reference:

S0950-5849(20)30022-7 https://doi.org/10.1016/j.infsof.2020.106272 INFSOF 106272

To appear in:

Information and Software Technology

Received date: Revised date: Accepted date:

13 February 2019 29 December 2019 27 January 2020

Please cite this article as: Jihyun Lee , Sungwon Kang , Pilsu Jung , Test coverage criteria for software product line testing: Systematic literature review, Information and Software Technology (2020), doi: https://doi.org/10.1016/j.infsof.2020.106272

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

Test coverage criteria for software product line testing:

Systematic literature review Jihyun Lee†, Sungwon Kang*, Pilsu Jung* †Department

of Software Engineering, Jeonbuk National University, 567 Baekje-daero, Deokjin-gu, Jeonju-si, Korea [email protected] *School of Computing, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon, Korea {sungwon.kang, psjung}@kaist.ac.kr

Abstract Context: In software product line testing (SPLT), test coverage criterion is an important concept, as it provides a means of measuring the extent to which domain testing has been performed and redundant application testing can be avoided based on the test coverage level achieved in domain testing. However, no previous literature reviews on SPLT have addressed test coverage criterion in SPLT. Objective: The objectives of this paper are as follows: (1) to clarify the notions of test basis and test coverage criterion for SPLT; (2) to identify the test coverage criteria currently used for SPLT; (3) to investigate how various SPLT aspects, such as the SPLT method, variability implementation mechanism, and variability management approach, affect the choice of test coverage criterion for SPLT; and (4) to analyze the limitations of test coverage criteria currently used for SPLT. Method: This paper conducts a systematic review of test coverage criteria in SPLT with 78 selected studies. Results: We have several findings that can guide the future research on SPLT. One important finding is that choice of test coverage criterion in SPLT is independent from variability implementation mechanism, variability management, SPL approach, and binding time but is dependent on the variability representation used in development artifacts. Another that is easily overlooked is that SPL test coverage criteria with the same test coverage criterion names of single system testing neither adequately convey what should be covered by the test methods applying them, nor can they be more generally regarded as extensions or generalizations for SPLT of their corresponding test coverage criteria of single system testing. Conclusion: This study showed that SPL test coverage criteria should be defined or redefined so that they can clearly deliver the target properties to be satisfied by SPLT.

1. Introduction Test coverage level is the degree to which specified coverage items have been exercised by a test suite [ISO 2013]. For effective testing, the notion of test coverage criteria is important because it provides a means of measuring the extent to which a set of test cases exercises a program [Jorgensen 14]. Software product line engineering (SPLE) consists of two distinct processes: the domain engineering process and the application engineering process. The domain testing stage of the domain engineering process performs testing for common parts and produces reusable test artifacts called domain test assets such as test plans, test cases, test data, and test scenarios. The testing stage of the application engineering process has to achieve an efficient reuse of test assets while testing product-specific parts and performing regression testing for the parts of the product tested during the domain testing stage. Software product line testing (SPLT) is conducted in these two distinct phases, handling from as few as dozens to as many as millions of test items for product lines. Thus, selecting SPL test coverage criteria that reflect the two distinct processes for the target software product line is crucial to achieving the respective goals of the two distinct processes and to prioritizing and/or reducing the number of test cases for the numerous test items of a product line. In addition to the inherent importance of test coverage criterion in testing both in general and to SPLT in particular, a systematic study of test coverage criteria in SPLT can shed light on understanding the differences between test coverage criteria in single system testing and test coverage criteria in SPLT, which result from the fundamental differences between single system testing and SPLT. Since test coverage criterion is a measure of the extent to which domain testing has been performed, it allows for an estimate of the effort to be put into application testing. Because test cases and implementation under test should deal with variability when they are designed and executed, test strategies for SPLT should be different from those used for single system testing [Pohl 05], leading to different test design techniques. This implies that test coverage criteria in SPLT can also be quite different from those used in single system testing. Several literature reviews on SPLT have been conducted to date [Lamancha 09, Engstrom 11, Lee 12, Machado 14, Neto 11]. Among these, the literature reviews conducted by [Engstrom 11] and [Neto 11] systematically reviewed the state of the art and practices in terms of test levels, test types, test process, variability management in SPLT, and SPLT strategies. Those of [Lamancha 09] and [Lee 12], which were published as conference papers, address similar topics. [Machado 14] focuses on selection of products to test and testing of the products produced. These researchers have thoroughly reviewed the current state-ofthe-art and best practices in test strategies for the testing of all feature interactions, the specification of variability in test assets, the reuse of the test assets, and the use of bindings in testing. From the standpoint of test coverage criteria, however, the existing reviews of SPLT have three limitations. In the first place, they focus on SPL test strategies, SPL test process, and variability management, all of which are important topics since they are key factors that differentiate SPLT from single system testing. But the reviews have not provided answers as to whether the tests performed using these methods met pre-determined adequacy criteria. Secondly, they also do not address aspects of SPL test

coverage measurement that differ from those of single system testing. Specifically, they described test coverage criteria at the feature and framework levels, which are very different from test coverage in testing for a single system. Thirdly, they do not analyze SPLT methods in terms of the resulting test coverage levels. Knowing how SPLT methods define test coverage criteria and how they derive test cases is important because it is closely related to achieving the test adequacy that a test is aiming for. In addition, as McGregor points out in [McGregor 01, McGregor 10], test coverage criteria for SPLT have not been precisely defined based on the notions of core assets, variation points, or variants, which are essential for reusability in SPLE. In spite of these limitations, these reviews have provided results for test strategies, test processes, and variability management, all of which can affect SPLT research and test coverage criteria. They can therefor provide a starting point for further review of SPLT‘s state-of-the-art and best practices with regard to test coverage criteria. Accordingly, our systematic review has the following goals: (1) to clarify the notions of SPL test basis, which is an artifact or a model used as a basis for deriving test cases, and test coverage criteria, in contrast to those of single system testing, as well as to provide a classification of test bases for SPLT; (2) to identify test coverage criteria currently used for SPLT; (3) to investigate how certain aspects of SPL, such as testing method, variability implementation mechanism, and variability management approach, affect the choice of test coverage criteria of SPLT; and (4) to analyze the limitations of test coverage criteria currently used for SPLT. For the purposes of this review, we use test basis as the central concept for classifying test coverage criteria in SPLT for the first time in systematic reviews of SPLT. In this paper, we synthesize the findings of our study through the following procedure:  Analyze test basis and test coverage criteria in single system testing;  Define test basis in SPLT;  Define research questions;  Select studies;  Derive answers to research questions;  Summarize findings and discuss their implications for test coverage criteria currently used in SPLT. The remainder of the paper is structured as follows: Section 2 discusses test bases and test coverage criteria used in single system testing and in SPLT as background; Section 3 describes the review methods used in this paper; Section 4 presents our findings and their implications; Section 5 discusses threats to the validity of this review; finally, in Section 6 we offer concluding remarks and suggest directions for future work.

2. Background This section describes the background knowledge necessary to discuss test coverage criteria in SPLT. Since test basis in SPLT has not been used in any previous study, we categorize in this paper the test bases of SPLT, based on the test bases and test coverage criteria used in a single system, together with the collection of test coverage criteria of SPLT obtained from a preliminary analysis of existing studies. In Section 2.1, then, we examine the test bases and test coverage criteria in single system testing. Section 2.2 discusses how SPL test coverage criteria differ from test coverage criteria in single system testing through a review of SPL-specific testing issues in the existing SPLT literature. Building on Sections 2.1 and 2.2, Section 2.3 defines the notion of test basis in SPLT as a reference-point for differentiating between test coverage criteria in single system testing and test coverage criteria in SPLT. The definition of test basis in Section 2.3 is used as a foundation to identify and analyze test coverage criteria of SPLT in this paper.

2.1 Test bases and test coverage criteria for single system testing In single system testing, test coverage has been used as a measure of thoroughness, as well as a criterion for completion, of the test. For the discussion in this paper, we use the following four basic terms1:  Test basis is an artifact or a model (e.g., requirements specification, architectural design, or detailed design for unit) that is used as a basis for deriving test requirements and test cases.  Test requirement is a specific element of a software artifact that a test case must test or cover. Test requirements are described with respect to a variety of test bases.  Test coverage criterion is a rule or a collection of rules that imposes test requirements on a test set.  Test coverage level (of a test set T with respect to a set of test requirements TR) is the ratio of the number of the test requirements in TR that are satisfied by T to the size of TR. The ISO/IEC/IEEE 29119 Software Testing Standard defines test basis as a ―body of knowledge from which the requirements for a component or system can be inferred‖ [ISO 2013]. This paper restricts the use of the term to dynamic testing with ―system requirements‖, ―architectural design‖, ―detailed design for unit‖, and ―structure of source code‖ as examples of a test basis. The ISO/IEC/IEEE 29119 Standard states in detail that ―which test case design technique to use depends on the nature of the test basis and test requirements and test cases are derived from the test basis with the test case design technique‖ [ISO 2013]. The definition of test basis used in this paper was redefined based on this definition of [ISO 2013]. These definitions indicate that the nature of test basis affects test coverage criterion. For example, if the test basis is ―structure of source code‖ and the test goal is ―all decisions of a program‖, then each decision leads to two test requirements: one for the decision to evaluate to false, the other for the decision to evaluate to true [Amman 16]. ―Branch coverage‖ yields the aforementioned test requirements [Myers 11, Amman 16]. Table 1 shows the various types of test coverage criteria used in single software testing, classified according to the kinds of test case design approaches used in each case. We follow the definitions of test requirements, coverage criterion and coverage level in [Amman 16].

Table 1. Test coverage criteria used for single system testing Test basis classification Specification-based test case design Design-based test case design

Program-based case design

test

Examples of test basis

Examples of test coverage criteria

Functional/nonfunctional requirements [ISO 2013]

equivalence partitioning coverage, boundary value coverage, causeeffect coverage

Models of a system under test [Utting 12]

coverage of all disjuncts in the post-condition, node coverage, transition coverage, mutation coverage of the mutated model

Structure of source code [ISO 2013, Jorgensen 14]

statement coverage, decision(branch) coverage, condition coverage, decision/condition coverage, multiple-condition coverage

Control flow graph [Amman 16] Data flow graph [Rapps 85]

node coverage, edge coverage, edge-pair coverage, prime path coverage, simple round trip coverage, complete round trip coverage, complete path coverage all paths, all DU paths, all uses, all C-uses, all P-uses, all Defs, all Edges, all nodes

As described in the first and second columns of Table 1, we can use as a test basis the specifications for functional or nonfunctional requirements, models of a system under test [Utting 12] in the case of the design-based test design technique, and structure of source code [ISO 2013, Jorgensen 14] or directed graphs [Rapps 85, Amman 16] in the case of the program-based test design technique. The third column of Table 1 lists examples of possible test coverage criteria by test basis. If the test requirements are selected as ―all decisions of a program,‖ then the test case set ‗T‘ can be selected in accordance with the branch coverage in the third column of Table 1. If ‗T‘ meets all of the test requirements, ‗T‘ can be said to meet the test coverage criterion [Amman 16]. If the selected test case set ‗T‘ executes all the conditions of the program at least once, the test coverage level for the given test coverage ―all conditions of a program‖ is 100% [Amman 16]. In the case of model-based testing, the test case design technique is determined in accordance with the modeling notation used in the test basis, and several different test requirements and test coverage criteria are possible. For example, if the test basis is described as a transition-based model involving nodes and edges ([Utting 12] classifies this model as a structural model), a graphbased test design technique can be used. For this technique, nodes or edges are the basis of the test requirements, and node coverage or edge coverage can be selected as the test coverage criterion. For details of test coverage criteria, see [Utting 12].

2.2 Test coverage criteria for SPLT According to [Machado 12], two issues have attracted the most attention in the SPLT research community: (1) How do the studies address the selection of products to test in a software product line? and (2) How do the studies deal with testing of endproduct functions in a software product line? The first issue is to construct samples to drive systematic testing of software system configurations, which should treat constraints between specific configuration parameters that render certain combinations of parameter values invalid. This issue was considered not only because testing all possible potential products of SPL is impractical, but also because a small subset of them can be effectively used to find most of the faults in the SPL. [Cohen 06] has applied the combinatorial interaction testing (CIT) method used in single system testing to select an optimal product set amid the multiplicity of possible products that combinations of features and constraints can create. The second issue is testing for functions of actual products. The first issue focuses on the selection of products for testing, which does not address the testing of actual products. The second issue does not specify how SPLT differs from testing for a single system. [Pohl 05] presents four fundamental test strategies that take into account the variability of SPLT and the distinction between domain engineering and application engineering: brute force strategy, pure application strategy, sample application strategy, and commonality and reuse strategy. The brute force strategy tests all possible products at all test levels during domain testing. In the pure application strategy, on the other hand, testing is performed only in application engineering and does not generate reusable test artifacts. The sample application strategy uses one or several sample products to test domain artifacts. The commonality and reuse strategy tests common parts in domain testing and prepares reusable test artifacts for variable parts. In this strategy, application testing reuses the predefined and variability-included domain test artifacts for testing of a specific application. As with single system testing, the test basis in SPLT affects test case design technique and test coverage criterion. However, in SPLT, the test basis is selected from possible test bases after the SPLT issue and test strategies, along with the corresponding SPLT methods, have been determined. Table 2 shows an SPLT classification framework for test coverage criteria of SPLT as defined in this context. The first column is based on the SPL test issue classification of [Machado 12]. The second issue of [Machado 12] was re-categorized into the four fundamental test strategies of [Pohl 05], as in column 2. The third through fifth columns of Table 2 are the results of classifying several well-known test methods according to this classification framework.

Table 2. Test basis and test coverage criterion examples classified by SPLT issues and strategies SPLT classification Two Issues of Types of SPL testing SPLT strategy [Pohl 05] [Machado 12] Selection of products-to-test

Testing of products of a product line

Representative test methods

Test bases

Test coverage criterion

CIT [Cohen 06]

feature model

feature combination coverage criterion

Brute force strategy

Variability-aware execution for testing [Nguyen 14, Meinicke 16]

Source code

Statement coverage criterion of all possible products

Pure strategy

application

-

-

-

Sample strategy

application

Delta-based testing [Lochau 14, Lity 18]

Delta model (in the forms of specification or models) Adapted activity diagram Data-annotated activity diagram

Delta coverage criterion (in the forms of specification or model based test coverage criterion)

Commonality reuse strategy

and

SCENTED [Reuys 10]

branch coverage criterion data flow coverage criterion

There are currently no known test methods for the pure application strategy of the second issue. One method that exploits variability-aware execution [Nguyen 14, Meinicke 16] can be regarded as an example of the brute force strategy because it executes all test cases on all possible products called configurations. This method is the only instance of the brute force strategy. A method exploiting variability-aware execution runs a test case on code containing variability without configuring a possible product, and its effect is equivalent to that of the brute-force strategy. If we add ―brute-force strategy‖ and its related types of test basis, then most entries of classification results will have the value ―-‖. In more details we analyze the studies, the more conspicuous this phenomenon becomes, which would not be useful to finding how various SPL aspects affect choice of test coverage criteria. Therefore, this method has not been included in our systematic review. The delta-based testing method [Lochau 14, Lity 18] tests domain asset and product-specific parts while testing one sample product. This method uses a delta model, which describes deltas between a product that has been tested and a product to be tested. In this case, the test basis is a delta model, the test requirements are deltas, and the test coverage criterion can be delta coverage, in the sense of deltas covered through testing. Because the delta model is derived from the specification or model that describes a product, the test basis has the same nature as the specifications or models, so the test coverage criterion name is the same as that used for single systems. One example of the commonality and reuse strategy is the SCEnario based Test Case Derivation (ScenTED) approach [Reuys 10]. ScenTED uses the decision point of the activity diagram as the variation point and the possible variable values from the decision as alternative flows of the activity diagram. The test basis associated with this approach is the activity diagram, so it is natural that ScenTED chooses branch coverage, which covers the commonality and variability of a system at least once, as long as all branches are covered. Although the names are the same as those used in single system testing, these test coverage criteria have different meanings in SPLT. For example, in single system testing, branch coverage involves the flow of control; however, in the ScenTED approach branch coverage involves selecting values for variation points, as well as the flow of control.

2.3 Test bases for SPLT Test bases for SPLT can be classified as in Table 3, which is obtained by integrating Tables 1 and 2 in Sections 2.1 and 2.2. That is, the first and second columns of Table 3 are the test basis group and its classification obtained by integrating the results of Sections 2.1 and 2.2, while the third column lists examples of SPL test bases. The ID in the third column of Table 3 is the name given to each test basis by the test basis classification (Test Basis ID, TB-ID). In this paper, this ID is used for classifying the status of test coverage criteria of SPLT. In Table 3, ―Sel-Pro‖ refers to a test basis for the methods of selecting products to be tested, while the remaining test bases are used for the methods of testing end product functionalities. Table 3. Test bases for SPLT Test basis group

TB-ID

Selection of Products-totest

Sel-Pro

SpecificationBased

CR-SB

Design-Based

CR-DB

Commonality and Reuse

Description A product, which is regarded as a test case in this classification, is selected based on the variabilityspecific concepts that make up the variability model and test coverage is measured using variability-specific concepts. Test cases are derived based on product line specification described, including variation points and variants, and test coverage is measured as the functional/non-functional requirements that domain or application testing must cover, respectively. Test cases are derived based on the design model of the product line containing the variation points and variants, and test coverage is measured in terms of the design-model-specific concepts that domain or application testing must cover.

Examples of Test basis

variability model

functional/non-functional specification of a product line or a particular product

design model of a product line or a particular product

Sample Application

ProgramBased

CR-PB

SpecificationBased

SA-SB

Design-Based

SA-DB

ProgramBased

SA-PB

Test cases are derived from the source code of the product line containing the variation points and variants, and test coverage is measured in terms of source code-specific concepts that each domain or application testing must cover. Test cases are derived based on the difference between the specification of a sample product already tested and the specification of a product to be tested, and test coverage is measured as the functional/non-functional requirements of the product. Test cases are derived based on the difference between the design model of the already tested sample product and the design model of a product to be tested, and the test coverage is measured in terms of design modelspecific concepts. Test cases are derived based on the difference between the source code of a sample product already tested and the source code of a product to be tested, and the test coverage is measured in terms of source code-specific concepts.

source code of a product line or a particular product

functional/non-functional specifications of sample products

design models of sample products

source code of sample products

3. Review methodology In section 3.1, we derive the research questions that this paper will investigate through a systematic review; and in section 3.2 we describe our process of searching and selecting works in the existing literature for review.

3.1 Research questions To define research questions, we referred to the research questions used for reviewing SPLT in the existing systematic reviews, as mentioned in the Introduction section. In addition, we examined the background knowledge of Section 2, which provides test coverage criteria for single system testing and the classification frameworks of SPLT differentiating test coverage criteria of SPLT from those of single system testing. The results led us to the following three research questions: RQ1. What test coverage criteria exist for each test basis? As mentioned in the Introduction, the first interest of this review is to answer the question ―what test coverage criteria are used in SPLT?‖ ―Are the same test coverage criteria used in single system testing?‖ If not, ―what is different?‖ Through this question we can obtain an answer regarding what to choose as a test coverage criterion when testing a product line, with the sample application strategy, that has design models annotated with relevant features. In Section 2, we defined SPL test bases on the assumption that test coverage criteria of SPLT methods may vary depending on the SPL test strategy employed. To obtain an answer to this question we use the test bases for SPLT in Table 3. RQ2. What factors affect test coverage criteria decisions? A key problem in determining SPL test strategy is how testing deals with variability in product line artifacts. In other words, a fundamental question is whether the SPLT method and the variability implementation mechanism used—that is, how an SPLT method handles variability—affect the choice of test coverage criterion. Test coverage criterion may or may not depend on the particular variability implementation mechanisms or SPLT methods used. Also dimensions other than these can affect test coverage criteria in SPLT. Or there may be other dimensions that affect test coverage criteria in SPLT. To investigate these interests in our study, we divide RQ2 into the following sub research questions:  RQ2-1. Does the SPLT method affect test coverage criteria?  RQ2-2. Does the variability implementation mechanism affect test coverage criteria?  RQ2-3. Does the variability management approach affect test coverage criteria? RQ2-4. Are there any other dimensions, other than those of RQ2-1, RQ2-2, and RQ2-3, which affect test coverage criteria in SPLT? For each of these questions, when the answer is affirmative, we discuss how test coverage criteria are affected by the aspect in question. From the answers to these questions, we will be able to determine whether test coverage criteria can be recommended for each SPLT method, variability implementation mechanism, and variability management. RQ3. What level of evidence is provided for the SPL test coverage criteria used (strength of evidence in support of the used SPL test coverage criteria)? This question investigates the reliability of the SPLT studies that have produced these results. To obtain an answer to RQ1, test coverage criteria in the selected studies will be classified by test bases for SPL, defined in section 2.3. For RQ2.1 through RQ2.3, we classify test coverage criteria using the categories for SPLT method, variability implementation mechanism, and variability management approach, the details of which will be presented in Section 4.2. Based on this classification, we examine whether there are significant results for characterizing test coverage criterion. To answer RQ2-4, we select additional dimensions of SPL other than those of RQ2-1, RQ2-2, and RQ2-3, such as types of SPL approach, binding time of variability, and variability representation in artifacts, and map the selected studies using these additional dimensions to test coverage criteria. For RQ3, we assign the evidence level by the source of the software system in the studies, i.e., no evidence, working examples, software systems obtained from academic studies, software systems obtained from industrial cases, and industrial practices.

Our final interest is whether the current test coverage criteria are sufficient to guarantee the quality of an SPL and reflect the SPLT‘s distinctiveness. We analyze and discuss the selected studies with regard to test coverage criterion decision and subsequent test cases design and execution, adapting single system testing methods to SPLT, and elements exercised by execution of test cases. These point towards future directions for SPLT research.

3.2 Study selection Figure 1 shows the study selection procedure used to avoid duplication with existing systematic reviews, while achieving the purpose of the paper as stated in the Introduction.

Fig. 1. Study selection procedure

3.2.1 Phase 1: analysis of existing reviews Figure 2 shows the number of studies reviewed by [Engstrom 11], [Neto 11], and [Machado 14], and shows the overlap between them.

Fig. 2. Overlap between existing reviews As Figure 2 shows, these three studies partially overlapped one another, but each author reviewed numerous studies. Prior to 2014, a total of 110 studies had been reviewed in these three systematic review studies, of which 61 were reviewed in [Engstrom 11] and [Neto 11] only; among these 61 studies, studies discussing test coverage criteria are [McGregor 01], [Nebut 03], [Kauppinen 04], [Tevanlinna 04], and [Cohen 06]. Therefore, among the studies analyzed by [Engstrom 11] and [Neto 11], these five studies are included in our first review. [Machado 14] reviewed SPL testing strategies and grouped them into ones ―to handle the selection of products to test‖ and ones ―to handle the test of end-product functionalities‖. The 24 studies using the first strategy are classified as ―selection of products to test‖ group in Table 3, while the remaining 25 studies are classified as belonging to the other test basis groups. We will analyze these studies in terms of test coverage criterion using the classification of [Machado 14]. Therefore, all together 49 studies selected by [Machado 14] are included in our first review.

3.2.2 Phase 2: Gathering recent publications We gathered publications in ScienceDirect, ACM Digital Library, IEEE Xplore, and SpringerLink, because these databases

provide many of the leading publications in the Software Engineering field. From the research questions in Section 3.1, we identified keywords to use in the search process. As listed in Table 4, we applied variants of the terms ―software product line‖, ―software product family‖, ―product line‖, and ―test‖ to compose the search queries. To define search strings consistent with the selected existing literature reviews, we checked the appropriateness of the defined search strings on existing systematic review papers, i.e., [Engstrom 11], [Neto 11] and [Machado 14]. Our search strings were appropriate for the existing systematic review papers, except that we did not use search strings used by [Neto 11] such as ―static analysis‖, ―dynamic analysis‖, ―test effort‖ and ―performance‖. We excluded the terms because their coverage of SPLT is too broad. Search strings were coded in accordance with the syntax requirements of each search engine (Table 4). In this phase, articles retrieved from more than one search engine were counted only once. We regarded such a paper as one found by the search engine that detected it first. We did not use the keyword ―test coverage‖ as a search string. This was because the search was targeting studies dealing with test coverages for SPL testing and so we intended to collect such studies by reviewing all the studies relating to SPL testing. This was based on our assumption that all studies for the test coverages of SPL testing would include the keywords ―SPL‖ and ―test‖. This also has the effect of avoiding possible inconsistencies because we used studies collected by the existing review papers.

3.2.3 Phase 3: Selection of primary studies To select the primary studies of this review, we used the following inclusion criteria:  Peer reviewed articles  Studies on test case generation for SPLT  Studies for which test bases are clear and test coverage criteria are specified To exclude studies, we used the following exclusion criteria:  Secondary studies that reexamine the primary studies  Prefaces, editorials, summaries of tutorials, panels, poster papers and extended abstracts  Studies published in doctoral symposia  Comparative papers, position papers, literature reviews, mapping studies and tertiary reviews of primary studies  Duplicated articles of a study that appears in different versions that appear as book chapters, journal papers, and conference or workshop papers In order to maximize discovery of relevant work, the appropriateness of inclusion and exclusion criteria was checked against existing systematic review papers, i.e., [Engstrom 11], [Neto 11] and [Machado 14]. After the set of search strings of Table 4 was applied to the search engines, the exclusion criteria were firstly applied. The title, authors, and abstract of each paper were examined. Next, according to the inclusion criteria, by analyzing the introduction section of each article we excluded papers that did not deal with test case generation for SPLT, studies that addressed product validation problem or test efforts, and studies for which it was unclear which test bases were used. After that, we investigated whether the paper in question considered test coverage criteria when generating test cases. We also excluded papers that did not specify what kind of test coverage criteria is used. In all, 39 publications remained for selection, as shown in Table 4. Table 4. Search strings and selections for studies from 2014 and after Results Selections

IDs for selected papers

Engine

Search string

Raw

IEEE Xplore(IEEE, IET)

(((((―software product line‖) OR ―software product family‖) OR ―product line‖) OR SPL) AND test) Refined by publisher: IEEE, IET, Content Type: Conference Publications, Journals & Magazines, Year: 2014-2017(July)

105

8

IEEE-P1 ~ IEEE-P8

946

11

SP-P1 ~ SP-P11

38

5

SD-P1 ~ SD-P5

75

15

ACM-P1 ~ ACM-P15

Springer

ScienceDirect

ACM DL

―software product line‖ OR ―software product family‖ OR ―product line‖ AND test between 2014–2017(July, Language: English) pub-date > 2013 and TITLE-ABSTR-KEY(―software product line‖ OR ―software product lines‖ OR ―software product family‖ OR SPL OR ―product line‖) and TITLE-ABSTR-KEY(testing OR test OR validate OR verify OR assess OR validation OR validating OR verification)[Journals(Computer Science)] (+―software product line‖, ―software product family‖, ―product line‖ +test), Published since 2014 Total

39

Among the studies selected for review in [Engstrom 11] and [Neto 11], [Cohen 06] is the first attempt to apply the CIT method but it is the same work as P33 of [Machado 14]; [Kauppinen 04] is a position paper; [Tevanlinna 04] is a review paper; [McGregor 01] gives an overview of SPL testing; and [Nebut 03] is a work that overlaps with P11 of [Machado 14]. Therefore, these studies were not selected. In the case of [Machado 14], 36 studies have been selected out of 49 studies, by excluding the 13 studies that did not discuss test coverage criteria. However, we cannot be sure that all important research papers published up to 2014 are included in the studies selected in the three literature reviews above. Accordingly, another search for papers published before 2014 was conducted using the same method as that for collecting papers from 2014 and after, in order to identify any missing papers. As a result, three more papers

were selected as relevant for the purpose of our review. Table 5 accordingly shows the 39 studies dating from before 2014 that were selected for review.

Table 5. Selections for studies before 2014 Engine [Engstrom 11] and [Neto 11] [Machado 14]

Search string

Raw

Results Selections

studies not included in [Machado 14]

5

0

studies to handle the test of end-product functionalities

25

15

studies to handle the selection of products to test

24

21

Additional studies not included in the previous three reviews

3

Total

IDs for selected papers

IDs in [Machado 14] used ACM-P16, IEEE-P9, SP-P12

39

Therefore, our literature review comprises the 39 publications obtained from the literature published in 2014 and after, the 36 studies in [Machado 14], and the additional three studies that were not included in the previous three reviews, a total of 78 studies in all. Annex 1 provides the details of these studies selected.

3.3 Data extraction The data for classification were extracted by following the keywords of each research question. We stored the extracted data in a spreadsheet after reading each study. We analyzed and classified the selected studies, based on  the SPL approach, SPLT method, variability management, variability implementation mechanism, variability representation in artifacts, binding time used, test basis, test coverage criteria, and tool support; and  the evidence level of each study. For evidence evaluation, we use the following five levels, except for ―expert opinions or observations‖ among the evidence levels used by [Alves 10] and [Machado 14]:  L1: No evaluation  L2: Evaluation with working examples  L3: Evaluation with subjects obtained from academic studies  L4: Evaluation with subjects obtained from industrial cases  L5: Evaluated by industrial practices The first author conducted classification with each keyword and the classification results were finalized after discussing with the second author.

4. Systematic review results In this section, we present results obtained through our systematic review. We give an answer to each research question based on our review of the literature. We start in Section 4.1 with answers to RQ1, test coverage criteria used in SPLT and their distributions by test basis. In Section 4.2, we describe distributions of test coverage criteria used in SPLT by the dimensions selected in RQ2. Table 6 shows the distribution of studies by year and publication type. A majority of the papers (60 out of the total of 78, which is 77%) were published at conferences or workshops. This is the same as in [Machado 14] because a majority of the 49 selected studies, other that three journal papers, were conference or workshop papers. Table 6. Distribution of studies by publication year and type Type Conference Workshop Journal Book chapter Total

~2013 25 10 3 2 40

2014 11 11

2015 5 2 7

2016 7 3 4 14

2017 2 2 2 6

Total 50 15 11 2 78

4.1 SPL test coverage criteria used in studies (RQ1) Table 7 summarizes the example test coverage criteria used in the SPL studies according to the classified test bases. As it shows, many studies belonging to the Sel-Pro test basis group use the feature model as a test basis and apply feature combination coverage. Studies belonging to the CR test basis group generate test cases from various types of test basis that have been modified to reflect variability, whereas, for those belonging to the SA test basis group, there were no differences from single system testing. All studies in the CR and SA groups defined the same test coverage criteria as those used in single system testing, such as ―allstates‖, ―all-transitions‖, or ―branch coverage‖. However, the elements ―states‖, ―transitions‖, and ―branches‖ in the test bases

belonging to the SA group, have the same meanings as in single system testing, whereas, in the test bases belonging to the CR group, these elements can additionally express commonality and variability. Table 7. Test coverage criteria used in studies TB-ID Sel-Pro

Test coverage criteria feature combination (F-Comb), feature interaction (F-Int), feature (F), coverage criterion based on model (MB-Cov) Code (Code) CR-SB Requirements (Req), MB-Cov CR-DB MB-Cov (all-states, all-transitions, all-actions, extended branch) CR-PB Code SA-SB Req SA-DB All-transition, full-fault (Fault), changes/differences between product variants (Delta) SA-PB Code * ( ) contains an abbreviation for the test coverage criterion.

Table 8 shows a classification of studies by test basis. As a result of our review, most of the studies were mapped to the test bases defined in Section 2.3, although some studies were not. Studies classified as ―Others‖ involved the generation of valid products through test case assessment, feature model validation, and feature model mutation.

Table 8. Classification of studies by test basis Test Basis Sel-Pro

CR-SB CR-DB CR-PB SA-SB SA-DB SA-PB Others

Studies ACM-P1, ACM-P2, ACM-P5, ACM-P7, ACM-P8, ACM-P12, ACM-P13, ACM-P16, IEEE-P2, IEEE-P3, IEEE-P4, 21 papers of [Machado 14], SD-P1, SD-P2, SD-P3, SDP5, SP-P4, SP-P5, SP-P7, SP-P8, SP-P10, SP-P11, SP-P12 IEEE-P5, M-P7, M-P9, M-P11, M-P12, M-P13 ACM-P3, ACM-P9, IEEE-P6, IEEE-P7, SP-P1, SP-P2, M-P1, M-P14, M-P17, M-P32, M-P35, EN-1 IEEE-P8, SP-P3, IEEE-P9 ACM-P11, M-P48, M-P49 ACM-P10, ACM-P14, SD-P4, SP-P6, M-P34, M-P39, M-P41 SP-P9 ACM-P4, ACM-P15, IEEE-P1

Total 43 6 12 3 3 7 1 3

Table 9 shows the distributions of test coverage criteria by test basis. Of the 43 studies that use the Sel-Pro test basis, 32 applied either the F-Comb or F-Int. For the SA group, none of the SPL test coverage criteria from the three test bases - SB, DB, and PB - differ from those for single system testing. That is, if the SB test basis uses the requirements specification as a test basis, it applies requirements coverage exactly as in single system testing. The meaning of coverage criterion varies depending on which test basis is selected, i.e. that of the CR group or that of SA group, as explained at the beginning of Section 4.1. If the CR group is selected, the coverage criterion is defined to include variability; if the SA group is chosen, it is defined as focusing on the uncovered portion of the previous product. Table 9. Distribution of test coverage criteria by test basis Coverage criteria

FComb

MBCov

FInt

F

Req

Code

Test Basis 26 3 6 2 2 2 Sel-Pro 6 CR-SB 8 1 CR-DB 1 CR-PB 2 SA-SB 5 SA-DB 2 SA-PB 1 Other Total 26 23 6 2 5 5 *DF-Cov: data flow, Err: error, Mut: mutation, FW: framework * ‗-‘: There are no studies

Delta

DFCov

1 1 2

1 1

Err 2 2

Fault

Mut

1 1

1 1 2 4

FW 1 1

Total 43 6 12 2 3 7 2 3 78

The results of analysis of the relationship between test bases and test coverage criteria show that MB-Cov is used in all groups of test bases, namely Se-Pro, CR, and SA. For product selection, this involves finding, from their test bases, the state machines, activity diagrams, or design models for the product that cover as many states, activities, or design model elements as possible. Models are created at a high level of abstraction and have trace links to the feature model.

4.2 Distribution of test coverage criteria by SPL aspects (RQ2) In this section, we describe review results from the different dimensions that are identified as affecting test coverage criteria in SPLT. Sections 4.2.1 through 4.2.3 give answers to RQ2-1 through RQ2-3. Section 4.2.4 presents results of selected additional dimensions and review results of their affects to test coverage criteria in SPLT.

4.2.1 Distribution of test coverage criteria by SPLT methods (RQ2-1) As defined in Section 2.1, test coverage criterion means a rule or a collection of rules that imposes test requirements on a test set. As Table 1 indicates, in a single system test, test methods affect test coverage criteria. This section is concerned with whether there is a relationship between SPL test methods and test coverage criteria in SPLT. It begins by extracting SPLT methods through a preliminary investigation. The following methods are used to categorize the selected studies:  Combinatorial interaction testing: systematically selects sample configurations and tests only these configurations.  Functional testing: designs test cases using requirements specification for a software product line or software products  Model-based testing: designs test cases using structural or behavioral models of software products  Control flow and data flow based testing: uses tailored test case design method using control flow or data flow information based on the structures of a software product line or software products  Risk-based testing: uses the risk-based testing used in single system testing for SPLT (e.g., test case selection and prioritization using features based on risks)  Search-based testing: uses tailored search-based testing method (e.g., test case selection for application testing)  Regression testing: uses the regression testing methods used in single system testing for SPLT The relationship between SPLT method and test basis was analyzed to investigate whether SPLT method affects test basis. Table 10 shows that, of the 43 studies that used the Sel-Pro test basis, 27(63%) adopted the CIT method and 10(23.3%) complemented the CIT method with the search-based testing (SBT) method. Almost all SBT methods used in SPLT are used to compensate for the selection of products to test. The model-based testing (MBT) method is related to several test coverage criterion classifications. However, it is mainly used in categories related to commonality and reuse strategy. Regression testing (RT) is used in the CR and SA groups, primarily in the design based approach. From the results in the last column of Table 9, it can be seen that CIT and SBT are mapped to the Sel-Pro test basis, while MBT is mapped to one of the test bases of the CR group. Table 10. Distribution of test bases by SPLT method Test Basis SPLT Method Combinatorial interaction testing (CIT) Model-based testing (MBT) Control flow or data flow-based testing (CDFT) Risk-based testing (RBT) Search-based testing (SBT) Regression testing (RT) Delta-oriented testing (DOT) Others (OT) Total * ‗-‘: There are no studies

SelPro 27 5 10 1 43

CRSB 6 6

CRDB 8 1 1 1 1 12

CRPB 1 1 2

SASB 1 1 1 3

SADB 3 1 2 1 7

SAPB 1 1 2

Other

Total

1 2 3

27 23 2 2 12 5 1 6 78

In addition to the test coverage criteria presented in Table 6, some studies have applied the Err, the FW and the DF-Cov test coverage criteria. Table 11 classifies studies by SPLT method and test coverage criterion. The CIT method mainly uses the FComb coverage criterion, and the SBT method uses a variety of test coverage criteria. Almost all MBT methods apply MB-Cov, and many SPLT methods use MB-Cov. In the case of the CIT, MBT and SBT methods, test coverage criteria are dependent on SPLT methods; for other SPLT methods, however, it is difficult to conclude whether there are dependencies between test coverage criteria and SPLT methods. Table 11. Classification of studies by SPLT method and SPL test coverage criterion SPLT methods

SPL test coverage criteria F-Comb

CIT

MBT

CDFT RBT

SBT

RT

F-Int F Code MB-Cov F-Int DF-Cov Code Req Delta F-Comb F Code Err Req Fault

Studies ACM-P1, ACM-P2, ACM-P5, ACM-P7, IEEE-P2, IEEE-P3, SD-P2, SD-P3, SD-P5, SP-P4, SP-P8, SP-P10, M-P23~M-P26, M-P29, M-P31, M-P45, M-P47 ACM-P13, M-P33, ACM-P16, SP-P12 SP-P7 IEEE-P4, M-P36 ACM-P3, ACM-P14, ACM-P15, IEEE-P5, IEEE-P6, SD-P4, SP-P1, SP-P2, SP-P5, SP-P11, M-P1, M-P7, M-P9, M-P11, MP12, M-P13, M-P14, M-P17, M-P20, M-P32, M-P41 M-P28, M-P30 M-P35 IEEE-P9 ACM-P9 ACM-P10 ACM-P8, SD-P1, M-P40, M-P42~M-P44 M-P05 SP-P9 M-P38 ACM-P11, ACM-P12, M-P18 IEEE-P7

Total

20 4 1 2 21 2 1 1 1 1 6 1 1 1 3 1

DOT OT

Req MB-Cov Code Delta Mut Err, FW

M-P49 SP-P6, M-P34 IEEE-P8 M-P48 ACM-P4, IEEE-P1, SP-P3, M-P39 M-P21 EN-1

1 2 1 1 4 1 1

Table 12 describes our remarks for test coverage criteria of SPLT methods. The names of test coverage criteria are in many cases the same as those used in single system software testing methods.

Table 12. Test coverage criteria used by SPLT methods SPLT methods

Test coverage criteria used F-Comb, F-Int, F

CIT Code

MB-Cov MBT F-Int CDFT

DF-Cov, Code

SBT

F-Comb, F, Code, Err, Req

RT

Fault, Req, MB-Cov, Code

DOT

Delta

OT

Mut, Err, FW

Remarks The basis used for product selection is feature model; coverage criteria indicate how much of selected products cover feature combinations, feature interactions, or individual features. The basis used for product selection is annotated code; the coverage criterion indicates how much of selected products cover source code lines. This coverage name, the same as in single system testing, does not convey the original meaning. The models used by the MBT are extended for SPLT, but the MBT uses the same coverage criterion name as single system testing, which does not convey the added meaning of the extension. Studies in this classification automatically generate corresponding test cases that fulfil feature interaction coverage criterion on the basis of model-based coverage criteria. They use the MBT and its coverage criteria to suit SPLT. Single system testing method was extended for SPLT; however, its coverage criteria did not consider different variants included in control or data flow. Studies in this category use the SBT for product selection that fulfils F-Comb or F coverage criteria. Code, Err, and Req are coverage criteria that do not convey this meaning. RT is used for SPLT or RT methods of SPLT are proposed; however, their coverage criteria do not convey fundamental characteristics of SPLT. Although the concept of delta is similar to SPL‘s commonality and variability concepts, delta cannot be considered a coverage criterion that reflects fundamental differences between SPLT and delta-oriented programming. They have the same names as coverage criteria for single system testing, but original meanings of Mut and Err in SPLT are not conveyed. FW coverage criterion is too broad in meaning to indicate which test requirements the criterion yields.

4.2.2 Distributions of variability implementation mechanisms by test coverage criterion (RQ2-2) Variability can be implemented in a variety of ways, such as annotative (e.g. preprocessors) or compositional approaches (e.g. feature-oriented, etc.), depending on the programming language or the mechanism provided by the tool being used. The method for implementing variability is called the Variable Implementation Mechanism (VIM). [Svahnberg 05], [Capilla 13] and [Apel 13] discuss VIMs for software product lines. [Capilla 13] added VIMs for the variability of runtime modes in addition to the VIMs provided by [Svahnberg 05]. [Capilla 13] distinguishes VIMs by three major product derivation stages at which the values of variability are selected in accordance with a given binding time. [Apel 13] distinguishes VIMs in a different way by dividing VIMs into the language-based (LB) approach and the tool-driven (TD) approach, unlike [Svahnberg 05] and [Capilla 13], who distinguish VIMs by product derivation stages. [Apel 13] includes all of the VIMs of [Svahnberg 05] and [Capilla 13]. However, we added a VIM that extends modeling notations or languages for single systems to the classification of [Apel 13]. In order to answer RQ2-2, we use the following classification to categorize the selected studies:  The language-based approach (LB): this approach uses mechanisms provided by a host programming language to implement variabilities and to derive products. Examples of such mechanisms include the model-extensions for describing variabilities in design models, parameters that use conditional statements (such as ―if‖ and ―switch‖) to alter the control flow of a program at run time, design patterns, frameworks, components in component-based implementations, service in the service-oriented architecture as a special form of software component, feature-oriented programming, aspect-oriented programming, and delta-oriented programming.  The tool-driven approach (TD): this approach uses one or more external tools to implement or represent variabilities in code and to control the product-derivation process. Examples of such external tools include version-control systems that build product lines by using their branching and merging functions; build systems that encode variabilities in build scripts; preprocessors that provide facilities for conditional compilation; exploiting a feature tracing link that connects a feature with its implementation artifacts; and integrated product derivation, which develops a product line by integrating both the language-based and tool-driven approaches and combining them with feature models.

Some of the studies analyzed did not specify which VIM is used or whether VIM is irrelevant for the study; for these, we added the value ―unspecified (U)‖. Figure 3(a) shows the classification of VIMs by test coverage criterion. 50 studies, which account for 64% of the total of 78 studies, did not specify which VIM is used in implementation under test. Among the 26 studies that applied F-Comb, 20 studies were assigned a U. We again confirmed the VIM using the SPLT method. Figure 3(b) shows that the MBT and SBT methods are classified as LB or U, while CIT is classified as all three; however, most methods are classified as U. This is because most SPLT studies are biased toward ―selection of products to test‖ or toward system-level testing, not because SPLT does not need to consider VIM, but because VIM is concerned with how variability is described in the model and how it is implemented in code. SPLT studies that use the MBT method but are not classified as U use feature annotation rather than feature tracing.

(a) Distribution of VIMs by test coverage criterion F-Comb: feature combination, MB-Cov: coverage criterion based on model, F-Int: feature interaction, F; feature, Req: Requirements, Delta: changes/differences between product variants, DF-Cov: data flow, Err: error, Mut: mutation, FW: framework

(b) Distribution of VIMs by SPLT method CIT: combinatorial interaction testing, MBT: Model-based testing, CDFT: control flow or data flow-based testing, RBT: Risk-based testing, SBT: search-based testing, RT: regression testing, DOT: delta-oriented testing, OT: others

Fig. 3. Distribution of VIMs by test coverage criterion and SPLT method

Analysis results of 28 studies for VIM values of non-U show that their detailed distributions include model extension (8), component (5), feature-oriented programming (3), framework (2), parameters (2), and delta-oriented programming (2). These correspond, respectively, to aspect-oriented programming, pre-processors, annotation, and the domain-specific language dependent method. Eight studies (see blue bar of MBT in Figure 3(b)), which accounted for 35% of the 23 MBT studies, implemented variability in the model through ―model-extension‖. However, as the gray bar of MBT in Figure 3(b) indicates, the remaining 15 studies were classified as U. All of these evidence levels were less than L3 (Cf. Section 4.3), and these studies generate test cases using the mapping between feature model and elements of the models used in MBT.

4.2.3 Distribution of test coverage criteria by variability management (RQ2-3) Since the main challenge of SPLT is how testing deals with variability in product line artifacts, managing variability is an important aspect of SPLT. Therefore we analyzed the relationship between the variability management method and the test coverage criterion. As a result, as Figure 4 shows, 63 studies of the 78 studies (84%) use the feature model (FM). Seven studies used the orthogonal variability model (OVM), which models variability only, variable point, variable value, variability dependence, and variable constraint, while one study used the commonality and variability language (CVL). Finally, four studies used methods were that were hard to evaluate as to how they managed variability (V). As Figure 4 shows, of 26 studies that used F-Comb, 25 used FM. That is, F-Comb is affected by FM. However, according to the FM-line in Figure 4, FM is used in almost all test coverage criteria, making it difficult to determine whether variability management method affects test coverage criterion. Five of the seven studies using OVM apply MB-Cov; however, in the case of

MB-Cov, 17 studies used FM, so the test coverage criterion cannot be judged as dependent on the variability management method.

Fig. 4. Distribution of variability models by test coverage criterion

As a result of analyzing the relationship between test basis and variability model, we found that 40 of the 41 Sel-Pro studies used FM as their variability model. If we integrate these results with those of Section 4.2, 31 of these studies that belong to the Sel-Pro test basis use features to measure test coverage. Nine studies that use FM as a variability model in the Sel-Pro test basis measure test coverage without using features.

4.2.4 Other dimensions that affect test coverage criterion in SPLT (RQ2-4) We checked the research questions of major SPLT review papers [Engstrom 11, Neto 11, Machado 14] and examined what dimensions could affect the SPLT that had not been examined in connection with test coverage criteria. The resulting dimensions were types of SPL approach, binding time of variability, and variability representation in artifacts. The selected studies were analyzed to see whether additional dimensions affected test coverage criterion in SPLT. Types of SPL approach can be classified as feature-oriented product line engineering, model-driven product line engineering, or component-based product line engineering [Pohl 05]. Among the 78 studies reviewed, 9 studies use the SPL approach as follows:  Feature-oriented product line engineering: ACM-P08, SD-P1, IEEE-P6, M-P33, M-P49;  Component-based product line engineering: ACM-P09, M-P09, M-P34; and  Model-driven product line engineering: M-P27 SPLT studies classified as delta-oriented product line engineering, including ACM-P10, SP-P6, M-P41 and M-P48, use different SPLT methods, test bases and test coverage criteria. However, in most cases other than these studies, it is difficult to classify the type of SPL approach used. Studies that use feature models as test basis or studies that use feature models as variability models may consider classifying them as feature-oriented product line engineering. However, it is difficult to ascertain whether these studies ―organize and structure the whole product-line process as well as all software artifacts involved in terms of features‖[Apel 13]. Even the studies that can be classified as described above did not find certain regularities regarding test bases or test coverage criteria. Next, we analyzed the binding time aspect for implementation under test. Binding time of variability is divided into ―before compilation‖, ―at compile time‖, ―at link time‖, ―at load time‖, and ―at run time‖ according to the classification of [Pohl 05]. In SPLT, the test generation method varies in accordance with the binding time of variability, because the implementation under test has different executable steps due to the variability. However, no study has previously considered binding time for the testing of variability in implementation under test. In the case of the Sel-Pro test basis, it is difficult to discuss binding time of variability because its objective is to select products to test. However, if the Sel-Pro test basis is used and VIM is a pre-processor, we can judge the binding time as ―compile time‖; if it is the Sel-Pro test basis and MBT, we can judge the binding time as ―design time‖. Since studies in which the SPLT method is classified as MBT generally use test basis as the model that includes the variability, it is not possible to know the binding time using the model alone. As described in [Pohl 05] and [Apel 13], the SPL approach and VIM are likely to affect the binding time of variability. However, the same is true for binding time, since we have previously determined that SPL approach type, including RQ2-2, does not affect test coverage criterion. This is also related to the fact that it is difficult to judge the VIM in many studies in RQ2-2. Finally, we analyzed aspects of variability representation in test bases, such as specifications, models, and code. We began the analysis based on the classification of [Apel 13], i.e., using the annotation-based approach (e.g., preprocessor, parameter)2 and the composition-based approach (e.g., design patterns, frameworks, components, feature-oriented programming, aspects)3. As a result, the variability representation in test basis was classified as annotation-based (AB), composition-based (CB), features (F), variability-annotation (VA), variation point (VP), and V (vague). F is the case in which the variability is described only as a feature model in the study. VA refers to a variability representation method that annotates variability to the elements of test basis 2

This approach annotates a common code base, such that code that belongs to a certain feature is marked accordingly. During product derivation, all code that belongs to deselected features or invalid feature combinations is removed (at compile time) or ignored (at run time) to form the final product. 3 This approach implements features in the form of composable units, ideally one unit per feature. During product derivation, all units of all selected features and valid feature combinations are composed to form the final product.

(e.g. featured transition system), while VP means a variability representation in which test basis contains a variation point and possible variants. Figure 5 shows that F-Comb is applied only when the AB, CB, and F variability representation approaches are used. MB-Cov, on the other hand, is often used to represent variability with the VA and VP approaches, although there are some cases that do not. With the results, it can be determined that in the cases of the F-Comb and MB-Cov test coverage criteria, variability representation approach is significantly related to test coverage criterion. However, in the case of the F-Comb coverage criterion, it was found that variabilities are represented in the test basis in an easily combinable form, although we expected that the variability representation affects decision on this coverage criterion. In the case of the MB-Cov coverage criterion, both the variability representation in the model and the test coverage criteria used in the single software product testing for the model affect the test coverage criterion decision. For example, if the activity diagram and VPs are used as a test basis and for variability representation, then the control flow of the activity diagram and VPs affect decision on test coverage criterion. In this case, the variation point coverage criterion or the variant coverage criterion should be considered. If the test basis is a state machine and the VA is used, then how the state and/or transition of the state machine are annotated with variability affects decision on test coverage criterion. In this case, the state variability coverage criterion or the transition variability coverage criterion can be used.

Fig. 5. Distribution of variability representations by test coverage criterion

4.3 Evidence levels of selected studies (RQ3) Evidence levels can be used as a basis for judging whether a proposed test coverage criterion is applicable in practice, while confirming the reliability of the results presented by the SPLT study. In addition, we can check whether the test coverage criterion, which is widely used in SPLT studies, is practical. We classified a study as an academic study if its evaluation was conducted in controlled lab experiments and as an industrial case if its evaluation was conducted in an industry setting. The rating ‗‗industrial practice‖ indicates that the method in question has already been approved and adopted by some SPL organization [Alves 10]. After analyzing the 78 studies selected by the levels of evaluation, as shown in Table 13, it was found that one study did not provide evaluation results (L1), the studies that provided running examples as evidence (L2) amounted to 24%, and studies providing evidence using the subject obtained from academic studies (L3) amounted to 60% (47). Studies that provided evidence through industrial cases (L4) or industrial practices (L5) comprised only 10% and 4%, respectively.

Table 13. Classification of studies by evidence level Evidence Studies Level L1 SP-P2 ACM-P14, IEEE-P6, IEEE-P8, SD-P2, SD-P3, SD-P4, SP-P1, SP-P4, SP-P5, M-P07, M-P12, M-P14, M-P32, L2 M-P18, M-P20, M-P21, M-P26, M-P38, EN-1 ACM-P1, ACM-P2, ACM-P3, ACM-P4, ACM-P5, ACM-P7, ACM-P11, ACM-P12, ACM-P13, ACM-P15, ACM-P16, IEEE-P1, IEEE-P2, IEEE-P3, IEEE-P5, IEEE-P7, IEEE-P9, SD-P5, SP-P3, SP-P6, SP-P7, SP-P8, L3 SP-P9, SP-P10, SP-P11, M-P01, M-P11, M-P13, M-P34, M-P35, M-P39, M-P41, M-P05, M-P23, M-P24, MP25, M-P28, M-P29, M-P30, M-P31, M-P33, M-P36, M-P40, M-P42, M-P43, M-P45, M-P47 L4 ACM-P8, ACM-P10, IEEE-P4, SP-P12, M-P17, M-P48, M-P49, M-P44 L5 ACM-P9, SD-P1, M-P09 Figure 6 shows the results of analyzing the levels of evaluation. Figure 6(a) shows that most of the studies of the Sel-Pro test basis belong to L2 and L3 (see TB-ID in Table 2 for details of the test basis of the x-axis in Figure 6(a)). Figure 6(b) shows that the level of evaluation of the SPLT study that applies feature combination and model-based coverage criteria corresponds to L2 or L3. Detailed analysis of the level of evaluation aspect reveals that 43 out of 78 studies used Sel-Pro, while 25 out of 43 studies

used F-Comb. Figure 6(b) also shows that both the F-Comb and the MB-Cov coverage criteria have been studied and validated at L3. The delta coverage criterion has a small number of studies, but all of them use industry cases for evaluation.

(a) Evidence levels by test basis

(b) Evidence levels by test coverage criterion

Fig. 6. Evidence levels of selected studies

4.4 Analysis of review results and discussions Among the many results obtained from this review, the main findings can be summarized as follows: (1) It was confirmed that all selected studies could be classified using the test bases for SPLT in Table 8 (Section 4.1). (2) Among the test bases identified, the Sel-Pro (selection of products-to-test) test basis mainly selects the F-Comb (feature combination) test coverage criterion, while the CR_SB (commonality and reuse-specification-based) and CR-DB (commonality and reuse-design-based) test bases mainly select the MB-Cov (coverage criteria based on model) test coverage criterion (Table 9 of Section 4.1). (3) The CIT and SBT (search-based testing) methods use the Sel-Pro test basis, while the MBT (model-based testing) method is mainly selected in the CR group, even although it is not limited to a specific test basis (Table 10 of Section 4.2.1). (4) The CIT method selects the F-Comb test coverage criterion, while the MBT method selects the MB-Cov test coverage criterion. On the other hand, the SBT and RT methods are not limited to a specific test coverage criterion (Table 11 of Section 4.2.1) because they are used for prioritization or selection of test cases, which are actually products. (5) Choice of test coverage criterion was independent of variability implementation mechanism, variability management, SPL approach, and binding time (Sections 4.2.2 through 4.2.4). (6) The selection of a specific test coverage criterion depended on the variability representation used in development artifacts. When variability was expressed using the AB (annotation-based), CB (composition-based), and F (feature) approaches, the F-Comb test coverage criterion was used; when variability was expressed in VA (variability-annotation) and VP (variation point), the MB-Cov test coverage criterion was used (Section 4.2.4). (7) Most of the studies presented evidence based on software systems used in academic studies (Section 4.3).

The implications of our systematic review of the existing research on SPL test coverage criteria can be discussed from three perspectives, as follows: (1) The test coverage criterion decision and subsequent test cases design and execution. Among all the selected studies, 55% used the Sel-Pro test basis, of which 60% also used the F-Comb test coverage criterion. The F-Comb criterion focuses on an optimized product selection based on feature combinations rather than on test cases derivations and thus can only be used to reduce the number of products to test. The studies on the F-Comb test coverage criterion for SPLT, while providing useful ways of selecting and prioritizing products to test, did not consider how to test these products themselves, and were concerned only with types of coverage criterion to use for selecting and prioritizing products. The studies belonging to the CR-SB and CR-DB test bases address subsequent test case generation and their execution based on the selected test coverage criterion. However, other test bases are ambiguously defined or irrelevant to the problem of how the test coverage criterion guides the test case generation and execution. This implies that the current status of the studies on SPL test coverage criteria is not mature enough to guide testing processes ranging from test design to test execution. (2) Adapting single system testing methods to SPLT. Single system methods such as model-based testing, search-based testing, risk-based testing, and others have also been used in SPLT. For example, model-based testing was tailored to accommodate variability and to define test models with variability through featured-annotation or model extension. In such cases, the notation for SPLT models must be interpreted differently from that for single software testing models. Likewise, test coverage criteria must be redefined to reflect such adaptation and adapted interpretation; in this way, it will be possible to determine whether a test coverage criterion satisfies test requirements and test goals in SPLT. As an example of a SPL test coverage criterion obtained by such an adaptation, the ScenTED employs an extended branch coverage criterion. The name extended branch coverage criterion, however, is misleading. For example, in the ScenTED, a decision point in an activity diagram represents a variation point, while the branches from the decision point represents selection of variants. However, the branch coverage criterion would not convey that testing should cover the possible selection of variants. Thus, the activity diagram extended by the ScenTED is insufficient to be an SPL test basis, because it does not prescribe the exact scope that testing should cover. In this case, coverage criteria such as "all variants" or "all variant-pairs" can be used as practical coverage criteria, even if an insufficient SPL test basis must be used. Adaptation is more than just extension. Adaptation may involve making fundamental changes to a single system testing method in order to suit SPLT, whereas extension maintains the single system testing method while adding new aspects to it. Thus the term ―extended‖ does not adequately convey the existence of fundamental differences that should emerge when a method of single system testing is promoted to a method of SPLT. (3) Elements exercised by execution of test cases. From a test coverage criterion, we know what elements or aspects of a software system (e.g. function, feature, quality attribute, or structural element) will be exercised by execution of the test cases generated from it. As can be ascertained from Section 4.2, many SPLT test coverage criteria share the same names as single system test coverage criteria. However, the elements exercised in a single system are quite different from those exercised by the test cases for product lines that are generated based on the same test coverage criteria for single system testing. For example, in the case of the branch coverage criterion for model-based testing, it means single software testing in which testing covers all branches of a decision point. However, in the selected studies for SPLT, the meaning of branches that testing cover can be variants of a variation point or decisions of a decision point. Also, in cases where featured transition systems are used as a test model, the state coverage criterion, the transition coverage criterion, etc., are used. Even when an SPL test coverage criterion that employs a state machine in the test model has the same test coverage criterion name as in single software testing, the state coverage criterion and the transition coverage criterion of featured transition system for SPLT do not mean that only the states or transitions of a state machine must be tested. They also mean that all possible combinations of states and transitions, depending on the features linked to states and transitions, must be tested. In this way, names of test coverage criteria for SPLT do not deliver precise information about the structure to be exercised as required by the given test coverage criteria unlike the test coverage criteria for single system testing. Thus achieving an SPL test coverage level may not mean achieving execution results that would be implied by the same test coverage criterion for single system testing (i.e., covering each of all possible combinations of states rather than covering each state at least once) even in cases where test coverage criterion provides a quantified measure.

5. Threats to validity One underlying threat to the preceding literature review is data omissions. When conducting searches by entering the search strings according to the rules of the digital library, many irrelevant works were listed. This is because the same term is also used in other academic fields. After extracting works relevant to this field, the final works for review were selected through perusal of the abstracts and tables of contents according to the selection criteria defined in the paper. In this process, relevant works in the literature may have been overlooked. In addition, bias can occur in the process of analyzing and classifying the literature selected. The validity of the results presented in the paper may be challenged both from internal and external perspectives, as follows: Construct validity: This paper used two methods to select studies to review. The first method was to select studies prior to (but not including) the year 2014; the second was to select studies from January 2014 to July 2017. In the first case, we reviewed the selected studies of existing review papers [Engstrom 11, Neto 11, Machado 14] and then reselected the studies for review again according to the selection criteria from the view of test coverage criteria. In the second case, we used the same search strings and selection criteria for study selection as existing review papers. However, the choice of these two methods may have been biased because the review objectives of the previous review papers differ from those of our review. To reduce this bias, we

repeated the search for selecting pre-2014 studies using the same search strings and found three additional studies. External validity: We analyzed test coverage criteria from several SPL aspects such as test basis, SPL approach, SPLT method, variability implementation mechanism, variability management, and binding time in order to analyze whether they affect the selection of test coverage criteria for SPLT. As can be seen in Tables 8, 9, and 11, there are rows in which the data mapping results are biased toward a specific row or column as well as rows where there are many cells with ―-‖ and fewer sample data than the other rows. This limits generality of the findings in Section 4. To address this issue, the analyses were conducted only for the coverage criteria for which sufficient supporting data was available. Reliability: The first threat to reliability is selection bias. To ensure quality of the study selection we defined detailed and explicit inclusion/exclusion criteria. Then, the first author performed study selection by applying these criteria. The third author performed oversight on whether there were studies missed by the review papers that have been used for selecting pre-2014 studies. Even within this process, the first author double-checked the results. Classification is another source of threat to reliability. The first author conducted the classification of this review to prevent possible bias. To avoid the threat of errors due to that only one author had conducted the classification, we followed the objective statements of each study for each keyword of research questions. The classification of the selected studies was finalized in consultation with the second author whenever issues related to classification were raised. The third threat to reliability of this review was the overall quality of individual studies. Although the quality criteria or scoring used in [Kitchenham 07-1] or [Kitchenham 07-2] were not applied in our review, we think that the inclusion and exclusion criteria of Section 3.2.3 partially ensure quality of individual studies.

6. Conclusion In this paper, we selected studies that are published from 2003 until the first half of 2017 and satisfy the quality criteria defined in Section 3.2. With the aim of identifying the test coverage criteria currently used in SPLT and analyzing their limitations that should be resolved in terms of test coverage criteria, three research questions were formulated and investigated. Before we began this review study, we expected that SPL test coverage criteria would be significantly different from those of single system testing, because software product line development consists of two distinct processes, the domain engineering and application engineering, and artifacts resulting from these two processes are quite different from artifacts from those in single system development. However, with the exception of the test coverage criteria used in the Sel-Pro (selection of products-to-test) test basis, all of the existing SPL test coverages have the same names as those used in single system testing; moreover, they are assumed to correspond to their single system testing counterparts, with no essential differences from them. This is because, when importing test coverage criteria from single system testing, test coverage criteria of SPLT failed to make proper adaptations of single system testing that are necessary for SPLT and thoroughly prescribe what a test method employing them should cover, as discussed in Section 4.4 with the examples of the state transition coverage criteria and the branch coverage criteria. Therefore, SPL test coverage should be defined or redefined to address these deficiencies so that it can clearly deliver the target scope of SPLT. For further study, we will investigate what aspects of SPL test methods should be used for SPL test coverage criteria, so that test coverage criteria of SPLT, while guiding testing process from test design to test execution, are unambiguously defined to deliver what should be covered by specific SPL test generation methods with given test coverage criterion, as in single system testing. Further, we will work on a more fine-grained SPL test coverage taxonomy that is based on the test coverage criterion classification of this review.

Acknowledgements This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education(2017R1D1A3B03028609); and by the Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Science, ICT (NRF2017M3C4A7066210).

Appendix A. Primary studies See Table A.13 Table A.13 Selected primary studies ID ACM-P1 ACM-P2 ACM-P3 ACM-P4 ACM-P5 ACM-P7

Title A Parallel Evolutionary Algorithm for Prioritized Pairwise Testing of Software Product Lines A Preliminary Empirical Assessment of Similarity for Combinatorial Interaction Testing of Software Product Lines Abstract Test Case Generation for Behavioural Testing of Software Product Lines Faster Bug Detection for Software Product Lines with Incomplete Feature Models Fault-based Product-Line Testing: Effective Sample Generation based on Feature-Diagram Mutation IncLing: Efficient Product-Line Testing using Incremental Pairwise Sampling

Author(s) R.E. Lopez-Herrejon et al. S. Fischer et al. X. Devroey et al. S. Souto et al.

Venue GECCO‘14 SBST‘16 SPLC‘14 SPLC‘15

D. Reuling et al.

SPLC‘15

M. Al-Hajjaji et al.

GPCE‘16

ACM-P8 ACM-P9 ACM-P10 ACM-P11 ACM-P12 ACM-P13 ACM-P14 ACM-P15 ACM-P16 IEEE-P1 IEEE-P2 IEEE-P3 IEEE-P4 IEEE-P5 IEEE-P6 IEEE-P7 IEEE-P8 IEEE-P9 SD-P1 SD-P2 SD-P3 SD-P4 SD-P5 SP-P1 SP-P2 SP-P3 SP-P4 SP-P5 SP-P6 SP-P7 SP-P8 SP-P9 SP-P10 SP-P11 SP-P12 [Machado14]

Multi-Objective Test Prioritization in Software Product Line Testing: An Industrial Case Study Risk Based Testing for Software Product Line Engineering Risk-Based Integration Testing of Software Product Lines Search-based Similarity-driven Behavioural SPL Testing Search-Based Test Case Selection of Cyber-Physical System Product Lines for Simulation-Based Validation Similarity-Based Prioritization in Software Product-Line Testing Towards Incremental Test Suite Optimization for Software Product Lines Towards the Assessment of Software Product Line Tests: A Mutation System for Variable Systems Grammar-based Test Generation for Software Product Line Feature Models A Mutation and Multi-objective Test Data Generation Approach for Feature Testing Efficient Product-Line Testing Using Cluster-Based Product Prioritization Product Selection based on Upper Confidence Bound MOEAD-DRA for Testing Software Product Lines Supporting Software Product Line Testing by Optimizing Code Configuration Coverage Model-Based Software Product Line Testing by Coupling Feature Models with Hierarchical Markov Chain Usage Models Model-Based Test Design of Product Lines Raising Test Design to the Product Line Level Reducing the Concretization Effort in FSM-Based Testing of Software Product Lines Test Logic Reuse through Unit Test Patterns a Test Automation Framework for Software Product Lines Analyzing Structure-based Techniques for Test Coverage on a J2ME Software Product Line Cost-effective Test Suite Minimization in Product Lines using Search Techniques PROW: A Pairwise algorithm with constRaints, Order and Weight Deriving Products for Variability Test of Feature Models with a Hyper-Heuristic Approach Input–Output Conformance Testing for Software Product Lines Practical Minimization of Pairwise-Covering Test Configurations using Constraint Programming An Approach to Derive Usage Models Variants for ModelBased Testing Coverage Criteria for Behavioural Testing of Software Product Lines Facilitating Reuse in Multi-goal Test-Suite Generation for Software Product Lines Generating Configurations for System Testing with Common Variability Language Model-Based Product Line Testing: Sampling Configurations for Optimal Fault Detection Applying Incremental Model Slicing to Product-Line Regression Testing Testing variability-intensive systems using automated analysis: an application to Android Validation of Constraints Among Configuration Parameters Using Search-Based Combinatorial Interaction Testing Genetic Algorithm-based Test Generation for Software Product Line with the Integration of Fault Localization Techniques Hybrid Algorithms Based on Integer Programming for the Search of Prioritized Test Data in Software Product Lines Statistical Prioritization for Software Product Line Testing: an Experience Report A Technique for Agile and Automatic Interaction Testing for Product Lines

S. Wang et al. H. Hartmann et al. R. Rachmann X. Devroey et al. A. Arrieta et al. M. Al-Hajjaji et al. H. Baller et al. H. Lackner et al. E. Bagheri et al. R.A. Matnei Filho M. Al-Hajjaji et al. T. do N. Ferreira et al. L. Vidács et al.

SPLC‘14 SPLC‘14 VaMoS‘17 VaMoS‘16 SPLC‘16 SPLC‘14 FOSD‘14 SPLC‘14 CASCON'12 SBES‘15 AST‘17 CEC‘16 ICSTW‘15

C.S. Gebizli et al. QRS‘16 H. Lackner et al. V.H. Fragal et al. G.S. Neves L. SIlva S. Wang et al. B.P. Lamancha et al. A. Strickler et al. H. Beohar et al. A. Hervieu et al.

ICST‘14 ICSTW‘17 IRI‘14 LATW'09 Journal of Systems and Software, 2015 Journal of Systems and Software, 2015 Applied Soft Computing, 2016 Journal of Logical and Algebraic Methods in Programming, 2016 Information and Software Technology, 2016

H. Samih et al.

ICTSS‘14, LNCS 8763

X. Devroey et al.

ISoLA‘14, LNCS 8802

J. Bürdek et al.

FASE‘15, LNCS 9033

D. Shimbara et al.

SDL‘15, LNCS 9369

H. Lackner

SDL‘15, LNCS 9369

S. Lity et al. J.A. Galindo et al. A. Gargantini et al.

ICSR16, LNCS 9679 Software Quality Journal, 2016 SSBSE‘16, LNCS 9962

X. Li et al. Empirical Software Eng., 2017 J. Ferrer et al. X. Devroey et al. M. Johansen et al.

EvoApplications‘17, LNCS 10200 Software System Model, 2017 ICTSS'12

P1, P5, P7, P9, P11, P12, P13, P14, P17, P18, P20, P21, P23, P24, P25, P26, P28, P29, P30, P31, P32, P33, P34, P35, P36, P38, P39, P40, P41, P42, P43, P44, P45, P47, P48, P49

Appendix B. Venues searched Journals JSW – Journal of Software TSE – IEEE Transactions on Software Engineering SQJ – Software Quality Journal Conferences AOSD – Aspect-Oriented Software Development CAiSE – Advanced Information Systems Engineering CASCON – Computer Science and Software Engineering EvoApplications – Applications of Evolutionary Computation FASE – Fundamental Approaches to Software Engineering GECCO– Genetic and Evolutionary Computation Conference GPCE – Generative Programming: Concepts & Experience ICSR – Software and Systems Reuse ICST – Software Testing, Verification and Validation ICTSS – Testing Software and Systems IRI – Information Reuse and Integration ISSRE – Software Reliability Engineering ITNG – Information Technology MODELS – Model-Driven Engineering and Software Development QRS – Software Quality, Reliability, and Security SBCARS – Software Components, Architectures, and Reuse SPLC – Software Product Line Conference TAP – Tests and Proofs Workshops A-MOST – Advances in Model Based Testing AST – Automation of Software Test FOSD – Feature-Oriented Software Development ICSTW – Software Testing, Verification and Validation Workshops IWCT – Combinatorial Testing LATW – Latin-American Test Workshop PFE – Product Family Engineering PLEASE – Product Line Approaches in Software Engineering SBST – Search-Based Software Testing SPLiT – Software Product Line Testing VaMoS – Variability Modelling of Software-Intensive Systems VariComp – Variability and Composition Others CEC – Evolutionary Computation ISoLA –Leveraging Applications of Formal Methods, Verification and Validation SBES –Software Engineering SDL – System Design Language SSBSE –Search-Based Software Engineering

Declaration of interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References [ISO 2013] ISO/IEC/IEEE 29119-1:2013, Software and systems engineering -- Software testing -- Part 1: Concepts and definitions, [Jorgensen 14] P.C. Jorgensen, Software testing, a craftsmans‘s approach, 4th ed., CRC Press, 2014. [Amman 16] P. Ammann, J. Offutt, Introduction to software testing, 2nd Ed., Cambridge University Press, 2016. [Mouchawrab 11] S. Mouchawrab, L.C. Briand, Assessing, comparing, and combining state machine-based testing and structural testing: a series of experiments, IEEE Transactions of software engineering 37 (2) (2011) 161-187. [Myers 11] G.J. Myers, C. Sandler, T. Badgett, The art of software testing, 3rd Ed., Wiley Publishing, 2011. [Naik 08] K. Naik and P. Tripathy, Software testing and quality assurance: Theory and Practice, A John Wiley & Sons, 2008. [Rapps 85] S. Rapps and E.J. Weyuker, Selecting software test data using data flow information, IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-ll, NO. 4, APRIL 1985, pp.367-375. E. Weyuker, ―The evaluation of program-based software test adequacy criteria,‖ CACM, Vol. 31, No. 6, 1988, pp. 668–675. [Utting 12] M. Utting, A. Pretschner and B. Legeard, A taxonomy of model-based testing approaches, Software Testing, Verification and Reliability, Vol. 22, 2012, pp. 297-312. [Cohen 06] Cohen, M. B., Dwyer, M. B., and Shi, J. 2006. Coverage and adequacy in software product line testing. In Proceedings of the Role of Software Architecture for Testing and Analysis (ROSATEA), pp. 53-63. [Pohl 05] K. Pohl, G. Böckle and F. van der Linden, Software product line engineering: foundations, principles, and techniques, Springer, (2005). [Nguyen 14] Nguyen, H.V., Kästner, C., and Nguyen. T.N. 2014. Exploring variability-aware execution for testing plugin-based web applications. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). ACM, New York, NY, USA, 907-918. DOI: https://doi.org/10.1145/2568225.2568300 [Meinicke 16] Meinicke, J., Wong, C., Kästner, C., Thüm, T. and Saake, G. 2016. On essential configuration complexity: measuring interactions in highly-configurable systems. Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering - ASE 2016 (Singapore, Singapore, 2016), 483-494. [Lochau 14] M. Lochau, S. Lity, R. Lachmann, I. Schaefer, U. Goltz, Delta-oriented model-based Integration testing of large-scale systems. J. Syst. Softw. 91, 63–84 (2014) [Lity 18] S. Lity, M. Nieke, T. Thüm, I. Schaefer, Retest test selection for product-line regression testing of variants and versions of variants, Journal of Systems and Software, 147, pp. 46-63, 2018. [Reuys 10] A. Reuys, S. Reis, E. Kamsties, and K. Pohl, The ScenTED method for testing software product lines, in Software Product Lines, Research Issues in Engineering and Management, Edited by T. Kakola, J.C. duenas, Springer, 2010. [Reis 07] S. Reis, A. Metzger, and K. Pohl, Integration testing in software product line engineering: a model-based technique. In Proceedings of the 10th International Conference on Fundamental Approaches to Software Engineering (FASE 2007), LNCS 4422 , 321-335. [Nebut 10] C. Nebut, Y. Le Traon, and J.-M. Jezequel, System testing of product lines: from requirements to test cases, in Software Product Lines, Research Issues in Engineering and Management, Edited by T. Kakola, J.C. duenas, Springer, 2010. [Engström 11] E. Engström, P. Runeson, Software product line testing – a systematic mapping study, Inform. Softw. Technol. 53 (1) (2011) 2–13. [Lamancha 09] B.P. Lamancha, M.P. Usaola, M.P. Velthius, Software product line testing – a systematic review, in Proceedings of the 4th International Conference on Software and Data Technologies (ICSOFT), INSTICC Press, Sofia, Bulgaria, 2009, pp. 23–30. [Lee 12] J. Lee, S. Kang, D. Lee, A survey on software product line testing, in Proceedings of the 16th International Software Product Line Conference, ACM, 2012, pp. 52-61. [Machado 14] I. Machado, J.D. Mcgregor, Y.C. Cavalcanti, E. Almeida, On strategies for testing software product lines: A systematic literature review, Information and Software Technology, Volume 56 Issue 10, October, 2014, Pages 1183-119. [Neto 11] P.A.M.S. Neto, I.C. Machado, J.D. McGregor, E.S. Almeida, S.R.L. Meira, A systematic mapping study of software product lines testing, Inform. Soft. Technol. 53 (5) (2011) 407–423. [Svahnberg 05] M. Svahnberg, J. van Gurp, and J. Bosch, A taxonomy of variability realization techniques. Software Practice and Experience, 35, pages 705-754. doi:10.1002/spe.652, 2015. [Capilla 13] Ra. Capilla, Variability realization techniques and product derivation. In: Capilla R., Bosch J., Kang KC. (eds) Systems and software variability management. Springer, Berlin, Heidelberg, pages 87-99, 2013. [Apel 13] S. Apel, D. Batory, C. Kastner, G. Saake, Feature-oriented software product lines: concepts and implementation, Springer, 2013. [Kauppinen 04] R. Kauppinen, J. Taina, and A. Tevanlinna. Hook and template coverage criteria for testing framework-based software product families. In Proceedings of SPLiT - International Workshop on Testing Software Product Lines, pages 7-12, 2004. [Mcgregor 01] J.D. McGregor, Testing a software product line, CMU/SEI-2001-TR-022, 2001. [Mcgregor 10] J.D. Mcgregor, Testing a software product line, the Second Pernambuco Summer School on Software Engineering, PSSE 2007, Recife, Brazil, December 3-7, 2007, Revised Lectures, Testing Techniques in Software Engineering Volume 6153 of the series Lecture Notes in Computer Science pp 104-140. [Machado 12] I. Machado, J.D. Mcgregor, Y. Cavalcanti, Strategies for testing products in software product lines, ACM SIGSOFT Software Engineering Notes, November 2012 Volume 37 Number 6, pp.1-8, 2012 [Cohen 08] M. B. Cohen, M. B. Dwyer, and J. Shi, Constructing interaction test suites for highly-configurable systems in the presence of constraints: a greedy approach, IEEE Transactions On Software Engineering, Vol. 34, No. 5, September/October, pp. 633-650, 2008. [Kamsties 03] E. Kamsties, K. Pohl, and A. Reuys. Supporting test case derivation in domain engineering. In 7th World

Conference on Integrated Design and Process Technology (IDPT-2003), December 2003. [PFE 03] E. Kamsties, K. Pohl, S. Reis, and A. Reuys. Testing variabilities in use case models. In Proceedings of 5th International Workshop on Product Family Engineering (PFE-5), volume 3014 of Lecture Notes in Computer Science, pages 5–18. Springer, November 2003. [Lochau 12] M. Lochau, S. Oster, U. Goltz, A. Schurr, Model-based pairwise testing for feature interaction coverage in software product line engineering, Software Qual J (2012) 20:567–604. [McMinn 11] P. McMinn, Search-based software testing: past, present and future, ICSTW '11 Proceedings of the 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops Pages 153-163, 2011. [Alves 10] V. Alves, N. Niu, C. Alves, G. Valena, Requirements engineering for software product lines: a systematic literature review, Inform. Soft. Technol. 52 (8) (2010) 806–820. [Kitchenham 07-1] B.A. Kitchenham, S.L. Pfleeger, L.M. Pickard, P.W. Jones, D.C. Hoaglin, K.E. Emam, J. Rosenberg, Preliminary guidelines for empirical research in software engineering, IEEE Transactions on Software Engineering 28 (8) (2002) 721–734. [Kitchenham 07-2] B. Kitchenham, S. Charters, Guidelines for performing systematic literature reviews in software engineering, Keele University and Durham University Joint Report, Tech. Rep. EBSE 2007-001, 2007.