Experiences of using an evaluation framework

Experiences of using an evaluation framework

Information and Software Technology 47 (2005) 761–774 www.elsevier.com/locate/infsof Experiences of using an evaluation framework Barbara Kitchenham*...

185KB Sizes 0 Downloads 13 Views

Information and Software Technology 47 (2005) 761–774 www.elsevier.com/locate/infsof

Experiences of using an evaluation framework Barbara Kitchenham*, Stephen Linkman, Susan Linkman Software Engineering Group, Department of Computer Science, Keele University, Keele Village, Stoke-on-Trent, Staffordshire ST5 5BG, UK Received 19 June 2003; revised 20 December 2004; accepted 10 January 2005 Available online 11 April 2005

Abstract This paper reports two trials of an evaluation framework intended to evaluate novel software applications. The evaluation framework was originally developed to evaluate a risk-based software bidding model, and our first trial of using the framework was our evaluation of the bidding model. We found that the framework worked well as a validation framework but needed to be extended before it would be appropriate for evaluation. Subsequently, we compared our framework with a recently completed evaluation of a software tool undertaken as part of the Framework V CLARiFi project. In this case, we did not use the framework to guide the evaluation; we used the framework to see whether it would identify any weaknesses in the actual evaluation process. Activities recommended by the framework were not undertaken in the order suggested by the evaluation process and we found problems relating to that oversight surfaced during the tool evaluation activities. Our experiences suggest that the framework has some benefits but it also requires further practical testing. q 2005 Elsevier B.V. All rights reserved. Keywords: Software bidding model; Model evaluation framework; Evaluation process; Evaluation results; Prototype tool evaluation

1. Introduction In a previous paper [3], we proposed a framework for evaluating a software bidding model, and suggested that such a framework might be useful for any expert-opinionbased model of a software process where there was little or no data available for model validation. In this paper, we present the results of two trials aimed at evaluating the framework. The first trial was based on our evaluation of the software bidding model for which the framework was originally developed. The second trial was an indirect evaluation of the framework based on a comparison of the framework with an independent evaluation exercise. The framework was developed to evaluate a risk-based software bidding model [4]. After developing the model, we realised we had a problem evaluating it. Evaluation was difficult because the model does not have a single value output that can be compared with an actual. The output from the model is a distribution of output values obtained from * Corresponding author. Tel.: C44 162 282 0484; fax: C44 162 282 0176. E-mail addresses: [email protected] (B. Kitchenham), [email protected] (B. Kitchenham). 0950-5849/$ - see front matter q 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.infsof.2005.01.001

a Monte Carlo simulation and the inputs are expert-opinionbased estimates of the distributions of a variety of input variables. Even more importantly, the model is meant to assist management decision-making and, if managers use the model, we cannot tell what would have happened if they had not used the model and vice versa. Furthermore, there is no guarantee that a company using the model in a specific bidding situation would win the contract even if the model were perfect. Our evaluation framework was developed to address this evaluation problem. Our first trial describes the results of the evaluation exercise we undertook using the evaluation framework. The goal of this paper is not to present the results of the evaluation exercise but to assess the value of the evaluation framework itself. Thus, we concentrate on explaining how we operationalised the evaluation concepts in the framework and discussing the benefits of using the framework and its limitations. In order to further assess the potential of the framework, we compared the framework with the evaluation of a prototype software engineering tool. The tool was built and evaluated as part of the CLARiFi Framework V project. The goal of the CLARiFi project was to support software-based component development. It developed a method to support

762

B. Kitchenham et al. / Information and Software Technology 47 (2005) 761–774

software component classification and a prototype tool based on the method. The project was part funded by the EEC, and was required by the EEC to provide an evaluation of the research tool that the project developed. Therefore, as a part of the project, the tool and its underlying concepts were evaluated throughout the project. Each software engineering tool includes a conceptual model that defines the concepts underlying the tool. The model usually relates to the solution to a problem and/or the means to perform a specific task (or set of related tasks). Thus, a complete evaluation of a tool ought to include an evaluation of the embedded conceptual model. Since our evaluation framework was based on a method for evaluating the quality of conceptual models, we believed that it might also be applicable to evaluating prototype tools with a novel underlying paradigm. We recognise that the framework would not be of any assistance evaluating fully commercial tools where techniques such as feature analysis can be used to compare similar tools, and usability laboratories can be used to assess user satisfaction. We describe the evaluation framework in Section 2. In Section 3, we discuss the procedures we used to conduct our trials of the evaluation framework. In Section 4 we present the first trial, and in Section 5 we discuss the second. In Section 6, we discuss the limitations of the trials. We present our conclusions in Section 7.

2. The evaluation framework The evaluation framework is shown in Table 1. It consists of evaluating five different qualities: Semantic, Syntactic, Pragmatic, Test and Value (see [3]). The framework is an extension of Lindland et al.’s framework for evaluating the quality of conceptual models [6]. We developed the framework to evaluate a risk-based software bidding model.

The evaluation framework identifies five quality dimensions for evaluation: † Syntactic Quality evaluation aims to check that all the statements in the conceptual model are syntactically correct. † Semantic Quality evaluation aims to check that the conceptual model is feasibly complete and valid. Completeness means that the model contains all statements about the domain that are relevant to the model. Validity means that all the statements made in the model are correct. Feasible completeness means that there may be relevant statements missing from the model but that it is not worthwhile trying to find them. Feasible validity means that there may be invalid statements in the model but that it is not worthwhile eliminating them. † Pragmatic Quality evaluation aims to check that the solution is both understandable and understood adequately by its target audience. The quality requirement is stated in terms of feasible understandability and feasible comprehension, where the term feasible means that there may be misunderstandings but that the misunderstandings are not worth correcting. † Test Quality evaluation aims to check that the conceptual model has been adequately tested in terms of feasible test coverage. Feasible test coverage means that there may be other relevant tests but that it is not worthwhile identifying and performing them. † The Value of the conceptual model aims to check the Practical Utility of the model. For a specialized model (i.e. a model tailored for use in a particular company), practical utility is the extent to which the use of the model improves the processes used in a user organisation. We used this evaluation framework in our first trial. However, we needed to adapt the framework in order to use

Table 1 Model evaluation framework Quality aspect

Goal

Model properties

Means

Syntactic quality Semantic quality

Syntactic correctness Feasible validity; feasible completeness

Defined syntax Traceability to domain

Pragmatic quality

Feasible comprehension

Test quality

Feasible understandability Feasible test coverage

Manual checking of the model implementation Inspection; sensitivity analysis—to identify unnecessary features; consistency checking (aimed at ensuring the model is internally consistent) Means to enable comprehension including visualization, explanation, filtering; means to assess comprehension for example empirical study of understanding achieved by audience group (interviews or selfadministered questionnaire) Documentation guidelines and standards covering format and content

Value

Practical utility

Structuredness; expressive economy Executability

Simulation studies based on pre-defined scenarios (conformance with reference mode behaviour); simulation studies related to input value manipulation (i.e. sensitivity and stability analyses) Means to enable model use including appropriate user interface design, user manuals and training; means to evaluate model value for example empirical study of model users view of using the model (experiment, interviews or self-administered questionnaire)

B. Kitchenham et al. / Information and Software Technology 47 (2005) 761–774

763

it in the second trial. The adaptation was required for two reasons:

independent researchers on a model different from the one that it was explicitly designed to evaluate.

1. To address weakness in the framework identified in the first trial. 2. To adapt the framework to the requirements of evaluating a tool rather than a model.

3.3. Data collection

3. Conduct of the framework trials The framework trials were not designed as formal case studies as described by Yin [11]. This limits their methodological rigour. In particular, we did not produce a study protocol with pre-defined data collection procedures. Neither did we use any selection process to identify the evaluation exercises we studied. However, we did use some of the basic principles of case study design to guide our trials and the description of our methodology follows Yin’s guidelines. 3.1. The trial scope 3.1.1. Goal of the trials The goal of the trials was to assess whether other researchers facing similar evaluation problems to ones we faced would find the framework helpful when defining their evaluation process. 3.1.2. Trial propositions The specific aspects that we investigated were: † Usability, which we defined to be how much work was necessary to make the framework usable for a specific evaluation. † Validity, which we defined to be the extent to which the framework identified all necessary evaluation activities. † Value of the framework, which we defined to be the benefits of using the framework. Our proposition was that a framework that was usable, valid and offered value to its users would be regarded as helpful by its potential users. We also identified other questions that could be addressed in the second trial because of the nature of the information available (see Section 5.2). 3.2. Trial design By analogy with case studies [11], the design of these trials constitutes a multiple-trial study with an embedded unit of analysis. Unlike a well-planned case study, however, these trials were not subject to any selection process. They represent the trials we were able to perform rather than trials that we would prefer to perform. For example, it would be preferable to have the evaluation framework tested by

The evidence used to assess our proposition was obtained primarily from publicly available documentation. All the authors were participants in at least one of the trials. One author (Kitchenham) was responsible for extracting information from the reports. The other authors were responsible for checking the accuracy and completeness of the data colleted. Note in this context, ‘data’ refers to issues (problems, events, observations) mentioned in the reports that related to the evaluation framework (trial 1) or the effectiveness of the evaluation exercise (trial 2). For the first trial, data collection was based on the report of the evaluation exercise [9]. The evaluation was designed using the evaluation framework and performed primarily by Dr Lesley Pickard. Dr Kitchenham performed some of the syntactic quality checking for the model implementation. Dr Kitchenham and Mr Linkman both contributed to the construction of the report. Dr Kitchenham reviewed the evaluation report to identify issues connected with the evaluation framework and its impact on the evaluation exercise. Mr Linkman reviewed Dr Kitchenham’s set of issues. Our second trial was based on back-fitting the evaluation framework to the CLARiFi evaluation. Susan Linkman managed and documented the CLARiFi evaluation exercise, which was initially planned by other researchers [1,2]. Stephen Linkman worked on the CLARiFi project and acted as a subject in one of the evaluation activities, however, he did not plan the evaluation activities, nor did he contribute to writing the evaluation report. He did, however, undertake a feature analysis of the CLARiFi tool at the request of the EEC. Barbara Kitchenham did not work on the CLARiFi project. Thus, the development of evaluation framework and the CLARiFi evaluation exercise took place independently of one another and can properly be used as a means of cross-checking each other. Note, although we have avoided direct bias, there are other forms of bias inherent in our study that are discussed in Section 6. For the second trial, several different documents were used. Dr Kitchenham checked documents related to the planning of the CLARiFi evaluation [1,2] against the revised evaluation framework to investigate similarities and dissimilarities between the CLARiFi evaluation plan and the evaluation framework and its supporting evaluation process. In addition, Dr Kitchenham reviewed the final report of the CLARiFi evaluation and categorised issues that arose during the CLARiFi evaluation activities [7]. This was intended to investigate whether the CLARiFi evaluation suffered from difficulties that could be attributed to failure to use the evaluation framework. Mrs Susan Linkman reviewed Dr Kitchenham’s analysis. As a result, several issues were re-categorised.

764

B. Kitchenham et al. / Information and Software Technology 47 (2005) 761–774

4. Results of trial 1 In this section, we discuss the elements of the evaluation exercise Dr Pickard undertook to evaluate our generic bidding model. We identify those issues that relate to the usability, validity and value of the evaluation framework used to plan the evaluation exercise. 4.1. The evaluation process In order to relate model building activities to model evaluation activities, we developed an evaluation process model shown in Fig. 1 and discussed in [3].

When we attempted to use the evaluation framework and the process to plan the actual evaluation, both the framework and process needed to be tailored to fit the specific evaluation requirements. This was necessary because we had created a generic bidding model but some aspects of the framework were more appropriate for evaluating a specialised version of our model adapted to a particular software development organisation. From the viewpoint of the evaluation process, the initial steps in Fig. 1 appear the same for generic and specialized models, but this conceals one major difference. For our generic bidding model, the domain was the literature on bidding in software and other industries and our personal

Fig. 1. Model evaluation construction development and evaluation process (lines with single arrows represent sequential links between process steps, dashed lines with single arrows represent backwards links between processes, lines with double arrows are used to identify linked processes, lines with a lozenge link a process to its owner).

B. Kitchenham et al. / Information and Software Technology 47 (2005) 761–774

experience of teaching software practitioners cost estimation and risk management. For a specialized model, the domain would be the specific bidding process of a particular organisation plus the generic model. This meant that the nature of the Semantic Quality evaluation was changed. In addition since the model did not have a group of actual users and was not intended for deployment, it was not possible to assess Model Value and Pragmatic Quality as proposed by the framework. We developed the evaluation framework and process after building the bidding model. This meant that we performed the semantic quality evaluation, after the syntactic quality evaluation, and after the start of model testing. The semantic evaluation highlighted a problem with contingency calculation, which caused us to develop a new contingency model. If the semantic quality evaluation had been performed at the appropriate stage in the model development process, it is possible that the defect in the contingency calculation would have been detected earlier. As it is, the stability and sensitivity tests were performed on the wrong model, and the impact of the contingency element and contingency adjustments was not properly evaluated. From the viewpoint of the framework, this result supports the need to sequence each evaluation activity appropriately. 4.2. Semantic and syntactic quality evaluation 4.2.1. Semantic quality evaluation We identified a means of semantic quality evaluation suitable for a generic model. The method relied on developing cross-reference tables explicitly linking the results of the literature survey to the elements of the model. The tabular format worked effectively and identified a major weakness in our bidding model that we had not utilised knowledge from the insurance industry to construct the contingency element of a bid. Other elements of the evaluation exercise confirmed problems with our handling of contingency and led to a significant change to the bidding model. 4.2.2. Syntactic quality evaluation The Syntactic Quality evaluation was supposed to check the syntactic correctness of the model. However, we found it insufficient just to check the syntax of the diagrams and mathematical formula in the specification. It was also necessary to confirm that the model specification was correctly implemented in program that performed the Monte Carlo simulation. For this reason, we also reviewed the model specification against the model implementation documentation [8]. The implementation documentation consists of descriptions for each of the Price and Profit Model programs, including equations and input/output factors, plus flowcharts of all programs and how they linked together. This activity led to the detection of eight model defects that were corrected before the model test plan was executed.

765

Six of the defects were semantic quality defects and two were Pragmatic Quality defects. The defects are itemised in [9]. 4.3. Test quality Test quality was evaluated by executing a test plan comprising three main components: † Pre-defined bidding Scenarios. This tests the fidelity of the model, where the term fidelity is used to mean that the model outputs are consistent with experience. † The impact of input value changes on model outputs. This tests the sensitivity and stability of the model outputs to changes in the input values. † The relative importance of the different factors in the model. This tests the sensitivity of the model to the different factors in the model. The sensitivity and stability testing and the investigation of relative importance of factors confirmed the problem with contingency handling suggested by the Semantic Quality evaluation. The activity of developing the scenarios identified various limitations in our original view of the scope of a bidding model and lead to the development of a linked profit model and a contingency model. In addition, we found some unexpected results when performing the scenario tests that we felt were related to Value rather than Test Quality. 4.4. Lessons learnt from applying evaluation framework 4.4.1. The value of the framework Factors related to the value of the framework are: † It caused us to review the objectives of the model and specify what results we would expect of the model if it was working as envisaged, i.e. the model properly reflected the software bidding process. † It allowed us to delimit the scope of our evaluation and to identify the specific evaluation activities that we needed to perform based on the issues relevant to a generic model: Semantic Quality, Syntactic Quality, and Test Quality. † The evaluation process was successful in helping us to identify deficiencies in the model in particular: incorrect handling of contingency, the need for a profit model, and the existence of implementation errors. † Failure to follow the order of activities suggested evaluation process meant we were late in detecting the problem with contingency handling.

4.4.1.1. Implication of value issues. The first two bullet points can be viewed as supporting the proposition that any framework is of value when undertaking an evaluation. However, the last two points support the view the specific elements of this evaluation framework are of value.

766

B. Kitchenham et al. / Information and Software Technology 47 (2005) 761–774

4.4.2. The validity of the framework 4.4.2.1. Missing elements. The framework suggests syntactic correctness is a simple matter of checking that the model statements are syntactically correct. However, we also found it necessary to check the translation from the initial model to the implemented model. Whether this counts formally as syntactic checking or testing is not clear, but it was necessary because we detected a number of model defects as a result of the exercise. The framework also suggests that the Practical Utility of a model can only be assessed when it is deployed—ruling out evaluation of the value of a generic model. However, the fidelity tests raised some important issues concerned with risk and portfolio management and could therefore be seen as related to the Value of the generic model. 4.4.2.2. Overlaps between model elements. In general we found that the techniques identified as means to achieve a specific quality goal often contributed to other goals. That contribution was of two types: † The technique sometimes contributed to the achievement of a different goal. † The technique sometimes contributed to the confidence we had that a different goal had been achieved. For example, the manual inspection of the model implementation document revealed eight defects. Although the inspection was intended to address syntactic correctness, six of the defects were related to inadequate model specification and were, therefore, semantic errors. The remaining two defects were related to understandability of the model and were therefore pragmatic understandability defects. In addition, the stability and sensitivity tests gave us confidence that, with the exception of the impact of contingency, the model was internally consistent which is an indicator of semantic validity. 4.4.2.3. Implication of validity issues. The framework benefits from synergy among evaluation tasks, since individual tasks can contribute to several different evaluation goals. However, currently the framework does not make this synergy visible. This validity issue does not suggest that the framework will fail to deliver value to users.

needed to have a means of representing the domain and its relationship to the bidding model. We found cross-reference tables were a practical means of documenting the relationship and making it available for inspection. 4.4.3.3. Pragmatic quality. The framework was not very helpful for achieving and assessing Pragmatic Quality. In our view, this is because we were evaluating a generic model. However, it will be important to establish whether the goals and mechanisms suggested for Pragmatic Quality are suitable for a specialized model. 4.4.3.4. Assessing goal achievement. The major weakness with the evaluation framework is the problem of assessing the extent to which the quality goals have been achieved. For example, following Lindland et al. [6], we define the goal of semantic quality evaluation to be feasible completeness and feasible validity. However, it is not clear how to determine what is meant by ‘feasible’. This is a particular problem when the developers of a model are the people evaluating it. This was summed up in [9] “As it stands, the evaluation framework is more of a validation than an evaluation framework”. In order to address this problem, the framework needs to separate the means to achieve the quality goals from the means to assess achievement of the goal. Furthermore, if there is no alternative to a subjective evaluation of goal achievement, the evaluation process ought to include the role of an independent Model Evaluator. 4.4.3.5. Implication of usability issues. Anyone wanting to use the framework should expect to tailor the framework. This can be regarded as a benefit from the viewpoint of the generality of the framework, or a disadvantage since the framework cannot be used ‘off-the-shelf’. A limitation of this trial is that it has not been possible to assess issues related to Pragmatic Quality. This is important because Pragmatic Quality represents an area of disagreement between our framework (Fig. 1) and Lindland et al.’s model [6]. A major limitation of the framework is that it does not address goal achievement. In order to be of value, the framework must be extended to include procedures for evaluating the extent to which goals have been met. 4.5. Framework re-formulation

4.4.3. Usability of the framework 4.4.3.1. Tailoring the framework. We needed to compliment the framework with an evaluation process. We also needed to tailor the framework to the specific type of model we were attempting to evaluate (i.e. a generic model as opposed to a specialized model). 4.4.3.2. Semantic quality. In order to use the concept of ‘Traceability to domain’ to check semantic accuracy we

As a result of the comments in Section 4.4, we revised the model as shown in Table 2. In this formulation of the model, we have: 1. Separated the means to achieve the quality goals from the means to assess achievement. 2. Identified a method of assessing achievement for each quality goal and made assessment of goal achievement the responsibility of an independent model evaluator.

B. Kitchenham et al. / Information and Software Technology 47 (2005) 761–774

767

Table 2 Adapted framework for evaluating a model Quality aspect

Goal

Definition

Model properties

Means to achieve goal (responsibility of tool specifiers and developers)

Means to assess achievement of goal (responsibility of evaluator)

Syntactic quality

Syntactic correctness

All the statements in the underlying conceptual model and its implementation are syntactically correct

Defined syntax

Manual inspection of the model specification against the model implementation

Semantic quality

Feasible completeness; feasible validity

The model is feasibly complete and feasibly valid

Traceability to domain

Construction of crossreference tables linking model concepts to the model domain; consistency checking

Pragmatic quality

Feasible comprehension

The model is adequately understood by its target audience

Not applicable

Feasible understandability

As far as feasibly possible the model is presented in an understandable format

Test quality

Feasible test coverage

The model has been adequately tested in terms of feasible test coverage

Structuredness; expressive economy Executability

Value

Practical utility; generic model

The value of a generic model is the extent to which it improves the processes used in a user organization provides non-trivial insights into the phenomenon being modelled The Value of a specialized model is the extent to which it

Presentation of underlying conceptual model including Visualization, Explanation, Filtering Documentation guidelines for model presentation Test cases aimed at assessing the sensitivity and stability of the model Scenarios (fidelity tests) related to the phenomenon

Review of each identified deficiency and each proposed correction Empirical study of domain experts views (e.g. experiments, interviews, self-administered questionnaire.) Empirical study of understanding achieved by audience group (interviews or selfadministered questionnaire)

Practical utility; specialized model

Executability

Not applicable

Appropriate user interface design, user manuals and training

Evaluation of test cases against predefined test coverage criteria Assessment of academic and industrial interest in the model insights Empirical study of users view of the tool (experiment, interviews or selfadministered questionnaire)

3. Explicitly noted the need to check the model against its implementation. 4. Explicitly noted cross-reference tables as a means to achieve traceability to domain. 5. Added a definition of Value that relates to a generic model.

process suggested by the evaluation framework. Before making the comparison, it was necessary to adapt the evaluation framework to the new type of evaluation object, i.e. a novel software engineering tool prototype. The adapted framework is shown in Table 3. The changes to the framework are:

However, it is important to remember that the framework is itself generic and would probably need to be tailored to the specific circumstance of any particular evaluation. In particular, the specific evaluation object is likely to have an influence on the framework.

† A column labelled ‘Object’ has been included. This indicates whether the underlying model (i.e. the solution to the problem addressed by the tool) is the main object of evaluation or the tool itself or both. † The definitions have been amended to distinguish between the model and the tool.

5. Trial 2 5.1. Further framework adaptation The second trial compared the evaluation process used to evaluate a Framework V project tool with the evaluation

Overall there are few changes to the framework. Thus, initially it seemed that the framework was quite general and we were encouraged to believe that applying the framework to a different evaluation object would be a valid method of evaluating the framework.

768

B. Kitchenham et al. / Information and Software Technology 47 (2005) 761–774

Table 3 Framework adapted for prototype tools Quality aspect

Object

Goal

Definition

Model properties

Means to achieve goal (Responsibility of tool specifiers and developers)

Means to assess achievement of goal (Responsibility of evaluator)

Syntactic quality

Model and tool

Syntactic correctness

All the statements in the underlying conceptual model and its implementation are syntactically correct

Defined syntax

Review of each identified deficiency and each proposed correction

Semantic quality

Model

Feasible completeness; feasible validity

The model is feasibly complete and valid

Traceability to domain

Pragmatic quality

Model

Feasible comprehension

As far as possible, the solution is understood by its target audience

Not applicable

Model

Feasible understandability

As far as possible, the solution is presented in a format that understandable to its target audience

Structuredness expressive economy

Test quality

Tool

Feasible test coverage

The tool has been adequately tested in terms of feasible test coverage

Executability

Value

Tool

Practical utility

The extent that the tool improves efficiency and effectiveness of the user organisation

Not applicable

Manual inspection of the tool specification against the underlying conceptual model Cross-reference tables linking model concepts to the model domain; consistency checking Presentation of underlying conceptual model including Visualization, Explanation, Filtering Documentation guidelines describing the underlying conceptual model using standards for format and content Test cases aimed at demonstrating all the tool requirements and showing how the tool supports each type of user Appropriate user interface design, user manuals and training

5.2. Additional research questions We evaluated our framework by comparing it with the evaluation framework developed and used by the CLARiFi project. In additional to assessing its usefulness, we wanted to investigate: † Whether the CLARiFi evaluation included activities that were not recommended by the framework (i.e. had we missed any important evaluation activities?) † Whether CLARiFi evaluation did not undertake activities recommended by the framework and in that case was there any evidence that the evaluation exercise suffered as a result (i.e. are all the activities recommended by the framework useful?)

Empirical study of domain experts views (e. g. experiments, interviews, etc.)

Empirical study of understanding achieved by audience group (interviews or self-administered questionnaire)

Evaluation of test cases against test coverage criteria

Empirical study of users view of the tool (experiment, interviews, etc.)

† Whether the CLARiFi evaluation activities were very similar to those recommended in the Framework (i.e. is the framework relevant but perhaps obvious?). 5.3. The scope of the CLARiFi evaluation The CLARiFi evaluation was planned in parallel with the CLARiFi research activities [1,2]. The stated objectives of the CLARiFi evaluation were: 1. To evaluate the ability of the final demonstrator to satisfy the whole set of user needs. 2. To perform a holistic test (end-to-end) by active consortium participants, others from participating consortium partners, and (chosen) external Suppliers and Integrators.

B. Kitchenham et al. / Information and Software Technology 47 (2005) 761–774

769

Table 4 CLARiFi evaluation activities Objective

Evaluation activity

Description

Mapping to evaluation framework

1

Supplier task demonstration Integrator task demonstration

Inputting supplier and component information into the database via the supplier interface; storing this information accurately; retrieving the data Retrieving component and supplier information in such a way that it is meaningful and useful to the user within the framework of designing a system, based on compatible components Incorporating the ability to certify some or all characteristics of a component within the database and to provide the necessary assurance that this has been provided by an authorised certifier The concepts from the original project proposal are traced through the appropriate deliverables to the demonstrator Independent opinions were solicited about certain, possibly contentious, characteristics incorporated into the tool; opinions on the tool were solicited from the same independent people, after they attended a detailed presentation of the CLARiFi project and a demonstration of the tool A feature analysis technique was used to compare the CLARiFi tool with the Componentsource tool

Value (Pragmatic Quality)

1

1

Certifier task demonstration

2

Concept verification

2

Independent verification

Additional

Comparison with existing tool

These objectives were refined into five evaluation activities shown in Table 4, together with an additional task required by the project reviewers appointed by the European Commission. Table 4 presents our mapping of the CLARiFi evaluation activities to the evaluation framework concepts. It was not a simple matter to construct this mapping because the CLARiFi plan did not describe activities in terms of the evaluation framework concepts. The major problem with the mapping was to assess the relationship between the CLARiFi demonstration tasks and the evaluation framework concepts. The CLARiFi plan identified four demonstration activities where subjects with experience of four different aspects of the tool were required to perform a set of specified tasks using the tool. Initially, the demonstration tasks looked like test scenarios with defined test plans which would mean they related to Test Quality, however, further a more detailed examination of the CLARiFi reports suggested another interpretation. In each demonstration, subjects with experience a particular aspect of component-based development were required to use the tool and evaluate several aspects of the activity on seven-point ordinal scales. The aspects and their mapping to the framework are shown in Table 5 (note, Integrators had an additional criteria Accuracy which was

Value (Pragmatic Quality)

Value (Pragmatic Quality)

Syntactic and semantic quality Semantic quality; Value

Value

related to Value). Thus, it became clear that the activity was best viewed as a small-scale empirical study aimed primarily at assessing Value. The fact that the subjects were not the tool developers ruled out classifying the activity as related to Test Quality. The demonstration activity also had the possibility of addressing Pragmatic Quality since the underlying theory of the tool could have been presented to potential users. This was consistent with the original plan that the tool specifiers introduce the tool to the subjects. In practice the tool builders undertook training of the subjects and concentrated on the use of the tool not the theory underlying the tool. In the end we decided to categorise the demonstration activities as primarily related to the Value concept with Pragmatic Quality as a secondary aspect that was intended but not fully addressed in practice. Table 4 shows that the CLARiFi evaluation plan addressed all issues except Test Quality but performed them in an order that was different from that suggested by the evaluation process model. In order to assess whether the difference between the CLARiFi plan and the evaluation framework and process, had any detrimental effect on the CLARiFi evaluation exercise, reports of the exercise were reviewed in more detail (see Section 5.4).

Table 5 Supplier task evaluation criteria Aspect

Mapping to framework

Richness of information in terms of completeness (i.e. all relevant information could be supplied) and conciseness (i.e. no extraneous information was required) Completeness in terms of the extent to which the tool guides and supports the user in the performance of his/her tasks Usability in terms of ease of use, learning time, skill required, confidence of user Breadth of applicability Ability to be implemented

Semantic quality Value Value Value Value

770

B. Kitchenham et al. / Information and Software Technology 47 (2005) 761–774

Table 6 Supplier subject comments Comment

Mapping

The architectural element (is it an object, process, etc.) is not explicit enough Technical things are well catered for, I also think that a lot can be lost in free text unless all text is entered There was no section in which to put compatible development environments or did I miss it I believe all the info is necessary, but better ways to extract it should be possible, e.g. where information is buried in a complex structure as in the quality attributes it should be possible to ask more directly for information and have it placed correctly rather than navigating the structure The functional abilities need an example visible and need space to allow a freer expression Needs an example of functional abilities visible Maybe overkill. Good at getting supplier to think about all relevant details. Could prove tedious for complex components It should have worked example or should implement pre-defined blocks of functional descriptions based on specific jobs done by components Need to improve the functional, better visibility of what definitions mean is needed Not all schemes have strengths and weaknesses The characterisation is fine and thorough. It is just a little difficult to grasp initially Working through an example in detail would aid the understanding Adequate for me, but there is a trade off between giving people enough info and boring them. Tough to judge Examples to be worked through under guidance could be useful to solidify terminology and syntax It needs to structure/guide user so that they behave in the same way Will suppliers put in the effort to fill in this level of detail without guaranteed benefit? Implementation could be enhanced by the use of more intelligent interfaces Use of a paper pro-forma before tool use is useful Perhaps the pro-forma, although useful, is a little too complex

Pragmatic quality Value Semantic quality Value

5.4. Demonstration results 5.4.1. Supplier task test The subjects’ comments (i.e. defect and incident reports plus other issues) are shown in Table 6 mapped to the quality aspect to which they refer. Ten comments referred to pragmatic quality (i.e. understandability of the underlying conceptual model), three referred to semantic quality and six

Pragmatic quality Pragmatic quality Value Pragmatic quality Pragmatic quality Semantic quality Pragmatic quality Pragmatic quality Value Pragmatic quality Value Semantic quality Value Pragmatic quality Pragmatic quality

referred to value (it should be noted that some of the comments relating to missing functionality were due to problems with the tool interface rather than missing functions). 5.4.2. Integrator task test The comments from the four integrator subjects and the mapping to the evaluation framework is shown in Table 7. Three comments referred to pragmatic quality, eight to

Table 7 Integrator’s comments Comment

Mapping

I would like to check the meaning of the selection criteria. I think an integrator should be able to look words up in a common dictionary It is rather difficult to find out the words (i.e. actions and objects), that can identify the requested functionality. It would be better to start from the general concept, e.g. report, and then move on through refinements Perhaps there are too many criteria and most of them are too general I would like to have an estimate of the time needed to execute an inquiry, or at least an overview of the number of components retrieved The selection approach is a little bit confusing, above all in the choice between the ‘system search’ and the ‘component search’ Due to the lack of time I am unable to say whether this process could be better supported In the parallel coordinates view it is possible to use sliders to restrict the range of property values. When using sliders, it is not possible to view the range extreme values The queries executed by the integrator in component search tab can be saved in a context. It should be possible to save in a context the FAB’s used in the system search tab, when the integrator searches for components by functions It is difficult to identify the FAB’s to use for the search on the basis of requirements The search by FAB object is missing. Thus the integrator has to browse all the FABs tree to find the FAB he/she is interested in An estimate of the duration of the query would be useful There are some FAB’s whose actions or objects are synonyms of other FAB’s actions or objects, therefore it is not clear which FAB to use to perform the system search There is a strong need for a User Manual It is rather difficult to make this tool work with all the functionality needed by a company

Pragmatic quality Pragmatic quality Semantic quality Value Semantic quality No mapping Semantic quality Semantic quality

Semantic quality Semantic quality Value Semantic quality Pragmatic quality Semantic quality

B. Kitchenham et al. / Information and Software Technology 47 (2005) 761–774

771

Table 8 Certifier’s comments Comment

Mapping

The confidence rating refers to the setting of the certification mark, not to the real certification process, which is outside Clarifi It is well done (referring to whether it is possible to characterize the components in a better way for certification) No presentations were given The tool would be more suitable for use by a wide community of internet users, if the user could run it via a simple web browser. We had a lot of difficulties installing the Clarifi software and the correct versions of the additional required software (Java JDK, Oracle). A typical user would probably not have enough motivation to go through a complicated, error-prone procedure The tool is quite tough to use and I had to find out about the functionality of the tool The tool itself worked slowly, but this could be the machine we used

No mapping

semantic quality and two to value. One comment referred to the experimental process not the CLARiFi tool. 5.4.3. Certifier task test The comments from the single certifier subject and their mapping are shown in Table 8. One comment referred to semantic quality, one to pragmatic quality and two to value. One comment referred to the questionnaire rating not the CLARiFi method or tool and another referred to the evaluation process. 5.4.4. Overall demonstration results The demonstration activities were directed at each major tool user group with the aim of assessing the Value of the tool. The subjects made a total of 39 comments. Three related to the demonstration process itself. Of the remaining comments, we classified 12 as relating to Semantic Quality, 14 as relating to Pragmatic Quality and 10 to Value. These results provide some support for the evaluation framework since they imply that: 1. Pragmatic Quality is an important aspect of quality and if it is not explicitly addressed, subjects will experience difficulties understanding the model/tool concepts. 2. Semantic quality should be addressed prior to Value evaluation and if it not subjects will experience difficulty evaluating the tool. One result that contradicts the framework is that the demonstration activities were aimed at addressing Value. Furthermore the subjects were able to assess the value of the tool although it was a prototype and was not being assessed in a ‘real-life’ situation. Many of the comments about Value involved the subjects identifying what people with their role in component-based software engineering would want/expect to see in a support tool. Thus, the subjects were able to assess potential Value, whereas our framework has only considered the issue of achieved Value. The CLARiFi project developed a good procedure for assessing potential Value. They explicitly addressed all tool user groups and ensured that experimental subjects in each group performed a number of representative tasks. However, if tool value is to be adequately assessed the subjects

Semantic quality No mapping Value

Pragmatic quality Value

must be representative of the potential tool user population (although it might be impossible in practice to obtain a random sample of potential users) and there should be sufficient subjects to ensure that the results are not dominated by the possibly aberrant viewpoint of a single person. In fact the CLARiFi evaluation involved rather a small number of subjects. (Note the initial plans called for more subjects, but delays with the toolset meant that some subjects dropped out and could not be replaced.) Finally, it is clear that the CLARiFi evaluation did not consider the Test Quality of the tool. Comments in the evaluation report [7] indicate that the poor quality of the tool had a detrimental effect on the demonstration activities. For example, the tool was difficult to install and certain functions could not be used. This can be interpreted as supporting the need for assessing Test Quality prior to assessing potential Value. Alternatively, it may be that for a conventional software application such as a software tool, Test Quality is a matter of standard quality assurance, internal to the development process, and should not be considered part of the evaluation process. This was the viewpoint taken by the CLARiFi evaluation team. 5.5. Other evaluation activities 5.5.1. Concept verification The researchers concerned with specifying the tool functions were asked to assess whether their requirements had been met. The tools developers were asked to assess the extent to which specific requirements were met in the tool. This is basically concerned with the link between the conceptual model and the tool. It concerns the semantic quality of the tool but does not address the semantic quality of the underlying conceptual model. Checking that the tool has implemented all its requirements is an essential part of tool development. It is a form of testing activity that should take place prior to any evaluation of Value. 5.5.2. Independent verification Members of the European Component Users Association attended a 3-day workshop. They were introduced to

772

B. Kitchenham et al. / Information and Software Technology 47 (2005) 761–774

the CLARiFi tool and concepts. Then they were asked to assess whether specific functions should or should not be included in the tool. They identified five facilities that should have been included in the tool, but were not, and four facilities that should be removed from the tool. This is primarily an assessment of Semantic Quality but affects Value as well. This activity is another method of assessing potential Value. It is less focussed that the demonstrator studies but allowed access to a larger number of potential tool users. 5.5.3. Tool comparison At the explicit request of the EEC project manager, the CLARiFi tool was compared with Componentsource, the current market-leading tool for component-based software engineering. The comparison was a feature analysis on which CLARiFi scored well. However, the features were based on the CLARiFi conceptual model, so there is a danger that the results are biased. ISO guidelines for CASE tool evaluation and selection are based on feature analysis which gives the impression that the method is generally applicable for comparing software tools [5]. However, we believe feature analysis is inappropriate for: † Comparisons of prototypes and commercial tools where usability and performance comparisons will usually favour commercial tools. † Comparisons of tools with different underlying conceptual models, where it is too easy to identify features related to the underlying model of one tool against which another tool will appear to perform poorly. 5.6. Lessons learnt from the trial 5.6.1. Trial 2 research questions 5.6.1.1. Have we missed any important evaluation activities?. The CLARiFi evaluation was able to assess potential Value by means of evaluation activities where potential users undertook controlled trials of the tool. The framework is missing the concept of early assessment of Value (or potential Value). This suggests an area where the framework needs to be extended. 5.6.1.2. Are all the activities recommended by the framework useful?. The Value-based evaluation activities were plagued by problems with Tool quality and issues connected with Pragmatic and Semantic Quality. This implies that omitting some of the framework activities or doing them late in the evaluation reduces the effectiveness of an evaluation exercise. 5.6.1.3. Is the framework relevant but perhaps obvious?. The CLARiFi evaluation did not consider Test Quality nor did it separate Pragmatic Quality from Value. Furthermore, Value was addressed prior to attempts to assess Semantic

Quality. This seems to have caused problems in the evaluation. In particular, Value-based assessments were undertaken before the tool had addressed issues relating to the underlying tool concepts and tool quality. These results imply that the framework is relevant and not obvious. 5.6.2. Usefulness of the framework In this trial, the framework was not used to plan the evaluation. The actual evaluation was compared with the framework to identify similarities and differences that might explain problems that occurred during the evaluation exercise. For this reason, the usefulness of the framework cannot be assessed directly, it can only be inferred from a discussion of the evaluation problems and suggestions as to how the framework might have helped avoid the problems. 5.6.2.1. Value of the framework. The CLARiFi plan was quite different from the evaluation framework and process: † It used the concept of demonstrators and verification activities to describe the planned tasks. † It did not consider the order in which activities needed to take place. Comparing the CLARiFi evaluation plan with our evaluation framework suggests that the Value-oriented demonstrations took place without prior Semantic and Pragmatic Quality evaluation and without appropriate Test Quality. The CLARiFi report notes that the quality of the tools was a problem. The defect reports from the demonstrators suggest that issues associated with Semantic and Pragmatic Quality were making it difficult for the demonstration subjects to assess Value. Implication of value issues. We take this to imply that failure to use the evaluation framework might lead to problems during the evaluation. This supports the proposition that the framework has the potential to improve an evaluation exercise both by identifying a well-defined set of activities that is complete (in the sense of covering a predetermined set of goals), and by indicating how to order those activities with respect to the development process. 5.6.2.2. Validity of the framework. Missing elements. The evaluation framework assumed that Value could not be addressed unless the tool was deployed. However, the CLARiFi results suggest that it is possible to distinguish between potential Value and achieved Value. Demonstrator subjects, who were independent of the tool builders, were clearly able to comment on potential Value. Possible unnecessary element. The CLARiFi evaluation did not consider Test Quality. This may be an area where the Framework is inappropriate for software tools. It can be argued that testing should be a standard part of software development and test quality should be assessed as a standard quality assurance activity. A counter argument, with which we agree, is that prototype tools may not have

B. Kitchenham et al. / Information and Software Technology 47 (2005) 761–774

been built and tested to normal commercial development standards. Therefore, Test Quality may need to be assessed in order to ensure that any evaluation of Potential Value is not prejudiced. Implication of validity issues. The framework needs to be corrected to allow for pre-deployment assessment of potential Value. In addition, Test Quality remains a potential validity problem. We suggest it should be included but there are counter arguments. With respect to trial propositions, apart from potential Value, we did not detect the need for any evaluation concept incompatible with the evaluation framework. Furthermore, the current trial emphasised again the importance of an evaluation process to support the evaluation framework. 5.6.2.3. Framework usability. Tailoring the evaluation. This trial confirms the need to tailor the evaluation framework to the specific evaluation object. However, it was relatively simple to make the desired changes. In the case of software tools, we believe that tailoring must consider the source of novelty in the tool. In this case, the novelty in the prototype CLARiFi tool resided primarily in a new method of classifying software components (i.e. the conceptual model underlying the classification system), not in the fact that the method was capable of being supported by a software tool. There are cases where the novelty lies in the ability to automate something in a particular way (e.g. accessing the Web via a mobile phone). In other cases both the problem solution and its implementation may contribute equally to the novelty in a tool (e.g. the first generation of spread sheets). In each of these cases framework adaptation would be somewhat different. For prototype tools, it is also important that the framework adaptation is sensitive to the fact that the tool is not a commercial offering. Selection of goal achievement techniques. The Evaluation framework was not detailed enough to indicate that a feature analysis of tools with different underlying models was likely to be biased. This suggests a basic limitation of the framework. The framework suggests specific techniques to achieve quality goals and is not intended to prevent the use of other techniques. It is the responsibility of the evaluators to assess the suitability of additional qualityachievement or quality-evaluation techniques. Implication of usability issues. Evaluators wanting to use the evaluation framework will (almost certainly) need to tailor the evaluation framework to their specific evaluation problem. In particular, they will need to take care with their choice of individual evaluation techniques, since the framework offers little detailed guidance.

773

framework. This can be considered an extreme version of participant–observer based research and casts some doubt on the validity of our findings. It is always difficult for researchers to be objective about their own work, and the impact that experimenter expectations can have on the results of empirical studies is well documented [10]. A particular problem is the tendency for experimenters to ignore evidence that contradicts their preconceptions and resolve any ambiguous evidence in favour of their preconceptions. To address this issue in the first trial, we have described the results of our evaluation in order to explain the conclusions we have drawn about the evaluation framework. In the second trial, we have presented the results of the CLARiFi evaluation as they were reported in the CLARiFi project report [7] allowing the reader the opportunity to assess whether or not our interpretation of the evidence (particularly the classification of the demonstrator subjects comments) was correct. Another limitation associated with the second trial is that it depends on the quality of the CLARiFi evaluation. If the evaluation was poorly planned and executed, it may not have been of good enough quality to act as a fair test the evaluation framework. That is, if the CLARiFi evaluation was very poor, anything else might look good by comparison. The CLARiFi evaluation exercise did encounter problems: † CLARiFi was a multi-company, multi-country project. People responsible for the evaluation worked in different countries to the people responsible for tool development. For evaluation exercises needing tool support, the tool builders ran the evaluation exercises rather than the evaluation team. † Tool development was delayed causing knock on effects to planned evaluation exercises. This resulted in subjects being unavailable to take part in experiments and the evaluation team being unable to attend the evaluation. † The poor quality of the tool interface impacted the evaluation exercises. Tool problems prevented subjects from performing some of the evaluation tasks. However, CLARiFi devoted considerable effort to initial planning of their evaluation exercise, and all planned evaluation actions were performed. Furthermore, the evaluation report was reviewed and agreed by the CLARiFi reviews and the EEC project officer. Thus, we believe that the evaluation exercise was reasonable for a pre-commercial, multi-company research project.

6. Limitations of the trials

7. Conclusions

The main limitation of these trials is that we ourselves have undertaken the evaluation of our own evaluation

We have presented two trials aimed at evaluating an evaluation framework. In both cases the evaluation

774

B. Kitchenham et al. / Information and Software Technology 47 (2005) 761–774

framework appeared to have some benefit. In the first trial, it provided a means of defining the scope of the evaluation as a whole and identifying the individual evaluation activities. The second trial did not use the evaluation framework but a review of the evaluation results confirmed the importance of Semantic and Pragmatic Quality evaluation and the need to undertake Semantic and Pragmatic Quality evaluation prior to evaluating tool Value. In both trials, it was clear that an evaluation process is essential for the evaluation framework to be effective. In both trials we found some aspects of the framework were invalid. For example, initially we believed that it was not possible to evaluate the Value of the bidding model or the prototype tool because they were not deployed. However, the trials indicate that: † The value of a generic simulation model is related to the extent to which it reveals theoretically interesting insights into the domain. † The potential value of a prototype tool may be assessed by appropriate empirical studies. Such studies should address all user groups and should attempt to obtain a representative sample of subjects. In each trial we identified the need to tailor the framework to the evaluation object. However, our experience tailoring the framework to a prototype tool for the second trial was that the tailoring exercise did not change a great deal of the framework. Thus, we believe the framework is quite general and may have wider applicability than we first imagined. We hope that researchers developing expert-opinion (rather than data-based) simulation-based models and software engineering tools will find the framework useful and will contribute to its further evaluation and evolution.

Acknowledgements This paper is based on research results from the EPRSC project ‘Managing Risks across a Portfolio of Projects’ (GR/M33709) and from the ESPRIT CLARiFi project (IST1999-11631).

References [1] K. Gleen, S. Moor, A. Di Lorenzo, C. Pandolfo, A. Pasquini, Evaluation plan, CLARiFi Project Deliverable D21, December 2001. [2] K. Gleen, S. Moor, C. Pandolfo, A. Pasquini, Evaluation prototype 1, CLARiFi Project Deliverable D25, March 2001. [3] B. Kitchenham, L. Pickard, S.G. Linkman, P. Jones, A framework for evaluating a software bidding model, Information and Software Technology, accepted for publication. [4] B. Kitchenham, L. Pickard, S.G. Linkman, P. Jones, Modelling software bidding risks, IEEE Transaction on Software Engineering 29 (6) (2003) 542–554. [5] International Standards Organisation, Information Technology— Guidelines for the Evaluation and Selection of CASE Tools. ISO/IEC 14102 1995. [6] O.V. Lindland, G. Sindre, A. Solvberg, Understanding quality in conceptual modeling, IEEE Software March (1994) 42–49. [7] S. Linkman, Experimental Evaluation E5. CLARiFi Project Deliverable D36, October 2002. [8] L. Pickard, B. Kitchenham, S.G. Linkman, P. Jones, Software Bidding Model Implementation, Keele Technical Report, TR/SE-0202, 2002 (www.keele.ac.uk/depts/cs/se/e&m/tr0202.pdf). [9] L. Pickard, B. Kitchenham, S.G. Linkman, P. Jones, Evaluation of the Software Bidding Model Technical Report TR/SE-0204 . Keele University, Dept. Computer Sciencewww.keele.ac.uk/depts/cs/se/ e&m/tr0204.pdf. [10] R.L. Rosnow, R. Rosenthal, People studying people, Artifacts and Ethics in Behavioral Research, (W.H.) Freeman, New York, 1997. [11] R.K. Yin, Case Study Research. Design and Methods,, third ed. Applied Social Science Research Methods Series, vol. 5, Sage, Beverly Hills, CA, 2003.