Information and Software Technology 53 (2011) 682–691
Contents lists available at ScienceDirect
Information and Software Technology journal homepage: www.elsevier.com/locate/infsof
A controlled experiment in assessing and estimating software maintenance tasks Vu Nguyen a,⇑, Barry Boehm a, Phongphan Danphitsanuphan b a b
Computer Science Department, University of Southern California, Los Angeles, USA Computer Science Department, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand
a r t i c l e
i n f o
Article history: Available online 20 November 2010 Keywords: Software maintenance Software estimation Maintenance experiment COCOMO Maintenance size
a b s t r a c t Context: Software maintenance is an important software engineering activity that has been reported to account for the majority of the software total cost. Thus, understanding the factors that influence the cost of software maintenance tasks helps maintainers to make informed decisions about their work. Objective: This paper describes a controlled experiment of student programmers performing maintenance tasks on a C++ program. The objective of the study is to assess the maintenance size, effort, and effort distributions of three different maintenance types and to describe estimation models to predict the programmer’s effort spent on maintenance tasks. Method: Twenty-three graduate students and a senior majoring in computer science participated in the experiment. Each student was asked to perform maintenance tasks required for one of the three task groups. The impact of different LOC metrics on maintenance effort was also evaluated by fitting the data collected into four estimation models. Results: The results indicate that corrective maintenance is much less productive than enhancive and reductive maintenance and program comprehension activities require as much as 50% of the total effort in corrective maintenance. Moreover, the best software effort model can estimate the time of 79% of the programmers with the error of or less than 30%. Conclusion: Our study suggests that the LOC added, modified, and deleted metrics are good predictors for estimating the cost of software maintenance. Effort estimation models for maintenance work may use the LOC added, modified, deleted metrics as the independent parameters instead of the simple sum of the three. Another implication is that reducing business rules of the software requires a sizable proportion of the software maintenance effort. Finally, the differences in effort distribution among the maintenance types suggest that assigning maintenance tasks properly is important to effectively and efficiently utilize human resources. Ó 2010 Elsevier B.V. All rights reserved.
1. Introduction Software maintenance is crucial to ensuring useful lifetime of software systems. According to previous studies [1,4,29], the majority of software related work in organizations is devoted to maintaining the existing software systems rather than building new ones. Despite advances in programming languages and software tools that have changed the nature of software maintenance, programmers still spend a significant amount of effort to work with source code directly and manually. Thus, it is still an important challenge in software engineering community to assess maintenance cost factors and develop techniques that allow programmers to accurately estimate their maintenance work. A typical approach to building estimation models is to determine what factors and how much they affect the effort at different ⇑ Corresponding author. Tel.: +1 323 481 1585; fax: +1 213 740 4927. E-mail addresses:
[email protected] (V. Nguyen),
[email protected] (B. Boehm),
[email protected] (P. Danphitsanuphan). 0950-5849/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.infsof.2010.11.003
levels and then use these factors as the input parameters in the models. For software maintenance, the modeling process is even more challenging. The maintenance effort is affected by a large number of factors such as size and types of maintenance work, personnel capabilities, the level of programmer’s familiarity with the system being maintained, processes and standards in use, complexity, technologies, the quality of existing source code and its supporting documentation [5,18]. There has been tremendous effort in software engineering community to study cost-driven factors and the amount of impact they have on maintenance effort [6,20]. A number of models have been proposed and applied in practice such as [2,5,12]. Although maintenance size measured in source lines of code (LOC) is the most widely used factor in these models, there is a lack of agreement on what to include in the LOC metric. While some models determine the metric by summing the number of LOC added, modified, and deleted [2,21], others such as [5] use only LOC that is added and modified. Obviously, the latter assumes that the deleted LOC is not significantly correlated with maintenance effort. This
V. Nguyen et al. / Information and Software Technology 53 (2011) 682–691
inconsistency in using the size measure results in discrepancies in strategies proposed to improve software productivity and problems in comparing and converting estimates among estimation models. In this paper, we describe a controlled experiment of student programmers performing maintenance tasks on a small C++ program. The purpose of the study was to assess the size and effort implications and labor distribution of three different maintenance types and to describe estimation models to predict the programmer’s effort on maintenance tasks. We focus the study on enhancive, corrective, and reductive maintenance types according to a maintenance topology proposed by Chapin et al. [9]. We chose to study these maintenance types because they are the ones that change the business rules of the system by adding, modifying, and deleting the source code. They are typically the most common activities of software maintenance. The results of our study suggest that the corrective maintenance is less productive than enhancive and reductive maintenance. These results are largely consistent with the conclusion from previous studies [2,17]. The results further provide evidence about the effort distribution of maintenance tasks in which program comprehension requires as much as 50% of maintainer’s total effort. In addition, our results on effort estimation models show that using three separate LOC added, modified, and deleted metrics as independent variables in the model will likely result in higher estimation accuracies. The rest of the paper is organized as follows. Section 2 gives a discussion on the related work. Section 3 provides a method for calculating the equivalent LOC in maintenance programs. The experiment design and results are discussed in Sections 4 and 5. Section 6 describes models to estimate programmers’ effort on maintenance tasks. Section 7 gives some discussions on the results. Section 8 discusses various threats to the validity of the research results, and the conclusions are given in Section 8.
2. Related work Many studies have been published to address different size and effort related issues of software maintenance and propose approaches to estimating the cost of software maintenance work. To help better understand and access software maintenance work, Swanson [31] proposes a topology that classifies software maintenance into adaptive, corrective, and perfective maintenance types. This topology has become popular among researchers, and the IEEE has adapted these types in its Standard for Software Maintenance [19] along with an additional preventive maintenance type. In their proposed ontology of software maintenance, Kitchenham et al. [22] define two maintenance activity types, corrections and enhancements. The former type is equivalent to adaptive maintenance type while the latter can be generally equated to adaptive, perfective, and preventive maintenance types that are defined in Swanson’s and IEEE’s definitions. Chapin et al. [9] proposed a fine-grained twelve types of software maintenance and evolution. These types are classified into four clusters support interface, documentation, software properties, and business rules, respectively listed in the order of their impact on the software. The last cluster, which consists of reductive, corrective, and enhancive types, includes all activities that alter the business rules of the software. Chapin et al.’s classification does not have a clear analogy with the types defined by Swanson. As an exhaustive topology, however, it includes not only Swanson’s and IEEE’s maintenance types but also other maintenance-related activities such as training and consulting. Empirical evidence on the distribution of effort among maintenance activities helps estimate maintenance effort more accurately through the use of appropriate parameters for each type of maintenance activity and helps better allocate maintenance resources.
683
It is also useful to determine effort estimates for maintenance activities that are performed by different maintenance providers. Basili et al. [2] report an empirical study to characterize the effort distribution among maintenance activities and provide a model to estimate the effort of software releases. Among the findings, isolation activities were found to consume a higher proportion of effort in error correction than in enhancement changes, but a much smaller proportion of effort was spent on inspection, certification, and consulting in error correction. The other activities, which include analysis, design, and code/unit test, were found to take virtually the same proportions of effort in comparison between these two types of maintenance. Mattsson [25] describes a study on the data collected from four consecutive versions of a 6-year object-oriented application framework project. The study provides evolutional trends on the relative effort distribution of four technical phases (analysis, design, implementation, and test) across four versions of the project, showing that the proportion of implementation effort tends to decrease from the first version to the forth, while the proportion of analysis effort follows a reversed trend. Similarly, Yang et al. [32] present results from an empirical study on the effort distribution of a series of nine projects delivering respective nine versions a software product. All projects are maintenance type except the first project which delivers the first version of the series. The coding activity was found to account for the largest proportion of effort (42.8%) while the requirements and design activities consume only 10.2% and 14.5%, respectively. In addition to analyzing the correlation between maintenance size and productivity metrics and deriving effort estimation models for maintenance projects, De Lucia et al. [13] describe an empirical study on the effort distribution among five phases, namely inventory, analysis, design, implementation, and testing. The analyses were based on data obtained from a large Y2K project following the maintenance processes at a software organization. Their results show that the design phase is the most expensive, consuming about 38% of total effort, while the analysis and implementation phases account for small proportions, about 11% each. These results are somewhat contrary the results reported in Yang et al.’s [32]. A more recent study reported by the same authors (De Lucia et al.) presents estimation models and the distribution of effort from a different project in the same organization [14]. A number of studies have been reported to address the issues related to characterizing size metrics and building cost estimation models for software maintenance. In his COCOMO model for software cost estimation, Boehm presents an approach to estimating the annual effort required to maintain a software product. The approach uses a factor named Annual Change Traffic (ACT) to adjust the maintenance effort based on the effort estimated or actually spent for developing the software [7]. ACT specifies the estimated fraction of LOC which undergo change during a typical year. It includes source code addition and modification, but excludes deletion. If information is sufficient, the annual maintenance effort can be further adjusted by a maintenance effort adjustment factor computed as the product of predetermined effort multipliers. In a major extension, COCOMO II, the model introduces new formulas and additional parameters to compute the size of maintenance work and the size of reused and adapted modules [5]. The additional parameters take into account the effects such as the complexity of the legacy code and the familiarity of programmers with the system. In a more recent model extension to estimating maintenance cost, Nguyen proposes a set of formulas that unifies two COCOMO II’s reuse and maintenance sizing methods [28]. The extension also takes into account the size of source code deletions and calibrates new rating scales of the cost drivers specific to software maintenance. Basili et al. [2], together with characterizing the effort distribution of maintenance releases, describe a simple regression model
684
V. Nguyen et al. / Information and Software Technology 53 (2011) 682–691
to estimate the effort needed to maintain and deliver a release. The model uses a single variable, LOC, which was measured as the sum of added, modified and deleted LOC including comments and blanks. The prediction accuracy was not reported although the coefficient of determination was relative high (R2 = 0.75), indicating that LOC is an important predictor of the maintenance effort. Jorgensen evaluated eleven different models to estimate the effort of individual maintenance tasks using regression, neural networks, and pattern recognition approaches [21]. The models use the size of maintenance tasks, which is also measured as the sum of added, updated, and deleted LOC, as the main size input. The best model could generate effort estimates within 25% of the actuals 26% of the time, and the mean of relative error (MMRE) is 100%. Several previous studies have proposed and evaluated models exclusively for estimating the effort required to implement corrective maintenance tasks. Lucia et al. used the multiple linear regression to build effort estimation models for corrective maintenance projects [12]. Three models were built using coarse-grained metrics, namely the number of tasks requiring source code modification (NA), the number of tasks requiring fixing of data misalignment (NB), the number of other tasks (NC), the total number of tasks, and LOC of the system to be maintained. They evaluated the models on 144 observations, each corresponding to 1 month period, collected from five corrective maintenance projects in the same software services company. The best model, which includes all metrics, achieved effort estimates within 25% of the actuals 49.31% of the time and MMRE of 32.25%. When comparing with the non-linear model previously used by the same company, they suggested that the linear model that uses the same variables produces higher estimation accuracies. They also showed that taking into account the difference in types of corrective maintenance tasks can improve the performance of the estimation model. 3. Calculating equivalent LOC In software maintenance, the programmer works on the source code of the existing system. The delivered maintained software includes source lines of code reused, modified, added, and deleted from the existing system. Moreover, the maintenance work is constrained by the existing architecture, design, implementation, and technologies used. These activities require maintainers extra time to comprehend, test, and integrate the maintained pieces of code. Thus, an acceptable estimation model should take into account these characteristics of software maintenance through its estimation of either size or effort. In this experiment, we adapt the COCOMO II reuse model to determine the equivalent LOC of the maintenance tasks. The model involves determining the amount of software to be adapted, the percentage of design modified (DM), the percentage of code modified (CM), the percentage of integration and testing (IM), the degree of Assessment and Assimilation (AA), the understandability of the existing software (SU), and the programmer’s unfamiliarity with the software (UNFM). The last two parameters directly account for the programmer’s effort to comprehend the existing system. The equivalent LOC formula is defined as
Equivalent LOC ¼ TRCF AAM S AAF ¼ TRCF ( AAFð1 þ ½1 ð1 AAFÞ2 SU UNFM; for AAF 6 1 AAM ¼ AAF þ SUUNFM ; for AAF > 1 100
ð1Þ where TRCF is the total LOC of task-relevant code fragments, i.e., the portion of the program that the maintainers have to understand to
perform their maintenance tasks. S is size in LOC. SU is software understandability. SU is measured in percentage ranging from 10% to 50%. UNFM is the level of programmer unfamiliarity with the program. The UNFM rating scale ranges from 0.0 to 1.0 or from ‘‘Completely familiar’’ to ‘‘Completely unfamiliar’’. Numeric values of SU and UNFM are given in Table 2 in Appendix A. LOC is the measure of logical source statements (i.e., logical LOC) according to COCOMO II’s LOC definition checklist given in [5] and further detailed in [26]. LOC does not include comments and blanks, and more importantly it counts the number of source statements regardless of how many lines a statement can span. TRCF is not a size measure of the whole program to be maintained. Instead, it only reflects portions of the program’s source code that are touched by the programmer. Ko et al. studied maintenance activities performed by students, finding that the programmers collected working sets of task-relevant code fragments, navigated dependencies, and editing the code within these fragments to complete the required tasks [23]. This as-needed strategy [24] does not require the maintainer to understand code segments that are not relevant to the task. The Eq. (1) reflects this strategy by including only task-relevant code fragments rather than the whole adapted program. The task-relevant code fragments are functions and blocks of code that are affected by the changes. 4. Description of the experiment 4.1. Hypotheses According to Boehm [7], programmer’s maintenance activities consist of understanding maintenance task requirements, code comprehension, code modification, and unit testing. Although the last two activities deal with source code directly, empirical studies have shown high correlations between the overall maintenance effort and the total LOC added, modified, and deleted (e.g., [2,21]). We hypothesize that these activities have comparable distributions of programmer’s effort regardless of what types of changes are made. Indeed, with the same cost factors [5] such as program complexity, project and personnel attributes, the productivity of enhancive tasks is expected to have no difference with that of corrective and reductive maintenance. Thus, we have the following hypotheses: Hypothesis 1. There is no difference in the productivity among enhancive, corrective, and reductive maintenance. Hypothesis 2. There is no difference in the division of effort across maintenance activities. 4.2. The participants and groups We recruited one senior and 23 graduate computer-science students who were participating in our directed research projects. The participation in the experiment was voluntary although we gave participants a small incentive by exempting participants from the final assignment. By the time the experiment was carried, all participants had been asked to compile and test the program as a part of their directed research work. However, according to our preexperiment survey, their level of unfamiliarity with the program code (UNFM) varies from ‘‘Completely unfamiliar’’ to ‘‘Completely familiar’’. We rated UNFM as ‘‘Completely unfamiliar’’ if the participant had not read the code and as ‘‘Completely familiar’’ if the participant had read and understood source code, and modified some parts of the program prior to the experiment.
V. Nguyen et al. / Information and Software Technology 53 (2011) 682–691
The performance of participants is affected by many factors such as programming skills, programming experience, and application knowledge [5,8]. We assessed the expected performance of participants through pre-experiment surveys and review of participants’ resumes. All participants claimed to have programming experience in either C/C++ or Java or both, and 22 participants already had working experience in the software industry. On average, participants claimed to have 3.7(±2) years of programming experience and 1.9(±1.7) years of working experience in the software industry. We ranked participants by their expected performance based on their C/C++ programming experience, industry experience, and level of familiarity with the program. We then carefully assigned participants to each group in a manner that the performance capability among groups is balanced as much as possible. As a result, we had seven participants in the enhancive group, eight in the reductive group, and nine in the corrective group. We will further discuss in Section 8 potential threats related to the group assignments, which may result in validity concerns of the results. 4.3. Procedure and environment Participants performed the maintenance tasks individually in two sessions in a software engineering lab. Two sessions had the total time limit of 7 h, and participants were allowed to schedule their time to complete these sessions. If participants did not complete all tasks in the first session, they continued the second session on the same or a different day. Prior to the first session, participants were asked to complete a pre-experiment questionnaire on their understanding of the program and then were told how the experiment would be performed. Participants were given the original source code, a list of maintenance activities, and a timesheet form. Participants were required to record time on paper for every activity performed to complete maintenance tasks. Time information includes start clock time, stop clock time, and interruption time measured in minute. Participants used Visual Studio 2005 on Windows XP. The program’s user manual was provided to help participants setup the working environment and compile the program. Prior to completing the assignments, participants were given prepared acceptance test cases and were told to run these test cases to certify their updated program. These test cases covered the added, affected, and deleted capabilities of the program. Participants were also told to record all defects found during the acceptance test and not to fix or investigate these defects. 4.4. The UCC program The UCC was a program that allowed users to count LOC-related metrics such as statements, comments, directive statements, and data declarations of a source program. It also allowed users to compare the differentials between two versions of a source program and determine the number of LOC added, modified, and deleted. The program was developed and distributed by the USC Center for Systems and Software Engineering. The UCC program had three main modules: (1) read input parameters and parse source code, (2) analyze, compare, and count source code, and (3) produce results to output files. The UCC program had 5188 logical LOC and consisted of 20 C++ classes. The program was well-structured and well-commented, but parts of the program were relatively high coupling. Thus, the SU parameter was estimated to be Nominal or a numeric value of 30%. 4.5. Maintenance tasks The maintenance tasks were divided into three groups, enhancive, reductive, and corrective, each being assigned to one partici-
685
pant group. These maintenance types fall into the business rules cluster, according to the topology proposed by Chapin et al. [9]. There were five maintenance tasks for the enhancive group and six for the other groups. The enhancive tasks require participants to add five new capabilities that allow the program to take an extra input parameter, check the validity of the input and notify users, count for and while statements, and display a progress indicator. Since these capabilities are located in multiple classes and methods, participants had to locate the appropriate code to add and possibly modify or delete the existing code. We expected that majority of code would be added for the enhancive tasks unless participants had enough time to replace the existing code with a better version of their own. The reductive tasks ask for deleting six capabilities from the program. These capabilities involve handling an input parameter, counting blank lines, and generating a count summary for the output files. The reductive tasks emulate possible needs from customers who do not want to include certain capabilities in the program because of redundancy, performance issues, platform adaptation, etc. Similar to the enhancive tasks, participants need to locate the appropriate code and delete lines of code, or possibly modify and add new code to meet the requirements. The corrective tasks call for fixing six capabilities that were not working as expected. Each task is equivalent to a user request to fix a defect of the program. Similar to the enhancive and reductive tasks, corrective tasks handle input parameters, counting functionality, and output files. We designed these tasks in such a way that they required participants to mainly modify the existing lines of code. 4.6. Metrics The independent variable was the type of maintenance, consisting of enhancive, corrective, and reductive types. The dependent variables were programmer’s effort and size of change. The programmer’s effort was defined as the total time the programmer spent to work on the maintenance task excluding interruption time; the size of change was measured by three LOC metrics, LOC added, modified, and deleted. 4.7. Maintenance activities We focus on the context of software maintenance where the programmer performs quick fixes according to customer’s maintenance requests [3]. Upon receiving the maintenance request, the programmers validate the request and contact the submitter for clarifications if needed. They then investigate the program code to identify relevant code fragments, edit, and perform unit tests on the changes [2,23]. In the experiment, we grouped this scenario into four maintenance activities: Task comprehension includes reading, understanding task requirements, and asking for further clarification. Isolation involves locating and understanding code segments to be adapted. Editing code includes programming and debugging the affected code. Unit test involves performing tests on the affected code. Obviously, these activities do not include design modifications because small changes and enhancements hardly affect the system design. Indeed, since we focus on the maintenance quick-fix, the maintenance request often does not affect the existing design. Integration test activities are also not included as the program is by itself the only component, and we perform acceptance testing independently to certify the completion of tasks.
686
V. Nguyen et al. / Information and Software Technology 53 (2011) 682–691
associated with each task. Six participants spent time on incomplete tasks with a total of 849 min or 18% of total time. On average, the enhancive group spent most time, 26%, on incomplete tasks compared with 16% and 12% by the corrective and reductive groups, respectively. The number of tasks completed by participants in the enhancive group is the lowest, 69%, while higher task completion rates were achieved by participants in the other groups, 96% in the enhancive and 98% in the corrective group. Thereafter, we exclude time and size associated with incomplete tasks because the time spent on these tasks did not actually produce any result to meet the task requirements.
5. Results In this section, we provide the results of our experiment, the analysis and interpretation of the results. We use one-sided Mann–Whitney U Test with the typical 0.05 level of significance to test the statistically significant difference between the two sets of values. We also perform Kruskal–Wallis test to validate the differences among the groups. 5.1. Data collection Data was collected from three different sources including surveys, timesheet, and the source code’s changes. From the surveys, we determined participants’ programming skills; programming language, platform, and industry experiences; and level of unfamiliarity with the program. Maintenance time was calculated as the duration between finish and start time excluding the interruption time if any. The resulting timesheet had a total of 490 records totaling 4621 min. On average, each participant recorded 19.6 activities with a total of 192.5 min or 3.2 h. We did not include the acceptance test effort because it was done independently after the participants completed and submitted their work. Indeed, in a real-world situation the acceptance test is usually performed by customers or an independent team, and their effort is often not recorded as the effort spent by the maintenance team. The sizes of changes were collected in terms of the number of LOC added, modified, and deleted by comparing the original with the modified version. These LOC values were then adjusted using the sizing method described in Section 3 to obtain equivalent LOC. We measured the LOC of task-relevant code fragments (TRCF) by summing the size of all affected methods. As a LOC is corresponding to one logical source statement, one LOC modified can easily be distinguished from a combination of one added and one deleted.
5.3. Distribution of effort The first three charts in Fig. 1 show the distribution of effort of four different activities by participants in each group. The forth chart shows the overall distribution of effort by combining all three groups. Participants spent the largest proportion of time on coding, and they spent much more time on the isolation activity than testing. By comparing the distribution of effort among the groups, we can see that proportions of effort spent on the maintenance activities vary vastly among three groups. The task comprehension activity required the smallest proportions of effort. The corrective group spent the largest share of time for code isolation, twice as much as that of the enhancive group, while the reductive group spent much more time on unit test as compared with the other groups. That is, updating or deleting existing program capabilities requires a high proportion of effort for isolating the code while adding new program capabilities needs a large majority of effort for editing code. The enhancive group spent 53% of total time on editing, twice as much as that spent by the other groups. At the same time, the corrective group needed 51% of total time on program comprehension related activities including task comprehension and code isolation. Participants on the enhancive group spent less time on the isolation activity but more time on writing code while participants on the corrective group did the opposite. Moreover, the sums of percentages of the coding and code isolation activities of these groups are almost the same (72% and 73%). The Kruskal–Wallis rank-sum tests confirm the differences in the percentage distributions of editing code (p = 0.0095) and code isolation (p = 0.0013) among
5.2. Task completion We did not expect participants to complete all the tasks, given their capability and availability. Because the tasks were independent, we were able to identify which source code changes are 12%
8%
18%
19%
26%
10%
20%
35%
41% 31%
27% 53%
Reductive Group
Enhancive Group
Corrective Group
10%
21%
Task comprehension Isolation 32%
Editing code Unit test
27%
37%
Overall Fig. 1. Effort distribution.
V. Nguyen et al. / Information and Software Technology 53 (2011) 682–691
these groups. Based on these test results, we can therefore reject Hypothesis 2. 5.4. Productivity Fig. 2 shows that, on average, the enhancive group produced almost 1.5 times as many LOC as did the reductive group and almost four times as many LOC as did the corrective group. Participants in the enhancive group focused on adding new, the reductive group on deleting existing, and the corrective group on modifying LOC. As a result, the enhancive group has the highest number of LOC added while no LOC was deleted; the reductive group has the highest number of LOC deleted, while no LOC was added; and the corrective group has few LOC added and deleted. This pattern was dictated by our task design: the enhancive tasks require participants to mainly add new capabilities which result in new code; the corrective tasks require modifying existing code; and the reductive tasks require deleting existing code. For example, participants in the reductive group modified 20% and deleted the rest 80% of the total affected LOC. The box plots shown in Fig. 3 provide the 1st quartile, median, 3rd quartile, and the outliers of the productivity for three groups. The productivity is defined as the sum of equivalent LOC added, modified, and deleted divided by the total effort measured in person-hour. According to the sizing method defined in Eq. (1), this productivity measure accounts for the effects of software understanding. One participant had a much higher productivity than those of any other participants. A closer look at this data point reveals that this participant had 8 years of industry experience and was working as a full-time software engineer at the time of experiment. As indicated in the box plots, the productivity of the corrective group is much lower than that of the other groups. On average, for
Equivalent SLOC
80
LOC Added LOC Modified LOC Deleted
60 40 20 0
Enhancive
Reductive
Corrective
Fig. 2. Average equivalent LOC added, modified and deleted in the three groups.
687
each hour participants in the corrective group produced 8(±1.7) LOC, which is 0.4 times as many as did the reductive and enhancive groups produce, 20(±8.4) and 21(±11.3) respectively. One-sided Mann–Whitney U test confirms these productivity differences (p = 0.001 for the difference between the enhancive and corrective groups and p = 0.0002 between the reductive and corrective groups); and there is a lack of statistical evidence to indicate the productivity difference between enhancive and reductive groups (p = 0.45). Kruskal–Wallis rank-sum test also indicates statistically significant difference in the productivity among these groups (p = 0.0004). Hypothesis 1 is therefore rejected. 6. Explanatory maintenance effort models Understanding the factors that influence the maintenance cost and predicting future cost is one of the best interests of software engineering practitioners. Reliable estimates enable practitioners to make informed decisions and ensure the success of software maintenance projects. With the data obtained from the experiment, we are interested in deriving models to explain and predict time spent by each participant on the maintenance tasks. 6.1. Models Previous studies have identified numerous factors that affect the cost of maintenance work. These factors reflect the characteristics of the platform, program, product, and personnel of maintenance work [5,8]. In the context of this experiment, personnel factors are most relevant. Other factors are relatively invariant, hence irrelevant, because participants performed the maintenance tasks in the same environment, same product, and same working set. Therefore, in this section we examine the models that use only factors that are relevant to the context of this experiment. Effort Adjustment Factor (EAF) is the product of the effort multipliers defined in the COCOMO II model, representing overall effects of the model’s multiplicative factors on effort. In this experiment, we define EAF as the multiplicative of programmer capability (PCAP), language and tools experience (LTEX), and platform experience (PLEX). We used the same rating values for these cost drivers that are defined in the COCOMO II Post-Architecture model. We rated PCAP, LTEX, PLEX values based on participant’s GPA, experience, pre-test, and post test scores. The numeric values of these parameters are given in Appendix A. If the rating fell in between two defined rating levels, we divided the scale into finer intervals by using a linear extrapolation from the defined values of two adjacent rating levels. This technique allowed specifying more precise ratings for the cost drivers. We will investigate the following models
40
M 2 : E ¼ b0 þ ðb1 Add þ b2 Mod þ b3 DelÞ EAF M 3 : E ¼ b0 þ b1 S2 EAF
20
30
M 4 : E ¼ b0 þ ðb1 Add þ b2 ModÞ EAF
10
Equivalent LOC/person-hour
M 1 : E ¼ b0 þ b1 S1 EAF
Enhancive
Reductive
Corrective
Fig. 3. Participants productivity.
where E is the total minutes that the participant spends on completed maintenance tasks. Add, Mod, and Del represent the number of LOC added, modified, and deleted by the participant for all completed maintenance tasks, respectively. S1 is the total equivalent LOC that was added, modified, and deleted by the participant, that is, S1 = Add + Mod + Del. S2 is the total equivalent LOC that was added and modified, or S2 = Add + Mod. EAF is the effort adjustment factor described above. As we can see in the models’ equations, the LOC metrics Add, Mod, and Del are all adjusted by EAF, taking into account the capability and experience of the participant.
688
V. Nguyen et al. / Information and Software Technology 53 (2011) 682–691
Models M3 and M4 differ from models M1 and M2 in that they do not include the variable Del. Thus, differences in the performance of models M3 and M4 versus models M1 and M2 will reflect the effect of the deleted LOC metric. The estimates of coefficients in model M2 determine how this model differs from model M1. This difference is subtle but significant because M2 accounts for the impact of each type of LOC metrics on maintenance effort. In the next two subsections, we will estimate the coefficients of the models using the experiment data and evaluate the performance as well as the structural differences among them. 6.2. Model performance measures We use two most prevalent model performance measures MMRE and PRED as criteria to evaluate the accuracy of the models [30]. These metrics are derived from the basic magnitude of relative error (MRE), which is defined as,
y y ^i MREi ¼ i yi ^i are the actual and the estimate of the ith estimate, where yi and y respectively. The mean of MRE of N estimates is defined as
MMRE ¼
N 1 X MREi N i¼1
Clearly, according to this formula, extreme relative errors can have a significant impact on MMRE, affecting the overall conclusion about the performance of the model under evaluation. To overcome this problem, PRED is often used as an important complement measure. PRED(l) is defined as the percentage of estimates where MRE is not greater than l, that is PRED(l) = k/N, where k is the number of estimates with MRE values falling in between 0 and l. PRED values range from 0 to 1. High performance models are often expected to relate with high PRED and small MMRE. Conte et al. [11] proposed PRED(0.25) P 0.75 and MMRE 6 0.25 as standard acceptance levels for effort estimation models. In this study, we chose to report MMRE, PRED(0.25), and PRED(0.3) values because they have been the most widely-used measures for evaluating the performance of the software estimation model [15,27,30]. In addition, we use the coefficient of determination R2 metric as a criterion to evaluate the explanation power of the variables used in the models. 6.3. Results We collected the total of 24 data points, each having LOC added (Add), modified (Mod), deleted (Del), actual effort (E), and effort adjustment factor (EAF). Fitting the 24 data points (see Table 3 in
Appendix B) to models M1, M2, M3, M4 using least squares regression, we obtained
M 1 : E ¼ 78:1 þ 2:2 S1 EAF M 2 : E ¼ 43:9 þ ð2:8 Add þ 5:3 Mod þ 1:3 DelÞ EAF M 3 : E ¼ 110:0 þ 2:2 S2 EAF M 4 : E ¼ 79:1 þ ð2:3 Add þ 4:6 ModÞ EAF Table 1 shows the statistics obtained from four models. The p-values are shown next to the estimates of coefficients. In all models, the estimates of all coefficients but b0 on M2 are statistically significant (p 6 0.05). It is important to note that b1, b2, and b3 in model M2 are the estimates of coefficients of the Add, Mod, and Del variables, respectively. They reflect variances in the productivity of three maintenance types that we discussed above. These estimates show that the Add, Mod, and Del variables have significantly different impacts on the effort estimate of M2. One modified LOC affects as much as two added or four deleted LOC. That is, modifying one LOC is much more expensive than adding or deleting it. As shown in Table 1, although Del has the least impact on effort as compared to Add and Mod, it is statistically correlated with effort (p = 0.02). Thus, it is implausible to ignore the effect of the deleted LOC on maintenance effort. Models M1 and M3, which both use a single combined size parameter, have the same slope (b1 = 2.2), indicating that the size parameters S1 and S2 have the same impact on the effort. The estimates of the intercept (b0) in the models indicate the average overhead of the participant’s maintenance tasks. The overhead seems to come from non-coding activities such as task comprehension and unit test, and these activities do not result in any changes in source code. Model M3 has the highest overhead (110 min), which seems to compensate for the absence of the deleted LOC in the model. The coefficient of determination (R2) values suggest that 75% variability in the effort is predicted by the variables in M2 while only 50%, 55%, and 64% of that predicted by the variables in M1, M3, and M4, respectively. It is interesting to note that both models M3 and M4, which did not include the deleted LOC, generated higher R2 values than did model M1. Moreover, the R2 values obtained by models M2 and M4 are higher than those of models M1 and M3 that use a single combined size metric. The MMRE, PRED(0.3), and PRED(0.25) values indicate that M2 is the best performer, and it outperforms M1, the worst performer, by a wide margin. Model M2 produced estimates with a lower error average (MMRE = 20%) than did M1 (MMRE = 33%). For model M2, 99% of the estimates (19 out of 24) have the MRE values of less than or equal to 30%. In other words, the model produced effort estimates that are within 30% of the actuals 79% of the time. Both models M3 and M1 use a single size parameter, but M3 outperforms M1, based on any of the model performance measures. PRED(0.25) of M3 is much higher than that of M1 as we can see in Fig. 4. The MRE values of M3 are less than those of M1 in most
Table 1 Summary of results obtained from fitting the models. Metrics
M1
M2
M3
M4
R2 b0 b1 b2 b3 MMRE PRED(0.3) PRED(0.25)
0.50 78.1(p = 103) 2.2(p = 104) – – 33% 58% 46%
0.75 43.9(p = 0.06) 2.8(p = 107) 5.3(p = 105) 1.3(p = 0.02) 20% 79% 71%
0.55 110.1(p = 107) 2.2(p = 105) – – 28% 75% 75%
0.64 79.1(p = 4.8 104) 2.3(p = 106) 4.6(p = 2.7 104) – 27% 79% 71%
689
V. Nguyen et al. / Information and Software Technology 53 (2011) 682–691
2 1.8
M1
M2
1.6
M3
M4
1.4
MRE
1.2 1 0.8 0.6 0.4 0.2 0 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Participant # Fig. 4. The MRE values obtained by the models in estimating the effort spent by 24 participants.
of the estimates. It is clear that the Del component in M1 negatively affects the performance of the model. On the contrary, the Del component in M2 contributes to improving the performance. As indicated in Table 1, model M2 produced more accurate estimates than did model M4, noting that the size parameters in models M2 and M4 are separated. These models outperform the combined size parameter models M1 and M3, indicating that the improvement in the performance results from using size metrics as independent variables. As shown in Fig. 4, all models produced low estimation accuracies when estimating the time of participant 13, the MRE values being ranging from 0.93 up to 1.84. A closer examination of this data point reveals that all of the models overestimated the participant’s time and that the participant’s capability and experience were estimated to be low, but his productivity was high. In fact, the participant, who worked in the enhancive group, spent a total of 200 min on the tasks but completed only two tasks that took him 57 min. We suspect that he may have had experience in resolving these specific tasks while struggling resolve the others. 7. Discussions Although participants in the enhancive group completed the least number of tasks, only 69% of tasks completed, they were most productive. Looking at the distribution of effort shown in Fig. 1, we see that they spent the majority of time on coding. As a result, more code was produced and higher productivity was achieved. This observation suggests that one must be cautious about the interpretation of the productivity measure and what size metric is used to derive it. Moreover, our results imply that the productivity should be interpreted in the context of what types of maintenance activities are performed. For example, the productivity of a programmer who performs functional enhancements should not be used as an indicator to evaluate the performance of another programmer who fixes software faults. Understanding the distribution of effort of different types of maintenance tasks allows managers to allocate appropriate resources and manage reasonable plans for maintenance work. Our results show that the distributions of effort differ across different maintenance types. These results are not fully consistent with the results of Basili et al.’s study in which the proportions of effort spent on design and code/unit test were found to be almost the
same [2]. In our study, the proportions of unit test effort in both enhancive and corrective maintenance types are almost identical while the coding activity consumed a much larger proportion of effort in the enhancive maintenance type. However, it is worth noting that we used a different set of maintenance activities that cannot fully be mapped into their maintenance activities. Under the context of our experiment, participants did not perform the analysis, design, and inspection activities that were reported in their study. Nonetheless, we found that program comprehension activities, which include task comprehension and code isolation activities, of the corrective and reductive groups require as much as 50% of effort, which is largely consistent with the results reported by Basili et al. [2] and Fjelstad and Hamlen [16]. As discussed above, some studies use the sum of LOC added, modified, and deleted as a single size measure while others use the sum of only LOC added and modified. In this study, we evaluated both methods. Surprisingly, our results suggest that excluding the deleted LOC in the sum likely gives a better size metric for predicting maintenance effort (see Table 1). However, both of these methods were shown to be inferior to the ones that use the LOC metrics as independent variables. As shown in Table 1, models M2 and M4, which use the LOC metrics as independent variables, outperform both M1 and M3. It can be inferred that each LOC metric has a different impact on maintenance effort. Thus, each LOC metric should to be adjusted by a factor to derive a better size metric. The deleted LOC metric was found to be a statistically significant parameter for estimating maintenance effort. Including this metric in the model that uses the independent LOC metrics likely improves the performance of the model. In the case where the simple sum is used, however, excluding the deleted LOC in the sum seems to generate more favorable models. This seemingly contradictory result needs further investigation and validation. In our study, the deleted LOC is the total number of LOC deleted only from the modified module. Deleting source code in the modified module requires detailed understanding of the task-relevant code fragments. On the other hand, if the whole module is deleted, the programmers may not need to acquire detailed understanding of the module. They may instead understand the code fragments where the module to be deleted references to and modify these code fragments appropriately. As a result, deleting the whole
690
V. Nguyen et al. / Information and Software Technology 53 (2011) 682–691
module would require much less effort than deleting the same number of LOC in the modified module.
8. Threats to validity 8.1. Threats to internal validity There are several threats to internal validity. The capabilities of the groups may differ significantly. We used a matched-withinsubjects design [10] to assign participants to groups, which helped to reduce differences. In addition, we scored participants’ answers on our pre- and post-experiment questionnaire about participants’ C++ experience and understanding of the program. We performed t-test to test the differences in scores among the groups. The result indicated no statistically significant difference between any two groups (p > 0.23). Another threat is the accuracy of time logs recorded by participants. We told participants to record start, end, and interruption time for each maintenance activity. This required participant to input their time consistently from one activity to another. In addition, we used the hardcopy timesheet instead of the softcopy one as we believe that it is more difficult to manipulate the time in the hardcopy and if manipulations were made, we could identify easily. The time data was found to be highly reliable. A third threat concerns possible differences in complexity of the maintenance tasks. As the complexity is one of the key factors that significantly affects the productivity of maintenance tasks, the differences may cause the productivity to be incomparable between the groups. However, we designed the tasks that involve the same set of methods and classes, ensuring that the groups have the comparable productivity. The submitted source code was found to be consistent with our design. Finally, one can reasonably argue that once participants complete maintenance tasks their knowledge in the program can be carried onto the next task, effectively reducing the time spent for understanding the program for the task. This reuse of knowledge can affect the effort distribution. In our experiment, however, maintenance tasks are independent of each other and thus the knowledge obtained in the previously completed tasks is not likely to be relevant for the current task. The general knowledge about the program’s structure and how it works was acquired by participants prior to participating in the experiment.
8.2. Threats to external validity Differences in environment settings between the experiment and real software maintenance may limit the generalizability of the conclusions of this experiment. First, professional maintainers may have more experience than our participants. As all of the participants except the senior were graduate students, and most of the participants including the senior had industry experience, we do not believe the difference in experience is a major concern. Second, professional maintainers may be thoroughly familiar with the program, e.g., they are the original programmers. The experiment may not be generalized for this case although many of our participants were generally familiar with the program. Third, a real maintenance process may be different in several ways such as more maintenance activities (e.g., design change and code inspection) and collaboration among programmers. In this case, the experiment can be generalized to four investigated maintenance activities that are performed by an individual programmer with no interaction or collaboration with other programmers.
9. Conclusions This paper has described a controlled experiment to assess the productivity and effort distribution of three different maintenance types including enhancive, corrective, and reductive. Our study suggests that the productivity of corrective maintenance is significantly lower than that of the other types of maintenance and a large proportion of effort is devoted to program comprehension. These results are mostly consistent with the previous studies on productivity of maintenance types [17] and effort distribution [2,16] of corrective in which task comprehension activities, which include task comprehension and isolation, consume the majority of total effort. Indeed, our results suggest that it is more expensive to modify than add or delete a LOC. We also described the effort models for explaining and estimating the individual programmer’s time spent on the maintenance tasks. The best software effort model produced effort estimates within 30% of the actuals 79% of the time. The results also show that the model with three independent LOC metrics (added, modified, and deleted) is more favorable than the other models that use the sum of three LOC metrics. This model better handles the variability in the productivity of three different maintenance types. In addition, we show that the deleted LOC is a statistically significant predictor of maintenance effort. Despite the validity concerns that we have discussed in Section 8, our study suggests several implications for maintenance effort estimation and staffing. The LOC added, modified, and deleted metrics are good predictors for estimating the cost of software maintenance. Effort estimation models for maintenance work that are based on LOC may use the LOC added, modified, and deleted metrics as three independent parameters instead of the simple sum of the three. That is, each of these LOC metrics has a different level of influence on maintenance effort and thus it should be weighted accordingly. Additionally, they are good predictors for estimating the cost of software maintenance even if the sizes of changes are small. Another implication is that reducing business rules of the program does not come for free. Rather, it requires a sizable proportion of the software maintenance effort. Finally, the differences in effort distribution among the maintenance types that our results have shown suggest that assigning maintenance tasks properly is important to effectively and efficiently utilize human resources. For example, assigning highly application-experienced programmers to fixing faults may save more effort from code comprehension than assigning them to enhancement tasks because of the significant difference in the distribution of effort between these maintenance types that we have shown. Appendix A See Table 2. Appendix B See Table 3. Table 2 The COCOMO II parameters used in the models. Parameter
Very low
Low
Nominal
High
Very high
Extra high
PCAP PLEX LTEX SU UNFM
1.34 1.19 1.20 50 0.0
1.15 1.09 1.09 40 0.20
1.00 1.00 1.00 30 0.40
0.88 0.91 0.91 20 0.60
0.76 0.85 0.84 10 0.80
1.00
691
V. Nguyen et al. / Information and Software Technology 53 (2011) 682–691 Table 3 Experiment data for the models. Participant No.
Group
Effort (min)
PCAP
LTEX
PLEX
EAF
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
E R C R C E R C R R E R E R C C E E R C C C E C
240 97 158 141 144 181 144 167 121 124 351 275 57 107 128 196 341 139 247 207 175 65 237 169
0.88 0.76 1.15 0.76 1.00 0.76 0.88 1.00 0.88 1.00 1.00 1.08 1.00 1.00 0.88 1.00 0.88 0.88 1.00 1.15 0.88 0.88 0.88 0.88
0.91 0.89 0.91 1.00 0.96 0.91 1.09 1.09 1.00 0.91 1.00 1.00 1.00 0.96 0.96 1.00 0.96 1.20 0.96 1.00 0.91 1.00 0.89 0.91
1.09 1.09 1.00 1.09 1.05 0.87 1.19 1.19 0.91 0.96 1.00 1.19 1.19 1.00 1.19 1.19 1.09 1.09 1.00 1.19 1.00 0.85 0.96 0.93
0.87 0.74 1.05 0.83 1.01 0.60 1.14 1.30 0.80 0.87 1.00 1.29 1.19 0.96 1.01 1.19 0.92 1.15 0.96 1.37 0.80 0.75 0.75 0.74
Equivalent LOC Added
Modified
Deleted
Total
87 0 0 1 0 122 1 0 0 0 87 0 20 0 0 2 72 27 0 6 2 0 70 4
8 3 21 6 18 5 0 20 0 17 4 27 0 11 19 19 9 2 27 12 33 12 3 19
0 43 0 57 0 0 47 0 44 43 0 29 0 41 0 2 0 0 50 0 0 1 0 0
95 46 21 64 18 127 48 20 44 60 91 56 20 52 19 23 81 29 77 18 35 13 73 23
Group C = corrective, E = enhancive, R = reductive.
References [1] A. Abran, H. Nguyenkim, Analysis of maintenance work categories through measurement, in: Proc. Conf. on Software Maintenance, Sorrento, Italy, 1991, pp. 104–113. [2] V.R. Basili, L. Briand, S. Condon, Y.M. Kim, W.L. Melo, J.D. Valett, Understanding and predicting the process of software maintenance releases, in: Proceedings of International Conference on Software Engineering, Berlin, Germany, 1996, pp. 464–474. [3] V.R. Basili, Viewing maintenance as reuse-oriented software development, IEEE Softw. 7 (1) (1990) 19–25. [4] B.W. Boehm, Understanding and controlling software costs, IEEE Trans. Softw. Eng. (1988). [5] B.W. Boehm, E. Horowitz, R. Madachy, D. Reifer, B.K. Clark, B. Steece, A.W. Brown, S. Chulani, C. Abts, Software Cost Estimation with COCOMO II, Prentice Hall, 2000. [6] B.W. Boehm, C. Abts, S. Chulani, Software development cost estimation approaches: a survey, Ann. Softw. Eng., 2000. [7] B.W. Boehm, Software Engineering Economics, Prentice Hall, 1981. [8] T. Chan, Impact of programming and application-specific knowledge on maintenance effort: a hazard rate model, in: Proceedings of IEEE Intl. Conference on Software Maintenance, 2008, pp. 47–56. [9] N. Chapin, J.E. Hale, K.Md. Kham, J.F. Ramil, W. Tan, Types of software evolution and software maintenance, J. Softw. Maint. Res. Pract. 13 (1) (2001) 3–30. [10] L.B. Christensen, Experimental Methodology, eighth ed., Allyn and Bacon, 2000. [11] S.D. Conte, H.E. Dunsmore, V.Y. Shen, Software Engineering Metrics and Models, Benjamin/Cummings, Menlo Park, Calif., 1986. [12] A. De Lucia, E. Pompella, S. Stefanucci, Assessing effort estimation models for corrective maintenance through empirical studies, Inform. Softw. Technol. 47 (2005) 3–15. [13] A. De Lucia, E. Pompella, S. Stefanucci, Assessing the maintenance processes of a software organization: an empirical analysis of a large industrial project, J. Syst. Softw. 65 (2) (2003) 87–103. [14] A. De Lucia, E. Pompella, S. Stefanucci, Assessing effort prediction models for corrective software maintenance, Enterprise Inform. Syst. VI (2006) 55–62. [15] J.J. Dolado, On the problem of the software cost function, Inform. Softw. Technol. 43 (2001) 61–72. [16] R.K. FjelstadW.T. Hamlen, Application program maintenance study: report to our respondents, in: G. Parikh, N. Zvegintzov (Eds.), Tutorial on Software Maintenance, IEEE Computer Society Press, Los Angeles, CA, 1983, pp. 11–27. [17] T.L. Graves, A. Mockus, Inferring change effort from configuration management databases, in: Proceedings of the International Symposium on Software Metrics, IEEE, 1998, pp. 267–273.
[18] M. Hariza, J.F. Voidrot, E. Minor, L. Pofelski, S. Blazy, Software maintenance: an analysis of industrial needs and constraints, in: Proceedings of the Int. Conf. on Softw. Maint. (ICSM), Orlando, Florida, 1992. [19] IEEE Std. 1219-1998, Standard for Software Maintenance, IEEE Computer Society Press, Los Alamitos, CA, 1998. [20] M. Jorgensen, M. Shepperd, A systematic review of software development cost estimation studies, IEEE Trans. Softw. Eng. 33 (1) (2007) 33–53. [21] M. Jørgensen, Experience with the accuracy of software maintenance task effort prediction models, IEEE Trans. Softw. Eng. 21 (8) (1995) 674–681. [22] B.A. KitchenhamG.H. Travassos, A.v. Mayrhauser, F. Niessink, N.F. Schneidewind, J. Singer, S. Takada, R. Vehvilainen, H. Yang, Toward an ontology of software maintenance, J. Softw. Maint. 11 (6) (1999) 365– 389. [23] A.J. Ko, H.H. Aung, B.A. Myers, Eliciting design requirements for maintenanceoriented ideas: a detailed study of corrective and perfective maintenance, in: Proceedings of the International Conference on Software Engineering ICSE’2005, IEEE Computer Society, 2005, pp. 126–135. [24] D.C. Littman, J. Pinto, S. Letovsky, E. Soloway, Mental models and software maintenance, J. Syst. Softw. 7 (1987) 341–355. [25] M. Mattsson, Effort distribution in a six year industrial application framework project, in: Proceedings of International Conference on Software Maintenance, Oxford, UK, IEEE CS Press, 1999, pp. 326–333. [26] V. Nguyen, S. Deeds-Rubin, T. Tan, B.W. Boehm, A SLOC Counting Standard, COOCOMO II Int’l Forum, 2007.
. [27] V. Nguyen, B. Steece, B.W. Boehm, A constrained regression technique for COCOMO calibration, in: Proceedings of the 2nd ACM–IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2008, pp. 213–222. [28] V. Nguyen, Improved Size and Effort Estimation Models for Software Maintenance, Qualifying Exam Report, University of Southern California, 2010. [29] T.M. Pigoski, Practical Software‘ Maintenance: Best Practices for Managing Your Software Investment, John Wiley and Sons, Inc., New York, NY, 1996. [30] D. Port, M. Korte, Comparative studies of the model evaluation criterions MMRE and PRED in software cost estimation research, in: Proceedings of the 2nd ACM–IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2008, pp. 51–60. [31] E.B. Swanson, The dimensions of maintenance, in: Proceedings 2nd International Conference on Software Engineering, IEEE Computer Society, Long Beach, CA, 1976, pp. 492–497. [32] Y. Yang, Q. Li, M. Li, Q. Wang, An empirical analysis on distribution patterns of software maintenance effort, in: International Conference on Software Maintenance, 2008, pp. 456–459.