Information and Software Technology 47 (2005) 141–149 www.elsevier.com/locate/infsof
Evaluating the ordering of the SPICE capability levels: an empirical study Ho-Won Jung* Department of Business Administration, Korea University, Anam-dong 5Ka, Sungbuk-gu, Seoul 136-701, South Korea Received 7 October 2003; revised 26 April 2004; accepted 21 May 2004
Abstract The Standard ISO/IEC PDTR 15504 (Software Process Assessment) defines process attributes (PAs) and associated practices must be implemented according to a process capability level. This definition implies that PA practices at lower capability levels must be implemented before moving to higher capability levels. The purpose of this study is to evaluate empirically whether the ordering of set of PAs, as measures of capability, is consistent with the Standard. For this purpose, the study estimates the Coefficient of Reproducibility (CR) statistic that measures the extent to which the observed ratings are identical to the pattern inferred by the Standard. Our analyses based on ratings of 689 process instances show that generally PA order of capability levels is consistent with that inferred by the Standard. However, our results also show that the definition of PA3.2 (Process resource) could be improved. This evaluation is capable of providing a substantiated basis for using the notion of capability, as well as providing information for necessary improvements to the Standard. q 2004 Elsevier B.V. All rights reserved. Keywords: Coefficient of reproducibility; Guttman scaling; Standard pattern; Process capability level; SPICE
1. Introduction Defining capability or maturity level in process assessment models is common as for example in the models associated with ISO/IEC 15504 (Process Assessment) [19,20], the Capability Maturity Model (CMM) for Software (SW-CMM) [27,28], CMM Integration (CMMI) [31], a process capability model specifically for requirements engineering [32], and Information Technology (IT) service CMM for software maintenance [26]. The capability dimension of ISO/IEC 15504 is depicted as a series of process attributes (PAs), applicable to any process, which represent measurable characteristics necessary to manage a process and to improve its performance capability (see Fig. 1) [19,20]. The capability dimension comprises of six capability levels ranging from 0 to 5, i.e. Level 0 (Incomplete), Level 1 (Performed), Level 2 (Managed), Level 3 (Established), Level 4 (Predictable), and Level 5 (Optimizing). The greater the level, the greater the process capability achieved. * Tel.: þ82-2-3290-1938; fax: þ 82-2-922-7220. E-mail address:
[email protected] (H.-W. Jung). 0950-5849/$ - see front matter q 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.infsof.2004.05.003
A basic premise defining capability level is that PAs and their associated ‘good’ practices1 are well ordered by the definition of capability level. In other words, the practices of a PA at lower capability levels must be implemented before moving to a higher capability level. The purpose of this study is to investigate empirically whether PAs as capability measures are consistently ordered within the definition of capability level. For this purpose, the study performed a pairwise comparison of two PA ratings at different capability levels for each process instance. If the comparison does not meet the pattern implied by ISO/IEC 15504, it is counted as an error. This study estimates the Coefficient of Reproducibility (CR) statistic that measures the extent to which the observed ratings are identical to the expected pattern. This paper calls the pattern defined in ISO/IEC 1
Practices defined in PAs are called management practices. Management practices with their associated characteristics are indicators of process capability and the means of achieving the capabilities addressed by process attributes. Evidence of management practice performance supports the judgment of the degree of achievement of the process attribute. The set of management practices is intended to be applicable to all processes in the process dimension of the model. But, PA1.1 (Process performance) at capability level 1 is based on process performance indicators (base practices, work products, and their characteristics).
142
H.-W. Jung / Information and Software Technology 47 (2005) 141–149
collection and analysis methods. The results of the analysis are presented in Section 4. Section 5 discusses the implications of the study results. Limitations of this study are described in Section 6. Final remarks are in Section 7. In this paper, ‘ISO/IEC 15504’ refers to ISO/IEC PDTR3 15504 [19] unless there is reason to distinguish versions of ISO/IEC 15504.
2. Background 2.1. ISO/IEC 15504 rating scheme
Fig. 1. Two-dimensional architecture of ISO/IEC 15504, where PA denotes Process attribute.
15504 the standard pattern. The dataset of our analysis consists of 689 process instances assessed during the Phase 2 SPICE Trials2 [21,33,34]. Our approach in this study can be called a reliability study of capability measures by using a variant of Guttman scaling method (scalogram analysis) [24]. To our knowledge, there have been no studies on the reliability of PAs as capability measure based on the Guttman scaling. The importance of this study can be explained in two ways. First, process assessment results have been used as a basis for many important decisions, including actions to improve internal software processes, for large-scale acquisitions, and for contract monitoring. Given the importance of the decisions influenced by process assessments and the resources required to implement them, both contractors and acquirers must be confident in the assessment results [17]. Increased confidence in assessment results can be achieved by evaluating the ‘goodness’ of the capability level definition. Such evaluations would provide a substantiated basis for using capability levels, as well as providing information to the developers of ISO/IEC 15504 for necessary improvements. Secondly, this study can be seen as a response to a statement by Pfleeger et al. [29] that ‘Standards have codified approaches whose effectiveness has not been rigorously and scientifically demonstrated. Rather, we have too often relied on anecdote, ‘gut feeling,’ the opinions of experts, or even flawed research rather than on careful, rigorous software engineering experimentation.’ The remainder of this paper is organized as follows: Section 2 provides a brief overview of ISO/IEC 15504 within the scope of this study and previous reliability studies in software process assessment. Section 3 addresses data 2 The SPICE (Software Process Improvement and Capability dEtermination) Trials empirically evaluate successive versions of the document sets of ISO/IEC 15504. The Trials were performed in three broad phases. Phase 1 took place in 1995, Phase 2 took place between September 1996 and June 1998, and Phase 3 began in July 1998. Summarized results of the Phase 2 SPICE Trials can be found in [21].
The SPICE Project has developed a reference model (ISO/IEC 15504: Part 2) for software process capability determination. The reference model consists of both process and capability dimensions as seen in Fig. 1. In the process dimension, the processes associated with software are defined and classified into five categories known as the Customer-Supplier, Engineering, Support, Management, and Organization (see Table 5 for the processes). The capability dimension is depicted as a series of PAs, applicable to any process, which represent measurable characteristics necessary to manage a process and to improve its performance capability. The capability dimension comprises of six capability levels ranging from 0 to 5. An ISO/IEC 15504 assessment is applied to an organizational unit (OU) (ISO/IEC 15504: Part 9). An OU is the whole or part of an organization that owns and supports the software process. During an assessment, an organization can cover only the subset of processes that are relevant to its business objectives. And, in most cases, it is not necessary to assess all processes in the process dimension. The object that is rated is a process instance defined as a singular instantiation of a process which is uniquely identifiable and where information can be gathered in a repeatable manner. The capability level of each process instance is determined by rating PAs. As shown in Table 1, capability level 1 has only one PA and each of capability levels 2– 5 consists of two associated PAs. A more detailed description of the attributes can be found in ISO/IEC 15504: Parts 2 and 5 [19,20]. As seen in Table 2, each PA is measured by an ordinal rating ‘F’ (Fully), ‘L’ (Largely), ‘P’ (Partially), or ‘N’ (Not achieved) that represents the extent of achievement of the attribute as defined in ISO/IEC 15504: Part 2. For example, 3
ISO/IEC JTC1 has a variety of paths for developing International Standards [18]. One of them is through a published technical report (TR). The TR follows a series of stages such as NP (New Proposal), WD (Working Draft), PDTR (Proposed Draft Technical Report), DTR (Draft Technical Report), and TR (Technical Report). Assessments in the Phase 2 SPICE Trials were based on the PDTR version. At the time of publishing, the nine documents of ISO/IEC TR 15504 were restructured to five parts. Parts 2 (Performing an assessment) and 3 (Guidance on performing an assessment) were published as international standard.
H.-W. Jung / Information and Software Technology 47 (2005) 141–149
143
Table 1 Process attributes at each capability level
Table 2 The rating scale of the process attributes [19,20]
Capability level
Acronym
Achievement of the defined attribute
N (Not achieved)
0 to 15%: There is little or no evidence of achievement of the defined attribute in the assessed process 16 to 50%: There is evidence of a sound systematic approach to and achievement of the defined attribute in the assessed process. Some aspects of achievement may be unpredictable 51 to 85%: There is evidence of a sound systematic approach to and significant achievement of the defined attribute in the assessed process. Performance of the process may vary in some areas or work units 86 to 100%: There is evidence of a complete and systematic approach to and full achievement of the defined attribute in the assessed process. No significant weaknesses exist across the defined organization unit
Process attribute
Level 0 Incomplete Level 1 Performed process
PA1.1 Process performance attribute: The extent to which the process achieves the process outcomes by transforming identifiable input work products to produce identifiable output work products
Level 2 Managed process
PA2.1 Performance management attribute: The extent to which the performance of the process is managed to produce work products that meet the defined objectives PA2.2 Work product management attribute: The extent to which the performance of the process is managed to produce work products that are appropriately documented controlled and verified
Level 3 Established process
Level 4 Predictable process
Level 5 Optimizing process
PA3.1 Process definition attribute: The extent to which the performance of the process uses a process definition based upon a standard process to achieve the process outcomes PA3.2 Process resource attribute: The extent to which the process draws upon suitable resources (for example human resources and process infrastructure) that is appropriately allocated to deploy the defined process PA4.1 Measurement attribute: The extent to which product and process goals and measures are used to ensure that performance of the process supports the achievement of the defined goals in support of the relevant business goals PA4.2 Process control attribute: The extent to which the process is controlled through the collection analysis and use of product and process measures to correct where necessary the performance of the process to achieve the defined product and process goals PA5.1 Process change attribute: The extent to which changes to the definition management and performance of the process are controlled to achieve the relevant business goals of the organization PA5.2 Continuous improvement attribute: The extent to which changes to the process are identified and implemented to ensure continuous improvement in the fulfillment of the relevant business goals of the organization
See ISO/IEC 15504 for ‘good’ (base) practices defined in each process attributes in detail.
to determine whether a process has achieved capability level 1 or not, it is necessary to determine the rating achieved by PA1.1 (Process performance attribute). A process that fails to achieve capability level 1 is at capability level 0. A process instance is defined to be at capability level k if all PAs below level k satisfy the rating ‘F’ and the level k attribute(s) are rated as ‘F’ or ‘L’. As an example, for
P (Partially achieved)
L (Largely achieved)
F (Fully achieved)
a process instance to be at capability level 3, it requires ‘F’ ratings for PA1.1 (Process performance), PA2.1 (Performance management), PA2.2 (Work product management) and ‘F’ or ‘L’ rating for PA3.1 (Process definition) and PA3.2 (Process resource). 2.2. Reliability studies in ISO/IEC 15504 assessments Concerns about the reliability of process assessments began early in the history of process assessment models. Skepticism does remain in our field, both about the value of process improvement in general and the credibility of assessment results. Indeed, some critics have argued that little or no evidence exists of the value of process improvement [15], while others have expressed concerns about the reliability of assessment results [2,16]. The critics are correct that credible evidence is vital. However, other such evidence does exist. Most studies of reliability in software process assessments have been conducted as part of the SPICE Trials. In the SPICE Trials, two types of reliability have been investigated during the last ten years. The first type is internal consistency (reliability), which measures the consistency of the PAs as indicators of process capability. The work assumes that the rating scale of PAs is the Likert scaling. The internal consistency, estimated by Cronbach’s alpha [7], is affected by ambiguities in wording in standard documents and inconsistencies in their interpretation by assessors [14]. The internal consistency of PAs was estimated by the Phase 2 SPICE Trials to be 0.90 by ElEmam and Goldenson [9], Jung et al. [21], and Jung and Hunter [23]. The second type is the interrater agreement, called the external reliability, and it is used to show the extent to which two assessors or teams of assessors agree when making independent judgments about the same software engineering processes. Cohen’s Kappa [8] is the most popular
144
H.-W. Jung / Information and Software Technology 47 (2005) 141–149
method to describe the strength of agreement using a single summary index. A series of interrater agreement studies conducted as part of the international SPICE trials does show reasonably high levels of interrater agreement [10,13]. More recent work by Jung [22] provides further discussion about paradoxes in the interpretation of the Kappa coefficient that is used in the SPICE studies. For the order of the definition of capability level, Brodman and Johnson [3] asserted that achievement of lower levels sustain the implementation of higher level practices in the SW-CMM. That is, the skill acquired in higher levels builds on the skills developed at lower levels. In the same context, Bilotta and McGrew [1] used the Guttman scaling to provide an ordered dimension of 121 practices for SW-CMM level 2, which lays out a step-bystep path upward from level 1. Those studies assume the Guttman scale of measures of maturity level. A basic premise of 15504 is that higher capability is associated with better project performance and product quality. Furthermore, improving higher capability is expected to subsequently improve both performance and quality. Testing this premise is an evaluation of the predictive validity of the assessment measurement procedure [6,9,11,12]. Thus, appropriate order of PA practices is essential for evaluating predictive validity.
3. Research methods 3.1. Data collection Phase 2 of the SPICE Trials used the regional structure defined for the project as a whole, which divides the world into five Regional Trials Centers (RTCs), namely Canada (including Latin America), Europe (including South Africa), North Asia Pacific (centered on Japan and including Korea), South Asia Pacific (centered on Australia and including Singapore) and the USA. At an earlier stage of the Project, there were only four RTCs, with North Asia Pacific and South Asia Pacific being one RTC. At the country or state level, Local Trials Coordinators (LTCs) liaised with the assessors and OUs to ensure assessors’ qualifications, to answer queries about the questionnaires, and to ensure the timely collection of data. There were 26 such coordinators worldwide during the second phase of the SPICE Trials. The dataset submitted to the International Trials Coordinator (ITC) for each trial included the ratings data from each assessment and answers to a set of questionnaires that followed each assessment. Lead assessors and OUs completed the questionnaires related to the assessment, the OU, and the project. During the Phase 2 SPICE Trials, there were 70 assessments of 44 organizations from the five regions as shown in Fig. 2: Europe (24 trials), South Asia Pacific (34 trials), North Asia Pacific (10 trials), USA (1 trial), and Canada/Mexico (1 trial) [21,33,34]. Since more than one assessment occurred in some OUs, the number of OUs was
Fig. 2. and OUs in regions.
less than the number of assessments. Those assessments covered 169 projects, and 691 process instances. 3.2. Analysis methods 3.2.1. Coefficient of reproducibility The definition of the ISO/IEC 15504 capability levels implies the following: † There is an ordered level of capability ranging from 0 to 5. † Higher-level achievement builds on the achievement of basic and management practices defined at lowercapability levels. In other words, ratings of lower-level PAs are higher than or equal to those of higher-level PAs. † There is no order between ratings of PAs in the same capability level. † Ratings of lower-level PAs can be partially predicted from ratings of higher-level PAs. The term partially is used because 15504 employs a 4-point-ordered rating scale. The four implications can be represented by Table 3 with the function R (called the rating function) denoting the rating of a PA. The function R of PA has a value of ‘F’, ‘L’, ‘P’, or ‘N’. In this study, sets of ratings consistent with the implications of ISO/IEC 15504 are said to conform to the standard pattern. The standard pattern defined in Table 3 requires the 32 cases of pairwise comparisons up to capability level 5 to be of the following form † R(PA1.1) R(PA3.2), † R(PA2.1) R(PA4.2), † R(PA2.2) R(PA4.2),
$ all of {R(PA2.1), R(PA2.2), R(PA3.1), R(PA4.1), R(PA4.2), R(PA5.1), R(PA5.2)} $ all of {R(PA3.1), R(PA3.2), R(PA4.1), R(PA5.1), R(PA5.2)} $ all of {R(PA3.1), R(PA3.2), R(PA4.1), R(PA5.1), R(PA5.2)}
Table 3 The definition of a standard pattern of PA ratings R(PA2.1) R(PA1.1)
$
R(PA3.1) $
R(PA2.2)
R(PA4.1) $
R(PA3.2)
R(PA5.1) $
R(PA4.2)
R(PA5.2)
H.-W. Jung / Information and Software Technology 47 (2005) 141–149
† R(PA3.1) $ all of {R(PA4.1), R(PA4.2), R(PA5.1), R(PA5.2)} † R(PA3.2) $ all of {R(PA4.1), R(PA4.2), R(PA5.1), R(PA5.2)} † R(PA4.1) $ all of {R(PA5.1), R(PA5.2)} † R(PA4.2) $ all of {R(PA5.1), R(PA5.2)} If PA ratings of a process instance satisfy the requirements of the ISO/IEC 15504 framework, the ratings will meet the standard pattern. In practice, however, a standard pattern of ratings is unlikely to be found in all cases. In this study, an error is defined as a discrepancy of the standard pattern. As an example, suppose the ratings of PAs up to capability level 5 are given as: ð‘F’Þ; ð‘F’; ’L’Þ; ð‘F’; ’L’Þ; ð‘F’; ’F’Þ; ð‘L’; ’L’Þ Then, there are five errors such as: R(PA2.2) , R(PA3.1), R(PA2.2) , R(PA4.1), R(PA2.2) , R(PA4.2), R(PA3.2) , R(PA4.1), R(PA3.2) , R(PA4.2). Identifying errors is simply a matter of noting deviations in the observed rating pattern from the standard pattern defined by 15504. In order to evaluate the goodness of capability level definition, the observed rating pattern is evaluated against the standard pattern defined in Table 3 by utilizing a measure Coefficient of Reproducibility (CR). CR denoting ‘the extent to which the observed ratings are identical to the standard pattern’ becomes as follows [24]: Number of errors CR ¼ 1 2 Total number of pairwise comparisons £ 100%: In the above example, the value of CR is 84.38%, i.e. (1 2 5/32) £ 100%. The range of CR is 0 to 100. If all ratings follow the standard pattern, then the value of CR is 100%. For determining whether a CR value is good or bad, a satisfactory level of CR needs to be considered. Though there is no absolute value, this study will use the CR value to find a relative weakness in the PA definition. It is worthwhile to mention the Guttman scaling. The Guttman scaling [24], known as cumulative scaling or scalogram analysis, arranges a set of items or statements so that a respondent who agrees with any specific question in the list will also agree with all previous questions. Scalogram analysis examines how closely a set of items perfectly matches this pattern. The definition of error in scalogram analysis is the same as our definition. But, its error counting method is different from ours. To clarify the difference, let us assume a dichotomous response in an observed response pattern (þ 2 þ þ ), where þ and 2 denote positive and negative responses, respectively [24]. Suppose that the standard pattern is defined in the scalogram analysis as (þ þ þ þ ). In the scalogram analysis, if the second observed response is changed to þ , then the observed pattern perfectly matches the standard pattern (all positive responses). The error count
145
becomes one in the Guttman scaling. On the other hand, according to the definition of this study there are two errors. So, we can say that error count in this study is stricter than that in scalogram analysis. As a reference, the CR value in scalogram analysis is recommended to be over 0.9 [24]. This study uses the value as a guideline to the interpretation of our results. 3.2.2. Confidence intervals Suppose that a random sample of size n has been drawn from a large population and that X observations of this sample show the standard pattern, then the proportion of standard patterns (CR value), p; is estimated by p ¼ X=n; where p and n are the parameters of a binomial distribution [25]. Since different samples will produce different estimatedvalues for the proportion of standard patterns, we therefore use the confidence interval (CI) to test whether the CR value is lower than a threshold CR value of 90%. Since the developers of ISO/IEC 15504 will likely only take action if the CR value is less than 90%, we are interested in a onesided CI. If the upper limit of the CI includes 90%, then the null hypothesis of 90% CR value cannot be rejected at the one-sided level of 0.05. Since our sample size did not satisfy np $ 5 and nð1 2 pÞ $ 5 that is required for the normal approximation on the inference of a population [25], we computed an ‘exact’ CI by utilizing a statistical software package StatXact [35]. The ‘exact’ means that the result is not from an approximation to a normal distribution but an exact CI, called Blyth-StillCasella interval [5].
4. Results 4.1. Descriptive summary The phase 2 Trials collected data from 70 assessments involving 44 different organizations, 169 projects, and 691 process instances. Of the 691 individual process instances assessed, the ratings were recorded for 689 process instances (2 missing). Sixteen of the 44 OUs were concerned with the production of software or other IT products or services. Using the definition of a small software organization that has also been used in the European SPIRE project (less than or equal to 50 IT staff) [30], we find that 52% (23/44) of the participating OUs are small. There was adequate variation in the sizes (both small and large) of the OU’s that participated in the trials. The median number of process instances per assessment was 6.5. The assessors involved had a broad range of experience with many having participated in phase 1 of the trials, used the SW-CMM, and/or been involved in ISO 9001 audits. Almost all of the competent assessors (93%) received assessment training in the context of ISO/IEC 15504.
146
H.-W. Jung / Information and Software Technology 47 (2005) 141–149
Fig. 3. Frequency of process instance ratings for each process attribute.
The total numbers of process instances, over all the trial assessments, which were rated at each capability level are shown in Fig. 3. During an assessment, it is not always the case that all of the attributes up to capability level 5 are rated as seen in Fig. 3. As expected, the PAs corresponding to the higher capability levels receive higher ratings less often than those corresponding to the lower levels. Of the two attributes at level two, PA2.1 (Performance management) is more often highly rated than PA2.2 (Work product management) and of the two attributes at level three, PA3.2 (Process resource) is more often highly rated than PA3.1 (Process definition). At level four, it is PA4.1 (Process measurement) that seems to be more often highly rated than PA4.2 (Process control). However, the difference is small. 4.2. Analysis results A total of 28 process instances are omitted in the comparisons because they have only ratings of PA1.1 (Process performance attribute), i.e. had no cases for comparison. Table 4 shows the results of 13,746 pairwise comparisons for 661 PA ratings, where bold numbers denote processes that have the CR value smaller than or equal to 90%. There are 661 comparisons between ratings of process instances rated at PA1.1 (Process performance attribute) and PA2.1 (Performance management attribute) as shown in Table 4. Among them, 17 comparisons show that a rating of PA1.1 is lower than that of PA2.1, with CR having the value
of 97.43%. The same interpretation is applied to the remaining rows. Note that since all of the attributes up to capability level 5 are not rated, the number of comparisons is not the same in each case. Overall, a total of 13,746 comparisons match the standard pattern while 355 comparisons do not, i.e. CR ¼ 97.42%. This CR value is a higher than a recommended threshold value of ‘over 90%’ in this study. For R(PA2.1) $ R(PA3.2), the upper limit of the CI, 91.52%, includes 90% of the threshold CR value. Thus, the hypothesis of 90% CR value cannot be rejected at a onesided a level of 0.05. However, since the upper limit of the CI of R(PA2.2) $ R(PA3.2), 84.59%, does not cover 90% of the threshold CR value, the hypothesis of 90% CR value is rejected at the one-sided a level of 0.05. Two ordering relations of R(PA2.1) $ R(PA3.2) and R(PA2.2) $ R(PA3.2) were investigated further to examine whether the CR value depends any specific processes defined in ISO/IEC 15504. Results are presented in Table 5, where bold denotes processes that have the CR value less than or equal to 90%. In R(PA2.1) $ R(PA3.2), a total of 13 processes among 29 process (44.38%) has a CR value not greater than 90%. In R(PA2.2) $ R(PA3.2), 22 processes (75.86%) show a smaller value than or equal to 90%. In some cases, since the CR value may be different from 90% as a consequence of sampling variability, we present a one-sided 95% upper limit of the CI in parenthesis in Table 5. In R(PA2.1) $ R(PA3.2), the upper limit of
H.-W. Jung / Information and Software Technology 47 (2005) 141–149 Table 4 Coefficient of reproducibility Relation function of process attribute R(PA1.1) $ R(PA2.1) R(PA1.1) $ R(PA2.2) R(PA1.1) $ R(PA3.1) R(PA1.1) $ R(PA3.2) R(PA1.1) $ R(PA4.1) R(PA1.1) $ R(PA4.2) R(PA1.1) $ R(PA5.1) R(PA1.1) $ R(PA5.2) R(PA2.1) $ R(PA3.1) R(PA2.1) $ R(PA3.2) R(PA2.1) $ R(PA4.1) R(PA2.1) $ R(PA4.2) R(PA2.1) $ R(PA5.1) R(PA2.1) $ R(PA5.2) R(PA2.2) $ R(PA3.1) R(PA2.2) $ R(PA3.2) R(PA2.2) $ R(PA4.1) R(PA2.2) $ R(PA4.2) R(PA2.2) $ R(PA5.1) R(PA2.2) $ R(PA5.2) R(PA3.1) $ R(PA4.1) R(PA3.1) $ R(PA4.2) R(PA3.1) $ R(PA5.1) R(PA3.1) $ R(PA5.2) R(PA3.2) $ R(PA4.1) R(PA3.2) $ R(PA4.2) R(PA3.2) $ R(PA5.1) R(PA3.2) $ R(PA5.2) R(PA4.1) $ R(PA5.1) R(PA4.1) $ R(PA5.2) Total
Total comparisons
Error
CR (%)
661 661 637 634 393 392 391 390 637 634 393 392 391 390 637 634 393 392 391 390 393 392 391 390 393 392 391 390 391 390
17 8 4 10 1 0 0 0 41 66 1 1 1 1 58 113 3 3 0 1 10 6 1 2 3 1 0 1 13 9
97.43 98.79 99.37 98.42 99.75 100 100 100 93.56 89.59 99.75 99.74 99.74 99.74 90.89 82.18 99.24 99.23 100 99.74 97.46 98.47 99.74 99.49 99.24 99.74 100 99.74 96.68 97.69
13746
355
97.42
one-sided CI in two processes (CUS.4 and CUS.5) does not cover a threshold of 90%. In R(PA2.2) $ R(PA3.2), the upper-limit of the one-sided CI in 11 processes (37.93%) does not meet the recommended CR value. The fact that our sample sizes were small means that the CIs are quite large [25]. This means that there is a large uncertainty in the estimate of the CR. Therefore, caution should be exercised when interpreting CR values; one should take into account the interval to determine the confidence that one can place in the calculated proportion (if the interval is large, the less precise the calculated proportion).
5. Discussion When ratings at PA 2.1 and PA2.2 are compared against the ratings of PA3.2, the CR value in R(PA2.1) $ R(PA3.2) is 89.59% and R(PA2.2) $ R(PA3.2) has the CR value of 82.18% as shown in Table 4. This implies that some PA3.2 practices may be implemented before PA2.2 practices. PA3.2 defines resources (Human resources and process infrastructures) to support
147
the implementation of the defined process, where human resources are strongly tied to skill improvement training. ISO/IEC 15504 is a descriptive, not prescriptive model which can be rewritten with a slightly modified word from [4] as follows: ‘Often, the most effective way of moving from one level to the next is to adopt some practices from higher levels. For example, an organization moving to a higher level usually starts by establishing a training program and an infrastructure.’ Thus, it is possible to implement practices defined in PA3.2 earlier than those defined at PA2.2. At the time of publishing, ISO/IEC 15504: Parts 2,3 and 4 became International Standards. ISO/IEC 15504 still uses the four-category rating scale and capability level as in the PDTR version. But, a revision of PDTR in becoming the Standard changed the name of PA3.2 to ‘Process deployment’ and includes additional practices. One of them is ‘Collect and analyse data about performance of the process to demonstrate its suitability and effectiveness.’ This can be considered as an improvement to the Standard because data collection is more compatible with the implementation of defined processes (defined in PA3.1). It is expected that this change in the new version will increase the CR value in R(PA2.2) $ R(PA3.2). This study provides a validity of the changes. Our analysis has an implication for determining ratings in process assessment. When making their own rating, individual assessors probably check whether the standard pattern is fulfilled. If the rating does not meet the standard pattern, individual assessors may not be sure which category on the four-point rating scale to use. This will most likely happen between adjacent categories on the rating scale. In such a case, individual assessors invest more effort reexamining evidence already collected or collecting further evidence. If a non standard pattern occurs in determining the rating of a process instance among independent assessors before consolidation, assessors may/will expend effort to achieve consensus during consolidation. Finally, PAs as measures of capability level should be evaluated under the assumptions of the Likert scale and the Guttman scale. Previous studies described in Section 2.2 provide a high value of internal consistency under the assumption of Likert scaling. This study adds appropriateness of the definition of PAs under the assumption of a Guttman scaling. Results from both internal consistency and CR give confidence in the PAs as measures of ISO/IEC 15504 capability level.
6. Limitations Our analyses have a number of limitations that should be made clear in the interpretation of our results. These limitations are not unique to our studies, but are
148
H.-W. Jung / Information and Software Technology 47 (2005) 141–149
Table 5 CR value and upper CI in R(PA2.1) $ R(PA3.2) and R(PA2.2) $ R(PA3.2) Process
CUS.1: Acquire software CUS.2: Manage customer needs CUS.3: Supply software CUS.4: Operate software CUS.5: Provide customer service ENG.1: Develop system requirements and design ENG.2: Develop software requirements ENG.3: Develop software design ENG.4: Implement software design ENG.5: Integrate and test software ENG.6: Integrate and test system ENG.7: Maintain system and software SUP.1: Develop documentation SUP.2: Perform configuration management SUP.3: Perform quality assurance SUP.4: Perform work product verification SUP.5: Perform work product validation SUP.6: Perform joint review SUP.7: Perform audits SUP.8: Perform problem resolution MAN.1: Manage the project MAN.2: Manage quality MAN.3: Manage risks MAN.4: Manage subcontractors ORG.1: Engineer the business ORG.2: Define the process ORG.3: Improve the process ORG.4: Provide skilled human resources ORG.5: Provide S/W engineering infrastructure
No. of comparisons
4 28 18 12 16 17 53 43 31 31 14 23 32 43 15 14 10 17 8 22 60 22 28 5 13 12 10 18 15
characteristic to most of the process assessment literature. However they are worth explaining here. The total population of assessments and its size cannot be clearly identified, and the assessed organizations are not selected on a random basis. Rather, the assessments in the Phase 2 SPICE Trials are a self-selected sample (i.e. assessed organizations that have voluntarily participated in the Phase 2 SPICE Trials to improve their software processes). As a result, the SPICE Trials team could not select OUs and control the processes assessed. As seen in Fig. 2, the distribution of assessments and OUs over the regions is highly skewed, with only one organization participating in the Phase 2 SPICE Trials in USA and in Canada. Nearly 50% of assessments were from the South Asia Pacific region. Since the dataset that we analyzed is not a random sample as described above, a threat to external validity exists as far as providing evidence in order to make generalizations [36]. In pairwise comparison to count errors, this study did not consider the severity of the violation of the standard pattern that can be illustrated in the following two examples. First, a rating that violates R(PA2.1) $ R(PA5.1) is more serious than that of R(PA2.1) $ R(PA3.1). Second, for violating R(PA2.1) $ R(PA5.1), R(PA2.1) ¼ ‘P’, R(PA5.1) ¼ ‘F’ is more serious than that R(PA2.1) ¼ ‘L’, R(PA5.1) ¼ ‘F’.
CR value (%) and upper CI in parenthesis R(PA2.1) $ R(PA3.2)
R(PA2.2) $ R(PA3.2)
75.00 (97.40) 100 94.44 (99.42) 66.67 (84.58) 62.50 (81.07) 94.12 (99.38) 86.79 (92.90) 83.72 (91.16) 87.10 (94.25) 90.32 (96.39) 85.71 (96.13) 82.61 (92.19) 84.38 (92.20) 93.02 (97.41) 100 100 100 88.24 (96.83) 100 81.82 (91.83) 88.33 (93.76) 100 100 100 92.31 (99.19) 91.67 (99.13) 90.00 (98.95) 94.44 (99.42) 93.33 (99.30)
75.00 (97.40) 85.71 (93.62) 61.11 (78.39) 41.67 (68.48) 50.00 (70.06) 94.12 (99.38) 81.13 (89.39) 76.74 (86.80) 77.42 (87.50) 87.10 (94.25) 85.71 (96.13) 86.96 (95.11) 75.00 (86.91) 88.37 (94.53) 100 100 100 76.47 (89.32) 75.00 (93.14) 86.36 (94.88) 88.33 (93.76) 100 96.43 (99.62) 80.00 (97.91) 53.85 (75.46) 100 90.00 (98.95) 77.78 (89.94) 53.33 (75.29)
Further study including those two factors would improve the evaluation results.
7. Final remarks PA ratings were analyzed to examine the consistency in the ordering of the PAs. Generally, the results show that PAs order is consistent with the definition in ISO/IEC 15504 and CR values are larger than the 90% required to use them in practice. Note that our error counting in this study is tighter than that of Guttman counting. However, since some organizations implemented PA3.2 practices at a lower capability level, comparisons between PA2.2 and PA3.2 show the smallest CR value. Assessments in the Phase 2 SPICE Trials were based on the PDTR version. A revision of PDTR for the International Standard includes better defined practices in PA3.2, which is expected to act as a remedy. This study supports a validity of the ISO/IEC 15504 revisions to the International Standard. Replication studies are necessary to raise the confidence in the International Standard.
H.-W. Jung / Information and Software Technology 47 (2005) 141–149
Acknowledgements The author wishes to acknowledge the contributions of other past and present members of the trials team, in particular those of In˜igo Garro, Peter Krauth, Bob Smith, Kyungwhan Lee, Angela Tuffley, and Alastair Walker. We would also wish to thank all of the LTCs, RTCs, assessors, and sponsors of assessments who have made a special effort to ensure that the required data is collected, and who have promoted the SPICE Trials in their regions. The author also wishes to thank two referees and Robin Hunter (University of Strathclyde) for their valuable comments on an earlier version of this manuscript. The research was supported by a Korea University Grant (2003). This support is gratefully acknowledged.
References [1] J. Bilotta, J. McGrew, A Guttman scaling of CMM level 2 practices: Investigating the implementation sequences underlying software engineering maturity, Empirical Software Engineering 3 (2) (1998) 159–177. [2] T. Bollinger, C. McGowan, A critical look at software capability evaluation, IEEE Software 8 (4) (1991) 25– 41. [3] [.3.].J.G. Brodman, D.I. Johnson, Return on investment from software process improvement as measured by U.S. industry, Crosstalk 9 (4) (1996). [4] D.N. Card, Learning from our mistakes with defect causal analysis, IEEE Software 15 (1) (1998) 56–63. [5] G. Casella, Refining binomial confidence intervals, Canadian Journal of Statistics 14 (1987) 113 –129. [6] E.G. Carmines, R.A. Zeller, Reliability and Validity Assessment, Sage University Paper Series on Quantitative Applications in Social Sciences, Sage, Newbury, 1979. [7] L. Cronbach, Coefficient alpha and the internal structure of tests, Psychometrika 16 (3) (1951) 297 –334. [8] J. Cohen, A coefficient of agreement for nominal scales, Educational and Psychological Measurement 20 (1) (1960) 37–46. [9] K. El-Emam, D. Goldenson, SPICE: An Empiricist’s perspective, Proceedings of the Second IEEE International Software Engineering Standards Symposium, IEEE Computer Society Press, Los Alamitos, CA, 1995, pp. 84– 97. [10] K. El-Emam, Benchmarking Kappa: interrater agreement in software process assessments, Empirical Software Engineering: An International Journal 4 (2) (1999) 113 –133. [11] K. El-Emam, A. Birk, Validating the ISO/IEC measure of software development process capability, Journal of Systems and Software 51 (2) (2000) 119 –149. [12] K. El-Emam, A. Birk, Validating the ISO/IEC measure of software requirement analysis process capability, IEEE Trans. on Software Engineering 26 (6) (2000) 541–566. [13] K. El-Emam, D. Goldenson, An empirical review of software process assessments, Advances in Computers 53 (2000) 319–423. [14] P. Fusaro, K. El-Emam, B. Smith, The internal consistencies of the 1987 SEI maturity questionnaire and the SPICE capability dimension, Empirical Software Engineering: An International Journal 3 (2) (1997) 179–201. [15] M. Fayad, M. Laitinen, Process assessment considered wasteful, Communications of the ACM, 40 (11) (1997) 125–128.
149
[16] E. Gray, W. Smith, On the limitations of software process assessment and the recognition of a required reorientation for global process improvement, Software Quality Journal 7 (1) (1998) 21– 34. [17] J. Herbsleb, D. Zubrow, D. Goldenson, W. Hayes, M. Paulk, Software quality and the capability maturity model, Communications of the ACM 40 (6) (1997) 30–40. [18] ISO/IEC JTC1 Directive, Procedures for the Technical Work of ISO/ IEC JTC1, ISO/IEC JTC1. ISO, Geneva, Switzerland, 1999 http:// www.jtc1.org/directives/toc.htm. [19] ISO/IEC PDTR 15504, Information Technology—Software Process Assessment: Part 1–Part. 9 1996, JTC1/SC7/WG10, 1996 [20] ISO/IEC FDIS 15504, Information Technology—Software Process Assessment: Part 1 - Part 5, ISO, Geneva, Switzerland, 2002 (Parts 2 and 3 were published as International Standards in 1993 and 1994, respectively). [21] H.-W. Jung, R. Hunter, D. Goldenson, K. El-Emam, Findings from Phase 2 of the SPICE Trials, Software Process Improvement and Practice 6 (2) (2001) 205 –242. [22] H.-W. Jung, Evaluating the interrater agreement in SPICE-based software process assessments, Computer Standards and Interfaces 25 (2003) 477–499. [23] H.-W. Jung, R. Hunter, Evaluating the SPICE rating scale with regard to the internal consistency of capability measures. To appear Software Process Improvement and Practice. [24] J.P. McIver, G.E. Carmines, Unidimensional Scaling, Sage University Paper Series on Quantitative Applications in Social Sciences, Sage, Newbury, 1981. [25] D.C. Montgomery, G.C. Runger, N.F. Hubele, Engineering Statistics, Wiley, New York, 1998. [26] F. Niessink, H. van Vliet, Software maintenance from a service perspective, Journal of Software Maintenance: Research and Practices 12 (2) (2000) 103– 120. [27] M. Paulk, B. Curtis, M. Chrissis, C. Weber. Capability Maturity Model for software, version 1.1, Technical report CMU/SEI-93-TR024, Software Engineering Institute, Carnegie Mellon University, Pittsburgh, PA, 1993. [28] M. Paulk, C. Weber, S. Garcia, M. Chrissis, M. Bush, Key practices of the Capability Maturity Model, version 1.1, Technical report CMU/ SEI-93-TR-025, Software Engineering Institute, Carnegie Mellon University, Pittsburgh, PA, 1993. [29] S.-L. Pfleeger, N. Fenton, S. Page, Evaluating software engineering standards, Computer 27 (9) (1994) 71–79. [30] M. Sanders (Eds.), The SPIRE Handbook: Better, Faster, Cheaper Software Development in Small Organizations, The SPIRE Project Team (ESSI Project 23873), 1998. [31] SEI, CMMI for Systems Engineering/Software Engineering/Integrated Product and Process Development/Supplier Sourcing, version 1.1, Continuous Representation (CMMI-SE/SW/IPPD/SS, V1.1) (Continuous: CMU/SEI-2002-TR-011) and Staged Representation (CMU/SEI-2002-TR-012), Software Engineering Institute, Carnegie Mellon University, Pittsburgh, PA, 2002. [32] I. Sommerville, P. Sawyer, Requirements Engineering: A Good Practice Guide, Wiley, New York, 1997. [33] SPICE Trials, Phase 2 SPICE Trials Interim Report, Version 1.00, ISO/IEC JTC1/SC7/WG10, http://www.cis.strath.ac.uk/research/ papers/SPICE/p2rp100pub.pdf, 1998. [34] SPICE Trials, Phase 2 SPICE Trials Final Report, Vol. 1, ISO/IEC JTC1/SC7/WG10, http://www.cis.strath.ac.uk/research/papers/ SPICE/p2v2rp100.pdf, 2003. [35] StatXact-4 for Windows, Software for Exact Nonparametric Inference, Cytel Software Co.: Cambridge MA, 1998. [36] W. Trochim, The Research Methods Knowledge Base (2nd Eds), atomicdogpublishing.com, 2001.