The Journal of Systems and Software 63 (2002) 173–186 www.elsevier.com/locate/jss
Analyzing software science data with partial repeatability q Kai-Yuan Cai *, Lei Chen Department of Automatic Control, Beijing University of Aeronautics and Astronautics, Beijing 100083, China Received 1 July 2000; received in revised form 1 September 2000; accepted 18 October 2000
Abstract HalsteadÕs software science postulates that there exist physics-like laws that obey each piece of software. In this paper we reexamine this postulate by using two datasets collected from real programs, and argue that software science data are featured with partial repeatability. Conventional sciences embody the nature of full repeatability in the sense that they can either be proved repeatably in mathematics or be validated to a high accuracy repeatably in physics (experimentally). By partial repeatability we mean that complex phenomena may demonstrate an invariant property that neither can be proved in mathematics nor validated to a high accuracy in physics, but still (partially) governs the behavior of the phenomena. We propose a new kind of mathematical model, namely, parepeatic model, to characterize partial repeatability quantitatively. A parepeatic model defines the relationship between a central function and a fluctuation zone and identifies the degree of correctness of the relationship without making any statistical assumption. We develop parepeatic models for the relationships among several program complexity measures including the number of distinct operators, the number of distinct operands and the program length, among others, and present some new findings about the relationships. Illustrative case study shows that the developed parepeatic models can really help software engineering practice. 2002 Elsevier Science Inc. All rights reserved. Keywords: Software measurement; Software complexity; HalsteadÕs software science; Full repeatability; Partial repeatability; Parepeatic model; Type III laws; Genetic algorithm
1. Introduction Halstead argued that there existed physics-like laws obeying each piece of software and proposed a theory of so-called software science (Halstead, 1977). In software science a set of software metrics such as the numbers of operators and operands and the relations among them were suggested. HalsteadÕs theory aroused many controversies later. Both the program length equation and the software defect formula were examined. Some researchers followed HalsteadÕs line and tried to validate HalsteadÕs theory. Fitzsimmons and Love summarized major convincing results up to 1978 and argues that the software science measures might be inaccurate when applied to individual programs, but they became more accurate when applied to large numbers of programs, q Supported by the National Outstanding Youth Foundation of China, the Key Project of China and the ‘‘863’’ Programme of China. * Corresponding author. Tel./fax: +86-10-8231-7328. E-mail address:
[email protected] (K.-Y. Cai).
such as were found in large software development projects (Fitzsimmons and Love, 1978). Ottenstein presented a derivation how the coefficient in the HalsteadÕs defect formula should be 300 (Ottenstein, 1979). More positive results can be found in the 1979 March issue of IEEE Transactions on Software Engineering (IEEE, 1979). Some researchers were opposed to HalsteadÕs theory and presented negative evidence. Johnston and Lister showed that the program length equation was not as well established as was often claimed and it might not hold for programs written in ÔstructuredÕ languages (such as Pascal) unless a counter-intuitive counting scheme were adopted (Johnston and Lister, 1981). Coulter criticized HalsteadÕs theory from a perspective of cognitive psychology (Coulter, 1983). Shen et al. presented a critical analysis of HalsteadÕs theory and the corresponding empirical evidence (Shen et al., 1983). Oldehoeft and Bass argued that HalsteadÕs software science did not give any indication of the execution performance of algorithms as written in programming language or
0164-1212/02/$ - see front matter 2002 Elsevier Science Inc. All rights reserved. PII: S 0 1 6 4 - 1 2 1 2 ( 0 2 ) 0 0 0 1 3 - 4
174
K.-Y. Cai, L. Chen / The Journal of Systems and Software 63 (2002) 173–186
provide a means for comparing the translation of a program in one language into machine language (Oldehoeft and Bass, 1979). Triantafyllos et al. reported that the percentage of modules whose predicted errors by HalsteadÕs theory falled within 10% of the actual values was less than 10% for all modules and all subunits of concern (Triantafyllos et al., 1995). There also were researchers who partially disproved HalsteadÕs theory and suggested modifications, extensions or alternative theories. Felician and Zalateu argued that Pascal programs seemed to fit only partially with HalsteadÕs theory (Felician and Zalateu, 1989). Levitin proposed two estimators for the HalsteadÕs program length by using the number of distinct operands only (Levitin, 1985). Shooman presented alternative estimates of the HalsteadÕs program length and related HalsteadÕs theory to information theory (Shooman, 1983, Chapter 3). Oldehoeft and Bass proposed ‘‘dynamic software science’’ which related the work performed by a program during execution to its expression in a programming language (Oldehoeft and Bass, 1979). Cai proposed a probabilistic model of estimating the HalsteadÕs program length (Cai, 1998a, Section 4.6, Remarks 23). He also proposed statistical versions of the HalsteadÕs formulae (Cai, 1998b). KellerMcNulty et al. developed a Markov chain for estimating the HalsteadÕs program length (Keller-McNulty et al., 1991). Alternative defect formulae and extensions of HalsteadÕs theory can also be found in Schneider (1981), Ottenstein (1979), Ottenstein (1981), Lipow (1982), Lipow (1986), Hatton (1997) and Compton and Withrow (1990). Following the observation that HalsteadÕs theory might provide more qualitative than quantitative information (Cai, 1998b), in this paper we neither try to validate HalsteadÕs theory nor are opposed to it. Instead, we take a position lying between the two sides. In conjunction with the recent growing attention to software measurements for software process improvements in software engineering community, we reexamine HalsteadÕs theory in particular, and software data in general, with a partial-repeatability perspective. Different from the original work of Halstead that made various assumptions of software program property (Halstead, 1977), we make no assumptions beforehand and let software data explain themselves. While conventional sciences treat the nature of full repeatability as their top criterion, we argue that software data are featured with partial repeatability in nature as a result of the high complexity of software. Full repeatability means that a property or an argument can either be proved repeatably in mathematics or be validated to a high accuracy in physics. By partial repeatability we mean an invariant property of complex phenomena that neither can be proved in mathematics nor validated to a high accuracy in physics, but still (partially) governs the behavior of the phenomena.
The importance of investigating the phenomenon of partial repeatability comes from the fact that partial repeatability is a fundamental phenomenon in software engineering that has seldom been examined in the literature. The concept of partial repeatability may lead to a new scientific treatment of software engineering. With this potential in mind, in this paper we aim to achieve several objectives. First, the concept of partial repeatability is formulated and the rationale behind it is addressed. Second, the empirical evidence of software science is presented for partial repeatability. Third, a modeling approach, namely, parepeatic approach, 1 to partial repeatability is introduced. This approach is essentially different from those commonly adopted in analyzing software measurement data, such as box plots, scatter plots, control charts, regression analysis, principal component analysis, etc. 2 Section 2 introduces the concept of partial repeatability and the underlying rationale. Section 3 presents two datasets collected from real programs. These two datasets will be used throughout the rest of this paper to justify the perspective of partial repeatability. Section 4 examines a major formula in HalsteadÕs theory by using one of the two datasets. Section 5 proposes a mathematical framework, i.e., parepeatic model, for characterizing the partial-repeatability phenomena. Section 6 develops the parepeatic version of the HalsteadÕs program length equation. Section 7 revisits the topic of Section 4 from a partial-repeatability perspective. Section 8 presents illustrative case study to justify the usability of the proposed mathematical framework. Concluding remarks are included in Section 9.
2. Full repeatability versus partial repeatability Full repeatability can be observed widely in mathematics and physics. An argument with full repeatability should have a proof or verification procedure that can be carried out by different human individuals independently and thus the correctness of the argument should be independent of human intention or intervention. The Fermat last theorem that no integer solutions exist for the equation xn þ y n ¼ zn when n is an integer greater than 2, embodies the nature of full repeatability since the theorem can be rigorously proved in mathematics by different mathematicians. 3 The Newton second law 1 Parepeatics is an abbreviated term for partial repeatability, and accordingly, ÔparepeaticÕ is the adjective form of ÔparepeaticsÕ. 2 Refer to, e.g., Fenton and Pfleeger (1997, Chapter 6) for a brief introduction to conventional mathematical analysis techniques of software measurement data. 3 Different mathematicians can repeat a single proof or present different proofs but repeat to draw a single conclusion. In the field of automated machine reasoning, one may expect that a mathematical theorem can be proved (repeat) mechanically.
K.-Y. Cai, L. Chen / The Journal of Systems and Software 63 (2002) 173–186
F ¼ ma embodies the nature of full repeatability since the law can be validated to a high accuracy by various independent physical experiments in the macro physical world. Full repeatability usually leads to some quantitative equality that holds to a high accuracy among variables of concern, e.g., y ¼ gðxÞ. It also suggests that the substitution operation could take place without impairing the related equalities. That is, y ¼ gðxÞ and z ¼ f ðyÞ means z ¼ f ðgðxÞÞ. It is full repeatability or high quantitative accuracy that enables existing physics or science to gain enduring respect and reputation. Full repeatability is an essential and fundamental feature of contemporary sciences or scientific methods in general as observed by Spielberg and Anderson (Spielberg and Anderson, 1995, p. 14): ‘‘Scientific knowledge must be verifiable or testable, preferably in quantitative terms. Miracles, on the other hand, cannot be explained scientifically; otherwise they would not be miracles. In general, miracles are not repeatable, whereas scientific phenomena are. The reason that science has been so successful in its endeavors is that it has limited itself to those phenomena or concepts that can be tested repeatably, if desired.’’ Unfortunately, we can hardly find a software science formula (Halstead, 1977) that embodies the nature of full repeatability. Any software science formula, e.g., the program length equation, can have exceptions of software in the sense that the software demonstrates a scenario significantly different from the equation. On the other hand, although exceptions and variations are always present, there is still some almost invariant feature that can be captured from software science data. Software science data are not miracles. Otherwise one would not be able to justify the great success of software engineering practice. There should be something lying between full-repeatability phenomena and miracles. To distinguish it from full repeatability and miracles, we coin a new term: partial repeatability (Cai and Liao, 1999). Several reasons explain the observation of partial repeatability. First, things may be extremely complicated and beyond the limits of human comprehension (at least for the time being). In fluid mechanics, turbulent flows have long been recognized as an odd problem and far beyond being fully comprehended, although the Navier–Stoke equations have long been a very existence (Chevray and Mathieu, 1993). In software engineering, it is a common sense that software is extremely complicated in some sense. Even for a small program, there may be astronomically large number of program paths. The existence of various kinds of software complexity measures (Zuse, 1990), or the lack of a unified software
175
complexity measure, introduces an extra dimension of complexity. Second, there are various kinds of uncertainty associated with things. In software engineering, Lehman argued that there were various kinds of uncertainty, including Godel-like uncertainty, Heisenberg-like uncertainty, pragmatic uncertainty, and consequently one might address the uncertainty principle of computer application (Lehman, 1991). With so many sources of uncertainty, how can one assure that everything be fully repeatable, even in a statistical sense? Third, things tend to be evolving. Consider a human face. The bones may be slowly evolving; the skins may be slightly faster evolving. The face pattern continues to evolve over age. One cannot assure that the face at two different time instants would be fully identical, although they might be essentially similar. In software engineering, Lehman argued for a law of continuing change: An E-program that is used must be continually adapted else it becomes progressively less satisfactory (Lehman, 1996). Fourth, repeatable things may become partially repeatable because of the granularity of description and level of abstraction. A course of fish may be repeatable if only the specie of fish is considered. It may be only partially repeatable if more dimensions (e.g., cuisine) are involved: no chef can make a course of fish twice at exactly identical sweetness or saltiness. Last, but not the least, fall of full repeatability does not mean that there is no repeatability at all. There is still some feature often obeying or recurring in things, although we may not be sure how often the feature would emerge. In summary, complex phenomena or things may comprise the repeatable as well as the unrepeatable. When the repeatable (even in the statistical sense) dominates and the unrepeatable can be neglected, the nature of full repeatability is observed and scientific laws in conventional sense emerge. When the unrepeatable dominates and the repeatable can be neglected, miracles emerge. When both the repeatable and the unrepeatable play a significant role, the nature of partial repeatability emerges. If we treat the deterministic physical laws as type I laws (where causality dominates), the statistical physical laws as type II laws (where randomness dominates), then we can treat the laws governing the phenomena of partial repeatability as type III laws. In quantitative terms, a relation or formula describing a type III law is not valid to a high accuracy, and substitution operations of variables may lead to significant errors.
3. Software science datasets We developed a software science data collection tool and applied it to two sets of C language program files,
176
K.-Y. Cai, L. Chen / The Journal of Systems and Software 63 (2002) 173–186
one comprising 286 files and the other comprising 199 files (Chen, 1999). These files came mainly from three sources: files written by undergraduate students (stored in a computer experiment laboratory), standard function library accessible to the public, and several files of a commercial software package. Basically, each of the two datasets of C language program files was collected independently and no stringent constraints were imposed on the type or the number of C language program files beforehand. The driving force of collecting data from C language program files was to examine whether there existed any relationships among various software complexity measures from a partial-repeatability perspective, rather than to examine how the type of a C language program file would make effect on the possible relationship. 4 The measures of concern include c1: c2: c3: c4: c5:
lines of code, lines of comment, lines of data declaration, number of decisions/jumps, number of distinct operands (i.e., g2 in the subsequent sections), c6: number of distinct operators (i.e., g1 in the subsequent sections), c7: number of total usages of operands (i.e., N2 in the subsequent sections), c8: number of total usages of operators (i.e., N1 in the subsequent sections).
3. All the histograms look similar except that of c6. It seems that there do exist some essential relationships among different program complexity measures. 4. DS1 and DS2 show a similar pattern of histogram for each of the complexity measures even for c6. This suggests that the possible relationships among different program complexity measures should be testable across different sets of programs. 5. No histogram is close to a Normal distribution in statistics. This suggests that there would be some difficulty for doing statistical analysis of software science data.
4. Partial repeatability in software science data The HalsteadÕs program length equation has extensively been examined in the literature (refer to Section 1). In this section we take a look at another major formula in HalsteadÕs software science that is concerned with the relation between operators and operands (Halstead, 1977). For a given piece of software, let g1 be the number of distinct operators and g2 the number of distinct operands. Halstead suggested (Halstead, 1977, Chapter 4). g2 ¼ Ag1 þ B with A¼
Fig. 1a and b depicts the histograms of c1 to c8 for the first set of the C language program files (denoted DS1 hereafter) and the second set of the C language program files (denoted DS2 hereafter). From the histograms we note 1. With respect to lines of code (c1), the McCabe complexity measure (c4), or the Halstead program length (c7 þ c8), the C language program files we handle are small ones. 2. From the histograms of c5 and c6, we see that small programs may mean small numbers of distinct operands, but things may be different for the numbers of distinct operators. Of course, this observation needs further justification by examining the possible relationships among c5 and other complexity measures or among c6 and other complexity measures program by program.
4 We by no means suggest that the latter question would not be important. Actually, this question must be examined if we want the possible relationships to be useful for software design. However we prefer to leave this question to the future research.
ð4:1Þ
B¼
g2 g2
g2 g log2 2 þ2 2 2A
where g2 is evaluated as the number of conceptually distinct operands involved. 5 Since g2 is not defined precisely or counted in DS1 and DS2, we are unable to validate or invalidate relation (4.1) directly. Instead, suppose relation (4.1) is correct, we use DS1 to determine the values of g2 . That is, for each pair hg1 ; g2 i, we solve the equation g2 g2 log2 ð4:2Þ g2 ¼ ðg1 2Þ þ g2 g2 þ 2 2 and obtain g2 . Accordingly, we can depict the corresponding relation g2 and g2 in Fig. 2. In turn, by using the least squares method, we can obtain the following relation to fit the dataset of hg2 ; g2 i: 2 g^2 ¼ 0:0006ðg2 Þ þ 0:1583g2 þ 2:7897
ð4:3Þ
The dotted curve shown in Fig. 3 depicts the fitted relation between g2 and g2 . 5 Actually, g2 was not precisely defined in HalsteadÕs work (Halstead, 1977).
K.-Y. Cai, L. Chen / The Journal of Systems and Software 63 (2002) 173–186
177
Fig. 1. Histograms of software complexity data of (a) DS1 and (b) DS2.
Fig. 2. g2 versus g2 for dataset DS1, with g2 being determined by Eq. (4.2). g2 : horizontal axis; g2 : vertical axis.
Fig. 3. g2 versus g2 for dataset DS1, with fitted curve g^2 ¼ 0:0006ðg2 Þ2 þ 0:1583g2 þ 2:7897.
178
K.-Y. Cai, L. Chen / The Journal of Systems and Software 63 (2002) 173–186
The corresponding correlation coefficient is qðg2 ; g^2 ; nÞ
¼ 0:9705
According to HalsteadÕs software science, g2 can be determined in terms of g1 and g2 . On the other hand, DS1 suggests that g2 could be determined by g2 . This implies that there is a coherent relation between g1 and g2 . Specifically, g1 ¼
g2 g2 g2 g2 þ2
log2
g2 2
þ2
(4.4): Eq. (4.4) comes from the substitution of Eq. (4.3) into Eq. (4.2). In general, if z f ðyÞ and y gðxÞ, there is no guarantee that z f ðgðxÞÞ, although the relation is not accurately defined here. Then how to explain the failure of Eq. (4.1) or (4.2)? Can a better equation be worked out? We will revisit the relation between distinct operators and distinct operands from a partial-repeatability perspective in Section 7. 5. Parepeatic models
or F ðg2 Þ ¼ g1
ð4:4Þ
with F ðg2 Þ ¼
g2 g2 g2 g2 þ2
log2
g2 2
g^2 ¼ 0:0006ðg2 Þ2 þ 0:1583g2 þ 2:7897 In other words, if Eq. (4.1) or (4.2) holds, then Eq. (4.4) should hold as well at least with respect to DS1. In order to examine this observation, we use DS1 to test Eq. (4.4). Fig. 4 depicts the distribution of hg1 ; F ðg2 Þi for DS1. Obviously, the distribution deviates significantly from the relation F ðg2 Þ ¼ g1 . The correlation coefficient between g1 and F ðg2 Þ is qðg1 ; F ðg2 ÞÞ ¼ 0:87924 This is smaller than qðg2 ; g^2 Þ. From the preceding results we can see that Eq. (4.1) is not valid with respect to DS1. There must be significant errors associated with Eq. (4.1). Further, the substitution operation of one equation into another equation may lead to unacceptable results if the equations are not accurate. This can be justified by the behavior of Eq.
The results presented in the last section should explain themselves. Software science data are featured with partial repeatability and the substitution operation in software science formulae may lead to significant errors. In this section we propose a kind of mathematical model, namely, parepeatic model, to describe partial repeatability phenomena. We will see that parepeatic models are not statistical models. 5.1. Parepeatic models By a parepeatic model we mean a mathematical model or scheme describing phenomena of partial repeatability. Obviously, there should be no universally applicable parepeatic models. Specific parepeatic models must be developed with respect to applications. Here we present a class of parepeatic models for software science data. Let U ¼ fug be a set of objects of concern constituting the universe of discourse. For each object a number of variables or attributes is defined. Specifically, xi : U ! Di that is, variable xi takes values in Di ; i ¼ 1; 2; . . . ; m. Further, a central function or relation f is defined such that f : D1 D2 Dm ! R or simply, f ðx1 ; x2 ; . . . ; xm Þ ¼ f ðXÞ 2 R, where 2 3 x1 6 x2 7 6 7 X ¼ 6 .. 7; 4. 5 xm
Fig. 4. g1 versus F ðg2 Þ. g1 : horizontal axis; F ðg2 Þ: vertical axis. Fitted straight line: F ðg2 Þ ¼ 0:7390g1 þ 4:0425.
and R denotes the real line. Let E : D1 D2 Dm ! PðRÞ, where PðRÞ denotes the power set of R. In another word, given x1 ; x2 ; . . . ; xm , Eðx1 ; x2 ; . . . ; xm Þ or EðXÞ determines a subset of R. Then we can say that the 8-tuple hU ; D; R; X; f ; E; 2; ci makes up a parepeatic model, where D ¼ D1 D2 Dm , EðXÞ is referred to as fluctuation zone, 2 defines a relation f ðXÞ 2 EðXÞ, c 2 ½0; 1 is interpreted as the correctness factor or confidence factor of the parepeatic model.
K.-Y. Cai, L. Chen / The Journal of Systems and Software 63 (2002) 173–186
Consider an example parepeatic model. Suppose U comprises all the software programs of concern and D ¼ R2 ¼ R R. Let x1 ¼ g1 and x2 ¼ g2 , that is, x1 denotes the number of distinct operators in a program, and x2 the number of distinct operands in the program. Refer to Eq. (4.4), we define f ðg1 ; g2 Þ ¼ g1 F ðg2 Þ and Eðg1 ; g2 Þ ½10; 10 Further, suppose c ¼ 0:7, then we arrive at a parepeatic model, with f ðg1 ; g2 Þ 2 Eðg1 ; g2 Þ or 10 6 f ðg1 ; g2 Þ 6 10. We can say that the parepeatic model or the relation f ðg1 ; g2 Þ 2 Eðg1 ; g2 Þ is correct to an extent of 0.7. There are always programs that fail the relation f ðg1 ; g2 Þ 2 Eðg1 ; g2 Þ. The relation f ðg1 ; g2 Þ 2 Eðg1 ; g2 Þ cannot be proved in mathematics. Neither can it be validated by reasonable or repeatable physical experiments to a high accuracy: it is rather difficult to define or design an accurate physical experiment for the validation purpose since human intervention should be an integral part of the experiment. From the above formulation we see that the core of a parepeatic model is the relation f ðXÞ 2 EðXÞ and the confidence factor c. So, if no confusion may otherwise arise, the parepeatic model can be simply denoted as hf ðXÞ 2 EðXÞ; ci. Here we do not confine c to a particular interpretation. It should be interpreted with respect to application context. Just note that, even for the degree of software correctness, different interpretations can apply: Bayesian perspective, program variant perspective, and input domain perspective (Cai, 1998a, Section 9.5). 5.2. Identification of parepeatic models Suppose n observations of X, i.e., Xð1Þ ; Xð2Þ ; . . . ; XðnÞ , are available. In order to work out a parepeatic model hf ðXÞ 2 EðXÞ; ci to describe the observations, we need to determine the central function f ðXÞ first. This can be accomplished by using conventional techniques, e.g., the least squares method. By referring to Fig. 4, we can treat the central function as the fitted straight line f ðg1 ; g2 Þ ¼ F ðg2 Þ ag1 b Of course, we may also express f ðg1 ; g2 Þ by using polynomial, e.g., 2
179
f ðXÞ 2 EðXÞ, and consequently, the correctness factor c is reduced. Ideally, we hope that the fluctuation zone is as ÔsmallÕ as possible and the correctness factor of the resulting parepeatic model is as high as possible. Unfortunately, these are two conflicting sides and we must balance them. Consider the last example again (refer to Fig. 4). Suppose the fluctuation zone is specified as h i b b Eðg1 ; g2 Þ ¼ aðg1 Þ ; aðg1 Þ Two parameters, a and b, are to be determined. Let Aða; bÞ be the area of Eðg1 ; g2 Þ within the area bounded by the horizontal axis, the vertical axis, the horizontal line y ¼ max F ðg2 Þ, and the vertical line x ¼ max g1 . Obviously, we hope that Aða; bÞ is small. On the other hand, let 1 if f ðg1 ; g2 Þ 2 Eðg1 ; g2 Þ dðg1 ; g2 Þ ¼ 0 otherwise Pn ðiÞ ðiÞ 1 c ¼ n i¼1 dðg1 ; g2 Þ q¼1c where superscript (i) corresponds to the ith observation of (g1 ; g2 ). In order to balance the conflicting requirements of Eðg1 ; g2 Þ and c, we choose the cost function as 6 J ða; bÞ ¼ Aða; bÞ þ Qq where Q is a weighting coefficient. The desired values of a and b should minimize the cost function. Since q is a discontinuous function of a and b, we employ genetic algorithms to do optimization. Here we must stress that the structure of Eðg1 ; g2 Þ and the definition of q proposed above make up only one of many possible choices. For example, we may define q ¼ 1 c2 . In summary, a parepeatic model can be identified as follows 1. Determine a central function f ðXÞ by using conventional methods. 7 2. Tentatively choose a structure of fluctuation zone EðXÞ. 3. Define an index q. 4. Choose a cost function J. 5. Employ a genetic algorithm to optimize J. 6. Redo the last steps, if the resulting parepeatic model is not satisfactory. 5.3. Remarks
k
f ðg1 ; g2 Þ ¼ F ðg2 Þ a0 a1 g1 a2 ðg1 Þ ak ðg1 Þ
The next step is to determine the fluctuation zone EðXÞ or the relation f ðXÞ 2 EðXÞ. If we let EðXÞ ¼ R, then the relation f ðXÞ 2 EðXÞ always holds and c ¼ 1, but the resulting parepeatic model would make no use in practice. If we make the fluctuation zone ÔsmallerÕ, then more observations of X will invalidate the relation
(1) Two sources of uncertainty are taken into account in a parepeatic model: fluctuation zone and correctness 6 The cost function will slightly be modified in the subsequent sections. 7 This step is not essential. Steps (1) and (2) can be combined together. Refer to remark (9) of Section 5.3.
180
K.-Y. Cai, L. Chen / The Journal of Systems and Software 63 (2002) 173–186
factor. The fluctuation zone EðXÞ asserts that an equation f ðXÞ ¼ 0 cannot fit the phenomena of partial repeatability. Instead, an inequation f ðXÞ 2 EðXÞ should take place. Further, due to the complexity of the phenomena or the nature of partial repeatability, even the inequation f ðXÞ 2 EðXÞ cannot hold strictly. There are always exceptions. The correctness factor represents the degree of correctness of the inequation f ðXÞ 2 EðXÞ. In other words, a parepeatic model measures partial repeatability in two dimensions: fluctuation zone EðXÞ and correctness factor c. (2) A parepeatic model explicitly divides observations into two parts: those fitting f ðXÞ 2 EðXÞ and those not fitting f ðXÞ 2 EðXÞ. The observations fitting f ðXÞ 2 EðXÞ can be treated as the repeatable and the observations not fitting f ðXÞ 2 EðXÞ can be treated as the unrepeatable, or ÔoutlierÕ. Parepeatic models try to capture the most important observations (those fitting f ðXÞ 2 EðXÞ). They are not aimed to characterize every observation. (3) Conventional approaches to (system) modeling believe that there exists a ÔtrueÕ model that can fully explain the corresponding phenomenon in a deterministic or an uncertain way, and although we cannot get the accurate ÔtrueÕ model, we can estimate or asymptotically approach the model. The model is in form of equations. In comparison, the parepeatic approach to (system) modeling follows a different philosophy. There exists no single ÔtrueÕ model in form of equations for characterizing phenomena of concern. There is always something in the phenomena that cannot be modeled. What we can obtain is an inequation and a correctness factor. Phenomena may be more complex than we can model. We can only partially model them. (4) Although few disagree to the argument that a mathematical model is only an approximator of the modeled object, few mathematical models consider the degree of approximation to the modeled objects explicitly. Ideally, a mathematical model should consist of two parts: idealized mathematical model and the (approximation) relation between the idealized mathematical model and the modeled object. In this way, the correctness factor c in a parepeatic model plays a crucial role in several manners: • c tries to quantify the approximation relation between the idealized mathematical model ðf ðXÞ 2 EðXÞÞ and the modeled object. It is a measure of goodness-of-fit. • c acknowledges the existence of outliers in various observations. • c acknowledges the limits to full repeatability as well as the nature of partial repeatability. (5) A parepeatic model hf ðXÞ 2 EðXÞ; ci reduces to a conventional model when EðXÞ ½0; 0 and c ¼ 1. Such
a conventional model can be deterministic, fuzzy or statistical. f ðXÞ may represent a differential model, integral model, membership function, statistical distribution, etc. From this viewpoint we can see that there are essential differences between parepeatic models and conventional models describing uncertainty, such as fuzzy models, statistical models, and the like. On the other hand, we can treat the fluctuation zone EðXÞ as a generalization of confidence interval in statistics, however no statistical assumptions are taken in a parepeatic model. (6) A data point should not be treated as absolutely a Ônormal dataÕ or an ÔoutlierÕ. A Ônormal dataÕ can become as an ÔoutlierÕ or vice versa as the corresponding fluctuation zone moves. The correctness factor c identifies to what extent a data point is a Ônormal dataÕ and to what extent a data point is an ÔoutlierÕ. (7) The substitution operation of variables is problematic if the relationships among the variables are parepeatic in nature. That is, y x and z y does not necessarily lead to z x. (8) A parepeatic model hf ðXÞ 2 EðXÞ; ci essentially gives a confidence interval EðXÞ for f ðXÞ at each X to lie in. This interpretation of a parepeatic model may be independent of the choice of Q or the cost function, although Q or the cost function does affect the identification of f ðXÞ and EðXÞ. (9) The identification procedure presented in Section 5.2 is subject to change. Instead of determining the central function f ðXÞ by using conventional methods such as the least squares method before the fluctuation zone EðXÞ is identified, we can specify the structures of f ðXÞ and EðXÞ (e.g., a linear structure) first and then use a genetic algorithm to determine the parameters of f ðXÞ and EðXÞ simultaneously. This should be no problem for doing so in theory, although it may be computationally intensive. (10) Since f ðXÞ and EðXÞ can be determined simultaneously, it seems that parepeatic modeling provides a new scheme of parameter estimations in general. In the least squares method, each data point is taken into account in the cost function directly. In the parepeatic modeling method, distinction is made between Ônormal dataÕ and ÔoutliersÕ, and data points are taken into account in the cost function in an indirect manner. (11) In conventional regression analysis, various statistical assumptions must be made (Freund and Wilson, 1998). This is especially true when confidence intervals are required. The parepeatic models suggest that we might do regression analysis without making any statistical assumptions. (12) In order to distinguish a parepeatic model from a statistical model, we only need to note that in a statistical model statistical assumptions are made and substitution operations are applicable. However this is not true for a parepeatic model. While statistical methods
K.-Y. Cai, L. Chen / The Journal of Systems and Software 63 (2002) 173–186
tackle the problem of non-Normality or ÔoutliersÕ by using non-parametric statistics (Daniel, 1990; Sprent, 1989) or robust statistics (Hampel et al., 1986; Huber, 1981; Rousseeuw and Leroy, 1987), parepeatic methods identify and distinguish between Ônormal dataÕ and ÔoutliersÕ, and balance their contributions in the cost functions. Parepeatic methods employ a new kind of cost functions.
181
Following the notation of Sections 5.1 and 5.2, we define the central function as f ðx; yÞ ¼ y 0:97483x 143:6879
ð6:2Þ
Let xmax ¼ ymax ¼
max
fN g
max
fg1 log2 g1 þ g2 log2 g2 g
all data points in DS1 all data points in DS1
and 6. Parepeatic models for N and g1 log2 g1 þ g2 log2 g2 In order to reexamine HalsteadÕs program length equation, in this section we develop the parepeatic models for N (¼ N1 þ N2 ) and g1 log2 g1 þ g2 log2 g2 . Let x¼N
Eðx; yÞ ¼ axb ; axb In other words, the parepeatic model asserts 0:97483x þ 143:6879 axb 6 y 6 0:97483x þ 143:6879 þ axb
y ¼ g1 log2 g1 þ g2 log2 g2
In order to determine parameters a and b, we let the cost function be
The HalsteadÕs program length equation asserts
J ða; bÞ ¼ As þ Qq
y¼x
where
In order to obtain the central function of a parepeatic model, here we assert
As ¼
y ¼ ax þ b
and A, Q, and q are measures as defined in Section 5.2. By choosing different values of the weighting coefficient Q and applying a genetic algorithm 8 for the optimization purpose, we can obtain different parepeatic models for DS1. Table 1 tabulates various results. We note that with increasing Q, q is decreasing whereas As is enlarged. Q is a parameter subject to human preference. Fig. 6 shows the parepeatic model with Q ¼ 1 for DS1. Compared to the HalsteadÕs program length equation, a parepeatic model provides more information. Suppose in software design phase we have g1 ¼ 40; g2 ¼ 127. According to the HalsteadÕs program length equation, N^ ¼ g1 log2 g1 þ g2 log2 g2 ¼ 1200. Suppose in the software-coding phase we find N ¼ 650, then is the programming practice normal or abnormal? The HalsteadÕs program length equation cannot answer this question directly. On the other hand, we can consider the parepeatic model with Q ¼ 1. From the central function we may have N^ ¼ 1100. Although the actual value of N ¼ 650 deviates significantly from this predicted value, the actual value still lies in the fluctuation zone. We may reasonably believe that the programming practice is normal. In order to confirm the validity of the parepeatic models summarized in Table 1, we apply the relation f ðXÞ 2 EðXÞ to dataset DS2 and calculate the corresponding q (with respect to the given fluctuation zones). Table 2 tabulates the resulting values of q of DS2 in comparison with those of DS1. We see that the two sets
Applying the least squares method to dataset DS1, we have y ¼ 0:97483x þ 143:6879
ð6:1Þ
and qðx; yÞ ¼ 0:91157 Eq. (6.1) is slightly different from the HalsteadÕs program length equation, however it avoids the inconsistency between the physical interpretations of the two sides of the equation. Fig. 5 shows the difference between the HalsteadÕs program length equation and Eq. (6.1).
Fig. 5. HalsteadÕs program length equation versus fitted straight line for dataset DS1.
8
A xmax ymax
We employ the genetic algorithm toolbox in Matlab.
ð6:3Þ
ð6:4Þ
182
K.-Y. Cai, L. Chen / The Journal of Systems and Software 63 (2002) 173–186
Table 1 Parepeatic models for DS1 Weighting coefficient Q Central function f ðx; yÞz Fluctuation zone Eðx; yÞ As q
0.1 y 0:97483x 143:6879 149.79x0:000244 0.0440 0.4441
0.5
1
2
5
347.03x0:00333 0.12430 0.1538
493.0x0:007659 0.1705 0.0944
67074x0:002725 0.2226 0.0664
492.69x0:114027 0.3751 0.0105
Inspired by this potential relation, we tentatively let x ¼ g1 log2 g1 y ¼ ag2 g2 log2 g2 If Eqs. (7.1) and (7.2) really hold, then there should exist a strong linear relation between x and y. In the vertical axis y parameter a is left to be determined instead of valuing 0:97483 7:6881 directly. This is because the coordinates hx; yi are chosen only tentatively, not finalized yet. Given a value of a, dataset DS1 will generate a corresponding dataset of hx; yi. Then the correlation coefficient between x and y, denoted qðx; yÞ, is accordingly determined. We apply the genetic algorithm to optimize parameter a such that qðx; yÞ is maximized. Consequently, Fig. 6. Parepeatic model with Q ¼ 1 for N and g1 log2 g1 þ g2 log2 g2 of dataset DS1.
or x ¼ g1 log2 g1 y ¼ 10:1887g2 g2 log2 g2
Table 2 Applying f ðXÞ 2 EðXÞ of DS1 to DS2 Weighting 0.1 coefficient Q
0.5
1
2
q of DS1 q of DS2
0.153846 0.075377
0.094406 0.035176
0.066434 0.01049 0.035176 0.015075
0.444056 0.356784
a ¼ 10:1887
5
qðx; yÞ ¼ 0:8736
of values of q are rather close. This in one respect suggests that the parepeatic models we develop be valid.
Fig. 7 shows the distribution of hx; yi of dataset DS1. We see that the data pairs are largely scattered. The fitted straight line is (by using the least squares method) y ¼ 2:3199x þ 34:8292
ð7:3Þ
We are not able to say that the straight line fits the dataset well since the correlation coefficient is only
7. The relation between g1 and g2 revisited The relation between g1 and g2 was discussed in Section 4. Fig. 4 shows that Eq. (4.4) are not valid and the substitution operation of one approximate equation into another approximate equation may lead to unacceptable errors. A natural attempt is to seek a better relation between g1 and g2 in place of Eq. (4.4). Applying the least squares method to dataset DS1, we have g1 log2 g1 þ g2 log2 g2 ¼ 0:97483N þ 143:6879
ð7:1Þ
and N ¼ 7:6881g2 73:0977
ð7:2Þ
Thus there should hold g1 log2 g1 þ g2 log2 g2 ¼ 0:97483ð7:6881g2 73:0977Þ þ 143:6879
Fig. 7. Fitted straight line of g1 log2 g1 versus 10:1887g2 g2 log2 g2 for dataset DS1.
K.-Y. Cai, L. Chen / The Journal of Systems and Software 63 (2002) 173–186
183
Table 3 Parepeatic models for DS1 Weighting coefficient Q Central function f ðx; yÞ Fluctuation zone Eðx; yÞ As q
0.5 y ¼ 2:3199x 34:8292 27.1158x0:2043 0.2090 0.4336
0.8701. It is smaller than the correlation coefficient (0.91157) of Eq. (7.1) and the correlation coefficient (0.90351) of Eq. (7.2). It is even smaller than the correlation coefficient (0.87924) of Eq. (4.4). Up to this point we can conclude that Eq. (7.3) does not hold to a high accuracy, or there can hardly exist an explicit or simple relation between g1 and g2 . Further, the substitution operation of an approximate equation into another approximate equation can hardly work well. In order to take account of the complexity of the relation between g1 and g2 , we have to develop parepeatic models. Let the central function be
0.7
1
2
29.8317x0:3144 0.3722 0.1573
20.2624x0:4257 0.4223 0.1014
29.2647x0:3969 0.5168 0.0315
control, we write four programs in Turbo C languages: test1, test2, test3 and test4. Test1 and test2 each calculate the following definite integrals one by one: 1. 2. 3.
1. 2.
and the fluctuation zone be
Eðx; yÞ ¼ axb ; axb
3.
8. Illustrative case study The implications of parepeatic models for software engineering practice should be observed. In order to confirm the applicability of parepeatic models to quality
2 0 ð1 þ x Þ dx. R2 2 3 0 ð1 þ x þ x þ x Þ dx. R 3:5 x 0 1þx2 dx.
Test3 and test4 each calculate the following definite integrals one by one:
f ðx; yÞ ¼ y 2:3199x 34:8292
The resulting parepeatic models are tabulated in Table 3 for different assignments of the weighting coefficient Q. Fig. 8 shows the scenario of the parepeatic model with Q ¼ 0:7.
R1
4. 5. 6.
R1
0 ð1
R2
þ x2 Þ dx.
2 3 0 ð1 þ x þ x þ x Þ dx. R 3:5 x 0 1þx2 dx. R2 2 0 ð1 þ x Þ dx. R3 2 3 0 ð1 þ x þ x þ x Þ dx. R 4:5 x 0 1þx2 dx.
The difference between test1 and test2 lies in that they follow different programming styles. Test1 invokes functional procedures, while test2 avoids functional procedures. Similarly, test3 invokes functional procedures and test4 avoids functional procedures. Obviously, functional procedures may simplify programming. This is particularly true for test3 since the integrands in integrals (1), (2) and (3) are identical to those in integrals (4), (5) and (6), respectively. It is reasonable to believe that test1 and test3 follow a good programming style, whereas test2 and test4 should be improved. If a parepeatic model can work, we may expect that some attribute values of test2 and test4 would be treated as ÔoutliersÕ. Even if they lie within the corresponding fluctuation zone, they should be closer to the boundaries of the fluctuation zone than those of test1 and test3.
Table 4 Attribute values of example programs
Fig. 8. Parepeatic model with Q ¼ 0:7 for g1 log2 g1 and 10:1887g2 g2 log2 g2 of dataset DS1.
Program
g1
g2
N1
N2
Test1 Test2 Test3 Test4
14 12 14 12
26 16 29 19
107 155 125 299
100 168 121 322
184
K.-Y. Cai, L. Chen / The Journal of Systems and Software 63 (2002) 173–186
Table 4 tabulates the attribute values of g1 , g2 , N1 and N2 for test1, test2, test3 and test4. In order to check whether or not a parepeatic model can help, we employ the parepeatic models developed in Section 6 for the relationship between N and g1 log2 g1 þ g2 log2 g2 . Figs. 9–13 each show the data points of test1, test2, test3 and test4, and the pattern of one of the five parepeatic models tabulated in Table 1, respectively. First, we note that Figs. 9 and 13 present misleading information. In Fig. 9 all the four data points lie outside the fluctuation zone. We cannot trust this observation since the degree of correctness of this parepeatic model is very low (1 q ¼ 0:56). Fig. 13 shows a different pattern. This should also be ignored since the fluctuation zone is over large (As ¼ 0:38). Consider Fig. 12, the corresponding parepeatic model is seemingly trustworthy since the degree of correctness
Fig. 9. Data points of test1, test2, test3 and test4 versus the parepeatic model of Table 1 with Q ¼ 0:1.
Fig. 10. Data Points of test1, test2, test3 and test4 versus the parepeatic model of Table 1 with Q ¼ 0:5.
Fig. 11. Data points of test1, test2, test3 and test4 versus the parepeatic model of Table 1 with Q ¼ 1.
Fig. 12. Data points of test1, test2, test3 and test4 versus the parepeatic model of Table 1 with Q ¼ 2.
Fig. 13. Data points of test1, test2, test3 and test4 versus the parepeatic model of Table 1 with Q ¼ 5.
K.-Y. Cai, L. Chen / The Journal of Systems and Software 63 (2002) 173–186
is very high. However the fluctuation zone is relatively large and this reduces our confidence in the usability of the model. Even so, the data point of test4 is much closer to the boundary of the fluctuation zone and thus the quality of test4 should be suspected. Figs. 10 and 11 generate convincing information. The corresponding parepeatic models should be trusted since the degrees of correctness are rather high and the fluctuation zones are rather small. The data point of test4 is now treated as ÔoutlierÕ and the data point of test2 is closer to the boundaries of the fluctuation zone than that of test1. In other words, the parepeatic models suggest that test1 and test3 be superior to test2 and test4, respectively. The usability of the parepeatic models is verified.
9. Concluding remarks Following the observation that HalsteadÕs software science provides more qualitative than quantitative information, in this paper we reexamine HalsteadÕs software science from a partial-repeatability perspective. Partial repeatability distinguishes itself from full repeatability in that neither can it be proved (or disproved) mathematically nor validated (or invalidated) to a high accuracy physically. The partial-repeatability phenomena can widely be observed in the field of software engineering where human intervention or intention is an essential and integral component. In the preceding sections we formulate the concept of partial repeatability and the underlying rationale. We present two software science datasets to justify the concept of partial repeatability in software engineering. A new kind of mathematical model, namely, parepeatic model, is proposed to characterize the relationships among several software science measures such as the number of distinct operands and the number of distinct operators. Parepeatic models are essentially different from statistical models in that no statistical assumptions are made in parepeatic models and substitution operations of variables may lead to significant errors. Illustrative case study shows that the concept of partial repeatability and parepeatic models do help software engineering practice. However we are still in an infant stage of the new line of research. Subsequent efforts may include further formulating the concept of partial repeatability, collecting more software data and developing multivariate parepeatic models, working out the parepeatic laws that each piece of software may really follow. Subsequent efforts may also include further investigating the parepeatic modeling scheme as a new modeling scheme, and conducting large-scale case studies of parepeatic models in software engineering. Much is left to explore in the future.
185
Acknowledgements The comments of one reviewer and Robert L. Glass greatly help the authors to improve the readability of the paper.
References Cai, K.Y., 1998a. Software Defect and Operational Profile Modeling. Kluwer Academic Publishers, Amsterdam. Cai, K.Y., 1998b. On estimating the number of defects remaining in software. Journal of Systems and Software 40, 93–114. Cai, K.Y., Liao, J.H., 1999. Software pattern laws and partial repeatability. In: Chen, G.Q., Ying, M.S., Cai, K.Y. (Eds.), Fuzzy Logic and Soft Computing. Kluwer Academic Publishers, Amsterdam, pp. 89–120. Chen, L., 1999. Software data collection and analysis: appending part. BS Thesis, Beijing University of Aeronautics and Astronautics, Beijing. Chevray, R., Mathieu, J., 1993. Topics in Fluid Mechanics. Cambridge University Press, Cambridge. Compton, B.T., Withrow, C., 1990. Prediction and control of ADA software defects. Journal of Systems and Software 12, 199–207. Coulter, N.S., 1983. Software science and cognitive psychology. IEEE Transactions on Software Engineering SE-9, 166–171. Daniel, W.W., 1990. Applied Nonparametric Statistics, second ed. PWS-KENT Publishing Company. Felician, L., Zalateu, G., 1989. Validating HalsteadÕs theory of Pascal programs. IEEE Transactions on Software Engineering SE-15 (12), 1630–1632. Fenton, N.E., Pfleeger, S.L., 1997. Software Metrics: A Rigorous and Practical Approach, second ed. Internal Thomson Computer Press. Fitzsimmons, A., Love, T., 1978. A review and evaluation of software science. Computing Survey 10, 3–18. Freund, R.J., Wilson, W.J., 1998. Regression Analysis: Statistical Modeling of a Response Variable. Academic Press, New York. Halstead, M.H., 1977. Elements of Software Science. Elsevier, Amsterdam. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A., 1986. Robust Statistics: The Approach Based on Influence Functions. John Wiley, New York. Hatton, L., 1997. Reexamining the fault density—component size connection. IEEE Software. Huber, P.J., 1981. Robust Statistics. John Wiley, New York. IEEE, 1979. IEEE Transactions on Software Engineering SE-5 (2). Johnston, D.B., Lister, A.M., 1981. A note on the software science length equation. Software Practice and Experience 11, 875–879. Keller-McNulty, S., McNulty, M.S., Gustafson, D.A., 1991. Stochastic models for software science. Journal of Systems and Software 16, 59–68. Lehman, M.M., 1991. Software engineering, the software process and their support. Software Engineering Journal 6 (5), 243–258. Lehman, M.M., 1996. Laws of software evolution revisited. In: Montangero, C. (Ed.), Software Process Technology. SpringerVerlag, Berlin, pp. 108–124. Levitin, A.V., 1985. On predicting program size by program vocabulary. In: Proceedings of Computer Software and Applications Conference, pp. 98–103. Lipow, M., 1982. Number of faults per line of code. IEEE Transactions on Software Engineering SE-8 (4), 437–439. Lipow, M., 1986. Comments on estimating the number of faults in code and two corrections to published data. IEEE Transactions on Software Engineering SE-12 (4), 584–585.
186
K.-Y. Cai, L. Chen / The Journal of Systems and Software 63 (2002) 173–186
Oldehoeft, R.R., Bass, L.J., 1979. Dynamic software science with applications. IEEE Transactions on Software Engineering SE-5 (5), 497–504. Ottenstein, L.M., 1979. Quantitative estimates of debugging requirements. IEEE Transactions on Software Engineering SE-5 (5), 504– 514. Ottenstein, L., 1981. Predicting number of error using software science. Performance Evaluation Review 10, 157–167. Rousseeuw, P.J., Leroy, A.M., 1987. Robust Regression and Outlier Detection. John Wiley, New York. Schneider, V., 1981. Some experimental estimators for developmental and delivered errors in software development projects. Performance Evaluation Review 10, 169–172. Shen, V.Y., Conte, S.D., Dunsmore, H.E., 1983. Software science revisited: a critical analysis of the theory and its empirical support. IEEE Transactions on Software Engineering SE-9 (2), 155–165. Shooman, M.L., 1983. Software Engineering: Design, Reliability, and Management. McGraw-Hill, New York. Spielberg, N., Anderson, B.D., 1995. Seven Ideas That Shook The Universe, second ed. John Wiley, New York. Sprent, P., 1989. Applied Nonparametric Statistical Methods. Chapman and Hall, London. Triantafyllos, G., Vassiliadis, S., Kobrosly, W., 1995. On the prediction of computer implementation faults via static error prediction models. Journal of Systems and Software 28, 129–142.
Zuse, H., 1990. Software Complexity: Measures and Methods. de Gruyter. Kai-Yuan Cai is a Cheung Kong Scholar (Chair Professor), jointly appointed by the Ministry of Education of China and the Li Ka Shing Foundation, Hong Kong. He has been a professor at Beijing University of Aeronautics and Astronautics (BUAA) since 1995. He was born in April 1965 and entered BUAA as an undergraduate student in 1980. He received his BS degree in 1984, MS degree in 1987, and Ph.D. degree in 1991, all from BUAA. He was a research fellow at the Centre for Software Reliability, City University, London, and a visiting scholar at the Department of Computer Science, City University of Hong Kong. Dr. Cai has published over 40 research papers. He is the author of three books: Software Defect and Operational Profile Modeling (Kluwer, Boston, 1998); Introduction to Fuzzy Reliability (Kluwer, Boston, 1996); Elements of Software Reliability Engineering (Tshinghua University Press, Beijing, 1995, in Chinese). He serves on the editorial board of the international journal Fuzzy Sets and Systems and is the editor of the Kluwer International Series on Asian Studies in Computer and Information Science (http://www.wkap.nl/prod/s/ ASIS). He is one of the 1998-year recipients of the Outstanding Youth Investigator Award of China. His main research interests include computer systems reliability and intelligent control. Lei Chen is currently a Master degree graduate student at BUAA. He was born in September 1976 and entered Beijing University of Aeronautics and Astronautics (BUAA) as an undergraduate student in 1995. He received his BS degree from BUAA in 1999.