A Predicitive Model for Programming Time: A Factor Analytic Approach

A Predicitive Model for Programming Time: A Factor Analytic Approach

Copyright © IFAC Experience with the Management of Software Pn~jeCls. Indiana. USA. 1989 A PREDICTIVE MODEL FOR PROGRAMMING TIME: A FACTOR ANALY...

2MB Sizes 79 Downloads 105 Views

Copyright

© IFAC

Experience with the

Management of Software

Pn~jeCls.

Indiana. USA. 1989

A PREDICTIVE MODEL FOR PROGRAMMING TIME: A FACTOR ANALYTIC APPROACH J. C. Munson* and T. M. Khoshfoftaar** "Di,·j,iol/ 0{ COII//JIIla .)(il' IIO'. Cl/i"l'nilv o( \\'1'.11 Flolidll. ['('I/SIIm/lI. Flolidll. CS,'\ ""''DI'/JIII'liIl('1/1 o( C(JII//JIIII'I S(il'l/l'I' . Flolidll ,41/rlllli( L·l/ii'l'nily. nom HII/ol/. Flolidll. CSA .

Abstract. The relationship between software complexity metrics and programming effort is explored. As a measure of programmmg effort. the development time of a program was systematically associated with software complexity metncs m two dlstmct modeltng scenanos. PredIctIve models were developed with a set of raw complexity metrics and a set of metncs on a reduced complexity space. The metrics were mapped onto the reduced space through the use of factor analySIS. ThIS techmque was used to reveal the underlying conceptual domains of the complexity space whIch was then assocIated WIth programmmg tIme through regression analysis. A significant relationship between programming effort and program complexity was found. In a direct comparison of two. alternate modeltng techmques. the reduced factor model was found to have better predictive quality than an assocIatIOn with raw complexity metrics. Keywords. Computer software; Correlation methods; Eigenvalues; Modeling; Programming; RegreSSion Models; Software development; Software engineering.

INTRODUCTION Because of the increasing cost of the generation of computer programs. a major factor in any software development project is to estimate the time it will take to develop a computer program that in turn is directly related to the cost of this program development. The objective of recent research in this area is to develop predictive models for new project based on historical data of project development. Several studies have been conducted in an attempt to associate complexity metrics directly with measures of programming effort (Albrect.1983; Daly. 1985; Wood field. 1981). A major problem of the research to date is the failure to investigate the predictive quality of these models. rather only examined the relative association of effort and complexity metrics. These metrics arc relatively stable measures developed on existing programs. These are very tractable variables and well defined in the literature. Consequently. there is some interest in the development of predictive models based on these measures. Typically. regression models have been developed with these software complexity metrics with some measure of development time as a dependent variable (Itakura. I n2; Basili. 1981). The relative success of the use of complexity metrics is predicated on the ability of these metrics to describe all aspects of program variability. That is. the metrics employed in this predictive role must. in fact. completely measure all dimensions or the complexity problem space. Some allempts to develop these predictive models have not succeeded because many of the metrics employed in these studies are simple linear (or non-linear) compounds of measures on only one or two complexity dimensions. In this paper we will develop a regression methodology that first establishes the precise dimensionality of the complexity space measured by the selected complexity metrics and then uses unitary measures on each of the underlying complexity dimensions as a basis for a regression model to predict programming time. The essential advantage of this approach. is the fact that these models may be used with increaSing precision as a historical database relating program specifications to complexity is established. Through the use of the regression models. this. in turn. will allow increased accuracy in the prediction of programming time. A basic statistical tool that will be used to detennine the relationship between programming time (effort) is that of regression analysis. As will be shown. there are many methods of developing a regression model for the same data. The choice of a particular model should be viewed as the selection of a besl model from a pool of candidate models.

A major problem in the development of regression model centers around the problem of multicollinearity. The basic regression model IS based on the assumption that the independent variables of the analYSIS are not linear compounds of each other nor share an element of common variance. Two variables sharing a common element of variance are said to be collinear. To meet this assumption of non-multicollinearity. another statistical procedure called factor analysis may be used. The specific value of factor analysis is that the technique will reduce a data matrix to a set of orthogonal variables or factors that are. in fact. non-collinear. Factor analysis is also useful in the reduction of the complexity metric space to a set of orthogonal complexity dimensions. These orthogonal dimensions are the basis for a conceptual model of software complexity. The specific contribution of the variance of each of these elements of the complexity model to programming effort may then be studied with regression analysis. The next two sections of this paper will. then. summarize the basic statistical issues of regression analysis and factor analysis. Subsequently. we will then show the appropriate application of these techniques to problem of predicting software development time using software complexity metrics.

FACTOR ANALYSIS The essential purpose of factor analysis is to describe . if possible. the covariance relationships among variables in tenns of a few underlying. but understandable. random quantities called factors. Basically. the factor model is motivated by the following argument. Suppose variables can be grouped by their correlations. That is. all variables within a particular group are highly correlated among themselves but have relatively small correlations with variables in a different group. It is conceivable that each group of variables represents a single underlying construct. or factor. that is responsible for the observed correlations. Factor analySis can be considered as an extension of principal component analysis (Dillon. 1984). Both can be viewed as attempts to approximate the covariance matrix. L. However. the approximation based on the factor analysis model is more elaborate. The primary question in factor analysis is whether the data arc consistent with a prescribed structure. For our investigative purposes. this matrix L. can be reconstructed from a correlation matrix. Assume that we are given p random variables X = (X t ..... Xp) having a multivariate distribution with mean Il = (Ill ..... II p ) and a covariance matrix L. The factor model postulates that X is linearly dependent on a few unobservable random variables Ft • ...• Fm.

J.

52

C. Munson and T. M. Khoshfoftaar

called common factors. and p additional sources of variation. El •...• Ep. called errors or. sometimes. specific factors. In panicular. the factor analytic model is Xi=f'ClijFj+Ei.

i=I.2.···.p

J~ The coefficient Clij is called the loading of the i 'h variable on the j'h factor. The variables F 1 •••.• Fm are assumed to be uncorrelated with unit variances.

The technique of factor analysis concerns itself with estimating the factor loadmgs Clij' Once the factor loadings have been obtained the major task is to make the best interpretation of the common fac: tors. It is irnponant to note that the burden of interpretation lies on the ~bserver and is not intrinsic in the factor analysis. Usually. it is reIanvely SImple to observe the relationships of variables grouped by their association with a common factor and attach a name to this set. To aid in the interpretation of the extracted factor loadings. we exploit the indeterminancy of the factor solution whereby we can find new common factors FIR) • .. '. F!!:). that are linear combinations of the old factors and which are uncorrelated with unit variances. Thus. the new set of factors also satisfies the factor model. Funhermore. there are an infinite number of such sets. The process of obtammg a set of new factors is called an onhogonal factor rotation. The objective of the rotation is to obtain some theoretically meanmgful factors and to simplify the factor structure. There are many dIfferent techniques available for these onhogonal rotations. The varimax rotation. that will be used in this study. attempts to simplify the columns of a factor matrix. A simple column is one which will have only ones and zeros in its columns. Only a subset of columns from the original factor pattern may be chosen for this rotation. The selection criterion for the incorporation of a column in the varimax rotation is generally based on the column's eigenvalue. Typically. columns whose eigenvalues are greater than one are selected for rotation. An example of the application of factor analysis to the field of software complexity metrics may be seen in the authors' recent study (Munson. 1989). This most imponant conclusion that can be drawn from this investigation is that the domain of complexity measures does not appear to be unrestricted. There are many software comp1e~ity metrics in the literature. but there are relatively few dlmenslOns m the complexity measure space. It would appear perfectly reasonable to characterize the complexity of a program WIth a SImple function of a small number of variables. In the case of program complexity. there are many measures of complexity, but factor analysis has shown that all of these metrics map onto a small number of complexity domains. A major problem. in the past. with the application of complexity metrics in the task of prediction or for comparative purposes among program modules is that different program modules would have substantially different values of these metrics. There were no direct means of comparing these program constituents. For example. program module A might have many lines of code (LOC) and low cyclomatic complexity (Y(G)) whereas program module B might have a low LOC and a high Y(G). These modules are clearly different in their complexity but are not comparable in some relative sense. When it is desirable to compare program modules directly in terms of their relative complexity we have developed a realistic measure of relative complexity that will reflect the contribution of each program module to the total complexity of a programming system.

One of the products of a factor analysis is a factor score coefficient matrix. F. This matrix is constructed to send a matrix of standardized complexity metrics. z. onto the underlying onhogonal factor dimensions. Thus the relative complexity. Cr. of the factored program modules may be represented as follows:

C, =zFAT

where A is a vector of eigenvalues associated with the specific factor dimensions. This relative complexity has proven quite useful both for classifying program modules into categories of varying complexity and also for comparing them as well. None of the other complexity metrics can be used for this classification. In many applications. it is desirable to use the complexity metrics for their value as predictors of some aspect of the software development process such as effon or perhaps the number of errors in the developed program modules. The ap-

propriate tool for prediction is regression analysis.

REGRESSION ANALYSIS The general notion of linear regression is to select from a set of independent variables a subset of these variables that will explain the most amount of variance in a dependent variable. The key to regresSlOn model development IS to choose the subset of independent variables in such a manner as to not introduce more variance (or noise) in the model than might be contributed by the independent variable itself. It is imponant to understand the statistics which will allow us to compare the resulting models in regards to quality of fit and model predictive qUality. When there is a .complete absence of linear relationships among the mdependent vanables. they are said to be onhogonal. Usually. the lack of onhogonality is not serious enough to affect the analysis. However. m software development, the independent variables. software complexity metrics. are so strongly interrelated that the regression results are ambiguous. It is not always possible to estimate the unique effects of individual software complexity metric in the regression equation. The estimated values of the coefficients are very sensitive to slight changes in the data and to the addition or deletion of variables in the regression equation. The condition of severe nononhogonality is also referred to as the problem of collinear data. or multicollinearity. It is imponant to know when multicollinearity is present and to be aware of its possible consequences.

Regression and Factor Analysis Principal components analysis may be used to detect and analyze collmeanty m the explanatory variables, which are in this case, software complexity metrics. When confronted with a large number of variables measuring a single construct, it may be desirable to represent the set by some smaller number of variables that convey all or most of the information in the original set. Principal components are linear transformations of a set of random variables that summarize the information contained in the variables. The transformations are chosen so that the first component accounts for the maximal amount of variation of the measures of any possible linear transform; the second component accounts for the maximal amount of residual variation; and so on. The principal components are constructed so that they represent transformed scores on dimensions that are mutually onhogonal. Through the use of factor analysis. it is possible to have a set of highly related variables. such as complexity metrics. be reduced to a relative small number of complexity dimensions. When this mapping is accomplished by factor analysis. the transformed and reduced complexity dimensions are in fact onhogonal. This definitively solves the problem of multicollinearity in subsequent regression analYSis.

Regression Model Selection The most obvious technique to use in the identification of the appropriate subset of independent variables is to perform all possible regressions (the combinatorial solution). In this case , regression models are developed for all possible sets of independent variables. An evaluation standard must be applied to select the best model. A model may fit the data well but not be very good in terms of predictive ability. The selection of onc model out of the set of all models will depend on the selection criterion. This particular technique lends itself well to the selection of an inappropriate model due to spurious random variation of independent variables not related in any way to the systematic variation of the dependent variable. The next alternative for the development of a regression model is to use one of two stepwise regression procedures that involve the systematic incorporation of variables in the regression model in an iterative manner. First. there is the stepwise regression analysis. In this procedure, an initial model is formed by selecting the independent variable with the highest simple correlation with the dependent variable. In subsequent iterations new variables are selected for inclusion based on their panial correlation with variables already in the regression equation. A second selection technique is that of backward elimination that forms a regression equation with all variables and then systematically eliminates variables that do not contri-

A Predictive Model for Programming Time bute significantly to the model. In conjunction with factor analysis. the regression analysis may be performed on a reduced complexity metric space. With this procedure. the set of dependent variables are mapped onto a smaller number of orthogonal dimensions through the use of factor analysis. The factor scores from the factor analysis are then used in a stepwIse procedure to form the final model. A significant value in this process. is the fact that multicollinearity among the variables is first eliminated by the factor analysis. Also. the factor analysis serves to reduce the apparent dimensionality of the set of independent variables. From a regression analysis of variance perspective. this also will have the net effect of reducing the degrees of freedom due to regression in that fewer total variables are presented as independent variables to the regression model.

Conceptually. perhaps the most important consideration in the use of the reduced factor dimensions in the predictive model. is that each of the underlying factor dimensions represents a fundamental concept in the complexity problem space. For example. through the use of factor analytic tools. we have found that there is consistently present a factor which relates to the Control complexity of a program. Many metrics have this Control component in them. Should this Control factor enter a predictive model with time as a dependent variable. we can unequivocally relate the complexity concept of control to the dependent measure. in this case. time.

The Evaluation of Regression Models

The net result of the previous discussion is that most regression studies on raw data will produce more than one possible model. An excellent discussion of the model evaluation process is available from Myers (1986). The objective. now. is to be able to evaluate the several models in terms of their predictive value. In our particular case. we are interested in predicting the programming time of a program module based on its software complexity. There are several statistical measures of the performance of a regression model. In general. there are two distinct classes of these evaluation criteria. The first of these classes contains statistics developed from the regression analysis of variance . Two of these. which will be employed in this study. are the coefficient of determination. R2 and the Cp statistic. Another approach. using the PRESS statistic. is based on residual analysis. The PRESS statistic is derived by a systematic reformulation of the regression model. One by one. each of the data vectors of variables are eliminated from consideration and a new regression model is constructed. This new regression model is used to predict the value of the dependent variable of the deleted independent variables. This statistic is also a valuable tool in the detection of outliers. Outliers are data values which represent extreme values that may overbias a least squares model. The PRESS statistic offers the ability to examine models that control for the effects of the oUlliers and assess the predictive quality of a regression model. Traditionally. the R2 statistic is used almost exclusively in empirical studies in software engineerin¥- There are some distinct problems associated with the use of R • which is simply the ratio of the regression sum of squares to the total sum of squares. In that the total sum of squares is constant for all regression models with the same set of independent variables. R 2 can only increase as independent variables are added to a regression equation. whether or not they will account for a significant amount of variance in the dependent variable. Also. the R 2 statistic does not assess the quality of future prediction. only the quality of fit on the sample data. The case for the Cp statistic is very different. This statistic is a measure of the total squared error in a regreSSion. Thus. a researcher should choose a model with the smallest value of Cp. This statistic is to be preferred to R 2 because a penalty is introduced for overfining the model with excess independent variables that bring with them an additional noise component.

THE PREDICTION OF PROGRAMMING TIME FROM COMPLEXITY DA TA We would now like to explore some of these regression models in an attempt to form a predictive relationship between a subset of complexity metrics and a dependent measure of effort such as programming time. The time metric is generally a simple numerical tally of the number of hours needed to construct a program. To

HEHS-E

53

understand the relationship between the complexity metric domains and programming time. the technique of factor analysis is very useful. Factor analysis will first be employed to show the relationship of the concept of effort (here. represented by programming time) and the underlying complexity domains. Next several predictive models will be developed to compare and contrast the use of raw complexity metrics versus factor dimensions in terms of their predictive value.

The Basili Study

An early study by Basili (I 981) sought to study this relationship. In this study. selected complexity metric data were obtained from FORTRAN programs together with the time it took programmers to prepare the programs. These data were reported in the form of correlation coefficients. These data are shown in Table 1 below. From a statistical perspective. it is somewhat disturbing that the correlation coefficients are uniformly relatively large. This consistency reflects a high degree of multicollinearity among these metrics. TABLE 1 Correlation Coefficients from Basili Study

Error Halstead XQT Source V(G) Calls

Time

Error

Halstead

XQT

Source

V(G)

.623 .672 .509 .602 .326 .667

.503 .429 .489 .304 .643

.830 .756 .654 .804

.806 .912 .770

.653 .776

.599

The particular metrics used in this study and shown in Table 1 are as follows: The Time variable is a number representing the number of hours (E) that a programmer took to complete the programming task. Error represents a numerical count of programming crrors reported by programmers as the program was debugged during final testing. Halstead's (1977) effort is represented by Halstead and McCabe ' s (1976) cyclomatic complexity by V(G). There are two metrics that represent the length of a program. These arc Source for total lines of code including comments and data statements and XQT which tallies only the number of executable FORTRAN statements in a program. Calls is a tally of the total number of call statements in the program. When these data are factor analyzed, three usable factors emerged that accounted for 90% of the observed variance. These factors were then subjected to an orthogonal rotation by the Varimax procedure. The resulting factor structure is shown in Table 2 below. Based on our earlier work in this area , we have chosen to label these factor dimensions as follows: Factor 1 has associated with it the metrics V(G), XQT, and Source. This relationship will be called the Volume! Control dimension. Factor 2 is most closely related to Halstead, Time, and Calls. This factor we will call Effort. The final factor, 3, contains only the metric used to measure the number of programming errors. Thus. this factor dimcnsion will also be called Error. TABLE 2 Varimax Factor Pauernfor Basili Data

V(G) XQT Source Time Halstead Calls Error

Factor I Volume/Control

Factor 2 Effort

Factor 3 Error

.953 .905 ,664 .141 .634 .570 .173

.081 .347 .566 ,869 .654 .572 .313

.124 .174 .202 .352 .185 .432 .928

This factor structure is very similar to pallerns we have observed and patterns that will be established later in the paper. The important consideration here is the fact that the metric, Time, is seen to be closely associated with Halstead's measure of cffort, Halstead, and the number of call statements in a program. Intuitively, this is indeed reasonable. The metric, Halstead, was, in fact, intended to represent the effort, hence, time to construct a program. The association of the Time metric with the Calls metric is also not surprising.

J.

54

C. Munson and T. M. Khoshfoftaar

The degree of modularity of a FORTRAN program will be roughly assessed by the number of call statements. This increased level of modularity will consume more programmer time. A regression analysis was subsequently performed that appears to show a relationship of executable statements (XQ7). errors. effort. and cyclomatic complexity to the time to construct a program. This. we feel. is a clear indication of the problem in the use of data with a high degree of multicollinearity. The factor analysis shows a totally different relationship. In orthogonal dimensions. time is related only to the metrics of effort and number of calls. Cyclomatic complexity and program length are found in a separate orthogonal dimension. Further analysis of the Basili data was not possible due to the lack of availability of the original data on which the correlations were developed. However. the raw data were available on a study by Wood field (1981) that did permit further study of similar metrics in association with time to develop programs. We will now examine these data with an attempt to control for problems of multicollinearity.

were to be formed that incorporated the software metrics as independent variables and time as a dependent variable. it is clear that there might be several ambiguous but plausible models that might result. To show the underlying structure of the variation in these data. they were factor analyzed with time as a variable in the analysis. As before. this factor structure was then subjected to an orthogonal rotation (Varimax). The results of the analysis are shown in Table 4. In keeping with our earlier work in this area (Munson. 1989). we are able to identify these three resulting factors. Factor I from this table has associated with it those metrics from a Control or Effort dimension. Factor 2 consist of metrics apparently from a Volume dimension. Finally. there is a Modularity dimension that is represented here by the # of Modules metric. From this preliminary analysis. we would conclude that the variability in the Time measure is closely related to the variability in the metrics associated with Factor I. the Control/Effort dimension. From this association. it would appear that the time to construct a program is closely related to the number of edges in the control graph necessary to represent the program and also the operator complexity of the program.

The Woodfield Data

TABLE 5 Correlation Coefficientsfrom Woodfield Study The use of complexity metrics as predictors and contributing factors in the determination of programming effort is subject to a number of difficulties from a statistical point of view. This is so because of the high degree of multicollinearity among the various complexity metrics. One such study. which will be used as an example. was performed by Wood field. Herein. several selected complexity metrics were determined for programs written in FORTRAN. A total of 63 programs were developed in two separate project phases. The first of these project phases was a development phase where data were collected from programming practice sessions. The second project phase consisted of a confirmatory set of data that was collected five months later. The time metric data were established from a period beginning when the programmer started working on a problem and ended with the first successful execution of the program. Correlation coefficients were computed for these data for the combined first and second project phases. The correlation coefficients for these data are shown in Table 3 below. The complexity metrics used in this study consisted of the time to develop the program . Time; a count of the number of program modules; a count of program lines of code. LOC; Halstead's metrics of TlI. the number of unique operators in the program. Tl2. the number of unique operands in the program. N I. the total number of operators in the program. and. N 2. the total number of operands in the program. TABLE 3 Correlalion CoeffiCients from Woodfield Study with Time Time

V(G) LOC 11 ,

N, 11,

N, #

of Modules

.649 .777 .736 .798 .506 .743 .392

V(G)

LOC

11 ,

N,

112

LOC 11,

N, 11,

N, #

of Modules

V(G)

LOC

11,

N,

112

N,

.760 .569 .736 .392 .675 .464

.882 .925 .561 .868 .704

.780 .419 .689 .667

.721 .963 .581

.815 .488

.549

To return to the problem of the multicoUinearity among the complexity metrics. these metrics were subsequently factor analyzed without the Time variable. The correlation coefficients for the metrics are presented in Table 5 below. This new correlation matrix was factor analyzed. again with the intention of reducing the dimensionality of the metrics space represented by the set of complexity metrics studied to a small number of orthogonal dimensions. In Table 6. the same factor structure emerges as was shown in Table 4. With the time variance component now not present. we see essentially the same factor structure as in Table 4. From the factor structure presented in Table 6. factor scores were computed for each observation vector that represent the individual mappings of each of the metrics values for each of the subprograms onto this new orthogonal set of three new metric dimensions. thus eliminating subsequent problems of multicollinearity in subsequent regression modeling.

N, TABLE 6 Varimax Factor Pal/ern for Woodfield Data

.760 .569 .736 .392 .675 .464

.882 .925 .561 .868 .704

.780 .419 .689 .667

.721 .963 .581

Factor 1 Control/Effort

Factor 2 Volume

Factor 3 Modularity

.894 .706 .685 .155 .591 .166 .561

.178 .361 .591 .946 .724 .267 .185

.186 .577 .376 .219 .305 .896 .710

V(G) .8 15 .488

LOC .549

N, Th

TABLE 4 Varimax Fa ctor Pal/ern for Woodfield Data with Time

N, #

Time

V(G) LOC

N, 11, 112

N, #

of Modules

Factor 1 Control/Effort

Factor 2 Volume

Factor 3 Modularity

.847 .814 .753 .727 .680 .199 .624 .201

.309 .175 .335 .567 .146 .940 .705 .264

.132 .205 .528 .327 .612 .214 .271 .917

An inspection of Table 3 shows that there is a high correlation among all of the selected complexity metrics. Hence. it may be assumed that there is a high degree of multicollinearity among these measures. There is also a relatively high correlation between the individual metrics and the measure of time. If a regression model

of Modules 11,

There is one distinctive difference between Tables 4 and 6. That is. the number of unique program operators (TlI) appears to switch from the Control/Effort factor to the Modularity factor with the removal of the Time metric. This result is not surprising in that the factor loading for this metric is relatively high for both factors on both tables. Clearly. the variance attributable to this metric is associated with both of these factors .

The Regression Analysis

Our objective. now. is to explore the potential use of the new factor dimensions for their predictive value in terms of the prediction of programming time. To this end. a stepwise regression was run with these new factors as independent variables and programmmg ume as a dependent variable. For the purposes of comparison. regression models were developed for these same data using the raw complexi-

A Predictive Model for Programming Time ty metrics as independent variables. The regression analysis of variance (ANOV A) for the four resulting models is presented in Table 7. For the F statistics in this table and subsequent tables. an a priori level of significance of p < 0.05 was chosen for this analysis. TABLE 7 Regression Model ANOVA

TABLE 8 Model Description

Raw Variable Regression Models Model

Raw Variable Regression Analysis Model

I

2

Source

d.f.

SS

MS

Regression

4

42.11

10.53

Error

58

16.86

0.29

Correcced Total

62

58.97

Regression

3

41.43

13.81

Error

59

17.53

0.29

Correcced Total

62

58.977

F

Parameter

Source

4

SS

MS

3 59

38.79

12.93

Error

20.17

0.34

Corrected Total

62

58.97

Regression

3

39.19

13.06

Error

58

15.84

0.27

Corrected Total

61

55.02

Regression 3

d.f.

Prob> F

1.83 0.15

0.10

2.33

.1327

TlI

0.42

0.11

12.14

.0009

NI

0.48

0.13

13.03

.0006

# of Modules

-0.25

0.09

7.23

.0093

.0013

Intercept

1.83

TlI

0.41

0.12

11.43

NI

0.60

0.11

28.42

.0001

# of Modules

-0.24

0.09

6.50

.0134

Factor Variables Regression Analysis Model

F

V(G)

46.4775 2

Std Error

Intercept I

36.2292

Estimate

Factor Variable Regression Models F

Model

37.82 3 47.83

4 There are two major classes of these models based on the nature of the independent variables One set of regression models used the raw complexity metrics and the other the factor variables. Two models emerged from the regression analysis of the raw complexity metrics. Model I was the best model from the set of all possible regressions. Model 2 was the best regression model of the stepwise process and the backward elimination process, which in this case recommended the same model. Two models. shown as Model 3 and 4, were also developed for the factor variables. These two models both contained all three factor variables. This was the model selected by all possible regressions. the stepwise procedure. and also by backward elimination. The rea· son that two apparently similar models are presented here is that Model 3 was found to contain a significant (p<0.05) outlier in a subsequent examination of the residuals with the PRESS statistic. This outlier was a data value that represented a significant departure from the set of data values of the dependent variable. When the outlier was removed a new model. Model 4. was created. The criteria used to determine the best regression model were. in the case of all possible regression models. the R 2 and Cp statistics. This means that the regression models were selected on the basis of their quality of fit as opposed to their predictive quality for the sample data. The actual regression models with the selected sets of independent variables are presented in Table 8. From this table it can be seen that Models I and 2 are the same except for the cyclomatic complexity metric. V (G), in Model 1. This model was the best of all models selected by the all possible regressions analysis. There is a problem with this model, however. Though the cyclomatic corn· plexity metric is present in the model . it does not contribute significantly to the model (p> . 13) . Therefore, we cannot use this model. Thus. the raw variable regression model which is the best of the two is Model two. This model has as independent variables the metrics TlI. N I. and It of Modules. The slopes of the operalor metrics are positive which would indicate a direct positive linear relationship between the number of operators in a program and the time to construct a program. Interestingly. the slope of the It of Modules metric is negative and different from zero. From this result we can conclude that total programming time (effort) actually decreased as the number of modules increased. The factor variable regression models both incorporated all three of the factors. In both cases. the parameter estimates for the slopes are all positive. which would indicate a direct positive linear relationship between each of the factor variables and the dependent vanable. Time.

Parameter

Estimate

Std Error

F

Prob> F

Intercept

1.83

Factor I

0.65

0.Q7

76.86

.0001

Factor 2

0.35

0.Q7

22.33

.0001

Factor 3

0.28

0.Q7

14.27

.0004

Intercept

1.80

Factor I

0.70

0.Q7

108.40

.0001

Factor 2

0.35

0.Q7

14.70

.0003

Factor 3

0.28

0.Q7

14.41

.0004

There are now four basic models which all are seen to predict programming time in programs. The problem now is to identify which of the four models is best. To this end. we will look at the predtctive quality of each model. The associated statistics for this model predictive quality are presented in Table 9. Here, it can be seen that based on the usual statistic of R 2 we can establtsh that With thiS cnterion. Models I and 4 are the best of the models in terms of the quality of fit. However. we are much more interested in the predictive value of these models than their quality of fit. To this end. we will now examine the regression models based on the Average PRESS statistic and the Cp statistic which will measure the predictive value of the several models. TABLE 9 Model Predictive Quality

Raw Variabtes Models R2

Model

PRESS

Ave PRESS

Cp

I

21.70

0.34

2.79

.71

2

20.59

0.33

3.02

.70

Factor Variable Models Model

PRESS

Ave PRESS

Cp

R2

3 4

25.00

0.39

4.00

.66

19.31

.31

4.00

.71

To establish the predictive value of the model. the Average PRESS and Cp were computed for each model. In terms of the Average PRESS statistic. the model which has the lowest value is the one with the best overall predictive quality. Based on this statistic. we can see that the factor variable Model 4 has the best overall predictive qUality. In terms of the Cp statistic. the model which has a value of Cp whose numerical value is as close as possible to the number of parameters in the model will be the one with the best predictive quality. In this case. we can see that the predictive quality of Models 3 and 4 are better that either of the raw variable regression models.

J. C. Munsoll and

56

Overall, then, we observe that the reduced factor regression model is the one which will yield the best predictive quality for programming effon. The overall objective of this research, however, has been twofold. First, the concept of predicting the programming effon involved in a program based on its underlying complexity has been shown to be a viable one. There is a reasonable linear relationship between these two concepts. Secondly, the best way to model this relationship between complexity and effon is the reduced factor model.

CONCLUSION For a regression model to be useful in the prediction of program ming time, a dominant concern is that a stable model has been developed. Funher, in the case of software metrics as independent variables, it is imponant that the variables chosen to develop the model serve to measure all possible aspects of program variability. The technique we have presented has two distinct vinues in this regard. First, the complexity metrics are mapped, through the use of factor analysis, onto a reduced set of onhogonal dimensions. Thus, the actual nature of the complexity space on which the model is being developed is clearly exposed. Second, regression modeling aberrations related to problems of multicollinearity are removed. In general, this technique will lead to models of superior predictive capability. It is imponant to note that it has been the intention of this paper to explore the modeling technique, not to present a definitive model. The models we have examined herein probably will not extrapolate to very large programming projects in that the sample size was quite small. The basic technique we have developed in this study relates to some major problems we have observed in the effon to develop reliable and meaningful predictors for program effon. Software complexity metrics cenainly would be useful in this regard in that they are numerical measures which may be obtained prior to the test and validation of a program. As has been shown, these metrics are quite interrelated. They are also, for the most pan. highly correlated with measures of program effon. This high co rrelation by itself is an unreliable indicator of the predictive quality of models used in the study of software development effon. In fact . the large correlations are cenainly indicators of multicollinearity which can only confound attempts at developing predictive models. Through the use of factor analysis. predictors may be mapped onto onhogonal dimensions . Our experience in this area indicates that there are relatively few such complexity dimensions in the existing set of complexity metrics. Perhaps the most imponant notion behind the use of factor analysis in regression modeling is that the underlying complexity dimension shown by factor analysis have meaning in tenns of a conceptual model. For exam ple. within a set a program modules, a distinctive characteri stic of these modules might be their differences on some underlying Control dimension. A subsequent predictive model which incorporated thi s factor dimension WOUld. in fact, demonstrate the relationship between the dependent variable, effon in our present case, and the conceptual factor dimension of Control complexity. Measures developed from the reduced complexity metric space may then be used to develop predictive models of effon of relatively great predictive quality and also of some real utility because of the exposed relationship to an underlying conceptual model. The subject of predictive quality and appropriateness of model s, has also failed to receive the attention that it deserves in many reliability models that we have studied . In general. there are many statistics that aid in the detennination of the predictive quality of regression models. We have examined several in this paper. Typically, research in this area is driven only based on the R 2 measure of predictive qUality. As we have seen. this statistic can only increase as new predictors are incorporated in a regression model whether or not they contribute to the overall predictive capability of the model. Clearly, a researcher using these predictive tools must aniculate, a priori, the need for a model to fit the data or to predict future outcomes.

T. M. Khoshfoftaar

REFERENCES Albrecht, A. J. and J. E. Gaffuey . Jr., (1983). Software function. source lines of code. and development effon prediction: A software science validation," Transa ctions on Software Engineering. SE-9, 639-648. Basili, Y. R. and T. Phillips, ( 1981 ). Evaluating and comparing software metrics in the softw are engineering laboratory, Performance Evaluation Review. 10.95-106. Daly , E. B. . (1985). Estim ating software development, Halstead. M. H. (1977). Elements of Software Science . Elsevier. New York. Dillon, W. R. , and M. Goldstein. (1984). Multi variate Analysis: Methods and Applications , John Wiley & Sons, New York. Itakura. M. and A. Takayanagi, A. (1982). A model for estimating program size and its evaluation. Proceedings of the 6th International Conference on Software Engineering, 104-109. McCabe, T. (1976). A complexity measure. IEEE Transactions on Software Engineering. SE-2. 4, 308-320. Munson, J. C. and T . M. Khoshgoftaar ( 1989). The dimensionality of program complexity. Proceedings of the 11th Annual International Conference on Software Engineering , 245-253 . Myers, R. H. ( 1986). Classical and Modern Regression with Applications. Duxbury Press. Boston. Wood field , S. N .. Y. Y. Shen, and H. E. Dunsmore (198 1). A study of seve ral metrics for programming effon, Journal of Systems and Software. 2. 97-103.