Does software reliability growth behavior follow a non-homogeneous Poisson process

Does software reliability growth behavior follow a non-homogeneous Poisson process

Available online at www.sciencedirect.com Information and Software Technology 50 (2008) 1232–1247 www.elsevier.com/locate/infsof Does software relia...

385KB Sizes 0 Downloads 8 Views

Available online at www.sciencedirect.com

Information and Software Technology 50 (2008) 1232–1247 www.elsevier.com/locate/infsof

Does software reliability growth behavior follow a non-homogeneous Poisson process q Kai-Yuan Cai *, De-Bin Hu, Cheng-Gang Bai, Hai Hu, Tao Jing Department of Automatic Control, Beijing University of Aeronautics and Astronautics, Beijing 100083, China Received 12 April 2006; received in revised form 19 December 2007; accepted 21 December 2007 Available online 10 January 2008

Abstract It is widely believed in software reliability community that software reliability growth behavior follows a non-homogeneous Poisson process (NHPP) based on analyzing the behavior of the mean of the cumulative number of observed software failures. In this paper we present two controlled software experiments to examine this belief. The behavior of the mean of the cumulative number of observed software failures and that of the corresponding variance are examined simultaneously. Both empirical observations and statistical hypothesis testing suggest that software reliability behavior does not follow a non-homogeneous Poisson process in general, and does not fit the Goel–Okumoto NHPP model in particular. Although this new finding should be further tested on other software experiments, it is reasonable to cast doubt on the validity of the NHPP framework for software reliability modeling. The importance of the work presented in this paper is not only for the new finding which is distinctly different from existing popular belief of software reliability modeling, but also for the adopted research approach which is to examine the behavior of the mean and that of the corresponding variance simultaneously on basis of controlled software experiments. Ó 2008 Elsevier B.V. All rights reserved. Keywords: Software reliability modeling; Controlled software experiment; Non-homogeneous Poisson process; Goel–Okumoto model; Software testing

1. Introduction Software reliability modeling is a relatively ‘old’ topic in software engineering. It can be traced back to the work of Hudson [1] and thus emerged as early as the birth of software engineering. Numerous software reliability models have been proposed [2–6], and the earliest models include the Jelinski–Moranda model [7], the Nelson model [8], the Shooman model [9], the Littlewood–Verrall model [10], the Halstead model [11], the Musa model [12], and the Schneidewind model [13]. Most of software reliability models focus on software testing phase, where software q Supported by the National Science Foundation of China and MicroSoft Research Asia (Grant No. 60633010). Cai is also with the State Key Laboratory of Virtual Reality Technology and Systems, Beijing, China. * Corresponding author. E-mail address: [email protected] (K.-Y. Cai).

0950-5849/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.infsof.2007.12.001

defects are detected and removed and thus software reliability tends to grow. A major class of software reliability models is NHPP (Non-homogeneous Poisson Process) models. Let M(t) be the number of software failures observed during the time interval (0, t]. A NHPP software reliability model assumes that {M(t), t P 0} follows a non-homogeneous Poisson process. It was revealed that many existing software reliability growth models can be formulated in the NHPP framework [2,14]. A number of empirical reports is available to validate the NHPP models [15–18]. It is reasonable to say the NHPP framework has played an influential role in software reliability modeling. Actually, The NHPP framework still draws considerable amount of research attention in software reliability community [19,20]. A common way to validate a (NHPP) software reliability model is to apply it to one or several software failure datasets collected from real software projects and examine

K.-Y. Cai et al. / Information and Software Technology 50 (2008) 1232–1247

whether it fits the datasets and delivers useful predictions. In particular, the fit criterion adopted is usually based on the discrepancy between the true value and the expected value of the cumulative number of observed failures, whereas the expected value is determined by the expectation of M(t). To the best knowledge of the authors, however, no controlled software experiments have been carried out to validate or invalidate the NHPP framework statistically. The variance of the software failure process is seldom analyzed. Then does software reliability growth behavior really follow a non-homogeneous Poisson process in a statistically rigorous sense? This question defines the topic of this paper. The rest of this paper is organized as follows. Section 2 reviews the Goel–Okumoto model. Section 3 describes the two controlled software experiments. Section 4 presents empirical observations from the results of the two controlled software experiments. Statistical hypothesis testing for the results of the two controlled software experiments is conducted in Sections 5 and 6, respectively. Discussion on the results presented in the previous sections is contained in Section 7. Concluding remarks are contained in Section 8.

KðtÞ ¼ 0; KðtÞ ¼ a;

1233

t¼0 t!1

where a is the expected total number of software failures eventually observed. (5) Upon a software failure being observed, one and only one failure-causing defect is removed from the software under test without new defects being introduced. Following these assumptions, we arrive at Kðt þ DtÞ  KðtÞ ¼ b½a  KðtÞDt where b is the proportionality constant. Then dKðtÞ ¼ ab  bKðtÞ dt or KðtÞ ¼ að1  ebt Þ The number of software failures observed up to time t is determined by the distribution PrfMðtÞ ¼ kg ¼

½að1  ebt Þk expfað1  ebt Þg; k!

kP0

2. The Goel–Okumoto NHPP model We see The fundamental assumption of a NHPP software reliability model is that, M(t), the number of software failures observed during the time interval (0, t], is a Poisson distribution and {M(t), t P 0} follows a non-homogeneous Poisson process. Let K(t) = E[M(t)], i.e. K(t) is the expectation of M(t), or the mean function of the Poisson process. Different NHPP software reliability models differ in the form of K(t). The Goel–Okumoto NHPP model was proposed in 1978 [21]. It assumes that KðtÞ ¼ að1  ebt Þ where a, b are two constant parameters to be estimated. More specifically, the above form of the mean function is derived on the basis of the following assumptions.

PrfMð1Þ ¼ kg ¼

Given f1, f2, . . . , fk, observed in time intervals (0, t1], (t1, t2], . . . , (tk1, tk], we can employ the maximum likelihood method or the least-squares method to estimate parameters a, b. For the maximum likelihood method, since PrfMðti Þ  Mðti1 Þ ¼ fi g ¼

½Kðti Þ  Kðti1 Þfi expfKðti1 Þ  Kðti Þg fi !

The likelihood function is Lða; bÞ ¼

(1) For any set of finite time instants t1 < t2 <    < tk, the numbers of software failures (defects), f1, f2, . . . , fk, observed in the time intervals (0, t1], (t1, t2], . . . , (tk1, tk], are independent. (2) Every software defect has an equal chance of being detected. (3) The cumulative number of software failures observed up to time t, i.e., M(t), follows a Poisson distribution with the parameter K(t) such that the mean number of software failures (defects) observed in the microtime interval (t, t + Dt] is proportional to the interval length and to the mean number of remaining software defects at time t. (4) K(t) is a bounded, non-decreasing function with

ak a e k!

f k Y ½Kðti Þ  Kðti1 Þ i expfKðti1 Þ  Kðti Þg fi ! i¼1

Let o ln L ¼0 oa

o ln L ¼0 ob

Then the estimates of a, b, denoted by a^; ^b, respectively, are determined by the following equations Pk i¼1 fi ^a ¼ ^k 1  ebt Pk ^ ^ k ^ btk X tk e fi ðti ebti  ti1 ebti1 Þ i¼1 fi ¼ ^k ^ i1 1  ebt  e^bti ebt i¼1 ^ k Þ, is Immediately, the estimate of K(t) at tk, denoted Kðt

1234

^ kÞ ¼ Kðt

K.-Y. Cai et al. / Information and Software Technology 50 (2008) 1232–1247 k X

fi

i¼1

This estimate is identical to that we can obtain by using the maximum likelihood method if M(t) at time t = tk is simply treated as a random variable obeying a Poisson distribuP ^ k Þ ¼ k fi ¼ Mðtk Þ. tion with parameter K(tk); that is, Kðt i¼1 Note that the Goel–Okumoto NHPP model was originally proposed to describe the software reliability behavior in the continuous-time domain. However it can apply to the discrete-time domain directly by confining the various time instants (e.g., t1, t2, . . . , tk) to positive integers [22]. More specifically, the time instants can represent the number of tests (test cases) applied in the course of software testing. 3. The software experiments In order to test the Goel–Okumoto NHPP model statistically, we carried out two controlled software experiments. In each of the experiments, a software program with a number of known defects is subjected to testing. Following the assumptions of the Goel–Okumoto NHPP model, we confine ourselves to removing one and only one failurecausing defect from the software under test for each observed failure despite the fact that a test case may trigger more than one defect. 3.1. Experiment I 3.1.1. The subject program The subject program is called Space program and was already used in the study of software testing [23,24].1 Rothermel et al described it as follows [24]: ‘‘Space consist of 9564 lines of C code (6218 executable), and functions as an interpreter for an array definition language (ADL). The program reads a file that contains several ADL statements, and checks the contents of the file for adherence to the ADL grammar and to specific consistency rules. If the ADL file is correct, Space outputs an array data file containing a list of array elements, positions, and excitations; otherwise, the program outputs error message”. The Space program was also used in our previous case study for adaptive testing [25]. In the experiment presented in this paper we injected 38 defects into the Space program which was then subjected to testing.2 The test pool (or the input domain) used in our case study comprised 13,498 distinct test cases, with each being an ADL file. The original Space program without injected defects served as the test oracle. That is, a test case was applied to both the original 1 The Space program, the injected defects, and the test pool were obtained from Gregg Rothermel. 2 These defects were exactly those employed in the case study reported in Ref. [24].

Space program and the Space program with injected defects. Any discrepancy between the corresponding outputs implied that a failure was observed and at least one injected defect was detected. Among the 38 injected defects, two were non-detectable by any test case of the test pool. 3.1.2. The test results The Space program was subjected to random testing in a software toolkit, SRATE (Software Reliability Analysis, Testing, and Evaluation),3 automatically. At the beginning the Space program contained 38 injected defects. Upon a failure being observed, one and only one failure-causing defect was removed. A test process was stopped at the 1200th test. We defined such a test process as a trial. That is, in each trial exactly 1200 tests were performed, no matter how many failures were observed. This was to minic the often observed scenario that software testing terminates upon a given amount of testing resources being used up. In order to test the Goel–Okumoto NHPP model statistically, the software testing was conducted for 40 trails. Since different users might have different operational profile, we generated a different test profile {p1, p2, . . . , p13,498} at random for each trial as follows4: Step 1: Let n = 13,498. Step 2: Generated n  1 random numbers, x1, x2, . . . , xn1, in accordance with the uniform probability distribution; these random numbers took value in the unity interval (0,1). Step 3: Rearranged these random numbers to obtain x(1), x(2), . . . , x(n1) such that x(1) 6 x(2) 6    6 x(n1). Step 4: Let x(0) = 0, x(n) = 1. Step 5: Let pi = x(i)  x(i1); i = 1, 2, . . . , n. Obviously, the test profiles generated above for two trials could rarely be identical. In each trial the test cases were selected one by one in accordance with the corresponding test profile which may or may not be a uniform probability distribution. Table 1 tabulates the test results for the 40 trials. Since exactly 1200 tests were performed in each trial and the two injected defects were not detectable by the given test pool, the number of defects detected in each trial was less than or equal to 36. 3.2. Experiment II 3.2.1. The subject program The subject program is called SESD (Software Environment for Software Data collection) program. It is a grammar analyzer originally used in a software environment for software data collection. The software environment was developed in our research group and written in VC6. It 3

The software toolkit was developed in our research group. The reasons for generating operational profiles in this way will be clarified in Section 7.3. 4

K.-Y. Cai et al. / Information and Software Technology 50 (2008) 1232–1247

1235

Table 1 Test results for Experiment I j

i = 1, 2, 3, . . .

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2,

3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3,

4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5,

5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 5, 5, 5, 5, 5, 6, 6,

6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 6, 6, 6, 6, 6, 6, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 6, 6, 6, 6, 6, 7, 7,

7, 7, 7, 7, 7, 7, 7, 7, 8, 7, 8, 7, 8, 7, 7, 7, 7, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 7, 7, 7, 7, 8, 8,

8, 8, 8, 8, 8, 9, 8, 9, 9, 8, 9, 8, 9, 8, 8, 9, 8, 9, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 8, 8, 8, 8, 9, 9,

9, 10, 11, 12, 13, 14, 15, 20, 22, 27, 28, 31, 33, 35, 37, 51, 52, 59, 66, 96, 104, 180, 233, 237, 269, 304, 497 10, 13, 14, 15, 17, 19, 20, 22, 23, 27, 29, 30, 32, 35, 38, 60, 63, 71, 78, 86, 97, 156, 187, 299, 478, 512, 925 9, 10, 12, 13, 16, 18, 21, 24, 26, 35, 39, 40, 41, 51, 62, 63, 68, 77, 78, 92, 97, 100, 120, 151, 166, 248 9, 10, 12, 13, 14, 15, 16, 19, 20, 21, 24, 33, 40, 46, 52, 65, 82, 89, 102, 113, 128, 139, 141, 183, 198, 220, 762 9, 10, 12, 13, 15, 16, 19, 20, 22, 24, 29, 3, 33, 35, 42, 55, 64, 70, 86, 89, 104, 159, 167, 178, 210, 521 10, 12, 13, 14, 15, 17, 18, 21, 23, 25, 27, 28, 29, 30, 32, 35, 36, 37, 42, 45, 74, 84, 85, 88, 92, 115, 500 9, 11, 12, 13, 14, 17, 18, 19, 20, 21, 23, 24, 26, 34, 47, 50, 53, 65, 80, 83, 108, 110, 112, 132, 169, 219, 394, 1196 10, 11, 12, 13, 14, 15, 17, 19, 22, 23, 24, 26, 27, 36, 37, 50, 69, 74, 94, 132, 151, 191, 249, 297, 504, 572, 638, 647 10, 11, 13, 17, 18, 20, 24, 26, 28, 30, 32, 33, 39, 42, 50, 51, 53, 54, 62, 72, 79, 85, 86, 90, 99, 118, 296, 994 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 21, 22, 27, 28, 34, 37, 54, 56, 57, 58, 73, 107, 157, 618, 680, 1081 12, 13, 14, 15, 17, 18, 19, 20, 21, 25, 26, 28, 29, 31, 33, 34, 42, 43, 48, 71, 98, 104, 115, 135, 148, 703, 1134 9, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, 30, 37, 38, 44, 66, 68, 72, 76, 90, 102, 115, 119, 214, 253, 320, 350 10, 11, 12, 14, 15, 16, 25, 28, 29, 30, 32, 33, 34, 43, 47, 60, 72, 90, 99, 117, 172, 211, 291, 325, 332, 667, 1187 10, 11, 14, 17, 19, 23, 25, 26, 27, 28, 29, 31, 37, 40, 49, 50, 51, 62, 102, 105, 109, 231, 241, 448, 773, 1147 9, 10, 12, 13, 15, 17, 18, 24, 26, 28, 30, 31, 40, 54, 59, 62, 68, 71, 74, 76, 132, 139, 155, 228, 301, 353, 457 10, 11, 12, 14, 17, 18, 24, 26, 29, 39, 45, 52, 55, 59, 66, 82, 98, 108, 125, 133, 141, 156, 186, 242, 253, 312, 577, 710 9, 10, 11, 13, 14, 15, 17, 19, 22, 23, 24, 27, 28, 29, 30, 35, 47, 48, 55, 70, 100, 122, 133, 248, 312, 586, 1046 10, 11, 12, 13, 14, 15, 17, 19, 20, 21, 23, 31, 34, 37, 38, 41, 43, 50, 51, 76, 96, 108, 131, 155, 387, 614, 633, 686 9, 10, 11, 12, 13, 16, 17, 19, 21, 28, 30, 36, 45, 47, 56, 59, 67, 84, 98, 113, 156, 158, 168, 197, 439, 479, 532, 535 9, 10, 11, 12, 13, 14, 15, 18, 19, 22, 25, 29, 32, 34, 37, 39, 46, 70, 77, 89, 91, 101, 119, 123, 164, 306, 888 9, 12, 14, 15, 16, 17, 18, 20, 21, 23, 32, 35, 36, 41, 43, 44, 47, 50, 78, 88, 91, 97, 107, 117, 132, 136 9, 11, 13, 16, 19, 21, 26, 29, 32, 34, 37, 47, 54, 60, 79, 111, 164, 169, 196, 241, 281, 282, 359, 442, 448, 485, 727 9, 11, 12, 13, 14, 16, 17, 18, 21, 22, 23, 24, 27, 32, 33, 36, 64, 66, 76, 86, 119, 155, 165, 174, 193, 196, 202, 840 10, 13, 14, 15, 16, 18, 20, 21, 22, 27, 30, 39, 41, 64, 70, 76, 77, 104, 105, 113, 124, 168, 180, 208, 269, 399 9, 10, 11, 12, 13, 15, 20, 21, 22, 28, 32, 33, 34, 42, 44, 56, 65, 75, 101, 178, 185, 239, 253, 272, 292, 513 10, 11, 12, 14, 16, 17, 18, 19, 20, 22, 23, 25, 28, 29, 30, 31, 34, 45, 48, 60, 77, 171, 255, 330, 507, 635 10, 12, 15, 16, 17, 18, 20, 21, 23, 27, 28, 31, 33, 36, 39, 40, 88, 91, 121, 130, 143, 200, 251, 258, 295, 314, 325, 660 9, 10, 11, 12, 13, 14, 18, 21, 22, 23, 25, 26, 30, 32, 33, 39, 61, 62, 66, 71, 72, 82, 87, 129, 175, 178 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 23, 25, 27, 32, 37, 51, 66, 68, 83, 92, 99, 121, 154, 174, 225, 266, 957, 971 9, 10, 12, 14, 16, 19, 21, 25, 28, 30, 31, 33, 34, 40, 44, 51, 73, 103, 117, 122, 158, 190, 253, 275, 318 9, 10, 11, 12, 13, 14, 15, 16, 19, 20, 22, 23, 24, 29, 34, 35, 38, 44, 47, 61, 73, 74, 85, 144, 216 9, 11, 13, 14, 15, 16, 18, 22, 23, 26, 27, 28, 45, 47, 50, 57, 66, 67, 82, 107, 112, 121, 124, 165, 233, 382, 512, 999 10, 11, 12, 13, 14, 16, 17, 18, 19, 22, 23, 28, 29, 32, 42, 45, 51, 56, 60, 63, 72, 79, 81, 86, 88, 135, 142 10, 11, 12, 13, 15, 16, 17, 18, 19, 22, 26, 28, 31, 32, 38, 57, 61, 67, 68, 82, 106, 127, 130, 134, 164, 202, 305, 308 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 24, 28, 33, 39, 41, 42, 46, 48, 50, 51, 72, 92, 94, 97, 99, 106, 366 9, 10, 11, 12, 13, 14, 15, 16, 20, 21, 22, 23, 26, 40, 42, 55, 61, 76, 112, 129, 131, 135, 138, 142, 146, 299 9, 10, 11, 12, 14, 15, 16, 18, 19, 24, 26, 27, 28, 31, 38, 40, 41, 53, 66, 73, 107, 108, 129, 151, 171, 776 10, 12, 15, 16, 17, 19, 21, 22, 24, 29, 30, 35, 37, 40, 44, 46, 64, 69, 86, 100, 146, 147, 161, 190, 226, 360, 373 10, 11, 12, 14, 15, 16, 18, 20, 21, 29, 37, 50, 52, 60, 64, 69, 72, 82, 83, 87, 92, 97, 122, 129, 279, 360 10, 11, 12, 13, 14, 17, 19, 26, 28, 30, 32, 35, 49, 52, 54, 57, 59, 71, 95, 114, 144, 148, 216, 230, 236, 375, 732

i: the number of detected and removed defects, i = 1, 2, . . . j: the number of trials of software testing, j = 1, 2, . . . , 40. Entry (i, j): the cumulative number of tests used to detect and remove the ith defect (inclusive) in the jth trial of software testing (being stopped at 1200th test); the last value in each row denotes the time instant of the last failure being observed in the corresponding trial.

comprised 17,807 lines of code in 106 files. The SESD program comprised 3559 lines of C++ code with 3179 lines being executable. The SESD program was first implemented by one programmer and then subjected to independent testing. Consequently, 28 defects were found and documented. The input to the SESD program was any C language program. The SESD program then generated five outputs for the input: the number of lines of code, the number of total usages of operators, the number of total usages of operands, the number of distinct usages of operators, and the number of distinct usages of operands.

was subjected to random testing in the SRATE platform automatically. The SESD program without the 28 defects served as the test oracle. Any discrepancy between the outputs of the subject program under test and those of the test oracle was treated as a failure. One and only one failurecausing defect was removed upon a failure being observed. The test pool comprised 5477 test cases, where a test case was simply a C language program. These C language programs were downloaded from the Internet. It was found that 2 of the 28 defects were not detectable by the test pool.5 Since in reality the tester might stop testing before

3.2.2. The test results In the experiment the 28 defects were returned to the SESD program. As in Experiment I, the SESD program

5 It was a co-incident that both experiments had 2 non-detectable defects each.

1236

K.-Y. Cai et al. / Information and Software Technology 50 (2008) 1232–1247

Table 2 Test results for Experiment II j

i = 1, 2, . . . , 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2,

2, 3, 3, 2, 2, 2, 2, 3, 2, 2, 3, 2, 2, 3, 3, 6, 3, 2, 3, 5, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 4, 2, 3,

4, 4, 4, 3, 3, 3, 3, 4, 3, 3, 4, 3, 3, 4, 4, 9, 4, 3, 4, 6, 3, 3, 5, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 4, 4, 3, 5, 3, 4,

6, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 34, 126, 172, 588 6, 8, 9, 11, 12, 13, 14, 16, 17, 19, 23, 24, 26, 28, 30, 31, 33, 36, 43, 135, 148, 163 5, 6, 7, 9, 10, 11, 13, 16, 17, 18, 19, 21, 22, 23, 24, 37, 62, 66, 75, 85, 104, 124 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 18, 22, 27, 53, 66, 82, 85, 137, 168, 237 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 33, 38, 47, 48, 63, 67, 87 4, 5, 6, 7, 8, 9, 10, 12, 13, 15, 16, 17, 18, 19, 21, 22, 24, 25, 29, 58, 109, 115 4, 5, 7, 8, 10, 12, 14, 16, 17, 19, 24, 25, 27, 29, 33, 38, 40, 44, 50, 137, 150, 157 6, 7, 8, 9, 10, 11, 12, 13, 16, 17, 20, 22, 26, 30, 34, 49, 86, 98, 117, 132, 181, 683 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 18, 19, 21, 22, 23, 24, 27, 29, 39, 88, 375 4, 5, 6, 7, 9, 12, 13, 14, 16, 17, 18, 19, 22, 24, 26, 47, 81, 91, 101, 137, 294, 304 5, 6, 7, 9, 10, 11, 12, 13, 14, 16, 17, 21, 22, 28, 36, 37, 40, 42, 88, 113, 139, 164 4, 5, 6, 7, 9, 10, 12, 13, 14, 15, 16, 17, 18, 24, 41, 49, 69, 73, 75, 81, 98, 664 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 18, 22, 29, 36, 61, 80, 115, 142, 165, 186 5, 6, 7, 10, 11, 12, 13, 14, 16, 17, 18, 21, 23, 25, 26, 28, 39, 49, 54, 58, 84, 191 5, 6, 7, 8, 9, 10, 12, 14, 15, 16, 17, 19, 20, 21, 26, 36, 41, 81, 95, 227, 304, 469 12, 13, 14, 15, 16, 18, 20, 22, 24, 25, 26, 27, 30, 33, 34, 35, 49, 55, 118, 120, 176, 248 5, 6, 7, 8, 10, 11, 12, 13, 15, 16, 17, 18, 20, 24, 25, 26, 27, 29, 81, 188, 417, 903 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 19, 30, 41, 42, 56, 64, 135, 156, 237, 306 5, 6, 8, 9, 10, 11, 14, 15, 19, 20, 21, 24, 25, 29, 30, 38, 39, 41, 50, 61, 82, 88 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 20, 22, 23, 24, 29, 30, 70, 74, 118, 152, 597 5, 6, 7, 10, 11, 12, 13, 15, 16, 19, 20, 21, 23, 24, 25, 34, 58, 60, 78, 95, 101, 209 4, 5, 6, 7, 8, 12, 13, 14, 15, 17, 19, 20, 28, 35, 45, 69, 73, 86, 116, 157, 214, 276 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 25, 26, 37, 66, 75, 130, 168 4, 5, 6, 7, 10, 11, 12, 13, 15, 16, 17, 18, 19, 22, 32, 41, 43, 45, 69, 125, 140, 236 6, 7, 9, 10, 11, 13, 14, 15, 17, 19, 20, 21, 22, 23, 24, 26, 28, 30, 31, 52, 63, 376 5, 6, 7, 8, 9, 10, 11, 12, 14, 16, 17, 18, 19, 23, 26, 27, 29, 34, 38, 76, 145, 153 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 16, 17, 18, 19, 21, 22, 33, 45, 57, 67, 105, 147 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 16, 20, 21, 22, 23, 24, 50, 66, 70, 71, 343, 1126 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 19, 24, 29, 32, 36, 38, 58, 83, 102, 427 4, 5, 6, 7, 9, 10, 11, 12, 14, 17, 18, 19, 22, 27, 36, 41, 59, 72, 117, 120, 144, 364 4, 5, 6, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 22, 41, 66, 83, 90, 138, 810 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 17, 21, 22, 24, 28, 37, 47, 69, 71, 75, 80 6, 7, 9, 10, 11, 12, 13, 15, 16, 17, 19, 20, 22, 23, 37, 43, 44, 64, 81, 127, 344, 484 4, 5, 7, 9, 11, 12, 13, 14, 15, 16, 17, 18, 21, 22, 24, 45, 47, 79, 89, 143, 404, 410 5, 6, 9, 11, 12, 13, 14, 15, 17, 18, 19, 21, 23, 24, 25, 27, 38, 43, 107, 109, 129, 132 5, 6, 7, 9, 11, 15, 16, 17, 19, 21, 22, 24, 25, 26, 29, 32, 38, 49, 157, 334, 453, 816 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 18, 20, 21, 22, 25, 26, 29, 36, 42, 57, 367 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 26, 30, 37, 44, 80, 108, 130 4, 5, 6, 7, 8, 9, 10, 12, 13, 16, 17, 19, 21, 26, 37, 49, 51, 63, 68, 301, 334, 435 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 29, 33, 56, 120, 266

i: the number of detected and removed defects, i = 1, 2, . . . , 25. j: the number of trials of software testing, j = 1, 2, . . . , 40. Entry (i, j): the cumulative number of test cases used to detect and remove the ith defect (inclusive) in the jth trial of software testing.

all detectable defects were found, in this experiment a trial of software testing was finished upon 25 defects being detected. This was a scenario different from that considered in Experiment I. We conducted software testing for 40 trials. In each trial the corresponding test profile was generated in the way as shown in Experiment I except n = 5477. Table 2 tabulates the test results for the 40 trials.

4. Empirical observations Obviously, in each trial of software testing, the corresponding software reliability tended to grow as software testing proceeds no matter whatever operational profile was adopted. This was because the number of remaining defects monotonically decreased in the course of software

testing as failure-causing defects were removed one by one, and the operational profile for each trial, once generated, kept invariant in the course of software testing. Suppose {M(t), t P 0} follows a non-homogeneous Poisson process with the mean function K(t). Then at any time t, M(t) can be simply treated as a Poisson distributed random variable with parameter K(t). An important property of a Poisson distributed random variable is that its mathematical expectation (mean) coincides with its variance. Let D(t) be the variance of M(t). It holds KðtÞ ¼ DðtÞ So, an informal method for testing whether a random variable follows a Poisson distribution is to compare its mean with its variance. This informal method can be applied to Tables 1 and 2.

K.-Y. Cai et al. / Information and Software Technology 50 (2008) 1232–1247

ðlÞ

I ðtÞ ¼

8 1 > > > > > > < > > > > > > :

0

if the lth trial of software testing is still ongoing at time t; i:e:; the number of defects detected and removed by the first ðt  1Þ test cases is less than 25 otherwise

Then K(t) is estimated as ^ ¼P KðtÞ 40 Fig. 1. Estimates of the mean and the variance for Experiment I.

4.1. Empirical observations for Experiment I

40 X

1

l¼1 I

ðlÞ

ðtÞ

40 X ^ ¼ 1 KðtÞ M ðlÞ ðtÞ 40 l¼1

!

!

I ðlÞ ðtÞM ðlÞ ðtÞ

l¼1

D(t) is estimated as ^ ¼P DðtÞ 40

1

ðlÞ l¼1 I ðtÞ  1

Since for each of the 40 trials the software test process terminated at the time instant6 of 1200, we can estimate K(t) and D(t) as follows, respectively,

1237

40 X

2 ^ I ðlÞ ðtÞðM ðlÞ ðtÞ  KðtÞÞ

l¼1

^ ^ Fig. 2 shows the behavior of KðtÞ and DðtÞ for Table 2. For the second manner, we take account of all the 40 trials of software testing. We have ! 40 X 1 ðlÞ ^ ¼ KðtÞ M ðtÞ 40 l¼1

and

where M(l)(t) still denotes the number of failures observed in the first t tests (inclusive) in the lth trial of software testing. However we note that

40 X 2 ^ ^ ¼ 1 DðtÞ ðM ðlÞ ðtÞ  KðtÞÞ 39 l¼1

M ðlÞ ðtÞ  25 for I ðlÞ ðtÞ ¼ 0

where M(l)(t) denotes the number of failures observed in the first t tests (inclusive) in the lth trial of software testing. ^ ^ Fig. 1 shows the behavior of KðtÞ and DðtÞ for Table 1. From Fig. 1 we can see that as software testing proceeds, ^ tends to monotonically increase, whereas DðtÞ ^ tends to KðtÞ ^ increase first. After some time, DðtÞ decreases and eventually diminishes to zero. The relation K(t) = D(t) can hardly hold. There is a huge gap between K(t) and D(t). This implies the argument that software reliability growth behavior follows a non-homogeneous Poisson process is highly questionable.

4.2. Empirical observations for Experiment II

D(t) is estimated as 40 X 2 ^ ^ ¼ 1 ðM ðlÞ ðtÞ  KðtÞÞ DðtÞ 39 l¼1

^ Fig. 3 shows the corresponding behavior of KðtÞ and ^ DðtÞ for Table 2. Figs. 2 and 3 do not justify that software reliability growth behavior follows a non-homogeneous Poisson process. 5. Statistical hypothesis testing for Experiment I In order to examine the software reliability growth behavior more rigorously, in this section we apply the Cramer–von Mises test, the CPIT (conditional probability integral transformation) test, and the Chi-square test to the dataset obtained in Experiment I.

Among the 40 trials, the software test processes terminate at various time instants upon 25 defects being detected. So, for any given instant of time t, the number of trials with ongoing software testing may or may not be 40. In this way, there are two different (informal) manners to estimate K(t) and D(t). The first manner is to ignore the trials whose software test processes have terminated. The second manner is to take account of all the trials, but the trials whose software test processes have terminated are supposed to observe no new failures. For the first manner, let 6

Note the time instant here just denotes the numbers of tests applied.

Fig. 2. Estimates of the mean and the variance for Experiment II via the first manner.

1238

K.-Y. Cai et al. / Information and Software Technology 50 (2008) 1232–1247

W 2 ¼ n1

g X

Z 2j qj ;

j¼n

W 2m ¼ n1

g X

A2 ¼ n1

g X j¼n

Z 2j qj ; H j ð1  H j Þ

Z 2j

j¼n

Fig. 3. Estimates of the mean and the variance for Experiment II via the second manner.

5.1. Cramer–von Mises test In order to test whether a set of observations (such as those tabulated in Table 1) fits a Poisson distribution (which a discrete) with an unknown mean that has been estimated by the sample mean, we can apply the Cramer– von Mises test.7 However Henze shows that the asymptotic distribution of the Cramer–von Mises statistic under the null hypothesis (Poisson distribution) depends on the parameter of the tested distribution and thus it is not possible to obtain a parameter-irrelevant table of critical values [27]. Rather, we follow the bootstrap method, i.e., estimating the underlying parameter from the sample, proposed by Spinelli and Stephens to generate the required table of critical values [28]. As in Section 4, let M(l)(t) denote the number of failures observed in the first t tests (inclusive) in the lth trial of software testing. Then for any given t, {M(1)(t), M(2)(t), . . . , M(40)(t)} forms a sample. We want to test whether this sample fits a Poisson distribution with mean (parameter)

Pj where n ¼ 40; H j ¼ k¼0 qk ; g is chosen so that Nk = 0 and qk < 103/n, for j > g; g is chosen so that Nk = 0 and qk < 103/n, for j < n. Table 3 tabulates the corresponding values of W 2 ; A2 ; W 2m at various time instants of concern. Let a = 0.05 be the significance level. Then for any time ^ instant t of concern, we can estimate the parameter as KðtÞ and obtain the corresponding critical values for the three statistics W 2 ; A2 ; W 2m as tabulated in Table 3. If any of W 2 ; A2 ; W 2m is greater than its corresponding critical value, then the null hypothesis that {M(1)(t), M(2)(t), . . . , M(40)(t)} is a sample of Poisson distribution is rejected. From Table 4 we can see that for various time instants of concern, W 2 ; A2 ; W 2m are all much greater than their corresponding critical values. We reject the hypothesis that {M(t), t P 0} follows a non-homogeneous Poisson process. 5.2. CPIT test In order to validate the Goel–Okumoto software reliability model, Goel developed a test procedure to test whether a software reliability dataset fits a given mean function under the assumption that the dataset fits a nonhomogeneous Poisson process already [29]. However the outcomes of the Goel test depend on how to estimate the required parameters. Here we apply the CPIT test which is more recent [30] to Table 1. For the CPIT test, suppose the ith failure is observed at instant of time ti, i = 1, 2, . . . , n, the statistic of concern is

 P ini1 n1 sup 0; j¼i W j  k  ðn  i  kÞW i Vi ¼1P h  P ini1 ni k k n1 C ð1Þ sup 0; W  k  ðn  i  kÞW j i1 k¼0 ni j¼i Pni

k k k¼0 C ni ð1Þ

h

40 X ^ ¼ 1 KðtÞ M ðlÞ ðtÞ 40 l¼1

^

k

^

Let8 qk ¼ ðKðtÞÞ eKðtÞ ; N k ¼ #ðM ðlÞ ¼ kjl ¼ 1;2; ...; 40Þ; k ¼ 0; k! P j 1; 2; ... ; Z j ¼ k¼0 ðN k  nqk Þ. The method proposed by Spinelli and Stephens calculates the following statistics

7

The standard Kolmogorov–Smirnov test can also be adapted to test discrete distributions by generating parameter-dependent table of critical values [26,27]. However the Cramer–von Mises test is more recent. 8 #(M(l)(t) = kjl = 1, 2, . . . , 40) denotes the number of occurrences (M(l)(t) = k) in the sample {M(1)(t), M(2)(t), . . . , M(40)(t)}.

where i = 1, 2, . . . , n  2; Wi = ti/tn. Then testing whether the Goel–Okumoto model is acceptable is equivalent to testing whether Vi follows the uniform distribution over [0, 1]. The outcomes are also summarized in Table 4. We see that in the 40 trials of software testing, the CPIT test rejects the Goel–Okumoto NHPP model for 27 of them. This implies that the Goel–Okumoto NHPP model cannot hold in general. 5.3. Chi-square test An important property of a NHPP is that the numbers of event occurrences observed in two disjoint time intervals

K.-Y. Cai et al. / Information and Software Technology 50 (2008) 1232–1247

1239

Table 3 Cramer–von Mises goodness-of-fit test of Poisson distribution for Table 1 at significance level 0.05 t

^ KðtÞ

W2

CV(W2)

A2

CV(A2)

W 2m

CVðW 2m Þ

Hypothesis

10 50 100 200 300 400 500 800 1200

9.400000 23.625000 28.325000 31.800000 33.000000 33.650000 33.900000 34.600000 34.925000

1.749736 0.743545 0.718347 1.084632 1.507963 1.517581 1.618067 1.689333 1.564728

0.170474 0.163708 0.165875 0.171371 0.168504 0.166156 0.165313 0.165249 0.164978

8.153997 4.073593 3.922953 5.506104 7.169412 7.254180 7.651154 7.852884 7.314326

1.095806 1.051986 1.064866 1.113657 1.112252 1.084893 1.083592 1.076686 1.075792

15.250516 11.592440 12.252877 18.508681 24.940190 25.437852 27.004309 28.137323 26.316063

1.846075 2.806448 3.094513 3.477869 3.494325 3.427137 3.447773 3.483760 3.491579

Reject Reject Reject Reject Reject Reject Reject Reject Reject

CV(.): critical value of a statistic.

Table 4 CPIT test of the Goel–Okumoto NHPP model for Table 1 at significance level 0.05 j

The value of the statistic of the CPIT test for the jth trial

The critical value of the CPIT test for the jth trial

The outcome of the CPIT test

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

0.297 0.388 0.139 0.139 0.209 0.179 0.288 0.448 0.158 0.539 0.348 0.288 0.309 0.379 0.318 0.218 0.409 0.509 0.379 0.339 0.218 0.306 0.188 0.248 0.267 0.409 0.318 0.279 0.448 0.309 0.170 0.318 0.179 0.288 0.239 0.118 0.188 0.248 0.197 0.297

0.224 0. 224 0.227 0.224 0.227 0.224 0.221 0.221 0.221 0.224 0.224 0.221 0.224 0.227 0.224 0.221 0.224 0.221 0.221 0.224 0.227 0.224 0.221 0.227 0.227 0.227 0.221 0.227 0.221 0.234 0.234 0.221 0.224 0.221 0.224 0.227 0.227 0.224 0..227 0.224

  p p p p   p       p     p

are independent. More specifically, let M(s1, s2] and M(s3, s4] be the numbers of software failures observed in time intervals (s1, s2] and (s3, s4], respectively, where s1 < s2 < s3 < s4, then M(s1, s2] and M(s3, s4] define two independent random variables. Here we apply the Chisquare test to Table 1 to test whether the independence property holds. Let M(l)(s1, s2] be the number of software failures observed from the s1th test (exclusive) to the s2  th test (inclusive) in the lth trial of software testing, l = 1, 2, . . . , 40. Then M(l)(s1, s2] takes values in the range {0, 1, 2, . . . , 38}. 5.3.1. Testing the hypothesis that M(0, 10] and M(10, 50] are independent Divide the range {0, 1, 2, . . . , 38} into 3 subsets for M(0, 10] as follows9 I 1 ¼ f0; . . . ; 8g;

I 2 ¼ f9g;

I 3 ¼ f10; . . . ; 38g

Divide the range {0, 1, 2, . . . , 38} into 10 subsets for M(10, 50] as follows J 1 ¼ f0; . . . ; 10g;

J 2 ¼ f11g;

J 3 ¼ f12g;

J 4 ¼ f13g;

J 5 ¼ f14g;

J 6 ¼ f15g;

 p

J 8 ¼ f17g;

J 9 ¼ f18g;

J 10 ¼ f19; . . . ; 38g

       p

Let

 p   p p

ðlÞ nij

: reject the hypothesis that the Goel–Okumoto model holds (27 trials in total). p : accept the hypothesis that the Goel–Okumoto model holds (13 trials in total).

¼

nij ¼

1 if M ðlÞ ð0;10 2 I i and M ðlÞ ð10;50 2 J j 0 otherwise

40 X l¼1

ðlÞ

nij ; ni ¼

10 X

nij ; nj ¼

j¼1

3 X i¼1

nij ; n ¼

3 X 10 X

nij

i¼1 j¼1

Table 5 tabulates the resultant values obtained from Table 1. The Chi-square statistic is defined as v2 ¼

 p 

(

J 7 ¼ f16g;

3 X 10 X ½ni;j  ðni nj Þ=n2 ðni nj Þ=n i¼1 j¼1

From Table 5 we have v2 = 34.001. However the critical value at the 0.05 significance level for the Chi-square with 9

2

The divisions of the ranges should make the used Chi-square statistic P3 P10 ½ni;j ðni nj Þ=n2 defined, that is, ni > 0,nj > 0. i¼1 j¼1 ðni nj Þ=n

v ¼

1240

K.-Y. Cai et al. / Information and Software Technology 50 (2008) 1232–1247

Table 5 Defined measures (ni,j) for independence hypothesis testing of M(0, 10] and M(10, 50] obtained from Table 1: Choice 1

Table 7 Defined measures (ni,j) for independence hypothesis testing of M(10, 50] and M(50, 500] obtained from Table 1: Choice 1

i

i

1 2 3 nj

j 1

2

3

4

5

6

7

8

9

10

ni

0 1 0 1

0 2 2 4

0 2 2 4

0 0 6 6

0 5 1 6

0 8 3 11

0 0 1 1

0 2 2 4

0 1 0 1

1 1 0 2

1 22 17 40

(3  1)  (10  1) = 18 degrees of freedom is 28.869. Since 34.001 > 28.869, we reject the hypothesis that M(0, 10] and M(10, 50] are independent. Since the choice of division for the Chi-sqaure test might have significant influence on the results, we try another division. Divide the range {0, 1, 2, . . . , 38} into 2 subsets for M(0, 10] as follows I 1 ¼ f0; . . . ; 9g;

I 2 ¼ f10; . . . ; 38g

J 5 ¼ f15g;

J 2 ¼ f12g;

J 6 ¼ f16; 17g;

J 3 ¼ f13g;

J 4 ¼ f14g;

J 7 ¼ f18; . . . ; 38g

We obtain Table 6 and v2 = 13.984. However the critical value at the 0.05 significance level for the Chi-square with (2  1)  (7  1) = 6 degrees of freedom is 12.592, we reject the hypothesis that M(0, 10] and M(10, 50] are independent. 5.3.2. Testing the hypothesis that M(10, 50] and M(50, 500] are independent As in Section 5.3.1, we divide the ranges of M(10, 50] and M(50, 500] into several subsets. First, the range of M(10, 50] is divided into 7 subsets as follows I 1 ¼ f0; . . . ; 11g; I 5 ¼ f15g;

I 2 ¼ f12g;

I 6 ¼ f16; 17g;

1

2

3

4

5

6

7

8

ni

0 0 0 0 0 1 2 3

0 0 0 0 1 2 1 4

0 0 0 0 4 2 0 6

0 0 3 3 0 0 0 6

0 0 2 1 5 0 0 8

0 2 1 1 1 0 0 5

1 2 0 1 0 0 0 4

4 0 0 0 0 0 0 4

5 4 6 6 11 5 3 40

is 58.124, we reject the hypothesis that M(10, 50] and M(50, 500] are independent. To further confirm the above result, we divide the range of M(10, 50] into 6 subsets as follows I 1 ¼ f0; . . . ; 11g;

Divide the range {0, 1, 2, . . . , 38} into 7 subsets for M(10, 50] as follows J 1 ¼ f0; . . . ; 11g;

1 2 3 4 5 6 7 nj

j

I 3 ¼ f13g;

I 4 ¼ f14g;

I 5 ¼ f15g;

I 2 ¼ f12g;

I 3 ¼ f13g;

I 4 ¼ f14g;

I 6 ¼ f16; . . . ; 38g

The range of M(50, 500] is divided into 6 subsets as follows J 1 ¼ f0; . . . ; 7g; J 2 ¼ f8; 9g; J 3 ¼ f10g; J 5 ¼ f12g; J 6 ¼ f13; . . . ; 38g

J 4 ¼ f11g;

We obtain Table 8 from Table 1 and v2 = 77.022. Note that the critical value at the 0.05 significance level for the Chisquare with (6  1)  (6  1) = 25 degrees of freedom is 37.625, we reject the hypothesis that M(10, 50] and M(50, 500] are independent. 5.3.3. Testing the hypothesis that M(50, 500] and M(500, 1200] are independent Divide the ranges M(50, 500] into 11 subsets as follows I 1 ¼ f0; . . . ; 5g;

I 2 ¼ f6g;

I 3 ¼ f7g;

I 4 ¼ f8g;

I 5 ¼ f9g; I 6 ¼ f10g; I 7 ¼ f11g; J 8 ¼ f12g; J 9 ¼ f13g; J 10 ¼ f14g; J 11 ¼ f15; . . . ; 38g

I 7 ¼ f18; . . . ; 38g

The range of M(50, 500] is divided into 8 subsets as follows

Divide the range of M(500, 1200] into 5 subsets as follows

J 1 ¼ f0; . . . ; 6g;

J 1 ¼ f0g;

J 5 ¼ f11g;

J 2 ¼ f7g;

J 6 ¼ f12g;

J 3 ¼ f8; 9g;

J 7 ¼ f13g;

J 4 ¼ f10g;

J 8 ¼ f14; . . . ; 38g

J 2 ¼ f1g;

I 3 ¼ f2g;

J 4 ¼ f3; 4g;

J 5 ¼ f5; . . . ; 38g

2

We obtain Table 7 from Table 1 and v = 100.308. Note that the critical value at the 0.05 significance level for the Chi-square with (7  1)  (8  1) = 42 degrees of freedom Table 6 Defined measures (ni,j) for independence hypothesis testing of M(0, 10] and M(10, 50] obtained from Table 1: Choice 2 i

1 2 nj

j 1

2

3

4

5

6

7

ni

3 2 5

2 2 4

0 6 6

5 1 6

8 3 11

2 1 5

3 0 3

23 17 40

Table 8 Defined measures (ni,j) for independence hypothesis testing of M(10, 50] and M(50, 500] obtained from Table 1: Choice 2 i

j 1

2

3

4

5

6

ni

1 2 3 4 5 6 nj

0 0 0 0 1 6 7

0 0 0 0 4 2 6

0 0 3 3 0 0 6

0 0 2 1 5 0 8

0 2 1 1 1 0 5

5 2 0 1 0 0 8

5 4 6 6 11 8 40

K.-Y. Cai et al. / Information and Software Technology 50 (2008) 1232–1247

1241

Table 9 Defined measures (ni,j) for independence hypothesis testing of M(50, 500] and M(500, 1200] obtained from Table 1: Choice 1

Table 11 Defined measures (ni,j) for independence hypothesis testing of M(0, 10] and M(500, 1200] obtained from Table 1: Choice 1

i

i

1 2 3 4 5 6 7 8 9 10 11 nj

j 1

2

3

4

5

ni

0 1 1 2 0 2 3 2 3 2 0 16

0 0 0 1 1 2 3 2 1 1 0 11

1 1 1 1 0 2 2 1 0 0 1 10

0 0 2 0 0 0 0 0 0 0 0 2

0 0 0 1 0 0 0 0 0 0 0 1

1 2 4 5 1 6 8 5 4 3 1 40

We obtain Table 9 and v2 = 40.488. Since the critical value at the 0.05 significance level for the Chi-square with (11  1)  (5  1) = 40 degrees of freedom is 55.758, we accept the hypothesis that M(50, 500] and M(500, 1200] are independent. Divide the ranges M(50, 500] into 8 subsets as follows

1 2 3 nj

j 1

2

3

4

5

ni

0 8 8 16

0 6 5 11

1 6 3 10

0 1 1 2

0 1 0 1

1 22 17 40

We obtain Table 11 and v2 = 4.549. Since the critical value at the 0.05 significance level for the Chi-square with (3  1)  (5  1) = 8 degrees of freedom is 15.507, we accept the hypothesis that M(0, 10] and M(500, 1200] are independent. Divide the ranges M(0, 10] into 2 subsets as follows I 1 ¼ f0; . . . ; 9g;

I 2 ¼ f10; . . . ; 38g

Divide the range of M(500, 1200] into 4 subsets as follows J 1 ¼ f0g;

J 2 ¼ f1g;

J 3 ¼ f2g;

J 4 ¼ f3; . . . ; 38g

2

Divide the range of M(500, 1200] into 4 subsets as follows

We obtain Table 12 and v = 1.218. Since the critical value at the 0.05 significance level for the Chi-square with (2  1)  (4  1) = 3 degrees of freedom is 7.815, we accept the hypothesis that M(0, 10] and M(500, 1200] are independent.

J 1 ¼ f0g;

6. Statistical hypothesis testing for Experiment II

I 1 ¼ f0; . . . ; 6g; I 5 ¼ f11g;

I 2 ¼ f7g;

J 6 ¼ f12g;

J 2 ¼ f1g;

I 3 ¼ f8g;

J 7 ¼ f13g;

I 3 ¼ f2g;

I 4 ¼ f9; 10g; J 8 ¼ f14; . . . ; 38g

J 4 ¼ f3; . . . ; 38g

2

We obtain Table 10 and v = 20.418. Since the critical value at the 0.05 significance level for the Chi-square with (8  1)  (4  1) = 21 degrees of freedom is 32.671, we accept the hypothesis that M(50, 500] and M(500, 1200] are independent.

In this section we perform statistical hypothesis testing for Experiment II or Table 2. The methods used are exactly those used in Section 5. 6.1. Cramer–von Mises test

5.3.4. Testing the hypothesis that M(0, 10] and M(500, 1200] are independent Divide the ranges M(0, 10] into 3 subsets as follows I 1 ¼ f0; . . . ; 8g;

I 2 ¼ f9g;

I 3 ¼ f10; . . . ; 38g

Divide the range of M(500, 1200] into 5 subsets as follows J 1 ¼ f0g; J 2 ¼ f1g; J 5 ¼ f4; . . . ; 38g

J 3 ¼ f2g;

J 4 ¼ f3g;

6.2. CPIT test

Table 10 Defined measures (ni,j) for independence hypothesis testing of M(50, 500] and M(500, 1200] obtained from Table 1: Choice 2 i

1 2 3 4 5 6 7 8 nj

We apply the Cramer–von Mises goodness-of-fit test to Table 2 for various instants of time, t = 10, 20, 30, 40, 50, 60, 70, 80. Table A.1 (in the Appendix) summarizes the testing results. For all the instants of time, the Cramer– von Mises test rejects the hypothesis that {M(t), t P 0} follows a non-homogeneous Poisson process.

j 1

2

3

4

ni

1 1 2 2 3 2 3 2 16

0 0 2 2 3 2 1 1 11

2 1 1 2 2 1 0 1 10

0 2 1 0 0 0 0 0 3

3 4 6 6 8 5 4 4 40

Table A.2 summarizes the hypothesis testing results of the CPIT test at the significance level 0.05 for Table 2. In the 40 trials of software testing, CPIT test rejects the Table 12 Defined measures (ni,j) for independence hypothesis testing of M(0, 10] and M(500, 1200] obtained from Table 1: Choice 2 i

1 2 nj

j 1

2

3

4

ni

8 8 16

6 5 11

7 3 10

2 1 3

22 17 40

1242

K.-Y. Cai et al. / Information and Software Technology 50 (2008) 1232–1247

Goel–Okumoto NHPP model for 11 of them. The Goel– Okumoto NHPP model cannot hold in general (refer to Section 7.1). 6.3. Chi-square test The method and notion used here are identical to those used in Section 5.3. However for the SESD program, the range of M(t) or M(s1, s2] becomes {0, 1, 2, . . . , 25}, and the time intervals considered in the hypothesis testing are different from those considered in Section 5.3. 6.3.1. Testing the hypothesis that M(0, 10] and M(10, 40] are independent The range of M(0, 10] is divided into 6 subsets as follows I 1 ¼ f0; 1; 2; 3g; I 2 ¼ f4; 5; 6g; I 5 ¼ f9g; I 6 ¼ f10; . . . ; 25g

I 3 ¼ f7g;

I 4 ¼ f8g;

The range of M(10, 40] is divided into 10 subsets as follows J 1 ¼ f0; . . . ; 7g;

J 2 ¼ f8g;

J 3 ¼ f9g;

J 5 ¼ f11g;

J 6 ¼ f12g;

J 7 ¼ f13g;

J 9 ¼ f15g;

J 10 ¼ f16; . . . ; 25g

J 8 ¼ f14g;

I 3 ¼ f8g;

I 4 ¼ f9g;

and divide the range of M(10, 40] into 7 subsets as follows J 1 ¼ f0; . . . ; 9g; J 5 ¼ f13g;

J 2 ¼ f10g;

J 6 ¼ f14g;

I 1 ¼ f0; . . . ; 9g; I 2 ¼ f10g; I 3 ¼ f11g; I 4 ¼ f12g; I 5 ¼ f13g; I 6 ¼ f14g; I 7 ¼ f15; . . . ; 25g Divide the range of M(40, 80] is divided into 5 subsets as follows J 1 ¼ f0; g; J 2 ¼ f1g; J 5 ¼ f4; . . . ; 25g

J 3 ¼ f2g;

J 4 ¼ f3g;

Then we obtain Table A.6 and v2 = 31.855. Note that the critical value at the 0.05 significance level for the Chisquare with (7  1)  (5  1) = 24 degrees of freedom is 36.415, we accept the hypothesis that M(10, 40] and M(40, 80] are independent.

J 4 ¼ f10g;

Then we obtain Table A.3 from Table 2, and v2 = 80.028. Note that the critical value at the 0.05 significance level for the Chi-square with (6  1)  (10  1) = 45 degrees of freedom is 61.656, we reject the hypothesis that M(0, 10] and M(10, 40] are independent. Similarly, we can divide the range of M(0, 10] into 5 subsets as follows I 1 ¼ f0; . . . 6g; I 2 ¼ f7g; I 5 ¼ f10; . . . ; 25g

Then we obtain Table A.5 and v2 = 53.824. Note that the critical value at the 0.05 significance level for the Chisquare with (10  1)  (6  1) = 45 degrees of freedom is 61.656, we accept the hypothesis that M(10, 40] and M(40, 80] are independent. Divide the range of M(10, 40] into 7 subsets as follows

J 3 ¼ f11g;

J 4 ¼ f12g;

J 7 ¼ f15; . . . ; 25g

Then we obtain Table A.4 and v2 = 44.644. The critical value at the 0.05 significance level for the Chi-square with (5  1)  (7  1) = 24 degrees of freedom is 36.415, and therefore we reject the hypothesis that M(0, 10] and M(10, 40] are independent.

6.3.3. Testing the hypothesis that M(0, 10] and M(40, 80] are independent Divide the range of M(0, 10] into 6 subsets as follows I 1 ¼ f0; . . . ; 3g; I 5 ¼ f9g;

I 2 ¼ f4; 5; 6g;

I 3 ¼ f7g;

I 4 ¼ f8g;

I 6 ¼ f10; . . . ; 25g

Divide the range of M(40, 80] into 6 subsets as follows J 1 ¼ f0; g; J 2 ¼ f1g; J 6 ¼ f5; . . . ; 25g

J 3 ¼ f2g;

J 4 ¼ f3g;

J 5 ¼ f4g;

Then we obtain Table A.7 and v2 = 24.695. Note the critical value at the 0.05 significance level for the Chi-square with (6  1)  (6  1) = 25 degrees of freedom is 37.652, we accept the hypothesis that M(0, 10] and M(40, 80] are independent. Divide the range of M(0, 10] into 5 subsets as follows I 1 ¼ f0; . . . ; 6g; I 2 ¼ f7g; I 5 ¼ f10; . . . ; 25g

I 3 ¼ f8g;

I 4 ¼ f9g;

Divide the range of M(40, 80] into 5 subsets as follows J 1 ¼ f0; g;

J 2 ¼ f1g;

J 3 ¼ f2g;

J 4 ¼ f3g;

J 5 ¼ f4; . . . ; 25g 6.3.2. Testing the hypothesis that M(10, 40] and M(40, 80] are independent Divide the range of M(10, 40] into 10 subsets as follows I 1 ¼ f0; . . . ; 7g; I 5 ¼ f11g; I 9 ¼ f15g;

I 2 ¼ f8g;

I 3 ¼ f9g;

I 6 ¼ f12g; I 7 ¼ f13g; I 10 ¼ f16; . . . ; 25g

I 4 ¼ f10g;

I 8 ¼ f14g

7. Discussion

Divide the range of M(40, 80] is divided into 6 subsets as follows J 1 ¼ f0; g;

J 2 ¼ f1g;

J 6 ¼ f5; . . . ; 25g

J 3 ¼ f2g;

Then we obtain Table A.8 and v2 = 15.715. Note the critical value at the 0.05 significance level for the Chi-square with (5  1)  (5  1) = 16 degrees of freedom is 26.296, we accept the hypothesis that M(0, 10] and M(40, 80] are independent.

J 4 ¼ f3g;

J 5 ¼ f4g;

7.1. Immediate observations and implications Based on the experimental data and statistical hypothesis testing presented in the preceding sections, we should be

K.-Y. Cai et al. / Information and Software Technology 50 (2008) 1232–1247

able to have several immediate observations and examine their implications. (1) All Figs. 1–3 demonstrate a similar pattern. The mean of M(t) increases monotonically and approaches a specific level, whereas the variance of M(t) increases first and diminishes eventually. At any time there is huge discrepancy between the mean and the variance of M(t), and the discrepancy tends to increase as software testing proceeds. (2) None of the Space program and the SESD program passes the Cramer–von Mises goodness-of-fit test for Poisson distribution at a given time. That is, the number of software failures observed in a given number of tests can hardly follow a Poisson distribution. Roughly speaking, {M(t), t P 0} does not follow a non-homogeneous Poisson process. (3) Tables 3 and A.1 show that there are huge discrepancies between the true values of various statistics and the corresponding critical values. This coincides with the huge discrepancies between the mean and the variance of M(t) as shown in Figs. 1–3. (4) Even if we could not reject the hypothesis that {M(t), t P 0} follow a non-homogeneous Poisson process in a statistically rigorous manner, the CPIT test shows that for the Space program, only 13 of the 40 trials of software testing can fit the Goel–Okumoto NHPP model. This number becomes as 29 for the SESD program. The rejection ratio is 67.5% and 27.5% for the Space program and the SESD program, respectively. If a sample really comes from the Goel–Okumoto NHPP model, however, the simulation results presented by Gaudoin show that the normal rejection ration of the CPIT test is around 5% [30, Table 2.4]. (5) Therefore, it is reasonable to say that that the Goel– Okumoto NHPP model does not holds in general.10 (6) The acceptance of some of the 40 trials of software testing by the CPIT test is not in conflict with the results of the Cramer–von Mises test. Actually, the Cramer–von Mises test is aimed to test whether various subsets of a given dataset corresponding to various particular instants of time fit a Poisson distribution and thus to test whether the given dataset fits a non-homogeneous Poisson process in an informal manner. On the other hand, the CPIT test adopted in this paper is aimed to test whether the given dataset fits the Goel–Okumoto model overall. The Cramer–von Mises test and the CPIT test examine different aspects of the NHPP assumption. It is natural for the CPIT test to accept some trials since it is not very powerful as pointed by Gaudoin [30]. The CPIT test cannot distinguish between the Goel–Okumoto model and the Jelinski– Moranda model. (7) For the Space program, M(0, 10] and M(10, 50] are not independent, neither are M(10, 50] and M(50, 500]. However M(0, 10] and M(500, 1200] are independent, and 10

We also applied the Goel test [29] to Tables 1 and 2. At the significance level 0.05, the Goel test rejects 23 for the Space program and 25 of 40 trials for the SESD program, respectively.

1243

so are M(50, 500] and M(500, 1200]. Note that M(0, 10] and M(10, 50] correspond to two adjacent time intervals at the early stage of software testing, and M(50, 500] and M(500, 1200] corresponds to two time intervals that are away from the beginning of software testing. We can argue that as software testing proceeds, the software failures tend to be less dependent. Software failures revealed in the late stage of software testing may be independent of those revealed in the early stage. (8) Similar scenario emerges for the SESD program. Among M(0, 10], M(10, 40] and M(40, 80], only M(0, 10] and M(10, 40] are not independent. In the early stages of software testing, software failures are unlikely to be independent for a piece of software under test. (9) The similarity in the outcomes of the Chi-square test for the Space program and the SESD program implies that it is mainly the software testing stage that makes the observed failures more or less correlated, though the software under test may be different. There is potential for the correlation between observed software failures to serve as an indicator of software testing stage or progress. (10) Although the operational profiles of different users may be different (as being simulated in Experiments I and II), the variance of the cumulative number of observed failures does not seem to catch up the mean of the cumulative number of observed failures and still tends to approach zero eventually. 7.2. General discussion The results presented in the preceding sections should convince us that the NHPP framework does not work well for software reliability modeling, at least one may be skeptical about the validity of various NHPP software reliability models. However the NHPP framework has been discussed and adopted extensively in software reliability community and one may be puzzled with the new findings. Why may the NHPP framework fail? Why could the invalidity of the NHPP framework escape the various tests presented in existing literature? And what can be done if the NHPP framework is discarded? In order to explain why the NHPP framework fails, we should note that there are two major assumptions in a NHPP software reliability model: (1) The numbers of failures observed in disjoint time intervals are independent. (2) The mean and the variance of the number of software failures observed up to a given instant of time (in a given number of tests) coincide with each other. The first assumption fails at least in the early software testing stage. Software failures may be highly correlated. The second assumption fails in the late software testing stage as the variance of the cumulative number of observed software failures approaches zero. Although the mean and the variance of the times between successive software

1244

K.-Y. Cai et al. / Information and Software Technology 50 (2008) 1232–1247

failures may tend to grow simultaneously, the cumulative number of observed software failures demonstrates a distinct scenario. The above two assumptions can hardly hold simultaneously in the course of software testing. We can go further with the property of Poisson probability distributions to explain why the NHPP framework may fail. A Poisson distribution can be treated as the limiting distribution of a binomial distribution as population approaches infinity. This implies that the validity of the NHPP framework requires a huge number of remaining software defects. As software testing proceeds, software defects are detected and removed and thus the number of remaining software failures that would eventually be observed tends to decrease. Consequently, the discrepancy between the software failure behavior and a non-homogeneous Poisson process tends to grow. It is reasonable to argue that a NHPP software reliability model may be more appropriate for large-scale software than for small-scale software, more appropriate for the early software testing stage than for the late software testing stage. Alternatively, it is more appropriate for software with numerous remaining defects than for software with few remaining defects. Unfortunately, this conjecture may fail since the existence of more defects implies that there is more chance for the defects to correlate with each other and thus reduce the chance for the required independence property to hold. The observation that the NHPP framework may be inappropriate for software reliability modeling benefits from the new research approach adopted in the work presented in this paper. The new research approach is based on controlled software experiments as well as on examining the behaviors of the mean and the variance of M(t) simultaneously. Here we note in the previous research approach adopted in existing literature, only the behavior of the mean of M(t) was examined by using a single dataset for each real project, and the behavior of the variance of M(t) was rarely examined, if not completely ignored. No controlled software experiments were conducted for real subject programs for examining the behaviors of the mean and the variance of M(t) simultaneously. The behavior of the mean of M(t) is not sufficient for validating or invalidating the NHPP framework for software reliability modeling. This may explain why the invalidity of the NHPP framework was not systematically addressed in existing literature. It is the new research approach that offers us a chance to have new observations and insights. Then what can we do if we discard the NHPP framework? It seems that there are at least three approaches for us to adopt. (1) We can try to develop a more general stochastic framework for software reliability modeling. Since software failures are correlated and software failure rates tend to be time dependent, we may imagine that non-homogeneous Markov chains offers a better modeling framework. Another possible framework for software reliability modeling is that of self-exciting point processes proposed by Chen and Singpurwalla [31]. Besides the cumulative num-

ber of observed software failures, the times between successive software failures should be examined. (2) We can discard other concerns and focus on the behavior of the mean of M(t) only. We may follow the philosophy of regression analysis to characterize the behavior of the mean of M(t). Although the maximum likelihood method may be difficult to be applied without the NHPP assumptions, the least-squares method can still work for parameter estimation of the mean of M(t). The invalidity of the NHPP framework does not mean that the previous work on NHPP software reliability modeling is useless. Rather, it provides useful insights for examining the behavior of the mean of M(t). (3) Software failure processes may be more than what a purely probabilistic or stochastic approach can handle. As software testing proceeds, the cumulative number of observed software failures becomes less and less random (in terms of the corresponding variance) and the underlying causal relation of the software under test may play a key role. Also, besides randomness as a form of uncertainty underlying the software reliability behavior, there are other forms of uncertainty such as fuzziness [32] and partial repeatability11 [33]. An integrated approach that combines deterministic factors, randomness, fuzziness and partial repeatability may be better than a purely probabilistic or stochastic approach for software reliability modeling. 7.3. The operational profile problem Since software reliability is a function of the underlying operational profile, one may wonder why different operational profiles were generated at random for the two controlled experiments. To justify the manner of generating operational profiles, we have the following observations: (1) In a NHPP software reliability model like the Goel– Okumoto model, no assumptions are made for what the underlying operational profiles may be. The effects of operational profiles on software reliability behavior are implicitly incorporated into the parameters of the model, which are to be calibrated by using observed software failure data. Suppose the NHPP models are valid in general, then the validity of the NHPP models should not depend on the choice of the underlying operational profiles. So, no matter whatever operational profile of a real-world software system may be, an arbitrary operational profile could be assigned for the controlled experiments. (2) There are at least two ways for assigning an arbitrary operational profile for the software under study in a controlled experiment. The first way is to generate an operational at random and use it to select test cases in all trials 11 Partial repeatability is a new kind of uncertainty (other than randomness and fuzziness) observed in complex processes as well as software systems [33]. By partial repeatability it is meant that complex phenomena may demonstrate an invariant property that neither can be proved in mathematics nor validated to a high accuracy in physics, but still (partially) governs the behavior of the phenomena.

K.-Y. Cai et al. / Information and Software Technology 50 (2008) 1232–1247

of software testing. The second way is to generate an operational profile at random for each trial of software testing. The rational of the second way lies in the understanding that different users may have different operational profiles. Further, in reality it is not easy to determine an accurate operational profile and thus it may be more reasonable to treat each associated probability of the required operational profile as a random number. Both ways for generating operational profiles were adopted in two controlled experiments. However for the space limitation, this paper only presents the test results generated by the first way. This is because various statistical hypothesis testing conducted for the test results generated by the first way also invalidated the Goel–Okumoto model and render similar conclusions as for the test results generated by the second way. (3) It is important to note that the aim of this paper is to examine whether the NHPP models are valid in general. The paper is not concerned with how an operational profile should be determined in practice or how many operational profiles should be determined for a particular software system.

1245

(3) The statistical hypothesis testing is not theoretically rigorous. We perform the Cramer–von Mises test for only selective instants of time rather than the whole process of software testing. The CPIT test may heavily depend on the selection of the significance level. The Chi-square test of independence requires divisions of various ranges of random variables into several subsets. (4) The software experiments and the corresponding analyses are conducted for the software reliability behavior in the discrete-time domain. This does not guarantee that the resultant observations can apply to the software reliability behavior in the continuous-time domain directly. (5) Effects of the software testing strategies are not considered systematically. In the software experiments presented in this paper, the test profiles were generated at random. However in software testing reality, various techniques may be employed such as data flow testing, branch testing, boundary values testing, state based testing, and so on. The true test profile or operational profile may be more complicated than those adopted in our experiments. How testing strategies or techniques affect the validity of the NHPP modeling framework is an open problem.

7.4. Threats to the validity of the observations 8. Concluding remarks Of course, there are threats to the validity of the major observation that the NHPP framework does not fit the software reliability behavior and the related ones. (1) The various observations are based on the two controlled experiments with the Space program and the SESD program. There are several constraints of the subject programs. First, the initial numbers of defects remaining in them and the sizes of them are not very large. They are not large-scale software. Second, the Space program and the SESD program are mainly transformational software. However there are other kinds of software such as reactive software, concurrent software, embedded software and so on. Different kinds of software may demonstrate different behavior of reliability. (2) A test case may trigger multiple software defects and an observed failure may lead to more than one defect being detected and removed. However in the automatic testing of the software experiments, we assume that there is one-toone correspondence between failure and defect and only one defect is removed each time a failure is observed.

Up to this point we can conclude this paper. We have presented two controlled software experiments to examine whether the NHPP framework is appropriate for software reliability modeling. Both empirical observations and statistical hypothesis testing suggest that software reliability behavior does not follow a non-homogeneous Poisson process in general, and does not fit the Goel– Okumoto NHPP model in particular. Although this new finding should be further tested on other software experiments, it is reasonable to cast doubt on the validity of the NHPP framework for software reliability modeling. The previous research for justifying the NHPP framework is mainly based on analyzing the behavior of the mean of the cumulative number of observed software failures. The behavior of the corresponding variance is most ignored. The research approach adopted in this paper to examine the behavior of the mean and that of the corresponding variance simultaneously on the basis of controlled software experiments leads us to

Table A.1 Cramer–von Mises goodness-of-fit test of Poisson distribution for Table 2 at significance level 0.05 t

^ KðtÞ

W2

CV(W2)

A2

CV(A2)

W 2m

CVðW 2m Þ

Hypothesis

10 20 30 40 50 60 70 80

8.075000 15.175000 18.600000 19.700000 20.575000 20.950000 21.500000 21.900000

0.757383 0.876857 0.918479 0.898647 1.107766 1.100470 1.179644 1.260455

0.172924 0.164025 0.164584 0.169067 0.168911 0.167351 0.166259 0.168204

4.033344 4.564069 4.944227 4.872288 5.667872 5.720892 5.940506 6.382881

1.104759 1.046464 1.053817 1.092022 1.098288 1.072030 1.073022 1.083049

6.754687 10.513960 12.589420 12.725472 15.327171 15.539527 16.475892 17.807850

1.718917 2.248677 2.494762 2.669601 2.736616 2.712770 2.747737 2.783833

Reject Reject Reject Reject Reject Reject Reject Reject

CV(.): critical value of a statistics.

1246

K.-Y. Cai et al. / Information and Software Technology 50 (2008) 1232–1247

the new finding. The importance of the work presented in this paper is not only for the new finding which is distinctly different from existing popular belief of software reliability modeling, but also for the corresponding research approach adopted. There is a lot of further work that should be conducted in the future. First, the new finding should be tested on large-scale commercial software systems and on different kinds of software systems. Second, a better modeling framework other than the NHPP one should be developed for software reliability. This modeling framework should not only consider randomness as a form of uncertainty,

Table A.2 Goel test and CPIT test of the Goel–Okumoto NHPP model for Table 2 at significance level 0.05 j

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

The value of the statistic of the CPIT test for the jth trial

The critical value of the CPIT test for the jth trial

0.183 0.157 0.170 0.239 0.183 0.200 0.257 0.257 0.196 0.370 0.213 0.170 0.326 0.226 0.383 0.183 0.457 0.370 0.170 0.226 0.113 0.313 0.170 0.226 0.270 0.257 0.200 0.457 0.213 0.270 0.226 0.257 0.326 0.357 0.226 0.383 0.313 0.213 0.457 0.213

0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275 0.275

The outcome of the CPIT test p p p p p p p p p  p p  p  p   p p p  p p p p p  p p p p   p   p  p

: reject the hypothesis that the Goel–Okumoto model holds (11 trials in total). p : accept the hypothesis that the Goel–Okumoto model holds (29 trials in total).

Table A.3 Defined measures (ni,j) for independence hypothesis testing of M(0, 10] and M(10, 40] obtained from Table 2: Choice 1 i

j

1 2 3 4 5 6 nj

1

2

3

4

5

6

7

8

9

10

ni

0 0 0 0 0 1 1

0 0 0 0 1 1 2

0 0 0 1 1 1 3

0 0 0 4 3 0 7

0 0 2 1 3 0 6

0 0 1 3 1 2 7

0 0 3 2 1 0 6

0 1 1 1 1 0 4

0 1 2 0 0 0 3

1 0 0 0 0 0 1

1 2 9 12 11 5 40

Table A.4 Defined measures (ni,j) for independence hypothesis testing of M(0, 10] and M(10, 40] obtained from Table 2: Choice 2 i

1 2 3 4 5 nj

j 1

2

3

4

5

6

7

ni

0 0 1 2 3 6

0 0 4 3 0 7

0 2 1 3 0 6

0 1 3 1 2 7

0 3 2 1 0 6

1 1 1 1 0 4

2 2 0 0 0 3

3 9 12 11 5 40

Table A.5 Defined measures (ni,j) for independence hypothesis testing of M(10, 40] and M(40, 80] obtained from Table 2: Choice 1 i

j 1

2

3

4

5

6

ni

1 2 3 4 5 6 7 8 9 10 nj

0 0 0 0 0 0 1 1 1 0 3

0 0 0 3 0 3 2 2 1 0 11

0 0 1 2 0 2 2 1 1 1 10

0 0 2 0 4 2 1 0 0 0 9

1 1 0 2 1 0 0 0 0 0 5

0 1 0 0 1 0 0 0 0 0 2

1 2 3 7 6 7 6 4 3 1 40

Table A.6 Defined measures (ni,j) for independence hypothesis testing of M(10, 40] and M(40, 80] obtained from Table 2: Choice 2 i

j 1

2

3

4

5

ni

1 2 3 4 5 6 7 nj

0 0 0 0 1 1 1 3

0 3 0 3 2 2 1 11

1 2 0 2 2 1 2 10

2 0 4 2 1 0 0 9

3 2 2 0 0 0 0 7

6 7 6 7 6 4 4 40

K.-Y. Cai et al. / Information and Software Technology 50 (2008) 1232–1247 Table A.7 Defined measures (ni,j) for independence hypothesis testing of M(0, 10] and M(40, 80] obtained from Table 2: Choice 1 j

i

1 2 3 4 5 6 nj

1

2

3

4

5

6

ni

0 0 1 1 1 0 3

0 2 1 4 3 1 11

1 0 3 3 1 2 10

0 0 4 3 2 0 9

0 0 0 1 2 2 5

0 0 0 0 2 0 2

1 2 9 12 11 5 40

Table A.8 Defined measures (ni,j) for independence hypothesis testing of M(0, 10] and M(40, 80] obtained from Table 2: Choice 2 j

i

1 2 3 4 5 nj

1

2

3

4

5

ni

0 1 1 1 0 3

2 1 4 3 1 11

1 3 3 1 2 10

0 4 3 2 0 9

0 0 1 4 2 7

3 9 12 11 5 40

but also consider causal factors as well as other forms of uncertainty. Third, the new findings and various related observations should help to improve existing software testing strategies or develop better software testing strategies. Future work on software reliability modeling should give useful feedback information for software project managers and help to improve software reliability practice. Appendix A See Tables A.1–A.8. References [1] G.R. Hudson, Program errors as a birth and death process, System Development Corporation, Report SP-3011, Santa Monica, CA, 1967. [2] J.D. Musa, A. Iannino, K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill, 1987. [3] M. Xie, Software Reliability Modeling, World Scientific, 1991. [4] M.R. Lyu (Ed.), Handbook of Software Reliability Engineering, McGraw-Hill, 1996. [5] K.Y. Cai, Software Defect and Operational Profile Modeling, Kluwer Academic Publishers, 1998. [6] H. Pham, Software Reliability, Springer, 2000. [7] Z. Jelinski, P.B. Moranda, Software reliability research, in: W. Greiberger (Ed.), Statistical Computer Performance Evaluation, Academic Press, 1972, pp. 465–484. [8] E.C. Nelson, A Statistical Basis for Software Reliability Assessment, TRW-SS-73-03, 1973. [9] M.L. Shooman, Probabilistic models for software reliability prediction, in: Proc. the Fault-Tolerant Computing Symposium, 1972, pp. 211–215.

1247

[10] B. Littlewood, J. Verrall, A Bayesian reliability growth model for computer software, Applied Statistics 22 (3) (1973) 332–346. [11] M.H. Halstead, Elements of Software Science, North-Holland, 1977. [12] J.D. Musa, A theory of software reliability and its application, IEEE Transactions on Software Engineering SE-1 (3) (1975) 312–327. [13] N.F. Schneidewind, Analysis of error processes in computer software, Sigplan Notes 10 (6) (1975) 337–346. [14] M. Xie, M. Zhao, The Schneidewind software reliability model revisited, in: Proc. IEEE International Symposium on Software Reliability Engineering, 1992, pp. 184–192. [15] T. Nara, M. Nakata, A. Ooishi, Software reliability growth analysis – application of NHPP models and its evaluation, in: Proc. IEEE International Symposium on Software Reliability Engineering, 1995, pp. 251–255. [16] A. Wood, Predicting software reliability, Computer (1996) 69–77. [17] T. Keller, N.F. Schneidwind, Successful application of software reliability engineering for the NASA Space Shuttle, in: Proc. IEEE International Symposium on Software Reliability Engineering (Case Studies), 1997, pp. 71–82. [18] K.C. Gross, Software reliability and system availability at Sun, in: Proc. 11th International Symposium on Software Reliability Engineering, 2000. [19] H. Pham, Software reliability and cost models: perspectives, comparison, and practice, European Journal of Operational Research 149 (2003) 475–489. [20] H. Okamura, M. Grottke, T. Dohi, K.S. Trivedi, Variational Bayesian approach for interval estimation of NHPP-based software reliability models, in: Proc. 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2007, pp. 698– 707. [21] A.L. Goel, K. Okumoto, A time dependent error detection rate for a large scale software system, in: Proc. the 3rd USA–Japan Computer Conference, 1978, pp. 35–40. [22] K.Y. Cai, Towards a conceptual framework of software run reliability modeling, Information Sciences 126 (2000) 137–163. [23] F.I. Vokolos, P.G. Frankl, Empirical evaluation of the textual differencing regression testing technique, in: Proc. the International Conference on Software Maintenance, November 1998, pp. 44–53. [24] G. Rothermel, R.H. Untch, C. Chu, M.J. Harrold, Prioritizing test cases for regression testing, IEEE Transactions on Software Engineering 27 (10) (2001) 929–948. [25] K.Y. Cai, B. Gu, H. Hu, Y.C. Li, Adaptive software testing with fixed-memory feedback, Journal of Systems and Software 80 (2007) 1328–1348. [26] D.B. Campbell, C.A. Oprian, On the Kolmogorov–Smirnov test for the Poisson distribution with unknown mean, Biometrics Journal 21 (1) (1979) 17–24. [27] N. Henze, Empirical-distribution-function goodness-of-fit tests for discrete models, The Canadian Journal of Statistics 24 (1) (1996) 81– 93. [28] J.J. Spinelli, M.A. Stephens, Cramer–von Mises tests of fit for the poisson distribution, The Canadian Journal of Statistics 25 (2) (1997) 257–268. [29] A.L. Goel, Software Reliability Modeling and Estimation Techniques, RADC-TR-82-263, 1982. [30] O. Gaudoin, CPIT goodness-of-fit tests for reliability growth models, in: D. Ionescu, N. Limnios (Eds.), Statistical and Probabilistic Models in Reliability, Birkhauser, 1999, pp. 27–37. [31] Y. Chen, N.D. Singpurwalla, Unification of software reliability models by self-exciting point processes, Advances in Applied Probability 29 (1997) 337–352. [32] K.Y. Cai, C.Y. Wen, M.L. Zhang, A critical review on software reliability modeling, Reliability Engineering and System Safety 32 (1991) 357–371. [33] K.Y. Cai, L. Chen, Analyzing software science data with partial repeatability, Journal of Systems and Software 63 (2002) 173–186.