lnformatwn Processing & Mano~emenr Prmted in Great Britam.
Vol. 20, No.
I-2,
pp. 199.208,
1984
03064573/84 Pergamon
$3.00 + .OO Press Ltd.
A PRELIMINARY INVESTIGATION OF THE USE OF THE CLOZE PROCEDURE AS A MEASURE OF PROGRAM UNDERSTANDING CURTIS COOK, WILLIAM BREGAR and DAVID FOOTE Computer Science Department, Oregon State University, Corvallis, OR 97331, U.S.A. Abstract-Program understanding is an integral part of the testing and maintenance phases of the software life cycle. There have been numerous investigations of the influence of various aspects of a program and the programming process on program comprehension. However, the many different measures of understanding used in these studies make any comparison or analysis difficult. Some of the different measures include time to find a bug, a comprehension quiz, ability to reproduce a functionally equivalent program without notes, time to perform a modification and Halstead’s E. All of these have limitations such as inability to measure both low- and high-level understanding, difficulty of administering and objectively grading, or impractical for large or non-trivial programs. Of the measures, the comprehension quiz is probably the most commonly used and accepted measure of program understanding. This paper reports on a controlled experiment that compared the “cloze” procedure and comprehension quiz as measuring program understanding. In a cloze procedure, the subjects are presented a program listing with some of the program tokens (operands, operators, reserved words, single parenthesis or brackets, etc.) replaced with blanks and are required to fill in the blanks. Our experiment tested students in sophomore, junior and senior level computer science courses. These were assumed to correspond to three levels of programming experperience. Each subject was given one of two versions of a sorting program and either a cloze version of the program or a comprehension quiz over it. Results for the cloze procedure closely approximated those of the comprehension quizzes for both programs and each level of experience. These results and the ease of administration and grading make the cloze procedure a potentially attractive means for measuring program understanding. INTRODUCTION
Program understanding is a crucial part of the testing and maintenance phases of the software life cycle. Numerous studies have attempted to relate the contribution of various aspects of programs and the programming process to a programmer’s ability to comprehend a program. A partial list of these includes modularization [26], comments [26, 161, indenting [25,22, 31, structured coding [ 10,251, mnemonic variable names [4,25], program length [4], flowcharts [23], documentation [17], control flow [25,21], and data flow [25]. Almost all of these investigations involved controlled experimentation. However, the actual measurement of the degree of program understanding was a serious problem. A wide variety of measures were used in these studies: “question-answer” (comprehension quiz), time to perform a modification, subject’s ability to reproduce a functionally program without notes, etc. The many different measures make any comparison or analysis difficult. Fortunately, the comprehension test is probably the most widely used and commonly
accepted of these measures. In this paper we investigate the use of the “cloze” procedure as a measure of program understanding. The word “cloze” refers to the human tendency to complete a familiar but not quite finished pattern. In a cloze procedure the subjects are presented a program listing with some of the tokens (operators, operands, reserved words, single parentheses or brackets) replaced with blanks and are required to fill in the blanks. The cloze procedure was originally used to measure comprehension in English text readability studies [24,8]. SHNEIDERMAN [20] mentioned the cloze as a measure of program composition tasks as well as comprehension, but cited no studies. NORCIO [14, 151has also proposed using the cloze procedure to measure program understanding, but he used a different approach. In his 199
(‘UKTIS
200
HOOK
?I
Cd.
approach entire statements were replaced by blank lines and the subject was required to supply the correct statements. It is interesting to note that researchers in the English readability area also admit difficulty in measuring understanding and must resort to comprehension quizzes or cloze procedures [9], page 13 1. Comprehension tends to be something of a rubber criterion, since there is no agreed-upon definition. Research workers have most often used either multiple choice or doze scores measures. PROGRAM
COMPREHENSION
MEASURES
Several recent papers [2, 121 have chided software psychology studies for their experimental methodology. This is probably a reflection of the infancy of the area. We were amazed by both the variety of measures and the lack of consensus among investigators as we sought to compare the cloze procedure with other measures of program understanding. This makes comparison and interpretation of experimental results difficult to say the least. In the following we list most of the measures that have been used and describe their limitations. Program (1) (2) (3) (4) (5) (6) (7)
understanding
measures:
Comprehension quiz or “question-answer” [17, 21, 25, 261. Time to perform a modification [4]. Time to locate a bug [3], [16]. Halstead’s E (programming effort) [6]. Reproduction of functionally equivalent program without notes [4, 18, 191. Speed of hand execution of program [25]. Subjective responses [25].
Obviously, program understanding must include both low-level (each statement of program) as well as high-level (module or overall program) comprehension. It is not clear that time to perform a modification, time to locate a bug, and speed of hand execution require high-level program comprehension. Also in time measures it is difficult to exclude time periods spent on tasks not related to hypothesis being tested. Studies [3,4] have shown varying correlation between Halstead’s E measure and understanding. In addition these studies admit the influence of other factors not assessed by either measure. Subjective responses are highly suspect for a variety of reasons. Originally SHNEIDERMAN[I91 required a subject to reconstruct a program, verbatim, after studying it for a period of time. This was later relaxed to reproducing a functionally equivalent program. He assumed a strong relation between the ability to memorize program statements and the ability to understand their intended function. His measure is limited to toy or small programs. Of these measures of program understanding, the comprehension quiz seems to be the most generally used and accepted. A well constructed comprehension quiz can test both lowand high-level program understanding. Also it is not limited by program length. But comprehension quizzes also have limitations. Constructing a set of questions to test both low- and high-level understanding is not easy. For example, one must guard against one question providing an answer to another question. It is unclear as to the number and type (completely open-ended or multiple choice) or categories of questions necessary to measure understanding. However, for other than multiple choice questions, constructing questions is easy compared to grading them. The subjectivity of the grader and the scoring for each question are difficult problems. Also the longer the program the more difficult it is to construct the quiz. The cloze procedure has all of the advantages and very few of the disadvantages of the comprehension quiz. It is trivial to construct and grade; in fact, both the construction and grading can be done by a computer. Cloze procedures are not limited by program length. A convenient variation of the cloze procedure for large programs is to apply it selectively or randomly to certain subprograms or parts of subprograms.
A preliminary investigation of the use of the cloze procedure
201
In the remaining sections we describe the experiment we conducted to compare the cloze procedure and comprehension quiz. The results suggest that the cloze procedure holds promise as a measure of program understanding. EXPERIMENT:
COMPREHENSION
QUIZ
VS CLOZE
PROCEDURE
The participants in the experiment were students in a sophomore Pascal programming course (CS 212), a junior data structures course (CS 318), and a senior operating systems course (CS 415). The CS 212 is a prerequisite for CS 3 18 and CS 318 is a prerequisite for CS 415. We assumed the courses correspond to three levels of programming experience. MATERIALS
There were two forms of the experimental materials. (1) A Pascal program listing and a comprehension quiz over the program. (2) A description of the cloze procedure (definition of tokens, short example of cloze version of a program and its answer) and a cloze version of a Pascal program. We used two versions of a Shell sort for our Pascal program. One version was from MILLER ([ 1I], pp. 178-79) and the other was a Pascal version of a Fortran program from BERZTISS ([l], p. 344). Several variable names in both programs were changed so that they did not provide too many clues for the comprehension quiz. Also no comments were included. Both comprehension quizzes were carefully constructed to be as similar as possible. Both had the same number (9) and type of questions. Furthermore, for each question on one quiz there was a corresponding question on the other test. The same two programs used for the comprehension part were used for the cloze procedure. A token was defined to be a variable identifier name, a constant, an operator (arithmetic or logical), a single parenthesis, or a single bracket. A single colon (not part of an assignment operator), a semicolon and a comma were not counted as tokens. The short example program illustrating the cloze procedure was Euclid’s Algorithm. The tokens in it replaced by blanks were chosen to illustrate all of the possible types of tokens. It was followed by the same program with the blanks filled in. For the cloze versions of the Pascal Shell sort programs we replaced every fifth token with a blank. All blanks were the same size. Thus depending on the position of the first token replaced by a blank, there were five cloze versions of each program. Samples of the materials for each program version are given in the Appendix. ADMINISTRATION
The experiment was conducted during class periods on the same day. All students had as much time as they needed. Each student was randomly assigned a cloze procedure or comprehension quiz for one of the two versions of the program. Generally, students with the cloze versions finished earlier. Students in the earlier classes were urged not to discuss the program or testing process with students in later classes. RESULTS
For the cloze procedure each filled in blank was either right or wrong. One point was awarded for each correct answer. For the comprehension quiz, each question was worth 10 points (90 points total) and all questions were graded by the same person. For questions with a set of answers, partial credit was awarded for correct parts. We discovered that very few errors were made in filling in the blanks when the tokens were reserved words, parentheses or brackets. We called these “giveaways”. See Table 1.
Table 1. Giveaways and errors on giveaways
% qiveskqs
Program 1 program
2
55.3 42.6
8 errors
on qfveaays 5.3 5.0
CURTIS COOK et ~1.
202 Table
2. Means
and standard
deviations
__. Pcwram lComprehensionquiz Mean sta dev N
Cloze score ban Std dev N
Prqram 2 Comprehensionquiz Mean St.ddev N
Cloze score Mean Std dev N
version and method
for each class, program
Cs 212
CS 318
cs 415
32.0 17.8 36
48.0 17.6 18
57.8 13.3 16
53.0 17.5 37
67.5 15.3 18
71.4 9.4 IS
46.7 19.8 37
61.1 14.9 16
73.5 12.6 17
64.8 17.6 37
76.8 9.8 17
79.1 10.6 15
Hence for the analysis of the results we only counted the non-giveaway tokens, e.g. the usual operators and operands. Table 2 presents the means and standard deviations for both cloze and comprehension quiz scores for each class and each version of the program. Note that we converted the cloze scores to a 90 point scale in order to be comparable with the comprehension scores. Figures 1 and 2 are graphs of the means for the cloze and comprehension quiz scores for each program. Both graphs show a nearly constant difference between the means for each class for each program. Several analysis of variance tests were performed using SPSS [ 131. A two-way analysis of variance comparing each of the three main effects (class, test method, and program version) indicated that the joint effect and each main effect were all significant at the 0.001 level, and the two-way interactions were not significant. Interestingly, in the latter case the significance of the class-program interaction was 0.838 which indicates practically no interaction. We converted the cloze scores to a 90 point scale for relative comparison quiz scores. The graphs in Fig. 1 and 2 show the scores for each method have the same relative behavior. It would be unreasonable to expect the scores to indicate more than this. To actually compare these scores we need to convert them to a common basis. To do so we applied a Z-score transformation ([13], p. 187) to the comprehension quiz scores and to
80 70 60 SO 40 30 20 10 0
CS 212 Fig. I. Graph
CS 318
of class vs scores for Program
CS 415 I
A preliminary
investigation
of the use of the cloze procedure
203
30 20
10 0 cs
222
Fig. 2. Graph
CS 318
cs
of class vs scores for Program
41s 2
the cloze scores. A Z-score transformation converts the scores to have a mean of 0 and a standard deviation of 1. Then we applied a two-way analysis of variance to the transformed scores. The results indicated significance at the 0.001 level for the joint effect and for the class and program main effects, but no significance for the method main effect. In fact, the significance value for the method main effect was 0.862. All the two way interactions were also not significant. However, the class-program version interaction significance was 0.883 which is a very strong indication of the lack interaction between them. These results strongly suggest that for each program version and each class the cloze scores give a close relative approximation of the comprehension quiz scores. Thus the cloze appears to hold promise as a measure of program understanding. CONCLUSIONS
Our experiment represents an initial investigation of the use of the cloze procedure as a measure of program understanding. We compared the cloze procedure with the comprehension quiz, the most commonly used and accepted measure of program understanding. Our comprehension quiz was carefully constructed to test both low- and high-level program understanding. Our experimental results showed that the cloze scores were a close relative approximation of the comprehension quiz scores. Thus the results have been encouraging but further investigation is needed. If our conclusion is substantiated by further experimentation, then the cloze procedure should have a considerable impact on future program understanding investigations. It provides a much needed, simple, and objective standared for comparison. Since it can be selectively applied to random routines in a large program, it will provide a means of studying more real world programs.
REFERENCES
A. T. BERZTISS, Data Structures Theory and Practice. Academic Press, New York (1971). [2] R. E. BROOKS, Studying programmer behavior experimentally: the problem of proper meth[1]
odology. Commun. ACM, 1980, 23, 207-213. [3] B. CURTIS, S. B. SHEPPARDand P. MILLIMAN, Third time charm: stronger replication of the ability of software complexity metrics to predict programmer performance. Proc. 4th Int. Conf. on Software Engineering, 356-360. Munich, Germany, September (1979). [4] B. CURTIS, S. B. SHEPPARD, P. MILLIMAN, M. A. BORST and T. LOVE, Measuring the psychological complexity of software maintenance tasks with Halstead and McCabe metrics. IEEE Trans. Software Engng, 1979, SE-S, 96104. [5] D. W. EMBLEY, Empirical and formal language decision applied to a unified control construct for interactive computing. Znt. J. Man-Machine Studies 1978, 10, 197-216. IPM Vol 20.No l/2 N
204
CUKTlS
[6] R. GORDON,
Measuring
improvements
COOK Cl Ui.
in program
clarity. IEEE Trans. Software
Engng, 1978,
10, 197-216.
[7] M. HALSTEAD, Elements of’ Sqfiwcrre Science. Operating and Programming Systems Series, Elsevier Computer Science Library, New York (1977). [8] Cr. R. KLARE. Assessing readability. Reading Res. Quurt. 1974-75, 10, 63-102. [9] G. R. KLARE, A second look at the validity of readability formulas. J. Reading Behazkr 1976, 8. 129- 1.52. [lo] T. LOVE, An experimental investigation of the effects of program structure on program understanding. ACM SIGPLAN Notices 1977, 10, 105-l 13. [I I] A. R. MILLER, Pascul Programs for Scientists und Engineers. Sybex, Berkeley, California (1981). [12] T. MOHER and G. M. SCHNEIDER, Methods for controlled experimentation in software engineering, Proc. Ftfth Int. Corzf. on Software Engineering, 224-233. San Diego, California, March (1981). [13] N. H. NIE, G. H. HULL, J. G. JENKINS, K. STEINBRENNERand D. H. BEST, SPSS: Statistical Puckugr fbr the Social Sciences, 2nd Edn. McGraw-Hill, New York (1975). [I41 A. F. NORCIO,The cloze procedure: A methodology for analyzing computer program comprehension. Paper presented at the Ann. ACM Computer Sci. Conf. Dayton (1979). [ 151A. F. NORUO, Factors affecting the comprehension of computer programs. Paper presented at the Nutionul Computer Conf., New York (1979). [16] S. B. SHEPPARD, M. A. BORST and B. CURTIS, Predicting programmer ability to understand and modify software. Proceedings qfS_vmposium on Human Factors and Computer Science, Washington, D. C., (June 1978), 115-135. [17] S. B. SHEPPARD. E. KRUESI and B. CURTIS. The effects of symbology and spatial arrangement on the comprehension of software specifications, 2077214. Proc. 5th Int. Conf: on Softwnre Engineering, San Diego, California, March (1981). [18] B. SHNEIDERMAN,Exploratory experiments in program behavior, Int. J. Comput. Inform. Sci. 1976, 5, 123-143. [19] B. SHNEIDERMAN, Measuring computer program quality and comprehension, Int. J. Man-Machine Studies, 1977, 9, 465478. [20] B. SHNEIDERMAN,Softwure psycholog),: Human fhctors in computer und irzformation systems, Winthrop Publishers Inc., Cambridge, Massachusetts, 1980. [21] B. SHNEIDERMAN,Control flow and data structures documentation: two experiments, Commun. ACM 1982, 25, 56-63. [22] B. SHNEIDERMANand D. MCKAY, Experimental investigation of computer program debugging and modification. Proc. Sisth Int. Cong. of the International Ergonomics Association, July (1976). [23] B. SHNEIDERMAN,R. MAYER, D. MCKAY and P. HELLER, Experimental investigation of the utility of detailed flowcharts in programming. Commun. ACM 1977, 20, 373-381. [24] W. L. TAYLOR, Cloze procedure: a new tool for measuring readability. Journalism Quart. 1953, 30,415.433. [25] L. WEISSMAN, Psychological complexity of computer programs: an experimental methodology. ACM SIGPLAN Notices 1977, 10. 105-I 13. [26] S. N. W~X)DI;IELDand H. E. DUNSMORE,The effect of modularization and comments on program comprehension. Proc. 5th Int. Conf. on Software Engineering, pp. 215-223. San Diego, California (1981). APPENDIX PFWFSURE
BERZTISS(VAR A : ARY ; N : INTEG'ZR ) ; VAR D, I, J, L, TEMP : INTXSR ; BFXIN D:=l; WHILE D <- N a3 D:=D*2; - 1 ) DIV 2 ; D:=(D WHIIX D 0 0 IXJ BEGIN FCRI:=ln)NtDtX3 BFX;IN J:=I; L:=J+D; WrHINAIL) < 4[J] LX3 TEMP A[Jl ALL] IF J
:= A[J) ; := 4[L] ; :- T=%P ; -D > OTHsJN
A preliminary investigation of the use of the cloze procedure
BEIN L:=J; J:=J-D; END; END ; EM); D:= (D - 1 ) DIV 2 ; END ; END; eroqram1 eer*iss [ll,
pestions --for the
4)
5)
6)
we
344)
Berztiss
Procedure
How many times will the statements “D := (D-l) DIV 2” be executed ifN=107 Describe in a sfmle sentence tit the variable TU4P is us& .for. Fbr N - 16, .&at will be the values of N - D 7 a) 15,7,3,1,0 b) 1,8,12,14,16 c) 1,9,13,X d) 1,2,3 ,....I 16 None of the above e) L&N=6and A[l] = 5 A(21 = 3 A[31 = 8 .A[41 = 6 A[S] = 4 A[61 = 9 What will be the values of A[51 and A(61 be ,.+henD first becomes 1 ? A[S] = A[61 = In a simle sentenczibe how thets of the procedure BERZTISS kould chaIye if “WILE A(L) < A[J] . ..* was ChXtld to “WHILZ A[L] > 9[J] . ..” . Given that N = 7, D = 3, and. ADI =
A[2] = 7 A[31 =
7) 8)
414) = 4 A[51 = 4[6] = 5 A[71 = Give a set of values for A[l], .4[3], AT51,m A[71 so that the body of the “WIL% A[L] < A[J] DO” loop will never be executed. Give one wrd or a single sentence *hat best descrfbes ha* tk proctiura B!?RZTISS does. If .rrLthin the FCR loop w - D” was chanted to “N - 1”, “L := J + D” was chdm~ed +.o “L := J + 1”, and and t&e tw occurrences of “J - D” wre chanqed to “J - l’, wuld the results of the procedure BERZTISS be chaIm3edl ( Yes or No
9)
)
Explain why or shy not in tkc sentences or less. If the tko occurrences of “D := (D-l) DIV 2” wre charqd to “D := (D-l) DIV 3, would the results of the procedure BFXZTISS be chaqed if a) N was 111 ( Yes or No ) b) N was 221 ( Yes or No ) Comprehension @Liz for
PF0ZED’JREGRrx;ANo( VAR A : ARY ; VAR
LZONE : BWLE4N ; : INT3GPR ; JUMP, I, J PFCC?SJJFE MILLER(VAR P, Q VAR HOLD : REAL ; BEGIN HOLD:=P; P:=Q; Q := HOLD : MD ;
Pz-oyram 1.
N : INTKW
: REAL ) ;
) ;
205
CURTISCOOKet al.
206
BE3GIN JUMP :=N 1 WHILX J'JMP> 1 M BEGIN
JUMP := JUMP DIV 2 ; REPFAT KlNE:=TRUEi FCRJ :=lTC'N -JUMPtXl BEGIN I :=J +JUMP ; IF A(J) > &[I) THFN BEGIN MILLER( A[J], A[11 ) ; DONE: :'F.ALSF, ; END ; END; UNTIL mNE ; END ; m;
Questions --__I for the GROG4Noprocedure 1) How many times will the statement“JUMP := JUMP DIV 2” 2)
be executed if N = 10 7 Give one sentencethat best describeswhat the procedure
MILLER does. 3) Ebr N = 16, hat will he the values of N -JUMP 1 a) 16,8,4,2,1 b) 8,4,2,1 c) 8,12,14,15 d) 8,9,10,....,15,16 e) None of the above 4) LetN=6and A[l] = 5 A(21 = 3 A[3] = 8 A(41 = 6 A(51 = 4 A[61 = 9 What will be the values of A[51 and A[61 when ZJMP first becomes 17 A(51 = AI61 = 5) In sin&e sentencedescribe how +heXts of the prcc&re GR0G.W wmld chaqe if the otstement 'IF A[J] > 9[1] ...I was changed to "IF A[Jl < 4[11 . ..I
6)
Given that. N = 7, ;pJMP= 3, and
WI
.
=
A[21 = 7 A[31
=
A[41 = 4 A[S] = 4161 = 5 A[-?]
Give a set of values
A(71 m 7) 8)
=
for A(l], A[3], A[5], and
that MN5 will always be 'IWE in the REPEAT loop. that best describes_
Give one mrd or a single &tence what the pocedure WXANOdoee. If within the REPEAT-UNTIL loop
.N - JU4P" was changed tn "N - l', =I := J +JUMP= ms charqed to "I :=J + 1' and would the result.of the procedure GRCGANObe changed? ( Yes or No
9)
)
Explainv&y or hy imt in tw sentences or less. If =JVMP := JUMP DIV 2. '&S changed to "JUMP := JUMP DIV 3', uould *he resultSof the procedke CRoGprNo be charx~~ if a) N was 11 ? ( Yes or No ) b) N was 22 ? ( Yes or No ) Ccqxehension
Quiz for Program 2.
Instructions For this experimentthere is a practiceproolem followedoy the actual problm. 7he practice woblem is to acquaint)ou with the testing
A preliminary
investigation
of the use of the cloze procedure
procedure being used in this experiment. ce problem do the actual problem.
After canpleting
207
the practi-
PracticeProblem --
A token is a variable identifier, constant, a or bracket, operator (arithmetic or logical), or a Certain of the tokens in the followim problem and replace? with an underline. Your task is to tokens.
GCD( INP’JT, OJTPJT Ml7 R, M, N : ;
single parenthesis reserved *ard. have been deleted fill in the deleted
i
BFSIN RmD(M,
WD.
REP?AT R :’ M :- -; N :=I4 ; UNTIL WRITRUT-)
);
MCDN ; -0;
;
So yt31.1 can canpere pur answxs, answers are given on the next page.
Directions
the
pogram
with
the
correct
for Oa022 pocdure.
PROGRAM QCD( INPUT, U.rEW 1 i VAR R, u,N : INTEGER; BEGIN READ(M, N ) ; REPPAT R :=MMmN; M :=N; N :=R; WJTILR = 0 ; wRITEU(M); IWD.
Cm the followingpaqe is the actual problem. You are to fila in ml5 sing tokens the best that you can. You will have a maximumof 25 minutes to complete this pert of the experiment. Directions
for cloze procdure
B!ClWTTSS(VARA :
(continued) .
; N:
TK-I, J, L, 'lT%P :
i
BEd--' D:= 1; D<=NDC) :-D+2; qgYDBEGIN BEGIN
) DIV 2 ; 0 v-rI :=lTu
-DE0
l'WFZ!ZR);
208
CURTIS
COOK
Cl d
:-I ; +D;
L :WHILE A
Jl Do
L] < A
BECIN iqq-
:= “[&r
;
m=
IF J-D
oTHk.4
BEGIN L J :=J END; !?ND ; END ; := ( D-
J ; D:
) DIV 2 ;
END; .
-'
Cute of the five cloze versions of h-gram
CR(X;.A~CJ(VARA :
;
1.
N : INTEGER 1 ;
VAR BO3LEAN ; mI,:J : PR'XEMJRE MILLER( VAR VAR : REAL ; B?XXN . HOLD := , P:-Q; HOLD ; QEND;
;
: REAL
, Q
) ;
BEGIN
JUMP WHILXm>
N; m
BKIN DIV 2 ;
J'JMP :' REPE4T LENS:
FOR J := BEX;IN I :=J IF A[
TR'JZ ; TO N - J'JMP
---j->
JlJMP A[
; ] TH3N
BEGIN
MILLSR :LLsE
A[Jl, ;
END ; m; Dmr?
MD
;
;
srJD ; One of the five cloze versions of Program 2.
[II
) :