SYSTEM System 27 (1999) 79±89
Can the C-test be improved with classical item analysis? A. Jafarpur Department of Foreign Languages and Linguistics, Shiraz University, PO Box 71345-1354, Shiraz, Iran Received 10 June 1998; revised 5 November 1998; accepted 6 November 1998
Abstract The application of the rule-of `two' for constructing C-tests produces two sorts of test items. Many items delineate acceptable facility and discrimination values, but a sizeable number of them are either extremely easy or extremely dicult to ®ll in. To investigate whether this defect can be avoided, a C-test with 5 texts and 126 items was constructed and tried with 146 Iranian English majors. On the basis of an item analysis, a tailored C-test with 100 items was developed and tried with 60 other subjects. The results of the study showed that no gains were made with the classical item analysis. # 1999 Published by Elsevier Science Ltd. All rights reserved. Keywords: C-Test; C-Principle; Reduced redundancy test
1. Introduction A C-test consists of 4±6 thematically distinct segments of connected discourse in which half of the letters of every second word (rule-of `two') are deleted. The test should contain at least 100 items. The testees receive credit for exact restoration. Despite the ease with which a C-test is constructed, Klein-Braley (1997) cautions that the test be used only after careful consideration on its suitability for the speci®c target examinees. She recommends that one should begin with more texts than one intends to include in the C-test. And after pretesting, the results should be analyzed before the test can be used for the purposes for which it has been prepared. Test analysis should involve calculating facility and discrimination values as well as reliability and validity coecients. Klein-Braley further states that ``If the empirical performance of the test is satisfactory, then it can be used in future for equivalent groups of subjects'' (Klein-Braley, 1997, p. 65). However, Klein-Braley (1997) oers no elaborations on what satisfactory performance is and what one should do with 0346-251X/99/$Ðsee front matter # 1999 Published by Elsevier Science Ltd. All rights reserved. PII: S034 6-251X(98)0004 3-8
80
A. Jafarpur / System 27 (1999) 79±89
malfunctioning items. She merely insists that the rule-of `two' for test construction works. In discrete-point tests, each item is locally independent. Accordingly, it is possible to control the characteristics of every item and subsequently the total test for facility and discrimination. This is not possible with C-test items when one follows the ruleof `two' for test construction. A look at various texts shows that some lexical items are very easy to predict, whereas others are extremely hard to guess. Grotjahn (1987) questions the ®xed deletion rate and recommends ¯exibility in the application of the rule-of `two'. That is to say, Grotjahn suggests abandoning the systematic deletion ratio and using a rational selection of deletion instead. Another way of improving on the C-test would be to select better functioning items after the test has been tried with subjects in its standard format. In addition to these two ways of improving the C-test at the word level, it is also possible to improve the statistical properties of the test at the text level. By selecting bundles of items that delineate speci®c qualities, one can improve on the C-test and, at the same time, retain some of the basic assumptions of the procedure as a measure of reduced redundancy. Since improvement at the word level is more feasible for EFL teachers, particularly novices and those untrained in language testing theory, this study was designed to investigate the eect of selecting C-test items on the basis of the statistical properties of individual items. Assuming that the C-principle is ``merely a technique for producing tests like any other technique'' (Alderson, 1983, p. 211), this investigation, after Brown (1988), attempts to study the extent to which the C-principle can be improved through classical item analysis. 2. Method 2.1. Instrumentation Three tests were utilized in this investigation: two C-tests and the English Placement Test as the criterion. 2.2. C-tests A C-test was constructed according to the instructions put forward by KleinBraley (1997). Twelve texts were carefully selected that were authentic and selfcontained, and that varied in subject matter. In addition, the texts varied in diculty as judged by the Flesch Reading Ease readability scale (Microsoft Word 1983±94). The texts were of dierent lengths because of semantic completeness. In each text, the ®rst and last sentences were left intact. Beginning from word two of sentence two, the second half of every other word was deleted. One-letter words were not mutilated. To facilitate test administration, the texts were randomly divided into two C-tests and, then, were randomly given to 100 Iranian English majors and 16 EFL instructors. On the basis of the results, ®ve texts with superior discrimination and facility values were chosen. They contained 60 to 87 words and varied in diculty
A. Jafarpur / System 27 (1999) 79±89
81
with Flesch Reading Ease readability values of 89, 75, 71, 70, and 56, respectively. The C-test thus prepared comprised 126 mutilations with each selection containing 24, 29, 20, 29, and 24 items, respectively (See Appendix A for the test). The results of this test will be presented under natural C-test. Subsequent to trying the natural C-test with 146 examinees, the results were item analyzed, and irrespective of any consideration, 100 items with superior discrimination and facility values were selected. Items with a discrimination index of 0.20 and higher and with a facility index of between 0.30 and 0.92 were considered valuable. Despite the fact that one could set more stringent measures to select better (but of course fewer) items, it was decided to select the best 100 items in order to ful®l the recommended minimum number of mutilations (Klein-Braley, 1997, p. 63). The distribution of the items in the ®ve texts were 21, 22, 15, 25, and 17, respectively. The results obtained from this test will be reported under tailored C-test. 2.3. The English Placement Test The English Placement Test (EPT) Form B (Corrigan et al., 1978) was used as the criterion for determining validity coecients. This test contains 20 listening comprehension items, 30 grammar items, 30 vocabulary items, and 20 reading comprehension items. The manual reports internal consistency (KR-21) as well as parallel-forms reliability estimates in the 0.90's for college-age students in intensive English courses. The total score is a composite of the subtest scores and the test is not course speci®c. The EPT has also been used experimentally with Iranian English majors and has been found to be quite suitable (See, for example, Jafarpur and Yamini (1993) and Yamini (1997)). Accordingly, the EPT was used in the present study as a criterion to measure overall language pro®ciency. 3. Subjects The subjects were 206 Iranian EFL learners. The subjects taking the natural C-test and the criterion were 146 English majors in the English Departments of Shiraz University (n 59) and Azad Islamic University at Lar (n 87). Those taking the tailored C-test and the criterion measure were 60 English majors in the In-Service Teacher Education Center in Bander Abbas (n 13) and in the English Department of Shiraz University (n 47). The examinees in the two groups were of both sexes and had the same subjects with the same instructors. Moreover, they enjoyed different levels of pro®ciency. 4. Procedures Each of the two C-tests and the EPT were administered to the subjects in groups over a 2-week period. To remove the eect of practice, the distribution of the two measures was counterbalanced. The natural C-test as well as the criterion measure
82
A. Jafarpur / System 27 (1999) 79±89
were given in the Spring of 1997. The tailored C-test along with the EPT were administered in the Spring of 1998. Except for the listening comprehension subtest, there was no time limit for completing the tests. 5. Analyses Classical item analysis based on sample separation was carried out on the results of the two C-tests. The top and bottom 27% were considered high and low groups, respectively. Items with a discrimination index of 0.20 and higher and with a facility index of between 0.30 and 0.92 were considered valuable. Reliability coecients were obtained through the Kuder±Richardson formulae 20 and 21. Validity coecients (rxy ) were calculated through the Pearson product-moment correlation coecient between the scores of the subjects from the C-test and the criterion measure. The obtained coecients were corrected for attenuation (rCA ). In addition, the dierences between the means of the two tests in each group were subjected to the paired t-test, and the dierences between groups were subjected to the group t-test. 6. Results and discussion Table 1 presents descriptive statistics for the scores obtained from the natural and tailored C-tests as well as from the EPT. The results from the two C-tests appear very similar. The scores from the natural C-test have a percent mean of 70.0 and a standard deviation of 15.49. Those of the tailored version have a mean of 70.7 and a standard deviation of 16.04. The dierence between these two means (0.07) is not statistically signi®cant. Moreover, the distribution of the scores from both C-tests is lopsided. Both are negatively skewed (ÿ0.73 and ÿ0.90) and peaked (0.32 and 0.15). Furthermore, the reliability coecients of the scores from both groups Table 1 Descriptive statistics Statistic
Group One Natural C-test
k n X(%) SD Range Skewness Kurtosis K-R21(K-R20) rxy rCA
126 146 70.0 15.49 22±118 ÿ0.73 +0.32 0.94(0.96) 0.81 0.87
Group Two EPT 100 146 63.9 16.36 31±95 ÿ0.22 ÿ1.0 0.92
Tailored C-test 100 60 70.70 16.04 32±93 ÿ0.90 +0.15 0.93(0.95) 0.77 0.84
Dierence EPT 100 60 75.73 12.75 34±97 ÿ0.84 +0.90 0.90
between C-tests
ÿ0.7 (NS)
A. Jafarpur / System 27 (1999) 79±89
83
are about the same (K-R20 estimates 0.96 and 0.95, respectively). However, the minimum obtained score from the natural C-test (22, or 17% of the total scores) is lower than that from the tailored version (32). In addition, the two C-tests dier in terms of their variance with the criterion. The natural C-test shows a variance of 0.66(=0.812) with the EPT and the tailored version 0.59(=0.772). Accordingly, these results indicate that the tailored C-test does not show any improvement over the natural version. Appendix B shows detailed item statistics for both C-tests, and Table 2 provides summary statistics from Appendix B. Table 2 indicates that the two C-tests do not dier. Mean item facility for the natural version (with 126 items) is 0.700 and that of the tailored version (with 100 items) is 0.707. The dierence between the two means is not statistically signi®cant. Neither is the dierence between their mean item discrimination indexes statistically signi®cant (0.356 and 0.387, respectively). There is a strong relationship between the two C-tests in terms of mean facility for their shared items (0.724), but the correlation coecient between the two for discrimination (0.512) is not that strong. Taken together, the results obtained from this study indicate that tailoring does not improve the statistical characteristics of the C-test. In designing the study, it was thought appropriate to administer the tailored Ctest to subjects dierent from those taking the natural version in order to avoid contaminating the obtained results with practice eect. With the results reported earlier, nonetheless, it is surmised that the results might have been confounded by another factorÐnamely, the inequality of the subjects taking the two C-tests. In order to verify this supposition, recourse was made to the scores of the examinees in the two groups that were comparable. An introductory course in language testing is oered for senior English majors at Shiraz University every spring. The students who participate in the course every year are approximately the same in terms of age, sex, motivation, number of credits completed, and level of language ability. Hence, the scores of these subjects on the C-tests and the EPT were carefully scrutinized. There were 58 subjects in Group 1 and 47 in Group 2. Table 3 reports the characteristics of their scores. As Table 3 indicates, there is no signi®cant dierence between the means of the two sub-groups on the EPT (76.66 and 80.06). This supports the contention that the two sub-groups are indeed comparable in terms of language pro®ciency. The means from their two C-tests (79.02 and 74.81) are not signi®cantly dierent either. However, the scores from the natural and tailored C-tests are not exactly analogous. Neither distribution is symmetric and peaked in like manner. The distribution of the scores from the natural version is positively skewed (+0.61), but that from the Table 2 Summary item statistics for natural and tailored C-tests Item statistic
Item facility Item discrimination
X Natural 126
100
0.706 0.356
0.688 0.419
(X ) Tailored
Dierence in X
Correlation natural and tailored
0.707 0.387
NS NS
0.724 0.512
84
A. Jafarpur / System 27 (1999) 79±89
Table 3 Descriptive statistics for two comparable groups Statistic
k n X SD Range Skewness Kurtosis K-R21 rxy rCA
Group 1
Group 2
Dierence between
Natural C-test
EPT
Tailored C-test
EPT
100 58 79.02 11.01 39±96 +0.61 ÿ1.95 0.87
100 58 76.66 9.78 46±95 +0.54 ÿ1.98 0.82
100 47 74.81 13.11 31±93 ÿ1.58 +1.24 0.90
100 47 80.06 8.88 61±91 ÿ0.73 ÿ1.15 0.81
0.68 0.81
0.63 0.74
C-tests
C-tests and EPT
NS
NS
tailored version is negatively skewed (ÿ1.58). Furthermore, the distribution of the natural version is ¯at (ÿ1.95), but that of the tailored version is peaked (+1.24). These results indicate that the natural C-test can adequately discriminate among the upper ability range, but the tailored version cannot. Accordingly, there is no indication that the tailored C-test is any better than the natural version. All in all, the results of the present study indicate that classical item analysis does not improve the psychometric and statistical characteristics of the C-test under investigation. As such, these results are in contradiction with those of Brown (1988) for the cloze procedure. Since the C-test and the cloze procedure are founded on the same theoretical construct, it was also expected to arrive at improved results with the tailored C-test. Indeed, tailoring interferes with the basic idea for producing tests that provide a representative sample of the written language. The case of the C-test may indeed be very dierent from that of the cloze procedure and discrete-point items. The stringent application of the rule-of `two' results in a C-test with locallydependent items as compared with discrete-point items. The items of a C-test are in fact more dependent than those of a cloze test simply because the blanks are closer in a C-test. Obviously, more de®nite statements must await further studies with other C-tests, and other and more subjects. Till then, these results must be regarded as tentative. 7. Summary and conclusion The application of the rule-of `two' for constructing C-tests results in a test with both items that function well and items that malfunction. In order to investigate whether the characteristics of C-tests can be improved with classical item analysis, this study compared the performance of two groups of subjects on a natural C-test and its tailored version against a single criterion of language pro®ciency. The results showed that classical item analysis does not improve the statistical characteristics of the C-test.
A. Jafarpur / System 27 (1999) 79±89
85
Acknowledgements The researcher would like to express his deep appreciation to Dr Mortaza Yamini for administering the tests to his students in Azad Islamic University at Lar. The author is also grateful to a SYSTEM anonymous reviewer for his comments on the monograph. References Alderson, C., 1983. The cloze procedure and pro®ciency in English as a foreign language. In: Oller, J.W. Jr. (Ed.), Issues in Language Testing Research. Newbury House Publishers, Rowley, MA, pp. 205±17. Brown, J.D., 1988. Tailored cloze: improved with classical item analysis techniques. Language Testing 5, 19±31. Corrigan, A., Dobson, B., Spaan, M., Tyma, S., 1978. English Placement Test. Testing and Certi®cation Division, University of Michigan, MI. Grotjahn, R., 1987. How to construct and evaluate a C-test: a discussion of some problems and some statistical analyses. In: Grotjahn, R., Klein-Braley, C., Stevenson, D.K. (Eds.), Taking Their Measure: The Validity and Validation of Language Tests. Brockmeyer, Bochum, Germany, pp. 219±53. Jafarpur, A., Yamini, M., 1993. Does practice with dictation improve language skills? SYSTEM 21(3), 359±369. Klein-Braley, C., 1997. C-tests in the context of reduced redundancy testing: an appraisal. Language Testing 14, 47±84. Microsoft Word, 1983±94. Microsoft Word, Arabic Edition, Version 6.0. Microsoft Corp. Yamini, M., 1997. A pedagogical approach to the pronunciation of English vowels. Unpublished PhD Dissertation, Allameh Tabataba'i University, Tehran, Iran.
Appendix A.
The C-test
This is a test of how well you comprehend written English. You will read ®ve texts. In each, half of the letters (plus one) of some words are missing. First, study each text. Then, write in the missing letters for each word. Each piece of the line stands for one letter. No negative point will be deducted for a wrong answer. Example: My name is Tom. I'm t_ _ oldest ch_ _ _ in m_ family. I ha_ _ a sister a_ _ two brot_ _ _ _.... Your job is to complete the text as: My name is Tom. I'm the oldest child in my family. I have a sister and two brothers .... Many men and women have a checkup every year to make sure that they are healthy. The che_ _ _ _ consists o_ a ser_ _ _ of te_ _ _ for t_ _ eyes, he_ _ _, lungs, a_ _ so o_. Many o_ these te_ _ _ are do_ _ by mac_ _ _ _. Nowadays, so_ _ people
86
A. Jafarpur / System 27 (1999) 79±89
pre_ _ _ to ha_ _ their chec_ _ _ _ at a cli_ _ _, an o_ _ _ where sev_ _ _ _ doctors prac_ _ _ _. The cli_ _ _ employs a num_ _ _ of nur_ _ _ and techn_ _ _ _ _ _. In the case of a serious illness, the patient enters a hospital. Dierent countries and dierent races have dierent manners. Before ente_ _ _ _ a ho_ _ _ in so_ _ Asian coun_ _ _ _ _, it i_ good man_ _ _ _ to ta_ _ o yo_ _ shoes. I_ European coun_ _ _ _ _, even tho_ _ _ shoes some_ _ _ _ _ become ve_ _ muddy, th_ _ is n_ _ done. A gu_ _ _ in a Chinese ho_ _ _ never ®ni_ _ _ _ a dr_ _ _. He lea_ _ _ a lit_ _ _, to sh_ _ that h_ has h_ _ enough. I_ a Malaysian ho_ _ _, too, a gu_ _ _ always lea_ _ _ a lit_ _ _ food. In England, a guest always ®nishes a drink to show that he has enjoyed it. Social customs and ways of behaving change. Things wh_ _ _ were consi_ _ _ _ _ impolite ye_ _ _ ago a_ _ now accep_ _ _ _ _. Just a f_ _ years a_ _, it w_ _ thought impo_ _ _ _ behavior f_ _ a m_ _ to sm_ _ _ on t_ _ street. N_ man w_ _ thought o_ himself a_ being a gent_ _ _ _ _ would ma_ _ a fo_ _ of himself by smoking when a lady was in the room. I once knew a man whose memory was very bad. Richard w_ _ so forg_ _ _ _ _ that h_ sometimes for_ _ _ what h_ was tal_ _ _ _ about i_ the mid_ _ _ of a sent_ _ _ _. His wi_ _ had t_ constantly rem_ _ _ him ab_ _ _ his appoin_ _ _ _ _ _, his cla_ _ _ _, even h_ _ meals! Si_ _ _ Richard w_ _ a prof_ _ _ _ _ at a well-known unive_ _ _ _ _, his forget_ _ _ _ _ _ _ was of_ _ _ an embarr_ _ _ _ _ _ _, it *was_ _ _ that h_ was uninte_ _ _ _ _ _ _, as so_ _ critical peo_ _ _ tended t_ gossip. He was just very, very absent-minded. The smartest builders in the animal world are beavers. They li_ _ in la_ _ _ and riv_ _ _, where th_ _ _ are tr_ _ _ nearby. Th_ _ cut t_ _ trees do_ _ and e_ _ the ba_ _ and tw_ _ _. Then th_ _ cut t_ _ trees in_ _ pieces a_ _ build da_ _ and ho_ _ _ with th_ _. Beavers bu_ _ _ their ho_ _ _ beside t_ _ water. A beaver's ho_ _ is cal_ _ _ a lo_ _ _. It has a sleeping room over the water and a storeroom under that. *Spelt as in the original source.
Item
1 2 3 4 5 6 7 8 9 10 11 12
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
Text
One
Two
0.69 0.73 0.79 0.96 0.94 0.66 0.87 0.90 0.92 0.97 0.53 0.52 0.97 0.59 0.65
0.89 0.76 0.59 0.76 0.55 0.44 0.78 0.85 0.95 0.82 0.82 0.69
Facility
0.65 0.38 0.15 0.10 0.13 0.30 0.23 0.23 0.23 0.13 0.65 0.52 0.00 0.33 0.68
0.15 0.40 0.65 0.68 0.45 0.63 0.40 0.40 0.18 0.55 0.33 0.35
Discrimination
Natural
APPENDIX B Item statistics in natural and tailored C-tests
0.77 0.48 ± ± ± 0.10 0.98 0.97 0.92 ± 0.72 0.50 ± 0.55 0.78
± 0.90 0.72 0.68 0.65 0.40 0.92 0.83 ± 0.85 0.72 0.78
Facility
0.56 0.38 ± ± ± 0.06 0.06 0.13 0.31 ± 0.31 0.50 ± 0.19 0.50
± 0.31 0.63 0.69 0.56 0.25 0.25 0.38 ± 0.50 0.44 0.19
Discrimination
Tailored
40 41 42 43 44 45 46 47 48 49 50 51 52 53
13 14 15 16 17 18 19 20 21 22 23 24
Item
0.84 0.67 0.61 0.71 0.49 0.92 0.88 0.97 0.51 0.85 0.72 0.85 0.48 0.93
0.80 0.64 0.62 0.60 0.87 0.77 0.77 0.83 0.83 0.98 0.88 0.63
Facility
0.23 0.38 0.73 0.58 0.83 0.20 0.35 0.10 0.48 0.38 0.25 0.28 0.80 0.15
0.23 0.73 0.60 0.53 0.43 0.40 0.45 0.35 0.45 0.08 0.33 0.68
Discrimination
Natural
0.13 0.56 0.56 0.25 0.81 0.19 0.19 ± 0.44 0.50 0.39 0.06 0.69 ±
0.25 0.31 0.56 0.63 0.19 ÿ0.06 0.38 0.19 0.31 ± 0.25 0.56
Discrimination
(Table continued on next page)
0.93 0.47 0.58 0.87 0.63 0.95 0.95 ± 0.72 0.80 0.47 0.95 0.53 ±
0.72 0.73 0.85 0.47 0.92 0.75 0.82 0.95 0.87 ± 0.87 0.57
Facility
Tailored
A. Jafarpur / System 27 (1999) 79±89 87
Item
54 55 56 57 58 59 60 61 62 63
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
103 104 105
Text
Three
Four
Five
0.79 0.29 0.90
0.90 0.29 0.99 0.62 0.96 0.80 0.84 0.80 0.72 0.76 0.98 0.57 0.79 0.71 0.60
0.66 0.42 0.96 0.38 0.67 0.89 0.87 0.90 0.78 0.92
Facility
APPENDIX BÐcontd
0.28 0.25 0.25
0.20 0.58 0.05 0.40 0.03 0.38 0.35 0.20 0.40 0.33 0.08 0.53 0.48 0.73 0.60
0.35 0.48 0.15 0.33 0.48 0.23 0.18 0.28 0.33 0.18
Discrimination
Natural
0.98 0.35 0.53
0.93 0.47 ± 0.67 ± 0.67 0.78 0.92 0.88 0.73 ± 0.35 0.92 0.32 0.40
0.78 0.62 ± 0.63 0.25 0.90 ± 0.93 0.80 ±
Facility
0.06 0.44 0.56
0.19 0.69 ± 0.50 ± 0.44 0.63 0.25 0.31 0.31 ± 0.69 0.25 0.50 0.81
0.19 0.44 ± 0.63 0.38 0.31 ± 0.19 0.25 ±
Discrimination
Tailored
115 116 117
89 90 91 92 93 94 95 96 97 98 99 100 101 102
64 65 66 67 68 69 70 71 72 73
Item
0.84 0.75 0.69
0.86 0.42 0.88 0.75 0.96 0.32 0.52 0.46 0.29 0.90 0.55 0.73 0.79 0.91
0.89 0.74 0.91 0.96 0.22 0.68 0.63 0.88 0.66 0.36
Facility
0.40 0.48 0.60
0.35 0.38 0.33 0.40 0.10 0.68 0.53 0.55 0.43 0.30 0.55 0.23 0.25 0.28
0.35 0.43 0.23 0.08 0.10 0.53 0.48 0.23 0.48 0.65
Discrimination
Natural
0.87 0.75 0.73
0.82 0.40 0.93 0.97 ± 0.40 0.62 0.50 0.55 0.88 0.68 0.60 0.80 0.77
0.97 0.83 0.88 ± ± 0.80 0.50 0.82 0.73 0.43
Facility
0.44 0.56 0.56
0.56 0.44 0.19 0.06 ± 0.75 0.56 0.75 0.44 0.38 0.31 0.56 0.31 0.44
0.13 0.38 0.31 ± ± 0.81 0.44 0.25 0.56 0.63
Discrimination
Tailored
88 A. Jafarpur / System 27 (1999) 79±89
Text
106 107 108 109 110 111 112 113 114
Item
0.82 0.77 0.92 0.84 0.40 0.35 0.16 0.15 0.92
Facility
APPENDIX BÐcontd
0.23 0.28 0.18 0.43 0.38 0.33 0.20 0.20 0.18
Discrimination
Natural
0.55 0.52 ± 0.90 0.53 0.35 0.18 0.23 ±
Facility 0.56 0.50 ± 0.31 0.00 0.06 0.25 0.44 ±
Discrimination
Tailored
118 119 120 121 122 123 124 125 126
Item
0.10 0.17 0.76 0.84 0.26 0.91 0.72 0.64 0.00
Facility 0.10 0.08 0.35 0.18 0.03 0.18 0.48 0.55 0.00
Discrimination
Natural
± ± 0.92 0.83 ± ± 0.82 0.82 ±
Facility ± ± 0.31 0.50 ± ± 0.06 0.31 ±
Discrimination
Tailored
A. Jafarpur / System 27 (1999) 79±89 89