Multiple inquiry in the validation of writing tests

Multiple inquiry in the validation of writing tests

ASSESSING WRITING 0 Ablex Publishing 5(l) 1998 pp. 89-l ISSN 09 1075-2935 All rights of reproduc?ion in any form resewed Corp. Multiple Inq...

2MB Sizes 0 Downloads 10 Views

ASSESSING

WRITING

0 Ablex Publishing

5(l)

1998

pp. 89-l

ISSN

09

1075-2935

All rights of reproduc?ion in any form resewed

Corp.

Multiple Inquiry in the Validation of Writing Tests RICHARDH.HASWELL

Texas A & M University-Corpus Christi In social-science research and in program evaluation, multimethod designs are well known, but few have been applied in the validation of writing tests. Yet there are good reasons to evaluate a testing program along many measures: to be of use to a variety of stakeholders, to be sensitive to the presence of conflicting perspectives, to seek convergent findings among studies with different biases, and to probe a social context that is complex, fluid, and provisional. At Washington State University, validation of the writing-placement examination system followed multiple-inquiry lines, and therefore stands as a test case. Analysis of the findings from WSU shows the value of multiple studies to different groups: to the students, both native and nonnative speakers of English, both transfer and non-transfer; to the writing-course teachers; to the other faculty on campus; to the cross-campus corps of raters; to the chairs and heads of programs; and to the board of regents. In multiple inquiry, a key role is the writing-program administrator, who probably should be included on any validation team. Although multiple inquiry proved sometimes problematic at Washington State, it also proved unusually productive of recommendations for improvement.

About twenty-five years ago, the chemistry department at my university asked me to make a reading test to serve as part of their examination placing students into first-year chemistry classes. I thought this an interesting and even enlightened goal, and worked up some multiple-choice items based on two passages from a chemistry text. The department used my mini-test for about ten years and then replaced their entire placement instrument with a nationally distributed one. While my little test lasted, however, it seemed to work well enough. At least the department always answered my inquiries with the laconic reply that they were happy with

Direct all correspondence Corpus Christi, TX 78412.

to: Richard

H. Haswell,

Texas A & M University-Corpus

Christi,

89

90

R. H. Haswell

the way their students were being placed. My reading test could serve as an illustration of the truism that most home-grown academic placement examinations, after they are put in place, remain largely unexamined. But that is not the point I wish to make, which heads in the opposite direction. It begins with the observation that when local writing-program validation studies have been run, usually they are self-limiting. This point was well demonstrated once during the history of my reading test. I remember the incident with a certain clarity. Before the chemistry placement exam could be installed, it had to pass muster with a university committee overseeing testing standards. We met deep inside the administration building in a colorless, windowless room, narrow as a conference table. Suddenly one of the committee members turned to me and asked if I had performed any item analysis on my pilot of the reading portion. The moment stands out in my past because I had an answer ready. Yes, I said, Kuder-Richarson’s Formula 20, which found an internal-consistency coefficient of .83. I was rewarded with a collective ah and nodding of heads, and the inquiry moved on. Not that the committee was unwise to have accepted Kuder-Richardson 20 (it may in fact have been appropriate) nor that they were unwise to have been satisfied with a coefficient of .83 (it may in fact have been sufficient). The point is that it was unwise of the committee, even astonishing, to have been satisfied with only one method of validation of my reading test. Kuder-Richardson 20 estimates the efficiency of test items in achieving discrimination among test-takers. But there are plenty of other issues at stake that the calculation has absolutely no bearing on, such as whether there is a functional connection between reading chemistry and learning chemistry in first-year classes, or whether the difficulty of questions would give a satisfactory sense of challenge to the best students, or whether students would learn anything about their own reading abilities from taking the test, or whether it would place too many examinees into Chemistry 099 who could pass Chemistry 101. I name four issues out of many. In this article I will argue first that it is unreasonable and unprofitable to validate a formal writing test along one measure, or even along a few measures, and second that it is easy and profitable to validate a writing test, even a large-scale formal one, along many. I only sketch out my first argument because although “multimethod validation” has not been much discussed among writing experts, it has long been explored and critiqued by social scientists involved in program evaluation and experimental design. I expound on my second point in detail because multimethod approaches have seen few applications in the small province of writing-test follow-up studies. Witte and Faigley (1983) have called for a “confluence” of qualitative and quantitative methods to evaluate the complex interactions of

Multiple Inquiry

91

whole writing programs. More recently Scharton (1996, p. 68) has called for triangulation designs in the assessing of student writing and Hughes (1996, p. 166) for “multiple methods” in the evaluation of WAC programs, and occasionally past writing-placement studies have used multiple measures without multimethod theory (e.g., McCauley, 1987; Cohen, 1989). But actual multimethodical studies of writing tests are hard to come by. At present, our discipline could use straightforward accounts of fust trials with multiple-inquiry validation, with both wrong and lucky turns openly divulged to the jury.

The Need for Multiple Inquiry in the Validation of Writing Tests My testing committee, of course, could not have known that twenty-five years later the Kuder-Richardson would be somewhat of a relic. But their choice betrays more than just the historicity of statistical methods. It reflects a change in conception of the essential nature of language tests quite relevant to the current enterprise of validating writing tests. As conceptions of tests alter, so must methods of test validation. Twenty-five years ago it was widely assumed that a good language test measured a universal capability or “aptitude” for language use. With an indirect test of writing, the standard test validation was internal, through item analysis, since the main question was how well the items on the test worked in discriminating among degrees of that assumed aptitude. As long as the reality of a universal language capability remained unquestioned, additional validating measures were felt unnecessary. For instance, not until the issue arose of direct testing, of whether there might not be two distinct aptitudes for written language, did it become necessary to compare the powers of the indirect and direct portions of the test in predicting future language course performance of the examinees. Contrast a more current notion of a language examination or, to come closer to the focus of this paper, of a formal writing examination. Camp defines writing assessment “as a set of contextualized performances and procedures” (1996, p. 139). Let’s be more specific and define an institutionalized writing test as a social apparatus that applies a nomenclature (specialized and provisional language) in order to classify and label people for certain public uses. By this definition, it is largely irrelevant whether or not a score (e.g., “87th percentile” on the SAT Verbal) actually measures some universal writing aptitude. What is important is that the score operates as a bit of language with social effects, temporarily attached to a person who has been through the complex institutional routine that determined the label (“took the SATs”) and then used for various social

92

R. H. Haswell

ends (e.g., helping admit the person into a particular college or allowing the person’s parents to brag at a Sunday cook-out with neighbors). Now once we take a writing test as a social apparatus producing a nomenclature for public use, and then ask how should such a process itself be tested, major contrasts appear with validation practices of twenty-five years ago. Suddenly everything is problematic, everything is under question. If the apparatus is social, then what (fallible) humans run it for what (debatable) ends and (more or less) how well, and how do the (vulnerable) people who are labeled by it feel (they think) about the process? If the product of the apparatus is a nomenclature-that is, a language system constructed by professionals for specialized and temporary ends-then how well do its labels match up with performance of people in the world outside the test? If the nomenclature is also a classification, then what is the value system that underlies it? And if the nomenclature is provisional, then how long before the historical motives for the original construction change, or how long before the labels drift out of use, attached as they are to people who are also changing? Finally, and most crucially, what are the uses that are made of this labeling, by the people who constructed the test, by the people who run the apparatus, by the institutions that use the labels, by the people who acquire the labels? If the essential question of test validation is whether the test works for society, then the overriding issue pertains to public benefit and not to universal “aptitude” or to internal validity, both of which in a social apparatus are provisionally constructed just like everything else and serve only so long as they serve the ends of the apparatus. For validation one upshot of all of this seems fairly evident. Once the notion of a universal or “true” competence in language is written out of our definition of “writing examination” and the notion of fluid social ends written in, then we can no longer be satisfied with a single-measure follow-up of the examination. We commit ourselves to multiple-measure validation, perhaps even commit ourselves to validation tests with no clear limit as to number, the more tests the better. All this is to repeat the current dictum, that language acts are social acts, but it adds the perhaps not so common but still commonsense assertion that writing tests themselves are language acts. So essentially a test of a writing test should be no different than a test of writing-which evaluation experts have been saying for decades requires, as a socially embedded activity, multiple samples and multiple readers. Another way of putting this is to say that the validation of any writing test is really a validation of a writing program. The program stands as the first-level social matrix in which the test act is embedded. For models of validation, then, those of us called upon to conduct follow-up studies of writing tests might best turn not just to traditional methods of test evaluation (e.g., item analysis, inter-rater reliability, test-retest correlations) but also to program evaluation. And indeed within

Multiple

Inquiry

93

the discipline of program evaluation, this last quarter century has seen no approaches receiving more attention than multimethodological ones (for useful syntheses, see Fisk, 1982; Mark & Shotland, 1987; Brewer & Hunter, 1989). Program evaluation now offers writing-test validators four powerful rationales for multiple inquiry. (1) Use. Focus on social utility pushes validation beyond just some alternate method to traditional quantitative designs. For any single validation, a variety of different methods are needed because any testing apparatus and any educational program is composed hierarchically of different people with different needs and hence with different uses for the results (compare Stake’s “responsive evaluation,” which “trades off some measurement precision in order to increase the usefulness of the findings to persons in and around the program,” 1975, p. 14). No one study is going to satisfy all stakeholders, and no one stakeholder is going to find useful all validation studies-unless it is the chief administrator of the assessment program (a point I will return to). (2) Perspective. Partly because different people in a program have different needs, they will work out of difference perspectives. Yet any one validation measure tends to privilege the perspectives of certain people and suppress the perspectives of others. Testing the predictive power of the SAT Verbal by correlating it with college grades privileges the teachers who give the grades. (It privileges them because it does not test the validity of the grades.) Even the most carefully constructed survey of stakeholder attitudes privileges the perspective of the researcher, who has to make decisions about survey questions and target groups. Multimethod research multiplies studies to give voice to different perspectives, especially where administrators need to be aware of the complexity of the situation (Heath, Kendzierski, & Borgida, 1982) and where stakeholder biases tend to lead in divergent ways (Reichardt & Gollob, 1987). (3) Cross-checking. Inquiry is “triangulated’ in multimethod validation so that the blindness or bias of one study may be illuminated or balanced by the angle of view of other studies (Campbell & Fiske, 1959; Denzin, 1970; for a critique of multimethod triangulation, see Shotland & Mark, 1987). In a validation of the SAT Verbal with college grades, as I have said, the value-system of college teachers is uncontrolled for. Those values could be tested in second study, a holistic appraisal of end-of-course writing with program administrators as evaluators. And since teachers and administrators may share the same biases, an exit survey of student self-assessment of their writing might help balance information. I am unaware of any writing-test validation designed to reap the advantages of a cross-checked or “multiplistic” design. Brewer and Hunter (1989), for instance, cite the cross-checking function of surveys, archival research, fieldwork, and experimentation:

94

R. H. Haswell

the weakness of surveys, that they are contaminated by reactive effects, is complemented with the greater objectivity of a historian’s paperchase, whose weakness of incompleteness is complemented with local fieldwork; controlled experiments may add predictive power lacking in all three. Smith and Kleine (1986) cite the triangulating power of ethnographic, historical, and biographical investigations in illuminating the role of educational administrators. When quite different studies in a multiple design “converge” or support the same outcome, then research can offer answers considerably more powerful than a single study could provide, especially in fluid social contexts. (4) Probing. Indeed, multiple studies are needed in writing-test validation because no one answer is ever complete. Validation studies of something as empirically squirmy as a direct writing test cannot be expected to prove anything. Rather they probe something. They bring back partial returns from partial samples that need to be synthesized with samples from parallel but not exactly replicating probes. Two of the most direct values of multimethod validation are the ways it augments the interpretability of findings and deconstructs the certainty of superficial interpretations (Mark & Shotland, 1987). Writing-exam validation, often prompted because the examination program is innovative, especially demands the kind of “illuminative evaluation” that interprets more than predicts (Parlett & Hamilton, 1976). With its emphasis upon pragmatic use, participant perspectives, crosschecking, and probing, multiple inquiry has a potential for writing test and writing program validation that can be highlighted with a glance back at typical one-shot validations of a decade or so ago. They all beg for further studies. Modu and Wimmers (1981), for instance, correlated writing placement with the grade the student received in the course. But how happy was the student or the teacher with the grade? Did a B mean that the student was correctly placed and progressed well in the course or underplaced and merely coasted? Hackman and Johnson (1981) compared placement with the end-of-course assessment by the teacher as to whether the student belonged in the course. But what constitutes “belonged,‘‘-a student whose writing was passing from the beginning or a student whose writing improved with much hard work and even trauma to a point of barely passing? Gorrell (1975) had outside holistic raters judge the quality of writing at the end of the course. But how does that judgment of quality compare with the judgments of the student and the teacher? Obviously, had all three validation methods been used together at any one of these sites, the findings would have been much more persuasive.

Multiple Inquiry

95

An Instance of Multiple Inquiry I do not present the following validation history as a model of multipleinquiry methods. It probably illustrates as many missteps as other, and certainly errs in conducting too many studies without careful planning (Shotland & Mark, 1987). But it does demonstrate the potential for multiple inquiry into a large-scale writing examination. It also shows that the logistics of such designs are not insurmountable at a moderate sized university with a complex composition program. Complex is the word to describe the formal undergraduate writing assessment at Washington State University. A local writing-placement test is required of all of its 2,800 entering students, and a local midcareer writing placement test of its 2,500 juniors. The two examinations are interconnected-they use the same timed-writing format and by 1995 many of the juniors had stood for both exams-and by their nature they should be treated as one program. In the academic year 1991/2, the administrators of the matriculation test conducted validation studies of it (Haswell, WycheSmith, & Magnuson, 1992), and in the spring of 1995, I conducted validation studies of the junior test (Haswell, 1995). All of these studies were initial investigations of examinations that had just been implemented. The first placement examination is fairly conventional. Students write a 90minute essay on one of several paragraphs in accordance with one of four rhetorical tasks. They have no choice of paragraph or task, which are rotated regularly. They then write for 30 minutes on a self-reflective topic, The junior exam, more complex and innovative, requires the student to sit for the same timed-writing exercise, with the same paragraphs and rhetorical tasks (although students usually get different ones than they chanced upon two years earlier). But this timed writing is only one part of a portfolio format. Along with their two timed essays, they submit three papers previously written for courses and signed off by the teacher of the course as of passing quality at a junior level. The rating of the writing in both exams follows an innovative two-tiered system (Haswell & Wyche-Smith, 1995). With the first-year exam, both of the timed essays are read by a trained reader (Tier I), who determines if the essays place+bviously place-the writer into regular freshman composition. If the placement is problematical in any way, the essays go to Tier II readers, administrators and long-time teachers of the courses, who decide in mutual consultation if the student will be exempted from first-year composition or otherwise placed in regular composition, in a tutorial in conjunction with the regular course, in a basic course, or somewhere in an ESL sequence. The same two-tiered rating system is used in the junior portfolio. Tier I readers read only the timed writing and decide if it represents an obvious “pass” (no further composition coursework required of the student). If so, the assessment of the port-

96

R H. Haswell

folio is finished. ~oblematic essays are sent on to Tier II, where the raters evaluate the entire portfolio of five pieces, including the ratings of the course instructors, and decide whether the student should pass, should be awarded a pass with honors, or should be required to take a further composition course before graduation (a one-hour tutorial or a full three-hour course). Information is collected from students when they take or sign up for the exams and after their examination writing has been processed. The extent of the database shows the complexity of the testing program. For the firstyear exam, data collected include the student’s full name, ID number, sex, ESL status, native language, transfer status, session of exam, cited paragraph in the prompt, rhetorical task of the prompt, names of Tier-I and Tier-II raters, Tier-I and Tier-II ratings, and final placement. The same information is recorded for the junior exam, to which is added the student’s telephone number; academic major; advisor’s name; teacher, title, date of course, and teacher’s assessment for each of the three submitted course writings; academic hours earned to date; and campus (the main campus or one of several branch campuses). All this information is entered into a computer data-filing program that allows access and manipulation and transportation into a statistical package-a step which makes quick work of many of the validation studies I will describe. As this data bank indicates, the writing promo at WSIJ has a variety pack of functions: to place students in writing courses, to support a WAC initiative, to support a general-education reform of the undergraduate curriculum, to send a message to prospective employers of degree-holders, to help fulfill a mandate for outcomes assessment from the state higher-education coordinating board, and to serve as a form of writing instruction in itself. Stakeholders in the program are numerous. Potentially benefiting are not only the students, teachers, and administrators of the writing program, but also teachers across campus (who assign and evaluate the course submissions for the junior portfolio), faculty serving as raters (who feel committed to a procedure, the effects of which they can see everywhere on campus), central a~~s~ato~ (who want to see clear benefit for their majors and their faculty), higher ad~nis~ators (who backed the generaleducation reform and the WAC initiatives), the Board of Regents (who approved the testing package under pressure from the state coordinating board), and the designers of the tests (who see implemented a large-scale testing scheme with new and risky features). Each of these people are asking of the program, is it working? They ask, of course, from their particular perspective. I will organize the following selection of validation studies and results around these stakeholders, faithful to my working definition of a writing test as an apparatus whose final end is public use.

Multiple

Inquiry

97

Students The immediate use of placement exams is to put students into courses that prove beneficial to them. From the vantage of students who place into basic courses, the benefit may be hard to see. But that student opinion is subject to change, and it will take more than one inquiry to see it. Four weeks after being placed by the first-year placement examination into a one-hour tutorial, 39% of the students felt the exam was not accurate in their case; yet at the end of the semester, less than 10% of the students felt the course had not been beneficial to them. A look at enrollment discovered that subscription for the tutorial doubled the following semester with an influx of students who were not placed into it but who evidently had heard good news about it. Similarly with the junior portfolio, a telephone survey of examinees found that while 100% of students awarded “Pass with Distinction” felt the exam accurately reflected their performance, the numbers drop to 70% for those receiving a simple “Pass” and to 35% for those receiving “Needs Work.” Yet an exit survey of students in the onehour tutorial, into which most “Needs Work” students are placed, found nearly all of them pleased with the course and more than a few stating that it was the most beneficial writing course of their college career. Student opinion about their own progress in writing, of course, may not be borne out by their actual writing performance. A fourth inquiry took advantage of the fact that in the first-year writing courses, students submit a midsemester folder, which is rated “pass,” “questionable pass,” or “fail” according to end-of-the-semester standards. Students in the basic course (English 100) have their folders mixed in with the others. The assessment of midsemester folders (Table 1) helps support the validity of the placement test. Students placed into regular English received fewer fails and questionable passes than did students placed into regular English with the tutorial, who received fewer than did students placed into English 100. Further support for the placement exam comes from the fact that the few students who enrolled in English 101 without having taken the test had a worse midsemester record than that of students in the tutorial. Table 1.

Results of Mid-semester Writing Folder with Placed Students Writing Folder

Placement History

Pass

Questionable Pass

Fail

Placed in regular English 101 Placed in English 101 plus the one-hour tutorial Placed in English 100 Did not take the placement examination

78% 63% 18% 58%

9% 15% 32% 14%

13% 22% 50% 27%

R. H. Haswell

98

Table 2.

Results of the Junior-Level Portfolio Examination According to Language and Transfer Status

LanguageStatus TransferStatus

PasS

Non-native writer Non-native writer Native writer Native writer

81% 62% 81% 81%

Non-transfer Transfer Non-transfer Transfer

Pass with D~~ction

Needs Work

02% 03%

17% 36%

12% 10%

07% 09%

Student self-assessment and exam results can sometimes be quite at odds and need to be cross-checked. For instance, in telephone interviews many international students who had earned a “Needs Work” on the junior portfolio said that they had felt doomed to failure since they knew that non-native speakers almost always fail. Exam results {Table 2) easily dispel this myth. Non-native writers who began their post-secondary education at Washington State pass at exactly the same rate as do native writers. Combined with other probes, the figures in Table 2 enhance-revise and complicate-interpretation of the writing program in other ways. Some WSU administrators would like to believe that as a group transfer students write worse than non-~~sfer students, but here the evidence suggests that this is not the case, not at least with native speakers. It is NNS transfers who appear more poorly instructed in writing. Yet about 80% of the interviewed transfer students, both NS and NNS, said they felt at a disadvantage because they had been poorly informed about the exam and had had trouble finding teachers to sign off on their papers, some of which originated from other schools. Add to this another twist: 93% of NNS transfer students felt that their portfolio rating reflected their performance on the exam accurately, compared with 67% of non-transfer NNS writers. Together the various validation probes converge to show the transfer NNS students very much at risk and also very conflicted about the exam, and diverge to show that assistance for this group will not be simple. One form of assistance would be to inform NNS students of the results of this validation before they take the test. Comparison of registration for the first-year placement-exam with that for the junior exam finds a growing number of international students who place into Washington State University’s ESL sequence yet who then do not enroll for it; instead they register for the junior examination with transfer credit, often from a twoyear college writing course taken during the summer. It would be a productive act of advising to show these freshmen the performance distributions above, which demonstrate transfer NNS students performing much worse than NNS students who had undertaken the ESL sequence of writing courses at Washington State.

Multiple Inquiry

99

One of the less appreciated uses of validation studies, in fact, is to recycle the findings themselves back into the test program (a self-placement system at Yale provides an interesting but isolated example, Hackman & Johnson, 1981). With the WSU junior portfolio, even a simple distribution of test outcomes could be extremely useful to students trying to decide how much effort to put into preparation: Pass with Distinction 11.5%, Simple Pass 79.%, Needs Work 9.1%. The special language, “Pass with Distinction,” is translated for students by the 11.5%,a figure that very well could validate a student’s striving for a goal that seems attainable. Students later preparing for the junior portfolio face a special quandary: should they wait to take the exam until they have better papers to submit? They might find informative a breakdown of performance by academic hours earned when the student took the exam. The results in Table 3 would show students that the answer is not simple (more enhancement of interpretation). Taking the exam early does not put them at risk, but waiting a semester or two beyond the recommended time (61-75 hours) is risky except for the slim minority of students who wait until their senior year (which has its own special risk). Administrators of the exam, of course, may not like all of these figures, nor like having them accessible to students. Teachers of Writing Courses A placement instrument works for teachers when it puts into their courses only students who fall within instructional parameters imagined by the teachers. Yet there is a natural tendency for teachers to resist placement procedures per se. It is difficult to convince them that they can’t bring along any student who ends up in their section. At WSU this allegiance to students surfaced when teachers of the first-year writing courses resisted a request from placement administrators to check their rolls for students who had placed in one course and registered for another. Such teacher attitudes question the results of an end-of-semester survey that otherwise appears to support the placement test. Regular-composition teachers judged only 4% of their students misplaced, basic teachers only 3%, and tutorial teachers only 6%. Other probes are needed to check for possible Table 3.

Credit-Hours

Hours Earned Up to 60 61-75 76-90 91-105 Over 106

Earned When Taking the Junior Examination

Percent of N 23% 42% 22% 10% 03%

Simple Pass 75.7% 79.2% 79.8% 17.3% 76.3%

Pass with Distinction 17.2% 12.4% 09.2% 13.3% 23.7%

and Results Needs Work 07.1% 08.4% 11.0% 09.3% 00.0%

100

R H. Haswell

teacher bias. The relative success of students on the mid-semester folder reading (shown above) argues that the placements stratified the writing skill of students roughly in line with the various courses. End-of-thesemester grades add further support, since they showed withdrawals, in both the regular and the basic courses, declining from the level of years before the placement exam was installed. Further cross-check, nonetheless, may add information. After all, since all of these probes operate within the milieu of the first-year writing program, they may share bias. What happens to this support for the exam after students leave that precinct? With the databank, it was easy to take a quick look at this issue by correlating performance on the first-year placement with performance on the junior exam. Rankings were formed from the outcomes of the two exams: first-year placements were ranked Exemption highest, English 101 second, English 101 plus tutorial third, and English 100 fourth; Portfolio placements were ranked Pass with Distinction highest, Pass second, one-hour Needs Work third, and three-hour Needs Work fourth. The correlation was significant (~05) and positive, but very low: .24 (Spearman rank). There is not much argument here for the lasting validity of the first-year placement. Rather the argument is for the presence of the junior exam, since it seems to elicit a quite different writing performance, perhaps to tap into a writing skill that over two years has changed with students in different ways. Two parallel probes into sub-groups, however, do show some predictive value of the first-year placement. (1) As we have already seen, non-native writers who evaded their placement into the university ESL writing sequence performed far worse on the junior exam than did any other group. (2) Students who are exempted from first-year courses by the placement exam-around one percent of exam takers-do exceptionally well on the junior portfolio. None of them earned Needs Work, and exactly 50% of them earned Pass with Distinction, compared to 11% for the population as a whole. (Upper-division teachers often aver that some of their worst writers were exempted as freshmen, a belief that these figures document as largely a myth.) All in all, the disparate findings about the longitudinal performance of groups of students may converge at one useful generalization for writing teachers, that writing development during the first two years of college is deeply non-uniform across student groups. Teachers of Other Courses Sixty-seven percent of the junior-portfolio course papers submitted by these exempted students were qualified as “Exceptional” by their teachers-further confidence in the predictive validity of the first-year place-

Multiple

Inquiry

101

ment category of “Exempt.” Confidence in one percent of the students, however, would have been cold comfort for these teachers. They had been asked to play a difficult role in the new junior placement process. They had to re-read and evaluate papers brought in by their students, either rejecting the papers as unsuitable for submission for the portfolio or rating them as “Acceptable” or as “Exceptional” (“within the top ten-percent of student performance on the particular assignment”). Their problems with this nomenclature quickly became apparent when during the first year of the exam they handed out the “top ten percent” Exceptional ratings at a 43% clip. Here validation took a non-tradition form: a workshop of teachers across campus who had been signing off on the largest number of student submissions. Many problems were voiced during the session. The term Exceptional was too loose, criteria for evaluation differed from teacher to teacher, responsibility for the evaluation was not clearly perceived (could teaching assistants sign off on papers in large classes?). On the other hand, as we have already seen with other participants in the program, the perspective of the teachers was complex, and they highly supported the general format and goals of the junior portfolio.

Evaluating Corps In an institutional testing program, raters need motives to continue the work of rating, to return to a modestly paying job time after time, even year after year. They need to resist disillusionment with the value and legitimacy of their function, and they need to resist boredom with their task, an activity that is repetitious almost by definition and by its nature resistant to change even when change might lead to betterment. In one aspect, the rating system for the two exams especially risked both disillusionment and boredom. This risk lay with Tier I readings, where one rater alone could determine the fate of a student. As university faculty, readers were aware of the novelty of this procedure, which forgoes the time-honored check of a second independent reading. Since readers would not be much impressed with the argument that getting by with a single reading on a large portion of the exams (64% of them, as it turned out) saved the program considerable cost, they needed validation for it. This validation was provided by several studies. With the first-year exam, 27 exams that had placed their authors into regular composition through only one reader were recirculated. All but one received the same placement. Course performance was checked of another 100 students who had been placed into the regular course on only one reading: 95 of them received credit for the course, 4 dropped the course with no evidence that they were unable to pass it, and 1 failed because she did not turn in

R H. Haswell required work at the end. With the junior exam, 40 portfolios judged “obviously pass” by the first reader and hence not read again were recirculated: 38 received a final rating of “Pass,” and 2 a “Needs Work” (these two, clearly borderline portfolios, were recirculated a third time and received “Pass”). These probes into rater reliability can be used, of course, to validate the exams, that is, to defend the outcome of the exams, but they also served to validate the exam for the raters, when the results of the studies were passed on to them. On the other hand, a series of similar studies, which would take too long to detail, made it clear that the more problematical

Table 4.

Performance

Academic Major History Electrical Engineering Biology Pharmacy English/General Political Science Animal Science Communications/Gen. Civil Engineering Teaching & Learning

All Examinees (Mean) Criminal Justice General Studies Nursing Speech & Hearing Zoology BAlMarketing Environmental Science Foreign Languages Hotel & Restaurant Business/General CommunicationsiPR Sociology BA/Accounting BAIFinance Mechanical Engineering Human Development Psychology Architecture Undecided BA/Management No&:

N 20 21 50 22 35 46 21 65 36 124 1493 41 24 32 23 24 26 20 20 40 61 22 32 51 26 26 31 83 22 44 25

on the Junior Portfolio by Declared Academic Major Pass with Distinction (Percent of N)

Simple Pass (Percent of N)

30% 29% 28% 21% 23% 22% 19% 18% 17% 14% 14% 13% 13% 13% 13% 13% 12% 10% 10% 10% 09% 09% 09% 08% 08% 08% 06% 06% 05% 05% 04%

65% 62% 70% 68% 69% 74% 81% 75% 77% 82% 80% 79% 75% 81% 70% 88% 85% 80% 90% 80% 78% 91% 84% 90% 58% 92% 77% 90% 91% 84% 76%

Majors with less than 20 exam takers are exluded.

Needs Work (Percent of IV) 05% 10% 02% 05% 09% 04% 00% 06% 06% 04% 06% 09% 13% 06% 17% 00% 04% 10% 00% 10% 13% 00% 06% 02% 35% 00% 16% 04% 05% 12% 20%

Multiple

Inquiry

103

decisions that reached Tier II readers were not as reliable by far. Thinkaloud protocols of raters reading full portfolios found them following unique and sometimes eccentric methods of evaluation, not so much deviating from the criteria and method that raters were trained for as circumnavigating parts of it. Yet subsequent training sessions found them eager to work toward a more cohesive rating community.

Central Administrators Heads of departments and programs, of course, would not support the examination if they felt that it was taking up too much of their faculty’s time. They would support it if it provided them with useful information about their programs. Here validation could well have taken the form of report-and-response, though little of this was done. Chairs were sent a list of the teachers most active in signing off on portfolios, largely so that credit for the work would return to the teachers. There was no response to this from the chairs. More response could have been expected had they been sent Table 4, which orders departments by the percent of their declared majors who earned Pass with Distinction. Granted that cell numbers are sometimes small, the figures still are provocative. The range in performance of major is quite broad, even within units. The College of Business, for instance, might well take notice that all their majors score below the universal mean, and that only 2 to 4 percent of their marketing majors earned Needs Work, compared to 20 percent of management majors and 35 percent of finance majors. Similarly chairs could also use a ranking by department on average time-to-exam, since they are concerned about time-to-graduation of their majors. Such a ranking, which appeared in the validation report, shows substantial differences among departments, with Nursing and Zoology students, for example, taking the junior examination at an average of 68 hours credit and Communications and Finance students at an average of 80 hours. The reality is that for validation such data sets are not useful until they are interpreted by the parties involved. The chair of the Finance department, whose students seem to come out poorly on both of these measurements, might argue that the kind of writing tested by the junior portfolio is not relevant to finance majors, or that the department has large numbers of international and transfer students who have good reasons to delay taking the exam. Then again, the chair’s dean might not agree. An illustrative case in point occurred when I presented some of the portfolio results of Education majors to a teacher professional advisory board. I showed a list of all courses in the School of Education that had been the origin of papers submitted to the examination. It was a very unevenly distributed list: 26

104

R. H. Haswell

courses were represented by only one submission, 17 more by only two to four submissions, 8 more by five to nine submissions; then one course had 19 and another course had 54. Several members of the Advisory Board were disturbed, seeing that the bulk of submissions came from only ten courses or a fifth of the School’s offerings. The Dean of Education, who was more aware of faculty workloads and assigmnent of faculty support, pointed out how many different courses apparently required extended writing. The Chair of Teaching and Learning expressed pleasure that most of the courses listed were in her dep~ent. I consider a meeting such as this as test validation, because it elicited perceived value of the test from several different perspectives.

Higher A~~~tratio~ By way of validation for the testing program, higher administration will be looking at several distinctive outcomes, among them costs of maintaining a new system and compliance with any new general-education requirements for graduation. Since exam administrators could show the Provost that student fees for the two exams were steadily reducing the initial start-up debts, and since they could also show that student subscription for the exams was growing pretty much as expected, both verbal and financial support for the program from the central administration continued. That support, of course, is a kind of validation in itself. Seeking validation from higher administration, however, is a risky business. Therein appears most blatantly the nature of testing programs as provisional apparatuses whose raison d’&re is public utility. (For a political analysis of the history this test administration and higher administrators, see Haswell & McLeod, forthcoming.)

Board of Regents At the end of the second year of implementation of the junior exam, I reported on its progress to the Board of Regents at their semi-armual campus meeting. Again, I construe the encounter as another validation probe. As might be expected, the Regents, who function sometimes as a public relations board, were favorably impressed by the moderate but not insignificant rate at which upper-division student were being required to take more writing (9%) and by the list of departments whose students had performed exceptionally well (animal science, biology, English, electrical engineering, history, pharmacy, political science), perhaps because it contained some entries from the sciences and professional schools. If this response sounds typical, even stereotypical, this particular board of

Multiple

Inquiry

105

regents resisted easy categorization. They approved most openly two outcomes of the testing program that had gone largely unappreciated by other constituents. One was the fact that to date the corps of raters had come from 29 different university units, and the other was that after two years of implementation students had submitted papers from 616 different courses around campus, taught by 880 different teachers. It may be that better than anyone the Regents had an understanding of the program as an institutional apparatus, with active participation across the university as a sign of health. Their sense of validation was certainly more complex than that shown by their university’s testing committee a quarter of a century before.

THE ROLE OF THE VALIDATOR Ironically enough, safekeeping of that old testing committee’s single contern-internal validity+nded up largely in the hands of the current validator. The old committee had modulated over time into a human rights review board, which had little interest in internal test validity. As had other stakeholders on campus, apparently. I ran formal equivalency tests on paragraphs and rhetorical tasks and duly incorporated the results into the validation reports of both exams. Since this testing resulted in the retiring of a number of gender and ESL biased paragraphs and in the rewriting of one of the rhetorical tasks, no one would argue that it was not worth doing. But then again, the major participants in the testing program-students, teachers, administrators, even raters+ommented little on the problem or the changes. Enforcing internal test standards is, of course, the traditional role for the test validator. Of necessity that role will become much more complex in a multiple-inquiry investigation. As probes multiply, opportunities to maintain that old image of a distant observer standing beyond the fray dwindle. Instead the validator must strive toward a different sort of distance. It will be a kind of oversight, but not in the governmental sense of the term. While various stakeholders are validating the testing program largely through their perceptions of its use to them individually, the validator is listening to everyone and imagining the larger picture and eventually recommending changes that will enhance the utility of the program as a whole. That is not an easy role, one that entails holding two sometimes contradictory conditions together, intimate knowledge of the program (Patton, 1978) and proactivity (Leviton & Hughes, 1981). For instance, when the validator is also the chief administrator of the testing program-as was my case at WSU-the validator has simultaneously strong insights into problem areas, strong motives to find pay-off measures, and strong temptations

106

R. H. Haswell

to table certain recommendations for change. Multiple inquiry begs for multiple validators. A validation team simply offers more chance of crosschecking within the validation process itself (I am now talking about testing the test of the test). Nevertheless, I feel that with multiple inquiry into most writing programs it would be a mistake to exclude the adminstrator of the testing system from the validation team. There are good reasons for this. In the first place, if there is any one person on campus with a complex understanding of the existing test and hence with motivation to seek a complex enhancement of the test, it is the test administrator. Have no doubts, an honest multiple inquiry will end with a mixed judgment on any writing test. The more probes sent out, the more perspectives heard from, the less chance for a unified assessment. It is stakeholders with peripheral connections to the test who have reason to wish for a single-strand validation and who may very well argue to put all of the eggs in one statistical basket, say, inter-rater reliability. The test administrator may be the stakeholder with the most desire to take a full look. A test administrator is also in the best position to imagine and effect some of the more innovative and powerful forms of validation, those going beyond the traditional one of formal, hypothetico-deductive experiments (Patton, 1978). In the history of the Washington State University exams, the inquiries that led to the most change were administrative initiatives: the workshop with faculty over the problem of portfolio course-paper submissions (which resulted in a handbook for faculty at large) and information sessions with dorm groups (which changed the way the examinations were advertised). Some of the most interesting validation issues were hardly accessible except through inside, day-to-day experience with the program. Has the junior portfolio been a learning experience for students? I believe it has been for some but not for the majority, yet I can validate that belief only in two roughshod ways. One is through some fifty or sixty office conferences I have had with students who were pleading to be allowed to re-take the exam, during which twenty-minute sessions I often saw a rapid change in some cavalier notions about writing, for instance about the need to convince readers with a balanced argument. The other way is through my experience, and the reported experience of others, performing the mundane task of taking in portfolio submissions, when large numbers of students could be seen occupying a space somewhere between utter indifference to the test and frustrated misconceptions about the test. It is unlikely that formal, structured interviews with students would allow such an appraisal of the testtakers. Finally, the test administrator is in the best position to convert validation recommendations into realities. An obvious point, but what may

Multiple

Inquiry

107

be so obvious is that the very ownership of that position endows the administrator, from the beginning, with unique motives and insights into ways the validation inquiry might proceed. Awareness of ends sharpens visualization of means. Who but a test administrator would be likely, for instance, to think of one of the finest uses of validation inquiry, when its findings re-enter the test itself, as where test takers are informed of the success rate of past test takers? It takes an administrative participation in a test to truly believe that the most rewarding validation results are those that future students can use to validate their own participation in the test. The larger contention is that the best validators are people most committed to making use of the validation. This returns to the central argument for multiple inquiry in validation. In the end it seems a very simple point. Multiple inquiry multiplies recommendations for change. At Washington State, the recommendations for the junior portfolio alone were numerous: that ways needed to be found to reduce student procrastination in taking the exam, that transfer students should be better informed about the exam, that the NNS writer’s plight needs immediate attention, that standards and procedures for signing off on student papers should be made more available to the faculty, that information about comparative performance according to college and major should be used to improve writing instruction across campus, that certain paragraph prompts used on the exam should be discarded, that the procedure of rater consultation during Tier II readings should be studied and improved, and that a forum should be developed through which students can voice their feelings about the exam and gain a better understanding of its uses. What could be categorized as a shotgun approach to validation may have its problems, but clearly no single inquiry and arguably no more controlled inquiry would have led to so fruitful a course for future action.

not

Acknowledgments: University

Support for the validation studies at Washington State was provided by the Office of Vice-Provost for Instruction, the Office

of the Dean of Humanities and Social Sciences, the General Education Office, and the Department of English. Especially supportive in this were Donald Bushaw, Richard Law, and Susan McLeod. Collaborators on these studies were Lisa Johnson-Shull, Robin Magnuson, and Susan Wyche. Some of the probes drawn on in the present paper were executed by members of a graduate seminar in Diagnosis and Evaluation of Writing during the spring of 1995: Carolyn Calhoon-Dillahunt, Yme Gerard Dolmans, Joel Norris, Glenn Putyrae, John Shindler, Dale Smith, and Steve Smith.

108

R. H. Haswell REFERENCES

Brewer, J., & Hunter, A. (1989). Multimethod research: A synthesis of styles Newbury Park, CA: Sage. Camp, R. (1996). New views of measurement and new models for writing assessment. In E. M. White, W. D. Lutz, & S. Kamusikiri (Eds.). Assessment of writing: Politics, policies, practices (pp. 135-147). New York: Modem Language Association of America. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56,8 l-105. Cohen, E. (1989). Approaches to predicting student success: Findings and recommendations from a study of California community colleges. Educational Data Retrieval Service (ERIC ED 3 10 808). Denzin, N. K. (1970). The research act: A theoretical introduction to sociological methods. Chicago: Aldine. Fiske, D. W. (1982). Convergent-discriminant validation in measurements and research strategies. In D. Brinberg & L. H. Kidder (Eds.). Forms of validity in research. New Directions for Methodology of Social and Behavioral Science no. 12. San Francisco: Jossey-Bass. Gorrell, D. (1975) Toward determining a minimal competency entrance examination for freshman composition. Research in the Teaching of English, 17, 263-274. Hackman, J. D., & Johnson, P. (1975). Using standardized test scores for placement in college English courses. Research in the Teaching ofEnglish, 15, 275-279. Haswell, R. H. (1995). The Washington State University writing portfolio: First$ndings (February 1993May 1995). Pullman, WA: Washington State University Office of Writing Assessment. Haswell, R. H., & McLeod, S. (in press). WAC assessment and internal audiences: A dialogue. In K. Yancy & B. Huot (Eds.). WAC and program assessment: Diverse methods of evaluating writing across the curriculum program. Norwood, NJ: Ablex. Haswell, R. H., & Wyche-Smith, S. (1995). A two-tiered rating procedure for placement essays. In T. Banta (Ed.). Assessment in practice: Putting principles to work on college campuses (pp. 204-207). San Francisco: Jossey-Bass. Haswell, R. H., Wyche-Smith, S., & Magnuson, R. (1992). Follow-up study ofthe Washington State University writing placement examination: Academic year 199U1992. Pullman, WA: Washington State University Office of Writing Assessment. Heath, L., Kendzierski, D., & Borgida, E. (1982). Evaluation of social programs: A multimethodological approach combining a delayed treatment true experiment and multiple time series. Evaluation Review, 6, 233-246. Hughes, G. F. (1996). The need for clear purposes and new approaches to the evaluation of writing-across-the-curriculum programs. In E. M. White, W. D. Lutz, & S. Kamusikiri (Eds.). Assessment of writing: Politics, policies, practices (pp. 158173). New York: Modem Language Association of America. Leviton, L. C., & Hughes, E. F. X. (1981). Research on the utilization of evaluations: A review and synthesis. Evaluation Review, 5,525-548. Mark, M. M., & Shotland, R. L. (Eds.). (1987). Multiple methods in program evaluation. New Directions for Progam Evaluation no. 35. San Francisco: Jossey-Bass.

Multiple

Inquiry

109

McCauley, L. (1987). Intellectual skills development program: Annual report, 1986-1987. Educational Data Retrieval Service (ERIC ED 288 478). Modu, C. C., & Wimmers, E. (1981). The validity of the Advanced Placement English language and composition examination. College English, 43,609-620. Parlett, M., & Hamilton, D. (1976). Evaluation as illumination: A new approach to the study of innovatory programs. In G. V. Glass (Ed.). Evaluation studies: Review annual (vol. 1, pp. 140-157). Beverly Hills, CA: Sage. Patton, M. Q. (1978). Utilization-focused evaluation. Beverly Hills, CA: Sage. Reichardt, C. S., & Gollob, H. F. (1987). Taking uncertainty into account when estimating effects. In M. M. Mark & R. L. Shotland (Eds.). Multiple methods in program evaluation. New Directions for Program Evaluation no. 35 (pp. 7-22). San Francisco: Jossey-Bass. Scharton, M. (1996). The politics of validity. In E. M. White, W. D. Lutz, & S. Kamusikiri (Eds.). Assessment of writing: Politics, policies. practices (pp. 53-75). New York: Modern Language Association of America. Shotland, R. L., & Mark, M. M. (1987). Improving inferences from multiple methods. In M. M. Mark & R. L. Shotland (Eds.), Multiple methods in program evaluation. New Directions for Program Evaluation no. 35 (pp. 77-94). San Francisco: JosseyBass. Shotland, R. L., & Mark, M. M. (1987). Multiple methods in program evaluation. New directions for program evaluation no. 35. San Francisco: Jossey-Bass. Smith, L. M, & Kleine, P. F. (1986). Qualitative research and evaluation: Triangulation and multimethods reconsidered. In D. D. Williams (Ed.). Naturalistic evaluation. New Directions for Program Evaluation no. 30 (pp. 55-71). San Francisco: JosseyBass. Stake, R. E. (1975). Evaluating the arts in education: A responsive approach. Columbus, OH: Merritt. Witte, S. P., & Faigley, L. (1983). Evaluating College Writing Programs. Carbondale, IL: Southern Illinois University Press.