The inside story: The reality of developing an assessment instrument

The inside story: The reality of developing an assessment instrument

Studies in Educational Evaluation PERGAMON Studies in Educational Evaluation 26 (2000) 21 l-229 www.elsevier.nl/stueduc THE INSIDE STORY: THE REAL...

1MB Sizes 5 Downloads 83 Views

Studies in Educational Evaluation PERGAMON

Studies in Educational

Evaluation

26 (2000) 21 l-229 www.elsevier.nl/stueduc

THE INSIDE STORY: THE REALITY OF DEVELOPING AN ASSESSMENT INSTRUMENT

Alison

Wolf’

and J. Joy Gumming”

Mafhemafica/ Sciences Group, lnstifufe of Education, University of London, UK **School of Cognition, Language & Special Education, Griffith University, Brisbane, Australia

Introduction England’s Basic Skills Agency (BSA) is responsible for advising government on policy and resource issues. To that end, it commissioned a large-scale survey of the language and literacy skills of linguistic minorities in England and Wales (defined as adults who had been born and educated outside the UK, and for whom English was not a first language). The results have been reported elsewhere (&-r-Hill, Wolf, & Passingham, 1996). This article deals, instead, with the lessons to be learned horn the experience of creating a special assessment instrument. Assessment and measurement are activities that consume a large amount of time in education. Formal assessment procedures are increasingly important in programme evaluation and in public monitoring of education systems. Assessment and measurement have also become academic and professional disciplines in themselves. Psychometric theories were in the ascendant during the period from 1950 to 1975, but have since been subject to a barrage of criticism corn proponents of “performance”, “authentic” and approaches, all designed to measure demonstrated achievement, “competence-based” rather than underlying traits (see, for example, Archbald & Newmann, 1988; Broadfoot, 1996; Gifford & O’Connor, 1992; Gipps, 1994; Jessup, 1991; Wiggins, 1989, 1993). What both psychometrics and “performance assessment” share, however, is a tendency to treat actual test construction as unproblematic. There is remarkably little discussion in the academic literature of how an instrument actually gets developed, or of how levels of performance are set. Yet without such knowledge, evaluators and their clients may easily misinterpret or over-interpret the apparently objective results of a test. It is the very messy grass roots of test development, which are the subject of this article.

0191-491x/00/$ - see front matter 0 2000 Published by Elsevier Science Ltd. PII: SOl91-491X(00)00016-X

A. Wolfand 1 J Gumming/Studies

212

in Educational Evaluation 26 (2000) 211-229

The Context Assessment of adult language competencies has recently become an area of major government activity, especially in developed countries. This is partly because of the perceived links between work force skills and economic productivity, and partly because of a general commitment to lifelong learning. More specifically, governments are concerned to promote the integration and economic independence of adult immigrants. In most Western countries, literacy education attracts substantial government funding; and for minority language speakers special tuition is provided. At the same time, the concept of adult literacy has developed common features across the industrialised world. The UK government uses the following working definition: “The ability to read, write and speak in English, and use mathematics at a level necessary to function at work and in society in general” (italics ours). Other countries use equivalent terms (see e.g., Kirsch, Jungeblut, Jenkins, & Kolstad, 1993; Wolf, 1994.) As in all public-spending areas, funding is also circumscribed and subject to scrutiny. Funding and administrative bodies aim to identify how much provision is really needed, to promote effective placement of students in appropriate programmes and to monitor their effectiveness. Accountability has become an increasing concern: so, for example, in the United States, government-funded programmes are routinely required to administer achievement tests to participants. Most measures used for accountability purposes in adult literacy programs have been designed primarily for first language speakers, even though they may be used with minority language groups. However, even for their designated population, they have evoked serious concerns. Many are of dubious validity as measures of any adults’ “reallife” literacy (Cumming, Gal, & Ginsburg, 1998; Lytle, & Schultz, 1990). For example, in the USA, the Test of Adult Basic Education (TABE) (CTB/McGraw-Hill, 1987), is very commonly used. However, most items reflect school content, mainly early primary content, and probably underestimate the real life skills of adults. Typical TABE items are not only quite removed l?om real life demands on adults, but also use a very schooloriented format. Here is an example: What is another name for 54,600? F 500+ 90 +6 G 5000 + 900 + 60 H 50,000 + 9000 + 60 J

50.000 + 9000 + 600

Assessment in the US context is usually high-stakes, because results affect allocation. Tests such as TABE provide apparent objectivity and reliability. The tests can be administered readily and allow state and federal evaluators and administrators to aggregate and compare results easily. This does not mean that they are necessarily appropriate in content. For example, Venezky, Bristow and Sabatini argue that comparative information on student performance, based on national standards, is important for policy makers and should be useful to both instructors and students.

A.

Wolf and

11 Cumming / Studies in Educational Evaluation 26 (2000) 21 l-229

213

Without such data, the ability of adult literacy programs to prepare adults for work, citizenship, and home management may be difficult to evaluate. It is critical to develop evaluation policies that are based on valid and reliable indicators for adults and that attend to what programs actually teach. (1994, p. 129) However, their evaluation of available procedures is highly critical of many current instruments and argues the need to “construct multiple indicators for evaluating adult literacy programs.. . free of elementary-level and secondary-level conventions such as grade equivalents” (Venezky, et al., 1994, p. 101). While in the United States commercially produced multiple-choice formats dominate, in other countries - Australia, the UK, and, more recently, New Zealand and South Africa - a very different approach has emerged and driven government policy. In the years immediately preceding this study, several extensive projects focused primarily on validity issues, and defined literacy standards for assessment purposes. In Australia this effort included development of standards and assessment specifically for people for whom English is a second language: for example, the Australian Second Language Proficiency Rating Scales (ASLPR) (Wylie, & Ingram, 1995). This standards-based approach reflects a more general attempt to relate assessment to substantive outcomes and is by no means confined to adult education. Teachers of adult literacy in these countries - whether dealing with native speakers or minority language groups - are expected to base their teaching and their assessment around the relevant standards which define, in substantive terms, what different levels of competence consist of. No common assessment instruments are provided or published; indeed they are disapproved of. Instead, teachers are encouraged to use their own materials and to judge achievement levels by reference to the standards. Reading 3.3

Interprets and extrapolates from texts containing data which is unambiguously presented in graphic, diagrammatic, formatted or visual form.

Conditions

of Performance

These statements outline the maximum degree of support allowable in interpreting performance on the indicators for this level: l Performs where advice/modelling is available if required. l Incorporates communication supports as required. l Demonstrates competence in a number of contexts which may be interrelated. Worknlace l l l l

student

& Social Contexts

Adapts the skills of one cultural context to another. Understands texts which include meanings which are predominantly explicit. Performs without reliance on interaction with sympathetic participants/interlocutors. Uses a narrow range of skills and knowledge for employment-related skills, preparatory courses, broad-based training and specific workplace skills. Figure 1: Sample Australian

National

Reporting

System Standard

A. Wolf and J: 1 Cumming/Studies in Educational Evaluation 26 (2000) 211-229

214

For example, in Australia, community and workplace language programs for English language learners are funded by the Commonwealth Government for the first 510 hours of instruction. Commonwealth initiatives currently require use of the National Reporting System for adult English language, literacy and numeracy competencies (NRS) (Coates, Fitzpatrick, McKenna, & Makin, 1995). An example of an NRS standard and some of the accompanying information on performance contexts appear in Figure 1. In England and Wales, adult literacy teachers are encouraged to use standards defined for communication and numeracy by the Adult Literacy and Basic Skills Unit (BSA’s precursor). These standards may be used by teachers for informal assessment, but are also embodied in national qualifications. Standards for these Wordpower certificates are expressed in terms of the performance criteria which must be met for success. A typical example follows:

Unit 305 Communicate

in writing

Element 1: Complete a form

a)

Performance criteria

Range

Identify

Forms

relevant

for inclusion

information

in the form

b)

Write legibly and clearly

c)

Follow any instructions requirements

routine and simple forms in common use. Information/context or

name, address and at least three other items of

of the form, e.g.

information

d)

form, holiday booking

Check and correct grammar,

form, mail order form, travel expenses

spelling and punctuation

application

for benefit, paying in slip, passport

application,

child’s details for school, post record)

to

follow standard conventions, appropriate e)

(e.g. insurance claim form, accident

report form, job application

upper case letters

to the task

as

Text

Provide all relevant

in handwriting,

information

processing

concisely

clearly and

claim form,

with a typewriter,

software,

Punctuation

to task

conventions

use of capital letters, sentences, Figure 2: Sample English Literacy

or using word-

as appropriate

commas, full stop

Standards (Wordpower)

The approach raised immediate practical problems for our BSA-commissioned national survey of English language needs among minority language groups, namely that no published assessment materials were available. Apart t?om standardised batteries produced by American testing companies (e.g., the TABE referred to earlier) only two major adult literacy instruments could be found, both also largely American in conception. One, used for a major study of literacy in the United States itself, is entirely multiple choice in its question format (Kirsch, et al., 1993), and only has a limited number of questions in the public domain, while the other, closely related to this and used for a

A. Wolfand 11

CummingIStudies

in Educational Evaluation 26 (2000) 211-229

215

major OECD comparative study of adult literacy, was and is unreleased (OECD, 1995, 1997). Both, moreover, were designed predominantly for native speakers across the whole attainment spectrum: a very different population. It thus quickly became clear that a purpose-built instrument was needed. The study was intended to inform decisions about provision, and the instrument must reflect this. Its focus was not on fine discrimination among individuals, nor on individual change in English language proficiency. Rather, it must provide a good indication of group performance level. Moreover, in the context of the UK’s standards-based approach, it must do so by relating performance to a number of levels defined in terms of everyday competence and behaviour. Defining the Instrument As noted earlier, the official English definition of literacy is: the ability to read, write and speak in English and use mathematics at a level necessary to function and progress at work and in society in general. To move from that broad definition to an actual measurement instrument required an enormous number of further decisions - decisions of a type which are usually concealed Irorn the buyer or user of a finished test instrument or from the candidate in an examination. Of these, the first two were

0

how to elaborate this definition in order to guide item design; and

ii)

whether to use a psychometric model, assuming an underlying unified construct of literacy, or to work at all times with reference to concrete and substantive standards. The DeJinition

The research team decided from the start that any measure of literacy must include the ability to read and comprehend texts, including the basic documents required in contacts with official agencies, and understanding written instructions in a variety of domestic and community contexts. It must also include the ability to understand verbal instructions and information given in English. Each of these areas must be incorporated directly into each and every one of the broad achievement bands by which the commissioning agency wanted the results reported. There was nothing hugely radical about this definition. As noted earlier, similar definitions of functional literacy are used in other countries and form the basis of studies of adult literacy that provide national and (supposedly) internationally-comparative data on population performance (Kirsch, et al., 1993; OECD, 1995, 1997; Statistics Canada, 1991). However, for minority language speakers oral skills are likely to be problematic in a way they are not for native speakers (Jones, 1992). We also concluded that grammatical awareness, which is not always tested directly in literacy contexts, may be an important

216

A.

Wolfand J J Cumming / Studies in Educational Evaluation 26 (2000) 21 l-229

factor in functioning successfully, especially in work and educational contexts. Explicit assessment of both oral proficiency and grammatical awareness was therefore included in the definition. None of these decisions may seem very controversial in itself. But they meant that one particular “basket” of skills had been adopted and others rejected. The contents of this basket also defined, to a considerable extent, the balance of items in the final instrument. A different combination of skills would, in all probability, have meant that many individuals performed quite differently in terms of their absolute and relative scores. Thus, at this very early stage, decisions were made which, while based on an articulated rationale, and a knowledge of research literature, were nonetheless in some sense arbitrary - and significantly different I?om those underlying other tests which were apparently measuring the “same” thing. Traits or Standards?

The choice between adopting a psychometric approach to test construction or working from standards was effectively pre-empted by the literacy definition we adopted. Test construction in the psychometric tradition generally assumes that one is measuring a single underlying trait, with different items providing (partial) measurements of this trait. It is on this basis that, for example, item response theories provide rules for accepting or rejecting particular possible items as good or bad measures. Our definition included a number of quite distinct types of task: oral as well as written, formal grammar as well as everyday use. More importantly, we had good reason to suppose that different linguistic groups would vary considerably and systematically in which types of item they found hard or easy, and that to posit a single underlying trait would, for the test population, be correspondingly and seriously misleading. Item response theory, and psychometric testing more generally, tend to see such differences as problematic (and have been criticised for so doing; see e.g., Goldstein & Wood, 1989). Differences in response patterns for groups are viewed as an indication of possible bias: items which display highly variable levels of relative difficulty are suspect. For this study, such a perspective was quite inappropriate. It was therefore decided that test construction would be based directly on standards: specifically those developed by ALBSU for the UK. However, our experiences call into question the explicit and implicit attitude that these are simple to use. The next section therefore describes in some detail the nature of such standards. Measuring Attainment Against Standards: The Problem of Item Writing The idea of standards or standards of competence has been extremely influential in recent years (Jessup, 1991). Standards are developed by analysing and defining the component parts of an applied skill as it is used in daily life. Most standards are developed for vocational awards but the approach has also been adopted for academic subjects in some cases, notably in New Zealand.

A. Wolf and J. J Cumming / Studies in Educational Evaluation 26 (2000) 21 l-229

217

The standards-based approach to assessment involves comparing someone’s performance with a list of clearly defined outcomes, which is what the standards effectively are. A contrast is generally drawn with conventional assessment, which supposedly fails to tell the world very much about what someone can actually do. On conventional assessments, the argument goes, people pass because they have scored 50% (or even less) on some sort of test or series of teacher-created assessments, but there is no way of saying with any confidence what they can actually do. Standardsbased assessment, by contrast, defines the outcomes to be achieved very precisely (see e.g., Burke, 1995). A candidate can only be accredited with achieving an outcome when they have demonstrated their complete coverage or mastery of the topic; and only achieve an award when they have achieved every outcome. The close relationship between this approach and older US developments in criterion referencing and mastery learning will be clear. The main difference is the emphasis of standards-based approaches on a careful analysis of “real-life” contexts (Wolf, 1995). There is also a confidence not shared by American constructors of mastery tests that assessors can deliver to standards without common, centrally written instruments. In the context of adult literacy, the standards developed in different countries have therefore provided what is in effect a national syllabus - or at least a national assessment schedule - in a way that had not been attempted in the past. In the UK, the relevant standards follow the structure established by the country’s National Council for Vocational Qualifications (NCVQ), in that they are divided into units of competence - for example, “Communicating in writing” or “Conversing with others”. Each of these units is then divided into elements (see Figure 2 above) defined in terms of the precise outcomes expected and using performance criteria and descriptions of the range of situations to be covered. We expected to use the ALBSU standards as a template for the assessment instrument but since they were designed for native English speakers rather than minority language populations, we also expected to draw extensively on the Australian standards for non-native English speakers. Unfortunately, we very quickly discovered that there was no way in which to derive test items of a clear, given level of difficulty from the standards. This was a fundamental conclusion for the whole study. It was not simply that the standards did not map directly on to the design of individual items. They also failed to provide unambiguous guidance on how to assign respondents to levels and thus report on the overall levels of skills in the population. Basic education competencies apply necessarily to a wide range of situations, for example, the ability to complete a form (see Figure 2 above). However, written guidance cannot create a shared understanding of the level of difficulty to which a given task or example of a competency belongs. Neither teacher nor assessor (who may be the same person) can be confident that a particular manifestation of a generic skill is at the level required. The problem is best illustrated with concrete examples. Two examples of standards, both, in this case, taken from the Australian National Reporting System (Coates, et al.,

218

A. Wolf and 1 J Cumming /&dies in EducationalEvaluation26 (2000)21l-229

1995), are provided in Figure 3 along with, for each, two examples of possible student output (Fitzpatrick, Wignall, & McKenna, 1999). Does each, or neither, meet the standard? If the first does, then is the second at the same level, or at a different one? Nothing in the standards themselves provides an answer.

Standard

Example 1

Example 2

Reading

Locates an area that is important to him on the map of NSW or the tribal maps of Australia.

Reads a Student Information Form, checking spelling of own street name and suburb on Health Card.

(Excerpt from the Girrawaa Creative Works Centre, Bathurst, Initial Assessment Package, cited in Fitzpatrick et al., 1999, p. 35)

(Needed support and verbal assistance to complete)

1.2 Identifies specljic information in a personally relevant text with familiar content which may include personal details, location or calendar information in simple graphic, diagrammatic, ormatted or visual form Writing

2.4

Completes forms or writes notes using factual or personal information relating to familiar contexts.

Completes details about an article he has made on the portfolio form (Fitzpatrick et al., 1999, P. 37)

(Fitzpatrick et al., 1999, pp. 47-48) Fills in a simple job application where asked to write a couple of sentences about past jobs (Fitzpatrick et al., 1999, p. 61)

Figure 3: Standards and Possible Evidence of Standard Achievement Instead of writing items by mapping directly from the standards we therefore had, instead, to decide what one might reasonably expect the standards to mean, using independent knowledge of the population concerned and of what people usually meant by a given level. In other words, at every point, item design involved arbitrary, if informed, decisions by the research team. To this were added other problems: aggregation, criticality and manageability. We turn to these in the following sections. The Problem of Setting Levels: Aggregation, Criticality and Context Aggregation

Most studies of literacy attempt to describe it in terms of four to five levels of performance that indicate capacity to function in everyday life (OECD, 1995). The problem with aggregating performance into an overall level or description is that one loses information. The only way one can be sure of what specific things someone can do is by equating a level with complete mastery of all the component parts. This is exactly the

A. Wolf and 11

Cumming /Studies in Educational Evaluation 26 (2000) 21 l-229

219

approach taken for the standards-based competence awards currently in favour, and the argument is that, by insisting on total mastery, one knows exactly what an award holder can do. The Wordpower literacy certificates used in the UK take this approach and insist that candidates provide evidence on every single aspect of the relevant standards. In the previous section we noted that this is less simple than it appears, since the meaning of a given standard or outcome is hard to interpret in concrete terms. However, over and above whether one assessor’s level I of an outcome is the same as another’s, the insistence on total mastery creates major practical problems of its own. If you apply this approach stringently then an individual’s performance is defined by their weakest point: they only need to fail one item, and they cannot achieve the level (Cresswell, 1987; Wolf, 1995). Yet measurement error - the chance elements that affect assessment decisions, whether they be candidate hayfever, assessor boredom, or a myriad other factors - means that someone will often fail to achieve something when tested which, on most other occasions, they would have achieved. If rigorously applied, therefore, this approach is extremely hard on candidates. No public examination or test with centralised marking, anywhere in the world, operates on anything like a 100% success requirement. Moreover, within a standards-based system, people quickly come to feel that the 100% requirement is unfair or inoperable. Research evidence indicates that, in practice, it is never fully implemented. Instead the teachers or workplace assessors responsible for competence-based awards fudge the results. (Crowley-Bainton, & Wolf, 1994; Eraut, Steadman, Trill, & Porkes, 1997). In the context of a national survey involving multiple assessors fudging was not an option. We could have retained a standards-based approach but reported only in profile terms - x% can complete this item, y% can complete this item, and so on. But our sponsors, quite reasonably, wanted reported findings related in terms of proportions at different levels of proficiency. For this purpose a “pure” standards approach based on 100% mastery was likely to be highly misleading: aggregated scores and pass-marks were effectively the only solution. Once again this meant marking and grading decisions on which the standards were little help. We (more or less) simply made them.

Criticality Literacy and other competence standards are avowedly comprehensive - they cover all aspects of an occupational or general competence (or, in an academic context, of the learning objectives associated with a curriculum). They offer no hierarchy of importance: all component parts have equal status. Nor is this peculiar to standards: definitions of traits, National Curricula, and exam syllabi generally share this characteristic. In practice, of course, some elements are more or less critical than others. If assessment involves complete satisfaction of all requirements, the relative importance of different elements never has to be addressed formally. However, the minute you only assess some of a domain, or allow candidates to compensate for weakness in one area

220

A. Wolfand

J. Gumming/Studies in Educational Evaluation 26 (2000) 211-229

with strengths in another (for example, by adding up marks), criticality becomes a major issue. Are there certain things that must be covered? What can compensate for what? We were very quickly faced with these issues, consulted with the funding agency, and made decisions. But what in effect we had done was substitute a sub-set of the standards for the full set - and so alter the underlying definition. Context and Shifting Item DifJiculties Sub-sets of respondents may be quite systematically different from each other in the sorts of contexts and items which they find relatively familiar or difftcult. One possibility (encouraged by an item response approach) would be simply to omit all parts of the subject matter (literacy) with such characteristics. As already noted, we rejected this option, which would have severely distorted our findings about a population explicitly made up of distinct sub-groups with different experiences. How, though, is one to accommodate such a situation when supposedly operating with national levels of competence? A concrete example may help here. One common use of literacy skills in the British population is to understand a mail-order catalogue and fill in mail order forms. The importance of this skill makes it a strong contender for inclusion in a test of literacy; if someone cannot do this at all, it is hard to justify calling them moderately, let alone highly, literate. But it is also a context which is differentially familiar, important, and therefore easy to men and to women; to different ethnic groups (depending on female access to means of payment and closeness to a range of shops), and different age groups (older women consistently find questions in this context much easier than do younger ones). So should it be excluded from a national survey after all? To take another example, most native English speakers find it very easy to deal with the apparently detailed and complex TV listings in a newspaper because they spend a lot of time watching English-language TV and have since childhood. Items based on TV listings appear difficult when defined in terms of “locating information” or, “cycling through a text several times”; such analyses are at odds with the actual, consistent easiness of the questions in practice. But Bengali-speaking adults may watch very little English-language TV, instead preferring Bengali-language videos. For them, such a question is hard - compared to native speaker and to other groups who have little access to own-language programming. Conversely, Bengali-speakers may be much more familiar than most native English speakers with forms involving customs declarations for imported goods or gifts sent overseas. Facility rates for a particular type of item will thus differ enormously depending on which context one selects. Are two questions designed to test information seeking (see Figure 2) but using different contexts, equivalent in any meaningful sense? Can we say that one context should be emphasised rather than another because it comes closer to what information seeking means in the context of UK life? The answer is not obvious, but the test-setter must decide. Similar dilemmas occurred when we came to items designed to assess speaking and listening. We looked for suitable oral items among tests of English as a Foreign Language

A.

Wolfand 11 Cumming /Studies in Educational Evaluation 26

(2000)211-229

221

-

a large and growing market. We found that many of the existing instruments were based on rather middle-class topics and notions of conversation, reflecting their current candidates. Interestingly, the examining body which was able to help us, and provide useable items, had developed these before they understood their own market. Original pilot questions, designed with a mixed population of adults in mind, had simply not worked for the largely European teenagers who took the tests but were highly suitable for our own target population. This underscores the ubiquity of contextual issues. Once again, then, we were forced into making day-by-day practical decisions, each of which had a major effect on the nature of the final test. We had to try and devise tasks which were broadly true to the overall British context, could be seen as testing ability to operate in UK society as a whole, and which also reflected the probable past experiences and likely future experiences of the population in question - generally urban, relatively low-income, in frequent contact with public agencies, but socially often quite separate. The results were justifiable and defensible: equally, a different set of decisions, also highly justifiable, would have produced a different test, for which the results were in turn quite different. How can one judge that one, rather than another, gives the “real” picture of literacy levels? Selecting, Constructing and Trialling Items As the previous sections will have made clear, the literacy standards which were our ultimate referent were of very limited help in actually constructing test items. Conscious of a tight time-frame and very limited resources, we looked for other existing tests which could be adapted for our purposes. We were largely disappointed although some examples horn related US based surveys were incorporated. These served as anchor items which contributed to later judgements about level cut-offs, and provided some initial face and construct validity. We also adapted tasks for teaching packs; informal teacher tests developed by experienced adult literacy teachers; US surveys, and previous focused research studies. Items were generated to have face validity in relation to the standards. Item administration also needed to take account of the very limited English of many respondents, and the use of native language background speakers as interviewers was necessary. Given the variety of minority language speakers in England and Wales, this meant that the instrument must be administered by non-specialist interviewers, and this also affected item design. So did the constraints imposed by the overall budget and the decision that an hour’s contact time with each subject was the maximum we could aim at or achieve. This item-writing process is described in standard measurement reference texts as “finding or inventing a set of operations that will isolate the attribute of interest and display it” (Thorndike, Cunningham, Thorndike, & Hagen, 199 1, p. 1 l), or, in plainer terms (Brown, 1983, p. 27), “as writing, editing, tryout, and revision...“, repeated until a satisfactory item is developed. Once your namework and objectives are in place, developing an item bank !Yomwhich to select the best items, is essentially a process of

222

A. Wolf and 11 Cumming /Studies in Educational Evaluation 26 (2000) 21 l-229

trial and error: you write a large number of items and you try them out. In the process, you refer to your own tacit knowledge of subject matter, context, how target candidates behave, and what you can reasonably expect. But there are always surprises. You only find out how hard an item is, or what sort of response it evokes by trying it. Item-writing remains as much an art as a science and attempts to reduce it to algorithmic processes have repeatedly and demonstrably failed (Pollitt, Hutchinson, Entwistle, & DeLuca, 1985; Wolf, 1993). But the result, again, is that the final test could, with different writers, have been a very different beast. In all over thirty complete tasks were trialled, each involving multiple questions. Figure 4 describes a sample of these and shows the match of the tasks to the original standards, their derivation and comments based on the trialling outcomes and modifications. Nineteen tasks were retained or modified for the final instrument. Trialling of the tasks involved several phases. Subsets were trialled with different groups of adult students in English language classes, and, in addition, individual students were withdrawn from the classrooms to trial and discuss individual items. A final stage of trialling, with near-complete instruments, was conducted with individuals in their own houses. Time and financial constraints confined this trialling to metropolitan London. General results for success rates for different tasks were kept, providing a rough ordering of the difficulty of the items. However, time and finance (again) meant that the trials took place on a rolling basis, using broadly overlapping sets of items with tasks being modified as trialling continued. A comprehensive ordering of all items (derived from trials in which all possible pairs of items were given to sizeable numbers of respondents) was beyond our means, as in practice is the case for most instruments. What is interesting (and consoling for other test developers) is that the relative item difftculties in the final survey pretty much replicated the orderings we developed on the move. What Made Items Good or Bad The initial trialling in classrooms led immediately to some alterations in the instrument. As noted earlier, we had included some items from US adult literacy tests as a way of connecting our results to those of other large-scale studies. But it quickly became apparent that the multiple-choice format was not familiar to students and caused many problems. The first modification was to turn all items into tasks requiring active or substantive response. This was disappointing in one sense as it meant that our anchoring questions had to be modified. In the process they ceased to be the same items (since we know that the format of a question significantly affects its difftculty (Beaton, & Zwick, 1990; Pollitt, et al., 1985)) and this meant that clear comparisons were lost. But since the “same” questions would in fact not have been the same at all, because so much less familiar to our respondents than to the original US population, a genuine set of anchor questions was not attainable.

Informal UK & US tests/ Basic Skills Agency tasks

Derivation

Originally designed to be completed using details provided about hypothetical person. For a form that was intended to be simple, this increased difficulty.

Pilot response and comments

Changed to completion in own name. Reduced difficulty meant a good entry level task. Not retained

Modifications

Designed to assess

studies

Description Library card

Simple form: reading, writing

International

Retained as Item 1

Item 8

Document prose

Changed focus to plumber

Not retained.

Only four passages of moderate but increasing difficulty and cultural acceptability were retained.

Selected

Items

Not retained

and Sources:

Not culturally appropriate, content on Shakespeare, of response unfamiliar to sample.

Standards

Changed from multiple choice questions to productive tasks. Not retained of Test Items,

Trialled in original form. Able to be read by more able sample respondents who found it enjoyable. Multiple choice format not culturally familiar. Time consuming and culturally inappropriate. and Trialling

Item 17

Item 15

Item 16

Item 11

Item 3

Swim food

text

Development

Difficult question dropped, difficulty reduced.

Reading, listening comprehension

pages

4:

US surveys

University of London ULEAC tests US surveys

Authentic

UK standardised tests

The Basic Skills Agency tasks

Informal UK & US tests/ Basic skills Agency tasks

Yellow

Reading, writing

Reading everyday forms, extracting information; formtilling; speaking Reading everyday forms; extracting information, formfilling Form filling and more extended writing

Informal UK & US tests/ Basic Skills Agency tasks US surveys

School timetable

Flight label

Job application

Sentence completion

Figure

Understanding of more complex forms; writing Reading comprehension and writing Reading comprehension and writing Reading comprehension and writing

format

Sample asked to underline sentence; and unfamiliar nature of task created uncertainty; little information gained. Original questions (find restaurant, accommodation) culturally inappropriate. Found to be more difficult than expected Culturally appropriate, one of questions ‘On how many days does she have six lessons’ discarded as very difficult. Perhaps due to mathematical content of words, complex structure of question. Required completion of flight number, date and return address. Task found to be familiar to many who had no trouble completing it accurately. Those who were wrong did not understand the nature of the task. Completion of authentic, textually simple job application form as though applying for a job. The form allowed analysis of writing. Pilot sample had no difficulty with hypothetical nature although task was moderately demanding. Eight cloze passages with 36 missing words trialled. Passages found to be generally discriminating but use of all passages time-consuming and provided redundant information. Some passages were found to be less culturally familiar. Complex but approachable authentic text. Simple questions used.

DSS Advice forms Visit to Stratford Passage on walrus Passage on la.%%

224

A. Wolf and 11

Gumming /Studies in Educational Evaluation 26 (2000) 21 l-229

Overall, tasks and parts of tasks were discarded on several grounds: format problems; cultural familiarity; time taken relative to information gained; ambiguity; redundancy, and excessive difficulty. (If an item was too difficult for all or almost all of the pilot population it was unlikely to be very useful in the main survey.) Several points emerged: Tasks must not have a factual content or vocabulary with which only specific subgroups were familiar. This would cause systematic response bias. We could not simply assume that tasks which were apparently valid in functional literacy terms were necessarily appropriate. A task using a train timetable is a staple of functional literacy assessment across the world, but a train timetable was, in fact, unfamiliar to most of the sample. It is likely that most travelled only locally or internationally, not within England. Similarly, success in completing details on a cheque (another standard item) was found to be overwhelmingly related to how much an individual was already familiar with the task - very easy for some, almost impossible for others. As noted above, we expected relative item difftculties to vary between respondent groups. However, as far as possible, we confined the final item set to questions where there were no extreme differences. The use of authentic texts as stimulus materials, particularly for the extraction of information tirn more complex texts, created problems. Many notices or official documents providing information on for example, health care are not just complex, but also badly written, ambiguous and difficult for even highly literate first language speakers to follow. Without careful trialling, use of authentic texts may produce unduly low estimates of people’s literacy levels. Assumptions made by others on behalf of minority groups can be patronising as well as wrong. Figure 5 illustrates this point. The abstract text passage and questions on the walrus were enjoyed by all adults with sufficient literacy skills to approach the task. Using a productive format rather than the original multiple choice format meant that this task required high level interpretation of complex text. This item was not only popular with those who reached that level, it was found to be a good and reliable indicator which had a stable relationship to performance on other items. However, if we had listened to the advice of native English experts we would have omitted this item as irrelevant and culturally inappropriate. It is dangerous to make assumptions about what is suitable or unsuitable for a group without first trying it out. Question format and stimulus format were very important determinants of ditliculty. Items 15 and 19 of the final assessment instrument both addressed the same functional literacy goals and their text stimulus was of apparently similar difficulty but they turned out, during piloting, to be at very different standards. This shows once more that a priori assumptions about the difficulty of an item are not always borne out and cannot be arrived at by using formal analysis of content and syntax: a finding made by previous studies (e.g. Pollitt, et al., 1985), but not reflected in the standards literature. Levels cannot really be established in this context until outcomes are known. l

l

l

l

l

A. Wolf and1 J. Gumming /Studies in Educational Evaluation 26 (2000) 21 l-229

225

Read the passage and answer the questions on the following page. THE WALRUS The walrus is easy to recognise because it has two large teeth sticking out of its mouth. These teeth are called eye teeth. The walrus lives in cold seas. If the water freezes over, the walrus keeps a hole free of ice either by swimming round and round in the water, or by hacking off the edge of the ice with its eye teeth. The walrus can also use its skull to knock a hole in the ice. The walrus depends on its eye teeth for many things. For example, when looking for food a walrus dives to the bottom of the sea and uses its eye teeth to scrape off clams. The walrus also uses its eye teeth to pull itself on the ice. It needs its eye teeth to attack or kill a seal and eat it, or to defend itself if attacked by a polar bear. The walrus may grow very big and very old. A full-grown male is almost 13 feet long and weighs more than 2200 pounds. It may reach an age of 30 years. The walrus sleeps on the ice or on a piece of rock sticking out of the water, but it is also able to sleep in the water. 1.

Where does the walrus

live?

.............................................................................................................................. .............................................................................................................................. 2.

How long can a walrus live?

... ... ........ ...... .......... .............. ............................................... .................. ..~.............. 3.

What does a walrus

eat?

...... .. ...... ...... .. ........ .. ............ ... ............ .................... .................. ............ ... .............. 4.

What may attack a walrus,

according

to the passage

above?

.............................................................................................................................. 5.

I

What does a walrus

do when it wants to get up on the ice?

.....................................................................

I

..............................................................................................................................

I

Figure

5: The Walrus

I

226

A. Wolf and .J 1 Gumming /Studies in Educational Evaluation 26 (2000) 211-229

Before flnalising the test, we looked at the overall performance of every individual in the sample on all those tasks which they had completed and which were candidates for retention. Their scores were then compared, on a class by class basis, with rank orderings of English language ability provided by their teachers. Results were very consistent overall. If groups were split into two, consisting of lower and higher skills, only one student was classified differently by the teacher horn the performance indicated by the test. This student was seen by the teacher to be more able than performance on the assessment tasks indicated. This was perhaps related to the fact that she was the only European student in the group, or, more specifically, spoke more readily in class than other students from middle-Eastern or Asian backgrounds. The overall process of trialling is similar to that followed (of necessity) in many policy contexts. However, it is worth highlighting how many of the textbook recommendations and procedures go by the board in the process. For example, it is not clear what the textbook recommendation to trial with a “random” or “stratified” sample would even mean in this context. The actual survey tested a population which was highly disparate in terms of both background characteristics and linguistic ability, but we did not know anything in advance about what the relationship between the background variables and linguistic levels would be. Indeed, our (and our sponsors’) ignorance was the major justification for the whole exercise. Conclusions What emerges is the essentially empirical nature of any test or level-setting exercise - the need to make frequent judgements based on instinct and experience (Wolf, 1998), and the impossibility of knowing how things would work without simply trying them out. Our actual process of test construction was thus a highly empirical and inductive aflair based on trial and error, not deduction ffom first principles. Thus we made decisions which were essentially judgemental when we: l defined the scope of the instrument l rejected the single trait model l selected a sub-set of standards on grounds of criticality l interpreted the standards in order to create items l allowed for aggregation via a mark-scheme l rejected certain conventional literacy test items as inappropriate to our population l adjusted the test to fit time and administrative constraints l opted for a rolling programme of selective piloting and redrafting of items. In one sense of the word, none of these decisions was arbitrary, but in another, all of them were. None simply emerged from, or was entirely dictated by, the item writing or piloting process. Instead, as we developed the test, we effectively redefined our construct. We did so because, at the end of the day, we had to provide some sort of summary description of levels, and of numbers attaining them, and do it on the basis of a test with a limited time span. This article does not deal in any detail with the procedure by which levels of performance were defined. This process is discussed in some detail

A. WolfandJ 1 Cumming /Studies in Educational Evaluation 26 (2000) 21 l-229

221

elsewhere (Carr-Hill, et al., 1996). The main point to be made here is that it too, in exactly the same way, required a whole series of decisions based, necessarily, on researcher and assessor judgement. The field in which the assessment instrument was to be developed was, admittedly, especially difficult in some respects. There was no curriculum to assess. It is a contentious theoretical area. Functional literacy in itself is a broad concept and the task of measuring literacy for linguistic minorities and establishing standards and levels was innovative. Development was not, however, uniquely difficult. The processes we had to use will need to be re-enacted in many different situations. Along the way, the process felt worryingly arbitrary but in retrospect the end results were not arbitrary at all. Each decision point can be justified in terms of the project goals, the type of information that was being sought, and the underlying endeavour to do justice to the functional literacy skills of those who were to be assessed. In the end, we felt we had a fair test. Perhaps it is this retrospective reflection that makes most authors of measurement texts describe test construction as they do - as unproblematic, scientific, informed only by the construct in hand and the decontextualised principles of measurement theory. Nonetheless, we feel that they do the constructors and users of tests a disservice by misrepresenting the messiness of reality. Our description will, we hope, provide a useful counterweight. References Archbald, D.A., & Newmann, F.M. (1988). Assessing authentic academic achievement secondary school. Reston, Virginia: National Association of Secondary School Principals. Beaton, A.E., & Zwick, R. (1990). Princeton, NJ: Educational Testing Services.

Lewis

Disentangling

the NAEP 1985-1986

Reading

in the

Anomaly.

Broadfoot, P. (1996). Assessment and learning: Power or partnership? In H. Goldstein and T. (Eds.), Assessment: Problems, developments and statistical issues. Chichester: Wiley.

Brown, F.G. (1983). Principles York: CBS College Publishing.

of educational

and psychological

Burke, J. (Ed.) (1995). Outcomes, learning and the curriculum: and other qualifications. Brighton: Falmer.

testing. (3rd edition.)

Implications

Carr-Hill, R., Wolf, A., & Passingham, S. (1996). Lost opportunities: linguistic minorities in England and Wales. London: Basic Skills Agency.

New

for NVQs, GNVQs

The language

skills of

Coates, S., Fitzpatrick, L., McKenna, A., & Makin, A. (1995). National reporting system. A mechanism for reporting outcomes of adult English language, literacy and numeracy programs. Melbourne: Language Australia. Cresswell, M. (1987). Describing examination examinations. Educational Studies, 13 (3) 247-265.

performance:

Grade

criteria

in

public

228

A. Wolfand d J Gumming /Studies in Educational Evaluation 26 (2000) 211-229

Crowley-Bainton, Employment Department CTB/McGraw-Hilt. CTB/McGraw-Hill.

T., & Wolf, A. (1994). (Research Strategy Branch). (1987).

Tests

of Adult

Access

Basic

to

assessment

Education

initiative.

(TABE).

Sheffield:

Monterey,

CA:

Cumming, J., Gal, I., & Ginsburg, L. (1998). Addressing mathematical knowledge of adult learners: Are we looking at what counts? Philadelphia: NCAL Technical Report Series, University of Pennsylvania. Eraut, M., Steadman, S., Trill, J., & Porkes, J. (1996). Report No. 4. University of Sussex Institute of Education.

The assessment

Fitzpatrick, L., Wignall, L., & McKenna, R. (1999). Assessment the literacy and numeracy programme. Melbourne: Commet/DETYA. Gifford, B. R., & O’Connor, C. (1992). Changing assessments: achievement and instruction. Boston & Dordrecht: Kluwer.

of NvQs. Research

and placement

Alternative

resource for

views of aptitude,

Gipps, C. (1994). Beyond testing. Lewes: Falmer. Goldstein, H., 8c Lewis, T. (Eds). (1996). Assessment: issues. Chichester: Wiley.

Problems,

Goldstein, H., & Wood, R. (1989). Five decades of item response Mathematical and Statistical Psychology, 42, 139-167.

London:

Jessup, G. (1991). Falmer.

Outcomes.

NVQs and the Emerging

Model

developments

modelling.

and statistical

British Journal of

of Education

and Training.

Jones, S. (1992). Literacy in a second language: Results from a Canadian survey of everyday literacy. In B. Burnaby & A. Cumming (Eds.), Socio-political aspects of ESL. (pp. 203-220). Toronto: OISE Press. Kirsch, IS., Jungeblut, A., Jenkins, L., & Kolstad, A. (1993). Adult literacy in America: A first look at the results of the National Adult Literacy Survey. Washington DC: National Center for Education Statistics. Lytle, S.L., & Schultz, K. (1990). Assessing literacy learning with adults: An ideological approach. In R. Beach & S. Hynde, Developing discourse practices in adolescence and adulthood. Norwood, NJ: Ablex. OECD & Human Resources Development Canada (1997). Literacy skills for the knowledge society. Further resultsfrom the International Adult Literacy Survey. Paris: OECD. OECD & Statistics Canada (1995). Literacy, economy and society. Paris: OECD.

dz&dt?

Poll&, A., Hutchinson, C., Entwistle, Edinburgh: Scottish Academic Press.

N., & DeLuca, C. (1985).

What makes exam questions

A.

Wolf and

11 Gumming /Studies in Educational Evaluation 26 (2000) 211-229

Statistics Canada. (1991). Statistics Canada.

Adult literacy in Canada:

Results of a national

229 survey.

Thorndike, R.M., Cunningham, G.K., Thorndike, R.L., & Hagen, E.P. (1991). and evaluation in psychology and education. (51h edition). New York: MacMillan.

Ottawa:

Measurement

Venezky, R.L., Bristow, P.S., & Sabatini, J.P. (1994). Measuring change in adult literacy programs: Enduring issues and a few answers. Educational Assessment, 2(2), 101-I 3 1.

Kappan,

Wiggins, G.P. (1989). A true test: Toward more authentic 70, 703-7 13. Wiggins, G.P. (1993). Assessing student performance.

and equitable assessment.

Phi Delta

San Francisco, USA: Jossey-Bass.

Wolf, A. (1993). Assessment issues and problems in a criterion based system. Occasional No. 2. London: Further Education Unit.

Paper

Wolf, A. (1994). Basic skills research: Bibliography of research in adult literacy and basic skills 1972-1992. London: Adult Literacy and Basic Skills Unit. Wolf, A. (1995). Competence-based

assessment.

Buckingham:

Open University

Press.

Wolf, A. (1998). Portfolio assessment as national policy: The National Council for Vocational Qualifications and its quest for a pedagogical revolution. Assessment in Education, 5 (3), 413-445. Wylie, E., & Ingram, D. (1995). The Australian Second Language Proficiency Rating Scales (ASLPR). General pro$ciency version for English. Brisbane, Australia: CALL, Griffith University.

The Authors ALISON WOLF is Professor of Education at the Institute of Education, University of London and Executive Director of its International Centre for Research on Assessment. Her major research interests are vocational, technical and professional assessment and mathematics for post-compulsory non-specialists, in the UK and in comparative contexts. JOY CUMMING is Associate Professor and Head of School in the Faculty of Education, Griffith University, Australia. Her research focuses are assessment, measurement, literacy and numeracy. She has been involved in several Australian national assessment and policy projects.