Studies in Educational Evaluation 33 (2007) 29–49 www.elsevier.com/stueduc
EVALUATING THE ASSESSMENT: SOURCES OF EVIDENCE FOR QUALITY ASSURANCE1
Menucha Birenbaum
School of Education, Tel Aviv University, Israel
This article is dedicated to the memory of Professor Arieh Lewy – a great scientific thinker, a devoted mentor, and a dear colleague whose expertise in curriculum and program evaluation has been worldwide acknowledged. In developing the framework presented here for evaluating assessment practices, I was inspired by Arieh's creative approach to evaluation and by his writings about assessment alternatives (e.g., Lewy, 1996a,b; Lewy, Liebman, & Selikovits, 1996; Lewy & Shavit, 1974).
Abstract High quality assessment practice is expected to yield valid and useful score-based interpretations about what the examinees know and are able to do with respect to a defined target domain. Given this assertion, the article presents a framework based on the "unified view of validity," advanced by Cronbach and Messick over two decades ago, to assist in generating an evidence-based argument regarding the quality of a given assessment practice. The framework encompasses ten sources of evidence pertaining to six aspects: content, structure, sampling, contextual influences, score production, and utility. Each source is addressed with respect to the kinds of evidence that can be accumulated to help support the quality argument and refute rival hypotheses regarding systematic and unsystematic errors that can cause bias among the score-based interpretations. Methods and tools for obtaining the evidence are described and a sample of guiding questions for planning an assessment evaluation is presented in the concluding section.
Introduction Educational assessment faces many challenges in the knowledge society. New principles and modes of assessment that meet those challenges were established over a decade ago (Birenbaum, 1996; Perkins & Blythe, 1994; Resnick & Resnick, 1992; Stiggins, 0191-491X/$ – see front matter # 2006 Published by Elsevier Ltd. doi:10.1016/j.stueduc.2007.01.004
30
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
1994; Wolf, Bixby, Glenn, & Gardner, 1991). However, experience in implementing them has taught us that the road from theory to practice is a rocky one. Consequently, an increasing number of concerns are currently raised regarding the quality of assessment practices offered at various levels within the education system (Brookhart, 2001; Mertler, 2005; Segers, Dochy, & Cascallar, 2003). In the current article I propose a framework for evidence-based evaluation of the quality of a given assessment practice. The framework refers to the full assessment cycle comprising five phases: planning the assessment; designing the assessment device; administering the tool to collect evidence regarding performance; interpreting the performance (scoring, and reporting), and utilizing the results for decision-making. The framework applies to a wide range of assessment purposes and contexts as well as to a variety of assessment methods.2 The Proposed Framework Rationale and Theoretical Underpinning Contemporary approaches to assessment view it as "evidentiary reasoning" (Mislevy, Wilson, Ercikan, & Chudowsky, 2002). In assessment we draw inferences and make interpretations about what the test-taker knows and is able to do in a defined target domain from his/her observed performance on tasks designed to represent that domain. It can therefore be asserted that the quality of a given assessment practice can be judged by the appropriateness, meaningfulness, and usefulness of these inferences/interpretations. This assertion guided the development of the proposed framework that purports to assist in forming an evidence-based argument regarding the quality of a given assessment practice. The framework is rooted in the "unified view of validity" advanced by Cronbach and Messick over two decades ago, and its elaboration by other measurement experts since then. In 1988 Cronbach wrote: "Validation of a test or a test use is evaluation…[and therefore] the logic of evaluation argument applies, and I invite you to think of 'validity argument' rather than 'validation research'" (p. 4). He further clarified that "[v]alidation speaks to a diverse and potentially critical audience; therefore, the argument must link concepts, evidence, social, and personal consequences, and values (p. 4 [emphasis in original]). Regarding the process of developing the argument, Cronbach noted that "[t]he first talent needed in developing a persuasive argument is that of devil's advocate". Hence, contemporary approaches to assessment view validation as a process of constructing and evaluating arguments for and against proposed score-based interpretations and uses. Any such argument involves a number of propositions, each of which requires support. Kane, who elaborated the argument-based approach to validation, introduced the interpretive argument (Kane, 1992; Kane, Crooks, & Cohen, 1999) and later on (Kane, 2004) distinguished it from a validity argument. The interpretive argument according to Kane (2004) "spells out the proposed interpretations and uses in some detail and then evaluates the coherence of this argument and the plausibility of its inferences and assumption" (p. 167). Whereas the validity argument "evaluates the interpretive argument [and] will generally involve extended analysis and may require a number of empirical studies" (p. 167). Mislevy and his associates (Mislevy, et al., 2002) extended the scope of the
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
31
assessment argument and adopted Toulmin's (1958) argument structure to introduce what they term the evidentiary argument. Recently Bachman (2005) also adopted Toulmin's argument structure to introduce the utilization argument as complementary to the validity argument, which together comprise the two-fold argument that he terms assessment use argument that links observation through interpretation to decisions. Overview of the Framework The proposed framework encompasses ten sources of evidence for quality assurance that can be classified into six categories: content; structure; sampling; contextual influences; score production, and utility. Table 1 lists the sources by the categories. Messick (1989) introduced six of the sources (marked by asterisks in the table) as aspects of construct validity in his seminal chapter on test validity, in the 2nd edition of Educational Measurement. Each of the ten sources refers to a certain aspect of assessment and can provide evidence to either support a proposition regarding the quality of the assessment with regard to that aspect or to discount counter claims that are specific to the purpose and context of the given assessment practice. Building a coherent and persuasive argument regarding the quality of an assessment practice entails specifying the particular network of reasoning and ascertaining its links by integrating and articulating the relevant evidence. Table 1:
The Ten Sources of Evidence for Quality Assurance Classified into Six Categories
Category Content
Source of Evidence 1) Content fidelity* 2) Response processes*
Structure
3) Internal structure* 4) External Structure* (relations with other variables)
Sampling
5) Generalizability*
Contextual influences
6) Equality of opportunities 7) Assessment perceptions and dispositions
Score production
8) Scoring and scaling
Utility
9) Reporting and feedback 10) Consequences of the assessment*
* Sources advanced by Messick (1989, 1994) as aspects of construct validity.
32
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
In the next section I will describe each of the ten sources of evidence. First I will state one or more propositions representing desirable qualities of assessment practice with regard to that source and describe methods for collecting evidence to support the proposition(s). Next, I will state plausible rival hypotheses (counter claims) and mention methods for collecting evidence to refute them. The rival hypotheses pertain to components of score variance that are sources of either systematic bias or random errors. Several such sources appear in Figure 1 which displays the components of score variance. As seen in the figure, score variance can be partitioned into two main components: construct relevant variance and construct irrelevant variance. Another partitioning is between systematic variance and unsystematic (error) variance. The two-way partitioning of variance results in three components of score variance: a) systematic variance that is relevant to the construct being assessed, i.e., reflects true individual differences in the construct being measured; b) nonsystematic variance that results from inconsistencies among raters, tasks, and occasions of testing, and c) irrelevant systematic variance that is due to biasing factors such as gender, ethnicity/culture, etc. According to Messick (1989), this third component constitutes one of the two main threats to validity. In order to guard against this threat and thereby strengthen the argument regarding the quality of a given assessment practice, relevant plausible rival hypotheses pertaining to effects of variables subsumed under this component need to be formulated and evidence to refute them needs to be collected (Kane, 1992).
Figure 1:
Components of Score Variance
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
33
The Ten Sources of Evidence In this section I describe each source of evidence with regard to the kinds of evidence that can be accumulated to help support the relevant proposition(s) and discount plausible rival hypotheses pertaining to sources of bias and error. Content Fidelity An argument about content fidelity includes several propositions each of which requires support. The first proposition refers to the relevance of the target domain to the purpose(s) of the assessment. Supporting this proposition entails ascertaining the fit between the definition of the target domain and the intended test interpretations. The second proposition refers to the broadness and authenticity of the test domain. Supporting it entails determining the fit between the target domain and the test domain. The third proposition refers to the representativeness of the actual set of tasks included in the test vis-à-vis the universe of tasks in the test domain. The fourth proposition concerns performance of cognitively complex tasks and refers to the alignment among the task objectives (derived from the standards), the task content, and the rubric for scoring the performances. It should be noted that the adequacy and usefulness of the score-based inferences depend on the rigor with which the target domain and the test domain have been defined. As stated in the Standards for Educational and Psychological Standards (AERA, APA, NCME,1999), "[b]oth tested and target domains should be described in sufficient detail so that their relationships can be evaluated. The analyses should make explicit those aspects of the target domain that the test represents as well as those aspects that it fails to represent" (p. 145). The test specifications should clearly specify the test domain and define each task by the skill it measures. The test specifications should also include the instructions to the testtaker, the test administrator, and the scorers, because, as Cronbach (1988) pointed out, "any change in them could alter what is measured" (p. 8). Threats to content fidelity. Messick (1989, 1994) pointed out two such threats – construct under-representation and construct irrelevant variance. The former, which is the more salient one, refers to the extent to which a test fails to capture important aspects of what it is intended to measure, and as a result the meaning of the test scores is narrower than the proposed interpretation implies. The latter, as mentioned above, refers to variables that are irrelevant to what the test intends to measure and causes test scores to be either unduly high or unduly low. Evidence about content fidelity. Evidence to support the above mentioned propositions regarding content fidelity is commonly based on judgments made by panels of domain experts who review the test tasks, the rubrics, the test specifications and the description of the target domain. To discount plausible rival hypotheses regarding effects of irrelevant factors, such as language and culture, on the scores, the experts are also asked to evaluate the appropriateness of the task content, format and wording to the intended test
34
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
takers as well as the clarity of the instructions. Such evaluations contribute to the fairness of the assessment and help protect against scores that are either unduly low or unduly high (Messick, 1989). Obstacles and limitations in evaluations of content fidelity. 1. Limited item specifications Nichols and Sugrue (1999) have pointed out the limitations of the commonly used item-specification tables, which pair each test item with a single discrete skill, for effectively specifying cognitively complex tasks. They quote research indicating that such specifications result in a large portion of items that experts were not able to classify. Instead, they advocate linking the task characteristics to the cognitive processes and content knowledge described in the construct definition. Examples of such linkage can be found in applications of the Rule Space Methodology developed by Tatsuoka (1983, 1995), which makes use of a 2-way matrix (referred to as Q matrix) to specify test items by the pattern of cognitive and content attributes required for solving each item. 2. Confirmatory bias Judgment-based evidence has often been criticized for having a strong confirmatory bias (Haertel, 1999; Kane, 2001). Such bias is for instance evident when experts are required to classify items according to the specifications provided by the test developer. A procedure suggested by Sireci (1998), which requires the experts to rate the similarities among all pairs of test items and subjects the ratings to a multidimensional scaling analysis that yields a visual display of the experts' perceptions of item similarity, seems a viable solution for reducing the confirmation bias. Other solutions include indices such as the ones developed by Porter (2002) for measuring the alignment between assessment instruments and content standards, and between the content of instruction and the content of achievement tests. Response Processes The proposition regarding response process is that the tasks elicit the intended cognitive processes. As pointed out by the Committee on the Foundations of Assessment, response process validation is often lacking in assessment development (NCR, 2001). Evidence to support the proposition can be collected using various techniques. The most common one is verbal reports, which can be either concurrent (having test-takers think aloud as they work on the task) or retrospective (interviewing test-takers after they have completed the task). Taylor and Dionne (2000) suggested collecting concurrent reports to identify the knowledge and skills students use to solve a task, and retrospective reports to elaborate or clarify the content and processes mentioned in the concurrent reports. Leighton (2004) addressed common concerns about the trustworthiness and accuracy of verbal reports and points to conditions that support successful use of such reports. She maintained that verbal reports should ideally be collected from tasks of moderate difficulty to the target population. Such tasks, she clarified, require the student to engage in controlled cognitive processing, which he/she can be aware of whereas easy tasks
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
35
elicit automatic cognitive processing and leave the student unaware of how the problem was solved. Difficult tasks, on the other hand, cause an overload on the working memory leaving no resources to articulate the process. Other techniques for collecting evidence regarding response process include error analysis, measuring response latencies, recording eye movement, and correlating the task scores with other relevant variables (NCR, 2001; AERA, APA, & NCME, 1999). Internal Structure An argument about internal-structure fidelity includes two propositions each of which requires support. The first proposition refers to the fit between the structure of the target domain (construct) and that of the response data. Supporting this proposition entails ascertaining that the statistical pattern of relationships among item scores conforms to the construct on which the proposed test score interpretations are based. The second proposition refers to congruence among the measurement model, the structure of the target domain, and the internal structure of the data. Evidence to confirm the underlying structure of the response data can be obtained by means of statistical analyses such as confirmatory factor analysis, cluster analysis, and multidimensional-scaling analysis. Such analyses can also be used for justifying the assignment of multiple scores (score profile) rather than a single total score, and the choice of the measurement model for scaling. To ensure just assessment, evidence should be obtained to refute rival hypotheses regarding test bias for identifiable groups that could result from either content irrelevancies such as language, context, format, etc., or construct under-representation. Relevant evidence includes demonstration of factorial invariance and congruency in item functioning across demographic groups of interest. Techniques for identifying differential item functioning (DIF) (Holland & Wainer, 1993) are employed to detect items where scores of subgroups of test-takers, matched by the attribute being measured, differ. Another factor that can affect the internal structure of an assessment is provision of testing accommodations. As stated in the 1999 Standards of Educational and Psychological Testing (AERA, APA, NCME, 1999) "the purpose of accommodations or modifications is to minimize the impact of test-taker attributes that are not relevant to the construct that is the primary focus of the assessment" (p. 10). In order to ensure fairness in assessment and refute rival hypotheses regarding impediment of construct validity by accommodations, congruence with regard to the internal structure of the data should be checked between two groups - a disabled group that was provided with testing accommodations and a group of regular test-takers who took the test with no accommodations. Relationships with other Variables (External Structure) The proposition regarding the relationships of the assessment scores with other variables is that these relationships are consistent with the construct underlying the proposed interpretations. To support this proposition, convergence and discriminant evidence can be obtained using a multi-trait-multi-method (MTMM) design (Campbell & Fisk, 1959) to compare the
36
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
assessment scores with scores on other measures of the same construct. Higher correlations are expected among similar attributes (convergence evidence) and lower correlations between different attributes of the construct (discriminant evidence). In contexts where evidence regarding the relationships between the test scores and a criterion (either predictive or concurrent) is of primary interest, such as employment testing, high correlations between scores on the test and performance on the criterion measure are the kind of evidence sought. In classroom assessment, variables such as teacher grades, teacher ranks, and scores on other assessments in the same subject can serve as criteria. To refute rival hypotheses concerning effects of biasing factors (construct irrelevancies) such as test anxiety, test wiseness, etc., evidence indicating negligible correlations between the assessment scores and scores on such measures is sought. Generalizability Generalizability is an issue of sampling or sufficiency of information. Tasks of a given test are considered a sample from some universe of tasks that constitute the test domain. Yet valuable inferences are about the domain, not the sample. Therefore, task scores are useful only if one can generalize the interpretation beyond the tasks that appear in the test to the universe of tasks in the domain. The assumption supporting such inference is that the results are invariant with respect to changes in the conditions of observation (Kane, 1992). Thus, the proposition regarding generalizabilty is that the interpretation based on the scores of a given task can be generalized to other tasks of the same type. Evidence to support the proposition can be obtained from reliability studies (Feldt & Brennan, 1989) or generalizabilty studies (Brennan, 1983; Cronbach, Gleser, Nanda, & Rajaratnam, 1972), which indicate how consistent scores are across samples of scorers, tasks, and occasions. Problems with regard to generalizability. 1. Task specifity Generalizability studies of performance tasks revealed that task-to-task variation in quality of test-taker performance is substantially larger than rater-to-rater inconsistencies in scoring (Dunbar, Koretz, & Hoover, 1991; Shavelson, Baxter & Cao, 1995). This problem of task specificity makes generalizability to universe score (test domain) more difficult as the task become less structured (Kane, 1992). As pointed out in the Standards for Educational and Psychological Testing (AERA, APA, NCME, 1999), because performance assessment typically consists of small number of tasks, establishing the extent to which results can be generalized to the test domain is particularly important. 2. Construct relevant inconsistencies Nichols and Sugrue (1999) raised a concern regarding the applicability of psychometric measures of reliability to performance tasks, especially the cognitively complex ones. They contend that conventional models of reliability rely on a trait-based theory of learning and performance, according to which inconsistency is regarded as error. These models, they argue, consider as error also construct relevant inconsistencies that tend
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
37
to occur in performance of cognitively complex tasks as a result of the interaction between the test-taker and the task. 3. Scorer reliability Another concern pertaining to cognitively complex tasks and other new modes of assessment refers to scorer reliability (e.g., the consistency in rating performances according to scoring criteria specified in rubrics.) As noted by Kane and his associates "[t]he more complex and open-ended the task, the more difficult it becomes to anticipate the range of possible responses and develop fair explicit scoring criteria that can be applied to all responses" (Kane et al., 1999, p. 9). To help control any resulting lack of consistency among scorers, Kane and his associates suggested having each task evaluated by a different set of scorers. Other solutions included providing more detailed rubrics, better trained scorers, and a smaller number of points on the score scale. Yet harsh theoretical criticisms are still being voiced against the measurement of scorer reliability. Pamela Moss, a leading advocate of the hermeneutic approach to assessment, argued that the nature of the new modes of assessment, especially portfolios, does not lend itself to psychometric measures of scorer reliability (Moss, 1994, Moss & Schutz, 2001). She claimed that the psychometric approach limits human judgment and that "dissensus [should be] nurtured as a counter weight against too-efficient pursuit of consensus" (Moss & Schutz, 2001, p. 66). She further stressed that the most important question is not how agreement can be reached but rather how the interpretations vary and why. Elsewhere Moss (1994) suggested having experts prepare independent evaluations and convene as a committee to discuss their reviews. However, employing assessment committees, appealing as it may seem, is quite impractical for large-scale assessment. A more practical solution was suggested by Delandshere and Petrosky (1994): here the scorer is required to prepare the review according to leading questions referring to relevant dimensions of the subject matter and to accepted standards for performance. A second scorer is then asked to indicate the extent to which in his/her opinion the first review is consistent with the existing evidence. 4. Contextual influences Since performance is a product of a three-way interaction of task, test-taker, and context, it cannot be evaluated without taking contextual influences into consideration (Bachman, 2005; Black & Wiliam, 1998). The next two sources of evidence are concerned with the context of the assessment. Equality of Opportunities The opportunities discussed in this section refer to three prerequisites for just assessment: Opportunity to learn the material covered by the assessment; opportunity to prepare for the assessment, and opportunity to be tested under proper testing conditions. Failure to afford any of these opportunities to test takers seriously undermines the validity of the score-based inferences. Evidence to support the proposition regarding opportunity to learn is should indicate that prior to the assessment all test-takers were provided with curriculum and instruction that afforded them equal opportunity to learn the content and the skills that are tested.
38
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
According to the Standards for Educational and Psychological Testing (AERA, APA, NCME, 1999) "reasonable efforts should be made to document the provision of instruction on tested content and skills" (p.146). Evidence to support this proposition should be searched in such documents. Evidence to refute rival hypotheses regarding inaccessibility to instruction would include information pertaining to the nature of the discrepancies between the intended and the actual curricula, and to the available instructional resources including teacher quality. This information is especially essential for interpreting low scores on the assessment in light of research findings dating from the "Coleman Report" (Coleman et al., 1966) that teacher quality has a larger impact on low-achieving students than on high-achieving ones (Goldhaber & Anthony, 2005; Nye, Konstantopoulos, & Hedges, 2004). Evidence to support the proposition regarding opportunity to prepare for the assessment should indicate that prior to the assessment all test-takers were given the same preparation materials and information regarding the test. To refute rival hypotheses regarding insufficient, or lack of, test preparation, which may result in unduly low scores, evidence should be obtained as to whether test-takers were informed in advance about the following: the content coverage (knowledge and skills); the type and format of tasks; the policy regarding the use of any material and equipment during the test; the criteria for judging the desired outcomes and the standards for mastery, and the appropriate test taking strategies (Working Group of the Joint Committee on Testing Practices, 2004). To refute rival hypotheses that the test preparation caused inflated scores, evidence should be collected about the exact nature of the test preparation activities and materials to determine the extent to which they reflect specific test items. Evidence to support the proposition regarding opportunity to be tested under appropriate testing conditions should indicate that all test-takers received equitable treatment during testing and that the test administration strictly followed the instructions specified by the test developer. More specifically, the evidence should indicate that: a) the directions to the test-takers were clearly stated and exactly as specified by the test developer, including directions regarding the use of calculators/ dictionaries, about guessing, etc.; b) the time allowed was precisely as specified; c) the physical conditions at the testing site were appropriate, and d) appropriate accommodations and modifications were offered when needed and were precisely as specified. To refute rival hypotheses regarding misconduct during test administration, detailed evidence should be obtained regarding indications of fraud, such as cheating during the test or breaking test security. As noted by Franklyn-Stokes and Newstead (1995), who researched cheating in undergraduate studies, this phenomenon seems to occur more frequently than staff realizes. Such misconduct, if it occurred, would result in inflated scores thus undermining the validity of the score interpretation. Assessment Perceptions and Dispositions The proposition regarding assessment perceptions and dispositions is that the testtakers understand the purpose of the assessment, anticipate an assessment of the type given, and are motivated to perform at their best.
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
39
Research has shown that students' anticipations of the assessment format influence the way they study for the assessment; expectations of conventional test evoke surface learning whereas new modes of assessment evoke deep learning (Sambell & McDowell, 1998; Sambell, McDowell, & Brown, 1997; Struyven, Dochy, & Janssen, 2005). In situations where students' anticipations fail to materialize they tend to claim that the assessment was unfair. However, as noted by Crocker (2003) this may be indicative of an inherent conflict in most assessments, where the instructor is interested in assessing application of knowledge to real-life problems, while the students expect an obvious fit between test and instruction. At any rate, if students are taken by surprise due to unexpected type of questions, their performance could be affected, thus jeopardizing the validity of the inferences based on the test-scores. The role of assessment in developing motivation to learn is well acknowledged (Crooks, 1988; Harlen, 2006). It is also acknowledged that assessment tasks can either enhance or decrease students' motivation to perform, hence the recommendations to design interesting tasks (Stiggins, 2001). Another factor that can affect performance motivation is the assessment context. Research on context impact on students' performance has indicated that scores on a high-stake test tend to be higher than on the same test when given as a lowstake test, such as in a pilot setting (DeMars, 2000). These findings have implications to equating test-forms and other psychometric procedures for high-stake tests that are based on the pilot data (Lane, 2004). Similarly, when students perceive a large-scale assessment, such as the international testing programs (e.g., TIMSS, PISA), as having little or no personal consequences, they may not attempt to perform at their best, thus rendering the score-based inferences invalid. It is therefore essential to obtain evidence regarding test-takers' dispositions, perceptions and anticipations regarding the assessment in order to refute rival hypotheses, which, if supported, would undermine the validity of the score-based interpretations. To collect such evidence small groups of students might be debriefed on test completion, asking for their responses to questions such as: what they thought the tasks tested; how they evaluate the test (easy/difficult; interesting…), and whether they tried their best to succeed (Haertel, 1999). In general, giving students the opportunity to speak and listening to what they think about the assessment is an indication of respect and openness to dialogue and as such is consistent with the origin of the word assessment, which in the original Latin means (the teacher) sitting next to (the examinee). Scoring and Scaling The proposition regarding procedures of test scoring is that they strictly followed the instructions specified by the test developer, and if cut scores are to be used, that sound decisions were made about the definition of the performance categories. To refute rival hypotheses regarding effects of inappropriate scoring procedures evidence should be obtained that the scorers were provided with required materials and clear guidelines for scoring the tests and monitoring the accuracy of the scoring process. In particular, with respect to choice-response tasks, evidence should be obtained that there were no scoring errors such as those resulting from miscoding the test format on the answer
40
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
sheet. With respect to constructed-response tasks, evidence should be obtained indicating that: the rubrics contained explicit benchmarks (performance criteria); the scorers were adequately trained; that they interpreted the benchmarks as intended by the test developer, and that they rated the performances consistently. The proposition regarding scaling is that the psychometric model used to scale items and test-takers is useful in facilitating meaningful interpretation. Commonly used scaling procedures have been criticized for being inappropriate for scaling cognitively complex tasks. Nichols and Sugrue (1999) argued that the unidimensional IRT models commonly used for scaling large-sale assessments assume a simple cognitive construct (a trait theory of learning and performance), and therefore are inappropriate for scaling cognitively complex constructs that consist of multiple processes that often cut across domains. In such tasks, they argued, "[d]ifferent students may solve problems in different ways, reflecting different conceptual and procedural knowledge, instructional experiences, and practical experiences" (p.22). Obviously, lack of fit between the data and the psychometric model used for scaling can obscure meaningful interpretation of assessment performance. Reporting and Feedback The proposition regarding reporting is that the score reports are accurate, include sufficient information, and are clear to the stakeholders. This holds both for reports generated by test developers/publishers to help test users interpret test-results correctly, and for teacher-made tests where feedback to the learner is at issue. Evidence to support the proposition in the case of high-stakes assessment should refer to the sufficiency of information for making decisions and to the clarity with which the information is communicated to the intended stakeholders. Such evidence could be obtained by expert judgment regarding the quality of the test report forms and the interpretive guides, as well as through interviews with users of the reports. Suggestions of how to enhance score reports have recently been offered by Goodman and Hambleton (2004) based on their comprehensive review of current practices of reporting test-scores. To refute rival hypotheses that a given reporting procedure causes confusion and misunderstanding, which according to Hambelton (2002) is a prevalent phenomenon among stakeholders of large-scale assessment reports, evidence should be obtained regarding the understanding of the reports and the interpretation of their results. Such information can be gathered by presenting small groups of stakeholders with score reports and asking them to interpret the results and draw conclusions. Observations and interviews should be conducted in these settings to identify possible misunderstanding and find out whether the drawn inferences are warranted (Haertel, 1999). In the case of classroom assessment, evidence obtained regarding feedback quality should include examples of feedback notes supplied by teachers as well as interviews with students regarding the clarity and usefulness of the feedback. A review of research about the effect of feedback on learning, offered by Black and Wiliam (1998), indicated that instructive feedback enhances student learning. Positive effects on achievement were found when students received feedback about particular qualities of their work in addition to suggestions on how to improve their learning. Moreover, larger effects were detected when
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
41
students were provided with effective instructions on how to assess their own learning and determine what they need from instruction. Evidence should also be collected regarding the quality of feedback generated by teachers to guide them in improving further instruction and assessment. Consequences of the Assessment The proposition regarding consequences is that the assessment serves its intended purpose(s) and does not have adverse consequences. Intended consequences of assessment include improvements in the instruction, learning, and assessment (ILA) culture, in student and teacher self-efficacy, and in other motivational attributes, as well as curricular changes that reflect aspects of academic competencies considered important by society (NCR, 2001). Unintended consequences include narrowing the curriculum, increasing the dropout rate, and other instructional and administrative practices aimed to increase scores rather than influence the quality of education (AERA, APA, NCME, 1999). Koretz (in press) points out three types of teachers' behaviors that can cause inflated test scores: reallocation of instructional resources to "align" instruction with the test; coaching to the test (shifting resources to focus on details of the test), and cheating ("teaching the test" - providing students with the answers to the test.) A search for evidence regarding adverse consequences of accountability tests should consider these types of behavior as primary candidates. Many studies of assessment consequences have been criticized for lack of rigor (Mehrens, 1998). In a recent presidential address, Linda Crocker (2003) referred to the credibility of consequential validation studies of accountability systems and called for developing a set of explicit guidelines for evaluating the consequences. It should be noted that test developers and users share the responsibility of providing evidence regarding the assessment consequences. However, the latter are usually in the best position to evaluate the likely consequences of the assessment in their particular context. The Quality Argument To recap: To evaluate the quality of a given assessment practice, relevant and sufficient evidence should be collected regarding the validity of the score-based interpretations /inferences, as well as the usefulness of those inferences for decision making. The evidence should be integrated and articulated into a coherent and persuasive quality argument that ascertains links in a network spanning from the purpose of the assessment to the impact of the assessment (as depicted in Figure 2).
42
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
Figure 2: The Quality Argument Network
As can be seen in the figure, the argument consists of three connected parts: content, interpretation, and utility, with each containing a sequence of links to be ascertained: The content argument should ascertain the sequence of inferences that starts by linking the assessment purpose(s) to the target domain (the construct) in order to asses its relevance for the intended inferences; The target domain should in turn be linked to the specified test domain to assess the broadness of the domain; The test domain should further be linked to the test tasks to ascertain the representativeness of the test; Finally, the alignment among the task, the rubric and the relevant standard (derived from the target domain) should be ascertained. Empirical evidence should further be obtained to support the claims that the task indeed elicits the intended process and that the structure of the response data matches that of the construct. The interpretive argument, as contended by Kane and his associates (Kane et al., 1999), should ascertain a sequence of inferences from the observed performance to the inferred performance in the target domain. This sequence, these authors maintained, starts with linking the observed performance to an observed score, which is then generalized to a universe score, which is further extrapolated to the target score. The utility argument should ascertain the sequence of links from the score-based inferences (the target scores) to the interpretation of the score reports by the
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
43
stakeholders and the conclusions they draw, which in turn should be linked to their decision(s) – these, then, should further be linked to the impact of the assessment. Ascertaining the argument links requires supporting evidence as well as evidence to refute plausible rival hypotheses regarding effects of the context and of other irrelevant variables. Further Remarks about Quality Arguments The type of evidence to be collected and the salience of the various sources of evidence depend on the particular context and purpose(s) of a given assessment. The argument is as strong as its weakest part; therefore, the evidence is most effective when it addresses this part. For instance, evidence regarding generalizability to tasks in the same domain is more important when evaluating performance tasks than when evaluating multiple-choice tasks for which evidence regarding extrapolation to the target domain is more important (Kane, 1992; Kane et al., 1999). The higher the stakes associated with a given assessment, the more important it is to build a coherent and persuasive quality argument. Articulating a quality argument is not a pos-hoc act after assessment completion, but a process that should start at the planning stage of the assessment cycle and guide its design and development. Conclusion I conclude this article by offering a set of guiding questions for planning an assessment evaluation: A. Questions regarding the validity of the score-based inferences What is the purpose of the assessment? What are potential consequences of the assessment? What are the attributes of the assessment context? What inferences are to be drawn from the test scores about what the test-taker knows and is able to do? How can these inferences be justified? (relevant propositions) What evidence is needed to justify these inferences/ to support the propositions? How can this evidence be obtained (tools, research design)? What potential sources can undermine the validity of the inferences (plausible rival hypotheses)? What evidence is needed to refute these rival hypotheses? How can this evidence be obtained? After the evidence has been obtained: Is the evidence sufficient for making the inferences/interpretations?
44
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
B. Questions regarding the utility of the assessment What decisions/actions are to be made based on the assessment results? What evidence will justify making these decisions? (intended consequences) How can such evidence be obtained (tools, research design)? What evidence will indicate that the assessment results do not justify making those decisions? (unintended/adverse consequences) How can such evidence be obtained (tools, research design)? After the evidence has been obtained: Is the evidence sufficient and useful for making the decisions? One final remark: conducting a high quality assessment evaluation is not just an obligation that assessment practitioners have to the assessment stakeholders but also an obligation they should have to themselves given the potential that the evaluation process holds for their development as reflective practitioners. Notes 1. 2.
This is an expanded version of a keynote lecture given at the Third Northumbria EARLI SIG Assessment Conference, Durham, UK, August 30, 2006. For the ease of communication, the following terminology will be used throughout the article: test will refer to an assessment device; task will refer to any type of item along the continuum from choice to constructed response; test-taker will refer to the person who performs the tasks (not necessarily in a testing session), and target domain will refer to the construct or the cognitive model of learning and cognition, or the standard, to which the intended score-based interpretation applies.
References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Bachman, L.F. (2005). Building and supporting a case for test use. Language Assessment Quarterly, 2 (1), 1-34. Birenbaum, M. (1996). Assessment 2000: Toward a pluralistic approach to assessment. In M. Birenbaum & F.J.R.C. Dochy (Eds.), Alternatives in assessment of achievement, learning processes and prior knowledge (pp. 3-29). Boston, MA: Kluwer. Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education, 5 (1), 7-73.
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
45
Brennan, R.L. (1983). Elements of generalizability theory. Iowa City, IA: American College Testing. Brookhart, S.M. (2001). The standards and classroom assessment research. Paper presented at the annual meeting of the American Association of Colleges for Teacher Education, Dallas, TX (ERIC Document Production Service). Campbell, D.T., Fiske, D.W. (1959). Convergent and discriminant validation by the multitraitmultimethod matrix. Psychological Bulletin, 56, 81-105. Coleman, J.S., Campbell, E., Hobson, C., McParthland, J., Mood, A., Weinfeld, F., & York. R. (1966). Equality of educational opportunity. Washington, DC: U.S. Government Printing Office. Crocker, L. (2003). Teaching for the test: Validity, fairness, and moral action. Educational Measurement: Issues and Practice, 23 (3), 5-11. Cronbach. L.J. (1988). Five perspectives on validity argument. In H. Wainer, H. & H.I. Braun (Eds.), Test validity (pp. 3-17). Hillsdale, NJ: Erlbaum. Cronbach. L.J., Gleser, G.C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurement: Theory of generalizability for scores and profiles. New York: Wiley. Crooks, T.J. (1988). The impact of classroom evaluation practices on students. Review of Educational Research, 58, 438-481. Delandshere, G., & Petosky, A.R. (1994). Capturing teachers' knowledge: Performance assessment. Educational Researcher, 23 (5), 11-18. DeMars, C.E. (2000). Test stakes and item format interactions. Applied Measurement in Education, 13 (1), 55-78. Dunbar, S.B., Koretz, D.M., & Hoover, H.D. (1991). Quality control in the development and use of performance assessment. Applied Measurement in Education, 4, 289-303. Feldt, L.S., & Brennan, R.L. (1989). Reliability. In R.L. Linn (Ed.), Educational measurement (3rd ed., pp. 105-146). New York: American Council on Education/ Macmillan. Franklyn-Stokes, A., & Newstead, S.E. (1995). Undergraduate cheating: Who does what and why? Studies in Higher Education, 20 (2), 159-172. Wolf, D., Bixby, J., Glenn III, J., & Gardner, H. (1991). To use their minds well: Investigating new forms of student assessment. Review of Research in Education, 17, 31-73. Goldhaber, D., & Anthony, E. (2005). Can teacher quality be effectively assessed? National Board certification as a signal of effective teaching. Urban Institute. Available at: http://www.urban.org/ UploadedPDF/411271_teacher_quality.pdf
46
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
Goodman, D.P., & Hambleton, R.K. (2004). Student test score reports and interpretive guides: Review of current practices and suggestions for future research. Applied Measurement in Education, 17 (2), 145-220. Haertel. E.H. (1999). Validity argument for high-stakes testing: In search of the evidence. Educational Measurement: Issues and Practice, 18 (1), 5-9. Hambleton, R.K. (2002). How can we make NAEP and state test score reporting scales and reports more understandable? In R.W. Lissitz & WD. Schfer (Eds.), Assessment in educational reform (pp. 192205). Boston MA: Allyn & Bacon. Harlen, W. (2006). The role of assessment in developing motivation for learning. In J. Gardner (Ed.), Assessment and learning (pp. 61-80). London: Sage. Holland, P.W., & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ: Erlbaum. Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527-535. Kane, M. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38 (4), 319-342. Kane, M. (2002). Validating high-stakes testing programs. Educational Measurement: Issues and Practice, 21 (1), 31-41. Kane, M. (2004). Certification testing as an illustration of argument-based validation. Measurement: Interdisciplinary Research and Perspectives, 2, 13-170 Kane. M.T., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational Measurement: Issues and Practice, 18 (2), 5-17. Koretz, D. (in press). Limitations in the use of achievement tests as measures of educators' productivity. The Journal of Human Resources. Lane, S. (2004). Validity of high-stakes assessment: Are students engaged in complex thinking? Educational Measurement: Issues and Practice, 24 (3), 6-14. Leighton, J.P. (2004). Avoiding misconception, misuse and missed opportunities: The collection of verbal reports in educational achievement testing. Educational Measurement: Issues and Practice, 24 (4), 615. Lewy, A. (1996a). Postmodernism in achievement testing. In A. Lewy (Ed.), Alternative assessment: Theory and practice (pp. 11-37). Tel Aviv: The Mofet Institute. (Hebrew) Lewy, A. (1996b). Examining achievement in writing and scoring essay-type tests. In A. Lewy (Ed.), Alternative assessment theory and practice (pp. 11-37). Tel Aviv: The Mofet Institute. (Hebrew) Lewy, A., Liebman, Z., & Selikovits, G. (1996). From item banks to kits of assessment tasks (KAT): Reflections about assessing achievement in Bible studies. Halacha leMaase betichnum Limudim, 11, 77-98 (Hebrew).
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
47
Lewy, A,, & Shavit, S. (1974). Types of examinations in history studies. Journal of Educational Measurement, 11 (1), 35-42. Lukin, L.E., Bandalos, D., Eckhout T.J., & Mickelson, K. (2004). Facilitating the development of assessment literacy. Educational Measurement: Issues and Practice, 24 (2), 26-32. Mehrens, W.A. (1998). Consequences of assessment: What is the evidence? Evaluation Policy Analysis Archives, 6 (13), 1-16. Mertler, C.A. (2005). Secondary teachers' assessment literacy: Does classroom experience make a difference. American Secondary education 33 (2), 76-92. Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New York: American Council on Education/ Macmillan. Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 32 (2), 13-23. Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Educational Measurement: Issues and Practice, 1, 5-8. Mislevy. R.J., Wilson, M.R., Ercikan, K., & Chudowsky, N. (2002). Psychometric principles in student assessment. CSE Technical Report 583. Center for the Study of Evaluation; National Center for Research on Evaluation, Standards, and Student Testing UCLA. Moss, P.A. (1994). Can there be validity without reliability? Educational Researcher, 23 (2), 5-12. Moss, P.A., & Schutz, A. (2001). Educational standards, assessment, and the search for consensus. American Educational Research Journal, 38 (1), 37-70. National Research Council. (2001). Knowing what students know: The science and design of Educational Assessment. Committee on the Foundation of Assessment. J.W. Pellegrino, N. Chudowsky, & R. Glaser, (Eds.). Division of Behavioral and Social Sciences and Education. Washington, DC: National Academy Press. Nichols, P., & Sugrue B. (1999). The lack of fidelity between cognitively complex constructs and conventional test development practice. Educational Measurement: Issues and Practice, 3, 18-29. Nye, B., Konstantopoulos, S., & Hedges, L. (2004). How large are teacher effects? Educational Evaluation and Policy Analysis, 26 (3), 237-257. Perkins, D.N., & Blythe, T. (1994). Putting understanding up front. Educational Leadership, 51(5), 4-7. Porter, A.C. (2002). Measuring the content of instruction: Uses in research and practice. Educational Researcher, 31 (7), 3-14.
48
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
Resnick, L.B., & Resnick, D.P. (1992). Assessing the thinking curriculum: New tools for educational reform. In: B.R. Gifford & C. O'Connor (Eds.), Changing assessment: Alternative views of aptitude, achievement and instruction (pp. 37-75). Boston, MA: Kluwer. Ross, C.C. (1946). Measurement in today's schools. New York: Prentice-Hall. Sambell, K., & McDowell, L. (1998). The construction of the hidden curriculum: Messages and meanings in the assessment of student learning. Assessment in Higher Education, 23 (4), 391-402. Sambell, K., McDowell, L., & Brown, S. (1997). "But is it fair?": An exploratory study of student perceptions of the consequential validity of assessment. Studies in Educational Evaluation, 23 (4), 349-371. Segers, M., & Dochy, F. (2001). New assessment forms in problem-based learning: The valueadded of students' perspective. Studies in Higher Education, 26 (3), 327-343. Segers, M., Dochy, F., & Cascallar, E. (2003). Optimizing new modes of assessment: In search of qualities and standards. Dordrecht, The Netherlands: Kluwer. Shavelson, R.J., Gao, X., & Baxter, G.P. (1995). On the content validity of performance assessment: Centrality of domain specification. In M. Birenbaum, & F.J.R.C. Dochy (Eds.), Alternatives in assessment of achievement, learning processes and prior knowledge (pp. 131-141). Boston: Kluwer. Shepard, L.A. (1995). Using assessment to improve learning. Educational Leadership, 53 (5), 3843. Sireci. S.G. (1998). Gathering and analyzing content validity data. Educational Assessment, 5 (4), 299-321. Stiggins, R. J. (1994). Student centered classroom assessment. New York: Merrill/Macmillan. Stiggins, R.J. (2001). Student-involved classroom assessment (3rd ed.). Upper Saddle River, NJ: Merrill Prentice Hall. Struyven, K., Dochy, F., & Janssen, S. (2005). Students' perceptions about evaluation and assessment in higher education: A review. Assessment and Evaluation in Higher Education, 30 (4), 331347. Tatsuoka, K.K. (1983). Rule-space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20, 34-38. Tatsuoka, K.,K. (1995). Architecture of knowledge structures and cognitive diagnosis: A statistical pattern recognition and classification approach. In P.D. Nicols, S.F. Chipmanm & R.L. Brennan (Eds.), Cognitively diagnostic assessment (pp. 327-359). Hillsdale, NJ: Erlbaum. Taylor, K.L., & Dionne, J.P. (2000). Accessing problem solving strategy knowledge: The complementary use of concurrent verbal protocols and retrospective debriefing. Journal of Educational Psychology, 92, 413-425.
M. Birenbaum / Studies in Educational Evaluation 33 (2007) 29–49
49
Toulmin, S. (1958). The uses of argument. Cambridge, England: University Press. Wolf, D., Bixby, J., Glenn III, J., & Gardner, H. (1991). To use their minds well: Investigating new forms of student assessment. Review of Research in Education, 17, 31-73. Working Group of the Joint Committee on Testing Practices (2004). Code of fair testing practice in education (revised). Educational Measurement: Issues and Practice, 24 (1), 2-9.
The Author MENUCHA BIRENBAUM is a professor of education at Tel Aviv University, School of Education. Her scholarly interests include classroom assessment, large scale diagnostic assessment, and assessment and promotion of self-regulated learning. Correspondence: