Fairness C Gipps, University of Wolverhampton, Wolverhampton, UK G Stobart, Institute of Education, University of London, London, UK ã 2010 Elsevier Ltd. All rights reserved.
Glossary Bias – A technical term used in test construction for when individuals with equal ability, in the subject being tested, but from different groups, do not have the same probability of success. DIF (Differential Item Functioning) – DIF occurs when people in different groups perform in substantially different ways on a test question even though they may have very similar scores on a test. Equality – An essentially quantitative approach to differences between groups. Equity – The ‘spirit of justice’ which looks at the fairness of a given state of affairs. It is essentially a qualitative judgment.
Introduction Fairness is a concept for which definitions are important, since it is often interpreted in narrow and technical terms. We set fairness in a social context and look at what this means in relation to different groups and cultures. Similarly, we are using educational assessment in a more inclusive way than is often the case; we include tests, examinations, teachers’ informal judgments, or evaluation (assessment in United Kingdom) of student performance. We then explore bias in measurement as well as the broader concept of equity – which goes beyond the more quantitative concerns of equality. Practical approaches to ensuring fairness are discussed in the subsequent section.
Fairness How would we tell whether a test is fair for different groups (male/female; socially advantaged/disadvantaged; ethnic groupings)? The dilemma is that different groups will have different qualities and experiences, so fairness in assessment cannot assume equal scores or outcomes. Indeed, equal outcomes could represent bias in favor of a lowerachieving group. Differences in performance may be due to differing access to learning, or there may appear to be differences as the test is biased in favor of one group. Wood (1987)
56
describes these different factors as the opportunity to acquire talent (access issues) and the opportunity to show talent to good effect (fairness in the assessment). In our view, fairness in assessment cannot be considered in isolation from access issues in the curriculum, and the educational opportunities offered to the students: fairness in access opportunities and in what the curriculum offers provide the level playing field which must precede a genuinely fair assessment situation. The approach we take includes these broader issues and, therefore, owes more to sociocultural theory than to measurement theory. Assessment is, after all, a socially embedded activity which can only be understood by taking account of the cultural, economic, and political contexts within which it operates (Sutherland, 1996; Gipps, 1999). There is a historical legacy here. Examinations were introduced in order to encourage advancement through talent rather than patronage, an example being the Civil Service examinations in England in the 1850s (although the examination was only open to men, and furthermore only to those men who had had a particular type of schooling). Of course, one can argue that public tests are important as a means of equalizing opportunities and as a necessary corrective to patronage, while at the same time understanding that tests may be biased in favor of one particular gender, social or ethnic group. Educational measurement has traditionally presented itself as scientific and objective – providing a true picture of an individual’s performance (unless he or she cheats). It is, in this frame, seen essentially as a fair activity. If access to higher levels of education or to certain professions is determined by the open competition of examination, then this allows society to subscribe to promotion through merit, and unsuccessful individuals to accept their lot, since all have been allowed to compete on an equal basis, in that they took the same examination under similar conditions, to demonstrate their competence. More recent analyses, however, show a rather different reality. Evidence in the 1950s began to emerge that nonverbal intelligence quotient (IQ) tests, which were viewed as culture free, were not independent of culture. A picture was built up of tests being biased toward the culture of the people who designed them. In the United States and the United Kingdom this meant, at the time, white, male, and middle class. Such tests are therefore unfair for people from other backgrounds, as these individuals do not have
Fairness
the same chance to do well on the tasks. This was particularly important in relation to IQ tests because of the very powerful role of these tests in determining life chances. In the post-1965 Civil Rights legislation era in the United States, critics of advancement through testing pointed out that opportunities to acquire talent, through access to good schooling, were not equally distributed (Wood, 1987; Orfield and Kornhaber, 2001). In other words, these tests were biased in favor of the dominant social group, partly from the design of the tests, and partly from access issues. In the case of school examinations and tests, the critique has essentially similar underpinnings. Social class is a significant determinant of school performance. Children from lower social groups are not less intelligent or less academically capable, but children from middle-class homes are better able to do well at school because of the correspondence of cultural factors between home and school. We are now well aware that the form of assessment can differentially affect results for different groups. In England there has been far more analysis of this in relation to gender than to ethnicity. We know that during compulsory schooling (up to 16 years) girls are likely to outperform boys on tasks which involve open-ended writing, particularly when this involves personal response. Even within multiple-choice tests, traditionally seen as favoring boys, there are differential response patterns. Carlton (2000) has shown that in such tests females perform better than males, matched for ability, on questions in which the content is a narrative or is in a humanities field and when the content deals with human relationships. As the context of an item grows longer, the relative performance of females also improves. Males outperform females on questions relating to science, technical matters, sports, war, or diplomacy. We also know that where examinations have a coursework (or essay) element, the performance of girls is likely to be more consistent, although the effect this has on final grades in Englishschool-leaving examinations has often been overstated (Elwood, 1995). We know less about other aspects of the form of assessment, particularly in relation to ethnicity. For example, oral assessment plays little part in the examination system in England outside examining languages. Does the emphasis on written response disadvantage groups who place more emphasis on oral communication in their culture? Bias Bias is a technical term used in test construction. In layman’s terms a test is biased if ‘‘two individuals with equal ability (in the subject being tested) but from different groups do not have the same probability of success’’ (Shepard et al., 1981). A main cause of bias in a test would be that it was designed by one cultural group to reflect their own experience,
57
and thus disadvantage test-takers from other cultural groups, as in the example of IQ tests given above. Thus, bias may be due to the content matter in a test. However, it may also be due to lack of clarity in instructions which leads to differential responses from different groups, or to scoring systems that do not credit appropriate or correct responses that are more typical of one group than another. The existence of group differences in performances on tests is often taken to imply that the tests are biased, the assumption being that one group is not inherently less able than the other. However, as we have argued, the two groups may well have been subject to different environmental experiences or unequal access to the curriculum. This difference will be reflected in group test scores, but the test is not strictly speaking biased. There are technical approaches to evaluating bias, which we discuss below.
Equity Equity is defined in the dictionary as moral justice. Equity does not imply equality of outcome and does not presume identical experiences for all – both of these are seen to be unrealistic, but it asserts that assessment practice and interpretation of results need to be fair and just for all groups. For example, it is possible to have similar outcomes for two groups and yet to see this as unfair to one of them, which may have been disadvantaged in terms of access to the curriculum. Conversely, it is possible to have unequal group outcomes that may be seen as fair. An example would be where there are group differences in the application to learning and preparation, where each had similar resources and opportunities. Equity is also a quasi-legal term. The legal meaning of equity is the spirit of justice and building on the work of Secada (1989), we see it as a qualitative concern for what is just. ‘‘Equity attempts to look at the justice of a given state of affairs, a justice that goes beyond acting in agreed upon ways and seeks to look at the justice of the arrangements leading up to and resulting from those actions’’ (p. 81). The implication is that equity is not the same as equality. Equity represents the judgment about whether equality, be it in the form of opportunity and/or of outcomes, achieves just (fair) results. Looking for equality requires essentially a quantitative approach to differences between groups, while equity goes beyond this and looks at the justice of the arrangements prior to the testing/examination. Legal Challenges Sitting behind these sociocultural and technical approaches to fairness are legal challenges to assessment programs and results. Such legal challenges are more common in the
58
Educational Measurement
United States and Australia than in the United Kingdom. According to Cumming and Dickson: Sources of legal action around the world have different bases. For example, in the USA many education cases have been fought and won on constitutional grounds and individual rights. In countries such as Australia, where individual rights do not have a constitutional basis, legal challenges in this area are usually brought through tort, or negligence, law, or on a statutory basis, under anti-discrimination laws. (Cumming and Dickson, 2006: 4) Simple examples illustrate how discrimination might arise in an assessment context. Direct discrimination in the administering of a test might happen, for example, if a marker marks a student of a particular race, sex, religion ‘harder’ than students not of that race, sex or religion. Indirect discrimination might arise in the administering of a test if there were, for example, a requirement that students complete the test in a set time. Students with a disability may not be able to comply with this requirement – they may not be able to write quickly, or they may have a processing disorder. (Cumming and Dickson, 2006: 5)
Legal challenges have been made on the grounds of: opportunity to learn – not having been taught the syllabus being assessed; certification decisions made on the basis of a single instrument (which violates the fairness principle of using a variety of assessment approaches); lack of funding – leading to low performance on national performance measures; incorrect scoring of tests; and incorrect (or absent) assessment of students with learning difficulties resulting in inappropriate placement and/or teaching. A review of legal cases in the United States indicates the high standard required to establish lack of equity in assessment. These authors similarly identify lack of opportunity to learn and lack of opportunity to demonstrate it to good effect – for various reasons, including insufficient preparation time for a new examination or test – as major themes in legal challenge, and conclude: In most cases, in hindsight, the matters challenged are preventable. Equitable access to the curriculum to be
Table 1
assessed, provision of adequate resources and implementation of appropriate assessment regimes are surely public expectations of education.
Approaches to Ensuring Fairness What we have outlined so far is a complex and contested picture, difficult to engage and move forward with. Yet, much work has been done to address this, which is both sociocultural and technical. An Equity Approach Apple (1989) argued that attention in the equity and education debate must be refocused on important curricular questions, to which Gipps and Murphy (1994) linked assessment questions (see Table 1). Based on accounts of the difficulties of providing effective education to all in countries such as South Africa (Meier, 2000) and Kenya (Mwachihi and Mbithi, 2000), Stobart has added access and resource questions (Stobart, 2005). Some aspects of this approach are now reflected in the guidelines for test developers. Test development approach The most fully articulated approach comes from test developers. The Standards for Educational and Psychological Testing (1999), which has guided later documents, was published jointly by AERA/APA/NCME (1999) to guide test development, scoring, and administration. It has a whole section on fairness in testing and test use, dealing with many of the issues in this article. Educational Testing Service (ETS), the major American test development agency, has its own standards for quality and fairness (ETS, 2000) to guide test developers. There are eight standards for fairness aimed at ensuring that ‘‘construct – irrelevant (i.e. to the skill or concepts being addressed) personal characteristics of test takers have no appreciable effect on test results or their interpretation’’ (p. 17).
Access, curriculum, and assessment questions in relation to equity
Access questions
Curricular questions
Assessment questions
Who gets taught and by whom? Are there differences in the resources available for different groups? What is incorporated from the cultures of those attending? (Stobart, 2005)
Whose knowledge is taught?
What knowledge is assessed and equated with achievement? Are the form, content, and mode of assessment appropriate for different groups and individuals?
Why is it taught in a particular way to this particular group? How do we enable the histories and cultures of people of color, and of women, to be taught in responsible and responsive ways? (Apple, 1989)
Is this range of cultural knowledge reflected in definitions of achievement? How does cultural knowledge mediate individuals’ responses to assessment in ways which alter the construct being assessed? (Gipps and Murphy, 1994)
Fairness Table 2 The ETS International Principles for Fairness Review of Assessments (2004)
Principle 1: Principle 2: Principle 3: Principle 4: Principle 5: Principle 6:
Treat people with respect in test materials Minimize the effects of construct-irrelevant knowledge or skills Avoid material that is unnecessarily controversial, inflammatory, offensive, or upsetting Use appropriate terminology to refer to people Avoid stereotypes Represent diversity in depictions of people
From ETS (2004). ETS International Principles for Fairness Review of Assessments. Princeton, NJ: Educational Testing Service.
These have been extended to the ETS International Principles for Fairness Review of Assessments (ETS, 2004). The six principles are presented in Table 2. The requirement is to select assessment content that accurately reflects the construct, even if it produces gender/ethnic group differences, and to avoid content that is not relevant to the construct and could affect such differences. The involvement of those with a minority background is crucial here. The document outlines the procedures for fairness review, and other fairness actions, one of which is the use of a statistical measure: A statistical measure related to fairness should be used, whenever sample sizes permit, as an empirical check on the fairness of questions. Statistical measures based on the way matched people in different groups perform on each test question, called differential item functioning or DIF, are preferred. DIF occurs when people in different groups perform in substantially different ways on a test question, even though they have very similar scores on the test. If DIF data are available, tests should be assembled following rules that keep DIF low. (ETS, 2004: 11)
Fifteen years ago, the emphasis was solely on a statistical approach to eliminating bias in test design. The shift in emphasis from statistical manipulation as a major approach to it being just one element of test review is notable, and we would argue, to be welcomed. An important approach to offering fairness, it is now argued, is to use, within any assessment program, a range of assessment tasks involving a variety of contexts – a range of modes within the assessment and a range of response format and style. This broadening of approach is most likely to offer pupils alternative opportunities to demonstrate achievement if they are disadvantaged by any one particular assessment in the program (Linn, 1992): Also, if we wish pupils to do well in tests/exams we need to think about assessment which elicits an individual’s best performance (after Nuttall, 1987). This involves
59
tasks that are concrete and within the experience or the pupil (an equal access issue) presented clearly (the pupil must understand what is required of her if she is to perform well) relevant to the current concerns of the pupil (to engender motivation and engagement) and in conditions that are not threatening (to reduce stress and enhance performance). (Gipps, 1994)
Large-Scale Assessment Programs There are four key areas within large-scale testing/examination systems in which to raise issues of fairness, particularly in relation to multicultural societies. These are: The nature and requirements of the assessment system itself, for example, how are cultural and linguistic diversity approached? (For example, in which languages are examinations conducted?) How do the assessment methods meet the cultural diversity of the candidates? How does the content of the assessment reflect the experiences of different groups? (For example, whose version of history is tested?) How effectively is the performance of different groups monitored and how is this fed back into the system? (Stobart, 2005)
Teachers’ Informal Evaluations/Assessments Fairness in assessment in the informal setting of the classroom can be both more difficult – because there are many complex issues for the teacher to consider – and more possible – since a range of assessment approaches is possible. It is more feasible for the teacher to offer, in the informal assessment setting, a range of assessment tasks and modes, and an approach which supports fairness as we argued above. It is also more feasible to provide the situation that can elicit an individual’s best performance, since it is under the teacher’s control. However, teachers’ informal assessment is, to a certain extent justifiably, perceived as being unreliable and biased (Harlen, 2004a EPPI-Centre). This is usually to do with lack of clarity, and variability, in standards or criteria. It is possible to improve the consistency of teachers’ assessments through: providing clear criteria, training teachers to assess against these, and supporting the process with moderation of judgments via discussion (ARG, 2006). In relation to the curriculum offered and opportunity to learn, there is another inconvenient fact: teacher expectation can affect the curriculum and learning experiences offered to children. There is clear evidence that teachers offer a different curriculum to children for whom they
60
Educational Measurement
hold low and high expectations (Tizard et al., 1988; Troman, 1988; Harlen, 2004b). This is pertinent to the equal access issue.
Conclusion Fairness is both essential and elusive. It is the appeal to fairness that has made educational measurement a pivotal part of most cultures. We have argued that different groups being allowed to sit, and be judged by, the same test is a simplistic view. Equality of opportunity includes access to similar resources and curricular opportunities. The more familiar, and narrower, discussion of bias is only a small part of this. We will never achieve fair assessment, but we can make it fairer: The best defense against inequitable assessment is openness. Openness about design, constructs, and scoring will bring out into the open the values and biases of the test design process, offer an opportunity for debate about cultural and social influences, and open up the relationship between assessor and learner. These developments are possible, but they do require political will.
Bibliography AERA/APA/NCME (1999). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association. Apple, M. W. (1989). How equality has been redefined in the conservative restoration. In Secada, W. (ed.) Equity and Education, pp 7–35. New York: Falmer. ARG (2006). The role of teachers in the assessment of learning. Assessment reform group pamphlet. www.assessment-reformgroup.org (accessed August 2009). Carlton, S. T. (2000). Contextual factors in group differences in assessment. Paper Presented at 26th IAEA Conference, Jerusalem, 14–19 May. Cumming, J. and Dickson, E. A. (2006). Equity in Australia: The legal perspective, Paper presented at 32nd IAEA Conference, Singapore, 21–25 May. Elwood, J. (1995). Undermining gender stereotypes: Examination and coursework performance in the UK at 16. Assessment in Education 2(3), 283–303. ETS (2000). ETS Standards for Quality and Fairness. Princeton, NJ: Educational Testing Service. ETS (2004). ETS International Principles for Fairness Review of Assessments. Princeton, NJ: Educational Testing Service. Gipps, C. (1994). Beyond Testing, Towards a Theory of Educational Assessment. London: Falmer.
Gipps, C. (1999). Sociocultural aspects of assessment. Review of Research in Education 24, 357–392. Gipps, C. and Murphy, P. (1994). A Fair Test? Assessment, Achievement and Equity. Buckingham: Open University Press. Harlen, W. (2004a). A systematic review of the evidence of reliability and validity of assessment teachers use for summative purposes (EPPI-Centre Review). In Research Evidence in Education Library Issue 3, London: EPPI-Centre, Social Science Research Unit, Institute of Education. http://eppi.ioe.ac.uk/EPPIWeb/home.aspx? page=/reel/review_groups/assessment/review_three.htm (accessed August 2009). Harlen, W. (2004b). A systematic review of the evidence of the impact on students, teachers and the curriculum of the process of using assessment by teachers for summative purposes (EPPI-Centre Review). In Research Evidence in Education Library Issue 4, London: EPPI Centre, Social Science Research Unit, Institute of Education Available on the website at: http://eppi.ioe.ac.uk/EPPIWeb/home. aspx? page=/reel/review_groups/assessment/review_four.htm (accessed August 2009). Linn, M. C. (1992). Gender differences in educational achievement. In Pfleiderer, J. (ed.) Sex Equity in Educational Opportunity, Achievement and Testing, pp 11–50. Princeton, NJ: Educational Testing Service. Meier, C. (2000). The influence of educational opportunities on assessment results in a multicultural South Africa. Paper Presented at the 26th IAEA Conference, Jerusalem, 14–19 May. Mwachihi, J. M. and Mbithi, M. J. (2000). Assessment and equity assurance in the Kenyan multicultural background. Paper Presented at 26th IAEA Conference, Jerusalem, 14–19 May. Nuttall, D. (1987). The validity of assessments. European Journal of Psychology of Education 11(2), 109–118. Orfield, G. and Kornhaber, M. L. (2001). Raising Standards or Raising Barriers? New York: Century Foundation. Secada, W. G. (1989). Educational equity versus equality of education: An alternative conception. In Secada, W. G. (ed.) Equity and Education, pp 68–88. New York: Falmer. Shephard, L., Camilli, G., and Averill, M. (1981). Comparison of procedures for detecting test item bias with both internal and external ability criteria. Journal of Educational Statistics 6, 317–375. Stobart, G. (2005). Fairness in multicultural assessment systems. Assessment in Education 12(3), 275–287. Sutherland, G. (1996). Assessment: Some historical perspectives. In Goldstein, H. and Lewis, T. (eds.) Assessment: Problems, Developments and Statistical Issues, pp 9–20. Chichester: Wiley. Tizard, B., Blatchford, P., Burke, J., Farquhar, C., and Plewis, I. (1988). Young Children at School in the Inner City. Hove: Erlbaum. Troman, G. (1988). Getting it right: Selection and setting in a 9–13 middle school. British Journal of Sociology of Education 9(4), 1988. Wood, R. (1987). Measurement and Assessment in Education and Psychology. London: Falmer.
Relevant Website www.assessment-reform-group.org – The Assessment Reform Group (ARG): The role of teachers in the assessment of learning.