OSX3xl355/R7 $0.00+.50 Copyright @ 1986 Pergamon Journals Ltd.
Inr. J. Educ. Res. Vol. 11, pp. l-143, 1987 Printed in Great Britain. All rights reserved.
EDUCATIONAL EVALUATION: THE STATE OF THE FIELD RICHARD M. WOLF (GUEST
EDITOR)
Teachers College, Columbia University, U.S.A.
CONTENTS INTRODUCTION-Richard References
3 6
M. Wolf
CHAPTER
1
THE NATURE OF EDUCATIONAL EVALUATION-Richard M. Wolf Abstract Introduction Toward a Definition of Evaluation Differences between Evaluation, Measurement, Research and Learner Appraisal The Role of Evaluation in the Educational Process The Limitations of Evaluation References
I I 7 8 10 13 18 19
CHAPTER
2
A FRAMEWORK Abstract Introduction A Framework References
21 21 21 22 29
CHAPTER
3
TWO-PLUS DECADES OF EDUCATIONAL Abstract An Alternative Ancestry The Behavioral Objectives Brouhaha Taxonomic Travails Measurement and Objectives Performance Standards as Separable Guideline Time After All - Only Two Decades References
CHAPTER
4
DESIGNING EVALUATION STUDIES: A TWENTY-YEAR - Jeri Benson and William B. Michael Abstract Purposes of Evaluation Design Four Basic Types of Evaluation Design Sources of Invalidity for Experimental and Quasi-Experimental Conclusion References
FOR EVALUATION-Richard
M. Wolf
OBJECTIVES
-
W. James Popham
31 31 32 33 34 36 38 39 40 41
PERSPECTIVE
Designs
43 43 44 44 52 55 55
2
RICHARD
M. WOLF
CHAPTER
5
SAMPLE DESIGN-Kenneth N. Ross Abstract Introduction Populations: Desired, Defined, and Excluded Sampling Frames Representativeness Probability Samples and Non-Probability Samples Sample Design and Experimental Approaches to Evaluation Evaluation Designs when SampI& is Not Possible Conclusion References
CHAPTER
6
THE INFORMATION SIDE OF EVALUATION FOR LOCAL IMPROVEMENTKenneth A. Sirotnik Abstract Collecting Information: What, from Whom and How? Multilevel Issues Collecting Information: Problems and Possibilities Closing Remarks References
CHAPTER
CHAPTER
7
8
CHAPTER9
CHAPTER
10
57 57 57 58 59 60 60 71 73 74 75 SCHOOL 77 77 78 83 85 88 88
THE DEFINITION AND INTERPRETATION OF EFFECTS IN DECISION ORIENTED EVALUATION STUDIES - Sore1 Cahan, Daniel Davis and Nora Cohen Abstract Introduction A Framework for the Definition of a Policy Effect The Definition of a Policy Effect Comparability of Effects The Effect of a ‘Policy Variable’ Interpretation of a Policy Effect References
91 91 91 92 93 93 95 101 103
BEYOND DECISION-ORIENTED EVALUATION - Jaap Scheerens Abstract Introduction The Rational Model The Incremental Model Approaches that Seek to Enhance the Relevance of Evaluations Conclusion: Beyond Decision Oriented Evaluation References
105 105 105 106 108 110 112 113
REPORTING THE RESULTS OF EVALUATION Abstract Reporting/Communicating in an Evaluation Study Reports and the Reporting Process References
STUDIES-A.
PROFESSIONAL STANDARDS FOR ASSURING EDUCATIONAL PROGRAM AND PERSONNEL -Daniel L. Stufflebeam Abstract Introduction to the Program Evaluation Standards Standards for Evaluations of Educational Personnel Closing References
THE QUALITY EVALUATIONS
Harry
Passow
115 115 116 121 123
OF 125 125 127 139 142 142
INTRODUCTION RICHARD
M. WOLF
Educational evaluation has long been regarded as a good thing. Writers in the areas of curriculum and school administration have called for evaluation of educational programs from the earliest years of this century. Undoubtedly, some educators heeded this call and engaged in efforts to evaluate curricula, programs and other aspects of the educational enterprise. Unfortunately, there was no technology to conduct evaluational studies. Experimental procedures that were being developed were well suited to the conduct of laboratory studies but could not be used in school settings. The absence of any technology for the conduct of evaluation studies probably reduced such efforts to discussions among educational personnel about the merits and problems surrounding particular programs or other features of the educational system. How much of a role evidence played in such discussions is not known. There were other problems in those early days. Educational personnel were usually so heavily involved in the organization, management and conduct of educational programs that there was little time left for reflective thought, let alone systematic study. Thus, while educational leaders were urging that evaluation be a regular part of the educational enterprise, the people they were writing for had neither the time nor the technology to do what was being asked. This is not an unusual state of affairs. Educational thought is frequently ahead of practice. What is noteworthy is that exhortations that evaluation be a regular part of the educational enterprise led to the development of a view that evaluation had an important role to play in education. By the end of the second decade of this century, this view was widely held. In the 193Os, largely as a result of work by Ralph Tyler, evaluation was being formally conceptualized and a fledgling technology was developed. Tyler, in a series of articles in Ohio State’s Educational Research Bulletin, presented his conception of what evaluation was and, equally important, how it might be done. Beginning in 1933, Tyler assumed the role of director of the evaluation staff of the Eight Year Study, a pioneering study in which a group of 30 schools were freed from traditional college entrance requirements and allowed to develop new curricular patterns to meet the needs of their students. The evaluation staff were able to develop, refine and put into use many of the notions that had been 3
4
RICHARD
M. WOLF
developed by Tyler and others. The report of that project (Smith & Tyler, 1942) is not only a fascinating chronicle of the methods, instruments and results of that study, but also did much to advance thinking about evaluation as well as supply the beginnings of a technology for educational evaluation. The publication of Smith and Tyler’s Appraising and Recording Student Progress (1942) could not have been more badly timed. The United States, along with virtually every other nation in the world, was deeply involved in World War II. There was not the time, personnel nor resources to be devoted to evaluation. With the close of World War II came the postwar period, lasting until the late 195Os, when countries had to adjust to a new set of national and world conditions. The ‘baby boom’ of the postwar period kept educators frantically busy just building schools and finding teachers to deal with a mushrooming school population. In the late 195Os, the Russians launched Sputnik 1. The effect of this event on thinking about education is difficult to underestimate. Briefly stated, there was an immediate reaction in the U.S. that its educational system was seriously behind those in other countries, notably the Soviet Union. Whatever the facts, this view was strongly and widely held. Government action, it was believed, was needed to improve the state of education in the United States. Accordingly, legislation was passed providing funds primarily for the strengthening of mathematics and science education. Support was made available for the development of new curricula as well as for teacher training, among other things. It was a crash effort. It was felt that prompt action was required to boost the quality of science and mathematics education. Interestingly, little provision was made to evaluate these efforts. The thinking at that time was that by allocating resources to solve a perceived problem, the problem would get solved. This nai’ve faith was severely tested over the next decade. While the decade of the 1950s was one of expansion of the physical facilities, teaching force and student population, the 1960s was a decade in which attention was focused on qualitative improvements in education. Landmark legislation was passed in the United States in 1965, at the instigation of President Lyndon Johnson, that was intended to improve the quality of education, particularly for students from disadvantaged backgrounds. While this legislation was being debated in the U.S. Senate, Robert Kennedy, then Senator from the state of New York, proposed an amendment requiring annual systematic evaluation of programs funded under the legislation. The amendment was quickly accepted on the reasoning that if the government was going to spend a great deal of money on education, it had a right to know what effects such expenditures were having. The political popularity of the evaluation requirement quickly spread to other social legislation so, by the end of the 196Os, it was commonplace to require systematic annual evaluations of social programs. In the meantime, the education community was caught virtually totally unprepared. School districts did not have staff trained in evaluation to do the work that was required. Personnel were rapidly shifted into hastily formed offices or bureaus of evaluation and directed to conduct evaluations of federally funded programs. Most people had little idea of what was involved in the evaluation of a program or how to go about doing it. Where it was not possible to assemble a staff, contracts were let with external groups to conduct evaluations. Often, the level of competence in such groups was not very high. The period from 1965 to the early 1970s was one of considerable confusion. A great deal of activity occurred under the heading of evaluation. Much of the work was highly questionable. Workers in the field had little to guide them in their efforts and many studies were
Educational
Evaluation
5
entirely ad hoc in their design and conduct. What fueled all this activity was a large flow of governmental money for various educational programs. Workers charged with evaluating these funded programs lived with the constant fear that if programs were not evaluated, or, if the evaluations were deemed unacceptable, the flow of funds could stop. Local school districts felt that they could not afford such a loss. Consequently, evaluation activity proceeded even if the quality of much work was suspect. Several interesting developments occurred in those early days. First, a view developed that future funding depended on the success of programs. The view may or may not have been correct. The important thing was that a number of people believed it. Thus, evaluation, regardless of how good or bad it might be, was seen to be important, if not critical, in determining future funding. A second interesting development was that individuals from a variety of disciplines and applied fields began to address the topic of evaluation, defining what it was and how it should be done. Each individual, by virtue of training or background, often addressed those aspects of evaluation that drew on his or her special expertise, often at the neglect of other disciplines or fields. Thus, individuals from an economics background often stressed aspects of costs, administrators stressed a management perspective and laboratory oriented educational psychologists emphasized the methods of true experiments. The list could be extended considerably. What was noteworthy was that individuals tended to stress those concerns they regarded as central and tended to pay less attention to areas in which they were not knowledgeable. When the excessive narrowness of some writers was detected, some decided it was important to assemble a variety of points of view so that those actually responsible for the conduct of evaluation studies could begin to see the totality of evaluation. Accordingly, various individuals produced edited books, drawing on the works of various writers representing differing views about evaluation or, at least, emphasizing different aspects of evaluation. Such edited books served some useful functions. They informed readers of the diversity of views and approaches to evaluation, offered insights into important questions that needed to be asked when contemplating undertaking an evaluation of an educational program, and provided a good deal of detailed technical assistance. The edited books also exposed the basic lack of coherence in the field of evaluation. It was not uncommon to find one chapter in fundamental conflict with another, for example, or to find that one writer emphasized one aspect of evaluation almost to the exclusion of everything else while another writer did the same thing with his or her pet concern. It is hard to underestimate the amount of ferment in the field of evaluation during the period from 1965 to 1975. There was also considerable confusion. Since 1975, the field has settled down somewhat. There is general agreement as to the important questions that should be addressed in the comprehensive evaluation of educational enterprises and an acceptance, albeit reluctant in some cases, that a variety of approaches are admissible in the conduct of studies. Sole reliance on formal true experimental procedures has been soundly rejected and more emphasis is now placed on the logic of design than on adherence to a narrow set of procedures. Evaluation is now seen to be a more open and ongoing process intended to yield information that will lead to the improvement of educatiotlal programs. This latter view is in marked contrast to the view of 10-20 years ago when it was felt that the results of evaluation studies would be the most important determinant of continued support. If the view of evaluation today is more relaxed and comprehensive than just 10 years ago, it is equally clear that no single prescription can be laid down for the organization and
6
RICHARD
M. WOLF
conduct of an evaluation study. Many of the views highlighted by writers over the past IOk 20 years are recognized as valid. Different kinds of studies are required to answer different kinds of questions about different kinds of programs. Recognition of diversity has resulted in greater acceptance of the need to use a variety of approaches. The tendency to seek single paradigms has been abandoned in the face of strong pressures to acknowledge the complexity of educational treatments and the likelihood of multiple factors of causation and outcomes. This acknowledgement of complexity has come to be accepted as evaluation workers struggle to understand educational enterprises. The articles in this special issue are designed to inform readers of the diversity and complexity of educational evaluation. While each writer prepared his article with full knowledge of the outline of the issue, each is writing about an aspect of evaluation in which he or she has specific expertise. It is possible that some statements made by one writer may conflict with those of another. One hopes the reader will not be unsettled by this. Diffcrences continue to exist about the particulars of educational evaluation. The writers for this special issue, however, were chosen not only for their special expertise in a particular area but also for their overall competence in the field. The fact that each writer may be emphasizing a different aspect of evaluation is more a testament to the fidelity with which they have carried out their assignment than to any basic differences between them.
References Smith, E. R.. Tyler. R.. & Evaluation Harper and Brothers.
Staff (1942).
Appruivrug
and rrcording
.studrnr progrcs.,.
New York:
CHAPTER
THE NATURE
1
OF EDUCATIONAL RICHARD
Teachers
College,
EVALUATION*
M. WOLF
Columbia
University,
U.S.A.
Abstract This article sets out to define educational evaluation and describe its role in education. A definition of educational evaluation is presented, discussed and elaborated. Equally important, educational evaluation is compared to measurement, research and learner appraisal and similarities and differences are noted. Next, the role of evaluation in an educational systems is presented and discussed. Evaluation is shown to have an important role to play in education, notably in relation to program improvement. Finally, limitations of educational evaluation are noted.
Introduction Any work that sets out to deal with a relatively new aspect of education is obliged to furnish the reader with a definition, description, and discussion of that aspect. This is particularly true of the burgeoning field of educational evaluation where there is considerable confusion. This confusion stems partly from the fact that many of the techniques and procedures used in evaluating educational enterprises are rather technical, and educators are often not knowledgeable about such matters. A more basic reason for the confusion, however, is that different authors have different notions of what educational evaluation is or should be. These dissimilar views often stem from the training and background of the writers, the particular professional concerns with different aspects of the educational process, specific subject-matter concerns, and even from differences in temperament. One result is that a reader unfamiliar with the field is all too often exposed to written works that not only differ but are even contradictory. Such writings are not just expressions of honest differences about what evaluation is about and how it should be carried out. They are often reflections of a deeper confusion which attends the development of a relatively new field of inquiry. One goal of this special issue is to reduce the confusion about what evaluation is and is not, how it should be organized and carried out, how the results of evaluation studies should be reported, and how they can be used. There is no intent, however, to shield the reader from honest differences that exist within the field. These can and should be exposed *This article is adapted York: Praeger.
from material
presented
in Wolf,
7
R. (1984).
Evaluation
in education
(2nd ed.).
New
RICHARD
X
M. WOLF
and discussed. However, it is not considered necessary or even desirable to deal with a number of highly idiosyncratic viewpoints regarding educational evaluation. Rather, the emphasis here is on the presentation of a conceptualization of educational evaluation that attempts to be comprehensive, coherent, sensible, and practical. It combines features emphasized by a number of writers in the field but attempts to weld them into a unified view of educational evaluation. It sometimes sacrifices the private concerns of writers when it is felt that these interfere with basic ideas of evaluation. The critical reader can deal with the subtleties, complexities, and differences that exist at the frontiers of evaluation once the basic ideas are learned.
Toward
a Definition
of Evaluation
There are several definitions of educational evaluation. They differ in level of abstraction and often reflect the specific concerns of the person who formulated them. Perhaps the most extended definition of evaluation has been supplied by C. E. Beeby, who described evaluation as “the systematic collection and interpretation of evidence, leading, as part of the process, to a judgment of value with a view to action” (Beeby, 1977). There are four key elements to this definition. First, the use of the term ‘systematic’ implies that what information is needed will be defined with some degree of precision and that efforts to secure such information will be planful. This does not mean that only information which can be gathered through the use of standardized tests and other related measures will be obtained. Information gathered by means of observational procedures, questionnaires, and interviews can also contribute to an evaluation enterprise. The important point is that whatever kind of information is gathered should be acquired in a systematic way. This does not exclude, a priori, any kind of information. The second element in Beeby’s definition, “interpretation of evidence”. introduces a critical consideration sometimes overlooked in evaluation. The mere collection of evidence does not, by itself, constitute evaluation work. Yet uninterpreted evidence is often presented to indicate the presence (or absence) of quality in an educational venture. High dropout rates, for example, are frequently cited as indications of the failure of educational programs. Doubtless, high dropout rates are indicators of failure in some cases, but not all. There may be very good reasons why people drop out of educational programs. Personal problems, acceptance into higher level educational programs, and landing a good job are reasons for dropping out which may in no way reflect on the program. In some cases, dropping out of an educational program may indicate that a program has been successful. For example, a few years ago, the director of a program that was engaged in training people for positions in the computer field observed that almost two-thirds of each entering class failed to complete the two-year program. On closer examination it was found that the great majority of ‘drop outs’ had left the program at the end of the first year to take well paying jobs in the computer department of various companies (usually ones they had worked with while receiving their training). The personnel officers and supervisors of these companies felt that the one year of training was not only more than adequate for entry- and secondlevel positions but provided the foundation on which to acquire the additional specialized knowledge and skill required for further advancement. Under such circumstances, a twothirds dropout rate before program completion was no indication of a program failure or deficiency.
Educational
Evaluation
9
Clearly, information gathered in connection with the evaluation of an educational program must be interpreted with great care. If the evaluation worker cannot make such interpretations himself, he must enlist the aid of others who can; otherwise, his information can seriously mislead. In the above example, the problem of interpretation was rather simple. Dropout statistics are easily gathered, and one can usually have confidence in the numbers. More complex situations arise when one uses various tests, scales, or observational and self-report devices such as questionnaires and opinionnaires. In these situations the interpretation of evaluation information can be extremely difficult. Unfortunately, the interpretation of information has too often been neglected. Specific mention of it in a definition is welcome since it focuses attention on this critical aspect of the evaluation process. The third element of Beeby’s definition - “judgment of value” - takes evaluation far beyond the level of mere description of what is happening in an educational enterprise. It casts the evaluation worker, or the group of persons responsible for conducting the evaluation, in a role that not only permits but requires that judgments about the worth of an educational endeavor be made. Evaluation not only involves gathering and interpreting information about how well an educational program is succeeding in reaching its goals, but judgments about the goals themselves. It involves questions about how well a program is helping to meet larger educational and social goals. Given Beeby’s definition, an evaluation worker who does not make - or if, for political or other reasons, an openly harsh judgment is inexpedient, does not strongly imply - a judgment of value is not, in the full sense of the term, an evaluation worker. Whoever does make such a judgment, after the systematic groundwork has been laid, is completing an evaluation. Lest the reader get the wrong impression that the evaluation worker has some kind of special power in education, a distinction needs to be made between two types of judgments. The first is the judgment of value of the program, curriculum, or institution being evaluated. This is the type described above which is clearly within the scope of the evaluation worker’s professional function. The second type of judgment is taken in light of the first and, along with other relevant factors, is the decision on future policy and action. This is clearly in the domain of administrators, governing boards, and other policy makers. If these decision makers make both kinds of judgments, they are taking over an essential part of the professional evaluation functions. This is to be avoided. It is quite possible that a decision will be made to retain a marginally effective program. It may be that the political or public-relations value of a program is deemed important enough to continue it despite low effectiveness. It is also possible that funds are available to operate a program of marginal quality which would not be available for other more worthwhile endeavours. It is the decision maker’s job to determine whether to fund it or not. The point remains: the evaluation workers, or those charged with the evaluation of a program, will render a judgment of value; it is the responsibility of decision makers to decide on future policy and action. Each has their area of responsibility, and each must be respected within their domain. This point must be understood at the outset. If it is not, there is danger that evaluation workers may become frustrated or cynical when they learn that policy decisions have been made contrary to what the results of their evaluation would suggest. The last element of Beeby’s definition - “with a view to action” - introduces the distinction between an undertaking that results in a judgment of value with no specific reference to action and one that is deliberately undertaken for the sake of future action. The
RICHARD
10
M. WOLF
same distinction is made by Cronbach and Suppes although the terms “conclusionoriented” and “decision-oriented” are used (1969). Educational evaluation is clearly decision-oriented. It is intended to lead to better policies and practices in education. If this intention is in any way lacking, evaluation probably should be dispensed with. Evaluation workers can use their time to better advantage. So far no mention has been made about what kinds of actions might be undertaken as the result of an evaluation study. The range is considerable. A conscious decision to make no changes at all could result from a carefully conducted evaluation study; or a decision to abolish a program altogether could be made although the latter case is not very likely. In fact, this writer has not heard of a single instance where a decision to terminate a program was based solely on the results of an evaluation study. Between these extremes, modifications in content, organization and time allocation could occur, as well as decisions about additions, deletions, and revisions in instructional materials, learning activities, and criteria for staff selection. Such decisions come under the general heading for course improvement and are discussed in some detail by Cronbach (1963). M. Striven uses the term “formative evaluation” to characterize many of these kinds of decisions (1967). In contrast, decisions about which of several alternative programs to select for adoption or whether to retain or eliminate a particular program arc “summative” in nature to use Scriven’s terminology. Striven’s distinction between formative and summative evaluation has achieved a fair measure of popular acceptance although the number of clearly summative studies is small. The basic point is that evaluation studies are undertaken with the intention that some action will take places as a result.
Differences
between
Evaluation.
Measurement.
Research
and Learner
Appraisal
Beeby’s definition of evaluation goes some distance towards specifying what evaluation is. However, in order to function effectively, a definition must not only say what something is; it should also say what it is not. This is particularly important with regard to evaluation. Three activities that are related to evaluation are measurement, research, and learner appraisal. Evaluation shares some similarities with each. The differences, however, are considerable and need to be examined so that evaluation can be brought more sharply into focus.
Evuluation
and Measurement
Measurement is the act or process of measuring. It is essentially an amoral process in that there is no value placed on what is being measured. Measurements of physical properties of objects such as length and mass do not imply that they have value; they are simply attributes that are being studied. Similarly, in the behavioral sciences, measurement of psychological characteristics such as neuroticism, attitudes toward various phenomena, problem-solving. and mechanical reasoning do not, in itself, confer value on these characteristics. In evaluation, quite the opposite is the case. The major attributes studied arc chosen precisely because they represent educational values. Objectives are educational values. They define what we seek to develop in learners as a result of exposing them to a set of edu-
Educational
Evaluation
11
cational experiences. They can include achievements, attitudes toward what is learned, self-esteem, and a host of other prized outcomes. Such outcomes are not merely of interest; they are educational values. Thus, while evaluation of measurement specialists often engage in similar acts, such as systematically gathering information about learner performance, there is a fundamental difference between the two in the value that is placed on what is being measured. A second important distinction between evaluation and measurement inheres in the object of attention of each. By tradition and history, measurement in education is undertaken for the purpose of making comparisons between individuals with regard to some characteristic. For example, two learners may be compared with regard to their reading comprehension. This is accomplished by administering the same reading comprehension test to the two learners and seeing how many questions each has answered correctly. Since they have been given the same test, a basis for comparison exists. This is the traditional measurement approach. In evaluation, on the other hand, it is often neither necessary nor even desirable to make such comparisons between individual learners. What is of interest is the effectiveness of a program. In such a situation, there is no requirement that the learners be presented with the same test. In fact, under some circumstances, it may be prudent to have them answer entirely different sets of questions. The resulting information can then be combined with that obtained from other learners and summarized in order to describe the performance for an entire group. Such a procedure introduces efficiencies into the process of information gathering. The point to be made here is that evaluation and measurement are typically directed towards different ends: evaluation toward describing effects of treatments; measurement toward description and comparison of individuals. In evaluation, it is not necessary that different learners respond to the same questions or tasks.
Learner Appraisal Closely related to the notion of measurement is learner appraisal. Appraising the proficiencies of learners for purposes of diagnoses, classification, marking and grading is usually considered the prerogative of those charged with the instructional function; typically, teachers. The introduction of systematic evaluation procedures has been viewed in some cases as an intrusion on this traditional teacher function. Nothing could be further from the truth. Evaluation is directed toward judging the worth of a total program and, sometimes, for judging the effectiveness of a program for particular groups of learners. Evaluation is not an external testing program that is intended to supplant teacher responsibility for learner appraisal. In fact, it is - more often than not - simply not possible to do so. For example, if in the course of evaluating a program, it is decided to have different groups of learners answer different sets of questions, the resulting evaluative information will contribute nothing to the process of learner appraisal. Measurements of individual learner proficiencies will still have to be made to fulfill the appraisal function. Thus, teachers need not fear that systematic evaluation of educational programs will intrude on the appraisal role of their professional function. Quite the opposite may occur. Teachers wishing to use evaluative information to assist them in appraising learner performance may find themselves frustrated when they learn that evaluative information does not help them in this regard.
12
RICHARD
Evaluation
M. WOLF
and Research
Evaluation and research share a number of common characteristics. There are some notable differences, however. Research, typically, aims at producing new knowledge which may have no specific reference to any practical decision, while evaluation is deliberately undertaken as a guide to action. This distinction is highlighted in the last phrase of Beeby’s definition of evaluation -“with a view to action”. Any distinction based on motivation is obviously fragile, and one operation can shade into another; but in practice there is usually a marked difference in content, presentation, and often method between research, inspired by scholarly interest or an academic requirement, and an investigation undertaken with a definite practical problem in mind. To be sure, scholarly research has often led to highly practical payoffs-the work of atomic physicists in the 1930s is a dramatic case in point. A basic difference in motivation, however, remains. A more basic distinction between evaluation and research lies in the generafizability of results that each type of activity produces. Research is concerned with the production of knowledge that seeks to be generalizable. For example, a research worker may undertake an investigation to determine the relationship between student aspiration and achievement. The study will be designed and carried out in such a way as to insure results that are as generalizable as possible. They will obtain over a wide geographic area, apply to a broad range of ages, and be as true in several years as now. Generalizability of results is critical in research. Little or no interest may attach to knowledge that is specific to a particular sample of individuals, in a single location. studied at a particular point in time. In fact, if a researcher’s results cannot be duplicated elsewhere, they are apt to be dismissed. In their now famous chapter on designs for research in teaching, D. T. Campbell and J. Stanley drew attention to the notion of generalizability, when they discussed threats to the integrity of various designs under two broad headings - internal validity and external validity (1963). External validity was their term of generalizability. Evaluation. in contrast, seeks to produce knowledge specific to a particular setting. Evaluators, concerned with the evaluation of a reading improvement program for third graders in a single school or school district, will direct their efforts toward ascertaining the effectiveness of the program in that locality. The resulting evaluative information should have high local relevance for teachers and administrators in that school district. The results may have no relevance for any other school in any other locality; well-intentioned educators, interested in such a program, will have to determine its effectiveness elsewhere in a separate enterprise. Another important distinction between evaluation and research lies in the area of method. In research there are fairly well developed canons, principles. procedures, and techniques for the conduct of studies, which have been explicated in various works (Kerlinger, 1975; Kaplan, 1964; Campbell & Stanley, 1966). These methods serve to insure the production of dependable and generalizable knowledge. While the methods of research frequently serve as a guide to evaluation endeavors. there are a number of occasions when such methods are neither necessary nor practicable. Evaluation is not research. and the methods of the latter do not need to dictate the activities of the former. Some writers assert that any evaluative effort must rigorously employ the methods of experimental research and that anything less is apt to be a waste of time and money. This is an extreme position. While research methods are often useful in planning evaluation studies, they should not be a straitjacket. Meaningful evaluative activity can be carried on
Educational
13
Evaluation
that does not follow a research model. For example, a program intended to train people in a particular set of skills, e.g. welding, could be undertaken with a single select group of learners and their proficiency ascertained at the conclusion of the training program. Such an enterprise would violate most, if not all, of the precepts of scientific research, e.g. lack of randomization, absence of a control group, etc. However, it could yield highly pertinent evaluative information that could be used for a variety of purposes. An inability to follow research prescriptions for the design and conduct of studies need not be fatal for evaluation work. There are occasions when departures from a strict research orientation are necessary and appropriate. The important point is that substantial and important work can be done in evaluation that does not require the use of formal research methods. One must, of course, be extemely careful. The above discussion was intended to introduce the reader to the concept of educational evaluation. A definition of educational evaluation was presented and discussed, and evaluation was briefly contrasted with measurement, learner appraisal, and research. But definition and contrast can only go so far. In a way, the remainder of this special issue and the references provided are an attempt to convey to the reader the nature of educational evaluation, and what it entails. Before proceeding further, some background material needs to be given so that the reader may see educational evaluation in a larger context. The Role of Evaluation
in the Educational
Process
The prominence given to educational evaluation in the United States can be traced to the mid-60s. The United States Congress passed the Elementary and Secondary Education Act of 1965 (ESEA) which provided massive amounts of funds for the improvement of educational programs. The ESEA also contained a requirement that programs funded under Titles I and III of the Act be evaluated annually and that reports of the evaluation be forwarded to the federal government. Failure to comply with this requirement could result in a loss of funding for programs. It was unfortunate that what came to be known as the ‘evaluation requirement’ was introduced in the way that it was. Still, it seems unlikely that the need to evaluate the worth of educational programs would have been taken so seriously or so quickly without such a spur. One unfortunate aspect of the requirement to evaluate arose from the fact that evaluation, initially, was viewed by many as an activity engaged in to satisfy an external funding agency, i.e. the federal government, rather than as an integral part of the educational enterprise. It was also unfortunate that only externally funded programs were, in fact, evaluated. Resources were often not available for evaluating conventional programs. The view that developed in the mid- and late-60s regarding the evaluation requirement was, to say the least, lamentable. It was also in marked contrast to the view that evaluation was an integral part of the educational process, which had been developing since the late 1920s and early 193Os, principally under the advocacy of Ralph Tyler. Briefly, Tyler’s rationale postulated three major elements in the educational process: objectives, learning experiences, and appraisal procedures.* Since objectives will be treated in another article, * Tyler originally used the term evaluafion to denote those procedures used to appraise learner progress toward the attainment of objectives. He saw such procedures as being critical for furnishing information about the extent to which objectives were being attained and about the appropriateness and efficacy of the learning experiences. In this book, the term evaluation is used in an even broader way. It includes not only Tyler’s views but procedures used in making judgments about the overall merit of an educational enterprise. Judgments of overall merit involve considerations of the worth of the objectives being pursued, their costs, and some measure of their acceptance by learners, teachers, and community.
RICHARD
14
M. WOLF
it is only necessary to mention here that objectives refer to one’s intentions for an educational endeavor. They represent the desired, or valued performances or behaviors that individuals in a program are supposed to acquire. An educational program’s purposes may range from having learners (whatever their age or other characteristics) acquire a narrowly specified set of skills to reorganization of an entire life style. The nature of the objectives is not significant at this point. What is important is that an educational program is undertaken with some intentions in mind and that these intentions refer to desired changes in the learners served by the program. The term learning experiences refers to those actitivies and experiences that learners undergo in order to acquire the desired behaviors. For example, if a program in nutrition education is concerned with learners acquiring information about the importance of including various food groups in a diet. the learning experiences designed to help students acquire this information might include reading. lecture, audio-visual presentations, and the like. Learning experiences is a broad term that includes both individual and group activities carried on in and out of class at the instigation of educators for the sake of attaining the objectives of the program. Thus, if a teacher requires that learners visit a museum to view a particular exhibit, this would be classified as a learning experience. provided it is intended to help attain a particular objective. Correspondingly, homework assignments, individual projects, and term papers -completed outside of school-would also be classed as learning experiences. According to Tyler, learner appraisal in the educational process is critical because it is concerned with ascertaining the extent to which the objectives of the program have been met. For a representation of the educational process as formulated by Tyler see Figure 1.1.
Objectives
Learning experiences
Learner appraisal
Figure I. 1 Representation of educational process. Adapted from Tyler, R. (Ed.). (1969). Educational endurerim: New roles, mw MCIIIIS. Yearbook of the National Society for the Study of Education. Chicago: University of Chicago Press.
The representation is a dynamic one as signified by the two directional arrows. linking each element with each of the others. Beginning with objectives, the arrow pointing to learning experiences indicates that objectives serve as a guide for the selection or creation of learning experiences. For example, if a geometry course is supposed to develop deduc-
Educational
Evaluation
15
tive thinking abilities in learners, then learning experiences that require work with other than geometry content, such as newspaper editorials, advertisements, and the like, will have to be included in the program. Other examples could easily be cited. The central point is that the nature of one’s objectives will be an important determiner of the learning experiences that constitute the operational program. The arrow pointing from objectives to learner appraisal indicates that the primary (some would maintain the exclusive) focus of appraisal is on gathering evidence on the extent to which the objectives of a program have been attained. Just as the objectives provide specifications for the establishment of learning experiences, they also furnish specifications for learner appraisal. To return to the previous example, a program that seeks to have students develop and use deductive thinking in life-situations might require, in its appraisal of learning, evidence regarding student proficiency to apply deductive principles to the analysis of a variety of material outside the realm of geometry. The two arrows stemming from objectives in Figure 1.1 are easily explained. The meaning of the other arrows is less apparent but no less important. The arrow pointing from learning experiences to learner appraisal is indicative of the fact that learning experiences can provide exemplars for the development of appraisal tasks. The activities that students engage in during the learning phase of a program should furnish ideas for appraisal situations. In fact, there should be a fundamental consistency between learning experiences and appraisal tasks for learners. If there is not, something is amiss in the program. This is not to say learning experiences and appraisal tasks must be identical. Appraisal tasks may contain an element of novelty for the learner. This novelty may appear in the content of the evaluation task, the form of it, or both. If there is no element of novelty, one does not have an educational program but, rather, a training program where learning experiences are designed to develop relatively narrow behaviors in the learner, and appraisals procedures are expected to ascertain only whether the narrow set of behaviors has been acquired. Education, on the other hand, involves the acquisition of fairly broad classes of behaviors. Thus, the arrow pointing from learning experiences to learner appraisal indicates that learning experiences furnish ideas and suggestions for learner appraisal; but there should not be an overspecification of appraisal procedures. The two arrows pointing from learner appraisal to objectives and to learning experiences are especially important. In the case of the former, the arrow signifies that appraisal procedures should furnish information about the extent to which the objectives are being attained. This is an important function of learner appraisal. In addition, appraisal information can furnish valuable information that may result in the modification of some objectives and the elimination of others. Particular objectives may have been included as a result of noble intentions on the part of a group of educators, but appraisal activities may yield information that indicates the goals were not attained. This should cause the educators to reconsider the objectives. Should the objectives be modified or perhaps eliminated? Are the objectives realistic for the group of learners served by the program? Are the resources necessary for achieving the objectives available? Such questions will, of course, have to be answered within the context of a particular situation. In raising these questions here, we seek to illustrate how the results of appraisal activities can provide information pertinent to the review of objectives. The arrow pointing from learner appraisal to learning experiences is suggestive of two important notions. First, just as appraisal activities can furnish information as to which objectives are being successfully attained and which are not, learner appraisal can also pro-
16
RICHARD
M. WOLF
vide information bearing on which learning experiences appear to be working well and which ones are not. In any educational enterprise there will be a variety of learning experiences. It is unreasonable to expect that all will be equally effective. Appraisal procedures can furnish information as to which learning experiences are succeeding, which ones may be in need of modification, and which ones should, perhaps, be eliminated. This is the notion of formative evaluation described by M. Striven (1967) and discussed in some detail by Cronbach (1963). A second important idea suggested by the arrow pointing from appraisal to learning experiences is that tasks, exercises, and problems, developed by evaluation specialists may be suggestive of new learning experiences. The incorporation of novel and imaginative appraisal materials into the learning phase of a program has, on occasion, contributed significantly to the improvement of learning. Of course, the appropriation of such materials for the improvement of the quality of the learning experiences renders the materials unusable for appraisal purposes. However, this is usually considered a small price to pay for the improvement of the quality of the learning experiences. The last arrow, which points from learning experiences to objectives, denotes that learning activities can result in encounters involving teachers, learners, and learning materials which can suggest new objectives. Alert and sensitive teachers can identify potentially new objectives. A teacher, for example, may be conducting a discussion with a group of learners and, as happens, the discussion will take a turn in an unexpected direction. The teacher may allow the discussion to follow this new course with considerable benefit to all. Such a development may lead the teacher to ask that specific provisions be made to insure such benefits by incorporating one or more new objectives into the program. If one does not make formal provision for such activities, then they may not occur in the future-the basic limitation of incidental learning being that if it is not formally provided for, it may not take place. For this reason the arrow, pointing from learning experiences to objectives, has been included. The above characterization of the educational process is attributed to Tyler (1950); its roots can be seen in his earlier work (1934). The important point to note is that Tyler saw evaluation as central to the educational process and not as an appendage - to be carried out merely to satisfy the demands of an outside funding agency. All serious writers about evaluation share Tyler’s view about the critical role evaluation has to play in the educational process, although there is considerable variance as to how this role should be fulfilled. Unfortunately, the view about the centrality of evaluation in the educational process is neither universally shared by educational practitioners nor, when it is held, necessarily applied. For example, a number of writers in the field of curriculum development exhort practitioners to engage in the systematic evaluation of educational programs but furnish little guidance as to how this function should be carried out. Even practitioners will often give lip service to the importance of evaluation but will do little or nothing about it. While Tyler’s view of education and the role of evaluation in the educational process has been of enormous value to persons in curriculum development as well as to those in educational evaluation, it provides only a foundation for current evaluation thought and practice. Technical developments in the methods of evaluation, measurement, research, decision theory, information sciences, and other related areas, as well as new demands for educational planning, have resulted in additions and modifications to Tyler’s original formulation. There has been a notable shift in thinking about the role of evaluation in the educational process since Tyler’s original work. Tyler viewed evaluation primarily as the assessment of
Educational
Evaluation
17
learner performance in terms of program objectives. For Tyler, evaluation was virtually synonymous with what was previously defined as learner appraisal. There was good reason for this. At the time that Tyler formulated his rationale, evaluation work was not only quite spotty but largely haphazard. Tyler sought to make evaluation a more systematic and rational process. Accordingly, Tyler urged that clear objectives be formulated and that they serve as the basis for the development of evaluation instruments. The results from the use of such instruments would permit people to determine how well program objectives were being attained and thus enable them to judge program success. Given the level of educational thought and practice at the time that Tyler formulated his rationale, it was clearly a great leap forward. A number of recent writers have argued for an expanded role for evaluation. Their reasoning is that strict fidelity to a program’s objectives can place an evaluation worker in a very difficult position. What if a program is pursuing worthless or unrealistic objectives? Must the evaluation worker restrict his activities to assessing the extent of attainment of those objectives or is he or she to be allowed to question or even challenge the objectives themselves? Opinion and practice are divided on this issue. The emerging consensus is that evaluation workers should be free to question and challenge dubious objectives when there is a real basis for doing so. This begs the issue, however, since one needs to know what constitutes a ‘real basis’ for challenging or even questioning a program’s objectives especially if the evaluation worker had no part in developing the program. There appear to be two bases for doing so. The first is more obvious than the second. If an evaluation worker has had considerable experience in evaluating the types of programs that he or she has been called on to study, then the evaluation worker might be able to question or challenge a program’s objectives in light of this experience. For example, an evaluation worker who is also a specialist in elementary school mathematics might be in a strong position to question the appropriateness and even the worth of a particular set of objectives for an elementary school mathematics program. The second basis for questioning or challenging a program’s objectives is the need which a program was designed to meet. Programs are established to meet some need as Tyler clearly pointed out in his classic monograph (Tyler, 1950). It is one thing to determine whether a program is achieving its objectives; it is another to say whether the objectives, even if they were achieved, meet the need that gave rise to the program. Thus, the need furnishes a basis for reviewing a program’s objectives. More important, it frees an evaluation worker from having to simply accept a program’s stated objectives. This does not mean that one can freely criticize program objectives. One should question or challenge program objectives only after careful study of the relationship between a program’s objectives and the need the program was designed to meet, or on the basis of sufficient expertise about the nature of the program and the learners to be served. The inclusion of objectives as part of an educational enterprise to be evaluated rather than as an external set of specifications that are beyond question is part of the contemporary view of educational evaluation. An attempt has been made to represent this view of the role of evaluation in Figure 1.2. Evaluation has as its province objectives, learning experiences, learner appraisal and the relationships between the three. Note, however, that the need on which the program is based is not included. The reason for this is that evaluation workers, because of their background, training, experience, and limited view of an educational enterprise, are usually not in a good position to say whether particular needs are valid, or which of several needs should be addressed. Such matters are usually left to a
RICHARD
M. WOLF
Need
Eva lust m
Figure
group of professional
1.2 Representation
workers
of the role of evaluation
called policy analysts
The Limitations
in the educational
working
process.
closely with decision
makers.
of Evaluation
It is common in education to make strong claims for one thing or another, whether it be a teaching method, a form of organization, or the like. In fact, one function of evaluation is to test such claims. It seems fitting, therefore, to identify limitations of evaluation lest the reader develop the mistaken notion that there are no limitations in educational evaluation. One limitation of educational evaluation was suggested earlier. Educational evaluations, typically, do not produce generalizable results, The evaluation of a particular program in a given location can provide useful information about that program in that place. Such information may not apply to any other locale, however. Separate studies would have to be conducted in different sites to estimate the effects of programs in those sites. A second limitation of educational evaluation stems from the fact that educational programs are rarely, if ever, static. Programs are continuously changing. Thus, any evaluation is, at least, partially out of date by the time data are gathered and analyzed. The report of an evaluation study is probably best seen as a historical document since the program that was evaluated has undoubtedly undergone some change in the period from when information was gathered until the report was produced. Some of these changes are quite natural and represent normal processes of growth and change in an educational enterprise. An evaluation study, in contrast, is a snapshot or a series of snapshots at a particular time. Readers of evaluation reports need to keep this point in mind as they seek to understand and improve educational programs. A third limitation of evaluation studies centers on the distinction between diagnosis and prescription. Educational evaluations are often similar to medical diagnoses. That is, they can indicate the presence or absence of a fever and what particular set of symptoms are exhibited. Such information can be quite useful. However, such information may not provide a prescription as to how to remedy identified deficiencies. For example, an evaluation study may indicate that students are failing to make an expected amount of progress in a particular school subject and even furnish some reasons for the lack of progress. Deciding what to do to improve the program is a different matter. An evaluation study may not be able to provide recommendations on how to improve a program that is found to be defi-
Educational
Evaluation
19
cient . The limitations of educational evaluation are important to recognize. Just as strong claims are made for various educational ventures, evaluation has been seen, by some, to be a panacea. It is not. Thoughtful evaluation workers recognize this. Other educators and policy makers should recognize this also.
References Beeby, C. E. (1977). The meaning of evaluation. In Current i,ssues in education: No. 4 Evaluafion (pp. 68-78). Wellington: Department of Education. Campbell, D. T., & Stanley, J. (1966). Experimental and quasi-experimental designs for research in teaching. Chicago: Rand McNally. Design and analysis issues for field settings. Cook, T., & Campbell, D. T. (1979). Q uasi-experimentation: Chicago: Rand McNally. Cronbach, L. J. (1963). Course improvements through evaluation. Teachers College Record, 64.672483. Cronbach, L. J., & Snow, R. (1977). Aptitudes and instructional methods. New York: Irvington Publishers. Cronbach, L. J., & Suppes, P. (Eds.). (1969). Researchfortomorrow’sschools: Disciplines inquiryforeducation. New York: Macmillan. Cronbach, L. J.. et al. (1980). Toward reform ofprogram evaluation. San Francisco: Jossey-Bass. House, E. R. (1980). Evaluating wirh validity. Beverly-Hills: Sage Publications. Kaplan, A. (1964). The conduct of inquiry. San Francisco: Chandler. Kerlinger, F. N. (1973). Foundations of behavioral research (3rd ed.). New York: Holt, Rinehart & Winston. Popham, W. J. (1975). Educational evaluation. Englewood Cliffs, NJ: Prentice-Hall. Striven, M. (1967). The methodology of evaluation. In R. E. Stake (Ed), Curriculum evaluation. (AERA Monograph Series in Education, No. 1) Chicago: Rand McNally. Striven, M. (1973). Goal free evaluation. In E. R. House (Ed.), School evaluafion: Thepolifics and theprocess. Beverly Hills: Sage Publications. Tyler, R. (1934). Constructing achievement rests. Columbus, OH: Ohio State University. Tyler. R. (1950). Basicprinciples of curriculum and instruction, Chicago: University of Chicago Press. Wolf. R. M. (1984). Evaluafion in education (2nd ed.). New York: Praeger.
CHAPTER
A FRAMEWORK
FOR EVALUATION
RICHARD Teachers
College,
2
M. WOLF
Columbia
University,
U.S.A.
Abstract While a number of writers have proposed various approaches or even formal models of evaluation, these are seen as being unduly restrictive in thinking about educational evaluation, or in planning or conducting evaluation studies. In contrast, it is proposed that attention be directed towards the kinds of information required in evaluating an educational enterprise. Five major classes of information are identified as being needed in a comprehensive evaluation. These are: (1) initial status of learners; (2) learner performance after a period of treatment; (3) program implementation; (4) costs; and (5) supplemental information. Each class of information is presented and discussed in detail, and its role in a comprehensive evaluation is justified.
Introduction The formulation of a view of evaluation that is comprehensive is no easy task. It is not that there are no guides to help one. On the contrary, there are at least nine classes of models and, counting conservatively, twenty specific models of educational evaluation (Stake, 1975, p. 33). It is not within the purview of this article to attempt to describe and compare all or even some of the models here; that would probably be more confusing than enlightening. Fortunately, descriptions and comparisons of a number of these models already exist and the interested reader is referred to those sources (Worthen & Sanders, 1973; Popham, 1975; Stake, 1975). While discussion of various evaluation models is beyond the scope of this article, some of the ways in which they differ can be noted in order to assist in formulating a view of evaluation that attempts to transcend, take account of, and, in some way, accommodate these differences. One way in which evaluation models differ is in terms of what is considered to be the major purposes of educational evaluation. One group of writers, perhaps best exemplified by Cronbach (1963), sees improvement of instructional programs as the chief aim of evaluation. Educational programs are developed, tried out, modified, tried out again, and eventually, accepted for adoption on a broad scale. Whether one holds a linear view of the process of research, development, and dissemination, characteristic of much of the curriculum development work of the early and mid-60s or some alternate view of the process of improvement of education, is not important here. Evaluation in any case 21
21
RICHARD
M. WOLF
is seen as supplying the information that will lead to improvement of the instructional endeavor. Evaluative information provides feedback to curriculum workers, teachers, and administrators so that intelligent decisions regarding program improvement can be made. Such a feedback and quality control role for evaluation, if carefully planned and carried out, is supposed to doom every curriculum development enterprise to success. In contrast to this view is the position that evaluation efforts should be directed toward formal. experimental comparative studies. In such instances evaluation efforts arc directed at producing educational analogues of environmental-impact reports. Just as an environmental-impact report estimates the likely effects of adopting a particular educational program, educational evaluation in such cases should be able to inform educators about the likely effects of adopting a particular educational program. The methods used to arrive at such estimates center around planned, comparative studies. In well-conducted research studies, it should be possible to conclude that one program out-performs another (or fails to) in terms of various criteria, usually some collection of measures of student performance. Advocates of such a role for educational evaluation are often concerned with overall judgments about educational programs based on a comparison with one or more alternative programs. The difference in views of the major purposes to be served by evaluation, presented above, has been drawn as sharply as possible. In reality, there is general acknowledgement of the validity of alternative purposes to be served by educational evaluation by writers holding different views. Unfortunately, the acknowledgement is often not translated into a modification of one’s model so as to accommodate alternative viewpoints. This is not to say that a particular purpose for educational evaluation is, necessarily, wrong. In many instances it may be correct and highly appropriate. The point is that most models of evaluation are generally limited in terms of the purposes envisaged for educational evaluation, each tending to emphasize a limited set of purposes.
A Framework In formulating a framework for evaluation studies, a conscious effort has been made to accommodate a variety of viewpoints about educational evaluation. Obviously the framework cannot be all-inclusive; some approaches are mutually exclusive and contradictory. When the purposes of evaluation differ, different parts of the framework will receive different emphasis. The framework presented here should not be regarded as a model. While various writers have proposed different models of evaluation, what is presented here does not claim to be a model. It is hoped that it will be helpful to the reader in thinking about educational evaluation and will be useful in planning and conducting evaluation studies. Comprehensive evaluation of educational treatments, whether they be units of instruction, courses, programs, or entire institutions, requires the collection of five major classes of information. Each class of information is necessary although not sufficient for a comprehensive evaluation of an educational enterprise. In setting forth these five major divisions, it is recognized that some may be of little or no interest in a particular evaluation undertaking. What is to be avoided is the possibility of omitting any important class of information for determining the worth of a program. Thus, the framework allows for possible errors of commission-for example, collecting information that will have little bearing on the determination of the merit of a program-while avoiding errors of omission: failing
Educational
Evaluation
23
to gather information that may be important. There are two reasons for this position. First, if a particular class of information turns out to be unimportant, it can simply be disregarded. Two, if it is known in advance that a particular class of information will have little bearing on the outcome fo the evaluation, it simply need not be gathered. This does not reflect on the frameworkperse but only on the inappropriateness of a part of it in a particular situation. Second, failure to gather information about particular relevant aspects of an education venture indicates a faulty evaluation effort; this should be avoided. Initial Status of Learners
The first class of information relates to the initial status of learners. It is important to know two things about the learners at the same time they enter the program: who they are, and how proficient they are with regard to what they are supposed to learn. The first subclass of information -who are the learners? -is descriptive in nature and is usually easily obtained. Routinely one wants to know the age, sex, previous educational background, and other status and experiential variables that might be useful in describing or characterizing the learners. Strictly speaking, such information is not evaluative. It should be useful in interpreting the results of an evaluation study and, more important, serve as a baseline description of the learner population. If it is found that subsequent cohorts of learners differ from the one that received the program when it was evaluated, then it may be necessary to modify the program to accommodate the new groups. The second subclass of information - how proficient the learners are with regard to what they are supposed to learn - is more central to the evaluation. Learning is generally defined as a change in behavior or proficiency. If learning is to be demonstrated, it is necessary to gather evidence of performance at, at least, two points in time: (1) at the beginning of a set of learning experiences; and (2) at some later time. Gathering evidence about the initial proficiencies of learners furnishes the necessary baseline information for estimating, however crudely, the extent to which learning occurs during the treatment period. A related reason for determining the initial proficiency-level of learners stems from the fact that some educational enterprises may seriously underestimate initial learner status with regard to what is to be learned,. Consequently, considerable resources may be wasted in teaching learners who are already proficient. Mere end-of-program evidence gathering could lead one to the erroneous conclusion of program effectiveness when what had actually happened was that already developed proficiencies had been maintained. It is important that a determination of the initial level of proficiencies of learners be undertaken before an educational enterprise gets seriously underway. Only then can one be sure that the learners are assessed independently of the effects of the program. Studying learners after they have had some period of instruction makes it impossible to determine what the learners were like before instruction began. While the point seems self-evident, it has been violated so often in recent years that several methodologists have published articles in professional journals about this to sensitize workers in the field. The one instance in which gathering data about the initial proficiencies of learners can safely be omitted is when what is to be learned is so specialized in nature that one can reasonably presume the initial status of learners is virtually nil. Examples where such a presumption could be reasonably made would include entry-level educational programs in computer programming, welding, and cytotechnology. Outside of such specialized fields, however, it is worth the relatively small investment of time, energy, and expense to ascer-
24
RICHARD
tain the initial status of learner Learner
M. WOLF
proficiency.
Performance
After a Period of Instruction
The second major class of information required in evaluation studies relates to learner proficiency and status after a period of instruction. The basic notion here is that educational ventures are intended to bring about changes in learners. Hence, it is critical to determine whether the learners have changed in the desired ways. Changes could include increased knowledge, ability to solve various classes of problems, ability to deal with various kinds of issues in a field, proficiencies in certain kinds of skills, changes in attitudes, interests and preferences, etc. The changes sought depend on the nature of the program, the age- and ability-levels of the learners and a host of other factors. Whatever changes a program, curriculum, or institution seeks to effect in learners must be studied to determine whether they have occurred and to what extent. The only way this can be done is through a study of learner performance. Whether the particular changes in learner behavior can be attributed to the effects of the educational experiences is, however, another matter. Before such a determination can be made, however, it should be ascertained whether learning has occurred and, if so, to what extent. The notion that information about what has been learned should be obtained after a period of instruction has often been interpreted to mean that learners must be examined at the end of a program or course. This is not quite correct. When information should be gathered is a function of the purposes of those who will be the major consumers of the evaluation. The developer of a program, for example, might be keenly interested in finding out how effective particular units of instruction are in bringing about particular changes in learners and also in the effectiveness of specific lessons. Such information may enable the developer to detect flaws in the program and make appropriate modifications. The same program developer might be relatively uninterested in learner performance at the end of the program. Moreover, summative information may be at such a general level as to be virtually useless in helping him or her detect where the program is working, and where it is not. Someone else, on the other hand, who is considering the adoption of a program, may have little interest about how learners are performing at various points in the program; his or her interest lies chiefly in the final status of the learners. That is, have they learned what was expected by the conclusion of the program? A positive answer could lead to a decision to adopt the program; a negative one, to possible rejection. Different persons will approach an evaluation enterprise with different questions, and such differences should be reflected in decisions concerning what information should be gathered and when it should be gathered. The above distinction closely parallels the one between formative and summative evaluation noted earlier. It is not, however, the only distinction that can be made. What is important is that a schedule of information-gathering with regard to learner performance should be consistent with the purposes for undertaking the evaluation and that the phrase “after a period of instruction” not be restricted to end-of-course information gathering.
Execution
The third class of information
of Treatment
to be collected
in an evaluation
study centers
on the edu-
Educational
Evaluation
25
cational treatment being dispensed. At the very least, one needs to know whether the treatment was carried out. If so, to what extent? Did the treatment get started on time? Were the personnel and materials necessary for the program available from the outset or, as has been the case in a number of externally funded programs, did materials and supplies arrive shortly before the termination of the program? Questions regarding the implementation of the intended program may seem trivial but are, in fact, critical. Often it is simply assumed that an intended program was carried out on schedule and in the way it was intended. This assumption is open to question but, more important, to study. Any responsible evaluation enterprise must determine whether and how an educational program was carried out. Information about the execution of a program should not only meet the minimal requirement of determining whether a program has been carried out as intended; it should also furnish some descriptive information about the program in operation. Such information can often be used to identify deficiencies in the program as well as possible explanations for success. The collection of information about the program in operation will rely heavily on the regular use of observational procedures and - in some cases - on the use of narrative descriptive material. Who actually gathers such information is a matter that can be decided locally. Evaluation workers, supervisors, or other administrative personnel can share in the performance of this critical function. Maintenance of logs or diaries by teachers in the program can also contribute to meeting informational needs in this area. Even participant-observer instruments, along the lines developed by Pace and Stern (1958)) where individuals are asked to report on particular practices and features of an environment, could be useful. The study of program implementation is not undertaken just to determine how faithfully a program was carried out. Rather, the program that is evaluated is the implemented program. The implemented program can differ markedly from the designed or intended program. Further, there may be very good reasons for such differences. One of the evaluation worker’s responsibilities is to be able to describe and compare the intended program, the implemented or actual program and the achieved program. The achieved program refers to learner performance in terms of program objectives. In order to fulfill this task; the evaluation worker will need to know not only the intended program or learning experiences, the achieved program or learner performance in terms of program objectives, but also the implemented or actual program in order to make the necessary comparisons.
costs The fourth major class of information is costs. Unfortunately, costs have not received adequate attention in evaluation work. The reason for this is not clear. Perhaps early educational evaluation efforts were directed toward ascertaining the efficacy of competing instructional treatments that had equal price tags. In such cases, cost considerations would not be a major concern. Today, however, the range of available treatments - in the form of units of instruction, courses, programs, curricula, and instructional systems - have widely varying costs. These need to be reckoned so that administrators and educational planners, as well as evaluation workers, can make intelligent judgements about educational treatments. Not only must direct costs be reckonedfor example, the cost of adoption -but indirect costs as well. Costs of in-service training for teachers who are to use a
26
RICHARD
M. WOLF
new program, for example, must be determined if a realistic estimate of the cost of the new program is to be obtained. An evaluation specialist, whose training and experience may be in measurement, research methodology, or curriculum development, may not be able to carry out such cost estimations. If one cannot do this, then one needs to find someone who can. An evaluation that makes no reference to costs is rarely of practical value, however interesting it may be academically. The educational administrator usually has a fair idea of what can be accomplished if money were no object. The real problem for the administrator is to make wise decisions when cost is a factor.
Supplemental Information The fifth class consists of supplemental information about the effects of a program, curriculum, or institution and is composed of three subclasses. The first includes the reactions, opinions, and views of learners, teachers, and others associated with the enterprise being evaluated. These could be administrators, parents, other community members, and even prospective employers. The purpose of gathering such information is to find out how an educational treatment is viewed by various groups. Such information is no substitute for more direct information about what is actually being learned, but it can play a critical role in evaluating the overall worth of a program in a larger institutional context. There have been occasions when programs, instituted by well-intentioned educators, have succeeded fairly well in achieving their objectives, and at a reasonable cost. However, controversy about such programs - inside or outside the institution - have led to their termination. One can cite as examples the installation of programs of sex education in schools in highly conservative communities or the adoption of textbooks that were considered to contain offensive material by a sizable segment of a community. Supplemental information -- in the form of views and reactions of groups connected with an educational venture - can be highly instructive in a number of ways. It can: (1) provide information about how a program is being perceived by various groups; (2) help in the formulation of information campaigns, if there is a serious discontinuity between what is actually taking place in a program and what is perceived to be taking place; and (3) alert evaluation workers and administrators about the need for additional information as to why a particular program is being viewed in a certain way by one or more groups. Such information may also prevent evaluation workers from developing what Striven (1972) has dcscribcd as a kind of tunnel vision which sometimes develops from overly restricting evaluation efforts to determining how well program objectives have been achieved. Information about the views and reactions of various groups connected with an educational enterprise can be gathered fairly easily through the use of questionnaires and interview techniques. The value of gathering such information should not be underestimated. Neither should it be overestimated. There have been a number of evaluation efforts that have relied solely on the collection of the views and reactions of individuals and groups having some connection to a program. Educational evaluation should not be confused with opinion polling. It is important to find out how well an educational venture is succeeding in terms of what it set out to accomplish. This requires the collection of information about learner performance outlined in the first two classes of information. In addition, supplemental information about how a program is perceived by various groups is important to an overall evaluation of its worth.
Educational
Evaluation
27
The instances cited above, where supplemental information in the form of views and reactions of various groups were rather extreme, do occur. The frequency of such occurrences, however, is low. Generally, reactions to educational programs tend to be on the mild side with a tendency on the part of the public to view new programs in a somewhat favorable light - especially if they have been fairly well thought through and reasonably well-presented. If reactions to educational programs are fairly mild, ranging, for example, from acquiescence to some positive support, little further attention need be accorded such information. However, one is not apt to know in advance what the views and reactions of various groups are likely to be. Accordingly, it is necessary to consciously find them out. The second subclass of supplemental information involves learner performances not specified in the objectives of the program. Developers of educational programs, courses, and curricula are improving in their ability to specify what should be learned as a result of exposure to instruction. Furthermore, the responsibility for assessing learner performance with regard to specified objectives is now generally accepted. However, it is also reasonable to inquire how well broader goals of education are being served by a particular program. That is, how well is the need which the program was developed to meet being realized? “An ideal evaluation” , Cronbach points out (1963, pp. 6799680): “would include measures of all the types of proficiency that might reasonably be desired in the areas in question, not just the selected outcomes to which this curriculum directs substantial attention. If you wish only to know how well a curriculum is achieving its objectives, you fit the test to the curriculum; but if you wish to know how well the curriculum is serving the national interest, you measure all outcomes that might be worth striving for. One of the new mathematics courses may disavow any attempt to teach numerical trigonometry, and indeed, might discard nearly all computational work. It is still perfectly reasonable to ask how well graduates of the course can compute and can solve right triangles. Even if the course developers went so far as to contend that computational skill is no proper objective of secondary instruction, they will encounter educators and laymen who do not share this view. If it can be shown that students who come through the new course are fairly proficient in computation despite the lack of direct teaching, the doubters will be reassured. If not, the evidence makes clear how much is being sacrificed. Similarly. when the biologists offer alternative courses emphasizing microbiology and ecology, it is fair to ask how well the graduate of one course can understand issues treated in the other. Ideal evaluation in mathematics will collect evidence on all the abilities toward which a mathematics course might reasonably aim; likewises in biology, English or any other subject.”
Cronbach’s view is not universally accepted. Some writers assert that any attempt to test for outcomes not intended by the program developers is imposing inappropriate and unfair criteria on the program. There is always a danger of being unfair. There is also the highly practical problem of deciding what learner-performance information, not specified in the objectives, should be obtained. It would seem that some information along these lines should be gathered. The examples cited by Cronbach above furnish some useful guides about the kinds of additional information program developers should obtain. It is important that supplemental information about learner performance, when obtained, be analyzed and reported separately from information bearing directly on the intended outcomes of a program. Any evaluation effort must not only be done fairly but be seen to be done fairly. The inclusion of information about learners competencies, not intended as part of an instructional program, must be handled delicately. Separate treatment and reporting of such information is a minimal prerequisite of fairness. While Cronbach maintains that an ideal evaluation effort would gather evidence about learner performance on all outcomes an educational enterprise might reasonably aim at, in practice the amount of supplemental performance information will, of necessity, be li-
28
RICHARD
M. WOLF
mited. Unless a program is heavily funded and has the requisite staff to develop evidencegathering measures for the whole range of supplemental performance information desired, it is unrealistic to expect that very much can be done. Some efforts, albeit modest ones, can be made. It would be best to use the bulk of available resources to obtain learner-performance information with regard to intended outcomes and to supplement this with some additional measures. One should not dilute an evaluation effort by trying to measure everything and end up doing a mediocre or even poor job. This is a matter of planning and strategy. One can always expect that the resources available for evaluation will be limited. This is not a tragedy. Failure to use available resources effectively, however, can lead to poor results. It is recommended that the major use of available resources be devoted to obtaining the most relevant information about learner performance. This would entail examining how much has been learned with regard to the intended outcomes; some provision can (and should) be made with regard to other learner-performance outcomes that might result from the kind of program being evaluated. The third subclass of supplemental information has to do with the side effects of educational programs, courses, and curricula. Admittedly, this is not an easy matter. Just as pharmaceutical researchers have long known that drugs can have effects on patients other than the ones intended, educators are realizing that their undertakings can have side effects too. Sometimes such unintended effects can be beneficial, e.g. when a program designed to improve reading skills of learners not only improves reading proficiency but increases self-esteem. Negative side effects may also occur: in a rigorous academic highschool physics course students may learn a great deal of physics but their interest in learning more physics in college may be markedly reduced and, in some cases, extinguished. While one can cite side effects of educational programs at length, prescriptions about what to look for and how to detect them are hard to give. Evaluation workers must first recognize that side effects can occur in any educational program. They then must strive to be as alert and as sensitive as possible about what is happening to learners as they move through a set of educational experiences. This means attending not only to learner performance in relation to stated objectives, but to any general class of behaviors, including interests and attitudes, that may be developing. This is not an easy task, but if the evaluation workers can maintain a level of receptivity about effects not directly specified by the objectives of a program, there is a fair chance that such behaviors will be discerned when they appear. Another way of detecting the side effects of educational programs is through the use of follow-up procedures. Following up learners into the next level of education or into their first year of employment can furnish clues. In the example cited above about the rigorous high-school physics course, the critical information about the negative effect of the program on interest in learning more physics came from a follow-up study. (It was found that students who had taken the innovative physics course were taking additional physics courses in college at less than one-half the rate of comparable students who had taken a conventional high school physics course.) However. it is not likely that formal follow-up studies will yield clear-cut results. Moreover, formal follow-up studies often require more resources than are available at most institutions. Even when such resources are plentiful, it is not usually clear how they should be put to use. The detection of unintended effects is an elusive business. Rather than attempt to conduct formal follow-up studies, it is usually better to use loosely structured procedures. Open-ended questionnaires and relatively unstructured interviews with teachers at the next level in the educational ladder, with
Educational
Evaluation
29
employers of graduates as well as with graduates themselves, could furnish clues about side effects of programs that could then be studied more systematically. It is also possible that the first subclass of supplemental information - views and reactions of various groups connected with a program - could provide clues about program side effects. Whatever strategy is used to detect them, it should probably be somewhat loosely structured and informal. The framework for evaluation presented above sets forth the major classes of information required for a comprehensive evaluation of an educational enterprise. Suggested procedures for the collection of each class of information are presented in subsequent articles along with discussions about the analysis and interpretation of evaluative information and the synthesis of results into judgments of worth. Before such complex undertakings can be initiated, however, it is critical that the necessary information is obtained. The framework presented in this chapter is an agenda for information gathering.
References Beberman, M. (1958). An emerging program of secondary school mathematics. Cambridge, MA: Harvard University Press. Begle, E. G. (1963). The reform of mathematics education in the United States. In H. Fehr (Ed.), Mathematical education in the Americas. New York: Bureau of Publications, Teachers College, Columbia University. Cronbach, L. J. (1963). Course improvements through evaluation. Teachers College Record, 64,672-683. Eisner, E. (1977). On the uses of educational connoisseurship and criticism for evaluating classroom life. Teachers College Record, 3,345-358. Pace, C. R., & Stern, G. (1958). An approach to the measurement of psychological characteristics of college environments. Journal of Educational Psychology, 49,269-277. Popham, W. J. (1975). Educational evaluation. Englewood Cliffs, NJ: Prentice-Hall. &riven, M. (1972). Prose and cons about goal-free evaluation. Evaluation Comment, 3. Stake, R. E. (1975). Program evaluation, particularly responsive evaluation. Occasional Paper Series 5. Kalamazoo, MI: The Evaluation Center, Western Michigan University. Worthen, B. R., & Sanders, J. R. (1973) Educational evaluation: Theory and practice. Worthington, OH: Charles A. Jones.
CHAPTER
TWO-PLUS
DECADES OF EDUCATIONAL OBJECTIVES W. JAMES
University
of California,
3
Los Angeles
POPHAM and 10X Assessment
Associates,
U.S.A.
Abstract The primary impetus for educational objectives as tools for U.S. educational evaluators is seen as the programmed instruction movement. Behavioral objectives, widely recommended in the 1960s are thought to be of limited utility to educational evaluators because of their hyperspecificity, hence overwhelming numbers. Bloom’s taxonomies of educational objectives are viewed as being useful only when considered broad-brush heuristics, not fine-grained analytic tools. Because objectives sans assessment are little more than rhetoric, criterion-referenced tests are advocated for assessing objective-attainment. Performance standards are viewed as appropriately separable from objectives. Five guidelines regarding objectives are, in conclusion, proffered for educational evaluators.
Educational evaluation, by most people’s reckoning, was spawned as a formal educational specialty in the mid-1960s. In the United States, the emergence of educational evaluation was linked directly to the 1965 enactment by the U.S. Congress of the Elementary and Secondary Education Act (ESEA). Focused on educational improvement, this precedent-setting legislation provided substantial federal financial support to local school districts each year, but only if officials in those districts had evaluated the previous year’s federally supported programs. Given the potent motivational power of dollars aplenty, U.S. school officials were soon scurrying about, first, to discover what educational evaluations actually were and, having done so, to carry them out. There were, of course, ample instances wherein school officials reversed the order of those two steps. During the early years of educational evaluation, considerable attention was given to the role of educational objectives. Indeed, in view of the mid-60s preoccupation of U.S. educators with educational objectives and educational evaluation, one might reasonably assume that they had been whelped in the same litter. Such, however, was not the case. In the following analysis an effort will be made to isolate the origins of educational objectives, describe the role of educational objectives during the early and later years of educational evaluation, then identify a set of experience-derived guidelines dealing with the uses of educational objectives by educational evaluators. 31
32
RICHARD
An Alternative
M. WOLF
Ancestry
American educators’ attention to educational objectives was not triggered by mid-f% federal education legislation. Quite apart from such federal initiatives, a series of developments in the field of instructional psychology resulted in the need for heightened attention to statements of instructional intent. It was the activity of instructional psychologists, not educational evaluators, that first focused the attention of American educators on the way in which statements of educational objectives were formulated. More specifically, in the late 50s Skinner (1958) captured the attention of numerous educators as he proffered laboratory-derived principles for teaching children. Skinner’s notions of carefully sequencing instructional materials in small steps, providing frequent positive reinforcement for learners. and allowing learners to move through such instructional materials at their own pace were, to many educators, both revolutionary and exciting. A key tenet of Skinner’s approach involved the tryout and revision of instructional materials until they were demonstrably effective. Because Skinner believed that such “programmed instruction” could be effectively presented to learners via mechanical means (Skinner, 19.58), the prospect of “teaching machines” both captured the fancy of many lay people and. predictably, aroused the apprehension of many educators. Central to the strategy embodied in all of the early approaches to programmed instruction, including the small-step scheme espoused by Skinner, was the necessity to explicate. in terms as unambiguous as possible, the objective(s) of an instructional sequence. More precisely. early programmed instruction enthusiasts recognized that if an instructional sequence was to be tried out and revised until successful, it was necessary to have a solid criterion against which to judge the program’s success. Hence, programmed instruction specialists universally urged that the effectiveness of instructional programs be judged according to their ability to achieve preset instructional objectives. Without question, Robert Mager’s 1962 primer on how to write instructional objectives served as the single most important force in familiarizing educators with measurable instructional objectives. Consistent with its roots. Magcr’s introduction to the topic of objectives was originally entitled Prepuring Objectives for Programmed Instruction (1962). Later, as interest in the book burgeoned, it was retitled more generally as Prepuring 1t7structional Objectives. Organized as a branching program which allowed readers to scurry through its contents rather rapidly, Mager’s slender volume constituted a 45minute trip from ignorance to expertise. During the mid-60s copies of Mager’s cleverly written booklet found their way into the hands of numerous educators and. even more importantly. influentially placed federal education officials. When, in the aftermath of 1965s ESEA enactment, federal officials attempted to guide U.S. educators toward defensible ESEA evaluation paradigms, they found an on-the-shelf evaluation approach best articulated by Ralph Tyler (1950). The Tyler strategy hinged on an evaluator’s determining the extent to which the instructional program being evaluated had promoted learner attainment of prespecified educational objectives. Such an objectives-attainment conception of educational evaluation, while destined to be replaced in future years by a number of alternative paradigms, seemed eminently sensible to many highly placed officials in the U.S. Office of Education. Both implicitly and explicitly, therefore. an objectives-attainment model of educational evaluation was soon being fostered by U.S. federal Officials at the precise time that American educators were becoming conversant with the sorts of measurable instructional objectives being touted by program-
Educational
Evaluation
33
med instruction proponents. In retrospect, it is far from surprising that a Tylerian conception of objectives-based educational evaluation became wedded to a Magerian approach to objectives formulation. That marriage occurred in such a way that many neophyte educational evaluators assumed the only bona fide way to evaluate an educational program was to see if its measurably stated educational objectives had been achieved. The unthinking adoption of an objectives-attainment approach to educational evaluation led, in many instances, to the advocacy of evaluation models in which positive appraisals of an educational program were rendered if its objectives had been achieved - irrespective of the defensibility of those objectives. Not that Tyler had been oblivious of the quality of educational objectives, for in his writings (Tyler, 1950) he stressed the importance of selecting one’s educational objectives only after systematic scrutiny of a range of potential objectives. In his classic 1967 analytic essay on educational evaluation, Michael Striven, having witnessed cavalier applications of objectives-attainment evaluations in the numerous national curriculum-development projects then underway, attempted to distinguish between what he viewed as genuine evaluation and mere estimations of goal-achievement (Striven, 1967). In his subsequent observations regarding the role of educational objectives in evaluating educational programs, however, Scrivens still came down solidly on the side of measurably stated objectives (Striven, 1970). As can be seen, then, although educational objectives in the U.S. trace their lineage more directly from instructional psychology than educational evaluation, such objectives were widely accepted during the early years as an integral component of evaluation methodologies - particularly those based on objectives-attainment.
The Behavioral
Objectives
Brouhaha
It was in the late 60s and early 70s that instructional objectives per se captured the attention of many educational evaluators. Many evaluators subscribed, at least rhetorically, to the form of measurable objectives set forth in the 1962 Mager booklet. Such instructional objectives had become known as behavioral objectives because, at bottom, they revolved around the post instruction behavior of learners. Yet, although behavioral objectives were espoused by many (e.g. Glaser, 196.5; Popham, 1964), a number of writers put forth heated criticisms of behaviorally stated objectives (e.g. Arnstine, 1964; Eisner, 1967). Proponents of behavioral objectives argued that such objectives embodied a rational approach to evaluation because they enhanced clarity regarding the nature of one’s instructional aspirations. Critics countered that because the most important goals of education did not lend themselves readily to a behavioral formulation, the preoccupation with behavioral objectives would lead to instructional reductionism wherein the trivial was sought merely because it was measurable. Disagreements regarding the virtues of behavioral objectives were frequent at professional meetings of that era, some of thosedisputes finding their way into print (e.g. Popham et al., 1969). Indeed, the arguments against behavioral objectives became so numerous as to be cataloged (Popham, 1969). Although the academic dialogue regarding the virtues of behavioral objectives lingered until the early 7Os, most U.S. educational evaluators who made use of objectives tended to frame those objectives behaviorally. The bulk of professional opinion, whether or not
35
RICHARD
M. WOLF
warranted, seemed to support the merits of behaviorally stated objectives. And, because many educational evaluators believed it important to take cognizance of an educational program’s instructional goals, behaviorally stated instructional objectives were commonly encountered in evaluation reports - and are to this day. Perhaps the most serious shortcoming of behavioral objectives, however, was not widely recognized during the first decade of educational evaluation. That shortcoming stems from the common tendency to frame behavioral objectives so that they focus on increasingly smaller and more specific segments of learner postinstruction behavior. The net effect of such hyperspecificity is that the objectives formulator ends up with a plethora of picayune outcomes. Although early critics of behavioral objectives were wary of what they believed to be a tendency toward triviality in such objectives, no critic predicted what turned out to be the most profound problem with small-scope behavioral objectives. Putting it pragmatically, the typical set of narrow-scope behavioral objectives turned out to be so numerous that decision-makers would not attend to evidence of objective-attainment. After all, if decision-makers were quite literally overwhelmed with lengthy lists of behavioral objectives. how could they meaningfully focus on whether such objectives were achieved? Simply stated, the most important lesson we have learned about the use of behaviorally stated objectives for purposes of educational evaluation is that less is most definitely more. Too many objectives benumb the decision-maker’s mind. Too many objectives, because decision-maker’s will not attend to them, are completely dysfunctional. The discovery that too many behavioral objectives did not, based on experience. result in improved decision-making need not. of course, force us to retreat to an era when we fashioned our educational objectives in the form of broad, vacuous generalities. There is a decisively preferable alternative, namely, to coalesce small-scope behaviors under larger. albeit still measurable behavioral rubrics. Thus, instead of focusing on 40 smallscope objectives. evaluators would present decision-makers with only five or six broadscope. measurable objectives. To illustrate. if an objective describes a “student’s ability to solve mathematical word problems requiring two of the four basic operations. that is, addition. subtraction, multiplication, and division,” that objective covers a good deal of mathematical territory. A modest number of such broad-scope objectives will typically capture the bulk of a program’s important intentions.
Taxonomic
Travails
In 1956 Benjamin Bloom and his colleagues brought forth a taxonomy of educational objectives in which they drew distinctions among objectives focusing on cognitive, affective, and psychomotor outcomes (Bloom et (I/.. 1956). In their analysis. Bloom and his coauthors attended chiefly to the cognitive taxonomy. laying out six levels of what they argued were discernibly different types of hierarchically arranged cognitive operations. These cognitive operations, they argued, were needed by learners to satisfy different types of educational objectives. Eight years later, in 1964, David Krathwohl and his coworkers provided us with a second taxonomy focused on five levels of affective-domain objectives (Krathwohl et al., 1964). Although several taxonomies of psychomotor objectives wcrc published shortly thereafter, none attracted the support of the initial two taxonomies dealing with cognitive and affective objectives.
Educational
Evaluation
35
The 1956 Taxonomy of Educational Objectives: Handbook I, The Cognitive Domain initially attracted scant attention. Sales of the book were modest for its first several years of existence. However, at the time that U.S. educators turned their attention to instructional objectives in the early 1960s they found in their libraries an objectives-analysis scheme of considerable sophistication. Sales of the cognitive taxonomy became substantial and the six levels of the taxonomy, ranging from ‘knowledge’ at the lowest level to ‘evaluation’ at the highest level, became part of the lexicon employed by those who worked with educational objectives. Although the affective taxonomy never achieved the substantial popularity of the cognitive taxonomy, it too attracted its share of devotees. Now, what did educational evaluators actually do with these objectives-classification systems? Well, quite naturally, evaluators classified objectives. Much attention was given, for example, to the appropriate allocation of objectives to various taxonomic categories. Some evaluators would classify each of a program’s objectives according to its proper taxonomic niche in the hope of bringing greater clarity to the objectives being sought. Several authors even went to the trouble of identifying the action verbs in instructional objectives which would be indicative of particular levels of the taxonomies (e.g. Gronlund, 1971; Sanders, 1966). Thus, if an objective called for the student to “select from alternatives” this was thought to represent a specific taxonomic level whereas, if the learner were asked to “compose an essay”, then a different taxonomic level was reflected. It is saddening to recall how much time educational evaluators and other instructional personnel devoted to teasing out the taxonomic distinctions between different objectives. For, in the main, this activity made no practical difference to anyone. One supposes, of course, that if certain educators derive personal satisfaction from doing taxonomic analyses, akin to the joys that some derive from doing crossword puzzles, then such behavior should not be chided. However, extensive preoccupation with the classification-potential of the cognitive and affective taxonomies seems to reflect time illspent. In fact, if there were a taxonomy of time-wasting activities, one might speculate that taxonomic analyses of educational objectives would be classified toward the top of the time-waster hierarchy. For one thing, the taxonomies focus on covert processes of individuals, processes whose nature must be inferred from the overt behaviors we can witness. For the cognitive taxonomy, to illustrate, we present an assignment calling for students to write an essay in which discrete information is coalesced, then infer that the ‘synthesis’ level of the taxonomy has been achieved because the student whips out a requested essay. But what if the student is merely parroting an analysis heard at the dinner table earlier in the week? In that instance, memory rather than synthesizing ability has been displayed and a different level of the cognitive taxonomy represented. Unless we have a solid fix on the prior history of the learner, it is difficult, if not impossible, to know whether a given type of learner response represents a higher or lower order process. Answering a challenging multiple-choice test item may represent the ‘application’ level of the taxonomy unless, of course, the correct answer to the question was discussed and practiced during a previous class. There is, obviously, peril in attempting to inferentially tie down the unobservable. An even more sustantial shortcoming with taxonomic analyses of educational objectives can be summed up with a succinct “So what ?“. Putting it another way, even if we could do so with accuracy, what is the meaningful yield from isolating the taxonomic levels of educational objectives? Are educational decision-makers truly advantaged as a consequence of such classification machinations? Is it, in fact, the case that higher level objectives are
36
RICHARD
M. WOLF
more laudable than lower level objectives? Or, more sensibly, must we not really determine the defensibility of an educational objective on its intrinsic merits rather than its taxonomic pigeon hole? The taxonomies of educational objectives brought to educational evaluators a helpful heuristic when used as a broad-brush way of viewing a program’s educational aspirations. It is useful to recognize that there are no affective objectives sought by the program and that the program’s cognitive objectives deal predominantly with rote-recall knowledge. But, beyond such general appraisals, fine-grained taxonomic analyses yield dividends of debatable utility. During the first decade of serious attention to educational objectives, the taxonomies of educational objectives became new hammers for many educational evaluators who, consistent with the law of the hammer, discovered numerous things in need of hammering. For the taxonomies of educational objectives, hammering-time has ended.
Measurement
and Objectives
Any educational evaluator who seriously,believes that objectives-achievement ought to be a key element in the evaluation of educational programs must reckon with a major task, namely, determining whether objectives have, in fact, been achieved. It is because of this requirement that we must consider the evolving manner in which educational evaluators have employed measuring devices to discern whether educational objectives have been achieved. There are some educational objectives, of course, that require little or nothing in the way of assessment devices to indicate whether they have been accomplished. For instance, if the chief objective of a “make-school-interesting” campaign is to reduce absenteeism, then the verification of that objective’s attainment hinges on a clerk’s counting absence records. If the nature of the educational objective being considered does not require the USC of formal assessment devices such as tests or inventories, then objcctivcs can be employed without much attention to assessment considerations. Yet, because program objectives in the field of education often focus on improving the status of students’ knowledge, attitudes. or skills, determination that most educational objectives have been achieved hinges on the use of some type of formal assessment device. During the past two decades, educational evaluators have learned some important lessons about how to use such assessment devices in order to establish whether educational objectives have been achieved. For one thing, it is now generally accepted that, for purposes of educational evaluation, criterion-referenced tests are to be strongly preferred over their norm-referenced counterparts. The distinction between traditional norm-referenced tests and the newer criterionreferenced tests had been intially drawn by Robert Glaser in 1963 (Glaser, 1963). Glaser characterized norm-referenced measures as instruments yielding relative interpretations such as the percentage of examinees in a normative group whose performance had been exceeded by a given examinee’s performance. In contrast, criterion-referenced measures yielded absolute interpretations such as whether or not an cxaminee could master a well defined set of criterion behaviors. Although the utility of this distinction was generally accepted, precious little attention was given to criterion-referenced testing by educational measurement specialists until the 70s.
Educational Evaluation
37
It was during the 7Os, indeed, that increasing numbers of educational evaluators began to recognize a continuing pattern of “no significant differences” whenever norm-referenced achievement tests were employed as indicators of an educatioRa1 objective’s achievement. Evaluators began to recognize that the generality with which norm-referenced test publishers described what those tests measured made it difficult to inform decision-makers of what a program’s effects actually were. Of equal seriousnesss was the tendency of norm-referenced test-makers to delete test items on which students performed well because such items contributed insufficiently to detecting the variance among examinees so crucial for effective norm-referenced interpretations. Such deleted items, however, often tapped the very content being stressed by teachers;hence norm-referenced tests frequently turned out to be remarkably insensitive to detecting the effects of even outstanding instructional programs. Although almost all educational evaluators recognized that norm-referenced tests were better than nothing, the 1970s were marked by growing dissatisfaction on the part of evaluators with norm-referenced achievement tests. Yet, it was soon discovered that not any test being paraded by its creators as a “criterion-referenced” assessment device was, in fact, a meaningful improvement over itsnorm-referenced prececessors. Many of the socalled criterion-referenced tests of the’70s were so shoddy that they were better suited for the paper-shredder than for use in &ucational evaluations. The chief dividend of a well constructed criterion-referenced test, at least from the perspective of educational evaluators, is that it yields an explicit description of what is being measured. This, in time, permits more accurate inferences to be made regarding the meaning of test scores. And, finally, because of such heightened clarity, educational evaluators can provide decision-makers with more meaningfully interpretable data. This clarification dividend, of course, flows only from properly constructed criterion-referenced tests, not merely any old test masquerading in criterion-referenced costuming. Educational evaluators, burned too often because of poorly constructed criterion-referenced tests, have learned to be far more suspicious of such tests. Although preferring tests of the criterion-referenced genre, evaluators are now forced to scrutinize such assessment contenders with care. It is impossible to tell whether an educational objective has been achieved if the assessment device used to assess its attainment is tawdry. The previous discussion was focused on tests used to determine the attainment of educational objectives in the cognitive domain. We have also learned a trick br two regarding assessment of affectively oriented educational objectives. The major lesson learned is that there are precious few assessment devices at hand suitable for the determination of students’ affective status such as, for example, their attitudes toward self, school, or society. There was a fair amount of pro-affect talk by U.S. educators in the 60s and 70s. It was believed by many evaluators that this attention to affect would be followed by the creation of assessment instruments appropriate for tapping affective outcomes of interest. Regrettably, we are still awaiting the long overdue flurry of activity in the construction of affective instruments suitable for determining whether affectively oriented objectives have been achieved. Most of the home-grown affective measures created by evaluators during the past decade clearly required far more cultivation. Too often an evaluator with modest measurement acumen has attempted to churn out quickly an affective inventory (modally, a “modified Likert-type” scale), then toss it, unimproved, into the evaluation fray. Not surprisingly, such quickly contrived assessment devices usually end up serving no one well. Slap-
38
RICHARD
M. WOLF
dash affective measures yield information only as good as the effort that went into obtaining it. Evaluators now recognize that: (1) there is not a resplendent array of on-the-shelf affective measures to be employed in evaluations; and (2) the development of acceptable affective assessment instruments is a task more formidable than formerly believed. This discussion of the relationship between educational objectives and measures leads naturally into a choice-point for evaluators. Should educational evaluators focus on program objectives initially, then move toward measurement, or leap directly toward measurement and dispense with objectives altogether? Although it is true that objectives without assessment often represent only a planners’ rhetoric, it is difficult to deny that framing one’s intentions clearly can aid in both program design and the subsequent formative and/or summative evaluation of the program. There seems to be virtue, however, in first determining what the range of available assessment options actually is, then framing one’s objectives in such a way that those objectives are linked to available instrumentation. To formulate educational objectives while remaining oblivious of assessment possibilities is folly.
Performance
Standards
as Separable
When Mager (1962) proffered his conception of an acceptable instructional objective, he argued that such an objective would: (1) identify and name the student behavior sought: (2) define any important conditions under which the behavior was to occur; and (3) define a level of acceptable performance. Thus, an example of an acceptable Magerian objective might be something like this: The student must be able, in IO minutes unencountered blueprints.
or Icss, to name correctly
the items depicted
in I8 of 20 previowly
Note that the student behavior, “to name”, is depicted as are two conditions, that is, the necessity to carry out the naming “in 10 minutes or less” plus the fact that the blueprints were “previously unencountered”. In addition, the objective establishes the performance level as “18 of 20” correct. Yes, in 1962 Bob Mager would have smiled a happy smile when viewing such an instructional objective. But that was 1962 and we have learned by now that the behavior identified in the objective and the performance level sought from that behavior are decisively separable. Indeed, there is substantial virtue in keeping an objective’s behavior and performance standards distinct. If evaluators are forced to attach a performance standard to educational objectives in advance of a program’s implementation, it is almost always the case that the standard will be arbitrary and indefensible. In most settings it makes far more sense to await the program’s impact, then render an experience-informed judgment regarding acceptable levels of performance. During the most recent decade we have seen considerable attention directed toward the establishment of defensible performance standards (e.g. Zieky & Livingston, 1977). Most of this work has been linked to competency testing programs and the need to establish acceptable performance on competency tests for. say, awarding of high school diplomas. There is no reason that educational evaluators cannot profit from this standard-setting lit-
Educational
Evaluation
39
erature and apply it to judgements regarding hoped-for program effectiveness. The isolation of the student behaviors to be sought is, however, a different task than the determination of how well those behaviors must be displayed. Mixing behavioral aspirations and performance standards in educational objectives adds not clarity but confusion.
Guideline
Time
In review, a number of experience-derived observations have been offered regarding the evolution of educational objectives as used by evaluation personnel during the preceding 20-plus years. These observations lead naturally to a series of recommended guidelines for today’s educational evaluators, that is, those evaluators who would employ educational objectives in their work. Perhaps, before turning to the five guidelines to be recommended, it would be fair to report that in current educational evaluations, the role of educational objectives is typically modest. While early educational evaluators were frequently caught up in the importance of objectives, today’s educational evaluators typically focus on evidence of program effects irrespective of whether those effects were ensconced in prespecified objectives. Although objectives are regarded as useful mechanisms for inducing clarity of intent, they are not considered the sine qua non of sensible evaluations. Guideline No. 1: Educational evaluators should form&are or recommend educational objectives so that the degree to which an objective has been achieved can be objectively determined.
Comment: Hindsight informed us that the early advocates of behaviorally stated instructional objectives chose the wrong label, namely, ‘behavioral’ objectives. Not only did that descriptor result in acrimonious repercussions in the 60s it still arouses the ire of critics (e.g. Wesson, 1983). But, whether characterized as ‘behavioral’, ‘measurable’, or ‘performance’ objectives, an objective that does not permit us to judge reliably when it has been attained is of limited utility to educational evaluators. Elsewhere it has been argued that, for purposes of curricular design, a modest proportion of objectives targeted at the ineffable may be of value (Popham, 1972). Yet, for educational evaluators who must communicate with decision-makers, the ineffable has its limitations. Thus, if an objective is stated nonbehaviorally, it should be accompanied by an indication of how the objective’s attainment will be established. Guideline No. 2: Educational evaluators should eschew numerous narrow-scope educational objectives and, instead, focus on a manageable number of broad-scope objectives.
Comment: Educational evaluation is a decision-oriented endeavor. Decision-makers need to have access to information that will increase the wisdom of their decisions. Decision-makers who are inundated with oodles of itsy-bitsy objectives will pay.heed to none. Too much information befuddles the prospective user of that information. Educational evaluators must encourage program personnel to coalesce narrow-scope objectives under broader rubrics so that decision-makers need contemplate only a comprehensible amount of data. Evaluators who countenance a gaggle of narrow-scope objectives do a disservice to both program personnel and decision-makers. Less is truly more.
40
RICHARD
M. WOLF
Guideline No. 3: Educational evaluators should employ the Taxonomies of Educational Objectives only us gross heuristics, notfine-grained analytic tools.
Comment: When evaluators use the taxonomies to remind themselves and program personnel that there are, in truth, affective, psychomotor, and cognitive objectivesand that most cognitive objectives articulated by today’s educators demand only the recall of information - they have discovered the bulk of taxonomic treasure. More detailed use of the taxonomies typically result in analyzing with a microscope objectives fashioned with a sledgehammer. Guideline No. 4: If measurement devices are required to ascertain an educational objective’s uttainment, educntional evaluators should employ criterion-referenced rather than norm-referenced measures.
Comment: This guideline assumes the existence of properly constructed criterion referenced tests. Such an assumption requires sufficient sophistication on the part of the evaluator to distinguish between criterion-referenced tests that are dowdy to those that are dandy. It was also suggested in this analysis that, whenever possible. educational evaluators should encourage program personnel to formulate their objectives so that they are operationally linked to acceptable measuring devices. Guideline No. 5: Educational evaluators should keep separate the behavioral focus of rduccrtionul objectives from the performance levels expected of .students.
Comment: Above all, educational objectives should bring with them a degree of clarification which elevates the rigor of rational decision-making. Embodiment of standard setting in the objective itself typically adds confusion, not clarity, to statements of instructional intent. Moreover, the premature setting of performance standards often leads to indefensible and arbitrary performance expectations.
After All -
Only Two Decades
As one looks back over two-plus decades of educational evaluation and the role that educational objectives have played during that period, it is clear that interest in objectives per se has abated. For a while, behavioral objectb cs were like new toys. And, it seemed, everyone wanted to play. Experience, more than 20 years worth of it, has shown us that there are other evaluation games worth playing. Test results, these days, capture far more of the decision-maker’s attention than assertions about how many objectives were or were not achieved. As has been stressed throughout this analysis, statements of instructional intent can accomplish much good. When we clarify our educational aspirations in precise form, it is both easier to fashion programs to achieve those aspirations and to mold evaluation endeavors that help us see to what extent our objectives were achieved. Clearly stated educational objectives can help an evaluator if their import is not overemphasized. Skillfully employed, educational objectives can contribute to more defensible evaluations.
Educational
Evaluation
41
References Arnstine, D. G. (1964). The language and values of programmed instruction: Part 2. The Educational Forum, 28,337-345. Bloom, B. S., Engelhart, M. D., Faust, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of educational objectives: Handbook I, the cognitive domain. New York: Longmans, Green (David McKay). Eisner, E. W. (1967). Educational objectives: Help or hindrance? Schools Review, 75(3), 250-266. Glaser, R. (1963). Instructional technology and the measurement of learning outcomes. American Psychologist, 18,519522. Glaser, R. (Ed.). (1965). Teaching machines and programmed learning, II: Data and directions. Washington: Department of Audio Visual Instruction, N.E.A. Gronlund, N. E. (1971). Measurement nnd evaluation in teaching. New York: Macmillan. Krathwohl, D. R., Bloom, B. S., & Masia, B. B. (1964). Taxonomy of educational objectives: The classification of educational goals. Handbook II: Affective domain. New York: David McKay. Mager, R. F. (1962). Preparing objectives forprogrammed instruction. San Francisco: Fearon Press. Popham, W. J. (1964). The teacher empiricist. Los Angeles: Aegeus Press. Popham, W. J. (1972). Must all objectives be behavioral? Educational Leadership, 29(7), 605. Popham, W. J., Eisner, E. W., Sullivan, H. J., & Tyler, L. L. (1969). Instructionalobjectives (AERA monograph series on curriculum evaluation) Chicago: Rand McNally. Sanders, N. M. (1966). Classroom questions: What kinds? New York: Harper & Row. Striven, M. (1967). The methodology of evaluation. In R. Tyler, R. Gagne &M. Striven (Eds.), Perspectives of curriculum evaluation (AERA monograph series on curriculum evaluation) Chicago: Rand McNally, Striven, M. (1971). Evaluation skills. An audiotape distributed by the American Educational Research Association. Washington, DC. Skinner, B. F. (1958). Teaching machines. Science, 128.969-977. Tyler, R. W. (1950). Basicprinciples of curriculum and instruction. Chicago: University of Chicago Press. Wesson, A. J. (1983). Behaviourally defined objectives: A critique. Part one. The Vocational Aspect of Education, XXXV(91), 51-58. Zieky, M. J., & Livingston, S. A. (1977). Manual for .setting standards on the basic skill assessments tests. Princeton, NJ: Educational Testing Service.
CHAPTER
4
DESIGNING EVALUATION STUDIES: A TWENTY-YEAR PERSPECTIVE JERI BENSON*
and WILLIAM
B. MICHAEL?
*University of Maryland, U.S.A. and TUniversity of Southern California, U.S.A.
Abstract Subsequent to a statement of the purposes of evaluation designs in judging the effectiveness of educational programs carried out in field settings, a description of experimental, quasi-experimental, survey, and naturalistic designs is presented. Emphasis is placed upon the identification of sources of invalidity in these designs that may compromise the accuracy of inferences regarding program effectiveness. It was concluded that use of quasi-experimental designs in combination with surveys and naturalistic inquiry affords an opportunity to replicate in simple studies selected aspects of a complex program. An information base can be acquired to permit a generalized inference of promising causal connections between treatments and observable changes in program outcomes for participating groups in diverse settings.
It has been more than 20 years since the Elementary and Secondary Education Act (ESEA) was enacted by Congress. The Act provided massive amounts of federals funds for the improvement of education. For the field of evaluation, two important aspects of the Act were that all programs receiving ESEA funds under Titles I and III be evaluated and that a report be forwarded to an agency of the federal government. Many of the personnel in local education agencies were not prepared to evaluate the programs because of a lack of training in program evaluation. Indeed, as the discipline of program evaluation was in its infancy, few educators were trained in its techniques. In this circumstance, many of the early evaluation designs drew from the experimental designs employed in psychology. As Stufflebeam (1971) pointed out, experimental designs have had substantial utility for evaluating the attainment of goals or objectives of comprehensive ongoing programs involving measurable products (summative or terminal evaluation), but only minor usefulness for evaluating the process components underlying the development and implementation of programs (formative evaluation). It is clear why early evaluation studies failed to detect program effects through using traditional experimental designs that assume a fixed treatment and that focus more on program impact than on program strategies and implementation. Evaluators have learned over the past 20 years that many contextual factors such as social and political forces do influence programs and that true experimental designs are not so useful in detecting and studying these important 43
44
RICHARD
M. WOLF
extraneous factors as are other types of designs (Cronbach, 1982, pp. 2630). The three major objectives of this paper were: (1) to indicate the purposes of evaluation design; (2) to describe four types of designs used most often in program evaluation (experimental, quasi-experimental, survey, and naturalistic); and (3) to identify sources of invalidity in designs that may compromise the accuracy of inferences regarding program effectiveness. Special emphasis is placed upon the use of quasi-experimental designs that are especially appropriate in evaluating the effectiveness of evaluation programs in field settings. Purposes
of Evaluation
Design
The twofold purpose of design in evaluation or research is to provide a framework for planning and conducting a study. In the context of program evaluation, two major components of design include: (1) specification of exactly what information is needed to answer substantive questions regarding the effectiveness of the program; and (b) determination of an optimal strategy or plan through which to obtain descriptive, exploratory, or explanatory information that will permit accurate inferences concerning possible causal relationships between the treatment or treatments employed and observable outcomes. In addition, Cronbach (1982, Chapter 1) has suggested that the design should anticipate the primary audience of the final report concerning program effectiveness such that methods and analyses conform to their level of understanding and informational needs.
Four Basic Types of Evaluation
Design
For the purposes of this paper the procedures employed to obtain evaluation information have been classified into four design types: (1) experimental; (2) quasi-experimental; (3) survey; and (4) naturalistic. As mentioned previously, early evaluation designs were essentially experimental. It was soon discovered, however, that experimental designs were frequently not flexible enough to encompass all aspects of a particular program operating in a field setting. Thus, adaptations of these designs were developed and termed quasiexperimental. In addition, educational programs, which were often new or innovative were implemented at many sites involving a large number of participants. In these situations, survey designs, which were borrowed from sociology. were used to obtain descriptions of how programs operated as well as information regarding who was affected. More recently, naturalistic studies have been introduced to provide detailed information to decision-makers. A variation on the previously cited designs that has received attention in the past (Reichardt & Cook, 1979) and that is being revitalized (Cronbach, 1982) is to combine quasi-experimental or experimental designs with naturalistic designs in order to evaluate more effectively both the implementation and the impact of the program.
Experimetztal Used to study cause and effect relationships,
Designs the true experimental
design is considered
Educational
Evaluation
45
the most useful one to demonstrate program impact if conditions of randomization in selection of participating units and in the assignment of treatments can be met (Boruch & Wortman, 1979). This design is differentiated from others by the fact that the evaluation units have been assigned to the treatment and control conditions at random. Evaluation units can be individuals, groups of individuals, institutions, regions, states, or even nations. The program under evaluation is usually defined as the treatment condition. The control condition may be a traditional, neutral, or placebo treatment or no treatment at all. The key element is that the units to be evaluated have been either randomly selected or assigned at least to one treatment and to one control condition. The study is then implemented, and one or more criterion measures are administered after the treatment (and sometimes before the treatment). Finally, differences between the treated and control groups are compared to determine the relative effectivenesss of the competing conditions. Campbell
and Stanley’s
Three True Experimental
Designs
Campbell and Stanley (1966) identified three true experimental designs which they termed: (1) the pretest-posttest control group design (Design 4); (2) the Solomon fourgroup design (Design 5); and (3) the posttest-only control group design (Design 6). Campbell and Stanley presented a list of threats to the validity of inferences regarding cause and effect that could be formulated from the outcomes arising from use of these three designs as well as of several quasi-experimental ones (the threats being considered in a subsequent section). As Design 5 is actually a combination of Design 4 and Design 6, these two designs are detailed first. If E and C, respectively, represent experimental and control units, R stands for random selection of units and assignment of treatment (or lack of treatment) to the units, 01 and 03 constitute pretests administered to the E and Cunits, respectively, Xportrays the treatment, and 02 and 04 indicate the posttests given to the E and C units, respectively, the paradigm describing Design 4 is as follows: E C
R R
01 03
x
02 04.
In the instance of Design 6, the pretest is absent so that it will not react with the treatment. If 05 and 06 stand for the posttest taken by participants in the E and C units, respectively, and if all other letters remain the same as for Design 4, the paradigm representing Design 6 is as follows: E C
x
R R
05 06
The composite of Design 4 and Design 6 yields Design 5, one of the most powerful designs available for controlling threats to the validity of a cause and effect inference by ruling out alternative hypotheses of potential causation, is diagrammed as follows: E C E C
R R R R
01 03
X x
02 04 05 06.
RICHARD
‘Ml
M. WOLF
It is interesting to note that a 2 x 2 table with margins of No X and X and of No Pretest and Pretest can be established, as in a factorial design to ascertain whether an interaction effect exists between taking or not taking a pretest and participation or lack of participation in the treatment. The absence of any significant interaction effect between the pretest and treatment in a pilot study would indicate that Design 4 probably could be used quite effectively. These three experimental designs are characterized by control of most extraneous variables thought to affect the outcomes of a program. In addition to an experimental unit that receives the treatment or intervention experience, a control unit provides a baseline from which to judge the impact of the treatment. Limitations
to the Three True Experimental
Designs
Although experimental designs are very powerful, they are often difficult to apply in field studies, as programs (treatments) frequently are implemented differentially at each locale where socio-political factors may interact with the treatment in a way not controllable by the design. When these factors are identified. they are often hard to measure. Evaluating the impact of a program using an experimental design is best employed after the program has been fully implemented and pilot tested. Even then, employing an experimental design in which nearly all extraneous variables except the treatment have been controlled may result in the program being so sterile that it is ungeneralizable. Even with randomization there may be compensatory efforts on the part of members in the control unit or resentment or defeatism that can lead to reduced motivation and effort.
Advantages
Inherent
in Replication
with Small Experimental
Studies
Small experimental investigations run on separate aspects of a complex program constitute practical alternatives to determining the effectiveness of selected aspects of a treatment in diverse settings. Each aspect of the program, if it is multifaceted, can comprise an experimental study. For example, a matrix approach to smaller experimental studies is possible where each layer of the program is evaluated by type of unit or participant. This approach allows one to determine for which type of unit or participant which aspect or aspects of the program are most effective. Saxe and Fine (1979) have illustrated how multiple studies within an evaluation can be designed.
Quasi- Experimental
Desigrls
Because experimental designs have been well documented in the psychological literature, the focus of this presentation is directed toward more recent quasi-experimental designs and naturalistic forms of inquiry. The purpose of a quasi-experimental design is to approximate a true experimental design, typically in field settings where control or manipulation of only some of the relevant variables is possible (Isaac & Michael, 1981). The distinguishing feature of the quasi-experimental design is that the evaluation units have not been randomly selected and often have not been randomly assigned to treatment conditions. This situation can occur, for example, in compensatory education programs where all eligible evaluation units are mandated to receive the innovative program. In this case.
Educational
Evaluation
4-l
it is not possible to use randomization, as such a procedure would be highly disruptive to existing units or intact groups. In this situation, many threats to the validity of a causal inference between one or more treatments in several outcomes can be expected to be present. Thus, somewhat less confidence can be placed in the evaluation of findings from the use of quasi-experimental designs than from the employment of true experimental designs. In comparison with the true experimental designs, quasi-experimental ones, however, offer greater flexibility for field settings and sometimes afford greater potential for generalizability of results in realistic day-to-day environments. It is often possible to ‘patch up’ quasi-experimental designs when difficulties begin to occur. For example, a proxy pretest or multiple pretest indicators can be introduced to reduce a potential selection bias or to correct for it. Two Major Types of Quasi-Experimental
Design
‘Campbell and Stanley (1966) and Cook and Campbell (1979) have reported two major types of quasi-experimental procedures: (1) nonequivalent control-group designs; and (2) interrupted time-series designs. Each of these two major methods is broken down into specific sets or subclasses of designs. For the nonequivalent control-group situation, Cook and Campbell (1979) have proposed 11 subdesigns, and for the interrupted time-series paradigm they have offered six subdesigns. Mention should be made of the fact that in a quasi-experimental design the terms nonequivalent control group and comparison group have been substituted for the term control group employed in true experimental designs. An Illustrative Paradigm for Each of the Two Major Types of Quasi-Experimental
Design
By far, the most common quasi-experimental design has been what Campbell and Stanley (1966) have termed the nonequivalent control-group design (Design 10) and which Cook and Campbell (1979) have identified as the untreated control-group design with pretest and posttest. This design, which does not involve random selection of participating units, can be represented by the following paradigm, the symbols of which have been previously defined: E
01
C
03
x
02 04
The broken line in the diagram separating the two groups indicates that no formal means such as randomization has been employed to assure the equivalence of the units; in other words, a potential selection bias could exist. In this design, it is assumed, although not always correctly, that the treatment is assigned randomly to one or the other of the two units. The greatest threats to the validity of this design are sources of error to be described in the next (third) major division of the papernamely, differential selection, differential statistical regression, instrumentation, the interaction of selection with maturation (the presence of differential growth rates in the two units during the treatment period), the interaction of selection and history, the interaction of pretesting with the treatment, and the interaction of selection (group differences) with the treatment. Within the context of the interrupted time-series design, one of the most promising ones is that of the multiple time-series design described by Campbell and Stanley (1966) as
RICHARD
4x
M. WOLF
Design 14 and by Cook and Campbell (1979) as the interrupted time-series with a nonequivalent no-treatment control group time-series. This design is quite similar to the previous one except for the fact that several pretest and several posttest measures are present. This design may be diagrammed as follows:
This particular design affords the advantage of establishing fairly reliable baseline data in the pretest observations and of indicating through several posttests sustained effects of the treatment. Such a design can be quite useful in evaluating the long-term effects of psychotherapy or various forms of educational intervention. A plot of the regression lines (often broken) of the 0 measures for E and C units against a sequence of time points corresponding to each of the 0 symbols may reveal that after the application of the treatment a substantial separation between regression lines exists if the treatment has been effective. The main threats to the validity of the design would be history, instrumentation if it immediately follows the treatment, the interaction of the testing at various periods with the treatment, and the interaction of selection (dissimilarity in the groups) with the treatment. Ex Post Facto Design A form of quasi-experimental design that is frequently employed but not recommended has been the expostfacto one assigned by Campbell and Stanley (1966) the name of staticgroup comparison and classified by them as Design 3. In this design, data have been obtained for a treatment and comparison group after the treatment has occurred. For example, an evaluator might wish to determine the benefits of a college education by comparing competencies of a group of college graduates with a group of noncollege graduates in the same community. Frequently, efforts are made to match the groups after the treatment variable has occurred. This design may be diagrammed as follows: E C‘
x
01 o‘t
The static-group comparison procedure contains virtually every possible threat to the validity of any inference regarding a causal relationship between the treatment and an outcome. At best, the method might hold some promise for exploratory research or initial hypothesis formulation. Changing
Conditions
in a Field Setting
Research in field settings typically involves a changing or evolving program (treatment). Thus, quasi-experimental designs need to be flexible and adaptable to the situation. The program may have been developed with the view that modifications will be introduced progressively as feedback from intermediate evaluations becomes available. In another situation the program may change because of contextual forces such as social or political pressures arising in conjunction with preliminary outcomes. In any event the design needs to be fluid, not static, to provide an appropriate framework for conducting an ongoing
Educational
evaluation evaluation
during the formative of program impact.
49
Evaluation
stages of program
development
as well as a summative
Suggestions for Selecting a Comparison Group in the Absence of Randomization Wolf (1984) has provided several useful suggestions for selecting a comparison group when randomization is not possible. First, one might be able to use a cohort from a previous or later year that was not affected by a change in the population or environment from one year to the next. A limitation is that data necessary for the evaluation may not have been recorded for the cohort group of interest. A second approach is to utilize a neighboring institution as a comparison group. Two advantages are that program imitation is not so likely to take place as in the instance of an institution located in a foreign setting and that data gathering can occur concurrently in both groups. A disadvantage is that the groups may not be comparable on all dimensions. Moreover, the comparison unit may choose to withdraw at a critical data gathering time if appropriate incentives are not built into the study. A third strategy is that when a standardized instrument is used norm group data reported in the technical manual can be utilized for comparison purposes. This procedure can be quite risky, however, as the norm group may not be similar to the one being evaluated. In many evaluation situations, the purpose of a program is to aid some extreme group either advantaged or disadvantaged. The subjects in these extreme groups often are not well represented (if they are represented at all) in most norm samples. Replication of Evaluation Studies Wolf (1984, p. 147) has pointed out that replication studies can improve the validity of quasi-experimental designs. Using different cohorts to study program impact would eliminate the difficulty imposed by carry-over effects (positive or negative transfer effects) often encountered in counter-balanced experimental designs involving cognitive processes. Building replications into designs allows programs to be studied over longer periods of time and thus enhances the probability that any potential effect of a program can be detected. The concept of using replications in quasi-experimental studies is similar to that found in carrying out multiple small-scale experimental investigations discussed earlier, in that multiple studies conducted at various levels or layers of a program represent separate replications. Thus, both approaches can strengthen the extent to which the effectiveness of treatments can be generalized (Saxe & Fine, 1979). Survey Designs Survey designs were developed to afford an efficient method of collecting descriptive data regarding: (1) the characteristics of existing populations; (2) current practices, conditions, or needs; and (3) preliminary information for generating research questions (Isaac & Michael, 1981, p. 46). Surveys are frequently used in sociology, political science, business, and education to gather systematically factual information necessary for policy decisions. Survey designs proceed first by identifying the population of interest. Next, the objectives are clarified and a questionnaire (structured or unstructured) is developed and field
SO
RICHARD
M. WOLF
tested. A relevant sample is selected, and the questionnaire is administered to its members by mail or telephone or in person. The results are then tabulated in a descriptive fashion (i.e. as in reporting means. frequencies, percentages, or cross-tabulations). Because the nature of the survey is basically descriptive, inferential statistics are usually not appropriate for summarizing the data obtained from the survey design. In addition, survey data often are used in subsequent ex post facto designs discussed previously. Documentation regarding the response rate also is an important consideration in survey studies. Although survey designs afford an efficient and relatively inexpensive method of gathering data among large populations, the response rate is considered a severe limitation. If the rate is low, the data may not be representative and accurate indicators of the perceptions of a population. Formulas have been developed to determine the minimum proportion of respondents required to be confident that the sample provides a relevant and accurate representation of the population (Aiken, 1981). Questionnaires used in survey designs can be developed to measure status variables (what exists) as well as constructs (hypothesized concepts) related to attitudes, beliefs, or opinions. When constructs are measured by a survey some evidence of their validity must be addressed. A questionnaire may be highly structured or loosely structured as in a personal interview. Details on constructing questionnaires along with the advantages and disadvantages can be found in Ary, Jacobs, and Razavieh (1985) and in Isaac and Michael (1981). Survey designs have been criticized on the grounds that an attempt to standardize the questionnaire for all respondents may result in items that often are too superficial in the coverage of complex issues. In addition, survey designs frequently miss contextual issues that lead to a respondent’s marking a particular alternative. To overcome these limitations, case studies or naturalistic designs have been adopted to be more responsive to the varied informational needs of decision makers.
Naturalistic
Designs
A major criticism of the three previous design strategies discussed up to this point is their failure to capture the context in which programs operate. In field settings, it is crucial to understand a contextual situation surrounding the program to be evaluated. The context, which includes types of participants, locales, and different occasions, can interact with the program in unique ways. Thus, a thorough understanding and documentation of the context in which the program is to function is usually as necessary as the product or outcome information obtained at the conclusion of the program. During the last 20 years. evaluators have become aware that social and political forces can do as much to shape and to alter program effects as can the program itself. Several educators and evaluators have advocated that ethnographic or naturalistic designs be adopted for use in educational research and evaluations (Lincoln & Guba, 1985; Patton, 1980). These designs grew out of the need to study phenomena as they naturally occur in the field. Naturalistic designs draw heavily from the ethnographic techniques of anthropologists. The goal of a naturalistic/ethnographic study is to understand the phenomena being observed. Naturalistic designs, like ethnographies, imply a methodological approach with specific procedures, techniques, and methods of analysis. Lincoln and Guba (1985) have described naturalistic inquiry as a major paradigm shift in research
Educational
Evaluation
51
orientations. They have illustrated how the naturalistic paradigm differs from the positivist paradigm in five areas (axioms): (1) there are multiple realities; (2) it is impossible to separate the researcher from that being researched; (3) only hypotheses about individual realities are possible; (4) it is impossible to separate cause and effect relationships because of their simultaneous interaction; and (5) inquiring is value-bound. Thus, naturalistic designs differ on severals basic points from positivist designs (experimental, quasi-experimental, or survey). The subjectivity that the researcher/evaluator brings to a study is openly confronted in naturalistic designs. Naturalistic designs provide in-depth investigations of individuals, groups, or institutions as they naturally occur. A major feature of this design has been the use of a human instrument (the observer) to collect, filter, and organize the incoming data. Naturalistic inquiry differs from surveys and experimental/quasi-experimental designs in that usually a relatively small number of units is studied over a relatively large number of variables and conditions (Isaac &Michael, 1981). In the past, naturalistic approaches were thought to be useful as background information in planning an evaluation, in monitoring program implementation, or in giving meaning to statistical data. Lincoln and Guba (1985) and Skrtic (1985) have suggested that naturalistic methodologies are more than supportive designs for the more quantitatively oriented evaluation and research investigations. In fact, these writers have maintained that naturalistic inquiry affords a sufficient methodology to be the only one used in an evaluation study. In establishing naturalistic designs as a different methodological approach unto itself, Lincoln and Guba (1985, Chapter 9) have developed guidelines for conducting a naturalistic study. The basic design differs from the positivist perspective in that it evolves during the course of the study, it is not established prior to the study. Some elements of the design, however, can and must be prespecified such as: (1) establishing the focus of the study; (2) determining the site(s) of data collection and instrumentation; (3) planning successive phases of the study; and (4) establishing the trustworthiness of the data. The concept of trustworthiness is similar to what Campbell and Stanley (1966) have termed validity. Guba (1981) proposed a different terminology under the general heading trustworthiness that he has perceived is more nearly appropriate for naturalistic designs. A meaningful naturalistic study should have credibility (internal validity), transferability (external validity), dependabikty (reliability), and confirmability (objectivity). Lincoln and Guba (1985) have suggested several techniques for implementing (Chapter 10) and for establishing the trustworthiness of the study (Chapter 11). To date, there have been relatively few published investigations employing true ethnographic or naturalistic designs in educational evaluations. In one such study, Skrtic (1985) reported on the implementation of The Education for All Handicapped Children Act in rural school districts. This study is important for three reasons. First, it represents the first national, multi-site evaluation using the naturalistic inquiry methods espoused by Lincoln and Guba (1985). Second, it takes a design strategy thought to be useful in studying a few evaluation units in-depth and extends it to a national study where many evaluation units are investigated exhaustively. Third, some of the difficulties in employing naturalistic methods are discussed from the standpoint of both practical field problems and the more theoretical issues of establishing the trustworthiness of the data. This latter aspect is critical for making naturalistic designs more useful and acceptable to the practicing evaluator than are alternative design strategies. Until this methodology is tested further, its utility as a descriptive or explanatory technique should be used very cautiously.
RICHARD
Sources
of Invalidity
for Experimental
Contributions
M. WOLF
and Quasi-Experimental
of Campbell
Designs
and Stanley
The landmark work by Campbell and Stanley (1966) in identifying threats to the accuracy of causal inferences between treatments and outcomes in experimental and quasiexperimental designs alerted researchers in the social and behavioral sciences to two major areas of concern regarding the validity of their designs. The focus of their work was on the internal and external validity of experimental and quasi-experimental designs. Internal validity refers to the confidence one has that the findings in a given study or experiment are attributable to the treatment alone. Internal validity is strengthened when rival hypotheses that might account for an observed relationship between treatment and outcome can be eliminated from consideration. External validity refers to the generalizability of the findings to other populations, settings, and occasions. Thus internal validity issues are concerned with whether a cause and effect relationship exists between the independent and dependent studied, whereas external validity issues are concerned with the extent to which observed cause and effect relationships can be generalized from one study to another reflecting different types of persons in varied settings across numerous occasions. Designs are sought that will minimize the influence of extraneous factors that might confound the effect of the treatment.
Extensions
by Campbell
and Cook
More recently, Cook and Campbell (1979) have extended the work on the validity of designs to include two additional areas of concern as well as to refine and extend the original internal and external validity categories. In dealing with internal validity issues, a researcher is first interested in determining whether certain independent and dependent variables demonstrate a quantitative relationship. Because the basis of a relationship depends upon statistical evidence, the term statistical conclusion validity was introduced. Once a relationship is established between certain variables, it is then of interest to determine whether the relationship is causal. Ascertaining whether a causal relationship exists between a measure of a treatment (the independent variable) and an outcome (the dependent variable) is the central issue of internal validity. The researcher needs to be able to demonstrate that the causality between the variables is a function of some manipulation, not a function of chance or of one or more extraneous variables. Once a relationship has been identified between the independent and dependent variables in association with a possible causal hypothesis on the basis of eliminating several rival alternative hypotheses, the researcher needs to be fairly certain that the theoretical (latent) constructs that represent the observed (manifest) variables are correct. Cook and Campbell (1979) have called this step the establishment of the construct validity of causes or effects. They have pointed out that the mere labeling of operationally defined observed variables is not sufficient to establish the link to a theoretical construct. Verifying construct validity is a highly complex process. Once the observed variables representing a cause and effect relationship have been determined not to be limited to a particular or specific operational definition but can be extended to a more generalized abstract term, construct validity has been established.
Educational
Evaluation
53
Then, it is of interest to ascertain to what other persons, settings, and occasions the causal relationship can be extended. This latter issue is one of external validity. These four areas of validity are not grounded in theory so much as they are centered around the practical issues an evaluator or researcher must addresss in eliminating rival hypotheses and factors that may interfere with the particular hypotheses and variables of interest. The separate areas addressed under each of the four forms of validity were developed from the practical experiences of Cook and Campbell (1979) and their colleagues in developing and implementing various designs in field settings. The four forms of validity and the separate areas covered under each have been summarized in Table 4.1. For the three true experimental designs, the two quasi-experimental designs, and the ex postfacto design described earlier, threats to their validity were cited. Additional information regarding what is meant by the kinds of threats to internal and external validity previously mentioned for these designs is set forth in the second column of Table 4.1.
Threats
Table 4.1 to the Validity of Designs* Features
Threats (1) Statistical
conclusion
(a) Low statistical (b) Violated
Was the study sensitive enough
validity
statistical
assumptions
All assumptions
covary?
must be known and tested when necessary.
Low reliability indicates high standard with various inferential statistics.
of measures
irrelevancies
in setting
(g) Random heterogeneity respondents
errors which can be a problem
Treatments need to be implemented in the same way from person to person, site to site, and across time.
(e) Reliability of treatment implementation (f) Random
the variables
Increases, unless adjustments are made with the number of mean differences possible to test on multiple dependent variables.
(c) Error rate (d) Reliability
to detect whether
Type II error increases when alpha is set low and sample is small; also refers to some statistical tests.
power
of
Environmental effects.
effects which may cause or interact
Certain characteristics variables.
in subjects may be correlated
Was the study sensitive enough
(2) Internalvalidity
with treatment
to detect a causal relationship?
(a) History
Event external
(b) Maturation
Biological and psychological their responses.
(c)Testing
Effects of pretest may alter responses treatment.
(d) Instrumentation
Changes in instrumentaton, difficulties).
(e) Statistical
Extreme scores tend to move to middle on posttesting treatment.
regression
to treatment
with dependent
which may affect dependent changes
raters,
variable.
in subjects which will affect on posttest or observers
regardless
of
(calibration regardless
of
(f) Selection
Differences
in subjects prior to treatment.
(g) Mortality
Differential
loss of subjects during study.
(h) Interactionofselection with maturation, history and testing
Some other characteristic of subjects is mistaken for treatment on posttesting; differential effects in selection factors.
effect
(i) Ambiguity causality
In studies conducted at one point in time, problem direction of causality.
the
about direction
(j) Diffusion/imitation
of
of treatments
of inferring
Treatment group members share the conditions of their treatment with each other or attempt to copy the treatment.
54
RICHARD
Threats
M. WOLF
Table 4.1 to the Validity of Designs*
(continued)
Threats
Features
(k) Compensatory treatments
equalization
(I) Demoralization
of respondents
(3) Construct effects
validity
(a) Inadequate constructs.
of causeand
explication
(b) Mono-operation (c)Mono-method
of
of
bias bias
It is decided that everyone in experimental or comparison group receive the treatment that provides desirable goods and services. Members of group not receiving and give up. Which theoretical Poor definition
treatment
or latent variables
perceive
are actually
they are inferior
being studied?
of constructs.
Measurement
of single dependent
Measurement
of dependent
variable.
variable
in only one way.
(d) Hypothesis-guessing
Subjects try toguess researchers’hypothesis and act in a way that they think the researcher wants them to act.
(e) Evaluation
Faking well to make results look good.
apprehension
(f) Experimenter
Experimenters during study.
expectancies
(g) Confounding constructs levels of constructs
and
may bias study by their expectations
All levels of a construct are not fully implemented they may appear to be weak or nonexistent.
(h) Interaction treatments
of different
Subjects one.
(i) Interaction treatment
of testing and
Testing may facilitate
(j) Restricted
(4) External
generalizability
are a part of other treatments
along a continuum,
influences.
can be generalized
from one study to
Can the cause and effect noted in the study be generalized individuals, settings. andoccasions?
validity
(a) Interaction of selection and treatment
Ability to generalize studied.
the treatment
to persons
(b) Interaction treatment
of setting and
Ability to generalize studied.
the treatment
to settings beyond
(c) Interaction treatment
of history and
Ability to generalize the treatment beyond the one studied.
*Adapted
from Cook and Campbell,
into and
rather than of an intended
or inhibit treatment
The extent to vvhich a construct another.
entering
beyond
across
the group the one
to other times (past and future)
1979.
Need for Smaller Studies in Diverse Settings It is obvious that all possible threats under each of the four categories of validity cannot be controlled in any one study and that a single study would not be sufficient to determine cause and effect relationships associated with the impact of an educational program. This argument suggests the need for smaller and more carefully controlled studies to be conducted in the evaluation of instructional programs. A second argument for smaller more controlled studies has been the general finding that programs instituted in the social and behavioral sciences tend to have only a small to moderate impact. This situation stems from the fact that programs, which are often complex, are implemented differentially in various settings and are influenced by a host of political and social contexts_
Educational
Evaluation
55
Given these constraints, smaller studies aimed at eliminating bias (internal validity concerns) and random error (statistical conclusion validity concerns) appear imperative, especially with new or innovative educational programs. Once a program has been field tested and has been found to have a probable effect, controlled studies are needed to determine what constructs are operating in the cause and effect network (construct validity regarding cause and effect concerns) and to ascertain in what settings, for which populations, and over what occasions (external validity concerns) the observed and theoretical relationships apply.
Objections to Validity Distinctions Objections to validity distinctions offered by Cook and Campbell (1979, p. 85) have been raised. For example, why should issues of internal validity have superiority over those of external validity? For that matter, why should any one form of validity have priority over another? These questions can be answered only by the researcher for a given research-evaluation situation. For the case in which information is needed to aid a policy decision, perhaps issues of external validity should supercede those of internal validity. Cronbach (1982) has made this point very convincingly. However, in the case in which a new innovative program is developed, the outcome of interest is whether the program is effective. In this situation, internal validity should take priority over external validity concerns. Finally, there could be a situation in which information on both the effectiveness of an educational program and assistance with a policy decision is the expected outcome. In this case, multiple studies aimed at controlling as many threats in all four validity areas should be designed and initiated.
Conclusion It is apparent that the meaningful conduct of evaluation studies concerning the effectiveness of educational programs in field settings requires substantial modifications in the true experimental designs employed in psychology. Use of quasi-experimental designs in combination with surveys and naturalistic inquiry affords an opportunity to replicate in simple studies selected aspects of a complex program. An information base can be acquired to permit a generalized inference of promising causal connections between treatments and observable changes in program outcomes for participating units in diverse settings.
References Aiken, L. (1981). Proportion of returns in survey research. Educational and Psychological Measurement, 41, 1033-1038. Ary, D., Jacobs, L., & Razavieh, A. (1985). Introduction to research in education (3rd ed.). New York: Holt. Boruch, R. F.. & Wortman, P. M. (1979). Implications of education evaluation for evaluation policy. In D. C. Berliner (Ed.), Review of Educational Research (Vol. 7). Washington: American Educational Research Association. Campbell, D. T., & Stanley, J. C. (1966). Experimental and quasi-experimental designsfor research. Chicago: Rand McNally.
RICHARD
56
M. WOLF
Cook, T. D., & Campbell, D. T. (1979). Q ucrsi-experimentrrtion: Design and analysis for field senings. Chicago: Rand McNally. Cronbach, L. J. (1982). Designing evuhtations of educational and socialprograms. San Francisco: Jossey-Bass. Guba, E. (1981). Criteria for assessing the trustworthiness of naturalistic inquiries. Educcrtional Commumcurion und Technology
Journal.
29,75-92.
Isaac, S.. & Michael, W. B. (1981). Handbook in research and evahtufion (2nd ed.). San Diego: EdITS. Lincoln, Y., & Guba. E. (1985). Naturalistic inquiry. Beverly Hills: Sage. Patton, M. (1980). Qualirative evuluafion methods. Beverly Hills: Sage. Reichardt, C., & Cook, T. (1979). Beyond qualitative versus quantitative methods. In T. Cook & C. Reichardt (Eds.), Qualitati~~e and quantitari~~e methods in nwlrmtion research Research Progress Series in Evaluation (Vol. I). Beverly Hills: Sage. Saxe, L., & Fine, M. (197’)). Expanding our view of control groups in evaluations. In L. Dana & R. Perloss (Eds.), Improving el,uluarion. Beverly Hills: Sage. Skrtic, T. (1985). Doing naturalistic research into educational organizations. In Y. Lincoln (Ed.), Orgcmizutionul theory and inquiry: The paradigm revolution. Beverly Hills: Sage. Stufflebeam, D. L. (1971). The use of experimental design in educational evaluation. Journal of Educuriontd Measurement.
8. 267-273.
Wolf, R. M. (1984). Evaluation (2nd ed.). New York: Pracger.
in education:
Foundalions
of competency
asxwment
and program
review
CHAPTER
5
SAMPLE DESIGN KENNETH Deakin
N. ROSS
University,
Australia
Abstract Sampling is undertaken in evaluation studies because resources often do not permit one to study everyone who is exposed to a particular treatment. From a detailed study of a part of a group, namely a sample, one endeavors to say something about the group as a whole and, often, the likely effects on subsequent groups. Samples can be selected in a number of different ways. Some are clearly better than others. This article considers various kinds of probability and non-probability samples in both experimental and survey studies. Throughout, how a sample is chosen is stressed. Size alone is not the determining consideration in sample selection. Good samples do not occur by accident. Rather, they are the result of a careful design.
Introduction Sampling in evaluation research is generally conducted in order to permit the detailed study of part, rather than the whole, of a population. The information derived from the resulting sample is customarily employed to develop useful generalizations about the population. These generalizations may be in the form of estimates of one or more characteristics associated with the population, or they may be concerned with estimates of the strength of relationships between characteristics within the population. Provided that scientific sampling procedures are used, the selection of a sample often provides many advantages compared with a complete coverage of the population: reduced costs associated with gathering and analyzing the data, reduced requirements for trained personnel to conduct the fieldwork, improved speed in most aspects of data manipulation and summarization, and greater accuracy due to the possibility of more intense supervision of fieldwork and data preparation operations. The evaluation research situations in which sampling is used may be divided into three broad categories: (1) experiments - in which the introduction of treatment variables occurs according to a prearranged experimental design and all extraneous variables are either controlled or randomized; (2) surveys - in which all members of a defined target population have a known non-zero probability of selection into the sample; and (3) investigationsin which data are collected without either the randomization of experiments or the probability sampling of surveys.
5x
RICHARD
M. WOLF
Experiments are strong with respect to internal validity because they are concerned with the question of whether a true measure of the effect of a treatment variable has been obtained for the subjects in the experiment. Surveys, on the other hand, are strong with respect to external validity because they are concerned with the question of whether the findings obtained for the subjects in the survey may be generalized to a wider population. Investigations are weak with respect to both internal and external validity and their use is due mainly to convenience or low cost.
Populations:
Desired,
Defined,
and Excluded
In any evaluation study it is important to have a precise description of the population of elements (persons, organizations, objects, etc) that is to form the focus of the study. In most evaluation studies this population will be a finite one that consists of elements which conform to some designated set of specifications. These specifications provide clear guidance as to which elements are to be included in the population and which are to be excluded. In order to prepare a suitable description of a population it is essential to distinguish between the population for which the results are ideally required, the desired target population, and the population which is actually studied, the defined target population. An ideal situation, in which the researcher had complete control over the research environment, would lead to both of these populations containing the same elements. However, in most evaluation studies some differences arise due, for example, to: (1) non-coverage: the population description may accidentally omit some elements because the researcher has no knowledge of their existence; (2) lack of resources: the researcher may intentionally exclude some elements from the population description because the costs of their inclusion in data gathering operations would be prohibitive; or (3) an ageing population description: the population description may have been prepared at an earlier date and therefore it includes some elements which have ceased to exist. The defined target population provides an operational definition which may be used to guide the construction of a list of population elements, or sampling frame, from which the sample may be drawn. The elements that are excluded from the desired target population in order to form the defined target population are referred to as the excluded population. For example, during a cross-national study of science achievement carried out in 1970 by the International Association for the Evaluation of Educational Achievement (IEA), one of the desired target populations was described in the following fashion. “All students aged 14.0(!-14.11 years at the time of testing. This was the last point in most of the school systems in IEA where 100 percent of an age group were still in compulsory schooling.” (Comber and Keeves. 1973, page 10.)
In Australia it was decided that, for certain administrative reasons, the study would be conducted only within the six states of Australia and not within the smaller Australian territories. It was also decided that only students in those school grade levels which contained the majority of 14-year-old students would be included in the study. The desired IEA target population was therefore reformulated in order to obtain the following defined Australian target population.
Educational
“All students aged 14.00-14.11 grades: New South Wales Victoria Queensland South Australia West Australia Tasmania
The target stralia target
on 1 August
Evaluation
59
1970 in the following
Australian
states and secondary
school
Forms I, II and III Forms I, II, III and IV Grades 8,9 and 10 lst, 2nd and 3rd Years Years 1,2 and 3 Years I, II, III and IV.”
numbers of students in the desired IEA target population, the defined Australian population, and the excluded population have been presented in Table 5.1. For Auoverall, the excluded population represented less than four percent of the desired population.
The Numbers
of Australian
Location
Students
Table 5.1 in the Desired IEA Target Population
Desired IEA target population
Population
Defined
Australian population
and the Defined
target
Australian
Excluded
Target
population
States New South Wales Victoria Queensland South Australia West Australia Tasmania Other Territories Total
78,163 62,573 33,046 22,381 19,128 7,868 3,427
76,317 62,030 31,839 21,632 18,708 7,789 0
226,586
218,315
Sampling
1,846 s43 1,207 749 420 79 3.427 8,271
Frames
The selection of a sample from a defined target population requires the construction of a sampling frame. The sampling frame is commonly prepared in the form of a physical list of population elements-although it may also consist of rather unusual listings, such as directories or maps, which display less obvious linkages between individual list entries and population elements. A well constructed sampling frame allows the researcher to ‘take hold’ of the defined target population without the need to worry about contamination of the listing with incorrect entries or entries which represent elements associated with the excluded population. In practical evaluation studies the sampling frame incorporates a great deal more structure than one would expect to find in a simple list of elements. For example, in a series of large-scale evaluation studies carried out in 21 countries during the 1970s (Peaker, 197.5), sampling frames were constructed which listed schools according to a number of stratification variables: size (number of students), program (for example, comprehensive or selective), region (for example, urban or rural), and sex composition (single sex or coeducational). The use of these stratification variables in the construction of the sampling frames was due, in part, to the need to present research results for sample data that had been drawn from particular strata within the sampling frame.
RICHARD
60
M. WOLF
Representativeness The notion of representativeness is a frequently used and often misunderstood notion in evaluation research. A sample is often described as being ‘representative’ if certain known percentage frequency distributions of element characteristics within the sample data are similar to the corresponding distributions within the whole population. The population characteristics selected for these comparisons are referred to as marker variables. These variables are usually selected from among important demographic variables that are not related to the conduct of the evaluation. Unfortunately there are no objective rules for deciding either: (1) which variables should be nominated as marker variables; or (2) the degree of similarity required between percentage frequency distributions for a sample to be judged as ‘representative’ of the population. It is important to note that a high degree of ‘representativeness’ in a set of sample data refers specifically to the marker variables selected for analysis. It does not refer to other variables assessed by the sample data and therefore does not necessarily guarantee that the sample data will provide accurate estimates for all element characteristics. The assessment of the accuracy of sample data can only be discussed meaningfully with reference to the value of the mean square error, calculated separately, for particular sample estimates (Ross, 1978). The most popular marker variables in the field of education have commonly been demographic factors associated with students (sex, age, socio-economic status, etc.) and schools (type of school, school location, school size, etc.). For example, in a series of evaluation studies carried out in the U.S.A. during the early 197Os, Wolf (1977) selected the following marker variables: sex of student, father’s occupation, father’s education, and mother’s education. These variables were selected because their percentage frequency distributions could be obtained for the population from tabulations of census data prepared by the United States Bureau of the Census.
Probability
Samples
and Non-Probability
Samples
The use of samples in evaluation research is usually followed by the calculation of sample estimates with the aim of either: (1) estimating the values of population parameters from sample statistics; or (2) testing statistical hypotheses about population parameters. These two aims require that the researcher has some knowledge of the accuracy of the values of sample statistics as estimates of the relevant population parameters. The accuracy of these estimates may generally be derived from statistical theory provided that probability sampling has been employed. Probability sampling requires that each member of the defined target population has a known, and non-zero, chance of being selected into the sample. In contrast, the stability of sample estimates based on non-probability sampling cannot be discovered from the internal evidence of a single sample. That is, it is not possible to determine whether a non-probability sample is likely to provide very accurate or very inaccurate estimates of population parameters. Consequently, these types of samples are not appropriate for dealing objectively with issues concerning either the estimation of population parameters, or the testing of hypotheses in evaluation research. The use of non-probability samples in evaluation research is sometimes carried out with
Educational
Evaluation
61
the (usually implied) justification that estimates derived from the sample may be linked to some hypothetical universe of elements rather than to a real population. This justification may lead to research results which are not meaningful if the gap between the hypothetical universe and any relevant real population is too large. In some circumstances, a well-planned probability sample design can be turned accidentally into a non-probability sample design if some degree of subjective judgement is exercised at any stage during the execution of the sample design. Researchers may fall into this trap through a lack of control of field operations at the final stage of a multi-stage sample design. The most common example of this in educational settings occurs when the researcher goes to great lengths in drawing a probability sample of schools, and then leaves it to the initiative of teaching staff in the sampled schools to select a ‘random’ sample of students or classes.
Types of Non-Probability Samples There are three main types of non-probability samples: judgement, convenience and quota samples. These approaches to sampling have in common the characteristic that the elements in the target population have an unknown chance of being selected into the sample. It is always wise to treat the research results arising from these types of sample design as suggesting statistical characteristics about the population - rather than as providing population estimates with specifiable confidence limits. Judgement Sampling The process of judgement, or purposive, sampling is based on the assumption that the researcher is able to select elements which represent a ‘typical’ sample from the appropriate target population. The quality of samples selected by using this approach depends on the accuracy of subjective interpretations of what constitutes a typical sample. It is extremely difficult to obtain meaningful results from a judgement sample because, generally speaking, no two experts will agree upon the exact composition of a typical sample. Therefore, in the absence of an external criterion, there is no way in which the research results obtained from one judgement sample can be judged as being more accurate than the research results obtained from another. Convenience Sampling A sample of convenience is the terminology used to describe a sample in which elements have been selected from the target population on the basis of their accessibility or convenience to the researcher. Convenience samples are sometimes referred to as ‘accidental’ samples because elements may be drawn into the sample simply because they just happen to be situated, spatially or administratively, near to where the researcher is conducting the data collection. The main assumption associated with convenience sampling is that the members of the target population do not vary according to accessibility or convenience. That is, that there would be no difference in the research results obtained from a random sample, a nearby sample, a co-operative sample, or a sample gathered in some inaccessible part of the population.
62
RICHARD
M. WOLF
As for judgement sampling, there is no way in which the researcher may check the precision of one sample of convenience against another. Indeed the critics of this approach argue that, for many research situations, readily accessible elements within the target population will differ significantly from less accessible elements. They therefore conclude that the use of convenience sampling may introduce a substantial degree of bias into sample estimates of population parameters. Quota Sampling Quota sampling is a frequently used type of non-probability sampling. It is sometimes misleadingly assumed to be accurate because the numbers of elements are drawn from various target population strata in proportion to the size of these strata. While quota sampling places fairly tight restrictions on the number of sample elements per stratum, there is often little or no control exercised over the procedures used to select elements within these strata. For example, either judgement or convenience sampling may be used in any or all of the strata. Therefore, the superficial appearance of accuracy associated with proportionate representation of strata should be considered in the light that there is no way of checking either the accuracy of estimates obtained for any one stratum, or the accuracy of overall estimates of population characteristics obtained by combining individual stratum estimates.
Types of Probability Samples There are many ways in which a probability sample may be drawn from a population. The method that is most commonly described in textbooks is simple random sampling. This method is rarely used in practical evaluation research situations because: (1) the selection and measurement of individual population elements is often too expensive; and (2) certain complexities may be introduced intentionally into the sample design in order to address more appropriately the objectives and administrative constraints associated with the evaluation. The complexities most often employed in evaluation research include the use of stratification techniques, cluster sampling, and multiple stages of selection. Simple Random Sampling The selection of a simple random sample is usually carried out according to a set of mechanical instructions which guarantees the random nature of the selection procedure. For example, Kish (1965) provides the following operational definition in order to describe procedures for the selection of a simple random sample of elements without replacement from a finite population of elements. “From a table of random digits select with equal probability n different selection numbers, corresponding to n of the N listing numbers of the population elements. The n listings selected from the list, on which each of the N population elements is represented separately by exactly one listing, must identify uniquely n different elements.” (Kish, 1965, pp. 3637).
Simple random sampling, as described in this definition, results in an equal probability of selection for all elements in the population. This characteristic, called ‘epsem’ sampling
Educational
Evaluation
63
(equal probability of selection method), is not restricted solely to this type of sample design. For example, equal probability of selection can result from either the use of equal probabilities of selection throughout the various stages of a multi-stage sample design, or from the employment of varying probabilities that compensate for each other through the several stages of a multi-stage sample design. Epsem sampling is widely applied in evaluation research because it usually leads to self-weighting samples in which the simple arithmetic mean obtained from the sample data is an unbiased estimate of the population mean. Stratified Sampling The technique of stratification is often employed in the preparation of sample designs for evaluation research because it generally provides increased accuracy in sample estimates without leading to substantial increases in costs. Stratification does not imply any departure from probability sampling - it simply requires that the population be divided into subpopulations called strata and that the probability sampling be conducted independently within each stratum. The sample estimates of population parameters are then obtained by combining the information from each stratum. In some evaluation studies stratification is used for reasons other than obtaining gains in sampling accuracy. For example, strata may be formed in order to employ different sampling procedures within strata, or because the subpopulations defined by the strata are designated as separate domains of study. Variables used to stratify populations in education generally describe demographic aspects concerning schools (for example, location, size and program) and students (for example, age, sex, grade level and socio-economic status). Stratified sampling may result in either proportionate or disproportionate sample designs. In a proportionate sample design the number of observations in the total sample is allocated among the strata of the population in proportion to the relative number of elements in each stratum of the population. That is, a stratum containing a given percentage of the elements in the population would be represented by the same percentage of the total number of sample elements. In situations where the elements are selected with equal probability within strata, this type of sample design results in epsem sampling and self-weighting sample estimates of population parameters. In contrast, a disproportionate stratified sample design is associated with the use of different probabilities of selection, or sampling fractions, within the various population strata. This can sometimes occur when the sample is designed to achieve greater overall accuracy than proportionate stratification by using ‘optimum allocation’ (Kish, 1965, p. 92). More commonly, in educational settings, disproportionate sampling is used in order to ensure that the accuracy of sample estimates obtained for stratum parameters is sufficiently high to be able to make meaningful comparisons between strata. The sample estimates derived from a disproportionate sample design are generally prepared with the assistance of ‘weighting factors’. These factors, represented either by the reciprocals of the sampling fractions or by a set of numbers proportional to them, are employed in order to prevent inequalities in selection probabilities from causing the introduction of bias into sample estimates of population parameters. The reciprocals of the sampling fractions, called ‘raising factors’, refer to the number of elements in the population represented by a sample element (Ross, 1978).
RICHARD
64
M. WOLF
The weighting factors are usually calculated so as to ensure that the sum of the weighting factors over all elements in the sample is equal to the sample size. This ensures that the readers of evaluation reports are not confused by differences between actual and weighted sample sizes. Cluster Sampling A population of elements can usually be thought of as a hierarchy of different sized groups or ‘clusters’ of sampling elements. These groups may vary in size and nature. For example, a population of school students may be grouped into a number of classrooms, or it may be grouped into a number of schools. A cluster sample of students may then be selected from this population by selecting clusters of students as classroom or school groups rather than individually as would occur when using a simple random sample design. The use of cluster sampling in evaluation research is sometimes undertaken as an alternative to simple random sampling in order to reduce research costs for a given sample size. For example, a cluster sample consisting of the selection of 10 classes - each containing around 20 students -would generally lead to smaller data collection costs compared with a simple random sample of 200 students. The reduced costs occur because the simple random sample may require the researcher to collect data from as many as 200 schools. Cluster sampling does not prevent the application of probability sampling techniques. This may be demonstrated by examining several ways in which cluster samples may be drawn from a population of students. To illustrate, consider an hypothetical population, described in Figure 5.1, of 24 students distributed among six classrooms (with four students per class) and three schools (with two classes per school).
Clossrwms
Students
School
I
School
Schools
1 \
Close
I
/II\
ClOSS 2 /II\
/I
ef
abed
Figure
S.
Class
gh
I
Hypothetical
I
1 \ 3
I\
j
School
2
kl
population
Class /I\\
mn
CL,!, \ Class
4
0
3
p
/I\\
qrst
6
/II\
""WX
of 73 students.
A simple random sample of four students drawn without replacement from this population would result in an epsem sample with each element having a probability of selection, p, equal to l/6 (Kish, 1965, p. 40). A range of cluster samples, listed below with their associatedp values, may also be drawn in a manner which results in epsem samples. (9 Randomly select one class, then include all students in this class in the sample. (p = l/6 x l/l = l/6.) (ii) Randomly select two classes, then select a random sample of two students from within these classes. 0, = 216 x 214 = l/6.) select two schools, then select a random sample of one class from within (iii) Randomly these schools, then select a random sample of two students from within these classes. @ = 213 x l/2 x 214 = l/6.)
Educational
Evaluation
65
The Accuracy of Estimates Obtained from Probability Samples The degree of accuracy associated with a sample estimate derived from any one probability sample may be judged by the difference between the estimate and the value of the population parameter which is being estimated. In most situations the value of the population parameter is not known and therefore the actual accuracy of an individual sample estimate cannot be calculated in absolute terms. Instead, through a knowledge of the behavior of estimates derived from all possible samples which can be drawn from the population by using the same sample design, it is possible to estimate the probable accuracy of the obtained sample estimate.
Mean Square Error Consider a probability sample of n elements which is used to calculate the sample mean, X, as an estimate of the population mean, J?. If an infinite set of samples of size n were drawn independently from this population and the sample mean calculated for each sample then the average of the resulting sampling distribution of sample means, the expected value of X, could be denoted by E(X). The accuracy of the sample statistic, X, as an estimator of the population parameter, 8, may be summarized in terms of the mean square error (MSE). The MSE is defined as the average of the square of the deviations of all possible sample estimates from the value being estimated (Hansen et al., 1953). MSE (X) = E(x - .%‘)* = E(x - E(X))’ + (E(j) - z’?))’ = Variance of i + (Bias of a)’
A sample design is unbiased if E(x) = x. It is important to remember that ‘bias’ is not a property of a single sample, but of the entire sampling distribution, and that it belongs neither to the selection nor the estimation procedure alone, but to both jointly. For most well designed samples in education the sampling bias is usually very small tending towards zero with increasing sample size. The accuracy of sample estimates is therefore generally assessed in terms of the Variance ofj, denoted var(f), which quantifies the sampling stability of the values of i around their expected value E(x). The Accuracy of Individual Sample Estimates In educational settings the researcher is usually dealing with a single sample of data and not with all possible samples from a population. The variance of sample estimates as a measure of sampling accuracy cannot therefore be calculated exactly. Fortunately, for many probability sample designs, statistical theory may be used to derive formulae which provide estimates of the variance based on the internal evidence of a single sample of data. For a simple random sample of n elements drawn without replacement from a population of Nelements, the variance of the sample mean may be estimated from a single sample of data by using the following formula: var(X) = -N - n s* N n
66
RICHARD
M. WOLF
where s* is the usual sample estimate of the variance of the element values in the population (Kish, 1965, p. 41). For sufficiently large values of N, the value of the finite population correction, (N - n)l N, tends toward unity. The variance of the sample mean in this situation may be estimated to be equal to s*/n. The sampling distribution of the sample mean is approximately normally distributed for many educational sampling situations. This approximation improves with increased sample size even though the distribution of elements in the parent population may be far from normal. This characteristic of sampling distributions is known as the Central Limit Theorem and it occurs not only for the sample mean but also for most estimators commonly used to describe survey research results (Kish, 1965). From a knowledge of the properties of the normal distribution we know that we can be ‘68 percent confident’ that the range X + SE(f) includes the population mean, where X is the sample mean obtained from a single sample and SE(x), often called the standard error, is the square root of var(.?). Similarly the range X -t 1.96 SE(X) will include the population mean with 95 percent confidence. While the above discussion has concentrated mostly on sample means derived from simple random samples, the same approach may be used to set up confidence limits for many other population values derived from various types of sample designs. For example, confidence limits may be calculated for complex statistics such as correlation coefficients, regression coefficients, multiple correlation coefficients, etc. (Ross, 1978).
Comparison of the Accuracy of Probability Samples The accuracy of probability samples is usually compared by considering the variances associated with a particular sample estimate for a given sample size. This comparison has, in recent years, been based on the recommendation put forward by Kish (1965) that the simple random sample design should be used as a standard for quantifying the accuracy of a variety of probability sample designs which incorporate such complexities as stratification and clustering. Kish (1965, p. 162) introduced the term ‘deff (design effect) to describe the ratio of the variance of the sample mean for a complex sample design. denoted c, to the variance of a simple random sample, denoted srs, of the same size. That is, &g= var(.\‘,). var(?,,,) For many complex sample designs that are commonly used in evaluation research the values of deff for sample means, and multivariate statisics, such as correlation coefficients and regression coefficients. are greater than unity (Ross, 1978). Consequently, the accuracy of sample estimates in these studies may be grossly over-estimated if formulae based on simple random sampling assumptions are used to calculate sampling errors. The potential for arriving at false conclusions in evaluation research by using incorrect sampling error calculations has been demonstrated in a study carried out by Ross (1976). This study showed that it was a highly misleading to assume that sample size was, in itself, an adequate indicator of the sampling accuracy associated with complex sample designs. For example, Ross (1976, p. 40) demonstrated that a two-stage cluster sample of 150 students (that was selected by randomly selecting six classes followed by the random selection
Educational
67
Evaluation
of 25 students within these classes) had the same sampling would a simple random sample of 20 students.
accuracy
for sample means
as
Error Estimation Procedures for Complex Probability Samples The computational formulae required to estimate the variance of descriptive statistics, such as sample means, are widely available for a wide range of probability sample designs which incorporate such complexities as stratification and cluster sampling (Kish, 1965). However, in the case of more complex analytical statistics, such as correlation coefficients and regression coefficients, the required formulae are not readily available for sample designs which depart from the model of simple random sampling. These formulae are either enormously complicated or, ultimately, they prove resistant to mathematical analysis (Frankel, 1971). Due to the lack of suitable sampling error formulae for analytical statistics, researchers have sometimes tended to accept estimates based on formulae which assume that data have been gathered by using simple random sampling assumptions. This course of action may lead to erroneous evaluation conclusions because results described as ‘significant’ may in reality be well within the bounds of random error (Ross, 1978). In the absence of suitable formulae, a variety of empirical techniques have emerged in recent years which provide ‘approximate variances that appear satisfactory for practical purposes’ (Kish, 1978, p. 20). These techniques may be divided into two broad categories: Random Subsample Replication and Taylor’s Series Approximations. In Random Subsample Replication a total sample of data is divided into two or more independent subsamples, each subsample following the overall sample design but being smaller in size. Then “a distribution of outcomes for a parameter being estimated is generated by each subsample. The differences observed among the subsample results are then analysed to obtain an improved estimate of the parameter, as well as a confidence assessment for that estimate” (Finifter, 1972, p. 114). The main approaches in using this technique have been Independent Replication (Deming, 1960), Jackknifing (Tukey, 1958), and Balanced Repeated Replication (McCarthy, 1966). The use of Taylor’s Series Approximations is often described as a more ‘direct’ method of variance estimation than the three approaches described above. In the absence of exact formula for the variance, the Taylor’s Series is used to approximate a numerical value of the first few terms of a series expansion of the variance formula. A number of computer programs have been prepared in order to carry out the extensive numerical calculations required for this approach (Wilson, 1983).
Sample Design for Two-Stage Cluster Samples The two-stage cluster sample is probably the most commonly used sample design in educational research. This design is generally employed by selecting either schools or classes at the first stage of sampling, followed by the selection of either clusters of students within schools or clusters of students within classes at the second stage. In many evaluation studies the two-stage cluster design is preferred because this design offers an opportunity for the researcher to conduct analyses at ‘higher’ levels of data aggregation. For example, the selection of students within classes at the second stage of sampling would, provided
68
RICHARD
M. WOLF
there were sufficient numbers of classes and numbers of students selected within classes, permit analyses to be carried out at the ‘between-student’ level (by using data describing individual students) and also the ‘between-class’ level (by using data based on class mean scores). The Design Effect for Two-Stage Cluster Samples The value of the ‘design effect’ (Kish, 1965, p. 257) for the two-stage cluster sample design depends, for a given number of clusters and a given cluster size, on the value of the coefficient of intraclass correlation. That is, drff= var (X‘)= I + (h - I) roll var(.f,,,) where var(&) is the variance of the sample means for the two-stage cluster sample design, b is the size of the selected clusters, and roh is the coefficient of intraclass correlation. The coefficient of intraclass correlation, often referred to as roh, provides a measure of the degree of homogeneity within clusters. In educational settings the notion of homogeneity within clusters may be observed in the tendency of student characteristics to be more homogeneous within schools, or classes, than would be the case if students were assigned to schools, or classes, at random. This homogeneity may be due to common selective factors (for example, residential zoning of schools), or to joint exposure to the same external influences (for example, teachers and school programs), or to mutual interaction (for example, peer group pressure), or to some combination of these.
The Effective Sample Size for Two-Stage Cluster Samples The ‘effective sample size’ (Kish, 1965, p. 259) for a given two-stage cluster sample is equal to the size of the simple random sample which has a level of sampling accuracy, as measured by the variance of the sample mean, which is equal to the sampling accuracy of the given two-stage cluster sample. A little algebra may be used to demonstrate that the actual size, n,, and the effective sample size, n*, for a two-stage cluster sample are related to the design effect associated with that sample in the following manner (Ross, 1978, pp. 137138),
From previous discussion, we may replace deff in this formula by an expression which is a function of the cluster size and the coefficient of intraclass correlation. That is.
For example, consider a two-stage cluster sample based on a sample of 10 schools followed by the selection of 20 students per school. In addition, consider a student characteristic (for example, a test score or attitude scale score) for which the value of the coefficient of intraclass correlation is equal to 0.1. This value of roh would be typical for clusters of students selected randomly from within secondary schools in Australia (Ross, 1983). In this situation. the above formula simplifies to the following expression.
Educational Evaluation
69
200 = n* x (1 + (20 - 1) 0.1). Solving this equation for IZ* gives a value of 69 for the value of the equivalent sample size. That is, given the value of 0.1 for rob, a two-stage cluster sample of size 200 that is selected by sampling 10 schools followed by sampling 20 students per school would have sampling accuracy equivalent to a simple random sample of 69 students. For a given population of students, the value of roh tends to be higher for clusters based on classes rather than clusters based on schools. Ross (1978) has obtained values of roh as high as 0.5 for mathematics test scores based on classes within Australian secondary schools. Using this value of roh in the above example provides an effective sample size of only 19 students!
Sample Design Tables Sample design tables are often prepared for well-designed evaluation studies in which it is intended to employ two-stage cluster sampling. These tables present a range of sample design options - each designed to have a pre-specified level of sampling accuracy. A hypothetical example has been presented in the following discussion in order to illustrate the sequence of calculations and decisions involved in the preparation of these tables. Consider an evaluation study in which test items are administered to a two-stage cluster sample of students with the aim of estimating the percentage of students in the population that are able to obtain correct answers. In addition, assume that a sampling accuracy constraint has been placed on the design of the study so as to ensure that the sample estimate of the percentage of students providing the correct answer, p, will provide p k 5 percent as 95 percent confidence limits for the value of the percentage in the population. For reasonably large samples it is possible to assume normality of the distribution of sample estimates (Kish, 1965, pp. 13-14) and therefore confidence limits of p + 5 percent are approximately equivalent to an error range of plus or minus two standard errors of p. Consequently, the error constraint placed on the study means that one standard error of p needs to be less than or equal to 2.5 percent. Consider a simple random sample of n* students selected from this population in order to calculate values of p. Statistical theory may be employed to show that, for large populations, the variance of the sample estimate of p as an estimate of the population value may be calculated by using the following formula (Kish, 1965, p. 46), var(p) = ~(100 - p) n* - 1
The maximum value of p(100 - p) occurs forp = 50. Therefore, in order to ensure that we could satisfy the error constraints described above, the following inequality would need to be valid, (2.5)2 z 50(100 - 50) n* - 1
That is, the size of the simple random sample, n*, would have to be greater than or equal to about 400 students in order to obtain 95 percent confidence limits of p f 5 percent. Now consider the size of a two-stage cluster sample design which would provide equiva-
70
RICHARD
M. WOLF
lent sampling accuracy to a simple random sample of 400 students. The design of this cluster sample would require knowledge of the numbers of primary sampling units (for example, schools or classes) and the numbers of secondary sampling units (students) which would be required. From previous discussion, the relationship between the size of the cluster sample, n,, which has the same accuracy as a simple random sample of size IZ* = 400 may be written in the following fashion. This expression is often described as a ‘planning equation’ because it may be used to explore sample design options for two-stage cluster samples.
The value of II, is equal to the product of the number of primary sampling units, a, and the number of secondary sampling units selected from each primary sampling unit, b. Substituting for n, in this formula, and then transposing provides an expression for (I in terms of b and roh. That is, u=4OOx(I+(b-1)roh) b
As an example,
consider
roh = 0.1, and b = 20. Then, m=4~x(I+(20-1)0.1) 20 = 5x.
That is, for roh = 0.1, a two-stage cluster sample of 1,160 students (consisting of the selection of 58 primary sampling units followed by the selection of clusters of 20 students) would have sampling accuracy equivalent to a simple random sample of 400 students. In Table 5.2 the planning equation has been employed to list sets of values for a, b, deff, and n, which describe a group of two-stage cluster sample designs that have sampling accuracy equivalent to a simple random sample of 400 students. Three sets of sample designs have been listed in the table-corresponding to roh values of 0.1,0.2, and 0.4. In a study of school systems in 10 developed countries, Ross (1983, p. 54) has shown that values of roh in this range are typical for achievement test scores obtained from clusters of students within schools.
Sample
Design Table for Two-Stage
2 5
10 1s 20 30 40 so
Table 5.2 Samples With Sampling Sample of 400
roh=O.l
Cluster size b
Cluster
Equal to a Simple Random
roh = 0.2
deff
II,
0
I .o 1.1
400 440 560 760 960 1,160 1,560 1,960 2,400
400 220 112 76 64 S8 52 49 4x
1.4 1.9 2.4 2.9 3.9 4.9 5.9
Accuracy
WY
I .o I .2 1.X 2.x 3.x 3.8 6.8 X.8 10.8
roll = 0.4
‘1‘
l2
400 4x0 720 I.120
400
I.530 I .920 2.730 3,520 3,350
240 144 112 102 96 91 88 x7
dejj
I .o 1.4 2.6 4.6 h.6 X.6 12.6 16.6 20.6
‘1‘
400 560 I.040 1 .x40 2,640 3,440 5,040 6,640 x.250
(1 400
280 20x
1x4 176 172 I68 I66 165
Educational
71
Evaluation
The most striking feature of Table 5.2 is the rapidly diminishing effect that increasing b, the cluster size, has on a, the number of clusters that are to be selected. This is particularly noticeable when the cluster size reaches 10 to 15 students. Consider, for example, two sample designs applied in a situation where a value of roh = 0.4 may be assumed: (i) a total sample of 2,640 students obtained by selecting 15 students per cluster from 176 clusters; and (ii) a total sample of 8,250 students obtained by selecting 50 students per cluster from 165 clusters. From Table 5.2, it may be observed that both of these sample designs have sampling accuracy equivalent to a simple random sample of 400 students. While these two sample designs have equivalent sampling accuracy, there is a striking difference between each design in terms of total sample size. However, the magnitude of this difference is not reflected proportionately in the difference between the number of clusters selected. This result illustrates an important point for the planning of evaluation studies that seek to make stable estimates of population characteristics. This is that the sampling accuracy levels of two-stage cluster sample designs, for cluster sizes of 10 or more, tend to be greatly influenced by small changes in the number of clusters that are selected at the first stage of sampling, and relatively less influenced by small changes in the size of the selected clusters. The main use of sample design tables like the one presented in Table 5.2 is to permit the evaluator to choose, for a given value of roh, one sample design from among a list of equally accurate sample design options. The final choice between equally accurate options is usually guided by cost factors, or data analysis strategies, or a combination of both of these. For example, the cost of collecting data by using ‘group administration’ of tests, questionnaires, etc. usually depends more on the number of selected schools than on the number of students surveyed within each selected school. This occurs because the use of this methodology usually leads to higher marginal costs associated with travel to many schools, compared with the marginal costs of increasing the number of students surveyed within each selected school. In contrast, the cost of collecting data by using ‘individual administration’ of one-to-one tests, interviews, etc., may often depend more on the total sample size than on the number of selected schools. The choice of a sample design option may also depend upon the data analysis strategies that are being employed in the evaluation. For example, analyses may be planned at both the ‘between-student’ and ‘between-school’ levels of analysis. In order to conduct analyses at the between-school level, data obtained from individual students may need to be aggregated to obtain data files consisting of school records based on student mean scores. This type of analysis generally requires that sufficient students be selected per school to ensure that stable estimates are able to be made for individual schools. In addition, it generally requires that sufficient students and schools are available at each level of analysis so as to ensure that meaningful results may be obtained.
Sample
Design
and Experimental
Approaches
to Evaluation
One of the most simple, and most popular, experimental designs employed in evaluation studies is concerned with the use of ‘treatment’ and ‘control’ groups followed by a comparison of mean scores for the criterion variable. For example, the researcher may wish to
72
RICHARD
M. WOLF
examine the difference in the mean mathematics test scores for a treatment group of students exposed to an innovative program of mathematics instruction compared with a control group of students exposed to a traditional program of instruction. In conducting the experiment it would be likely that the researcher would note some difference between the treatment group mean (X,) and the control group mean (jl). The important question, however, is whether this difference is sufficiently large to justify the conclusion that the means associated with the relevant parent populations are different. A statistical procedure that is often used for checking this conclusion is the well known t-test for the significance of the difference between two means obtained for independent samples (Ferguson, 1966, p. 167). The t-test employs the ratio of the difference in sample means (X,--X2) to the standard error of the difference in sample means (So), This ratio has, assuming equal parent population variances, a distribution of f with ~1,+n,-2 degrees of freedom. The method used to estimate the value of sd in this ratio depends on the nature of the sampling and management of the experiment. For example, consider a study in which the students selected to participate in the study were: (i) assigned randomly and individually to the two experimental groups; and (ii) responded to the experimental conditions independently of each other for the duration of the experiment. In this case sd would be estimated as the square root of the sum of the variances of the group means. The appropriate estimate of the t ratio for this experimental design would then be (Ferguson, 1966, p. 167): I = (I, - xl)/v(s’/n, +
s’ir7,)
In order to simplify calculations, we may assume that the sample sizes for treatment and standard decontrol groups are the same (1~~= n2 = n), and also that the between-student viations of student scores are the same (s, = s? = s). Under these simplifying assumptions the value of s,, would be sd2/n, and the corresponding estimate oft would be:
I,,,
=
(X, -t,)l(sV2/?7)
The ‘srs’ subscript associated with the t ratio in the formula given above refers to the use of simple random sampling in the selection and allocation of the students to the experimental conditions. This estimate is quite different from the value which would be obtained if cluster sampling of intact groups had been employed in the selection and allocation of students. For example, consider a study in which the students selected to participate in the study were: (i) assigned randomly as intact class groups to the two experimental conditions; and (ii) responded to the experimental conditions as intact class groups. In this case s,, should be estimated by using a design effect factor in order to account for the degree of clustering associated with the use of intact class groups. The value of .Q in the estimate of the r ratio would be
and the corresponding
estimate
oft ratio would be: f,,,,,,, = (i,
-xz)l(sV(2.deff/ln))
Educational
73
Evaluation
The vast majority of program evaluations in education are administered according to the ‘cluster’ design described above. However, the data analyses for these studies often proceeds by using the ‘srs’ formula to estimate the t ratio. In these situations the value of the t ratio may be over-estimated by a factor of v/de#. The impact of this erroneous approach to the estimation of the t ratio leads to the possibility that the differences in mean scores between the experimental groups could be mistakenly interpreted as being statistically sig-
nificant . It is a sobering exercise to examine the magnitude of the influence of a design effect factor on the calculation of the t ratio. For example, Ross (1978) has shown that the intraclass correlation coefficient for mathematics test scores obtained from students in intact class groups in Australian secondary schools can be as high as 0.57. Using this value of roh, the value of deff, assuming a class size of 20, may be estimated from the formula relating the design effect to the intraclass correlation coefficient. That is, deff=
1 + (20-l)
0.57 = 11.83.
The adjustment factor for the estimated t ratio in this example would be the square root of 11.83. The t ratio would therefore need to be 3.43 times larger than the estimate of the t ratio obtained without taking account of the clustering effects associated with intact class groups. The implications of this example for applications of Analysis of Variance techniques may be considered by noting that the Fratio is equal to the square of the t ratio in situations where only two experimental groups are employed (Ferguson, 1966, p. 293). In this situation the adjustment factor for the F ratio would be equal to defl. That is, the value of the Fratio which would emerge by applying statistical procedures appropriate for ‘srs’ designs to a ‘cluster’ design would need to be divided by 11.83. The application of this kind of adjustment to many published evaluation studies would undoubtedly lead to observed differences between control and treatment groups being within the boundaries of chance fluctuation. A number of authors (for example, Glass & Stanley, 1970; Page, 1975; Cronbach, 1976; Hopkins, 1982) have suggested that significance tests based on individual-level analyses may be unacceptable in studies which employ intact class groups as the units of sampling and experimentation. In particular, Glass and Stanley (1970) put forward a strqng argument for using class means as the units of analysis in these types of studies; whereas Hopkins (1982) demonstrates skillfully that choice of the correct linear model will, under some assumptions, permit analyses to be carried out using students as the units of analysis. The Glass and Stanley argument represents a worst case view of the example presented above concerning the use of the design effect to adjust t and Fvalues. It is a worst case view because it would be similar to assuming a value of 1 .O for the intraclass correlation coefficient -which in turn would give a value of deffequal to the size of the class groups. That is, the observed value oft and Fwould need to be adjusted by 4.47 and 20, respectively.
Evaluation The key difficulty
in dealing
Designs
When Sampling
with the findings
is Not Possible
of an evaluation
study that is not based on
74
RICHARD
M. WOLF
a probability sample drawn from a defined target population is that the researcher may be unable to assess accurately the confidence which can be attributed to generalizations that extend beyond the sample to some wide-ranging population of interest. This situation can occur in evaluation studies where sampling is not possible due to administrative, political, or ethical constraints. For example, consider an evaluation of a government program designed to assist all students attending schools below a certain percentile cut-off point on an objective indicator of poverty. The withholding of the benefits of the program from an eligible student may, in some settings, be prohibited by law. In this situation the whole of the target population for whom the program was intended would form the treatment group for the program and there would be no possibility of selecting a comparable control group. Many of the initial Title I evaluation designs prepared for the United States Office of Education faced this problem (Tallmadge & Wood, 1976). The approach taken in these evaluations was generally either to use a benchmark population in order to provide a norm-referenced’ metric for gains made by the treatment group, or to form an approximate control group by sampling from a non-equivalent population and then employing ‘rcgression adjustment’ in order to make comparisons (Gay, 1980). The norm-referenced design has involved the pretesting and posttesting of students participating in the program. followed by a comparison of the gains made by these students in terms of percentile equivalents on standardized tests. In this situation a kind of surrogate control group is established through comparison of student gains with a set of norms ohtained for an appropriate age-related or grade-related reference group. The regression-adjustment design has involved the comparison of predicted posttest scores with actual posttest scores. The prediction of posttest scores in this model has generally been based on correlations between pretest and posttest scores and has been applied to results obtained from the treatment group of students participating in the program and an approximate control group of students sampled from just above the cut-off point set for the program. The effectiveness of the program has then been considered by investigating the degree to which the program participants were doing better than expected.
Conclusion The discussion of sampling presented in this chapter has reviewed the characteristics of the types of probability and non-probability samples that are commonly used in evaluation studies in education. In addition, it has examined some of the implications of employing complex sample designs which depart markedly from the traditional model of simple random sampling. This discussion incorporates two key messages which need to be remembered during the construction of sampling plans for evaluation studies. The first message is that ‘sample size’ is only one component of sample design and therefore the degree of accuracy associated with the results obtained from a sample should not be judged solely by the size of the sample. Consequently. small well-designed samples may provide more accurate information than large poorly-designed samples. For example: ( 1) a large sample selected by using non-probability sampling methods could result in large levels of bias because only readily accessible elements have been selected while important but less accessible elements have been overlooked; or (2) a large sample based on the
Educational
Evaluation
75
technique of cluster sampling, for a population in which the size of the intraclass correlation coefficient was substantial, could lead to the effective sample size being extremely small compared with the actual sample size. The second message is that good samples do not occur by accident, nor are they readily available in the sampling sections of standard textbooks on evaluation methodology. Rather, good samples are ‘designed’ in the sense that they are planned to provide an optimal degree of sampling precision within the limitations posed by the administrative, cost, and analysis constraints that are placed on the particular evaluation studies in which they are to be used. Therefore the sample design adopted for an evaluation study should represent the conclusion of an iterative process in which the final design is chosen from among the options described in a set of ‘sample design tables’ that have been prepared, discussed, modified, and fine-tuned to the needs of the study.
References Comber, L. C., & Keeves, J. P. (1973). Science education in nineteen countries. New York: Wiley. Cronbach, L. J. (1976). Research on classrooms and schools: Formulation of questions, design, and analysis. Stanford, CA: Stanford University. Deming. W. E. (1960). Sample design in business research. New York: Wiley. Ferguson, G. A. (1966). Statisticalanalysis in psychology and education (2nd ed.). New York: McGraw-Hill. Finifter, B. M. (1972). The generation of confidence: Evaluating research findings by random subsample replication. In H. L. Costner (Ed.), Sociological methodology. San Fransicso: Jossey-Bass. Frankel, M. R. (1971). Znference from survey samples. Ann Arbor, MI: Institute for Social Research. Gay, L. R. (1980). Educational evaluation and measurement: Competencies for analysis and application. Colombus, OH: Merrill. Glass, G. V., & Glass, J. S. (1970). Statistical methods in education and psychology. Englewood Cliffs, NJ: Prentice-Hall. Hopkins, K. D. (1982). The unit of analysis: Group means versus individual observations. American Educational Research Journal, 19(l), 5-18. Kish, L. (1965). Survey sampling. New York: Wiley. Kish, L. (1978). On the futureof survey sampling. InN. K. Namboordi (Ed.), Surveysamplingand measurement. New Yorki Academic Press. McCarthy, P. J. (1966). Replication: An approach to the analysis of data from complex surveys. Washington: United States National Centre for Health Statistics. Page, E. B. (1975). Statistically recapturing the richness within the classroom. Psychology in the Schools, 12, 339-344. Peaker, G. F. (1975). An empirical study of education in twenty-one countries: A technical report. New York: Wiley. Ross, K. N. (1976). Searching for uncertainty: An empirical investigation of sampling errors in educational survey research (ACER Occassional Paper No. 9). Hawthorn, Victoria: Australian Council for Educational Research. Ross, K. N. (1978). Sample design for educational survey research. Evaluation in Education, 2, 105-195. Ross, K. N. (1983). Social area indicators of educational need. Hawthorn, Victoria: Australian Council for Educational Research. Tallmadge, G. K. (1976). User’sguide: ESEA title Ievaluation and reportingsystem. Mountain View, CA: RMC Research Corporation. Tukey, J. W. (1958). Bias and confidence in not-quite large samples (Abstract). Annals of Mathematical Statistics, 29, 614. Wilson, M. (1983). Adventures in uncertainty (ACER Occasional Paper No. 17). Hawthorn, Victoria: Australian Council for Educational Research. Wolf, R. M. (1977). Achievement in America. New York: Teachers College Press.
CHAPTER
6
THE INFORMATION SIDE OF EVALUATION LOCAL SCHOOL IMPROVEMENT KENNETH College
of Education,
FOR
A. SIROTNIK
University
of Washington,
U.S.A.
Abstract In reconceptualizing ‘program evaluation’ for local school improvement, it is argued that the ‘program’ being evaluated is really the ongoing schooling process itself - the daily circumstances, activities, and human orientations constituting the program of the local school. With this perspective in mind, a broad conception of evaluative information is offered in terms of multiple domains (personal, instructional, institutional, societal), multiple sources (teachers, students, administrators, parents, etc.), and multiple methods (survey, interview, participant observation, historical, archival, etc.). Additional issues related to data aggregation and the problems of collecting quality information are also discussed. Finally, it is emphasized that the idea of comprehensive information collection must be part of a larger commitment to, and actualization of, a school-based process of ongoing staff inquiry.
The theory and practice of ‘evaluation’ for educational settings - particularly public schools - are undergoing a number of interesting transformations in our rapidly accelerating ‘information society’. Added to this is a sociopolitical context in which: (i) more accountability pressures and demands from federal and state levels are being laid at the doorsteps of districts and schools; and (ii) fewer large-scale school improvement programs (as in the 60s and 70s) are being supported and funded federally. One transformation resulting from the combination of these forces, in my view, is that the object of traditional evaluation designs is changing, from a ‘programmatic intervention’ focus to a focus on ‘schooling’ itself. In other words, the ‘program’ for evaluative focus is now the ongoing constellation of daily activities and outcomes constituting theprogram of the local school. This suggests the need for information - actually a considerable variety of information - designed to facilitate any number of evaluative purposes from appraising the impact of specific programmatic interventions, to informing organizational and instructional planning and development activities, to, perhaps most importantly, monitoring the periodic health of the school work and learning environment. Since the processes of collecting, storing, retrieving, analysing, and reporting multiple forms of information is not particularly problematic any longer (at least from a technological point of view), district- and school-based comprehensive information systems are becoming one
RICHARD
7x
M. WOLF
significant vehicle whereby evaluation can be reconceptualized specifically for local school improvement. (See, for example, the work by: Bank & Williams, 1986; Burstein, 1984; Cooley & Bickel, 1986; Hathaway, 1984; Idstein, 1985; McNamara, 1980; Sirotnik. 1984a; Sirotnik & Burstein, 1986.) My original assignment for contributing to this special issue was to write about ‘data gathering’ in educational evaluation. Given my opening remarks, I have probably departed somewhat from this assignment in two ways. First, I prefer to talk about information rather than data only because of the connotation of ‘data’ as codifications (usually quantitative) of information. Although for practical intents and purposes we could (and I will) use the terms data and information interchangeably, I want to emphasize my concern not only with what can be represented electronically in a computer, but also with the substantive content represented ostensibly in the data. Second, rather than addressing more traditional concerns in ‘program evaluation’ or ‘evaluation research’, I am more inclined to treat general evaluative concerns based on formative, school improvement processes. In fact. it is my belief that ongoing evaluation practice conceived for local school improvement will, by definition, serve (or be easily augmented to serve) the more particularistic needs of specific program evaluations. In addressing. therefore, information collection as part of evaluation for local school improvement, I will try to cover three main issues: (1) information domains (the ‘what’), information sources (the ‘who’), and methods for collecting information (the ‘how’); (2) the aggregation of information and concerns regarding appropriate unit of analysis; and (3) practical details regarding the quality of data collection procedures. I will conclude with some views regarding how the process of collecting information must, in my view, fit into a larger commitment to inquiry and school renewal.
Collecting
Information:
What,
from Whom
and How?
The key ‘word’ here is multi - multicontent, multisource, and multimethod - when conceptualizing, defining, and collecting information about what goes on in schools. In A Study of Schooling (Goodlad, 1984), we developed a heuristic that helped to map out the schooling terrain (see Figure 6.1). Although more could be invented, the four domains persona1 (or individual), instructional (or classroom), institutional (or school), and societal (or schooling in general) - proved adequate in encompassing most of the information schools and districts could possibly collect. The data sources listed are, of course, only illustrative of the many that could be relevant, e.g. administrators, district staff, and other community constituencies might be important additional sources of information. But Figure 6.1 underrepresents the complexity of the whole. More recently, we have augmented and refined this map in several ways (see Sirotnik, 1984a; Sirotnik & Burstein, 1983). First (see Figure 6.2), a substantive facet has been added that makes explicit the potential information inherent in the circumstances, activities (processes), and meanings in and of the school setting. The circumstances of schooling constitute the array of structures, situations, and physical features in the school setting - the ‘givens’ at any point in time. Age and conditions of the school facility, community demography, size of the school (e.g. number of students), teacher-student ratio, teacher turnover, student transiency, duration of current principalship, daily schedule (e.g. period structure at the secondary level),
Educational
Data Domains
8
edu-
8 )
Demography Self-concept Educatlonal
asplratlons
(Examples
Only)
Class
School
l Relatw amounts of time spent on instruction. behawor control, and routmes l Use of behavioral objectives certam l Frequency of learnmg actiwtles
l Relatw importance of school functions (social, mtellectual. personal, and vocattonal) l School “chmate” or work enwronment l Major problems . Equahty of education (ability, race. sex)
*Relative amounts of time spent on mstruction. behavior control, and routines l Difficulty of class content l Frequency of certain learnmg actwtes l Class “chmate”
l Relative importance of school functions l Evaluatwe rating l Major problems l Equality of education l Adequacy of counseling sewces l Subject.area preferences
Personal ) Demography Reasons for entenng :atlon profession Teaching experience Educational behefs
79
Evaluation
l Relative importance of school functions l Evaluative rating l Major problems l Equahty of education - Involvement m actlwtles and decnon makmg
Demography Years Iwed m commumly Political beliefs
spent hawor l Use
Schooling l Desegregation . Fiscal support of pubhc education l Teachers unions l Mimmum competency l Role of global education I” the schools
Desegregation Role of fob experlence schools l Value of schools
l l
m
Desegregation Flscal support of public education l Teachers un!ons l Teachers’ salaries l MIntmum competency l Role of global education l l
on Instruction. becontrol. and routines Of correctwe feed-
* Use of open
versus
closed
l lnstruct~onal time spent wth total class versus Indlvidual versus groups
Figure 6.1 The schooling terrain: map one. *Data were collected on this data source through observation. For the purposes of this conceptualization, observers are being treated not as a data source, but as part of the data collection method, just as questionnaire and/or interview methods were used in collecting data from teachers, students and parents.
student tracking policies, materials and resources, and so on - these are just a few of the circumstances that vary from school to school. The activities are the ongoing and dynamic behaviors and processes that constitute the practice of schooling. These are the activity components of organizational and curricular commonplaces (Goodlad er al., 1979) such as instructional practices, learning activities, decision-making, communications, evaluation, and so forth, at all levels (see below) of the schooling process (see Figure 6.3). These commonplaces cut across the potential data sources and data domains in Figure 6.2. Not captured by just the circumstances and activities of the educational setting are the meanings that people infer from and bring to bear upon the setting. One sizeable chunk of these meanings is the constellation of orientations - sentiments (feelings), opinions, attitudes, beliefs, and values - that interact with the circumstances and activities of schooling. For example, the effectiveness of administrative-staff communication mechanisms may interact with staff perceptions of the school work environment and staff attitudes and beliefs regarding the exercise of authority. Classroom management techniques may depend upon teacher beliefs like, “The student should be seen and not heard” versus a more
80
RICHARD
M. WOLF DATA DOMAINS
Data Categories:
Individual
C
A
Institutional (School)
Instructional (Classroom)
Personal (Individual)
C ---
M
A
C --n
M
A
Students Teachers Administrators Parents Students
% ; 5 :
Data Catesories:
Administrat Parents Classroom
5 3
School
Students Teachers Administrators Parents Classrooms School
District
Students Teachers Administrators Parents Classrooms Schools District
s
C = Circumstances A = Activities M = Meanings
Figure 6.2 The
schooling
terrain:
map two.
Cult~ral/Eco~ogical Schooling
Commonplaces
*
Activities
Circumstances
Physical Environment Human PCSOUPC~S Material Resources Curriculum* Organization Co~unicat~on Problem-Solving/ Decision-Making Leadership Issues/Problems Controis/Restraints Expectations Climate Evaluation
Dimension
Enformation
Meaninqs
Grid
Survey Questionnaire Interview Observation Case Study Document/Archive Review
Curricul"~ is to be interpreted broadly and should include at least these additional commonplaces (see Goodlad, Klein 8 Tye, 1979): Goals/Objectives Content Instructional Materials Classroom Activities Teaching Strategies Assessment Time Space Grouping
Figure
6.3 The schooling
terrain:
map three.
Societal (Schooling) C ---
A
H
Educational
Evaluation
81
egalitarian stance on student participation. The allocation of teaching resources to different content areas will depend on opinions regarding the most important function of schooling (e.g. academic versus vocational). Another sizeable chunk of meanings derives from one way we attach meaning to the teaching-learning act. We sample a domain of tasks that we believe will define learning objectives, and then we appraise students’ performance on this sample of tasks. We call this an achievement test. The point is that such performance measures are just one more class of indicators by which educational meaning is construed. Thus, the setting can be characterized, things happen in it, and people attempt to make sense out of it. Using the terms loosely, we might refer to the circumstances as the ‘factual’ information, information that, if systematically recorded, could be determined through document and archival review; activities as ‘observational’ information, although we would admit to this category of information the perceptions not only of ‘trained observers’ but of all participants in the setting; and meanings as ‘phenomenological’ information with the understanding that methods for gathering it are not restricted to those ordinarily used in the phenomenological tradition. A second necessary refinement of Figure 6.1 is achieved through the addition of an aggregation facet. This is not meant to be an analytical gimmick; rather, it is to suggest the fact that data collected at, or aggregated to, different levels may mean different things (Burstein, 1980; Cronbach et al., 1975; Sirotnik, 1980). In other words, information gathered at one level of the educatonal enterprise (e.g. student perceptions of classroom learning environments) can be aggregated to create information (not necessarily at the same) at other levels of the system (e.g. a classroom measure of discipline and control or, perhaps at the school level, an indicator of policy regarding order and discipline). A third complication of the foregoing maps of schooling terrain is the addition of the necessary time facet (Figure 6.4) denoting that much of the information in Figures 6.1-6.3 is not static. Even in Figure 6.4, it is necessary to chop out some time segment; for example, I have chosen to represent the usual K-12 elementary and secondary educational time frame with the potential for preschool and postsecondary information. Different inquiry purposes will, of course, suggest different points of entry and departure in this continuum of schooling. The point, however, is that a comprehensive information system must be capable of the longitudinal study of schooling. Finally, in keeping with this multicontent, multisource, multilevel conception of information, is a multimethod perspective on the appropriate forms of information collection. This necessarily implies a multi paradigmatic perspective on what constitutes appropriate knowledge. Using the concepts of convergenr validity (Campbell & Fiske, 1959) from the more quantitative tradition, and triangulation (Denzen, 1978) from the more qualitative tradition, demand that much of the information suggested in Figures 6.1-6.4 can and should be collected in different ways. Various methods include, but are not limited to, survey questionnaire, interview, observation, case study, and historical analysis and document/archival review. The choice of methods, as usual, depends on such matters as how well-understood are the constructs in question, time and resources available for data collection, analysis, and interpretation, volume of information desired, and so forth. (See the usual research methods textbooks for guidelines, e.g. Babbie, 1983.) Moreover, various compendiums of constructs, items, and data collection devices exist (e.g. Nowakowski et al., 1984; Sirotnik, 1984b). But I must warn the reader that there is little to recommend the collection of information simply because it is there. Information
82
RICHARD
M. WOLF
Educational Evaluation
83
is a key ingredient to making an inquiry rigorous and systematic, i.e. using relevant data to inform staff dialogue, facilitate decision-making, guide actions, and provide a descriptive context for evaluations. But information, in my view, does not drive an inquiry anymore than tails wag dogs. Rather, a viable inquiry process continually suggests the kind of information likely to be useful to augment, stimulate and sustain the effort. Information fuels the engine of inquiry but does not automatically determine the direction of travel. I will return to this point in my closing remarks. In the next two sections, I will treat two of the more salient sets of concerns that arise in collecting, analysing, and interpreting information: multilevel issues and issues pertaining to data gathering procedures. Although these sets of concerns have their counterparts in both the quantitative and qualitative methodological traditions, I will locate my remarks more in the former tradition for three reasons: I do not have the space to deal with both, I have had more experience with quantitative methods, and, most importantly, I have a belief that many of the issues to follow automatically become part of the interpretive process in good, contextualized, qualitative inquiry. The quantitative analyst is far more likely to proceed routinely through a mass of information, forgetting about the quality of the data or the profoundly different ways in which covariance can be accounted for.
Multilevel
Issues
Even a cursory glance at Figures 6.1-6.4 will reveal a latent, multilevel morass of information likely to frighten the bravest of data analysts. Indeed, the quantitative treatment of information gathered at and/or aggregated to two or more levels (e.g. individual, class, grade, school, district, etc.), with the intent of exploring multivariate relationships in the data, is only recently being taken seriously by some methodologists unwilling to put aside the conceptual and analytical issues buried in multilevel information. These issues, broadly conceived, fall into either measurement or statistical categories of concerns. I can only briefly outline matters here using a rather simple, illustrative example. Suppose it was of interest to assess the classroom learning environment construct ‘teacher concern’ in a secondary school; students in each class would respond to items like “This teacher cares about me as a person” on, say, a five-point Likert scale of agreement. Following typical internal consistency analyses - factor analyses and/or item-total correlations and alpha analyses - each student would receive a total score on a culled set of ‘teacher concern’ items. Suppose, further, that the relationship between ‘teacher concern’ and a measure of class achievement was the object of statistical investigation at this school. Now, several routes might be taken by data analysts depending upon how nai’ve or sophisticated (and/or daring) they might be. At the naive end, correlational analyses will be conducted across all students regardless of class membership-a total analysis. Sharper analysts will remember intact class warnings from their design courses and they might decide that the class is really the unit of analysis, give the class ‘scores’ equal to the means of the student scores, and perform the correlational analysis on these class measures - a between (class) analysis. Towards the more sophisticated end, the analyst will be concerned about the variance within classes on the measures as well and might conduct a number of correlational analyses for each class separately - within (class) analyses; additionally, if the results warrant it, these analyses might be ‘averaged’ for interpretive purposes - a pooled within analysis. Finally, the sophisticated analyst will recognize the conceptual and
84
empirical value in exploring tilevel analysis.
RICHARD
both within
M. WOLF
and between
analyses
simultaneously
-
a muf-
The basic dilemma is that each of these analytical approaches can yield quite different results (see Knapp, 1977, for the basic bivariate case); so what is the appropriate unit of analysis? The basic resolution, of course, is that it depends on the question(s) being investigated and what you think you are measuring; moreover: (1) given the multilevel nature of schooling phenomena, it is unlikely that any single-level analysis will ever be adequate; and (2) given the lack of working theoretical models of schooling, it is likely that the empirical consequences of multilevel analyses will need to be investigated. (Although beyond the scope of discussion here, it should be noted that the statistical consequences of multilevel analyses can be quite complex, especially when these analyses are conducted longitudinally.) Thus far in this example, I have alluded to matters arising during the analyses of data assuming that the constructs have been appropriately measured. However, measurement issues - that I deliberately glossed over above - are equally if not more important during psychometric analyses. Three highly interrelated subcategories of concerns arise during psychometric analyses: construct-indicator match, alternative indices of group level constructs, and analytic decisions in scale construction. First, following the example above, what is being measured by the item “My teacher cares about me as a person” - something about the individual student respondent or, when responses are aggregated (e.g. using the mean), something about the students as a group, something about the teacher, something about the class, all of the above, some of the above, or none of the above? We can either make a decision by fiat or, preferably, we can look at the empirical consequences of several possibilities. For example, thinking about the aggregate response as an indicator of a teacher or class construct, we would want to investigate the variance (and correlates thereof) of this aggregate across classes-a between analysis. If, rather, we viewed the data as representing individual perceptions contextualized by classroom experience, then we would be apt to investigate item response variance within classes. It would be hard, I would argue, to decontextualize these data and justify a total analysis. Parenthetically, the grammatical form of the item in our example is important, but tends to make little difference in items of this nature; that is, even for an item like “This teacher cares about us as people” exhibits far more within class than between class variation (Sirotnik, 1980; Sirotnik et al., 1980). It should be added that construct-indicator issues pertain not just to the scaling of items but to ‘stand alone’ questions such as age, SES, teacher experience, etc., as well (see Burstein, 1980). The SES of students within a class, for example, may measure their families’ learning resources; aggregated to the class level, however, SES may be a proxy for ethnicity or school tracking policies. Second, aggregating from the individual to group level usually takes the form of computing the arithmetic average. This makes sense if we are looking for some kind of group level indicator that is conceived as an additive constant in each individual response. But if we are seeking to measure, say, group consensus, asymmetry, independent-dependent variable relationship, or the like, then aggregates such as the variance, percentage above a given cutoff point, or a within group regression coefficient may be more useful. There is no redeeming value in using the mean just because it’s easy analytically. Third, and intimately tied to the above concerns, is how to go about analyses at the psychometric stage of scale development. In the above example, I inferred that factor
Educational
Evaluation
85
analyses or alpha coefficients were computed across individuals regardless of class membership. This is how most people do them. This is also ill-advised given the above argument, which suggests that between class and within class correlation matrices should be the ones that are analysed. And the possibility, of course, exists (quite likely, in fact) that the factor-structure, item-total correlations, etc, will vary depending upon which matrix is analysed; that is, the very same set of items may behave quite differently and measure different constructs depending upon the level of analysis. In sum, if we are to contextualize our evaluative inquiries into schooling with information of the types discussed above, and if we are not to ignore the multilevel nature of schooling, then we are forced into coping with the often sticky measurement and statistical issues in multilevel analyses. The interested reader will want to first read a few overview articles (e.g. Knapp, 1982; Sirotnik & Burstein, 1985) and then consult some of the more technical references in these reports.
Collecting
Information:
Problems
and Possibilities
We have all been appropriately warned by the traditional experimental design and evaluation research literature of the problems to be encountered when attempting to model laboratory research in field settings. Threats to interpreting independentdependent variable relationships, to generalizability, to construct validity, and to statistical sensitivity - due largely to the inability to randomize subjects to treatment conditions, the systematic loss of subjects, the passage of time, imperfect treatment interventions, and the like - have given rise to a body of work attempting to compensate for these problems through better design and analysis. (See, for example, Bentler, 1980; Cook & Campbell, 1979; Muthen & Joreskog, 1983, among others.) Better design and analysis, however, can not compensate for bad data arising out of faulty information collection procedures. Recently, this issue has been highlighted by Burstein et al. (1985a,b) in the specific context of evaluating programmatic interventions. However, data collection faults - such as burdening respondents with time consuming paper and pencil surveys or interviews, asking overly sensitive questions, personality clashes between data collectors and respondents, inappropriate and/or inconsistent timing of data collection, inconsistent changes in instrumentation, data sources, data collection settings, and/or data collector approaches in gathering ostensibly the same information, and so forth - are equally disastrous for inquiry using comprehensive information systems of the type being described here. This is especially the case if a primary focus of the schoolbased inquiry has to do with monitoring the total school program over time. I cannot get specific about all the potential imperfections in collecting information in the space remaining, and I highly recommend the documents cited above. Three issues, however, deserve special emphasis in the evaluative context I have suggested: information overload, sensitivity of information, and the timing of data collection.
Information Overload In A Study of Schooling, we collected over 800 survey responses from teachers, 350 from students, and 275 from parents at the secondary level. Similar amounts of information
X6
RICHARD
M. WOLF
were collected at the elementary level. Additionally, principals and teachers were interviewed for approximately one hour, and classes were observed for entire periods (or days) on three separate occasions. For typical schools in and around major cities, collecting this amount of information required approximately four highly trained data collectors working 18 days at senior high schools, three working 18 days at junior highs, and four working 13 days at elementary schools. Surveying of students required two (sometimes three) class periods at the secondary level and three (sometimes four) sessions (about 3540 minutes) at the elementary level. The teacher survey took anywhere from one and a half to three hours to complete. I could go on with this, but the point should be obvious: this volume of information served us well for research purposes, but it would be overkill for schools trying to develop ongoing, multipurpose (but purposeful) appraisal systems. There is a growing literature on the effect of ‘respondent burden’ on the quality of information gathered (e.g. Frankel & Sharp, 1980). Large enough increases in the volume, frequency, or duration of data collection impact unfavorably upon the reliability and validity of the information due to missing data, respondent fatigue and loss of attention, uncooperativeness, and so on. My colleagues and I at the Center for the Study of Evaluation, while working with a high school staff developing an information system, noticed a significant increase in missing data after 100-150 items in a student survey (Sirotnik & Burstein, 1986). We also noticed that the burden is not only on respondents, but also on the ‘gatekeepers’ of those respondents, viz., teachers, principals, and district administrators, upset about the time being drained from the school program. To be sure, there are some useful designs and analytical tools for alleviating some of the burden. For example, everything does not have to be collected from everyone; for certain purposes, information can be gathered from a sample of respondents, a sample of questions, or samples of both (as in matrix sampling). The virtues of matrix sampling to estimate score distribution parameters of scaled sets of items for sufficiently large respondent samples (e.g. district level, school level, and perhaps grade levels in large enough schools) is now well known (see Sirotnik & Wellington, 1977). For single survey items where the information is primarily useful at aggregated levels (e.g. for schoolwide evaluation and planning activities), ordinary respondent sampling is adequate. Burden can be further decreased over time by sampling different respondents on each occasion, although some overlap will be necessary for correlational studies. If careful longitudinal work is to be carried out, a small but representative respondent cohort would be useful. And do not forget technology. Although I am not aware of any place doing this yet, there is more than enough power in desk-top microcomputers to automate the paper/pencil part of information collection for ongoing systems of the type described here. Software could be developed that would contain the entire set of surveys and survey questions and would record and store the responses of students, teachers, etc. Respondents would sit down, enter their name (or pre-assigned ID code), respond to questions as prompted, be branched as necessary to different course contents, and be referenced to specific classes or periods. Questionnairing would not need to be done in one sitting (assuming situational effects can be minimized); respondents could return another time and pick up where they left off. Moreover, in the event of some items omitted, respondents could be prompted to complete them (or indicate their wish not to answer them). Ordinarily cumbersome data management problems then become trivial: completed response protocols are now stored and are ready for analysis automatically; multiple samplings of the same students that occurs at the secondary le,vel in different periods can be easily managed by prompting them only
Educational
Evaluation
87
once for demographic and schoolwide information while prompting them repeatedly for information pertaining to each class in which they were sampled; and so forth. Technology not only holds some promise for relieving much of the burden in data collection, but also holds the same for data analysis. In fact, students are an excellent resource for performing the data analysis tasks, and the data analysis tasks become an excellent ‘hands-on’ learning experience for students in mathematics, statistics, and computer science classes. But methodology and technology are only tools in the hands of, one hopes, sensible and sensitive people. The real key to relieving the burden of information-based appraisal systems lies in a viable inquiry process at district and building levels that allows for participatory deliberation and judicious selection of relevant information for mutually endorsed purposes.
Sensitivity of Information
In this ‘age of information’, the issue of sensitive information goes well beyond the usual ones, e.g. personal items (age, religion, ethnicity, etc.), upsetting items (nuclear war, pending personnel decisions, etc.), embarrassing items (sexual behavior, number of hours watching TV versus reading the latest in educational research and practice), and so forth. Certainly, this aspect of information sensitivity is important and must be contended with when constructing, testing, and using information gathering devices. Use and abuse issues in using comprehensive information systems, however, will emerge as even more important concerns (Sirotnik, 1984~). The Orwellian reality of computer technology significantly exacerbates the ever-present problems of information security and respondent confidentiality. Anonymity and confidentiality have traditionally been handled by eliminating identification codes or establishing trust. Computerizing the entire process makes it easy to keep track of respondents. Linking teacher responses to those of their students in their classrooms, or linking students’ responses one year with their responses the next, are necessary data management tasks if certain correlational or longitudinal analyses are to be done. These tasks, of course, require a ‘dictionary’ that links names to ID numbers. It may well be that the future holds a climate of increasing distrust and that analyses requiring respondent confidentiality will be a thing of the past. Again, the most promising strategy, in my view, is to involve people in significant ways in the conceptualization, development, and use of their own information system.
Timing the Collection of Information
Scheduling is a major consideration in collecting information in schools, not only because of potential confounding and error effects due to missing or tardy baseline data, too few observations over time, unsystematic time sampling, and so forth, but also because no time ever seems to be the right time to schedule a major data collection. I am sure that the experience of readers who have ever tried to do this sort of thing will confirm that there is precious little time left in schools between the regularly scheduled events such as holidays, semester breaks, assemblies, shortened days, etc., and unanticipated events like mass illnesses, snow storms, computer foul-ups, teacher strikes, etc.
88
RICHARD
M. WOLF
I do not envision an information-based, comprehensive appraisal system collecting large amounts of data more than once per academic year. However, many data collection activities are ongoing by definition: accumulating attendance and dropout rates, achievement assessment that is referenced to instructional continuums, and so on. Additionally, specialized surveys or interviews for special circumstances will occur on an as-needed basis (e.g. a drug abuse survey. a parent survey on a pending school closure, etc.). And specific evaluations of programmatic interventions will probably require additional assessments in accord with the evaluation design. General ‘audits’ of the school’s circumstances, activities and meanings, however, might be done, say, between the 10th and 20th week of the first or second semester. depending on how current the need is for the information.
Closing
Kemarks
I have noted at the beginning and throughout this report my view of the importance of seeing the collection and use of information as part of a larger commitment to selfexamination - individually and collaboratively - on the part of those seriously concerned with schools (administrators, teachers, students and parents, for example). Many of the problems above can be largely overcome as the idea, value. and use of information become ‘cultural regularities’ (Sarasson, 1971)) of schooling. I hope it is clear that this view of collaborative. informed inquiry is ‘methodological’ every bit as much as the quantitative and qualitative methods endorsed above. Information-based appraisal systems will not become a cultural regularity of schooling until critical discourse becomes a way of organizational life in schools (Sirotnik & Oakes. 1986). By this I mean a commitment to rigorous and sustained inquiry by the relevant stakeholders around generic questions as: What goes on here in the name of educational goals, curricular emphasis, principal leadership, assessment of effectiveness, etc.? How did it come to be that way? Whose interests are being served by the way things are? Is this the way we want it‘? What other information do we huve, or need to get, to help inform our deliherationsY What are we going to do about all this, and how are we going to monitor it’? I am thinking of this kind of organizational andprofessionul context when I suggest that the collection, analysis, interpretation, and use of comprehensive information can play a major role in evaluation for local school improvement.
References Babbie, E. (1983). Thepracticr ofsocial rrseurch. Belmont, CA: Wadsworth. Rank. A.. & Williams, R. C. (1986). Creating an ISS In A. Bank Rr R. C. Williams (Eds.), Inve,ztin~ the futurr: The drvelopment ofrdrtcationa/ informntion systems. New York: Teachers College Press. In press. Bentler, P. M. (1980). Multivariate analysis with latent variables: causal modeling. Annual Review of Psychology. 31,419--156. Burstein, L. (19X0). The analysis of multilevel data in educational research and evaluation. In D. C. Berliner in education (Vol. 8). Washington. DC: American Educational Research (Ed.), R evie~, <>freswrch Association. Burst&, L. (1984). The use of existing data bases in program evaluation and school improvement. Educatiord E~wlmtion md Policy Analysis. 6, 307-31X.
Educational
Evaluation
89
Burstein, L., Freeman, H. E., & Rossi, P. H. (Eds.) (1985a). Collecting evaluation data. Beverly Hills, CA: Sage. Burstein, L., Freeman, H. E., Sirotnik, K. A., Delandshere, G., & Hollis, M. (1985b). Data collection: The Achilles heel of evaluation research. Sociological Methods and Research, 14,65-80. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56,81-105. Cooley, W. W., & Bickel, W. E. (1986). Decision oriented educational research. Boston: Kluwer-Nijhoff. In press. Cook, T. D., & Campbell, D. T. (1979). Q uasi-experimentation: Design and analysis issues for field settings. Chicago: Rand McNally. Cronbach. L. J., Deken, J. E., & Webb, N. (1975). Research on classrooms and schools; Formulation of questions, design andanalysis. Occasional Paper of the Stanford Evaluation Consortium, Stanford University. Denzen, N. K. (1978). The research act: A theoretical introduction of sociological methods. New York: McGraw-Hill. Frankel, J., & Sharp, L. (1980). Measurement of respondent burden. Washington, DC: Bureau of Social Science Research. Goodlad, J. I. (1984). A place called school; Prospects for the future. New York: McGraw-Hill. Goodlad, J. I., Klein, M. F., & Tye, K. A. (1979). The domains of curriculum and their study. In J. I. Goodlad et al., (Eds.), Curriculum inquiry. New York: McGraw-Hill. Hathaway, W. E. (1984). Evolution, nature, uses and issues in the creation of local school district comprehensive information systems. Paper presented at the conference of the American Educational Research Association. Idstein, P. (1986). We’ll create the future, but may keep it a secret. In A. Bank & R. C. Williams (Eds.), Inventing the future: The development of educational information systems. New York: Teachers College Press. In press. Knapp, T. R. (1977). The unit-of-analysis problem in applications of simple correlation analysis to educational research. Journal of Educational Statistics, 2, 171-186. Knapp, T. R. (1982). The unit and the context of the analysis for research in educational administration. Educational Administration Quarterly, 18, 1-13. McNamara, T. C. (1980). Ongoing documentation systems. Paper presented at the conference of the American Educational Research Association. Muthen, B., & Joreskog, K. G. (1983). Selectivity problems in quasi-experimental studies. Evaluation Review, 7, 139-174. Nowakowski, J., Bunda, M. A., Working, R., Bernacki, G., & Harrington, P. (1984). A handbook of educational variables: A guide to evaluation. Boston: Kluwer-Nijhoff. Sarason, S. B. (1971). The culture of the school and theproblem of change. Boston: Allyn & Bacon. Sirotnik, K. A. (1980). Psychometric implications of the unit-of-analysis problem. Journal of Educational Measurement, 17, 245-282. Sirotnik, K. A. (1984a). An outcome-free conception of schooling: Implications for school-based inquiry and information systems. Educational Evaluation and Policy Analysis, 6, 227-239. Sirotnik, K. A. (1984b). Principles andpractice of contextual appraisalfor schools. Los Angeles: Laboratory in School and Community Education, UCLA. Sirotnik, K. A. (1984~). Using vs. being used by school information systems. Paper presented at the conference of the American Educational Research Association. Sirotnik, K. A., & Burstein, L. (1983). Methodological issues in studying the effectiveness of schooling: Recent developments and lingering concerns. Paper presented at the conference of the American Educational Research Association. Sirotnik, K. A., & Burstein, L. (1985). Measurement and statistical issues in multilevel research on schooling. Educational Administrative Quarterly, 21, 169-185. Sirotnik, K. A., & Burstein, L. (1986). Making sense out of comprehensive school-based information systems: An exploratory study. In A. Bank & R. C. Williams (Eds.), Inventing the future: The development of educational information systems. New York: Teachers College Press. In press. Sirotnik, K. A., & Oakes, J. (1986). Critical inquiry for school renewal: Liberating theory and practice. In K. A. Sirotnik & J. Oakes (Eds.), Critical perspectives on the organization and improvement of schooling. Boston: Kluwer-Nijhoff. In press. Sirotnik, K. A., & Wellington, R. (1977). Incidence sampling: An integrated theory for “matrix sampling.” Journal of Educational Measurement, 40,343-399.
90
RICHARD
M. WOLF
Sirotnik, K. A.. Nides, M. A., & Engstrom, G. A. (1980). Some methodological issues in developing of classroom learning environment. Studies in Educational Evaluation, 6, 279-289.
measures
CHAPTER
7
THE DEFINITION AND INTERPRETATION OF EFFECTS IN DECISION ORIENTED EVALUATION STUDIES* SOREL
CAHAN,t
DANIEL
DAVIS+
and NORA
COHEN+
tThe Hebrew University of Jerusalem, Israel and SThe Chief Scientist’s Office, Israeli Ministry of Education and Culture,
Israel
Abstract The estimation and evaluation of the effects of alternative educational or social policies is a major purpose of decision oriented, comparative evaluation studies. This paper reviews the rationale underlying the definition and interpretation of various measures of effect magnitude and examines their relevance to evaluation research.
Introduction Underlying the initiation of most summative evaluation studies is a need to decide between alternative educational or social policies. In the simplest case there are only two alternative policies, usually the incumbent policy and a challenging one. In more complex situations there are several candidate policies. The decision then involves choosing the ‘best’ policy in terms of the decision maker’s utility function, one component of which is the relative effectiveness of the various policies. That is, other things being equal (e.g. budgetary and logistical considerations) the policy decided upon should be the most effective one. Therefore, quantitative information concerning the effect of each policy is of major relevance to the choice between policies and the estimation of these effects should be the major purpose of any summative evaluation study. Indeed, the last decade has witnessed an ever increasing emphasis on estimating the magnitude of effects in the analysis and interpretation of experimental results, in both basic and evaluation research, and various measures of effect magnitude have been suggestedin the literature (e.g. Glass, 1976; Cohen, 1977; Rosenthal & Rosnow, 1984; Wolf, 1984; Hedges & Olkin, 198.5). In this paper we review the rationale underlying the definition and interpretation of
* We wish to thank Terri Finkelstein drafts of this paper.
and Itamar
Gati for their valuable
91
comments
and suggestions
on previous
RICHARD
92
M. WOLF
measures of effect magnitude and examine their relevance to decision oriented, comparative evaluation studies. The forthcoming discussion starts with a conceptual framework for the definition of a ‘policy effect.’ The term ‘policy’ is introduced here, instead of the commonly used term ‘treatment’, in order to include situations where different treatments are administered to different units in the same system. For example, educational programs such as school desegregation, ability grouping or individualized instruction actually involve different treatments for different students or groups of students. Nevertheless, from the decision-maker’s point of view, adopted by this paper. each of these programs is a single policy. Thus, the term policy is used here as a generic term for a set of treatments. A policy is a plan for the administration of treatments to units of a given system, either as a function of their characteristics or contingent upon their behavior. Alternative policies may consist of different plans. For example, one policy may consist of the adoption of the ability grouping by all subsystems (e.g. districts, communities or schools) in the target system, while another policy may leave the decision concerning ability grouping to individual subsystems. In either case. the treatment administered to each unit (i.e. each student) is a particular level of ability grouping. In the simplest case, a policy consists of one treatment to be administered to all units. However, since this need not be the general rule, the conceptual distinction between a treatment policy and a specific treatment is a necessary one. In the following discussion we shall be concerned with the definition of the effect of implementing an educational or social policy. The estimation of this effect will not be dealt with in this paper. For a comprehensive review of this issue see Hedges and Olkin (1985).
A Framework
for the Definition
of a Policy Effect
The definition of a policy effect requires the specification of three elements: (1) The relevant system. Educational and social policies refer to a system. Hence, the definition of a policy effect should refer to a specified system. The specification of the relevant system is closely related to the nature of the decision to be made. In the simplest case, the policy is to be implemented in the entire national (or state-level) system. In this case, the policy effect is defined for the entire system. In other instances, the decision is to be made separately for each subsystem (e.g. districts, communities or schools), requiring that the effect of each policy be defined and estimated for each relevant subsystem. The definition of a policy effect in the forthcoming discussion applies to any system or subsystem. (2) The relevant outcome. Since the same policy may affect different outcomes (i.e. dependent variables) in the same system, the definition of the policy effect necessarily requires specification of the relevant outcome. Therefore, associated with each policy there is a set of effects, each defined with respect to a different outcome. The intended outcomes of a social or educational policy at the system level are usually the parameters of the univeriate or multivariate distribution of variables defined at a lower level of aggregation. For example, the outcome of an educational policy implemented at the district level may be the district mean or the district variance of a school or individual level variable such as scholastic achievement. However, the definition of the outcome may be much more complex, involving two or more lower
Educational
(3)
Evaluation
93
level variables. For example, the purpose of a compensatory education program at the school level may be to decrease the within school correlation between parents’ education and students’ achievement. Here, the outcome is the within-school correlation. The definition of a policy effect in this paper applies to any particular definition of the outcome, provided that the outcome is measured on an interval scale at least. A reference point. The conceptualization of a policy effect requires the specification of a reference point. Usually this is provided by the outcome value for the same system under a meaningful control condition. The control condition is another treatment plan. For example, if a challenging policy involves the administration of the same treatment to all units in the relevant system, the control condition may consist of no treatment, administration of a placebo or the administration of another treatment, usually the incumbent one. On the other hand, if the challenging policy involves administration of different treatments to different units according to some rule, then the control condition may consist of the administration of one treatment to all units. It should be emphasized that there is no unique natural control condition for any particular policy and that the same policy may have different effects with respect to different controls. If there is no natural control condition, the effects of the various alternative policies can be defined with respect to one of them, arbitrarily defined as the common reference point. For example, Ragosta et al. (1982) defined a variety of control conditions for a series of treatments. In each case, the control condition was the one least exposed to the treatment.
The Definition
of a Policy Effect
Given a set of J alternative policies (which includes a control condition), the effect A(P,) of policy P,(j= 1,2,. . , J- 1) on the outcome 0 in the system, with respect to the control condition C, is defined as the difference between two condition outcome values: 01 Pj, the value of 0 following implementation of P,, and OIC, the value of 0 in the same system, following the implementation of C:
Note that since only one policy can be implemented in a given system at any particular point in time, the effect A(Pj)refers to an empirically impossible situation. Therefore, it is by definition a hypothetical system parameter; it reflects the hypothetical gain (or loss) in outcome 0, associated with the implementation of policy Pj,as compared to the implementation of the control condition C, in the target system. Note also that the effect A(P,)is specific to a particular combination of outcome, control condition and system; hence, it is not an absolute characteristic of policy Pi.
Comparability
of Effects
The end product of an evaluation study may be represented by a matrix. The rows of the matrix are the various policies and the columns are the various outcomes. For each combi-
94
RICHARD
M. WOLF
nation of policy and outcome there is an effect. Since effects are expressed on the outcome metric, direct comparision will be possible only between the effects of the various policies with respect of the same outcome, whereas the effects of the same policy on different outcomes will not be directly comparable. One solution to this problem is the expression of each effect as a percentage of the OIC value, denoted by rr,, where rrTT=
A(“,)
x
100.
(2)
OIC
n provides a common. unit free metric for the measurement of effects and ensures direct interpretability of the magnitude of each effect as well as comparability between effects defined with respect to different outcomes. Moreover, the n values of the same policy for different outcomes can be combined (e.g. added or averaged) and differences between the rr values of the various policies can be considered by the policy maker, together with the corresponding differences between other policy characteristics, such as cost logistical feasibility, etc. However, rr is a meaningful expression only if 0 is measured on a ratio scale. When 0 is measured on an interval scale only (which has no absolute zero point), the division of A(P) by OIC results in a meaningless value (in the sense of not being invariant under admissible transformations). This is most likely to be the case if 0 is the system mean tar of an interval variable Y, such as students’ scores on an achievement test. In the remainder of the paper we shall be concerned with this case. Unfortunately, most scales used in psychology and education have no natural zero-point, a reflection of the fact that the attributes represented by these scales are not quantities. Note that if 0 is the system variance of Y, the 0 scale will still be a ratio one, thus allowing for the meaningful interpretation OflT. The frequently used solution to this problem, which is applicable whenever 0 is the system mean of an interval variable, has been popularized by Glass and his coworkers (Glass. 1976; Glass etal., 1981) in the context of meta-analysis and by Cohen (1977) in the context of power analysis. Their approach involves standardization of the effect by the standard deviation arIC of the Y scores under the control condition. The resultant A,, (3)
is usually referred to as the effect size. The rationale underlying the definition of A is identical to the rationale underlying the standardization of the Y scores themselves. It reflects the fact that these scores are meaningful only in a relative sense, as rank orders, percentiles or Z scores. Indeed, if the raw Y scores under both conditions were standardized using the mean and standard deviation under the control condition, Ajwould directly result from the original definition of the policy effect A(P,) in terms of the difference between the two conditional means )_~rlP, and prlC, where py/C is by definition zero (Hedges & Olkin, 19X5). Alternatively, one can conceive of A, as the Z score corresponding to ).L,,P, in the hypothetical distribution of Y scores under the control condition (Hustn & Postlethwaite, 198.5). When each outcome is defined as the mean of an interval variable, the computation of
Educational
Evaluation
95
A solves the problem of the incomparability between effects defined with respect to different outcomes, by providing a common, unit-free scale for the measurement of effects. When there is no meaningful control condition and the policy effect is defined with respect to an arbitrarily chosen policy P*, the appropriate standard deviation is uyl P*. Note, that unlike A(P) and n, A is affected by measurement error in Y, reflected in uric. The larger the unreliability of Y, the larger the underestimation of the true effect size by A. The disattenuation of A is discussed by Hedges and Olkin (1985).
The Effect of a ‘Policy Variable’ The question underlying the definition of A(P) and of its derived measures n and A in the previous section referred to the effect of one policy on the system outcome 0. Given a set of J policies (one of which is or serves as the control condition), an evaluation study has to define and estimate J-1 effects (the jth effect being, by definition, zero) for each outcome. Two indices are frequently mentioned in the literature as alternative measures of effect magnitude:? (Cohen, 1977) and w2 (Hays, 1963; Wolf, 1984; Hedges & Olkin, 1985), also referred to by some authors as q2 (e.g. Cohen, 1977) or p2 (when J = 2, e.g. Rosenthal & Rubin, 1982). Basic to the definition of these indices is the conceptualization of the effect a(X) of a hypothetical ‘policy variable’ or ‘policy factor’ X, where Xstands for each of the J policies (X = P1,P2,...,PJ). In order to interpret a(X), one can conceive of the hypothetical implementation of each of the J policies in the same system. Associated with the implementation of each policy there is a conditional outcome, OlX(OlX = OIP,,OIP,,...,OIP,). The definition of a(X), which underlies the definition of the derived indicesfZ and 02, is in terms of the variance u2 (01X) of the hypothetical set of J conditional OIXvalues:* a(x)
c
d(Olx).
(4)
The larger the differences between the conditional OlXvalues, i.e. the larger the differences between the effectiveness of the J policies, the larger o(X) will be. Indeed, u2 (01X) equals the variance u* [A(Pj)] of the effects A(Pj) of the J policies, each defined with respect to the control condition C. Unlike A(Pj), which provides policy-specific information, a(X) refers to the entire set of mutually exclusive alternative policies, i.e. to the hypothetical ‘policy variable’ X and does not provide information concerning the individual policies. Hence, unless J = 2, o(X) is entirely irrelevant to the motivation underlying decision oriented, comparative evaluation studies: the need to choose between policies. Note that CY(X)would be appropriate if Xwere a treatment variable and.different values of this variable (i.e. different treatments) were to be randomly administered to equally
*Note that since 01X is not a random variable, conceptually a’(OlX) ISnot a variance (Hedges Nevertheless, it is referred to as such (e.g. Cohen, 1977). Note also that in samples, the definition of a2(OlX) requires that the J research groups be equally sized.
& Olkin, 1985). of a(X) in terms
Oh
RICHARD
M. WOLF
sized subsystems, perfectly matched, before the administration of the treatment, with respect to some subsystem dependent variable V. Then a(X) E u2(V) would be the answer to an entirely different question: “What is the X-induced variance of the subsystems’ Vvalues?” For example, this question is meaningful when Xstands for a population of teachers. Given that each class would be taught by a different teacher, an estimate of the magnitude of the teacher-induced variance of some class level variable (e.g. mean achievement scores) would be informative. Since the allocation of different teachers to different classes has no feasible alternative, the question underlying the estimation of a(X) in this case does not refer to the choice between different policies. Clearly, this is not the case in decision oriented, comparative evaluation studies. Here, the variable Xstands for J mutuully exclusive alternative policies, only one of which will be eventually implemented in the target system. Unlike the set of A(P,) values, a(X) 3 o’ (01X) is irrelevant to this choice and, therefore, practically useless. However, when there are only two policies, P and C, a(X) is a function of the effect A(P) of policy P, defined in the previous section:
Note that, since a(X) is a nondirectional expression, it is duotonically related to A(P) and monotonically related to the absolute value of the policy effect A(P). Therefore, a(X) is alternative to /A(P)/ only in terms of rank order. This functional relation holds true also when there are more than two alternative policies if, instead of defining Xas the set of J alternative policies, one defines/-l dichotomous variables, one for each policy (except the control one), such that for each variable X =P or X = C. Note that, unlike the overall (Y(X), the set of J-l policy-specific ol(X)s is potentially relevant to the choice between policies. The following discussion, which is restricted to J = 2, equally applies therefore to each of the J-l policy-specific c~(X)s, when J > 2. Three features of o(X) are worth mentioning: (1) Unlike A(P), which attributes the entire difference (01 P- 01 C) to the effect of policy P, a(X) equally divides this difference between P and C, thus awarding them as the mean of the absoequal status. Indeed, %‘a(X) = (Y2A(P)j can be interpreted lute values of the effects of P and C, both defined with respect of the mean of O/P and OIC. Therefore, unlike A(P), CY(X) IS . symmetric with respect to P and C. Therefore, it can be meaning(2) Consequently, unlike A(P), a(X) IS . nondirectional. fully interpreted only as a definition of the magnitude of the effect of a policy variable. (3) Unlike A(P), CL(X)is expressed in squared 0 units of measurement, which is conceptionally questionable and affects its interpretability. Note that, unlike (1) and (2) above, this is not necessary for defining the effect of a policy variable. interpreted as meaa(X) underlies the definition of two indices, p and w’, frequently sures of effect magnitude. In the remainder of this paper we present the definition of these indices, examine their relation to one another and to the previously defined measure of effect size and discuss their interpretation as measures of effect magnitude. Even though the original definition of these indices applies to X, where X stands for a set of J alternative policies (see, for example, Cohen, 1977), we shall be concerned only with their definition when there are two alternative policies, P and C.
Educational
Evaluation
91
The Indices f2 and w2 Two solutions to the problem of the incomparability of the o(X) values between different outcomes, as well as to the difficulties associated with the interpretation of these values, are f” and w2. Both solutions involve some kind of standardization of CY(X) and only apply when 0 is the system mean of a variable Y, defined at a lower level of aggregation, OIP E p,rlP, OIC E l~,~lC, A(P) = i.e. in the case in which 0 E py. Accordingly, pyIP-pylC and U(X) E a2 (01X) = a2 (p&) = [!h(~ylP-~yJC]2. The definition off is conceptually analogous to the definition of the effect size A in the previous section. It involves standardization of a(X) by the hypothetical average a$X of the conditional variances U$[P and a$IC. The resultant?,
provides a common, metric free scale for the measurement of c*(X), thus allowing for the direct comparison between a(X)s defined for different outcomes. A major deficiency off is that it has no maximum. The index o2 corrects for this deficiency (Hays, 1963,1973; see also Cohen, 1977; Hedges & Olkin, 1985). When applied to a set of two policies, P and C, this definition of o2 involves the division of a(X) by the sum of two terms: (a) a$lX, the average of the conditional variances and (b) a(X) itself.
& = _
4-V
4x+4x)
w(l*Yl~-l*Ylc)12 = ‘/z(a:lP+(T:lC) + [‘/5((LylP-~ylc)]*
(6)
Like f’, w2 is a pure number, which allows for the measurement of a(X) on a (different) metric free scale. However, its definition is conceptually different from that off2. Whilef2 results from the standardization of a(X) by a$IX, which is independent of a(X) itself, w2 obtains from the division of o(X) by cr$lX+a(X). The inclusion of the numerator in the denominator in the definition of o* gives an index that ranges between 0 and 1. However, unlike f, o2 is not proportional to o(X). In fact w2 is only monotonically related to o(X). Therefore, magnitude relations between the w2 values cannot be meaningfully interpreted in terms of the corresponding relations between the o.(X) values. Nevertheless, w2 is by far the most frequently used metric free measure of the magnitude of the effect of a policy variable. This is probably due to the apparent ease of interpretation of the o2 values in terms of proportions. This interpretation is identical to that of the correlation ratio q$,x, thus explaining the alternative use of o2 and r12. In order to clarify this point we focus, for the moment, on r12, the descriptive index of strength of relationship used in correlational studies. When Xis dichotomous, q;.xequals p& (the squared point biserial correlation between X and Y). Given a population of units and two variables Y and X, where Xis dichotomous and uniformly distributed, the definition of q2.y x = p,& involves the partitioning of the total variance u$ of Y, and its expression in terms of the sum of two components: u+ = a$ + ai, where IS; is the pooled within groups variance of Y and IJ~ is the between
groups
variance
of Y.
98
RICHARD
M. WOLF
Since a; 5 (T+, the total variance provides a meaningful groups variance. Accordingly, the expression
upper
bound
for the between-
can be meaningfully interpreted in terms of the proportion of at lying between the groups defined by X. Note that this interpretation rests on the fact that &. the total variance of Y, is the maximal possible value of CT; in a givelz population. In order to make sense of the apparently analogous interpretation of w2, one has to conceive of the hypothetical superdistribution of Y values, resulting from the pooling of the conditional distributions of YIP and YIC (i.e. the pooling of the two distributions which would be obtained if P and C were each implemented in the entire system at the same time). Then, a$/X+cu(X) can be thought of as the ‘total’ variance of this super-distribution, denoted by some authors by a$ (e.g. Cohen, 1977).
can then be interpreted as the proportion of the super-distribution variance &made up by the variance a(X) of the conditional means, i.e. the proportion of the ‘total’ variance in Y ‘accounted for’ by the policy variable X (Cohen, 1977; Hedges 611Olkin, 1985). Underlying this interpretation of w2 is the assumption that tr$IX, a(X) and their sum are equivalent to a,$, (ri and &in the definition of q’, respectively. However. here X is not a within system variable. Rather, it is a hypothetical policy variable, which stands for two mutually exclusive policies, P and C, only one of which may be implemented in the target system at any point in time. Depending on which was implemented, P or C, the variance of Y would be given by either a$IP or at.lC. Therefore, (J$P or o$lC are the only empirically meaningful total variances of Yin the target system, analogous to CJ+ in the definition of $. However, unlike (r,$ and & in the context of q’. rrf,lX and IX(X) do not result from the partitioning of this total variance. Hence, neither uc,lP nor a$lC‘are upper bounds for CX(~. Moreover, the very notion of an upper bound for a(X) is conceptually problematic. Nevertheless, this notion is central to the definition of w2 and its interpretation in terms of proportions. It explains the need for the definition of the artificial ‘supervariance’ acTas the sum of o$jX and a(X). While the division of a(X) by this sum results in an index varying between 0 and 1, the interpretation of the wz values in terms of the proportion of the total variance of Y accounted for by Xis misleading, seeing that u.$IX+a(x) is not the total variance of Yin the target system. w’ is systematically smaller thanf” and erroneously creams the Since a$lX+cl(X)Zu$IX, impression that the magnitude of a(X) relative to the total variance of Y (approximated by (rf,iX) is smaller than it actually is. Indeed,
Educational
99
Evaluation
The Relations Between w*, f* and A If the conditional variances u$P and o$IC are approximately equal, then the relations between fz and w2 on one hand, and A(the effect size of policy P) on the other hand, are a reflection of the relation between the underlying expressions o(X) and A(P). That is, both p and w* are duotonically related to A and monotonically related to 1Al :
and J
=
2
a(x) ‘Y2(cJ:IP+u:Ic)
+ a(x)
WN~W 4c + 4X)
SC-
A2
4+
A*
These functions are presented graphically in Figure 7.1. If there are J policies, then these functional relations hold true also for the J-l values off’, w2 and A, provided that a$lZ’j does not vary considerably between policies. One can conclude that if A(P) is the accepted definition of the effect of policy P and the effect size A is the accepted unit free scale for the measurement of this effect, then J” and w2 can be legitimately inter reted only as monotonic measures of the absolute value of the effect size. Therefore, the B and w2 values themselves and the magnitude relations between them can not be meaningfully interpreted as representative of the corresponding IAI values and of the magnitude relations between these values, respectively. Hence, the indicesfZ and w* cannot be considered as alternative quantitative measures of the magnitude of a policy effect. When interpreted as such, they would be misleading. Take, for example, (A11 = 0.4 and iA21 = 0.8, where
Figure 7.1 f and wz as a function
of the absolute
value of the effect size A.
100
RICHARD
The ratios between
the corresponding
M. WOLF
values of f2 and o* are:
Assuming that cr$/C=o$P, the lack of proportionality betweenf’ and IAl is due only to a square function of IAl; f~~f*-V~A1isproportional to IAl and, therefore, can be meaningfully interpreted as a quantitative measure of the absolute magnitude of a policy effect. This is not true for w=q=prd\/w’, unless lAl~1. For this range of IAl values, which is likely to include the effect sizes of most educational policies, w-f is approximately proportional to IAl. However, while in this case both f and w can be legitimately considered as possible alternatives to IAl, they are not alternatives to the signed value of the effect size A. When there are J alternative policies, which differ in terms of the sign of their effects, the rank order of the policies in terms off and w will not reflect their rank order in terms of A. The same argument holds true for the comparison between the effects of the same policy on different outcomes. Hence,fand w cannot be considered as alternative to A in a comparative sense, their meaningful interpretation as measures of effect magnitude being restricted to the special case in whichJ = 2 and there is only one relevant outcome. However, since w=f=tilAl, their numerical value would be misleading if interpreted as representative of the absolute size of the policy effect.
f* being
A final remark concerns the behavior of fand w in samples where, unlike /A/, they are affected by unequal 11s of the experimental (P) and control (C) groups. This is demonstrated in Table 7.1, in which two cases of experimental and control data are presented. The only difference between the cases is that in case 2 the experimental group has 10 times as many subjects as in case 1.
Data from Two Cases with Identical
Table 7. I Distributions in the Experimental but with Different o
Case 1
Control(C)
4(l) 6(l)
2(l) 4(l)
(C) Groups
case 2 Experimental(P) Y(N)
Y(N)
(P) and Control
Experimental Y(N)
Control(C) Y(N) 2(l) 4(l)
(P)
4(10) 6(10)
.f
21
0.57 2
w=i-
0.70
0.50
It is clear that the two cases should result in identical effect magnitudes. In fact, the As are identical, but f and w = r are quite different. This sensitivity to sample size is yet another reason to avoid using f and w = r measures of effect magnitude in evaluation research.
Educational
Interpretation
Evaluation
101
of a Policy Effect
We now turn to the question of interpreting effect magnitudes. Evaluation of a single effect or of the differences between two effects as ‘important’ or ‘worth considering’ is a major issue from the decision-maker’s point of view. When the outcome is measured on a ratio scale, the effect A(Pj) can be expressed as a percentage of the outcome under the control condition OIC. The resulting n provides, therefore, a meaningful and directly interpretable measure of the effect magnitude. However, most relevant outcomes in education are not of this type. Frequently, the outcome is defined in terms of some psycho-educational variable Y, usually measured on an interval scale. Consequently, IT cannot be used to interpret a policy effect defined in relation to such a variable. Moreover, since the scales used for the measurement of psychoeducational variables have an arbitrary and usually idiosyncratic unit, which varies from scale to scale for the same attribute, the scale scores, and consequently the differences between scores, have no inherent meaning (Angoff, 1984; Hedges & Olkin, 1985). Referring to this issue, Cook and Campbell (1979) write: “When the scale on which the outcome variable is measured has some commonsense referent, effects can be expressed in terms of the treatment causing, say, an average increase in income of $1,000 per annum per person, or a reduction in prison recidivism of 20% over two years. With other scales, magnitude estimates are more difficult to interpret, and one wonders what an average treatment effect of five points on some academic achievement test means.” (p. 40).
One solution to this problem is the standardization of the effect A(P) by the standard deviation uyl C of the Y scores under the control condition, i.e. the computation of the effect size A. However, while A solves the problem of the comparability of effects between outcomes expressed on different metrics, it does not solve the problem of the interpretability of a single effect or of the difference between two effects. Unfortunately, standard deviations and the associated Zscores (i.e. A values) are not intuitively meaningful and both the evaluator and the decision-maker may wonder what an effect size or a difference between two effect sizes of, say, 0.5 standard deviation means. Moreover, since IAl has no natural maximum, one cannot evaluate it as proportion out of (IA/),,,. Consequently, the question concerning the absolute magnitude of the effect size, its ‘importance’ or ‘worth consideringness’ has no objective answer. Indeed, the evaluation of IAl is usually in terms of completely arbitrary criteria. For example, according to Wolf (1984), educational methodologists generally regard an effect of 0.3 standard deviations or greater as indicative of a meaningful difference. According to Cohen (1977), on the other hand, effect sizes between 0.2 and 0.5 standard deviations should be considered as small, effect sizes between 0.5 and 0.8 as medium and only effect sizes greater than 0.8 may be regarded as large. These differences in opinion are rightly related by Wolf (1984) to Cohen’s background in psychology, where unlike in education, the control condition involves no treatment, rather than a different treatment (usually the incumbent). However, it should be emphasized that even if the criteria for the evaluation of the magnitude of the [Al values were consensual, they would nevertheless always be arbitrary. Two other solutions to the problem of the evaluation of the magnitude of a policy effect can be conceived of, each involving a different transformation of the original raw Y scores.
102
RICHARD
M. WOLF
Therefore, these solutions are conceptually analogous to the definition of the effect size (which implies Z score transformation). The first approach (Husen & Postlethwaite, 198.5; see also Kraemer & Andrews, 1982; Kraemer, 1984) consists of two steps: (1) The transformation of the Y scores under control condition into percentiles. (2) The definition of the policy effect in terms of the difference between two percentiles: (a) The percentile corresponding to the median of the Y scores under Pj in the distribution of Y under the control condition; and (b) the percentile corresponding to the median of the Y scores under control condition, which by definition equals 50 (see Figure 7.2).
Med(YIC)
Figure 7.2 The definition
of the policy effect QP,)
MedtYIPl
in terms of the difference
between
two percentiles.
Therefore, according to this approach, the policy effect, denoted by D(P), is expressed in terms of the percentage of Y/C values lying between the median of YIC and that of YIP. Obviously, percentages are more intuitively meaningful than standard deviation units. However, unless both conditional distributions are symmetrical, this implies a different definition of a policy effect. On the other hand, if the distribution of YIP, is symmetrical and the distribution of YIC is normal, then Med(Y’IC=kylC, Mcd(Y)IPj=urIPj and the corresponding percentiles equal the normal probabilities. In this case, the difference D=@c(~vIPj)-@c(~r/~)=@~(~r/Pj)-0.50, where Q,” is the cummulative standard normal distribution function in the distribution of YI C, provides an intuitively meaningful interpretation of the effect size A (Glass, 1976; Wolf, 1984; Hedges & Olkin, 1985). Note, however, that the relation between D and A is not linear. Therefore, whereas D provides (under certain assumptions) a meaningful interpretation for a single effect size, differences between effect sizes cannot be meaningfully interpreted in terms of the differences between the corresponding D values. A conceptually similar, however technically different approach to the interpretation of a policy effect, involves the dichotomous transformation of both the YIC and YlPj scores with respect to the median of the YIC scores. The subsequent definition of the treatment effect is in terms of the difference between the conditional means of the resulting binary scores. Denoting the dichotomous Y scores by Y*, where Y* = 1 (when Y%-Med( Y) IC) or Y* =0 (when Y
Educational
103
Evaluation
can interpret D* as the difference between the conditional proportions of ‘high’ scores. The main shortcoming of both D and D* is their upper bound (50 and 0.50, respectively), which affects their sensitivity to possible differences between conditional Y distributions equal in terms of D and D*. This is demonstrated in Table 7.2, where the three alternative policies have equal D and D* values, even though the condiThe Conditional
1 2 3 4 5 6 D D*
Distributions
Table 7.2 of Y Values for Three Alternative Policies, Values. A Numerical Example
each Yielding
the Same D and D*
4 5 6 7 8 9
5 6 7 8 9 10
6 7 8 9 10 11
50 0.50
50 0.50
SO 0.50
tional Y distributions are different. This deficiency of D* is apparently corrected by Rosenthal and Rubin’s (1982) binomial effect size display (BESD), where scores are dichotomized using the median of the superdistribution of Y scores, resulting from the pooling of the conditional distributions of YIC and YIP. Thus, for example, the BESD values corresponding to the three alternative policies in Table 7.1 are 0.50,0.67 and 0.83, respectively. Note, however, that these values refer to different medians (5, 5.5 and 6, for P,, P2 and P,, respectively). The conceptual difference between the approaches underlying the definition of D* and that of BESD parallels the distinction between A and w. While BESD can be meaningfully interpreted if there are only two policies, P and C, itcannot be used in comparative evaluation studies, where there are several alternative policies. For these kind of studies, D and D* provide the only meaningful nonparametric definitions of effect magnitude. In addition to their ease of interpretation, these definitions are preferable to A also because they are not affected by measurement error.
References Angoff,
W. H. (1984). Scales, norms and equivalent scores. Princeton,
NJ: Educational Testing Service. Academic Press. Design and analysis issues for field WI-
Cohen, J. (1977). Statisticalpower analysis for rhe behavioral sciences. New York:
Cook, T. D., & Campbell, D. T. (1979). Q uasi-experimenration. rings. Boston: Houghton Miffin. Glass, G. V. (1976). Primary, secondary and meta-analysis of research. Educational Researcher, 5, 3-8. Glass, G. V., McGraw, B., & Smith, M. L. (1981). Me&z-analysis in social research. Beverly Hills, CA: Sage Publications. Hays, W. L. (1963). Starisfics forpsychologists. New York: Holt, Rinehart & Winston.
104
RICHARD
M. WOLF
Hays, W. L. (1973). Statisticsforthe socialsciences. (2nd ed.). New York: Holt, Rinehart &Winston. Hedges, L. V., 6i Olkin, I. (1984). Nonparametricestimators of effect size in meta-analysis. Psychologicul Bulletin, 96(3), 573-580. Hedges, L. V., & Olkin, I. (1985). Statistical methodsfor mrta-analysis. Orlando: Academic Press. HusCn, T., & Postlethwaite, T. N. (1985). Synthesis of teaching effectiveness research. In T. H&n &T. N. Postlethwaite (Eds.), The international encyclopedia ofeducation (Vol. 9, pp. SlOl-5119). Oxford: Pergamon Press. Kraemer, H. C. (1984). Nonparametric effect size estimation: a reply. Psychological Bulletin, 96(3), S69572. Kraemer, H. C., & Andrews, G. (1982). A nonparametric technique for meta-analysis effect size calculation. Psychological Bulletin, 91(2), 404412. Ragosta, M.. Holland, P.. & Jamison, D. (1982). Computer-assisted instruction and compensatory education: The ETSILAUSD study. (Final report to the National Institute of Education). Princeton, NJ: Educational Testing Service. Rosenthal, R., & Rubin, D. B. (1982). A simple general purpose display of magnitude of experimental Journal of Educational Psychology, 74, 166169. effect. Rosenthal, R., & Rosnow, R. L. (1984). Essentials of behavioral research methods and data analysis. New York: McGraw-Hill. Wolf, R. M. (1984). Evaluation in education. Foundafions of competency assessment andprogram review. ed.). New York: Praeger. (2nd
CHAPTER
BEYOND
DECISION-ORIENTED JAAP
Foundation
8
EVALUATION
SCHEERENS
for Educational Research in the Netherlands, 2500 CB The Hague, The Netherlands
P.O. Box 19050,
Abstract This article discusses the value and limitations of conceptualizing evaluation as a decision-oriented research and appraisal activity. After ‘decision-making’ has been related to the evaluation rationale presented in Chapter 2 of this issue, various models and views on the relation between evaluation and decision-making are used to illustrate the theme. The procedural and structural intricacies of evaluation research, in the context of political and administrative decision-making, blur the clear logic of the rational model. Yet, ‘rational reconstruction’ is seen as the basic tool to the evaluation worker for increasing the relevance of his work. ‘Beyond decision-oriented evaluation’ thus expresses the view that, although the issue of the relevance of evaluations should not be neglected, researchers cannot sit back and wait for decision-makers to formulate their questions before designing and reporting on evaluations. Even if the decision-making context is seen as an important basis for designing evaluations, the evaluator will still be required to play an active part in structuring this context.
Introduction The idea that evaluations should eventually support administrative or political decisionmaking can almost be considered as a defining characteristic of the evaluation discipline. This is apparent from classical texts such as Suchman’s (Suchman, 1967), Stufflebeam’s well-known book on educational evaluation and decision-making (Stufflebeam er al., 1971) and Campbell’s notion of “reforms as experiments” (Campbell, 1969). But it also applies to the more recent literature on utilization focused evaluation (cf. Alkin et al., 1979; Patton, 1978). To express the centrality of the concept of “evaluation for decision-making” we could relate it to the evaluation rationale presented in Chapter 2 of this issue. In doing so we might simply add a phase called “deciding on future policy and action” to the diagram in Figure 1.2, and place this label outside the ellipse that contains the program to be evaluated (see Figure 8.1). The implication of this visualization of relation between the basic need to which a program responds and evaluation and decision-making would be that decision-making should only take place after a program had been evaluated in terms of 105
106
RICHARD
M. WOLF
Need
implemented
-
I
Deciding
Figure 8.1 Adaptation
on future
t
policy
ond
action
of Figure 1.2. ‘Decision-making’ has been added and the term ‘learning placed by the more general term ‘program as implemented’.
experiences’
re-
whether it had provided an effective response to the need. Notwithstanding the inherent logic of this point of view, in reality decision-making tends to creep in at earlier stages of program-formation and implementation. I have therefore included ‘formative decisionmaking’ within the ellipse and indicated, with arrows, that this kind of decision-making may lead to modification of objectives and/or program implementation. The distinction between formative and ‘ultimate’ decision-making is related to other issues basic to different viewpoints on the relation between evaluation and decision-making. One of these issues concerns the strength of the evidence evaluations can offer to decision-makers. Another has to do with preconceptions about the nature of organizational decision-making processes, e.g. whether we can expect decision-makers to hold their hand until a program has been evaluated in terms of how far it has met the basic needs for which it was designed. In this chapter the principal ways of looking at the relation between evaluation and decision-making will be discussed. This will be done by examining the ‘decisional contexts’ of evaluation projects in terms of a comparison between two models of organizational decision-making, the rational and the incremental model, and by analyzing certain remedies that have been prescribed for the problems arising when an attempt is made to enhance the ‘decisional relevance’ of evaluation. In the process, the concept of decision-oriented evaluation will be widened while at the same time its prescriptive usefulness will be regarded in an increasingly relative light.
The Rational
Model
When organizational decision-making is seen as a deliberate alternatives that have to be assessed in terms of their effect
choice between clear-cut on pre-established goals,
Educational
Evaluation
107
evaluation has a ‘natural place’ in the process. The core task of evaluation will then be to discover, by empirical testing, which alternative is most effective. The sequence of activities corresponds to the problem solving model: statement of the problem; development of feasible courses of action (e.g. program variants); evaluation of the alternatives, and choice of the optimum or ‘most satisfying’ solution. The most likely research designs in such a sequence of activities would be experimental or quasi-experimental (Campbell, 1969). Although this model of organizational decision-making is often referred to as the rational model, it should be noted that in reality it is only a weaker modification of the ‘pure rationality’ model (Dror, 1968). The pure rationality model assumes full information on all available alternatives, all desirable end-states and on the functions that characterize relation between actions and states. In the problem solving model of organizational decisionmaking probably only a few feasible alternatives will be considered, so that in actual practice the decision-making process will be more like Simon’s ‘bounded rationality’ model (Simon, 1946), but here we shall follow common usage and continue to speak of the rational model. In this section the assumptions underlying evaluation within the context of rational decision-making will be examined; first by summarizing the characteristics of the decision-model itself and then by looking at how evaluation functions within this model. Clear statement of goals. The assumption is that goals are stated in advance, preferably specified as operational objectives. Distinction of goals and means. The means are supposed to be devised after the goals have been declared. Causal hypotheses. There is supposed to be some kind of causal theory about means-toend relations. The idea is that this theory is comprehensive, and that every relevant factor can be taken into account. A long-term perspective. There is a strong belief in long-term overall-planning. Relatively stable programs. The causal theory must be empirically tested. The test of a good program or policy is that it can be shown to comprise the best means of reaching the desired ends. One implication of this assumption is that program implementation is not seen as a major problem. An aspect of the rational model is its optimistic view of the ability and the willingness of the various parties concerned (i.e. teachers) to carry out programs as they are planned. Another implication is that a certain stability is presupposed during the ‘try-out’ of program variants; goals and means are assumed not to change during the course of the program. ZdentifiabiZityof decision-makers. The rational model usually assumes a single group of people who are authorized to make decisions, in spite of the fact that there are multi-actor variants of the rational model, e.g. game theory. The well-structured and coherent nature of the rational decision-making context, as well as its long-term perspective can be seen to have the following consequences for evaluations: (i) Evaluation as a means-to-end analysis. The core evaluation question is whether the means (program variants) are effective. This means that there will usually be an emphasis on performance measures (product evaluation) and that experimental, quasiexperimental or non-experimental causal designs will predominate. (ii) A ‘safe place' for program evaluators. The assumption is that evaluation researchers will be able to carry out their tasks relatively independently according to professional norms. Evaluation itself is seen as a rational and impartial activity and not as ‘politiCal’. (iii) Large-scale evaluation studies. The most likely responses to comprehensive, long-
108
RICHARD
M. WOLF
term plans are large-scale evaluations. Use follows evaluation in a linear sequence. The idea is that decision-making takes place after the evaluation results have been made public. According to the rational ideal, the evaluation results will speak for themselves and will be the major basis for administrative decision-making. The inevitable danger with characterizations such as these is that they tend to be overstylized and to emphasize extremes. This means that they make ideal material for setting up straw men who can then be destroyed with great ease as soon as a better alternative model is presented. Therefore, it should be emphasized here that there are all kinds of modifications of the model depicted above. In fact one could even see rationalism and incrementalism (the model to be discussed in the next section) as discrete points on several continua (cf. Scheerens, 1985a). The major criticism of the rational model has arisen out of meta-evaluations of largescale evaluation projects and empirical and analytical studies on the use of evaluation-rcsuits (cf. Caplan, 1982; Weiss. 1982, 1985; Cronbach et al., 1980; Cronbach, 1982; Lindblom & Cohen, 1979; Patton, 1978; Alkin etal., 1979; Berke, 1983; DeYoung & Conner, 1982). Discussion of these criticisms will be postponed until the alternative model that has emerged from the work of these authors has been presented in the next section. However, an important distinction for the ensuing discussion should be mentioned here. There are two ways of looking at the rational model: as a model appropriate to describing organizational decision-making or as an ideal, i.e. a model of mainly prescriptive value. It will be argued that even if there is justifiable criticism of its descriptive validity, the rational model is nevertheless of great importance for prescriptive purposes. (iv)
The Incremental
Model
In many practical situations the assumptions underlying the rational model do not hold. Organizational decision-making frequently fails to follow the clear-cut logic of articulated means-to-end analysis. The alternative model, in which Lindblom tried to capture ‘real’ organizational decision-making, emphasizes a different formal structure for the decisionmaking process, summarized in terms like “successive limited comparisons”, or “muddling through”. Other studies of political and administrative decision-making have focused on the context of decision-making and the organizational background of decision-makers (cf. Allison, 1971). In many cases the official program goals-if they are clear at all -arc mixed with other motives, like the pursuit of political careers and the self-maintenance or ‘imperialism’ of organizational units. Since the recognition that there are often more interests at play, and that they are related to the organizational affiliation of the parties concerned, fits in well with the formal characteristics of pluralistic, ‘small step’ decision-making, these two aspects will be dealt with together in the interpretation of ‘incrementalism’ used here. As in the preceding section, the main characteristics of the incremental decision-making model will be summarized, followed by an overview of the way in which this kind of organizational decision-making may affect evaluation. To facilitate comparison of the two models, the same categories will be used in the same order. Ambiguous goal statements. ‘Overall’ goals, if stated at all, are formulated in vague terms. It should also be asked whose goals these are, since it is recognized that there are
Educational Evaluation
109
usually several parties with disparate interests in a program. Vagueness of overall goals is often seen as a political advantage since it allows politicians to dissemble differences among their own adherents and because noncommitment leaves them more freedom of action. Goals and means cannot be distinguished. Means may function as ends in themselves. According to the incremental model, selection of means may precede the attribution of goals. Interpolation of the past. Since overall means-to-end planning is seen as unrealistic, the incremental model assumes that progress consists of ‘small steps’, in which the history of an organization plays an important role. It is also recognized that proceedings will vary across sites. The consequence is, that even when large-scale programs are envisaged, they will present a fragmented pattern of local divergences. This will have obvious drawbacks for overall causal analysis. A short-time perspective. Incrementalism is strongly associated with a conservative outlook and a distrust of detailed overall planning. Programs are seen as fuzzy and fleeting. Incrementalism places no reliance in the idea that a program might be carried out according to plan. In the unlikely event of a general overall-plan it will be recognized that there are still many different ways of implementing it. In terms of a well-known distinction culled from educational innovation literature, an incremental decisional-context is associated more with a “mutual adjustment perspective” than with a “fidelity perspective” (Fullan & Pomfret, 1977). It is also recognized that the goals, and political priority of a program may change while it is being carried out. Diffuse decision-making. It is seen as uncertain which decisions will be taken and even by whom (cf. Weiss, 1982); different stakeholders are confidently expected to draw their own conclusions from the evaluation results. Although the question of how evaluation can function within a context of incremental decision-making will be dealt with in more detail in subsequent sections of this article, some of the main observations that have been made in the literature will be summarized below: (i) Evaluation as descriptive research. Because of the lack of articulation of means-toend relations, the changing character of programs, local variations and the shorttime perspective, evaluation as overall means-to-end analysis will usually seem impossible (Cronbach et al., 1980). Instead, the emphasis will tend to be on descriptive case-studies of local implementation variants. The ‘qualitative’ movement in the American evaluation literature coincides with this view of the nature of decisionmaking and innovation processes (Weiss & Rein, 1971; Berman & McLaughlin, 1976). (ii) Evaluators aspawns in apoliticalgame. From the incremental perspective, the differences in interest between evaluators and practitioners or site workers are recognized (cf. Caro, 1971; DeYoung & Conner, 1982; Cohen, 1983). Sometimes evaluators only obtain access to data after lengthy negotiations with practitioners (Scheerens, 1985b). Other stakeholders too may try to influence the way evaluations are carried out in order to get their view of the program accepted, or to protect their interests. Some authors even state that current evaluation approaches are bound to be politically biased (cf. Ross & Cronbach, 1976; Berk & Rossi, 1976; Campbell & Erlebacher, 1972). (iii) Several small-scale studies. The lack of coherence in programs makes overall-designs
110
RICHARD
M. WOLF
risky affairs. Cronbach and coworkers (1980) in this respect strongly condemn “blockbuster studies” and advocate a number of smaller studies done by different research-teams. (iv) Use of evaluation results as a gradualprocess of enlightenment. According to the incremental model, organizational decision-making is not a matter of making all-ornone decisions, but rather a slow process of “accretion” (Weiss, 1980). Nor can evaluation results be expected to have the dramatic effect of forcing such important comprehensive decisions. According to Patton “research impacts in ripples, not in waves” (Patton, 1978). Weiss and Caplan conclude from their studies in knowledge use that evaluation results may gradually change the way societal problems are conceptualized in decision-making communities (Caplan, 1982; Weiss, 1982). They refer to this view on the use of policy-research results as the ‘enlightenment’ model. It is recognized that evaluation results will simply number among the information sources used by decision-makers. This accounts for the many instances of programcontinuation after negative evaluation findings (and the opposite). After comparing these two models of organizational decision-making, and their most likely consequences for the ‘position’ of evaluation, the most interesting question seems to be to find realistic ways of saving as much as possible of the rational ideal, while recognizing at the same time that the political and organizational context will be more or less in line with the incremental model. This question will be addressed in the next section, by comparing several solutions, two of which can be seen as more or less ‘succumbing’ to incrementalism, while the other two in a sense try to resurrect the rational model.
Approaches
that Seek to Enhance
the Relevance
of Evaluations
The four approaches to be discussed in this section are: utilization-focused evaluation, stakeholder based evaluation, evaluation according to a betting model and the idea of rational reconstruction of diffuse decision-making contexts.
Utilization-Focused
Evaluation
The term “utilization-focused evaluation” was coined by Patton (Patton, 1978). On the basis of a study on the actual use of evaluation results, Patton formulated a set of practical recommendations for evaluators. Alkin et al. (1979) took a similar approach. According to Patton, the main factors enhancing utilization are as follows: (i) Evaluators should try to identify decision-makers. It is considered of great importance for the evaluator to know the decision-makers personally. (ii) The evaluator should try actively to identify and focus on the relevant evaluation question. In doing so he may try to commit decision-makers to evaluation questions, for instance by means of public statements in the press. (iii) Evaluators should have an ‘active/reactive/adaptive’ attitude; they should be as flexible as possible in the face of changes in program goals and political priorities. (iv) Evaluations should not be preoccupied with quantitative outcome measures and experimental designs. Instead, a lot of attention should be given to the description of program variants and implementation processes. Qualitative or ‘naturalistic’
Educational
Evaluation
111
methods can play an important role. Evaluators should try to construct the program’s theory of action in terms of meansto-end relationships. (vi) Evaluators should stimulate the participation of decision-makers and practitioners in methodological choices so that they “understand the strengths and weaknesses of the data - and believe in the data” (Patton, 1978, p. 202). Decision-makers should also be actively involved in the way the evaluation findings are reported. Both Patton and Alkin concluded from their research that decision-makers do not see research quality as an important factor when they decide whether or not to use evaluation findings. Some of the findings and recommendations of Patton and Alkin are refuted by other investigations of knowledge use. Weiss’s (1980) conclusion that it was hardly ever possible to identify decision-makers, stands in strong contrast to Patton’s focus on the ‘evaluator’s getting to know the decision-makers personally. And Patton’s and Alkin’s finding that decision-makers do not pay much attention to research quality is contradicted by Weiss and Bucuvalas (1980) who found that decision-makers do scrutinize research quality when they do not like the evaluation outcomes.
(v)
Stakeholder-Based
Evaluation
The stakeholder approach to evaluation not only recognizes the fact that there are different parties with an interest in research outcomes but also tries to make active use of this phenomenon. The idea is that giving the various parties more proprietary feeling for the evaluation process and its outcomes will increase the chances of the evaluation results being used. Two groups of stakeholders are recognized: those involved in decision-making and those otherwise affected by the evaluation (e.g. teachers, pupils, parents). The assumptions of the stakeholder approach are critically examined by Weiss (1983) and Cohen (1983). Weiss questions the assumption that stakeholders have specific information needs and that evaluators have the skills to respond adequately to these (possibly divergent) needs. According to Cohen, the orientations of contractors and evaluators on the one hand and site-workers on the other cannot be reconciled and integrated in one evaluation study. He also observes that power is unequally divided between stakeholders (to the disadvantage of site-workers). Cohen concludes that it would be better if each stakeholder had his own evaluation study. I think that the relativity inherent in the way these approaches (both utilization-focused and stakeholder-based evaluation) regard research quality and technique is a serious point for discussion. Both put professional evaluators, as far as research-technical know-how and objectivity is concerned, on the same level as directly interested parties with no formal training in social science research methodology. Although this is no plea for putting evaluators on pedestals, I think these approaches overemphasize the degree to which evaluation can be ‘participatory’. Besides, there remains the obvious point that invalid and unreliable results would be useless even if all the stakeholders were happy.
Evaluation Assuming
that different
According
stakeholders
to a Betting Model
tend to have conflicting
expectations
of the out-
112
RICHARD
M. WOLF
comes of a program, Hofstee (1985) applies an approach to evaluation which he calls a betting model. According to this model, stakeholders indicate in advance which outcomes they expect and also the probability of the occurrence of the expected outcomes. After thus having specified predictive probability distributions (the ‘bets’ of the stakeholders) the outcomes are measured and finally the scores are compared with the expectations. This is done by applying a logistic scoring-rule which results in the establishment of the degree to which each of the betting parties has been right in his or her predictions. An important assumption relied on by the betting model is that stakeholders can reach agreement on an outcome operationalization. Although this may be seen as the Achilles-heel of the approach, the model has been successfully applied in the field of curriculum evaluation, with commercial educational publishers as the betting parties (Van den Berg, 1985).
Active
‘Rational Reconstruction’
by Evaluators
A basic stipulation in decision-oriented and utilization-focused evaluation is that evaluators should appeal to others for guidance in establishing the structure of an evaluation project. The obvious reason for doing so is to increase the relevance of evaluations, by committing the parties concerned to the evaluation as much as possible. Yet, studies of evaluation-use seem to show that evaluators can only expect limited help from external sources. It is often impossible to identify an authoritative decision-maker at all. Instead, for every program there are usually several parties whose interests in the program and its evaluation often conflict. If there is no-one ‘out there’ waiting for evaluation results in order to make ‘important’ decisions, and evaluation-use consists of the gradual percolation of knowledge and the gradual shaping of conceptual outlooks, then evaluators would seem all the more bound to choose a more active approach to designing program evaluation. Hofstee’s betting model is a good example because the external stakeholders participate according to a set of rather strict rules already established by the evaluator. In a more general approach evaluators would use whatever information was available - including stakeholders’ viewpoints - to make rational constructions of fuzzy programs and diffuse decision-making contexts (e.g. see Rossi & McLaughlin, 1979). The focus should be on the causal structure relating program variants to desired outcomes. In choosing dimensions for outcome measures, evaluators sould look for those criteria on which the expectations of stakeholders differ most. If evaluations succeed in providing empirical evidence on controversial issues, it is hard to imagine politicians failing to make use of it in ensuing debate.
Conclusion:
Beyond
Decision
Oriented
Evaluation
Decision-oriented evaluation is to be seen as an approach that attempts to increase the relevance of evaluations by taking into account the context and structure of the administrative or political decision-making to which the evaluation results are assumed to contribute. Utilization-focused and ‘stakeholder-based’ evaluation go a long way to attaching evaluation design more closely to the decision-making context. Since decision-making contexts are often diffuse and it is hard to reconcile the demands of all stakeholders, this road reaches its natural end when evaluators find themselves forced to make their own final design-choices anyway. Also, letting stakeholders participate in every aspect of the structur-
Educational
Evaluation
113
ing of evaluations, including the technical aspects, may well lead to the loss of a vital prerequisite of evaluation use: research quality. This realization implies that decisionoriented evaluation is to be looked upon with a certain amount of relativity. In approaches like the ‘betting model’ and with the idea that evaluators themselves should rationally structure diffuse decision-making contexts, we can discern a counter-movement, emphasizing a more active role for evaluators. But such a movement does not simply take us back to square one with scientific evaluators in ivory towers, neglecting the external relevance of their work, for both approaches start by gathering information on the decisional context. There are two more reasons for thinking beyond decision-oriented evaluation. The first reason rests on the assumption that if we succeed in providing information on the ultimate effectiveness of a program, i.e. when it is established whether the program has succeeded in fulfilling the basic needs for which it was designed and implemented, it will be difficult for anyone to ignore the evaluation data. Although we can never be absolutely certain that decision-makers use evaluators’ recommendations, we may at least entertain a more optimistic view, if the ultimate effectiveness of a program has been assessed. The implication is that it should not be overlooked that there are other ‘internal’ criteria for increasing evaluation relevance besides adaptation to the decision-making context. The third reason for going beyond the concept of decision-oriented evaluation is of a different nature. Rather than expressing the relativity of the concept, it has to do with enlarging it to a more comprehensive conceptualization of the contextual analysis of evaluation projects (Scheerens, 1985b). According to this enlarged concept of contextual analysis, organizational and institutional arrangements should be taken into account too. For instance, the degree to which evaluators can cooperate with decision-makers will depend on the structure of the interorganizational network in which the various parties are located. The influence of practitioners and site-workers on evaluations depends on the autonomy of schools within a district or within a national educational system. And the independence of researchers willing to play an active role in structuring evaluations depends on their institutional location, the resources available and the management philosophy of the research-contractor. In the author’s opinion such organizational issues should be included in the very first assessment of feasibility for any evaluation program, in order to find the optimum balance between evaluation aspirations and the contextual intricacies of a particular setting.
References Alkin, M. C., Daillak, R., & White, P. (1979). C/sing evaluation. Beverly Hills: Sage. Allison, G. T. (1971). Essence ofdecision. Boston: Little, Brown & Co. Berg, G. v. d. (1985). Curriculum evaluation as comparative product evaluation. In: B. P. M. Creemers Evaluation research in education; Reflections and studies. The Hague: SVO. Berk, R. A., & Rossi, P. H. (1976). Doing good or worse; evaluation research politically reexamined. Problems,
Berke,
40.3.
Social
23, 337-349.
I. P. (1983). Evaluation
tion and Policy Analysis,
Berman,
(Ed.)
P., 81 McLaughlin.
and incrementalism:
the AIR report
and ESEA
title VII. Educational
Evalua-
Educational
Forum,
5, 249-256.
M. W. (1976).
Implementation
of educational
evaluation.
114
RICHARD
M. WOLF
Campbell, D. T. (1969). Reforms as experiments. American Psychologist, 24(4). (Reprinted in Struening, E. L., & Guttentag, M. (1975). Handbook of evaluation research. Beverly Hills: Sage. Campbell, D. T., & Erlebacher, A. (1970). How regression artifacts in quasi-experimental evaluations can mistakenly make compensatory evaluation look harmful. In J. Helmuth (Ed.) Cornpensafing education: a national debate (Vol. III of The disadvantaged child). New York: BrunnelA4azel. Caplan, N. (1982). Social research and public policy at the national level. In D. B. P. Kallen et al. (Eds.) Social science research and public policy-making: A reappraisal. Windsor, UK: NFER-Nelson. Caro, F. G. (1971). Readings in evaluation research. New York: Russell Sage Foundation. Cohen, D. K. (1983). Evaluation and reform. In Sfakeholder-based evaluation (New directions for program evaluation no. 17) (pp. 73-81). San Francisco: Jossey-Bass. Cronbach, L. J. (1982). Designing evaluations of educational and socialprograms. San Francisco: Jossey-Bass. Cronbach, L. J. et al. (1980). Toward reform ofprogram evaluation. San Francisco: Jossey-Bass. DeYoung, D. J., & Conner, R. F. (1982). Evaluator preconceptions about organizational decision-making. Evaluation
Review,
6, 431-440.
Dror, Y. (1968). Public policy-making Fullan, M., & Pomfret, A. (1977). Educational
Hofstee. research
Research,
Research
Scranton, Pennsylvania: Chandler. on curriculum and instruction implementation.
Review
of
47, 335-397.
W. K. B. (1985). in education,
reexamined.
A betting
ref7ections
model
of evaluation
and studies. The Hague:
research. SVO.
In B. P. M. Creemers
(Ed.).
Evaluation
Lindblom, C. E., & Cohen. D. K. (1979). Usable knowledge. New Haven, London: Yale University Press. Patton, M. Q. (1978). Utilization-focused evaluation. Beverly Hills: Sage. Ross, L., & Cronbach, L. J. (1976). Handbook of evaluation research review. EducationalResearcher, 5,9-19. Rossi, R. J., & McLaughlin, D. H. (1979). Establishingevaluationobjectives. Evaluation Quarterly, 3,331-346. Scheerens, J. (1985a). A systems approach to the analysis and management of large-scale evaluations. Studies in Educational Evaluation, 11, 83-93. Scheerens, J. (1985b). Contextual influences on evaluations: the case of innovatory programs in Dutch education. Educational Evaluation and Policy Analysis, 7,3. Simon, H. A. (1945). Administrative behaviour. New York: MacMillan (2nd ed., 1964). Suchman, E. (1967). Evaluative research. New York: Russell Sage Foundation. Stufflebeam. D. L., Foley, W. J., Gephart. W. J., Guba, E. G., Hammond, R. L., Merriman. H. O., & Provus. M. M. (1971). Educational evaluation and decision-making in education. Ittica, Ill: Peacock. Weiss, R. S.. & Rein, M. (1971). The evaluation of broad-aim programs: experimental design, its difficulties and an alternative. In F. G. Caro (Ed.), Readings in evaluation research. New York: Russel Sage Foundation. Weiss, C. H. (1975). Improving the linkage between social research and public policy. In L. E. Lyn (Ed.). Knowledge andpolicy: the uncertain connection. Study project on social research and development (Vol. 5). Weiss, C. H. (1980). Knowledge creep and decision accretion. Knowledge; creation, diffusion, ufilizafion, l(3). Weiss. C. H., & Bucuvalas, M. J. (1980). Truth tests and utility tests: decision-makers’ frames of reference for social science research. American Sociological Review, 45, 302-313. Weiss, C. H. (1982). Policy research in the context of diffuse decision-making. In D. B. P. Kallen et al. (Eds.). Social science research and public policymaking: a reappraisal. Windsor, UK: NFER-Nelson. Weiss, C. H. (1983). Toward the future of stakeholder approaches in evaluation. In Stakeholder-based evaluation (New directions for Program Evaluation no. 17) (pp. 83-96) San Francisco: Jossey-Bass.
CHAPTER
REPORTING
THE RESULTS STUDIES A. HARRY
9
OF EVALUATION
PASSOW
Jacob H. Schiff Professor of Education, Teachers College, Columbia New York, New York 10027, U.S.A.
University,
Abstract One of the most important stages in the conduct of evaluation studies is the reporting process, one which involves communication to various groups and individuals throughout an evaluation study, using a variety of communication means including written reports. Some reporting takes the form of formal, written reports, while other reporting may be quite informal, depending on the nature and intent of the communication. A variety of reports may be prepared, each serving a different function. Among the types of reports are: progress report, final report, technical report, executive summary, and media report. Evaluators must keep in mind that it is a reporting process to which they must attend throughout the conduct of an evaluation study, not simply preparation of a final report.
One of the most important stages in the conduct of evaluation studies is the reporting process. The major purposes of evaluation, purposes which are not mutually exclusive, include: contributing to decisions about program installation, program continuation, and/ or program modification; obtaining evidence to rally support either for or in opposition to a program; and contributing to the understanding of basic psychological, social or other processes (Anderson & Ball, 1978). The fulfillment of these purposes depends to a large extent on the nature and quality of the reporting process. Adequate and appropriate communication of findings and recommendations to various relevant individuals and groups will determine how effective the decision-making processes will be. The operant phrase is ‘the reporting process’ which is intended to indicate that more than the preparation of a final report is involved. Rather, there must be communication to various groups and individuals throughout the evaluation process using a variety of communication means. Some reporting is in the form of formal reports, other may be quite informal, depending on the nature and intent of the communication. The organization or agency which initiates and pays for the study will want a formal report. The individuals and groups who are involved in the decisions regarding the program will require various kinds of communications, including a report on the findings and the recommendations for future action. Individuals and groups who are supplying information will want feedback. Thus, as 115
116
RICHARD
M. WOLF
Anderson and Ball (1978) observe: “It is rather obvious that program evaluation cannot be carried out without some communication among the agents in the process-funding organization, program director, program participants, evaluation staff, and the communities and institutions within which the program is being developed or assessed” (p. 93). The evaluator(s) needs to determine at the outset what the elements of the communication network are in a particular setting-who needs to be informed about what, who needs to be involved at what points and how, and who needs what information in order to participate in decisions about the program as well as to implement and effect these decisions. Communication, of course, is not simply a one-way avenue from evaluator(s) to client(s); unless there is two-way communication, the reporting and the outcomes are likely to be limited. It is essential that the evaluators keep the program staff and others involved in the program informed, providing sufficient information so that there will be an adequate understanding of what the evaluation is all about without invalidating the findings by biasing the participants. Sufficient information is that which will reduce the anxieties of those involved (“The program is being evaluated, not individual teachers,” for example), secure the needed cooperation in the data gathering process without biasing the data. and establish trust and confidence in the findings and recommendations. Adequate information is essential to build a sound relationship between the program participants and the evaluators so that all aspects of the evaluation process can be facilitated. Taking Striven’s concept of formative and summative evaluation as a guide, the reporting can also be thought of as related to in-process (formative) and final (summative) communication.
Reporting/Communicating
in an Evaluation
Study
A look at an evaluation study conducted over a two-year period will illustrate the reporting/communicating process. The description focuses on the reporting and communicating and is not a complete account of the study. With declining enrollments a school system found it necessary to close one of its two junior high schools (grades 7 and 8 - ages 12 and 13) and consolidate the students into one building. The recommendation to take this action was made by the District’s Superintendent of Schools who coupled it with the proposal that the new school should be a middle school. rather than a junior high school, with a philosophy, program, structure, staffing and functioning appropriate to a middle school. Other details of the proposal included the following recommendations: (1) A middle school be organized to include all of the District’s seventh and eighth graders, with only the seventh graders to be involved the first year and both seventh and eighth graders to be enrolled beginning with the second year. (2) Instruction in the basic academic subjects - English, social studies, mathematics. and science -be provided four days per week rather than the traditional five days a week. In addition, a wide variety of so-called ‘mods’ or ten-week minicourses be provided the fifth day. The four-day academic and one-day mod program was to be scheduled over a six-day cycle. (3) The students be organized into four teams of approximately 100 pupils in which the basic academic subjects were to be taught by a team of four teachers. (4) More attention be paid to the range of characteristics and needs of this age group, in-
Educational
Evaluation
117
eluding additional guidance. As the Superintendent’s proposal was discussed at the formal Board of Education meetings, in the community and in the news media, support for and opposition to the plan began to develop. Those opposed to the new design appeared to focus on features which they considered to represent a move away from concern for high academic standards and for provisions for students of high ability and performance. Those in favor of the new middle school plan saw it as being more responsive to the entire student body through provisions for teams, electives, and new program offerings such as the computer laboratory. After considerable discussion in a variety of forums, a decision was made to proceed with the new program but the vote by the Board of Education was not a unanimous one. Even though there appeared to be general support for the plan as perceived by the Superintendent and the Board of Education members, many issues continued to arouse concern among parents, teachers, other members of the school community, and the Board members themselves. Among the most controversial issues were the four-day academic schedule, the mod program, heterogeneous grouping, scheduling and provisions for gifted students. As approved by the Board of Education, the Superintendent’s plan was accepted with two provisions: that there be an evaluation conducted by an external agency or group and that accelerated classes in English and mathematics be organized after the first 10 weeks of school. Four evaluation teams submitted competitive proposals for undertaking the evaluation. The ‘winning’ team’s proposal was for a two-year evaluation study, focusing on the seventh grade the first year and the seventh and eighth grades during the second year after both grades had been admitted and the two-year middle school program was functioning fully. The evaluation team consisted of two university professors and seven doctoral students the first year and 10 the second year. Using the mandate of the Board of Education for a comprehensive evaluation of the new Middle School for guidance, the evaluation team developed its procedures for gathering data. The major sources of information were teacher and administrative staff (both central office and building) interviews and questionnaires; student interviews, questionnaires, and test data; parent interviews and questionnaires; documents from the Board of Education and the central office related to the Middle School; and observations in the classrooms and the school. During initial meetings with the members of the staff, the evaluators described the purpose of the study and how it was to be conducted and assured the staff that the study was concentrated on the school as a unit and that individual staff members were not being evaluated. At these initial meetings, it was also pointed out that teacher concerns, reactions, and recommendations would be an important input and would be reflected in the instruments used and the procedures followed in gathering and analyzing the data. Aside from formal faculty meetings at which the evaluators reported on the purpose of the study and how it was to be conducted, the evaluators always spent time communicating informally with staff members in the Faculty Room each time they were at the school. These informal meetings provided opportunities for the staff members to inquire about what was going on, to make suggestions about the procedures and the data being collected, and to discuss matters raised by their colleagues or the evaluators. These informal sessions proved invaluable in establishing relationships and mutual trust and to help the evaluators in understanding the data being collected. They were an important part of the informal reporting process.
118
RICHARD
M. WOLF
The basic focus of the study was on the instructional program, the staff, operational policies and their impact on pupil performance and perceptions. Each of the classrooms in the school was observed on at least one occasion by different members of the evaluation team to obtain a broad picture of the instructional processes. In addition to the observations, interviews were conducted with teachers, students, parents, administrators (including department chairpersons and team leaders), board members, and other members of the community. The interviews provided the basic data from which the questionnaires were constructed. Standard procedures were used in developing, pretesting, revising, and conducting the surveys for all of the groups studied. During the first year, two sets of interviews were conducted with a small sample of individuals who were selected to represent a broad spectrum of opinions of people who were most knowledgeable of the range of issues involved in the policy decision. These persons were identified on the basis of documents with an effort made to include individuals who had been outspoken in favor and others who had been equally outspoken in opposition to the decision for a middle school. These individuals were interviewed in the fall and again in the late spring. The second interview schedule was designed to elicit the reactions of the interviewees once the program was well underway and focused on issues which had been identified in the first interviews. In October, most of the teachers responded to a form with questions dealing with perceived positive aspects of the new middle school program, concerns related to the program, and suggestions for change. After the responses were analyzed, meetings were conducted with each of the four teacher teams to report the findings and to discuss and confirm them. An in-depth comprehensive questionnaire was then designed, using the preliminary survey and the team meeting discussions as its foundation. This questionnaire included items dealing with middle school goals, instructional programs, instructional strategies, teaching styles, administration, school and district policies, ability grouping, decisionmaking input, scheduling, the academic program, the mod/minicourse program, pupil personnel services, testing, etc. The questionnaire consisted of three parts: (i) 149 items dealing with attitudes and perceptions concerning various aspects of the program; (ii) 30 items dealing with level of satisfaction for each program area; and (iii) 10 demographic items. The responses were analyzed for the staff as a whole, for each of the four teams, and for teachers not assigned to a team or assigned to more than one team. These findings were reported to and discussed by the teachers. Several questionnaires were administered to the students during the year. In order to assess the transition from the elementary schools to the middle school, a questionnaire was administered to all seventh graders after two months in the school. A standard articulation survey form was modified with the assistance of the guidance counselors and the principal. Students were asked to respond to questions about adjustment to school routines, the curriculum, the teachers and teaching, peer relationships, and feelings about the program generally. These data were analyzed for the seventh grade group as a whole as well as by sex, by team, and by sending elementary school. When the data were compiled, the findings were reported to the faculty, with special attention given to aspects which were perceived as problematic by the students. At the beginning of each lo-week period, students selected the mod/minicourse electives in which they wished to participate. A questionnaire was designed to elicit student reactions to the mod/minicourse program in general as well as evaluative judgments of each elective in which the student participated during the lo-week period. Students who
Educational Evaluation
119
participated in more than one mod elective - actually most of the students - completed a separate evaluation for each mod. These questionnaires were analyzed at the end of each lo-week period and a computer printout was provided for the teacher of each mod. This report on the student responses to a mod/minicourse was provided for each teacher at the end of each lo-week quarter. In addition, there were a number of items on the Middle School Student Questionnaire administered toward the end of the school year which elicited student responses to the mod/minicourse program overall. A Middle School Student Questionnaire was designed to study the perceptions of the students regarding their academic and social self-concepts as well as their opinions regarding the middle school, its teachers and teaching, and the mod/minicourse program. A 103item questionnaire also contained a section with 48 ‘curricular areas’ aimed at securing student judgments about the middle school experience in general and specific aspects in particular. These questionnaires were analyzed for the total group as well as by sex, by team, and by sending elementary school. Two tests, the Metropolitan Achievement Test and the Otis-Lennon Mental Ability Test were administered as part of the District’s regular testing program. For this study, the reading and mathematics scores of the Metropolitan Achievement Test were examined together with the mental ability or IQ scores from the Otis-Lennon. The seventh grade students’ sixth and seventh grade test scores as well as the eighth grade students’ (i.e. those a year ahead and still in the junior high school setting) seventh grade scores on the same achievement and mental ability tests were presented in the form of stanine scores. In addition, these same scores were presented as Stanine Bivariate Distributions, with the achievement scores examined in relation to the measured ability. Thirdly, the seventh and eighth graders’ reading and mathematics scores were presented in total population mean stanine scores for comparison. These test data were analyzed to answer the questions as to how the seventh grade middle school students were doing compared to eighth grade when they were seventh graders and how students were performing in relation to their ability. A study was also undertaken to determine the amount of student involvement in the learning activities - i.e. the time-on-task. This study aimed at determining whether students were more ‘on task’ in some subjects than in others and whether ‘on task’ behavior occurred more frequently at the beginning or at the end of the class period, if either. ‘On task’ behavior was compared in regular classes and mod/minicourses. Educational activities (lecture, discussion, small group work, individual work, etc.) and organizational settings (working independently, in small groups, or in whole class) were examined as well in relation to time-on-task. All of the regular classes and a sample of the mod/minicourse classes were observed over a two-day span. Finally, a questionnaire was designed to obtain the opinions and perceptions of the parents most closely associated with the middle school program. All seventh grade parents were included, together with a sample of sixth grade parents from each of the five elementary schools. In addition, all members of the Middle School Committee, an advisory group, were sent the Parent Study Questionnaire. The 87-item instrument sought parents’ opinions about all aspects of the middle school program and the school’s functioning. The questionnaire also included a demographic section with questions about family background and history and contact with the school. Space was provided for any additional comments. These data were analyzed for the parent group as a whole, by whether they had children in seventh and/or sixth grade, by sending elementary school, and by the team to which their seventh grade child had been assigned.
120
RICHARD
M. WOLF
Two final reports were prepared -one an Executive Summary which summarized all of the findings in relatively non-technical terms and provided a set of eleven recommendations which were aimed at building on the strengths of the program during its first year and at addressing the observed weaknesses and problems. The second report was really a combination final report and technical report in the sense that it contained all of the data analyses for all of the surveys, more than would usually go into a final report. The Executive Summary and Final Report were the basis for a meeting with the Board of Education members and the chief administrators which was scheduled some three weeks after they had received them and had time to study them. A large number of copies of the Executive Summary were then made available to the public with copies of the Final Report placed in the public libraries and school offices as well as the central office. A public meeting was then held at which time the evaluators summarized the findings and recommendations briefly and then responded to questions from those parents in attendance. Finally, the evaluators met with the central administrators, the building administrator, and the middle school staff on a number of occasions to discuss the findings and recommendations and help plan the second year of the school’s operation during which the school would have its full student complement of seventh and eighth graders. In conducting the second year evaluation of the implementation, the focus was once again on the question of how well the school was working in terms of student outcomes as well as the quality of life at the school. In appraising the latter, the evaluators were concerned with whether a climate for learning had been created “which nurtured both cognitive and affective growth for a very special group of young adolescents”. The school had become a single unit housing all of the District’s seventh and eighth graders with a complete middle school staff and a single administration. One of the factors which influenced data collection and the interpretation of the results was the knowledge of impending program changes required by the state education department, changes which were to be implemented the fall of the following year. Some features of the original middle school program would have to be modified to meet the state’s new mandates. Consequently, the evaluators paid particular attention to perceptions concerning how the mandated changes would influence the philosophy on which the middle school program was based as well as elements of the program’s operations. Almost all of the same studies were repeated - student adjustment to the middlle school, student perceptions of the middle school experience, student assessment of the mod/minicourse program (which had been modified considerably based on evaluator recommendations), reading and mathematics achievement, time-on-task observations, teacher perceptions and opinions, and parent and community survey. While the same studies were conducted, the same instruments were not used. The questionnaires were revised to take into account the fact that there were now seventh and eighth graders in the school, that the number of teachers had increased because of the two-grade complement, that program changes had occurred, etc. Individual interviews were conducted with all of the teachers and administrators and these provided the basis for the teacher questionnaire. Frequent visits to the school by evaluation team members provided many informal contacts with school and District personnel and made possible observations of the day-to-day activities in the school. Analyses of the student responses on the questionnaire were done for the total group, by seventh and eighth grade team assignments, by sex and by sending school. The teacher interviews provided the basis for developing a questionnaire which was analyzed in terms
Educational
121
Evaluation
of total staff response, grades taught (i.e. seventh grade, eighth grade, or both) and by years of service at the Middle School (i.e. one or two years). The parent questionnaire was designed to obtain the opinions of all parents whose children were enrolled in the school. Except for five items, the same parent survey was used for parents of both grades. Where anchor items were used from the previous year, comparisons were presented. The data were presented by means for the first year, means for seventh grade parents for the second year, means for eighth grade parents for the second year, and totals for both grades for the second year. Interim reports were presented to the staff and administrative personnel as findings became available. In addition, an Executive Summary and Final Report were prepared and presentations made to the Board of Education and school administrators at a closed meeting, to the public at an open Board meeting, and to the staff and administrators after the Executive Summary and full Final Report had been made available earlier, as in the first year. Ten recommendations were included in the second year final report.
Reports
and the Reporting
Process
As can be seen from the above illustration of a two-year evaluation study, reporting is an ongoing process, not simply a final report which is issued at the end of a study. Since the major purpose of evaluation is to assist decision-makers to make decisions based on better information, communication/reporting is an integral part of the evaluation process throughout. A good deal of reporting, particularly that which occurs early on is fairly informal. An evaluator must ascertain what the information needs are of the different audiences who are involved and use various means of communicating to those audiences. Progress or in process reporting will be quite different from summative or final reporting. Progress reports, as Wolf (1984) points out “usually include a relatively short summary of activities engaged in during a particular time period, a preview of upcoming activities, a statement of problems encountered and/or resolved and, possibly a brief statement of preliminary findings” (pp. 200-201). Progress reports are used to keep those involved in the program informed about the way the evaluation is proceeding, to provide a check on the accuracy of data being collected and its interpretation, and to provide an indication of additional information needed and how it can be collected. Interim findings can sometimes be used by decision-makers to initiate program changes, as part of the formative evaluation process. The final report includes the results of the study, the conclusions drawn from the findings about the program, and recommendations regarding the future of the program. The final report may include an executive summary and a technical report or these may be separate documents. The executive summary is usually a relatively short, concise document which summarizes the major findings, the conclusions and the recommendations which emerge. The executive summary is written in non-technical language for wide dissemination to individuals who need to be informed but want only the essence of the findings and recommendations, not the details on which the conclusions and recommendations are based. The technical report may be issued as a separate report or as an appendix in the final report. The technical report contains the tables, the statistical analyses, and even the datacollecting instruments and procedures. As the name implies, this report discusses technical aspects and problems of the evaluation which have been encountered which are usually not
122
RICHARD
M. WOLF
of great interest to a wide audience even though they may be fascinating to the evaluator. There will be some audiences or members of some audiences who will want to see the data and statistical analyses and these should be provided in the technical report. The full evaluation report or final report is the document which fleshes out the executive summary, providing more details regarding the findings, conclusions, and recommendations for future action without providing the tables and statistical analyses or discussions about problems encountered. The full final report includes descriptions of the program and its implementation, the evaluation procedures and processes, the concerns and issues identified and studied, the findings, the conclusions and the recommendations. The final report should contain sufficient information to respond to the concerns and issues of the audiences and provide enough information so that the judgments on which the recommendations are based are clear. Wolf (1984) draws a distinction “between judgments about the worth of an educational course, program, or curriculum and judgments about future action” (p. 186). Both kinds of judgments need to be dealt with in the final report and sufficient information needs to be presented for the audiences to understand both. Judgments about the value of an educational program should be based on the data collected and its analysis. Judgments about future action, on the other hand, may be more subjective, more political, and may take into account forces and factors which go beyond the data. Thus, as Wolf (1984) cautions: “Since the evaluation worker will be going beyond the bounds of the kinds of conclusions that are made in conventional studies, it is important that judgments of worth be clearly separated from the rest of the material in an evaluation report” (p. 187). Guba and Lincoln (1981) argue that “the reaching of judgments and recommendations is a matter for interaction between the evaluator and the several audiences; the report should, however, highlight the judgments and recommendations that need to be made and provide the basic information from which they can jointly be fashioned by the evaluator and the audience” (p. 365). A draft of the executive summary and the final report should be made available for discussion with some of the key individuals involved in the program being evaluated prior to the preparation and presentation of the final version. Making available the report draft enables the evaluator to find out whether the team “has it right”: Are the data accurate and complete? Have the data been correctly interpreted‘? Are there conditions or factors which are important which might mitigate or alter the findings or the recommendations? Making a draft of a report available also avoids the element of surprise when the final report is presented so that policy makers and those responsible for the program are not faced with the unexpected when confronted with a final document in public. It must be stressed that providing a draft of a report is not for the purpose of negotiations and alterations but rather for discussions about the completeness and accuracy of the data, the correctness of the interpretation, and consideration of factors which might affect implementation. Some recommendations emerge not directly from the data and findings but rather from the experience and values of the evaluators who presumably bring an expertise and competence which extends beyond the concerns and issues raised by those responsible for the program. Finally, there is another kind of report, one prepared for the media. While this may be a news release, it is an important document. An evaluation report may be a lengthy, complicated document which a news reporter may skim for its attention-catching phrases. If the evaluation report is one to which the public should have or would want to have access, it is better to prepare a public information document which correctly summarizes the findings, conclusions and recommendations-briefly, succinctly, and accurately--for use by
Educational
Evaluation
123
those who write or report the news. The life-span value of evaluation reports vary with the nature of the program being studied. In a sense, if evaluation is for the purpose of helping decision-makers make better or better informed decisions about a program - its installation, continuation, or modification - then whether and when a report becomes obsolescent will depend on whether and how the recommendations for future action have been implemented, whether and how the conditions affecting the program change, whether and how the values of the audience change, etc. Evaluation reports do become obsolescent and this is an aspect of a study which must be recognized by both the evaluators who prepare reports and those responsible for programs who use those reports. It is important that evaluators keep in mind that it is a reporting process to which they must attend throughout the conduct of an evaluation study, not simply the preparation of a final report.
References Anderson, S. B., & Ball, S. (1978). The profession and practice of program evaluation. San Francisco: JosseyBass. Guba, E. G., & Lincoln, Y. S. (1981). Effective evaluation: Improving the Icsefulness of evaluation results through responsive and naturalistic approaches. San Francisco: Jossey-Bass. Wolf, R. M. (1984). Evaluation in education: Foundations of competency assessment and program review (2nd ed.). New York: Praeger Publishers.
CHAPTER
10
STANDARDS FOR ASSURING THE PROFESSIONAL QUALITY OF EDUCATIONAL PROGRAM AND PERSONNEL EVALUATIONS* DANIEL The Evaluation
Center,
L. STUFFLEBEAM Western
Michigan
University,
U.S.A.
Abstract Issues concerning standards for evaluations of education are of worldwide interest. This article reviews experiences in the United States over the past 15 years in developing standards by which to guide and assess the work of professional evaluators. This work has stressed that all evaluators should strive to make their evaluations useful, feasible, proper, and accurate. Conducted by a Joint Committee representing 14 diverse professional societies concerned with American education, the American effort has included two major projects. The first one resulted in 1981 in the publication of Standards for Evaluations of Educational Programs, Projects, and Materials. The second project is currently under way to develop standards for evaluations of teachers and other educators. The standards are seen to have general applicability to evaluation work in other countries but to require adaptation to take account of the values and realities in other national settings. It is recommended that the process of developing consensus standards used by the Joint Committee would provide a useful exemplar for consideration by standard-setting groups in other countries.
Professional educators, throughout the world, must evaluate their work in order to: (1) obtain direction for improving it; and (2) document their effectiveness. They must evaluate the performance of students, programs, personnel, and institutions. Within various countries, such evaluations have occurred at many levels: classroom, school, school district, state or province, and national system. And there have been international comparisons of the quality of education as well. The evaluations have varied enormously: in the objects assessed, the questions addressed, the methods used, the audiences served, the funds expended, the values invoked, and, to the point of this article, their quality. In evaluations, as in any professional endeavor, many things can and often do go wrong: they are subject to bias, misinterpretation, and misapplication; and they may address the wrong questions and/or provide erroneous information. Indeed, there have been strong
* James Sanders
provided
a valuable
critique
of a prior draft of this article 125
126
RICHARD M. WOLF
charges that evaluations, in general, have failed to render worthy services (Guba, 1969), and often, findings from individual studies have been disputed (e.g. the “Coleman, 1966 Equal Opportunity Study”). Clearly, evaluation itself is subject to evaluation and quality assurance efforts. During the past 30 years, there have been substantial efforts in the United States to assure and control the quality of educational evaluation. In addition to creating professional evaluation societies and developing preparation programs and a substantial professional literature, there have been concerted efforts to develop and enforce professional standards for educational evaluation. In the middle 195Os, the American Psychological Association joined with the American Educational Research Association and the National Council on Measurements Used in Education to develop standards for educational and psychological tests (APA, 1954; AERA/NCMUE, 1955); updated versions of the ‘Test Standards’ have been published by APA in 1966, 1974, and 1985, and they have been widely used - in the courts as well as professional settings-to evaluate tests and the uses of test scores. In 1981, the Joint Committee on Standards for Educational Evaluation, whose 17 members were appointed by 12 organizations, issued the Standards for Evaluations of Educational Programs, Projects, and Materials (which originally was commissioned to serve as a companion volume to the ‘Test Standards’); in 1982, the Evaluation Research Society (Rossi. 1982) issued a parallel set of program evaluation standards (intended to deal with program evaluations both outside and inside education). Currently, the Joint Committee on Standards for Educational Evaluation is developing standards for evaluations of educational personnel (which will be a companion volume to their program evaluation standards). The different sets of standards are noteworthy because they provide: (1) operational definitions of student evaluation and program evaluation (soon to include personnel evaluation); (2) evidence about the extent of agreement concerning the meaning and appropriate methods of educational evaluation; (3) general principles for dealing with a variety of evaluation problems; (4) practical guidelines for planning evaluations; (5) widely accepted criteria for judging evaluation plans and reports; (6) conceptual frameworks by which to study evaluation; (7) evidence of progress, in the United States, toward professionalizing evaluation; (8) content for evaluation training; and (9) a basis for synthesizing an overall view of the different types of evaluation. We expect that many evaluators, psychologists, and others concerned with the evaluation of education are aware of the ‘Test Standards’, but not the program evaluation standards, and obviously not the personnel evaluation standards, which are still under development. The purpose of this article is to inform an international audience - including evaluators, psychologists, and others involved in the evaluation of education - about the Standards for Evaluations of Educational Programs, Projects, and Materials (hereafter called the ‘Program Evaluation Standards’) and the more recent work of its authors, the Joint Committee on Standards for Educational Evaluation, toward developing educational personnel evaluation standards. While it is hoped this report will provide a useful reference to groups in other countries that may desire standards to guide their evaluation work, the Joint Committee’s standards are distinctly American and may not reflect the values, experiences, political realities, and practical constraints in some other countries. The article is divided into two parts: (1) an introduction to the Joint Committee’s ‘Program Evaluation Standards’; and (2) an overview of the Committee’s project to develop ‘Educational Personnel Evaluation Standards’.
Educational
Introduction
Evaluation
to the Program
Evaluation
127
Standards
In general, the Joint Committee devised 30 standards that pertain to four attributes of an evaluation: Utility, Feasibility, Propriety, and Accuracy. The Utility standards reflect a general consensus that emerged in the educational evaluation literature during the late 1960s requiring program evaluations to respond to the information needs of their clients, and not merely to address the interests of the evaluators. The Feasibility standards are consistent with the growing realization that evaluation procedures must be cost-effective and workable in real-world, politically-charged settings; in a sense, these standards are a countermeasure to the penchant for applying the procedures of laboratory research to realworld settings regardless of the fit. The Propriety standards - particularly American reflect ethical issues, constitutional concerns, and litigation concerning such matters as rights of human subjects, freedom of information, contracting, and conflict of interest. The Accuracy standards build on those that have long been accepted for judging the technical merit of information, especially validity, reliability, and objectivity. Overall, then, the ‘Program Evaluation Standards’ promote evaluations that are useful, feasible, ethical, and technically sound-ones that will contribute significantly to the betterment of education.
Key Definitions The ‘Program Evaluation Standards’ reflect certain definitions of key concepts. Evaluation means the systematic investigation of the worth or merit of some object. The object of an evaluation is what one is examining (or studying) in an evaluation: a program, a project, instructional materials, personnel qualifications and performance, or student needs and performance. Standards are principles commonly accepted for determining the value or the quality of an evaluation.
Development of the Program Evaluation Standards To ensure that the ‘Program Evaluation Standards’ would reflect the best current knowledge and practice, the Joint Committee sought contributions from many sources. They collected and reviewed a wide range of literature. They devised a list of possible topics for standards, lists of guidelines and pitfalls thought to be associated with each standard, and illustrative cases showing an application of each standard. They engaged a group of 30 experts independently to expand the topics and write alternative versions for each standard. With the help of consultants, the Committee rated the alternative standards, devised their preferred set, and compiled the first draft of the ‘Program Evaluation Standards’. They then had their first draft criticized by a nationwide panel of 50 experts who were nominated by the 12 sponsoring organizations. Based on those critiques, the Committee debated the identified issues and prepared a version which was subjected to national hearings and field tests. The results of this five-year period of development and assessments led, in 1981, to the published version of the ‘Program Evaluation Standards’. Presently, that version is being applied and reviewed, and the Joint Committee is collecting feedback for use in preparing the next edition.
128
RICHARD
Developers
M. WOLF
of the Program
Evaluation
Standards
An important feature of the standards-setting process is the breadth of perspectives that have been represented in their development. The 12 organizations that originally sponsored the Joint Committee included the perspectives of the consumers as well as those who conduct program evaluations. The groups represented on the Joint Committee and among the approximately 200 other persons who contributed include, among others, those of statistician and administrator; psychologist and teacher; researcher and counselor; psychometrician and curriculum developer, and evaluator and school board member. There is perhaps no feature about the Joint Committee that is as important as its representative nature, since by definition a standard is a widely shared principle. Just as the breadth of perspectives involved in developing the ‘Program Evaluation Standards’ enhances their credibility in the United States, the low level of involvement of groups from outside the United States limits the credibility and usefulness of the ‘Program Evaluation Standards’ in other countries.
Format The depth to which the Joint Committee developed each standard is apparent in the format common to all of the standards. This format starts with a descriptor - for instance, ‘Formal Obligation’. The descriptor is followed by a statement of the standard, e.g. “Obligations of the formal parties to an evaluation (what is to be done, how, by whom, when) should be agreed to in writing, so that these parties are obligated to adhere to all conditions of the agreement or formally to renegotiate it”, and an overview, that includes a rationale for the standard and definitions of its key terms. Also included, for each standard, are lists of pertinent guidelines, pitfalls, and caveats. The guidelines are procedures that often would prove useful in meeting the standard; the pitfalls are common mistakes to be avoided; and the caveats are warnings about being over zealous in applying the given standards, lest such effort detract from meeting other standards. The presentation of each standard is concluded with an illustration of how it might be applied. The illustration includes a situation in which the standard is violated, and a discussion of corrective actions that would result in better adherence to the standard. Usually, the illustrations are based on real situations, and they encompass a wide range of different types of evaluations. One easy step to extending the applicability of the ‘Program Evaluation Standards’ to evaluations in fields outside education would be to develop new illustrative cases drawn directly from experiences in evaluating programs outside education. Such a step might also be useful in efforts to adapt the ‘Program Evaluation Standards’ for use in countries outside the United States.
Content of the Standards Utility Standards In general,
the Utility
Standards
are intended
to guide evaluations
so that they will be
Educational
Evaluation
129
informative, timely, and influential. These standards require evaluators to acquaint themselves with their audiences, earn their confidence, ascertain the audiences’ information needs, gear evaluations to respond to these needs, and report the relevant information clearly and when it is needed. The topics of the standards included in this category are Audience Identification, Evaluator Credibility, Information Scope and Selection, Valuational Interpretation, Report Clarity, Report Dissemination, Report Timeliness, and Evaluation Impact. Overall, the standards of Utility are concerned with whether an evaluation serves the practical information needs of a given audience.
Feasibility Standards The Feasibility Standards recognize that an evaluation usually must be conducted in a ‘natural’, as opposed to a ‘laboratory’, setting, and require that no more materials and personnel time than necessary be consumed. The three topics of the Feasibility Standards are Practical Procedures, Political Viability, and Cost Effectiveness. Overall, the Feasibility Standards call for evaluations to be realistic, prudent, diplomatic, and frugal.
Propriety Standards The Propriety Standards reflect the fact that evaluations affect many people in different ways. These standards are aimed at ensuring that the rights of persons affected by an evaluation will be protected. The topics covered by the Propriety Standards are Formal Obligation, Conflict of Interest, Full and Frank Disclosure, Public’s Right to Know, Rights of Human Subjects, Human Interactions, Balanced Reporting, and Fiscal Responsibility. These standards require that those conducting evaluations learn about and abide by laws concerning such matters as privacy, freedom of information, and protection of human subjects. The standards charge those who conduct evaluations to respect the rights of others and to live up to the highest principles and ideals of their professional reference groups. Taken as a group, the propriety Standards require that evaluations be conducted legally, ethically, and with due regard for the welfare of those involved in the evaluation, as well as those affected by the results.
Accuracy Standards Accuracy, the fourth group, includes those standards that determine whether an evaluation has produced sound information. These standards require that the obtained information be technically adequate and that conclusions be linked logically to the data. The topics developed in this group are Object Identification, Context Analysis, Defensible Information Sources, Described Purposes and Procedures, Valid Measurement, Reliable Measurement , Systematic Data Control, Analysis of Quantitative Information, Analysis of Qualitative Information, Justified Conclusions, and Objective Reporting. The overall rating of an evaluation against the Accuracy Standards gives a good idea of the evaluation’s overall truth value. The 30 standards are summarized in Table 10.1.
130
RICHARD
Summary (A)
(342)
Audience Identification Audiences involved in or affected be addressed.
by the evaluation
will serve the practical
should be identified,
Evaluator Credibility The persons conducting the evaluation should be both trustworthy the evaluation, so that their findings achieve maximum credibility
information
so that their needs can
and competent and acceptance.
to perform
(A3)
Information Scope artd Selection Information collected should be of such scope and selected in such ways as to address pertinent questions about the object of the evaluation and be responsive to the needs and interests of specified audiences.
(A4)
Valuation Interprelation The perspectives, procedures, and rationale used to interpret described, so that the basesforvalue judgments are clear.
the findings should be carefully
(A%
Report Clarity The evaluation report should describe the object being evaluated and its context. and the purposes. procedures, and findings of the evaluation, so that the audiences will readily understand what was done. why it was done, what information was obtained, what conclusions were drawn, and what recommendations were made.
(A61
Report Dissemination Evaluation findings should be disseminated that they can assess and use the findings.
(A71
Report Timeliness Release of reports
(ASI
Evaluation Impact Evaluations should be planned members of the audiences.
to clients and other right-to-know
should be timely, so that audiences and conducted
Feasibility Standards The feasibility standards are intended to ensure diplomatic, and frugal. These standards are: (Bl)
(C)
Table 10.1 of the Standards
Utility Standards The utility standards are intended to ensure that an evaluation needs of given audiences. These standards are:
(AlI
(B)
M. WOLF
can best use the reported
in ways that encourage
that an evaluation
Practical Procedures The evaluation procedures should be practical, needed information can be obtained.
audiences,
so that disruption
information.
follow-through
will be realistic,
so
by
prudent,
is kept to a minimum,
and that
(B2)
Political Viabilit> The evaluation should be planned and conducted with anticipation of the different positions of various interest groups, so that their cooperation may be obtained. and so that possible attempts by any of these groups to curtail evaluation operations or to bias or misapply the results can be averted or counteracted.
(B3)
Cost Effectiveness The evaluation should produce expended.
information
of sufficient
value to justify the resources
Propriety Standards The propriety standards are intended to ensure that an evaluation will be conducted legally. ethically, and with due regard for the welfare of those involved in the evaluation, as well as those affected by its results. These standards are: (Cl)
Formal Obligation Obligations of the formal parties to an evaluation (what is to be done, how. by whom, when) should be agreed to in writing, so that these parties are obligated to adhere to all conditions of the agreement or formally to renegotiate it. (continued)
Educational
Summary
(C.4
(C3)
(C4)
(C5)
(Co)
(C7)
(CS)
(D)
131
Evaluation
Table 10.1 of the Standards
(Continued)
Conflict of Interest Conflict of interest, frequently unavoidable, should be dealt with openly and honestly, does not compromise the evaluation processes and results.
so that it
Full and Frank Disclosure Oral and written evaluation reports should be open, direct, and honest in their disclosure pertinent findings, including the limitations of the evaluation.
Public’s Right to Know The formal parties to an evaluation should respect and assure the public’s right to know, within the limits of other related principles and statutes, such as those dealing with public safety and the right to privacy. Rights of Human Subjects Evaluations should be designed and conducted, subjects are respected and protected.
so that the rights and welfare of the human
Human Interactions Evaluators should respect human dignity and worth in their interactions associated with an evaluation.
with other persons
Balanced Reporting The evaluation should be complete and fair in its presentation of strengths and weaknesses of the object under investigation, so that strengths can be built upon and problem areas addressed. Fiscal Responsibility The evaluator’s allocation procedures and otherwise
and expenditure of resources should reflect sound accountability be prudent and ethically responsible.
Accuracy Standards The accuracy standards are intended to ensure that an evaluation will reveal and convey technically adequate information about the features of the object being studied that determine worth or merit. These standards are:
WI P.4
(D3)
(D4)
(D5)
(Do)
(D7)
(D8)
of
its
Object Identification The object of the evaluation (program, project, material) should be sufficiently examined, that the form(s) of the object being considered in the evaluation can be clearly identified. Context Analysis The context in which the program, project, or material exists should be examined detail, so that its likely influences on the object can be identified. Described Purposesand Procedures The purposes and procedures of the evaluation should be monitored detail, so that they can be identified and assessed. Defensible Information Sources The sources of information should be described information can be assessed.
so
in enough
and described
in enough
in enough detail, so that the adequacy
of the
Valid Measurement The information-gathering instruments and procedures should be chosen or developed and then implemented in ways that will assure that the interpretation arrived at is valid for the given use. Reliable Measurement The information-gathering instruments and procedures should be chosen or developed and then implemented in ways that will assure that the information obtained is sufficiently reliable for the intended use. Systematic Data Control The data collected, processed, and reported in an evaluation so that the results of the evaluation will not be flawed. Analysis of Quantitative Information Quantitative information in an evaluation to ensure supportable interpretations.
should be reviewed
should be appropriately
and corrected,
and systematically
analyzed
(Continued)
132
RlCHARD
Summary (D9)
(D10)
(D11)
M. WOLF
Table 10.1 of the Standards
(Continued)
Analysis of Qualitative Information Qualitative information in an evaluation should be appropriately analyzed to ensure supportable interpretations. Justified Conclusions The conclusions reached assess them.
in an evaluation
and systematically
should be explicitly justified,
so that the audiences
can
Objective Reporting The evaluation procedures should provide safeguards to protect the evaluation findings and reports against distortion by the personal feelings and biases of any party to the evaluation.
Eclectic Orientation The ‘Program Evaluation Standards’ do not exclusively endorse any one approach to evaluation. Instead, the Joint Committee has written standards that encourage the sound use of a variety of evaluation methods. These include surveys, observations, document reviews, jury trials for projects, case studies, advocacy teams to generate and assess competing plans, adversary and advocacy teams to expose the strengths and weaknesses of projects, testing programs, simulation studies, time-series studies, checklists, goal-free evaluations, secondary data analysis, and quasi-experimental design. In essence, evaluators are advised to use whatever methods are best suited for gathering information that is relevant to the questions posed by clients and other audiences, yet sufficient for assessing a program’s effectiveness, costs, responses to societal needs, feasibility, and worth. It is desirable to employ multiple methods, qualitative as well as quantitative, and the methods should be feasible to use in the given setting.
Nature of the Evaluations to be Guided by the ‘Program Evaluation Standards’ The Joint Committee deliberately chose to limit the ‘Program Evaluation Standards’ to evaluations of educational programs, projects, and materials. They chose not to deal with evaluations of educational institutions and personnel nor with evaluations outside education. They set these boundaries for reasons of feasibility and political viability of the project. Given these constraints, the Joint Committee attempted to provide principles that apply to the full range of diffferent types of studies that might legitimately be conducted in the name of evaluation. These include, for example, small-scale, informal studies that a school committee might employ to assist in planning and operating one or more workshops; as another example, they include large-scale, formal studies that might be conducted by a special evaluation team in order to assess and report publicly on the worth and merit of a statewide or national instructional program. Other types of evaluations to which the ‘Program Evaluation Standards’ apply include pilot studies, needs assessments, process evaluations, outcome studies, cost/effectiveness studies, and meta analyses. In general, the Joint Committee says the ‘Program Evaluation Standards’ are intended for use with studies that are internal and external, small and large, informal and formal, and for those that are formative (designed to improve a program while it is still being developed) and
Educational
Evaluation
133
summative (designed to support conclusions about the worth or merit of an object and to provide recommendations about whether it should be retained, revised, or eliminated). It would be a mistake to assume that the ‘Program Evaluation Standards’ are intended for application only to heavily funded and well-staffed evaluations. In fact, the Committee doubts whether any evaluation could simultaneously meet all of the standards. The Committee encouraged evaluators and their clients to consult the ‘Program Evaluation Standards’ to consider systematically how their investigations can make the best use of available resources in informing and guiding practice. The ‘Program Evaluation Standards’ must not be viewed as an academic exercise of use only to well funded developers but as a code by which to help improve evaluation practice. This message is as applicable to those educators who must evaluate their own work as it is to those who can call on the services of evaluation specialists. For both groups, consideration of the ‘Program Evaluation Standards’ may sometimes indicate that a proposed evaluation is not worthy of further consideration, or it may help to justify and then to guide and assess the study.
Tradeoffs Among the Standards The preceding discussion points up a particular difficulty in applying the ‘Program Evaluation Standards’. Inevitably, efforts to meet certain standards will detract from efforts to meet others, and tradeoff decisions will be required. For example, efforts to produce valid and reliable information and to generate ‘ironclad’ conclusions may make it difficult to produce needed reports in time to have an impact on crucial program decisions, or the attempt to keep an evaluation within cost limits may conflict with meeting such standards as Information Scope and Selection and Report Dissemination. Such conflicts will vary across different types and sizes of studies, and within a given study the tradeoffs will probably be different depending on the stage of the study (e.g. deciding whether to evaluate, designing the evaluation, collecting the data, reporting the results, or assessing the results of the study). Evaluators need to recognize and deal as judiciously as they can with such conflicts. Some general advice for dealingwith these tradeoff problems can be offered. At a macro level, the Joint Committee decided to present the four groups of standards in a particular order: Utility, Feasibility, Propriety, and Accuracy. The rationale for this sequence might be stated as “an evaluation not worth doing isn’t worth doing well”. In deciding whether to evaluate, it is therefore more important to begin with assurances that the findings, if obtained, would be useful, than to start with assurances only that the information would be technically sound. If there is no prospect for utility, then of course there is no need to work out an elegant design that would produce sound information. Given a determination that the findings from a projected study would be useful, then the evaluator and client might next consider whether it is feasible to move ahead. Are sufficient resources available to obtain and report the needed information in time for its use? Can the needed cooperation and political support be mustered? And, would the projected information gains, in the judgment of the client, be worth the required investment of time and resources? If such questions cannot be answered affirmatively, then the evaluation planning effort might best be discontinued with no further consideration of the other standards. Otherwise, the evaluator would next consider whether there is any reason that the evaluation could not be
134
RICHARD
M. WOLF
carried through within appropriate bounds of propriety. Once it is ascertained that a proposed evaluation could meet conditions of utility, feasibility, and propriety, then the evaluator and client would tend carefully to the accuracy standards. By following the sequence described above, it is believed that evaluation resources would be allocated to those studies that are worth doing and that the studies would then proceed on sound bases. There are also problems with tradeoffs among the individual standards. The Committee decided against assigning a priority rating to each standard because the tradeoff issues vary from study to study and within a given study at different stages. Instead, the Committee provided a Functional Table of Contents that is summarized in Table 10.2. This matrix summarizes the Committee’s judgments about which standards are most applicable to each of a range of common evaluation tasks. The standards are identified down the side of the matrix. Across the top are 10 tasks that are commonly involved in any evaluation. The checkmarks in the cells denote which standards should be heeded most carefully in addressing a given evaluation task. All of the standards are applicable in all evaluations. However, the Functional Table of Contents allows evaluators to identify quickly those standards that are most relevant to certain tasks in given evaluations.
Attestation To assist evaluators and their clients to record their decisions about applying given standards and their judgments about the extent to which each one was taken into account, the Committee provided a citation form (see Table 10.3). This form is to be completed, signed, and appended to evaluation plans and reports. Like an auditor’s statement, the signed citation form should assist audiences to assess the merits of given evaluations. Of course, the completed citation form should often be backed up by more extensive documentation, especially with regard to the judgments given about the extent that each standard was taken into account. In the absence of such documentation, the completed citation form can be used as an agenda for discussions between evaluators and their audiences about the adequacy of evaluation plans or reports.
Validity of the Standards In the short time since the ‘Program Evaluation Standards’ were published, a considerable amount of information that bears on the validity of the standards has been presented. In general, this evidence supports the position that the ‘Program Evaluation Standards’ are needed, have been carefully developed, have good credibility in the United States, and have been put to practical use. However, the assessments also point out some limitations and areas for improvement. Bunda (1982), Impara (1982), Merwin (1982), and Wardrop (1982) examined the congruence between the ‘Program Evaluation Standards’ and the principles of measurement that are embodied in the Standards for Educational and Psychological Tests (APA, 1974); they independently concluded that great consistency exists between these two sets of standards with regard to measurement. Ridings (1980) closely studied standard setting in the accounting and auditing fields and developed a checklist by which to assess the Joint Committee effort against- key checkpoints in the more mature standard-setting programs in
Information
Valuational
Report clarity
Report dissemination
Report timeliness
A3
A4
A5
A6
A7
Politicalviability
Cost effectiveness
Formal obligation
Conflict of interest
Full and frank disclosure
Public’s right to know
Rightsof
B2
B3
Cl
C2
C3
C4
C5
impact
Fiscal responsibility
Object identification
Context analysis
Described
Defensible
Valid measurement
Reliable measurement
Systematic
Quantitative
Qualitative
C7
C8
Dl
D2
D3
D4
D5
D6
D7
D8
D9
X
X
X
X
X
X
X
X X
Dll
Objective
reporting
X
X
X X
X
X X
X
X
X X
X
X
X
X
X X
X
X
X
X
X X
X
X
X
X
X X
X
X
X
X
X
X
X
X
X
X X
X
X
X X
X
D 10 Justified conclusions
X
X X
X X
X
X X X
X X X
X
X
X
(9) Report findings
X
X
X X
(8) Analyze data
X
X
X
(7) Collect data
X
X
X
X
X
X
X
X
X
X X
X
(6) Manage the study
X X
(5) Staff the study
X
(4) Contract the study
X
X
X
X
X
X
X
X
X
X
X
(3) Ensure political viability
analysis
analysis
data control
sources
and procedures
information
purposes
reporting
Human interactions
Balanced
C6
human subjects
procedures
Evaluation
Practical
A8
Bl
interpretation
scope and selection
credibility
identification
Evaluator
A2
(descriptors)
Audience
Al
Standards
(1) (2) Decide Clarify and whether to assess do a study purpose
Table 10.2
X
X
X
X
X
X
X
X
X
X
X
X
X
(10) APPLY results
136
RICHARD
M. WOLF
Table 10.3
Citation Form The Standards for Evaluations of EducationalPrograms,
Projects, and Materialr
guided the development
of this (check one):
request for evaluation planldes~gnlproposal evaluation plan/design/proposal evaluatlo” evaluaflo” other To interpret
tha informrtm”
Joint Committee Mater&s
contract repcart
provided on this form, tha ma&r
on Standards for Educational
New York: McGraw-Hill.
Evaluation,
needs to refer to the full text of the standards B they appear in
Standards for Evaluations of Educational
Progrms,
Projects, and
199.0
The Standards were consulted and used as indicated
in the table below (check as appropriate):
Name
Date, (typed1 (rlg”atureI
Posmon or Title: Agency. Address. Relatmn
to Document: (e.g.. author of document.
evaluation
tssm leader, external audnor. mtcmal
audnor)
Educational
Evaluation
137
in accounting and auditing. In general, she concluded that the Joint Committee had adequately dealt with four key issues: rationale, the standard-setting structure, content, and uses. Wildemuth (1981) issued an annotated bibliography with about five sources identified for each standard; these references help to confirm the theoretical validity of the ‘Program Evaluation Standards’, and they provide a convenient guide to users for pursuing in-depth study of the involved principles. Linn (1981) reported the results of about 25 field trials that were conducted during the development of the ‘Program Evaluation Standards’; these confirmed that the ‘Program Evaluation Standards’ were useful, but not sufficient guides, in such applications as designing evaluations, assessing evaluation proposals, judging evaluation reports, and training evaluators. Additionally, they provided direction for revising the ‘Program Evaluation Standards’ prior to publication. Stake (1981) observed that the Joint Committee had made a strong case in favor of evaluation standards, but he urged a careful look at the case against standards. He offered analysis in this vein and questioned whether the evaluation field has matured sufficiently to warrant the development and use of standards. A number of writers have examined the applicability of the ‘Program Evaluation Standards’ to specialized situations. Wargo (1981) concluded that the ‘Program Evaluation Standards’ represent a sound consensus of good evaluation practice, but he called for more specificity regarding large-scale, government-sponsored studies and for more representation from this sector on the Committee. Marcia Linn (1981) concluded that the ‘Program Evaluation Standards’ contain sound advice for evaluators in out-of-school learning environments, but she observed the ‘Program Evaluation Standards’ are not suitable for dealing with tradeoffs between standards or settling disputes between and among stakeholders. While the ‘Program Evaluation Standards’ explicitly are not intended for personnel evaluations, Carey (1979) examined the extent to which they are congruent with state evaluation policies for evaluating teachers; she concluded that only one standard (Dll, Objective Reporting) was deemed inappropriate for judging teacher evaluations. Burkett and Denson (1985) surveyed participants at a conference on evaluation in the health professions to obtain their judgments of the ‘Program Evaluation Standards’. While the respondents generally agreed “. . . that the Standards represent a useful framework for designing evaluations and offer substantial potential for application to the evaluation of continuing education (CE) for programs for the health professions”, they also issued the following criticisms: (1) Crucial elements of certain standards lie outside the evaluator’s professional area of control. (2) The Standards assume more flexibility, e.g. in the choice of methods of assessment, than sometimes may exist in institutional settings. (3) The Standards deal better with external evaluations than with internal, self-evaluations. (4) The Standards need to be made more useful by ordering them in the same sequence as an evaluation typically unfolds, providing more specific guidelines and examples, and adding bibliographic references. Marsh et al. (1981) used the ‘Program Evaluation Standards’ to study the practice of educational evaluation in California and concluded the following: “( 1) the standards were perceived as important ideals for the orientation of the process and practice of evaluation; (2) the current practice of evaluation in California was perceived by professional evaluators as being, at most, of average quality; and (3) the practice of low quality evaluation was attri-
13X
buted to a combination of restriction of incompetence of the evaluator”.
RICHARD
M. WOLF
of time, of political
and bureaucratic
coercions,
and
Several evaluators from other countries have examined the ‘Program Evaluation Standards’ for their applicability outside the United States. Nevo (1982) and Straton (1982), respectively from Israel and Australia, both concluded that while the ‘Program Evaluation Standards’ embody sound advice, they assume an American situation-regarding level of effort and citizens’ rights, for example-that is different from their own national contexts. Oliveira et al. (1982) published, in Portuguese, a summary and critique of the ‘Program Evaluation Standards’ in the hope that their contribution would “. . . positively influence the quality of the evaluations conducted in Brazil, help in the training of educational evaluators, and help those who recommend evaluations to improve their value”. Lewy, from Israel, concluded that the ‘Program Evaluation Standards’ “. . . provide useful guidelines for evaluators in Israel as well as the USA”, but raised questions about the adequacy of their theoretical rationale and criticized their lack of specificity. Lewy, like Dockrell (19X3), saw great possibilities for unhealthy collusion between evaluators and sponsors and disagreed with the position reflected in the ‘Program Evaluation Standards’ that evaluators should communicate continuously with their clients and report interim findings. Dockrell also observed that evaluation in Scotland and other European countries is much more qualitatively oriented than is evaluation practice in the United States and that the ‘Program Evaluation Standards’ do not and probably could not provide much guidance for the perceptiveness and originality required of excellent qualitative research. Scheerens and van Seventer (1983) saw in the ‘Program Evaluation Standards’ a useful contribution to the important need in the Netherlands to upgrade and professionalize evaluation practice; but, to promote utility in their country, they said the standards would need to be translated and illustrated at the national research policy level, as opposed to their present concentration on the individual evaluation project. Even so, they questioned whether such standards could be enforced in Holland, given the susceptibility of national research policy there to frequently changing political forces and priorities. Marklund (1983) concluded that the ‘Program Evaluation Standards’ provides a “. . . good checklist of prerequisites for a reliable and valid evaluation,” but that “. . due to differences in values of program outcomes, such standards do not guarantee that the result of the evaluation will be indisputable”. Overall, the main value of the ‘Program Evaluation Standards’ outside the United States appears to be as a useful reference for stimulating discussion of the need for professionalizing evaluation and the range of issues to be considered. Six studies were conducted to examine the extent to which the ‘Program Evaluation Standards’ are congruent with the set of program evaluation standards that was recently issued by the Evaluation Research Society (Rossi, 1982), Cordray (1982), Braskamp and Mayberry (1982), Stufflebeam (1982)) McKillip (1983)) and Stockdill (1984) found that the two sets of standards are largely overlapping. Overall, the literature on the ‘Program Evaluation Standards’ indicates considerable support for these standards. They are seen to fill a need. They are judged to contain sound content. They have been shown to be applicable in a wide range of American settings. They have been applied successfully. They are consistent with the principles in other sets of standards. And they are subject to an appropriate process of review and revision. But, by no means are they a panacea. Their utility is limited, especially outside the United States. And several issues have been raised for consideration in subsequent revision cycles.
Educational
Standards
for Evaluations
Evaluation
of Educational
139
Personnel
An initial decision in developing the ‘Program Evaluation Standards’ was to exclude the area of personnel evaluation. One reason was that developing a whole new set of standards for program evaluation presented a sufficiently large challenge; another reason was that members of the Committee believed that teachers’ organizations would not support development of standards for evaluations of personnel. Also, in 1975 when the Joint Committee was formed, there was little concern for increasing or improving the evaluation of educational personnel.
The Decision to Develop Educational Personnel Evaluation Standards In 1984, a number of factors led to the Joint Committee’s decision to develop standards for evaluations of educational personnel. The Committee had successfully developed the ‘Program Evaluation Standards’ and felt capable of tackling the personnel evaluation standards issue. They were also convinced that personnel evaluation in education was greatly in need of improvement. Moreover, they saw this need as urgent, because of the great increase in the development of systems for evaluating teachers and because of the great turmoil and litigation that accompanied the expansion of educational personnel evaluation activity. Moreover, they believed that the major teachers’ organizations would support the development of professional standards that could be used to expose unsound plans and programs of personnel evaluation.
Expansion of the Joint Committee In the course of deciding to develop the educational personnel evaluation standards, the Committee also decided to expand its membership to ensure that its members reflected relevant perspectives on evaluations of educational personnel as well as evaluations of educational programs. Additions to the Committee included representatives from the American Association of School Personnel Administrators, the American Federation of Teachers, and the American Association of Secondary School Principals,‘as well as individual members-at-large with expertise in litigation in personnel evaluation and research on teacher evaluation. New appointments by sponsoring organizations also included the perspectives of industrial/organizational psychology and traditionally underrepresented groups. The 18-member Committee continues to include a balance between the perspectives of educational practitioners and evaluation specialists. The Joint Committee’s sponsoring organizations are now the American Association of School Administrators, American Association for School Personnel Administrators, American Educational Research Association, American Evaluation Association, American Federation of Teachers, American Psychological Association, Association for Measurement and Evaluation in Counseling and Development, Association for Supervision and Curriculum Development, Education Commission of the States, National Association of Elementary School Principals, National Association of Secondary School Principals, National Council on Measurement in Education, National Education Association, and National School Boards Association.
140
RICHARD
The Guiding
M. WOLF
Rationale
It is appropriate for the Joint Committee to deal with personnel evaluation as well as program evaluation. Both types of evaluation are prevalent in education, and both are vitally important for assuring the quality of educational services. Practically and politically it is usually necessary to conduct these two types of evaluation separately. But logically, they are inseparable. Practice and literature have lodged responsibility for personnel evaluation with supervisors and administrators and have created expectations that program evaluators will not evaluate the performance of individuals as such. Program evaluators might provide some technical advice for developing a sound system of personnel evaluation and might even evaluate the personnel evaluation system itself; but they have preferred, and often have insisted on, staying out of the role of directly evaluating individual personnel. To do otherwise would stimulate fear about the power and motives of evaluators, and would undoubtedly generate much resistance on the part of principals and teachers, leading in turn to lack of cooperation in efforts to evaluate programs. Thus, program evaluators typically have avoided any association with personnel evaluation. They have emphasized instead the constructive contributions of program evaluation, and they have promised as much anonymity and confidentiality as they could to teachers and administrators in the programs being evaluated. On the whole, efforts to separate personnel and program evaluation in school districts have remained in vogue. But a basic problem remains: namely, it is fundamentally impossible to remove personnel evaluation from sound program evaluation. A useful program evaluation must determine whether a program shows a desirable impact on the rightful target population. If the data reveal otherwise, the assessment must discern those aspects of a program that require change to yield the desired results. Inescapably, then, program evaluators must check the adequacy of all relevant instrumental variables, including the personnel. The rights of teachers and administrators must be respected, but evaluators must also protect the rights of students to be taught well and of communities to have their schools effectively administered. However, personnel evaluation is too important and difficult a task to be left exclusively to the program evaluators. Many personnel evaluations are conducted by supervisors who rarely conduct formal program evaluations. Also, state education departments and school districts are heavily involved, apart from their program evaluation efforts, in evaluating teachers and other educators for certification, selection, placement, promotion, tenure, merit, staff development, and termination. Undesirably, the literatures and methodologies of program evaluation and personnel evaluation are distinct. The work of the Joint Committee in both areas affords a significant opportunity to bring a concerted effort to bear on synthesizing these fields and coordinating the efforts of program evaluators and personnel evaluators for the betterment of educational service.
The Developmental
Process
To achieve its goals for developing standards for personnel evaluations, mittee is employing the approach it found successful in the development
the Joint Comof the ‘Program
Educational
Evaluation
141
Evaluation Standards’. They have collected and studied an enormous amount of information about educational personnel evaluation and have developed a tentative set of topics for personnel evaluation standards. A panel of writers, nominated by the 14 sponsoring organizations, wrote multiple versions of each proposed standard. The Joint Committee evaluated the alternative versions and decided which aspects of each standard would be included in the initial review version of the Educational Personnel Standards book. The first draft of the book will be critiqued by a national review panel and an international review panel. The Joint Committee will use the critiques to develop a semifinal version of the book. That version will be field tested and subjected to hearings conducted throughout the United States. The results will be used to develop the final publication version of the Educational Personnel Evaluation Standards. Publication is expected in 1988. The entire process will be monitored and evaluated by a Validation Panel, whose members represent the perspectives of philosophy, international education, law, personnel psychology, school district administration, educational research, psychometrics, teaching, and the school principalship.
Contents of the Standards After reviewing a great deal of material on personnel evaluation, the Joint Committee decided that the four basic concerns of Utility, Feasibility, Propriety, and Accuracy are as relevant to personnel evaluation as they are to program evaluation. Some of the topics for individual standards are likewise the same, e.g. valid measurement and reliable measurement. However, there are important differences in the two sets of topics. For example, Full and Frank disclosure, a program evaluation standard, has not surfaced in the personnel evaluation standards; and Service Orientation, a key entry in the personnel evaluation standards (requiring that evaluators show concern for the rights of students to be taught well), was not among the ‘Educational Program Evaluation Standards’. In general, much work remains to be done before the contents of the first edition of the Educational Personnel Evaluation Standards will be determined.
International Involvements and Implications The Committee desires to stay in touch with international groups that are involved in evaluations of educational personnel so that it can benefit by the experiences in other countries and share what it learns from this project with interested groups in those countries. Accordingly, an Irish psychologist will serve on the Validation Panel to add an international perspective, and the Committee will engage an International Review Panel to evaluate the first draft of the standards. We will also report our progress to international audiences, as in this article, and through a periodic newsletter. We realize, however, that the standards must concentrate on the relevant U.S. laws and personnel evaluation systems; and it is quite possible that the personnel evaluation standards will not transfer well to other cultures.
142
RICHARD
M. WOLF
Closing The purpose of this article has been to communicate with international colleagues about standards for evaluations of education. Included are a review of the Standards for Evaluations of Educational Programs, Projects, and Materials and a description of the Joint Committee’s plans for developing standards for evaluations of educational personnel. The pervasive message is that all evaluators should strive to make their evaluations useful, feasible, proper, and accurate. However, the Joint Committee has not and undoubtedly will not finalize a set of universal standards. Especially, their standards lack direct applicability to evaluation work in other countries. Moreover, professional standards are most useful when developed by the professionals whose work is to be assessed. Evaluation groups in other countries that might desire to develop their own standards could profit from studying the Joint Committee’s work. They have had 10 years experience in organizing a systematic and ongoing process of setting and applying standards for improving evaluation services. As Chairman of the Joint Committee, I have been glad to share our experiences with colleagues abroad, and we would welcome their feedback and cooperation regarding evaluation standards and standard-setting processes.
References American Educational Research Association & National Council on Measurements Used in Education. (19.5.5). Technical recommendations for achievement tests. Washington, DC: National Education Association. American Educational Research Association, American Psychological Association. & National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC: American Psychological Association, Inc. American Psychological Association. (1954). Technical recommendations for psychological tests and diagnostic techniques. Washington, DC: American Psychological Association. American Psychological Association. (1966). Standards for educational and psychological tests and manuals. Washington, DC: American Psychological Association. American Psychological Association. (1973). Ethical principles in the conduct of research with human participants. Washington, DC: American Psychological Association. American Psychological Association. (1974). Standards for educational and psychological tests (rev. ed.). Washington, DC: American Psychological Association. Braskamp, L. A., & Mayberry, P. W. (1982). A comparison of two sets ofstandards. Paper presented at the joint annual meeting of the Evaluation Network and Evaluation Research Society, Baltimore, MD. Bunda, M. (1982). Concerns and techniques in feasibility. Paper presented at the annual meeting of the National Council on Measurement in Education, New York. Burkett, D., & Denson, T. (1985). Another view of the standards. In S. Abrahamson (Ed.), Evaluation of continuingeducation in the health professions. Boston: Kluwer-Nijhoff Publishing. Carey, L. (1979). State-level teacher performance evaluation policies. In Inservice centerfold. New York: National Council on State and Inservice Education. Coleman, J. S., Campbell, E. Q., Hobson, C. J., et al. (1966). Equality of equal educational opportunity. Washington, DC: Office of Education, U. S. Department of Health, Education, and Welfare. Cordray, D. (1982). An assessment of the utility of the ERS standards. In P. H. Rossi (Ed.), Standards for evaluation practice. New directions for program evaluation, No. 15. San Francisco: Jossey-Bass. Division of Industrial-Organizational Psychology, American Psychological Association. (1980). Principles for the validation and use of personnel selection procedures: (2nd ed.). Berkeley, CA: American Psychological Association, Division of Industrial-Organizational Psychology. Dockrell, W. B. (1983). Applicability of standards for evaluations of educational programs, projects, and materials. Presentation at the annual meeting of the American Educational Research Association, Boston. Impara, J. C. (1982). Measurement and the utility standards. Paper presented at the meeting of the National
Educational
Evaluation
143
Council for Measurement in Education, New York. Joint Committee on Standards for Educational Evaluation. (1981). Standards for evaluations of educational programs, projects, and materials. New York: McGraw-Hill. Lewy, A. (1983). Evaluation standards: Comments from Israel. Presentation at the annual meeting of the American Educational Research Association, Boston. Linn, M. (1981). Standards for evaluating out-of-school learning. Evaluation News, 2(2), 171-176. Linn, R. L. (1981). A preliminary look at the applicability of the educational evaluation standards. Educational Evaluation and Policy Analysis, 3,87-91. Margh, D. D., Newman, W. B., & Boyer, W. F. (1981). Comparingidealand real: Astudyofevaluationpractice in California using the Joint Committee’s evaluation standards. Paper presented to the annual meeting of the American Educational Research Association, Los Angeles. Marklund, S. (1983). Applicability of standards for evaluations of educational programs, projects, and materials in an international setting. Presentation at the annual meeting of the American Educational Research Association, Boston. McKillip, J., & Garberg, R. A further examination of the overlap between ERS and Joint Committee evaluation standards. Carbondale, IL: Southern Illinois University, Department of Pscyhology. Unpublished manuscript. Merwin. J. C. (1982). Measurement and propriety standards. Paper presented at the meeting of the National Council for Measurement in Education, New York. Nevo. D. (1982). Applying the evaluation standards in a different social context. Paper presented at the 20th Congress of the International Association of Applied Psychology, Edinburgh, Scotland. Ridings, J. M. (1980). Standard setting in accounting and auditing: Considerations for educational evaluation. Kalamazoo: Western Michigan University. Unpublished dissertation. Rodrigues de Oliveira, T., Hoffman, J. M. L., Barros, R. F., Arruda, N. F. C., & Santos, R. R. (1981). Standards for evaluation of educational programs, projects, and materials. Department of Education of the Federal University of Rio de Janeiro (UFRJ). Unpublished manuscript. Rossi, P. H. (Ed.). (1982). Standards for evaluationpractice.San Francisco: Jossey-Bass. Scheerens, J., & van Seventer, (1983). Political and organizationalpreconditions for application of the standards for educational evaluation. Presentation at the annual meeting of the American Educational Research Association, Boston. Stake, R. (1981). Setting standards for educational evaluators. Evaluation News, 2(2), 148-152. Stockdill, S. H. (October, 1984). The appropriateness of the evaluation standards for business evaluations. Presentation at the Evaluation Network/Evaluation Research Society Joint meeting, San Francisco. Straton, R. B. (1982). Appropriateness and potential impact of programme evaluation standards in Australia Paper presented at the 20th International Congress of Applied Psychology, Edinburgh, Scotland. Stufflebeam, D. L. (1982). An examination of the overlap between ERS and Joint Committee standards. Paper presented at the Annual Meeting of the Evaluation Network, Baltimore, MD. Wardrop, J. C. (1982). Measurement and accuracy standards. Paper presented at the meeting of the National Council for Measurement in Education, New York.