Evaluation as a discipline

Evaluation as a discipline

Studses in Educational Evaluatton, Vol 20, pp 147-166, 1994 © 1994 Elsevier Science Ltd Pnnted m Great Bntam All nghts reserved 0191-491X~34 $24 (30 ...

1MB Sizes 0 Downloads 101 Views

Studses in Educational Evaluatton, Vol 20, pp 147-166, 1994 © 1994 Elsevier Science Ltd Pnnted m Great Bntam All nghts reserved 0191-491X~34 $24 (30

Pergamon

0191-491X(94)EO009-5 EVALUATION AS A DISCIPLINE Michael Scriven Western Mtchigan University, Kalamazoo, MI, U S A. 1

In~oduction This essay provides some background, explanation, and justification of a theoretical framework for all types of evaluation. This framework is useful for connecting and underpinning the approaches of the other contributors in this special issue. Some of the discussion provided to support the framework argues for benefits from using the perspective generated by the framework for improving best practice in existing applied fields within evaluation--for example, by looking for similarities between personnel evaluation and institutional evaluation that may point the way towards solving a problem or avoiding a mistake. But there is more to the framework than a perspective, there is also the logic of evaluation itself--the 'core discipline' of evaluation--the fine structure of the theory that ties the assortment of applied evaluation fields into a discipline. The core subject pays off for practice in the applied fields not only by substantiating the proposed overall perspective, but also by solving hitherto unsolved problems in the fields, and by breaking new ground which uncovers fertile new areas for further work. The core subject is a kind of central management system that knits the fields together with an overview and a map of connections (some newly discovered and some long recognized), develops concepts and language to deal with shared problems, addresses threats to validity, converts solutions from one field for use by others, and moves the frontiers of foundations research forward. The emergence of that core subject is essential to the creation of a discipline because without it there is no entity, no entirety of which the parts are part, no self-concept, no identity. The emergence of candidates for a core discipline, and hence of the whole of evaluation as a discipline is a very recent affair. The treatment here begins with a review of the background leading to this event, and then goes on to review the stages of early development that followed. There are two ways to approach a historical review like the brief one here. One might try to describe it in pluralistic, multi-perspectival terms, so that the reader can see how the issues presented themselves to--and were resolved or abandoned by--various parties to the process. That makes for better, fairer, history, but it also requires a longer perspective than we can have at this point--and it calls for an author less committed to one of the contrasting views of the nature of evaluation. The treatment here is partisan, and its main intent is to convey the way that the historical and ideological issues now appear from 147

148

M Scnven

the point of view of an advocate--someone convinced that there is a true discipline of evaluation, and that it has a certain shape, although it is still in its infancy. From that point of view, the most interesting issue about the history of evaluation is why it took so long to emerge. To understand that delay, we must understand that the birth of a discipline of evaluation was disadvantaged from well before the time of conception. It was an overdue child of allegedly incompatible parents whose families opposed the marriage. And its difficult birth began a confused childhood exhibiting multiple personality disorder. The discipline of evaluation was fathered by evaluation practice and born of science. The father was of humble birth but ancient lineage. The mother's family was nouveau riche but regarded itself as aristocratic: the mother herself had little interest in meeting the father and denied she could bear such a child. The father had for much longer acted as if such an union was of little interest to him. But in the end, with considerable help from marriage brokers and midwives, and notwithstanding the refusal of most of the mother's family to accept the union, the match was made and the child was born. Its difficult childhood, which we will shortly review, resulted from the mother's relapse to the values of her prenuptial caste and the father's disinterest in the welfare of the child, whose problem then became that of growing up with a coherent self-concept, and developing enough autonomy to justify self-respect. What this meant in practice was that most of the early evaluation theories turned out to be attempts to replace evaluation by a more acceptable substitute, some persona already made respectable by science. And most of these theories, including some not subject to the first type of subversion, had severe constraints imposed on them from the beginning, constraints as to the turf on which they were to be allowed to play. These constraints served to keep them away as far away from the matrilineal turf as possible. The Nature of Evaluation What we are talking about here is the general discipline of evaluation. It has many specific and autonomously developed applied areas, of which it's convenient to refer to half a dozen important ones as 'the Big Six'. These are program evaluation, personnel evaluation, performance evaluation, product evaluation, proposal evaluation, and policy evaluation. There are two other applied areas of the greatest importance. One is discussed only implicitly in this paper: it is meta-evaluation, the evaluation of evaluations. 2 The other is discipline-specific evaluation, the kind of evaluation that goes on inside a discipline, sometimes with and often without any assistance from trained evaluators, but always requiring substantial knowledge of the discipline. It obviously includes the evaluation of hypotheses, instruments, experimental designs, methods, etc. within a discipline, but here are some more examples, listed in roughly increasing order of the amount of outside help they should employ--although they frequently employ less: the evaluation of (i) a new theory in surface physics (topic-specific); (ii) a review of recent progress and promising directions in chaos theory (discipline-specific meta-analysis); (iii) a new program in emergency health care or instruction (application-specific, i.e., it's specific to program evaluation); (iv) proposals for research support in short-term psychotherapy; (v) several candidates for a job or for promotion within the department of mathematics; and (vi) literary criticism, a discipline which is by definition a branch of applied evaluation, but one with severely limited objectivity and, as usual, in serious need of external review.

Evaluation as a Dtsctpltne

149

There are many other fields of evaluation besides these eight, including: curriculum evaluation; technology assessment; medical ethics; industrial quality control; appelate court jurisprudence (the legal assessment of legal opinions); and some from our avocational interests such as wine tasting, art criticism, movie and restaurant reviewing. Since the philosophy of science deals, amongst other things, with the question of the nature and logic of scientific propositions, one of the claims that it has to evaluate is the issue of whether evaluative claims have (or have not) a legitimate place in science. The most powerful view about this issue, throughout the twentieth century, has been the doctrine of value-free science, the denial that evaluative claims have any legitimate place in science. This position of course entailed that there could not be a science of evaluation. Similar arguments were raised in the humanities, leading to the general conclusion that there could be no legitimate discipline of evaluation, whether considered as a science or under some other heading. We will focus here on the issue with respect to science, since that is the hardest nut to crack. Many scientists have an interest in the philosophy of science as well as their own science--as indeed they should--and they often make claims about the value-free doctrine. They commonly make the mistake of thinking that their familiarity with scientific claims means they are in an expert's position with respect to claims concerning the nature of scientific propositions. In fact, as is suggested by the radical disagreement between scientists about such claims, they are in possession of at most half of the requisite expertise, the other half being an understanding of the concepts involved in epistemological and logical classification schemes. Their relatively amateur status in this area, 3 combined with their anxiety about the contamination of science by the intrusion of matters which many of them saw as essentially undecidable--i.e., value judgments--and hence essentially improper for inclusion in science, led most of them to embrace and continue to support the doctrine of value-free science. Once that was in place, and widely supported by the power elite in science, the stage was set for the suppression of any nascent efforts at developing a general discipline of evaluation. No-one wanted to be associated with a politically incorrect movement. Philosophers of science, who should have known better, were for too long influenced by the distinguished group of neo-positivists in their own discipline, descendants of the group of logical positivists--scientists and philosophers--who first established the value-free doctrine. Eventually some of them came to abandon the value-free doctrine, but just as they became willing to consider this possibility, they were hypnotized by the constructionist/constructivist revolution. So they jumped ship, but into equally bad company.4 Constructivism is a currently popular derivative from philosophical scepticism, relativism, or phenomenology (depending on which version of it one considers). It offered another kind of reason from the ones considered here for thinking that science was not value-free. Its reasons were centrally flawed--in particular, they were self-refutingS--but its extensive acceptance has led to the present unusual situation in which there is widespread agreement that the value-free doctrine is false based on completely invalid reasons for supposing it to be false. Since the constructionist's reasons lead to the abandonment of the notion of factuality or objectivity even for descriptive science their rejection of the value-free doctrine comes at the price of a simultaneous abandonment of most of what science stands for. It was in a sense incidental, although important for our topic here, that constructionism renders impossible the construction of any discipline of

150

M Scnven

evaluation worthy of the name. Ironically, then, the most widely-accepted revolt against the doctrine of value-free science in fact generated another argument which made a discipline of evaluation impossible. The stance here is that a discipline of evaluation is entirely possible and strictly analogous to the disciplines of statistics, measurement, and logic. That is, evaluation is a tool discipline, one whose main aim is to develop tools for other disciplines to use, rather than one whose primary task is the investigation of certain areas or aspects or constituents of the world. Such disciplines are here called 'transdisciplines' for two reasons. The first is that they serve many other disciplines--and not just academic ones. Much of the work that falls under the purview of a transdiscipline is discipline-specific (but not topicspecific), e.g., biostatistics, statistical mechanics. The second reason for calling them transdisciplines refers to the "discipline" part of the term. Each of them has a core component--an academic core--which is concerned with the more general issues of their organizing theories or classifications, their methodology, nature, concepts, boundaries, relationships, and logic. In conventional terms, this is often referred to as the pure subject by contrast with the applied subject. Thus there are pure subjects of logic, of measurement, and of statistics. The field of evaluation, alone amongst the transdisciplines, has always had the applied areas--because practical problems demanded it--but never a core area. Without that, a field cannot be a discipline, for it cannot have a self-concept, a definition, integrating concepts, plausible accounts of its limits and basic assumptions, etc. So the birth of the discipline of evaluation was delayed by these squabbles amongst the families of the potential parents. Meanwhile, the applied disciplines suffered severely, both from unnecessary limitations and from the use of invalid procedures. The Uses and Abuses of the Concept of Evaluation In saying that a general discipline of evaluation has only very recently emerged, it should not be supposed that there have been no publications which a p p e a r to deal with such a discipline. There are, for example, many books with the unqualified term "evaluation" in the title. These would, one might suppose, refer to the general discipline. Pathetically, however, for six decades they were simply books about student assessment. That is, they referred to only one part of one applied area in evaluation in one academic field (performance evaluation in education). More recently, the occurrence of the unqualified term in the title turns out to be simply referring to program evaluation. In other cases, a title that referred to "educational evaluation" might lead one to think that the additional term would entail some inclusion--or at least some mention---of the evaluation of teaching, administrators, teachers, curriculum, equipment, schools, etc. But while it used to simply mean 'tests and measurement', more recently, it just means 'program evaluation in education'. What explains this phenomenon of exaggeration of coverage? It can be seen as a case of academic nature abhorring a vacuum. In the absence of any truly general discipline of evaluation, each applied field can think of itself as covering the general subject. And in a m i c r o c o s m i c way, they do; that is, books on program evaluation often provide a model of 'evaluation', or at least some remarks about proper evaluation methodology, which is far more than a mere listing of techniques. But it's far less than a general model for evaluation, both in breadth and depth. Most of the vacuum was still there, and its existence was officially endorsed by the value-free doctrine. Lowlevel generalizations from the applied fields were no great threat to its legitimacy, although

Evaluation as a Disctphne

151

if you added them all up, the situation was somewhat bizarre--six healthy bastards all said to have no parents. What was forbidden, as a logical or scientific impropriety--arguments were given for both claims--was a general account. Nevertheless, it is a bizarre situation when the whole of science and the teaching of science involves--and cannot continue without---evaluation, yet the high priests of science still maintain its impropriety. This was a typical example of the way in which a paradigm can paralyze perception. One of the classic cases comes from particle physics. Given that electrons have a negative charge, experimenters supposed that a track of a lightweight particle which curves the wrong way in a cloud chamber photograph "must have been" due to someone getting careless with reversing the photographic plate. As we now know, many such photographs were disregarded--instead of checked, which was easy enough--before someone challenged the fundamental precept by suggesting that the positron was a real possibility. In the present case, scientists believed--indeed, most of them wanted to believe--the power elite's quasi-religious dogma of 'value-free (social) science'. It followed that there could not be a science of evaluation, indeed any discipline of evaluation. Of course, everyone knew there was a practice of evaluation, since every one of them as a student had received grades on their school work and virtually every one of them had given grades to students--presumably well-justified, factually based grades. People working in testing or program evaluation realized there were plenty of wicks of the trade, enough to justify a text on the subject--but it never occurred to them to see that subject as part of a general discipline, or to use less than the general term to describe their own work, although their common sense was perfectly well aware that there were half a dozen other applied fields of evaluation. Despite the prima facie absurdity of thinking that many fields could be engaged in easily justified practices which obviously shared many common concepts--ranking, grading, bias, evaluative validity, etc.--if in fact value judgments were completely unscientific, the paradigm persisted. It prevented scientists from trying to generalize their evaluative results to other parts of their own domain, let alone considering the possibility of a common logic, methodology, and theory that transcended domains. In fact the paradigm prevented them from trying to study the other fields to see if there were some practices there from which they could learn. As a result the wheel was reinvented many times, or, worse, not reinvented. Instead workers in each field made a point of decrying any suggestions of similarity. People in personnel evaluation often rejected the idea that they could learn from the quite sophisticated and much older field of product evaluation, often with some great insight like: "We can hardly learn from product evaluation, since people aren't products." One might as well say that cognitive psychology can't learn from computer science since people aren't computers. The difference in subject matter is undeniable but irrelevant to the existence of useful analogies and some common logic and methodology. Had the thought of a general discipline occurred to these writers, they would of course have made some mention of it in the introduction to their books, or used a less misleading rifle. But such a thought was not acceptable and such mentions never occurred. That doesn't mean they thought it but didn't say it; it means they didn't think it. Their perceptions and thinking were controlled by the paradigm. We've talked about what it takes to constitute a discipline. Now, what is this subject of evaluation that we are talking of making into a discipline? The term "evaluation" is not used here in any technical sense: we follow common sense and the dictionary.

152

M Scnven

Evaluation is simply the process of determining the merit or worth of entities, and evaluations are the product of that process. Evaluation is an essential ingredient in every practical activity--where it is used to distinguish between the best or better things to make, get, or do, and less good alternatives--and in every discipline, where it distinguishes between good practice and bad, good investigatory designs and less good ones, good interpretations or theories and weaker ones, and so on. It can be done arbitrarily, as by most wine 'experts' and art critics, or it can be done conscientiously, objectively, and accurately, as (sometimes) by trained graders of English essays in the state-wide testing programs. If done arbitrarily in fields where it can be done seriously, then the field suffers, and the work of all those in the field suffers. For if we cannot distinguish between good and bad practice, we can never improve practice. We would never have moved out of the stone age, or even within the stone age from Paleolithic to Neolithic. The Emergence of a Discipline From Considered Practice Evaluation is an ancient practice, indeed as ancient as any practice, since it is an integral part of every practice. The flint chippers and the bone carvers left mute testimony of their increasing evaluative sophistication by consigning to the middens many points and fish hooks that their own ancestors would have accepted, and by steadily increasing the functionality of their designs. Craft workers developed increasing skill and became increasingly sophisticated in their communicable knowledge about procedures that led to improvement, and about indicators that predicted poor performance. Much of this went on before there was anything correctly described as a language. 6 There is no reason why evaluation cannot proceed without language since gestures or actions can clearly indicate dissatisfaction, which--in certain contexts--represents an evaluative judgment. However, language is the great lever of progress. Once we have a language for talking about what we make or do, we can then more quickly focus on the aspects which we approve or disapprove and avoid misguided efforts. Now there are some contexts where the evaluator's disapproval is merely aesthetic, a matter of personal preference; important to others only as a reflection of power or tastes. In others, coming from the shaman or priest, it's intended as an indication of divine disapproval. But in yet others, it is the judgment of an expert whose expertise is demonstrable, and that kind of evaluation--and acceptance of it--is a survival characteristic. Better fish hooks catch more fish. Using the master hook maker as an evaluator leads to making better fish hooks. Or so it appears, with good reason--but with some traps as well. That situation is no different today. We still call on disciplinary professionals to review programs and proposals in the same field, we still get valuable commentary from them--and there are still some traps. Identifying the traps and ways to avoid them or get out of them, is something we have not got very far with, because doing that requires a discipline of evaluation. Important traps include the realization that using experts in a field to evaluate beginners can and often does involve several major sources of error. But we must recognize that doing some discipline-specific evaluations--the topicspecific ones--is simply part of competence in doing the usual thoughtful disciplinary work of research or teaching. As we move on through the sequence of intradisciplinary evaluation examples given earlier, to disciplinary overviews, to proposal evaluation, and to the evaluation of programs which teach or apply the discipline, we move further away from the skills of the outstanding researchers within a discipline. The qualities required for

Evaluation as a Disclphne

153

the latter evaluation tasks are, interestingly enough, quite often lacking in the best researchers. Here are just a few: open-mindedness with respect to several competing viewpoints; an ability to conceptualize the functions of the program under evaluation (which is often education or service rather than research); extensive comparative experience with similar programs; a strong historical perspective; the ability to do or at least critique needs assessments; good empathy or role-playing skills (to see the program or proposal from the point of view of its author or manager as well as its customers/clients); skill in applying codes of ethical conduct; an understanding of the main traps that untrained program or proposal evaluators fall into (such as superficial or zero costanalysis---especially opportunity cost analysis); broad experience in other disciplines of activities at the same level of generality. With this list we are beginning the task of listing evaluation competencies, something which is only possible if we have a concept of good evaluation by task. While professionals in general have a good sense of how to do topic-specific intradisciplinary evaluation, and some of them have or acquire good talents for application-specific evaluation tasks (e.g., performance evaluation), one needs more than an intuited set of standards if one is to improve the latter process further. It's all too easy to give many examples of it being done impressionistically at the moment. No doubt there are places within the literature of a number of professions where an attempt has been made to list the qualities that should be sought in identifying evaluators for programs or proposals or personnel within the profession. But this approach is reinventing the wheel, and not the way to get the best result after 50 years of effort. A discipline of evaluation is the central agency where such attempts can be assembled, compared, strained for dross, squeezed for common elements, and conceptualized. The disciplines are not well served by using evaluators of their own programs and proposals who are picked for their prestige amongst the few with that quality who can take time off. The Need for a Discipline of Evaluation Related cases of great importance concern the evaluation of research achievements in the course of personnel selection or promotion at a university, or when refereeing for a journal, or when making a selection amongst proposals for support by a fund or foundation. Since these selections are what shape the whole future of the discipline, their importance is clear enough; and since the criteria used vary between journals and departments to an extent which lacks any justification in terms of the differences between the journals and departments, we are clearly dealing with idiosyncrasy, most of it quite damaging to the validity of the process. Much, although not all, of that idiosyncrasy is removable by the use of available procedures. But the situation is far worse; the underlying logic of the process is usually flawed, and guarantees that even with perfectly uniform standards, the results will be incorrect. We address this problem briefly below. Overlapping with these examples of performance evaluation is the neighboring area of personnel evaluation. Much of it involves the integration of a number of performance evaluations, but it is not reducible to that--e.g., because there is a need in personnel evaluation, not present in performance evaluation, to predict future performance. In one task faced by the personnel evaluator, the selection task, careless thinking about evaluation has led to some major disasters. Few people outside industrial/organizational psychology realize the extent of the research that has been done on the interview process. Simply

154

M Scnven

reading a good collection of that research--beginning with the research demonstrating the near-zero validity of the usual approaches----changes the whole way one looks at the interview, changes the way one does it, the way one manages it, and the value of the results from it. 7 Here we have an example of an applied field of evaluation moving from considered practice to major pay-off territory. But there are deeper flaws in the selection part of personnel evaluation, still hardly broached. One of these is the common use of indicators that are only empirically correlated with later job performance, not just simulations of it. This is the foundation of much of the use of professional constructed testing of applicants. 8 Now we know that if the indicator was skin color, then, even if it had been empirically established as correlated with later good (or bad) performance, we couldn't use it. But that ban is generally regarded as a political or perhaps an ethical override on the most-effective procedures. In fact, the ban is based on sound scientific principles. The underlying facts are that the indicators are very weak predictors, that we can normally do better by using or adding past performance on related tasks (which can always be obtained, absent emergencies), and that once we use that data, the indicator is invalidated (because it only applies to random samples from the population with that skin color, and someone with known relevant past performance is not a random sample). If it is provably impossible to obtain 'related performance' data, or to use a simulation (including a trial period), the indicators can be justified, e.g., the use of the Army alpha test when there was no other way to process the number of recruits. Otherwise, the use of indicators is scientifically improper--as well as ethically improper since they involve the 'guilt by association' and 'self-fulfilling prophecy' errors. Given the attention that has been paid to test validity in recent years, it is interesting that this key fallacy has not received more attention. 9 Once there is a core discipline of evaluation in place, the kind of foundational analysis we are doing here can be called on to reexamine many of these general procedures. These examples are just a few of a dozen that could be given to illustrate the weakness of assuming that one does not need a discipline of evaluation, which includes its applied fields and makes some of the needed improvements in them. Evaluation training has a zero or near-zero contribution to make to the physicist evaluating topic-specific hypotheses, ~0 but it can transform the evaluation of personnel, proposals, and training programs in physics from laughable to highly valuable. Of course, as we move towards a hybrid area like the teaching of science, so another expertise--in this case, research on teaching, including computer-assisted instructionwmay also earn its place beside the subject matter specialist and the evaluator. A core discipline of evaluation starts by developing a language in terms of which we can describe types of evaluation--and parts or aspects of them--and begin to study, classify, analyze, and generalize about them. That much enables us to focus our evaluative commentary, and is the first step towards the discipline. The next step consist in developing a theory about the nature and limits of these different aspects of evaluation. This includes, for example, a theory about the relation of grading to ranking, scoring and apportioning. Another topic it must deal with is the differences between, and the extent of proper use of, criteria vs. indicators (an example mentioned above), standards vs. dimensions, effectiveness vs. efficiency, objectivity vs. bias, formative vs. summative, etc. Here we begin to see something that transcends topic- and area-specific evaluation, and indeed transcends fields within evaluation. At this step we clearly divorce determining merit from approving: we can, for example, determine that a hand-gun is accurate and

Evaluation as a Discipline

155

well-made without approving its use, manufacture, or possession. That divorce is notable, since for the first half of this century the most widely accepted theory about evaluative claims was that they simply expressed approval or disapproval, and had no propositional content. Most applied fields in evaluation, and many other applied fields that are not part of evaluation (e.g., survey research), have made some steps towards clarifying evaluation predicates, but very few seem to have done it well. Pick up an opinion questionnaire (or an example of one from a text), or pick up a personnel rating scale, and you are likely to find that the anchors are a clumsy mix of norm-referenced (ranking) descriptors and criterionreferenced (grading or rating) descriptors, 11 a botch-up that violates the simplest principles of the logic of evaluation. The treatment of Likert scales by experts is equally flawed: it is a definitional requirement on such a scale that there be no right answer to any question, yet clearly the correct answer by someone from the ghetto today to "Getting the job you want is mostly a matter of luck" is Disagree Very Much (Spector, 1992). It's fight up there with Disagree Very Much with "2+2=5". Thus we can see that the fundamental premises of most thinking about attitudes and valuing are shaky.12 Essentially nobody teaches and few write about these matters, although the elements of scaling with descriptive (measurement) predicates are probably known to everyone who graduates with a major in the social sciences. We need better treatment of the basics of evaluation, and setting them out is one of the tasks that falls to a foundational or core discipline of evaluation.13 Sorting out these considerations--about discipline-specific evaluation and scaling-are early steps in the development of a discipline, but the pay-off from getting them in place is considerable because of their widespread use. To take an example from one of the applied areas that is most in need of development, some of the most distinguished scientists in the country think that the procedure used by the National Science Foundation to allocate research funds is seriously flawed in that it undervalues originality. They say that it excessively rewards applicants who are working within a paradigm by comparison with those striving for a new one. Is this concern justified? It's quite easy to find out, and it would be the scientifically appropriate response to undertake the study. But that would be to treat evaluation as if it is (i) legitimate, and/or (ii) something over and above topicspecific evaluation, which would undermine the axiom that scientists from discipline X are the only ones with the expertise to do discipline-specific evaluation in X. This is the worst kind of turf-protection at the expense of the taxpayer and of science, and it occurs because the above simple distinctions between topic-specific and application-specific evaluation has not passed into the first-year graduate courses and texts. We can conclude with one last example that comes from a slightly more difficult level in the development of a general discipline of evaluation. It is not subject-matter specific, and is widely used in everyday life as well as in virtually all kinds of evaluation within the disciplines and the professions. It concerns the synthesis step that must be made in complex evaluations, in order to bring together the sub-evaluations have been done which yield ratings on each of the dimensions or criteria of merit. The usual way to do this, other than impressionistically (as is too often the case in selection evaluation in the personnel field) is to give a numerical weight to the importance of each criterion, (e.g., on the scale 1-5), convert the performances on each of these into a standardized numerical score (e.g., on a scale of 1-10 or 1-100), multiply the two together, and add up each

156

M Scriven

candidate's total score in order to find the winner. Many of us have used something like this for selecting a home, a job, a graduate school, etc., as well as in evaluating a program. This approach is fundamentally invalid, and will give completely incorrect results in many cases. Nor can it be adjusted; nor can one say in advance when it will work or not work, i.e., on what type of case it will work reasonably well. There are some such cases. For example, the GPA (grade point average) works reasonably well for some purposes; for others, it fails. In fact, no algorithm will work, because the fundamental flaw is an assumption about comparability of utility distributions across three logically independent scales, an assumption which is essentially always false. There is a valid process (a set of heuristics) for synthesizing sub-scores, referred to as the Qualitative Weight and Sum approach, by contrast with the invalid algorithm proposed by the Numerical Weight and Sum approach (Evaluation Thesaurus 4, 1991). Now, it is remarkable that we should not have discovered this flaw in such a widespread practice until this late date. The explanation of course is that it was nobody's business to look at it, since it is obviously an evaluation process and there was no legitimate discipline of evaluation. As an ironic note, the invalidity applies to all the standard procedures for evaluating proposals, the system by which essentially all research funds are allocated in this country; and yet none of the researchers on the peer review panels, which do most of these evaluations, ever raised their eyes from the task long enough to notice the crude errors in the process. TM So much for the paradigm of the scientific method when it runs up against the (virtually selfrefuting) paradigm of value-free science. It can perhaps be seen from this brief summary that the list of errors in evaluation as it is currently done, even in the autonomously originated practical fields, but especially within the disciplines, and even more with respect to transdisciplinary practice, involves serious and costly mistakes. We need something better than the vacuum where there should be a core theory of evaluation, something corresponding to the core theories in statistics and measurement. In the search for that elusive entity, we now turn to a brief review of what came out of the woodwork in the late 1960s and thereafter to provide us with various theories which were referred to as theories about the nature of evaluation. Perhaps one of them, if necessary somewhat modified, can provide us with the missing core theory.

Early Models of Evaluation 15 It was only because these views were filling a perceived vacuum that they were generally put forward as theories of evaluation. In fact, they were only theories of program evaluation. Indeed, they had an even narrower purview. For "program evaluation" has become a label for only part of what is actually required to do program evaluation, just as "needs assessment" has in some quarters become a name for a formalized approach that covers only part of what is required in order to determine needs. In the real world, program evaluation always involves some personnel evaluation, should nearly always involve some evaluation of management systems and some ethical evaluation, and should usually involve some product evaluation. It will also often benefit from some consideration of proposal evaluation and the evaluation of evaluations. But we'll leave out all these refinements in this brief overview, and focus on what is conventionally called program evaluation.

Evaluatton as a DIsclphne

157

The following simplified classification16 begins by identifying six views or approaches that are alternatives to and predecessors of the one advocated here, the transdisciplinary view. They are listed below in the order of their emergence into a position of power in the field of program evaluation since the mid-sixties when the explosive phase in that field began. In addition to those discussed here there is a range of exotica-fascinating and illuminating models ranging from the jurisprudential model to the connoisseurship model--which we pass over for reasons of space. A. The 'strong decision support' view was an explication of the use of program evaluation as part of the process of rational program management. This process, implicit in management practice for millenia, has two versions. The strong version described in this paragraph conceived of evaluators as doing investigations aimed to arrive at evaluative conclusions designed to assist the decision-maker. Supporters of this approach pay considerable attention to whether programs reach their goals, but go beyond that into questions about whether the goals match the needs they are supposedly addressing, thereby differentiating themselves from the much narrower relativistic approach listed here as approach C. Position A was exemplified in, but not made explicit by the work of Ralph Tyler, 17 and extensively elaborated in the CIPP model of evaluation (Context, Input, Process, and Product) (Stufflebeam, et al., 1971). The CIPP model goes beyond the rhetoric of decision support into spelling out a useful systematic approach covering most of what is involved in program evaluation, and uses this to infer evaluative conclusions. Dan Stufflebeam, who co-authored the CIPP model, has continued to play a leading role in evaluation, still representing--and further developing--this perspective. By contrast, Egon Guba, one of his co-authors in the early CIPP work, has now gone in a quite different direction--see F below. This approach, although this particular conclusion was more implicit than explicit, clearly rejected the ban on evaluation as a systematic and scientific process. It was not long, however, before recidivism set in, as we see in the next four accounts. B. The 'weak decision support' view. The preceding approach has often been described as the 'decision support' approach but there is another approach which also claims that title. It holds that decision support provides decision-relevant data but stops short of drawing evaluative conclusions or critiquing program goals. This point of view is represented by evaluation theorists such as Marv Alkin who define evaluation as factual data gathering in the service of a decision-maker who is to draw all evaluative conclusions. 18 This position is obviously popular amongst those who think that true science cannot or should not make value judgments, and it is just the first of several that found a way to do what they called program evaluation although while managing to avoid actually drawing evaluative conclusions. The next position is somewhat more like evaluation as we normally think of it, although it still manages to avoid drawing evaluative conclusions. This is: C. The 'relativistic' view. This was the view that evaluation should be done by using the client's values as a framework, without any judgment by the evaluator about those values or any reference to other values. The most widely used text in evaluation is written by two social scientists and essentially represents this approach (Rossi & Freedman, 1989). B and C were the vehicles that allowed social scientists to join the program evaluation bandwagon. 19 The simplest form of this approach was developed into the 'discrepancy model' of program evaluation by Malcolm Provus (the discrepancies being divergences from the projected task sequence and timeline for the project). Program

158

M Scnven

monitoring as it is often done comes very close to the discrepancy model. This is a long way from true program evaluation for reasons summarized below. It is best thought of as a kind of simulation of an evaluation: as in a simulation of a political crisis, the person staging the simulation is not, in that role, drawing any evaluative conclusions. Of course, it's a little more quaint for someone who is not drawing any evaluative conclusions to refer to themselves as an evaluator. D. The 'rich description' approach. This is the view that evaluation can be done as a kind of ethnographic or journalistic enterprise, in which the evaluators report what they see without trying to make evaluative statements or infer to evaluative conclusions-not even in terms of the client's values (as the relativist can). This view has been very widely supported--by Bob Stake, the North Dakota School, many of the UK theorists, and others. It's a kind of naturalistic version of B; it usually has a flavor of relativism about it, reminiscent of C--in that it eschews any evaluative position; and it sometimes looks like a precursor of the constructivist approach described under F below, in that it focuses on the observable rather than the inferrable. More recently, it has been referred to as the 'thick description' approach--perhaps because "rich" sounds evaluative? E. The 'social process' school. This was crystallized about 12 years ago, approximately half way to the present moment in the history of the emerging discipline, by a group of Stanford academics led by Lee Cronbach, referred to here as C&C (for Cronbach and Colleagues; Cronbach et al. 1980). It is notable for its denial of the importance of summative evaluation, i.e., evaluation (i) as providing support for external decisions about programs, or (ii) to ensure accountability. The substitute they proposed for evaluating programs in anything like the ordinary sense was understanding social programs, 20 flavored with a dash of helping them to improve. Their position was encapsulated in a set of 95 theses. This paper may perhaps represent an implementation of the 87th in their list, which states: "There is need for exchanges [about evaluation] more energetic than the typical academic discussion and more responsible than debate among partisans"--if indeed there is any such middle ground. Ernie House, a highly independent thinker about evaluation as well as an experienced evaluator, also stressed the importance of the social ambience but was quite distinctive in his stress on the ethical and argumentation dimensions of evaluation. In fact his stress on the ethical dimension was partly intended as a counterpoint to the absence of this concern in C&C (House, 1989). F. The 'constructivist' or 'fourth generation' approach, representing the most recent riders on the front of the wave, notably Egon Guba and Yvonna Lincoln (1989), but with many other supporters including a strong following in the USA and amongst UK evaluators. This point of view rejects evaluation as a search for quality, merit, worth, etc., in favor of the idea that itmand all truth, such as it is in their termsmis the result of construction by individuals and negotiation by groups. This means that scientific knowledge of all kinds is suspect, entirely challengeable, in no way objective. So, too, is all analytic work such as philosophical analysis, including their own position. Out goes the baby with the bathwater. Guba has always been aware of the potential for selfcontradiction in this position; in fact, there is no way around its suicidal bent.

Evaluation as a Discipline

159

Comments Now, the commonsensical view of program evaluation is probably the view that it consists in "working out whether the program is any good". It's the counterpart, people might say, of the sort of thing doctors, road-testers, engineers, and personnel interviewers do, but with the subject matter being programs instead of patients, cars, structures, or applicants. The results of this kind of investigation are of course direct evaluative conclusions--"The patient/program has improved slightly under the new therapeutic/managerial regime", etc. Of the views listed above, the slxong decision support view, of which CIPP is the best known elaboration, comes closest to this. The CIPP model was originally a little overgeneralized in that it claimed all (program) evaluation was oriented to decision support. It seems implausible to insist that a historian's evaluation of the "bread and circuses" programs of Roman emperors, or even of the WPA, is or should be designed to serve some contemporary decision maker rather than the professional interest of historians and others concerned with the truth about the past. One must surely recognize the 'research role' of evaluation, the search for truth about merit and worth, whose only payoffs are insights. Much of the decision support kind of evaluation, and all of the research type exemplify what is sometimes called summative evaluation---evaluation of a whole program of the kind that is often essential for someone outside the program. One might also argue, contra the original version of CIPP, that formative evaluation----evaluation aimed at improving a program or performance, reported back to the program staff--deserves recognition as having a significantly different role than decision support and its importance slightly weakens the claim that evaluation is for decision support. (Of course, it supports decisions about how to improve the program, but that's not the kind of decision that decision support is normally supposed to support.) Over the years, however, CIPP has developed so that it accepts these points and is a fullyfledged account of program evaluation; and its senior author has gone on to lead research in the field of personnel evaluation. While CIPP remains an approach to program evaluation, it comes to conclusions about program evaluation that are very like those entailed by the transdisciplinary model. The differences are like those between two experienced navigators, each of them with their own distinctive way of doing things, but each finishing up--or else how would they live to be experienced?--with very similar conclusions. Of matters beyond program evaluation, and in particular, of the logic and core theory of evaluation, CIPP does not speak, and those are the matters on which the transdisciplinary view focuses above all others. The other entries in the list above--that is, almost all schools of thought in evaluation---can be seen as a series of attempts to avoid direct statements about the merit or worth of things. Position B avoids all evaluative conclusions; C avoids direct evaluative claims in favor of relativistic ones; 21 D avoids them in favor of non-evaluative description; E avoids them in favor of insights about or understanding of social phenomena; and F rejects their legitimacy along with that of all other claims. ~ This resistance to the commonsense view of program evaluation--even amongst those working in the field--has its philosophical roots in the value-free conception of the social sciences, discussed above, but it also gathered support from another argument, which appears at first sight to be well-based in common sense. This was the argument that

160

M. Scnven

the decision whether a program is desirable or not should be made by policy-makers, not by evaluators. On this view it would be presumptuous for program evaluators to act as if it were their job to decide whether the program they were called in to evaluate should exist. That argument confuses evaluations--which evaluators should produce--with recommendations, which they are less often in a good position to produce (although they often do produce them), and which are frequently best left to executives close to the political realities of the decision ambience. That such a confusion exists is further evidence of the lack of clarity about fundamental concepts in the general evaluation vocabulary. Evaluators have all too often overstepped the boundaries of their expertise and walked on to the turf of the decision maker, who rightly objects. But it is not necessary to react to the extent of the weak decision support position and others that draw the line too early, cutting the evaluator off even from drawing evaluative conclusions. The issue must now be addressed of how the view supported in this paper, referred to as the 'transdisciplinary' view, compares with the above. The transdisciplinary view extends the commonsense view but is significantly different from A, and radically different from all the rest. The Transdisciplinary Model On this view, the discipline of evaluation has two components: the set of applied evaluation fields, and the core discipline, just as statistics and measurement have these two components. The applied fields are like other applied fields in their goals, namely to solve practical problems. This means finding out something about what they study, and what they study is the merit and worth of various entities--personnel, products, etc. The core discipline is aimed to find out something about the concepts, methodologies, models, tools, etc. used in the applied fields of evaluation, and in other fields which use evaluation. This, as we have suggested, includes all other disciplines---craft and physical as well as academic. Hence the transdiscipline of evaluation is concerned with the analysis and improvement of a process that extends across the disciplines, giving rise to the term. Consider statistics more closely. There is a core discipline, studied in the department of mathematics or in its own academic department. This is connected to the applied fields of, for example, biostatistics, statistical mechanics, and demographics. The applied fields' main tasks are the study and description of certain quantitative aspects of the phenomena in those fields, and the study and development of field-specific quantitative tools for describing that data and solving problems on which it can be brought to bear. The more general results coming from the core discipline apply across all the disciplines that are using----or should be rising--statistics, hence the term "transdiscipline"; but it also helps develop field-specific techniques, attending in particular to the soundness of their fundamental assumptions and hence the limits of their proper use. Both evaluation and statistics are of course widely used outside their recognized applied fields, i.e., the ones with "evaluation" or "statistics" in their rifle. That wider use is part of the subject matter of the core discipline in both cases. Statistics must consider the use of statistics wherever it is used, not just in areas that have that word in their title. Looking at other Ixansdisciplines, logic has its own applied fields--the logic of the social sciences, etc.--and is of course widely used outside those named fields. So it is an extremely general transdiscipline. But evaluation is probably the most general--unlike

Evaluation as a Discipline

161

logic, it precedes language--and both are much more general than measurement or statistics. The transdisciplinary view of evaluation has four characteristics that distinguish it from B-F on the previous list; one epistemological, one political, one concerning disciplinary scope, and one methodological. (I) It is an objectivist view of evaluation, like A. It argues for the idea that the evaluator is determining the merit or worth of, for example, programs, personnel or products; that these are real although logically complex properties of everyday things embedded in a complex relevant context; and that an acceptable degree of objectivity and comprehensiveness in the quest to determine these properties is possible, frequently attained, and a goal which can be more frequently attained if we study the transdiscipline. This contrasts with B-F for obvious reasons. (There is some contrast with the early form of A, in the shift of the primary role from decision-serving to troth-seeking.) Since an objecfivist position implies that it is part of the evaluator's job to draw direct evaluative conclusions about whatever is being evaluated (e.g., programs), the position requires a head-on attack on the two grounds for avoiding such conclusions. So the transdisciplinary position: (i) Explicitly states and defends a logic of inferring evaluative conclusions from factual and definitional premises; and (ii) Spells out the fallacies in the arguments for the value-free doctrine33 (II) Second, the approach here is a consumer-oriented view rather than a managementoriented (or mediator-oriented, or therapist-oriented) approach to program evaluation--and correspondingly to personnel and product evaluation, etc. This does not mean it is a consumer-advocacy approach in the sense that 'consumerism' sometimes represents--that is, an approach which only fights for one side in an ancient struggle. It simply regards the consumer's welfare as the primary justification for having a program, and accords that welfare the same primacy in the evaluation. That means it rejects 'decision support'which is support of management decisions--as the main function of evaluation (by contrast with B), although it aims to provide (management-)decision support as a byproduct. Instead, it regards the main function of an applied evaluation field to be the determination of the merit and worth of programs (etc.) in terms of how effectively and efficiently they are serving those they impact, particularly those receiving---or who should be receiving--the services the programs provide, and those who pay for the program-typically, taxpayers or their representatives. While it is perfectly appropriate for the welfare of program staff to also receive some weighting, schools---for example----do not exist primarily as employment centers for teachers, so staff welfare (within the constraints of justice) cannot be treated as of comparable importance to the educational welfare of the students. To the extent that managers take service to the consumer to be their primary goal-as they normally should if managing programs in the public or philanthropic sectorm information about the merit or worth of programs will be valuable information for management decision making (the interest of the two views that stress decision support); and to the extent that the goals of a program reflect the actual needs of consumers, this information will approximate feedback about how well the program is meeting its goals (the relativist's concern). But neither of these conditions is treated as a presupposition of an evaluation; they must be investigated and are often violated.

162

M Scnven

The consumer orientation of this approach moves us one step beyond establishing the legitimacy of drawing evaluative conclusions--Point I above--in that it argues for the necessity of doing so---in most cases. That is, it categorizes any approach as incomplete (fragmentary, unconsummated) if it stops short of drawing evaluative conclusions. The practical demonstration of the feasibility and utility of going the extra step lies in every issue of Consumer Reports: The things being evaluated are ranked and graded in a systematic way, so one can see which are the best of the bunch (ranking) and whether the best are safe, a good buy, etc. (grading), the two crucial requirements for decisionmaking. ON) Third, the approach here is a generalized view. It is not just a general view, it involves generahzing the concepts of evaluation across the whole range of human knowledge and practice. So, unlike any of the views A-F, it treats program evaluation as merely one of many applied areas within an overarching discipline of evaluation. (These applied areas may also be part of the subject matter of a primary discipline: personnel evaluation, for example, is part of (industrial/organizational) psychology, biostatistics is, in a sense, part of biology.) This perspective leads to substantial changes in the range of considerations to which e.g., program evaluation, must pay attention (for instance, it must look at other applied evaluation areas for parallels, and to a core discipline for theoretical analyses), but helps with the added labors by greatly enhancing the methodological repertoire of program evaluation. Spelling out the directions of generalization in a little more detail, the transdisciplinary view stresses: (a) the large range of distinctive applied evaluation fields. The leading entries are the Big Six plus meta-evaluation (the evaluation of evaluations). There are at least a dozen more major entries, ranging from technology assessment to ethical analysis. (b) the large range of evaluative processes infelds other than applied evaluation fields, including all the disciplines (the intradisciplinary evaluation process-the evaluation of methodologies, data, instruments, research, theories, etc.) and the practical and performing arts (the evaluation of craft skills, compositions, competitors, regimens, instructions, etc.) (c) the large range of types of evaluative investigation, from practical levels of evaluation (e.g., judging the utility of products or the quality of high dives in the Olympic aquatic competition) through program evaluation in the field to conceptual analysis (e.g., the evaluation of conceptual and theoretical solutions to problems in the core discipline of evaluation). (d) the overlap between the applied fields, something that is rarely recognized. For example, methods from one field often solve problems in other fields, yet 'program evaluation' as usually conceived does not include any reference to personnel evaluation, proposal evaluation, or ethical evaluation, each of which must be taken into account in a good proportion of program evaluations. (IV) The transdisciplinary view is a technical view. This has to be stated rather carefully, because we need to distinguish between the fact that many evaluations, for example large program evaluations, require considerable technical knowledge of methodologies from other disciplines; and the fact that is being stressed here, that evaluation itself, over and

Evaluation as a Dtsctphne

163

above these 'auxiliary' methodologies, has its own technical methodology. That methodology needs to be understood by anyone doing non-trivial evaluation in any field at all. It involves matters such as the logic of synthesis and the differences between the evaluation functions like grading, scoring, ranking, and apportioning. Not all evaluators need to know anything about social science methodologies such as survey techniques; all must understand the core logic or risk serious errors. It has been common for those working in and teaching others program evaluation to stress the need for skills in instrument design, cost analysis, etc. But they have commonly supposed that such matters exhausted the range of technical skills required of the evaluator. On the contrary, they are the less important of two groups of those skills. Stressing this does not minimize the fact that across the whole range of evaluation fields, an immense number of 'auxiliary' methodologies are needed, far more than with any of the other transdisciplines. There are more than a dozen auxiliary methodologies involved in even the one applied field of program evaluation, more than half of them not covered in any normal doctoral training program in any single discipline such as sociology, psychology, law, or accounting. Conclusion Program evaluation treated in isolation can be seen in the ways all six positions advocate. But program evaluation treated as just one more application of the logical core which leads us to solid evaluative results in product evaluation, performance assessment, and half a dozen other applied fields of evaluation, can hardly be seen as consistent with the flight from direct evaluative conclusions that five of those positions embody. While there are special features of program evaluation which often make it less straightforward than the simpler kind of product evaluation, the reverse is often the case. The view that it is different from all product evaluation is only popular amongst those who know little about product evaluation. For example, the idea that program evaluation is of its nature much more political than product evaluation is common but wrong; the history of the interstate highway system and the superconducting supercollider are counter-examples, and it was after all 'only' a product evaluation----one commissioned by Congress and done flawlessly--that led to the dismissal of the Director of the National Bureau of Standards. p One must conclude that the non-evaluative models of program evaluation discussed here (B-F) are completely implausible as models for all kinds of evaluation. And it is extremely implausible to suppose that program evaluation is essentially different from every other kind of evaluation. The transdisciplinary view, on the other hand, applies equally to all kinds of evaluation, and that consistency must surely be seen as one of its appeals. For the various evaluative purposes addressed by the authors of the other papers in this issue, it may also be of some value to see what they are doing as part of a single, larger, enterprise, and hence as parallel to what workers in other applied fields of evaluation are doing. In that perception there is a prospect of many valuable results which should serve to revitalize several areas and sub-areas of evaluation. And the second edge of the transdisciplinary sword cannot be ignored: the demonstration of fundamental errors in applied evaluation fields such as personnel evaluation and program evaluation due to the neglect of the core discipline.

164

M Scfiven

Notes 1

2.

The author welcomes comments and critacisms of all kinds; they should be addressed to him at P.O. Box 69, Point Reyes, CA 94956 ([email protected] on the Intemet), or faxed to (415) 663-1511. The reflectaons reported here were produced while working part-time on the CREATE staff, although mostly on my own tame since my work for CREATE ~s primarily concerned with the specifics of teacher evaluation. However, even when working on the specific topic, there is a need to examine foundations in order to deal with questions of validity, and some remarks about the connection are included here. CREATE works mainly on personnel policy and program evaluation, and on institutional evaluation which combines program and personnel evaluation. It also does considerable meta-evaluation.

3.

Closely analogous to the amateur status of applied mathematicians about matters in the foundations of mathematics, and not unlike the status of a bookmaker with respect to probability theory. A high degree of skill in an apphed field does not automatically generate any skill in the theory of the field, let alone meta-fields such as the sociology or history of the subject, or the logical analysis of propositaons in the field.

4

The discussion here is only intended to prowde a brisk overview of this techmcal area. Further details and references will be found in the relevant amcles in An Evaluatton Thesaurus (1991).

5.

Since ff constructaonism were true, its arguments would prove that the claim that it is true has no validity for those who do not construct reality in the same way as those who think it true, i.e., those who disagree with it. That is, it is no more true than its denial, which means it is not true in the only sense of that term that ever made it to the dictaonaries or into logical or scientafic usage.

6.

By contrast wRh a vocabulary of standard signs, lacking grammar and hence recombination capabtlity.

7.

The defimtave reference is The employment intervtew" theory, research, and practice, Eder and Ferris (1989).

8.

The term "indicator" is here used to refer to a factor that ~s not a criterion (i.e., not one of the defining components of the job). Hence good simulations----e.g., the classical typing test for selecting typists--are exempt from these remarks, which apply only to 'empirically validated' indicators such as performance on proprietary tests or demographic variables.

9.

It is discussed at greater length in the writer's contribution to Research-Based Teacher Evaluation (1990). It is still denied by leading specialists in personnel evaluation, many of whose standard procedures are threatened by it (the use of 'empirically validated' tests).

10. There are some cases where the contribution has been and may be significantly different from zero. For example, when the theory wolates a paradigm, a specialist in the evaluation of paradigms-someone with a background in both history and philosophy of science - - m a y be able to contribute a useful perspectwe or analogy. 11. E.g., they will include Excellent, Above Average, Average, Below Average, Unacceptable. Since the average performance may be excellent, or unacceptable, this fails to meet the minimum requirement for a scale (exclusive categories). The converse error is a scale like this: Outstanding, Good, Acceptable, Weak, Weakest. 'Grading on the curve' is another good example of total category confusion, for well-known reasons. 12. Many other examples are given m ET4. 13. By early next year, the present author hopes to do this in a monograph for the Sage Social Science Methodology Series called General Evaluation Methodology.

Eva/uatlon as a Dtsctpline

165

14. The most obvious ~s that the standard procedure, which allocates say 100 points across half a dozen dimensions of ment, ignores the existence of absolute mimma on some of the scales. This means---to gave an extreme example that a proposal which happens to be on the wrong topic, but which is staffed by great staff from a great institution at a modest price, could m principle wm a competitive grant by picking up enough points on several of the dimensions to overcome its zero on relevance. 15. Some of this section is an improved version of parts of a much longer article, "Hard-Won Lessons m Program Evaluation" m the June, 1993, issue of New Directions in Program Evaluation (JosseyBass).

16. Ttus is an improved version of a classification that appeared in Scriven (1993). 17. Although he is often wrongly thought of as never questiomng program goals. 18. Alkin recently reviewed his original definition of evaluation after 21 years, and still could not bring himself to include any reference to merit, worth, or value. He defines it as the collection and presentation of data summaries for dec~smn-makers, which is of course the definition of MIS (management reformation systems). See pp. 93-96, in Allon (1991). 19

By contrast, Position A was put forward by educational researchers, who were less committed to the paradigm of value-free social science, possibly because their disciphne includes history and philosophy of educataon, comparative education, and educational administration, which have quite different paradigms.

20

This attempt to replace evaluataon with explanation is reminiscent of the last stand of psychotherapists faced with the put up or shut up attitude of those doing outcome studies m the 1950s and 1960s. The therapists, notably the psychoanalysts, tried to replace reme&ation with explanation, arguing that the payoff from psychotherapy was improved understanding by the pauents of their cond~uons, rather than symptom reduction. This was not a popular view amongst patients who were in pare and paying heavily to reduce it--they thought.

21. A relativtstic evaluatave statement is something like: "If you value so-and-so, then this will be a good program for you" or "The program was very successful in meetang its goals", or "If technology education should be accessible to girls as easily as boys, then this program will help bnng that about" These claims---of course, these are simple examples--express an evaluative conclusion only relative to the client's or the consumer's values. A direct evaluauve claim, by contrast, wtule it can be 'relatavistic' m another sense--that ~s, comparative or condlt~onal--will contain an evaluative claim by tts author, about the program under evaluataon. For example: "This program is not cost-effective compared to the use of traditional methods" or "This is the best of the optaons" or "These s~de-effects are extremely unfortunate". 22. The conno~sseurship model also weakens the evaluative component m evaluation, reducing ~t to the largely subjectave model of a connoisseur's judgments. The connoisseur is highly knowledgeable but the knowledge is in a domain where it only changes but does not validate its owner's evaluations. 23

These points are covered m some detarl in the Evaluation Thesaurus entries on the logic of evaluation and not repeated here since the arguments are of rather specialized interest, although the issue is of crucial ~mportance.

24. In the re_famous Astin case, where the D~rector was asked to do a study of the effect of the battery addiuve AD-X2 prior to governmental purchase of it for the vehicle fleet. The additive had no effect, as was apparent from a simple control-group study of government vehicles, and reporting that result cost Astin his job (although media pressure eventually got him reinstated). A look at the process of evaluation for textbooks, and its polittcal ambience, provides what may be an even clearer example of product evaluation as involwng the same pohtical dimensions as program evaluation.

166

M Scnven

References Alkin, M. (1991). Evaluation theory development: II. In M. McLaughlin, & D. Phillips (Eds.), Evaluanon and educanon at quarter century (pp. 9 I-112). Chicago: NSSEAJniversityof Chicago Cronbach, L.J., Robinson Ambron, S., Dombusch, S.M., Hess, R.D., Homik, R.C., Phillips, D.C., Walker, D.F., Weiner, S.S. Towards reform in program evaluation" Aims, methods and tnstituttonal arrangements San Francisco: Jossey-Bass. Eder, R.W., & Ferns, G.R. (Eds.). (1989). The employment mterview Theory, research and practice Newbury Park, Cahforma. Sage. Evaluation Thesaurus (1991). 4th edition. Newbury Park, California: Sage.

Guba, E., & Lincoln, Y. (1989). Fourth generation evaluation.. Newbury Park, California: Sage. House, E. (1989). Evaluating with validity Newbury Park, California: Sage. Ross1, & Freedman (2989). Evaluation" A systematic approach Newbury Park, Califorma: Sage. Scriven, M. (1990). Can research-based teacher evaluation be saved? In Rachard L. Schwab (EA.), Research-based teacher evaluation (pp. 12-32). Boston: Kluwer. Scriven, M, (1993). Hard-won lessons in program evaluation. New Directions in Program Evaluation (June). Spector, P. (1992). Summated ranng scale construction. Newbury Park, California: Sage. Stufflebeam, D.L., Foley, W.J., Gephart, W.J., Guba, E.G., Hammond, R.L., Mernman, H.O., & Provus, M.M. (1971). Educational evaluanon and decision making. Itasca, IL: Peacock.

The Author MICHAEL SCRIVEN has degrees in mathematics and philosophy from Melbourne and Oxford, and has taught and published in those areas and in psychology, education, computer science, jurisprudence, and technology studies. He was on the faculty of UC/Berkeley for twelve years. He was the first president of what is now the American Evaluation Association, founding editor of its journal (now called E v a l u a t i o n Practice), and recipient of its Lazarsfeld Medal for contributions to evaluation theory.