An industrial application of meta-evaluation

An industrial application of meta-evaluation

0149.7189/82/020099-12$03.00/O Copyright 0 1982 Pergamon Press Ltd Evaluation and Program Planning, Vol. 5, pp. 99-l 10, 1982 Printed in the U.S.A. ...

1MB Sizes 0 Downloads 22 Views

0149.7189/82/020099-12$03.00/O Copyright 0 1982 Pergamon Press Ltd

Evaluation and Program Planning, Vol. 5, pp. 99-l 10, 1982 Printed

in the U.S.A.

All rights reserved.

AN INDUSTRIAL

APPLICATION

MARTIN

E.

New England

OF META-EVALUATION

SMITH Telephone

ABSTRACT This article describes the efforts of an industrial training organization to increase the number and to improve the quality of “follow-up “evaluations (i.e., evaluations in terms of post-training job performance). Efforts included an analysis of the problems encountered in evaluating training, development and implementation of corrective actions, and an evaluation of the corrective actions. The project illustrates the process of “meta-evaluation. ”

up” evaluations of its training courses. The term follow-up denotes an evaluation of training in terms of the post-training job performance of course graduates. In effect, this study counters the criticisms cited in the first paragraph. This study is interesting for two reasons. First, it describes a number of unpublished evaluations in terms scope, methodological problems and impact. Consequently, this article may present a truer picture of the “state of the art” than evaluations reported in the professional literature. Secondly, this study illustrates the process of meta-evaluation and the translation of metaevaluation findings into practical actions. The study may be divided into three phases. The first was a meta-evaluation, or critique, of the training evaluations completed up to that point in time. This meta-evaluation produced recommendations for improving the follow-up evaluation process. The second phase involved implementing corrective actions based upon the phase 1 recommendations. Phase 3 evaluated the success of the phase 2 interventions by the same methods used in phase 1. Some follow-up evaluations go beyond determining the training’s success or failure to analyze the reasons for post-training performance deficiencies. Factors considered include omission of key job requirements from the training program, supervisory support, task interference, etc. Figure 1 illustrates this analysis. A follow-up study may also be extended to translate observed performance improvements into revenue gain or cost reduction for the company. This extension may

Over the last 20-30 years, commentators have lamented the sorry state of evaluation in business and industry. The litany of faults is lengthy. Few training programs are evaluated (e.g., Campbell, 197 1; Chabotar, 1977; Hinrichs, 1976; Snyder, Raben & Farr, 1980; Wallace & Twichell, 1953). Data collection is often limited to polling participant reactions to the program (e.g., Ball & Anderson, 1975; Catalan0 & Kirkpatrick, 1968; Digman, 1980; Goldstein, 1974), derisively referred to as the “smiles test.” Few efforts seek to determine the transfer of newly acquired skills to the job setting (Campbell & Dunnette, 1968; Catalano & Kirkpatrick, 1968; Kirkpatrick, 1960). Evaluations do not support decision-making (Alden, 1978). I contend that the true state of training evaluation is healthier than popularly believed. There are useful models of training evaluation (e.g., Brethower & Rummler, 1976; Goldstein, 1978; Snyder et al., 1980). Exemplary evaluations can be found in the professional literature (e.g., Chaney & Teel, 1967; Duncan & Gray, 1975; Mikesell, Wilson, & Lawther, 1975; Smith, 1978, 1979). There is at least a modest amount of research on the methodological problems of training evaluation (e.g., Blaiwes, Puig, & Regan, 1973; Gustafson, 1976; Rothe, 1978). Finally, extended critiques of training evaluation and applied research (e.g., Ball & Anderson, 1975) suggest a growing maturity. This paper describes a study that further exemplifies the maturity of training evaluation as a professional discipline. The study represents one organization’s efforts to improve the frequency and quality of “follow-

A preliminary report Requests for reprints

of this research was presented by Smith, O’Callaghan, Corbett, Morley, and Kamradt should be sent to Martin E. Smith, New England Telephone, P.O. Box P, Marlboro, 99

(1976). MA 01752.

100

MARTIN

E. SMITH

YES SATISFACTORY7

TO PERFORM?

NO

NO MAY NOT BE REASONABLE TO EXPECT DIRECT APPLICATION OF TRAINING

NO

NO LOCAL EQUIPMENT AND METHODS MAY VARY FROM COURSE THIS DIFFERENCE MAY INDICATE PROBLEM WITH ADHERING TO COMPANY ;;';;EOR A PROBLEM WITH THE

NO

THERE ARE A VARIETY OF REASONS WHY AN EMPLOYEE MAY NOT HAVE HAD A CHANCE TO APPLY HIS/HER TRAINING SOME REASONS ARE ACCEPTABLE OTHER REASONS MAY IMPLY A PROBLEM WITH TRAINING OR WITH THE SUPERVISOR

IMPLIES PROBLEM WITH TRAINING

EVIDENCE OF THE EFFECTIVENESS OF TRAINING

NO

MOVE WAS

ASSUME UNRELATEDTO TRAINING

MOVE MAY IMPLY FAILURE OF TRAINING IF EMPLOYEE LEFT (S)HE WAS UNABLE TO DO JOB

BECAUSE

Figure

1. Flowchart

for Analyzing

Problems

be called a benefit, value or worth analysis and has obvious design implications (see e.g., Hahn, in press, Smith and Corbett, 1977). Meta-Evaluation refers to an evaluation of an “Meta-evaluation” evaluation study (Striven, 1980, p. 83). Stufflebeam ( 1974) presented the case for meta-evaluation in these

in the Transition

from Training

to the Job

Good evaluation requires that evaluation efforts themselves be evaluated. Many things can and often do go wrong in evaluation work. Accordingly, it is necessary to check evaluations for problems such as bias, technical error, administrative difficulties, and misuse. Such checks are needed both to improve on-going evaluation activities and to assess the merits of completed evaluation efforts. (P.

1)

terms:

PHASE

1: META-EVALUATION

To put the study

in context, the organizational setting is described first. Next, the phase 1 objectives and data collection plan are outlined. The findings are presented under two headings representing the frequency and quality of follow-up evaluations. Each dimension is discussed in terms of problems detected and factors suspected of contributing to the observed problems. Finally, recommendations are presented for correcting the problems. Organizational Setting This study began in 1974. The project was sponsored by the Bell System’s Plant Training Advisory Board. This committee directed an association of training groups across the Bell System. This association, its products and procedures, were collectively known as the “Fair Share” program (Shoemaker, 1979). This Fair Share program involved approximately 170

course developers and project managers in 20 telephone companies. These groups developed technical, clerical and supervisory .training for approximately 280,000 employees responsible for installing and maintaining telephone equipment. Course populations ranged from a few hundred people over several years to three or four thousand in one year. A course could be offered in a single training center or as many as 50 centers, or it could be administered on the job by the student’s supervisor. Between its inception in 1970 and mid-1974, the Fair Share program produced 95 courses with over 50 more in various stages of development. The nature of the Fair Share program affected the meta-evaluation in several ways. First, the project was sponsored by the Training Advisory Board who controlled the administrative machinery through which corrective actions would be implemented. Beside this high-level support, the study’s feasibility was also

An Industrial Application favored by the course development documentation required by the Training Advisory Board. Each course was supposed to be designed and documented according to the AT&T Training Development Standards (Shoemaker, 1979; Crilly, 1980). The documentation requirements included follow-up evaluation, although the AT&T Standards did not provide guidance in how to evaluate a course. Unless specifically requested by the Training Advisory Board, follow-up evaluations were undertaken at the discretion of the company who developed the course. The policy of requiring follow-up evaluations to be done was not enforced. The follow-up evaluation process itself was a difficult undertaking. Evaluation requirements could be expected to vary by such course characteristics as subject matter, target population, and course administration procedures. Evaluations were usually conducted by the course developers who had little or no experience in evaluation. Course developers were generally telephone supervisors on a 2 to 3 year rotational assignment in training. The “stakes” for this study were high. A large number of people designing a large number of courses for large trainee populations represented a huge investment. Criticisms about the cost, efficiency and contribution of the Fair Share program prompted the Training Advisory Board in 1974 to initiate studies of several critical issues, such as course maintenance and instructor training. The board gave the highest priority to an investigation of the follow-up evaluation process. This process was not only the final quality control check on course development but also produced evidence for training’s contribution to corporate operations.

Objectives and Plan An inter-company project team was directed by the board to investigate the extent and manner in which courses were being evaluated and to recommend any improvements that might be indicated. The mission was translated into four questions: (1) What proportion of courses have been evaluated? (2) How do the evaluation reports compare to “quality” standards? In other words, what are the process deficiencies? (3) What factors account for observed deficiencies? (4) What remedial actions should be taken? The data collection plan involved: obtaining all the available follow-up evaluation reports; devising a checklist for rating the reports and interviewing project managers about follow-up evaluations. The checklist was created by the project team based upon a literature review. It covered project goals, procedures, interpretation and recommendations. The checklist items are represented in Table 1. The reports were first rated individually by four project team members, and

of Meta-Evaluation

101

then consensus ratings were derived through group discussion. Interviews were conducted by telephone. Thirteen project managers were interviewed about the reasons why specific courses had not been evaluated. These 13 constituted over half of the available project managers. All 12 managers who had supervised followup evaluations were interviewed about: ( 1) problems in planning the study, collecting and analyzing data, and writing the report; and (2) reactions to the evaluation report and implementation of recommendations. Proportion of Courses Evaluated As of June 30, 1974, only 20 of the 95 published courses had been evaluated in terms of the post-training job performance of course graduates.’ Although this record might compare favorably to the industry in general (Ball & Anderson, 1975), the number fell below the standard that the Training Advisory Board had set for the Fair Share program. Furthermore, the rate of producing courses was far outstripping the rate of evaluating courses. The interviews revealed several factors responsible for the low proportion of evaluated courses. Foremost was the competition among projects for limited manpower. Development of new courses was clearly perceived as more important than evaluation by the project managers. Secondly, managers were reluctant to undertake evaluation because they had no experience or training in evaluation. Managers also disliked collecting data from field managers because of prior difficulties in obtaining data from the field. Finally, volatile course content was cited as a reason for not doing a follow-up evaluation. Changes in work procedures or equipment could make evaluation findings obsolete before the recommendations were implemented. Technical Quality “Technical Quality” is used here to denote all the desired features of an evaluation which increase the probability of reliable measurements, valid conclusions and cost-effective recommendations. As mentioned earlier, the instrument for judging quality was a checklist. Twenty-three reports were rated.2 Maximum score on the checklist was 23 points. The mean rating was 9.13, or 40% of the maximum score. Table 1 presents a classification of the observed deficiencies. The deficiencies are also referenced to four types of suspected causes, which are defined as follows:

‘All 95 courses had been evaluated in terms of testing trainees’ endof-course achievement of course objectives. ‘Not all evaluation reports could be obtained in time for this analysis.

102

MARTIN

REPORT

DEFICIENCIES

E. SMITH

TABLE 1 AND THEIR SUSPECTED

CAUSES

(Based

on 23 Studies) Causes

Report

deficiencies

Knowledge

1. Evaluation goals not stated (16 studies). 2. Sample not representative of trainee population. a. Biased selection procedure (5 studies). b. Sample limited to one company (7 studies). c. Small sample size (17 studies). d. No comparison of sample to population (16 studies). 3. Poor data return. a. Poor questionnaire returns (7 studies). b. Failure to arrange interviews (2 studies). c. Data from field records not supplied (1 study). reliability of data. 4. Questionable a. Known (9 studies) or suspected (13 studies) biases. b. Unclear relation of data to course objectives (12 studies). c. Data collection limited to unrepresentative sample of graduates’ work activities (suspected in 18 studies). d. Vague description of data collection (21 studies). deficiencies not identified. 5. Job performance a. Not all data analyzed (10 studies). b. Indications of performance problems ignored (3 studies). c. Data not analyzed by sub-group within trainee population (21 studies). d. Misinterpretation of data (3 studies). findings not compared to pre-course findings 6. Evaluation (23 studies). and implementation problems not studied 7. Administration (21 studies). recommendations. 8. Inappropriate a. Recommendations not supported by data (4 studies) b. Failure to recommend action for performance problem (3 studies). detail throughout report (23 studies). 9. Inappropriate

Knowledge: the report deficiency is attributed knowledge deficiency of the evaluator;

to a skill/

Standard: the report deficiency is attributed to an omission of a critical design or report requirement from the AT&T Training Development Standards; Consequences: the report deficiency is attributed to punishing consequences or no reward for evaluation activities; Interference: the report deficiency is attributed factor beyond the evaluator’s” control.

to some

The following paragraphs describe each of the nine numbered categories of report deficiencies. ‘The “evaluator” is anyone who wrote an evaluation report reviewed in our study. The evaluator could be any of the following: (1) the person who designed the course that was evaluated, (2) the manager of this course designer, (3) another designer or manager from the same company, (4) a designer or manager from another company, or (5) an evaluation specialist.

X

Standards

Consequences

X X

X

Interference

X

X

X

X

X

X

X

X X X

X

X

1. Evaluation Goals Not Stated. More often than not, project goals had to be inferred from the data collection procedures. The absence of stated goals made it difficult to judge the appropriateness of data collection procedures and the relevance of the findings. The main reason for this deficiency was that the directions for writing evaluation reports, contained in the AT&T Standards, did not require a statement of evaluation goals. 2. Sample Not Representative of Trainee Population. In five studies, the sample was selected in a manner which clearly seemed biased, for example, asking the instructor or line supervisors to nominate people to be included in the study. In another seven studies, the sample was selected from only one company even though the findings were generalized to other companies (as many as 16) using the course. Sample sizes were generally quite small, as few as three graduates in one study, with a median of 15 graduates. Another common fault was the failure to compare the sample to the target population on critical characteristics,

An Industrial Application of Meta-Evaluation such as time in the job before and after training, prior job experience, etc. Several factors accounted for these deficiencies. First, evaluators were not required to describe their sampling procedure. Secondly, they were largely unaware of the importance of sampling, not to mention concepts such as randomization and stratification. Finally, incomplete or obsolete training records hampered identification of all course graduates. 3. Poor Data Return. Seven studies had a low rate of returning checklists and questionnaires from supervisors of course graduates. Two other studies experienced difficulty in arranging all planned interviews. In another study, field supervisors refused to search through records for specific data. From the supervisor’s viewpoint, the data collection task, e.g., filling out a questionnaire, required too much time, provided no reward for responding, and required rating job performance which was not easily observed or for which the supervisor was not tcchnitally qualified to evaluate. In some cases, these reactions were warranted by flaws in the data collection procedures. For example, several open-ended questionnaires involved much writing. In other cases, the evaluator did not take advantage of data routinely collected by field supervisors, e.g., productivity data. Returns also suffered from attrition due to personnel changes (e.g., transfers) and vacations. Finally, no evaluator sent “reminders” to tardy respondents. 4. Questionable Reliability of Data. Virtually every study had some suspected or demonstrated bias in its data. Biases were classified as “suspected” when only one data collection technique was used, when questionnaires were used to measure the transfer of skills to job performance and when observations were limited to one or two work assignments per graduate. Biases were considered to be “demonstrated” when less than 50% of the data were returned, when the evaluation report noted errors or omissions in the completed performance checklists, and when a substantial period of time had elapsed between training and assignment of relevant work. Unreliability was primarily attributed to design errors caused by knowledge deficiencies on the part of the evaluators. Knowledge deficiencies involved the selection of data collection strategies, the design of specific instruments, e.g., questionnaires and observation checklists, the timeliness of data collection, and the testing of procedures and materials before actual data collection. 5. Job Performance Deficiencies Not Identified. This category concerns errors in summa;izing and interpreting data. In 10 studies, not all the data were

103

analyzed. Indications of performance problems were ignored in three studies. In another three reports, our project team doubted the conclusions. Finally, most studies did not consider the possibility that the course might not be effective for all sub-groups within the trainee population, for example, trainees from nontraditional backgrounds. Most of these deficiencies can be ascribed to lack of knowledge. However, it may also be punishing to the evaluator, who also wrote the course, to uncover major faults in course design. It is punishing not only to admit mistakes to one’s supervisors but also to divert manpower from other projects to the revision of courses. 6. Evaluation Findings Not Compared to Pre-Course Findings. If training is designed to prevent or correct specific performance deficiencies, as documented in a work problem analysis report“ or job study report, then one approach to evaluation is to see if those job performance deficiencies still exist after training. No evaluation study looked at this question. There were two reasons why evaluation data were not compared to pre-course data. First, the AT&T Standards did not call for this analysis. Secondly, some courses had been developed in the absence of research. 7. Administration and Implementation Problems Not Studied. “Administration” refers to how the course is taught in the training center, whether or not it is taught according to the plan outlined in the instructor’s guide. “Implementation” refers to the use of the course by other companies, whether or not there are problems in adapting the course for local conditions. Eighteen studies did not consider these issues because there was no requirement. 8. Inappropriate Recommendations. Recommendations were not supported by data in four reports and were not provided in three studies when problems were clearly indicated. There were only three reports that recommended any substantial remedial action. Twelve reports recommended no action of any kind. The factors proposed as causes of deficiency no. 5 also apply to no. 8, viz., lack of knowledge in data analysis and punishing consequences to the evaluators. 9. Inappropriate Detail Throughout the Report. The project team felt that the level of detail was inappro41nthe Bell System’s approach to training development, the first step is often an investigation of a reported job performance problem to determine its causes. The purpose of this investigation is to limit training development to problems caused by skill/knowledge deficiencies and to avoid writing training that will have no impact upon job performance. See Gilbert (1967) or Mager and Pipe (1970) for a discussion of this topic.

104

MARTIN E. SMITH

priate-either too much or too little-throughout the reports, especially in describing project goals, data collection procedures, and in relating data to the course objectives. This problem could be overcome by more specific directions for writing evaluation reports. Summary and Recommendations Two problems were found: (1) not enough courses had been evaluated and (2) there were numerous technical faults in the completed evaluations. The technical faults were grouped into nine categories. The first problem was explained in terms of competition among projects for limited personnel resources, the sometimes punishing consequences of evaluation and the volatility of course content. The problem of technical quality was attributed to lack of knowledge, inadequate standards for reporting and evaluating reports, punishing aspects of evaluation, and task interference variables. In January, 1975, the following actions were recommended to the Training Advisory Board: PHASE

(1) Enforce

(2) (3)

(4)

(5) (6) (7)

(8) (9)

2: REMEDIAL

Work began immediately on several remedial actions. The Training Advisory Board decided on the actions listed in Table 2. The first two actions were directed at increasing the proportion of courses evaluated. The other three were aimed at improving the technical quality of evaluations. These actions essentially represent recommendations (l), (2), (4), (5) and (6). Recommendations (8) and (9) eventually took place but for reasons beyond the meta-evaluation. The remedial actions are briefly described below. Chairman’s Efforts Through informal discussions, the chairman of the Training Advisory Board persuaded several companies to undertake follow-up evaluations. Admittedly, this was a short-term action, but it did set a favorable context for other remedial actions. Administrative Guidelines These guidelines specified when and who should do a follow-up evaluation. More precisely, the guidelines

the requirement to evaluate courses as part of the development process; Assign the evaluation to a neutral party, when practical to do so; Hire a full-time evaluation specialist to train, monitor, coach and evaluate the evaluation efforts of course developers; Revise the AT&T Training Development Standards to provide better definition of design and report requirements; Develop a “how to” manual on follow-up evaluation; Supplement the manual with adjunct instruction; Use commercially available workshops on performance problem analysis to prepare course developers for evaluation until recommendations (5) and (6) could be completed; Sponsor research on performance measurement; and Include performance evaluation in supervisory courses. ACTIONS

set forth the organizational responsibilities for assuring that evaluations are done and done according to agreed upon standards of quality. The first requirement was that each company would be responsible for evaluating each course that it developed for the Bell System. Secondly, a company was prohibited from undertaking a new development project until an evaluation report had been submitted for each of its courses. This restriction could be waived under three conditions: (1) an evaluation was in progress; (2) another company assumed responsibility for the evaluation, and (3) the board waived the restriction for a “good” reason, such as an insufficient number of recent graduates. Finally, AT&T was assigned the task of monitoring evaluation projects. Follow-up Evaluation Guidelines This document was a step-by-step guide for planning, conducting and reporting follow-up evaluation studies. It was designed for experienced course

TABLE 2

IMPLEMENTATION Implemented 1. Informal

efforts

2. Administrative 3. Manual

of Plant Training guidelines

Advisory

adopted

published.

DATES OF REMEDIAL

ACTIONS

actions Board

chairman

(when, who, exceptions).

Dates to initiate

evaluations.

May-June, September, October,

4. Workshop

taught

for first

5. Standards

for evaluations

time.

December,

published.

October,

1975 1975 1975 1975 1976

An Industrial Application

of Meta-Evaluation

105

TABLE

FOLLOW-UP

EVALUATION

GUIDELINES Inputs

outputs

Chapter 1. Introduction

l

2. Goals and constraints

l

3.

l

l

Data collection strategy

l

l l

Go/No Go decision using guidelines

Tentative Tentative

Overview of process Benefits Assumptions Limitations

on

goals constraints

. .

Strategies selected Outline of requirements for the selected strategies Constraints identified Goal(s) chosen

.

4.

Evaluation

5.

Materials development

6. Tryout

plan

l

l

and revision

l

l

7.

Final arrangements for data collection

l l l l l l l

8.

Data collection

l l

9.

Data analysis

l l l

10.

Evaluation

Report

l

Possible Possible

goals constraints

Tentative goals Tentative constraints List of possible strategies Considerations for selecting strategies Requirements of each strategy

. .

Outline of requirements Form for each requirement

. . .

Evaluation Plan Samples of instruments “Rules of thumb”

Revised data collection instruments and procedures Observer/Data Collectors selected and trained

.

Data collection instruments and instructions Directions for conducting tryout

Sample selected Materials reproduced Facilities scheduled Field cooperation obtained Distribution arranged Plan for monitoring Other assistance obtained

. .

Revised data collection instruments and procedures “Reminder checklist”

Data Record of deviations from plan

.

Outputs

Data summaries Conclusions Recommendations

. . . . .

Data Deviations from plan Worksheets Directions for summarizing data Interpretation guidelines

. . . . . . .

Data summaries Conclusions Recommendations Deviations from plan Evaluation Plan Phase 7 Standards Report form

Detailed

plan for the study

Data collection and instructions

Evaluation

instruments

Report

developers with no experience in evaluation. Table 3 outlines the manual’s content. Several features of the guidelines are worth noting. The guidelines contained a 150-page appendix of sample data collection instruments. Secondly, the guide-

.

of Chapter

#7

lines provided users with checklists and worksheets for committing design decisions to writing. These documentation requirements not only led the evaluators through a logical design process but also constituted an “audit trail” for later review.

MARTIN E. SMITH

106

Evaluation Workshops Potential evaluators were trained

in the use of the Follow-up Evaluation Guidelines by means of a threeday workshop. Participants were expected to come to the workshop with a course to evaluate. They left the workshop with a tentative evaluation plan. The workshop is still offered today. Revised Training Development Standards As mentioned earlier, the AT&T Standards specified

PHASE

3: EVALUATION

documentation requirements” for each phase of course development, including follow-up evaluation. The old version listed 10 questions to be addressed in the followup evaluation report. The revised version proposed 77 questions and provided explanations of each question. The revised standards were directed at two audiences: writers and readers of evaluation reports. Table 4 contains excerpts from the revised standards.

OF THE REMEDIAL

In late 1976, a second meta-evaluation was undertaken to assess the combined impact of the various remedial actions upon follow-up evaluation activity. This study used the same data collection methods as the original study. Recently, historical records were examined for the frequency of evaluations during 1977 and 1978. Findings are again presented in terms of the number and quality of the evaluations. Proportion of Courses Evaluated Figure 2 shows the trends in publishing and evaluating courses. Of the 86 courses evaluated since the inception of the Fair Share program,6 32 were reported in the five years prior to the interventions (1971-75), 34 in the years following the interventions, and 20 in the subsequent two years. Since most of the remedial actions were implemented in 1975, it is tempting to take credit for the sudden burst in follow-up evaluations. Interviews with project managers did identify “project assignment

ACTIONS

directly from the board” as the most frequent reason for undertaking a follow-up evaluation. However, one project accounted for the evaluation of 18 interrelated courses. Furthermore, the 1976 total may reflect a backlog from the previous year or two. Evaluation activity was down in 1975 and possibly in 1974. The depressed economy during this period reduced work loads across the Bell System, which, in turn led to no hiring, lay-offs and few promotions. Consequently, training volume was down. Without training, there were no graduates to follow up. The sharp decrease in follow-up evaluations in 1977 and 1978 suggests that the impact of the remedial actions were relatively short-lived. Technical Quality Based on the project team’s ratings, the quality of the follow-up evaluation reports did improve. As Table 5 shows, the rated quality increased from a low of 6.4 in 1973 to the 1976 average of 13.3 points (out of a possi-

TABLE 4

EXCERPTS

FROM EVALUATION

REPORT STANDARDS

1. Goals a. Relevance. Has the goal of relevance been selected for the follow-up evaluation study? Relevance would be indicated by such questions as: Are the graduates assigned the task they were trained skills taught needed or used on the job? Are there important skills or tasks NOT taught in the course?

to do? Are the

4. Findings a. Relevance. Is there an adequate statement of analysis and summary of the data collected concerning the goal of relevance? (1) SUMMARY. Does the report summarize the relevance of the course to the graduates and their jobs in terms of: (a) The number and proportion of objectives judged to be relevant. (Acceptable alternatives to this summary include: the average number of graduates for whom an objective was relevant, or the average number of relevant objectives per graduate.) (b) The proportion of graduates assigned relevant work following training. (c) The number of new skills and concepts and/or tasks needed but not taught. 5. Recommendations a. Specific actions. Recommendations training process,

For each recommendation, does the report state a specific action to be taken? could include revisions of content, teaching methods within the course, revision changes in the post-training process, etc.

‘The “standards” were later revised to provide step-by-step procedures of course development. was unrelated to this project.

more detail on the This later revision

of the assignment

5everal evaluations from prior years were discovered meta-evaluation and added into Figures 2 and 3.

to

after the first

107

An Industrial Application of Meta-Evaluation

N

U M B E R 0 F

f, U R S

,

:

I_

Figure 2. Trends

in Publishing

and Evaluating

ble 23 points). There were other signs of improvement. The average sample size of data respondents increased dramatically in 1976. The number of data collection methods per study increased slightly. The proportion of evaluations recommending major course revisions increased. A “major” revision is defined as a complete re-write of at least one lesson or the writing of a new lesson. The causes of the improved quality are not entirely clear. During the study period of meta-evaluation II, five studies were conducted without benefit of the Follow-up Evaluation Guidelines, the workshop or the revised standards. Yet Figure 3 shows that the quality ratings of these five projects equaled the ratings for

TRENDS

Year 1971-2 1973 1974 1975 1976

IN FIVE CHARACTERISTICS Number of studies 4 17 6 2 14

Quality rating 9.8 6.4 7.0 9.0 13.3

Fair Share Courses.

projects aided by the guidelines and the workshop and exceeded the ratings of the “guidelines only” projects. Only the combination of the guidelines, workshop and revised standards was associated with dramatic improvement in the rated quality of the evaluation reports. It’s possible that the improvement of the “unaided” evaluations was due to the renewed interest of the Training Advisory Board in evaluation. In which case, the improvement might be short-lived. Alternatively, the improvement might represent the growing experience among course developers with evaluation or with data collection in general. This explanation suggests gradual improvement over a long period. It’s also

TABLE 5 OF FOLLOW-UP

EVALUATION

REPORTS

Sample size

Number of met hods

Major revision

33.2 15.9 19.0 22.0 98.0

4.0 3.4 3.5 3.0 4.1

0% 6% 33 % 50% 50%

Figure 1 shows 34 courses for 1976 versus the Table 5 total of 14 studies for 1976. The difference is due to two factors: one study evaluated 18 separate courses that constituted a curriculum, and three reports were not available when the ratings were done.

108

MARTIN E. SMITH

Figure 3. Rated Quality

Etduarion

Gkfdinrs.

of Evaluation Attending

Report5 as a Function

a Workrhop,

of’ Using the FO//OWII~

and Using the Revised Srundurd~ for

Evaluation.

possible that the improvement was limited to the few evaluators and, hence, no prediction of future evaluations is possible. Needless to say, the guidelines and workshop proved to be less successful than expected. Recommendations were made for revising the guidelines. An unanticipated finding was that quality varied as a function of whether or not the evaluator worked for the company that developed the course. Table 6 shows that evaluators who did not work for the developing company produced better studies. Furthermore, they were more likely to recommend major course revisions than evaluators from the developing companies, as Table 7 indicates. These findings were independent of using the Follow-up Evaluation Guidelines.

Summary

Meta-evaluation II found a dramatic improvement in the number of follow-up evaluations, although the improvement was short-lived. The quality of follow-up evaluations also improved, but the improvement fell short of expectations. The cause of the quality improvement were obscured. The Follow-Up Evaluation Guidelines and the workshop produced no better results than evaluations without these aids. Only the combination of the guidelines, workshop and revised report standards was associated with dramatic improvement in quality. Since there was no independent use of the revised standurcls, it is unclear whether the improvement was due to the new standards alone or to an interaction effect.

An Industrial Application

RATED QUALITY

109

of Meta-Evaluation

TABLE 6 AS A FUNCTION OF THE EVALUATOR’S WITH THE DEVELOPING COMPANY

AFFILIATION

F.U.E.G. not used

F.U.E.G.’ used Evaluator

from the developing

company

Evaluator

not from the developing

* F.U.E.G.

= Follow-up

company

Evaluation

x = 12.1 SD = 4.0 N= 9

x = 6.8 SD = 2.9 N = 24

X = 18.0 SD = 0.9 N= 3

x = 10.1 SD = 3.0 N= 7

Guidelines

DISCUSSION ficacy of the Follow-up Evaluation Guidelines was tenuous at best because its use was left to the discretion of individual evaluators. Our meta-evaluation plan seemed to be an efficient and effective means of analyzing the evaluations. At the time of this study, little had been written about meta-evaluation. What had been written was oriented toward public education and United States Office of Education sponsored projects. Therefore, our project team developed its own set of meta-evaluation criteria for industrial training. Our methodology was borrowed more from Gilbert (1967, 1978), Harless (1974) and Mager and Pipe (1970) than from Striven (1969) and Stufflebeam (1974). Meta-evaluation was viewed as an application of a broader analytic technique, known by such terms as “performance analysis” and “front-endanalysis.“This approach involves specifying an organization’s output and the desired characteristics of the output (e.g., quantity, quality, timeliness, cost, etc.), measuring actual output against the desired output characteristics, and analyzing discrepancies between actual and desired output in terms of such causal factors as skill deficiencies, feedback, consequences and task interference.

What conclusions can be drawn about follow-up evaluation? First, establishing the follow-up evaluation process as a regularly occurring organizational function is a complex problem. The phase 2 interventions undoubtedly promoted evaluation activity but did not sustain a high level beyond the first year. Similarly, the technical quality data indicated that follow-up evaluation involves skills which are difficult to master, although several non-professionals did produce exemplary evaluations. It seems a trite thing to say, but research is needed on the organizational factors which support evaluation activity and on appropriate strategies for training evaluation. Perhaps the issue is how to optimize evaluative feedback to training managers considering: (1) criticality of the feedback, which in turn depends upon such factors as criticality of the skills taught, size of the target population, opportunity to modify the training, etc., (2) available personnel and (3) other staffing priorities of the training organization. Markle (1976) raised this issue several years ago. This study also demonstrated some of the interpretation problems inherent in field research. Piecing together cause-and-effect relationships was virtually impossible. For example, speculating about the ef-

RECOMMENDATION EVALUATOR’S

TABLE 7 OF MAJOR COURSE REVISION AS A FUNCTION OF THE AFFILIATION WITH THE DEVELOPING COMPANY

F.U.E.G. used

l

F.U.E.G.* not used

Evaluator from the developing company

of 9 studies, 33% recommended major course revision

of 24 studies, 4% recommended major course revision

Evaluator not from the developing company

of 3 studies, 100% recommended major course revision

of 7 studies, 43% recommended major course revision

l

F.U. E.G. = Follow-up

Evaluation

Guidelines

110

MARTIN E. SMITH

REFERENCES ALDEN, J. Evaluation in focus. nal, 1978, 32(10), 46-50.

Training and Development

Jour-

HARLESS, J. An ounce qfana/_vsis. McLean, mance Guild, 1974.

VA: Harless

Perfor-

BALL, S., & ANDERSON, S. B. Practices in program evaluation: A survey and some casesfudies. Princeton: Educational Testing Service, 1975.

HINRICHS, J. R. Personnel training. In M. D. Dunnette (Ed.), Handbook of industrial and organization psychology. Chicago: Rand McNally, 1976.

BLAIWES, A. S., PUIG, J. A., & REGAN, J. J. Transfer of training and the measurement of training effectiveness. Human Factors, 1973, /5(6). 5233533.

KIRKPATRICK, D. L. Techniques for evaluating training programs: Part 3 -Behavior. Journal of the ASTD (now Training and Development Journal). 1960, 14( I ), 13-18.

BRETHOWER, K. S., & RUMMLER, G. A. Evaluating training. Improving Human Performance Quarter&, 1976, 5(3-4), 103-120. Reprinted in Training and Development Journal, 1979, 33(5), 14-22.

MAGER, R., & PIPE, P. Analyzing performance mont, CA: Fearon Press, 1970.

CAMPBELL, J. P. Personnel training and development. In P. H. Mussen & M. R. Rosensweig (Eds.), Annual Review of P.s_vcho/og_v.Palo Alto, CA: Annual Reviews, 1971. CAMPBELL, J. P., & DUNNETTE, M. D. Effectiveness of r-group experiences in managerial training and development. Psychological Bulletin, 1968, 70(2), 73-104. CATALANO, R. E., & KIRKPATRICK, D. L. Evaluating training programs-The state of the art. Training and Development Journal, 1968, 22(5), 2-9. CHABOTAR, K. J. The logic of training 1977, S(4), 23-27.

Personnel,

evaluation.

problems.

Bel-

MARKLE, S. M. Evaluating instructional programs: enough? NSPI Journal, 1976. 15( 2), I ff.

How much is

MIKESELL, J. L., WILSON, J. A., & LAWTHER, program and evaluation model. Public Personnel 1975, 4(6), 405-41 I.

W. Training Management.

ROTHE, H. F. Output rates among industrial of Applied Psychology, 1978, 63(I), 40-46.

employees.

Educational

SCRIVEN, M. An introduction to meta-evaluation. Product Report, 1969, 2(5), 36-38. SCRIVEN, 1980.

M. Evaluation

thesaurus.

Inverness,

Journal

CA:

Edgepress,

CHANEY, F. B., & TEEL, K. S. Improving inspector performance through training and visual aids. Journa/ of Applied Psychology, 1967, 5/(4), 311-315.

SHOEMAKER, H. A. The evolution of management systems for producing cost-effective training: A Bell System experience. NSPI Journal, 1979, /R( 8), 337ff.

CRILLY, Education. 23-26.

SMITH, M. E. An illustration of evaluation processes, design features, and problems. improving Human Performance Quarter&, 1978, 7(4), 337-350.

L. ID model for the Bell System Center for Technical Journal of Instructional Development, 1980, 4(2),

DIGMAN, L. A. How companies evaluate management development programs. Human ResourcesManagement, 1980, 19(3), 9-13. DUNCAN, K. D., & GRAY, M. J. An evaluation of a fault-finding training course for refining process operators. Journal of Occupational Psychology, 1975, 48(4), 199-218. GILBERT, T. Praexeonomy: training needs. Management 20-33. GILBERT,

T. Human competence.

GOLDSTEIN, tion. Monterey, GOLDSTEIN, ing programs.

A systematic approach to identifying of Personnel Quarterly, 1967, 6(3),

New York:

McGraw-Hill

I. L. Training: Program development CA: Brooks/Cole, 1974.

1978.

and evalua-

I. L. The pursuit of validity in the evaluation Human Factors, 1978, 20(2), 131-144.

SMITH, M. E. An illustration of evaluating formance. Improving Human Performance 181-201.

post-training job perQuarterly, 1979, 8(3),

SMITH, M. E., & CORBETT, A. J. Exchanging ideas on evaluation: 5. Pricing the benefits of training. NSPIJournal, 1977, /6( 3), 16-18. SMITH, M. E., O’CALLAGHAN, M., CORBETT, A. J., MORLEY, B., & KAMRADT, 1. L. An example of meta-evaluation from industry. Improving Human Performance Quarterly, 1976, 5(3-4). 168-182. SNYDER, R. A., RABEN, C. S., & FARR, J. L. A model for the systematic evaluation of human resource development programs. Academy of Management Review, 1980, 5(3), 431-444.

of train-

GUSTAFSON, H. W. Job performance evaluation as a tool to evaluate training. Improving Human Performance Quarterly, 1976, 5(3-4) 133-152. HAHN, R. Justifying HRD: The cost/benefit issue in context. In J. Alden (Ed.), Critical research issue,s in human resource development. Madison, WI: American Society for Training and Development, in press.

STUFFLEBEAM, D. L. Toward a technology for evaluating evaluation. Paper presented at the annual meetings of the American Educational Research Association, Chicago, April, 1974 (ERIC document reproduction service number ED 090 319). WALLACE, S. R., JR., & TWICHELL, C. M. An evaluation of a training course for life insurance agents. Personnel Psychology. 1953, 6(l), 25-43.