The effect of the correctness of the behavior observed on the accuracy of ratings

The effect of the correctness of the behavior observed on the accuracy of ratings

ORGANIZATIONAL BEI-IAVIORAND HUMAN PERFORMANCE ~ 366--377 (1970) The Effec~ of the Correctness of the Behavior Observed on the Accuracy of Ratings~ ~...

735KB Sizes 2 Downloads 15 Views

ORGANIZATIONAL BEI-IAVIORAND HUMAN PERFORMANCE ~ 366--377 (1970)

The Effec~ of the Correctness of the Behavior Observed on the Accuracy of Ratings~ ~V~ICHAEL E . GORDON2

University o] Tennessee The present investigation was designed to determine the effects of two variables on the accuracy of ratings: (a) the correctness of the behavior ra~ed; and (b) the amount of the rater's experience with the particular rating device. 118 managers rated nine specific dimensions of the performante of art "insurance agent's" prospecting technique. The results indicated that the ratings were 88.0% accurate on behavior which was actually performed correctly, but only 73.8% accurate on behavior which was actually performed incorrectly (p < 0.1). This finding was called the Differential Accuracy Phenomenon (DAP). Although experience with the ratings improved the overM1 accuracy, it had no effect on the DAP. The implications of the DAP for criterion development were discussed. Ib is well k n o w n t h a t r a t i n g s a r e t h e m o s t p e r v a s i v e as well as 4~he m o s t r i d i c u l e d class of c r i t e r i a used in p s y c h o l o g i c a l research. T h i s seems to h o l d true across m o s t a r e a s of r e s e a r c h w i t h i n t h e field of industri.al p s y c h o l o g y in w h i c h r a t i n g s often s e r v e as indices o.f b e h a v i o r . T h e v o e a b u l a r y of the p s y c h o l o g i s t is r e p l e t e w i t h t e r m s which d e s c r i b e v a r i o u s sources of e r r o r or bias in r a t i n g s tha, t h a v e been d o c u m e n t e d in p a s t r e s e a r c h : h a l o effect, errors of l e n i e n c y , logical errors, a n d errors of central tendency. D e s p i t e the e n o r m o u s a m o u n t of l i t e r a t u r e thab h a s a e e u m u l a b e d on ra~ings, l i t t l e w o r k h a s been p e r f o r m e d on the a c c u r a c y of r a t i n g s . M o s t of the w o r k d e s c r i b e d as p e r t i n e n t to a c c u r a c y h a s .been m i s l a b e l e d a n d is a c t u a l l y c o n c e r n e d w i t h the r e l i a b i l i 5 y or v a l i d i t y o.f r a t i n g s . B e c a u s e a c c u r a c y is a s e p a r a t e m a t t e r , t h e consequences of i n a c c u r a c y on t h e This investigation was supported by the Life Insurance Agency Management Association, Hartford, Conn. This article is based upon research undertaken as a partial requirement for a Ph.D. in psychology at The University of California. The author wishes to express his gratitude to Prof. Edwin E. Ghiselli for his generous contributions of personal time and guidance in the preparation of the dissertation upon which this report is based. The author would like to thank Professors Cabot Jaffee, John Larsen, and Gerald Whitloek for their comments on earlier drafts of this manuscript. 356

RATING ACCUaACY

36,7

decisions for which the ratings were gathered will be different from 'the effee5s s'ternming from unreliability and other types of rating errors. This paper will take up the issue of the accuracy of ratings by (a) describing a source of in.no.curacy in ratings; (b) documenting the procedure used to demonstrate its existence; and (.c) delving into the consequences .of its presence in a criterion used in psychological research. The concept of accuracy. Accuracy is a ierm used to describe the relationship between a set of measurements o'btained with a fallible scale o,f some sort .and a corresponding set of measures derived from an ,accepted standard or less fallible scale. The term "accepted standard" is meant to signify a commonly accepted criterion which serves as a basis of comparison, calibration, or s:tandardization for other instrumen,5,s which purpor'g to measure the same dimension as the standard. So, for .example, an ordinary ruler is said to be accurate if the measurements made with it closely correspond to 'the results orbtained from a comparable set of measuring operations performed with the standard foot rule housed in the Bureau of Weights and Standards in Washington, D.C., Similarly, ratings are a,eeura'te to the exten't that they represent behavior which has actually occurred as measured by a ree.ording of the activity which transpired .during the period .of o,bservation. The key ~o 'the determination of a.eeuraey is the availability ef an a~ccepted standard against which to compare {he measures .o'bgained wfth ,a particular instrument. Most studies purported to be concerned with the accuracy of ratings have actually turned out to Jb.e studies of relia'bility or validity. And, in the few studies which employ an accepted s~andard .as a criterion, the analytical treatmen~ o.f the da'ta does net meet the operational requirements of accuracy. That is, only ~he .degree of ass,oeiation between the fMlible and less fallible measures ~ere investigated without regaM to the degree ,of o.vedap in t~he ~wo distributions of measuremen,ts. Ae.euraey, according to Guion (1965, p. 33), must be measured in terms of both the "strength and kind" of rela.tionship between an index and an indisputable criterion. ~ Accuracy is a function of the total amount of error inherent in aza instrument. This includes both variable error, which is measured by an index of dispersion, and constant error, which is a functio~ of the difference in location of the distributions of measurements obtMned with the falliMe and less fallible instruments. l~eliability and validity, two distinct concepts which are confused with accuracy, are functions only of variable err'or. The index of dispersion in terms, of which reliability is measured is the standard error of measurement, while the standard error of estimate is the index of dispersion used to measure validity. Reliability and validity are both necessary but insufficient conditions for accuracy to obtain, whereas accuracy is a sm~ficient conditio~ for both reliability and validity. For a more derailed treatment of the statistical definition of accuracy, see Naylor (1967).

368

MIeI-IAEI~ E. ¢OI~DON

A review of the literature un.covered little information pertinent, to the conditions .that affect the accuracy with which ,behavior is perceived. One relevant study conducted by Maier and Thu~ber (1968) investigated ~he accuracy of ratings of honesty as a funetion of the type of information received. A model o] perceptual accuracy. The accuracy of ratings of behavior can be graphically portrayed as in Fig. 1 as the relationship of two dichotomous varia~bl'es: the perceived performance and the actual perfo'rmance. The two levels o,f these variables are the same in both cases: Did do, which means that .a certain desirable behavior actually occurred or wa,s perceived to occur; Did not do, which means that a certain desirable behavior actually was, or was perceived to be, omitted or performed incorrectly. The r a tot's iudgment is correct in only two of 'the four situations described in Fig. 1. Cells I and IV of the diagram contain accurate iudgPerceived behavior Actual behavior

Did do Did not do

~

Did do Did not do I ~ ~ -~

FIG. 1. A model of perceptual accuracy. ments of correct and incorrect ~behavior, respectively. In 'both these s'ituations the ra.ter's perception of the performance agreed with the record of the actual performance. Cells II a~d I I I represent si'tua~ions in which the rater's perceptions disagree with the facts. In Cell II .the rater has falsely accused the .ratee .of .doing something incorrectly when in fact he performed up to s'tandard. This kind of mistatce will be called Type A error. Cel~l I I I depicts a different ~ype of error in ratings. Here 'the ratee is given credit for something which he has not in fact done correctly. This kind .of rating error will be .called Type B. From the model it is evident th,at two un'ique accuracy scores may be computed on any given set of data: accuracy o,f observations of correc~ behavior; and, a.ecuracy of observations of incorrect ,behavior. Note tha~ the manifestations of correct and incorrect behavior will be rated with different degrees of accuracy if the relative frequency of Type A and Type B errors are different. The author has found evid~en~e in two unpublished studies which indicates .there is good reason to believe that Type B errors are far move predo.minan~t Khan Type A errors. Raters were asked to identify both the correct and incorrect .aspects of an insura~nce agent's prospecting performance which had been recorded previously. The judges were provided vdth

aATIXG ACCVaAC~

369

a checklist of behaviors which they were to observe and rate as being correct or inc:orrect. The results indicated that the accuracy of the judgments was far greater ,for those behaviocs which were actually performed ~orreetly than those behaviors which were actually performed incorrectly. This finding was ca'lled the Differential Accuracy Phenomenon (DAP). This result in the area of interpersonal pereeptions is similar to other scattered findings about the relationship between perceptual accuracy and the correctness of the input information. Jaeobs and gandeventer (1968) found that first and third grade children were able to identify more accurately correctly completed samples of l~avens' Coloured Progressive Matrices Test than incorrectly completed samples. Harris (1968) investigated the accuracy of product inspection operations. By reanalyzing th,e data in Harris's study it was .determined that the relative frequency of Type B errors 'far exceeded that of TJrp'e A errors. This, again, seems to be a manifestation of the DAP. In the three s~udies described ~bove t'be 'finding that accuracy .appeared to ,be rel~ated to the correctness of the input was a by-product of the experiment. Tha% is, the experimental design in each .of these studies was not specifically contrived to test the hypothesis that idenification of correct stimuli is ea.sier than the identification of incorrect stimuli. Cons,equently, .one may cite a number of limitations of the previous work which should give one pause ~before accepting the DAP as a reli,a,b.le phenomenon. For example, the sample size in each of the experiments conducted 'by the author was pro'bably too small (N = 9:) to establish the existence of a new rating phenomenon. In both the Jaeohs and gandeventer (1968) .and :the Harris (1968.) studies, the number of incorrect samples viewed differed from the number .of correct samples. Therefore, the opportunity to commit the two types of error's differed, as did the corresponding reliability of ,the two accuracy scores. Obviously, 'a more systematic .and soiohistieated study of the DAP is desiraible which would not be limited ,by sample size or experimental design. MET'HOD

Subjects. The Ss were experienced managers from field and home offices of more than 30 different insurance companies who attended two Schools in Agency Management (SAM). Da~ta were collected from 56 managers attending a SAM in Philadelphia, .and 62 managers attending a Dallas SAM. The Ss completed an extensive biographical inform'ation form which provided data on their backgrounds and experience in the insurance business. Personal history data relating 'to each S's experience with standardized rating procedures also were collected.

370

MICHAEL

E. GORDON

Procedure. The Ss were asked to e v a l u ~ e the performance of ,a "life insurance agent" in 19 simulated "agent"-"prospect" conversations presented on video tape. The setting for each of the 19 conversations was an "agent's" office. T h e "agent" was viewed making a series of telephone approaches to 19 different "prospects," none o,f whom were ever seen on the screen. 'Two professional actors played the roles .of ".agent" and "prospects." T h e "agent" was always played b y the same actor. All Ss in bo,th SAM's judged the s.ame set of 19 conversations which were witnessed in the same order. T h e Ss ~vere asked to judge how well the "agent's" approach conformed to a standard telephone approach. T h e Ss were given a set of scripts which contained the lines the "ag:ent" ought to say exactly if he made his appro.ach perfectly on each of the 19 calls. These written lines comprised 'the standard against which the "agent's" actual performance was compared. The lines the "prospect" actually would say on each call were also included in the scripts. The "prospect's" spoken dialogue on each call did not deviate from what was written in the scripts. The "agent's" spoken dialogue did di#er significantly on occasion from wharf 'was prepared in the scripts. It was the Ss' job to judge how and when the "agent's" performance departed from the prepared ~ext. No feedback was provided to the S after each trial rega,rding ,the accuracy of his observations. The experiment was conducted as if it were an integral part of the SAM curriculum. The Ss were told that their ratings would be scored like a tes't and that they would receive individual feedback on their performance the next day. T h e first %hree c,alls 'in the series weTe practice calls. These were included to enable the Ss to g!et acquainted with the requirements of @e r~ting task ,and to become familiar with the rating material. Only ~he last 16 calls were scored. Rating lotto. T h e rating form used was a check list which .contained nine items. Ea,ch of the items, all .of which were .answered with a "Yes" or "No," concerned the presence of certain desira'ble types of behavior in the "agent's" performance. The items all were phrased so that a "Yes" response represented the perception of a desirable form of behavior. The nine specific aspects of ,the "agent's" performance which were rated were: 1. Presence o,f a smile before speaking to the prospect. 2. Use of the proper remarks to introduce himself to the prospect and request an interview. 3. Use of the correct response to ~he different o b.iections offered 'by the prospect.

RATING ACCURACY

371

4. Promptness of response to the prospect's objections. 5. Perseverance; hanging on until the interview was granted or ~hree objections received. 6. If gran'ted, repeating the time of the appointment 'before hanging up. 7. Addressing 'the prospect b y his correct name. 8. Thanldng the prospect regardless of the outcome of the .convers'ation. 9. Concentrat}on on selling the interview, not insurance. The Ss were required to complete the cheek list during and immediately after each .call. Call content. T h e dialogue for ea,ch call was writ:ten s'o 'tha.g differen~ proportions of what the age~t said were correct in terms of the task stand.ards. The number of correct behaviors o,n any given .call was either 9, 7, 5, 3, or 1. 'The particular behaviors keyed eorrect were independ.ent of the number o,f ,behaviors k,eyed correct. T~b.le 1 contains a summary of the structure of the 16 calls. From TABLE

i

SUMMARY OF T H E S T R U C T U R E OF T H E

Correct behaviors Incorrect behaviors

16 SCORED CALLS

Block I

Block II

Total

34 38

36 36

70 74

column 3 of this table it can be seen that the total number of correct and incorrect behaviors over the 16 cMls were approximately equal; there were 70 correct behaviors and 74 incorrect behaviors. The sequence of calls was designed s'o tha£ the first eight calls were n,early equivalent to the last eight calls in terms ,of the frequency of correct and incorrec.i~ (behaviors. From Table 1 i:t can be seen that these frequenci'es were approximately equal 'in the first and second blocks of eight calls. Because of these similarities, Bl,ock II was considered a comp}ete replication of Block I. Finally, by equalizing the occurrence :of right and wrong behavior for each dimension, the probability was reduced ~chat the Ss might learn to identify any given dimension of the "agent's" performance as generally correct or ineorrec't. Table 2 contains a breakdown of the number o'f actual correct and incorrect occurrences of each of the nine rated 'aspects of the ".agent's" performan.ee. With the exception of questions 5 and 6, the number .of times 'the "agent's" actual performance was .correct on .a particular dimension has been very nea;rly re'arched by the number of t}mes that particular aspect of his performance was incoTreet.

372

MICHAEL E. GORDON

TABLE 2 DISTRIBUTION OF CORRECT AND INCORRECT OCCURRENCES FOR EACI=[ DIMENSION OF BEHAVIOR

Occurrence Dimension

Correct

Incorrect

1 2 3 4 5 6 7 8 9

8 8 7 8 12 5 7 7 8

8 8 9 8 4 11 9 9 8

I~ESULTS

Comparability o] the two samples. The biographical data for the Ss in each SAM school were averaged in order to examine the comparability ~f the two sampl~es. Most o,f the observed differences between the samples were too small to ,be of practical s~gr~ificance and, consequently, tests o'f statistical significance were not performed on the background data. Because of the marked similarity of 'the two groups of managers, the decision was made to pool the data from the two schools and disregard the sample as a variable in subsequent statistical analyses. Calculation o] accuracy scores. In order to determine the accuracy of enoch S's observations, the ratings were scored by comparing them with the key used to pl'an and develop the actual di.alogue of the stimnlus material. This key contained the oHtline o,f the agent's accrual co,nduc% in terms of the correctness of his behavior on each of the nine rated dimensions. If a S's rating agreed with the corresponding entry in 'the plan, it was scored correct; if it disagreed, it was scored incorrect. This scoring procedure was used to determine the accuracy of all the ratings for each S. However, Swo elaborations of this basic scoring,scheme were employed 'to develop special sub-scores on each S's data. First, the ratings on each call were subdivided on the basis of the correctness of actual behavior. This procedure allowed computation of two accuracy scores; ,one score for 'the accuracy of the ratings .of behavior which, according to the key, was actually performed .correctly; and, one score for the accuracy of ratings of behavior which was actually performed incorrectly. T~en, these two .accuracy scores were averaged separately .across blocks of eight calls. The mean percentage accurate for ratings of correct behavior, and the mean percentage accura.te for ratings of incorrect be-

RATING ACCURACY

373

harlot, were computed 'for calls 1 through 8 and calls 9 through 16 for each S. Analysis of variance. Before the data were analyzed, certain preliminary checks and normalizing procedures were conducted. (.a) Because their data were incomplete, seven Ss were discarded from the sample, This brought the total N ¢o 111. (b) Due ¢o viela.~ions of both the assumption of normaKty and of homogeneity of variance, it was considered necessary to iransform the data. An inverse sine transformation was used. ~ The ~ransformed scores were not significantly skewed and were well within the expanded limits of homogeneity of variance proposed by l~or~on (19.52) and Boneau (19fi0). An analysis of variance of a 111 X 2 X 2 'factorial design was conducted on both the .originM and the transformed scores. McNemar's (1962, p. 333) case X I I mixed model design was used .as the b.asis for the analysis. The results o,f these two analyses were identical. Because the outcome of the sta'tistical analysis 'was not affected by violations o f the assumptions underlying the F test, and because the original scores are more readily interpretable than the transformed scores, only ~he results ,based upon the original scores will :be reported. A summary of ~he results 'of Che analysis is contained in Table 3. TABLE 3 SUMMARY OF ANALYSIS OF VARIANCE

Source of variance

SS

df

MS

F

DAP D A P X Ss Expe~ence Expe~ence X Ss D A P X experience D A P X experience X Ss Ss

22398.27 12111.98 1340.54 3975.21 37.48 4300.77 19016.41

1 110 1 110 1 110 110

22398.27 110.11 1340.54 36.14 37.48 39.10 172.88

203.42*

63180.66

443

Total

37.09* 0.96

* Significant at a = .001.

The difference in accuracy between the observations of correct and incorrect behavior was statistically significant and in the direction indicating the existence of a DAP. The mean percentage accurate averaged across blocks for ratings of behavior which was actually correct was 88.0%, while the same sta:tistic based upon ratings of incorrect behavior 4 The transforma,tion used was of the following form : x' = arcsin X

374

MICHAEL :E. GORDON

was 73.8%. The magnitude of the DAP in this particular study was

142%. There was a statistically significant increase in overall accuracy from the first ~block of eight calls to the sec'ond block of eight calls. The mean accuracy on calls 1 through 8 was 79.4%, whereas ~he mean accuracy on rails 9 through 16 was 82.4%. Accuracy was improved significantly by unguided, unseinforced :experience with ~the rating form.

1°°f e 90 I 8o g 70

I

Blocks

it

FIG. 2. The DAP X experience interaction. 'The fact th~at the DAP X Exp'erience interaction was not significant indicates that the magnitude of the DAP did not differ in Blocks I and II. Ratings of both correct 'an~d in.correct 'behavior were improved a~bout equally ,by ,experience with ihe rating form. Mere use of the rating forms did not reduce the magnitude of the DAP, although it did improve overall accuracy. DISCUSSION The resul'ts of the present investigation clearly poin't to the existence of a sizeable DAP in th.e rating situation described. It is apparen$ th'at manifestations 'of correct and incorrect behavior were not rated ~with ~he same degree 'of accuracy with the particular checklist used in this study. These results support 'the earlier unpublished work of the 'author with a similar .set ,of ratings, the work of J~eo.bs and Vandeventer (1968) with l%aven's Coloured Progressive Matrices Test, and the work o'f Harris (19,6,8) on product inspection. However, unlike these studies, the design of the present investigation was specifically contrived to measure 'the DAP. Itenee, the results are not subject to the methodological limitations of the previous research. N~ever~hetess, one must be careful about generalizing thes,e results to other ra~ing situations. Without additional research it wi.ll 'be difficult ~o

~ATI~G ACCVRACY

375

determine whether the D A P is a generalizable phenomenon, or whether it is peculiar t'o the checklist, the type of rater, the ~behavior rated, or the rating situation employed in this study. Furthermore, additional research should h.e conducted to de4ermine the conditions which affect the magnitude of the DAP. For example, it is not known whether providing the rater with feedback concerning the accuracy of his ratings adter each call would have changed the results of th'is study. DAP and leniency errvr. At first .blush the D A P would seem to be quite similar to the error of leniency. T h a t is, in both situations the rater appears to be giving the ratee the "benefit of the doubt." This tendency toward beneficence could be reflected operationally in the mean and skew of the favorability ratings assigned 'by a rater ~n a number of rating situations, or in the predominance of errors of observation characterized by the raters giving the ratee .credit for something which he did no¢, in fact, do correctly. Operationally, h'owever, the D A P and 4he error of leniency axe distinct concepts. 5 Nevertheless, despite these .operational differences, the question of whether any behavioral link exists between the two concepts remains largely unanswered. I t is not known whether lenient raters are also most susceptible to the DAP. If the D A P and leniency error are ~both manifestations of a mild or tolerant disposition, those individuals whose ratings are generally most favorable should also display the largest DAP. This proposi'tion might be tested in the future by comparing the D A P measured in two groups of r~ters who have been selected on the basis of their record as either lenient or severe judges in pa.s~ rating situations. Implications o/ the DAP. As previously mentioned, ratings are frequently used as criteria in all areas of psychological research. Hence, any source of inaccuracy in them is likely to affect the measurement of the relationships under study. The purpose for which ratings are collected will 4etermine the extent o',f their usefulness given their susceptibility t'o the DAP. One clear example of the costly and detrimental consequences of the D A P is in the area of training evaluation. In the infrequent instances when training programs are subjected to validation, ratings of some sort 'The first and most obvious difference between the DAP aad leniency error pertains to the dependent variable chosen to measure each of the phenomena. Whereas the RAP is defined in terms of accuracy, i.e., the average percentage correct, leniency errors are usually measured in terms of the mean or skew of a rater's judgments of favorability. Based upon this difference in the metric used to define each of the concepts, DAP and leniency error also differ in terms of the norms used to make each score meaningful. Leniency error is defined by comparing one rater versus the average of many raters, whereas DAP is defined by comparing One rater versus the values derived fro~ ~n ~c.Qeptedstandard.

376

M I C H A E L E. GORDON

are the commones~b form ,of criteria employed. Properly designed validation studies inc'orpora!te at least .one control group composed of individuals who do not receive the tr'aining program being studied. The research hypotheses generally tested is that the performance of the trained g:ro~p will be better than that of the control group. It is assumed that the di,fference between these groups will be evident on the criterion. Typically, the trained group is expected to receive many more favorable ratings on the various dimensions of performance than the control group. If the DAP is a common source .of inaccuracy in ratings, the performance .of .both trained and control groups is likely to receive higher ratings than it c~eserves due 'to the predominance of Type B errors, ttowever, the probability that Type ]3 errors will occur in the ratings used to compare trained and control subjects generally will not be equal in the two groups. Behavior which is .actua.lly inc,orrect will be less prevalent 'in the trained group if the training is really effective. Hence, the opportunity to oommit a Type B error, i.e., overrate performance, will 'be greater for the controls than for the treatment group. Therefore, the effect of the DAP will be in the direction of drawing the control group performance cl,oser to tha~ of the train'ed group, thereby tending to obscure any significant training effect. It is important to recognize that, in terms of their usefulness as criteria, it is not a major concern that ratings contain some reasona~bly s'mall amount ,of error. As ~ong as the error is not systematic, i:e., on the average Type A errors o.ccur as frequently as Type B errors, and as long as the magnitude of the errors is within acceptable limits, the researcher will prob:ably not be drawn consistently into wrong conclusions about trainin'g's effectiveness. The trainer's measures will be less reliable and valid as they become more inaccurate, and, consequently, his statistical tes%s will 'be less precise. H~owever, the criterion will n,ot consis'tently favor the acceptance of the null hypothesis of no difference between the control and trained groups. It is only if raters tend to commit more Type B .errors than Type A errors, %he DAP, that researchers must 'be concerned a~b.o.utthe usefulness of ratings. It i.s imp.ortant to learn something about the DAP before trainers continue to rely so heavily upon ratings as criteria .in valid'ation research. In order to conduct a study o,f the effectiveness of ,a training program, it is necessary first ¢o know the properties of the criteria employed to test the impact of the program. Naturally, any criterion susceptible to the DAP is a poor choice of measure with which to validate the training program. The DAP also has imporban:t implications for supervision, especially in the area o.f joint field work. Typically~ ioin4 field work ~s un.d.ertaken

RATING ACCURACY

377

to allow the supervisor to acquire firs4:-hand information about ~he characteris'fics of a new employee's on-the-job behavior. I t is the sup.ervisor's job to pick out the strong and weak points of the new man's work and bring them to :the atten'tion of the worker. If the D A P is a widespread phenomenon, it is likely that the supervisor .doing joint field work will overlook many examples .of poor or in.adequate work. If such is the case, the supervisor will not provide feedback ~o the worker which will cause him to. change his method of doing the job. Instead, it seems more likely that iby no,t .saying anything about the poor aspects of performance which went unnoticed, the supervisor will be impLlieitly reinforcing these behaviors. T h a t is, the worker probably will interpret the sapervi.sor's silence as tacit approval or satisfaction with 4~hat particular aspect of his work. Obviously, this is an undesirable consequence of the DAP. 'There seems ~o be no questi~on that .consideration should be given to the con.di'gions and consequences of inaccuracy in ratings. More research on this topic seems imperative because, 'at present, very little information is available ~bout the sources and effects of inaccuracy in interpersonal perceptions. Consequently, it is impossible to iudge the a,c,tual direction or true magnitude .of an unmtd number of o'bserved rela.gionships involving ra:tin'gs. This is why the generality 'and causes of the D A P ought to ~b.e studied more thoroughly. Without an understanding of the accuracy of ratings, ~he actual meaning and usefulness of much research will remain unknown. REFERENCES BONEAU, C. A. The effects of violations of assumptions underlying the t test. Psychological Bulletin, 1960, 57, 49-64. GUION,~. M. Personnel testing. New York: MeGraw-ttill, 1965. I-In,his, D. H. Effect of defee~ rate on inspection accuracy. Journal of Applied Psychology, 1968, 52~ 377-379. JACOBS, P. I., & VAmDEVE~TER,M. Progressive matrices: an experimental, developmeng, nonfactorial analysis. Educational Testing Service, Princeton, N. J. Reseaxeh B~llegin, Jan. 1968. RB-68-3. MAma, N. R. F., & THURBE~,J. A. Accuracy of judgments o.f deception when an interview is watched, heard, and read. Personnel Psychology, 1968, 21, 23-30. MeNB~AR, Q. Psychological statistics. New York: Wiley, 1962. NAYLOR, g. C. Some eommenN on the accuracy and validity of a cue variable. Journal o] Mathematical Psychology, 1967, 4, 154-161. NORTON, D. W. An empirical investigatio~ of some effects of non-~ormaliW and

heterogeneity on the F-distribution. Unpublished doetorial dissertation, S~ate University of Iowa, 1952. RECEIVED: June 30, 1969