Observer bias

Observer bias

Applied Ergonomics 1975, 6.1, 3- 8 Observer bias E.C. Poulton Assistant Director, Medical Research Council, Applied Psychology Unit, Cambridge. Subj...

633KB Sizes 0 Downloads 73 Views

Applied Ergonomics 1975, 6.1, 3- 8

Observer bias E.C. Poulton Assistant Director, Medical Research Council, Applied Psychology Unit, Cambridge.

Subjective assessments are quick and easy to obtain. They give answers to questions wh ich are difficu It to answer otherwise. U nfortu nately su bjective assessments su ffer from two kinds of observer bias. First, the assessment selected tends to be closer than it should be to the middle of the range of available assessments. This is called a range effect. Secondly, a general subjective assessment not tied to a specific task may be

determined by common knowledge instead of by the details of the question under consideration. If so, the assessment may give a wrong answer. Clearly, subjective assessments should be used o n l y when it is i m p o s s i b l e to obtain valid objective measures of performance. T h e y need to be interpreted w i t h extreme caution.

Engineers and physicists define comfort zones for the temperature of offices and factories. They specify the limits of acceptability for glare, for noise and for vibration. How well can they do this?

Table I: The just acceptable noise made by road vehicles and the range of noises heard (Data from Robinson, Copeland and Rennie, 1961).

Intensity in dB(A)

The Consumer Association's magazine Which? gives ratings for the quality of different brands of goods bought in the shops. It uses a code of from 1 to 5 blobs, the more blobs the better. Managers have to rate the worth of their subordinates. Scientific managers have to rate the worth of their individaul scientists. Scientists are asked to rate the merits of other scientists' applications for research grants. Is it possible to make unbiassed assessments of this knd?

Investigator

52 66

73 82

Highest 84 97

Andrews

86

90

97

Acceptable

Noisy

~-

the two middle ratings "acceptable" and "noisy" was found to lie at 82dB. This is exactly halfway between 66 and 97. Robinson et al (1961) quote two previous investigations in which the middle o f the rating scale is found to lie near the middle of the range of noise intensities. A Swiss experiment by Weber and Lauber has less intense noise levels. The top row of the table shows that they range from 52 to 84dB. The middle rating lies at 73dB. This is a little above the middle of the range, which lies at 68. But 73 is a good deal less than the 82 found by Robinson with more intense noise levels. The other investigation quoted by Robinson is an American experiment by Andrews and Finch. The bottom row of the table shows that Andrews has the most intense noise levels, 86 to 97dB. Here the middle rating lies at 90dB. This is a little below the middle of the range, which lies at

Range effects in assessing noisiness Table 1 (Robinson, Copeland and Rennie, 1961) shows how the just acceptable noise level depends upon the range of noises heard. Unpractised observers sat beside the London to Brighton road. They estimated the noisiness of vehicles climbing a hill, using the 6 category rating scale (Fig 1). The greatest noise made by each vehicle was measured, but the observers were not told the noise levels. The middle row of the table shows that the peak noise levels ranged from 66 to t~TdB(A). The average transition point between

B

Weber Robinson

~

The simple answer is that no method of measurement is perfect when difficult judgements have to be made. In this article some of the problems of obtaining valid subjective assessments are discussed, and subjective assessments are compared with objective measures of performance. It becomes clear that objective measures should be used whenever possible,

A

Lowest

C

D

Acceptable

Noisy

E

F

9'

-

Quiet

Excessively noisy

-

Fig 1. Six-category rating scale for vehicle noise.

Applied Ergonomics

March 1975

3

92. But 90 is a good deal more than the values found in the other two investigations, Table 1 shows that the mid-point of the subjective estimates depends upon the physical range of noises used in the experiment. Intense noises raise the intensity of the just acceptable noise level. Noises o f low intensity lower the intensity of the just acceptable noise level. The observer centres his range o f responses nearer to the middle of the range of noise intensities than he should do. The greatest just acceptable noise level given by Andrews, 90dB, is 17dB greater than the 73dB given by Weber. A difference of 17dB represents about seven times the amplitude of the noise, or about 50 times the power. Subjectively, the 90dB noise is about four times as loud as the 73dB noise. This is a large discrepancy, A similar range effect is reported by Bowsher, Johnson and Robinson (1966) at the 1964 Farnborough Air Show. One group of observers judged the noisiness o f aircraft from an assembly hall which was 500 m from the landing end of the runway,; and' only 100 m from the glidepath. Another group made similar judgements at the same time from a church hall which was 1000 m from the landing end of the runv~ay, and 900 m from the glidepath. In this part of the investigation, the two middle categories of the 6 point ratilag scale were labelled "moderate" and "noisy". The bottom row of Table 2 shows that near the glidepath the noises ranged from 55 to 101dB(A). The transition point between the two middle ratings "moderate" and "noisy" lies at 79dB. This is close to the middle of the range o f intensities, 78dB. The top row of the table shows that further from the glidepath the noises ranged from 45 to 83dB. Here the transition point between the two middle ratings lies at 69dB. Like the Weber data of Table 1, this is a little above the middle of the range of intensities, which lies at 64. But 69 is 10dB less than the 79dB of the group of observers near the glidepath. As with the estimates of the noisiness of road vehicles, the middle of the range o f ratings ties too close to the middle of the range of noise intensities. There are numerous laboratory examples of range effects found in estimating sensory magnitudes. They come from many different sensory dimensions (Poulton, 1968). This is most depressing for applied scientists, who would like to obtain direct measures of just acceptable levels o f noise and of other environmental stressors, Range effects are not resitricted to subjective assessments. They are found also in experiments measuring performance when everyone performs all the experimental conditions (Poulton, 1973a,b). But the range effects of Tables 1 and 2

Table 2: The transition between aircraft noises judged moderate and noisy and the range of noises heard. (Data from Bowsher, Johnson and Robinson, 1966, Figure 5.)

Distance from

glideimth Far Near

Intensity in dBIA)

lowest - 45 55

Highest

69 79

Moderate

83 101

Noisy 1-.i---

4

Applied Ergonomics

March 1975

are a good deal larger than the range effects which are usually found with measures of performance. Also by using separate groups of people for each experimental condition, it is usually possible to exclude range effects from measures of performance, whereas it is virtually impossible to get rid of range effects in subjective assessments. "['his is because in order to respond, the observer has to know the range o f responses which he is to use. The knowledge is bound to affect the responses which he makes. Even the very lirst response lies a little closer than it should to the middle of the available range of responses (Poulton. 1974). To avoid range effects it is necessary to restrict each observer to a single assessment, and to only one o f two possible answers. For example, he can be asked to judge whether one particular noise intensity is acceptable or noisy. Restricting the observer to a single noise intensity releases him from the influence of the range of noise intensities. Restricting him to one of two possible answers releases him from the influence of the range of possible answers. A comparison between general subjective assessments and objective measures of wind

disturbance In the experiment of Table 3 IPoulton, Hunt. Mumford and Poulton. 1967) l0 housewives were filmed while walking into a wind tunnel with a wind speed of 4 m/s coming from their right. They wore inked pads on the soles of their shoes, and walked along a paper path about 3.5 m long and 1 m wide. From the film it was possible to detect any momentary loss of balance. From their footmarks it was possible to measure the extent to which they were blown off course by the wind. "fhe housewives then had to walk obliquely across the wind tunnel and back again five times, trying to step on the footmarks made in their previous attempts. From the extra width of their footmarks it was possible to assess the extent to which the wind made them walk inconsistently After a number of other tests in the wind the housewives had to answer the following question: "Which did the wind interfere with most? The filmed walk towards the camera. or the walk afterwards in my own footsteps." This was followed by a number of general assessments of the wind made by marking a 100 mm line The assessment most closely related to walking in the wind was that shown m Fig 2. Another group of t0 housewives followed exactly the same procedure but with a windspeed of 8-5 m/s. Each group of housewives performed two sets of tests, one set in turbulent wind, 12% turbulence~ the other set in wind without added turbulence. 0.5~.. The order of the two conditions was balanced for each group of housewives. The effect of the order of the two conditions is not shown in the table. It is discussed later. The results in Parts I and 4 of the table are also discussed later The second and third parts of Table 3 show the ~wo besl quantitative measures of the effect of the wind upon walking. Measure No 2 is the average deflection off course alter walking two steps rata the wind tunnel with the wind coming from the right. The average deflection with the faster more turbulent wind is 7.1 cm. This is reliably greater than any of the other three values for a slower wind

Table 3: Subjective and objective measures of walking in wind. (Data from Poulton eta/, 1967.)

No

Measure

Add ed turbulence

Windspeed (m/s) 4 8"5 (Group 1 ) (Group 2)

1

Observed momentary loss of balance on walking into the wind, %

No Yes

0a 0a

30 a 30 a

2

Measured mean deflection in cm after two steps in wind

No Yes

0"5 1"7

4'2 7"1 b

3

Measured mean inconsistency in walking in wind. Extra width of sole prints (cm)

No Yes

2"6 2"6

2"6 3"5 b

4

Wind stated to interfere most with walking into the wind (not with walking in own footmarks) %

No Yes

35 5

35 20

5

Assessedwalking in wind Sure (0) Unbalanced (100)

No Yes

54"6 56'1

56'0 69"1 b

aReliable effect of windspeed at two levels of turbulence combined bReliable effect of both windspeed and turbulence &

and less turbulence. The average deflection off course after walking two steps into the wind tunnel is therefore a sensitive measure of both windspeed and turbulence. After a third step into the wind, the effect of turbulence is no longer reliable; there is only a reliable effect of windspeed. Measure No 3 shows the inconsistency of the housewives in walking on their own footmarks. The average extra width of sole prints of 3"5 cm for the faster more turbulent wind is reliably grater than any of the other three values for a slower wind and less turbulence. There is a similar reliable effect using the corresponding measure of the extra width of heel prints. But the measure of the extra length of heel prints is not reliable, The bottom part of Table 3 shows the general subjective assessment of walking, sure - unbalanced. The value of 69.1 for the faster more turbulent wind is again reliably greater than any of the other three values. Thus this general subjective assessment also is a sensitive quantitative measure of the effect of windspeed and tubulence combined. The table shows that it is as sensitive as the best of the measures of performance, Out of the 13 general subjective assessments made on the wind, two not shown in Table 3 are more sensitive than any of the objective measures. They show reliable differences between all four wind conditions (Poulton et al, 1967). To an applied psychologist concerned principally with the difficulties of obtaining sensitive objective measures of performance, these two general subjective assessments appear to be almost too good to be true.

Subjective assessments are also considerably quicker and easier to obtain than are objective measures of performance Clearly, subjective assessments have many advantages. This is why they are used so extensively.

The association between the general assessment of unbalance and measures of unbalance in walking The subjectively assessed unbalance at the bottom of Table 3 corresponds well with the measured deflection on walking into the wind in the second part of the table. The subjectively assessed unbalance also corresponds well to the measured inconsistency of the housewives in walking on their own footmarks in the third part of the table. By all three criteria, the faster more turbulent wind is reliably worse than the slower and less turbulent winds. The agreement is gratifying. However there is no reliable association between the extent of an individual housewife's assessed unbalance and the measures of her walking in the wind. Nor is there a reliable association between the sizes of the difference in her two subjective assessments, and the size of the difference in the objective measures for the two wind conditions she received, one more turbulent than the other. The distribution of the sizes of the coefficients of rank correlation (tau), between the individual subjective assessments of unbalance and the individual objective measures of unbalance, is not different from that to be expected by chance (Poulton et al, 1967, Table 8).

Walking (I00 mm) Sure

Unbalanced

Fig 2. Method for obtaining subjective assessments. The observer marks the 100mm line at a position which corresponds to his assessment.

Applied Ergonomics

March 1975

5

A perfect association between the individual subjective assessments and the individual objective measures o f unbalance is not to be expected. Each housewife walks into the wind tunnel, does her test of consistency in walking, and stands and walks during some of her other tests. Her subjective assessments of unbalance must be affected by all these experiences. Thus they will not correlate perfectly with her measures of walking derived from a single test. ttowever, there should be some association between the individual subjectively assessed unbalance and the individual measures of walking in the wind, if the subjective assessments are based upon the housewives' experiences during the experiment. The complete lack of association suggests that the subjective assessments o f unbalance and the objective measures of unbalance may reflect different aspects of the wind. Perhaps the subjective assessments reflect common knowledge about wind and its effects upon people, not simply the housewives' experience of the wind during the experiment. The subjective assessments may reflect what the housewives know ought to happen, not simply what

Table 4 shows the relationship between the observed momentary loss o f balance with the faster windspeed and the judged interference. The six instances of momentary loss of balance on the left of the table came from fore housewives. Two housewives appeared to lose their balance momentarily i~1 both conditions of turb ulence. The middle section of the table shows that of the 19 reports of most interference, m 14 the wind is reported as interfering most with the test of consistency. Only five times is the filmed entry into the wind tutmel reported as interfered with most. But m four o f these cases the housewife shows a momentary loss of balance in the filmed entry, this compares with only two instances of momentary lo~s of balance m the filmed entry when conastency of walking is said to oe interfered with most The difference is reliable, provided all the 19 instances in the table are taken as comparable to each other Thus when a housewife momentarily loses her balance on entering the wind, she is likely to report tha! the wind interferes more with her entry, thml with her subsequent consistency m walking m her own the/steps.

does actually happen. A similar lack of association is apparent in tire data of Sadoff, McFadden and Heinle (1961) from an experiment on simulated aircraft with various degrees of instability, The 6 test pilots have to fly the simulated aircraft while their performance is being measured. They then assess the handling characteristics of the aircraft on a 10 point scale, There is no reliable correlation between the individual subjective assessments and the individual measures of

However. the right side of Table 4 ,~hows that thole ~ no association between inomentarv los~ o f balance eta first entering the wind and the subsequent individual assessment of walking, sure unbalanced. The individual assessment of walking, sure unbalanced, is not associated reliably with any of the individual measures of walking m the wind: cousistency of watkmg in the wtnd, dellection b~, the wind blowing from the right on entering the wind t u n n e l nor the fihned momentary ioss of balance

tracking performance.

Clearly the correlation between ~ e individual subjecuve reports and the measures of performance depends upon the quesuon which the housewives are :tsked. When asked the specific question of which task is interfered with most bv the wind. the housewives give individtml answers which correlate with their individual per%rmances. But when they are asked simply to assess m general the amount o! unbalance in walking in the wind. the housewives" individual answers do not correlate with their individual perfom~ances.

The validity of subjective reports In the wind experiment, the lack of association between the individual subjective and objective measures is restricted to the general subjective assessments of unbalance in walking. It does not hold for the answers to a specific question, Part 1 of Table 3 shows that 30% of the housewives who entered the faster windspeed of 8"5 m/s appeared to lose their balance for an instant at b o t h levels of turbulence, Momentary loss of balance is defined as a sudden quick pastural movement, which is superimposed upon the slower rhythmical movements of walking. Part 4 of the table gives the housewives answers to the question "Which did the wind interfere with most? The filmed walk towards the camera, or the walk afterwards in my own footsteps." The filmed walk into the wind is stated to be interfered with most on 35% or less of the trials,

Table 4: Momentary loss of balance on entering a w i n d tunnel compared with reported most interference and assessment of unbalance. (Data from Poulton et al,

1967.)

Filmed momentary Reported most Assessment of walking loss of balance interference Sure - Unbalanced on entry Entry Consistency 10) (100) Yes No

6 14

4 1*

2 12"

63 62

• One housewife once reported no interference in either task

6

Applied Ergonomics

March 1975

To obtain an appropriate answer tt ,~ uecessary lo ask a specific questmn, no1 a general que~tJo~l. The answer !o a general question may draw upon t,.~o much past experience. both from other parts of the expe~inen~ and from outside the experiment.

Subjective reports without objective checks In the wind experiment it does m~t matter about the lack of association between the housewives' individual subjective assessments of unbalance in walking, and their individual objective measures of walking in the wind. All the measures agree that the taster and more turbulent wind is worse than the slower and less turbulent winds, lnTable 3 there are no reliable differences which contradict this. But the lack of association between the individual subjective and objective measures should be taken as a warning. It means that it is unwise to rely exclusively
of a particular task or tasks. Engineers and physicists are always being tempted to do this, as subjective measures are so much easier to obtain than objective measures of performance. But if they do, they will find wrong answers when common knowledge turns out ~o be incorrect, and does not reflect measured performance.

A s h o r t period in h e a t An example of this is a short period of work in heat. The fright deck of a transport aircraft standing in the mid-day sun at an airport in the Middle East may reach a temperature of 45°C (113°F). In the early 1960's, air conditioning equipment was available only in the first class passenger compartment, not on the flight deck. So the crew of the aircraft had to work at this temperature until the aircraft became airbonre. They could not even have a fan on until they had checked the electrical equipment. Pilots were complaining that the heat adversely affected their efficiency while they were doing their pre-flight checks and routines, The climate was simulated in an experiment in the hot room at the hrstitute of Aviation Medicine at Farnborough. The experimental task involved listening with headphones, while checking the readings of a number of instruments arranged in a semicircle. Performance was found to be reliably better during the first half hour in the hot climate, than in a cooler climate with a temperature of 25°C (77°F). It was concluded that first entering a hot environment may have a beneficial effect. The experiment did not support the pilots' claim that the heat adversely affected their efficiency (Poulton and Kerslake, 1965). P r o l o n g e d low frequency noise Another example is prolonged low frequency noise, Environmentalists campaign against noise of any kind anywhere, basing their case on subjective disturbance, Low frequency noise is judged to be less disturbing than high frequency noise. But there do not appear to be any published reports stating that low frequency noise feels bracing, or is the opposite of disturbing. Yet low frequency noise of an intensity of about 102dB(C) or 92dB(A) has been found to improve performance on all the three experimental tasks used in it (Poulton and Edwards, 1974). The low frequency noise has a rumbling quality. It shakes the man, without completely preventing communication by shouting. Also it does not completely mask the clicks and taps from the equipment in use, which in quiet conditions provide cues about performance. The low frequency noise keeps the man aroused, without isolating him from his normal auditory environment, In prolonged routine tasks where it is difficult to remain alert, low frequency noise helps to maintain alertness just as broadband noise does. But the low frequency noise has the added advantage that it does not isolate the man from his normal auditory environment as much as broadband noise does. Low frequency noise provides the advantage of broadband noise, without the full disadvantage.

Prolonged vibration in heavy goods vehicles A third example is prolonged vibration in heavy goods vehicles. If the driver's seat is not adequately isolated from the vibration, its effect may be unpleasant. Drivers of heavy goods vehicles complain that they become fatigued when they are exposed to the vibration hour after hour. The question was investigated in an experiment at the Royal Aircraft Establishment at Famborough. Vertical vibration was used with a frequency of 5 Hz and an

amplitude of 3"5 cm, or a peak G of 0.t7. The Wilkinson auditory vigilance task was given twice, once towards the start of a 3 h spell of work, and once towards the end. Days working with vibration for the 3 h alternated with days working without vibration. Days when the volunteers knew they would be told after the spell of work how well they had performed, alternated with days when they knew they would not be told how they had performed On experimental days when knowledge of results was expected, vibration was found to degrade vigilance, especially towards the end of the 3 h spell. But oil days without knowledge of results, the drop in performance was smaller towards the end of the 3 h spell with vibration. The interaction between the vibration and the knowledge of results was reliable (Wilkinson and Gray, 1974). The experimental days without knowledge of results are probably more representative of the work of the driver of a heavy goods vehicle than the days with knowledge of results. After driving for a number of hours at night on a motorway with little traffic, the lack of stimulation may make the driver feel sleepy. Here a certain amount of noise and vibration may help to keep him alert. If the cab and seat shelter him from the noise and vibration of the vehicle, he might be well advised to open a window and let in a little noise and wind. It is clear that findings such as these would not have been made by a research worker who stuck exclusively to general subjective assessments, unrelated to the performance of a particular task or tasks.

Excluding o b s e r v e r bias Quantitative subjective assessments are inevitably contaminated by observer bias. Tables 1 and 2 illustrate the kind of range effect which is bound to occur if each observer receives a number of conditions. In the wind experiment, two of the 13 subjective assessments show reliable asymmetrical transfer. This means that the average assessments of the housewives who are presented with the turbulent and non-turbulent winds in one order, are reliably different from the average assessments of the housewives who are presented with the same two conditions in the reverse order (Poulton and Freeman, 1966). None of the objective measures of performance show reliable asymmetrical transfer. Two restrictions upon subjective assessments are required in order to exclude transfer effects and range effects. First, each observer has to be presented with only a single condition. Secondly, his answer has to be restricted to two alternatives. These two restrictions make it virtually impossible to carry out subjective assessments in the currently accepted way. Restricting each observer to a single condition excludes the series of judgments of noisiness which provide the results in Tables 1 mad 2. Restricting the observer's answer to one of two possible alternatives excludes the rating scales used in the experiments of Tables l and 2. It also excludes the method using the 1O0 mm line of Fig 1. Here the observer tends to avoid putting his mark exactly in the middle of the line. This is the reverse of the tendency with rating scales to select a response too close to the middle of the range (Poulton, Simmonds and Warren, 1968).

Applied Ergonomics

March 1975

7

In the wind experiment, the general subjectiye assessments of the individual housewives are not associated with the appropriate individual measures of performance, This suggests that the general subjective assessments may be based upon what the housewives know ought to happen, not simply upon what does actually happen. When general subjective assessments are used in an experiment as the only measure, they can therefore provide misleading results. This occurs when common knowledge is at variance with measures of performance. Three examples of this have just been given,

Poulton, E.C. 1973a Applied Ergonomics, 4, 17-1 ~ Bias in ergonornic experiments. Poulton, E.C. 1973b Psyehological Bulletin, 80, 113-121. Unwanted range effects from using within-subject experimental designs.

Yet Table 4 shows that subjective reports can be associated with observable events. But to achieve this, a specific question has to be asked about the observable event,

Poulton, E.C.. and Edwards, R.S. 1974 Journal of Experimental Psychology, 102, 621-628. Interactions and range effects in experiments on pairs of stresses: mild heat and low-frequency noise

The necessary conditions for excluding observer bias can be summarized as follows: 1. Ask a specific question, 2. Get each observer to make only a single assessment, 3. Allow him only one of two possible alternative answers.

Poulton. E.C.. and Freeman, P.R. 1966 PsychologicaIBulletin, 66, 1-8. Unwanted asymmetrical transfer effects with balanced experimental designs. Poulton, E.C., Hunt, J.C.R., Mumford, I.D., and Poulton, ]. 1976 m preparation. The mechanical disturbance produced by steady and gusty winds of moderate strength: skilled performance and semantic assessments.

Even with all these precautions it is advisable to obtain objective measures of performance as a check, whenever this is possible.

Implications for ergonomists Recommendations based upon subjective assessments should be checked against objective measures of performance. This is because general subjective assessments may be based upon common knowledge, which can be wrong. Even after checking, the recommendations should be treated with great caution because the subjective assessments are almost certainly biassed by range effects,

References B owsher, J,M., Johnson, D.R, and Robinson, D.W. 1966 Acustica, 17,245-267. A further experiment on judging the noisiness of aircraft in flight. Poulton, E.C. 1968 Psychological Bulletin, 69, 1-19. The new psychophysics: six models for magnitude estimation,

8

Applied Ergonomics

March 1975

Poulton, E.C. 1974 American Journal of Psychology, 87, in press. Range effects in experiments on people.

Poulton, E.C.. and Kerslake, D. McK. 1965 Aerospace Medicine, 36, 29-32. Initial stimulating effect of warmth upon perceptual efficiency,

Poulton, E.C., Simmonds, D.C.V., and Warren, R.M. 1968 Perception and Psychophysics, 3, 112-I 14. Response bias In very first judgments of the reflectance of grays: numerical versus linear estimates. Robinson, D.W., Copeland, W.C., and Rennie, A.J. 1961 The Engineer, 211,493-497. Motor vehicle nmse measurement. Sadoff, M., McFadden, N.M. and lleinle, D.R. 1961 US National Aeronautics and Space Administration. Tectmical note D-348 Washington DC. A study of longitudinal control provlems at low m~d negauve damping and stability with emphasis on effects of motion cues. Wilkinson, R.T., and Gray, R. 1974 Royal Aircraft Establishment report. Effects of duration of vertical vibration, beyond the proposed ISO 'fatigue-decreased proficiency' time, on the performance of various tasks.