Applied Ergonomics 30 (1999) 121 — 135
Evaluation of interrater reliability for posture observations in a field study Susan Burt *, Laura Punnett US Department of Health and Human Services, Centers for Disease Control and Prevention, National Institute for Occupational Safety and Health, 4676 Columbia Parkway, MS R-10, Cincinnati, Ohio 45226, USA University of Massachusetts Lowell, Department of Work Environment, One University Avenue, Lowell, Massachusetts 01854, USA Received 13 December 1996; accepted 17 December 1997
Abstract This paper examines the interrater reliability of a quantitative observational method of assessing non-neutral postures required by work tasks. Two observers independently evaluated 70 jobs in an automotive manufacturing facility, using a procedure that included observations of 18 postures of the upper extremities and back. Interrater reliability was evaluated using percent agreement, kappa, intraclass correlation coefficients and generalized linear mixed modeling. Interrater agreement ranged from 26% for right shoulder elevation to 99 for left wrist flexion, but agreement was at best moderate when using kappa. Percent agreement is an inadequate measure, because it does not account for chance, and can lead to inflated measures of reliability. The use of more appropriate statistical methods may lead to greater insight into sources of variability in reliability and validity studies and may help to develop more effective ergonomic exposure assessment methods. Interrater reliability was acceptable for some of the postural observations in this study. 1999 Elsevier Science Ltd. All rights reserved. Keywords: Posture observations, Field studies, Reliability
1. Introduction The incidence of work-related musculoskeletal disorders has tripled in the US during the past decade. According to the Bureau of Labor Statistics data, ‘Disorders Associated with Repeated Trauma’ comprised 65% of all reported occupational illnesses in 1994 (Bureau of National Affairs, 1995). They result in lost work time, high workers’ compensation costs, and sometimes job loss (NIOSH publication, 1992). They also interfere with activities of daily living outside of work and limit future job prospects (Schneider et al., 1995). Examples of work-related musculoskeletal disorders include tendinitis, tenosynovitis, epicondylitis, rotator cuff syndrome, carpal tunnel syndrome, Guyon’s canal syndrome, thoracic outlet syndrome, and hand-arm vibration syndrome (Hadler, 1987; Putz-Anderson, 1988). High risk industries and occupations for these disorders
* Corresponding author
include meatpacking, automobile manufacturing, garment sewing, and computer operators (Jensen et al., 1983). Laboratory and epidemiologic studies have identified associations between various work-related musculoskeletal disorders and exposure to highly repetitious or static work, work requiring high force, awkward postures, localized contact pressure, vibration and cold (Keyserling et al, 1991; Silverstein et al, 1986; Armstrong, 1983; Stock, 1991). Posture can be a major determinant of the amount of force required to perform a task, the friction of tendons against bones or other unyielding tissues (such as the carpal tunnel), and the displacement of anatomic structures (Armstrong, 1983). Low back pain has been linked to non-neutral postures that result in a biomechanical disadvantage and increased spinal disc compression (Punnett et al, 1991). Non-neutral postures have been linked to fatigue as well as specific soft-tissue disorders (Habes et al, 1985). For example, working with the arms overhead has been linked to rotator cuff tendinitis and pinch gripping has been associated with carpal tunnel syndrome (Jonsson et al, 1988).
0003-6870/99/$ — see front matter 1999 Elsevier Science Ltd. All rights reserved. PII: S 0 0 0 3 - 6 8 7 0 ( 9 8 ) 0 0 0 0 7 - 6
122
S. Burt, L. Punnett / Applied Ergonomics 30 (1999) 121—135
Awkward postures may be quantified by recording their frequency and/or duration during a work cycle or work sampling period. Joint angles can also be measured directly by electrogoniometers, inclinometers, or other motion analysis systems. Although direct-reading instruments are often used as a gold standard, their expense and intrusiveness limit their use in field studies. Despite their objectivity, they may not result in a reliable or valid representation of the exposure, because of detachment of the instrument, interference, artifacts, and inaccuracies due to a lack of sensitivity or specificity (Burdorf et al, 1992; Baty et al, 1986; Selin and Winkel, 1994; Karlqvist et al, 1994). Awkward postures and other ergonomic exposures have also been assessed by questionnaires, but reports on the reliability of self-reported physical exposures have been mixed (Wiktorin et al, 1993, 1996; Kilbom, 1994; Armstrong et al, 1989; Hjelm et al, 1995; Burdorf and Loan, 1991). Thus, observational methods continue to be used commonly, especially to assess postural stresses. Observational postural analysis of videotaped job tasks has been described by Armstrong and Keyserling, among others (Armstrong et al, 1982; Keyserling, 1986). The method described by Armstrong in 1982 involves coding 35 upper extremity postures, from each frame of film or video stopped at regular intervals (e.g. every 0.2 s), and requires several hours to analyze one minute of work (Punnett and Keyserling 1987).The method described by Keyserling in 1986 uses direct computer entry of a more limited number of observed postures, while the analyst observes a running videotape of a task. The tape is replayed once for each joint of interest. Direct observation of a task may be preferable to videotaping, since the camera angle may affect the accuracy of judgements about the plane of motion and the degree of deviation from neutral (Punnett and Keyserling, 1987; Douwes and Dul, 1991), although some researchers have found no differences in accuracy between direct observations and videotapes (Ericson et al, 1991). Direct observational methods use checklists, data collection sheets, or direct data entry on portable computers, but direct computer entry in the workplace may be more practical when the categories of observations are limited (Wiktorin et al, 1995; FranssonHall et al, 1995; Van der Beek, 1992; Buchholz et al, 1996). This paper addresses the interrater reliability of a quantitative observational ergonomic exposure assessment method. It is part of a larger, prospective epidemiologic study of work-related upper extremity soft tissue disorders among automobile manufacturing workers (Punnett, 1995). Questionnaires and physical examinations (baseline and 1-year follow-up) and ergonomic evaluations were conducted in an engine plant (primarily machining and sub-assembly of engine com-
ponents) and a stamping plant (where presses ‘stamp’ out auto parts from steel blanks).
2. Statistical methods to evaluate interrater agreement Several statistical methods are available to measure agreement; each has advantages and disadvantages. The simplest calculation to evaluate agreement on ratings is proportion of agreement; it is the sum of the frequencies along the main diagonals of a contingency table (Fleiss, 1973). Proportion of agreement has been criticized as an inadequate measure of agreement, because it does not account for chance agreement (Fleiss, 1973). Kappa is a measure of agreement that does account for chance. The formula to calculate kappa subtracts the proportion of agreement that could be expected by chance alone from the observed proportion of agreement, and can therefore avoid erroneous conclusions that agreement is good, when in fact it may simply be due to chance (Fleiss, 1973). If there are three or more possible ratings that observers can make, and the ratings are ordinal, weighted kappa can be calculated (Fleiss, 1973). It measures approximate as well as exact agreement, by assigning greater statistical weight to neighbouring rankings or categories than to distant categories. The choice of weights assigned can affect Kappa, but this problem can be avoided by using standard weights (Maclure and Willett, 1987). Maclure and Willett (1987) crticized the application of Kappa to ordinal data, because the value of Kappa will change according to the choice of categories to represent collapsed data. A Kappa value for an ordinal scale with three categories is not comparable to a Kappa value for an ordinal scale with four categories, for example. However, this problem is not unique to Kappa. Kappa is affected by unbalanced marginal totals in contingency tables, which can occur with a rarely observed event or if observers vary widely in their sensitivity to the factor (Feinstein and Cicchietti, 1990; Graham and Jackson, 1993). Proportion of agreement and kappa are both suited to data that can be represented in contingency tables, that is, categorical or ordinal data. If data are continuous measurements or counts, then analysis of variance may be more appropriate (Kleinbaum et al, 1988). Analysis of variance compares the amount of variance in the measurement that can be attributed to one factor versus another (e.g. job differences vs. observer differences). Intraclass correlation coefficients are measures of reliability calculated using variance components from analysis of variance models (Shrout and Fleiss, 1979; Bartko, 1994); the selection of the appropriate version depends on the reliability study design (Armstrong, et al, 1992; Muller and Buttner, 1994). The intraclass correlation coefficient represents the correlation of, for example, measurements within the same job classification. It has been
S. Burt, L. Punnett / Applied Ergonomics 30 (1999) 121—135
demonstrated to be equivalent to weighted kappa (Fleiss and Cohen, 1973). When data are not normally distributed, it may be more appropriate to choose an analysis method that allows the investigator to specify the distribution, such as generalized linear modeling (Wolfinger and O’Connell, 1993). Statistical modeling can account for multiple factors simultaneously, thus offering an advantage over kappa (Graham and Jackson, 1993). Non-parametric measures of correlation are sometimes presented as estimates of reliability when data are not normally distributed. These methods assign ranks to data; they may not be appropriate to use when many ties occur in ranks (Sokal and Rohlf, 1981).
3. Methods Seventy-five jobs were evaluated by two observers. These jobs were chosen from a sample of 648 jobs observed in the stamping plant. Included were several sub-assembly areas (hoods, doors, side panels, lift-gates, and fenders), large press lines (hood outer panels, hood inner panels), and a small press line. All workers on the selected lines and shifts were invited to participate in the ergonomic evaluation. The first observer had been evaluating jobs in the larger study for a year at the time of data collection for the interrater reliability study, and he had additional experience using a similar method in another large study. The second observer, who had experience evaluating jobs in field studies by other methods, was trained by the first observer for one day in the plant, during which they evaluated five jobs together, and discussed and resolved
123
differences in observations. Following the training period, 75 job analyses were performed independently, without comparison or discussion. Whenever feasible, the two observers stood side-by-side or alternated sides to view the motions of the same worker from the same perspective and during the same time period. The ergonomic exposure data were recorded on a data collection sheet that included information to identify each job observed (plant, department, operation number and description), information on parts and tools used, and general workstation factors (height, reach, padded surfaces, space limitations, etc.). Monthly production reports were used as the source for the number of cycles per hour on each production line. The remainder of the data form was used to record the frequencies of each of eighteen movements or postural changes of the hands, wrists, arms, shoulders, and back during a typical cycle of the task (Table 1). Trunk postures recorded were similar to those described by Punnett et al. (1991). The hand/wrist postures recorded were similar to those described by Stetson et al. (1991) except that in this study wrist flexion and extension were recorded separately, and frequency but not duration of exertions was recorded. Shoulder postures recorded were similar to those described by Keyserling (1986); elbow and forearm postures recorded were among those described by Armstrong et al. (1982). Each job was observed for several cycles before beginning to count and record specific postures. Job analysts then focused on one body area at a time during each of several work cycles to count and record posture changes. Observing the worker and recording data on the recording sheet typically required fifteen to twenty minutes for each worker. A task cycle was defined as beginning when
Table 1 Agreement on raw posture counts (2 observers, 70 jobs) Posture
Range of counts counts
% Exact agreement agreement
Paired t-test:* mean difference [p-value]
Wrist Flexion-Left Wrist Flexion- Right Wrist Extension-Left Wrist Extension-Right Ulnar Deviation-Left Ulnar Deviation-Right Pinch Grip-Left Pinch Grip-Right Elbow Extension-Left Elbow Extension-Right Shoulder Elevation (45°-Left Shoulder Elevation (45°-Right Shoulder Elevation'45°-Left Shoulder Elevation'45°-Right Back Flexion(45° Back Flexion'45° Back Extension Back Twist/Lateral Bend
0—1 0—1 0—1 0—1 0—1 0—5 0—12 0—12 0—2 0—10 0—18 0—18 0—50 0—50 0—5 0—12 0—1 0—6
99% 97% 96% 91% 84% 81% 44% 39% 76% 66% 46% 44% 39% 26% 63% 83% 99% 61%
!0.01 !0.03 !0.04 !0.03 !0.01 !0.03 !0.40 !0.28 !0.13 !0.25 0.60 0.56 1.70 1.72 0.11 0.24 !0.01 0.08
*Mean of paired differences for each of 70 jobs observed by 2 raters
[0.32] [0.16] [0.08] [0.42] [0.77] [0.75] [0.08] [0.25] [0.06] [0.13] [0.03] [0.04] [0.02] [0.05] [0.33] [0.13] [0.32] [0.49]
124
S. Burt, L. Punnett / Applied Ergonomics 30 (1999) 121—135
the worker pushed the activation button and ended when he reached to push the activation button for the next piece. For tasks that did not entail activation buttons, the cycle was defined as beginning when the worker reached for a blank sheet of steel or part, and ended when he reached for the next blank or part. Task cycles typically lasted less than a minute, and often less than thirty seconds.
4. Data analysis methods Interrater agreement on the frequency per cycle of each of the 18 postures was evaluated using several statistical approaches: proportion of agreement, kappa, paired t-tests of mean differences, and intraclass correlation coefficients based on two-way analysis of variance (using analyst and job as the two factors in the model). Because of concern that assumptions of normality and equal variance required for analysis of variance were violated, generalized linear modeling was also performed, using repeated measures mixed models and specifying a Poisson distribution with extra dispersion (Wolfinger and O’Connell, 1993; Wolfinger, 1995). All statistical analyses were performed using SAS 6.10 or 6.11 (SAS Institute, 1996) for personal computer. To calculate kappa, raw counts of postural observations were collapsed into dichotomous categories (posture observed or not observed). If there was a sufficient distribution of counts to warrant it, raw counts were also divided into three categories (0, 1, '1), to calculate weighted kappa. Standard weights assigned by SAS version 6.10 were used to calculate weighted kappa [weights of 1 for exact agreement by the two raters, and 0.5 for ratings in neighboring categories] (Muller and Butner, 1994). Because of concern that data were not distributed normally, non-parametric tests were considered. These methods (which assign ranks to data) were rejected because of many zero values for postures that were not observed in particular jobs, which would have resulted in too many tied ranks. Agreement was also determined separately for postures in atypical jobs (inspection and repair jobs that did not involve repeated cycles of identical motions for each part) and jobs in which different workers were observed by each rater.
5. Results One hundred and forty-nine data sheets were completed (75 by one rater and 74 by the other). Two jobs were eliminated from the interrater reliability analysis because the workers declined to be observed (only workstation factors were recorded). Three other jobs that involved pairs of workers were eliminated because one rater combined two workers on one form
while the other completed one form for each worker. Data from the 70 remaining jobs (140 observations) were analyzed. In general, the distributions of the two observers’ counts of the 18 different postures were similar (Appendix A). Wrist flexion and extension and back extension were observed by either analyst in fewer than five of the 70 jobs. Ulnar deviation and elbow extension were observed by either analyst in 12 or fewer jobs, and back flexion was observed in 16 or fewer jobs. In contrast, pinch grips were observed by at least one analyst in 51 of the 70 jobs, and shoulder elevations were observed in at least 60 jobs. Analyst A was more ‘sensitive’ in observing pinch grips; however, the range of frequencies of left and right pinch grips per cycle recorded by analyst A was narrower (0—6 and 0—7) than the range recorded by analyst B (0—12). Analyst B observed more of both slight ((45°) and severe ('45°) shoulder elevations. The greatest difference in the range of frequencies of postures observed was for left shoulder elevations greater than 45°. Analyst A’s counts ranged from 0—3, while B’s counts ranged from 0—50 per cycle. Back twist/lateral bend was observed at least once in 26 of the 70 jobs. The observers’ counts of postures per work cycle were not distributed normally. 5.1. Agreement on a job-by-job basis Exact agreement in the number of postures observed per work cycle, matched on job, ranged from 26% (right shoulder elevations '45°) to 99% (left wrist flexions). Postures with a greater range of counts (e.g. shoulder elevations and pinch grips) generally had lower agreement than those with narrower ranges (e.g. wrist postures). Except for shoulder postures, the mean paired differences in posture counts by the two raters did not vary significantly (p(0.05) from zero. (Table 1). When the original posture counts were dichotomized, proportion of agreement increased for all of the posture variables that initially had wide ranges of counts. Agreement, as evaluated by kappa, was ‘moderate’ for left wrist extension and left pinch grip; ‘fair’ for right ulnar deviation, right pinch grip, left and right shoulder elevation '45°, back flexion '45°, and back twist/lateral bend; ‘slight’ for right elbow extension, left and right shoulder elevation (45° and back flexion (45°; and ‘poor’ for right wrist extension, left ulnar deviation, wrist flexion and back extension (Table 2). Only six postures had a sufficient distribution of counts (Appendix A) to warrant dividing the counts into three categories (left and right pinch grips and left and right shoulder elevations (45° and '45°). For these six postures, the proportion of agreement was lower than for dichotomized variables and higher than for the original counts. The weighted kappa values (for three
S. Burt, L. Punnett / Applied Ergonomics 30 (1999) 121—135
125
Table 2 Agreement on presence or absence (0 or *1) of postures (2 observers, 70 jobs) Posture
% Agreement
Kappa (95% CI)
Rating of agreement
Wrist Flexion-Left Wrist Flexion-Right Wrist Extension-Left Wrist Extension-Right Ulnar Deviation-Left Ulnar Deviation-Right Pinch Grip-Left Pinch Grip-Right Elbow Extension-Left Elbow Extension-Right Shoulder Elevation (45°-Left Shoulder Elevation (45°-Right Shoulder Elevation '45°-Left Shoulder Elevation '45°-Right Back Flexion (45° Back Flexion '45° Back Extension Back Twist/Lateral Bend
99% 97% 96% 91% 84% 81% 74% 63% 79% 67% 56% 53% 69% 70% 66% 84% 99% 71%
0.00 0.00 0.55 !0.04 !0.08 0.22 0.47 0.24 0.23 0.12 0.13 0.05 0.37 0.25 0.03 0.20 0.00 0.37
Poor Poor Moderate Poor Poor Fair Moderate Fair Fair Slight Slight Slight Fair Fair Slight Fair Poor Fair
(!0.47, 0.47)R (!0.47, 0.47)(0.11, 0.99) (!0.08, 0.00) (!0.14, !0.03) (!0.07, 0.51) (0.28, 0.67) (0.03, 0.45) (!0.05, 0.50) (!0.11, 0.35) (!0.09, 0.34) (!0.18, 0.28) (0.19, 0.55) (0.03, 0.46) (!0.21, 0.27) (!0.09, 0.50) (!0.47, 0.47) (0.14, 0.59)
As defined by Landis and Koch (1977). 95% confidence intervals calculated by hand using a variance formula from Fleiss et al. (1969), because SAS does not calculate kappa for tables in which one row consists only of zeros.
Table 3 Agreement on categorized (0, 1 or '1) posture counts (2 observers, 70 jobs) Posture
% Agreement
Weighted kappa (95% CI)
Rating of agreement
Pinch Grip-Left Pinch Grip-Right Shoulder Elevation(45°-Left Shoulder Elevation(45°-Right Shoulder Elevation'45°-Left Shoulder Elevation'45°-Right
53% 54% 47% 44% 53% 39%
0.40 0.43 0.16 0.07 0.34 0.18
Moderate Moderate Slight Slight Fair Slight
(0.24, 0.56) (0.27, 0.59) (!0.03, 0.34) (!0.12, 0.25) (0.19, 0.48) (0.03, 0.32)
Landis and Koch (1977).
categories) were higher than the kappa values for the corresponding dichotomized variables, except for left shoulder elevation '45° (Table 3). However, kappa ratings did not change substantially (right pinch grips increased from ‘fair’ to ‘moderate’, while right shoulder elevations '45° decreased from ‘fair’ to ‘slight’; the others were unchanged). Two-way analysis of variance revealed that variance in posture counts attributable to job differences reached statistical significance (p(0.05) for six of the seven postures (right ulnar deviation, left and right pinch grip, left and right elbow extension, and back flexion, but not back twist/lateral bend) for which the overall models were statistically significant. For these same seven postures, variance in posture counts that could be attributed to the observer did not reach statistical significance, although, for two postures (left pinch grip and left elbow extension), the differences approached the significance level. Results
from generalized linear modeling for repeated measures were similar. Intraclass correlation coefficients based on 2-way, fixed effects analysis of variance ranged from !0.05 to 0.51 (Table 4). Twenty-five jobs were determined to be less comparable because: (a) they did not require stereotypical motions [12 inspection and repair jobs]; (b) they were two-worker jobs and it was not possible to determine whether the data referred to the worker on the left or right [2 jobs]; or (c) the raters were unable to observe the same worker due to job rotation [11 jobs]. For the group of 25 less comparable jobs (Table 5), analysis of variance showed a significant observer effect for left and right pinch grips, controlling for job. Identical analyses for the same postures in the group of 45 more-comparable jobs showed no significant observer effect, again controlling for job. Therefore, observers agreed on counts of left and right pinch grips more often when they were observing
126
S. Burt, L. Punnett / Applied Ergonomics 30 (1999) 121—135
Table 4 Analysis of variance vs. generalized linear modeling results (2 observers, 70 jobs) Posture
ANOVA: F-tests [p] for O (observer) and J (job)
GLIM: Chi-square [p] for O (observer) Z [p] for J (job)
ICC
Ulnar Deviation-Right
O: 0.10 [0.75] J: 1.82 [0.01]
O: 0.09 [0.76] J: 2.27 [0.02]
0.29
Pinch Grip-Left
O: 3.15 [0.08] J: 2.33 [0.00]
O: 2.82 [0.09] J: 2.95 [0.00]
0.40
Pinch Grip-Right
O: 1/37 [0.25] J: 3.06 [0.00]
O: 1.28 [0.26] J: 3.68 [0.00]
0.51
Elbow Extension-Left
O: 3.66 [0.06] J: 1.67 [0.02]
O: 3.76 [0.05] J: 2.25 [0.02]
0.25
Elbow Extension-Right
O: 2.39 [0.13] J: 1.69 [0.02]
O: 1.90 [0.17] J: 1.89 [0.06]
0.26
Shoulder Elevation (45°-Left
O: 4.90 [0.03] J: 1.20 [0.22]
O: 5.98 [0.01] J: 1.05 [0.29]
0.09
Shoulder Elevation (45°-Right
O: 4.28 [0.04] J: 1.11 [0.33]
O: 5.19 [0.02] J: 0.59 [0.55]
0.05
Shoulder Elevation '45°-Left
O: 5.52 [0.02] J: 0.99 [0.52]
O: 7.16 [0.00] J:- 0.07 [0.95]
0.00
Shoulder Elevation '45°-Right
O: 3.69 [0.06] J: 0.90 [0.67]
O: 3.62 [0.06] J:- 0.50 [0.62]
!0.05
Back Flexion (45°
O: 0.75 [0.39] J: 0.97 [0.55]
O: 0.78 [0.38] J:- 0.12 [0.90]
!0.01
Back Flexion'45°
O: 2.13 [0.15] J: 1.77 [0.01]
O:2.72 [0.10] J:3.21 [0.00]
0.28
Back Twist/Lateral Bend
O: 0/48 [0.49] J: 1.47 [0.06]
O: 0.47 [0.49] J:1.51 [0.13]
0.19
These models were not necessarily significant at the p(0.05 level. Only p-values can be directly compared Shrout-Fleiss intraclass correlation coefficients from 2-way fixed effects ANOVA
Table 5 Analysis of variance in posture observations: Comparable vs non-comparable jobs Posture
Comparable jobs (45 jobs) F-test (p-value) O"observer J"job
Non-comparable jobs(25 jobs) F-test (p-value) O"observer J"job
Pinch Grips-Left
O: F"0.60 (0.44) J: F"2.18 (0.01)
O: F"5.10 (0.03) J: F"2.78 (0.01)
Pinch Grips-Right
O: F"0.05 (0.83) J: F"2.84 (0.00)
O: F"5.00 (0.03) J: F"4.12 (0.00)
Shoulder Elevation (45°-Left
O: F"3.22 (0.08) J: F"1.63 (0.05)
Shoulder Elevation(45°\Right
-
Shoulder Elevation'45°-Left
O: F"65.78 (0.00) J: F"4.13 (0.00)
Shoulder Elevation'45°-Right
O: F"18.90 (0.00) J: F"1.65 (0.05)
Non-comparable jobs were not cyclical or different workers were observed 2-way anova model not significant at p(0.05 level
cyclical jobs and the same workers. Observer differences for shoulder postures did not show this pattern. Analyst was a statistically significant predictor for left and right shoulder elevation '45° among the more comparable
jobs, controlling for job effect; but neither analyst nor job was statistically significant in any of the shoulder models limited to non-comparable jobs, possibly because of too few observations in the smaller group of jobs.
S. Burt, L. Punnett / Applied Ergonomics 30 (1999) 121—135 Table 6 Unbalanced margin example Left wrist extension counts Analyst A
Analyst B 0 1
0
1
65 (93%) 0 65 (93%)
3 (4%) 2 (3%) 5 (7%)
68 (97%) 2 (3%) 70 (100%)
% observed!% expected % Agreement"96% Kappa" 1!% expected Kappa"0.55 (moderate) Range: !1 to #1, 0"chance
6. Discussion Substantial differences between observers were found in this assessment of postural stress in 70 manufacturing jobs. If proportion of agreement were used as the sole measure, the conclusions would have been much different. For example, percent agreement was high for left wrist extensions, while the value of kappa was relatively low (Table 6). Both raters agreed that there were no wrist extensions in 65 of the 70 jobs, and that one wrist extension was observed in each of two other jobs. Although the raters agreed in 96% of their observations, the value of kappa was only 0.55, or ‘moderate’. This apparent paradox of high percent agreement but low kappa occurred because percent agreement does not account for chance agreement, and chance agreement is high with a rarely observed event. Kappa is calculated by subtracting the expected value (sum of the marginal products) for agreement from the observed agreement (sum of the main diagonal), and dividing this number by the highest possible value for observed minus expected agreement (!expected agreement). Because the posture was rarely observed, the marginal totals are very unbalanced in this example, and the value for kappa is mostly driven by the expected value of 0.90. The highest possible value for observed minus expected agreement is only 0.10. If the expected value were 0.50, then the observed minus expected would be 0.96—0.50, or 0.46, and the value for kappa would be 0.46/0.50, or 0.92 (‘almost perfect’ agreement). This example demonstrates how much the value of kappa can vary due to the prevalence of the characteristic being observed. In a rarely observed characteristic, expected agreement increases and observed agreement is increasingly discounted (Kraemer and Bloch, 1988). In this case, although kappa may seem to err on the low side in evaluating interrater reliability, percent agreement seems to err on the high side. If any conclusions can be drawn about rarely observed postures in this study (including wrist flexion and extension and back extension), it is probably better to rely on kappa than on percent
127
agreement. A high percent agreement may simply mean that both observers had almost the same difficulty identifying these particular type of postures in the workplace. If the rarely observed postures truly did not occur in these jobs, then the interpretation of the reliability of the observations should probably be limited to similar work settings in which the range of frequencies of the posture is also limited to the low end of the scale. A low kappa value in these situations may be useful; it may deter researchers from using an exposure assessment tool that has not been sufficiently evaluated over the full range of possible exposures. Unbalanced marginal totals would also occur if one rater were very sensitive and the other very specific, that is, if expected values could be largely predicted by which rater made the observation. This was not the case in this study, as can be seen by comparing the distributions of posture counts by the two observers (Appendix A), and by noting that the mean differences in paired observations were sometimes negative and sometimes positive (Table 1). Statistical modeling may be preferable to kappa for measuring agreement in some cases, because it accounts for marginal differences as well as chance and allows for the simultaneous analysis of the effect of more than one factor (Graham and Jackson, 1993). Applications of statistical methods that have recently become available in computer software packages allow for the analysis of repeated measures and non-normal distributions of data. Statistical testing from models provides a means to compare the effect of job differences versus observer differences or other factors in determining estimates of postural stresses or other exposures. Identifying such sources of variability can direct efforts to improve the precision of ergonomic exposure assessment methods. For these data, the results from generalized linear modeling for repeated measures (specifying Poisson distribution with extra dispersion) were similar to those obtained from two-way analysis of variance. Meaningful separate analysis of the 25 atypical jobs could only be conducted for pinch grips and shoulder elevations, since the other postures were observed infrequently. Agreement between observers on pinch grip was much better in the group of comparable jobs than in the atypical jobs. Observer was a significant predictor of left and right shoulder elevation '45° in the group of comparable jobs, suggesting that the observers were using different decision criteria for these shoulder postures (Table 5). 6.1. Sources of disagreement or variation in posture observations Unclear definition of postures can be a source of disagreement between observers. The large discrepancy in counts of left shoulder elevations greater than 45° (Table 1) may be due to the observers using different definitions of cycle times for at least one of the atypical
128
S. Burt, L. Punnett / Applied Ergonomics 30 (1999) 121—135
jobs. Some investigators hold multiple training sessions for observers, in part to refine operational definitions and to verify standardization of coding (Van der Beek et al, 1992; Buchholz et al, 1996). In the study reported here, the observers only briefly reviewed the definitions before beginning job observations. Estimating degrees of deviation from neutral, as in this study, is more difficult than using landmarks, such as ‘hands are below the hips’, or ‘hands are above the shoulders’, as described in other exposure assessment methods (Wikotrin et al, 1995). Some researchers have noted that gross body motions, such as back flexion, are easier to observe and therefore result in better agreement than smaller motions, like wrist deviation (Keyserling, 1986). It has also been noted by others that agreement is better for extreme postures than for slight deviations from neutral (Stetson et al, 1991). Wrist flexion and extension were observed very infrequently in these jobs, while shoulder elevations were frequently observed. This may be at least partially due to the greater difficulty, as noted by other researchers, in observing smaller motions. The work observed in this study was fast-paced, and the workers combined several postures into one continuous, fluid movement. It was challenging for the observers to separate these rapid, dynamic movements into component postures, a problem also noted by other researchers (Fransson-Hall et al, 1995; Van der Beek, 1992). When observations are made from videotape, it is possible to slow down or stop the action to note postures of interest. The data recording sheet used for this study included 18 different postures involving the hands, arms, shoulders and back, as well as observations on tools used, parts handled, and workstation factors. Observers recorded actual counts of the number of times a worker used each posture during a work cycle. It has been noted in other studies that as the complexity of observations increases, agreement between observers can be expected to decrease (Kilbom, 1994; Suen and Ary, 1989). Although we attempted to avoid interworker differences in work postures by observing each worker on a production line, workers rotated jobs and the analysts could not always synchronize their observations, so the analysts sometimes observed different workers for a given job. When the analysts observed the same worker doing a given job but at different times of the day, changes in postures used by the worker during the course of a day may also be a source of variability, since different muscle groups may be used from time to time in order to rest fatigued muscles. In this study, workers were sometimes observed to change the way they performed a task (e.g. switching hands). When one worker was asked, he explained that he was resting the other hand. Synchronizing observations with an audible cue, as reported by other researchers (Van der Beek et al, 1992) should improve interrater reliability. Our observations raise a concern that observing only very short periods of work, even
in apparently stereotypical work, may compromise the representativeness of exposure data. The jobs that were analyzed entailed similar activities. In most cases, workers stood alongside semi-automated assembly lines and placed parts on the line or removed assembled parts from the line. When comparing jobs that require dramatically different activities, differences in analysts’ observations may appear smaller, relative to larger actual differences between jobs. It is possible that jobs in the larger study group were more varied, so posture observations may have been more successful in identifying differences among those jobs. The subgroup of 70 jobs for which interrater reliability was evaluated were a convenience sample, not selected to represent a broad range of postural exposures. While there is debate over the best ergonomic exposure assessment method, desirable characteristics of methods have been proposed. These include appropriateness for use in large populations at reasonable cost, the versatility to estimate a variety of exposure factors, and the ability to represent the exposures of a job over an appropriate length of time, not just the specific day or hour that exposures are being measured or observed (Winkel et al, 1991). Reliability and validity of a method should be demonstrated. In addition to the above criteria, there is growing demand for more detailed, quantitative measures to better define dose-response relationships and to evaluate the impact of interventions. Unfortunately, no single ergonomic exposure assessment method currently meets all of the above criteria. Many widely used observational methods have not been adequately validated (Kilbom, 1994). Critical evaluation and systematic improvement of ergonomic exposure assessment methods will enable more accurate classification of exposure, which in turn will facilitate determining etiology through epidemiologic studies. This study used a relatively detailed, quantitative method in real workplace circumstances and with minimal training of one of the observers. Interrater reliability was moderate for observations of pinch grips, but only fair at best for shoulder elevations. The remaining postures were generally infrequently observed, so it is difficult to draw conclusions about the reliability of those observations. Findings from this study, as well as others discussed above, suggest that interrater reliability of postural observations can be optimized when: operational definitions are simple and unambiguous (ideally pre-tested); longer and multiple training sessions precede data collection; the number of postures observed is limited (and observations are prioritized); the level of detail is limited; and real-time observations are limited to jobs that do not involve rapid, dynamic movements. Longer observation periods and repeated observations may also improve the accuracy and precision of observational assessments (Burdorf, 1995). Detailed, quantitative ergonomic exposure assessment methods that are
S. Burt, L. Punnett / Applied Ergonomics 30 (1999) 121—135
appropriate for large field studies are needed, but such methods will only advance our knowledge of the causes of musculoskeletal disorders to the extent that they are reliable and valid. In some settings, it may be difficult to achieve an accurate quantitative estimate of postural stress using an observational method. Although some studies have reported high reliability of ergonomic assessment methods, that reliability has not always been thoroughly analyzed. The choice of statistical methods to evaluate interrater reliability can substantially change the conclusions reached. It is important to recognize that an exposure measure with low reliability limits the power of a study to detect associations between exposures and health outcomes (Thompson and Walter, 1988). When choosing an appropriate method, it is important to consider: the type of data (e.g. dichotomous, ordinal, count, or continuous), the distribution of the data (e.g. binomial, Poisson, or normal), and whether the assumptions of the method are iolated. Simply reporting the proportion of agreement between observers can greatly overestimate reliability, particularly when the characteristic is rarely observed. In order to judge the reliability of a job assessment method, it should be applied to jobs with a wide distribution of exposures across the range of possible values.
129
Fig. 1
Acknowledgements The authors thank Andrew Bigelow for data collection, Lei Peng for data management, James Deddens for statistical advice and assistance, David Wall and Charles Mueller for programming assistance, and Allison Tepper, Thomas Hales, Richard Monson, and David Christiani for their periodic reviews, guidance and support.
Appendix A Distributions of posture counts by 2 observers (not matched on job) are shown in Figs. 1—18.
Appendix B Kappa ranges from !1 to #1, with a value of 0 representing the level of agreement that would be expected solely by chance. A classification of kappa values has been proposed by Landis and Koch (1977) Kappa
Strength of agreement
(0 0—0.2 0.2—0.4 0.4—0.6 0.6—0.8 0.8-1.0
Poor Slight Fair Moderate Substantial Almost perfect
Fig. 2
130
S. Burt, L. Punnett / Applied Ergonomics 30 (1999) 121—135
Fig. 3 Fig. 5
Fig. 4
Fig. 6
S. Burt, L. Punnett / Applied Ergonomics 30 (1999) 121—135
131
Fig. 7
Fig. 9
Fig. 8
Fig. 10
132
S. Burt, L. Punnett / Applied Ergonomics 30 (1999) 121—135
Fig. 11
Fig. 13
Fig. 12
Fig. 14
S. Burt, L. Punnett / Applied Ergonomics 30 (1999) 121—135
133
Fig. 17 Fig. 15
Fig. 16
Fig. 18
134
S. Burt, L. Punnett / Applied Ergonomics 30 (1999) 121—135
References Armstrong, B.K., White, E., Saracci, R. 1992. Principles of Exposure Measurement in Epidemiology. Oxford University Press, Oxford. Armstrong, T.J., 1983. An Ergonomics Guide to Carpal Tunnel Syndrome. American Industrial Hygiene Association, Fairfax, Virginia. Armstrong, T.J., Foulke, J.A., Joseph, B.S., and Goldstein, S.A., 1982. Investigation of cumulative trauma disorders in a poultry processing plant. Amer. Ind. Hyg. Assoc. J. 43, 103—116. Bartko, J., 1994. Measures of agreement. Statist. Med. 13, 737—745. Baty, D., Buckle, P.W., Stubbs, D.A., 1986. Posture recording by direct observation, questionnaire assessment and instrumentation: a comparison based on a recent field study. The Ergonomics of Working Postures: Proc. Ist Int. Occupational Ergonomics Symposium. Taylor & Francis, Philadelphia, pp. 283—293. Buchholz, B., Paquet, V., Punnett, L., Lee, D., Moir, S., 1996. PATH — a work sampling-based approach to ergonomic job analysis for construction and other non-repetitive work. Applied Ergonomics 27(3), 177—187. Burdorf, A., Derksen, J., Naaktgeboren, B., Van Riel, M., 1992. Measurement of trunk bending during work by direct observation and continuous measurement. Applied Ergonomics 2(4), 263—267. Burdorf, A., 1995. Reducing random measurement error in assessing postural load on the back in epidemiologic surveys. Scand. J. Work Environ. Health 21(1), 15—23. Bureau of Labor Statistics 1995. Excerpts from annual BLS survey; occupational injuries and illnesses in 1994. Occupational Safety and Health Reporter. Bureau of National Affairs, Inc., Washington, DC, 1015. Cicchietti, D.V., Feinstein, A.R., 1990. High agreement but low kappa: II. Resolving the paradoxes. J. Clin. Epidemiol. 43(6), 551—558. Douwes, M., Dul, J., 1991. Validity and reliability of estimating body angles by direct and indirect observations. In: Quiennec, Y., Daniellou, F. (Eds.) Designing for Everyone and Everybody. Proc. 11th Congress of the Int. Ergonomics Association. Taylor & Francis, London, pp. 885—887. Ericson, M., Kilbom, A., Wiktorin, C. et al. 1991. Validity and reliability in the estimation of trunk, arm and neck inclination by observation. Designing for Everyone and Everybody: Proc. 11th Congress of the International Ergonomics Association, Vol. 1. Taylor & Francis, London, pp. 245—247. Feinstein, A.R., Cicchietti, D.V., 1990. High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol. 43(6), 543—549. Fleiss, J., 1973. Statistical Methods for Rates and Proportions. Wiley, New York. Fleiss, J., Cohen, J., 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ. Psychol. Meas. 33, 613—619. Fleiss, J.L., Cohen, J., Everitt, B.S., 1969. Large sample standard errors of kappa and weighted kappa. Psychol. Bull. 72, 323—327. Fransson-Hall, C., Gloria, R., Kilbom, A., Winkel, J., 1995. A portable ergonomic observation method (PEO) for computerized on-line recording of postures and manual handling. Applied Ergonomics 20(2), 93—100. Graham, P., Jackson, R., 1993. The analysis of ordinal agreement data: beyond weighted kappa. J. Clin. Epidemiol. 46(9), 1055—1062. Habes, D., Carlson, W., and Badger, D., 1985. Muscle fatigue associated with repetitive arm lifts: Effects of height, weight and reach. Ergonomics 28(2), 471—488. Hadler, N.M. (editor), 1987. Clinical Concepts in Regional Musculoskeletal Illness. Grune & Stratton, Inc., New York. Hjelm, E.W., Winkel, J., Nygard, C.H., Wiktorin, C., Karlqvist, L., 1995. Can cardiovascular load in ergonomic epidemiology be estimated by self-report? J. Occup. Environ. Med. 37(10), 1210—1217.
Jelles, F., Van Benneckom, C., Lankhorst, G., Sibbel, C., Bouter, L., 1995. Inter- and intra-rater agreement of the rehabilitation activities profile. J. Clin. Epidemiol. 48(3), 407—416. Jensen, R.C., Klein, B.P., Sanderson, L.M., 1983. Motion-related wrist disorders traced to industries, occupational groups. Monthly Labor Rev. Jonsson, G., Persson, J., Kilbom, A., 1988. Disorders of the cervicobrachial region among female workers in the electronics industry. Int. J. Ind. Ergonomics 3, 1—12. Karlqvist, L., Winkel, J., Wiktorin, C., 1994. Validity of questions regarding physical activity and perceived exertion in occupational work. Applied Ergonomics 25(5), 319—326. Keyserling, W.M., 1986. Postural analysis of the trunk and shoulders in simulated real time. Ergonomics 29(4), 569—583. Keyserling, W.M., Armstrong, T.J., Punnett, L., 1991. Ergonomic job analysis: a structured approach for identifying risk factors associated with overexertion injuries and disorders. Applied Occup. Environ. Hygiene 6(5), 353—363. Kilbom, A., 1994. Assessment of physical exposure in relation to workrelated musculoskeletal disorders — what information can be obtained from systematic observation? Scand. J. Work Environ. Health 20, 30—45. Kleinbaum, D.G., Kupper, L.L., Muller, K.E., 1988. Applied Regression Analysis and Other Multivariable Methods, 2nd ed. Duxbury Press, Belmont, California. Kraemer, H.C., Bloch, D.A., 1988. Kappa coefficients in epidemiology: an appraisal of a reappraisal. J. Clin. Epidemiol. 41 959—968. Landis, R., Koch, G.G., 1977. The measurement of observer agreement for categorical data. Biometrics 33, 159—174. Maclure, M., Willett, W., 1987. Misinterpretation and misuse of the kappa statistic. Am. J. Epidemiol. 126(2), 161—169. Muller, R., and Buttner, P., 1994. A critical discussion of intraclass correlation coefficients. Statist. Med. 13, 2465—2476. National Institute for Occupational Safety and Health, 1992. A national strategy for occupational musculoskeletal injuries — implementation issues and research needs — 1991 conference summary. Department of Health and Human Services, Cincinnati, Ohio, NIOSH publication, pp. 93—101. Punnett, L., 1995. Ergonomic stressors and upper extremity disorders in automotive manufacturing. Proc. 2nd Int. Scientific Conf. on Prevention of Work-related Musculoskeletal Disorders PREMUS 95. Montreal, Canada, pp. 66—68. Punnett, L., Keyserling, W.M., 1987. Exposure to ergonomic stressors in the garment industry: application and critique of job-site work analysis methods. Ergonomics 30, 1099—1116. Punnett, L., Fine, L.J., Keyserling, W.M., Herrin, G.D., Herrin, G.D., 1991. Back disorders and nonneutral trunk postures of automobile assembly workers. Scand. J. Work Environ. Health 17, 337—346. Putz-Anderson, V., 1988. Cumulative Trauma Disorders: A Manual for Musculoskeletal Diseases of the Upper Limbs. Taylor & Francis, Philadelphia. SAS Institute Inc., 1996. SAS/STAT Software: Changes and Enhancements through Release 6.11. SAS Institute Inc., Cary, North Carolina. Schneider, S., Punnett, L., Cook, T.M., 1995. Ergonomics: applying what we know. Occup. Med. State Art Rev. 10(2), 385—394. Selin, K., Winkel, J., 1994. Evaluation of two simple methods for estimation of variation in physical activity in epidemiologic studies. Applied Ergonomics 25(1), 41—46. Shrout, P.E., Fleiss, J.L., 1979. Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86(2), 420—428. Silverstein, B.A., Fine, L.J., Armstrong, T.J., 1987. Occupational factors and carpal tunnel syndrome. Amer. J. Ind. Med. 11, 343—358. Silverstein, B.A., Fine, L.J., Armstrong, T.J., 1986. Hand wrist cumulative trauma disorders in industry. British J. Ind. Med. 43, 779—784.
S. Burt, L. Punnett / Applied Ergonomics 30 (1999) 121—135 Sokal, R.R., Rohlf, F.J., 1981. Biometry: The Principles and Practice of Statistics in Biological Research, 2nd ed. W.H. Freeman and Co., San Francisco, California. Statistical Analysis Software versions 6.10 and 6.11. SAS Institute, Inc., Cary, North Carolina. Stetson, D.S., Keyserling, W.M., Silverstein, B.A., Leonard, J.A., 1991. Observational analysis of the hand and wrist: a pilot study. Applied Occup. Environ. Hyg. 6(11), 927—937. Stock, S.R., 1991. Workplace ergonomic factors and the development of musculoskeletal disorders of the neck and upper limbs: a metaanalysis. Amer. J. Ind. Med. 19, 87—107. Suen, H.K., Ary, D., 1989. Analyzing quantitative behavioral observation data. Lawrence Erlbaum Associates Publishers, London. Thompson, W.D., Walter, S.D., 1988. A reappraisal of the kappa coefficient. J. Clin. Epidemiol. 41, 949—958. Van der Beek, A.J., Kuiper, J.I., Dawson, M., Burdorf, A., Bongers, P.M., Frings-Dresen, M.H.W., 1995. Sources of variance in exposure to nonneutral trunk postures in varying work situations. Scand. J. Work Environ. Health 21(3), 215—222. Van der Beek, A.J., Van Gaalen, L.C., Fring-Dresen, M.H.W., 1992. Working postures and activities of lorry drivers: a reliability study
135
of on-site observation and recording on a pocket computer. Applied Ergonomics 23(5), 331—336. Wiktorin, C., Hjelm, E.W., Winkel, J., Koster, M., 1996. Reproducibility of a questionnaire for assessment of physical load during work and leisure time. J. Occup. Environ. Med. 38(2), 190—201. Wiktorin, C., Karlqvist, L., Winkel, J., 1993. Validity of self-reported exposures to work postures and manual materials handling. Scand. J. Work Environ. Health 19(3), 208—214. Wiktorin, C., Mortimer, M., Ekenvall, L., Kilbom, A., Hjelm, E.W., 1995. HARBO, a simple computer-aided observation method for recording work postures. Scand. J. Work. Environ. Health 21, 440—449. Winkel, J., Mathiassen, S.E., 1994. Assessment of physical workload in epidemiologic studies: concepts, issues and operational considerations. Ergonomics 37(6), 979—988. Wolfinger, R., 1995. GLIMMIX, a SAS Macro. SAS Institute, Inc., Cary, North Carolina. Wolfinger, R., O’Connell, M., 1993. Generalized linear mixed models: a pseudo-likelihood approach. J. Statist. Comput. Simulation 48, 233—243.