International Journal of Pediatric Otorhinolaryngology 50 (1999) 125 – 131 www.elsevier.com/locate/ijporl
Gender and intra-observer agreement about laryngoscopy of papilloma Philomena Mufalli Behar, N. Wendell Todd * Department of Otolaryngology, Emory Uni6ersity School of Medicine, 1365 Clifton Road, N.E., Atlanta, GA 30322, USA Received 10 January 1999; received in revised form 23 June 1999; accepted 30 June 1999
Abstract Background: The gender of the observer may bias data. Objecti6e: Compare the intra-observer agreements of male and female pediatric otolaryngologists about videotaped images of laryngeal papilloma. Design: Five male and six female pediatric otolaryngologists independently viewed videotapes of ten children undergoing treatment for laryngeal papilloma. Each of 12 anatomic sites was categorized as disease present, absent, or indeterminate. Each observer estimated the percent overall airway obstruction. 5–24 weeks later, each observer repeated his/her assessments. Results: The mean intra-observer agreements for both male and female pediatric otolaryngologists were good, and identical (k 0.63; proportion of positive agreement 0.82; proportion of negative agreement 0.72). Females more frequently categorized a site as indeterminate. Males more frequently categorized a site oppositely on repeat assessment. The males’ indeterminate/opposite ratio was less than that of the females’ (P = 0.03). Intra-observer estimates of overall airway obstruction have wide variability: for male pediatric otolaryngologists, differences exceeding 30% are significant; for females, 40%. Conclusions: Male and female pediatric otolaryngologists had equally good and identical intra-observer k scores and proportions of positive and negative agreement. However, males used the indeterminate category less than did the females, and males more often gave an opposite categorization at the second viewing session. Estimates of overall airway obstruction have much intra-observer variability. © 1999 Elsevier Science Ireland Ltd. All rights reserved. Keywords: Gender; Laryngeal papilloma; Observer variation; Recurrent respiratory papilloma; Reproducibility of results
1. Introduction
Presented at the meeting of the Society for Ear, Nose, & Throat Advances in Children, St. Petersburg Beach, Florida, December 4 – 7, 1997. * Corresponding author. Tel.: +1-404-778-5717; fax: + 1404-778-4295. E-mail address:
[email protected] (N.W. Todd)
Concern about biased clinical and research data is on-going. Observer-introduced bias, generally termed observer bias, is a distortion, deliberate or not, in the discernment or reporting of the observer’s call. The observer’s gender has not been considered a contributor to observer bias in medical research or clinical care. Observer agreement
0165-5876/99/$ - see front matter © 1999 Elsevier Science Ireland Ltd. All rights reserved. PII: S 0 1 6 5 - 5 8 7 6 ( 9 9 ) 0 0 2 2 6 - 8
126
P. Mufalli Behar, N.W. Todd / Int. J. Pediatr. Otorhinolaryngol. 50 (1999) 125–131
differences may explain some of the effects of ‘treatment’ for laryngeal papilloma. Following a lecture about observer agreement about laryngeal papilloma imaged on videotape [9], a question was raised as to whether the study was biased, since all observers were male. Though the question may be considered chauvinistic, some rigorous data support the idea that females and males have different visual perception capabilities [1,7,8]. For example, in comparing water levels, and in the rod and frame task, males tend to score higher [6,7]. In a study using a motor vehicle driving simulator, Caird and Hancock checked perception of arrival time for oncoming vehicles at an intersection: men were more accurate in estimating arrival times, while females were generally more conservative in their judgments [2]. We address the hypothesis that female pediatric otolaryngologists have different intra-observer agreements than do their male colleagues, in assessing children’s laryngeal papilloma imaged on videotape.
2. Methods The edited intra-operative video-taped images of 10 children with recurrent respiratory papillomatosis were used, as in a previous report [9]. The video excerpts were from recordings obtained using a Storz-Hopkins bronchoscope during operative procedures. Each near-20-second segment was labeled, at its beginning and ending, with the patient’s six-digit birthdate. Table 1 The 12 laryngeal sites which each pediatric otolaryngologist categorized as ‘papilloma present’, ‘absent’, or ‘indeterminate’ (meaning, cannot determine from the videotape whether disease is present or absent) Epiglottis: rim, lingual and laryngeal surfaces Aryepiglottic folds, right and left False vocal cords, right and left True vocal cords, right and left Anterior commissure of the glottis Posterior glottis Subglottis
Five male pediatric otolaryngologists independently viewed the tapes on two separate occasions, and observer agreements determined [9]. A year later, six female pediatric otolaryngologists likewise independently viewed the tape, on two separate occasions. Each observer viewed the tape at his/her own pace. Typically, an observer viewed the tape both forward and backward, and at both regular and slow motion. Each observer completed two tasks for each of the ten video segments: 1. Categorized the involvement with papilloma at each of twelve laryngeal sites (Table 1) as ‘present’, ‘absent’, or ‘indeterminate’ (meaning, cannot determine from the video whether disease is present or absent); and, 2. Estimated the overall percent of airway obstruction. The time interval between the two video-tape viewings/categorizations, ranged from 5 to 20 weeks (mean 11.4) for the male pediatric otolaryngologists, and 8–24 weeks (mean 13.1) for the female pediatric otolaryngologists. The years of clinical experience of the two groups of pediatric otolaryngologists were similar: the males were in post-graduate years (PGY) 6–21 (mean 10.2), the females were PGY 6–20 (mean 11.5). Each clinician had experience in treating children with recurrent laryngeal papillomatosis. Intra-observer agreements were calculated using the k score, which corrects for chance agreement. Table 2. A k of ‘0’ indicates no agreement, B 0.4 indicates poor agreement, and \ 0.75 means excellent agreement. The proportion of positive agreement (Ppos) is analogous to sensitivity. The proportion of negative agreement (Pneg) is analogous to specificity. The proportion of intra-observer opposite categorizations, Popposite, was calculated to determine the degree of inconsistent responses for each male and female observer, and for each anatomic site. The indeterminate/opposite ratio, Pindeterminate/ Popposite, was calculated for each observer, and for each anatomic site. Roundings of numbers were done conservatively downward; for example, a k value of 0.698 was rounded to 0.69. The non-parametric rank sum two sample test (two-tailed) was used to
P. Mufalli Behar, N.W. Todd / Int. J. Pediatr. Otorhinolaryngol. 50 (1999) 125–131
127
Table 2 Comparison of the first and second categorizations, done 15 weeks apart, by female pediatric otolaryngologist pseudonymed Bigwiga At second categorization
No papilloma Indeterminate Papilloma present Totals
At the first categorization No papilloma
Indeterminate
Papilloma present
Totals
34b 4 6 44
7 30 1 38
5 1 32c 38
46 35 39 120
Agreement is good, k =0.69 with the 95% confidence limits 0.58–0.80. The Ppos and Pneg are 0.83 and 0.75, respectively (these are analogous to sensitivity and specificity). b Data cells top-left and c Bottom-right can be considered, respectively, as ‘at neither assessment of the videotape did she identify papilloma’ and ‘at each assessment she saw papilloma’. a
compare the differences of the indeterminate/opposite ratios of the male and the female pediatric otolaryngologists. The null hypothesis was that the distributions for the two groups are the same.
3. Results The mean intra-observer agreements, as expressed by the k scores, were identical for the male and female groups of pediatric otolaryngologists: k =0.639 0.11 (95% confidence interval). Table 3. The mean Ppos, which is analogous to sensitivity, was 0.82 for both the male and female groups of pediatric otolaryngologists. Similarly, the mean Pneg, which is analogous to specificity, was identical for the males and females: 0.72. The indeterminate and opposite proportions, Pindeterminate and Popposite, varied by anatomic site. Table 4. Sites most consistently categorized by both the male and female pediatric otolaryngologists were the posterior glottis, the left true vocal cord, and the anterior commissure. In contrast, sites more often categorized oppositely were the right and left false vocal cords. The epiglottis and aryepiglottic folds were usually indeterminately categorized by all participants. The indeterminate and opposite proportions, Pindeterminate and Popposite, also varied by gender of
the observer. Exactly opposite categorizations (that is, at one session a site was labeled as having disease present, but at the other session was labeled as not having disease) for a particular anatomic site, were more frequently done by male pediatric otolaryngologists, than by female pediatric otolaryngologists: mean 10.2% (range 8–12), versus mean 6.5% (range 2–12), respectively. Indeterminate categorizations were more frequently made by female pediatric otolaryngologists. Table 4. When combining Pindeterminate and Popposite into indeterminate/inconsistency ratios, the male pediatric otolaryngologists were significantly different observers, in contrast to the female pediatric otolaryngologists: P=0.03. That is, the males were relatively more assured though more inconsistent with their categorizations of disease. In contrast, the females were relative more indeterminate, but were more consistent. Estimates of overall airway obstruction varied greatly for both the female and for the male pediatric otolaryngologists. Figs. 1 and 2. The frequency plot of difference in percent obstruction (first ‘call’ minus second ‘call’), for both the male and female observers, approximates a bell-shape. Accord was again apparent for the male and female otolaryngologists: one standard deviation in percent airway obstruction was 15.1% for males, 19.9% for females.
P. Mufalli Behar, N.W. Todd / Int. J. Pediatr. Otorhinolaryngol. 50 (1999) 125–131
128
4. Discussion
ences in editor and reviewer characteristics were found, but the final outcome was that of no effect on peer review or acceptance for publication [5]. Succinctly, we have also noted gender differences, but the ‘clinical bottom line’ is essentially the same for male and female pediatric otolaryngologists.
In research and clinical care, the gender of the observer or task-performer is rarely mentioned and rarely considered. In a study about gender bias in the Journal of the American Medical Association’s peer review process, gender differTable 3 Intra-observer agreements as related to gender of observera Gender
k 92 SD
Ppos
Pneg
Pindeterminate
Popp
Male Comet Male Dancer Male Dasher Male Prancer Male Vixen Mean of males
0.509 0.13 0.79 90.11 0.57 90.12 0.659 0.11 0.6590.12 0.6390.11
0.55 0.88 0.72 0.77 0.81 0.82
0.55 0.92 0.64 0.70 0.79 0.72
0.55 0 0.62 0.70 0.69 0.51
14/120 11/120 14/120 10/120 12/120 0.102
4.7 0 5.3 8.4 6.9 5.06
Female Bigwig Female Fiver Female Hazel Female Holly Female Kehaar Female Vervain Mean of Females
0.699 0.11 0.489 0.13 0.6290.11 0.7890.09 0.649 0.12 0.589 0.12 0.639 0.11
0.83 0.87 0.81 0.91 0.76 0.76 0.82
0.75 0.55 0.72 0.82 0.76 0.74 0.72
0.82 0.53 0.75 0.82 0.78 0.63 0.72
11/120 5/99b 2/120 6/120 14/120 10/120 0.068
8.9 10.5 45.0 16.4 6.7 7.8 10.59
a b
Pindeterminate/ Popposite
For anonymity, arbitrary names have been substituted. This observer did not complete the second viewing session.
Table 4 Intra-observer opposite categorizations, and consistently indeterminate categorizations, as related to anatomic site, and gender of observera Site
Opposite categorizations
Epiglottis’ lingual aspect Epiglottis’ rim Epiglottis’ laryngeal aspect Aryepiglottic fold, right Aryepiglottic fold, left False vocal cord, right False vocal cord, left True vocal cord, right True vocal cord, left Anterior commissure Posterior glottis Subglottis a b
Consistently indeterminate
Males (N= 5)
Females (N =6)b
Males (N= 5)
Females (N =6)
0 0 14 3 2 14 2 5 6 6 5 4
1 0 6 4 3 4 5 5 2 5 4 7
28 17 14 5 5 0 0 0 0 0 0 0
49 29 24 15 16 0 0 5 4 3 3 1
Poorly depicted images likely explain some of these categorizations. One female observer did not complete the second viewing session for two patients.
P. Mufalli Behar, N.W. Todd / Int. J. Pediatr. Otorhinolaryngol. 50 (1999) 125–131
129
Fig. 1. Frequency distribution of the males’ intra-observer agreements as to extent of overall airway obstruction. Five male pediatric otolaryngologists each independently twice categorized the videotapes of ten children’s laryngeal papilloma. As 1 standard deviation for these data is 15.1, a change in percent overall obstruction would have to exceed 30% to be considered statistically significant. Two data points were rounded ( − 3 was rounded to − 5, and −2 to zero) in constructing this graph. For anonymity of the tape reviewers, arbitrary names have been substituted. From [9] with permission.
Fig. 2. Frequency distribution of the females’ intra-observer agreements as to extent of overall airway obstruction. Six female pediatric otolaryngologists each independently twice categorized the videotapes of ten children’s laryngeal papilloma. As 1 SD for these data is 19.9, a change in percent overall obstruction would have to exceed 39.8% to be considered statistically significant. There are only eight data points for observer Fiver on this graph, as she did not complete categorizing the images of the last two endoscopies in the second viewing session. And, there are only nine data points for Hazel, as she considered the ‘% overall airway obstruction’ inestimable at the second viewing session of one patient’s endoscopy. For anonymity of the tape reviewers, arbitrary names have been substituted. Note the similarity of this Figure to Fig. 1.
130
P. Mufalli Behar, N.W. Todd / Int. J. Pediatr. Otorhinolaryngol. 50 (1999) 125–131
Various studies of visual perception [7,8] and dynamic spatial processing [2,10] report that men perform these tasks better than do females. We suspect that the selection processes to be an otolaryngologist, analogous to the selection processes to be an aircraft pilot, exclude both male and female candidates having poor visual perception and dynamic spatial processing abilities. Nevertheless, gender-associated observer bias is seemingly present in the data of the present report. The clinical significance of the gender-associated observer categorizations is unknown, though probably small. As expressed by the k statistic, intra-observer agreement about papillomatous involvement of various anatomic sites was good, and identical, for both the male and the female groups of pediatric otolaryngologists. Nevertheless, the k statistic has limitations [3]. The present report is an example of k, even in conjunction with Ppos and Pneg, not conveying interesting information. Table 2. The indeterminate/inconsistency ratios for the male and female groups of pediatric otololaryngologists are statistically significantly different. As Cicchetti & Feinstein emphasized for data in 2 × 2 contingency tables, Ppos and Pneg should be considered along with k [3]. For 3× 3 contingency tables as we constructed, the intermediate category (Pindetermindate of our data) and opposite categorizations (Popposite) should be considered. Ethical and monetary considerations suggest that videotape images have a role in validating observers of laryngeal papillomatosis. The disadvantages of utilizing video-tape images to assess laryngeal papilloma are numerous, including: ‘‘(1) the lack of depth perception, and (2) the impossibility of the observer additionally checking a site by viewing from a different angle’’ [9]. In the present report, sites often categorized inconsistently (right and left false vocal cords), and sites that were usually indeterminately categorized (epiglottis and aryepiglottic folds) were comparatively poorly depicted on the videotape. Table 4. If the videotape had better depicted the endoscopic findings, then the oppositely categorized anatomic sites, and the indeterminately categorized sites, may have been categorized consistently.
The practice of describing an ‘estimate of overall airway obstruction’ is questionable. Intra-observer agreement is lax, and inter-observer agreement even more lax [9]. Intra-observer estimates of overall airway obstruction varied similarly for both female and male pediatric otolaryngologists. Figs. 1 and 2. A significant difference in ‘overall airway obstruction’ is 2 SDs, which is 30%. Such widely varying interpretations of airway obstruction are reminiscent of the ‘grossly erroneous’ visual interpretations of arteriograms of coronary stenoses [4]. Of course, any convincing change in a patient’s disease must exceed the imprecision (noise) of the measuring system. Because the estimate of ‘overall airway obstruction’ was considered by both the male and female observers to be the most nearly clinically realistic part of the videotape viewing, the recognized variability is probably similar to that in ‘live surgery’. The variability of ‘live surgery’ vs. videotape assessment of obstruction [9], was similar to the intra-observer variability in assessing obstruction on viewing the videotape. Hence, poor estimates of airway obstruction seem most attributable to observer factors, rather than to video-recording factors. The question of inter-observer agreement about laryngeal papilloma, when one observer is male and the other female, was not addressed in this report. Intra-observer variability is nearly always less than is inter-observer variability. We think this question of inter-observer variability, along with the question of accuracy, is more likely answerable when two concerns are met. These concerns are: 1. The limited quality of the videotape images used for this report. The video images were not taken for the purpose of illustrating the various laryngeal anatomic sites. Rather, the images used for this report were simply those obtained while the bronchoscope was transiting the larynx. 2. The lack of consensus of a ‘gold standard’ for whether disease is present or absent at an anatomic site. Should the ‘gold standard’ be what the operating surgeon described (based on the additional information from viewing from surgeon-determined angles, from binocu-
P. Mufalli Behar, N.W. Todd / Int. J. Pediatr. Otorhinolaryngol. 50 (1999) 125–131
lar microscopic inspection, and from palpation), what an ‘expert’ otolaryngologist determined from reviewing a videotape, or what a pathologist found in biopsy specimens?
131
Derkay, Verlia Gower, Ian Jacobs, Carol MacArthur, Johnna MacCormick, Dimitry Rabkin, Brian Wiatrak, Ann White, and Benjamin White.
5. Conclusion
References
Intra-observer agreements as to whether videoendoscopic images demonstrated papillomatous involvement of specific laryngeal sites, as expressed by the k statistic and by the proportions of positive agreement (Ppos) and of negative agreement (Pneg), were identically good for both male and female pediatric otolaryngologists. However, males were more commonly inconsistent and insistent, than were the female pediatric otolaryngologists. We suggest that observational data include a statement of the strength of intra-observer agreement. For agreements that are less than excellent (as expressed, for example, by k, Ppos, Pneg, Pindeterminate, and Popposite), explanations should be sought. One explanation may be the gender of the observer. Estimates of overall airway obstruction have so much intra-observer variability, as to be useless. A statistically significant difference in overall airway obstruction was at least 30% — too wide a range to be clinically useful.
[1] D. Bibawi, B. Cherry, J.B. Hellige, Fluctuations of perceptual asymmetry across time in women and men: effects related to the menstrual cycle, Neuropsychologia 33 (1995) 131 – 138. [2] J.K. Caird, P.A. Hancock, The perception of arrival time for different oncoming vehicles at an intersection, Ecol. Psychol. 6 (1994) 83 – 109. [3] D.V. Cicchetti, A.R. Feinstein, High agreement but low k: II. Resolving the paradoxes, J. Clin. Epidemiol. 43 (1990) 551 – 558. [4] N. Danchin, Y. Juilliere, D. Foley, P.W. Derruys, Visual versus quantitative assessment of the severity of coronary artery stenoses: Can the angiographer’s eye be reeducated?, Am. Heart J. 126 (1993) 594 – 600. [5] J.R. Gilbert, E.S. Williams, G.D. Lundberg, Is there gender bias in JAMA’s peer review process?, J. Am. Med. Assoc. 272 (1994) 139 – 142. [6] M.C. Linn, A.C. Peterson, Emergence and characterization of sex differences in spatial ability: a meta analysis, Child Dev. 56 (1985) 1479 – 1498. [7] M. Robert, T. Ohlmann, Water-level representation by men and women as a function of rod-and-frame test proficiency and visual and postural information, Perception 23 (1994) 1321 – 1333. [8] M. Robert, J. Pelletier, Women’s deficiency in water-level representation: present in visual conditions yet absent in haptic contexts, Acta Psychol. (Amsterdam) 87 (1994) 19 – 32. [9] N.W. Todd, Observer agreement about laryngoscopic assessment of papilloma, Int. J. Pediatr. Otorhinolaryngol. 41 (1997) 37 – 46. [10] I. Viaud-Delmon, Y.P. Ivanenko, A. Berthoz, R. Jouvent, Sex, lies and virtual reality, Nat. Neurosci. 1 (1998) 15 – 16.
Acknowledgements Robin Fivush PhD guided us to the psychology literature on gender differences in visual perception. Pediatric otolaryngologists who twice reviewed the videotape were Linda Brodsky, Craig
.