Intraobserver agreement in interpretation of digital epiluminescence microscopy Ignazio Stanganelli, MD,a Marco Burroni, PhD,b Silvia Rafanelli, Laura Bucchi, MD’ Ravenna, Siena, and Forli, Italy
MD,a and
Background: Although a major problem with the classification of epiluminescence microscopy (ELM) findings is the lack of standard definitions, reproducibility of the criteria proposed has never been investigated. Objectives: Our purposes were (1) to perform a review of four major published classifications to obtain a set of apparently well-defined ELM variables and descriptors and (2) to evaluate the ability of one of us to report consistently ELM findings in melanocytic lesions according to these criteria. Methods: mtraobserver agreememt (with a set of 44 selected descriptors) between two readings of 150 digital ELM images was evaluated with the kappa (K) statistic. Subgroups of descriptors were compared for K value distribution. Results: The median K value for the whole series of descriptors was 0.66. Median K did not vary significantly among the four classification systems (K = 0.61 to 0.67). Agreement was significantly better as to the presence or absence of ELM findings (K range, 0.39 to 1.OO;median K, 0.77) compared with agreement as to their distribution (K range, 0.10 to 0.79; median K, 0.47;~ = 0.0007) and their width, thickness, and size (K range, 0.06 to 0.83; median K, 0.39; p = 0.0075). Conclusive: Although nothing can be inferred from a single study, descriptors associated with low intraobserver agreement are likely to be inadequately defined. (J AM ACAD DERMATOL 1995;33:584-9.)
Epiluminescence microscopy (ELM) is a noninvasive technique that renders the stratum corneurn translucent and subsurface structures of skin accessible to in vivo examination.’ Some pilot studies2-” have suggested that ELM facilitates the diagnosis of pigmented skin lesions and by implication reduces the need for biopsy in those with an unequivocal appearance. However, the accuracy and usefulness of ELM have been questioned.7 In fact, available studies on the concordance of ELM fmdings with subsequent histologic findings are based on series of pigmented skin lesions with a relatively low prevalence of melanoma,b6> 8-1oand only Soyer et al6 and From the
Department of Dermatology and the Centre for Cancer Prevention, Santa Maria delle Croci Hospital, Ravemwa the DeWEva/ Burroni Studio, Florence/Sienqb and the Romagna Cancer Registry, Fork’
Supported
by the Istituto
Oncologico
Romagnolo,
Fork
Accepted for publication Feb. 17, 1995. Reprint requests: Ignazio Stanganeti, MD, Divisione di Dermatologia, Ospedale Santa Maria delle Croci, via Missiroli lo,48 100, Ravenna, Italy. Copyright
0 1995 by the American
0190-9622/95 $5.00+0
584
16/l/64310
Academy
of Dermatology,
Inc.
Yadav et al. lo have correlated the ELM findings with histologic step sections. Another major problem with ELM is the lack of standardized classification criteria. A reference terminology based on an essential set of variables and descriptors was developed by the consensus conference in 1989.‘l However, these standards have undergone numerous modifications in subsequent years. For most structures discernible on ELM, a variety of appearances have been reported, leading to a plethora of descriptive characterizations. Despite this the reproducibility of clinical interpretation according to these multifaceted criteria has never been the object of a formal analysis. Currently, novel approaches to this problem are being offered by advances in digital imaging.12-14 In particular, digital ELM (D-ELM) files can provide a substrate for studies of intraobserver and interobserver variability on the basis of available classifications. In 1993 D-ELM equipment was made available to us. This study is part of a larger protocol for validation and quality assurance of the technique and focuses on the intraobserver variability in interpreta-
Journal of the American Academy of Dermatology Volume 33, Number 4
tion of D-ELM pictures of melanocytic lesions according to proposed major classification systems. Our aims were (1) to review systematically these classifications to identify a comprehensive set of definitions for ELM variables and descriptors and (2) to determine the ability of one of us to classify consistently ELM findings according to these variables and descriptors. MATERIAL AND METHODS Study design PhaseI. We performed a comprehensive search of the
literature for articles about ELM of pigmented skin lesions. We found that the work of three research groups (Table I) accounted for most articles. Major relevant pubfications4-6> 8, 15-17 and the report of the consensus conference’ ’ were reviewed to obtain a systematic list of all ELM variables (often referred to as patterns, features, or criteria) and descriptors (modalities). Griginal defiitions were recorded and analyzed. Those showing only negligible differences in terminolo,T were unified. A reporting form was drawn from this systematic list and tested during a 6-month period. We thought several previously included variables were unacceptably ambiguous for practical application and deleted them. In this way we obtained the final version of the report form to be used in phase II. We also excluded from the form some variables of well-defined significance that were not associated as a rule with melanocytic lesions (e.g., telangiectasia,4* 5, 8, l1 red-blue areas>, 6**al1 and maple leaf-like areas.)” *, l1 For reasons of brevity the original descriptive definitions of variables and descriptors are not reported in this article. Phase II. From the unselected consecutive series of D-ELM images previously recorded, a set of 150 cases (all magnified at x 16 or x25) was abstracted by one of us (S. R.) according to three essential criteria: (1) good image quality, (2) presence of ELM features consistent with a melanocytic lesionb6, ‘9 11,15-17and (3) stratification of the sample to obtain a statistically adequate prevalence (~50%) of cases expressing pigment network (PN). As a fast reading, the images selected were classified by one of us (I. S.) in May 1994. Two months later the same material was retrieved by one investigator (S. R.) and evaluated a second time by another (I. S., without access to any information except the image on the screen). Data were managed and analyzed for intraobserver agreement by a third investigator (L. B.). D-ELM
equipment
Our equipment included a Leica Wild M-650 microscope (Leica AG, Heerbrugg, Switzerland), a Sony 3CCD DXC-930P color video camera an AT-Vista Videographics adapter (used with 768 x 576 pixels and 16 bits/pixel, with 32,768 colors displayed), an IBM 6384
Stanganelli
et al.
585
personal computer, a Sony Trinitron Analog PVM2043MD monitor, and DBDERMO MIPS software (Dell’Eva/Burroni Studio, Florence/Siena, Italy). Spatial resolution was 44 pixels/mm at x 16 magnification and 70 pixels/mm at x25 magnification. To obtain unbiased, mutually comparable kappa (K) values associated with ELM variables, images were evaluated under standard conditions. Therefore no color or contrast enhancement (which would have selectively improved recognition and evaluation of subtle ELM structures)* was used. Data analysis The percentage of crude (or perfect) agreement and the index18 for agreement over that expected by chance alone were calculated for each descriptor considered. Subgroups of descriptors were compared for distribution of the associated K values by means of two-tailed Wilcoxon rank-sum tests. A p value less than 0.05 was considered statistically significant. Among dichotomous descriptors, we have excluded from analysis those with one of the two modalities reported with a frequency less than 5 on both readings. A K value of 1.0 indicates full agreement beyond chance. Values greater than 0.75 are generally considered excellent, values less than 0.40 poor, and values between 0.40 and 0.75 fair to good. K
RESULTS The left column in Table I shows variables and descriptors selected for the study. These were obtained from the works referenced in the second column. Some variables and descriptors were proposed by a single classification system (e.g., PN patches and their characteristics); others were proposed with similar definitions by two (e.g., pseudopods), by three (e.g., PN regularity), and by four (e.g., brown globules) classifications. For certain features (e.g., depigmentation) significantly different definitions were proposed. For each variable and descriptor the right columns in Table I show the crude agreement and the K value between the two readings. The median of the 40 K values listed in Table I is 0.66. Compared with each other (for a total of six pairwise comparisons), the four classification systems showed no significant difference in the distribution of the K values for agreement on the descriptors included. The median value was 0.67 for the B classification (15 descriptors); 0.66 for the S classification (10 descriptors); 0.6 1 for the K classification (22 descriptors); and 0.66 for the PS classification (22 descriptors). When the descriptors included in single classifi-
586
Stanganelli
Journal of the American Academy of Dermatology October 1995
et al.
Table I. ELM variables and descriptors for melanocytic lesions according to published and intraobserver agreement between two readings of a set of 150 digital ELM images ELM
~Iassification
Agreement
I
Descriptors
Authors
classifications
Perfect agreement (%)
K
PN 1. Present/absent
2. Discrete/prominent/both 3. Regulakregular 4. Wide/narrow 5. Delicate/broad 6. Focally absent (yes/no) PN line thickness 7. Homogeneous/varied PN hole size 8. Homogeneous/varied PN patches 9. Present/absent 10. Da&light/both 11. Central/peripheral/both PN sharp margins 12. Present/absent PN branching 13. High order/low order Diffuse pigmentation 14. Present/absent 15. Regular/irregular 16. Homogeneous/varied Irregular extensions 17. Present/absent Irregular extensions according to K 18. Present/absent Pseudopods 19. Present/absent
B, B, B, B, B, K
S, K, PS S, K, PS S, PS PS S, K
150/150 (100)
1.00
48197 (49) 53197 (55) 74/97 (76) 74197 (76) 86/97 (89)
0.23 0.16 0.42 0.50
K
65/97 (67)
0.36
K
72l97 (74)
0.06
K K K
94197 (97) 42193 (45) 59193 (63)
0.39 0.13 0.41
B, K, PS
9ol97 (93)
0.78
K
94197 (97)
0.47
PS PS PS
145/150 (97) 113/138 (82) 104/138 (75)
0.72 0.53 0.21
S
140/150 (93)
0.69
K
1431150 (95)
0.69
K, PS
150/150 (100)
1.00
B, K, PS
148/150 (99)
0.91
B, S, K, PS B, PS K ps
1281150 (85)
34/40 (85) 23140 (57)
0.67 0.41 0.17
K
1251150 (83)
0.67
B, S, PS PS PS PS
124/150 (83) 53/58 (91) 45/58 (78) 54/58 (93)
0.65 0.79 0.64 0.83
B, S, K K
120/150 (80) 30135 (86)
0.55 0.71
0.10
Radial streaming
20. Present/absent Brown globules 2 1. Present/absent 22. Regular/irregular 23. Homogeneous/varied Pigment dots 24. Present/absent Black dots 25. Present/absent 26. Regular/irregular 27. Central/peripheral/hot 28. Homogeneous/varied Whitish veil
29. Present/absent 30. Uniform/focally irregular
B, Bahmer et al.“’ , K 7 Kenet et a18; not calculared, descriptor with one of the two modalities being reported PS, Pehamberger et al?, 15; and Steiner et al?, 16; S, Soyer et aI.6, l7 *Originally termed “white scarlike depigmented areas.” TOriginally defined as region that has less pigmentation than background skin. $Originally defined as absence or diminution of pigment within the lesion. ~Originally defined as areas with reduced amount of melanin. I IDriginaIly defined as region with less pigmentation than the overall degree of lesion pigmentation. $Qructures originally defined as yellowish white dots are included. #Globular, multicomponent, nodular, saccular, homogeneous, or network.
in fewer than five cases on both readings;
of the American Academy of Dermatology Volume 33, Number 4
Journal
Table
Stanganelli
et al.
587
I. Cont’d ELM
Agreement
classification
Descriptors
Gray-blue areas 3 1. Present/absent Gray-blue areas according to K 32. Present/absent Depigmented areas 33. Present/absent Depigmentation according to K 34. Present/absent Depigmentation according to PS 35. Present/absent 36. Regular/irregular 37. CentrallperipheralJboth Hypopigmented areas 38. Present/absent Hypopigmentation 39. Present/absent Reticular depigmentation 40. Present/absent Milia-like cysts 4 1. Present/absent Comedo-like openings 42. Present/absent Horny plugs 43. Present/absent Global pattern 44. Six possible pattems#
Authors
B, S, PS K B”
Kt
pss
PS PS
KI I B, PS
B, S,¶ K B, S, K
Perfect agreement (%)
K
14OJ150(93)
0.77
Not calculated 148/150 (99)
0.88
Not calculated 137/150 (91) 79/106 (75) 731106 (69)
0.77 0.45 0.51
138/150 (92)
0.80
137/150 (91)
0.78
Not calculated 142/150 (95)
0.78
Not calculated
PS
148/150 (99)
0.92
K
141/150 (94)
0.90
cations (n = 25) were compared with those shared by
at least two systems (n = 15), no significant difference in K value distribution was found (median K = 0.67 for both groups). In Table II descriptors are (arbitrarily) grouped into four major types. Presence/absence was the type of descriptor associated with the highest reproducibility; substantially lower scores were observed for descriptors of distribution and for those addressing metric (width, thickness, and size) and chromatic (pigmentation) characteristics of skin structures discernible with D-ELM. DISCUSSION
High levels of intraobserver variability may depend on inadequate expertise of the operato& 6, 9, 15, l9 and on inadequate definition of the classification criteria. In determining whether ELM diagnostic variables are poorly standardized, there is no substitute for agreement studies. In other branches of pathology, evaluation of reproducibility has played a key role in the development of reliable classifications ?O
fntraobserver variability studies have been conducted in many diagnostic areas.21-23However, they differ substantially in rationale and implications from those based on multiple external comparisons. In brief, validity of intraobserver reproducibility data is inversely related to the values observed. Good levels of intraobserver agreement may result as easily from systematic misinterpretations of ill-defined criteria as from correct application of standardized criteria. In contrast, the most plausible explanation for low levels of intraobserver agreement is the lack of standardized definitions. As a consequence, the lowest K scores are the most reliable results of the present study. Although our observations cannot be generalized, ELM variables found to be associated with unsatisfactory reproducibility are probably ill defined. Therefore it should be taken into account (1) that the introduction of ELM in non-researchoriented settings may be affected by poor reliability of certain observations and/or (2) that the histologic counterparts of misinterpreted ELM findings may differ from those expected on the basis of the literature. As a practical consequence, our data suggest
Journal of the American
588
Table
Stanganelli
Academy
et al.
II. Distribution
of
K
of Dermatology October 1995
values by type of descriptor K
Type of descriptor
Presence/absence? Distributiont Width, thickness, size5 Pigmentationl I
No. of descriptors
19 11 6 3
Range
Median
p Value*
0.39-1.00 0.10-0.79
0.77 0.47
Referent 0.0007
0.06-0.83 0.13-0.23
0.39 0.21
0.0075 0.47
Descriptor Nos. 32, 34, 40, 42, and 44 in Table I were excluded from this table. *Two-tailed Wilcoxon rank-sum test for the difference in distribution of the K values between tAll descriptors listed in Table I. $Descriptor Nos. 3, 6, 11, 13, 1.5,22,26,27, 30, 36, and 37 in Table I (uniformity, symmetry, §Descriptor Nos. 4, 5, 7, 8, 23, and 28 in Table I. I IDescriptor Nos. 2, 10, and 16 in Table I (degree and homogeneity of pigmentation).
that all clinicians involved should undertake a preliminary evaluation (followed by periodic checks) of their ability to use consistently the criteria adopted. Most ELM descriptors other than present/absent were associated with moderate to poor agreement, with the lowest K scores being observed on descriptors of width, thickness, and size and on descriptors of chromatic characteristics of ELM features. This is particularly the case for PN descriptors. Other authors24-26 have raised doubts about standardization of PN features, especially with regard to symmetry and pigmentation. A possible explanation is that diffuse pigmentation4, 5, 6 l6 may be so dark as to obscure other structures and to prevent accurate recognition. According to the original studies, however, we have not considered the grading of diffuse pigmentation. On the basis of our experience, the only PN feature apparently discernible with high reliability is the presence of sharp margins,4, 5, *, 11, 15j l6 an important criterion for the diagnosis of early melanoma.5, 6 Good levels of agreement on presence/absence with a decrease on subsequent characterizations was also observed for diffuse pigmentation, brown globules, and depigmentation as described by Pehamberger et a14, l5 and Steiner et al53 l6 For some variables such as black dots4-6, 11,15-17 and the whitish ve@ 8, I13 I7 the observed agreement was greater on descriptors other than present/absent. This suggests that reliability of evaluation of these characteristics depends on the degree of the variable’s expression. In our opinion the possibility of characterizing the whitish veil accurately is related to size, intensity of underlying pigmentation, and thickness of cornified layer, lo whereas evaluation of black dots depends on number per surface unit and size. Puppin et al.27 suggested that with the usual magni-
the reference regularity,
type and each of the others. and location).
fications (from x10 to x40) the boundaries between the black dots and the so-called in-the-hole brown globules (i.e., those located in the upper papillary dermis)l’ may be difficult to assess. In our experience difference in image quality between DELM and epiluminescence stereomicroscopy is not significant except for black dots. Accurate recognition of their morphology on D-ELM requires image enhancement systems.8 Two apparently well-defined variables proposed for melanocytic lesions in the study of Kenet et al.* have been observed in a negligible number of cases. We refer to gray-blue areas and depigmentation. As defined by Kenet et al., depigmentation (a region that has less pigmentation than background skin) differs from hypopigmentation (a region with less pigmentation than the overall degree of lesion pigmentation). Although we have classified depigmentation among presence/absence descriptors, our problems with the identification of this feature are probably related to the observed unreliability of subjective evaluation of chromatic characteristics of lesions. The digital ELM equipment Fondazione
Cassa di Risparmio,
was donated by the Ravenna.
REFERENCES Pehamberger H, Binder M, Steiner A, et al. Early recognition and prognostic markers of melanoma. Melanoma Res 1993;3:279-84. MacKie RM. An aid to the preoperative assessment of pigmented lesions of the skin. Br J Dermatol 1971; 85:232-S. Fritsch P, Pechlaner R. Differentiation of benign from malignant
melancqtic
lesions
using
incident
light
micros-
copy. In: Ackerman AB, ed. Pathology of malignant melanoma. New York: Masson, 1981:301-12. Pehamberger H, Steiner A, Wolff K. In vivo epilumines-
of the American Academy of Dermatology Volume 33, Number 4
Journal
5.
6. 7. 8. 9. 10.
11. 12. 13. 14. 15.
cence microscopy of pigmented skin lesions. I. Pattern analysis of pigmented skin lesions. .I AM ACAD DERI~IATOL 1987;17:571-83. Steiner A, Pehamberger H, Wolff K. In vivo epiluminescence microscopy of pigmented skin lesions. II. Diagnosis of small pigmented skin lesions and early detection of malignant melanoma. J AM ACAD DERMATOL 1987;17:58491. Soyer HP, Smolle J, Hodl S, et al. Surface microscopy: a new approach to the diagnosis of cutaneous pigmented tumors. Am J Dermatopathol 1989;11:1-10. Skin surface microscopy: Anything new under the sun? pditorial]. Lancet 1989;1:1239. Kenet RO, Kang S, Kenet BJ, et al. Clinical diagnosis of pigmented lesions using digital epiluminescence microscopy. Arch Dermatol 1993;129:157-74. Nachbar F, Stolz W, Merkle T, et al. The ABCD rule of dermatoscopy. J AM ACAD DERMATOL 1994;30:551-9. Yadav S, Vossaert KA, Kopf AW, et al. Histopathologic correlates of structures seen on dermatoscopy (epiluminescence microscopy). Am J Dermatopathol 1993;15:297305. Bahmer FA, Fritsch P, Kreusch J, et al. Terminology in surfacemicroscopy. JAMACAD DERMATOL 1990;23:115962. Perednia DA. What dermatologists should know about digital imaging. J AM ACAD DEFMATOL 1991;25:89108. Stone JL, Peterson RL, Wolf JE Jr. Digital imaging techniques in dermatology. J AM ACAD DEFCMATOL 1990; 23:913-7. Sober AJ. Digital epiluminescence microscopy in the evaluation of pigmented lesions: a brief review. Semin Surg Oncol 1993;9:198-201. Pehamberger H, Binder M, Steiner A, et al. In vivo epiluminescence microscopy: improvement of ea rly diagnosis
Stanganelli
16.
17. 18. 19.
20. 21. 22. 23. 24. 25. 26. 27.
et al.
589
of melanoma. J Invest Dermatol 1993;1OO:(supp1)356S62s. Steiner A, Binder M, Schemper M, et al. Statistical evaluation of epiluminescence microscopy criteria for melanocytic pigmented skin lesions. J AM ACAD DERMATOL 1993;29:581-8. Soyer HP, Kerl H. Microscopic de surface des tumeurs cutanCes pigment&s. Ann Dermatol Venereol 1993;120:1520. Fleiss JL. Statistical methods for rates and proportions. 2nd ed. New York: John Wiley & Sons, 1981:212-36. Hall P. Clinical diagnosis of melanoma. In: Kirkham N, Cotton DWK, Lallemand RC, et al, eds. Diagnosis and management of melanoma in clinical practice. London: Springer-Verlag, 1992~35-5 1. National Cancer Institute Workshop: the 1988 Bethesda System for reporting cervical/vaginal cytological diagnoses. JAMA 1989;262:931-4. Measuringmelanomas wtorial].Lancet 1991;338:351-2. Robertson AJ, Anderson JM, Swanson Beck J, et al. Observer variability in histopathological reporting of cervical biopsy specimens. J Clin Path01 1989;42:231-8. Saftlas AF, Szklo M. Mammographic parenchymal patterns and breast cancer risk. Epidemiol Rev 1987;9: 14674. Kreusch J, Rassner G. Struckturanalyse melanozytischer pigmentmale durch Auflichtmikroskopie: Uebersicht and Eigene erfahmngen. Hautarzt 1990;41:27-33. Kreusch J, Rassner G. Standardisierte Auflichtmikroskopie unterscheidung Melanozytischer und Nichmelanozytischer pigmentmale. Hautarzt 1991;42:77-83. Pabish S. Lesioni pigment&e della cute: valutazione dermatoscopica. Milan, Italy: University of Milan, 1992. Thesis. Puppin D Jr, Salomon D, Saurat J-H. Amplified surface microscopy. J AN ACAD DERMATOL 1993;28:923-7.
BOUND VOLUMJZS AVAILABLE TO SUBSCRIBERS Bound volumes of the JOURNAC OF THE AMERICAN ACADEMY OF DERMATOLOGY are available to subscribers (only) for the 1995 issues from the Publisher at a cost of $84.00 for domestic, $109.14 for Canadian, and $102.00 for international for volume 32 (January-June) and volume 33 (July-December). Shipping charges are included. Each bound volume contains a subject and author index and ail advertising is removed. Copies are shipped within 60 days after publication of the last issue in the volume. The binding is durable buckram with the journal name, volume number, and year stamped in gold on the spine. Payment must accompany all orders. Contact Mosby-Year Book, Inc., Subscription Services, 11830 Westline Industrial Dr., St. Louis, MO 63146-3318. USA: phone (800) 453-4351; (314) 453-4351. Subscriptiorzs must be in force to qua@j~. Bound volumes are izot available in place of a regular journal subscription.