Interobserver and intraobserver variability in the histological dating of the endometrium in fertile and infertile women

Interobserver and intraobserver variability in the histological dating of the endometrium in fertile and infertile women

FERTILITY AND STERILITY威 VOL. 82, NO. 5, NOVEMBER 2004 Copyright ©2004 American Society for Reproductive Medicine Published by Elsevier Inc. Printed o...

83KB Sizes 0 Downloads 52 Views

FERTILITY AND STERILITY威 VOL. 82, NO. 5, NOVEMBER 2004 Copyright ©2004 American Society for Reproductive Medicine Published by Elsevier Inc. Printed on acid-free paper in U.S.A.

Interobserver and intraobserver variability in the histological dating of the endometrium in fertile and infertile women Evan R. Myers, M.D., M.P.H.,a,bSusan Silva, Ph.D.,b Kurt Barnhart, M.D., M.Sc.E.,c Pamela A. Groben, M.D.,d Mary S. Richardson, M.D.,e Stanley J. Robboy, M.D.,a,f Phyllis Leppert, M.D., Ph.D.,g and Christos Coutifaris, M.D., Ph.D.,c for the NICHD National Cooperative Reproductive Medicine Network

Received December 3, 2003; revised and accepted April 6, 2004. Supported by National Institutes of Health/National Institute of Child Health and Human Development (NIH/NICHD) grants U01 HD38997 and HD27049. Reprint requests: Evan R. Myers, M.D., M.P.H., Department of Obstetrics and Gynecology, DUMC 3279, Duke University Medical Center, Durham, North Carolina 27710 (FAX: 919-668-0295; E-mail: [email protected]). a Department of Obstetrics and Gynecology, Duke University Medical Center. b Duke Clinical Research Institute, Duke University Medical Center. c Department of Obstetrics and Gynecology, University of Pennsylvania School of Medicine. d Department of Pathology, University of North Carolina School of Medicine. e Department of Pathology, Medical University of South Carolina. f Department of Pathology, Duke University Medical Center. g Reproductive Sciences Branch, National Institute of Child Health and Human Development. 0015-0282/04/$30.00 doi:10.1016/j.fertnstert.2004. 04.058

1278

NICHD National Cooperative Reproductive Medicine Network: University of Medicine and Dentistry of New Jersey, Newark, New Jersey; Duke University Medical Center, Durham, North Carolina, Duke Clinical Research Institute, Duke University Medical Center, Durham, North Carolina; University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania; Baylor College of Medicine, Houston, Texas; Pennsylvania State University, Hershey, Pennsylvania; University of Colorado, Denver, Colorado; University of Texas—Southwestern Medical Center, Dallas, Texas; University of Alabama, Birmingham, Alabama; Stanford University, Palo Alto, California; Wayne State University, Detroit, Michigan; University of Rochester, Rochester, New York; University of Pittsburgh, Pittsburgh, Pennsylvania; Brigham and Women’s Hospital, Boston, Massachusetts; University of California, Davis, Sacramento, California; University of Louisville, Louisville, Kentucky; The Center for Clinical Epidemiology and Biostatistics, University of Pennsylvania, Philadelphia, Pennsylvania; and the Reproductive Sciences Branch, National Institute of Child Health and Human Development, Bethesda, Maryland

Objective: To assess effects of biopsy timing and fertility status on inter- and intraobserver variability in dating of the endometrium. Design: Endometrial biopsy slides randomly selected from a multicenter study testing the utility of biopsy in the diagnosis of infertility were distributed to three gynecologic pathologists, who estimated cycle day using standard criteria. Readers were blinded to the purpose of the study, patient age, fertility status, or timing of biopsy relative to LH surge or next menses. Setting: Multicenter academic research programs in reproductive medicine. Patient(s): Eighty-two women with proven fertility, 83 infertile patients. Intervention(s): Endometrial biopsy during midluteal (days 21–22) or late (days 26 –27) luteal phase. Main Outcome Measure(s): Intraclass correlation coefficient (ICC), kappa. Result(s): Overall agreement was excellent (ICC 0.88); addition of readings by local pathologists decreased ICC only slightly. In subgroup analyses, ICCs were lowest for infertile women during the midluteal phase (0.65 vs. 0.71 for fertile women in the midluteal phase, and 0.88 – 0.90 for both groups in the late luteal phase). Intraobserver reliability was excellent (0.9 – 0.99). Agreement for diagnoses of “out-of-phase” was only moderate, with kappa values between 0.4 and 0.6. Conclusion(s): Observer variability in dating the endometrium was greatest in infertile women during the window of implantation. (Fertil Steril威 2004;82:1278 – 82. ©2004 by American Society for Reproductive Medicine.) Key Words: Luteal phase defect, infertility, observer variability, endometrial biopsy, histopathology

For more than 50 years, predictable changes in the histological appearance of the endometrium during the postovulatory phase of the menstrual cycle have been used to assess endometrial function (1). Discrepancies between the estimated luteal phase date determined by

the histological appearance and the luteal phase date estimated based on either some marker of ovulation or the onset of menses have been used to diagnose abnormalities of luteal function in infertile women (2). Histological specimens for this purpose are usually obtained by

endometrial biopsy. The National Institutes of Health/National Institute of Child Health and Human Development (NIH/NICHD) Reproductive Medicine Network, a cooperative multicenter research network, has been evaluating the usefulness of the endometrial biopsy in the evaluation of infertile patients (3). Because previous studies have suggested a high degree of observer variability in the dating of the endometrium (4 – 6), we evaluated intra- and interobserver variability on specimens collected as part of this larger study. In addition to measuring overall variability, we evaluated the effect of fertility status and timing of the biopsy on measures of observer variability.

MATERIALS AND METHODS This study was approved by the Data Safety and Monitoring Committee of the NIH/NICHD Reproductive Medicine Network, and the Duke University Health System Institutional Review Board.

Selection of Slides Slides were drawn from those collected as part of the NIH/NICHD Reproductive Medicine Network study of the utility of the endometrial biopsy in the evaluation of infertility; details of the protocol are provided elsewhere (3). In this study, women between the ages of 23 and 39 years were classified as “fertile” (defined as having had a spontaneous conception resulting in a live birth within 2 years of the date of biopsy, and not having a history of recurrent pregnancy loss) or “infertile” (defined as 1 year of intercourse without contraception without a conception). Subjects refrained from use of any hormonal contraceptive for at least 1 month before biopsy. Women in each group were stratified into two age groups, ⬍35 years old and ⱖ35 years old, to ensure that there would not be an imbalance in ages between the fertile and infertile groups. Within each stratum of age and fertility status, women were randomized to endometrial biopsy on either days 21–22 (midluteal) or days 26 –27 (late luteal), counting the date of the urinary LH surge as day 14. Slides were selected from each of these eight groups (stratified by fertility status, age, and timing of biopsy) as follows. At the time this substudy began, 511 slides had been logged into the main study database. Slides were excluded from selection for luteal phase dating if any of the following criteria were met: the slide was damaged; there was no local reading available; there was evidence of proliferative endometrium; or the local reading from the site was outside of cycle days 14 –28. Proliferative endometrium was initially excluded because our primary objective was on measuring agreement in the quantitative dating of luteal phase. A total of 437 slides were eligible; these were then sorted into the eight subgroups. Within each midluteal subgroup, eligible slides were ordered by true cycle day, based on the date of the urinary LH surge. Any cycle day outside the day 21–22 target period FERTILITY & STERILITY威

was automatically selected to add variation in the possible reading range. Remaining slides for the subgroup were randomly selected from the biopsies obtained in the 21–22 day period. Similarly, for each late luteal subgroup, slides with readings outside of the 26 –27 day window were selected, followed by random selection of biopsies obtained on days 26 –27. Within each subgroup, adjustments were made to ensure that a maximal number of sites were represented within each subgroup. Nineteen luteal slides were selected for each subgroup, for a total of 152 slides. An additional 16 randomly selected proliferative phase slides from the excluded group (two per subgroup) were also added, for a total of 168 slides.

Distribution to Pathologists The slides were randomly divided into three sets of 56 slides each. A simple Latin square design was used to establish the set distribution. The randomization was performed to generate a balanced representation of luteal and proliferative slides, from within each subgroup, in each of the three sets. A random number generator was used to generate the reading order of slides within each set. After each pathologist had read all three sets of slides, 24 slides, balanced with respect to fertility status, age, and timing of biopsy assignment, were randomly selected to estimate intrarater reliability. These were sent as a fourth slide set to each pathologist.

Pathologist Reading Three university-based pathologists with special interest and expertise in gynecological pathology reviewed the slides. Before receiving the slides, the pathologists met by conference call to review criteria for assigning a date to each slide, based on those of Noyes, Hertig, and Rock (1, 7). Each pathologist was instructed to assign a single cycle day reading to each slide, as well as a range, based only on the histological appearance of the endometrial tissue. The primary analysis was based on the single day reading. The pathologists were unaware of the study design or hypothesis of the primary study, subject age or fertility status, or the date of the subject’s urinary LH surge or next menstrual period. They were also asked to comment on the adequacy of tissue for dating, synchrony between glands and stroma, and presence of endometritis. Readings were sent to the Duke Clinical Research Institute for data entry and analysis.

Data Analysis The primary outcome measure was the cycle day reading for each pathologist. Intraclass correlation coefficient (ICC) methods were used to evaluate both inter- and intrarater readings for the entire set of slides. Intraclass correlation coefficient provides a measure of reliability or agreement for continuous measurements and is appropriate when assessing agreement between two or more raters. The ICC is calculated using a similar but modified procedure to that used to calculate Pearson’s correlation coefficient. The ICC procedures 1279

generate a coefficient that has a range of 0 –1, with 1 representing perfect agreement. For this reliability analysis, we used the INTRACC macro developed for the SAS by Hamer (8) to perform the ICC procedures described by Shrout and Fleiss (9). The Shrout– Fleiss reliability random set method, which assumes that the same set of pathologists rated the slides and the study raters represents a random subset of all possible raters, was used to calculate the ICC values presented in this article. A priori sample size calculations estimated that a total of 167 readings by each of 3 pathologists (a total of 501 readings) would allow detection of an ICC of 0.7, with a minimum value of 0.6, at an alpha of 0.05 and a beta of 0.2. An ICC less than 0.6 would represent substantial disagreement, and suggest that interobserver variability in interpretation of cycle day substantially affected the clinical utility of endometrial dating. Secondary analyses included subgroup analyses of interand intrarater ICC based on fertility status and timing in the luteal phase, and an estimation of the effect of including readings from “nonexpert” pathologists (i.e., the original readings at each site) on the ICC, performed by adding the original readings for each slide to the readings by the “expert” pathologists, and recalculating ICCs. In addition, kappa statistics using the SAS MAGREE macro were calculated for the diagnosis of “out-of-phase,” based on the standard criterion of a histological reading greater than 2 days earlier than the actual cycle day. Unlike ICC methods, the kappa statistic is most appropriate for measuring categorical responses (10, 11). For the purposes of the primary analysis, the date of the urinary LH surge in each subject was assumed to be cycle day 14. The actual cycle day was calculated by subtracting the date the biopsy was obtained from the date of the LH surge, and adding 14. Analyses were repeated using alternate dating methods, including assigning the date of the urinary LH surge as day 13 (assuming ovulation occurs 24 hours after the LH surge), assigning the date of the next menstrual period as day 28 (usual clinical practice), and assigning the date of the next menstrual period as day 29.

RESULTS All three pathologists provided single day readings for 145 of the slides, two provided readings for 18 slides, one for 2 slides, and none were able to provide a date or diagnosis for 3 slides. Notably, there was no disagreement on the diagnosis of proliferative endometrium. For the cycle day reading, the ICC for the three pathologists was 0.88, indicating excellent agreement in assigning a cycle day. The addition of the reading from the local pathologist at each site led to a slight decrease in ICC to 0.82. 1280 Myers et al.

Observer variability in endometrial dating

TABLE 1 Interrater reliability as measured by intraclass correlation coefficient (ICC) by group. Central pathologists (k ⫽ 3)a

Central ⫹ local pathologists (k ⫽ 4)a

Analysis group

No. of slides

ICC

No. of slides

ICC

Fertile women Fertile — luteal slides only Infertile women Infertile — luteal slides only Mid biopsy (days 21–22) Late biopsy (days 26–27) Mid — fertile Mid — infertile Late — fertile Late — infertile

82 77 83 77 79 75 38 41 39 36

0.89 0.87 0.88 0.86 0.68 0.89 0.71 0.65 0.88 0.90

82 77 82 78 80 75 38 42 39 36

0.82 0.80 0.82 0.80 0.59 0.79 0.65 0.55 0.77 0.83

a

k ⫽ number of observers per slide.

Myers. Observer variability in endometrial dating.

Overall ICCs did not substantially differ between fertile or infertile women (Table 1). However, ICCs were lower for biopsies obtained in the midluteal phase compared to those obtained in the late luteal phase, with ICCs lowest for biopsies obtained in the midluteal phase from infertile women. In all of these subgroups, addition of the readings from the site pathologists did not substantially lower the ICC. Intrarater reliability among the three pathologists was excellent, with ICCs of 0.90, 0.91, and 0.99. Table 2 illus-

TABLE 2 Kappa values by subgroup for the diagnosis of out-ofphase.a Kappa

Analysis group

No. of slides

Central pathologists (k ⫽ 3)b

Central ⫹ local pathologists (k ⫽ 4)b

All slides Fertile women Infertile women Mid biopsy (days 21–22) Late biopsy (days 26–27) Mid — fertile Mid — infertile Late — fertile Late — infertile

145 73 72 67 67 34 33 34 33

0.54 0.56 0.54 0.40 0.56 0.37 0.41 0.67 0.45

0.51 0.53 0.50 0.39 0.50 0.39 0.39 0.59 0.41

a

Defined as histological reading more than 2 days earlier than actual calculated cycle day. b Number of observers per slide. Myers. Observer variability in endometrial dating.

Vol. 82, No. 5, November 2004

trates the kappa values for the traditional definition of “outof-phase,” using both the three main pathologists and adding the readings of the site pathologists. Kappa values are between 0.4 and 0.6 in most cases, representing “moderate” agreement (11), with the best agreement seen in late luteal fertile women. Substantial decreases in measured agreement are not seen when the local pathologist’s readings are added. Use of alternate methods for calculating luteal phase day (counting the day of the urinary LH surge as day 13, or dating based on the onset of subsequent menses), or including proliferative endometrium as “out-of-phase,” did not substantially alter the results.

DISCUSSION We found that overall intra- and interobserver reliability for pathologists assigning a date to a histological specimen was quite good when measured by the ICC. These findings were consistent even when readings by multiple local pathologists were included. Previous studies have reported agreement based on the absolute difference between readings (4, 6, 12, 13). The advantage of the use of ICC and kappa as measures of variability in this setting is that these methods measure chance-corrected proportional agreement. Although reliability in assignment of a date as measured by ICC was good, agreement about the diagnosis of “outof-phase” based on these readings as measured by the kappa statistic was only moderate. Although this may seem counterintuitive, the findings are not inconsistent. Disagreement of 1 or 2 days in assigning dates represents good agreement in the rating of a continuous variable; however, a disagreement of 1 or 2 days could easily result in differences in a dichotomous classification of “in-phase” or “out-of-phase” when the classification is based on a 2-day difference between the histological date and the calculated luteal phase date. Therefore, 1- or 2-day differences between pathologists represent excellent agreement statistically, but, when translated into a dichotomous clinical diagnosis, this relatively small disagreement results in substantial diagnostic variability. This confirms previous studies, which, based on alternate methods of assessing variability, estimated that 20%– 40% of biopsies would have a different classification of “in-phase” or “out-of-phase” if read by a different pathologist (6, 12). This study verifies these previous findings in a much larger sample size characterized by age, timing of the biopsy, and fertility status. We did find that variability in dating was greater during the midluteal phase than during the late luteal phase, and that the degree of variability was greatest in infertile women. Given that postovulatory days 21–22 represent the “window of implantation,” this increased variability in interpretation of histological appearance has interesting implications. Previous studies, which have not demonstrated an association between alternative markers of endometrial maturation and histological dating during the midluteal phase, may have FERTILITY & STERILITY威

failed to do so because of uncertainty in the histological dating during this period (13, 14). Variation in the histological appearance of the endometrium during the window of implantation may point to functional differences related to fertility, which may be more readily identified by more precise molecular tests. In a large sample of endometrial biopsy slides, we found good statistical intra- and interobserver reliability in dating of the endometrium. This overall good reliability did not translate into high reliability for the diagnosis of “out-ofphase.” When taken with the results of the main study (3), these results provide further evidence that current histological methods for examining the endometrium in women presenting for infertility evaluation are not clinically useful. We found that overall consistency between observers did not differ between fertile and infertile women, but was worse for biopsies scheduled during the midluteal phase, especially for infertile women, compared to the late luteal phase. This finding suggests that further investigation of the endometrium during the window of implantation may ultimately yield insights into the mechanism of implantation, which will prove clinically useful.

Acknowledgments: Other investigators for this study of the NIH/NICHD Reproductive Medicine Network were: Baylor College of Medicine, Houston, TX: Sandra A. Carson, M.D., John Buster, M.D.; Brigham & Women’s Hospital, Boston, MA: Joseph A. Hill, M.D., Ph.D., Elizabeth Ginsburg, M.D.; Pennsylvania State University School of Medicine, Hershey, PA: Richard Legro, M.D.; Stanford University, Palo Alto, CA: Linda Guidice, M.D., Ph.D.; University of Alabama, Birmingham School of Medicine, Birmingham, AL: Michael P. Steinkampf, M.D.; University of Colorado School of Medicine, Denver, CO: Marcelle Cedars, M.D., William Schlaff, M.D.; University of Louisville School of Medicine, Louisville, KY: Steven T. Nakajima, M.D.; University of Medicine and Dentistry of New Jersey, Newark, NJ: Peter McGovern, M.D., Gershon Weiss, M.D.; University of Pennsylvania School of Medicine, Philadelphia, PA: Luigi Mastroianni, M.D.;University of Pittsburgh/Magee Women’s Hospital, Pittsburgh, PA: Judith Albert, M.D.; University of Rochester School of Medicine, Rochester, NY: David S. Guzick, M.D., Ph.D.; University of Texas-Southwestern School of Medicine, Dallas, TX: Bruce Carr, M.D.; Wayne State University, Detroit, MI:Michael Diamond, M.D., Richard Leach, M.D.; National Institute of Child Health and Human Development, Bethesda, MD: Donna Vogel, M.D., Ph.D.

References 1. Noyes RW, Hertig AT, Rock J. Dating the endometrial biopsy. Fertil Steril 1950;1:3–25. 2. Ginsburg KA. Luteal phase defect. Etiology, diagnosis, and management. Endocrinol Metab Clin North Am 1992;21:85–104. 3. Coutifaris C, Myers ER, Guzick DS, Diamond MP, Carson SA, Legro RS, et al., for the National Cooperative Reproductive Medicine Network. Histological dating of timed endometrial biopsy tissue is not related to fertility status. Fertil Steril 2004;82:1264 –72. 4. Scott RT, Snyder RR, Bagnall JW, Reed KD, Adair CF, Hensley SD. Evaluation of the impact of intraobserver variability on endometrial dating and the diagnosis of luteal phase defects. Fertil Steril 1993;60: 652–7. 5. Duggan MA, Brashert P, Ostor A, Scurry J, Billson V, Kneafsey P, et al. The accuracy and interobserver reproducibility of endometrial dating. Pathology 2001;33:292–7.

1281

6. Smith S, Hosid S, Scott L. Endometrial biopsy dating. Interobserver variation and its impact on clinical practice. J Reprod Med 1995;40: 1–3. 7. Anderson MC, Robboy SJ. The normal endometrium. In: Robboy SJ, Anderson M, Russell P, eds. Pathology of the female genital tract. London: Churchill-Livingston, 2002:241– 66. 8. Hamer RM. INTRACC.SAS. 90. Cary, NC: SAS Institute, 2002. 9. Shrout PE, Fleiss JL. Intra-class correlations: uses in assessing rater reliability. Psychol Bull 1979;86:420 – 8. 10. Fleiss JL. Statistical methods for rates and proportions. New York: John Wiley & Sons, 1981.

1282 Myers et al.

Observer variability in endometrial dating

11. Landis JR, Koch GC. The measurement of observer agreement for categorical data. Biometrics 1977;33:159 –74. 12. Gibson M, Badger GJ, Byrn F, Lee KR, Korson R, Trainer TD. Error in histologic dating of secretory endometrium: variance component analysis. Fertil Steril 1991;56:242–7. 13. Acosta AA, Elberger L, Borghi M, Calamera JC, Chemes H, Doncel GF, et al. Endometrial dating and determination of the window of implantation in healthy fertile women. Fertil Steril 2000;73:788 –98. 14. Lessey BA, Castelbaum AJ, Wolf L, Greene W, Paulson M, Meyer WR, et al. Use of integrins to date the endometrium. Fertil Steril 2000;73:779 – 87.

Vol. 82, No. 5, November 2004