FERTILITY AND STERILITY Copyright
©
Vol. 60, No.4, October 1993
Printed on acid-free paper in U. S. A
1993 The American Fertility Society
Evaluation of the impact of intraobserver variability on endometrial dating and the diagnosis of luteal phase defects*
Richard T. Scott, M.D. t:j: Russell R. Snyder, M.D.:j:§ James W. Bagnall, M.D.§ II
Kurt D. Reed, M.D.§~ Carol F. Adair, M.D.§** Samuel D. Hensley, M.D.§tt
Wilford Hall Medical Center, Lackland Air Force Base, Texas
Objective: To determine the magnitude of intraobserver variation in dating endometrial biopsies and its impact on clinical management. Design: Blinded histopathologic interpretation of endometrial biopsy specimens 1 year apart by five pathologists. Setting: Large military tertiary care center. Patients: Endometrial biopsy specimens from 51 patients undergoing evaluation for potential luteal phase defects. Interventions: None. Main Outcome Measures: Calculation of the magnitude of the individual and overall intraobserver variation in endometrial dating for the five pathologists and estimation of its potential impact on clinical management. Results: The intraobserver variation was 0.69 ± 0.05 days (means ± SE). There was no significant difference in the magnitude of the variation for I-day or 2-day dating ranges. The theoretical probability of altering clinical management by having the same pathologist redate a given specimen ranged from 15% to 28%. Conclusion: Histologic dating of endometrial biopsies is subject to a small but highly clinically significant intraobserver variability that may have a major impact on clinical management. Fertil Steril1993;60:652-7 . Key Words: Luteal phase defect, endometrium, histology
Luteal phase defects (LPDs) are recognized causes of infertility and are found in approximately
Received March 23, 1993; revised and accepted June 4, 1993.
* The views expressed in this article are those of the authors and do not reflect the official policy of the Department of Defense or other Departments of the u.s. Government. t Division of Reproductive Endocrinology, Department of Obstetrics and Gynecology. :j: Reprint requests: Richard T. Scott, M.D., Department of Obstetrics and Gynecology, Uniformed Services University of the Health Sciences, 4301 Jones Bridge Road, Bethesda, Maryland 20814. § Department of Pathology. II Deceased. 11 Present Address: Marshfield Laboratories, Marshfield, WI. ** Present Address: Department of Pathology, Walter Reed Medical Center, Washington D.C. tt Present Address: Department of Pathology, Mississippi Baptist Medical Center, Jackson, Mississippi.
652
Scott et al.
Variability in endometrial dating
10% of infertile couples (1). Luteal phase defects have been attributed to deficiencies in either the P production by the corpus luteum or the ensuing endometrial secretory response (2, 3). Almost since the time of Jones'(4) original description of the disorder in 1949, controversy has existed regarding the appropriate means of diagnosis. Although proposed diagnostic techniques have ranged from such simple techniques as basal body temperature charts at one extreme to the evaluation of subtle and complex cellular development at the other, there is no consensus regarding the optimal diagnostic evaluation (1-3, 5-8). Timed endometrial biopsies with histologic dating using the criteria of Noyes et a1. (9) remains the most widely described technique and as such is the gold standard to which other techniques are usually compared. In fact, this reference is the most widely quoted in the infertility Fertility and Sterility
literature (10). It is also the standard used in a substantial amount of the literature evaluating the efficacy of various treatment regimens (1, 11, 12). Recently, the validity of this gold standard has come under scrutiny. In an elegant study ofthe natural variability in endometrial secretory development over multiple cycles, Davis et al. (13) found that a substantial number of normal women with proven fertility had delayed secretory transformation of a magnitude that would have caused the biopsy to be considered out of phase. Interestingly, the incidence of out of phase biopsies was equivalent in this normal population and the patients undergoing an infertility evaluation in the same center. These types of data raise difficult questions regarding the validity of endometrial dating as a diagnostic tool. The original evaluations of the accuracy and consistency of endometrial dating were published in a series of articles by Noyes and his colleagues in the 1950s (14-16). They evaluated a number of parameters such as the uniformity of secretory transformation throughout a single biopsy specimen, uniformity over the entire endometrial surface, and the magnitude of the interobserver variability. Although clearly quantifying the variation attributable to these parameters, no evaluation of their potential clinical impact was published. One common means of characterizing the reproducibility of various histologic diagnoses is to calculate the interobserver and intraobserver variation among those individuals actually reviewing the histopathologic specimens. These calculations may then be used to estimate the impact of observer variability on clinical management. This study, which looks specifically at the impact of intraobserver variability, is a follow-up to our prior published evaluation of the clinical impact of interobserver variation in dating endometrial biopsies in infertility patients. That study revealed that even though the actual magnitude of the variation among the pathologists was acceptably small, the impact on clinical management attributable to interobserver variability was quite large (17). In fact, just the interobserver variation was shown to have the potential to affect the diagnoses of LPD in as many as 39% of the patients. To date, there have been no published evaluations ofthe intraobserver variability in endometrial dating. The specific purpose of this study was to determine the magnitude of the intraobserver variability in histologic dating of the Vol. 60, No.4, October 1993
endometrium and its potential clinical impact on the diagnosis and management of LPDs. MATERIALS AND METHODS Experimental Design
Histologic slides from 51 separate endometrial biopsy specimens performed over a I-year period of time were selected for use in this study. All 51 specimens had been obtained as part of a routine infertility evaluation and had been assigned a histologic date as a part of a routine histopathologic assessment by different members ofthe pathology department. The original assessments were taken as evidence that the specimen would be datable but were in no way used for any of the calculations made during the study. Based on this original dating and the patients' next menses, 27 of the endometrial biopsies had been characterized as out of phase, whereas the remaining 24 were in phase. To avoid any artifact introduced by aging of the original slides, new sections were taken from the original paraffin-embedded blocks and stained with hematoxylin and eosin. The slides were assigned a study number, randomized in order, and then given in the same order to each of five pathologists. No clinical information (including the timing of the patients' subsequent menses) was provided to the pathologists. Each pathologist dated the specimens according to the criteria of Noyes et al. (9) and recorded both a single day reading and a 2-day reading. After a minimum time interval of ~1 year, the specimens were randomized to a new order and then given in turn again to each ofthe five pathologists. The pathologists redated each of the specimens using the same criteria and recorded the results in a similar fashion to the first interpretation. The pathologists were blinded to the results oftheir first readings. Statistical Evaluation
The intraobserver variation in dating was calculated for each slide individually. This variation was defined as the difference in days between the first and second interpretation. For example, ifthe originaIl-day reading was day 25 and the second reading was day 27, then the variation was 2 days. Similarly, ifthe original reading was day 24 to 25 and the second reading was day 25 to 26, then the variation was 1 day. The actual intraobserver variation was calculated as the mean of these individually calculated variances. Scott et al.
Variability in endometrial dating
653
The mean intraobserver variation was calculated for each of the five pathologists individually and for the group as a whole for both the 1- and 2-day readings. An analysis of variance (ANOV A) was used to determine if any of the pathologists had significantly greater or lesser variability than any of the other pathologists. A paired t-test was used to determine if the magnitude of the variation was different for the I-day or the 2-day readings. The clinical impact of the intraobserver variability in dating was calculated by determining whether each specimen would have been diagnosed as being in or out of phase for each of the readings. Biopsies were considered out of phase if the difference in the histologically assigned date and the cycle date, based on the timing of the patients' next menses, was <2 days. After each of the histopathologic assessments had been determined to make the biopsy either in phase or out of phase, the consistency in the clinical diagnosis for each specimen for each pathologist was determined. In brief, this was the ultimate result of each of the pathologist's two readings of the same specimen as to whether they would either agree (both in or out of phase) or disagree (1 in phase and lout of phase). The subsequent predicted alteration in clinical management was then calculated based on the probability of a disagreement in the calculation of in or out of phase. The incidence of altered clinical management was calculated for the group as a whole, as well as for each of the five pathologists individually. Contingency table analysis was used to determine if the readings of any single pathologist were more or less likely to alter management than those of the other four pathologists. The final analysis involved comparison of the magnitudes of the interobserver determined in our prior publication and the intraobserver variation found in this study. The same five pathologists participated in both studies. This analysis was accomplished using a paired t-test. Each pathologists mean intraobserver variation during this study was compared with the mean intraobserver variation that they had in the prior study. All data are expressed as means with associated SEs, and a error of <0.05 was considered significant. Corrections for multiple comparisons were used where appropriate. RESULTS
All five pathologists dated all 51 of the endometrial biopsy specimens and recorded single and 2654
Scott et al.
Variability in endometrial dating
140 120
Zl cQ)
100
.~
a...
80
'0 Q; .D
E ::J
z
40
0 0
2
3
Intra-Observer Variation (days) a a 0 and 1 > 2 > 3 days; p< 0.04
Figure 1. The magnitude of intraobserver variation in dating endometrial biopsies as defined by the difference in assigned histologic dates by a pathologist who was reading the same slide on two occasions separated by ~l year. The magnitude of the variation was between 0 and 3 days, with 0 and 1 days of variation being most common, 2 days being less common, and 3 days of variation being least common.
day readings. The magnitude of the intraobserver variation ranged from 0 to 3 days. Evaluation of the distribution of these variances revealed that 0- and I-day variations were most common and were equivalent. The incidence of 2 days of variation was significantly less common than either 0- or I-day variation, and 3 days of variation was significantly less common than either 0-, 1-, or 2-day variances (P < 0.04) (Fig. 1). The mean intraobserver variations for the I-day readings for the five individual pathologists ranged from 0.63 ± 0.08 to 0.75 ± 0.10 days (Table 1). The overall intraobserver variation was 0.69 ± 0.05 days. An ANOVA of the differences in the individual readings revealed that there was no difference in the magnitude of the variation among the five pathologists. Assigned dates for the specimens ranged from day 18 to day 27. There was no difference in the magnitude of the intraobserver variation for any of the assigned dates. This is consistent with there being no difference in the reproducibility of dating for any portion of the luteal phase as compared with any other. A similar evaluation of the 2-day readings revealed that the mean intraobserver variation was 0.64 ± 0.04 days. The intraobserver variation for the five individual pathologists for the 2-day readings ranged from 0.59 ± 0.10 days to 0.70 ± 0.12 days. Similar to the single-day readings, there were no significant differences in the magnitude of the intraobserver variation for each of the five pathologists. A paired t-test comparing the variation for Fertility and Sterility
Table 1
Magnitude and Clinical Impact of Intraobserver Variation in Dating Endometrial Biopsies Pathologist
Variation (d)* Altered phase calculation
1
2
3
4
5
Overall
0.71 ± 0.12 9/51
0.67 ± 0.10 6/51
0.75 ± 0.10 4/51
0.69 ± 0.11 12/51
0.63 ± 0.08 8/51
0.69 ± 0.05 39/255
* Values are means ± SE.
the 1- and 2-day readings revealed no differences. Thus, there was no improvement in the reproducibility of the dating by using a 2-day range. The 1-day reading of each biopsy was used to calculate whether the biopsy would have been interpreted as either in or out of phase. For each individual pathologist's two readings of each individual biopsy, it was determined if the dating assignment would result in a clinical diagnosis of a LPD. The results of both readings on the same patient were then compared to see if they agreed or disagreed as far as the diagnosis of a LPD. Of the 255 total inter7 pretations, a discrepancy in the calculation of in or out of phase existed in 39 (15.2%). Twenty-three of the biopsies had origninally been read as being in phase, whereas 16 had been interpreted as being out of phase. The number of discrepancies for each of the individual pathologists ranged from 4 to 12. Contingency table analysis revealed no significant difference in the number of disagreements in calculating the phase of the biopsy for any of the individual pathologists. This indicates that the readings of any single pathologist were no more or less likely to alter management than any of the others. The next evaluation involved the probability that an alteration in clinical management would have occurred based on having the specimen redated. Because the indications for a second biopsy are based on the results of the first biopsy, there is more than one possible scenario that must be considered when estimating the clinical impact of the intraobserver variation in dating. The first scenario is that the original biopsy was read as in phase. This means that no further biopsies would have been done and that no treatment for LPD would have been initiated. In this scenario, there would be a 15.2% chance that having the same pathologist redate the specimen would change the phase calculation with the biopsy now being interpreted as out of phase. The patient would then undergo another biopsy to confirm the diagnosis ofLPD, which is an alteration in clinical management. The second scenario is that of having the original Vol. 60, No.4, October 1993
assigned date resulting in the biopsy being called out of phase. Again, there would be a 15.2% chance that merely having the specimen redated would alter the calculation of in or out of phase. This would eliminate the need for a second biopsy in these patients, thus altering the management of these patients. In the remaining 84.8% of cases, a second biopsy would be performed to confirm the diagnosis and to direct management. Here again, the difference in dating attributable to intraobserver variation could alter clinical management. Irrespective of the initial reading of the second biopsy as in or out of phase, a 15.2% chance exists that having the same pathologist redate the same specimen would result in a change in the phase calculation and thus clinical management. Therefore, management could be altered in 15.2% of the patients having a second biopsy, which represents 12.7% (15.2% of 84.8%) of the patients. Thus, in this scenario, 15.2% of the patients would have their management altered after the first biopsy and an additional 12.7% after the second biopsy for an overall chance of a change in management of 27.9%. It should be emphasized that these calculations are based on the biopsies of the patients in this series. The exact clinical impact would depend on the distribution of initial biopsy results of the population being evaluated. The final analysis involved comparison ofthe intraobserver and interobserver variations. A paired t-test found that the magnitude of intraobserver variation was smaller than that of interobserver variation (P < 0.01). This indicates that there was greater consistency in assigning a date to a given histologic specimen when a pathologist redated the specimen themselves as compared with having the slide evaluated by the other four pathologists. DISCUSSION
The magnitude of the intraobserver variation in dating endometrial biopsies was small (0.69 days) and very consistent (range of 0.63 to 0.75 days) Scott et al.
Variability in endometrial dating
655
among the five pathologists participating in this study. Although the intraobserver variation was small, as evidenced by the pathologists agreeing within 1 day of the originally assigned date in 86% of the cases, the actual clinical impact in terms of the calculation of in or out of phase was quite large. There are a number of reasons why such a small intraobserver variation could have such a significant impact on clinical decision making and management. First, dating of endometrial specimens is quantum, i.e., individual days are specified with nothing in between (e.g., there is no day 20.25). This information must also be interpreted in view ofthe fact others have reported that up to 30% of all endometrial biopsy specimens obtained from infertile patients are dated as having a delay of either 2 or 3 days (16). The obvious significance of this is that a single-day change in the dating of a specimen (the smallest unit possible) may lead to an absolute change in the calculation of phase. The second factor explaining the high impact of the intraobserver variability in dating is that the clinical diagnosis of LPDs requires an out of phase biopsy in each of two successive cycles. A change in the calculation of phase for either of the two biopsies will always change management by either altering the indication for a second biopsy (after the first biopsy) or changing the actual diagnoses (after the second biopsy). Thus, a single patient's evaluation may provide two opportunities for having this variation in dating impact their management. This study and our previous data allow a comparison of the clinical impact on the diagnoses of LPD that interobserver and intraobserver variability have when dating endometrial biopsies. That clinical management could be altered in up to 39% of patients solely because of interobserver variation but only in up to 28% of the patients because of intraobserver variation has clinical relevance. This information suggests that there may be some advantage to having the same individual responsible for dating subsequent endometrial biopsies or all endometrial biopsies performed solely for dating purposes. Knowledge of how significantly intraobserver and interobserver variation can impact clinical decision making may be very important. Patients with discrepancies between their actual cycle day and the assigned histologic date of 1 to 3 days could easily have their calculation of in or out of phase altered by simply having the identical slide reevaluated. This is true even if the same pathologist dates the slide at different times. This clearly questions 656
Scott et al.
Variability in endometrial dating
our ability to define with reasonable precision the line between in and out of phase biopsies. That many patients with out of phase biopsies may not actually have clinically significant LPDs is not surprising. Some prior studies have shown that treatment success is most common in the patients with largest defects (5 or more days) (12). This represents a clinical paradox, with the more severely effected patients experiencing better responses to therapy than patients with small defects. This might be explained by the fact that LPDs that are associated with small delays might reflect a group of patients containing a higher proportion of patients without a true LPD but rather a histopathologic one that reflects the inherent variability in histologic dating of the endometrium. Perhaps their defect would have resolved by simply having the histologic specimens reevaluated by the same or another pathologist. The information in this study raises a number of other questions. If histologic dating of the secretory transformation of the endometrium appears to lack precision and reproducibility, then what would its impact be on the clinical definition of LPDs? This question requires a review of where the definition of a deficiency of 2 or more days in two consecutive biopsies was derived. Unfortunately, this long-accepted gold standard was not defined in terms of clinical endpoints, i.e., no controlled study was done to determine whether patients with defects of ]2 days truly have poorer pregnancy rates and pregnancy outcomes than patients with discrepancies of 2 or fewer days (1). Additionally, a recent publication of the results of serial timed biopsies in normal controls revealed that fertile women had out of phase biopsies at the same rate as the infertile population (13). Furthermore, some of these patients conceived during the study interval in cycles in which their biopsies were clearly out of phase. In summary, the results ofthis study raise a number of difficult questions regarding the impact of interobserver and intraobserver variability in dating endometrial biopsies on the clinical management of patients undergoing evaluations for possible LPDs. Although the magnitude of the variation in dating was small, the very nature of the definition of LPDs is such that even the smallest possible variation (1 day) may have dramatic impact on clinical management. This study in no way questions the existence of LPDs as an entity. It indicates a need for controlled studies to evaluate the validity of histologic dating of the endometrium, as well as the other techniques used to define LPDs, to Fertility and Sterility
reproducibly distinguish individual patients with and without clinically significant luteal phase inadequacy. Acknowledgment. We acknowledge the assistance of Daniel M. Strickland, M.D., Wilford Hall Medical Center, Lackland AFB, Texas, for his assistance with multiple aspects of this study. REFERENCES 1. McNeely MJ, Soules MR. The diagnoses of luteal phase deficiency: a critical review. Fertil SterilI988;50:1-15. 2. Jones GS. The physiology of menstruation and the corpus luteum function. Int J FertiI1986;31:143-7. 3. Jones GS. The luteal phase defect. Fertil Steril 1976; 27:351-6. 4. Jones GES. Some newer aspects of the management of infertility. JAMA 1949; 141:1123-9. 5. Tredway DR, Mishell D, Moyer D. Correlation of endometrial dating with luteinizing hormone peak. Am J Obstet Gynecol 1973; 117:1030-3. 6. Lundy KE, Lee SG, Levy W. The ovulatory cycle: a histologic, thermal, steroid, and gonadotropin correlation. Obstet GynecoI1974;44:14-25. 7. Johannisson E, Landgren B, Rohr HP, Diczfalusy E. Endometrial morphology and peripheral hormone levels in women with regular menstrual cycles. Fertil Steril 1987; 48:401-8.
Vol. 60, No.4, October 1993
8. Olive DL, Thomford PJ, Torres SE, Lambert TS, Rosen GF. Twenty-four-hour progesterone and luteinizing hormone profiles in the midluteal phase of the infertile patient: correlation with other indicators of luteal phase insufficiency. Fertil SteriI1989;51:587-92. 9. Noyes RW, Hertig AT, Rock J. Dating the endometrial biopsy. Fertil Steril 1950; 1:3-25. 10. Key JD, Kempers RD. Citation classics: most cited articles from Fertility and Sterility. Fertil SteriI1987;47:91O-5. 11. Wentz AC, Herbert CM, Maxson WS, Garner CH. Outcome of progesterone treatment of luteal phase inadequacy. Fertil SterilI984;41:856-62. 12. Downs KA, Gibson M. Clomiphene citrate therapy for luteal phase defects. Fertil Steril 1983; 39:34-8. 13. Davis OK, Berkeley AS, Naus GJ, Cholst IN, Freedman KS. The incidence of luteal phase defect in normal, fertile women, determined by serial endometrial biopsies. Fertil SterilI989;51:582-6. 14. Noyes RW. Uniformity of secretory endometrium. Obstet GynecoI1956;7:221-8. 15. Noyes RW. Uniformity of secretory endometrium, study of multiple sections from 100 uteri removed at operation. Fertil Steril 1956; 7:103-9. 16. Noyes RW, Haman JO. Accuracy of endometrial dating: correlation of endometrial dating with basal body temperature and menses. Fertil Steril 1953; 4:504-17. 17. Scott RT, Snyder RR, Strickland DM, Tyburski CC, Bagnall JA, Reed KD, et al. The effect of interobserver variation in dating endometrial histology on the diagnosis of luteal phase defects. Fertil Steril 1988;50:888-92.
Scott et al.
Variability in endometrial dating
657