The Reproducibility of the Postcoital Test: A Prospective Study ISAAC Z. GLATSTEIN, MD, CRAIG L. BEST, MD, MPH, ANGELA PALUMBO, MD, LYNN A. SLEEPER, ScD, ANDREW J. FRIEDMAN, MD, AND MARK D. HORNSTEIN, MD Objective: To determine the reproducibility of the postcoital test among trained observers. Methods: Twenty-eight infertile patients presenting to the Brigham and Women's Hospital over a 1-year period were recruited for the study. After a standardized collection of specimens for the postcoital test, four fellowship-trained reproductive endocrinologists evaluated six postcoital test characteristics and gave their overall impression of the test. Each observer was blinded to the patients' identities and clinical histories as well as to the ratings of the other observers. The six characteristics included an assessment of the cervical mucus by ferning, cellularity, spinnbarkeit, and consistency, and of sperm by total count per high power field and percent motility. Scoring was adapted from World Health Organization (WHO) criteria for semen-cervical mucus interaction. Statistical analysis included the kappa statistic to determine agreement among observers for postcoital test characteristics and the Mantel-Haenszel test to determine the association between overall impression and the other test characteristics. Results: Agreement among the four observers was best for sperm number and motility (39% of cases) and worst for cellularity, spinnbarkeit, and overall test impression (11, 14, and 14% of cases, respectively). The kappa statistic ranged from a low of 0.13 for cellularity, demonstrating poor reliability (95% confidence interval [CI] 0.03-0.23), to a high of 0.51 for sperm number, demonstrating fair reliability (95% CI 0.41-0.60). Only sperm number and percent motility were significantly associated with the overall impression (P .001). Conclusions: In a blinded study, the characteristics of the postcoital test were found to have poor to fair reproducibility among trained observers using a standardized WHO scoring system. The observers' overall impressions of test quality correlated with sperm number and motility only. We question the validity of the postcoital test as a diagnostic tool in the evaluation of infertility. (Obstet Gynecol 1995;85: 396-400) From the Division of Reproductive Endocrinology, Department of Obstetrics and Gynecology, Brigham and Women's Hospital, Harvard Medical School, Boston; and The New England Research Institute, Watertown, Massachusetts.
396 0029-7844/95/$9.50
0029-7844(94)00390-y
Since its original description in 1866 by J. Marion Sims, 1 the use of the postcoital test has become widespread in the evaluation of infertile couples. Many authorities consider it a cornerstone of the infertility evaluation. 2-5 Although this test is commonly used, it is not without controversy. Its detractors claim that the test lacks diagnostic validity for the prediction of pregnancy. 6-9 In addition, the large body of literature concerning the postcoital test is difficult to interpret because of a lack of standardization pertaining to the test's methodology, timing, and interpretation. 1° Inter-observer reproducibility of a diagnostic test is considered to be a critical element in determining diagnostic usefulness. 11'12 Alt h o u g h other infertility tests (eg, the semen analysis or pelvic ultrasound tests) have been studied for interassay variability, 13'14to our knowledge, no reports have evaluated the inter-observer variability of the postcoital test. The purpose of this study was to develop a standardized scoring system for this test and to evaluate its reproducibility in a group of four fellowshiptrained reproductive endocrinologists in a prospective, blinded fashion. We also sought to determine which of the elements of the postcoital test, if any, were best correlated with the observers' overall impression.
Materials and Methods Patients undergoing a postcoital test ordered by their attending physician for the evaluation of infertility in the fertility and endocrine unit of Brigham and Women's Hospital between March 1993 and March 1994 were eligible for inclusion in the study. The study protocol was approved by the hospital's Institutional Review Board. Patients were given standardized instructions to engage in intercourse between 2-12 hours before presenting for the examination, after a 2-3 day interval of abstinence. Patients were also instructed to refrain from bathing or douching after intercourse. Timing of the test was determined by the attending
Obstetrics & Gynecology
physician, based on a review of previous basal body temperature charts or urine or blood LH surge monitoring. The postcoital test and slide preparation for each subject were performed in an identical, standardized fashion by a study nurse according to World Health Organization (WHO) guidelines. 15 The patient was placed in the l i t h o t o m y position, and a watermoistened, nordubricated speculum was inserted vaginally. With the needle removed, an 18-gauge, flexible Teflon-coated vascular catheter was placed into the posterior fornix, and mucus was aspirated. From this vaginal pool, a slide and coverslip were prepared and labeled appropriately. Using a new, second catheter, we aspirated mucus from within the endocervical canal, and a second slide was prepared and covered with a glass coverslip. A trail of mucus was left uncovered and allowed to air dry for a determination of ferning. Finally, additional mucus was collected from the endocervix and drawn into a 3-mL syringe for subsequent evaluation of the mucus spinnbarkeit and consistency. After collection of the test specimens, the study nurse examined vaginal pool specimens microscopically for the presence of spermatozoa. If sperm were present, the patient was entered into the study. A postcoital test was excluded from the study if: 1) there were no sperm on inspection of the posterior fornix slide, 2) the patient had intercourse more than 12 hours before the examination, 3) more than 30 minutes had elapsed between the slide preparation and evaluation by the four observers, or 4) the slide exhibited drying artifact. Each physician was blinded to the patients' identities and clinical histories, as well as to the rating of the other physicians. To evaluate the cervical mucus categories of spinnbarkeit and consistency, the mucus contained in the syringe was placed on a fresh slide and gently lifted with a glass coverslip while all observers were present. For the remainder of the evaluation, each observer examined the study slide under a standard binocular microscope. Each was allotted up to 5 minutes to view the slide and record an evaluation on a standardized scoring sheet. This form consisted of a determination of four mucus characteristics, sperm density and motility, and the observer's overall impression of the test (Table 1). Each of the individual characteristics was scored on a scale of 0-3, with 0 representing the poorest possible score and 3 the highest. The scoring system developed for this study was adapted from the WHO recommendations for analysis of sperm-cervical mucus interaction. is To enable standardization of the scoring among observers and permit data analysis, we modified this WHO system. Specifically, we gave the total sperm per high power field-category a score range of 0-3, similar to the scoring of mucus characteristics. A score of 0 was
VOL. 85, NO. 3, MARCH 1995
Table 1. Characteristics of the Postcoital Test Evaluated by
Four Study Observers
Mucus characteristics Ferning Cellularity Spinnbarkeit Consistency Sperm characteristics Sperm no. Percent motility Overall impression
Kappa statistic*
95% CI
0.24 0.13 0.34 0.25
0.14-0.34 0.03-0.23 0.25-0.42 0.16 -0.34
0.51 0.38 0.24
0.41-0.60 0.29-0.48 0.15--0.33
CI = confidence interval. * Zero indicates no agreement above that expected by chance; 1 indicates perfect agreement.
assigned if no sperm were seen in a high power field after scanning at least five fields, a score of 1 for one to five sperm per high power field, a score of 2 for six to ten sperm per high power field, and a maximum score of 3 if more than ten sperm were seen per high power field. For the evaluation of sperm motility, each observer was asked to assign a percent score for each of four possible motility categories: "irnmotile," "shaking in place," "'sluggish," and "rapid." A weighted sum was calculated from these raw percentages (which totaled 100%), yielding a final score of 0-3. Finally, a score for overall impression was added as part of the postcoital test evaluation. Aside from these minor modifications to increase standardization and allow data analysis, our system adhered to the WHO guidelines. To further decrease potential inter-observer variability and ensure that all study physicians were equally familiar with each characteristic of the test, we prominently displayed an instruction set adjacent to the microscope used to evaluate the specimen. These guidelines contained photographs of various cervical mucus ferning patterns and a brief capsule of the standard WHO instructions for each characteristic of the postcoital evaluation. Statistical analysis included the kappa statistic (~:) to determine inter-observer agreement for individual postcoital test characteristics (K is a measure of agreement above or below what is expected by chance alone). A commonly accepted interpretation of K is that a value of 0 indicates no agreement above that expected by chance alone, a value between 0-0.4 indicates marginal agreement, 0.4-0.75 indicates fair agreement, and a value greater than 0.75 to a maximum of 1 indicates good agreement, a6 To determine associations between overall impression and other postcoital test characteristics, Mantel-Haenszel test for linear trend was used. This test was conducted separately for each of the four sets of physician ratings. Statistical significance was
G l a t s t e i n et al
PostcoitalTest
397
T a b l e 2. Characterization of the Inter-Observer Variability
in 28 Postcoital Tests by Individual Characteristic Perfect 4 scores 4 scores agreement Proportion in 2 across 3 among 4 in 0 or adjacent or 4 observers* 3~ categories~ categories Ferning Cellularity Spinnbarkeit Consistency Sperm no. Motility Overall impression
8 (28.6%) 3 (10.7%) 4 (14.3%) 7 (25.0%) 11 (39.3%) 11 (39.3%) 4 (14.3%)
7/8 3/3 3/4 7/7 9/11 10/11 3/4
15 (53.6%) 5 (17.9%) 14 (50.0%) 11 (39.3%) 21 (75.0%) 3 (10.7%) 8 (28.6%) 13 (46.4%) 12 (42.9%) 5 (17.9%) 10 (35.7%) 7 (25.0%) 15 (53.6%) 9 (32.1%)
* Number in parentheses indicates % of total postcoital test examinations. Proportion of those scores with perfect agreement that fall into either one of two extreme categories, yielding a score of either 0 or 3. Number (percentage in parentheses) of postcoital test examinations where all four scores were in two adjacent scoring categories (eg, 0 and 1, 1 and 2, or 2 and 3).
w e r e significantly associated w i t h overall i m p r e s s i o n (P < .001, M a n t e l - H a e n s z e l test). That is, a h i g h rating for overall i m p r e s s i o n c o r r e s p o n d e d to h i g h ratings for s p e r m m o t i l i t y a n d s p e r m n u m b e r , a n d a l o w rating for overall i m p r e s s i o n c o r r e s p o n d e d to l o w ratings for s p e r m n u m b e r a n d motility. This relation b e t w e e n the six scored characteristics a n d the o b s e r v e r s ' overall i m p r e s s i o n s is d e m o n s t r a t e d in F i g u r e 1. A m o n g all p h y s i c i a n s , the m e a n ratings for s p e r m n u m b e r (out of a p o s s i b l e h i g h score of 3) w e r e 0.4, 1.8, 2.2, a n d 2.9 for s a m p l e s classified w i t h an overall i m p r e s s i o n of poor, fair, g o o d , a n d excellent, respectively. For s p e r m motility, the m e a n ratings w e r e 0.0, 0.6, 1.7, a n d 2.5. The r e m a i n i n g four characteristics of ferning, cellularity, spinnbarkeit, a n d mucus consistency were not significantly associated with overall impression for any of the physicians. This lack of correlation b e t w e e n test characteristics a n d overall impression is shown in Figure 1 b y the fact that the m e a n rating score does not increase signifi-
d e f i n e d as P < .05. W i t h 28 patients, four raters, a n d a 0.05 t y p e I e r r o r rate, there w a s a 90% p o w e r to detect a g r e e m e n t of at least 70% or less t h a n 30%, w h e r e 50% a g r e e m e n t b y chance w a s expected.
Ferning
2.5
2.5
=~2.o ~t.5
Results T w e n t y - e i g h t p a t i e n t s w e r e eligible for p a r t i c i p a t i o n in the s t u d y . The a g e r a n g e w a s 2 2 - 4 0 y e a r s ( m e a n 32.4). I n t e r - o b s e r v e r a g r e e m e n t for each of the seven categories of the postcoital test as d e t e r m i n e d b y K is pres e n t e d in Table 1. The K v a l u e s r a n g e d f r o m a l o w of 0.13 (95% confidence interval [CI] 0.03-0.23) for cellularity, to a h i g h of 0.51 (95% CI 0.41-0.60) for s p e r m n u m b e r . These results d e m o n s t r a t e that except for the d e t e r m i n a t i o n of s p e r m n u m b e r , w h i c h h a d fair agreement, all other test e l e m e n t s h a d m a r g i n a l r e p r o d u c i b i l ity a m o n g observers. To further e x a m i n e the origin of this variability, w e a n a l y z e d the p e r c e n t a g e of cases in w h i c h there w a s perfect i n t e r - o b s e r v e r a g r e e m e n t (ie, all four o b s e r v e r s r a t e d the characteristic w i t h the s a m e score). Table 2 p r e s e n t s a s u m m a r y of these findings. The greatest a m o u n t of a g r e e m e n t w a s in the t w o s p e r m categories, total n u m b e r a n d motility, w i t h all four o b s e r v e r s g i v i n g identical ratings in 39% of tests. In contrast, the p o o r e s t a g r e e m e n t w a s seen in the cellularity c o m p o n e n t , w i t h perfect a g r e e m e n t in o n l y 11% of cases. For all characteristics in the cases w i t h perfect a g r e e m e n t , at least 75% w e r e scored as 0 (worst p o s s i b l e score) or 3 (highest p o s s i b l e score). The i n d i v i d u a l characteristics of the postcoital test w e r e also e x a m i n e d to d e t e r m i n e w h i c h c o r r e l a t e d w i t h the o b s e r v e r s ' score for overall i m p r e s s i o n . F o r each of the four p h y s i c i a n s , s p e r m m o t i l i t y a n d s p e r m n u m b e r
398
G l a t s t e i n et al
Postcoital Test
Cellularity
3.0
3.0
4
2.0
¢1.5 c ~1.0
0.5
0.5
0.0
0,0 Poor
Fair Good Excellent Overall Impression
Spinnbarkeit
3.0
Poor
3ot
Fair Good Excellent Overall Impression
Mucus Consistency
2.5
2.s
4 2.01
12° ~= 1.5
=~.o 1
~ 1.0 0.5 0.0 Poor
3.0 7 2.5 12° 1 ~1°1
Fair Good Excellent Overall Impression
Sperm Number
Poor
Fair Good Excellent Overall Impression
Sperm Motility
30 t
°~2"5
4 2.° 1
;,21
~1.01 °01~l Poor
Fair Good Excellent Overall Impression
Poor
Fair Good Excellent Overall Impression
Figure 1. Relation between the six scored postcoital test characteristics and the observers' overall impressions. For sperm number and sperm motility, there was an association with the overall impression (P < .001, Mantel-Haenszel test). This finding was not seen in the four mucus characteristics as demonstrated by the lack of an increasing mean rating as the overall impression increases.
Obstetrics & Gynecology
cantly as the overall impression improves and, in some cases, decreases as the overall impression score increases.
Discussion The postcoital test, still considered a fundamental part of an infertile couple's work-up, has come under increasing scrutiny in recent years. Concerns that have been raised include a lack of standardization in the test's performance and timing, an absence of agreement in defining the "normal" range of motile sperm for an adequate test, and its poor sensitivity, specificity, and predictive value for forecasting pregnancy. 1° In addition, test reproducibility has not been defined. This fundamental issue was the focus of our investigation. Although nearly 130 years have passed since the original description by Sims, this study represents the first attempt to determine the inter-observer variability of the postcoital test. In this study, we demonstrated that there is poor to fair reproducibility within the individual categories of the postcoital test among trained observers rating identical slides. When we used the K statistic to measure inter-observer agreement, only the sperm number assessment had a K value greater than 0.4, indicating fair reproducibility. The remainder of the elements had K values ranging from 0.13 for cellularity to 0.38 for percent motility, indicating poor reproducibility. In addition, we also found that of six characteristics rated by the observers, only sperm number and motility correlated with the overall impression. In fact, this is what one would expect in clinical practice. While the mucus characteristics may assist in determining the proper timing and conditions for the test, clinicians rely heavily on sperm number and motility to guide their final impression. For all the test characteristics, perfect agreement was limited in at least 75% of cases to scoring categories of 0 or 3 (ie, postcoital examinations that were either very poor or very good). These cases often occurred in situations in which timing of the test was excellent or poor. For patients failing between these extremes, a situation which likely represents most cases in clinical practice, variability remained high. Our study was conducted in a prospective manner, with each of the four observers blinded to both the patients' identities and clinical histories as well as to the other observers" ratings. The study was designed to maximize, as much as possible, the standardization of the collection, preparation, and interpretation of the test. This was accomplished by having a trained study nurse perform each mucus collection in an identical fashion. To standardize the interpretation, a modification of the WHO guidelines for evaluation of the test
VOL. 85, NO. 3, MARCH 1995
was devised; this enabled the observers to take advantage of the well-accepted WHO criteria for postcoital evaluation and allowed statistical analysis of the results. Despite these precautions, however, our study does have some limitations. Although the blinding of the observers was necessary to prevent bias, the physician in the office may use patient information to better analyze the postcoital test and focus on a specific feature. For example, if a patient has a history of a previous cervical conization or has a partner with an abnormal semen analysis, this may influence the manner in which the test is interpreted. Clinicians involved in evaluating infertile couples, either by performing the postcoital test themselves or obtaining results performed by a referring physician, should be cognizant of the poor to fair reproducibility of this test.
References 1. Sims JM. Clinical notes on uterine surgery (with special reference to the sterile condition). London: Robert Hardwicke, 1866. 2. Mishell DR, Davajan V. Evaluation of the infertile couple. In: Mishell DR, Davajan V, Lobo RA, eds. Infertility, contraception, and reproductive endocrinology. 3rd ed. Boston: BlackweU Scientific Publications, 1991:557-70. 3. Speroff L, Glass RH, Kase NG. Clinical gynecologic endocrinology and infertility. 4th ed. Baltimore: William and Wilkins, 1989. 4. Glass RH. Infertility. In: Yen SSC, Jage RB, eds. Reproductive endocrinology. 3rd ed. F'hfladelphia: WB Saunders, 1991:1-2I. 5. Seibel MM. Workup of the infertile couple. In: Seibel MM, ed. Infertility: A comprehensive text. Norwalk, Connecticut: Appleton and Lange, 1990:1-21. 6. Collins JA, So Y, Wilson EH, Wrixon W, Casper RF. The post-coital test as a predictor of pregnancy among 355 infertile couples. Fertil Steril 1984;41:703-8. 7. Harrison RF. The diagnostic and therapeutic potential of the postcoital test. Fertil Steril 1981;36:71-5. 8. Giner J, Merino G, Luna J, Aznar R. Evaluation of the Sims-Huhner postcoital test in fertile couples. Fertil Steril 1974;25:145--8. 9. Samberg I, Martin-Du-Pan R, Bourrit B. The value of the postcoital test according to etiology and outcome on fertility. Acta Eur Fertil 1985;16:147-9. 10. Griffith CS, Grimes DA. The validity of the postcoital test. Am J Obstet Gynecol 1990;162:615--20. 11. Department of Clinical Epidemiology and Biostatistics, McMaster University. How to read clinical journals: II. To learn about a diagnostic test. Can Med Assoc J 1981;124:703--10. 12. Fineberg HV, Hiatt HH. Evaluation of medical practices. The case for technology assessment. N Engl I Med 1981;301:1086-91. 13. Levine RJ, Mathew RM, Brown MH, et al. Computer-assisted semen analysis: Results vary across technicians who prepare videotapes. Fertil Steril 1989;52:673-7. 14. Forman RG, Robinson l, Yudkin P, Egan D, Reynolds K, Barlow DH. What is the true reproducibility of transvaginal ultrasound monitoring in stimulated cycles? Fertil Steril 1991;56:989-92. 15. World Health Organization laboratory manual for the examination of human semen and semen-cervical mucus interaction. 2nd ed. Cambridge, England: Cambridge University Press, 1988.
G l a t s t e i n et al
PostcoitalTest
399
16. RosnerB. Fundamentals of biostatistics. 3rd ed. Boston: PWS-Kent Publishing, 1990.
Address reprint requests to: Mark D. Hornstein, M D Department of Obstetrics and Gynecology Brigham and Women's Hospital 75 Francis Street Boston, M A 02115
Received June 6, 1994. Received in revisedform September21, 1994. Accepted September23, 1994.
Copyright © 1995 by The American College of Obstetricians and Gynecologists.
WRITING GUIDE AVAILABLE To help prospective authors, especially those who are beginning in medical journal writing, to produce better papers, this journal has developed "A Guide to Writing for Obstetrics and Gynecology." Copies are available, free of charge, from: Editorial Office, OBSTETRICS A N D GYNECOLOGY, 1100 Glendon Avenue, Suite 1655, Los Angeles, CA 90024-3520. Requests m a y also be submitted by FAX: (310) 208-2838.
400
Glatstein et al
Postcoital Test
Obstetrics & Gynecology