ADULT UROLOGY
RELIABILITY OF REMEMBERED INTERNATIONAL INDEX OF ERECTILE FUNCTION DOMAIN SCORES IN MEN WITH LOCALIZED PROSTATE CANCER PIERRE KARAKIEWICZ, SHAHROKH F. SHARIAT, AMIR NADERI, DOV KADMON, KEVIN M. SLAWIN
AND
ABSTRACT Objectives. To test the reliability of recollected International Index of Erectile Function (IIEF) domain scores before and after radical prostatectomy. Recall reliability can be affected by several biases. In men with localized prostate cancer (PCa), conflicting results have been reported. Methods. Thirty-nine men, aged 44 to 69 years, were invited to participate in a prospectively administered IIEF questionnaire. The survey was administered before and 6 and 12 months after radical prostatectomy. Several months later, a recall IIEF survey targeted the prospectively gathered IIEF data. The independent sample t test, Pearson correlation coefficient, partial correlation, and intraclass correlation coefficient tested the reliability of the recalled IIEF scores versus the prospective ratings. Results. All 39 men completed the prospective and recalled IIEF surveys addressing preoperative erectile function. Surveys targeting function at 6 and 12 months after surgery were completed by 85% and 51% of the participants, respectively. The erectile function domain demonstrated the greatest recall reliability (intraclass correlation coefficient 0.65 to 0.73). Erectile function and sexual desire scale recall reliability was greatest for pretreatment function or function 12 months after surgery. The orgasmic function domain had the lowest recall reliability (intraclass correlation coefficient 0.37 to 0.54). Conclusions. When restricted to before surgery and 12 months after surgery, most IIEF domains may be reliably used in a retrospective fashion. The erectile function and sexual desire domains appear to be most reliable, possibly because they address more objective areas of men’s sexual function. UROLOGY 65: 131–135, 2005. © 2005 Elsevier Inc.
T
reatment of localized prostate cancer (PCa) is challenging, because a balance between cancer control and preservation of health-related quality of life (HRQOL) is required.1,2 Except for urinary incontinence, erectile dysfunction is the most common, long-lasting detriment to HRQOL in men treated with radical prostatectomy (RP).2 Assessment of RP-associated erectile dysfunction is
This study was supported in part by the American Foundation for Urologic Diseases, Austrian Program for Advanced Research and Technology, and Fonds de la Recherche en Santé du Québec. From the Department of Urology and Outcomes Research Unit, University of Montreal, Montreal, Quebec, Canada; Scott Department of Urology, Baylor College of Medicine, Houston, Texas; and Department of Urology, University of Texas Southwestern Medical Center, Dallas, Texas Reprint requests: Kevin M. Slawin, M.D., Scott Department of Urology, Baylor College of Medicine, 6535 Fannin Street, Houston, TX 77030. E-mail:
[email protected] Submitted: July 16, 2004, accepted (with revisions): August 27, 2004 © 2005 ELSEVIER INC. ALL RIGHTS RESERVED
crucial for outcome assessment, because appreciable erectile dysfunction may offset some of the benefits of durable cancer control.2 The accuracy with which HRQOL is recalled has been questioned.3–7 Tests of preoperatively recalled HRQOL focused on men with PCa have yielded divergent results.3,4 Because of the importance of recalled HRQOL in men who have undergone RP for PCa, we examined the value of recalling erectile function. Specifically, we tested the consistency of recalled information gathered with a widely used tool, the five multi-item domain of the International Index of Erectile Function (IIEF).8,9 MATERIAL AND METHODS The study targeted a convenience sample of 39 men, who had been diagnosed and treated with radical retropubic prostatectomy for clinically localized PCa at Baylor College of Medicine (Houston, Tex). Erectile function represented the targeted outcome and was measured with the IIEF United 0090-4295/05/$30.00 doi:10.1016/j.urology.2004.08.054 131
132
11.5 (3.5–27.3)
⫺0.8 (0.39) 12.4 (2.3–33.5) KEY: IIEF ⫽ International Index of Erectile Function. Data in parentheses are standard deviation, unless otherwise noted. * Independent sample t test.
18.4 (8.9–40.4)
8.8 (1.5)
9.1 (1.6)
0.3 (0.40)
6.0 (2.7)
5.3 (2.7)
⫺0.7 (0.26)
6.4 (2.4)
5.6 (2.6)
⫺0.4 (0.78) 5.3 (3.8) 5.7 (5.0) 0.4 (0.67) 4.8 (4.3) 4.4 (4.4) 12.4 (2.0) 12.2 (2.9)
0.2 (0.67)
0.2 (0.80) 7.2 (2.1) 8.7 (1.5) 8.4 (1.6)
0.3 (0.41)
6.8 (2.0)
6.7 (2.2)
⫺0.1 (0.82)
7.4 (1.6)
⫺0.1 (0.90) 4.9 (2.0) 5.0 (2.8) 0.1 (0.87) 4.1 (2.9) 4.0 (2.8) 9.9 (0.5) 9.7 (1.1)
0.2 (0.28)
⫺0.5 (0.82) 8.6 (5.2) 9.1 (7.3) 0.5 (0.74) 6.7 (5.6) 6.2 (5.2) 1.5 (0.16) 28.9 (2.9) 27.4 (5.8)
Erectile function (score range 0–35) Orgasmic function (score range 0–10) Sexual desire (score range 0–10) Intercourse satisfaction (score range 0–15) Overall satisfaction (score range 0–10) Mean recall time (mo; range)
Difference (P Value*) Recalled Prospective Recalled Prospective Recalled Prospective IIEF Domain
12 mo After Surgery (n ⴝ 20) 6 mo After Surgery (n ⴝ 33)
Difference (P Value*) Difference (P Value*)
A convenience sample of 39 men was invited to participate in a study addressing the reliability of recalled erectile function. Their age range was 44 to 66 years (mean and median 55). All participants were diagnosed with clinically localized PCa, and all underwent RP after completing a preoperative IIEF survey. At diagnosis, the serum total prostatespecific antigen (PSA) level ranged from 1.1 to 31.8 ng/mL (mean 9.0, median 6.4). Biopsy Gleason sums ranged from 6 to 8 (mean and median 7). All 39 participants completed prospective and recalled IIEF surveys addressing their erectile function before surgery. Of the 39, 33 (85%) and 20 (51%) participants completed surveys at 6 and 12 months after RP, respectively. Table I shows the descriptive statistics of prospectively and retrospectively recorded IIEF domain scores for each of three visits (before surgery, 6 months after surgery, and 12 months after surgery). The recall time is indicated for each visit type. It defines the amount of time elapsed between the prospective and retrospective assessments. The difference in the mean value shows the change between the prospective and recalled score. A positive mean difference indicates that the recalled domain score was greater than the prospective domain score (recall overestimated prospective score). Conversely, a negative mean difference indicates that the recalled score was lower than the prospective score (recall underes-
Before Surgery (n ⴝ 39)
RESULTS
TABLE I. Mean prospective and recalled International Index of Erectile Function domain scores
States version. At each assessment, all participants were invited to complete the IIEF, which targeted the preceding 4 weeks. All men were assessed preoperatively. Subsequent assessments took place at 6 and 12 months postoperatively. After the real-time assessment, each participant was invited to complete a recall IIEF survey. Within the recall survey, the IIEF items were preceded by the phrase “before surgery,” “6 months after surgery,” or “12 months after surgery,” instead of the original “during the past 4 weeks.” All surveys were administered in a clinic setting. Each patient completed six IIEF questionnaires, three addressed recalled erectile function, and three addressed real-time erectile function. Statistical analyses addressed the reliability of the recalled IIEF domain scores relative to the prospective assessments. The IIEF domain scores were used in all tests. These were calculated using IIEF items 1, 2, 3, 4, 5, and 15 for erectile function; 9 and 10 for orgasmic function; 11 and 12 for sexual desire; 6, 7, and 8 for intercourse satisfaction; and 13 and 14 for overall satisfaction. Differences in the mean IIEF domain score between the prospective and retrospective assessments were tested using the independent sample t test. The strength of the linear relation between the prospective and retrospective assessments was quantified using the Pearson correlation coefficient. Partial correlation coefficients were used to control for the effect of differences in recall time. Finally, intraclass correlation coefficients (ICCs) with one-way analysis of variance models with random effect for subjects were used in all reliability tests addressing the strength agreement after accounting for the level of agreement on an individual basis. In all tests, statistical significance was set at 0.05, and all tests were two-sided.
UROLOGY 65 (1), 2005
timated prospective score). The recall time quantified the number of months that separated the prospective from the recall surveys. A comparison of the mean domain scores found very close agreement between the prospective and recalled data. Regardless of the time of the assessment relative to the RP, the differences between the IIEF mean scores were neither clinically meaningfully nor statistically significant. Furthermore, in virtually all mean comparisons, the differences consisted of a fraction of a point on a scale ranging from 10 to 35 possible points. The differences in the mean recall times were statistically significant in all comparisons between visit types (P ⬍0.05, Bonferroni test), except for the recall times recorded for the 6-month and 12-month visits. Figure 1 shows a composite of the scatterplots of the association between the prospective and recalled IIEF domain scores for each visit type (preoperative, 6 months postoperatively, and 12 months postoperatively). The correlation coefficients indicate the strength of the linear relation of each plot. The coefficients suggest that the time of the assessment and the specific IIEF domain type may represent important determinants of recall reliability. Overall, the IIEF domain scores addressing erectile status before surgery had the strongest linear relationship. Except for orgasmic function, all Pearson correlation coefficients equaled or exceeded the desired cutoff value of 0.7. IIEF surveys addressing erectile status 6 months postoperatively had the weakest linear correlation between evaluations, with most coefficients appreciably less than 0.7. The IIEF domain type affected the strength of the linear relation between the prospective and recalled data. Table II shows the potential confounding effect of the length of recall time and of averaging the individual observations on the linear relation between the prospective and recalled data. For each domain score, the data were tabulated in columns according to the type of visit. For each visit type, the relation between the prospective and recalled domain scores was tested using Pearson’s correlation coefficient. Partial correlation coefficients assessed the strength of the linear relation after controlling for recall time. Finally, ICCs were determined to test the relation between the prospective and recalled domain scores after accounting for the individual level of agreement. The recall time may be a confounding factor because of the statistically significant differences for each visit type (Table I). Averaging may either exaggerate or underestimate the true relation between the prospective and recalled IIEF scores. Averaging of individual observations may be accounted for with the use of the ICC. The ICC assesses the strength of the linear relationship between the prospective and recalled scores on the basis of the contribution of the individual xaxis and y-axis data pairs. The ICC is different from UROLOGY 65 (1), 2005
the Pearson correlation coefficient by the effect of removing the effect of group averaging. Therefore, partial correlations and ICCs represent more stringent ways of testing for the strength of a given linear relation. Partial correlation coefficients, which controlled for the effect of recall time, demonstrated that the recall time differences had a negligible effect on the strength of the relation between the recalled and prospective IIEF domain scores. For most IIEF domain scores, the ICCs demonstrated a minor decrease in the recorded strength of the linear relationship. The strongest effect of averaging was operational for data addressing the preoperative visit, especially data within the erectile function domain of the IIEF. All Pearson, partial correlation coefficients, and ICCs were statistically significant at the 0.05 level. COMMENT HRQOL recollection error may represent an important source of bias.3–7 Recall bias affects reliability and may consequently compromise the validity of the HRQOL data. Statistically significant recollection errors have been reported in several disease processes.4 –7 Sources of distortion may relate to difficulty of the question posed, or to the recall process required to access the desired information from memory. Moreover, flaws in study design may introduce biases related to data encoding or subject motivation. Furthermore, the type of treatment and/or the focus of the assessment may affect the quality of recollected information. In urologic reports, recalled impressions of urinary and sexual function in men treated for PCa exceeded prospectively assessed function when the recall time ranged from 7 to 37 months.4 When the recall time was 6 months, strong to reasonable agreement was found.3 The most noticeable differences between these studies were related to the questionnaire type and the length of recall time. We attempted to test the contribution of recollection error using the reliable and validated multiitem domains of the IIEF instrument. The accuracy of recalled IIEF information varied from good to poor and was affected by the IIEF domain type. Conversely, the reliability indexes seemed only marginally affected by the length of recall time. The average difference between the actual and recollected scores was less than 4%. Its direction depended on the timing of the erectile function assessment relative to treatment. Before treatment, recalled scores invariably tended to overestimate the actual scores. Twelve months after treatment, the recalled IIEF tended to underestimate the actual IIEF. Six months after treatment, two IIEF scales underestimated the actual score, and three 133
FIGURE 1. Relation between prospective (x-axis) and recalled (y-axis) IIEF domain scores. Columns demonstrate data according to visit type (before surgery versus 6 months after surgery versus 12 months after surgery). Each row of panels illustrates relation between prospective and recalled data for given IIEF domain: erectile function (EF), orgasmic function (OF), sexual desire (SD), intercourse satisfaction (IS), and overall satisfaction (OS). Within each panel, plain circles and sunflower petals represent one observation. Best fit linear regression lines accompanied by 95% confidence intervals and show relation between prospective and recalled data. Within each panel, Pearson’s correlation coefficients (r) describes strength of linear relation.
others overestimated the actual score, suggesting that recall bias is variable in its direction. The IIEF domain types were the most important determinant of recall accuracy. The erectile func134
tion and sexual desire domains demonstrated the greatest overall reliabilities. These findings suggest that more concrete and quantitative HRQOL domains, such as erectile function, may be less afUROLOGY 65 (1), 2005
TABLE II. Correlation between prospective and recalled International Index of Erectile Function domain scores Before Surgery IIEF Domain Erectile function Orgasmic function Sexual desire Intercourse satisfaction Overall satisfaction
6 mo After Surgery
12 mo After Surgery
r
rc
ICC
r
rc
ICC
r
rc
ICC
0.91 0.49 0.68 0.76 0.87
0.91 0.49 0.75 0.75 0.87
0.69 0.37 0.67 0.68 0.83
0.64 0.44 0.64 0.59 0.39
0.65 0.44 0.64 0.63 0.39
0.65 0.45 0.64 0.60 0.38
0.76 0.55 0.72 0.74 0.70
0.75 0.57 0.72 0.73 0.70
0.73 0.54 0.70 0.72 0.68
KEY: IIEF ⫽ International Index of Erectile Function; r ⫽ Pearson correlation coefficient; rc ⫽ partial correlation coefficient; ICC ⫽ intraclass correlation coefficient.
fected than conceptually more complex domains, such as orgasmic function. The timing of IIEF administration relative to RP was the second most important determinant of recollection error. The reliability of recalled preoperative IIEF domain scores was adequate. This may have been due to the constancy over time of erectile function before surgery. Similarly, only a small fraction of patients recalled better baseline function in a previous study.4 We noted a wide range in the lengths of recall time for each visit type. Longer recall time may increase the magnitude of recollection error. Therefore, we used partial correlations to control for differences in recall time length. Our results suggested that recall time did not appreciably affect recollection error. Several other sources of distortion may affect the accuracy of recalled HRQOL such as re-coding and motivational and practice biases.5–7 Re-coding of erectile function may occur after RP, because subjects are warned about the effects of RP on erections. Motivational bias may result in subjects trying harder to notice and report symptoms related to treatment. Practice bias may be introduced when subjects are invited to participate in several assessments, such that understanding and skills related to questionnaire participation may increase in proportion to the number of assessments. These biases affect prospective and retrospective assessments. Our study had several limitations. Primarily, the small sample size may have limited the statistical power of our study. Moreover, the loss to follow-up further decreased the number of observations at 12 months postoperatively. The relatively limited length of follow-up also precluded assessment of recall reliability when HRQOL has reached a truly stable state after RP. This would correspond to 24 or 36 months after surgery. The variability of recall time length may also be interpreted as a limitation. However, the restriction to a specific recall time length also limited the potential for testing of recall time variability on the magnitude of recall
UROLOGY 65 (1), 2005
bias. Therefore, depending on the interest of the investigator, recall time length variability may be viewed as either a strength or weakness. Finally, the use of multi-tool surveys would allow comparisons of recall reliability and of its magnitude between different surveys and their domains. CONCLUSIONS The use of recalled HRQOL scores approximated prospective HRQOL scores by a reasonable margin. To minimize distortion of recalled HRQOL, domain items should be of limited complexity and should target quantifiable items. In addition, the assessments should be performed at a time when the measured function has reached a stable state. REFERENCES 1. Lu-Yao GL, and Yao SL: Population-based study of longterm survival in patients with clinically localised prostate cancer. Lancet 349: 892– 893, 1997. 2. Fleming C, Wasson JH, Albertsen PC, et al, for the Prostate Patient Outcomes Research Team: A decision analysis of alternative treatment strategies for clinically localized prostate cancer. JAMA 269: 2650 –2658, 1993. 3. Legler J, Potosky AL, Gilliland FD, et al: Validation study of retrospective recall of disease targeted function: results from the Prostate Cancer Outcomes study. Med Care 38: 847– 857, 2000. 4. Litwin MS, and McGuigan K: Accuracy of recall in health-related quality-of-life assessment among men treated for prostate cancer. J Clin Oncol 17: 2882–2888, 2000. 5. Mancuso CA, and Charlson ME: Does recollection error threaten the validity of cross-sectional studies of effectiveness? Med Care 33(suppl): AS77–AS88, 1995. 6. Aseltine RH, Carlson KJ, Fowler FJ, et al: Comparing prospective and retrospective measures of treatment outcomes. Med Care 33(suppl): AS67–AS76, 1995. 7. Herrmann D: Reporting current, past and changed health status: what we know about distortion. Med Care 33 (suppl): AS89 –AS94, 1995. 8. DeVellis RF: Scale Development: Theory and Applications. Applied Social Research Methods Series, vol 26. Newbury Park, California, Sage Publications, 1991. 9. Rosen RC, Riley A, Wagner G, et al: The International Index of Erectile Function (IIEF): a multidimensional scale for assessment of erectile dysfunction. Urology 49: 822– 830, 1997.
135