Psychiatry Research 135 (2005) 229 – 235 www.elsevier.com/locate/psychres
Gradations of clinical severity and sensitivity to change assessed with the Beck Depression Inventory-II in Japanese patients with depression Takahiro Hiroea, Masayo Kojimaa,b, Ikuyo Yamamotoa, Suguru Nojimaa, Yoshihiro Kinoshitaa, Nobuhiko Hashimotoa, Norio Watanabea, Takao Maedaa, Toshi A. Furukawaa,* a
Department of Psychiatry and Cognitive-Behavioral Medicine, Nagoya City University Graduate School of Medical Sciences, Mizuho-cho, Mizuho-ku, Nagoya, 467-8601, Japan b Department of Health Promotion and Disease Prevention, Nagoya City University Graduate School of Medical Sciences, Mizuho-cho, Mizuho-ku, Nagoya 467-8601, Japan Received 5 January 2004; accepted 17 March 2004
Abstract Knowledge of what constitutes a minimal clinically important difference and change on a psychiatric rating scale is essential in interpreting its scores. The present study examines the Beck Depression Inventory-II (BDI-II), a recently revised successor to the world’s most popular self-rating instrument for depression. BDI-II was administered to 85 patients with major depression, diagnosed with DSM-IV along with its severity specifiers. It was again administered to 40 first-visit patients from the original sample when they returned 14 or more days later. The Clinical Global Impression-Change Scale was rated at the same time. All the ratings were done independent of each other. The BDI-II was able to distinguish between all grades of depression severity. An approximate 10-point difference existed between each severity specifier. The BDI-II was also sensitive to change in depression: a 5-point difference corresponded to a minimally important clinical difference, 10–19 points to a moderate difference, and 20 or more points to a large difference. Given the already established high reliability, content validity, construct validity and factorial validity, and the high sensitivity to betweensubject differences and within-subject changes demonstrated in the present study, the BDI-II promises to continue to be a leading self-rating instrument to assess depression severity worldwide. D 2005 Elsevier Ireland Ltd. All rights reserved. Keywords: Depression; Psychiatric status rating scales; Psychometrics; Sensitivity to change
* Corresponding author. Tel.: +81 52 853 8271; fax: +81 52 852 0837. E-mail address:
[email protected] (T.A. Furukawa). 0165-1781/$ - see front matter D 2005 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.psychres.2004.03.014
230
T. Hiroe et al. / Psychiatry Research 135 (2005) 229–235
1. Introduction The increasing use of psychometric scales within psychiatry and related disciplines necessitates a clear understanding of how to appraise the importance of differences in severity (assessed crosssectionally and between subjects) and changes over time (assessed longitudinally and within subjects) recorded with these scales. The knowledge of what constitutes a minimal clinically important difference and change is, however, often wanting with regard to many widely used psychiatric instruments. The Beck Depression Inventory-Second Edition (BDI-II) appears to be one such instrument. The Beck Depression Inventory (BDI) has been one of the most widely used self-report instruments for measuring the severity of depression. In the years since its original development in 1961 (Beck et al., 1961), the BDI was once slightly revised with minimal modification in wording in 1979 (Beck et al., 1979). With the advent of operationalized diagnostic criteria for depression, the original developers of the scale deemed it necessary and appropriate to modernize the instrument and developed the second edition of the BDI in 1996 (Beck et al., 1996). All but three of the 21 items were reworded: four old items (weight loss, body image change, somatic preoccupation, and work difficulty) were replaced by new items (agitation, worthlessness, concentration difficulty, and loss of energy) to harmonize with DSM-IV diagnostic criteria for major depression (American Psychiatric Association, 1994); two items were changed to allow for increases as well as decreases in appetite and sleep; many of the statements used in the ratings were revised. The BDI-II has found a rapid and widespread acceptance on the international scene, with Spanish, Portuguese (Coelho et al., 2002), Arabic (Al-Musawi, 2001) and Japanese (Kojima et al., 2002) versions now available. Although the BDI has been shown to be an excellent screener for depression in the general and medical populations, the manual for the BDIII states that it is primarily an assessment tool to rate the severity of depression in patients whose diagnosis has already been established, and proposes guidelines to interpret total scores (Beck et
al., 1996). Whether the guidelines apply to patients outside Western cultures is not known. Moreover, no study to date has examined the scale’s sensitivity to change and set out clear guidelines to interpret changes in scores. After developing the Japanese version of the BDI-II and ascertaining its excellent internal consistency reliability, criterion validity and factor validity (Kojima et al., 2002), we therefore realized that there was an urgent need to standardize this instrument by establishing minimal clinically important difference and change scores. Without such knowledge, clinicians and patients would find it extremely difficult, and probably impossible, to interpret scores obtained with the BDI-II.
2. Methods 2.1. Subjects and procedures Eighty-five patients with major depressive disorder, single episode or recurrent, according to DSMIV participated in this study at the Departments of Psychiatry of Nagoya City University Hospital and Toyokawa Municipal Hospital, Japan. The patients were asked to complete the BDI-II. The treating psychiatrist rated the severity of their depressive episode according to DSM-IV as severe, moderate, mild, partially remitted or remitted without knowledge of the patients’ (subsequently obtained) BDI-II scores. Forty subjects of the original cohort, all of whom consented to the baseline evaluation at the time of their first visit to our hospitals, were re-administered the BDI-II when they returned 14 or more days later. This time period was chosen because the time frame of evaluation for the BDI-II is 2 weeks, unlike the BDI-I, which set the time frame at 1 week. Using the Clinical Global Impression-Change (CGIChange) (Guy, 1976) Scale, the treating physician rated the patient as much worse, moderately worse, slightly worse, no change, slightly improved, moderately improved or much improved, again before having access to the patients’ BDI-II scores. The patients consented to allow their data to be used for research purposes after receiving an explanation of the study.
T. Hiroe et al. / Psychiatry Research 135 (2005) 229–235
2.2. Statistical analyses We used SPSS 11.0 (SPSS Inc., 2001) to perform statistical analyses. To elucidate what constitutes a minimal clinically important difference in BDI-II scores with regard to depression severity, we first depicted the distribution of the baseline BDI-II scores according to the severity of the subjects’ depression. Between-group differences in BDI-II scores were compared by one-way analysis of variance, with post hoc Tukey’s honestly significant difference test. Linear regression of depression severity based on the BDI-II scores was performed, and the minimal difference corresponding to one degree of difference in DSM-IV severity was estimated. The validity of the cutoff scores for depression severity proposed for the original English version was examined by calculating the likelihood ratios (LR) to distinguish between remitted vs. mild depression, mild vs. moderate depression, and moderate vs. severe depression. A likelihood ratio is a ratio of two likelihoods, one of showing the test result in question among those with the condition (e.g. disease severity) and one of showing the same test result among those without the condition. According to the Bayes theorem, the relationship between the pretest probability of the target condition, LR, and posttest probability can be described as follows. Pretest odds LR = Posttest odds where Odds = Probability / (1 Probability). Thus, LR indicates by how much a given test result will increase or decrease the pretest probability into the posttest probability. Posttest probability is greater than the pretest probability if the LR is N1.0; and the former is smaller than the latter if the LR is b 1.0 (Furukawa et al., 2001). If the original cutoff scores are valid, we would expect LRs to change from b 1.0 to N 1.0 as we cross the boundary between the less severe group and the more severe group. To examine the minimal clinically important change in BDI-II scores over time, we again depicted the distribution of the change scores according to the degree of change as rated by the CGI-Change Scale. The change scores were calculated as BDI-II score at baseline minus BDI-II score after treatment. Because the significance of the change scores likely depended on the baseline
231
BDI-II score (e.g., the greater the baseline score, the greater a change would be required for a patient to be rated improved), we plotted the distribution of the change scores for each category of the baseline BDI-II scores. The linear regression of the CGIChange score on the BDI-II change scores was performed to estimate the change score corresponding to a 1-point increase in the CGI-Change score. We compared the BDI-II change scores between CGI-Change categories by analysis of covariance adjusting for the baseline score, with Tukey’s post hoc comparisons. We also examined the percentage change scores, instead of raw change scores, to assess the scale’s sensitivity to change. Because the results were essentially identical and raw scores are far easier to interpret, we only report the results with regard to raw change scores.
3. Results Table 1 summarizes the clinical characteristics of the original cohort of 85 patients and the subsample of these who underwent the second evaluation. Forty (47%) of the original cohort were repeat visit patients, but all patients in the subsample who underwent the second evaluation were first-visit patients as demanded by our research protocol. Fourteen (16%) of the cohort were inpatients when they first completed the BDI-II. When disease severity or CGI-Change scores were controlled, BDI-II baseline scores and change scores did not significantly differ depending on these demographic and clinical variables, and therefore the results for the total sample are reported in the following.
Table 1 Characteristics of the sample
Sex (women : men) Age (mean, SD) Diagnosis (single episode : recurrent) First visit : repeat visit Inpatient : outpatient
Original cohort (n = 85)
Subgroup who underwent second evaluation (n = 40)
50 : 35 45.7 (SD = 16.5) 51 : 34
27 : 13 44.9 (SD = 17.2) 29 : 11
45 : 40 14 : 71
40 : 0 0 : 40
232
T. Hiroe et al. / Psychiatry Research 135 (2005) 229–235
Fig. 1 shows the scattergram of the baseline BDIII scores according to the severity of the subjects’ depression. Because there were only three subjects in full remission, these were amalgamated with those in partial remission. The BDI-II means (95% confidence interval, n) for each level of depression severity were 11.2 (5.6 to 16.8, n = 14) for partially remitted patients, 20.8 (16.5 to 25.0, n = 16) for people with mild depression, 30.5 (27.6 to 33.4, n = 33) for people with moderate depression, and 42.2 (37.8 to 46.7, n = 22) for people with severe depression. The analysis of variance showed a highly significant difference in BDI-II scores depending on depression severity ( F 3, 81 = 39.4, P b 0.001). Tukey’s post hoc comparisons of each adjacent severity pair were all statistically significant (Table 2). The Pearson product–moment correlation between depression severity according to DSM-IV and the BDI-II score was 0.77 (0.67 to 0.84, P b 0.001). The regression of depression severity on the baseline BDI-II score produced an unstandardized coefficient of 10.3 (8.5 to 12.2), indicating that there was an approximate 10-point increase in the BDI-II score for
Table 2 Comparison of BDI-II scores for each adjacent severity category Adjacent severity categories
Mean difference (SD)
Pa
Remitted vs. mild depression Mild vs. moderate depression Moderate vs. severe depression
9.5 (9.0) 9.7 (8.9) 11.7 (9.1)
0.02 0.003 b0.001
a The mean difference is significant at the 0.05 level according to the analysis of covariance with post hoc Tukey’s honestly significant difference test.
a 1-point increase in depression severity according to DSM-IV. This observation is consistent with Table 2 and Fig. 1. Table 3 shows LRs to distinguish between severity groups according to the original cutoff scores proposed by Beck et al. The LRs to distinguish between remitted vs. mild depression showed a transition from 0.22 to 8.80 at the cutoff point of 13/14, those to distinguish between mild vs. moderate depression changed from 0.16 to 3.24 at the cutoff point of 19/20, and those to differentiate between moderate from severe depression changed from 0.28 to 1.67 at the cutoff of 28/29; none of these estimates’ 95% confidence intervals crossed 1.0, as
60
50
BDI-II score
40
30
20
10
0 remitte
mild
moderate
severe
Depression severity according to DSM-IV Fig. 1. Distribution of BDI-II scores according to the severity of the major depressive episode. The dotted line represents the regression prediction line (r = 0.77).
T. Hiroe et al. / Psychiatry Research 135 (2005) 229–235
233
Table 3 Likelihood rations (LRs) of the original scoring guide to distinguish between grades of severity DSM-IV specifier categories
BDI-II cut scores
LR
LR+
Remitted vs. mild depression Mild vs. moderate depression Moderate vs. severe depression
At the cutoff 13/14 At the cutoff 19/20 At the cutoff 28/29
0.22 (0.07 to 0.66) 0.16 (0.05 to 0.54) 0.28 (0.10 to 0.79)
8.80 (1.92 to 40.3) 3.24 (1.33 to 7.88) 1.67 (1.16 to 2.42)
LR : likelihood ratio for the scores below the cutoff. LR+: likelihood ratio for the scores above the cutoff. Figures in parentheses show 95% confidence intervals.
would be expected if the original cutoff scores were to apply to the Japanese patients. We now examine the minimal clinically important change in BDI-II scores over time. No patient was evaluated as much or moderately or slightly worse on the CGI-Change Scale on repeat visit. Fig. 2 shows the scattergram of the change scores of the BDI-II from baseline to repeat visit at least 14 days later (18.1 days on average, range = 14 to 42), according to the CGIChange evaluation. Because there were only three patients who showed bmuch improvementQ according to the CGI, we combined the bmoderately improvedQ and bmuch improvedQ subjects in the following analyses.
Table 4 presents the change scores for each CGIChange category, subdivided by and adjusted for the baseline BDI-II scores. The analysis of covariance adjusted for the baseline BDI-II scores demonstrated a highly significant difference in the BDI-II change scores according to the CGI-Change categories ( F 3, 47 = 13.8, P b 0.001). Tukey’s post hoc comparison between CGI bNo changeQ and CGI bSlight improvementQ was not significant ( P = 0.060). There was a mean difference of 11.2 (5.5 to 16.9, P b 0.001) in the BDI-II change scores between CGI ratings of bSlight improvementQ and bModerate to much improvement.Q
50
Change in BDI-II scores
40
30
20
Baseline severity 29 to 63
10 20 to 28 14 to 19 0 0 to 13 -10 no change
slight
moderate
much
CGI Change Fig. 2. Distribution of BDI-II change scores by CGI-Change score and by baseline BDI-II score. The dotted line represents the regression prediction line (r = 0.72).
234
T. Hiroe et al. / Psychiatry Research 135 (2005) 229–235
Table 4 BDI-II change scores for each CGI-Change category by baseline score Baseline score on BDI-II
0–13 14–19 20–28 29–63 Adjusteda
CGI-Change No change
Slight improvement
Moderate to much improvement
n
n
Mean
n
1 3 5 8 17
1.00 3.3 1.6 8.6 5.8 (1.8 to 9.9)
0 1 6 9 16
0 2 1 4 7
Mean 4.0 6.0 0.5 0.01 ( 6.2 to 6.2)
Mean 9.0 11.0 23.3 17.0 (12.9 to 21.2)
a Mean BDI-II change scores for each CGI-Change category adjusted for the baseline BDI-II score using analysis of covariance (95% confidence intervals).
The Pearson product–moment correlations between CGI-Change and change scores or percent changes were 0.72 (0.53 to 0.84, P b 0.001). The regression of CGI-Change on the BDI-II change score produced an unstandardized coefficient of 9.7 (6.7 to 12.6).
4. Discussion First of all, our study clearly showed that the BDIII is highly sensitive to a wide spectrum of depression severity and can very efficiently differentiate between different grades of depression. An approximate 10point difference existed between each category of depression severity, as defined by DSM-IV. The original developers of the scale suggested the following cut-score guideline for total scores of patients diagnosed with major depression (Beck et al., 1996):
! 0–13: minimal ! 14–19: mild ! 20–28: moderate ! 29–63: severe Examination of LRs to distinguish between several grades of depression severity demonstrated that these cut scores were valid for the Japanese version as well (Table 3). In the original sample of patients with major depression (n = 111) recruited at the University Pennsylvania, the BDI-II means for each depression sever-
ity were 7.7 for non-depressed patients, 19.1 for people with mild depression, 27.4 for people with moderate depression, and 33.0 for people with severe depression (Beck et al., 1996). The same group of researchers reported the following means for a larger group of patients with major depression (n = 260): 18 for mild depression, 27 for moderate depression and 34 for severe depression (Steer et al., 2001). These figures are highly concordant with those obtained with our Japanese patients, because the average scores were 20.8, 30.5 and 42.2 for mild, moderate and severe depression, respectively, in our sample. Secondly, the present study demonstrated that the BDI-II is highly sensitive to change in depression severity. To our knowledge, this is the first study to establish the scale’s sensitivity to change. An average patient needed to score 5.8 points less for the clinician to stop rating him/her as bno changeQ, and 17.0 points less for the clinician to rate him/her as bmoderately or much improved.Q When the baseline BDI-II score was low, these differences could be smaller. The linear regression analysis revealed that an approximate 10-point change corresponded to each 1-point increase in the CGI-Change scores. We therefore propose the following guideline to interpret changes in the BDI-II.
! 0–9:
! !
no or slight change, with 5 indicating a minimally important clinical difference (a smaller difference is enough when the baseline depression is mild, and a larger difference is required when the baseline depression is severe) 10–19: moderate change z 20: large change
The issue of ascertaining the sensitivity to change of a scale and defining its clinically important change scores continues to pose theoretical and empirical challenges (Norman, 1989; Guyatt et al., 2002). Some studies claim to have established a scale’s sensitivity to change by demonstrating a statistically significant paired within-individual comparison of scale scores. This approach is valid only when the study design assures that there is a change in severity of the disease in question over time, for example, in the case of Alzheimer’s disease (Salmon et al., 1990). A more appropriate method to ascertain sensitivity to change requires concurrent validity with an external
T. Hiroe et al. / Psychiatry Research 135 (2005) 229–235
measure of change, such as the CGI-Change score (Guyatt et al., 2002). We adopted this method and ascertained a high correlation between changes in the BDI-II scores and the CGI-Change scores. A more detailed analysis of the correspondence between these two scales suggested a minimally important clinical difference of 5, and a moderately important clinical difference of 10. The difference of 10 points in the BDI-II corresponded well with the between-individual difference for one severity difference according to DSM-IV. Some possible shortcomings of the present study include the following: first, the mean difference in change scores between the no change group and the slight improvement group did not reach statistical significance, probably due to the small number of subjects with repeated measures. However, we should note that the point estimate was in accordance with the overall very high correlation between BDI-II change scores and CGI-Change scores. In addition, the findings with regard to change (within-individual difference) were in close correspondence with those regarding severity (between-individual difference). Second, strictly speaking, the current findings apply only to Japanese patients’ depression measured with the Japanese BDI-II. However, the fact that the guidelines for severity interpretation of the original BDI-II were confirmed (Table 3) and that average scores for mild, moderate and severe depression were largely comparable between U.S. samples and our Japanese sample supports the cultural independence of the BDIII. It should be examined empirically whether the same conclusion holds for other countries. Lastly, some may question the lack of non-depressed samples in the study. This aspect of the design reflected the aim of the study to examine the BDI-II’s ability to measure depression severity and change among those with an established diagnosis of depression. A different study design would be needed to assess the scale’s ability to screen for depression. In conclusion, besides the high reliability, content validity, construct validity and factorial validity already ascertained with the BDI-II both in the original English (Beck et al., 1996) and several other language versions (Al-Musawi, 2001; Kojima et al., 2002;
235
Coelho et al., 2002), the present study established the scale’s sensitivity to change and provided guidelines to interpret scores both cross-sectionally and longitudinally. We expect the BDI-II to become one of the standard second generation self-report instruments for depression worldwide.
References Al-Musawi, N.M., 2001. Psychometric properties of the Beck Depression Inventory-II with university students in Bahrain. Journal of Personality Assessment 77, 568 – 579. American Psychiatric Association, 1994. DSM-IV: Diagnostic and Statistical Manual of Mental Disorders. APA, Washington, DC. Beck, A.T., Rush, A.J., Shaw, B.F., Emery, G., 1979. Cognitive Therapy of Depression. Guilford Press, New York. Beck, A.T., Steer, R.A., Brown, G.K., 1996. BDI-II Manual. The Psychological Corporation, San Antonio, TX. Beck, A.T., Ward, C.H., Mendelson, M., Mock, J., Erbaugh, J., 1961. An inventory for measuring depression. Archives of General Psychiatry 4, 561 – 571. Coelho, R., Martins, A., Barros, H., 2002. Clinical profiles relating gender and depressive symptoms among adolescents ascertained by the Beck Depression Inventory II. European Psychiatry 17, 222 – 226. Furukawa, T.A., Goldberg, D.P., Rabe-Hesketh, S., Ustun, T.B., 2001. Stratum-specific likelihood ratios of two versions of the General Health Questionnaire. Psychological Medicine 31, 519 – 529. Guy, W., 1976. ECDEU Assessment Manual for Psychopharmacology. National Institute of Mental Health, U.S. Department of Health and Human Services, Rockville, MD. Guyatt, G.H., Osoba, D., Wu, A.W., Wyrwich, K.W., Norman, G.R., 2002. Methods to explain the clinical significance of health status measures. Mayo Clinic Proceedings 77, 371 – 383. Kojima, M., Furukawa, T.A., Takahashi, H., Kawai, M., Nagaya, T., Tokudome, S., 2002. Cross-cultural validation of the Beck Depression Inventory-II in Japan. Psychiatry Research 110, 291 – 299. Norman, G.R., 1989. Issues in the use of change scores in randomized trials. Journal of Clinical Epidemiology 42, 1097 – 1105. Salmon, D.P., Thal, L.J., Butters, N., Heindel, W.C., 1990. Longitudinal evaluation of dementia of the Alzheimer type: a comparison of 3 standardized mental status examinations. Neurology 40, 1225 – 1230. SPSS Inc., 2001. SPSS for Windows Version 11.0. SPSS Inc., Chicago. Steer, R.A., Brown, G.K., Beck, A.T., Sanderson, W.C., 2001. Mean Beck Depression Inventory-II scores by severity of major depressive episode. Psychological Reports 88 (3 Pt 2), 1075 – 1076.