BEHAVIOR THERAPY
33,253-269,2002
Assessing Clinical Significance: Application to the Beck Depression Inventory LESLIE B. SEGGAR MICHAEL J. LAMBERT
Brigham Young University NATHAN B. HANSEN
Yale University School of Medicine Traditionally, psychotherapy outcome research has been analyzed using statistical tests of significance. Inherent limitations in this approach, however, have contributed to the assessment of clinical significance being advocated as a method by which to evaluate change. In this study, Tingey, Lambert, Burlingame, and Hansen's (1996a, 1996b) extensions and clarifications of Jacobson, Follette, and Revenstorf's (1984) method for evaluating clinically significant change were applied to the Beck Depression Inventory (BDI; Beck, Ward, Mendelson, Mock, & Erbaugh, 1961; Beck, Rush, Shaw, & Emery, 1979). A three-sample normative continuum (asymptomatic community, community, and clinically symptomatic) was formulated from the community and the existing literature. The distinctness of the normative samples was assessed by using three statistical procedures, including a t test, a d test, and a test of bioequivalence; and reliable change indices and cutoff points were calculated. The cutoff scores that were developed may prove useful in research and clinical practice.
While establishing statistically significant change continues to form the foundation of psychotherapy outcome research, inherent limitations in this approach have contributed to increased attention to the concept of clinical significance (Kendall, 1999). This refers to the practical meaning or value of change resulting from treatment (Jacobson, Follette, & Revenstorf, 1984; Kazdin, 1977; Kendall & Grove, 1988; Wolf, 1978). An analysis of this dimension provides a needed additional perspective in therapy outcome research because, although tests of statistical significance provide necessary information regarding whether change has occurred, they do not address its meaning and relevance. The authors wish to acknowledge the collaborative work of Kellie M. Condon, Curtis T. Grundy, and Elizabeth M. Grundy in the preparation of this manuscript. Address correspondence to Michael J. Lambert. Department of Psychology, 272 TLRB, Brigham Young Univerity, Provo, UT 84602; e-mail:
[email protected]. 253 005-7894/02/0253~026951.00/0 Copyright2002 by Associationfor Advancementof BehaviorTherapy All rights for reproductionin any formreserved.
254
SEGGAR ET AL.
A variety of strategies for assessing the clinical significance of change have been put forth (Jacobson et al., 1984; Kazdin, 1977; Kendall, MarrsGarcia, Nath, & Sheldrick, 1999; Speer, 1992; Wolf, 1978). The method proposed by Jacobson et al. is probably the best developed and most widely utilized, and is based on the development of norms for social comparisons by using traditional measures of symptomatology. More specifically, the assumption is held that a social standard (defined as the measured performance of a selected sample of society that can be based upon scores obtained from selected instruments) can provide a meaningful method by which to evaluate client change through comparing a client's performance to such a standard. To this foundation is added the notion of both a functional and dysfunctional sample, which allows for the comparison of a subject's performance pre- and posttreatment. Clinical significance methodology has had a major impact on the field of psychotherapy outcome research, as it allows for analysis in large N, between group designs, as well as with individual subjects. More and more studies of both psychotherapy efficacy and effectiveness are reporting outcomes using clinical significance (see Hansen, Lambert, & Forman, in press). While the work of Jacobson and colleagues has greatly contributed to development in the use of clinical significance in therapy outcome studies, as a recent special section in the Journal of Consulting and Clinical Psychology (Kendall, 1999) illustrates, there is still debate over how to operationalize the concept of clinical significance. In a summary of the special section, Kazdin (1999) suggested that there is still need for clarification in methodology determining "the meaning and interpretation of measures of clinical significance, the importance of relating assessment of clinical significance to the goals of therapy, and the construct(s) that clinical significance reflects" (p. 332). For instance, while the focus of clinical significance has been primarily on symptom reduction (Jacobson, Roberts, Berns, & McGlinchey, 1999; Kendall et al., 1999), there are certainly other areas of change that can and should be considered, such as quality of life (Gladis, Gosch, Dishuk, & CritsChristoph, 1999) and the ability to cope (Kazdin, 1999). Also, defining from whose perspective meaningful change should be considered, such as clients or those close to clients (Foster & Mash, 1999; Gladis et al., 1999), should be considered. Further, practical limitations exist in the actual application of Jacobson et al.'s (1984) method of assessing the clinical significance of change. Tingey, Lambert, Burlingame, and Hansen (1996a) specifically identified three problems relating to Jacobson's operationalization of the social standard. Namely, the vague definition provided for the functional group creates difficulties, the methodology is restricted by utilizing only two samples, and no method is provided for determining the distinctness of samples. In an attempt to address the first of these limitations, Tingey (1989) operationalized and stringently defined the concept of a functional group, suggest-
ASSESSING CLINICAL SIGNIFICANCE
255
ing that this be comprised of individuals who are relatively free from psychopathology, that is, "asymptomatic:' With respect to the second shortcoming, Tingey followed the work of Kendall and Grove (1988) by proposing an expanded continuum of dysfunction, with four normative samples (asymptomatic, mildly distressed, moderately distressed, and severely distressed). With this delineation, additional information is provided, the criteria upon which clinical change is measured is altered, and the possibility of quantifying differing degrees of change is created. In sum, a continuum would seem to provide an increasingly specific and meaningful means by which to assess clinical change (Tingey et al., 1996a). Regarding the final limitation relating to the lack of a clear operationalization of the construct of "distinctness;' Tingey proposed a two-part method to assess this, which includes t test for independence of sample means and a d test (effect size estimate) for nonoverlap of sample distributions. It would appear that Tingey et al.'s (1996a, 1996b) extensions and clarifications of Jacobson et al.'s (1984) method for evaluating clinical significance offer several benefits. However, at present, there are few measures that have identifiable distributions along a continuum of severity (Lambert & Hill, 1994). The present study was conceived as an effort to contribute to development in this area of clinical significance by selecting a widely used psychotherapy outcome measure, applying this extended method, and formulating normative data along a continuum of disturbance. The overarching aim of this study was to provide researchers and clinicians with a more specific standard upon which to evaluate change. The results of this study might also assist clinical research and practice by (a) making therapy outcome research more applicable to clinical practice (Barlow, 1981; Bergin & Strupp, 1972; Lambert & Hill, 1994; Nathan, Stuart, & Dolan, 2000), (b) providing a more consistent form for reporting psychotherapy outcome results across studies (Lambert & Hill), and (c) increasing accountability and documentation of treatment effects for health care providers, policymakers, and consumers (Ottenbacher, Johnson, & Hojem, 1988). The Beck Depression Inventory (BDI; Beck, Ward, Mendelson, Mock, & Erbaugh, 1961; Beck, Rush, Shaw, & Emery, 1979) was selected and considered appropriately suited to the intent of this project for a variety of reasons. First, it possesses respectable psychometric properties that suggest its value as an instrument aimed at the assessment of depressive symptomatology (Beck, Steer, & Garbin, 1988). In addition, its status as the most frequently used self-report method of assessing the severity of depression (Froyd, Lambert, & Froyd, 1996; Shaw, Vallis, & McCabe, 1985) indicates that work in the domain of clinical significance involving this instrument holds much relevance and value. In fact, preliminary work in this area has previously been conducted (Nietzel, Russell, Hemmings, & Gretter, 1987; Robinson, Berman, & Neimeyer, 1990), and it was the intent of this project to further contribute by extending these past efforts.
256
SEGGAR ETAL.
Method
Subjects The normative continuum for the BDI was comprised of three groups: asymptomatic community, community, and clinically symptomatic. 1 Asymptomatic community sample. Individuals who were relatively free from symptomatology or psychopathology formed the asymptomatic community sample. Although the literature reveals the use of nonpatient populations in research using the BDI, no studies have utilized a substantially screened asymptomatic sample. Saunders, Howard, and Newman (1988) estimate that approximately 20% of a "functional" population have clinically notable levels of psychopathology. Due to these factors, a well-defined asymptomatic group was collected. In operationalizing the asymptomatic community group, it was necessary to delineate what symptoms should be absent. Additionally, it was necessary to define the screening procedure that would be used to assure the absence of these variables. Since the BDI focuses on depression, it might seem reasonable for the asymptomatic sample to consist of individuals who are free from this cluster of symptoms. However, a more conservative approach was taken by selecting a sample that was relatively free from a broad range of psychopathology and severe stress. The intent of the inclusion criteria for the asymptomatic sample was to provide a broad assessment of an individual's psychological functioning. This screening was achieved by a three-stage process that included nomination by a licensed psychologist (Stage l), a telephone interview (Stage 2), and a screening battery of psychological tests (Stage 3). Thirty licensed psychologists participated in the first stage of this process. They were given a standardized description of the study and requirements for nominating individuals for the asymptomatic community group. Specifically, they were asked to anonymously nominate 5 local individuals over the age of 18 whom they considered to be relatively free from psychological and physical disorders, and who had not attended therapy, used psychotropic medications, or experienced a major life crisis (i.e., death in family, divorce, major job change, etc.) within the past year. Additionally, it was requested that the individuals not be pregnant, have no history of psychiatric hospitalizations, or Based on the work of Tingey (1989), it was first hypothesized that the continuum would be comprised of four samples of increasing symptomatology--asymptomatic, mildly symptomatic, moderately symptomatic, and severely symptomatic. Studies that used outpatient and inpatient populations would form the latter two groups, respectively. However, the differentiation of the moderately symptomatic and severely symptomatic groups was not successful according to the statistical criteria for distinctness. Due to these results, these two were collapsed into a single group, termed clinically symptomatic, and thus, a three-sample normative continuum was formed. It was also decided to rename the other two groups as the terms asymptomatic and mildly symptomatic do not seem to adequately describe the sample compositions. The asymptomatic group was changed to the asymptomatic community sample, and the mildly symptomatic group was changed to the community sample.
ASSESSING CLINICAL SIGNIFICANCE
257
be a mental health professional familiar with psychological tests. This resulted in the nomination of 199 individuals. Overall, this stage of the selection process was based on the clinical judgment of the referring psychologist. The second stage of the screening process involved contacting the nominated individuals by phone and performing a brief screening interview, with the intent of obtaining specific information regarding the history and functioning of the individual in an effort to assure consistency with the aforementioned nomination criteria. Out of the initial group, there were 49 who declined to participate, 28 who failed the screening, and 11 who could not be contacted. One hundred and eleven individuals were willing to participate in the study and passed the phone screening. These 111 individuals were administered a selected screening battery of tests and the BDI. The screening battery consisted of the following objective, self-report questionnaires: The Symptom Checklist 90-Revised (SCL-90-R; Derogatis, 1983) and the Inventory of Interpersonal Problems (IIP; Horowitz, Rosenberg, Baer, Ureno, & Villasenor, 1988). These particular instruments were chosen because of their ability to provide, with limited subject time investment, indices of psychological symptomatology and interpersonal functioning. This contributed to an overall assessment of the functioning of the individual in both intra- and interpersonal domains. Those individuals who scored beyond the defined cutoff point on one of these measures were excluded from the asymptomatic group. The cutoff points used were formulated based on normative data in the literature: SCL-90-R, General Severity Index < 0.40 (Tingey, 1989); and liP, Total Score < 1.189 (Horowitz, Rosenberg, & Bartholomew, 1988). Eighty-one subjects passed the screening battery and were included in the asymptomatic community sample. This group consisted of 40 males and 41 females who were predominantly Caucasian (93.8%), married (87.7%), and had an average of approximately 4 children. The ages of the subjects ranged from 18 to 89 years old, with an average age of 42.33. In general, they were highly educated (71.6% reported having a college degree or higher) and of moderate to high socioeconomic status. Community and clinically symptomatic samples. Data for the community and clinically symptomatic group norms were gathered from the existing literature on the BDI. References were obtained by performing an extensive computer search utilizing PsychLit (1/74-3/93), as well as examining bibliographical citations from review articles for relevant studies using the BDI. This resulted in over 9,000 articles being reviewed. These articles formed the database from which these norms were obtained. In terms of the inclusion or exclusion of studies from the normative samples, several criteria were related and applied to both normative groups, with additional criteria being specific to each. A delineation of each normative group and the criteria specific to it will be presented first, followed by the criteria common to them. The community sample was defined as those subjects who were tested
258
SEGGAR ET AL.
from the general community (and were therefore considered to be "nonpatients") and who were unscreened for symptomatology or psychopathology. Nonpatient status refers to the subject not currently participating in either outpatient or inpatient therapy. Since the subjects were not in treatment, it was inferred that their psychological state did not interfere with their functioning. However, as noted above, and in contrast to the asymptomatic community group, these subjects were unscreened for difficulties and thus this statement is made with caution, due to Saunders et al.'s (1988) estimate that approximately 20% of the "functional" population have clinically notable levels of pathology. Nevertheless, within this normative group, if such symptomatology existed, it would likely only be manifest at a mild level. There were 147 studies published since 1974 and 1992 that met the requirements and comprised the community sample. The clinically symptomatic sample was defined as subjects who were experiencing enough psychopathological distress to seek or to be required to participate in treatment-- outpatient or inpatient. In addition, it was required that these subjects were currently manifesting a major depressive illness, and that the studies used a specifically stated formal/standardized diagnostic system to establish this. This diagnostic grouping was chosen as it was viewed as consisting of subjects who were symptomatic in terms of the variable measured by the BDI, that is, depression. While other diagnoses might manifest depressive symptomatology, this group seemed to best represent this area of dysfunction. Also, different diagnostic groups manifest different scores on the BDI (Beck et al., 1988; Schnurr, Hoaken, & Jarrett, 1976; Steer, Beck, Brown, & Berchick, 1987; Steer, Beck, & Garrison, 1986; Steer, Beck, Riskind, & Brown, 1986). It was thought to be important to limit it to the type of subject delineated in an effort to provide normative data that were more pure, as opposed to a mixing of a variety of different diagnoses/disorders. There were 60 studies published between 1979 and 1992 that met the requirements and comprised the clinically symptomatic sample. 2 The following exclusion criteria were utilized in selecting studies for both the community and the clinically symptomatic samples. First, many studies used BDI scores in some manner to select subjects resulting in truncated distributions of scores. Thus, studies using scores on the BDI as inclusion criteria were excluded from the present sample. Second, means and standard deviations were required to derive the data necessary for the normative continuum, and studies that did not present this information were therefore excluded. Third, studies whose sample consisted of less than 10 individuals were excluded. Fourth, it was required that the BDI administered was the English-language version, and that the results were reported in this language. Fifth, a problem encountered in the creation of both normative samples was the issue of multiple publications with the same or part of the same subject 2 References for the community and clinically symptomatic studies are available upon request from the second author.
ASSESSING CLINICAL SIGNIFICANCE
259
pool. In these instances, a determination was made to use the study that contained the greatest number of the subjects that appeared in the various articles and exclude those based on subgroups of this. If there was more than one article with the same number of subjects, the article that was first published was utilized here, and the other(s) excluded.3
Distinctness of the BDI Normative Continuum The distinctness of the three samples that comprise the normative continuum was assessed by using three statistical procedures, two which are suggested by Tingey (1989), including a t test and a d test, or test of effect size. A third test, equivalence testing, was added to address distributional issues in the current data samples. Each of these procedures served to analyze a different aspect of the "distinctness" of the three samples. The t test and d test are alike in that they each assess whether the sample means differ reliably. They are distinct, however, in that the t test takes into account the sample size, while the d test utilizes only the standard deviation to evaluate the difference between means. The d test standardizes the raw effect size by expressing the measurement unit of the dependent variable in standard deviation units. By examining the effect size in standard units, the d test can assess overlap between distributions. Cohen (1977) calculated the percent of nonoverlap, U, for all levels of d. His calculations were based on two assumptions: equal variances and normal distributions. Even if these two assumptions are not perfectly met, the d test is a quantifiable guideline for determining if samples are separate enough so as to be considered distinct. Cohen (1977) discussed the commonly accepted idea of ds of 0.2, 0.5, and 0.8 being conceptualized as small, medium, and large effect sizes. For this study, a d of 0.5 was selected, shace the associated area of nonoverlap, U, equals 33.0 percent. By definition, a d of 0.5 also indicates that the means of the two samples 3 Since the c o m m u n i t y and clinically symptomatic samples were each comprised of multiple studies, some of which had multiple means and standard deviations, the means and standard deviations for these samples were derived according to the following procedure: If the individual study reported more than one mean and standard deviation, then an averaged m e a n and standard deviation for the study was calculated. The m e a n s were averaged while taking into account the sample size. The standard deviations were averaged by using the following formula:
so= fcss dN-1
where SD = the standard deviation for the article, CSS = the corrected sum of squares, and N = the sample size for the article. Then, after each study had a single mean and standard deviation, a single m e a n and standard deviation was calculated for the sample being examined. The means of the studies were averaged, while taking into account each study's sample size. The standard deviations were then averaged using the above formula, but where SD = the standard deviation for the group of studies, CSS = the corrected s u m of squares, and N = the sample size for the group of studies.
260
S E G G A R ET AL.
differed by one half a standard deviation. Therefore, for this study, if two sample distributions had a d score of 0.5 or greater, they were considered distinct. A further test was utilized as a safeguard because of unequal sample sizes, nonequivalent variances, and skewed distributions (particularly with the asymptomatic community sample). Using bioequivalence methodology (Hauck & Anderson, 1992; Stegner, Bostrom, & Greenfield, 1996), it can be determined whether two sample means are in fact similar enough to represent the same population. This is done by computing a confidence interval, with the accepted standard ranging from 20% above and below the population mean, and observing whether sample means fall within this range. For the current study, the community sample mean will serve as a representation of the population mean, and the confidence interval for tests of equivalence will be based on this value. If the means of the asymptomatic community and the clinically symptomatic sample fall outside of this confidence interval, the samples can be considered nonequivalent. It was hypothesized that the groups would fall along a continuum of severity of distress, therefore comparisons were only made between adjacent samples. One-tailed t tests were performed due to the implied directionality of the continuum, and an alpha of 0.01 was selected for the t tests to correct for increased probability of Type I error associated with unequal variance and sample size. Also, where appropriate, calculations took into account unequal variances and unequal numbers of subjects. Normative groups were regarded as "distinct" if the following criteria were met: the t test equaled or exceeded the alpha of 0.01 (Tingey et al., 1996a), the d test revealed a moderate effect size of 0.5 (Cohen, 1977; Tingey et al.), and equivalence testing revealed nonequivalent sample means. Analyzing Clinically Significant Change In order for a subject to experience clinically significant change, the change that the subject makes from pretreatment to posttreatment must cross into the adjacent sample along the continuum; further, the exhibited change, as based on the Reliable Change Index (RCI), must be reliable (Jacobson & Truax, 1991). Cutoff points and reliable change indices are required to demonstrate whether or not a subject has met the above two requirements and therefore experienced a clinically significant change. Cutoff points. When there is considerable overlap between two adjacent samples, as is possible with a three-sample normative continuum, a cutoff point determines if a subject's posttreatment score has changed enough to move it out of the original pretreatment distribution and into another distribution. In this study, cutoffs were calculated between adjacent samples along the continuum (e.g., between the asymptomatic community sample and the community sampie). The formula for the cutoff for unequal variances is as follows: c =
(SDl)(mean2) + (SD2)(meanl) SD 1 + SD 2
1)
ASSESSING CLINICAL SIGNIFICANCE
261
where SD1 = the standard deviation of the first sample, SD2 the standard deviation of the second sample, mean~ = the mean of the first sample, and mean2 = the mean of the second sample. Reliable Change Index (RCI). The RCI was developed to assess the statistical reliability of the change a subject exhibits when he or she has crossed a cutoff point. If the RCI is larger than 1.96, then there is a 95% probability level that reliable change has occurred (Jacobson & Truax, 1991). In order to determine whether a test score change is due to chance variation or to an actual change in the underlying characteristic, the following formula is used:
RCI =
(pretreatment ) - (posttreatment ) Sdi~ × 1.96
2)
where pretreatment = a subject's pretest score, posttreatment = the same subject's posttest score, and Sdg = the standard error of difference between the two test scores. The Sdiff describes the spread of the distribution of change scores (posttreatment minus pretreatment) that would be expected if no actual change had occurred. Sdiff is defined as follows:
sdi~ = ~
3)
where SE = the standard error of measurement. The SE is defined as the following:
SE = SQJl - r x x
4)
where SD = the standard deviation of the pretreatment sample and rxx = the internal consistency of the measure (Martinovich, Saunders, & Howard, 1996; Tingey et al., 1996b). For this study, an internal consistency value of 0.86 was used (see below).
Internal Consistency for the BDI An internal consistency estimate for the BDI was necessary for the calculation of Sdi~. From the meta-analytic study performed by Beck et al. (1988), an internal consistency figure (coefficient alpha) of 0.86 was computed. Rather than selecting a single coefficient from those listed, the decision was made to use a mean figure (weighed by sample size), as this would likely provide a representative estimate. This figure is the mean value of 21 studies and includes all of the studies listed by Beck et al. (1988) that include a coefficient alpha figure, are based on the long form of the BDI only, and were computed with adult subjects. The combined N of these studies included 3,043 subjects .4 4 The internal consistency (coefficient alpha) estimates upon which this median was based, and the references of the studies from which these were gathered, are available upon request from the second author.
262
SEGGAR ET AL. TABLE 1 DESCRIPTIVE DATA FOR THE THREE NORMATIVESAMPLES
S ample Asymptomatic community Community Clinically symptomatic
N
Mean
SD
81 28,905 3,339
2.88 7.22 25.45
2.44 6.33 9.99
Results Analysis of the Normative Continuum The descriptive data for the three normative samples are presented in Table 1. The proposed three-sample continuum was analyzed to assess the distinctness between adjacent groups by using a t test, d test, and equivalence testing. Table 2 presents the results of the t test and the d test analyses. The results indicate that these groups met the required criteria for distinctness. The results of the t tests show that the groups are all distinct, since all of the resulting p values were less than 0.01. The results of the d test analyses indicate that this criteria for distinctness was met for every comparison, since all comparisons are greater than the required level of 0.50 for concluding distinctness. Finally, using equivalency testing, a confidence interval can be computed that encloses 20% of the community sample. This is done by multiplying the mean, 7.22, by 0.2, which equals 1.44. Adding and subtracting 1.44 into and out of the mean of 7.22 results in a confidence interval ranging fiom 5.78 to 8.66. Both the asymptomatic community sample and the clinically symptomatic sample fall outside of this range, indicating that these samples cannot be considered to be equivalent and are unlikely to reflect the same population. Even if you change criteria to a range of 20% of the distribution rather that 20% of the mean (a z score of 0.53), the resulting range is 3.67 to 10.57, which excludes both the asymptomatic community and the clinically symptomatic samples. Therefore, the asymptomatic community, community, and clinically symptomatic samples form a continuum consisting of statistically distinct normative groups.
TABLE 2 DISTINCTNESS OF THE THREE-SAMPLE NORMATIVECONTINUUM Sample Asymptornatic community vs. community Community vs. clinically symptomatic
df
t
d
82.18 3,653.90
6.17" 146.65"
.99** 2.18**
* p < 0.01. ** Significance for d is greater than a cutoff value of 0.5.
ASSESSING CLINICAL SIGNIFICANCE
263
TABLE 3 RELIABLE CHANGE INDICES AND CUTOFF POINTS FOR THE THREE-SAMPLE NORMATIVE CONTINUUM
Sample
RCI
Cutoff
Asymptomatic community vs. community Community vs. clinically symptomatic
4.55 8.46
4.09 14.29
Analysis of Clinically Significant Change Cutoff points between each of the three normative sample distributions were used to determine where it is statistically more likely to be in one distribution rather than in the adjacent distribution. Reliable versus random change on the BDI was determined by calculating an RCI (Christensen & Mendoza, 1986; Jacobson & Revenstorf, 1988) for each adjacent pair of normative samples in the continuum. The resulting cutoff points and RCIs are displayed in Table 3. The calculations for each RCI and cutoff point are based on the data for each of the adjacent samples. Deriving these values for each pair of adjacent samples provides a more precise and conservative estimate of reliable change than computing a single RCI to be used for the entire continuum.
Discussion The purpose of the present research was to identify cutoff scores and RCIs for the BDI. It is hoped that these values will become a standard across psychotherapy outcome studies for the purpose of calculating the proportion for patients who achieve significant meaningful change following treatment. The scores developed in the present study can also be used by individual clinicians and health systems to assist in treatment decisions such as when to terminate treatment, step down to less intensive treatment, or modify treatment plans (cf. Clement, 1996; Haaga, 2000; Lambert & Brown, 1996). A significant issue that relates to the results obtained is the effect of having a limited range of scores, which were found in this project on the BDI. That is, while its possible range is from zero to 63, this is not reflected in the means, standard deviations, and cutoff points for the normative groups. Although the three normative samples met the proposed criteria of distinctness, the reduction in differences between means for adjacent normative samples, combined with a reduction in variance across samples, provides evidence for a floor effect on the BDI. This floor effect would appear to have had an impact on the normative continuum derived. More specifically, without it there would likely have been a larger range manifested by the asymptomatic group, and ultimately a more extended continuum through the range of BDI scores. Raising a related concern, it is unclear what effect nonnormal distributions and unequal variances have on the computation of clinical significance values (Marfinovich et al., 1996), as computing cutoff points using standard methods
264
S E G G A R ET A L .
may miss the actual midpoint in skewed samples. This problem is likely magnified at the tails of a distribution. Therefore, the more skewed samples are, and the further apart adjacent means are, the less reliable computed cutoffs likely become. This is a concern in the present study, particularly in the computation of the cutoff between the asymptomatic community and community samples. While these samples meet criteria for distinctness, there is a large amount of overlap between them, suggesting that the positive skew in the asymptomatic community sample has less impact on the computed cutoff. Still, there is potential for error in this value given the distribution properties of the current samples. Several benefits of using a normative continuum have been delineated. It is unfortunate, however, that in this study, while the final normative continuum is comprised of three normative groups, a normative continuum comprised of further groups could not be formed. A major problem in this was the initial failure to find distinctness between the moderately and severely symptomatic normative groups (see Footnote 1). Although a three-sample normative continuum was formed and would seem to provide a valuable method to assess change, in terms of practical application, this project was not able to improve significantly on Jacobson et al.'s (1984) functional/dysfunctional dichotomy. Given the initial necessity of combining the moderately and severely symptomatic groups, and the characteristics and placement of the asymptomatic normative group, the effective yield is a single cutoff point between the clinically symptomatic and community groups. Nevertheless, this simple cutoff, as well as the RCI, may prove useful in research and in clinical decision making. In this context, the asymptomatic community group deserves further discussion. The cutoff point of the group would appear to make the likelihood of moving into this group rather difficult; overall, it represents a rather stringent criteria for improvement. However, it does seem that, despite the inherent difficulty in achieving such status, this is a useful and valuable addition of data. An appropriate manner in which to view this might be that it is not a reasonable Criteria for all clients, nor a requirement that one becomes asymptomatic as a result of treatment, but rather that with the data provided here, such classification becomes a potential. In conceiving of the asymptomatic group in this manner, one is left with an addition to the functional/dysfunctional dichotomy, albeit a stringent and not always practical one. The methodology utilized in this study allowed for the development of a sample of subjects who are relatively asymptomatic and who report experiencing a high degree of life satisfaction/well-being. Although the asymptomatic community sample was developed according to strict criteria, it was based on a very select cross-section of society, and this homogeneity is a limitation. It may not be similar in makeup to the patients who often form norm groups and seek treatment. This demographic issue is important for the users of these data to be aware of in terms of generalizing the results of the asymptomatic community sample. Gathering additional asymptomatic samples, with diverse demographic backgrounds, and comparing them to the asymp-
ASSESSING CLINICAL SIGNIFICANCE
265
tomatic sample in this study would prove beneficial in providing more complete and accurate norms for the BDI. One problem inherent in the methodology of this project in terms of the manner in which the data for the community and clinically symptomatic groups was gathered (i.e., review of the literature) is that much of the information about the subjects and related variables was not always clearly described in the articles from which the normative data were being compiled. Such variables are important, however, as they might relate to the BDI scores obtained for normative groups, the subsequent distinctness found between them, and ultimately the cutoff points and RCIs. It might have been that with additional information provided in the articles, there would have been an opportunity to exercise greater selectivity, and additional inclusion/exclusion criteria might have been formed if these were seen as appropriate. Additionally, this may have contributed to increased standardization and confidence in the cutoff scores. Nevertheless, the cutoff between the community samples and the patient samples is based on large and diverse samples and is likely to be far more reliable than cutoffs based on single samples, thus providing a better choice for researchers than those developed from a single study (e.g., Lacks & Powlishta, 1989). This lack of consistent knowledge with regard to subjects extends to the area of basic demographic variables as well. It has been stated that in terms of the BDI, the pattern with regard to demographic effects is inconsistent and equivocal, and therefore it is important for a researcher to examine such relationships within his or her sample (Beck et al., 1988). This was done as much as was possible with the information available in this project. However, the nature of such examination was significantly limited by the information available or the manner in which this was presented by the studies used. Specifically, the majority of the studies used did not present means and standard deviations separately on the basis of demographic variables. From the analyses that were conducted, it appeared most appropriate to construct a normative continuum without differentiation on demographic variables, such as gender or ethnicity. 5 5 In an effort to understand the possible effects of demographic variables on the scores of the asymptomatic subjects, the potential for covariation was assessed. A general linear analysis was utilized to test the possibility that a subject's score would be related to one or more of their demographic variables, and the following demographic variables were examined: gender, age, marital status, educational status, and income. Ethnic group was not analyzed, since the asymptomatic subjects were predominantly Caucasian. The results indicated that the demographic variables analyzed do not significantly vary in relation to the BDI score, F(8, 47) = 0.64, p > 0.05. As noted, the nature of the data provided by the articles used in the mildly and symptomatic samples precluded the performance of most types of demographic analyses. That which was conducted consisted solely of an exploration of the effect of gender in the community sample, through means of a t test between the mean BDI score of males and females. The data utilized for this came from those articles that had provided means and standard deviations for males and females separately. In terms of the numbers of subjects upon which this procedure
266
SEGGAR ET AL.
The normative continuum has ramifications for other cutting scores that have been proposed and are used in the determination of the severity of depression. There are a number of guidelines that have been offered in terms of such cutting points (Beck, 1967; Beck & Beamesderfer, 1974; Beck & Steer, 1987; Beck ct al., 1988; Burns & Beck, 1978; Oliver & Simmons, 1984; Schwab, Bialow, Ciemmons, Martin, & Holzer, 1967; Shaw et al., 1985), and while there is notable similarity among these, there is also some variation. Although some empirical data have been provided for the derivation of one set of cutting scores (Beck, 1967), and many of the guidelines appear to be slight variations of these, with some coming from a similar source (i.e., Beck & Beamesderfer; Burns & Beck), largely, there does not appear to be a consistently articulated basis by which such guidelines were formulated. This raises the possibility that these scores have, to some extent, been based on theoretical reasoning and are used for clinical convenience. The results of the three sample normative continuum derived here do not verify the suggested cutoff guidelines in the literature. The methodology enveloped in the current study seems to be an improvement over past practices. Two final issues relate to the clinical significance methodology utilized in this study. First, an internal consistency value was used in the calculations for RCI instead of a test-retest reliability coefficient (see Martinovich et al., 1996) because of the likelihood that test-retest values based on patient populations underestimate the actual stability of measurement due to actual change being considered error variance. Martinovich et al. suggest that it is possible that when patients decide to seek treatment, the change process is already set in motion, and that a test-retest period prior to treatment actually beginning may still reflect actual change rather than only temporal instability on the measurement instrument. This results in an inflated RCI. A more reasonable estimate of RCI is based on the internal consistency of the measure. Second, confidence bands were not computed around the cutoff scores between adjacent samples, although this practice has been advocated by Jacobson and Revenstorf (1988). This is because using a confidence band is redundant with using the RCI (the formula for computing a confidence band is 2RCI + the cutoff point). Essentially, using a confidence band means that a patient's score must change by at least the magnitude of the RCI, and it cannot start within the range of the RCI using the cutoff point as a midpoint. Hansen and Lambert (1996) have suggested that assessing clinical significance by requiring the patient to cross the cutoff point between adjacent samples and having a change score of greater magnitude than the RCI is suffi-
was based, there were 5,269 males and 9,419 females. The mean BDI score of males (M = 6.54) was found to be significantly lower than that of the females (M = 7.35), t(14,686) = 7.67, p < 0.05. However, this difference in means equals a score o f 0.81, which, while significant, does not appear to represent a marked or necessarily meaningful difference in actual scores obtained by males and females,
ASSESSING CLINICAL SIGNIFICANCE
267
ciently stringent, and that using confidence bands does not add significantly to the methodology and results in data falling within the confidence bands being unusable. One of the most attractive features of clinical significance methodology is that it is relatively simple and straightforward to apply. This makes it practical for routine practice and implementation across a wide variety of settings without great technical requirement. Thus, the tables presented are recommended for use by clinicians and researchers as a helpful and efficient way to assess clinically significant change on the BDI resulting from treatment. This project has been based on one model that has been proposed as a means by which such assessment can be done. The assertion has been put forth that "clinical significance has clearly arrived, but the optimal methods for deriving it remain to be determined" (Jacobson & Truax, 1991, p. 18), and that, as such, it is too early to invariantly adhere to a particular method or set of conventions: Clearly, further experimentation and efforts to improve methodology are important (Jacobson et al., 1984; Jacobson & Truax, 1991; Kazdin, 1999; Lambert & Hill, 1994). These suggested attempts at clarification, refinement, and betterment of techniques should be ongoing in an effort to contribute to an increased ability to meaningfully measure and describe client change, thus impacting both clinical research and clinical practice.
References Barlow, D. H. (1981). On the relation of clinical research to clinical practice: Current issues, new directions. Journal of Consulting and Clinical Psychology, 49, 147-155. Beck, A. T. (1967). Depression: Causes and treatment. Philadelphia: University of Pennsylvania Press. Beck, A. T., & Beamesderfer, A. (1974). Assessment of depression: The Depression Inventory. In P. Pichot (Ed.), Psychological measurements in psychopharmacology: Modern problems in pharmacopsychiatry (Vol. 7, pp. 151-169). Basel, Switzerland: Karger. Beck, A. T., Rush, A. J., Shaw, B. F., & Emery, G. (1979). Cognitive therapy of depression. New York: Guilford Press. Beck, A. T., & Steer, R. A. (1987). Beck Depression Invento~T--manual. San Antonio, TX: The Psychological Corporation. Beck, A. T., Steer, R. A., & Garbin, M. G. (1988). Psychometric properties of the Beck Depression Inventory: Twenty-five years of evaluation. Clinical Psychology Review, 8, 77-100. Beck, A. T., Ward, C. H., Mendelson, M., Mock, J., & Erbaugh, J. (1961). An inventory for measuring depression. Archives of General Psychiatry, 4, 561-571. Bergin, A. E., & Strupp, H. (1972). Changing frontiers in the science of psychotherapy. New York: Aldine-Atherton. Bums, D., & Beck, A. T. (1978). Cognitive behavior modification of mood disorders. In J. P. Foreyt & D. Rathjen (Eds.), Cognitive behavior therapy. New York: Plenum Press. Christensen, L., & Mendoza, J. L. (1986). A method of assessing change in a single subject: An alteration of the RC index. Behavior Therapy, 17, 305-308, Clement, R W. (1996). Evaluation in private practice. Clinical Psychology: Science and Prac-
tice, 3,146-159. Cohen, J. (1977). Statistical power-analysis for the behavioral sciences (Rev. ed.). New York: Academic Press.
268
SEGGAR ET AL.
Derogatis, L. R. (1983). SCL-90-R: Administration, scor#~g, and procedures manual (2nd ed.). Towson, MD: Clinical Psychometric Research. Foster, S. L., & Mash, E. J. (1999). Assessing social validity in clinical treatment research: Issues and procedures. Journal of Consulting and Clinical Psychology, 67, 308-319. Froyd, J. E., Lambert, M. J., & Froyd, J. D. (1996). A review of practices of psychotherapy outcome measurement. Journal of Mental Health, 5, 11 - 15. Gladis, M. M., Gosch, E. A., Dishuk, N. M., & Crits-Christoph, P. (1999). Quality of life: Expanding the scope of clinical significance. Journal of Consulting and Clinical Psychology, 67, 320-331. Haaga, D. A. F. (2000). Introduction to the special section on stepped care models in psychotherapy. Journal of Consulting and Clinical Psychology, 68, 547-548. Hansen, N. B., & Lambert, M. J. (1996). Clinical significance: An overview of methods. Journal of Mental Health, 5, 17-24. Hansen, N. B., Lambert, M. J., & Forman, E. M. (in press). The psychotherapy dose-response effect and its implications for treatment delivery services. Clinical P~chology: Science and Practice. Hauck, W., & Anderson, S. (1992). Type of bioequivalence and related statistical considerations. Journal of Clinical Pharmacology, Therapy and Toxicology, 30, 181-187. Horowitz, L. M., Rosenberg, S. E., Baer, B. A., Ureno, G., & Villasenor, V. S. (1988). Inventory of interpersonal problems: Psychometric properties and clinical applications. Journal of Consulting and Clinical Psychology, 56, 885-892. Horowitz, L. M., Rosenberg, S. E., & Bartholomew, K. (1988). Interpersonalproblems in brief dynamic psychotherapy. Unpublished manuscript. Jacobson, N. S, Follette, W. C., & Revenstorf, D. (1984). Psychotherapy outcome research: Methods for reporting variability and evaluating clinical significance. Behavior Therapy, 15,336-352. Jacobson, N. S., & Revenstorf, D. (1988). Statistics for assessing the clinical significance of psychotherapy techniques: Issues, problems, and new developments. Behavioral Assessment, 10, 133-145. Jacobson, N. S., Roberts, L. J., Berns, S. B., & McGlinchey, J. B. (1999). Method for defining and determining the clinical significance of treatment effects: Description, application, and alternatives. Journal of Consulting and Clinical Psychology, 67, 300-307. Jacobson, N. S., & Truax, P. (1991). Clinical significance: A statistical approach to defining meaningful change in psychotherapy research. Journal of Consulting and Clinical Psychology, 59, 12-19. Kazdin, A. E, (1977). Assessing the clinical or applied importance of behavior change through social validation. Behavior Modification, 1,427-452. Kazdin, A. E. (1999). The meanings and measurement of clinical significance. Journal of Consulting and Clinical Psychology, 67, 332-339. Kendall, E C. (Ed.). (1999). Clinical significance [Special section]. Journal of Consulting and Clinical Psychology, 67(3). Kendall, P. C., & Grove, W. M. (1988). Normative comparisons in therapy outcome. Behavioral Assessment, 10, 147-158. Kendall, E C., Marrs-Garcia, A., Nath, S. R., & Sheldrick, R. C. (1999). Normative comparisons for the evaluation of clinical significance. Journal of Consulting and Clinical Psychology, 67, 285-299. Lacks, P., & Powlishta, K. (1989). Improvement following behavioral treatment for insomnia: Clinical significance, long-term maintenance, and predictors of outcome. Behavior Therapy, 20, 117-134. Lambert, M. J., & Brown, G. S. (1996). Data-based management for tracking outcome in private practice. Clinical Psychology: Science and Practice, 3, ]72-178. Lambert, M. J., & Hill, C. E. (1994). Assessing psychotherapy outcome and process. In A. E. Bergin & S. L. Garfield (Eds.), Handbook of psychotherapy and behavior change (4th ed., pp. 72-113). New York: Wiley.
ASSESSING CLINICAL SIGNIFICANCE
269
Martinovich, Z., Saunders, S., & Howard, K. I. (1996). Some comments on assessing clinical significance. Psychotherapy Research, 6, 124-132. Nathan, R E., Stuart, S. R, & Dolan, S. L. (2000). Research on psychotherapy efficacy and effectiveness: Between Scylla and Charybdis? Psychological Bulletin, 126, 964-981. Nietzel, M. T., Russell, R. L., Hemmings, K. A., & Gretter, M. L. (1987). Clinical significance of psychotherapy for unipolar depression: A meta-analytic approach to social comparison. Journal of Consulting and Clinical Psychology, 55, 156-161. Oliver, J. M., & Simmons, M. E. (1984). Depression as measured by the DSM-III and the Beck Depression Inventory in an unselected adult population. Journal of Consulting and Clini-
cal Psychology, 52,892-898. Ottenbacher, K. J., Johnson, M. B., & Hojem, M. (1988). The significance of clinical change and clinical change of significance: Issues and methods. American Journal of Occupa-
tional Therapy, 42,156-163. Robinson, L. A., Berman, J. S., & Neimeyer, R. A. (1990). Psychotherapy for the treatment of depression: A comprehensive review of controlled outcome research. Psychological Bulletin, 108, 30-49. Sannders, S. M., Howard, K. I., & Newman, F. L. (1988). Evaluating the clinical significance of treatment effects: Norms and normality. BehavioralAssessment, 10,207-218. Schnurr, R., Hoaken, R C. S., & Jarrett, F. J. (1976). Comparison of depression inventories in a clinical population. Canadian Psychiatric Association Journal, 21,473-476. Schwab, J. J., Bialow, M. R., Clemmons, R. S., Martin, P., & Holzer, C. E. (1967). The Beck Depression Inventory with medical inpatients.Acta Psychiatrica Scandinavica, 43,255-266. Shaw, B. F., Vallis, T. M., & McCabe, S. B. (1985). The assessment of the severity and symptom patterns in depression. In E. E. Beckham & W. R. Leber (Eds.), Handbook of depression: Treatment, assessment, and research (pp. 372-407). Homewood, IL: Dorsey Press. Speer, D. C. (1992). Clinically significant change: Jacobson and Truax (1991) revisited. Journal
of Consulting and Clinical Psychology, 60,402-408. Steer, R. A., Beck, A. T., Brown, G., & Berchick, R. J. (1987). Self-reported depressive symptoms differentiating major depression from dysthymic disorders. Journal of Clinical Psy-
chology, 43,246-250. Steer, R. A., Beck, A. T., & Garrison, B. (1986). Applications of the Beck Depression Inventory. In N. Sartorius & T. A. Ban (Eds.), Assessment of depression (pp. 123 - 142). Geneva, Switzerland: World Health Organization. Steer, R. A., Beck, A. T., Riskind, J., & Brown, G. (1986). Differentiation of depressive disorders from generalized anxiety by the Beck Depression Inventory. Journal of Clinical Psychology, 40, 475-478. Stegner, B. E., Bostrom, A. G., & Greenfield, T. K. (1996). Equivalence testing for use in psychological and services research: An introduction with examples. Education and Program Planning, 19, 193-198. Tingey, R. C. (1989). Assessing clinical significance: Extensions in method and application to the SCL-90-R (Doctoral dissertation, Brigham Young University, 1989). Dissertation Abstracts International, 50, 04B. Tingey, R. C., Lambert, M. J., Burlingame, G. M., & Hansen, N. B. (1996a). Assessing clinical significance: Proposed extensions in method. Psychotherapy Research, 6, 109-t23. Tingey, R. C., Lambert, M. J., Burlingame, G. M., & Hansen, N. B. (1996b). Clinically significant change: Practical indicators for evaluating psychotherapy outcome. Psychotherapy Research, 6, 144-153. Wolf, M. M. (1978). Social validity: The case for subjective measurement or how applied behavior analysis is finding its heart. Journal of Applied Behavior Analysis, 11,203-214. RECEIVED: March 29, 2001 ACCEPTED: December 12,2001