.I. psychiot. Res., Vol.22,No. 3,pp.207-220, 1988.
0022-3956188 $3.00+,013 0 1988 Per.gmnPressplc
Printed in GreatBritain.
AGREEMENT BETWEEN FACE-TO-FACE AND TELEPHONEADMINISTERED VERSIONS OF THE DEPRESSION SECTION OF THE NIMH DIAGNOSTIC INTERVIEW SCHEDULE KENNETH
B. WELLS*,
M.
AUDREY
BURNAM?,
BARBARA LEAKE$
and
LEE N.
ROB&
*Department of Psychiatry and Biobehavioral Sciences, UCLA Neuropsychiatric Institute and School of Medicine, Los Angeles, CA 90024, U.S.A.; tThe RAND Corporation; $UCLA Department of Medicine, School of Medicine; and SDepartment of Psychiatry, Washington University School of Medicine (Received 9 February 1987; revised 10 September 1987; in final form 11 January 1988) Summary-To increase the feasibility of identifying persons with depressive disorders in a largescale health policy study, we tested the concordance between face-to-face and telephone-administered versions of the depression section of the NIMH Diagnostic Interview Schedule (DIS). This section was administered over the telephone to 230 English-speaking participants of the Los Angeles site of the NIMH Epidemiologic Catchment Area Program (ECA) after their completion of a face-toface interview (Wave II) with the full DIS. Time lag between interviews was 3 months, on the average. Persons with depressive symptoms were oversampled. Using the face-to-face version as the criterion measure, the sensitivity, specificity, and positive predictive value of the telephone version for identifying the presence or absence of any lifetime unipolar depressive disorder were 71, 89, and 63 percent, respectively; the kappa statistic was 0.57, and agreement was unbiased. The comparable figures for concordance between two face-to-face interviews administered one year apart to the same subjects were 54, 89, and 60 percent and 0.45 (kappa), respectively. Thus, disagreement was due primarily to test-retest unreliability of the DIS rather than the method of administration.
INTRODUCTION
THE DEVELOPMENT of structured interviews to identify psychiatric disorders according to precisely defined criteria has been one of the most important methodologic advances in psychiatric epjdemiology and health services research in the last two decades. As noted by ENDICOTT AND SPITZER (1978), structured interviews standardize the diagnostic process by reducing both information variance (differences in information available for assigning a diagnosis) and criterion variance (differences in criteria for aggregating data into a diagnosis). In the United States, the two most widely-used structured psychiatric interviews are the Schedule for Affective Disorders and Schizophrenia (SADS) (ENDICOTT AND SPITZER, 1978), which assigns diagnoses according to the Research Diagnostic Criteria (RDC), and the NIMH Diagnostic Interview Schedule (DIS), which assigns diagnoses according to DSMIII, RDC, or Feighner Criteria (ROBINS et al., 1981). The DIS is relatively unique among structured interviews in the degree of specification of its questions, probes, and scoring system. As a result, it can be administered by specially This research was supported by grants from the Robert Wood Johnson Foundation, the Henry .I. Kaiser Family Foundation, The Pew Memorial Trust, and the National Institute of Mental Health. The conclusions are those of the authors and do not necessarily represent the views of the Foundations or NIMH. 207
208
KENNETH
B. WELLS et al.
trained lay interviewers. By contrast, other existing structured interviews require trained clinician interviewers, making its use less costly than use of interviews which require trained clinicians as interviewers. But still greater cost savings could be realized if it could be administered by telephone rather than face-to-face. Several investigators have recently developed new assessment instruments or modified the DIS to enhance the feasibility of obtaining a psychiatric diagnoses for research purposes. For example, Z~ERMAN et al. (1986) developed a self-report instrument that assigns a diagnosis of major depression. GREISTet al. (1984) developed a computer-assisted version of the DIS. ROBINS et al. (1986) have developed a short screening version of the DIS. In this paper, we report the results of an attempt to decrease the cost of obtaining a psychiatric diagnosis from a community sample by administering the depression section of the DIS over the telephone. The telephone-administered version of the depression section of the DIS reported here was developed for the National Study of Medical Care Outcomes (MOS), an observational study of the processes and outcomes of care delivered to patients with one of four specified disorders in fee-for-service and prepaid practices. One of the four disorders studied in the MOS is the group of unipolar depressive disorders (WELLS, 1985). Because the study is a large-sample multi-site study, it was not feasible, given the budget, to administer a faceto-face DIS to potential participants to identify those with depressive disorders. Instead, we developed a two-stage screening process. The first stage is a self-administered 8-item measure of depressive symptoms (BURNAMet al., in preparation). Those positive for depression on the first stage screener are administered the depression section of the DIS by telephone. To measure the equivalence of administering this second stage DIS by telephone versus face-to-face, we conducted a telephone follow-up of participants in Los Angeles ECA Program. Previous studies have examined the comparability of face-to-face and telephoneadministered interviews for obtaining data on general health status, use of health services, disability days, or psychiatric symptoms (SIMONet al., 1974; HOCHSTIM,1967; YAFFEet al., 1978; SIEMIATYCKI,1979; KULKAet al., 1984; JOSEPHSON,1965; JORDANet al., 1980; ANESHENSELet al., 1982a, 1982b; HENSONet al., 1977; COLOMBOTOS, 1969; ANESHENSEL and YOKOPENIC,1985). Generally, the results of these studies indicate that data obtained from telephone-administered interviews are at least as accurate as data obtained from faceto-face interviews, and that similar estimates of the prevalence of depressive symptoms are obtained using either method. A few studies suggest that telephone-administered surveys result in more missing data on items considered to be “sensitive”, such as income (HOCHSTIM,1967; JORDANet al., 1980; ANESHENSELet al., 1982), and at least one study of psychiatric symptoms suggests that respondents are more likely to give a “cheerful” response over the telephone than in a face-to-face interview (HENSONet al., 1977). Otherwise, few differences in responses as a function of interview method have been observed. These previous studies are not sufficient for determining whether the DIS can be administered over the telephone for two reasons. First, in most of the previous comparisons of the two interview methods, random subsamples of participants were administered either a face-to-face or a telephone interview. Conclusions about comparability were not based on repeat interviews of the same individuals. Thus, even when the two methods yielded similar estimates of the prevalence of depressive symptoms, they might have identified
AGREEMENT BETWEEN FACE-TO-FACE ANDTELEPHONE-ADMINISTERED DIAGNOSTIC INTERVIEW 209 different individuals in the population as having depressive symptoms. For many research purposes, including studies such as the MOS which seek to evaluate the care delivered to depressed persons over time, it is necessary to be able to select a sample of affected persons for study. Second, previous studies examined comparability of responses to single items or of scale scores, but not of more complicated combinations of items, such as an algorithm for assigning a psychiatric diagnosis. Such algorithms require integrating information on the presence or absence of multiple symptoms and on the timing of symptoms (e.g. whether they cluster together). Even if responses to the two interview methods were similar on most individual items, there could be less agreement on the presence or absence of disorder. METHODS The current study was designed as a follow-up to the Los Angeles project of the NIMH Epidemiologic Catchment Area Program (ECA). The ECA is a multisite general population survey of the epidemiology of psychiatric disorders and use of health services by adults (RBG~BR and MYERS,1984; HOUGHet al., 1984; EATONand KESSLER,1985). The Los Angeles ECA site has a household probability sample of two mental health catchment areas (East Los Angeles and Venice/Culver City). One adult was randomly sampled from each household. Data were obtained from face-to-face interviews in two waves (from the same respondents) one in 1983-1984 and one in 1984-1985. The response rate for Wave I was 68%; of these, 76.8% completed the Wave II interview. The face-to-face interviews at each wave included the complete DIS, administered by trained lay interviewers. After completing a Wave II interview, a sample of ECA respondents were asked by ECA staff for their permission to be contacted by staff for the telephone validation study. For the first 52 respondents, such permission was obtained by the ECA interviewer immediately after completing the Wave II interview. For the remainder of respondents, permission was requested by a telephone call by an ECA staff member one to two weeks after the Wave II interview. The telephone validation study staff contacted consenting respondents by telephone to request participation and schedule a telephone interview. Respondents were told that the interview would include some of the same items that were asked in the full DIS, and they were offered $10 for their participation. On average, the telephone interviews were conducted 3 months after the Wave II interview. Those eligible for the validation study included all ECA participants selected from the general population who completed a face-to-face Wave II interview in English between 15 August 1984 and 1 September 1985. We selected a random sample of these respondents, stratified by the presence or absence of either of two indicators of depression: having two or more weeks of feeling depressed or sad in the 12 months prior to the Wave II ECA interview; or lifetime history of three or more of the eight symptom groups included in the DSM-III Criteria B for major depressive disorder. We included all respondents who met one of these criteria and a 30% random sample of the remainder of participants, to yield a study sample with approximately 50% of respondents having at least one of these indicators of depression. We oversampled patients with depressive symptoms to allow us to determine agreement on the presence of depressive symptoms, which are relatively
210
KENNETHB. WELLS et
al.
uncommon in the general population, and to simulate the use of the DIS as a second-stage screener for depressive disorder (i.e. for use in the MOS). Because we wanted a relatively brief interval between the Wave II ECA interview and the telephone interview, to reduce the amount of clinical change in the subjects, we restricted our sample to the 1488 ECA respondents who completed their Wave II ECA interview between 15 August 1984 and 1 September 1985. Of these, 588 were ineligible for the current study (258 were from a separate institutionalized sample, 167 completed the interview in Spanish, 163 had their Wave II interview over the telephone). Of the remaining 900 cases, 194 were selected because they met the depressive symptoms inclusion criteria and 207 were randomly selected from the remaining cases. Of the 401 respondents selected for telephone interviews, 58 were subsequently determined to be ineligible (30 had their Wave II ECA interview by telephone; 28 did not have a telephone). Of the eligible 343 selected participants, 60 refused to participate. In addition, 34 individuals could not be contacted, 15 interviews were terminated before completing the interview, and 4 had incomplete Wave II data at the time of analysis. The relatively high rate of refusals was explained in part by respondent fatigue-they had just completed their 3rd ECA interview-and in part by the fact that we allowed all initial refusals to the ECA interviewer stand. We did not attempt to “convert” them, a practice that usually increases response rates by 5-10%. Both the face-to-face (ECA) and telephone-administered interviews included the depression section of the DIS (ROBINS et al., 1981), which assesses the presence or absence of 20 symptoms of depression representing DSM-III criteria for major depression and dysthymia. The depression section also elicits data on the clustering of symptoms, allowing a determination of lifetime prevalence of depressive disorders, and on the recency of depressive symptoms and episodes. In this paper, we focused on lifetime, rather than current, symptoms and disorder. We would have had low precision for analyses of agreement on current depression because of its low prevalence in a community sample and such agreement would have been more affected by the several weeks interval between interviews. Our definition of depressive disorders excluded grief reaction, consistent with the DSM-III definition, but not the other DSM-III exclusion criteria (e.g., preexisting schizophrenia). STATISTICAL
METHODS
Our study raises three statistical design issues: (1) whether a comparison of a telephone and face-to-face administered DIS is a test of reliability or validity; (2) distinguishing disagreement in response due to differing methods of administration (i.e. telephone versus face-to-face) from disagreement due to imperfect test-retest reliability; and (3) selecting the appropriate statistical tests or equivalence. We briefly discuss each issue below. Reliability
versus validity
As noted by SHROUT et al. (1987), “Reliability in the psychometric sense is the reproducibility of distinctions made between some aspects of persons” (p. 172). Reliability can refer to agreement in responses to different items or measures that are thought to assess the same construct (i.e. internal consistency) or to agreement in responses to the same measure administered at different points in time (i.e. stability or test-retest reliability). Validity, in contrast, refers to the extent to which a measure actually assesses the construct
AGREEMENT BETWEEN FACE-TO-FACE AND TELEPHONE-ADMMSTERED DIAGNOSTIC INTERVIEW
211
it is thought to assess. In tests of diagnostic measures, two types of validity are of interest: construct validity (including discriminant and convergent validity), or the extent to which an instrument is related, in the appropriate direction, to other constructs known to be positively or negatively associated with the dimension (or diagnosis) of interest; and criterion validity, or the degree of association with an independent criterion measure, or “gold standard”. In the current study, we assess the equivalence of a telephone and face-to-face administered version of the depression section of the DIS. The items are identical, but the route of administration differs. HELZER et al. (1985), describe such a study as a test of “procedural validity’ ’ , having properties of a study of both reliability (because the underlying measure is identical) and of validity (because one of the methods of administration may elicit a more valid response). Given the relative lack of data on the reliability and validity of the DIS, however, it may be inappropriate to consider the faceto-face version the gold standard. It is possible, for example, that more accurate information is obtained on the telephone because respondents may feel more comfortable revealing personal information when they are not seen by the interviewer. Consistent with HELZER et al. (1985), we regard this study as having properties of test of both reliability and validity and present measures of agreement consistent with both approaches (see below). Controlling for competing explanations of disagreement Even if we administered a second face-to-face version of the DIS instead of the telephone version, we would not expect perfect agreement. This is because there is measurement error in the face-to-face version and because true clinical status could change in the interval between administrations, i.e. the respondent could experience a new symptom or disorder during the interval. Thus, to assess the equivalence of the telephone and face-to-face versions, we must control for these competing explanations of disagreement. To control for test-retest reliability, we compared the agreement between the telephone and face-to-face administrations with the agreement between the two face-to-face administrations in the ECA, Wave I and Wave II, to the same respondents 12 months apart. For testing agreement between Wave I and Wave II, Wave I was selected as the criterion (for validity statistics). To control for incidence of new depressive symptoms and disorders in the interval between administrations (and for differences in duration of the interval between administrations), we performed sensitivity analyses in which we corrected the measures of agreement for incidence of lifetime depressive symptoms and/or major depression*. Because the incidence *Specifically, in these analyses, if a lifetime symptom of depression was absent on the first interview but present on the second, we did not count it as disagreement if the new symptom was reported as occurring during the interval between administrations of the DIS (as determined from items assessing recency of each symptom). Such a correction was possible for each lifetime symptom of depression for the comparisons of the telephone and Wave II DIS and the Wave I and Wave II DIS. Because information on recency of dysthymia is not elicited by the DIS, a correction for incidence of dysthymia during the specific interval between administrations was not possible. It is possible to correct for incidence of major depression within 2 weeks, one month, six months, and one year prior to a DIS interview. These time intervals were too restrictive to allow correction of measures of agreement on major depression for the comparison of telephone and face-to-face versions, which were, on average, three months apart. However, we were able to correct agreement statistics for major depression for the Wave I versus Wave II comparison because of the greater duration between administrations (12 months).
KENNETH B. WELLSet al.
212
of both lifetime depressive symptoms and/or major depression was very low over the short intervals between administrations, these corrections had a negligible effect on the results. Thus, we present the uncorrected statistics in this paper. Selection of agreement statistics Our measures of validity are sensitivity (percent of true positives correctly identified), specificity (percent of true negatives correctly identified), and positive predictive value (percent of those identified as positive by the follow-up DIS that are also positive by the initial or criterion administration). Our measure of reliability is the kappa statistic (H), the most commonly-used measure of agreement. Recently, there has been considerable controversy concerning the interpretation of the kappa statistic, especially whether the fall in kappa with very low or very high base-rates is appropriate or inappropriate. [See SHROUT et al. (1987), ROBINS(1985), and HELZER et al. (1985) for a fuller presentation of the issues]. As SHROUT et al. (1987), and HELZER et al. (1985) point out, base-rates around 50% will have the highest values of kappa, other factors equal. The kappa statistic tends to fall as the base-rate becomes very low or very high. According to SHROUT et al. (1987) this fall in kappa is appropriate and indicates that a measure that demonstrates acceptable agreement at moderately high base-rates will demonstrate lower-and perhaps unacceptable agreement-at extreme base-rates. HELZER et al. (1985) argue, in contrast, that because of the “sensitivity of kappa” to base-rates, other measures of agreement, such as Yule’s Y, that show less sensitivity to base-rate, should also be used as indicators of agreement. We found, however, that our conclusions were similar across a wide variety of measures of reliability. Thus, the only measure of reliability reported here is the kappa statistic. We also determined McNemar’s x2, which estimates degree of bias between measures. Bias is the tendency for persons to have more positive responses, i.e. to report having more depressive symptoms, with one interview method than another. The measures of validity and reliability are defined in Fig. 1, with reference to a simple 2-way cross-tabulation of responses to a telephone and (criterion) face-to-face interview. Telephone Present Face-to-face
DIS Absent
DI S Present Absent
Sensitivity=
CA/(A+B)J
Specificity=
CD/CC+
Positive
predictive
x 100 D)l
value
x 100 = CA/A
+ Cl x 100
Kappa=2(AD-BC)/I:(A+B)(B+D)+(A+C)(C+D)l McNemar’s
xzz2C(B-C)-17/(B+C)
FIG. 1
AGREEMENT BETWEENFACE-TO-FACE AND TELEPHONE-ADMINISTERED DIAGNOSTIC INTERVIEW
213
RESULTS
Of the 339 ECA Wave II respondents selected to participate in the telephone interview study, 230 (68Vo) completed the telephone interview. The age range of this sample is 18-81 yr, with a mean of 39 yr. The sample is 54.3% male; 59.6% are non-Hispanic white, 26.4% are Hispanic, and 14% have some other ethnic background. As shown in Table 1, the sample has a relatively high lifetime prevalence of depressive symptoms (consistent with the sampling design). Half have experienced two weeks or more of depression; the prevalence of other depressive symptoms varies from 40% with two weeks or more of crying spells to 5% with two weeks or more of pacing or moving all the time. Agreement at the symptom level Table 1 provides measures of agreement and bias for lifetime depression items that were included in both the Wave II ECA interview and the telephone interview. The items represent DSM-III Criteria A for major depression and for dysthymia, and DSM-III Criteria B for major depression (8 symptom groups) and 9 of the symptoms for dysthymia. As shown in Table 1, when the face-to-face interview items are used as the criterion measure, the telephone items have high specificity: the mean for the 20 items is 91%. Sensitivity is lower and more variable, ranging across items from 33% for “talking or moving slowly” to 83% for “depressed two weeks or more”. The mean sensitivity for the 20 items is 62%. Positive predictive value ranges from 22Vo for “moving all the time” to 89% for “attempted suicide.” The mean positive predictive value is 62%. As shown in Table 1, kappa statistics range from 0.22 for “moving all the time” to 0.83 for “attempted suicide.” The mean kappa is 0.50. Only three of the twenty items demonstrate a significant bias, in the direction of fewer symptoms reported on the telephone version in two cases (crying spells and talking in morning slowly) and more symptoms in the third (life hopeless). The difference is especially striking for crying spells, with 64% higher reporting of this symptom on Wave II, relative to the telephone interview. The lowest agreement is for symptoms that are not specific for depression (“felt tired”) or are abstract (“thoughts come slow”), as opposed to items that are more specific to depression (“suicidal ideation”) or concrete (“lost weight”). When we correct measures of agreement for incidence of symptoms in the interval between administrations (see Methods), the results are similar; positive predictive value increases the most (in one case by 12 percentage points). With the correction, the mean sensitivity is 59070,the mean specificity is 92070,the mean positive predictive value is 67%, and the mean kappa is 0.52. For specific lifetime symptoms of depression, measures of agreement between Waves I and II (not shown) are similar to measures of concordance between the telephone and face-to-face interviews reported in Table 1. Agreement on unipolar depressive disorders Table 2 presents the measures of concordance on the presence or absence of specific affective disorders for the contrasts of face-to-face and telephone administration and two face-to-face administrations (Wave I and Wave II). With Wave II response as the criterion measure, specificity of the telephone version for identifying a particular affective disorder
214
KENNETH
B. WELLSet al.
TABLE 1. AGREEMENT OF DIS LIFETIMEDEPRESSIVESYMPTOMS,WAVE II
FACE-TO-FACE VERSUSTELEPHONE
ADMINISTERED
Abbreviated DIS item Sad or depressed Depressed 2 yr Loss of appetite Lost weight Gained weight Trouble falling asleep Sleeping too much Moving all the time Talking or moving slower Interest in sex low Tired out Worthless or guilty Trouble concentrating Thoughts slow Thoughts of death Wanted to die Thoughts of suicide Attempted suicide Life hopeless Crying spells
2 x 2 table Telephone DIS -I- Wave11 +A B DIS -C D I 95
19
17 11 13 27 14 26 13 24 15 43 21 18 10 04 14 11 05 16 17 29 38 39 13 31 20 14 11 40 34 22 07 40 15 16 02 29 31 47 08
99 11 194 13 173 14 177 22 169 34 132 19 182 08 202 16 198 26 167 25 138 16 162 24 155 19 179 24 132 12 189 11 164 04 207 10 159 43 130
Sensitivity Voo)
Specificity Vo)
83
85
50
Positive predictive value
K
Bias
8.5
69
NS
94
46
42
NS
66
93
68
59
NS
65
95
90
63
NS
52
92
62
47
NS
56
86
67
44
NS
49
95
64
48
NS
33
94
22
22
NS
41
98
69
47
*
38
91
49
32
NS
54
78
43
30
NS
71
93
75
65
NS
56
89
61
46
NS
42
91
45
35
NS
63
80
54
40
NS
65
96
16
65
NS
78
92
73
68
NS
80
99
89
83
NS
74
84
48
48
*
52
94
86
50
*
*P <0.05; NS: P >0.05; sample size varies slightly by item due to missing data.
is high (89-95Vo), sensitivity is moderately low (55-56%), and positive predictive value is 52-55010. For the presence or absence of major depression and/or dysthymia, sensitivity is 71%, specificity 89% and positive predictive value 63%. There is no significant bias in agreement on disorder status between the telephone and Wave II interviews. As shown in Table 2, the comparable results for the comparison of two face-to-face interviews (Wave
AGREEMENT BETWEENFACE-TO-FACEANDTELEPHONE-ADMINISTERED DIAGNOSTIC INTERVIEW
215
TABLE2. AGREEMENT ON DIS LIFETBEUNPOLARDEPRESSIVE DISORDERS, WAVEII FACE-TO-FACE VERSUS TELEPHONE ADMINISTERED ANDWAVE I VERSUSWAVE II DIS comparison DSM-III depressive disorder
I
First DIS
Second DIS + +A B -C D
Sensitivity (Q)
Specificity (%I
Positive predictive value
X
Bias
56
89
55
45
NS
55
95
52
48
NS
71
89
63
57
NS
Wave II versus telephone*
Major depression Dysthymia Major depression and/or dysthymia
24 20 12
19 167
10 11 197 34 14 20 162
Wave Z versus Wave IZt
Major depression
25 18
23 165
54
90
58
44
NS
Dysthymia
11 10 11 199 29 25 19 158
52
95
50
46
NS
54
89
60
45
NS
Major depression and/or dysthymia
NS:P >O.lO; depressive disorders exclude grief reaction but not other psychiatric disorders. *Sample size=230. Wave II DIS is the criterion measure. tSample size = 23 1; includes one individual with incomplete data on the telephone interview. Wave I DIS is the criterion measure.
I and Wave II) are quite similar. However, sensitivity of the Wave II interview for identifying any major depression and/or dysthymia from Wave I is only 54%. Interval between administrations On average, the interval between the face-to-face and telephone interviews was 104 days, or three and one-half months. The median was 84 days. There was a slight but statistically insignificant tendency for persons with current depression to have a shorter interval. Twentyfive percent of those reinterviewed within 90 days compared to 17 percent of those reinterviewed in more than 90 days had had a major depressive episode in the previous 12 months [for a two-way contingency table, 2 (1) = 2.14, P> 0. lo]. We determined the relationship between the duration of the interval and agreement on two indicators of depression: having experienced 2 weeks or more of feeling depressed; and the presence or absence of three or more lifetime symptom groups for major depression. For each indicator of depression, the percent of respondents agreeing on the presence or absence of the indicator is similar for those with an interval of less than 90 days and for those with an interval of 90 days or greater (see Table 3). Thus the interval does not seem to affect agreement. DISCUSSION
Recently, several authors have published data on the validity of the lay-administered DIS, as compared to independent clinical assessment (HELZER et al., 1985; SCHULBERG et
216
KENNETHB. WELLS et
al.
TABLE 3. PERCENTAGREEINGON DEPRESSION CRITERIABY INTERVALBETWEENWAVE II FACE-TO-FACEAND TELEPHONE-ADMWISTFXED INTERVIEWS(N= 226) Depression indicator
Interval Less than 90 days 90 days or more
Significance
Depressed 2 weeks in prior 12 months
51
49
NS
Three or more lifetime symptom groups of major depression
52
48
NS
N&P >O.lO.
1985; ANTHONY et al., 1985). These studies suggest that the validity of the depression section is relatively high when the clinical judgement is based on criteria very similar to that used in the DIS (HELZER et al., 1985), but lower when the clinical judgement is either based on more independent criteria (ANTHONY et al., 1985) or derived from the medical record (SCHULBERG et al., 1985). Here, we take the criterion validity of the depression section of the DIS for granted and focus on the concordance between responses to two methods of administration: face-to-face and telephone-administered. Our results clearly indicate that agreement between an initial face-to-face version of the depression section of the DIS and a second DIS is equivalent, whether the second DIS is face-to-face or telephone-administered, given an interval between administrations of 3-12 months. That is, measures of agreement for both lifetime depressive symptoms and specific major affective disorders were comparable for the two types of comparisons (i.e. face-toface versus telephone; two face-to-face administrations). Our conclusion that telephone and face-to-face versions of the depression section of the DIS are equivalent is based on data from a subsample of the English-speaking participants of the Los Angeles ECA Project. This conclusion does not appear to be very sensitive to the base-rate of depressive symptoms and/or major affective disorder: for a wide range of hypothetical base-rates (with sensitivity and specificity fixed at the values reported in this paper), we calculate that the kappa statistics and positive predictive value would be comparable. Because of the higher sensitivity of the follow-up telephone version, however, we estimate that at very low or high base-rates, the positive predictive value would be clearly superior for the telephone version. For example, at a base-rate of 7% for any lifetime major depression and/or dysthymia, with sensitivity and specificity fixed at the values reported in Table 2, positive predictive value would be 37% for the follow-up telephone version and 25% for the repeat face-to-face DIS. The greater sensitivity of the telephone DIS could be due to the shorter interval between comparisons, relative to the one-year interval between Waves I and II. Robins (1985) has described the desirable characteristics of studies of agreement between diagnostic measures: (1) the order of administration should be reversed for a random sample of participants to compensate for any sequence effects; (2) time interval between administrations should be minimized (same day, if possible) and recency effects should be determined; and (3) the measures should be administered to the same sample rather al.,
AGREEMENT BETWEEN FACE-TO-FACE ANDTELEPHONE-ADMINISTERED DIAGNOSTIC INTERVIEW
217
than each measure administered to a different random subsample). Our study design addressed some, but not all, of these recommendations. Our study, in contrast to previous studies of the comparability of telephone and personal interviews, examined agreement based on reinterviews of the same sample. Because of the need to protect the integrity of the data collection for the ECA, we could not reverse the order of administration by giving some subjects the telephone-administered DIS first. While we attempted to minimize the interval between administrations, the number of phone contacts required to complete data collection resulted in a delay averaging three months. Such a delay, however, would be expected to have little effect on agreement on lifetime prevalence of disorders. Further, the delay would indicate that the level of agreement we observed was not due to simple memory of responses to the first interview, such as might be the case for a study based on reinterviews of respondents on the same day. We used two strategies to evaluate the effects of the time interval. First, we determined that duration of the interval was unrelated to level of agreement. Second, we corrected some analyses of agreement for incidence of symptoms/disorder during the interval between administrations, with negligible changes in agreement. Our results are consistent with the conclusion of previous authors that data on sensitive or personal topics can be obtained reliably on the telephone (HOCHSTIM,1967; MARCUS and CRANE, 1986). Specifically, we found that among the 20 symptoms included in the depression section of the DIS, agreement between interviews was highest for suicidal ideation, perhaps the most socially sensitive symptom. Further, agreement increased with the seriousness of the suicidal ideation. Although the telephone and face-to-face versions of the depression section of the DIS appear to be equivalent, concordance between an initial face-to-face DIS and a follow-up DIS, using either method of administration, was far from perfect. What are the implications of the extent of discordance we observed? Psychiatric diagnostic measures are most commonly used either to estimate prevalence or incidence of disorder in general or clinical (treated) populations or to identify a homogeneous population, e.g. those with depressive disorders, for further study. For the first purpose, we are concerned about the effect of imperfect measurement on prevalence estimates; for the second purpose, we are concerned about the effect of imperfect measurement on the positive predictive value of the DIS, i.e. on the yield of true cases in an identified sample. The specific reason for our study of the equivalence of telephone and face-to-face administration of the DIS was to determine the suitability of the telephone version for inclusion in the MOS as a second-stage screener for major depression/and or dysthymia. Our results indicate that in a sample with a high prevalence of depressive symptoms such as that studied in the MOS, the test-retest reliability of the DIS may not present a serious problem for estimating prevalence of depressive disorder. With a base-rate of about 20070, there was nearly perfect agreement between the first and second administrations of the DIS on the prevalence of each specific depressive disorder. The majority (60-63%) of those identified as having either major depression or dysthymia by the second DIS had one of these two disorders according to the first DIS. Based on our actual field experience with the MOS, we estimate that the base-rate of any depressive disorder among persons who are positive on our first stage screener is about 40%. If we assume that same specificity and sensitivity of the telephone DIS as reported in Table 2, then at a base-rate of 40%,
218
KENNETH B. WELLS ef d.
positive predictive value would increase to about 85% and prevalence rates would agree within about 13% (35 vs 40%). We estimate that the kappa statistic would be about 0.61. Assuming fixed specificity and sensitivity, at a base-rate of 7070,such as might occur in a general population, we estimate that the kappa statistic for agreement between a telephone and face-to-face administration on the presence or absence of major depression and/or dysthymia would be about 0.37. Positive predictive value would be about 31%. We caution the reader, however, that we do not know how the sensitivity and specificity of the repeat DIS would vary as a function of base-rates. As noted by ROBINS (1985), it has been commonly assumed that sensitivity and specificity are constant across base-rates. Robins has pointed out, however, that both sensitivity and specificity are affected by the mix of severe and nonsevere cases among the true positives and the level of pathology (e.g. conditions similar to depression) in the true negatives. As a result of these factors, clinical populations, with a high base-rate, commonly have relatively lower specificity (because there is similar pathology in the true negatives) and higher sensitivity (because the true positives are more severe, on average) while general populations have relatively higher specificity (because the true negatives are healthy) and lower sensitivity (because the true positives are less severely ill, on average). Thus, the kappa statistic and positive predictive value would probably be lower than estimated above for a base-rate of 40% and higher than estimated above for a base-rate of 7%. Even with the most optimistic estimates of the positive predictive value of a repeat DIS, at least 15-37% of patients identified as having lifetime depressive disorder on the second DIS will not have lifetime depressive disorder on the first (criterion) measure. The public health consequences of this disagreement depend in part on the policy and clinical importance of the distinction between those with disorder (on both administrations) and those falsely identified as having disorder on the second. The false positives are most likely to be at the margin in terms of meeting diagnostic criteria. On the one hand, such persons may have clinically important depressive symptoms, while inconsistently meeting diagnostic criteria. From a public health perspective, it may be a less serious problem that such persons are “falsely” identified as having a disorder. Alternatively, such persons may have selflimited or mild illnesses of little clinical significance. We are currently evaluating these two alternatives using data from the MOS. Our data allow us to evaluate two possible sources of disagreement on disorder status between a first and second administration of the DIS: change in diagnostic status among diagnoses with similar criteria and underreporting of symptoms on a follow-up DIS interview. We found that measures of agreement tended to be higher for identification of major depression and/or dysthymia than for either disorder alone. Thus, one source of disagreement on any one disorder is converting from meeting criteria for one of these disorders to the other in the interval between interviews. ROBINS (1985) has noted that persons tend to report fewer psychiatric sympt0m.s on a readministration of the DISpossibly due to the respondent believing that the interviewer only wants to know data not already reported on the first interview. This phenomenon could be responsible for the bias toward lower reporting of some of the depressive symptoms on the second interview (whether telephone or face-to-face administered). Bias was not significant, however, for the diagnostic variables for either method of administration.
AGREEMENTBETWEENFACE-TO-FACEAND TELEPHONE-ADMINISTERED DIAGNOSTIC INTERVIEW
219
Finally, our conclusions about equivalence of telephone and face-to-face administrations and about reliability of the DIS for identifying depressive disorders are based on a modification of DSM III criteria. We did not apply exclusion criteria other than grief reactions and mania. Thus, our true positives are more heterogeneous than those identified by strict DSM III criteria. Such heterogeneity could affect the sensitivity and specificity of a repeat DIS, (telephone orface-to-face),e.g. by making it more difficult to distinguish cases of depression from cases of other psychiatric disorders. In sum, the telephone version of the depression section of the DIS appears to be equivalent to the face-to-face version, particularly when the prevalence of depressive symptoms is relatively high. Under these conditions, test-retest reliability of the DIS depression section appears to be acceptable for estimating prevalence or for identifying samples consisting mostly of persons with depressive disorders. Agreement on specific lifetime depressive symptoms varies greatly across symptoms and investigators should be cautious in using a repeat DIS (whether telephone or face-to-face administered) to follow the course of specific depressive symptoms over time; exceptions are suicidal ideation and depressed mood. Our findings generally support the view that data on sensitive or personal matters can be reliably obtained over the telephone. Future studies should focus on the appropriateness of administering other sections of the DIS over the telephone.
authors wish to acknowledge Rosalie Dominik and Sandra Berry for management of field operations and John Ware, Jr, Ph.D., Principal Investigator of the National Study of Medical Care Outcomes.
Acknowledgements-The
REFERENCES
C., MARCUS,Ph.D. and LORI, A. C. (1986) Telephone surveys in public health research. Med. Care. 24, 97-112.
ALFRED,
ANEsHENsEL,C. S. and YOKOWNIC, P. A. (1985) Tests for the comparability of a casual model of depression under two conditions of interviewing. J. Personal. Sot. Psychol. 49, 1337-1348. ANESHENSEL, C., FRERICHS,R., CLARK, V. and YOKOPENIC,P. (1982a) Measuring depression in the community: a comparison of telephone and personal interviews. Pub. Opinion Q. 46, 110-112. ANESHENSEL, C., FRERICHS,R., CLARK, V. and YOKOPENIC,P. (1982b) Telephone versus in-person surveys of community health status. Am. J. Pub. Hlth 72, 1017-1021. ANTHONY,J. C., FOLSTEIN,M., ROMANOSKI, A. J., KORFF,M. R., NESTADT,G. R., CHAHAL,R., MERCHANT, A., BROWN, C. H., SHAPIRO,S., KRAMER,M. and GRUENBERG, E. M. (1985) Comparison of the lay diagnostic interview schedule and a standardized psychiatric diagnosis: experience in eastern Baltimore. Archs gen. Psychiat.
42, 667-675. J. (1969) Personal versus telephone interviews: effect on responses. Pub. Hlth Rep. 84, 773-782. EATON, W. W. and KESSLER,L. G. (Eds) (1985) Epidemiologic Field Methods in Psychiatry: The NIMH Epidemiologic Catchment Area Program. Academic Press, Orlando. ENDICOTT,J. and SPITZER, R. L. (1978) A diagnostic interview-the schedule for affective disorders and schizophrenia. Archs gen. Psychiat. 35, 837-844. FLEISS,J. L. (1981) Statistical Methods for Rates and Proportions (2nd Edn). John Wiley, New York. GREI~T,J. H., MA-EN, K. S., KLEW, M. H., BENJAMIN, L. S., ERDMAN,H. P. and EVANS,F. J. (1984) Psychiatric diagnosis: what role for the computer? Hosp. Corn. Psychiat. 35, 1089-1090, 1093. HELZER, J. E., ROBINS, L. N., MCEVOY, L. T., SPITZNAGEL,E. L., STOLTZMAN,R. K., FARMER, A. and BROCKINGTON, I. F. (1985) A comparison of clinical and diagnostic interview schedule diagnoses-physician reexamination of lay-interviewed cases in the general population. Archs gen. Psychiat. 42, 657-666. HENSON,R., ROTH, A. and CANNELL,C. F. (1977) Personal versus telephone interviews: the effects of telephone reinterviews on reporting of psychiatric symptomatology. In Experiments in Health Reporting (Edited by CANNELL,C., OKSENBERG, L. and CONVERSE,J.), pp. 1971-1977. (Also in DHEW Publication No. (HRA) 78-3204, pp. 205-212. Survey Research Center, University of Michigan, Ann Arbor). COLOMBOTOS,
220
KENNETHB. WELLSef al.
HOC~~STIM, J. R. (1967) A critical comparison of the strategies of collecting data from households. J. Am. Statist. Ass. 62, 976-989. Houorr, R. L., KARNO,M., BURNAM,M. A., ESCO~AR,J. and TIMBERS, D. M. (1983) The Los Angeles epidemiologic catchment area research program and the epidemiology of psychiatric disorders among Mexican-Americans. J. operat. Psychiat. 14, 42-51. JORDAN,L. A., mcus, A. C. and REEDER,L. G. (1980) Response styles in telephone and household interviewing: a field experiment. Pub. Opinion Q. 44, 210. JOSEPHSON,E. (1965) Screening for visual impairment. Pub. Hlth Rep. 80, 47. KULKA,R. A., WEEKS,M. F., LESSLER,J. T. and WHITMORE,R. W. (1984) A comparison of the telephone and personal interview modes for conducting local household health surveys. In Health Survey Research Methods Fourth Conference, pp. 116-127. National Center for Health Services Research, Rockville, MD. REG~R, D. A. and MYERS,J. K. (1984) The NIHM epidemiologic catchment area program. Archsgen. Psychiat. 41, 934-941. SIMON,R. J., Frmss, J. L., FISHER,B. and GARLAND, B. J. (1974)Two methods of psychiatric interviewing: telephone and face-to-face. J. Psychol. 88, 141-146. ROBINS,L. N. (1985) Epidemiology: reflections on testing the validity of psychiatric interviews. Archs gen. Psychiat. 42, 918-924. ROBINS,L. N., HELZER,J. E., CROUGHAN,J. and RATCLIFF,K. S. (1981) National Institute of Mental Health diagnostic interview schedule: its history, characteristics, and validity. Archs gen. Psychiat. 38, 381-389. ROBES, L. N., HELZER,J. E. and MARCUS,S. (1986) The Screening DZS. Washington University, St Louis. SCHULBERG, H. C., SAUL,M., MC&ELLAND,M., GANGULI and CHRISTY,F. (1985) Assessing depression in primary medical and psychiatric practices. Archs gen. Psychiat. 42, 1164-1170. SHRO~T,P. E., SPITZER,R. L. and FLEISS,J. L. (1987) Quantification of agreement in psychiatric diagnosis revisited. Archs gen. Psychiat. 44, 172-177. SIEMIATYCKI, J. A. (1979) Comparison of mail, telephone and home interview strategies for household health survey. Am. J. Pub. Hith 69, 238-245. WELLS,K. B. (1985) Depression as a Tracer Condition for the National Study of Medical Care Outcomes. The RAND Corporation publication No. R-3293-RWJ/HJK. YAFFE,R., SHAPIRO,S., FUICHSEERG, R. R., ROHDE,A. and CORPENO,C. (1978) Medical economics surveymethods study: cost effectiveness of alternative survey strategies. Med. Care 16, 641-659. ZIMMERMAN, CORYELL,M. W., CORENTHAL,C. and WILSON, S. (1986) A self-report scale to diagnose major depressive disorder. Archs gen. Psychiat. 43, 1076-1081.