Inpatient diagnostic assessments: 2. Interrater reliability and outcomes of structured vs. unstructured interviews

Inpatient diagnostic assessments: 2. Interrater reliability and outcomes of structured vs. unstructured interviews

Psychiatry Research 105 Ž2001. 265᎐271 Inpatient diagnostic assessments: 2. Interrater reliability and outcomes of structured vs. unstructured interv...

85KB Sizes 8 Downloads 165 Views

Psychiatry Research 105 Ž2001. 265᎐271

Inpatient diagnostic assessments: 2. Interrater reliability and outcomes of structured vs. unstructured interviews 夽 Paul R. Miller U Department of Psychiatry, School of Medical Sciences, Uni¨ ersity of California, Los Angeles, 2406 Astral Dri¨ e, Los Angeles, CA 90046, USA Received 26 February 2001; received in revised form 5 September 2001; accepted 20 September 2001

Abstract A preceding study found that structured interviews ŽSCID-CV, Computer Assisted Diagnostic Interview wCADIx. were significantly more accurate than the unstructured Traditional Diagnostic Assessment ŽTDA. for making inpatient diagnoses, using Consensus Diagnosis as the standard. This study measured interrater reliability for diagnoses between the Emergency Room ŽER. and the Inpatient Unit ŽIU., as achieved by TDA vs. CADI. It selected subjects from consecutive admissions to the ER who were transferred to the IU. Group 1 had 33 subjects evaluated with TDA, leading to interrater agreement s 45.5% Ž15r33. and kappas 0.24 Ž‘poor’.. Group 2 had 39 subjects evaluated with CADI, leading to interrater agreement s 79.5% Ž31r39. and kappas 0.75 Ž‘excellent’.. Group 3 had 33 subjects, again evaluated with TDA, leading to interrater agreement s 54.5% Ž18r33. and kappas 0.43 Ž‘fair’.. The test᎐retest᎐test ŽTDA᎐CADI᎐TDA. format demonstrated that CADI had better interrater reliability than TDA. How diagnostic reliability might correlate with parameters like timing of treatment choices and length of stay are also measured and discussed. 䊚 2001 Elsevier Science Ireland Ltd. All rights reserved. Keywords: DSM-IV; Accuracy; Validity; Length of stay

夽 Similar findings with an overlapping sample were presented at the Annual Meeting, American Psychiatric Association, 6 May 2000, in Chicago, Illinois. U Tel.: q1-323-876-2831; fax: q1-323-876-2831. E-mail address: [email protected] ŽP.R. Miller..

0165-1781r01r$ - see front matter 䊚 2001 Elsevier Science Ireland Ltd. All rights reserved. PII: S 0 1 6 5 - 1 7 8 1 Ž 0 1 . 0 0 3 1 8 - 3

P.R. Miller r Psychiatry Research 105 (2001) 265᎐271

266

1. Introduction 1.1. Structured and unstructured diagnostic inter¨ iews A previous study ŽMiller et al., 2001. compared three methods of inpatient diagnosis: Ž1. unstructured Traditional Diagnostic Assessment ŽTDA., the standard of practice in clinical psychiatry for that task; Ž2. Structured Clinical Interview for DSM-Clinical Version ŽSCID-CV., the standard of practice for structured interviews; and Ž3. the structured Computer Assisted Diagnostic Interview ŽCADI.. The diagnoses from the three methods were then compared with a Consensus Diagnosis, using Spitzer’s Ž1983. ‘Lead Standard’ ŽSpitzer, 1983.. TDA had ‘fair’ agreement with Consensus, using Fleiss’s Ž1973. standards for kappa; SCID-CV and CADI had ‘excellent’ agreements. Both SCID-CV and CADI were significantly better than TDA for diagnostic agreement with Consensus. CADI also had ‘excellent’ agreement with SCID-CV and ‘excellent’ interrater reliability. A major difference between CADI and TDA is how much data each directs the clinician to collect. CADI uses questions and answers that appear sequentially on the computer screen; they follow a decision-tree format to cover the 72 DSM-IV diagnostic criteria necessary to assess 10 diagnostic groups Žsee Table 3 for a list.. The TDA write-up formats used by 20 psychiatric clinics and hospitals in Los Angeles County Žeight public, eight private, and four medical school.

that had referred patients used in this study were analyzed and found to direct the clinician to assess on average 24, or 33% Ž24r72, range s 8᎐41. of these 72 criteria. The study was conducted at a major universityaffiliated publicly funded hospital; its TDA writeup coincidentally directed assessment of 24 criteria, like the ‘average’ TDA write-up. 1.2. Hypothesis This study hypothesizes that CADI, compared with TDA, will achieve better interrater reliability for diagnosis between the psychiatric Emergency Room ŽER. and Inpatient Unit ŽIU..

2. Methods 2.1. Patient-subject assignment Table 1 shows how the 105 patient-subjects were selected from 300 consecutive admissions to the ER. Patients that the ER discharged or transferred elsewhere were not in the study. Each subject was assessed in both the ER and IU with the same method, either TDA or CADI. Because the study was naturalistic and involved standard patient care, and because the study’s major independent variable was the interview process, the Committee for the Protection of Human Subjects determined that no informed consent was necessary.

Table 1 Study groups A ER admission consecutive number Group 1 Group 2 Group 3

1᎐100a 101᎐200 201᎐300

Total

300

a

B Used in study: transferred to IU

33 39 33

C Evaluation method in both ER and IU

TDA CADI TDA

105

These are consecutive admissions and numbered in cohorts of 100 s.

D Not used in study: discharged, or transferred outside 67 61 67 195

P.R. Miller r Psychiatry Research 105 (2001) 265᎐271

2.2. Diagnosticians Twenty-two different clinicians Žeight faculty psychiatrists, two PhD faculty psychologists, and 12 2nd᎐5th year resident psychiatrists whose workups were reviewed by faculty. did the TDA and CADI workups. Different clinicians always did the ER and IU workups. 2.3. Training No TDA training was conducted, since all clinicians were experienced using it ᎏ the least experienced, the 2nd year residents, had done at least 100. For CADI training, clinicians reviewed its documentation, then evaluated simulated patients in a computer lab. When they began with patient-subjects, the corresponding author reviewed their work with them individually. Because CADI is constructed with the same sequences and topics as TDA, clinicians adapted to CADI without major difficulties. 2.4. Scheduling The study took 35 days to complete. Assignment of clinicians to both ER and IU was on a regular schedule for days and a rotational schedule for nights and weekends: the 22 clinicians were all involved in evaluating each of the three groups. The 22 clinicians did a total of 139 CADIs Ž100 in the ER and 39 in the IU., an average of 6.3 ᎏ the range was 2᎐10. The 22 clinicians performed a total of 266 TDAs Ž200 in the ER and 66 in the IU., an average of 12.1 ᎏ range of 4᎐18. 2.5. Data a¨ ailable to clinicians Because this was an open study, clinicians were free to use all available data, including all previous diagnoses. The Los Angeles County Department of Mental Health ŽLACDMH. keeps a computerized Medical Information System ŽMIS. that lists all previous diagnoses made in LACDMHaffiliated facilities Žhospitals, clinics, contract agencies.. ER clinicians had the MIS available,

267

plus hospital records of previous admissions. IU clinicians had all the above plus the ER diagnostic assessment. 2.6. Primary data analysis ᎏ interrater reliability Clinicians were instructed to specify a primary diagnosis, the Axis I disorder that best accounted for this hospital contact. For interrater reliability, the ER primary diagnosis and IU primary diagnosis Žalways done by different clinicians . were compared. Agreement counted if both diagnoses were in the same diagnostic group Žsee Table 3 for groups; see Miller et al., 2001, for a detailed listing.. The combinations of clinicians carrying out the two diagnostic interviews in the PER and IU were as follows: attending and resident 47.3%; two attendings 37.9%; two residents 14.8%. No combination did significantly better or worse than any other combination. 2.7. Additional data analysis Other dimensions were measured to assess whether they might correlate with different rates of interrater reliability: 䢇



Length of stay ŽLOS., measured from day of ER admission to day of IU discharge. Day the discharge medication started ŽDDMS.: this was the day of the hospital stay ŽD1, D2, D3 . . . . that the primary psychotropic medication used at discharge was first prescribed. The rationale was that the medication the patient was taking at time of discharge was the medication that was finally found to be most effective and therefore worth continuing at discharge; hence the day that medication was first ordered was the day that the final effective treatment began.

3. Results 3.1. Demographics These are summarized in Table 2. There were

P.R. Miller r Psychiatry Research 105 (2001) 265᎐271

268 Table 2 Demographics Group N

Measure 1. Gender: % males 2. Age: mean years 3. Race White Latino Black Asian Other

A Group 1 33

B Group 2 39

C Group 3 33

D Total 105

52% Ž17r33. 36.3

62% Ž24r39. 38.5

61% Ž20r33. 33.9

57% Ž61r105. 36.4

42% Ž14r33. 27% Ž 9r33. 0% Ž0r33. 9% Ž3r33. 21% Ž7r33.

51% Ž20r39. 31% Ž12r39. 10% Ž4r39. 3% Ž1r39. 5% Ž2r39.

33% Ž11r33. 42% Ž14r33. 6% Ž2r33. 6% Ž2r33. 12% Ž4r33.

43% Ž45r105. 33% Ž35r105. 6% Ž6r105. 6% Ž 6r105. 12% Ž13r105.

no significant differences among the groups in gender, age, and race. 3.2. Interrater reliability Table 3 Žbottom two rows. show interrater agreements. Initial use of TDA Žcolumn A. had ‘poor’ reliability Ž␬ s 0.24, kappa standards by Fleiss, 1973.. CADI Žcolumn B, boldfaced. had ‘excellent’ reliability Ž␬ s 0.75.. When TDA use resumed, reliability declined to ‘fair’ Ž␬ s 0.43..

3.3. Additional data analysis First was length of stay ŽLOS. Žsee Table 4.. Subjects evaluated with CADI Žcolumn B. had shorter LOS than subjects evaluated with TDA Žcolumns A and C., and the differences were possibly significant Ž P s 0.0141; Ps 0.0636.. Second was DDMS Žsee Table 5.. Subjects evaluated with CADI had earlier DDMS than subjects evaluated with TDA, and the differences were possibly significant Ž Ps 0.043; Ps 0.0725..

Table 3 Incidence of diagnoses by method Diagnostic group

A Group 1 TRA ERrIUragree

B Group 2 CADI ERrIUragree

C Group 3 TRA ERrIUragree

1. Cognitive impairment 2. General medical condition-induced 3. Alcohol-induced 4. Drug-induced 5. Mania 6. Depression 7. Schizophrenia 8. Schizoaffective 9. Psychosis NOS 10. Anxiety Žand PTSD. Totals Percent agree Kappa

0r0r0

0r0r0

0r0r0

0r0r0 2r0r0 2r0r0 2r3r1 8r9r6 3r4r0 1r5r1 14r11r6 1r1r1 33r33r15 45.5% Ž15r33. 0.24 Žpoor.

2r2r2 9r8r5 3r4r3 4r4r4 8r7r7 9r11r8 2r2r2 2r1r0 0r0r0 39r39r31 79.5% Ž31r39. 0.75 Žexcellent .

0r0r0 3r4r3 3r1r0 4r2r1 11r10r8 0r4r0 2r2r2 9r9r3 1r1r1 33r33r18 54.5% Ž18r33. 0.43 Žfair.

How to read the columns: use column A line 6 as an example: depression that reads ‘8r9r6’ means that eight patients were diagnosed with depression in the ER, nine patients were diagnosed with depression in the IU, and six of those diagnoses agreed. Kappa evaluations ŽFleiss, 1973.: - 0.40s poor; 0.40᎐0.59s fair; 0.60᎐0.74s good; G 0.75s excellent.

P.R. Miller r Psychiatry Research 105 (2001) 265᎐271

269

Table 4 Length of stay Group

Measure 1. N 2. Mean days 3. Difference in days 4. Group vs. group t Ž t-test. Degrees of freedom Significance

A Group 1 ŽTDA.

B Group 2 ŽCADI.

C Group 3 ŽTDA.

33 12.6 4.9 Ž12.6᎐7.7. A vs. B t s 2.52 d.f.s 70 Ps 0.0141

39 7.7

33 12.3 4.6 Ž12.3᎐7.7. C vs. B t s 1.88 d.f.s 70 Ps 0.0636

4. Discussion 4.1. Limitations of the study The three groups were similar in age, gender, and race ŽTable 2., so demographic differences were unlikely to have affected the results. Certain limitations could have affected the study: the study was regional, because the extent of multiethnicity and the complete urbanism of Los Angeles County are not typical of the United States. The subject-inpatients were non-representative of psychiatric patients in general, and the psychiatrists were also non-representative, given that both were from a publicly funded university-affiliated hospital. Hence, these results are provisional, as discussed previously ŽMiller et al., 2001..

Clinicians had previous diagnoses available from the MIS and from old hospital records that could have influenced them. However, a random sampling of 15 patients from each of the three groups found low correlations Ž␬ values - 0.40. between previous MIS diagnoses and current ER and IU diagnoses. Regarding diagnoses made for this study, it appears that the TDA diagnosis from the ER had relatively little influence on the IU diagnosis, given the low interrater reliabilities Žsee Table 3, columns A and C.. However, the ER CADI diagnosis possibly influenced the IU CADI diagnosis, given the high interrater reliability Žsee Table 3, column B.. Since the research design never mixed sequences for the diagnostic method used by the ER and IU Žthey were always TDA ª TDA or CADIª CADI, never TDAª CADI or CADIª TDA.; we could not compare how

4.2. Influences on the study Because the study was open, knowledge about the project might have influenced the clinicians’ attitudes and performance. All clinicians were informed that we had low diagnostic reliability between the ER and IU and were studying how to improve it. If the clinicians had feelings and opinions about that, then it probably influenced their attitudes similarly. Most clinicians had negative attitudes about replacing the old familiar TDA, because the new, unfamiliar CADI required time and effort to learn and to use. Both these conditions could have influenced the conduct of the study and the results.

Table 5 Day discharge medication was started Group

Measure 1. DDMS 2. Group vs. group t Ž t-test. Degrees of freedom Significance

A Group 1 ŽTDA.

B Group 2 ŽCADI.

C Group 3 ŽTDA.

5.4 A vs. B t s 2.06 d.f.s 70 P s 0.0430

2.9

5.0 C vs. B t s 1.82 d.f.s 70 Ps 0.0725

DDMS, day of hospitalization ŽD1, D2, D3 . . . . that the primary psychotropic discharge medication was started.

270

P.R. Miller r Psychiatry Research 105 (2001) 265᎐271

much the ER diagnosis Žeither TDA or CADI. may have affected the later IU diagnosis. 4.3. Differences in interrater reliability An explanatory hypothesis for the low interrater reliability among TDA users ŽTable 3, bottom rows, columns A and C. is that incomplete data collection increases the likelihood of diagnostic imprecision. The TDA write-up format solicited only 33% Ž24r72. of the necessary criteria while CADI solicited 100%. Others also found less than optimal levels of data collection by TDA users ŽSkodol et al., 1984; Garb, 1998; Shear et al., 2000.. Miller et al. Ž2001. found that TDA users evaluated on average only 53% Ž9.5r18. of Key Criteria Žthose criteria in diagnostic algorithms that must be assessed in order to rule-in or rule-out that diagnosis.. 4.4. Differences in LOS and DDMS LOS is a multidetermined outcome, shaped by variables such as diagnoses Žwhich ones, how many., treatment choices Žmedications, psychotherapy, occupational and social therapy., rate of response to treatments, ward milieu, family attitudes and pressures Žearly discharge, long-term stay., legal status Žholds, court-ordered release ., HMO and insurance regulations, medication side effects, medical complications, patient’s insight and judgment Žcooperation, or resistance to treatment., policies and practices Žhospital administration, psychiatry department, medical and nursing staff and individuals., demographics, discharge plans and placement availability, etc. The multiple effects of some medications can alter LOS. Clozapine, olanzapine, risperidone, and haloperidol can affect mood as well as psychosis. SSRIs can affect depression and also anxiety, panic, obsession᎐compulsion, eating disorders, negative symptoms of schizophrenia, etc. Thus, an incorrect or incomplete diagnosis can lead to a medication choice that is coincidentally effective for an undiagnosed disorder, and so the patient gets better even though misdiagnosed. Attempts to measure the effect of a single variable on LOS are difficult, given the potential

effects of other variables. Nonetheless, the attempt is often made: clinical trials measure the effects of one medication on outcomes, including rates and LOS of hospitalization. Clinical trials also link tests of drug effectiveness to diagnosis, and others agree that this is rational: Shear et al. Ž2000, p. 581. stated, ‘good medical treatment rests on a foundation of accurate diagnosis... Diagnosis-specific, proven efficacious treatments are a major recent advance in psychiatry. Appropriate use of such treatments presupposes patients who meet the diagnostic criteria’. Basco et al. Ž2000. noted, ‘treatment plans are often based on . . . diagnostic type . . . the advent of disease-specific treatment protocols has heightened the necessity for accurate diagnostic procedures’. To summarize, the hypothesis is that early and precise diagnosis ª early and precise medication treatment ª more rapid recovery ª shorter LOS. This has been the rationale for using day the discharge medication started ŽDDMS. as a measuring rod for that hypothesis. Table 5 shows that patients diagnosed with CADI did have an earlier DDMS, which may have been a contributor to shorter LOS. These findings are suggestive, not conclusive. A further study while controlling as many variables as possible is necessary. 4.5. Cogniti¨ e beha¨ iors CADI uses the computer to prompt specific behaviors ᎏ collect complete data, match data exactly with DSM algorithms, and explore all diagnoses. Do clinicians who use CADI then internalize these behaviors, so that when the computer is removed, the behaviors persist Ži.e. are internalized cognitively.? If so, then the ‘excellent’ interrater reliability that clinicians achieved using CADI should probably continue when the computer is removed and TDA is resumed. The data show that this did not happen; rather, interrater reliability dropped when TDA replaced CADI ŽTable 3, columns B᎐C, bottom rows.. This same clinical sequence ᎏ traditional method, computer-supported method, traditional method ᎏ occurred in an internal medicine study ŽMcDonald, 1976.. Their results were similar to ours: clinicalrdiagnostic improvement occurred with

P.R. Miller r Psychiatry Research 105 (2001) 265᎐271

computer support and disappeared when the computer was removed. 4.6. Conclusion These results justify more study on how and why structured and computer-based methods can improve interrater reliability and possibly contribute to shorter LOS. Such studies need to consider as many variables as possible. References Basco, M.R., Bostic, J.Q., Davies, D., Rush, A.J., Witte, B., Hendrickse, W., Barnett, V., 2000. Methods to improve diagnostic accuracy in a community mental health setting. American Journal of Psychiatry 157, 1599᎐1605. Fleiss, J.L., 1973. Statistical Methods for Rates and Proportions. John Wiley and Sons, New York.

271

Garb, H.N., 1998. Studying the Clinician: Judgment Research and Psychological Assessment. American Psychological Association, Washington, DC. McDonald, C.J., 1976. Protocol-based computer reminders, the quality of care and the non-perfectibility of man. New England Medical Journal 295, 1351᎐1355. Miller, P.R., Dasher, R., Collins, R., Griffiths, P., Brown, F., 2001. Inpatient diagnostic assessments: 1. Accuracy of traditional versus structured methods Žsubmitted.. Shear, M.K., Breeno, C., Kang, J., Ludewig, D., Frank, E., Swartz, H., Hanekamp, M., 2000. Diagnosis of nonpsychotic patients in community clinics. American Journal of Psychiatry 157, 581᎐587. Skodol, A.E., Williams, J.B.W., Spitzer, R.L., Gibbon, M., Kass, F., 1984. Identifying common errors in the use of DSM-III through supervision. Hospital and Community Psychiatry 35, 251᎐255. Spitzer, R.L., 1983. Psychiatric diagnosis: Are clinicians still necessary. Comprehensive Psychiatry 24, 399᎐411.