RELIABILITY OF DIAGNOSIS IN MOOD DISORDERS*

RELIABILITY OF DIAGNOSIS IN MOOD DISORDERS*

MOOD DISORDERS 0193-953>(/96 $0.00 + .20 RELIABILITY OF DIAGNOSIS IN MOOD DISORDERS Charles E. Holzer 111, PhD, Hoang Thanh Nguyen, BA, and Robert ...

754KB Sizes 1 Downloads 51 Views

MOOD DISORDERS

0193-953>(/96 $0.00

+ .20

RELIABILITY OF DIAGNOSIS IN MOOD DISORDERS Charles E. Holzer 111, PhD, Hoang Thanh Nguyen, BA, and Robert M. A. Hirschfeld, MD

The current version of the US classification system of psychiatric disorders, DSM-IV, represents an ongoing evolution in a classification system that has been influenced by scientific developments as well as the changing needs of American psychiatric practice. This article reviews the issues affecting the reliability and validity of the assessment of mood disorders within the DSM. Particular attention is given to the special difficulties in the assessment of the chronic forms of mood disorders, chronic major depression, double depression, and dysthymia. RELIABILITY AND VALIDITY AS THEORETICAL CONCEPTS

Reliability and validity of measurement are closely interrelated in psychiatry. Generally, the concept of validity requires that a measurement gets at the true value of the quantity being assessed. Reliability requires that a rater arrives at the same score every time with little The overall director of the DSM-IV Mood Disorders Field Trial was Martin B. Keller, MD, Brown University, Providence, Rhode Island. Principal investigators at the five sites were Martin B. Keller, MD, Brown University; Daniel Klein, PhD, State University of New York at Stony Brook, Stony Brook, New York; James Kocsis, MD, Cornell University Medical College, New York, New York; James P. McCullough, PhD, Virginia Commonwealth University, Richmond, Virginia; and Robert M. A. Hirschfeld, MD, The University of Texas Medical Branch, Galveston, Texas.

From the Department of Psychiatry and Behavioral Sciences, The University of Texas Medical Branch, Galveston, Texas

THE PSYCHIATRIC CLINICS OF NORTH AMERICA

-

VOLUME 19 NUMBER 1 * MARCH 1996

73

74

HOLZER et a1

random error or that the ratings of different observers are in agreement. In these general terms, reliability is a prerequisite to validity, because if two measurements disagree, then both cannot be right. When these concepts are translated in terms of psychiatric diagnosis, validity is getting the diagnosis right and reliability is agreement on a diagnosis by two or more clinicians, or one clinician making the same diagnosis at different times. It usually is assumed that if enough clinicians agree, then the diagnosis must be right. Thus, many practitioners in psychiatry use the concepts of reliability and validity interchangeably. Doing so, however, is erroneous. For purposes of clarity, we examine the differences between reliability and validity, and then examine their application to mood disorders, beginning with the concept of the validity of the diagnostic system itself. Validation of Diagnostic Criteria

Traditionally, one would establish the validity of measurement by reference to a gold standard, or through content validation. In psychiatry, as in psychology, however, there is usually no observable gold standard to use as a reference. Therefore, there has been substantial use of face validation, which is defined as a close correspondence between the definition and the proposed criteria. If there is no means of evaluating the definition, the process becomes circular, resulting in ”measurement by fiat.” To avoid this circularity, Robins and Guze,’O in 1970, proposed a set of rules for establishing the validity of psychiatric diagnosis. They argued that is was not sufficient for a committee to agree on a set of rules in order to have a valid diagnostic system. Instead, they proposed a five-phase approach requiring that the criteria (1)provide a good clinical picture of the disorder, (2) be consistent with laboratory studies, (3) differentiate the disorder in question from other disorders in patients (the disorder ideally being homogeneous within itself), (4) provide reasonable prediction of outcomes in follow-up studies, and (5) identify a familial pattern when there is a genetic component to the disorder. This approach has moved us toward more rigorous classification by requiring data-driven support for each disorder. Although the Robins and Guze recommendations have had a positive effect on diagnostic systems, there remains a degree of arbitrariness to diagnosis, which is a consequence of two factors. Not all of the conventionally accepted disorders meet the standard proposed by Robins and Guze. The field has not yet identified homogeneous subtypes for each disorder. Diagnoses are not fully predictive of natural course or of treatment outcomes. Research studies have not been able to identify genetic or pathophysiologic bases for most psychiatric disorders. Thus, many psychiatric disorders, including the affective disorders, continue to be syndromes without well-identified cause or course. Cloninger6 argues that a major limitation of the Robins and Guze

RELIABILITY OF DIAGNOSIS IN MOOD DISORDERS

75

approach is its failure to incorporate the method of construct validation. Construct validation is a method proposed by Lee CronbachRthat examines the measurement process in the context of a theoretical model. If the measurement of a diagnostic category behaves in relation to a set of related variables as would be predicted by a theoretical model, then the workings of the model lend support to the specific construct. The relative lack of construct validation reflects the phenomenologic state of psychiatry as a developing science. In the mean time, the nomenclature remains a useful tool for description, communication, and treatment planning, and is otherwise a set of working hypotheses that continue to develop through empirical research. Current definitions have evolved through different versions of the DSM and may be somewhat different from those used in other countries, for example, the International Classification of Diseases. At present, most defined disorders reflect an accumulation of knowledge over many generations of psychiatrists, as well as other mental health professionals, and have demonstrated utility. The process of revision of DSM reflects an attempt to further refine and develop those categories, and occasionally to create new ones to take into account the accumulated findings of recent studies. In an attempt to evaluate newly proposed disorders and criteria for DSM-IV, the American Psychiatric Association, with support from the National Institute of Mental Health, sponsored empirical evaluation of the proposed DSM-IV criteria. Two strategies were used. One of these was a reanalysis of existing databases to consider the implications of alternative criterion sets. Because many of the proposed changes could not be evaluated in existing data sets, it was necessary to collect new data through extensive field trials. This report examines some of the field trial data on the reliability of assessment for DSM-IV mood disorders, in relation to earlier reliability studies based on criteria given in the DSM11, DSM-111, and DSM-111-R. Reliability of Psychiatric Diagnosis

Reliability of diagnosis is the extent to which diagnoses made by two or more raters agree. In the absence of a means to directly assess validity, assessment of the reliability of diagnosis becomes a minimal criterion for validity. Thus, assessment of reliability has been an important component of the field trials for DSM-IV, as it was for DSM-I11 and DSM-111-R. Reliability can be assessed through a number of different procedures. The simplest procedure, that for test-retest reliability, involves having two or more different psychiatrists conduct independent assessments of the same patients. Ideally, each psychiatrist would use the diagnostic criteria in the same manner as he or she would in individual practice. Assuming that the psychiatrists were reasonably experienced and had ample opportunity to learn the new criterion set, this could provide as much a test of the criterion set as of the abilities of the psychiatrists.

76

HOLZER et a1

The proposed test-retest scenario, however, has a number of problems that keep it from being ideal. The first of these has to do with the effects of the assessment process on the patient. Only the first psychiatrist’s assessment would approximate conventional psychiatric practice. The second psychiatrist’s assessment is likely to be affected by the influence of the first assessment on the patient and by the passage of time between interviews. If the interviews are close together in time, patient may become tired or bored from the first interview, or even feel that he or she should not have to repeat what was said in the first assessment. These factors may make reliability appear worse than would be expected from the validity of the criteria. In addition, the patient may remember what the first psychiatrist asked, and use the same language, perhaps learned in the first interview, to respond, which may artificially inflate estimates of reliability, or possibly make it worse. Moving the interviews further apart in time may reduce some of those effects, but it also may allow for greater change in the state of the patient and, therefore, make the second assessment different from the first. Another standard procedure for assessing reliability, called jointrating, is to have one psychiatrist conduct a single interview with the patient while another psychiatrist or rater is present in the room. Depending on the specific procedure, the second rater may be permitted to ask additional questions, perhaps at the end of the interview. Jointrating permits both psychiatrists to have access to the patient at the same time. The interviewing psychiatrist, however, may telegraph critical information, and even the diagnosis, to the second, observing psychiatrist, thus, providing an artificial increase in agreement. If both ask questions, this problem is compounded. Estimates of reliability from coratings are generally higher than those from consecutive interviews. In standard psychometric procedures, a third method for estimating reliability is to examine the internal consistency of a test. In tests where the items are sampled from a single domain of content, and one can assume no systematic bias, reliability can be estimated from the internal consistency of the measure. Various procedures such as split-half and Cronbach’s alpha are used for that purpose. Psychometric theory indicates that reliability increases with the length of the test and with the internal consistency of a test. The criterion-based approach used in DSMIV is not well suited for analyses of internal consistency, however, because the diagnostic criteria are not intended to have equivalent or redundant items as are found in many psychological tests. The diagnostic process, especially as defined by DSM-IV, is based on a set of rules, rather than from a summation or average of scores from related items. This does not exhaust the possibilities for studying reliability. In the DSM-IV Field Trial for Mood Disorders, as well as a number of previous studies, alternative procedures were attempted. These included (1)videotaping the initial interview, with video reratings by other clinicians, and (2) reinterviewing 6 months after the initial interview, but asking about the time of the initial interview. Those alternatives are discussed in a later section.

RELIABILITY OF DIAGNOSIS IN MOOD DISORDERS

77

Sources of Unreliability

The process of conducting a diagnostic interview can be considered in two parts: (1)asking questions and recording answers and (2) combining the answers into a diagnosis. Both parts contribute variability and error to the diagnostic process. The first source of error is identified as information variance. It includes differences in how an interviewer asks questions, how the patient responds to those questions, and how the interviewer interprets the responses given. For structured interviews this also can include simple clerical errors in recording answers. The second source of variability, called criterion variance, comes from differences in the way the rater combines the symptoms into a diagnosis. Prior to DSM-111, this process was more one of pattern matching than one of applying formal rules. Since the formulation of DSM-111, formal rules are provided for combining symptom criteria into diagnoses. The development of formal diagnostic criteria as seen in DSM-111, and later, has been focused on the reduction of criterion variance. This approach represents a compromise between the pattern matching and clinical intuition, which were dominant elements in earlier diagnosis, and the more structured criterion sets developed for research use. Many of the features of the Research Diagnostic Criteria and the Feighner Criteria were incorporated into DSM-I11 and continue to influence the diagnostic process. Although often criticized as limiting the application of clinician skill, the formal criteria have improved diagnostic agreement a great deal. Use of Structured Interviews

Structured interviews were initially the domain of researchers who collected lots of data about symptoms and then struggled to figure out what it really meant. Even with the increasing use of symptom-based scales, clinicians and psychometricians could not always agree on diagnosis. The diagnostic process began to change after the US-UK studies of the 1970s identified differences between the United States and England in the relative prevalence of schizophrenia and affective disord e r ~At . ~first it was thought that these two populations might be truly different, but the use of structured interviews to describe symptom patterns in the United States and the United Kingdom demonstrated that the major differences in diagnosis were in the clinicians’ use of diagnostic criteria and not in the specific symptoms being reported.I6 Subsequent research led to the development of the Schedule for Affective Disorders and Schizophrenia, which used the Research Diagnostic Criteria, and eventually to the development of the Structured Clinical Interview for DSM-111-R (SCID).I4Parallel developments in the United Kingdom included development of the Present State Examination and, later, the Schedules for Clinical Assessment in Neuropsychiatry (SCAN), which were both based on the International Classification of Diseases.

78

HOLZER et a1

These and related structured diagnostic instruments have become a ”gold standard” for making diagnoses in clinical trials, treatment outcome studies, and even epidemiologic studies. Structured diagnostic assessments such as the SCID have been found to be far more reliable than unstructured clinical assessments. It is not surprising, therefore, that the use the SCID, a structured clinical interview, was adopted as a basis for examining the alternative criterion sets proposed for DSM-IV. For use in the DSM-IV field trials of affective disorders, the SCID was modified to include alternative proposed criterion sets and their corresponding symptoms. REVIEW OF RELIABILITY STUDIES FOR MOOD DISORDERS

Historically, the reliability for mood disorders has improved as psychiatric diagnosis has moved to more specific criteria and to more structured interviews. Earlier versions of DSM, maintained somewhat different categories of mood disorders than does DSM-IV.5DSM-I’ differentiated psychotic affective disorders such as manic depressive reactions, manic and depressed types, and psychotic depressive reaction, from psychoneurotic depressive reaction. These are described in dynamic terms that required inference about internal dynamic processes, and that lacked specific inclusion or exclusion rules. DSM-112maintained similar categories for affective disorders, while adding the circular type of manic depression and characterizing disorders as “illnesses” rather than ”reactions.” DSM-I1 provided descriptions primarily in terms of observable symptoms and episodes, but generally emphasized the descriptive pattern of the phenomenon rather than specific rules. Spitzer and Wilson12conducted a review of nine studies of diagnostic agreement between psychiatrists using DSM-I1 criteria. They found relatively low agreement for the overall classification of affective disorders, with a mean K of 0.37, as compared with a mean K of 0.54 for psychosis/schizophrenia and 0.71 for alcoholism. Agreement was even lower within the specific categories of neurotic depression with a mean K of 0.21, psychotic depression with a mean K of 0.19, and manic depression with a mean K of 0.33. DSM-1113 represented a major change in the process for developing the nomenclature, as well introducing formal rules for the diagnostic process. DSM-I11 introduced field trials as a means of evaluating the new classifications before their adoption. DSM-I11 also was intended to be ”atheoretical with regard to etiology” and it specifically recognized heterogeneity among individuals having specific disorders. DSM-I11 introduced specific classification rules for each disorder and was multiaxial. Although still referred to as ”affective disorders,” DSM-I11 notes that the proper descriptive term should be “mood disorders.” A major change for the diagnosis of these disorders was the elimination of the primary distinction between psychotic and neurotic depression,

RELIABILITY OF DIAGNOSIS IN MOOD DISORDERS

79

relegating that distinction to the subclassification ”with psychotic features.” Within the affective disorders section, the descriptive categories ”major depressive episode” and ”manic episode” were retained as intermediate components of the diagnosis of major depression and/or bipolar disorder. Major depressive episode required dysphoria or loss of interest, in addition to four of eight specific symptoms, and a duration of at least 2 weeks. It also excluded the schizophrenias, delusional disorder, which was mood incongruent, organic disorder, and uncomplicated bereavement. The diagnosis of major depression required having a major depressive episode, but never having had a manic episode. Manic episode had correspondingly specific criteria and exclusions, but permitted prior depressive episodes in the diagnosis of bipolar disorder, depending on the subtype. Dysthymia was defined as requiring at least 3 of 13 symptoms and a 2-year duration. The Field Trials for DSM-I11 report reliabilities from two phases of field trials. In phase one, approximately 60% of adult (n = 339) as well as child diagnostic assessments (n = 71) were done in separate evaluations. In phase two, about two thirds of evaluations were done separately for adults (n = 331) and children (n = 55). For adults, the overall reliability for affective disorders combined produced a K of 0.69 in phase one and a K of 0.83 in phase two. Reliabilities were lower for specific subgroups of disorders. For ”major affect disorders,” reliabilities resulted in K’S of 0.68 and 0.80, respectively for phase one and phase two. For ”other specific affective disorders,” the K’S were 0.49 and 0.69 for the two phases. For ”atypical affective disorders,” the K’S were 0.29 and 0.49, respectively. Spitzer et all3reported reliabilities from the field trials by differentiating ratings by clinicians using the joint rating method (n = 150) and the test-retest method (n = 131). For affective disorders as a group, the joint agreement K was 0.77, and for test-retest agreement the K was 0.59. Clearly, the joint rating method provides higher reliabilities. For major affective disorders as a group, the joint rating K was 0.70 and for testretest the K was 0.59. For the assessment of chronic minor depression (e.g., dysthymia and cyclothymic disorders), the joint rating K was 0.64, but for test-retest the K was only 0.29. Riskin and colleagues9have reported reliabilities for DSM-I11 based on the rerating of videotapes of SCID interviews for 75 psychiatric outpatients. They found an overall agreement for all disorders in the study to have a K of 0.74, and major depression to have a K of 0.72. Skre and colleagues” in Norway have reported reliabilities for audiotape reratings of participants in a twin study of mental disorder. A total of 54 interviews were audiotaped and, subsequently, rerated by two raters. They found a K of 0.93 for overall mood disorder (n = 31), a K of 0.93 for major depression (n = 25), and a K of 0.88 for dysthymia (n = 7). Williams and colleague^*^ have presented an examination of the reliability of the SCID for DSM-111-R in a multisite study with 592 subjects of whom 390 were patients. Reliabilities were assessed using stringent test-retest methods. Within the patient group they found an

80

HOLZER et a1

overall weighted agreement having a K of 0.61 for all current disorders and a K of 0.68 for lifetime disorders. Agreement for current bipolar disorder had a K of 0.84, and for lifetime disorder, a K of 0.84. Agreement for major depression was somewhat lower with a K for current disorder of 0.64 and 0.69 for lifetime disorder. Dysthymia had a low reliability with a K of 0.40. The recency of last occurrence for dysthymia was apparently not defined. Within the nonpatient groups, current major depression produced a K of 0.42, and lifetime major depression produced a K of 0.49. Reliability for dysthymia in this nonpatient group was low, with a K of 0.53, which is nonetheless higher than for the patient group. Design and Procedures of DSM-IV Mood Disorders Field Trial

During the formulation of DSM-IV, an effort was made to include field trials for the major disorders undergoing change. For the section on mood disorders, a field trial was required by revisions to the set of symptoms to be used for dysthymia, and by the introduction of longitudinal course modifiers when both depression and dysthymia are present. The Field Trial was designed to examine lifetime course descriptors for depression and dysthymia, alternative criterion sets for dysthymia, and definitions for depressive personality. The initial wave of interviews in the Field Trial for Mood Disorders was a five-site survey of 524 adults with depressed mood and at least two additional symptoms of depression or dysthymia. Respondents were recruited from inpatient, outpatient, and community settings. SCID interviews were conducted by trained research interviewers, with master’s degrees or higher. The version of the SCID used in the interview was modified to include criteria for dysthymia from DSM-111, DSM-IIIR, and the proposed DSM-IV. The section for major depression was unmodified except for inclusion of additional information about onset and duration. At the end of the mood disorders section of the SCID, the set of proposed longitudinal course modifiers was added. These specifiers identify whether dysthymia and major depression occurred separately during the lifetime course of disorder, and when they both occurred, their sequence, and the degree of recovery between episodes. As part of this trial, a videotape-based rerating study was conducted, including reratings at the same site where the interview was conducted and reratings at one other site. For the rerating study, 20 interviews were videotaped at each of four field study sites, with sets of five systematically selected copies being sent to each of the other four sites. At the fifth site, five interviews were videotaped, with copies of all five being distributed to each of the four other sites. Each of the 25 original videotapes was rerated at the originating site by a randomly selected interviewer other than the one who had made the tape. The copies received at each remote site were distributed randomly among the interviewers at that site and were re-rated.

RELIABILITY OF DIAGNOSIS IN MOOD DISORDERS

81

RESULTS OF THE DSM-IV FIELD TRIAL RELIABILITY STUDY Three sets of comparisons among raters were examined, corresponding to (1)the agreement between the original interviewer and the rerater at the original site, (2) the original interviewer and the rerater at the remote site, and (3) the two different videotape reraters, one local and one remote. These were examined for major depression, dysthymia, and for the proposed longitudinal course modifiers. Major Depression For lifetime and current major depression, the comparison between the initial interview and rerating at the same site provided the highest K’S (0.77 and 0.72). This indicates fairly good agreement, but not exceptional agreement, given that these are videotape reratings rather than test-retest interviews. The agreements of remote site reratings with the original interview and local site reratings are only moderately good, having K’S ranging from 0.52 to 0.68. Dysthymia For lifetime and current dysthymia, the agreement between initial interview and local site rerating had K’S of 0.81 and 0.82, respectively. The cross-site reratings for lifetime dysthymia have somewhat lower K’S ( 0.67 and 0.65), but are better than ratings of dysthymia for the past month (0.51 and 0.59). This may reflect difficulties with a nonstandard item added to the interview for the Field Trial, which asks for the number of months since the respondent was last dysthymic. Examination of the types of disagreements indicates a tendency for remote ratings to diagnose dysthymia more frequently than the initial interview or the local rerating. Longitudinal Course Modifiers for Mood Disorders Within the rerating subsample of 78 respondents, there were 43 respondents who had both major depression and dysthymia at some time in their lives. Within this group, cases were classified by the presence of single or repeat episode depression, whether the dysthymia occurred at least 6 months prior to the major depression, and whether the respondent had fully or only partially recovered between episodes. This broke out into the following six categories: 1. Single episode major depressive disorder, with antecedent dysthymia

82

HOLZER et a1

2. Single episode major depressive disorder, without antecedent d ysthymia 3. Repeat episode major depressive disorder, with antecedent dysthymia, and full recovery 4. Repeat episode major depressive disorder, with antecedent dysthymia, and not full recovery 5. Repeat episode major depressive disorder, without antecedent dysthymia, and full recovery 6. Repeat episode major depressive disorder, without antecedent dysthymia, and not full recovery.

The overall K, including six categories, was 0.59 for the comparison between the original interviewer and local rerater (n = 43). The K comparing the initial interviewer with the remote rerater was 0.53 (n = 40), and the K comparing the local with remote reraters was 0.59. These represent an acceptable level of agreement among raters for a rather complex set of specific categories. This suggests that, in addition to being able to rate the presence or absence of the major depression and dysthymia, it is reasonable to classify mood disorders by their overall course. Within this examination of longitudinal course modifiers, it is possible to approximate two specific chronic forms of depression, although not exactly. Course 4, which has antecedent dysthymia followed by repeat episode depression, without full recovery, corresponds most closely to a diagnosis of “double depression.” Kappas for agreement on course 4, ”double depression” among the three rerating comparisons were 0.54, 0.66, and 0.57, respectively. Course 6, having recurrent major depression, without antecedent dysthymia, and without full recovery corresponds most closely to a diagnosis of ”chronic major depression.” Kappas for course 6 are 0.37, 0.41, and 0.65 for the three rerating comparisons. Although somewhat lower than for course 4, these also indicate that rating this specific combination of factors is feasible. DISCUSSION

In examining the reliability of diagnosis for mood disorders, several trends are clear. The first is that the progression toward formal criteria results in improved reliability. Second, the use of structured diagnostic instruments leads to improved reliability as compared with unstructured interviews. The third trend is a progression from examining broad diagnostic categories toward examination of specific subtypes, especially those dealing with longitudinal course. This process places greater demands on the criteria, on the structured instruments, and on the clinician implementing those criteria. As the categories become narrower, finer distinctions are made, and reliability is less than when broad categories are used. This is particularly evident for the longitudinal course modifiers where consistency in diagnosis requires matching not only the diag-

RELIABILITY OF DIAGNOSIS IN MOOD DISORDERS

83

noses of major depression and dysthymia, but also onset, duration, severity, and the degree of recovery taking place between episodes. This is a substantial change from most other diagnostic processes, which are ascertained at the time of interview, with a requirement of duration going back weeks or months, or in the case of dysthymia, 2 years. It seems that the identification of course is not only feasible and adequately reliable, but very important for both clinical and scientific purposes. For clinical work, determining the course influences expectations about the usual state of the individual, their personality, and the degree of recovery expected in treatment. For investigations into the epidemiology of these disorders, especially their causes, tracking course back to an apparent onset permits identification of the relevant risk factors and not just consequences of the disorder. To increase the reliability and validity of these assessments of chronic depression, it appears that more work is needed in several areas. The first area in need of improvement is the training of clinicians in life course methods, with greater emphasis on the onset of disorders. Second, the nomenclature itself needs to address the rules for differentiating double and chronic major depressions. The rules surrounding the degree of recovery permitted between episodes and the various course patterns remains open to substantial reinterpretation by different clinicians. Other issues include the differentiation of dysthymia and major depression based on timing and symptom pattern. Finally, the structured instruments need further refinement with regard to these chronic depressions. The SCID, for example, offers two tracks of ascertainment, one for current disorder, and in the absence of current disorder, one to ascertain lifetime disorder. To differentiate alternative courses, both tracks need to deal with the timing of onset and duration and the severity of multiple episodes. There is no formal ascertainment of the degree of recovery between episodes. Perhaps it would be possible to identify these parameters on an episode by episode basis. Versions of questions to facilitate this process have been tried in the DSM-IV Mood Disorders Field Trial and in subsequent clinical trials, but they are not ready for incorporation into the standard version of the SCID. Thus, there remains ample opportunity for development of instrumentation for ascertainment of the chronic disorders. Another issue to be added to that consideration is the role of depressive personality versus dysthymia in the life course of depression. This review of reliability studies demonstrates continuing improvement in psychiatric diagnosis, but there is an ongoing need to review modifications as they are made. References 1. American Psychiatric Association: The Disorders, ed 1. Washington, American 2. American Psychiatric Association: The Disorders, ed 2. Washington, American

Diagnostic and Statistical Manual of Mental Psychiatric Association, 1952 Diagnostic and Statistical Manual of Mental Psychiatric Association, 1968

84

HOLZER et a1

3. American Psychiatric Association: The Diagnostic and Statistical Manual of Mental Disorders, ed 3. Washington, American Psychiatric Association, 1980, pp 4-7, 205 4. American Psychiatric Association: The Diagnostic and Statistical Manual of Mental Disorders, ed 3, revised. Washington, American Psychiatric Association, 1987, p 388 5. American Psychiatric Association: The Diagnostic and Statistical Manual of Mental Disorders, ed 4. Washington, American Psychiatric Association, 1994 6. Cloninger CR: Establishment of diagnostic validity in psychiatric illness: Robins and

Guze’s method revisited. In Robins LN, Barrett JE (eds): The Validity of Psychiatric Diagnosis. New York, Raven Press, 1989, pp 9-18 7. Cooper JE, Kendell RE, Gurland BJ, et al: Psychiatric Diagnosis in New York and London. Maudsley Monograph No. 20. London, Oxford University Press, 1972 8. Cronbach LJ: Essentials of Psychological Testing, ed 3. New York, Harper & Row, 1970, pp 142-145 9. Riskind JH, Beck AT, Berchick RJ, et a1 Reliability of DSM-111 diagnoses for major depression and generalized anxiety disorder using the structured clinical interview for DSM-111. Arch Gen Psychiatry M817-820, 1987 10. Robins E, Guze SB: Establishment of psychiatric validity in the psychiatric illness: Its application to schizophrenia. Am J Psychiatry 126:983-987, 1970 11. Skre I, Onstad S, Torgersen S, et al: High interrater reliability for the Structured Clinical Interview for DSM-111-R Axis I (SCID-I).Acta Psychiatr Scand 84:167-173,1991 12. Spitzer RL, Wilson P T Nosology and the official psychiatric nomenclature. In Freedman AM, Kaplan HI, Sadock BJ (eds): Comprehensive Textbook of Psychiatry. Baltimore, Williams & Wilkins, 1975, pp 826-845 13. Spitzer RL, Forman JBW, Nee J: DSM-111 field trials I. Initial interater reliability. Am J Psychiatry 136S15-817, 1979 14. Spitzer RL, Williams JBW, Gibbon M, First MB: Structured Clinical Interview for DSM111-R. Washington, American Psychiatric Press, 1990 15. Williams JBW, Gibbon M, First MB, et a1 The structured clinical interview for DSM111-R (SCID). Arch Gen Psychiatry 49:63&636, 1992 16. Wing J K Reasoning about Madness. London, Oxford University Press, 1978, p 103

Address reprint requests to Charles E. Holzer 111, PhD Room 1.213 Graves Department of Psychiatry and Behavioral Sciences The University of Texas Medical Branch 301 University Boulevard Galveston, TX 7755543429