The logistic regression analysis of psychiatric data

The logistic regression analysis of psychiatric data

J. psychial. Rex, Vol. 20, No. 3, pp. 145-209, 1986 Printed in Great Britain. 0022.3956/86 $3.00+ .@I Pergamon Journals Ltd. THE LOGISTIC REGRESSION...

1MB Sizes 1 Downloads 116 Views

J. psychial. Rex, Vol. 20, No. 3, pp. 145-209, 1986 Printed in Great Britain.

0022.3956/86 $3.00+ .@I Pergamon Journals Ltd.

THE LOGISTIC REGRESSION ANALYSIS OF PSYCHIATRIC JOSEPH

L.

FLEIss*t,

JANET

B. W. WIILLLMS*$ and ALAN

F.

DATA

DIJBRO*§

*Biometrics Research Department, New York State Psychiatric Institute; tSchoo1 of Public Health, Columbia University; SDepartment of Psychiatry, Columbia University; and §Department of Psychology, University of Arizona, U.S.A.

(Received 7 August 1985; revised 6 January 1986) Summary-Logistic regression is presented as the statistical method of choice for analyzing the effects of independent variables on a binary dependent variable in terms of the probability of being in one of its two categories vs the other. The method, which must be applied by computer, is illustrated on data from the DSM-III field trials. The dependent variable is treatment with behaviourallyoriented psychotherapy vs treatment with psychoanalytically-oriented psychotherapy, and the independent variables are several patient and clinician characteristics. Like ordinary multiple regression, the method is shown capable of analyzing categorical as well as continuous independent variables. Unlike ordinary multiple regression when applied to binary data, logistic regression analysis necessarily yields estimated probabilities that lie between 0 and 1. The measure of association derived from logistic regression analysis, the odds ratio, is defined. Methods for making inferences about it are presented and illustrated.

INTRODUCTION MANY studies of mental disorders call for each of a sample of subjects to be measured on a response or dependent variable Y as well as predictor or independent variables X,, on the K independent variables is typically studied by means . . ., X,. Y’s dependence of multiple regression analysis in which a linear model for the mean value of Y,

mean

Y= a! + C/3JL,

is assumed and in which the regression coefficients are estimated, tested for significance, and bounded by confidence intervals (COHENand COHEN, 1975; DRAPER and SMITH, 1981). When the response variable is binary (Y= 1 for one of two outcomes and Y=O for the other), the above model in effect postulates that the probability of observing a particular outcome is a linear function of the K independent variables. Provided that the probabilities may be as low as 0.20 are close to 0.50 for all values of X,, . . ., X, (some probabilities or as high as 0.80), the assumption of linearity may be reasonable and the results of the traditional regression analysis will likely make sense. When, however, either relatively rare events (whose probabilities are less than 0.20) or relatively prevalent events (whose probabilities exceed 0.80) are being studied, a traditional multiple regression analysis may produce nonsensical results such as predicted probabilities less than 0 or greater than 1. Logistic regression analysis is a variant of traditional regression analysis for binary dependent variables which does not suffer from the weaknesses of the traditional approach Address for correspondence: Joseph L. Fleiss, Ph.D., Division of Biostatistics, University, 600 W. 168th Street, New.York, NY 10032. 195

School of Public Health, Columbia

196

JOSEPH L. FLEISS ef al.

to such data but which nevertheless shares many of its powerful features such as the ability to study quantitative as well as categorical independent variables and to include in the model interactions among the independent variables (Cox, 1977). The method has become widely adopted in such specialties as cardiology (Moss et al., 1981) and epidemiology (KLEINBAUM et al., 1982; TRIJETT et al., 1967), but apparently not in psychiatry. In this paper we illustrate the use of logistic regression analysis with data from the DSM-III field trials to identify those characteristics of patients and clinicians that are associated with the choice between behaviourally-oriented psychotherapy and psychoanalytically-oriented psychotherapy. METHODS

Clinical methods The data set that will be used to illustrate logistic regression analysis was collected during a national field trial testing the reliability and feasibility of the third edition of the American Psychiatric Association’s Diagnostic and Statistical Manual of Mental Disorders, commonly called DSM-III (AMERICAN PSYCHIATRIC ASSOCIATION, 1980). This field trial, sponsored by the NIMH, has been described elsewhere (SPITZER et al., 1979). Briefly, clinicians were invited to participate in the project through notices appearing in Psychiatric News (the monthly newsletter of the American Psychiatric Association) and other mental health publications and by individual letters sent to community mental health centers and the membership of the American Academy of Child Psychiatry. All who agreed to complete the required work were accepted as participants, either as private practitioners or as groups of clinicians working within facilities. The clinicians were from all parts of the country and worked in both urban and rural settings. The field trial itself was divided into two phases. In each phase clinicians used a draft of DSM-III to evaluate at least 20 patients selected from their practice as either consecutive admissions or “catch-as-catch-can.” The results of each evaluation were recorded on a diagnostic report form (DIRE) that provided space for recording basic demographic data (age, sex, ethnic-racial background) and information on each of the five axes of the DSMIII multiaxial system (mental disorder diagnoses, personality disorders, physical disorders and conditions, severity of psychosocial stressors, and highest level of adaptive functioning in the past year) (WILLIAMS, 1981). In the second phase of the field trial the DIREs also included an additional section in which the clinicians were asked to note which of thirteen treatment categories was their first choice of treatment for each patient (there were four categories of psychotherapy or special education, seven of somatic therapy, a category for custodial care, and one for no treatment). Data were also collected about several demographic and professional characteristics of the clinicians. In selecting the variables to be included in the present analysis, we chose as the dependent variable one or the other of a pair of treatments that one would expect to be differentially associated with various patient and clinician variables. Of the thirteen treatment categories that had been recorded, two of the psychotherapeutic modalities, behaviourally-oriented psychotherapy (abbreviated as behavior modification) and psychoanalytically-oriented psychotherapy (abbreviated simply as psychotherapy) seemed to be ideal candidates. Both of these treatment approaches have a highly developed conceptual framework that guides the treatment and each has a large number of specific disorders for which it is considered

LOGISTIC REGRESSION ANALYSIS

197

(by its practitioners) to be the treatment of choice. Furthermore, clinicians tend to specialize in providing one or the other of the treatments but rarely both. Of the over two hundred specific DSM-III diagnoses, 16 diagnostic groups were chosen to maximize the likelihood of a differential relationship with the two treatment categories. Diagnostic categories such as Schizophrenia and Organic Mental Disorders, for which neither treatment would generally be considered the treatment of choice, were not included. The data analysis was restricted to those patients for whom one of the 16 chosen diagnostic groups was the principal diagnosis, that is, the diagnosis that occasioned the admission to clinical care. The chosen diagnostic groups included conduct, anxiety, substance use, depressive, dysthymic, adjustment and personality disorders. Patient variables that were included in the analysis as independent variables, in addition to principal diagnosis, were age, sex, race (white or non-white), setting in which the patient was evaluated (in- or out-patient), whether or not there was a secondary associated Personality Disorder diagnosis, severity of psychosocial stressors (Axis IV of the DSM-III multiaxial system), and highest level of adaptive functioning during the past year (Axis V). Clinician variables included in the analysis as independent variables were professional background (MD vs non-MD), race (white or non-white), number of years of professional experience, major type of work setting (clinical vs academic), and treatment orientation (psychoanalytically-oriented psychotherapy, behaviorally-oriented psychotherapy, or other).

The statistical model The linear model postulated by logistic regression analysis is like the one underlying traditional multiple regression analysis, except that it is the log odds for one outcome versus the other that is assumed to vary linearly with a set of predictors. Let P denote the proportion of individuals in a population assigned to one of the two levels of the dependent variable. Exactly the same information as contained in P is contained in the associated odds for being assigned to the first category rather than the second, odds=Ep. Exactly the same information log odds,

as in P and the odds, finally; is contained in the associated log odds = lnep,

where In denotes the natural logarithm. The log odds is especially suitable for representation by linear models (Cox, 1977; EVERITT, 1977; FLEISS, 1981). Logistic regression analysis, therefore, is a set of procedures for making inferences about the factors affecting a probability in the context of the model P = (Y+ C&Y,. lnilrj

There are several well-documented packages of computer programs for carrying out a logistic regression analysis, including BMDP and SAS.

JOSEPH L. FLEISS et al.

198

With the sample estimates of (Y and PI,&, . . . denoted by a and b,, b2, . . ., the predicted log odds for subjects with values X,, . . ., X, on the independent variables is

ln2!?=a+Cb,Xi, 1-P and the predicted

probability

is exp(a + Cb,XJ 1 + exp(a + CbjXj)’

p =

Note that p always lies between 0 and 1. In ordinary multiple regression analysis, the coefficient 6, is the estimated average change in the dependent variable per unit change in X,, with the other variables held fixed. In logistic regression analysis, bi is the estimated average change in the log odds per unit change in X, and exp(b,) is the estimated odds ratio (OR) associating X, with the dependent variable, the other variables being held fixed. In general, if pU and P, are the estimated probabilities for subjects with different values on an independent variable, the odds ratio is OR

zz

p,i(l-~,) P,.(l-rj,)’

For example, the proportion of patients prescribed modification) by MD’s will be seen in the Results proportion prescribed psychotherapy by non-MD’s odds ratio associating professional background and

psychotherapy (rather than behavior section to be, say, JjU =0.8053; the was, say, p’, = 0.5339. The resulting treatment assignment was therefore

0~=0.8053X0.4661=3 0.5339x0.1947

61 .

.

The odds ratio has not yet become a familiar descriptor in psychiatry of association involving a dichotomous variable, certainly not as familiar as it is in epidemiology (FLEISS, 1981, pp. 61-67; SCHLESSELMAN, 1982, pp. 33-34). In that discipline, values for an odds ratio greater than or equal to 2.5 or 3.0, when both the dependent and independent variables are binary, are generally taken to represent the lower limits of strong association. Table 1 gives the values of the odds ratio for several pairs of probabilities, so readers may decide for themselves what values represent clinically minor differences and what values represent clinically important differences. Inferences about and interpretations of the odds ratio are highlighted throughout the Results section. TABLE 1. ODDS RATIOS IP,(l P,,

t

- P,)/P,,(l

0.2

0.3

0.4

2.25

3.86 1.71

6.00 2.67 1.56

0.1 0.2 0.3 0.4 *If P, >P,, ilf P,, >0.5,

interchange subscripts. enter the table with 1 -P,

- Pu)] FOR PAIRS OF PROBABILITIESPv AND P, >P,.* 91 0.5

0.6

0.7

0.8

0.9

9.00 4.00 2.33 1.50

13.50 6.00 3.50 2.25

21.00 9.33 5.44 3.50

36.00 16.00 9.33 6.00

81.00 36.00 21.00 13.50

and 1 -P,.

LOGISTIC REGRESSION ANALYSIS

Computational

199

methods

As in traditional regression analysis, a rich variety of independent variables may be included in a logistic regression analysis. If a categorical independent variable has C levels, C-l dummy coded variables are required to represent all levels. For example, one coded variable was sufficient to represent the clinician’s professional background (MD or nonMD) but two were required to represent his or her treatment orientation (primarily psychotherapy, primarily behavior modification, primarily other). The codes actually used will be illustrated in the Results section. If an independent variable is quantitative, such as the patient’s score on Axis V of DSMIII, it may be analyzed as measured. If there is reason to suspect a nonlinear association between the log odds for the response variable and a quantitative independent variable, the square, logarithm, reciprocal, etc. of the independent variable may be included in the equation in order to produce a more nearly linear association (DRAPER and SMITH, 1981, pp. 220-225). The interaction between two or more independent variables, finally-the effect of one depending on the level of the other-may be modeled by including the products of these variables in the equation. The statistical method of maximum likelihood (Cox, 1977) was applied to the data from the field trial described earlier. Program PLR of the BMDP package was used in this study. It provides for a logistic regression analysis most of the features of programs for a conventional multiple regression analysis. In the present application, a stepwise addition of variables to the equation was requested: at each step, the independent variable most strongly associated with the dependent variable, controlling for the variables already in the equation, was entered provided it was significantly associated with the dependent variable at the 0.05 level. The BMDP program also permits the removal of a variable from the model if, on the basis of the variables added subsequently, the given variable loses its statistical significance. The built-in criterion removes a variable if it is no longer significant at the 0.15 level. The SAS package’s CATMOD procedure (for CATegorical data MODeling) shares many of the features of BMDP’s PLR routine.

RESULTS Table 2 presents the demographic, clinical and clinician characteristics for the 709 patients in the selected diagnostic categories prescribed either psychotherapy or behavior modification. Five of the eight patient characteristics were not sufficiently associated with treatment to enter the final model: sex, race, age, a secondary diagnosis of personality disorder and severity of psychosocial stressors as coded on Axis IV of DSM-III. Only one of the five clinician characteristics, race, failed to enter the model. The development of the model containing the remaining seven significant independent variables will now be described. The first three steps of the stepwise multiple logistic regression analysis involved the data in Table 3. The column headed Logistic will be described later. For now, only the numbers of patients evaluated by clinicians with the indicated treatment orientations and the indicated professional backgrounds, and the observed proportions of those patients prescribed psychotherapy rather than behavior modification, will be discussed.

200

JOSEPH L. FLEISS et al. TABLE 2. PATIENT AND CLINICIAN CHARACTERISTICS FOR 709 PATIENTS PRESCRIBED EITHER PSYCHOTHERAPY OR BEHAVIOR MODIFICATION

Treatment

prescribed:

72% psychotherapy 28% behavior modification

Patient characteristics Sex: female, 56%; male, 44% Race: white, 87%; nonwhite, 13% Setting: inpatient, 18%; outpatient, 82% Age: mean = 25.4 yr, SD = 12.0 yr Primary Diagnosis: Conduct disorder, aggressive Conduct disorder, nonaggressive Childhood anxiety Oppositional disorder Identity disorder Eating disorders Substance use disorders Major depression Dysthymic disorders Phobic disorder Anxiety states Somatoform disorders Adjustment disorders Psychosomatic disorders Conditions not attributed to a mental disorder Personality disorders Secondary personality disorder: yes, 21%; no, 79% Axis IV: mean=3.7, s~=1.12 Axis V: mean=3.7,

4% 4% 5vo 2%

2% 2% 6% 6% 7% 2% 6% 3% 23% 1% 3% 25%

SD= 1.1

Clinician characteristics: Professional background: MD, 69%; non-MD, 31% Race: white, 94%; nonwhite, 6% Years of experience: mean = 11.6, SD = 8.4 Major work setting: clinical, 91%; academic, 9% Treatment orientation: psychotherapy, 61%; behavior modification, 14%; other, 24%

Constant

term

At the first step, only the constant

termin

the equation

estimated; no independent variables are entered yet. The important values in the printout are shown in Table 4. The estimated constant term, a=0.948, has the following interpretation. In the sample as a whole, the estimated log odds for psychotherapy rather than behavior modification is 0.948, which corresponds to an estimated odds of exp(0.948) =2.581. Over 2.5 times as many patients in the sample are prescribed psychotherapy as are prescribed behavior modification. Finally, the corresponding estimated probability of a typical patient’s being prescribed psychotherapy is 2.581/(1+ 2.581) = 0.721, is

201

LOGISTIC REGRESSIONANALYSIS TABLE 3.

TREATMENT ASSIGNMENT TO PATIENTS BY CLINICIAN’S TREATMENT ORIENTATIONAND PROFESSIONAL BACKGROUND

Treatment orientation

No. of patients

Profession

Proportion given psychotherapy Logistic Observed

Psychotherapy

MD Non-MD

340 95

0.8971 0.9053

0.9081 0.8659

Behavior Modification

MD Non-MD

25 76

0.1200 0.1184

0.1550 0.1070

Other

MD Non-MD

123 50

0.6911 0.4600

0.6536 0.5521

All MDs All non-MDs Total

488 221 709

0.8053 0.5339 0.7207

TABLE

4. SUMMARY

STEP

NUMBER

TERM CONSTANT

OF BMDP PRINTOUT AT INITIAL STEP IN THE ANALYSIS

0

COEFFICIENT

STANDARD ERROR

0.948

0.084

COEFF/S.E. 11.326

precisely the observed proportion of all 709 patients prescribed psychotherapy. Working backwards, therefore, the value of a in the equation in which no independent variables yet appear is simply a= ln@/q), where p is the proportion of patients prescribed psychotherapy and q = i-p is the proportion prescribed behavior modification. The formula for the standard error of a is SE(a) = 1, dnPq

where n is the total sample size. In the present analysis, SE(a) =

1

= 0.084.

\j709 x 0.721 x 0.279 The BMDP program automatically calculates the ratio of the estimated constant term to its standard error. This ratio may be referred to the standard normal distribution to test the hypothesis that the constant term in the population, CY,is zero. A constant term equal to zero corresponds to an odds of 1, which in turn corresponds to a population probability of 0.5. In the present study, the value of this ratio is 0.94UO.084 = 11.3. The constant term is thus highly significant: the sample proportion of 0.721 is significantly different from 0.5. There was no reason in the current study to hypothesize an underlying probability of 0.5, but there may be other kinds of studies (e.g. in genetics) in which such a hypothesis is sensible.

JOSEPHL. FLEISSet al.

202

TABLE 5. SUMMARYOF BMDP PRINTOUTAFTERTHE FIRSTVARIABLE ENTEREDTHE MODEL STEP NUMBER

TERM

1

PROFBACK

COEFFICIENT

PROFBACK CONSTANT

1.284 0.136

IS ENTERED

STANDARD ERROR

COEFF/S.E.

0.176

7.263

Professional background Table 5 summarizes the results of selecting the first variable to enter the model, the clinician’s professional background. The dummy variable representing profession was coded as X, = + 1 for MDs and X, = 0 for non-MDs, and the resulting equation was lnFP=0.136+

1.284X,.

For MDs, the estimated log odds for prescribing psychotherapy rather than behavior modification is 0.136 + 1.284( 1) = 1.420, the estimated odds are exp( 1.420) = 4.137, and the estimated probability is 4.137/(1+ 4.137) = 0.805, precisely the observed proportion of all patients evaluated by MDs who were prescribed psychotherapy. For non-MDs, the estimated log odds is 0.136 + 1.284(O) = 0.136, the estimated odds are exp(O. 136) = 1.146, and the estimated probability is 1.146/( 1 + 1.146) = 0.534, precisely the value observed for patients evaluated by non-MDs. If p, denotes the observed proportion of patients evaluated by MDs who were prescribed psychotherapy, if p2 denotes the corresponding proportion for patients evaluated by non-MDs, and if q, and q2 denote the complementary proportions, the formula for the estimated coefficient of X, is

h with an estimated

standard

error

= It-& - lnz, 41

of 1

SE(b ,) =

f

nlplql where n, and n, are the total numbers the current data,

b,= (

of patients

1 n2p2q2

1% /r

evaluated



by MDs and non-MDs.

In=-lns)=1.28 .

and 1 SE(b,)=

488 x 0.8053 1 x 0.1947 + 221 x 0.5339 x 0.4661 > =0.18.

%

For

203

LOGISTIC REGRESSIONANALYSIS

The significance of the difference between MDs and non-MDs in their log odds for prescribing psychotherapy may be tested by referring the ratio b,/SE(b,) to the standard normal distribution. The value of the ratio for the current data is 1.28/0.18 = 7, so the difference between the log odds for psychotherapy for the two groups of clinicians is highly significant. It is important to note that, at this step and at subsequent ones, the value of the constant term changes appreciably from what it had been at the preceding step. It is only at step 0 that the constant term is directly related to the overall proportionp. At subsequent steps, the value of the constant term depends on the way in which the independent variables are coded. The constant term thus loses its substantive meaning after step 0. The odds ratio The log odds for psychotherapy rather than behavior modification for MDs is a + b, , and the log odds for non-MDs is a. The difference between the two log odds, which is the same as the log odds ratio associating the clinician’s professional background with his or her prescribed treatment, is therefore simply b,. The value of the log odds ratio for these data is equal to 1.284, and the corresponding odds ratio is exp(l.284) = 3.61. Therefore, the odds that a patient evaluated by an MD will be prescribed psychotherapy rather than behavior modification are over three and a half times the odds for a patient evaluated by a non-MD. A 95% confidence interval for the underlying odds ratio is obtained by first finding a 95% confidence interval for the underlying log odds ratio, b, f 1.96x SE(b,), and then by taking antilogarithms of these limits. The lower limit is exp(1.284- 1.96 x 0.176) =exp(0.939)=2.56, and the upper limit is exp(1.284+ 1.96~0.176)=exp(1.629)=5.10. This interval, 2.56 < OR < 5.10, suggests a fairly strong association between the clinician’s professional background and treatment. Effect of treatment orientation The clinician’s treatment orientation, a categorical variable with three levels, was the next independent variable entered into the model (see Table 6). Two dummy coded variables are required to identify membership in one of three groups. The codes used for the clinician’s TABLE 6. SUMMARY OF BMDP STEP

PRINTOUT AFTER THE SECOND VARIABLEENTERED THE MODEL

NUMBER2

TERM PROFBACK(b , ) RXORIENT(l)(b,) RXORlENT(2)(b3) CONSTANT(A)

PROFBACK(b ,) RXORIENT(I)(b,) RXORlENT(2)(LQ)

RXORIENTIS ENTERED COEFFICIENT 0.426 -2.331 1.656

STANDARD ERROR 0.230 0.358 0.224

COEFF/S.E. 1.841 - 6.508 7.402

0.209 COVARIANCEMATRIXOF COEFFICIENTS PROFBACK(b I ) RXORlENT(1)(6,) RXORlENT(2)(b,) 0.0203 - 0.0014 0.0529 0.1282 0.0245 0.0203 0.0245 0.0502 - 0.0014

204

JOSEPH

L. FLEES et al.

treatment orientation were X, = 1, X, = 0 for behavior modification, X, = 0, X, = 1 for psychotherapy, and X, =O, X, =0 for other. With X, coded as before for professional background, the estimated equation was ln_!!_=0.209+0.426X, 1-P

-2.331X,+1.656X,.

The values in the column headed Logistic in Table 3 are the predicted proportions of patients prescribed psychotherapy using this equation. For example, the predicted log odds for patients evaluated by non-MDs (X, = 0) primarily oriented to psychotherapy (X, = 0, X, = 1) is 0.209 +0.426(O) - 2.33 l(0) + 1.656(l) = 1.865, the corresponding predicted odds are exp(1.865) = 6.456, and the corresponding predicted proportion is 6.456/(1+ 6.456) = 0.866. Explicit formulas no longer exist for the estimated coefficients and their standard errors, nor do explicit formulas exist for the covariances between coefficients. These summary statistics are calculated as part of the maximum likelihood anlaysis, and may be used for making a variety of inferences, as follows.

Controlling for treatment orientation,

what is the effect of professional

background?

The ratio of b, , the coefficient describing the MD-non-MD difference, to its standard error is 0.426/0.230 = 1.8, which fails to attain significance at the 0.05 level. This finding is in apparent contradiction to the highly significant effect of profession found at the preceding step. There, however, treatment orientation was not taken into account. We see from Table 3 that the majority of MDs (340/488) were oriented toward psychotherapy whereas only a minority of non-MDs (95/221) were. Most of the apparent difference in log odds between MDs and non-MDs is due to their differing treatment orientations. Controlling for treatment orientation, the estimated odds ratio associating professional background and treatment is exp(0.426)= 1.53, and a 95% confidence interval for the underlying odds ratio extends from a lower limit of exp(0.426-1.96x0.230) = exp( - 0.025) = 0.98 to an upper limit of exp(0.426 + 1.96x 0.230) = exp(0.877) = 2.40. Not only are these limitis suggestive of a weaker association than found at the previous step, but the interval includes the value 1 .O (the value indicating no association), which agrees with the failure to find the coefficient for professional background, b,, significantly different from zero. The lower limit, 0.98, is only slightly less than 1.0, however, suggesting that it might be a mistake to infer that professional background has no effect at all on treatment, controlling for treatment orientation.

Controlling for profession,

what are the effects of treatment orientation?

The coding system that was adopted for representing treatment orientation is such that the coefficient b, is the log of the odds ratio contrasting clinicians oriented to behavior modification with those oriented to other therapies and the coefficient b, is the log of the odds ratio contrasting clinicians oriented to psychotherapy with those oriented to other therapies. In general, when a nominal scale has C categories and when dummy-coded variable c is equal to 1 for individuals in category c and is equal to 0 for all others (c= 1, . . . ,C- l), then the coefficient for variable c is equal to the log of the odds ratio comparing category c with category C (FLEW, 1985).

LOGISTICREGRFSSION ANALYSIS

205

The difference in log odds between clinicians oriented to behavior modification and those oriented to other therapies is statistically highly significant (critical ratio = - 6.5, p
V

The next variable that was entered into the model was a characteristic of the patient, his or her highest level of adaptive functioning during the past year as scored on Axis V (see Table 7). This was also the first quantitative variable included in the model, the kind of variable for which a graphic display is especially informative. Note that the coefficients, standard errors and covariances of the other variables changed little or not at all when Axis V was added to the model. None of the effects of the clinician’s treatment orientation or professional background, therefore, was due to any association between these independent variables and Axis V. The coefficient of -0.550 for Axis V means that, on the average, the log odds for psychotherapy decreases by 0.55 for each unit increase in Axis V. For patients at the two extremes of Axis V, those with a score of 1 (for superior functioning) vs those with a score of 7 (for grossly impaired), the change in log odds is -0.550(7l)= -3.30 and the

JOSEPH L. FLEISS et al.

206 TABLE 7. SUMMARY OF BMDP STEP

NUMBER

PRINTOUT AFTER THE THIRD VARIABLE ENTERED THE MODEI AXIS

3

V IS ENTERED STANDARD

TERM PROFBACK (b 1) RXORIENT(1) (b2) RXORIENT(2) (b ,) AXIS V (b4) CONSTANT (A)

COEFFICIENT 0.426 - 2.499 1.539 - 0.550 2.388

PROFBACK PROFBACK (b 1) RXORIENT(l) (b,) RXORIENT(2) (b,) AXIS V (b4)

(b,)

ERROR

C0EFFS.E.

0.238 0.371 0.23 1 0.106

1.790 - 6.742 6.661 - 5.205

COVARIANCE RXORIENT(l)

0.0566 0.0225 - 0.0020 - 0.0006

MATRIX OF COEFFICIENTS (&) RXORIENT(2) (6,) AXIS

0.0225 0.1376 0.0265 0.0061

- 0.0020 0.0265 0.0534 0.0006

V (b4)

- 0.0006 0.0061 0.0006 0.0112

associated odds ratio is exp( - 3.30) = 0.04. Patients with an Axis V score of 7 have an odds of receiving psychotherapy that is about 4% of the odds for patients with a score of 1. The strong association between treatment and Axis V is displayed in Fig. 1, which graphs the estimated probability of receiving psychotherapy as a function of the three variables currently in the model.

9

03.

-

Psychotherapy

-

Ofher

-‘-‘-.-

Behovmr

\. ‘; Modif!cafmn

oik--TEr7 Axis

7

V

FIG. 1. Estimated probability of a patient’s receiving psychotherapy rather than behavior modification as a function of the clinician’s treatment orientation and professional background and the patient’s score on Axis V (based on step 3 of the logistic regression analysis).

LOGISTICREGRESSION ANALYSIS

207

Each estimated probability was calculated by first finding the value of log odds = 2.388 +0.426X, - 2.499X, + 1.539X, - 0.550x, for a particular professional background (X, = + 1 for MDs, X, = 0 for non-MD’s), treatment orientation (X, = 1, X, = 0 for behavior modification; X,=0, X,=1 for psychotherapy; X, =X, = 0 for other), and patient score on Axis V (X, = actual score), and then using the formula estimated probability = exp(log odds) 1 + exp(log odds) ’ The figure shows clearly the large differences among clinicians oriented to different treatment modalities, the strong effect of the patient’s score on Axis V, and the slight difference between MDs and non-MDs when other factors are controlled. Summary

of results

Two additional patient characteristics, status as inpatient or outpatient and diagnosis, and two additional clinician characteristics, years of experience and work primarily in an academic or clinical setting, entered the model as significant correlates of treatment. In the final model, the clinician’s professional background once again became a significant correlate of treatment (pcO.01). The estimated odds ratio for professional background controlling for all other variables was 2.14, with a 95% confidence interval extending from 1.22 to 3.75. The independent effect of professional background on treatment, though significant, was modest at best. The effect of treatment orientation remained essentially the same as it was above, but the effect of Axis V diminished somewhat (although it remained signifcant at the 0.05 level). The reason is that a patient’s score on Axis V was associated with inpatient or outpatient status and with diagnosis, both of which exerted strong and independent effects on treatment. With all other significant variables in the model, the average change in the log odds for psychotherapy became -0.314 per unit change on Axis V. Outpatients were significantly more likely than inpatients to receive psychotherapy @
208

JOSEPHL. FLEES et al. DISCUSSION

It is only relatively recently that computer software has become available for performing a logistic regression analysis (the computer storage requirements for logistic regression are orders of magnitude greater than for ordinary multiple regression). For example, it was not until 1983 that the SPSSX package included a program for logistic regression analysis. Investigators who intend to apply this method to their data, or their programmers, should check that the computer package they intend to use has at least the following features. (Because software is changing so rapidly, an enumeration of features of packages as of December 1985 will likely be out of date one year later.) (1) Membership in one of C groups should be codable using whichever one of the several standard dummy-coding systems the investigator chooses (COHEN and COHEN, 1975, chapter 5). The resulting C-l coded variables must be analyzable together as a set: if one coded variable is in the model, all must be in. (The reason is that the meanings attached to the coefficients are valid only in the context of an analysis in which all C-l variables are present.) (2) The option should be available to specify, on a priori grounds, the order in which the variables are to be entered into the model. In the present study there was no logical or theoretical basis for ordering the independent variables from most to least important, so a stepwise procedure was applied in which the data determined what was and what wasn’t important in predicting treatment. When the independent variables can be ordered on, for example, a temporal basis, forcing the variables to enter in the prescribed order will produce a more coherent and credible set of results (COHEN and COHEN, 1975, pp. 98-102). (3) The option should be available to analyze the data using the method of maximum likelihood. Other statistical methods, which are computationally less complicated and result in measurably reduced computer time, exist for making inferences about a logistic regression equation. The method of maximum likelihood is theoretically superior to them (Cox, 1977), and is proposed as the method of choice. (4) The option should be available to have printed the covariances between pairs of estimated coefficients, As shown in the analysis of treatment orientation, these covariances permit the investigator to make more comparisons among c groups than are provided by the C-l coded variables. There were other than pedagogic reasons for discussing in such detail the results of steps 0 and 1 of the analysis. One may expect that programming or formatting errors will be made in the initial applications of a new statistical method, but it will not be obvious whether errors actually occurred. The explicit and simple formulas given for the constant term and its standard error at step 0, and for the coefficient and standard error for a binary independent variable if such a variable is entered at step 1, permit the investigator to apply some checks on the correctness of the computer output. Agreement of results does not prove that all subsequent computer results are correct, but disagreement proves that an error has been made. For comparative purposes, ordinary multiple regression analyses were applied to the data in parallel with the logistic regression analysis. The response variable was set equal to 1 if psychotherapy was prescribed and to 0 if behavior modification was prescribed. The following are two of the major differences noted. The estimated probabilities based on a logistic regression model are necessarily constrained

LOGISTICREGRESSION ANALYSIS

209

to fall in the interval from 0 to 1. Not so the estimated probabilities based on the traditional multiple regression model applied to the same data. The estimated probabilities from the ordinary multiple regression equation applied to the variables entered by step 3 exceeded 1 for patients scoring 1,2 or 3 on Axis V and reported on by MDs oriented to psychotherapy, and were less than zero for patients scoring 7 on Axis V and reported on by non-MDs oriented to behavior modification. Such estimates are nonsensical. It was possible to calculate a quantity like the squared multiple correlation coefficient for the final logistic regression equation, even though it was not produced by any of the standard computer programs. Advantage was taken of the fact that the multiple correlation coefficient is the ordinary product-moment correlation between a subject’s observed value and the value predicted from the regression equation (COHEN and COHEN, 1975). The BMDP program printed, for each patient, the treatment actually assigned (1 for psychotherapy, 0 for behavior modification) and the estimated probability of his or her being prescribed psychotherapy using the logistic regression equation with the seven significant correlates of treatment. The square of the correlation between the actual treatment prescribed and the estimated probability of receiving psychotherapy was 0.596. When an ordinary multiple regression analysis was applied to the same seven independent variables, the squared multiple correlation coefficient was 0.482. By fitting a model that was linear in the log odds rather than linear in the probability itself, a much better fit to the data was obtained. It is ironic that, as numerical measurements have continued to replace qualitative classifications in psychiatry and in other disciplines, the computer software for optimally analyzing qualitative data has become more widely available. Some characteristics will continue to be measurable only on dichotomous scales, and the investigator who relies on such data should seriously consider logistic regression models for their analysis. REFERENCES AMERICANPSYCHIATRIC ASSOCIATION (1980) Diagnostic and Statistical Manual of Mental Disorders, 3rd Edn. American Psychiatric Association, Washington. COHEN, J. and COHEN,P. (1975) Applied Multiple Regression/Correlation Analysisfor the Behavioral Sciences. Lawrence Erlbaum, Hillsdale, NJ. Cox, D. R. (1977) The Analysis of Binary Data. Chapman & Hall, London. DRAPER,N. and SMITH, H. (1981) Applied Regression Analysis, 2nd Edn., John Wiley, New York. EVERITT,B. S. (1977) The Analysis of Contingency Tables. Chapman & Hall, London. FLEISS,J. L. (1981). Statistical Methods for Rates and Proportions, 2nd Edn., John Wiley, New York. FLEISS,J. L. (1985) Re: “Estimating odds ratios with categorically scaled covariates in multiple logistic regression analysis.” Am. J. Epidem. 121, 476-477. KLEINBAIJM,D. G., KUPPER, L. L. and MORGENSTERN, H. (1982) Epidemiologic Research: Principles and Quantitative Methods. Lifetime Learning Publications, Belmont, CA. Moss, A. J., DAVIS,H. T., CONRAD,D. L., DECAMILLA,J. J. and ODOROFF,C. L. (1981) Digitalis associated cardiac mortality after myocardial infarction. Circulation 64, 1150-I 156. SCHLESSELMAN, J. J. (1982) Case-Control Studies. Oxford University Press, New York. SPITZER,R. L., FORMAN,J. B. W. and NEE, J. (1979) DSM-III field trials--I. Initial interrater diagnostic reliability. Am. J. Psychiat. 136, 818-820. TRUETT,J., CORNFIELD,J. and KANNEL,W. (1967) A multfvariate analysis of the risk of coronary heart disease in Framingham. J. chron. Dis. 20, 51 l-524. WILLIAMS,J. B. W. (1981) DSM-III: a comprehensive approach tb diagnosis. Social Work 26, 101-106.