Latent trait analysis of the eysenck personality questionnaire

Latent trait analysis of the eysenck personality questionnaire

J. psychiof. Rex. Vol. 20, No. 3, pp. 217-235, Printed in Great Britain. LATENT TRAIT 1986 OOZZ-3956186 $3.00+ .oO Pergamon Journals Ltd. ANALYSI...

1MB Sizes 0 Downloads 87 Views

J. psychiof. Rex. Vol. 20, No. 3, pp. 217-235, Printed in Great Britain.

LATENT

TRAIT

1986

OOZZ-3956186 $3.00+ .oO Pergamon Journals Ltd.

ANALYSIS

OF THE

EYSENCK

PERSONALITY

QUESTIONNAIRE DAVID A. GRAYSON NH & MRC Social Psychiatry

Research

Unit, The Australian National 2601, Australia

(Received 17 July 1985; in revised form

University,

10 February

GPO Box 4, Canberra

ACT

1986)

Summary-This paper exhibits contemporary psychometric models of questionnaires with dichotomous items. Such as approach allows assessment of individual items in terms of precision of measurement in ways not previously available. This Latent Trait Model approach is used to analyse the responses of 3806 subjects to the Eysenck Personality Questionnaire (EPQ). Short forms, of 10 items length, are recommended; they are the best possible such short forms, in that they provide the most accurate possible measurement overall. INTRODUCTION

THE PURPOSEof psychiatric inventories or questionnaires is to measure illness severity. Good measurement is accurate measurement. Latent trait models (LTMs) of dichotomous questionnaire data have, in recent years, been the focus of much psychometric interest. Only now are applications beginning to appear in the psychiatric literature (GIBBONS et al., 1985; DUNCAN-JONES et al., 1986). The present paper has two aims. The first is didactic, the intention being to exhibit the model as non-technically as possible, with particular emphasis on precision of measurement. The model provides objective criteria for assessing how accurately an inventory measures underlying illness severity for subjects with given degrees of severity. Moreover, it allows assessment of individual symptoms or items by the same criteria. This opens the possibility of selecting sub-inventories aimed at measuring best those located at particular levels of illness severity, and more generally, of constructing questionnaires with particular applications in mind. The second aim is more substantive. An LTM analysis is performed on the responses of 3806 subjects to the 90-item Eysenck Personalitv Ouestionnaire (EPO) (EYSENCK and IZYSENCK, 1975). The model allows the dynamics of the four subscales of the EPQ to be exhibited at the level of individual items. The items are evaluated objectively in terms of their individual contributions to precision of measurement, and short-forms are recommended which provide the best possible measurement. Such analysis is of indisputable value in psychiatric research, as it allows symptoms to be evaluated simultaneously in terms of how well they are discriminating and at which level of the illness they are best discriminating. Hitherto, this has not been possible. The approach allows, for example, the construction of screening instruments which are “best” in an objective sense for, say, the most ill 10% of the population of interest. Such an application is presented in DUNCAN-JONES et a/. (1986). ‘I he present paper recommends 7.1 I

218

DAVID A. GRAYSON

IO-item short forms which are “best” overall, and exhibits the conceptual framework underlying such decisions. The paper is divided into four main sections, a statement of the model, a description of “information” curves, analysis of the EPQ data (including suggested short forms), and finally, a discussion of both the EPQ results and LTMs in general. THE MODEL

The model attempts to explain the data observed on a questionnaire in terms of a single underlying dimension (latent trait) and aspects of the individual items. It assumes that each item relates to this underlying dimension in a probabilistic fashion. For subjects with a given trait value, say z, there corresponds a particular probability of endorsing the item. This probability is a function of only their z-value. Subjects higher on this single z dimension have greater probability of endorsing each item. That is, the function relating the latent trait to the probability of endorsement of a given item is increasing as z increases. This function, for item j, is called the Item Characteristic Curve (ICC,) and Fig. 1 provides an example. More explicit assumptions are required, however, to relate the latent trait to the actual observed response patterns in a particular application. The first of these is that the underlying trait is normally distributed. In fact, this is not a real assumption, but rather a means of removing arbitrariness in the model. Whatever distribution is postulated for the latent trait, it can always be monotonically transformed to yeild a unit normal distribution, while still leaving the ZCCs as increasing functions of 2. In other words, if for each item the probability of endorsement is greater for those higher on z, then this property is unaltered when z is resealed (to yield normality) in such a way that the ordering of subjects on the new scale is the same at that on z.

l-

6 2 8 6

-

B 8 %

0

-2

2

1

0

-1

LATENT TRAIT FIG.

1. Item

characteristic

curve (/= 1, s= 2).

3

4

LATENTTRAITANALYSISOF THEEPQ

219

Finally, it is assumed that, on this normal latent scale, the forms of the ICCs are well approximated by 2-parameter logistic curves (as in Fig. 1). Such curves are described in detail elsewhere (BIRNBAUM, 1968; LORD, 1980; DUNCAN-JONES et al., 1986). Essentially, this allows items to be characterized by location and discrimination parameters (the threshold, tj, and slope, sj). The threshold is the latent trait value which yields a 50% chance of endorsing the item; the slope represents how sharply the item discriminates people with a trait value near this threshold. The lower and higher flat regions of these curves reflect the belief that an item is “inactive” for subjects too low or too high on the trait. With these assumptions the model becomes explicit enough to predict, or fail to predict, the data. Such data consist, for each subject, of a pattern of responses across the questionnaire or symptom inventory. A statistical theory has been developed (BOCK and AITKIN, 1981) for estimating the item parameters. A computer program is available (MISLEW and BOCK, 1981) for performing this estimation. The estimation procedure is complicated and interative in nature. Thus, there are no simple formulae for obtaining these estimates from the data. However, estimates reported elsewhere (based on large normative samples, such as in the present paper) may be used to perform the ancillary computations necessary for selecting short forms. Details are provided in the Appendix. UTILIZING INFORMATION If such a model is found to fit the data well enough, then it allows valuable insights into the structure of the questionnaire at the level of the constituent items. It allows us to assess which items are good, which are bad, and where, on the normally distributed latent trait, items are performing better or worse. Such item evaluation is extremely useful in the construction of good instruments, and in the refinement of instruments already commonly in use. In the next section such an evaluation will lead to the recommendation of short-forms for the EPQ questionnaires. A concept central to such item assessment is that of the “information” provided by an item at various levels of the latent trait. For a given subject, our ultimate aim is to estimate the value on the latent trait for that subject, given the response pattern observed. One way of obtaining an estimate is Maximum Likelihood estimation. It is an optimal procedure, in that asymptotically (for questionnaires of enough items) it yields estimates that have the least variance (see BIRNBAUM, 1968 and LORD, 1980, for detailed discussion). The “information” at point z on the latent scale provided by a questionnaire is a measure of the precision with which the test measures someone with trait value z, and is the sum of the information components provided by each item at that point z. Although, in theory, using the Maximum Likelihood estimate as the score for a subject on a questionnaire is optimal, in practice it seems satisfactory to use the number-right score provided the item slopes are not extremely heterogeneous. This would lead to substantial savings in computer expenditure as the maximum likelihood procedure is iterative, hence time consuming. Recent work (DUNCAN-JONES et al., 1986) indicated that these two scoring methods correlate well in excess of 0.9. Given such linearity, decisions about the usefulness of individual items which are based on aspects of the maximum likelihood scores (such as “information”, discussed below), should be no less useful in discussing the numberright score.

DAVID A. GRAYSON

220

Once we know (or have accurate estimates of) the item parameters, we are able to compute the item information for each value of z and, by adding them, the test information function. In this way we can evaluate the performance of the instrument at each value of z and, by averaging over the distribution of z, the individual contribution to overall performance of individual items. The details of these computations appear in the appendix. Figure 2 shows the item information function of an item with ZCC as in Fig. 1. The key features to note are that maximum information is provided at the threshold of the item, and that item information is also related to the slope of the item. Items with bigger slope (better discrimination) provide more information, and all items provide most information at their threshold.

-2

-1

0

1

2

3

4

LATENT TRAIT FIG. 2. Item information

curve.

Thus, we have a mathematical model of fair generality, and techniques for fitting it to observed data (i.e. for estimating parameters). We can compare observed and expected data to see how well they fit the observed data and, if the fit is acceptable, features of the model can be exploited to exhibit the architecture of the test in ways that enable us to evaluate the behaviour of the component items. Such an evaluation can readily lead to short-form recommendations, and provide other insights into the measurement of the trait. ANALYSIS

OF THE EPQ DATA

The data The EPQ consists of four scales: N-scale for neuroticism, of 23 items; E-scale for extraversion, of 21 items;

LATENT TRAIT ANALYSIS OFTHEEPQ

P-scale L-scale

for psychoticism, of 25 items; for lie-scale, of 21 items.

221

and

in the course of other research (reported in JARDINE those with missing responses, there were complete records on 3806 pairs of twins. The data used below were from the first twin only of each pair, yielding a sample of 3806 individual. This was not a random population sample. The twins were volunteers drawn from the Australian NH & MRC Twin Register. To examine, and perhaps rule out, the possibility of gross abnormality in this sample, number-right N,E,P,L,-scores from the twin sample were compared with those reported as normative data in the EPQ manual (EYSENCK and EYSENCK, 1975), in a sex by age breakdown. Also compared were the scale intercorrelations among men and women separately. The twin sample yielded systematically lower N-scores and systematically higher L-scores, although the differences are slight. The intercorrelations among the scales were more pronounced in the twin sample, but relative sizes and directions of signs changed little. On this basis, it was concluded that the twin sample was in no way grossly abnormal in reference to the EPQ normative sample. Writing of their standardization data, EYSENCK and EYSENCK (1975) state, “We cannot claim that this was a random sample even of the urban population, but it may be said that it approaches such a random sample more closely In fact the current sample may be more than is usual in questionnaire standardization.” appropriate as a norm. This questionnaire

was administered

et al., 1984) to 3810 twin pairs. After excluding

The fit of the model LTMs were fitted to the data from the four scales, and item parameter estimates obtained. These are shown in Table 1. For each scale, the items are listed in descending order of which is the mean information provided by the item over all values “average information”, of z (see below). Discussion of this statistic is deferred until the following subsection. Having fitted the model (using an appropriate program, e.g. MISLEVY and BOCK, 1981), the next task is to evaluate the adequacy of that fit. Regarding the parameter estimates from such a large sample as effectivly the true values, theoretical response pattern probabilities (and frequencies) can be derived. From these one can derive theoretical total score (number-right score) distributions and item-item correlations (phi-coefficients). These can then be compared to the observed total score distributions, and item-item correlation matrices for each scale. Figure 3 shows the observed and expected total score distributions. Chi-square goodness of fit statistics were computed for the N,E,P,L distributions. Their values were 62.66, 32.24, 13.96, 33.04 on 24, 22, 12, 22 categories respectively (on the P-scale the top 13 categories were combined with the 13th category because of low frequencies). If the parameter estimates were true values, Chi-square tests could be performed on 23,21, 12 and 21 degrees of freedom respectively, but these are estimates from the same data. Even then, with such a large sample statistical significance is of secondary importance. No model will be an exact reflection of reality, and the emphasis shifts to assessing whether the model accounts well for the data rather than exactly. In this sense, the goodness of fit indices and Fig. 3 indicate that the models recapture the total score distributions well.

DAVID A. GRAYSON

222

TABLE 1. LATENT TRAIT MODEL PARAMETERSAND INDICES (a) N-SCALE EPQ item No.

Slope

Threshold

34 23 31 75 3 41 7 38 19 72

1.999 1.780 1.933 1.710 1.521 1.651 1.318 1.298 1.396 1.288

-0.171 0.162 0.755 0.680 -0.195 1.125 0.290 0.247 - 0.786 - 0.358

58 27 77 12 62 80 15 88 66 68 54 47 84

1.259 1.313 1.298 1.382 1.234 1.190 1.221 1.143 0.849 0.860 0.958 0.724 0.825

- 0.040 0.554 0.858 ~ 1.236 1.135 ~ 0.949 1.267 - 1.490 0.197 1.390

1.800 0.504 - 1.499

Average

info.

0.60 0.51 0.49 0.43 0.40 0.33 0.32 0.31 0.31 0.31 4.01 0.30 0.30 0.27 0.25 0.23 0.23 0.21 0.18 0.15 0.13 0.12 0.11 0.11 6.60

(b) E-SCALE Average

EPQ item No.

Slope

Threshold

21 45 70 42 10 14 52 86 40 5

2.162 2.305 1.910 1.710 1.765 1.727 2.434 1.645 1.503 1.360

0.098 0.601 0.224 -0.179 - 0.742 - 0.735 - 1.344 ~ 0.422 - 0.048 - 0.251

0.67 0.65 0.56 0.48 0.44 0.43 0.43 0.43 0.40 0.34

17 29 82 25 32 36 60 49 56 64 1

1.864 1.345 1.126 1.050 1.242 0.807 0.751 0.767 0.692 0.385 0.352

- 1.368 - 1.208 - 0.322 - 0.504 - 1.412 PO.019 - 0.700 ~ 1.280 0.137 - 0.322 PO.166

4.43 0.32 0.25 0.25 0.22 0.20 0.14 0.12 0.11 0. I1 0.04 0.03 6.62

info.

LATENT TRAIT ANALYSISOF THE EPQ

223

(c) P-SCALE EPQ item No.

Slope

Threshold

67 79 83 87 33 30 37 46 65 57

0.821 0.899 1.497 0.978 1.455 1.340 1.252 0.753 1.101 0.773

0.836 2.020 2.734 2.513 2.741 2.175 2.792 2.389 2.758 2.910

0.14 0.10 0.08 0.08 0.08 0.07 0.07 0.07 0.07 0.06

22 6 74 71 18 9 76 90 43 53 2 11 50 61 26

1.066 0.723 0.562 0.882 0.527 0.474 0.870 0.830 1.012 0.637 0.432 1.280 0.523 0.549 1.075

3.070 2.747 1.908 3.254 2.324 1.464 3.261 3.441 3.480 3.705 3.798 3.618 5.280 7.500 4.645

0.82 0.06 0.06 0.06 0.05 0.05 0.05 0.05 0.04 0.04 0.03 0.03 0.03 0.02 0.01 0.01

Average

info.

1.41 (d) L-SCALE EPQ item No.

Slope

Threshold

59 8 63 24 48 4 16 78 44 39

1.628 1.636 1.650 1.415 1.794 1.500 1.179 1.105 1.114 1.083

- 0.076 -0.194 - 0.430 0.832 1.462 - 1.460 - 0.856 0.361 0.718 0.770

0.45 0.45 0.44 0.31 0.29 0.24 0.24 0.24 0.23 0.21

28 20 13 51 89 35 81 73 69 55 85

0.996 0.973 1.057 0.930 0.952 0.865 0.764 0.911 0.648 0.650 0.902

- 0.509 0.588 - 1.204 0.496 - 1.273 0.760 1.062 1.964 0.247 0.095 2.187

3.10 0.20 0.19 0.18 0.18 0.15 0.15 0.12 0.11 0.10 0.10 0.09

Average

4.67

info.

DAVID A. GRAYSON

224

2

Boo

20

TOTAL SCORE

A

_EXP .... OBS

TOTAL SCORE

I

20

400 ,

1

P-SCALE L-SCALE

TOTAL SCORE

2

20

FIG. 3. Number-right

score

TOTAL SCORE

20

distributions

Table 2 summarizes the discrepancies between observed and model predicted item-item correlations. As overall indices of fit, the root mean square (RMS) observed minus expected correlation was computed for each scale. For N,E,P,L, scales these RMS increments were 0.056, 0.054, 0.043 and 0.037 respectively. The entries in Table 2 are, for each item pair, the signed number of complete RMS increments in the actual observed minus expected difference in correlation for that item pair. Thus, an entry of zero indicates an G-E correlation difference less than one RMS in absolute magnitude. The preponderance of zeros in this table again supports the interpretation that the data are reasonably well accounted for by the models.

Analysis of residuals A fair summary of the fit would be that the models appear as reasonable, though not perfect, accounts of the data. Given this conclusion, it is of interest to examine in finer detail where the model does not seem to be fitting the data, and attempt to say why this is so. With this in mind, the O-E correlation increments in Table 2 can be particularly informative. If such an increment is large and positive, then the two items correlate more than the single latent trait of the model can account. This would indicate that they share some feature above that they measure in common with the rest of the items in the scale.

-1 -1

0 0 1100 0

0 0 0 0 0 0 0 0 0

2 0 1 0 1 0

I 12 15 19 23 27 31 34 38 41 47 54 58 62 66 68 72 75 77 80 84 88

-1

3

Item

-1

-1

0

0

0 0 0 0 0 0 2

00 0

-1

-1

0 0 0 2 0 0 0

12

0 1 0 0 1 0 0 0 0

0 0 0

0 0 0 1 0

7

0

0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

15

0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 4 0 1

19

-1

-1 -1

-1

-1

0 0

1

0 0 10 2 0 1

0 0

0

23

-1

-1

-1

0 0 0 0

0 0 0 0

0 0

0 0

27

-1

-1

0 0 0

0 0 4

1 0 3 0 0 0 0

31

-1 0 0, 0 1 0 0 00 00

0 1 02 00 00

34

0 1 0 0 0 0 0

0

38

-1

-1

0 0 0

0 0 3

0 0 0 0

41

0 0 0 2 0 0 0 0 0 0 0

41

0 0 0 0 0 0 0 0 0 0

54

0 0 0 0 0 0 0 1 0

58

-1 00 00

0 1 0 0 30

62

0

0 1 1

66

0 0 10 0 0 0

68

TABLE 2. OBSERVED-EXPECTEDCORRELATION INCREMENTS IN UNITSOF WHOLERMSs (a) N-SCALE: RMS = 0.056

2 0 0

0

72

0 0 0 0

15

0 0 0

77

0 10

80

84

226

DAVID A. GRAYSON

-a--

227

LATENT TRAIT ANALYSISOF THE EPQ

00

oo-

0oc)o

00000

000000

I

0000-O-

I

I

ooorvooc3--

I

I

000000000

o-oar-o-o-o I

I

I

oo-oooooooo I

I

0-0000-000-0

I

I

0000000000000

ooooooooooooo-

I

I

000000-000-0-00

I

I

I

I

000000000000-000

o-oooo-omoo-oo--I

I

I

I

000000000000000-00

I I I

ooo--00-000000000-0

I

I e-oII

ooo-o-ooooo-moo0

I

I

I

000-00-00000-000000-d

I

I

0000000000000000000000 .

ooooo-o-ooor4oo-ooo-o--I

I

I

II

-oo.-

0-0000-0000000000000

I

I

228

DAVID

A.

GRAYSON

LATENT TRAIT ANALYSIS OF THE EPQ

Similarly, existence observed

three or more items which have large pairwise positive increments of some secondary trait in the set of items. In all the correlations correlation was greater than that predicted by the model.

229

indicate below,

the the

N-scale Item pair 19,80 62,77 31,75 31,41 41,75

Residual

Index

4 3 4 3 3

The items are: 19: Are your feeelings easily hurt? 80: Are you easily hurt when people find fault with you or the work you do? 62: Do you often feel life is very dull? 77: Do you often feel lonely? 3 1: Would you call yourself a nervous person? 41: Would you call yourself a tense or “highly-strung” person? 75: Do you suffer from “nerves”? It seems clear that the excess observed correlation is explained by highly specific factors (“vulnerability of feeling”, “excitement due to others”, “nerviness”). The two pairs (and the triad) did not intercorrelate excessively, as perusal of Table 2 will demonstrate. This supports the view that the three specific factors are unrelated to each other. E-scale Item pair 10,86 45,70 17,52 1,64 25,82

Residual

index

4 4 4 3 3

The items are: 10: Are you rather lively? 86: Do other people think of you as being lively? 45: Can you easily get some life into a rather dull party? 70: Can you get a party going? 17: Do you enjoy meeting new people? 52: Do you like mixing with people? 1: Do you have many different hobbies? 64: Do you often take on more activities than you have time to do? 25: Do you feel like going out a lot? 82: Do you like plenty of bustle and excitement around you? Here the specific factors might be called “liveliness”, “party stirring ability”, “new/other people orientation”, “hobby activity”, “crowd sociability’. The five pairs are uncorrelated

230

DAVID A. GRAYSON

with each other (their residuals are predominantly zero), except for lo,86 and 17,52, where the four residuals are all one, but the model over-estimates all four correlations. This is interesting in view of the debate about the unidimensionality of the E-scale (GUILDFORD, 1975, 1977; EYSENCK, 1977). Although the EPQ E-scale may now be different from the earlier EPI E-scale, there seems little doubt that it is essentially unidimensional and no longer a “shotgun wedding” of subdimensions of sociability and impulsiveness. The present analysis favours the interpretation of one major dimesion with a number of small specific factors, each relating to no more than two items. P-scale Item pair 57,74 l&67 11,90 30,65

Residual index 7 5 4 3

The items are: Do you like to arrive at appointments in plenty of time? When you catch a train do you often arrive at the last moment? Do you believe insurance schemes are a good idea? Do you, think people spend too much time safe-guarding their future with savings and insurances? 11: Would it upset you a lot to see a child or animal suffer? 90: Would you feel very sorry for an animal caught in a trap? 30: Do you have enemies who want to harm you? 65: Are there several people who keep trying to avoid you? These four specific constructs might be labelled “punctuality”, “financial security”, “vulnerability to suffering in others”, “paranoia”. Again, these four pairs of items do not intercorrelate more than expected.

57: 74: 18: 67:

L-scale Item pair 35,51

Residual

index 7

The items are: As a child did you do as you were told immediately and without grumbling? 35: 51: As a child were you ever cheeky to your parents? This one specific factor might be called “childhood naughtiness”. This analysis of residuals leads to the following explanation of the EPQ scales. A given item consists of words, and has very specific connotations. While it may predominantly tap a single latent construct, each item must also tap some other aspects, not related to this construct, but specific to the words used. In item analysis, this specific variance is absorbed into the binomial error, except in such cases as the above, where two (or three) items, all tapping the same major dimension, have obvious specific components in common. This is the likely explanation of the systematic residuals observed in the correlation matrices above.

LATENT TRAIT ANALYXS OF THE EPQ

231

This explanation is confirmed when the model is fitted to reduced scales, with all but the best (highest average information) item omitted from such residual clusters. The RMS indices for N,E,P, scales dropped from 0.056, 0.054 and 0.043 to 0.042, 0.040 and 0.030 respectively. For, the L scale, however, this index rose from 0.037 to 0.049. When item 51 was omitted instead of item 35, the RMS changed from 0.037 to 0.049. In both cases, no dramatic residuals occurred, but the fitted models seemed systematically to undercorrelate the items. This anomaly with the L-scale may be due to peculiarities of the program in relation to this data set. Such enhanced correlations due to surface similarities in the items lead to misleadingly high measures of internal consistency and reliability. The extreme case is where the same item is asked a number of times yielding item-test correlations of unity, even if it is a poor item with respect to the main factor. The EPQ scales could undoubtedly be purified by omitting all but one item from the above subsets. On the other hand, if the items in a pair are performing well at different places on the latent trait, it may be worth keeping them both for the sake of greater precision of measurement. So far, then, the analysis reveals that a unidimensional model is a good fit to the data, with the minor perturbations indicated by the analysis of the residuals.

The information structure We can utilize the model to see how well the questionnaires (and individual items) measure this unidimensional trait. To do this we make use of the information curves. Figure 4 plots the information curves for the full scales. The information (or “precision of measurement”) is the ordinate, and the latent trait value the abscissa [recall that these latent traits are constrained to have N(O,l) distributions]. Notice that information is supplied about a

-3

0 LATENT TRAIT

3

LATEXT TRAIT

LATENT TRAIT

LATENT TRAIT FIG. 4. Test information

functions.

232

DAVID A. GRAYSON

particular value of the latent trait, without reference to the frequency of subjects in the population who take that value. In other words, there exist possibilities such as: the information is large (the test is measuring well) at a region on the latent trait where very few subjects lie. Finally, recall how a test information function arises. It is the simple addition (at each value of z) of the individual item information functions. Thus, we can construct information curves for the whole test or selected subsets of items. Curves in Fig. 4 other than the FULL TEST are constructed for such subtests and will be referred to later. We see that N and E scales provide 8 to 9 units of information at z values around the mean. This means that if we use the Maximum Likelihood estimate of z as a score, then the error of measurement is about 0.12 (reciprocal of information at z=O), with a 95% confidence interval of about 0.7 units on the latent trait (A 1.96 times the square root of 0.12). Similar statements about potential test performance can be made for values on the latent trait other than z = 0. The L-scale also provides maximal information at about the mean of a unit normal latent trait. But this maximum is only about 6 units, indicating that the items in the L-scale do not perform as well as those in the N, E-scales. Finally, the P-scale provides maximum information of about 5 units at 3 standard deviations above the mean of the latent trait, and only about 1 unit at the mean. Thus, if one wishes accurately to measure the P-dimension all over, this is an inadequate scale. An interesting insight into dyanamics of test performance stems from the following (somewhat informal) reasoning about information. Where a test is performing well on the latent trait (at a particular value of z), the information will be relatively high. Moreover, the test should be discriminating subjects well (or rank ordering them well) on this region of the latent trait. Conversely, where little information exists, subjects in fact different on the latent trait should be indistinguishable or “mixed up” in their test results. Thus, for the P-scale, the bulk of subjects have scores between zero and six (over 90%, see Fig. 3), while the remainder (those high on the latent trait) are well measured because of the high information (see Fig. 4), and appear correspondingly dispersed on the number-right distribution. The same trend, but in the opposite direction, is present to a lower degree with the E-scale. The mass of the information is below the mean, and this manifests in a negatively skewed distribution in Fig. 3.

Selecting short forms for the EPQ The LTM technique also allows a quantitative approach to individual item evaluation. One example of this is in the construction of short forms of the questionnaires of say, 10 items length. The right-hand column of Table 1 shows for each item the information averaged over the unit normal latent trait distribution. The sum of these item average information indices is the average information provided by the test (e.g. 6.60 for the Nscale), and provides a single index of how the test performs over the whole population. The subtest with “best” 10 items in this sense, is that with the 10 items of highest item average information. Information curves corresponding to such BEST-10 short forms are also plotted in Fig. 4. Comparison of these short forms with the FULL TEST information curves reveals that they perform relatively the same way, although understandably less efficiently. In other words, using the BEST-10 short forms results in more than 50% reduction in

LATENT TRAIT ANALYSISOF THE EPQ

233

test administration and expense, while the average loss over the population in precision of measurement is only of the order of 40% (e.g. from 6.60 to 4.01 for the N-scale). Bearing in mind that information is dimensionally equivalent to l/variance, the average confidence interval for measurement of the latent trait would only be increased by 28070, from a factor l/6.6 to one of l/4.01. Whether or not the full scales are performing ideally (discriminating in desired regions of the trait), the top 10 items listed in Table 2 perform similarly, while providing the maximum average information of any 10 item subtests. Item selection is dependant to a large extent on particular applications, but there are enough data presented in this paper for users to select specific subsets of items aimed at high discrimination among subjects at specific regions on the latent trait (see Appendix for details). DISCUSSION

There have been two aims to this paper: to use LTMs as as vehicles to exhibit the structure of the four EPQ scales (and recommend criterion-based short forms), and secondly, to acquaint the psychiatric readership with LTMs, a powerful new technique for questionnaire analysis. The present data support the view that all four scales are predominantly unidimensional, with occasional pairs of items intercorrelating more than expected. Analysis of such item pairs clearly shows that this is due to surface similarity in their content beyond that determined by the basic dimension. The model provides a valuable criterion for assessing the worth of a particular item, namely, the average information (or precision of measurement) on a unit normal latent trait. This same item statistic enables one to select (10 item) short forms which minimize the average information loss. For each scale, this led to short forms of less than half the length of the full scales, which maintain roughly the same relative precision of measurement over the latent trait and lead to only a 20-30% increase in confidence intervals for estimation of the trait value of a given subject. The information curves for selected subsets of items could be used in particular applications for targetting tests to measure given regions on the trait (see LORD, 1980). An example of this application is provided in DUNCAN-JONES et al. (1986). Enough material is presented in the current paper (the item slopes and thresholds) for readers to carry out such specific applications. But there is more to LTM that just analysis of questionnaires. It is a theory of questionnaire structure, which provides insight into test score behaviour (previously unavailable). In the older psychometic perspective, items that are only rarely endorsed or rarely not endorsed are excluded from a final questionnaire on the grounds that, because of low variance, they provide little discrimination. The LTM approach refines the concept of discrimination and addresses it as discrimation at a particular place on the latent trait. For example, most of the items in the P-scale have very low overall probability of endorsement, but some of them are performing good discriminatory functions in a region on the latent trait where few subjects reside. For many applications, this may not be undersirable. The practice of excluding extreme items has other side effects that only become clear within the LTM framework. Such a procedure concentrates test information around the

DAVID

234

A. GRAYSON

median of the latent trait. In other words (assuming no items have flat ICCs over the whole latent trait), a substantial proportion of those low on the trait have little probability of obtaining high number-right scores. Similarly, a substantial proportion of those high on the trait have negligible probability of obtaining low scores. Therefore a test with information concentrated around the median will exhibit a total score distribution which tends more towards bimodality than a test with information more evenly spread over the whole trait. For example, Fig. 4 shows that the N,E,L,-scales have progressively less concentrated information; thus we would expect bimodality to be most apparent in the N-scale, least in L-scale and intermediate in the E-scale. This expectation is confirmed by the expected total score distributions in Fig. 3. Insofar as the model fits the observed data, the observed total-score distributions are correspondingly less unimodal, small expected bimodalities appearing as plateaux when “smeared” by error. Inspection of Table 1 shows, as we had hoped, that the L-scale location parameters are more dispersed and more extreme, than those of E and N-scales. In this account, LTMs have provided a framework enabling us to explain aspects of the total score distribution solely in terms of the architecture of the test, rather than to speculate substantively about the latent-trait distribution. A recent paper examines bimodality in symptom inventories in more detail (GRAYSON, 1986). Finally, if one accepts the sample generating the present data as normative, then the estimates presented in Table 1 together with the computations in the Appendix provide a means for researchers to select their own subsets of the EPQ for particular applications. REFERENCES BIRNBAUM, A. (1968) Some latent trait models. In Statistical Theories of Mental Test Scores (Edited by LORD, F. M. and NORVICK, M. R.). Addison-Wesley, Reading MA. BOCK, R. D. and AITKIN, M. (1981) Marginal maximum likelihood estimation of item parameters: an application of the EM algorithm. Psychonzetrika 46, 443-459. DUNCAN-JONES,P., GRAYSON, D. A. and MORAN, P. A. P. (1986) The utility of latent trait models in psychiatric epidemiology. Psycho/. Med. in press. EYSENCK, H. J. (1977) Personality and factor analysis: a reply to Guildford. Psychol. Bull 84, 405-41 I. EYSENCK, H. J. and EYSENCK, S. B. J. (1975) Manual of the Eysenck Personality Questionnaire (Junior and Adult). Hodder & Stoughton, Sevenoaks, Kent. GIBBONS, R. D., CLARK, D. C., VONAMMONCAVANAUGH, S. and DAVIES, J. M. (1985) Application of modern psychometric theory in psychiatric research. J. Psychiat. Res. 19, 43-55. GRAYSON. D. A. (1986) Can categorial and dimensional views of psychiatric disorder be distinguished? Br. J. Psychiat. in press. GUILDFORD,J. P. (1975) Factors and factors of personality. Psycho/. Bull, 82, 802-814. GUILDFORD,J. P. (1977) Will the real factor of extraversion-introversion please stand up? A reply to Eysenck.

Psychol. Bull. 84, 412-416. JARDINE, R., MARTIN, N. G. and HENDERSON, A. S. (1984) Genetic covariation symptoms of anxiety and depression. Genet. Epidemiol. I, 89-107. LORD, F.

Hillsdale,

neuroticism

and the

M. (1980)Applications of Item Response Theory to Practical Testing Problems. Lawrence Erlbaum, NJ.

MISLEVY, R. J. and BOCK, R. D. (1981) BILOG

model.

between

Distributed

by International

Educational

1: maximum Services,

likelihood

U.S.A.

item analysis

and test scoring:

logistic

235

LATENT TRAIT ANALYSIS OF THE EPQ APPENDIX For a given symptom, thejth of n symptoms, say, let P, (z) be the (binomial) probability trait value z endorsing the symptom. The model assumes that z takes a standard normal distribution (mean zero, unit variance) and that

of a subject

with

in the population,

cxp is, (z - l,)] p, (z) = Let Q,(z) = 1 Then,

Pj (z), be the probability

the information

supplied

1 + exp [s(z-

of not endorsing

t,)]

the item.

by item j at trait value z is given by:


details

(1)

are given in BIRNBAUM (1986).

The information

at z supplied

by the whole test is: n

i(z) = The information

c,=1 i,(z), th e sum

at z supplied

i,(z) = c

by any subset,

of all the item contributions

at z

A, of items is:

‘J(z), the sum of the contributions

at z from

the items in the subset A.

J-A

Knowing the slope and threshold values, we can easily compute these information functions for any, and thus every, value of L. To ascertain the average information supplied by item j, 7; say, over the z distribution, we must add together the information at each value of z, weighed by the relative frequency with which that value of z occurs: m T

(2)

i, (z) n(z) dz

‘j= -m

where n(z) is the relative

frequency

function

of a standard n(Z) = &

The average

information

supplied

normal

variate;

exp ( - z*/2)

by (he whole test, 7, and by sub-test 7-f:

A, i,,

are then:

Tand j=l

T=CT, A

I JtA

To obtain similar quantities for distributions other than the complete norming population, the same calculations can be performed using the desired relative frequency function instead of n(z). For example, DUNCAN-JONES ef al. (1986) discusses an application where interest centres around optimizing measurement in the region 1~ z< 2. Here n(z) is replaced by m(z) = 0 ,z< 1 = -$-

=o

a

exp

( - z*/2) / 0.272

,1
, z 2 2.

Although no simple formulae exist for computing average information, the integration (2) required is easily performed on a computer. The information function itself (1) is directly computable, and for the EPQ scales, the slope and threshold estimates in Table 1 can be used in performing the numerical integrations entailed by (2) As stated in the text, no simple formulae exist for obtaining estimates like those in Table 1 from given data. A theory based program such as MISLEVY and BOCK (1981), must be used. However, for the EPQ, this task has already been carried out and the estimates in Table 1 can be used by any researcher to select subsets from these scales for particular applications.