Automated psychological testing: some principles and practice

Automated psychological testing: some principles and practice

Int. J. Man-Machine Studies (1982) 17, 247-263 Automated psychological testing" some principles and practice ALICK ELITHORN, SUE MORNINGTON AND ANDRE...

1MB Sizes 6 Downloads 131 Views

Int. J. Man-Machine Studies (1982) 17, 247-263

Automated psychological testing" some principles and practice ALICK ELITHORN, SUE MORNINGTON AND ANDREAS STAVROU

The Institute of Neurology and The Royal Free Hospital, London, U.K. (Received 17 July 1982) The present paper reviews briefly the principles which should guide the development of automated psychological test systems. These are illustrated by reference to research and development studies undertaken with the Perceptual Maze Test. The possible role of these tests in assessing treatment effects in neurology and psychiatry is discussed and an example of this type of application is presented.

Introduction In psychiatry and clinical psychology, research on the automation of clinical techniques began in the middle-1960s. Examples of such work include the automated Minnesota Multiphasic Personality Inventory (MMPI) interpretative system developed by Pearson, Swenson, Rome, Mataya & Brannick (1965) at the Mayo Clinic, Piotrowski's (1964) automated Rorschach interpretive system, Elwood & Griffin's (1972) automated Wechsler Adult Intelligence Scale administration system and Spitzer & Endicott's (1969) development of the DIAGNO programs for determining a DSM-II diagnosis from structured interview protocols. In the field of automated psychological testing the most impressive American contribution was probably the computerized adaptive ability testing system developed by David Weiss and his co-workers in collaboration with the Office of Naval Research on an interactive time sharing system, based on a Control Data 6400 computer system capable of handling up to 70 terminals. This had a central memory of 650k byte arranged as 65k 60 bit words. The system was written in Fortran for the Control Data 6000 series compiler which provides almost the full range of Fortran IV conventions. This system was developed primarily as a research tool controlling the administration of a range of tests using a number of different process control strategies. It makes many very valuable contributions to the methodologies of automated testing. It also illustrates very well the disadvantage of using a main-frame system with slave, as opposed to intelligent terminals (Weiss, 1974; Dewitt & Weiss, 1974). In GreatBritain the early studies were those reported by Gedye & Miller (1970) and Elithorn & Telford (1969). In the U.S.A. some of this work was widely accepted, the Roche automated MMPI interpretation program being the best known example. In Great Britain, however, the attempt by Roche to market this system proved unsuccessful, not so much because it was automated but rather because clinicians in this country have only rarely found the MMPI useful. Early workers with automated test systems tended to see the main advantages of automated psychological testing as arising from two areas: convenience and economy, and objectivity. As Gedye & Miller often emphasized, the majority of the tests in 247 0020-7373/82/070247 + 17 $03.00/0

O 1982 AcademicPress Inc. (London)Limited

248

A. E L I T H O R N

ET AL.

clinical use involve highly trained administrators in the expenditure of considerable amounts of time and effort for relatively low returns in terms of relevant useful information. Automated testing in its simplest form can be justified in terms of manpower saving alone. Automating psychological testing, and hence removing from the psychologist the chores of psychometry, has been seen by some workers as freeing the psychologist for closer clinical interaction with the patient. Others have feared that mechanizing testing would tend to mechanize the psychologist and that he would become increasingly distant from the patient. Indeed some, perhaps many, psychologists see their function as intuitive and clinical rather than as clinical and scientific. Goldberg and co-workers, for example, reject almost entirely the concept of the objective standard test score. Certainly those who regularly administer psychological tests will have found that it is indeed very difficult to achieve standard conditions of presentation and scoring. Indeed the experimenter effect is at least as pervasive in the clinical field as it is in the experimental laboratory (Kintz, Delprato, Mettee, Persons & Schappe, 1965). Even with automated testing, the experimenter effect cannot be completely excluded, since the subject at the very least has to be introduced to the test situation. Here the sex and attitude of the test supervisor, whether he or she be a psychologist, a receptionist, a nurse or a technician, will inevitably affect the test results. This effect can, of course, be minimized by making the test system as self-explanatory as possible. To produce a system which is self-explanatory to the extent that a literate person of normal intelligence can take a series of tests without any human intervention, is a very considerable challenge. With some tests it is much more practicable than with others. In the present state of developing technology, it would be unwise to do more than tackle this problem where it seems appropriate. In the not too distant future, it will be possible to provide spoken instructions and perhaps, in the more distant future, to look forward to systems which will be able to understand the subjects' replies in a variety of dialects. In the authors' view, inexpensive systems capable of providing spoken instructions will be an essential part of the next generation of automated psychological testing. Psychological tests are widely used in educational and vocational guidance and personnel selection. Within the health services they are used mainly for assessment in neurology, neurosurgery and psychiatry. Psychological testing is labour-intensive and automation is justified in terms of manpower saving. Much greater advantages, however, will accrue from the full exploitation of process control and other computer facilities. It is this aspect of interactive computer technology which will not only make psychological testing less expensive but also very much more effective. Standard psychological tests of intellectual functions are effective in assessing gross differences between individuals and the relatively large intra-individual changes which accompany maturation and education. In general they are both insensitive to small changes in individual competence and unsuited to the repeated testing of the same individual. The introduction of computer techniques means, however, that it is possible to use computer item generation and randomized item selection together with process control techniques to develop criterion referenced tests which are not only more sensitive but also much more suitable for repeated testing. There is much more potential, therefore, in computer-based testing than simple automation.

PRINCIPLES

AND

PRACTICE

249

Psychological medicine today has available a wide range of powerful treatments, which aim either to depress or ablate mental functions which are overactive, or alternatively to stimulate functions that are pathologically depressed. Many of these (e.g. psychosurgery, electroconvulsive treatment and the antidepressant and psychotropic drugs) carry unwanted, and sometimes dangerous, side effects. No physician would dream of treating hypertension (high blood pressure) without measuring the effect of the treatment on the patient's blood pressure. Clearly it is equally desirable that in treating mental illness the psychiatrist should measure and record the effect that treatment has on his patient's mental competence. The need for the potential that computer based assessments can bring to clinical care is a very real one (Elithorn, 1974).

Automated psychological testing The minicomputer has rightly been called an amorphous laboratory instrument (Mundie, Oestreicher & Von Gierke, 1966). In our own field we have found it invaluable for increasing the power of a research-oriented neuropsychological laboratory. Psychological test procedures require a moderately complex graphic display in which the timing between stimuli and the duration of each stimulus can be accurately controlled. They also require that the subject's responses be timed accurately, in some instances to the nearest millisecond. These are facilities which a small laboratory computer can easily provide. More important are the techniques such as process control, computer item generation and randomized stratified item selection which it would be impractical to use without computer assistance. A minicomputer can provide all these facilities together with the facilities for physiological studies. Distributed computing has been around for some time and Figs 1 and 2 show such a laboratory system as it was in 1970, together with its links to a variety of central computing facilities. Such a system designed around a small laboratory computer forms a powerful tool for collecting behavioural data of great value for research in psychology and psychiatry, and the links with large main-frame computers fulfil the need for complex analyses outside the capability of a small laboratory computer (Elithorn, Singh & Telford, 1970; Elithorn, Powell & Telford, 1976). For an automated psychological test system, the minicomputer is too expensive and a microprocessor-based intelligent terminal linked to a central computing facility provides a practical alternative (Elithorn, Powell, Telford & Cooper, 1980). In arguing that in computerizing psychological test systems one should aim to achieve more than mere automation, we are not maintaining that existing tests should not be automated but rather that those tests which are chosen for automation should be chosen on their merits as tests and their suitability for automation. It is also desirable that in automating any test that test should be improved. At the very least, most tests can be improved by the facility which automation offers of recording individual item response times. To take account of these response times the tests would need to be restandardized. Moreover, it can and certainly will be argued that it is invalid to use existing norms for tests that are automated even at the simplest level. To take an example at one extreme, the advantages of adding item response times to the scores for Raven's Matrices would be of little benefit. It might be found that one or two items are relatively more difficult than had previously been thought. From the variance

250

A. ELITHORN

ET AL.

i

Monitoring devices X-Y plotter Pen recorder Monitor oscilloscope Storage oscilloscope

Stimuli . Visual Non- visual Flash tubes Tactile CR. oscilloscope Auditory = ] Storage oscilloscope Electrical I Film/slide projector

Process control unit

Ign

ooEl~ ;onsole

Paper tape I

."I'

fvvv~

Behaviourol responses ~ I ~ "~'"/ Keys 1 [10 x I0 Recorder [

/ Teletype t 9 -~

%Oo%O~o

-

"~ . . . . . . . .

i

] Analogue I responses

I IPaper tapet . FIG. 1. Laboratory computer system for neurophysiological and psychological research. The central configuration is a 4k PDP-8 with A-D, a 34D oscilloscope control, a Dec tape unit and a 32k fixed disc (Elithorn, Singh & Telford, 1970).

ULCC Trend ,~ 200 User i ......

I

CDC 6600

40 8k [

1905E I10}. G.P.O. Queen Mary Co age [~ network

I

t

,---~.- G.PO. network Kings College ICS

Imperial College 2 4k

IlO

=

I

24k l PDP-9 L !12k/-75~.

360/75

Rutherford Laboratory

c- L__

', Aries: CAD ' ~ ', 2 i Cambridge

~T.Ty.J ~ G.P O. network Royal Free Hospital

ReseoirC~council

FIG. 2. Diagram to illustrate the linkage of the laboratory system in Fig. 1 to the University of London computing facility.

PRINCIPLES

AND

PRACTICE

251

it might also be possible to get a day-dream measure but would this and the facility of automatically recording the subject's responses justify the extensive standardization studies which would be demanded? Certainly from the subject's point of view, the pencil-and-paper version is more convenient and less expensive. Since composing and presenting such complex displays with a small microprocessor is time-consuming, the pencil-and-paper version would probably take considerably less time to complete. It is, of course, reasonable to argue that if a testing session is to be partly automated, then it should be completely automated. Whether or not a version of Raven's Matrices should or should not be one component in automated test batteries is clearly one of many topics which urgently calls for both debate and experiment. If, for good reasons, we believe that the automation of Raven's Matrices and, for that matter, the automation of the Weschler battery are challenges which should be left until the technology is more advanced, what then are the criteria which should decide the characteristics of the tests to which those of us working in this field should direct our current efforts? In our research on automated psychological testing our aim has been to develop two interrelated batteries: one primarily for diagnostic use, the other designed for repeated measures studies. In designing these batteries our selection of "tests" has been guided by the following principles. 1. The parameters which control the complexity of the subject's task should be as few as possible and capable of systematic control. 2. It should be possible to vary the difficulty--i.e, the level of complexity---of the task over a wide range without fundamentally altering its character. 3. The task should be criterion-referenced. 4. Item characteristics should facilitate the use of randomized stratified item selection and where possible and appropriate, computer item generation techniques. 5. The factors determining item difficulty should be explicable in mathematical terms and suitable for systematic manipulation. 6. For each subject it should be possible using feedback information to maintain the level of task difficulty or complexity at an optimal level. (To obtain optimum motivation and obtain maximum information about performance it seems likely that the subject should succeed on about half the items.) 7. The task should be designed to externalize as much as possible of the problem solving activity and hence to give as much information as possible about: (i) the way the subject achieves the level of performance he does and (ii) the causes of his failures. 8. The characteristics of the display stimuli and the response required should be an optimal compromise between the constraints of the technology and the ergonomics of psychological tests (e.g. stimulus-response compatibility should be as high as possible). 9. The measures or indices derived from the test should be as explicable as possible, i.e. the measure should not be confounded with irrelevant aspects of the subject's competence for which it is impossible to account, i.e. in measuring the subject's response latency to complex stimuli it is practicable to evaluate a simple motor response component but much more difficult to evaluate the component of the response time which should be related to the subject's familiarity with Q W E R T Y or other complex keyboards.

252 10. For changes in 11. The sessions. 12. The

A. E L I T H O R N U T AL .

repeatedly measured observations, the task should be sensitive to small mental status. task should be attractive and capable of interesting the subject for several subject should never be kept waiting, except as part of the test procedure.

These principles are, of course, guidelines rather than constraints. With many tests many of them will be inappropriate. It is, however, our belief that the most onerous components in test development are standardization and validation. If a test is to attract the effort that this requires, then it must have clear advantages not only over existing tests but also over other competing tests. The temptation with automation, as with any new technology, is to adopt the "look no hands" approach. We believe that there is much more to automation than this and that full use should be made of the wide range of computer techniques which relatively inexpensive microprocessors now make available. Our experience is based on the automation of a number of tests which include: Digit Span, a Digit Symbol Coding Test, the Perceptual Maze Test, a Tracking Test, Memory Tests for words and nonsense syllables, Self-Recording Analogue Scales, a Tapping Test, Visual and Sound Reaction Time Tests, an Adjective Check List, a Stress Questionnaire, Tests of Reading Speed, a Vigilance Test, and Three Letter Word Recognition Test (Elithorn, 1982). In the present paper we illustrate the application of the above principles in relation to a single test. The test we present is one which was redesigned with computer technology in mind early in its career and whose value in neuropsychological research in a pencil-and-paper form has been clearly established. It is the test with which we have done most work and for which we can produce results which we believe fully justify the effort involved in full automation.

The Perceptual Maze Test The Perceptual Maze Test was designed initially to study the effects of psychosurgical operations, e.g. leucotomy and lobotomy, on intellectual functions. At that time the Porteus maze test was considered to be the single test most sensitive to frontal lobe damage, so it seemed appropriate to develop a maze type task. Because the orientation of the project was "experimental" rather than psychometric, a criterion-referenced approach rather than a norm-referenced approach was adopted, and a test in which it was practicable to vary task difficulty systematically was designed. The first version of the Perceptual Maze Test, therefore, was a tracking task in which the subject was presented with a choice of pathways containing a varying number of target dots. On a paper strip driven at a variable speed the subject attempted to track through an interlocking network a path which gave him the maximum score, i.e. the maximum number of target dots. The speed with which the subject had to take decisions was controlled by varying the speed at which the paper moved while the amount of look ahead or foresight which the subject could use was varied by using Perspex windows of different sizes. The task was essentially culture-free. It could, for example, be presented as choosing a route between office and home which passed by the largest number of pubs. Equally it could be the choice of a route between two villages which

PRINCIPLES AND PRACTICE

253

allowed one to visit the largest number of water-holes. Unfortunately the time pressure produced by the paper drive was too disconcerting for the majority of patients and a pencil-and-paper version of the test was produced in which the length and breadth of the pattern for each item were systematically varied. Following a successful pilot study, the test was included in a project undertaken by Professor Arthur Benton at the University of Iowa designed to assess the relative value of some 28 psychological tests for clinical neuropsychological research. Using three different criteria, the maze ranked fifth, second and first in efficiency in detecting brain-damaged subjects. Moreover, when paired with each of the other 27 tests in the battery, the perceptual maze gave an average discrimination better than that of any other test. One finding from Professor Benton's study which subsequently proved to be of considerable significance was the observation that the pattern of failure shown on the test varied with the site of the lesion (Benton, Elithorn, Fogel & Kerr, 1963). In the light of these very encouraging results obtained in an independent validation study it was decided to develop the test further. TEST STRUCTURE Research in psychological testing has contributed very considerably to, and has also derived a great deal of benefit from, the development of statistical techniques. Rather surprisingly in the face of important contributions by Penrose (1944), Attneave & Arnoult (1956), Kolers (1960) and Moore & Anderson (1954) it has failed singularly to develop logical and objective methods for the creation of the psychological tests themselves. In test design and item analysis too much emphasis has been placed on the correlative approach and too little on systematic design. Indeed, most methods of psychological test construction introduce a large element of subjective judgment on the part of the constructor in selection of items, or else items are chosen arbitrarily on grounds of empirical observations without reference to the mechanisms which subserve the functions they are supposed to test (Anstey, 1966). In the majority of intelligence tests no attempt is made to explain the difference in difficulty between test items, and few workers other than those mentioned above have tried to find any objective measure of increasing complexity to correlate with the empirical observations on item difficulty. To achieve this it is necessary to develop families of test items which can be ordered theoretically in a systematic way. If the individual items in a test can be ordered logically, then the relationship between the items can be given a mathematical form. This can be achieved more elegantly and fruitfully if the nature of the test material is such that its physical or perceptual characteristics can themselves be expressed simply in mathematical terms. The Perceptual Maze Test lends itself readily to this approach and the decision was taken to produce a version of the test which would maximize its potential for computer manipulation and for item analysis in mathematical terms. The test was therefore redesigned into the form of an interlacing binary net for which a mathematical notation can be developed to describe any parameters which might be related to the subjective difficulty of different items (see Fig. 3). Apart from the physical dimensions of the lattice and the target dots, there are three variables which determine item complexity and hence, we assume, task difficulty: (a) the number of horizontal rows in the lattice, excluding the vertex,

254

A. E L I T H O R N E T AL.

',-"--:W ":/'w'w''..,..:" -W ':-.;: "/":,-:'w:'/ w ':-:: ::.:: ":/::.-:: :---:-.:-:--: w w ::-~:-.." '::....::. ~..?..:: :-:..~ ?-:..::.:: ~"::.::.:: ~.".:." 7 ".:: '-~..;;"-....:, 'S '-:: :-:: -..,

-,

..

;

..

:.,

;

,,

,,

;

-,

.-

;

..

;

..

;

,.

;

;

',.::~ :'-.-::::-:::.,.::::.::)( ::.:::::.-::: :::-.-:::--: ::: ::.. -::::w ::.:: ~ .~ :::' ".."2 '...X,::::K : 2::,' I. ::--" __. F[o. 3. The subject's task in the Perceptual Maze Test is to find a pathway along the background lattice which passes through the greatest number of target dots. He must keep to the lattice or tracks and must not cut across from one path to another. At each intersection the path must continue forward, i.e. the subject may fork right or left but must not "double back". In general, dependent on the arrangement of the target dots, there is more than one "best" pathway and the subject is said to have succeeded if he finds any one of these. There are two main conditions under which the Perceptual Maze Test is presented. A subject is either told the maximum number of dots which can be obtained, or this information is withheld and he is then left to decide whether he has found a "best" solution. Conventionally, these two methods of presentation are called the "with information condition" and the "without information condition".

(b) the ratio of number of target dots to the total number of lattice points, the vertex again being omitted, and (c) a "pattern" factor, namely the arrangement of the dots on the lattice. The first two parameters are clearly amenable to extract treatment. The notation developed for the study of the patterning parameter is given in Fig. 4. Some of the

0

q.--__,..

I 2 5

~ ':: " :,: :." " ::. ".:: ~-:', r :. :: r ~/~ 7 ': : . " ', " ; ~ ~ ::: :- ~, ": '.. i~ :~

.::, .. :::... : R ?:. / ::< .:~:~ ~ :~ -'-::~ F~O-. ,., :. ":,."-:.:~." ":: ::. q ~ " " ~"-.:"i.:::: ':" ::" :::~ :.: -.: O ".: ":: ).

9: ' ~

9.........'(;

-~..,....::. ':,: : / . :-'o~':

Tm 0 t ~ , = 2 5

'": :: " :~ ~ "

Tin_, 9 Cmo~--14

"~...":: .:~:.9

"r,,_2.t~,.2=17

;2"~..'""

Tm-a0 tm-a=0 T,,_,49t,,,_4=5

FIG. 4. Notation for the analysis of the pattern parameters of the Perceptual Maze Test and for the analysis of subjects' responses. Parameter notation in figure : n = order of maze = number of rows excluding vertex = 16. x = number of intersections excluding vertex = n(n + 3)/2 = 152. r = number of target dots = 61. S = saturation or density of dots = (r/x) x 100% = 40%. s = pathway score (0, 1, 2 . . . . . m). m = maximum pathway score = 11. v, = number of pathways through maze with score s. ~"s = number of different sets of s dots which lie on pathways of score s. v = v,, = 66. v'= ~," = 26. Note: ~,,~ and v " are usually referred to as ~, and v'. T~ = the set of dots through which a pathway of maximum score s passes, h = the number of dots in set T,. Maze co-ordinates: Row number p ( = 0, 1, 2 . . . . . n) runs from the vertex to the top row. Position on row p, q ( = 0, 1, 2 . . . . . p) runs from left to right. Thus in the above diagram: co-ordinates of the t,,-4 dots are (8, 8), (10, 10), (13, 13), (14, 0) and (16, 0).

PRINCIPLES

AND

PRACTICE

255

studies undertaken are discussed briefly here and are reviewed at greater length in Smith, Jones & Elithorn (1978). At the time, this decision was criticized on the grounds that we were making the human subject conform to the requirements of the machine, rather than the reverse, and that given a little ingenuity any test structure could be represented in a computer. In the event this relatively small change in format proved to have been a most important step. It has facilitated computer item generation and the analysis of item structure. Random patterns generated by the computer can be solved almost instantaneously using a simple algorithm for routing telephone traffic developed by Edward Moore of the Bell Laboratories (Moore, 1959). It enabled David Lee to develop a matrix terminology to describe the target dot pattern and relate this to item difficulty (Buckingham, Elithorn, Lee & Nixon, 1963; Lee, 1965, 1967). Lee was also able to develop a number of methods for analysing the strategies adopted by different subjects and found, for example, with a pencil-and-paper version of the test that there were marked differences in the strategies which men and women used to obtain solutions (Lee, 1965). Again the simplicity of the structure of the test has made it easier to write programs modelling techniques which human subjects may use in solving the tests.

Simulation studies Building computer models of the way in which subjects might tackle problems can be a fruitful approach to research on human skills. The Perceptual Maze Test is particularly suited to the development of such simulation studies. As already mentioned, the problem of writing computer programs which can solve a Perceptual Maze pattern of any feasible size is relatively trivial. In addition to the algorithm mentioned above, simple programs using either systematic or random searches will readily solve maze items of any reasonable size. These programs, of course, bear little relationship to the way in which human subjects solve test items but the two contrasting techniques--algorithmic and heurist i c - a r e prototypes for strategies which human subjects may well combine, namely, a sequential search and an encoding of the dot relationships. Clearly humans do employ search strategies but a systematic sequential search of a 16-row maze would mean the examination of 65,536 pathways, trivial for a machine, for man a timeconsuming, and for many people an impossible, task. How then do subjects go about solving this problem? The subject's task is to link together a maximum number of target dots. Most subjects say that they first scan the maze for areas with a high target dot density and then attempt to link these areas together. In other words, the subject makes an analysis of the groupings of target dots and the relationships between these groups. For most subjects this initial scanning is perceptually controlled rather than systematic, although recent evidence on sex differences suggests that reading habits may to a greater or lesser extent determine a tendency for subjects to scan from left to right. Although ultimately constrained by the lattice background, the possibility of linking target dots might be assessed on the basis of the angular relationships between the groupings of target dots. One possible strategy might therefore include a transformation of the maze graph (defined by the lattice lines) to a graph defined by the target dots and the links between these.

256

A. E L I T H O R N

ET AL.

Such a hypothesis would certainly fit well with current research on the way the mammalian visual system encodes data. The number of target dots which can be linked together as a single unit will be limited and the size of the unit which any individual can handle will be one variable which underlies the correlation of the test with tests of general intelligence. T o the human subject some linkages between target dots will be more obvious than others. The same applies to the linkages between groups and, indeed, some groups may be perceived as linked when there is no valid path connecting them and some relationships, such as horizontal groupings, may be seen as more attractive than their problem-solving value warrants. Grouping the target dots into groups and " r u n s " is an encoding technique which provides the basis for a whole class of strategies. For example, a subject might scan the maze, pick out a large group of dots, and use this as a focal point for a pathway. Alternatively he might try to link up runs of dots working systematically from the vertex to the top. In this latter case we may visualize the subject as starting at the vertex and making a series of decisions to fork left or right based on the configuration of dots in the rows immediately ahead. If this decision is made on the basis of a more-or-less " c o m p l e t e " knowledge of this configuration, then the number of rows ahead that the subject will be able to include will be very limited. Clearly most subjects will not search only from below upwards, nor will their decisions be taken only on the basis of complete knowledge of a limited area. Most subjects will use a probability function. " L o o k a h e a d " is likely to be complete only for the immediate neighbourhood and to spread forward on an increasingly impressionistic basis. These assumptions are simplified approximations from which subjects will depart to a greater or lesser extent. Analysis, however, suggests that scanning from below upwards is common and that the effective look ahead of some subjects is quite short. Thus subjects tended to " a d h e r e " to or continue to the end of a run in preference to breaking away to a new group, even though a breakaway gives a better score. There is also a tendency for subjects to continue in a straight line even though this is to their disadvantage. In one simulation developed by D. Lee, a maze solution is considered as a series of runs of dots from the vertex up. The best run in the region of the apex is chosen and then the best run accessible from the top of the first run and so on up to the top of the maze. Lee's program combines a local encoding component with an elementary search procedure but does not take account of the effect of distant configurations on local decisions. It also relates to a reduced transformation which ignores possible alternative paths through the open lattice between groupings of target dots. Both these points are taken account of by R. Jagoe. This adopts a total encoding strategy which allows for perceptual approximations. In this model the subject is seen as taking sequential binary decisions from the vertex to the top. At each decision point he is faced with a choice between two opportunity cones. The relative attractiveness of these two cones is a function of the number of target dots each contains and the arrangement of these dots. Each target dot thus influences the decision process related to the vertex node of every opportunity cone. The algorithm used in generating and evaluating items for the automated version of the test gives to each node a value representing the value of the opportunity cone calculated algorithmically. The human approach,

PRINCIPLES AND PRACTICE

257

however, is heuristic and to simulate this it is necessary to derive a function which represents for each cone not its true value but its attraction as perceived by the subject. These derived values are therefore a function of the subject as well as of the pattern of the test item. Once these values have been calculated, a decision strategy can then start at the vertex of the pattern and at each bifurcation choose the most attractive cone and so on up to the top. Subjects tend to take a path with a m a x i m u m score, though not all of them succeed, just as in a resistance network current tends to take the path of least resistance, though not all of it does so. In this program, therefore, the maze pattern is transformed into a "resistance" network in which the nodes with target dots have resistances of lower value than the nodes without target dots. Assuming that current can only flow one way along the pathways, the resistance from every node to the top is computed. T h e value at any given node is a function of the total pattern of the cone with that node as vertex. If there are many target dots in the cone then the value will in general be low: if these dots are arranged in a m o r e vertical pattern than in a horizontal one the value will tend to be low and if there are dots close to the node, the value will be low again. In other words, the value of the resistance from a node to the top reflects perceptual c o m p o n e n t s which are likely to attract a subject to the node. In the first of the two programs written by Mr R. Jagoe the simplest resistance values were t a k e n - - u n i t resistance for an e m p t y node and zero resistance for a node with a target dot. The probability function determining the decision at each node was also simple, namely, choose the next node of lowest resistance or if these are equal go to the " r i g h t " - h a n d one. In this form the program gives only a single pathway. In the second p r o g r a m the same resistance values were used but the decision probabilities were calculated f r o m a smoothed function derived from the decision behaviour of a subset of 12 men and 12 w o m e n from the control subjects. H o w realistic are these models and how much do they model some segments of human p e r f o r m a n c e ? One way to assess the appropriateness of these models is to compare their performance with that of a group of h u m a n subjects. In Fig. 5 we present some results from such a comparison. The first two of the three programs each gave a unique solution pathway and the program's success or failure in 72 items is related to the relative difficulty of each pattern for an experimental group of 144 subjects. Of the 72 items, the first model failed to find a solution in 22, and the pattern of success and failure was similar to that of the R.A.F. subjects in that the model failed seven of the 10 most difficult items and only one of the 10 easiest items. The second p r o g r a m failed 24 items; eight being from the 10 most difficult, none from the 10 easiest. For these two programs the correlation of success and failure with the rank order of the patterns was 0.38 and 0.41, respectively. The third p r o g r a m with the derived probability function was run 144 times. It failed to find a solution path in only two of the mazes. The correlation between the performance of the program and that of the 288 control subjects was significantly higher at 0.60. Higher correlations would almost certainly be obtained by manipulating the two p r o g r a m parameters. Recent studies on human laterality bias suggest that a default left bias rather than a right bias has been more realistic. One of the characteristics of the maze test which makes it particularly suitable for analysis by model-building or simulation techniques is that there are usually a large

258

A. E L I T H O R N E T A L . Items foiled Progrom I Progrom 2

I

I I

I I

I I

II

II I I I I I I I I

IIII Ilill

IIIIIIIIIIIIIIIII

14o i2

0

I00

80 |

9 "\fv', '

" ~'\,

'

.'

'/ ~'~ ;', "',.~,

,',,

~-

E~

4r

v ~,

",,

i~,

IIIIIlllll[lllllllllllllllllllllllllllllllllllllllll

|

9

IIIIIItllllllllllll I

oOiflllllllllll,,ll, Ill ,,i,,i,, I,,(jll ,

'

I

,

1

, .,1,1

.

FIG. 5. Comparison between h u m a n and simulation programs on the Perceptual Maze Test. The scores are those of 144 m e n (solid line) and 144 w o m e n (broken line) and the three programs described in the text for 72 maze items. T h e apparent greater variability of the female subjects reflects the fact that the items are arranged in rank order of difficulty for men. A b o v e the graph the two rows of symbols indicate which items were failed by programs 1 and 2. The histogram below the graph shows the n u m b e r of passes achieved by program 3 in 144 " a t t e m p t s " at each item.

number of different failure solutions and in general more than one correct solution. It is thus possible to assess the model not only in terms of its success and failure but also in terms of the actual routes which it has chosen. For example, in the case of the simulations which produce single pathway solutions we can ask whether the model, when it fails, tends to make the same or similar mistakes as the subjects with whom the model is being compared. Since all solution paths have a common origin, a key characteristic of an incorrect pathway is the point at which it first leaves or breaks away from a correct pathway. For both of the first two models there was only one item on which the model made an error not made by any subject. With only one pattern did a majority of subjects make the same error as either model. In this instance 149 subjects broke away at the same point as the second model and only 76 stayed with the correct path (Elithorn, Jagoe & Lee, 1966).

Automated testing with the Perceptual Maze Test Giving the maze a logically-simple structure has clearly/nade it much easier to examine systematically and in greater depth the results obtained with the pencil-and-paper versions of the test. Even more important, perhaps, has been the fact that this simple structure has facilitated the automation of the test, first on a minicomputer and more recently on a range of microprocessor systems. Mini- or microprocessors capable of producing complex graphics are still expensive and with the less expensive systems the response times are slow and the graphic still relatively restricted. Because the Perceptual Maze Test structure lends itself to highly efficient display algorithms, it was possible even in the early days to produce complex yet relatively flicker-free

pRINCIPLES A N D P R A C T I C E

259

patterns with point plot refresh oscilloscopes. Point plot refresh oscilloscopes have the disadvantage that the image of a dynamic display must be continually up-dated and with computer displays flicker inevitably becomes a problem. Storage oscilloscopes are unsuitable for dynamic displays and plasma displays are expensive and provide poor contrast. Eventually we discovered the DEC.VK8E logic module. This provides a direct read out from the computer core to a standard television monitor. This means that the graphic display is represented in core by a bit-per-point map and hence the processor does not have to maintain the display. The system is essentially that used by current micros. Any complex performance is achieved with a mixture of skills, the composition of which will vary between subjects. For example, on average men possess a rather specific spatial skill which plays a major role in determining their performance on the Perceptual Maze Test. Women on average have the skill to a much lesser degree (Beard, 1965; Carter-Saltzman, 1979). In fully evaluating test results, therefore, it is essential to apportion these components either by a component analysis of test scores from a battery of tests or by the direct analysis of the subject's performance on individual tests. We believe that these techniques should be continued and that the current computer technology makes this not only possible btit even with the first generation of automated tests practicable. We have already described how a detailed computer analysis of a subject's performance on the pencil-and-paper version of the Perceptual Maze Test can contribute to the better understanding of the problem-solving strategies which subjects adopt. With the automated version of the test it becomes practical to analyse the subject's performance in much greater detail than would otherwise be possible. With the automated version, not only does it become much more practical to record and analyse each individual solution and to time each item, but individual tracking responses can also be recorded. Hence we can analyse how subjects distribute their effort between different aspects of the task. Thus from a subject's performance on the automated version of the Perceptual Maze Test, it is possible to extract the following indices. 1. Search Time: The time until the first motor response. 2. Track Time: The time from the first motor response until the completion of the task. 3. Check Time: The time between a subject completing his tracking and his signifying that he is satisfied with his solution. 4. Non-fatal Errors: Number of corrections per item. 5. Fatal Errors: Number of items incorrect. 6. Motor Index: Average of fastest 10% of key response. 7. Refresh Index: Number of pauses >1 s during the tracking phase. 8. Laterality Index: Percentage of right preferences. 9. Processing Speed: Number of vertices processed per second. Evaluating the value of this type of "behavioural fragmentation" is in itself a daunting challenge but we have, we believe, been able to demonstrate that the result will justify the effort. Thus analysing performance in this way shows that a subject who spends relatively little time on his initial search will tend during the tracking phase to spend relatively more time conducting additional searches and, perhaps, making and correct-

260

A. E L I T H O R N

ET AL.

ing more errors. Extroverts as opposed to introverts tend to behave in this way. Again, not unexpectedly, this interaction between cognitive style and personality is affected by the information condition under which the test is presented. Extroverts are less constrained when the test is presented under the "without-information condition" and find it easier to start the tracking part of the task while still uncertain as to whether they have found a solution. These aspects of the subject's performance on the Perceptual Maze Test--his readiness to "have a g o " - - a n d his response to the "withoutinformation condition" have been shown to be affected not only by personality factors (Elithorn, 1982) but also by drugs which produce little change in the subject's overall competence at the task (Elithorn, Cooper & Lennox, 1979). In psychiatry a current clinical crisis centring around the use of physical treatments such as psychosurgery, electro-convulsive therapy and powerful antipsychotic and antidepressant drugs which may have many undesirable side effects. In practice an increasing number of patients are being denied effective treatment through the fear that the treatment is worse than the disease. In our own work with automated tests we have become increasingly impressed with the potential that these tests have for evaluating the balance of treatment effects with treatments which are designed to affect mental functions therapeutically but which also have undesirable mental side effects. We have, we believe, demonstrated their value using classical statistical techniques. We have also been fortunate in having access to a Bayesian statistical program developed by Professor Adrian Smith. This program makes it possible to specify intervention effects in time-dependent data even though these contain autogressive and learning components. Professor Smith's technique first fits a mathematical model to the series and then sets up the hypothesis that there is a discontinuity in the data and that this would be better represented by two models. Next, a likelihood analysis calculates for each point the probability that the discontinuity occurs at that point. If there is little likelihood that there is a discontinuity then these probabilities--totalling o n e - - a r e distributed evenly between the observations. If there is a relatively high probability of a discontinuity at one point, the data are then split at this point and the analysis repeated on each section. Some results from such an intervention study are presented in Fig. 6. The patient was a young graduate suffering from a severe schizo-affective illness. While on treatment with chlorpromazine, he was found one night unconscious in the lavatory. Subsequently on five occasions his E.E.G. showed fluctuating activity, generally predominantly on the left but on one occasion maximally on the right. His medication was changed to Haloperidol. He found this depressing and threatened to discontinue medication. However, he agreed to the systematic withdrawal of his medication while undergoing daily automated testing. Following the withdrawal of his medication he became progressively more overactive and excitable. With the demonstration that his performance on the tests was deteriorating he agreed to restart Haloperidol at a lower dosage, which in the event proved quite adequate. When his psychiatric condition stabilized he again agreed on the basis of his nocturnal collapse and the E.E.G. findings to a trial of Phenytoin. In Fig. 6(a) the median solution times on the Perceptual Maze Test are plotted and below these the result of three runs of Professor Smith's program. In run 1 there is evidence for a discontinuity in the period during which the patient's Haloperidol was being withdrawn. In run 2 an additional discontinuity occurs at the point where treatment with Haloperidol was re-introduced. In run 3 there is further

261

PRINCIPLES AND PRACTICE Solution time (without information) 15 m g .

-.-L~g

5 mg

I ~176176~

tI -

o5o

9

L n

B

n

5

I0 15 20 25 30 35 4 0 4 5 50 55 60 Day No.

(a) Search time (without information) 15 mo

Fastest I0 % (without information) 15 mg

5 mg

5 mg

I

**•

L*,% *%* *,,~_

,~Oo 9

"~176176176 I P'"Y'~

,." -:

%

o

E I-

~~ ~ ~

n0

:

A

rl B

..J]

C

5 tO 15 20 25 30 35 40 45 50,55 60 Day No.

(b)

5

I0

15 20 25 :50 55

40 45 50 55 60

Day No.

(c)

FIG. 6. Bayesian Time Series Analyses of sequential test data (for explanation see text).

evidence for the disruption in the performance caused by the withdrawal of the Haloperidol. In Fig. 6(b) the data analysed is the search index for the same test. There is again evidence in all the runs of a disruption of performance during the period of Haloperidol withdrawal and in runs 2 and 3 evidence that this perceptual component of his performance was also affected by the reinstatement of Haloperidol but not by the addition of Phenytoin.

262

A. ELITHORN E T A L .

In Fig. 6(c) the motor components of the subject's performance are analysed. This time the initial disruption is more dearly related to the final withdrawal of Haloperidol, but there is no detectable effect when this drug is re-introduced. There is, however, evidence for a speeding of the motor component at the time that treatment with Phenytoin is started. In general, the literature on Phenytoin suggests that this anticonvulsant has little effect on mental and motor skills. In this case, however, the E.E.G. showed that the patient had a largish area of abnormal electrical activity just anterior to his left motor cortex. The finding that Phenytoin affected his motor performance can therefore be explained and this is strong confirming evidence both for the validity of the statistical analysis and for the technique of computer fragmentation of test performance. Further development of these techniques will, we believe, meet many of the criticisms which currently prevent single subject intervention trials being used effectively in monitoring treatment effects in psychiatry (Kratochwill, 1978).

Summary The present paper reviews briefly the principles which should guide the development of automated psychological test systems. These are illustrated by reference to research and development studies undertaken with the Perceptual Maze Test. The possible role of these tests in assessing treatment effects in neurology and psychiatry is discussed and an example of this type of application is presented. It is a pleasure to acknowledge the support and encouragement of our colleagues at the Royal Free Hospital and the National Hospital for Nervous Diseases. To Professor Adrian Smith, Dr C. D. Litton and Dr J. A. Heady, we are particularly grateful for statistical advice and assistance with the analyses.

References ANSTEY, E. (1966). Psychological Testing. New York: Macmillan. ATTNEAVE, F. & ARNOULT, M. D. (1956). A quantitative study of shape and pattern perception. Psychological Bulletin, 53 (6), 452-471. BEARD, R. M. (1965). The structure of perception: a factorial study. British Journal of Educational Psychology, XXXV, 210-220 (June). BENTON, A. L., ELITHORN, A., FOGEL, M. L. & KERR, M. (1963). A perceptual maze test sensitive to brain damage. Journal of Neurology, Neurosurgery and Psychiatry, 26 (6), 540-544. BUCKINGHAM, R. m., ELITHORN, A., LEE, D. N. & NIXON, W. L. B. (1963). A mathematical model of a perceptual maze test. Nature, 199 (4894), 676-678. CARTER-SALTZMAN, I. (1979). Patterns of cognitive functioning in relation to handedness and sex-related differences. In WlTTIG, M. A. & PETERSON, A. C., Eds, Cognition and Perception. London: Academic Press. DEWITT, L, J. & WEISS, D. J. (1974). A computer software system for adaptive ability measurement. Research Report 74-I, Psychometric Methods Program, Department of Psychology, University of Minnesota. EL/THORN, A. (1974). Combining inferences for single sequential trials. In LEESE, A. L., Ed., Proceedings of NATO Conference on Statistical Design and Experimental Field Trials at RapaUo, Ottawa. ELITHORN, A. (1982). Psychological Testing: the way ahead. In KARAS, E., Ed., Current Issues in Clinical Psychology. New York: Plenum Press. ELITHORN, A. & TELFORD, A. (1969). Computer analysis of intellectual skills. International Journal of Man-Machine Studies, 1 (2), 189-209. ELITHORN, m., JAGOE, J. R. & LEE, D. N. (1966). Simulation of perceptual problem solving skill. Nature, 211 (5053), 1029-1031.

PRINCIPLES AND PRACTICE

263

ELITHORN, A., SINGH, K. & TELFORD, A. (1970). Small computer or intelligent terminal. Proceedings of the 6th DECUS European Seminar, Munich, pp. 1-7. ELITHORN, A., POWELL, J. & TELFORD, A. (1976). Proceedings of Electronic Display, vol. 3, p. 18. London, Newport Pagnell: Network. ELITHORN, A., COOPER, R. & LENNOX, R. (1979). Assessment of psychotropic drug ettects. In CROOKS, J. M. & STEVENSON, J. H., Eds, Drugs and the Elderly, London: Macmillan. ELITHORN, A., POWELL, J., TELFORD, A. & COOPER, R. (1980). An intelligent terminal for automated psychological testing and remedial practice. In GRIMSDALE, R. L. & HANKINS, H. C. A., Eds, Human Factors and Interactive Displays. Buckingham: Network. ELWOOD, D. L. & GRIFFIN, H. R. (1972). Individual intelligence testing without the examiner: reliability of an automated method. Journal of Consulting and Clin&alPsychology, 28, 9-14. GEDYE, J. L. & MILLER, E. (1970). Developments in automated testing. In MITTLER, P., Ed., The Psychological Assessment of Mental and Physical Handicaps. London: Methuen. KINTZ, B. L., DELPRATO, D. J., METFEE, D. R., PERSONS, C. E. & SCHAPPE, R. H. (1965). The experimenter effect. Psychological Bulletin, 63, 223-232. KOLERS, P. A. (1960). Some aspects of problem-solving: 1. Methods and materials. Wright

Air Development Division Technical Report 60-2. KRATOCHWlLL, W. R. (1978). Foundations of time-series research. In KRAOTCHWILL, T. R., Ed., Single Subject Research. London: Academic Press. LEE, D. N. (1965). A psychological and mathematical study of task complexity in relation to human problem-solving using a perceptual maze test. Ph.D. thesis, University of London. LEE, D. N. (1967). Graph-theoretical properties of Elithorn's maze. Journal of Mathematical Psychology, 10, 341 (June). MOORE, O. K. (1959). The shortest path through a maze. Proceedings of an International Symposium on the Theory of Switching. Harvard University Computational Laboratory Annals, pp. 29-30. Harvard University Press. MOORE, O. K. & ANDERSON, S. B. (I954). Modern logic and tasks for experiments on problem-solving behaviour. Journal of Psychology, 38, 151. MUNDIE, J. R., OESTREICHER, H. L. & VON GIERKE, H. E. (1966). Real time digital analysis for biological data. IEEE Spectrum, 3, 116 (October). PEARSON, J. S., SWENSON, H. P., ROME, H. P., MATAYA, P. & BRANNICK, T. L. (1965). Development of a computer system for scoring and interpretation of Minnesota Multiphasic Personality Inventories in a medical clinic. Annals of the New York Academy of Sciences, 126, 682-692. PENROSE, L. S. (1944). An economical method of presenting matrix intelligence tests. British Journal of Medical Psychology, 20 (2), 144-146. PIOTROWSKI, Z. A. (1964). Digital-computer interpretation inkblot test data. Psychiatric Quarterly, 28, 1-26. SMITH, J., JONES, D. & ELITHORN, A. (1978). The Perceptual Maze Test. Medical Research Council. [Reprinted with addendum, 1978.] SPITZER, R. L. & ENDICOTT, J. (1969). DIAGNO II: Further developments in computer program for psychiatric diagnosis. American Journal of Psychiatry (Suppl.), 125, 12-21. WEISS, D. J. (1974). Strategies o( computerized ability testing. Research Report 74-x. Psychometric Methods Program, Department of Psychology, University of Minnesota.