CHAPTER
8
Making Decisions in Ophthalmology ADRIAN R. H I L L The Eye Hospital, Oxford and Nuffield Laboratory of Ophthalmology, University of Oxford, Oxford, UK CONTENTS 1. U n c e r t a i n t i e s , I n t u i t i o n a n d R a t i o n a l i t y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
208
2. T h e N a t u r e o f J u d g e m e n t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
209
3. M e a s u r e m e n t a n d P r o b a b i l i t y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
211
4. S o m e Limits to Clinical M e a s u r e m e n t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1. I n t r a and I n t e r - o b s e r v e r Sources o f V a r i a t i o n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2. M e t h o d s o f M e a s u r e m e n t as a S o u r c e o f E r r o r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
213 213 214
5. Principles o f D e c i s i o n A n a l y s i s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
216
6. T h e D e c i s i o n M a t r i x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1. Single Test E v i d e n c e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2. M u l t i p l e Test E v i d e n c e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
216 216 220
7. T h e Bayesian V i e w p o i n t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
222
8. C o m b i n i n g Test I n f o r m a t i o n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
223
9. Clinical I n f o r m a t i o n T h e o r y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
226
10. Selecting D e c i s i o n C r i t e r i a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1. T h e E f f e c t o f A t t i t u d e s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2. T h e O p t i m a l C r i t e r i o n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3. Single or D o u b l e C r i t e r i o n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
229 229 230 230
11. M e a s u r i n g B e l i e f ' a n d V a l u e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1. S u b j e c t i v e P r o b a b i l i t i e s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2. Assessing S u b j e c t i v e P r o b a b i l i t y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3. Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4. M e a s u r i n g V a l u e as U ti li ty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
231 231 232 234 234
. .........
12. S t r u c t u r i n g the P r o b l e m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1. D e c i s i o n Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2. T h e G e n e r a l D e c i s i o n M o d e l . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3. A Clinical E x a m p l e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
235 235 236 238
13. C o n c l u d i n g R e m a r k s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
241
Acknowledgements ......................................................................
242
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
242
Appendix ..............................................................................
244
207
208
A . R . HILL
1. U N C E R T A I N T I E S , I N T U I T I O N A N D RATIONALITY In a collection of essays published over 50 years ago on the principles of clinical education, Sir William Osier gave a clear description of the attitude which clinicians should adopt in making medical diagnoses. One should "start out with the conviction that absolute truth is hard to reach in matters relating to our fellow creatures, healthy or diseased, that slips in observation are inevitable even with the best trained faculties, that errors in judgement must occur in the practice of an art which consists largely of balancing probabilities" (Osier, 1930). While this principle still lies at the foundation of all clinical reasoning, there is one essential difference which has occurred in clinical practice during the past half-century. It is a consequence of the enormous scientific productivity which has occurred since the turn of the century. In Osler's day, a principal limitation in health care was lack of information. Today, the principal problem has become one of management of information and, in spite of our best efforts, more and more existing knowledge goes unapplied. To counteract that dilemma for the clinician, strategies have now evolved for the rational handling of information obtained during a clinical examination and methods have been developed for quantifying all the sources of error in judgement. Decision analysis is the approach to rationality in information management, and probabilities are the practice of quantifying uncertainty. Medical practice, therefore, need no longer be viewed as an art of balancing probabilities, but rather as a science of uncertainty based on the rational use of probabilities. This paper seeks to describe, by means of examples from ophthalmology, the principles of clinical decision analysis and to show how its application may be used to benefit patient management. The approach is normative rather than descriptive. In other words, the principles of decision analysis describe how the rational man ought to behave and not how he actually behaves. In medical practice, the disparity between these two approaches is frequently quite large because
many clinicians take a highly personalized approach to diagnosis. Despite Osler's criticism of those who practice medicine by intuition rather than by using a probabilistic approach, it was another 30 years before the application of symbolic logic, probability and value theory was shown to be of practical relevance in diagnosis (Ledley and Lusted, 1959). While no one can doubt that complex reasoning processes are involved in making a medical diagnosis, these three basic concepts are essential to any clinical diagnostic and patient management procedure, even when the action is at an intuitive level. Decision analysis is a prescriptive model which uses particular structures for analyzing complex problems and evaluating the worth or value of alternative actions. By incorporating the principles of information theory and utility theory, it aims to strike a balance between what is desirable and what is possible. It is not to be confused, therefore, with psychological decision theories (Kozielecki, 1981) which comprise a number of general propositions describing the actual behaviour of individuals or groups of people making decisions. In clinical decision analysis there are procedures, through the application of probability theory, for quantifying the predictive or diagnostic value of signs, symptoms and test evidence, as well as for handling uncertainties. But in medical practice it is important to distinguish between the statistical randomness of data and our perceived uncertainty associated with an event. Both are sources of error in judgement. The former is a property of the measurement procedure, whereas the latter is a property of ourselves. Uncertainty may be defined as the degree of belief we hold about something. Where there is disparity between our beliefs and reality, or our beliefs and actions we behave irrationally. Firmly held inappropriate beliefs can lead to errors of judgement or incorrect diagnoses. It is instructive, therefore, to consider how inappropriate beliefs can arise and whether they are governed by the particular ways in which we view the world. The benefits of using a normative or prescriptive approach to clinical decision making can be appreciated more easily against this background.
MAKING DECISIONS IN OPHTHALMOLOGY
2. T H E N A T U R E OF J U D G E M E N T
Most scientists and clinicians act under the misapprehension that they are unbiased observers (Abercrombie, 1960). Unfortunately, this is rarely so, because the way we deal with problems and m a k e choices depends on the way we perceive those problems. The judging of clinical evidence involves interpretation, and that judgement is influenced both by the context and manner in which the problem is presented (Bieri et al., 1975) as well as by our attitudes towards the perceived risks and probable benefits associated with alternative choices (Slovic et al., 1981). While the assimilation and careful consideration of favourable and unfavourable attributes or consequences is fundamental to forming attitudes, these may be governed largely by our beliefs which, in turn, may be based on highly selected information (Ajzen and Fishbein, 1980). Clearly, the holding of specific beliefs does not mean that the information we select is accurate, valid or even discriminable; yet much of the behaviour of clinicians would seem to suggest otherwise. For example, in a large study of the relative efficiency of different screening measures for detecting glaucomatous field loss, Daubs and Crick (1980) showed that simply estimating the cup-to-disc ratio of the optic nerve head allowed one to discriminate better between those patients with and those without the pathology than a single measure of intra-ocular pressure. Nevertheless, clinical practice still shows a preference for tonometry in screening for glaucoma because of the long established belief that elevated intraocular pressure is the cause o f optic nerve fibre damage. It is more likely that the reason for this disease is the susceptibility of nerve fibres to damage (Spaeth, 1977). The Daubs and Crick data simply show that the estimation of cup-to-disc is a better single predictor of the consequences of this susceptibility than intra-ocular pressure alone. While other procedures have been shown to be better screening criteria for glaucomatous nerve fibre damage (Quigley, 1985), this example serves to illustrate how firmly held beliefs and attitudes can introduce bias and can therefore be a potential source of errors in judgement.
209
It seems that, in making judgements from the evidence and material presented by a patient, we only take in such information as is consistent with our attempt to find patterns in the immediate context of that evidence f r o m our store of knowledge. This selection of information is influenced by our concepts of the meaning of ' n o r m a l ' or with our choice of criteria for classification. Our behaviour is affected by a kind of aesthetic appreciation of what is most appropriate to the particular circumstances. In a directed search after meaning, the data are not assessed in isolation but in relation to other things which may or may not be relevant. There is a tendency to disregard conflicting evidence by giving less weight to information that does not yield to a consistent profile or hypothesis. Indeed, the characteristic of selective perception in h u m a n judgement causes people to seek information which is consistent with their own views or hypotheses. For example, in the interview situation, such as the interrogation of a patient for history and symptoms, doctors will frequently seek information consistent with first impressions rather than information that could refute those impressions (Hogarth, 1980). It is also important to realize that first impressions are usually little more than intuitive guesses. This is clearly illustrated in a study of 304 patients with acute abdominal pain reported by de Dombal and coworkers (1972) which showed that the admitting diagnosis was the same as the final operative diagnosis in less than 50% of the cases. It is necessary, therefore, to keep alternative hypotheses in mind in order to obtain the m a x i m u m possible valid information about a patient. We need to consider alternative inferences explicitly and apply deliberate weights to the evidence. This should increase the likelihood of making the right choice and taking the best course of action. The quest for meaning in information was argued by Bartlett (1932) to be an essential feature of all considered judgements, and has been shown to be present in much of clinical diagnosis. For instance, in order to reduce a sense of uncertainty in their judgements most clinicians typically behave according to the maxim " W h e n in doubt make a diagnosis" (Meador, 1965) and this has the inevitable consequence of a tendency to 'over
210
A. R. HILL
diagnose' (Meador, 1969). This quest for meaningful patterns in information has the effect of causing frequent confusion between descriptive and inferential statements when dealing with problems, whether clinical or non-clinical. The latter involve extrapolation, and can be neither confirmed nor rejected without additional information not present in the existing data or evidence. Unfortunately, in decision making, it is not irrefutable 'facts' that we find interesting and that we act on, but the inferences we make from them, often without recognizing the limits of their validity and without questioning how safe (or meaningful) they are. Understanding the differences between descriptive and inferential statements will help to distinguish between true and false judgements and can thus be seen as a step towards minimizing error in clinical decisions. Clearly, the inferences and decisions we make are based not only on the available knowledge about the problem in question, but also on the patterns we perceive in the data and the relative emphasis we place on alternative hypotheses. Although the nature of human judgement is complex, a number of inferential rules which people use have been identified. All are potential sources of bias either in the acquisition or processing of information and are clearly discussed in Hogarth (1980). The most important sources of bias in clinical decision making are: the principle of 'conservatism' in human judgement (Edwards, 1968), the 'availability heuristic' (Tversky and Kahneman, 1973) and the 'law of small numbers' (Tversky and Kahneman, 1971). 'Conservatism' means the failure to revise opinion on receipt of new information to the same extent as the 'true' posterior conditional probabilities ~. This results in the decision maker feeling more uncertain than he need be. The consequence may be an inappropriate decision which, in medical practice, frequently involves the request of additional information about a patient's clinical state. Conservatism in human judgement therefore encourages the acquisition of redundant information from additional tests in an attempt to minimize perceived rather than actual uncertainties. It is this aspect of human F o r n o t e s 1 a n d 2 see A p p e n d i x .
judgemental
behaviour which the American Medicare programme hopes to modify by basing clinical funding according to the principle of 'diagnosis related groups', (Anon, 1983). By this principle, clinicians are strongly encouraged to limit themselves to tests, investigations and medicines that are really needed and cost effective. It remains to be seen whether such incentives are sufficient to modify human behaviour away from conservatism in information processing towards more veridical judgements (Hodes, 1985). 'Availability heuristic' was the term introduced by Tversky and Kahneman (1973) to describe the way in which specific instances, either recalled from memory or imagined, have an irrational effect on judgements of frequency. For a variety of reasons, some pieces of information are more salient in memory and are weighted more heavily in judgement than others. Typically, the frequency of well-publicized events are over-estimated (e.g. blindness resulting from trauma in sport), while the frequency of less well-publicized events are underestimated (e.g. blindness from diabetes). These distortions of our subjective base-rate likelihoods 2 or prior probabilities are also evidenced in our misperception of the frequencies associated with rare events. In discussing the 'law of small numbers', Tversky and Kahneman (1971) describe a number of examples which illustrate the tendency in human judgement to over-estimate rare events and under-estimate common occurrences. These sources of bias in human information processing emphasize the problems which are likely to arise from faulty judgements. In relatively simple clinical tasks, such as the analysis of radiographs, Abercrombie (1960) has shown that the accuracy of decisions can be improved if a person is made aware of the potential sources of error. Unfortunately, it appears that in more complex clinical decisions, doctors are not only reluctant to revise their subjective probabilities of symptom-disease relationships, but th.ey also have little idea of what the 'true' probabilities are (Leaper et al., 1972). Furthermore, there is evidence that clinicians frequently lack the ability to manipulate all the available information at once and are still uncertain at the end of their deliberations. Indeed, it is the desire to overcome
MAKING DECISIONS IN OPHTHALMOLOGY
the diagnostic errors arising from inferential bias that has been the driving force behind the development of computer-assisted decision making (e.g. Clancey and Shortliffe, 1984; Reggia and Tuhrim, 1985). The application of decision analysis is an important part of that development because it provides the framework for structuring a problem from which rational inferences may be made.
3. MEASUREMENT A N D PROBABILITY
To concede that we often have a sense of uncertainty about our judgements is to imply that we have degrees of belief and that they are quantifiable. Although probabilities are the principal means of quantifying uncertainty, it is important to realize that often clinical judgements may be based upon information derived by quite different measurement principles. Furthermore, since the diagnostic process can be said to involve two objectives, namely to classify the classifiable and to measure the measurable (Murphy, 1976); then understanding the principles of measurement is essential to rational clinical decision making. According to Stevens (1946), measurement is the assignment of numbers to objects or events according to rules. For him, any rules will suffice for this basic definition. Ellis (1968) however, has pointed out that for the measurement to be informative the rule adopted must be both determinative (i.e. requiring that the same numerals should always be assigned to the same things under the same conditions, provided sufficient care is exercised) and non-degenerate (i.e. allowing for the possibility of assigning different numerals to different conditions). The qualifications made by Ellis are important because what distinguishes scales of measurement are the mathematical properties of these rules. The practical value of Stevens' classification o f scales of measurement is that, from a knowledge of the kind of scale upon which a set of measurements is based, it is possible to say what sort of statistics are relevant to these measurements. Unfortunately, in medical science the widespread misuse of statistics suggests a poor appreciation of the
211
principles of measurement (Schor and Karten, 1966; Forrest and Anderson, 1986). Since inappropriate statistical inference can lead to distortions in judgement, a decision maker should be able to improve the quality of his decisions by a knowledge of measurement scale properties. In Stevens' system of classification, four basic levels of measurement are identified. These have been defined as nominal, ordinal, interval and ratio scales. Each type of scale uses different features of the number system, identity, order, difference and ratio respectively. These four features are fundamental to the mathematical manipulations which it is possible to perform upon a scale and may be defined as follows: Identity: numbers may serve as names or labels to identify or classify items. Order: numbers may serve to reflect the rank order of items. Intervals: numbers may serve to reflect differences or distances between items. Ratios: numbers may serve to reflect ratios among items. An interesting property of these four features of the number system is that they are hierarchical. For example, an ordinal scale also possesses the properties of identity, but not of intervals or ratio; numbers which express intervals must, of necessity, possess properties of order and identity but do not permit ratio operations. Where numbers are used to express ratios they also possess the other three features of number relations. In this classification system, ratio scales have the greatest number of constraining rules and are capable of conveying the maximum empirical information. On the other hand, the principle of identity is the simplest form of number usage in which there is no property of magnitude but merely classification. The type of scale corresponding to each of the four uses of number is listed in Table 1 with examples. What defines each scale is the permissible number of mathematical transformations which may be performed without losing any of the empirical information represented by the numbers. For example, in an ordinal scale it is permissible to change the numbers which represent items or events, providing the rank order remains invariant and
212
A. R. HILL TABLE I. The four main types o f measurement scale according to the classification o f Stevens (1946). The permissible transformations, namely the way in which the scale numbers can be altered and still represent all the empirical information, distinguish the scale types by the principle o f in variance. Non-parametric statistics shouM be used f o r nominal and ordinal scales, whereas parametric statistics are applicable to interval and ratio scales. Scale
Principles of number system
Permissible transformations
Clinical examples
Nominal
Identification and classification
Ordinal
Rank order
Substitution of any number for any other number Any change which preserves order
Interval
Express distances or differences Express ratios, fractions or multiples
Classification of acquired colour vision deficiencies according to Type I, II and llI* Grading of retinopathy severity ~ Grading of lens scatter and brunescence in cataract* Measures of Snellen visual acuity ~
Ratio
Multiplication by or addition of a constant Multiplication by a constant only (assumes an absolute zero)
Estimation of optic cup/disc ratio or neuro-retinal rim area '
*The classification systems for both acquired (Type I, 11, 1II) and congenital (protan, deutan, tritan) colour vision defects are nominal (Pokorny et al., 1979). Many of the features of anatomical systems of classification are also nominal (e.g. Hoskins et al., 1984). tThe classification systems both for diabetic retinopathy and for retinopathy of prematurity are based upon the evolution of the lesion but also allow for variations in the course of the retinopathy (Scott, 1951; Garner, 1984). *Several examples are given in Bron and Brown (1982) and in Sparrow et al. (1986). Sin commenting on the limitations of the Snellen scale of visual acuity, Donders (1864) wrote; "That the relative values are not comparable, Snellen has already observed. If an image has double the magnitude, it has not at the same time double distinctness" (p. 194). "Estimations of the cup-to-disc ratio and the area of the neuro-retinal rim are both indirect ratio measures of the amount of neural tissue present at the optic nerve head (Balazsi et al., 1984).
undisturbed, without loss of empirical i n f o r m a t i o n . B u t s i m p l y c h a n g i n g t h e n u m b e r s to alter t h e i n t e r v a l s b e t w e e n t h e m d o e s n o t c o n v e r t a n o r d i n a l scale i n t o o n e w i t h i n t e r v a l o r r a t i o p r o p e r t i e s . It is i m p o s s i b l e to a d d i n f o r m a t i o n to a n y scale o f m e a s u r e m e n t m e r e l y by m a n i p u l a t i n g the n u m b e r s . T h e m e t h o d o l o g i c a l p r i n c i p l e s e m p l o y e d in t h e act o f m e a s u r e m e n t d e t e r m i n e t h e t y p e o f scale w h i c h is used. It is this f u n d a m e n t a l p r o p e r t y o f m e a s u r e m e n t w h i c h , j u d g i n g by t h e m i s u s e o f statistics in m e d i c a l r e s e a r c h , s e e m s to be so little u n d e r s t o o d in m e d i c a l science. F o r e x a m p l e , a s u r v e y o f t h e 1982 e d i t i o n s o f 12 m e d i c a l j o u r n a l s s h o w e d t h a t in at least 70°7o o f 175 p a p e r s e m p l o y i n g ordinal m e a s u r e m e n t scales, statistical methods were used which assumed a more refined level o f m e a s u r e m e n t ( F o r r e s t a n d A n d e r s o n , 1986). It s e e m s , t h e r e f o r e , t h a t n o t o n l y d o c l i n i c i a n s h a v e a l i m i t e d u n d e r s t a n d i n g a b o u t the
t y p e s o f m e a s u r e m e n t t h e y e m p l o y but also a b o u t t h e n a t u r e o f e r r o r s p r e s e n t in m e a s u r e m e n t . B o t h t h e s e c h a r a c t e r i s t i c s will a f f e c t the c o n f i d e n c e a s s o c i a t e d w i t h i n f e r e n t i a l j u d g e m e n t s i n v o l v e d in clinical d e c i s i o n m a k i n g . It has a l r e a d y b e e n m e n t i o n e d t h a t m e d i c a l practice m a y be r e g a r d e d as a science o f u n c e r t a i n t y in w h i c h p r o b a b i l i t i e s are the m e a s u r e o f u n c e r t a i n t y . It has also b e e n p o i n t e d o u t t h a t probability judgements depend on our background knowledge, our intellectual capacity and o u r e x p e r i e n c e . B u t w h e r e d o p r o b a b i l i t i e s fit within Stevens' system for the classification of m e a s u r e m e n t scales? S u c h p r o b a b i l i t y scales a r e scales o n w h i c h t h e o r d e r o f o u r n u m e r i c a l a s s i g n m e n t s is h i g h l y c o r r e l a t e d w i t h o u r o r d e r i n g o f c r u d e s u b j e c t i v e p r o b a b i l i t i e s , a n d w h i c h we are p r e p a r e d to r e g a r d as a s t a n d a r d a g a i n s t w h i c h to assess t h e a c c u r a c y o f o u r s u b j e c t i v e o r d i n a l probability judgements. Of the possible
MAKING DECISIONS IN OPHTHALMOLOGY
probability scales, those in practical use are based on certain mathematical properties* and may be called normal probability scales. According to Stevens' system, a normal probability scale is a ratio scale, but it has special properties, namely two fixed points, p = 0 and p = 1. These points correspond with the contradiction (p = 0) and confirmation (17 = l) of any analytic proposition. Probability relationships are relationships between propositions. Every probability statement is therefore a statement about propositions. As already stated, in medicine very few diagnostic propositions are causal or deductive, because rarely is the aetiology a n d / o r the progression of a disease fully known and understood. Clinical medicine is not simply a matter of applied physiology and biochemistry. Most diagnostic propositions are either probabilistic in nature or involve a form of pattern recognition from a set of particular signs and symptoms. In both these latter approaches to diagnosis, the alternate propositions of health or disease (or of two diseases) are unlikely to be completely true or false. Consider, for example the statement that "eyes with an intra-ocular pressure exceeding 30 m m Hg have g l a u c o m a " . To say that this statement is true is to imply that all eyes with an intra-ocular pressure above 30 m m Hg have glaucoma. To say that it is false is to imply that no eye with an intra-ocular pressure above 30 m m Hg has glaucoma. The statement is, in fact, neither wholly true nor wholly false and we can say only that it is probable that a patient with this intraocular pressure has glaucoma. Propositions such as the above, which cannot be confirmed or contradicted with certainty, are known as indefinite propositions and are expressed in probabilistic terms. Decision analysis is based upon the logic of probability measurement and *There are four properties o f normal probability scales from which all the rules and theorems of probability theory m a y be derived. These are: (1) For any event A, 0 < p ( A ) < l . (2) The sum of the probabilities of all possible events equals unity, i.e. p(Ai)= 1. (3) If A and B are mutually exclusive events, then p ( A or B) = p(A) + p(B). (4) The conditional probability of event B given event C is
p(BIC) = p(B and C) p(C)
213
embodies procedures which provide for estimating the degree of confirmation or falsification of indefinite propositions. But it is important to remember that the discriminability of probability judgements and the appropriateness of diagnostic decisions depend largely upon the precision and accuracy of the information upon which they are based. That precision and accuracy is, in turn, largely influenced both by the type of measurement scale adopted and by the method of measurement employed.
4. S O M E L I M I T S T O C L I N I C A L MEASUREMENT
4.1. lntra and Inter-observer Sources of Variation
In addition to the phenomenon of perceptual bias discussed earlier, one of the m a j o r factors contributing to uncertainty in judgement is the error inherent in all measurement. The vagaries of measurement are of particular importance in clinical practice because clinicians often attach undue certainty to test results that are based upon single measures or observations. Despite the obvious shortcomings of single measurements, clinicians frequently believe implicitly in their first readings or observations and it is these results which largely determine their choice of treatment (Koran, 1975). Such a practice would be unacceptable in scientific enquiry where repeated measures are the hallmark o f statistical inference. In clinical practice, however, various pressures demand that a degree of accuracy in measurement be sacrificed for economy of time. Since both time and accuracy have cost implications in medicine, acting in opposition, they are amenable to rational c o s t - b e n e f i t analysis. But a simple approach to this problem is only possible if the source and nature of errors in clinical data are known and their extent appreciated. Not all observations need repeated measures. The likelihood of inaccuracy will depend on the discriminability required in the measurements and the method by which the observations are made. In the physical sciences the desired discriminability can be achieved by selecting an
214
A. R. HILL
instrument with the appropriate precision. In clinical science, and particularly in ophthalmology, the measuring instrument is a h u m a n observer; it may be the clinician classifying or quantifying his own observations, or the patient reporting on some aspect of visual function. Where information is obtained in this way, the major factors limiting its precision are intra-observer and inter-observer variations. Intra-observer variation is the error in measurement arising from differences between repeated observations of the same feature by the same observer. Although clinicians often believe they are consistent in their observations, there are m a n y instances in which intra-observer variation is surprisingly high. Consider, for example, the relatively simple measure of estimating optic cupto-disc ratios. A masked study by Sommer et al. (1979) has shown that the mean difference of repeat estimations by a single trained observer may be as much as 0.3 (i.e. about one third the total range of the scale) with a corresponding standard deviation also of about 0.3. If repeatability of relatively simple measures of this kind is so poor, then the errors associated with more complex in vivo estimates of nerve fibre layer loss and neuro-retinal rim area are unlikely to be better. The relatively high precision quoted for estimates of retinal nerve fibre layer abnormalities (Sommer et al., 1984) are all based on observations of photographs and are undoubtedly conservative estimates of the intraobserver variation which occurs in observations on the living eye. What these latter studies do show, however, is that variability of this sort can be reduced by changing to more objective criteria such as, for example, a measuring ophthalmoscope graticule (Hitchings et al., 1983). Inter-observer variation in measurement is usually greater than intra-observer variation. It is compounded by biases in judgement peculiar to each observer and there are many examples of its considerable consequences for diagnosis. For example, Kahn et al. (1975) showed that when five clinicians were asked to examine a group of patients for the presence of macular pigment disturbance, the percentage of patients diagnosed as positive (i.e. having the condition) by each observer varied between 5% and 41%. This high
inter-observer variation of categorical classification for a relatively simple ophthalmosopic feature occurred despite the fact that the observers worked from a c o m m o n reference manual in which the diagnostic criteria were claimed to be precisely defined. In order to minimize errors of this kind, solutions to systematic bias in observation have to be found. Clearly defined criteria and frequent reference to standards for measurement can help to minimize inter-observer variation. For example, photographs and line drawings can be used to try to standardize some measurement procedures such as grading of diabetic retinopathy (Oakley et al., 1967) or cataract classification (Sparrow et al., 1986). But these procedures often have little practical relevance in a busy clinic. Where biases in observation do exist, they are likely to extend well beyond the measurement of a single trait and become manifest as systematic differences between clinicians in their threshold criteria for diagnoses or second opinion referrals (Cummins et al., 1981).
4.2. M e t h o d s o f M e a s u r e m e n t as a S o u r c e o f E r r o r
These examples illustrate how intra and interobserver variability affect precision and accuracy in measurement and diagnosis. But similar sources of error are found when the patient becomes the measuring instrument in psychophysical tests of visual function. It is under these circumstances, where clinical tests are directed towards determining visual thresholds, that the choice of psychometric method of measurement has a considerable influence on the values obtained. Indeed, in a comparison of performance on a number of clinical colour vision tests, Aspinall (1974) has shown by factor analysis that results are largely test-specific. There are, in general, three classes of psychometric method. These are known as the method of adjustment (or method of average error), the method of serial exploration (or method of limits) and the method of constant stimuli. All three methods have been used as the basis for different clinical tests of visual function. However, their correct application requires that repeated
MAKING DECISIONS IN OPHTHALMOLOGY
observations should be employed, a feature rarely adopted in clinical practice because of the pressure of time. Typically, methods of adjustment and serial exploration are quick to use and methods of constant stimuli take the longest because more observations are required to determine an end point. It is not surprising, therefore, that the former two procedures are most c o m m o n l y used for clinical tests of visual function. What is not always realized, however, is that the precision associated with the three psychometric methods differs considerably. In a detailed study of this problem, Blackwell (1952) showed that error variance in the measurement of visual thresholds is greatest with methods of adjustment and least with a constant stimulus approach. This is also demonstrated by the smaller normalized ranges which exist for static perimetric thresholds compared with kinetic visual fields. The former uses a constant stimulus method, the latter serial exploration (Parrish et al., 1984). Although current clinical practice favours tests which are quicker to use it has to be appreciated that this choice is made at the expense of accuracy. Deficits of visual function shown by one method of testing may not be found using another more accurate procedure. Recent papers concerned with the measurement of spatial contrast sensitivity demonstrate this (Ginsberg and Cannon, 1983; Higgins et al., 1984). Another feature of the variability inherent in clinical measures which has received little attention concerns normalized sample ranges (i.e. variance). These ranges, determined from data on healthy volunteers, may not be the appropriate estimate of precision against which to assess measures of visual dysfunction. For example, Ross et al. (1984) have shown that the test/re-test error variance for visual fields of normally sighted subjects, plotted kinetically using a G o l d m a n n perimeter with several weeks interval, is very much smaller than the comparable error variance of a group of patients with stable pathology (retinitis pigmentosa). On this evidence, inferences about the clinical significance of a pathological field change will be over-estimated if based on the error variance associated with healthy subjects. Hence, where measures of error variance exist, as confidence limits, care must be exercised to ensure
215
that they are valid for the comparisons being made. While most clinicians will accept that they can often influence the visual thresholds reported by their patients, it is not widely realized that interexaminer effects on m a n y tests of visual function are large. This places considerable limits on the practical value of assessing the significance of change in visual function over time, when different examiners have been involved in testing a particular patient. It is a c o m m o n experience, for example, for some clinicians to be able to coax their patients to read an extra line on the Snellen chart. But less overt examiner-effects m a y also be present and these are likely to be greatest when the psychometric procedures of adjustment or serial exploration are used. Although Ross et al. (1984) did not consider alternative measurement techniques in their study of the reliability of visual fields, the effect of compounding the increased error variance associated with a patient group and with a poor psychometric method (i.e. the method of limits for kinetic perimetry) was demonstrated. They showed that the mean percent inter-examiner variability of perimetric area (for a single eye) was about 5°70 in the normal control population and 13070, ranging f r o m 007o to 48070, in patients with retinitis pigmentosa. This high inter-examiner effect is not unique to visual field assessments. In a recent study on the reliability of the Arden Grating Test of spatial contrast sensitivity (Arden and Jacobson, 1978), which also uses the method of limits, Reeves and Hill (1986) have shown that the 95070 confidence limits associated with interexaminer variability cover approximately onequarter of the total dynamic range of the test. Incidentally, although the underlying principle of this test is the method of limits, only o n e estimate is used to determine the contrast threshold for each spatial frequency, demonstrating the undue weight attached to single observations by clinicians. These few examples are not atypical of the high error variance associated with clinical measurement. The effects are far reaching but poorly understood. All too frequently clinicians tend to over-estimate the reliability of their observations and measurements with respect both to their precision and diagnostic accuracy (Wulff,
216
A. R. HILL
1981). If diagnostic accuracy is to improve, there is a clear need for more attention to be given to the sources of error limiting the precision of clinical measurements. By requiring the decision maker to identify explicitly all the components relevant to a decision, clinical decision analysis can take a rational account of most errors in measurement. But, while it provides a framework for handling inherently unreliable data, it cannot take acount of invalid or heavily biased information.
5. P R I N C I P L E S OF D E C I S I O N A N A L Y S I S
The application of decision analysis to clinical problems developed from use of the principles of information theory in the analysis of perceptual signal detection tasks (Green and Swets, 1966). This clinical application was first explored in detail by Ledley and Lusted (1959). But the task of making appropriate clinical decisions is not simply one of distinguishing pathology from normal, or signal from noise. The complete decision process involves forming an adequate description of the problem, assessing the uncertainties and values associated with the alternative courses of action and evaluating which is the most appropriate action to take. Decision analysis is a set of principles and procedures by which it is possible to maximize all the information about a particular problem in the process of determining the most prudent outcome or course of action. It is, therefore, a normative model of decision making and describes how a rational person ought to behave, rather than how they actually behave. In the general decision making model five stages are involved: (i) Structuring the decision problem - - here the decision maker must specify the realistic alternatives open to him, together with the areas of risk and uncertainty. (ii) Assessing probabilities for uncertain events - uncertainties can either be quantified by objective probability data or, where this is not available, by subjective probability estimates. (iii) Assessing utility values for consequences or outcomes - - these represent the worth or
payoff value we attach to the consequence of our actions. (iv) Deciding the most prudent action - - each possible outcome has an associated probability and utility which may be combined to produce a measure of expected benefit; the largest expected benefit indicates the most prudent action. (v) Sensitivity analysis - - the sensitivity of the solution is assessed by varying subjective estimates of probabilities and utilities. This allows one to observe the effect of such variations on the expected benefits associated with different outcomes, thereby highlighting the crucial factors in the decision problem. The decision model is just as applicable to the problems of medical screening as it is to that of differential diagnosis or that of assessing the best course of action when alternative surgical or medical treatments are available. Likewise, it may be used to assist in deciding whether it is worth introducing a new screening test (Aspinall and Hill, 1984c) or expanding an existing ophthalmic service for a community (Aspinall and Hill, 1984b). In short, it is appropriate for all situations in which decisions have to be made under uncertainty. As such, clinical decision analysis should be viewed as the primary framework for the practice of medicine and it is unfortunate that it does not feature more prominently in medical education. In this paper, the several stages in the decision-making model are illustrated by reference to examples from ophthalmology.
6. T H E D E C I S I O N M A T R I X 6.1. Single Test Evidence
A disease entity may b e regarded as the vehicle of clinical knowledge and experience. Consequently, it is rarely possible to establish the truth of a diagnosis by independent means. The presence of disease is therefore almost always defined with respect to 'some other test' and how we define the 'truth' of this other test has been the subject of much debate (Galen and Gambino, 1975). Because health and disease know no sharp
219
MAKING DECISIONS IN OPHTHALMOLOGY
data in Table 4 in Bayes' theorem, the probability of glaucoma given ocular hypertension at an IOP>26 mm Hg is no better than 0.02. The complement o f this, p -- 0.98 (i.e. 1 - 0.02), is the probability that a patient does not have glaucoma given that they have ocular hypertension. Similarly, it can be shown that the probability that a patient does not have glaucoma given that they are normotensive (i.e. IOP<26 mm Hg) is 0.99 and its complement, p = 0.01, is the probability of glaucomatous field loss in a normotensive. These results are not unique to the King's College Hospital data. Similar findings have previously been reported by Armaly (1969). While the results may surprise clinicians, the extremely low diagnostic value o f tonometry as a screening method for glaucoma in an unselected population is a consequence of both the low incidence of glaucoma and the relatively low sensitivity of tonometry alone for its detection (i.e. sensitivity = 0.26). On this evidence, tonometry is a very inefficient means o f mass screening for glaucoma, yet despite this knowledge it is still widely used as such. However, when tonometry is used in situations where the prior probability of glaucoma is much higher, it can be shown to have greater predictive value. Table 5 shows the increase in probability of glaucoma given ocular hypertension (IOP>26 m m / H g ) , i.e. p(G + IT +), for a general eye clinic and for a glaucoma clinic in the outpatient department of an eye hospital. It will be seen from these Bayesian probabilities that simply changing the prior probabilities or initial expectations of a disease in a population group, significantly affects its detection. Figure 2 illustrates this principle for a hypothetical test having equally high sensitivity and specificity at 0.95. What is evident from this figure is that the likelihood of a disease being detected given
positive test evidence, ie. p(D + I T + ) , is always low when the prior probability is low. And since the prevalence of most diseases is less than 10 in 100 (except in speciality clinics) then it is safer to base decisions on the probability of a disease being absent given negative test evidence, i.e. p ( D - I T - ) . In other words, an essential feature of drawing inferences from clinical tests (particularly screening tests) is that we are not testing the null-hypothesis as is assumed in scientific enquiry. Rather we are testing an alternative hypothesis which invariably has a low prior probability. In these circumstances, considerable evidence is needed to overturn the alternative hypothesis in which the existing expectation (prior probability) is heavily weighted against a decision for disease rather than to substantiate it. It is for this reason that the principles of medical screening should be to distinguish those who are normal from those who m a y have a disease (Hart, 1975). In this respect, the function o f screening is not to identify disease but to distinguish those individuals who would benefit from further clinical investigation from those who can be said confidently to need no further examination (Aspinall and Hill, 1984c). Consider, for example, the case of screening for macular degeneration in an unselected population over the age of 65 years. In this age group the mean prevalence o f macular degeneration has been estimated at around 2070 (Medical Research Council, 1983). If a clinical test or procedure were to be used with both a sensitivity and specificity o f 0.99 (which is highly unlikely), then we find from Bayes' theorem that the probability o f a patient having macular degeneration given a positive test result is only 0.17. This means that four out o f five patients with positive test results would not have the disease. Such a result is not only surprising but
TABLE 4. Frequencies o f initial presenting cases of ocular hypertension and
glaucomatous field loss* Normal tension (IOP<26 mm Hg) (T-) Normal visual fields Glaucomatous field loss
(D-) (D + )
*From Daubs and Crick, 1980.
243 419
Ocular hypertension (IOP>26 mm Hg) (T+) 30 147
220
A. R. HILL TABLE 5. Probabilities o f the presence ((3 +) or absence ( G - ) o f glaucoma given ocular hypertension above (T + ) or below ( T - ) 20 mmHg* Population g r o u p p(G + IT +) p ( G - IT +) p(G - ]T ) p(G + IT ) Mass screening 0.02 0.98 0.99 0.01 (prior probability = 0.01) General eye clinic 0.21 0.79 0.92 (/.08 (prior probability 0.10) Glaucoma clinic 0.44 0.56 0.78 0.22 (prior probability = 0.25) *Derived from Daubs and Crick, 1980.
p( D-] T-)
1.0
p( D+ IT+) 0.80.6 ,m.
~,
0.4
0.2
r
,
~
0.02
,
i
0.04
,
i
0.06
,
i
0.08
,
(1983). For example, using the Daubs and Crick (1980) data o f Table 2, when the prior probability o f g l a u c o m a is 0.1 (as in the general eye clinic), the standard error for the Bayesian estimate, SE o f p(D + [T + ) , is 0.046. Since the Daubs and Crick data show that p ( D + [ T + ) = 0.21, the 95% confidence interval for p ( D + ] T + ) therefore ranges f r o m 0.12 to 0.30, i.e. 0.21 _-!-0.09. These SEs are only o f relevance within the limits 0 < p < l .
i
0.I
Prior probability p( D+ )
FIG. 2. The effect of prior probability of a disease on the probability of its detection when the test sensitivity and specificity are equal at 99%. When the prior probability is 0.5, decisions for "disease present given positive test evidence", p(D+ ]T+) match those for decisions of "disease absent given negative test evidence" with a Bayesian probability of 0.99.
also raises important questions about the relevance and costs o f the i n f o r m a t i o n derived f r o m a test. More will be said about this later. The Bayesian estimate of conditional probability has such widespread application to the diagnostic problem that it is important to assess the range within which the probability estimate can vary. The assessment o f this variability is conventionally expressed in terms o f the standard error (SE) f r o m which can be derived a confidence interval for the Bayesian estimate. As with all statistical measures, the SE will depend u p o n sample size and the inherent variability in the data. The SE o f p ( D + ]T + ) can be derived f r o m T a y l o r ' s theorem as shown by Aspinall and Hill
6.2, Multiple Test Evidence
The above examples are based on the assumption that a diagnosis is made on the grounds o f only one item o f evidence. While this m a y be true for m a n y mass screening situations because o f e c o n o m y o f time, in clinical practice multiple signs are used. Consider, for example, the value o f additional information about subtle changes in visual function for improving the prediction o f glaucomatous field loss developing f r o m ocular hypertension. There is recent evidence to suggest that both acquired colour vision loss (Lakowski and Drance, 1979) and spatial contrast sensitivity loss (Ross et al., 1985) are amongst the earliest functional losses in glaucoma. F r o m the data o f a five year prospective study on colour vision changes in ocular hypertension by Drance and Lakowksi (1983), it is possible to show how this additional i n f o r m a t i o n improves diagnostic accuracy for the likelihood o f an ocular hypertensive developing g l a u c o m a t o u s field loss. One o f the colour vision tests used by Drance and Lakowski was the F a r n s w o r t h - Munsell 100 hue. The frequency distributions o f hue discrimination p e r f o r m a n c e (I00 hue error score) at the time o f presentation in a g r o u p o f ocular hypertensives
MAKING DECISIONS IN OPHTHALMOLOGY
TABLE2. The 2 × 2 contingency matrix for a simple single test screening problem (a to d are the numbers of patients in each cell) Test result Disease absent ( D - ) Disease present (D + )
Negative (T - )
Positive (T + )
a c
b d
217 Normal (D-)
Diseases
Frequency
i
Ic1
i
iTcTc2 |
boundaries, there are considerable difficulties attached to the concept of what is normal. Thus it is not possible at present to use a causal system of disease classification because the complex causality of most diseases is not well known. Instead a predictive system is used where diagnosis is viewed as a prerequisite for choosing the treatment which ensures best prognosis. Therefore, the user of any clinical test should be aware of the limitations in diagnosis due to difficulties associated with finding an acceptable valid definition of normality or disease. Since the true diagnosis may never be revealed, it will he evident that in many instances the truth of a diagnosis remains a matter of opinion and belief. It becomes necessary, therefore, to express the value of a diagnostic test in probabilistic terms. The probabilities associated with particular test results depend upon three factors: (i) the relevant signs a n d / o r symptoms; (ii) the error rates associated with data collection; and (iii) the probability of any patient selected at random from the population under study having the disease*. These statistical principles for assessing diagnostic tests can be represented in their simplest form by a 2 5< 2 decision matrix in which the test results can be logically related to the clinical or pathological outcome. If we consider whether a disease is present (D + ) or absent (D - ) when a screening test has given positive (T + ) or negative ( T - ) results, the decision matrix will be as represented in Table 2. The idealized screening test is one in which there *In some cases this probability will be the prevalence of the disease (e.g. mass screening), in other cases the disease incidence (e.g. neonatal screening). Sometimes, where a previous test has been carried out, the prior probability of disease in the study population will be neither incidence nor prevalence but some function of the information already gathered.
(~I
rest Scores
(~+)
FIG. 1. Frequency distribution of test scores in the practical screening situation showing misclassifications about a test criterion score 7",. The areas under each of the curves represent the number of individuals found in each category giving respectively a true-negative (TN), false-negative (FN), false-positive (FP) and true-positive (TP) result.
is perfect agreement between the test result and the presence or absence of the disease. In this state of certainty, where there are no misclassifications, the decision problem before us is relatively trivial. But such certainty rarely occurs in practice because there are inevitable misclassifications. Assuming test scores are measured on a continuous scale, the frequency distributions from populations of normals and diseased will overlap as illustrated in Fig. 1. The area of overlap under the two curves in Fig. 1 on either side of a test criterion score, Tc, represents the misclassifications. They are known as false-positives and false-negatives, and are denoted by cells b and c respectively in Table 2. The four possible outcomes of a screening test may therefore he summarized as in Table 3. By reference to Tables 2 and 3 we can define two measures of test performance. true positives d Test sensitivity - total diseased tested - c + d true negatives a Test specificity = total non-diseased tested - a + b Both these measures are conditional probabilities and may be expressed in the following manner: Test sensitivity p(T + ID +). This is the conditional probability of a positive test result ( T + ) given that the patient has the disease (D +). Test s p e c i f i c i t y p ( T - [ D - ) . This is the conditional probability of a negative test result ( T - ) given that the patient does not have the disease.
218
A . R, HILL TABLE 3. The f o u r possible o u t c o m e s o f a clinical test expressed as a decision
matrix Test result Negative (T - )
Positive (T + )
Disease absent ( D - )
True negative
False positive
Disease present (D + )
False negative
True positive
Totals
Total negative test results
Total positive test results
Clearly a good clinical test is one where both sensitivity and specificity are high. A knowledge of both these measures of test performance is essential to deriving the diagnostic value of a test, yet surprisingly few clinical or laboratory tests have been assessed in this way. By combining measures of sensitivity and specificity with a prior knowledge of the frequency of the disease in the population being examined we may derive two probabilities of particular interest to the clinician facing a diagnostic decision. Again, both are conditional probabilities. Firstly, there is the probability that the disease is present given a positive test result [p(D + IT + )], and secondly the probability that the disease is absent given a negative test result [ p ( D - I T - ) ] . Using Bayes' theorem (Phillips, 1973) the first of these may be defined as follows: (Prior probability of disease) × (Probability of observed (Probability of disease present = signs/symptoms in that disease) given observed signs/symptoms) ÷ (Probability of signs/symptoms in the population examined)
This expression is a logical consequence of the multiplication and addition rules of normal probability theory and may be rewritten as: p(D + )-p(T + ID + ) p(D+IT+) = p(D + )-p(T + ]D + ) +p(D - )-p(T + ]D - ) where p ( D + ]T + ) = the conditional probability that a patient has the disease (D + ) given a positive test result (T + ) ; p ( T + [D + ) = the conditional probability o f a positive test result (T + ) given that the patient has the disease ( D + ) (i.e. test sensitivity); p ( D + ) = the prior probability o f
Totals
Total patients without disease Total patients with disease
disease D; p ( T + I D - ) = the conditional probability of a positive test result given that the patient does not have the disease D (i.e. 1 - specificity), and p ( D - ) = the prior probability of no disease (absence of D) [i.e. 1 - - ( D + ) ] . Similarly: p(D - ) . p ( T - ]D - ) p ( D - I T - ) = p ( D - ) - p ( T - [ D - ) +p(D + ) - p ( T - ]D + ) The structure of the formula shows that the prior probability or likelihood of a disease, e.g. (D + ) , is modified by evidence or test data [e.g. p ( T + ID + )] to yield posterior or new probabilities of the disease occurrence [ p ( D + I T + ) ] . The following examples illustrate this noncontroversial use of Bayes' theorem. All examples emphasize the importance of combining test data with a knowledge of the incidence of the possible alternative states of disease and clinical normality. In a study of open-angle glaucoma Daubs and Crick (1980) investigated the relationship between the presence or absence of glaucoma (defined by the presence or absence of a glaucomatous field loss) and ocular hypertension (defined as a condition given a raised intra-ocular pressure (IOP), greater than 26 m m Hg, in the absence of other signs). Their data from King's College Hospital are given in Table 4. In order to determine the diagnostic value of tonometry as a screening measure, it is necessary initially to estimate the prior incidence of glaucoma in an unselected population. According to the Medical Research Council, (1983) this has been assessed as 0.01 in the United Kingdom (giving a prior probability of not having glaucoma of 0.99). When this prior incidence is combined with the
M A K I N G DECISIONS IN O P H T H A L M O L O G Y
who developed glaucomatous field loss and ocular hypertensives who still had full visual fields after five years is shown in Fig. 3. Applying Bayes' theorem to this data, and assuming that the F a r n s w o r t h - M u n s e l l 100 hue test is used in a specialist glaucoma clinic at an eye hospital outpatients' department where the prior incidence of glaucoma is taken to be 0.25, we obtain the conditional probabilities shown in Table 6. From these figures it can be seen that, as one might expect, the probability associated with a positive diagnosis of glaucoma ( G + ) given colour discrimination loss ( C + ) , i.e. p ( G + t C + ) , increases for greater F a r n s w o r t h - M u n s e l l 100 hue error scores. However, the test sensitivity decreases with increasing error score. This is because a large number of false-negatives is obtained when a high cut-off criterion score is used (Fig. 3). In calculating the figures for Table 6, a prior probability of p = 0.25 was used. But when the evidence of acquired colour vision loss is used to revise opinion about the likelihood of glaucoma developing from a state of ocular hypertension,
1615 14 13 12
~,
11
~ ~
-
O c u l a r H y per tensi on
....
Glaucoma
~, 10 9 8 xa E
7
~ 6 5 4 3 2 1 0
'~' 40
t~{ 80
~
,~
120 160 200 240 280 320
360 400 440
FM 100 Hue total e r r o r score From Drance antl Lakowski, 1983
FIG. 3. F r e q u e n c y d i s t r i b u t i o n o f F a r n s w o r t h - Munsell 100 h u e e r r o r scores in 80 o c u l a r h y p e r t e n s i v e eyes o f w h i c h 24 d e v e l o p e d g l a u c o m a t o u s field loss d u r i n g the f o l l o w i n g five years. F r o m D r a n c e a n d L a k o w s k i , 1983.
221
different priors should be used. The Bayesian approach to diagnosis is to use posterior probabilities as new priors awaiting modification from subsequent test evidence (Barnoon and Wolfe, 1972). The posterior probabilities of glaucoma given raised intra-ocular pressure, p(G + IT + ), in Table 5 should therefore be used as the new prior probability estimates from which to generate a second series of posterior probabilities based upon the colour vision evidence. The revised posterior probabilities for the combined test evidence of raised intra-ocular pressures and colour discrimination loss, i.e. p(G + IT + ,C +), are shown in Table 7 for three different population groups. It will be seen that the evidence from colour vision tests in this pathology assists in a positive diagnosis and that benefit is greatest where the prior probability is high. On the other hand, in some applications, the combined information has resulted in a lower probability for the prediction that an eye does not progress to glaucomatous field loss given negative test evidence, i.e. p ( G - I T - , C ). Such a finding is the consequence of poor test sensitivity (and high false-negative error rate). This is not a common feature of all tests but illustrates that in some instances a single cut-off test criterion may be inadequate to provide satisfactory decisions both for disease and no-disease. Methods for handling this are discussed later. Another important implication of results such as those shown in Table 7 is that a test may have useful predictive value for diagnosis in one population group but not in another. The determining factors are the sensitivity and specificity of the test and the probability of disease in the sample population to which a patient belongs. This prior probability is a highly locally dependent value and may be either prevalence or incidence depending upon the clinical application. Furthermore, not only will these prior probability values differ for different population groups but they may also change over time either rapidly (e.g. epidemic) or slowly (e.g. as a consequence of medical practice providing for earlier detection of some diseases and more effective prophylaxis or management of others). Unfortunately, most clinicians have very little knowledge of the sensitivity, specificity and predictive value of the
222
A. R. HILL TABLE 6. Conditional probabilities associated with predicting glaucomatous field loss (G +) developing over five years in ocular hypertension f o r differing levels o f hue discrimination loss ((7 +) at presentation (assessed using the Farnsworth- Munsell 100 hue tesO* 100 Hue p(G + [C + ) Sensitivity p(G - [C - ) Specificity error score 80 0.29 0.67 0.80 0.45 120 0.39 0.41 0.80 0.79 200 0.66 0.42 0.83 0.93 280 0.80 0.21 0.79 0.98 360 0.99 0.08 0.77 0.99 *Derived from Drance and Lakowski (1983). A prior probability of p = 0.25 has been used as the estimate of the incidence of glaucoma amongst patients attending a hospital eye department 'glaucoma clinic.' F- M
TABLE 7. Bayesian probabilities f o r predicting glaucomatous field loss (G +) developing over five years on the basis o f the combined evidence o f ocular hypertension (T + ), i.e. IOP > 26 rnm Hg, and acquired colour discrimination loss (C + ), assessed using the Farnsworth-Munsell 100 hue test Population lOP > 26 mm Hg lOP > 26 mm Hg and group*
100 hue error score >160 p(G+[T+) p ( G - [ T - ) p(G+[T+,C+) p ( G - [ T - , C - ) 0.02 0.99 0.06 0.99
Mass screening (prior probability = 0.01) General eye clinic 0.21 0.92 0.46 0.86 (prior probability = 0.10) Glaucoma clinic 0.44 0.78 0.72 0.67 (prior probability = 0.25) *The prior probabilities represent the initial expectations of the presence of glaucoma in each population group. The figures in columns 1 and 2 of the table are derived from the data of Daubs and Crick (1980) and these posterior probabilities, i.e. p(G + IT + ) and p ( G - IT- ) have been used as new priors combined with the data of Drance and Lakowski (1983) to determine the figures in columns 3 and 4 respectively.
tests they use. Moreover, they f r e q u e n t l y have little u n d e r s t a n d i n g o f how the p r o b a b i l i t y o f the disease for which they are testing influences the predictive value o f the test result. Since these m i s c o n c e p t i o n s influence diagnostic t h i n k i n g a n d therapeutic decisions, it w o u l d be useful for clinicians occasionally to evaluate their o w n decisions against the o p t i m a l course o f action, calculated using a Bayesian a p p r o a c h .
7. THE
BAYESIAN
VIEWPOINT
Statisticians o f all persuasions w o u l d accept the above statistical use of Bayes' t h e o r e m . T h e y w o u l d also agree that h u m a n j u d g e m e n t is a necessary part of statistical practice. However,
m a n y p r o p o n e n t s o f Bayes' t h e o r e m also believe that p r o b a b i l i t y can be interpreted subjectively as a n i n d i v i d u a l ' s r a t i o n a l degree o f belief, a n d that the degrees o f belief o f a n ideally r a t i o n a l person c o n f o r m to the m a t h e m a t i c a l principles of p r o b a b i l i t y . T h u s , if m y degree of belief that a patient has a disease is p , then my degree o f belief that the p a t i e n t does not have the disease should be ( l - p ) . F o r a Bayesian p r o p o n e n t , prior o p i n i o n expressed in the f o r m of subjective probabilities should be i n c o r p o r a t e d into f o r m a l procedures so that j u d g e m e n t can be publicly displayed. In the absence o f further evidence, initial subjective probabilities might well be substituted for the incidence or prevalence probabilities illustrated earlier. T h u s , beliefs (prior probabilities) are m o d i f i e d by evidence to
MAKING DECISIONS IN OPHTHALMOLOGY
yield new beliefs (posterior probabilities) which become, in turn, prior probabilities for subsequent new evidence. Hence, a scientist or clinician should quantify his opinions as probabilities either by calculation or, more practically in the case o f a clinician, by subjective estimation before collecting data. After collecting the data, he should subsequently use Bayes' theorem to revise those opinions formally. For a Bayesian, therefore, learning progresses by these revised probability values. From the Bayesian viewpoint, the history o f science is seen as the generation o f hypotheses and the accumulation o f data until one particular theory is believed because it is the most plausible o f all those suggested. A high posterior probability associated with a hypothesis does not guarantee its truth but only indicates that it is more likely than others considered which have lower posterior probabilities. By this cyclical or iterative process, data is collected which bears on the relative truth of alternative hypotheses (or diagnoses). The Bayesian argues that the traditional null hypothesis is rarely of interest and represents only one of several alternative hypotheses. Moreover, one relies on fairly arbitrary conventions when deciding whether to reject or accept particular hypotheses. A Bayesian equivalent o f one such arbitrary convention could be, for example, to accept a hypothesis when the probability associated with it reaches 0.95 or 0.99, thus giving the probability o f a false-positive of 0.05 or 0.01 respectively. The probability o f a false-negative can be considered in the same way. (The two traditional statistical e r r o r s are to reject the null hypothesis when it is true at a probability o f <0.05 or 0.01 and to accept the null hypothesis when it is false at a similar probability, often unspecified.) If two hypotheses are thought to be equally likely then one should assign priors o f 0.5 to each, rather than leave these priors as an implicit assumption in the testing. The logical approach to hypothesis testing is to seek evidence o f falsification and the best information will be that which can be used in this manner (Popper, 1983). In most clinical situations the prior probabilities will be heavily weighted in favour o f the hypothesis for no-disease. It is therefore easier to seek evidence to reach a
223
probability criterion of a certain confidence for acceptance o f a hypothesis o f no disease than of disease. In mass screening, for example, there will always be more clinically normal than diseased people and under these circumstances it will be easier to falsify than to substantiate the hypothesis o f disease. The choice o f appropriate tests to e x c l u d e the likelihood o f disease should be guided by this principle. This approach to clinical hypothesis testing invariably presents a dilemma to the clinician when making the diagnosis of a rare disease. As in the above screening example, from a Bayesian viewpoint considerable evidence is required to overturn the heavily weighted alternative hypothesis, i.e. no disease, in favour o f the decision for a r a r e disease. Statistically, it is frequently impossible to prove the existence of a rare disease. However, if there are high utilities associated with the consequences of not taking action in a suspected rare disease, then the appropriate weightings for this would be included in the full decision model to determine the outcome with maximum expected utility. Watchfulness would therefore be appropriate in such instances until there is sufficient evidence to increase the plausibility o f a rare diagnosis. While there will always be some degree of uncertainty about a diagnostic hypothesis there is a constant need for utility consideration in clinical practice. " O u r main concern is to increase utility and decrease risk to the individual patient, no matter how rare his disease may b e " (Balla, 1985, p. 69). In summary, for a Bayesian a course o f action is rational when consistent with a possessed body of information (beliefs and desires). Bayesian decision theory is about how one's actions, preferences, values and beliefs should be rationally related to each other. It is as applicable therefore to the deliberations o f the inexperienced as it is to those o f the expert. How reasonable or objectively sound the judgements are depends upon how well the agent is informed. Only when an action is both rational and well informed can it be said to be prudent (Eells, 1982). 8. COMBINING T E S T I N F O R M A T I O N Except in situations involving mass screening, medical practice rarely involves making decisions
224
A.R. HILt. TABLE8. Double decision matrices can be used to avoid the problems o f correlated measures from the performance on two tests when combined in a Bayesian expression o f multiple evidence
Diseased population (D + ) Test 1 T, T, + Test 2 T2 T2+
a c
b d
Normal population (D-) Test 1 T, T, + T2 T2+
e g
f h
Where the cell entries a to h are proportions and i = disease incidence (i.e. prior probability), then the Bayesian probability for the presence of the disease ( D + ) given positive test evidence on both tests (T~+ a n d T2+) is: i'd
p(D+IT,+,T_,+) = i'd + (l-i)'h
based on the results of a single test. What characterizes most diagnoses is that they are deduced from multiple evidence. That multiple evidence may either be used in a sequential application of Bayes' theorem for the revision of prior probabilities as proposed by Barnoon and Wolfe (1972), or it may be used in the general Bayesian expression which can be used to combine information about any number of diseases and any number of symptoms or items of data. The general form of Bayes' theorem for two or more tests (Phillips, 1973)is: p(D,lx,,x2,
.x,,) ""
p(Di)'p(x~lD1)...p(x,,lD,) Z [ p ( D D ' p ( x , IDD...p(x,,lDk)]
where D . . . . D~ are the set of diseases and x,...x,, the set of tests. Ideally, the set of diseases should be mutually exclusive and the set of tests (or items of data) should be mutually independent, However, on the latter point the search for independence may discard considerable information (Dombal and Gremy, 1976). Fryback (1978) reached a similar conclusion, namely that Bayesian performance will be maximized by using a few of the most diagnostic of the available variables even if they are highly redundant, Nevertheless, if independence between tests cannot be assumed, it is possible to incorporate conditional dependencies into the Bayesian model, For example, assuming dichotomous decisions on information from two tests, the data could be set out as in Table 8. The cell entries can then be used in the normal Bayesian formula as shown. By classifying data in this way, it is possible to determine the likelihoods associated with any combination of test results for the diseased or normal population groups. Other likelihoods
from combined test results have similar form and are conditional upon the appropriate combinations of the probabilities derived from test performance. This procedure clearly becomes tedious when many tests are employed. Also, if only a few of the most diagnostic variables are used, then incorporating conditional dependencies into the Bayesian model in this way may not be worth while. What is needed are methods for determining the most informative tests. There are many approaches to determining the relative weight of data from combined test information. Fryback (1978) used a procedure of ordering variables based on the magnitudes of their conditional intercorrelations. Hill et al. (1985) have shown how factor theory may be applied to selecting the most informative group of tests to be incorporated in a test battery. A recent popular approach in clinical studies has been to use weighted multiple linear regression analysis. These analyses give equations which represent the relative importance of different items of test evidence. The solution of such an equation may then be used in Bayes' theorem to determine its diagnostic value. In this way, the multiple test evidence is represented as a single 'composite' test performance. Bayes' theorem is only used once and the difficulties associated with nonindependence of test data are largely overcome. Furthermore, if one or more variables (i.e. tests) are removed from the test battery, the resultant loss in diagnostic accuracy may also be estimated using Bayes' theorem. The clinical importance of individual tests relative to others can be determined in this way. A recent application of this technique combined with Bayesian statistics can be found in Aspinall et
225
MAKING DECISIONS IN OPHTHALMOLOGY
TABLE 9. Bayesian probabilities f o r predicting the development o f diabetic retinopathy over seven years f r o m combined scores on a 12-variable multiple regression model f o r patients attending a diabetic clinic (prior probability, p(R + ) = 0.30) Combined
p(R + IE + )
Sensitivity
p(R - IE - )
Specificity
test score 0.1 0.3 0.5 0.9
0.33 0.54 0.70 0.99
0.98 0.62 0.30 0.10
0.95 0.83 0.77 0.73
0.17 0.78 0.95 0.99
The combined test is the derived measure (i.e. cut-off criterion) used for predicting the onset of retinopathy, p(R + ]E + ) is the probability of retinopatby given positive test evidence, p(R - ]E - ) is the probability of a normal fundus given negative test evidence. Derived from Aspinall et al., (1983).
al. (1983). In a prospective longitudinal study over
seven years, they obtained predictions about the onset of retinopathy in 295 diabetic patients. Performance on 12 variables was obtained at the start of the study. These included seven clinical variables and five relating to colour vision performance. The latter were included because acquired colour vision changes have been reported as one of the earliest functional losses in diabetes (Kinnear et al., 1972). The data were analyzed in the form of categories rather than on a continuous scale. The model selects those variables that best discriminate between two populations, in this case the retinopathy group ( R + ) and the normal fundus group ( R - ) as defined at the end of the study. For different confidence levels associated with the development of retinopathy, one finds that the number of patients who are misclassified also changes. It is therefore possible to calculate the prognostic value of the model from the single derived measure of the combined test information. Bayesian probabilities of developing retinopathy over seven years given that the combined test evidence (E + ) predicts retinopathy, i.e. p ( R + [ E + ) , and of having no sign of retinopathy given that a normal fundus ( E - ) was predicted by the model, i.e. p ( R - ]E - ), have been recalculated from the data of Aspinall et al. (1983) and are given in Table 9. One value of using a composite derived score from multiple test evidence, is the possibility of determining the trade-off in sensitivity and specificity for different cut-off criteria of a 'combined test score'. A further value of the test battery approach is the ability to determine the
single variable which has the best diagnostic or prognostic power. In the Aspinall and coworkers study the single variable with the greatest discrimination power between those diabetics who developed retinopathy and those who retained a normal fundus over seven years was colour discrimination along the y e l l o w - blue axis of the anomaloscope. (The next two most important factors were good blood glucose control and duration of diabetes.) If the colour vision evidence is excluded from the test data, the remaining test evidence gives new sensitivity and specificity values of 0.6 and 0.69, respectively, for the same optimal (i.e. minimum number of total misclassifications) cut-off criterion of 0.3. The corresponding revised Bayesian probabilities are p ( R + ] E + ) = 0.45 a n d p ( R - [ E - ) = 0.80. While the conditional probability for predicting no fundal change of seven years remains essentially unchanged, that for predicting the onset of retinopathy in the absence of information about y e l l o w - b l u e colour discrimination has dropped from 0.54 to 0.45. A change of this amount in other clinical situations m a y have significance both to the cost and type of clinical management. Hence, this procedure of statistical pattern classification can be used not only to provide information for decisions on disease states about individual patients, but also to provide a means of assessing the relative value of single tests or test batteries. When used as a means of test evaluation, the expected information can be calculated for each test by reference to the difference between the two posterior probabilities. The test giving the largest gain information may
226
A. R. HILL
then be selected as the best test. Similarly, the particular combination of a group of tests which provides for the greatest informational gain may be selected as the best test battery from that group. This concept of informational gain has been extended further and quantified by Renyi (1970) for the revision of conditional probabilities in the light of the presence or absence of supporting information.
9. C L I N I C A L I N F O R M A T I O N T H E O R Y In the previous sections it has been shown how variations in test sensitivity and specificity affect the diagnostic accuracy when decisions are based on the outcome of a test. Both the sensitivity and specificity o f a test are determined by the falsepositive and false-negative misclassifications which occur. In turn, these misclassifications are governed by the choice o f the cut-off criterion of a test, on the basis of which one distinguishes those patients who are deemed to have a disease from those who are deemed not to have it. Reference to Fig. 1 shows that by adopting different criteria, Tc~ or To2 (i.e. a lower or higher test score), we vary the trade-off between false-positives (FP) and false-negatives (FN). Consequently, selecting the appropriate cut-off criterion on a test is as essential for maximizing diagnostic accuracy as it is for optimizing test efficiency. In the past, the cut-off criterion for clinical decisions has been chosen on the basis of clinical impressions or expert clinical opinion. An alternative has been to choose arbitrary points on statistical distribution curves to justify selection. More recently, both these former methods have been replaced by a rational process which makes use o f probability, utility and decision analysis. The Receiver Operator Characteristics (ROC) curve is a term developed in the electronic engineering field and widely used in psychology as a means of representing the detection of signal against a noisy background (Green and Swets, 1966). It is a useful diagrammatic way of representing the problem of defining an appropriate cut-off criterion for a clinical test involving dichotomous decisions (e.g. screening).
An ROC curve is simply a plot of test sensitivity p(T + ID + ) against test specificity p ( T - [D - ). Specificity is either plotted on a reverse scale or as ( I - - specificity) on the abscissa. A typical form of ROC curve is shown in Fig. 4(a). Here, the decision curves for two tests A and B are plotted. Points along any curve (e.g. Tc~,Tc2) represent different test score values which can be used as cut-off criteria for the test. A point in the top left hand corner o f the diagram indicates good sensitivity a n d specificity; the diagonal broken line represents points of zero discrimination. Thus, a test which discriminates well between the two populations will give an ROC curve which lies close to the upper left hand corner o f the figure. In this example, Test B is more effective at distinguishing between those people who have the disease (D + ) and those who do not ( D - ) . The ROC curve is therefore a means of providing two separate measures for a detection or discrimination problem. One is an index of the power of the test to discriminate between the two populations* and the other, the decision point selected as a cut-off criterion score, Tel or To2 (see Fig. 4(b) and (c)). The following examples illustrate the way in which tests differ in their powers of discrimination between normal and pathological. Consider first the data from Daubs and Crick (1980) where intraocular pressure was used as a predictor of glaucoma. The ROC curve for tonometry as a screening test for the detection of glaucomatous field loss is shown in Fig. 5. The numbers on the curve represent different intra-ocular pressure values which can be chosen as cut-off criteria for separating those considered to be at risk from glaucoma from those not at risk. The value chosen by Daubs and Crick was 26 mm Hg. This corresponds with the point on the ROC curve which is furthest from the diagonal broken line, where the total number of misclassifications (i.e. false-positives and false-negatives) is at a minimum. It is not always the best cut-off
*The index of discrimination for the difference between two means (dA)is given by dA' = [Mto+,-M,o_,]/s, where M~o+~is the mean of the pathological distribution, M~o ~the mean of the normal distribution and s the mean standard deviation of the two distributions.
227
MAKING DECISIONS IN OPHTHALMOLOGY I
(a)
Specificity P(T-ID-)
1.0 0
I
Frequency
I
I~
dA ==
(D
W Z. 2
//,,,"
P(T÷[ D+) Sensivity
(b)
Tc1 Tc2
I/<;," //II Frequency W P(T+t D-)
1.0
1 - Specificity
~ f
~
T-
T+ i
Testscores
D
FIG. 4. (a) Receiver operator characteristic (ROC) curves showing two levels of discrimination (A, B) and two values of the test criterion score (T<~, T<2). (b) and (c) show the relationships between levels o f discrimination dA, dB and test criterion scores Td, T<2 for the two ROC curves of (a).
Intra--ocularpressure 1.0
1
/
.8
Sensitivity
7"//
.2 ~ / / / /
i
0
.2
L
i
.4 .6 1- Specificity
i
.8
1.0
FIG. 5. ROC curve showing intra-ocular pressure (mm Hg) as a detector of glaucomatous visual field loss in mass screening. The values on the ROC curve indicate the tonometric readings in mm Hg. From Daubs and Crick, 1980.
criterion to use because that will depend on the relative premiums attached to false-positive and false-negative errors. Nevertheless, it will be evident f r o m Fig. 5 that whatever intra-ocular pressure is chosen as a cut-off value, the test discriminates glaucomatous field loss very poorly.
Where a test is this poor, alternative screening procedures need to be considered. For example, Daubs and Crick (1980) showed a superior ROC curve for the Arden Grating Test (Arden and Jacobson, 1978) as a method for screening for glaucoma, although recent investigators have shown that results on this test are markedly influenced by the examiner (Reeves and Hill, 1986). I f a subpopulation consisting only o f ocular hypertensives is considered, there are m a n y tests of visual function which are relatively successful at predicting the likelihood of visual field loss developing (e.g. Motolko et al., 1982). One of these is colour vision testing; an ROC curve for F a r n s w o r t h - M u n s e l l 100 hue error score as a predictor of glaucomatous field loss given an ocular hypertensive population has been derived from the data of Drance and Lakowki (1983) and is shown in Fig. 6. The ROC curve for the test battery of Aspinall et al. (1983), used to predict the onset of retinopathy in diabetes, is shown in Fig. 7. The numbers on the curve represent the likelihood of a patient developing retinopathy on the basis of different criterion values, i.e. combined test scores. The effect o f removing the most diagnostic test item (i.e. y e l l o w - b l u e colour discrimination
A . R . HILL
228
on the anomaloscope) from the battery is also illustrated in this figure. With the removal of this one test item the discriminating power of the remaining battery is considerably impaired and the space between the two curves represents the information lost. The ROC curve, therefore, provides a simple and direct means of comparing the efficiency of alternative tests or test batteries for a particular population. An example of a test with good discriminating power is shown in Fig. 8. The upper ROC curve is based upon data from a screening survey of 1000 adults for defective colour vision using the Ishihara test (Hill and Aspinall, 1980). The numbers on the curve represent the number of plate errors on the test and indicate how a change in the pass/fail criterion would affect the sensitivity and specificity o f the test. The lower ROC curve in Fig. 8 also represents the screening performance o f the Ishihara test but for a population o f children (age 5 - 1 1 years). It is derived from the data of Hill et al. (1982). From the two curves in Fig. 8 it can be seen that it is possible for the same test to have different powers of discrimination when used on different populations. This is an important feature o f many visual function tests where, for example,
0,8. 0.6. Sensitivity 0.4. 0.2'
1.0 O.
(t~./x/ /
:0.4/
///
// //
./o/ / I / /
/
'
0.'2
'
0.'4
0.'6
'
0.8
1.0
I - Specificity FIG. 7. ROC curve for the test battery of 12 clinical variables as a predictor of diabetic retinopathy occurring over seven years in known diabetics. Numbers on the ROC curve indicate the cut-off criterion from the combined test score (i.e. the probability that a patient develops retinopathy). The cross 'X' indicates the loss in sensitivity and specificity for a cut-off criterion of 0.3 in the absence of colour vision data. The broken line ROC curve shows the corresponding loss of discriminability of the test battery when the single most significant test (yellow-blue colour discrimination tested on an anomaloscope) is removed. Derived from Aspinall et al., 1983 (see also Table 9).
Ishlhara colour vision test ( 20-30years) '2.~/ooAdUlts /// Children (5-1I years) / /
~';i
1
//
0.6 120,,..~
/
Sensitivity / l / 0.4 !9 / / 1 // 0.2 i IO /
/ /
/ 0
0.2-
/
//
'6,9 /
0.8
8l
0.6 Sensitivity 0.4-
~ . / "
0
1.0
FMI00 hue in ocular hypertension
Diabeticretinopathytestbattery j.o.~-2-x'~
1.0
;11 / /
//
0
~/// /
0.2
0.4 1-
,0.12, 0.4,
i 0.6 l - Specificity
i 0.8
0.6
0.8
1.0
Specificity
1.0
FIc. 6. ROC curve showing Farnsworth - Munsell 100 hue error score for predicting glaucomatous field loss occurring over five years in patients with ocular hypertension. Numbers on the ROC curve indicate Farnsworth -- Munsell 100 hue total error scores. Derived from Drance and Lakowski, 1983 (see also Fig. 3).
FIG. 8. ROC curves showing the relative performance of the Ishihara colour vision test when used on adults (age 2 0 - 30 years) and on children (age 5 - 11 years). The numbers on the ROC curves represent different decision criteria in terms of the number of plate errors made on the test. Data for adults from Hill and Aspinall, 1980; data for children from Hill et al., 1982.
MAKING DECISIONS 1N OPHTHALMOLOGY "Visometer" acuity prediction in cataract 1.0
/f
0.8-
/ 1.0,
0.6 ¸
/
/
/
Sensitivity
/
0.4
/
0.67//
0.2
/ / , 0.4
,¢
,
0.2 ~
0.4 ,
,
0.~6
'
0.8
1.0
229
(1983). F r o m these examples, it will be evident that ROC curves provide an easy way of evaluating test efficiency, i.e. the ability to discriminate between two populations. But there are several other derived measures of diagnostic quality which m a y be obtained from an ROC curve. These include the choice of an optimal cut-off criterion for diagnostic decisions and the average net benefit derived from performing the test (see Metz et al., 1976).
1- Specificity
10. S E L E C T I N G D E C I S I O N C R I T E R I A FIG. 9. R O C curve for 'Visometer' interferometer readings in eyes with cataract as a predictor of post-operative Snellen visual acuity (best refraction). Values on the ROC curve indicate 'Visometer' scale units, i.e. decimal acuity. Although m e a s u r e m e n t variance produced data points below the diagonal, this line is the theoretical limit for a case of zero prediction. Derived from Halliday and Ross, 1983 and from additional data made available by the authors.
performance may be influenced by inter a n d / o r intra-observer variations. There are, unfortunately, some extremely poor visual function tests which still attract clinical interest. The interferometric test designed for predicting post-operative Snellen visual acuity in a patient with cataract is one such test. Halliday and Ross (1983) kindly made their data available for the ROC curve of Fig. 9 to be determined. It is perhaps unjustified to call this data a 'curve' because it is indistinguishable f r o m the indecision line. (The fact that some data points fall below the indecision line is a function of sampling variance.) Thus the ROC curve for this instrument (i.e. the H a a g - S t r e i t Visometer) shows that it is totally unable to discriminate between those cataract patients who will have good post-operative Snellen acuity and those whose acuity will be poor due to obscured macular dysfunction. A similarly poor curve was obtained for the Retinometer, a laser interferometer. The practical implication of this is that, if we assume complete uncertainty about the state of the retina before surgery (i.e. a prior probability for p ( D + ) = 0.5), as in the case of a clinician without an interferometer, the Bayesian probabilities for predicting post-operative Snellen acuity f r o m this data are no better than chance - the conclusion also drawn by Halliday and Ross
10.1. The Effect of Attitudes
The test performance score we choose for discriminating between two population groups (e.g. disease and normal) depends not only on the discriminability of the test but also on our attitudes to the resultant misclassifications. The relative weightings we give to false-positive and false-negative errors will reflect the severity or triviality of the consequences likely to arise from an inappropriate decision. If the suspected disease is sight or life threatening, then greater emphasis will be placed on the need to minimize falsenegatives. The test criterion will therefore be adjusted to reflect this, so that one has a high level of confidence when making decisions of normality given that a patient's performance on the test is less than the chosen criterion score [assuming a low test score is more indicative of normality, p ( D - I T - ) ] . This is what happens, for example, when using tonometry for screening for glaucoma. Decisions of normality, p ( D - I T - ) , are made with a high degree of confidence when a patient has normal IOP. On the other hand, decisions of pathology, p ( D + I T + ) , have a very low probability and hence low confidence. In medical practice, most clinicians would accept the view that false-negatives should be minimized. On the other hand, there are some fields in which emphasis is placed on minimizing false-positive errors. In the legal system, for example, evidence is sought to falsify the hypothesis that an accused is innocent. If, after all the evidence has been presented, there is a
230
A. R. HILL
'reasonable doubt' that the prosecution has failed to falsify the hypothesis of innocence, then a decision of 'not guilty' will be made. Clearly, the amount of evidence that is required to go beyond the criterion of 'reasonable doubt' depends on the severity of the case and the weightings given to an inappropriate decision. It is not surprising, therefore, that there are m a n y different legal definitions of the term 'reasonable doubt' (Cohen and Christensen, 1970). For a detailed discussion of the application of decision analysis and Bayesian principles in the practice of law, the reader is referred to Nagel and Neef (1979). While in medical practice it is unacceptable to decide wrongly that a person with pathology is clinically normal (a false-negative), it is equally unacceptable in law to convict an innocent person (a false-positive). It is evident, therefore, that the weights we apply to false-positives and falsenegatives are influenced both by personal and public attitudes - - a fact which should be more widely recognized by both decision maker and public.
10.2. The Optimal Criterion
By reference to the ROC curve of any test it is possible to determine an optimal criterion or cutoff value. In selecting this criterion we must consider all the costs and benefits associated with the outcome of a decision. This includes the costs of a false-positive (FP) and false-negative (FN) decision and the benefits associated with decisions of true-positive (TP) and true-negative (TN). (Methods by which costs and benefits can be quantified as utilities are shown in Section 12.) These factors can be arranged to determine the slope /3 of the tangent to the ROC curve at the optimal criterion point. This m a y be expressed as: = p ( D - ) . ( B e n e f i t T N + Cost FP) /3 p(D +)-(Benefit T P + Cost FN) where p(D + ) and p ( D - ) are the probabilities of the disease being present and absent respectively in the population studied. Beta is also equal to the ratio of the ordinate of the (D + ) distribution to the ordinate of the ( D - ) distribution for specific test scores [see Fig. 4(b)]. If /3 ~- 1, which corresponds to the point on the ROC curve
furthest from the diagonal indecision line, then the optimal criterion value is the point where the frequency distribution curves for (D + ) and ( D - ) cross. On the other hand, if the costs and benefits of FP and TN balance those of FN and T P then the optimal cut-off criterion becomes the point on the ROC curve where the slope of the curve is equal to the ratio of the probabilities of ( D - ) to (D +). This is the criterion selected by Daubs and Crick (1980) to define the optimal intra-ocular pressure for distinguishing between those not at risk and those at risk of developing glaucomatous field loss. In a glaucoma clinic, where p(D + ) -0.25 (see Table 5), the intra-ocular pressure cutoff criterion was determined to be 25 m m Hg. For mass population screening, where p ( D ÷ ) -- 0.01, the cut-off criterion was calculated to be 40 m m Hg. Whereas the former value is not too discordant with clinical practice, the latter mass screening criterion would be deemed unacceptable by most clinicians. Current practice, therefore, implies that much greater weight is given to the cost of a false-negative than a false-positive error when using this screening test, resulting in a lower value for/3 and consequently a lower IOP cut-off criterion (see Fig. 5).
10.3. Single or Double Criterion
Selecting a single criterion assumes that the decision maker is happy with the direct trade-off between test sensitivity and specificity. On the other hand, for the clinician, the conditional probabilities associated with p ( D - [ T - ) and p(D+IT+) may not both be acceptable. Typically, when using a single criterion on a clinical test, even if it is ' o p t i m a l ' , decisions for p(D-IT-) will be acceptable while those for p(D ÷ IT + ) will not. If it is desired that both these conditional probabilities should give satisfactory levels of confidence for decisions of normal and diseased, it may be necessary to use two decision criteria. The aim would be to select one criterion which minimized false-negative errors for decisions of p ( D - [ T - ) and another which minimized false-positive errors for decisions of p ( D + I T + ) . These would be represented as test criterion scores X~ and Xd respectively in Fig. 10.
231
M A K I N G D E C I S I O N S IN O P H T H A L M O L O G Y Frequency
I s i h a r a Test Prior probability p(D) - O.08(Males)
Uncertainty (.cofltinue) I.O
•
,
,
p( D I Fail)
¢
-
p(N[Pass)
0.8Diseases (D+)
False - -
?r-~
Pass
-
0.4.
False
.
Xn
Xd
Test S c o r e s
i" 0.6-
Fa. -
-
ne of
7
(T+~)
FtG. 10. A representation of overlapping frequency distributions according to test performance for two populations: normal ( D - ) and diseased (D+). Here the decision maker has selected two cut-off criteria. Setting one test criterion at x, gives minimal false-negatives; this criterion therefore provides a satisfactory level of confidence for decisions of normality. On the other hand, criterion X, gives minimal false-positives and provides satisfactory confidence for decisions of pathology. Intermediate test performance values between X , and Xa indicate states of uncertainty in classification, requiring additional investigation (and therefore sequential decisions) for a confident diagnosis.
Test score values which occur between X , and Xd would indicate uncertainty for decisions o f either normal or diseased, implying that additional evidence from other sources should be sought (Lusted, 1968). A simple example o f the application o f a double criterion is illustrated by colour vision testing. In a survey of defective colour vision amongst 1000 healthy adults, Hill and Aspinall (1980) found that most tests give a high probability o f being normal given that a test is passed, whilst the probability o f being defective given that a test is failed is unsatisfactory. The ROC curve for one o f these tests (Ishihara) is shown in Fig. 8 (Hill and Aspinall, 1982). Bayesian probabilities associated with the different pass/fail criteria are shown in Fig. 11. It will be seen that, when the test is used on males, wherep(D + ) = 0.08, the probability o f being normal given the test is passed, p(N[pass), remains high for any 'pass' criterion, whereas decision confidence for the probability o f being defective given the test is failed, p(Dlfail), is markedly influenced by the choice o f 'fail' criterion. If the user o f this test wished to have 99% confidence in decisions o f both ' n o r m a l ' and 'defective', he should make a decision o f ' n o r m a l '
I-
uncertainty for p = 0.99
0.2-
Criterion
r
( Number of permissible plate errors)
FIG. 11. Bayesianprobabilities for the probability of a male being colour defectivegiven a 'fail' on the Ishihara test, i.e. p(Dlfail), and the probability of being normal givena 'pass' on the test, i.e. p(Nlpass) for different pass and fail criteria. TN and T o represent respectively the pass and fail criteria for 99% confidence in each of these decisions. Derived from Hill and Aspinall, 1982 (see also Fig. 8). if two or less mistakes on the test were made and 'defective' if seven or more mistakes were made. If between three and six mistakes were made, one should conclude that 'further evidence is required'. Similar double criteria could be established for other confidence limits. It is important to realize that establishing different cut-off criteria in this way is simply a means o f changing the conditional probability associated with decisions o f 'normal' and 'defective'. It is a means o f adjusting test sensitivity and specificity to reflect the importance one places on these errors and does not necessarily imply different consequences o f false-positive and false-negative misclassifications. ROC curves enable the test user to see at a glance the trade-off between test sensitivity and specificity, thereby making it possible to make a more informed choice when selecting a test cut-off criterion. In this way, the test user is required to make explicit the weightings he assigns to the false-positive and false-negative error terms. 11. M E A S U R I N G B E L I E F A N D V A L U E 11.1. Subjective Probabilities
Definitive empirical evidence is not always readily available about the performance o f a
232
A. R. HILL
clinical test or medical or surgical treatment. In these circumstances, clinicians must rely on personal experience or the experience of others for assessing probabilities associated with alternative outcomes. Although most will do this in a rather intuitive manner, it is beneficial to make such estimates explicit by means of subjective probabilities. Unlike the traditional notion where probability is thought of as the relative frequency of an event, subjective probabilities are thought of as representing an individual's degree of belief about an event. In order to carry out decision analysis, the uncertainty in any problem must be quantified explicitly in probabilistic terms. Two types of uncertainty frequently occur, firstly where the uncertain events are of a discrete type (e.g. a treatment may be considered either to succeed or fail), and secondly where the uncertainty is a continuous variable represented by a probability distribution. Several methods of quantification are available (Kozielecki, 1981) of which only one is presented here. While it will be appreciated from the earlier discussion in Section 2 that subjectively held beliefs are invariably distorted, the validity of a subjective probability is not in question here. We are simply concerned with quantifying belief, in so far as it affects the decision problem. The only prerequisite for any method of quantification is that a decision maker is rational in his use of probabilities. In other words, if he believes that an event occurs with probability p, then he should also believe that the same event fails to occur with probability (1 - p ) .
11;2. Assessing Subjective Probability It is possible to use a lottery procedure to assess a clinician's belief about, for example, the outcome of treatment. The steps involved in such a procedure are: (i) The decision maker is asked to choose a value for the unknown quantity (e.g. visual acuity) above and below which he considers the outcome to be equally likely. Let this value be Xso.
(ii) The subject is then asked to consider only those values above X~o and to subdivide the
range about Xs0 into two equally likely parts. Let this value be x75. (iii) The procedure in (ii) is repeated for those values below x~o, yielding a value x2~. (iv) The four intervals created by the above steps can each be subdivided if desired. (v) Finally, values are chosen above and below which the decision maker is almost certain the value will not fall, i.e. x99 and x,. A graph of these values can then be drawn. This is known as a 'cumulative density function'. [It is sometimes easier to establish the limits of x99 and x, before beginning step (i).] Where appropriate, a range of uncertainty can be incorporated in the estimates as an indicator of the variance of the measurement, with the mid-point of the range as the best index of uncertainty. The following examples illustrate different applications of this procedure. (a) One clinician's uncertainty was assessed regarding the expected outcome in the following case. A 50-year-old asymptomatic male with visual acuities R and L 6 / 6 presented following routine examination with an anterior superior choroidal melanoma, measuring 5 mm deep and 10 mm diameter at the base on ultrasound assessment. Estimates by this clinician for the uncertainty in postoperative visual acuity following two alternative forms of treatment, cobalt plaque or local resection, are shown in Fig. 12. The difference between the two curves illustrates the different beliefs held by the clinician for these two procedures and almost certainly reflects aspects of personal experience. The clinician was also asked to assess uncertainty concerning the life expectancy of the same hypothetical patient. This is shown in Fig. 13. (b) A second clinician was asked to assess uncertainty of visual acuity outcome in the following case. A 50-year-old architect presented complaining of metamorphopsia in one eye for which the visual acuity was 6/6. The other eye was amblyopic with an acuity of 6/36. There were early retinal signs of age related macular degeneration nasal to the fovea. Fluorescein angiography showed a small subretinal neovascular membrane about 700/~m from the centre of the foveal avascular
233
MAKING DECISIONS IN OPHTHALMOLOGY X99-
----
x,,_
Successful treatment
m n
Cobalt plaque Local resection
-~7-
/
t ~ -It" ~
X99-
. >,
X75 -
Xs0-
/o/" o_ X50-
"G '.~ X25-
= X1
6/6 6J126)186;24 6/9
6/36
II
X25 -
,
6/60CF
HM NPL X1
Visual a c u i t y
FIG. 12. Uncertainty expressed as subjective probability regarding post-operative visual acuity for two different methods of treating an anterior superior choroidal melanoma. (Visual acuity is represented on an interval scale in equal intervals of minimum angle of resolution from 6/6 = 1 min to 6/60 = 10 min of arc. A nominal category scale is used for visual acuities less than 6/60.) Subjective probability estimates are not dependent upon the choice of scale transformation.
~
6/6 [6]1261186)24 6t9
6/36
6/60 CF
Visual acuity
FIG. 14. Subjective probabilities for the visual acuity outcome of an eye with age-related maculopathy following successful laser treatment of subretinal neovascularization. (See note to Fig. 12 for visual acuity scale.)
Treatment with complications
X99-
X99_
~, ! I t
:
X75-
X75-
_
7
o o_
2- Xso_
XSO-
•
/
.,
t/
:
y" ,
"G
*d
X25-
.~ X25-
I I I X1
X1
I; i; 2; 2; 3;
6 6 6 12 61186124 6/9
6/36
6/60 CF HM
Life expectancy (years) Visual acuity
FIG. 13. Uncertainty expressed as subjective probability regarding life expectancy following treatment of a 50-yearold patient with an anterior superior choroidal melanoma.
zone. The clinician was asked what the expected visual acuity outcome would be after t r e a t m e n t b y a r g o n l a s e r p h o t o c o a g u l a t i o n : (i) if there were successful closure of the leaking v e s s e l s ; a n d (ii) i f c o m p l i c a t i o n s w e r e t o a r i s e during or following treatment. The cumulative distribution functions for the uncertainties associated with these two events
FIG. 15. Subjective probabilities for the visual acuity outcome of an eye with age-related maculopathy having complications following laser treatment of subretinal neovascularization. The horizintal lines represent ranges of uncertainty. (These estimates were produced by the lottery principle without reference to objective probability values. Their close approximation to the published values of Fine, 1982, demonstrates that subjectively determined probabilities are an acceptable first approximation in the absence of empirical data.) a r e s h o w n in F i g s 14 a n d 15. B e c a u s e o f t h e variety of possible complications which vary
234
A . R . HILL
in their severity (complications for this form o f treatment could include foveal burn, macular pucker, failed closure of leaking vessel with further extension of neovascularization, additional neovascularization from excessive laser application, macular oedema, retinal pigment epithelial detachment), the clinician felt more uncertain about the expected outcome if treatment was unsuccessful. Subjective estimates were therefore expressed in ranges. These estimates of subjective probability contain considerable information on the form and extent of an individual's uncertainty about an event. The actual shape of the curve is characteristic of an individual's beliefs and, where it represents uncertainties associated with certain forms of treatment, it is likely to reflect the clinician's own skills rather than averaged published figures. Steeper slopes in the graph indicate less uncertainty than shallow slopes (for further examples see Aspinall and Hill, 1984a). An advantage of generating such a graph is that discrete probability values can be read from any part of it. However, perhaps the major benefit resides in the fact that the very act of quantifying uncertainty in this way can be helpful to the participant and can also provide a basis for professional discussion related to the efficacy of alternative treatments. It can become, therefore, a basis for revising opinon.
10[ ........ Utility
0.5t- . . . . y
£0
£30
,."
',
£100
FIG. 16. Utility function for a 'risk avoider' (continuous line) and a 'risk seeker' (broken line). way, a typical utility function would take the form shown by the continuous line in Fig. 16. A person whose utility function has this shape is said to be a risk avoider, whereas the dashed curve in Fig. 16 is that for a risk seeker. Most people behave as risk avoiders where gains are concerned and as risk seekers in choices that involve losses (Kahneman and Tversky, 1982). In a clinical application of utility analysis, Card et al. (1976) found a risk avoidance function for the utilities assigned to post-operative visual acuity estimates. More recently, the basic ideas of utility have been extended well beyond the field of economics and have been shown to have a particular practical relevance to clinical decision making (Dombal and Gremy, 1976). For instance, a patient may lose an eye as the result of surgery, but otherwise remain fit. Utility is the numerical measure of the worth of this state of health expressed on a particular relative scale. If decisions are to be relevant then they must reflect the utilities associated with alternate outcomes.
11.3. Utilities
The numerical value assigned to the consequence or worth of an action or event is known as its utility (Lindley, 1971). It is the means by which we express how much something is worth to us. Utility, therefore, is always contingent upon particular conditions and its dimensions are relative. It has its origins in the 18th century when two mathematicians, Gabriel Cramer and Daniel Bernoulli independently derived functions to describe the subjective value of money (Stevens, 1959). In their original approach to utility, both Cramer and Bernoulli assumed that the subjective value of money grows less rapidly than the actual numerical amount of money. Expressed in this
11.4. Measuring Value as Utility
The assessment of value begins by assigning a utility of 1 to the outcome which is considered the best of those available and a utility of 0 to the outcome which is considered the worst of those available. Lotteries are then presented to the decision maker to determine the utilities of intermediate outcomes. Several methods are available for assessing utilities (Kozielecki, 1981) but only one will be presented here to illustrate its application. Consider, for example, the utility a patient assigns to different expected levels of visual acuity
235
MAKING DECISIONS IN OPHTHALMOLOGY
following cataract surgery. The best visual acuity would be at least 6/6 and the worst would be 'no perception of light'. (In clinical practice it will be appreciated that the best outcome can never be guaranteed.) In assessing the utility for the intermediate value of say 6/12, the decision maker is presented with two options as lotteries: Lottery 1: Receive a visual acuity o f 6/12 for certain. Lottery 2: Receive the best outcome (VA = 6/6) with a probability p and the worst outcome (no perception of light) with a probability of (1 - p ) . The value of p for which the decision maker finds the two lotteries equivalent represents the utility for the intermediate outcome of a visual acuity of 6/12. Only a small number of intermediate outcomes is needed to generate a utility function, f r o m which other utilities can be interpolated. Clearly, the utility assigned to each intermediate outcome will depend upon m a n y factors, not least of which are the patient's own aspirations about his post-operative state of health. The procedure is a means whereby the decision maker is encouraged, systematically, to be explicit about the values he associates with an outcome or consequence. It is important to realize that there is not a right or wrong utility. Furthermore, in most aspects of medical treatment the individual patient should be encouraged, wherever possible, to assess his own utilities along several dimensions. For instance, one variable m a y be state of health, another earning capacity and yet another the discomfort associated with the form o f treatment (including side effects). The skilled clinician will then incorporate these personal utilities o f his patient into the overall decision analysis for clinical management. Most decision problems require utility functions of more than one variable such as indicated above. Where this occurs, the simplest practical way of dealing with the problem of determining the overall utility Ux is to use a weighted additive utility model which has the form: Ux = Z b i ' x i
for i = 1 to n, i.e. the number of dimensions,
where xi denotes the value of dimension i and b~ denotes its weight. Weights can be assigned to different dimensions by simply letting numbers (e.g. percentages) reflect the relative importance of each to the patient. The principle assumption behind this multidimensional utility model is that the attributes should be independent. Fortunately, Dawes and Corrigan (1974) have found the model robust even when this condition is not strictly fulfilled; deviations f r o m true independence make little difference to the ultimate U-values or, more particularly, their rank order. Finally, the global expected utility of a consequence (EU) can be determined as a function of the probability of the consequence (p~) and the value or utility of the consequence (U0. It is only possible to compare the relative merits of alternate courses of action whose outcome probabilities and utilities differ by use of the concept of expected utility. A simple additive model is used for determining the expected utility (EU) as follows:
EU
= ~.pi'ui
for i = 1 to n, where n is the number of dimensions. Both variables Pi and ui can either be objectively assessed as probability and value or as subjective probability and utility, or a combination of both objective and subjective terms. A practical example is given in Section 12.3 of determining the expected value of alternative courses of patient management where the outcomes have multidimensional utilities. But before that can be undertaken, it is necessary for the decision maker to identify all the m a j o r parameters likely to influence the decision. This may be achieved by means of a decision tree.
12. S T R U C T U R I N G T H E P R O B L E M 12.1. Decision Trees
In patient management, clinical decisions should follow a logical sequence which can be expressed formally as a decision tree in which the uncertainties associated with each alternative
236
A . R . HILt. Extensive field loss Minimal field loss
~
Progressive field loss Stable fields
lOP 22 - 35 mal fields
NEW PATIENT (referred from screening)
Poor lOP control ~
iSurary I
IOP>35 Reduced & stable lOP
Family history I
lOP < 22 & normal discs
~ ' ~
No family history
FIG. 17. Diagnosis and management decision tree for a new patient attending a glaucoma clinic. The flow chart is read from left to right, according to the order in which events occur in time. Confidence in diagnosis increases as the patient progresses from left to right in the decision tree. However, the course of action indicated by the terminal branches is a matter of clinical opinion and depends on the weight given by the clinician to the information present at the different decision nodes. Decision analysis is aimed at rationalizing that clinical opinion by requiring the clinician's beliefs and values to be made explicit when combined with information about test efficiency and treatment efficacy. o u t c o m e m a y be m a d e explicit using Bayes' theorem. A decision tree is a graphic r e p r e s e n t a t i o n o f the structure o f a p r o b l e m . By dividing a c o m p l e x p r o b l e m into its c o m p o n e n t parts, the decision m a k e r is f o r c e d to define the p r o b l e m clearly, assess the alternatives a n d clarify the n a t u r e o f the risks a n d uncertainties at each step. The tree c o n t a i n s all possible a c t i o n s a n d all possible consequences in the order in which they occur in time. T h u s a decision tree is a flow c h a r t in which a l t e r n a t i v e courses o f a c t i o n are indicated. A typical e x a m p l e o f a decision tree is s h o w n in Fig. 17. This represents a p a t i e n t diagnosis flow chart in a g l a u c o m a clinic ( A s p i n a l l a n d Hill, 1984a). W h e n used in c o n j u n c t i o n with decision analysis, o b j e c t i v e or subjective p r o b a b i l i t i e s are assigned to each b r a n c h o f the tree. The p r o b a b i l i t i e s at each n o d e are c o n d i t i o n a l (i.e. Bayesian) a n d t h e r e f o r e a d d up to 1.0. The p r o b a b i l i t i e s f r o m n o d e to n o d e , however, are u n c o n d i t i o n a l a n d t h e r e f o r e m a y be ' a v e r a g e d ' by m u l t i p l i c a t i o n in a step-wise m a n n e r along a given p a t h . This p r o c e d u r e allows the d e t e r m i n a t i o n o f the t o t a l p r o b a b i l i t y estimate for a n y p a r t i c u l a r c o m p l e t e b r a n c h o f the decision tree. Utility
values are a t t a c h e d to the t e r m i n a l b r a n c h e s t h e r e b y p e r m i t t i n g an easy c a l c u l a t i o n o f the ' e x p e c t e d v a l u e ' a s s o c i a t e d with each o u t c o m e . These simple flow charts are not the o n l y w a y o f s t r u c t u r i n g a p r o b l e m but their structure helps to clarify the logical steps r e q u i r e d in decision analysis. T h e y s h o u l d n o t be c o n f u s e d with m o r e descriptive a t t e m p t s at s t r u c t u r i n g decision problems. The most common descriptive a p p r o a c h e s are the c a u s a l - a s s o c i a t i o n a l n e t w o r k m o d e l s which are now gaining p r o m i n e n c e in m e d i c a l artificial intelligence studies (Clancy a n d S h o r t l i f f e , 1984). F o r e x a m p l e , Weiss et al. (1978) have used such an a p p r o a c h to d e v e l o p a k n o w l e d g e - b a s e d c o m p u t e r system for medical c o n s u l t a t i o n in g l a u c o m a which is m o d e l l e d a r o u n d the p a t h o p h y s i o l o g i c a l processes o f the disease a n d its m a n a g e m e n t .
12.2. The General Decision Model
T h e general f r a m e w o r k for decision analysis o u t l i n e d in this p a p e r is essentially Bayesian. It will be recalled that the Bayesian a p p r o a c h e n c o u r a g e s a decision m a k e r to bring all the
MAKING DECISIONS IN OPHTHALMOLOGY
237
(b)
(a)
SMD No treatment group
SMD Treatment group •
1.0-
l.O-
o
o
o 0.5,
o o 0.5" L >
/
/
E 0
,
. . . . .
II
6/616/lz 6/I8 6lea 6•36 6/9 Visual acuity
0
6•60 CF HM
II
6t6 16/12 6/18 6}24 ' 6}36 ' ' 6t9 Visual acuity
Derived from Fine, 1982
'
,
6/60 CF H'M
Derived from Fine, 1982
FIG. 18. 'Cumulative density functions' for visual acuity in age related macular degeneration with choroidal neovascularization at a distance of 200 - 2500 tam from the centre of the foveal avascular zone. (a) shows results following treatment with argon laser photocoagulation and (b) shows results from the no-treatment group in a randomized clinical trial. Horizontal lines indicate the range of acuity values. Derived from data of the Macular Photocoagulation Study Group, Fine, 1982. Outcome
Utility dimensions
Acuity
~< Success
Visual ability
Earning capacity
Anxiety
Best
£12000
Best
Intermediate
£8500
Best
Worst
£5000
Best
Best
£IOOO0
Intermediate
Intermediate
£7500
Intermediate
6/48
Worst
£5000
WoPst
6/12
Best
£15000
Intermediate
Intemediate
£10000
Intermediate
£5000
Intermediate
6/12
6/18 - 6/36
~I
6/48
Treat
<~_ 6/12 Complications
6/18 - 6/36
8/18 - 6/36 Leave
6/48
Worst
FIG. 19. Decision tree for the case vignette described in Section 12.3 showing three dimensions of utility (see text).
238
A. R. HILL some course of action. The aim of the decision maker is always to choose the action which has the highest expected utility.
0
/
Utility
12.3. A Clinical Example
£5~0
£10000 '
r
,
~15 000
Earning capacity
FIG. 20. Utility function for earning capacity associated with the case vignette of a self-employed architect described in Section 12.3 and illustrated by the decision tree in Fig. 21.
knowledge he can to a problem, including his subjective estimates of unknown events. For those unhappy with the use of subjective probabilities it is suggested that they should only be used when there is no alternative way of estimating the probability of the unknown event, until such time as more objective probabilities become available. However, the fact that the decision maker is forced to make an assessment of unknown events explicitly is considered a particular strength of a Bayesian approach because it often clarifies the relevant factors in a complex decision. The approach is, therefore, an ongoing learning process where prior beliefs can be continually updated as new evidence comes to light. The previous sections have contained the elements needed to illustrate a general framework for decision making. The approach is flexible in that the techniques illustrated in any section can be used on their own depending on the nature of the problem. For instance, the decision matrix and Bayes' theorem may be sufficient for a diagnostic problem, while the ROC curve might be used to clarify the choice of a suitable criterion for a detection or recognition problem. In some cases the very act of quantifying uncertainties as subjective probabilities may provide the necessary insight required to make a decision without further formal aids. The other feature of the proposed approach is related to the concept of the expected value of
In Section 5, the general decision model was identified in terms of five stages. These are: (i) Structuring the decision problem. (ii) Assessing probabilities for uncertain events. (iii) Assessing utility values for outcomes. (iv) Selecting the outcome with maximum expected utility. (v) Assessing the sensitivity to variations in judgemental input. The following clinical example demonstrates the practical relevance of each of these stages. It is not, however, intended that such a detailed analysis be undertaken in all clinical decision making but rather that the approach should be invoked more widely where there is genuine uncertainty about the most appropriate course of action. More particularly, it is believed that structuring decision problems in this way will provide a rational basis both for the teaching and learning of the diagnostic process. Consider the following case vignette. A healthy 50-year-old professional architect, self-employed, married and with two children, with no previous ocular history complains of metamorphosia in one eye which has a visual acuity of 6/6. The other eye is amblyopic with a best refracted visual acuity of 6/36. Ophthalmoscopy shows age related macular degeneration and fluorescein angiography demonstrates a small subretinal neovascular membrane presenting as a disciform lesion, 700 ~m nasal to the centre of the foveal avascular zone. The clinician is faced with the decision of whether or not to treat this eye using argon laser photocoagulation. There is recent evidence from the Macular Photocoagulation Study G r o u p (Fine, 1982) that treatment of such cases can be successful. However, uncertainty remains because the probabilities associated with varying levels of visual acuity following treatment are, on average, only about twice those for an eye which is untreated [see Fig. 18(a) and (b)]. Since the success
MAKING
239
D E C I S I O N S IN O P H T H A L M O L O G Y
Outcome
Utilities Individual
Weighted 35
Acuity
(i-p)
@
Earning capacity
I0
Visual ability
Earning capacity
Anxiety
Visual ability
6/12
1.0
0.9
1.0
35
50
10
95
6/18 - 6/36
0.6
0.7
1.0
21
39
10
70
10
10
p:0.60
p=O.12
55
Total %
Anxiety
>/
6/48
0
o
1.0
0
0
<--.
6/12
1.0
0.8
0.6
35
44
6
85
6/18 - 6/36
0.6
0.6
0.3
21
33
3
57
6/48
0
0
0
0
0
0
0
6/12
1.0
1.0
0.7
35
55
7
97
6/18 - 6/36
0.6
0.8
0.4
21
44
4
69
0
0
0.I
0
0
i
I
L p:O'20 p=0.7o
p=0.22 Leave
p=0.46
>i
6/48
FIG. 21. A complete decision tree with utilities and associated probabilities for the case vignette of a self-employedarchitect described in Section 12 (see text).
rate for treatment in an individual eye is somewhat indeterminate, the problem here is to decide what minimum success rate the patient should accept before proceeding with treatment.
present occupation on grounds of disability. The three acuity levels therefore reflect different likely earning capacities.
12.3.1. STAGE 1
12.3.2. STAGE 2
The options for this patient and their consequences are illustrated in the decision tree in Fig. 19. The visual acuity outcomes are considered in three groups according to their functional value for the patient. With acuities o f 6/12 or better he would still be able to p e r f o r m his job as an architect. Acuity levels o f 6/18 to 6/36 would place some constraints on his work capabilities because, in the United Kingdom, he would no longer meet the legal standard for holding a driving licence. If his visual acuity became 6/48 or worse he would be forced to resign f r o m his
The probabilities associated with different visual acuity levels are obtained for the branches of the decision tree in Fig. 19. These have been derived by interpolation (Fig. 21), as objective probabilities from the data of Fine (1982), shown in Fig. 18, for the 'successful treatment' group and the 'no treatment' group. Subjective probabilities associated with different acuity outcomes in the case of 'treatment with complications' were estimated by a clinician (Fig. 15). The m a j o r factors likely to contribute to uncertainty in the outcomes associated with
240
A. R. HILL
treatment are the clinician's skill at using the laser and the responsiveness of the lesion to photocoagulation. If no treatment is given, uncertainty arises from the possibility of further extension of the neovascular membranes. We consider the situation, therefore, in which the clinician is uncertain about the probability of success after treatment. Let us designate this probability value p.
12.3.3. STA6E 3 For this problem we consider three possible dimensions to utility. These are: (i) visual ability; (ii) earning capacity; and (iii) long-term anxiety after the operation or the decision not to treat. Suppose the decision maker ranks the dimensions in order of importance as earning capacity, visual ability, anxiety and assigns numbers to reflect the relative importance of these three dimensions in the ratio 55:35:10. The utilities of the financial consequences are assessed by the patient using a lottery procedure. The best outcome is given a utility of 1, the worst outcome 0. A curve relating earning capacity to utility (see Fig. 20) can then be generated by the lottery procedure described in Section 11.4. When there are several intermediate financial values they can be transformed into utilities from the same utility curve. Utilities for visual ability and long-term anxiety are assessed in a similar manner. Again, a utility of 1 represents the best outcome and a utility of 0 the worst outcome. In assessing anxiety, it is assumed that there is no anxiety (utility 1) after successful laser treatment since it is expected that the most stable outcome has been attained, whatever the visual acuity achieved. On the other hand, both 'no treatment' where vision may deteriorate unexpectedly, and 'post-operative complications' which might also give rise to a sudden loss of vision, are considered to be likely to cause significant long-term anxiety. In this case, it is assumed that anxiety is worst (i.e. utility 0) in the case of the uncertainty resulting from complications where the visual acuity outcome is poorest. The patient's anxiety is assumed here to be worse for a given visual acuity when
complications arise following treatment than when the eye is untreated, since a further session of laser treatment is less likely to give rise to a stable state than a first session of treatment on an untreated eye. Single utilities can now be given for all combinations of outcomes. These are determined as the sum of the products of a utility and its percentage weighting across the three utility dimensions for each outcome. For instance, for the combination of successful treatment (and therefore no anxiety) with a visual acuity outcome of 6/18 to 6/36, the single resultant utility would be (35 × 0.6) + (55 × 0.7) + (10 x 1.0) = 70 (see Section 11.4). The complete decision tree with utilities and associated probabilities is shown in Fig. 21. 12.3.4. STAGE4 The expected utilities for each course of action can now be determined according to the model in Section 11.4. These are calculated as follows:
EU~ = (0.60 × 95)+(0.22 x 70)+(0.1.2 × 10) = 74 EU8 = (0.10 × 85)+(0.20 x 57)+(0.70 × 0) = 20 EUc = (0.22 x 97)+(0.32 x 69)+(0.46 x 1) = 74 and it follows that:
EUo = 74p + 2 0 ( 1 - p ) Where EU4, EUa, EU, and EUD are the expected utilities at nodes A, B, C and D in the decision tree of Fig. 21. The decision for proceeding with treatment should be made if EUD > EUc. From the above expected utilities, therefore, it follows that the decision to proceed with treatment will be followed if (54p + 20) > 44, i.e. if p > 0.44. In other words, the decision to proceed with laser treatment is considered the most favourable course of action if the decision maker's estimate of the probable success of laser photocoagulation is 0.44 or better. If the estimated success of laser treatment is less than this value, then the best action is not to treat and to keep the patient under surveillance until there is a change of circumstances present.
MAKING DECISIONS IN OPHTHALMOLOGY 12.3.5. STAGE 5
This example highlights two points in clinical decision analysis. Firstly, the decision to follow a particular course of action does not simply depend on the presence o f high probability estimates. In rational action both utilities and probabilities must be taken into account. Secondly, it is clear that the methodology could allow for the patient to generate his own utilities for alternative outcomes. Furthermore, if the patient is unhappy about the relative weightings given to the three dimensions of utility, these can be changed. In so doing, the sensitivity of the m a x i m u m expected utility can be assessed and its consequence on the final decision can be demonstrated. (Other complete decision analysis examples are given in Aspinall and Hill, 1984b.) The model also allows one to study the effect on p of varying the subjective probability estimate of different visual acuity outcomes if complications arise with treatment. This makes it possible to find out whether this subjective probability estimate is crucial to the decision.
13. C O N C L U D I N G R E M A R K S Many of our judgements and actions, when based on intuition, are not only heavily biased by selective perception and reasoning but can also be shown to be irrational. The judgements made in the practice of medicine are no exception to that fact. And those who practice medicine by intuition will continue to make m a n y avoidable decision errors - - perhaps without even recognizing them as errors. Unfortunately, these problems are likely to remain as long as the teaching of medicine continues to place emphasis on the acquisition of an ever increasing knowledge base without providing guidance on the principles of strategic decision making. This plea is not new (Fineberg, 1981) and neither are the principles of decision analysis. But the nature of bias in h u m a n judgement is such that those who need to be convinced by scientific evidence seem to be the least capable of responding to it. To adapt a
241
quotation from Patrick O ' D o n o v a n , "All the writing on the wall of science is invisible to those who most need to read it" (1976). It is not sufficient that experience and intuition should be accepted as a basis for knowledge. It is also necessary for there to be a conscious organization of that knowledge. One way in which knowledge can benefit through experience is by repeated encounters with events which have perceived similarities. When that knowledge is rationalized to provide a corpus of understanding it can then be fed back into our next experience. However, it is too hazardous to expect the growth of application of knowledge in medicine to proceed in this chance way without serious consequences. More than ever before, medical education and clinical practice is in need of a rational framework for handling information, for helping to bring about effective clinical judgements and for assessing the most efficient deployment of health service resources. Decision analysis meets these aims. Indeed, it can be argued that a Bayesian approach to decision analysis should be seen as the theory of medicine. Clinical decisions are based on inferences derived from patient symptoms and test evidence when data is collected as a means of hypothesis testing. In medicine the initial expectation of a clinical state is rarely expressed in the classical scientific language of a null hypothesis where alternative outcomes are assumed to be equally probable, because clinical experience reinforces the Oslerian maxim that c o m m o n events occur commonly. The collection of test evidence, therefore, is directed towards overturning prior likelihoods for a clinically pathological state which are far from mere chance expectations. The principle of hypothesis falsification should therefore be encouraged in clinical practice. The extent to which any test evidence can modify such prior expectations depends not only on its relevance to the pathology (i.e. the hypothesis) but also on the inherent sources of error in the test itself. Bayesian decision analysis is a model which reflects these facts. The use of decision trees also permits a problem to be expressed formally so that decisions may follow a logical sequence. When hard data is not available to express the uncertainties associated with relevant
242
A . R. HiLL
information, the clinician must rely on his own past experience or that o f a colleague. He must learn, therefore, to assess both subjective probabilities and also the utility o f outcomes of alternative actions in numerical terms. This is not an excuse for woolly thinking. If doctors act consistently, then their actions can be viewed as if there were probabilities and utilities associated with them. Clinical practice is the act of assigning probabilities to sets o f signs and symptoms, which are then used rationally to assess the maximum expected utility of an outcome. To use a normative strategy for decision making on the basis o f Bayesian principles, is not only logical, but also ethical. It provides a satisfactory answer to the question, "Is the a u t o n o m y of the patient being respected?". A decision analysis approach allows us to express explicitly whether the pros and cons o f a particular treatment or course of action negate the patient's basic rights. H o w well that aim is achieved will depend upon how effectively the clinician is able to help the patient become aware of the differential weights or utilities he associates with the acclaimed benefits and harms. Simply to say that it is possible to effect a cure is inadequate because it does not permit a rational decision to be made. It is necessary for the patient to be given an estimate o f the probability of a cure. If exact data is not available for deriving that probability then a subjective estimate must be made. Only by this means can a clinician claim to have discharged his full responsibilities in a prudent, caring and understanding manner. The growth in medical knowledge continues unabated, requiring medical students and practicing doctors not only to cope with a larger data base but also to make reasonable clinical judgements on the basis o f a larger number of factors. If medical educators and clinical practitioners fail to recognize the need for adopting principles o f strategic decision making, they rightly deserve the condemnation o f future members o f their profession for failing to lay the foundation of a medical science. Acknowledgements - - I am most grateful for the many helpful discussions with Dr Peter Aspinall and Dr Barnaby Reeves, and for Dr Reeves' critical review of the manuscript.
REFERENCES ABERCROMBIE, M. J. L. (1960) The Anatomy o f Judgement. Hutchinson, London. AJZEN, i. and FISHBEIN, M. (1980) Understanding Attitudes and Predicting Social Behaviour. Prentice-Hall, Englewood Cliffs, NJ. ANON (1983) Diagnosis related groups (DRG's) and the Medicare program: implications for medical technology. Technical M e m o r a n d u m . Congress of the United States, Office of Technology Assessment, Washington, DC. ARDEN, G. B. and JACOBSON, J. J. (1978) A simple grating test for contrast sensitivity: Preliminary results indicate value in screening for glaucoma. Invest. Ophthalmol. vis. Sci. 17: 2 3 - 32. ARMALV, M. F. (1969) Ocular pressure and visual fields: a ten year follow-up study. Archs Ophthal. 81: 2 5 - 40. ASPINALL, P. A. (1974) Some methodological problems in testing visual function. In: Modern Problems in Ophthalmology, Vol. 11. Karger, Basel. ASPINALL, P. A. and HILl., A. R. (1983) Clinical inferences and decisions: I Diagnosis and Bayes' theorem. Ophthal. Physiol. Opt. 3: 2 9 5 - 304. ASPINALL, P. A. and HILL, A. R. (1984a) Clinical inferences and decisions: II Decision trees, receiver operator curves and subjective probability. Ophthal. PhysioL Opt. 4: 31-38. ASPINALL, P. A. and HILL, A. R. (1984b) Clinical inferences and decisions: Ill Utility assessment and the Bayesian decision model. Ophthal. Physiol. Opt. 4 : 2 5 1 - 2 6 3 . ASPINALL, P. A. and HIEI,, A. R. (1984c) Is screening worthwhile? In: Progress in Child Health (A. C. MacFarlane, ed.) Vol. 1, pp. 2 4 3 - 2 5 9 . A. Churchill Livingstone, Edinburgh. ASPINALL, P. A., KINNEAR, P. A., DUNCAN, L. J. P. and CLARKE, B. (1983) Prediction of diabetic retinopathy from clinical variables and colour vision data. Diabetes Care 6: 1 4 4 - 148. BALAZSI, A. G., DRANCE, S. M., SCHULZER, M. and DOUGLAS, G. R. (1984) Neuro-retinal rim area in suspected glaucoma and early chronic open-angle glaucoma. Archs Ophthal. 102:1011 1014. BALLA, J. I. (1985) The Diagnostic Process: a Model for Clinical Teachers. Cambridge University Press, Cambridge. BARNOON, S. and WOI.VE, H. (1972) Measuring the Effectiveness o f Medical Decisions. C. Thomas: Springfield, I11. BARTLETT, F. C. (1932) Remembering. Cambridge University Press, Cambridge. BIERI, J., ATKINS, A. L., BRIAR, S., LEAMAN, R. L., MILLER, H. and TRIPODI, T. (1975) Clinical and Social Judgment: the Discrimination of Behavioral Information. R. E. Krieger, New York. BLACKWELI., H. R. (1952) Studies of psychophysical methods for measuring visual thresholds. J. opt. Soc. Am. 42: 606 616. BRON, A. J. B. and BROWN, N. A. P. (1982) Classification, grading and prevention of cataract. Res. clin. Forums4: 101 - 127. CARD, W,, RUSINKIEWlC'Z, M. and PHU~IAI'S, C. 1. (1976) Estimation of the utilities of states of health with different visual acuities using a wagering technique. In:
MAKING DECISIONS IN OPHTHALMOLOGY
Decision Making and Medical Care: Can Information Science Help? (F. T. Dombal and F. Gremy, eds) North-Holland, Amsterdam. CLANCEY, W. S. and SHORTLIFFE,E. H. (eds) (1984) Readings in Medical Artificial Intelligence: the First Decade. Addison-Wesley, Reading, MA. COHEN, J. and CHRISTENSEN, 1. (1970) Information and Choice. Oliver and Boyde, Edinburgh. CUMMINS, R., JARMAN, B. and WHITE, P. M. (1981) Do general practitioners have different "referral thresholds"? Br. reed. J. 282: 1037- 1039. DAUBS, J. and CRICK,R. P. (1980) Epidemiological analysis of King's College Hospital glaucoma data. Res. clin. Forums 2:41 - 59. DAWES, R. and CORRIGAN,B. (1974) Linear models in decision making. Psychol. Bull. 81: 9 5 - 106. DE DOMBAL,F. T. and GREMY, F. (eds) (1976) Decision Making and Medical Care: Can Information Science Help? North-Holland, Amsterdam. DE DOMBAL, F. T., LEAPER, D. J., STAN1LAND, J. R., MCCANrq, A. P. and HORROCKS,J. C. (1972) Computeraided diagnosis of acute abdominal pain. Br. reed. J. 2: 9-13. DONDERS, F. C. (1864) Accommodation and Refraction o f the Eye. New Sydenham Society, London. DRANCE, S. M. and LAKOWSKI, R. (1983) Colour vision in glaucoma. In: Glaucoma Update H (G. K. Krieglstein and W. Leydhecker, eds) pp. 117-121. SpringerVerlag, New York. EDWARDS, W. (1968) Conservation in human information processing. In: Formal Representation o f Human Judgement (B. Kleinmuntz, ed.) Wiley, New York. EELLS, E. (1982) Rational Decision and Causality. Cambridge University Press, Cambridge. ELLIS, B. (1968)Basic Concepts o f Measurement. Cambridge University Press, Cambridge. FINE, S. L. (1982) (Chairman: Macular Photocoagulation Study Group) Argon laser photocoagulation for senile macular degeneration: results of a randomised clinical trial. Archs Ophthal. 100: 912-918. FINEBERG, E. (1981) Editorial. J. Med. Decision Making 1: 3. FORREST, M. and ANDERSON, B. (1986) Ordinal scale and statistics in medical research. Br. med. J. 292: 537-538. FRYBACK, D. G. (1978) Bayes' theorem and conditional nonindependence of data in medical diagnosis. Comput. Biomed. Res. 11: 423-434. GALEN, R. S. and GAMBINO,S. R. (1975) Beyond Normality: The Predictive Value and Efficiency o f Medical Diagnoses. Wiley, New York. GARNER, A. (1984) (Committee chairman) An international classification of retinopathy of prematurity. Archs OphthaL 102: 1130- 1134. GINSBERG, A. P. and CANNON,M. W. (1983) Comparison of three methods for rapid determination of threshold contrast sensitivity. Invest. Ophthalmol. vis. Sci. 24: 1626- 1629. GREEN, D. M. and SWETS, J. A. (1966) Signal Detection Theory and Psychophysics. Wiley, New York. HALLIDAY, B. and Ross, J. E. (1983) Comparison of 2 interferometers for predicting visual acuity in patients with cataract. Br. J. OphthaL 67: 273-277. HART, C. R. (1975) Screening in General Practice. ChurchillLivingstone, Edinburgh.
243
HIGGINS,K. E., JAFFE,M. J., COLETTA,N. J., CARUSO,R. C. and DE MONASTERIO, F. M. (1984) Spatial contrast sensitivity: importance of controlling the patient's visibility criterion. Archs Ophthal. 102: 1035- 1041. HILL, A. R. and ASPINALL, P. A. (1980) An application of decision theory to colour vision testing. In: Colour Vision Deficiencies V (G. Verriest, ed.) pp. 164- 171. Adam Hilger, London. HILL, A. R. and ASPINALL, P. A. (1982) Pass/fail criteria in colour vision tests and their effect on decision confidence. Docum. Ophthalmol. Proc. Series 33: 157 - 161. HILL, A. R., HERON, G., LLOYD, M. and LOWTHER,T. (1982) An evaluation of some colour vision tests for children. Docum. OphthalmoL Proc. Series 33: 183- 187. HILL, A. R., ASPINALL, P. A. and VERRIEST, G. (1985) Principles of colour vision test battery selection. In: Colour Vision Deficiencies VII (G. Verriest, ed.) pp. 181 - 187. Junk, The Hague. HITCHINGS, R. A., BROWN, D. B. and ANDERTON, S. A. (1983) Glaucoma screening by means of an optic disc grid. Br. J. Ophthal. 67: 352-355. HODES, B. L. (1985) The prospective patient system, diagnosisrelated groups, and the new world of health care. Archs Ophthal. 103: 185-186. HOGARTH, R. (1980) Judgement and Choice. Wiley, New York. HOSKINS, H. D., SHAFFER,R. N. and HETHERINGTON, J. (1984) Anatomical classification of the developmental glaucomas. Archs OphthaL 102:1331 - 1336. KAHN, H. A., LEIBOWITZ, H., GANLEY, J. P., KINI, M., COLTON,T., NICKERSON,R. and DAWBER,T. R. (1975) Standardizing diagnostic procedures. Am. J. Ophthal. 79:768 - 775. KAHNEMAN, D. and TVERSKY, A. (1982) The psychology of preferences. Scient. Am. 274: 136-142. KINNEAR, P., ASPINALL, P. A. and LAKOWSKI,R. (1972) The diabetic eye and colour vision. Trans. ophthaL Soc. U.K. 92: 6 9 - 7 8 . KORAN, L. M. (1975) The reliability of clinical methods, data and judgments. New Engl. J. Med. 293: 642-646, 695 - 701. KOZIELECKI, J. (1981) Psychological Decision Theory. D. Reidel, London. LAKOWSKI, R. and DRANCE, S. M. (1979) Acquired dyschromatopsias: the earliest functional losses in glaucoma. Docum. Ophthalmol. Proc. Series 19: 159- 165. LEAPER, D. J., HORROCKS, J. C., STAN1LAND, J. R. and DOMBAL, T. T. DE (1972) Computer-assisted diagnosis of abdominal pain using "estimates" provided by clinicians. Br. med. J. 4: 350- 354. LEDLEY, R. S. and LUSTED, L. B. (1959) Reasoning foundations of medical diagnosis. Science 130: 9 - 2 1 . LINDLEY, D. (1971) Making Decisions. Wiley, New York. LUSTED, L. B. (1968) Introduction to Medical Decision Making. C. Thomas, Springfield, Ill. MEADOR, C. K. (1965) The art and science of non-disease. New EngL J. Med. 272: 9 2 - 95. MEADOR, C. K. (1969) Non-disease: a problem of overdiagnosis. Diagnostica (Ames Comp.) 3: 10- 11. MEDICAL REASEARCH COUNCIL (1983) Diseases of the eye. Working party report submitted to Neurobiology and Mental Health Board, London, UK.
244
A . R. HILL
METZ, C., STARR,S. and LUSTED, L. B. (1976) Quantitative evaluation of visual detection performance in medicine. 7th L. H. Gray Conference, Leeds. Public Institute of Physics, Bristol. MOTOLKO, M., DRANCE, S. M. and DOUGLAS, G. R. (1982) The early psychophysical disturbances in chronic open-angle glaucoma. Archs Ophthalmol 100:1632 - 2634. MURPHY, E. A. (1976) The Logic o f Medicine. John Hopkins University Press, Baltimore. NAGEL, S. S. and NEEF, M. G. (1979) Decision Theory and the Legal Process. D. C. Heath, Lexington, MA. O'DONOVAN, P. (1976) I watched the old China die. The Observer, Sept. 12th. p. I1. OAKLEV, N., HILL, D. W., JOPLIN, G. F., KOHNER, E. M. and FRASER, T. R. (1967) Diabetic retinopathy 1. The assessment of severity and progress by comparison with a set of standard fundus photographs. Diabetologia 3: 402 - 405. OSLER, W. (1930) Teacher and Student. In: Aequanimitas with Other Addresses to Medical Students, Nurses and Practitioners o f Medicine, 2nd Ed, p. 38. Blakiston, Philadelphia. PARRISH, R. K., SCHIFFMAN,3. and ANDERSON,D. R. (1984) Static and kinetic field testing: reproducibility in normal volunteers. Archs Ophthal. 102: 1 4 9 7 - 1502. PHILLIPS, D. (1973) Bayesian Statistics for Social Scientists. Nelson, London. POKORNY, J., SMITH, V. C., VERRIEST, G. and PINCKERS, A. J. L. G. (1979) Congenital and Acquired Colour Vision Defects. Grune & Stratton, New York. POPPER, K. R. (1983) Realism and the Aim o f Science. Hutchinson, London. QUIGLEY, H. A. (1985) Better Methods in Glaucoma Diagnosis. Archs Ophthal. 103: 1 8 6 - 189. REEVES, B. C. and HILL, A. R. (1986) Test - retest reliability of the Arden Grating Test: inter-clinician variability. Br. J. Ophthal. (submitted). REGGIA, J. A. and TUHRIM, S. (eds) (1985) Computer-assisted Medical Decision Making, Vols 1 and 2. SpringerVerlag, New York. RENVl, A. (1970) Probability Theory. North-Holland, Amsterdam. Ross, D. F., FISHMAN, G. A., GILBERT, L. D. and ANDERSON, R. J. (1984) Variability of visual field measurements in normal subjects and patients with retinitis pigmentosa. Archs Ophthal. 102: 1 0 0 4 - 1010. Ross, J. E., BRON, A. J., REEVES, B. C. and EMMERSON, P. G. (1985) Detection of optic nerve damage in ocular hypertension. Br. J. Ophthal. 69: 8 9 7 - 903. SCHOR, S. and KARTEN, I. (1966) Statistical evaluation of medical journal manuscripts. J. Am. reed. Ass. 195: 1 4 5 - 150. SCOTT, G. 1. (1951) Diabetic retinopathy. Proc. R. Soc. Med. 44:743 - 747. SLOVIC, P., FISCHHOFF, B. and LICHTENSTEIN, S. (1981) Perceived risk: psychological factors and social implications. Proc. R. Soc. Lond. (.4) 376: 1 7 - 34.
SOMMER,A., POLLACK,1. and MAUMENEE,A. E. (1979) Optic disc parameters and onset of glaucomatous field loss, 1 Methods and progressive changes in disc morphology. Archs Ophthal. 97: 1 4 4 4 - 1448. SOMMER, A., QUIGLEY, H. A., ROBIN, A. L., MILLER,N. R., KATZ, J. and ARKELL, S. (1984) Evaluation of nerve fibre layer assessment. Archs Ophthal. 102: 1 7 6 6 - 1771. SPAETH, G. L. (1977) The Pathogenesis o f Nerve Damage in Glaucoma. Grune & Stratton, New York. SPARROW,J. M., BRON, A. J., BROWN,N. A. P., AYLIFFE,W. and HILL, A. R. (1986) The Oxford clinical cataract classification and grading system. Int. Ophthalmol. (in press). STEVENS, S. S. (1946) On the theory of scales of measurement. Science 103: 6 7 7 - 680. STEVENS, S. S. (1959) Measurement, psychophysics and utility. In: Measurement: Definitions and Theories (C. W. C h u r c h m a n and P. Ratoosh, eds) pp. 1 8 - 6 4 . Wiley, New York. TVERSKY, A. and KAHNEMAN, D. (1971) The belief in the "law of small n u m b e r s " . Psychol. Bull. 76: 1 0 5 - 110. TVERSKY, A. and KAHNEMAN, D. (1973) Availability: A heuristic for judging frequency and probability. Cognitive Psychol. 5: 2 0 7 - 232. WEISS, S. M., KULIKOWSKI,C. S., AMAREL, S. and SAHR, A. (1978) A model-based method for computer medical decision making. Artificial Intelligence 11:145 - 172. WULFF, H. R. (1981) Rational Diagnosis and Treatment: an Introduction to Clinical Decision Making (2nd Edn) Blackwell Scientific, Oxford.
APPENDIX A conditional probability is the likely occurrance of an event which is dependent upon a defined set of conditions. In the language of hypothesis testing, a conditional probability is known as a prior probability when referring to a state of knowledge before the collection of additional information, and as a posterior probability once the former state of knowledge has been modified by new information. In sequential problem solving, a posterior probability becomes the new prior probability awaiting transformation by the acquisition of further modifying evidence into a new posterior probability. 2 The fikelihood of a hypothesis (disease) is the probability, given that hypothesis (disease), of the actual result of the experiment (e.g. symptoms or clinical tests). The subjective base-rate likelihood is the subjectively estimated probability assigned to an hypothesized event (or disease). It is simply a number between 0 and 1 which represents the extent to which a person believes the hypothesis to be true.