‘De facto’ assessment: issues of validity and reliability

‘De facto’ assessment: issues of validity and reliability

Current Obstetrics & Gynaecology (1997) 7, 145-148 © 1997 Pearson Professional Ltd Mini-symposium:Teaching 'De facto' assessment: issues of validity...

322KB Sizes 1 Downloads 116 Views

Current Obstetrics & Gynaecology (1997) 7, 145-148 © 1997 Pearson Professional Ltd

Mini-symposium:Teaching

'De facto' assessment: issues of validity and reliability

H. M u l h o l l a n d

The need for large numbers o f observations and lnany hours o f testing time make it impossible to create a truly reliable examination of clinical competence. This paper argues that the traditional methods o f assessment used by supervisors in 'apprenticeship' situations can be made reliable by the use o f a number of observers making continuous observations during normal service contacts. The validity o f the scale introduced in the college's log books is established by long usage in the real world. By providing a structure for trainees to keep track of these 'de facto' assessments, the new system can ensure that all practitioners reach the required standards o f competence.

qualities for 'good' assessment are validity, reliability and fidelity. Validity ensures that the assessor is actually giving marks for the performance of actions that are 'correct' and not for those that are 'incorrect' or irrelevant. By the creation of agreed checklists a number of experts devise a protocol that they consider is valid. Experience, in the DRCOG OSCE and other examinations, suggests that such agreement is difficult and time-consuming to reach, even within a small group, and that the result is not always agreeable to the other examiners required to use the resulting checklist. For this reason there has been a move away from detailed checklists to global ratings in which the examiner is required to make judgements based on agreed criteria. Reliability ensures that candidates faced with an individual examiner, under specific circumstances obtain the same mark as they would have done if faced with any other examiner under any other set of circumstances on the same question. If all the questions in the test are supposed to be testing the same characteristic, for example, 'clinical competence' it is to be expected that scores on all items should agree. Unfortunately (at the time of writing), there is no test of competence that reaches an acceptable level of reliability. To create a reliable test, a large number of items and a large amount of exanlinee and

The structured training system of the Royal College of Obstetricians and Gynaecologists (RCOG) was designed to embody the strengths of the traditional apprenticeship system and to overcome the weaknesses that have been aggravated by recent changes in models of staffing the National Health Service. The reduction in junior doctors' hours, the break-up of teams and the transition to consultant-led service have combined to create a situation in which familiarity between individual consultants and juniors is less certain to develop, and in which trainees do not have so many opportunities to observe consultants' clinical practice, or to seek advice from any one consultant about their own dilemmas. This creates problems for all aspects of education, but has particular implications for assessment. In the undergraduate curriculum, the introduction of the Objective Structured Clinical Examination (OSCE) and of skills training laboratories in many medical schools has broadened the range of types of assessment so that long and short cases are no longer the only methods of assessing 'competence'. The use of checklists in OSCEs and in skills laboratories has introduced an appearance of objectivity. Has it improved the quality of assessment? The combined Dr llelen ,Mulholland, 36b Pettycur Bay. Burnfisland. Fife KY3 9SB. UK

145

146

Current Obstetrics & Gynaecology

examiner time is required. MCQ tests, which generally test low-level knowledge acquisition, do reach the required 0.90 level of reliability in college exams, and may do so in some undergraduate courses. Even the most carefully constructed OSCE fails to reach this level. This apparent unreliability may, however, be spurious. The exam is generally designed to separate the pass candidates from those who fail. Individual scores may not need to be accurate to exact percentages, or fractions of percentages, if the pass/fail decision is based on a logical approach to scores. It is not the individual score that needs to be exactly replicable but the pass/fail decision. FMelity, the third characteristic, is one that was, for many years, ignored in medical examinations but is now seen to be crucial to the predictive validity o f our examinations. This characteristic is closely related to the distinction between competence and performance. Students in OSCEs may show that they can correctly carry out the actions required (competence), but when faced with the situation in the real ~,'orld (performance) may fail to do so. Fidelity o f test conditions is the extent to which they reflect the real world circumstances in which the actions must be taken. There is a spectrum from low-fidelity pencil and paper tests, in which the candidate simply says what would be done, to simulated situations, in which the candidate shows what would be done by using a model or simulated patient. The only truly high-fidelity situations involve the use of real patients in real wards and clinics. Even the introduction of an external assessor to the real world situation may change the candidate's behaviour and reduce fidelity. The challenge for the test developer is to use highfidelity situations to arrive at valid and reliable pass/fail decisions. There are two main problems to be overcome. Because each candidate works in different conditions and sees different patients, scores cannot be compared between candidates. There may be a suggestion that the cases seen by one candidate are 'harder' or 'easier' than those seen by other candidates. This is an issue of validity. Because each candidate is observed by different assessors, who may have different views o f how cases should be approached, and on different cases, there can be no single detailed checklist. It is, therefore, impossible to prove statistically that scores are reliable.

E S T A B L I S H E D PRACTICE Apart from the clinical components of college exams, there has been no formal assessment of clinical competence in postgraduate medical education in the UK. Within the established system there were two methods of assessment that were widely used. When a trainee applied for a new post, the consultant could use the reference to present his assessment of the trainee. While the trainee was in post, the consultant could organize

theatre and clinic lists to match the consultant's assessment of the trainee's abilities. It was not, however, generally acknowledged that in carrying out these functions the consultant was, in fact, assessing the trainee. Where the element of assessment was recognized, it was often denigrated as being 'subjective'. The fidelity and validity of the assessment were unquestionable, though its reliability could never be statistically established. Research was required to investigate the reliability o f such 'subjective'judgements.

ANNUAL S U M M A T I V E A S S E S S M E N T With tile implementation o f The Cahnan Report, the assessment by reference is replaced by the annual summative assessment o f progress, to which a number of consultants and other members of the team may contribute. Collection o f the reports that will form the basis of this assessment will be the responsibility of the educational supervisor: a consultant who has day-to-day contact with the trainee. The reports will be collated and discussed with the trainee, who will be able to appeal to the Specialty Training Committee against any perceived inaccuracies. It is expected that every consultant will act as supervisor to one or two trainees.

DAY-TO-DAY C O N T I N U O U S A S S E S S M E N T Consultants will still be required to organize service commitments to match their assessment o f trainee's abilities. They will be assisted in this process by the trainee's logbook, which will show the level o f supervision required in each aspect of service provision. The 'scale' used for this assessment reflects the decisions that consultants have always made about trainees: 1. The trainee has observed. 2. The trainee has assisted. 3. The trainee has provided care under constant direct supervision. 4. The trainee is able to provide care under indirect supervision - with a senior colleague within call. 5. The trainee is able to act without supervision, (independent competence). The change from the use of the verb 'has' to 'is able to' at level 4 reflects the implications for patient care that become significant at this level. At levels 1, 2 and 3 the trainee is acting under constant supervision and can be helped, advised or replaced if the senior colleague decides it is necessary. At these levels the trainee simply records by ticking or dating the logbook that the experience has occurred. For a task such as history taking, most units have already developed systems for monitoring the performance of new senior house officers (SHOs),

De facto assessment from sitting in with new SHOs or checking over every set of case notes to checking only a sample of cases. The SHO quickly passes through levels 1 to 3 and is functioning at level 4 within a few days of joining the unit. For some practical procedures a much longer period will be required to reach level 3. Some rarely occurring or emergency aspects of service provision may never be experienced in the unit, and it is these 'targets' that should form the focus for formal teaching sessions, independent learning and short courses.

ASSESSMENT FOR LEVEL 4: 'I'LL LEAVE YOU TO IT' The decision to leave a trainee unsupervised has significant implications for patient care and must be based on objective assessment. This does not mean that separate tests or opportunities for observation need to be set up, but the assessor must observe the trainee's performance and compare it with a standard. In the past, such standards have been set by norm, criterion or limen referencing. In norm referencing the trainee is compared to a group of peers. 'Top 10%' and 'average' are not, however, appropriate ways of describing a performance in the context of clinical medicine; the top 10% may still be incompetent. The average may, or may not, be performing at an adequate standard. Norm referencing is, therefore, inappropriate for assessments of competence. Criterion referencing may be appropriate to those aspects of performance for which clear protocols are available. The assessor needs then only to be satisfied that the trainee can and will apply the protocol correctly. There are some tasks, however, for which protocols cannot easily be created. It is now recognized that expert clinicians faced with the same cases do not always follow the same steps in the same order in the diagnostic process. Management decisions too are, to a greater or lesser extent, a matter of professional judgement? Being labelled 'subjective' has often tarnished assessments. There has been an assumption that such judgements are, of necessity, less reliable and, therefore, less valid than 'objective' assessments based on numerical scores and checklists. Research now indicates that this is not necessarily so. Limen referencing is the term used to describe an assessment made by an expert observer comparing the performance of a trainee with an internal representation of competent performance? .~The technique has been widely applied in professional and vocational education and found to be reliable. In medical education, for example, it has been established that 'global rating' of performance by experts gives rise to scores that are just as reliable as those based on detailed 'objective' tests? The basis of assessment in structured training is, therefore, limen referencing. The assessor compares the performance of the trainee with the assessor's

147

previous observations and experience, and decides whether or not the trainee's standard is satisfactory. Because assessments are based on many cases observed over long periods of time, in high-fidelity conditions, and because a number of colleagues are asked to contribute to the final assessment, such assessments are likely to be both valid and reliable.

ASSESSMENT AT LEVEL 5 Having attained level 4 the trainee proceeds to acquire experience and confidence in circumstances that allow senior colleagues to check up on performance and allow the trainee to ask for help and advice when required. When both assessor and trainee are happy that level 5 has, in fact, been reached, that the supervisior does not need to interfere and the trainee does not need to ask for help, a signature is entered in the logbook. It is anticipated that the period between reaching level 4 and attaining level 5 will be a considerable one. For most targets the consultant will continue to keep a watchful eye on the trainee's service provision throughout the greater part of the training period.

DE FACTO ASSESSMENT From this argument it can be seen that the assessment at level 4 is the one that is critical and requires direct observation by the assessor. It is anticipated that this will happen during the normal course of service provision. The responsibility for asking for assessment will be the trainee's. The system already exists. Trainees are left to handle cases with only indirect supervision. Structured training simply provides a system for recording progress on specific targets. The two log books will take at least 6 years to complete. There is, therefore, no need to rush through targets. The speed of progress through the targets should reflect the natural speed of acquisition of clinical competence. This is 'de facto' assessment.

IMPLICATIONS FOR T H E UNDERGRADUATE CURRICULUM At undergraduate levels only the 'observe' and 'assist' steps of the scale of competence can be achieved in high-fidelity situations for most practical procedures. Such procedures can be tested on models and in simulations in order to ensure that any relevant protocols have been established. For history taking and other aspects of communication it may, however, be possible to reach level 4: indirect supervision. It has already been established that learning is more effective under high-fidelity conditions when the student is

148

Current Obstetrics & Gynaecology

treated as a w o r k i n g m e m b e r o f the t e a m ? T h e c o n c l u s i o n o f this p a p e r is that valid assessments o f clinical c o m p e t e n c e are also best m a d e u n d e r these conditions.

2. 3.

ADDITIONAL READING A p r a c t i c a l h a n d b o o k o f assessment m e t h o d s is T h e G o o d A s s e s s m e n t G u i d e edited by B. Jolly a n d J. G r a n t , p u b l i s h e d b y Joint C e n t r e for E d u c a t i o n in Medicine, I S B N 1 873207 76 X REFERENCES 1. Jennett PA, Tambay JM, Atkinson MA et al. Chart stimulated recall: a method for assessing factors which influence

4.

5.

physicians' practice performance. In: Harden RM, Hart IR, Mulholland H (eds.) Approaches to the assessment of clinical competence. Dundee: Centre for Medical Education, 1992: 511-517 Christie T, Forrest GM. Defining public examination standards. Schools Council Research Studies. London: Macmillan, 1981 French S, Slater JB, Vassiloglou M, Willmott AS. The role of descriptive and normative techniques in examination assessment. In: Black HD, Dockrell WB (eds.) New developments in educational assessment. Edinburgh: Scottish Academic Press, 1988:15-32 Norman GR, Davis DA, Painvin A, Lindsay E, Rath D. Comprehensive assessment of clinical competence of family/general physicians using multiple measures. In: Bender W, Hiemstra R J, Scherpbier AJA (eds.) Teaching and assessing clinical competence. Groeningen: Boekwerk Publications, 1990:357-363 Mountford, Brenda Teaching and learning medicine: a study of teachers and learners in a young medical school. UnpuNished PhD Thesis, The University of Southampton 1991