MINI-SYMPOSIUM: STAGING AND GRADING IN HISTOPATHOLOGY
Diagnostic categorization in EQA schemes
diagnoses and often there will be almost as many diagnoses as participants. Many of the differences in those diagnoses will be differences in terminology. The first step for a scheme organizer, having collected all of the participants’ diagnoses, is to group them into broad diagnostic categories to facilitate the choice by the participants of the correct diagnosis. The organizer has, at this stage, to decide which differences are merely terminological and which are of diagnostic importance. A classification error at this stage e grouping together diagnoses which later turn out to be significantly different, can cause massive duplication of work, having to rescore all participants’ responses. A computer program has been written by Professor Peter Furness2 to facilitate this compilation process and allows a total of 10 diagnostic categories for each case. It gives a weighting to each of those 10 categories depending upon how many participants made the diagnosis, and how confident of it they were (Table 1). The scheme organizer has to exercise some other judgements about the case. If two diagnoses for one case have been offered by many participants, how is that categorized? For example, an appendix might show inflammation and worms. Does the organizer group everyone by the mention of inflammation? Or should there be two groups? ‘Inflammation and worms’, versus ‘Inflammation and no worms mentioned’. Some participants may have considered the worms, or spirochaetes or whatever, as being not worth mentioning. With two potential diagnoses in a case there are four possible diagnostic groups. With three potential diagnoses (faecolith, appendicitis, worms) the number of categories increases exponentially. Many participants will (not unreasonably) describe everything that they can see on a slide.
Nicholas P Mapstone
Abstract The different ways in which diagnoses may be categorized in EQA schemes are considered. They may be grouped in very large diagnostic categories, or in narrow specific diagnoses. They may be categorized as dangerous or non-dangerous diagnoses. They may be categorized by the participants or a panel of experts. They may be categorized according to organ system, but rarely by underlying pathological process. The ways in which these categories are reached is considered, and the effect this has on the educational role of EQA schemes. This categorization will also influence the participant’s strategy to ensure that they are not categorized as a poor performer.
Keywords diagnostic errors/classification; healthcare; pathology/standards; quality assurance; quality control
Introduction External Quality Assurance (EQA) is now widely accepted by histopathologists in the UK.1 They use it to demonstrate that their diagnostic powers are in line with the majority of their colleagues. It is likely that pathologists will use EQA as one way to fulfil the requirements of revalidation. Much of the process of EQA is in the categorization of cases. This short article will consider that categorization process, and how it affects EQA scoring.
In this example, less than half of participants made the most popular diagnosis of sessile serrated polyp or sessile serrated lesion (diagnosis 1). There is no consensus as to the main diagnosis. A few participants wrote both hyperplastic polyp and sessile serrated polyp as their main diagnosis (diagnosis 4). Only one participant suggested serrated carcinoma as a possible diagnosis, and they only scored it as 1/10, giving 9/10 to some other diagnosis
Determination of the ‘correct’ diagnosis All cases which are used in EQA will have a ‘correct diagnosis’ against which participants’ diagnoses are compared. Most cases will be submitted by the original reporting pathologist with their original diagnosis. That is one person’s opinion and may not be correct. So the ‘correct diagnosis’ will be determined either by an expert panel or by the most popular diagnosis amongst all participants. Popular diagnosis schemes Most EQA schemes have more than 50 participants. Obviously the more participants in a scheme, the more possibility for variety in diagnoses. Occasionally there will be only one diagnosis. Some would consider such a case to be too ‘easy’, but most schemes will have one or two cases per circulation with such excellent consensus. It is much more common to get multiple
1 2 3 4 5 6 7
Nicholas P Mapstone FRCPath is a Consultant in the Department of Histopathology at Royal Lancaster Infirmary, Lancaster, UK. Conflicts of interest: the author is chair of the ‘NQAAP for Histopathology incorporating the Steering Committee for Interpretive EQA’, and the organiser of a gastrointestinal pathology EQA scheme, and the Bowel Cancer Screening Program EQA scheme.
DIAGNOSTIC HISTOPATHOLOGY 17:6
Circulation: R
Case number: 804 Diagnostic categories:
Number of responses: 214 Score
Sessile serrated polyp/lesion Serrated adenoma Sessile serrated adenoma Hyperplastic polyp/sessile serrated polyp Hyperplastic polyp Normal Serrated carcinoma
4.41 1.70 2.10 0.21
Date: 3 Nov 10
1.57 0.01 0.00
Highest scoring diagnosis was 1 with 4.41 Asterisks (if any) indicate dangerous diagnoses.
Table 1
268
Ó 2011 Published by Elsevier Ltd.
MINI-SYMPOSIUM: STAGING AND GRADING IN HISTOPATHOLOGY
The scheme organizer has to decide which diagnostic categories need to be considered by all the participants, and which can be safely ignored to simplify categorization. What is the diagnostic crux of the case? This difficulty means that it is rare to require grading or staging diagnoses in EQA schemes. A case of gastrointestinal stromal tumour may result in a range of differential diagnoses, and prognosis for the GIST may complicate the diagnostic categories beyond what is practical. Once the draft diagnostic categories have been compiled by the organizer, they are presented to a group of the participants e either at a meeting or in a discussion facilitated by, for example, email. Participants are presented with the (maximum of 10) diagnostic categories for each case and asked to decide which categories should be lumped together. There is no facility for splitting at this stage e unless the scheme organizer goes back to all the responses and reclassifies them. The participant in the back of the room who sees their diagnosis alone in a separate category now has a definite interest in getting it included with the most popular diagnosis e their score will depend upon how many of their colleagues made the same diagnosis. This is where some of the most heated and prolonged discussions can occur around EQA. If the most popular diagnosis is ‘collagenous colitis’, then participants who have diagnosed microscopic colitis will argue that their less specific diagnosis is included with the main diagnostic group e which will perhaps change its name to ‘microscopic colitis, incorporating collagenous colitis’. The participant who has diagnosed ‘inflammation’ or ‘colitis’ might make the same argument e with less likelihood of success (Table 2). Following this diagnostic lumping process, if fewer than 80% of participants have been grouped together in the main diagnostic category, the case is usually excluded as not having sufficient consensus.
categorization will be much reduced. Participants will usually receive a score of one if they get the case ‘correct’ and none if ‘incorrect’. Whilst there are arguments for both ways of identifying the ‘correct’ diagnosis, it is rare that participants as a group will come to an obviously incorrect consensus diagnosis. It is much commoner for such a case to result in a wide spread of diagnoses, and thus to be excluded from scoring due to the absence of consensus.
Dangerous diagnoses Sometimes participants will lose marks for diagnoses whose difference from the most popular diagnosis is minimal or even terminological. Conversely occasional incorrect diagnoses are so drastically incorrect that it seems unfair that the penalty is the same as for a terminological error. There are advocates for the separate category of ‘dangerous diagnosis’. This category has fallen into disfavour at least in part because of the difficulty of defining a ‘dangerous diagnosis’. Overdiagnosing carcinoma would likely be a dangerous diagnosis. Missing amoebae in colitis might also be considered dangerous. Missing giardiasis might have major consequences for a patient e but would probably not be called ‘dangerous’.
Diagnostic granularity Most schemes require participants to produce a classical histological diagnosis for each case. The participants, or a panel, then combine those diagnoses into diagnostic categories as we have seen. The breast3 and bowel cancer screening programme EQA schemes however ask participants to classify cases into broad diagnostic categories e e.g. Non-neoplastic, low-grade dysplasia, high-grade dysplasia, malignant. This has many advantages, which are mainly administrative. It makes it much easier to compile large numbers of responses from hundreds of participants. It makes the process of identifying the correct diagnosis e usually by a panel of experts, much easier. It also allows a participant a semiquantative way of seeing how far away they might be from the ‘correct’ diagnosis. This difference, between the non-granular diagnoses of the screening EQA schemes and the much more granular diagnoses in the other schemes, reflects a major philosophical difference in the rationale of EQA schemes. It is often said that the main role of EQA is educational, although many consider its quality assurance role as predominant. The educational function is best served by having classical, specific diagnoses. This allows diagnostic variants to be included and assessed. Missing a rare diagnosis in an EQA scheme means it is less likely to be missed in routine diagnosis. The assessment function is best served by having broad diagnostic categories. The spread of diagnoses will necessarily be smaller, but when aberrant diagnoses occur they are more obvious and objective. These schemes still have an educational role, but it is less pronounced. In the bowel cancer screening programme scheme, participants have a single box to tick if they think a specimen is not dysplastic e they do not have to identify Peutz Jeghers, hyperplastic or juvenile polyps. From the bowel cancer screening programme’s point of view that is not a crucial distinction.
Expert panel schemes The process of identifying the most popular diagnosis and thrashing it out at a participants meeting is cumbersome. Identifying a panel of one or more experts who make the definitive diagnosis is much simpler but perhaps less popular with the body of participants. Problems of diagnostic
In the example seen in Table 1, following the participants meeting, those diagnoses which used the term serrated (except for carcinoma) have been combined. There is now consensus as to the main diagnosis
1 5 6 7
Circulation: R
Case number: 804 Diagnostic categories
Number of responses: 214 Score
Sessile serrated lesion Hyperplastic polyp Normal Serrated carcinoma
8.42 1.57 0.01 0.00
Date: 9 Nov 10
Highest scoring diagnosis was 1 with 8.42 Asterisks (if any) indicate dangerous diagnoses.
Table 2
DIAGNOSTIC HISTOPATHOLOGY 17:6
269
Ó 2011 Published by Elsevier Ltd.
MINI-SYMPOSIUM: STAGING AND GRADING IN HISTOPATHOLOGY
Identifying poor performance
Categorization by organ system is fairly straightforward. Categorization by diagnosis is less easy. Lymphomas are common in many organ systems, and correspondingly common in specialist EQAs. Many organ specialists will say that they do not report lymphomas and request exclusion on that basis. Of course that begs the question of the original diagnosis. This is a disincentive to including lymphomas in specialist EQA schemes. Some specialists will join regional general schemes and ask for exclusion from the organ systems they do not report. This may result in them producing responses from only a very small proportion of the cases in a circulation. This is potentially dangerous. A participant reporting only two cases in a circulation only has to make a mistake in one for their score to drop to 50%. They would almost certainly end up in the bottom 2.5% of participants.
Participants scores are ranked from highest to lowest scores. The group in the bottom 2.5 centile are identified as poor performers. A single poor performance is not a matter for investigation. Schemes differ in their ways of defining what constitutes ‘persistent poor performance’ but a common way of doing so is to identify those who perform poorly in two out of three consecutive circulations of the scheme. This usually triggers the ‘first action point’ which entails a letter from the scheme organizer and the offer of assistance. At this point the slate is (almost) wiped clean e participants have to show substandard performance in another two of three circulations in order to pass the second action point. The only difference from the first three circulations is that, after a first action point, nonparticipation in a round is taken to be an episode of ‘poor performance’. At this point the ‘granularity’ of scoring can become important. If there are 10 cases, for which there is only a right or wrong score, with a maximum of 10 and others scores of 9, 8, 7 etc, there is the potential to have no poor performers at all. Thus if there are 100 participants, 70 score 10, 20 score 9 and 10 score 8, then the 10 lowest scorers will not be in the bottom 2.5 centile. If one participant then scores 7, then there will be only one in the bottom 2.5 centile. Conversely if using the computer software used by many of the schemes it is very unlikely for participants to have whole number scores and thus unlikely that they will have the same scores as any of their colleagues. Consequently although the spread of scores may be the same, from a maximum of 10 down to, say, 7.16, the software will be able to identify two or three low scoring participants. Of course it could be argued that this is artificially identifying a difference where none exists. Conversely it could be said that a scheme which never identifies anybody as a poor performer was undiscriminating. If the poor performances noted by the former system are truly not significant, then they are not likely to be persistent and a second episode of poor performance is unlikely to be incurred.
The scheme organizers strategy The scheme organizer wants to have cases which are not too ‘easy’ and not too ‘hard’. Thus they don’t want too many cases where every participant gets exactly the same correct answer (although there is no problem with having one or two cases like that in each circulation). There would be less benefit in a scheme where everybody scored 100% all the time. Conversely there is no advantage in having cases so difficult that there is no consensus as to the correct diagnosis, only a small proportion of participants get it ‘right’ and in which there are multiple different diagnoses proffered by participants. In that example enormous amounts of work are done by participants to little discriminant effect e although it may be educational. In fact a circulation with many such cases might be the most educational of all schemes, although scheme organizers are unlikely to be thanked for this by their scheme members. Probably the ideal case for scheme organizers is one in which 95% of participants get the popular diagnosis, and another 5% of participants produce different diagnoses e and different on a benign malignant spectrum rather than a difference based purely on terminology. The educational value of such a case is probably greatest for those 5% of participants who have the less popular diagnoses e but also for those participants who have had to reach the correct diagnosis by using reference books before submitting their responses.
General and specialist schemes EQA schemes are split, fairly evenly, between specialist schemes based on organ systems and general schemes, covering all organs, and based in one geographic area. As pathologists become more specialized, the specialist schemes are expanding and the general schemes remaining stable. There is a tendency for cancer peer review to require pathologists to join multiple specialist schemes rather than one general scheme. This is often driven by local cancer managers, even when the guidelines for that cancer type do not require membership of a specialist EQA scheme. A general scheme that efficiently categorizes its cases according to organ system is one way to counteract this tendency. General schemes should explicitly cover specific organs, and allow participants who do not report that organ to be excluded from scoring for such cases. A pathologist should be able to show that their EQA covers the material that will be encountered in local MDTs, and avoid the need to join multiple specialist EQA schemes. Many of the cases in general schemes are ‘harder’ than those seen in specialist schemes.
DIAGNOSTIC HISTOPATHOLOGY 17:6
The participant’s strategy Plainly the participant’s strategy is to avoid falling within the bottom 2.5% centile in the scores. Above that there is no advantage to a participant in scoring 100% or 10%. Many schemes allow participants to spilt diagnoses e so they can allocate 9/10 of their ‘diagnostic commitment’ to diagnosis A and 1/10 to diagnosis B. The lowest scoring participants have usually scored low in four out of 10 cases in a circulation. Having 1/10 gone towards the correct diagnosis in one case may make little difference in that situation, but may put someone just above a colleague in a close ranking situation. There is often an advantage to a participant in being as ‘non specific’ as possible e certainly if attending the participants meeting. A more general, overarching diagnosis is more likely
270
Ó 2011 Published by Elsevier Ltd.
MINI-SYMPOSIUM: STAGING AND GRADING IN HISTOPATHOLOGY
REFERENCES 1 Macartney JC. An update on EQA in histopathology and cytology. RCPath Bulletin 2003; 124: 9e12. 2 Furness PN. ‘The RESPONSE EQA Analysis Program’ http://www.le.ac. uk/users/pnf1/eqa/paeresp.html (accessed 4/01/2011). 3 NHS Breast Pathology EQA Scheme http://www.icr.ac.uk/research/ research_sections/epidemiology/epidemiology_teams/cancer_ screening_evaluation_unit/nhs_screening_programme/5965.shtml (accessed 4/01/2011).
to be within the popular group than a more specific diagnosis. For example, a diagnosis of ‘carcinoma’ will often be included in the most popular group, but adding more specific features e such as ‘showing neuroendocrine differentiation’ or ‘adenosquamous carcinoma’ may put the diagnosis at risk of being excluded from the popular group. Getting the balance right between specificity and broad diagnosis is the basis of successful EQA submission, but then it is obviously also important in ‘real life’. In most EQA schemes the attempt by a participant to make a brilliant obscure diagnosis will usually cause a low score. Cases will usually reflect routine working practice and not require a detailed literature search to be correctly diagnosed.
Practice points C
Conclusion
C
Categorization is at the heart of assessing EQA responses, and the participant who understands how their diagnoses are likely to be categorized is less likely to be themselves categorized as a poor performer. A
DIAGNOSTIC HISTOPATHOLOGY 17:6
C
271
External quality assurance schemes are intended to monitor the quality of reporting by histopathologist in their everyday work and the design of these schemes should reflect that Scoring the results in EQA schemes is difficult if the material cannot be categorized into a small number of discrete groups A balance has to be struck between broad and narrow diagnostic groupings when scoring EQA scheme
Ó 2011 Published by Elsevier Ltd.