ELSEVIER
Artificial Intelligence in Medicine
6 (1994) 161-173
Artificial Intelligence in Medicine
Methodological issues in validating decision-support systems for insulin dosage adjustment H.J. Leicester
a, A.V. Roudsari
a,*, E.D. Lehmann
b, E.R. Carson
‘*’
a Department
of Systems Science, Centre for Measurement and Information in Medicine, City Unir~ersi&, London, EC1 OHB, UK b Department of Endocrinology, Chemical Pathology and Medicine, U.M.D.S. (University of London). St. Thomas'Hospital, London SE1 7EH, UK (Received:
April 1992, revised September
1993)
Abstract Safety and reliability of advice from new computer systems should be confirmed before embarking on prospective hospital trials. This process of preliminary testing is termed ‘validation’. Though it forms a fundamental stage in system development, few standards exist for choosing and implementing tests. In the present paper, a validation methodology is developed in the domain of diabetes and intended for general use in chronic health management. It is based on a peer review protocol and incorporates empirical measures indicating: applicability of results to the real environment; variation among doctors; comparisons between doctors’ and computer advice; and relative merits of different computer algorithms. Key words: Validation
Decision-support
system;
Chronic
health
management;
Diabetes;
Evaluation;
1. Introduction
Validation, a process for selecting computer decision-support systems for hospital trials, is an important stage in a system’s development. It marks a first formal, independent and comprehensive test of performance, and should emphasise the medical accuracy and safety of the underlying algorithms. Prospective hospital
* Corresponding
author.
Email:
[email protected]
0933-3657/94/$07.00 0 1994 Elsevier SSDI 0933-3657(93)E0027-B
Science
B.V. All rights reserved
162
H.J. Leicester et al. /Artificial Intelligence in Medicine 6 (1994) 161-l 73
trials, themselves, are an inappropriate means for initial assessment. Trial conditions obscure details of performance [14,8], and place patients at risk before system capabilities are established. It is appropriate, therefore, to place considerable importance on the validation stage. There are, however, no standards for the vahdation process. The deficiency is most noticeable in chronic health management where there are few ‘gold standards’ for medical practice itself. Peer review techniques using recorded data are usually recommended. But, again, technical standards are rarely discussed. With such limitations, it is uncertain whether any current validation technique will produce consistent results which may be related directly to performance in the clinic. Computer advisory systems in diabetes pose many of these validation problems. Management of insulin-dependent diabetes mellitus, therefore, provides a focus for investigating the issues. Diabetes is an abnormality of carbohydrate metabolism, requiring clinical advice to balance diet, exercise and insulin dosage. Because of its worldwide prevalence, and the success of specialised treatment clinics, diabetes became a focus for Medical Informatics research [5,7]. Several systems have been developed exploiting a range of computing techniques [l-4,8,9,131. Despite this academic and medical interest, few systems have been tested across a comprehensive range of patient cases [lo] and comparisons between systems, even over a small range, are rarely reported ([2,6,10] are notable exceptions). Preliminary methodological studies are reported here to illustrate the problems. The peer review protocol, based on recorded data, is subsequently developed to address the difficulties. The resulting protocol relies on three features. Firstly, insulin dosage adjustment, which lies at the core of most systems, is a broadly quantitative discipline. Number, timing and dosage of injections may be measured; and the consequences may be monitored through effects on blood glucose levels. Such features offer a basis for statistical analyses. Secondly, diabetic patients show varying degrees of severity of disease and lifestyle. Together with other factors, these features may be used to categorise patients or events and to generate a graded scale of tests for any system. Finally, the methodology is enhanced by including doctors who are treating the test cases, as well as those who see only recorded data. This allows direct comparisons between practice in the clinic and performance under test conditions.
2. Preliminary
validation
methodology
2.1. Preliminary methodology Roudsari et al. have described early studies testing computer systems simultaneously [11,121. Their studies were based on an extended peer review approach, summarised in Fig. 1.
Fig. 1. Schematic
LevelZ:Peertwiew
representation
I
f7izz-j~~~~ClinicianI
I[
Clinical Data
of the preliminary methodology, emphasising the controlling role of the Consensus Panel.
/-EGLl
1
Level 1 : Advice generation
164
H..l, Leicester et al. /Artificial Intelligence in Medicine 6 (1994) 161-I 73
Initially a Consensus Panel of three consultants is established. Acting collectively, they select other participating staff and appropriate sets of patient data. The data is presented to three clinicians and to the computer systems, composing a first level of equal, and independent clinical advisors. Provided with the recorded patient data, they give insulin dosage advice for the next day. The advice, and the original data, pass to the second level. Here three further independent
Freauencv 25
20
In-941
In-1111
r
25
c
(al 20
15
15
10
5
5
n “012345670
0 9
Units m Frequency
10
11
12
13
14
of Insulin
Actrepid
w
NPH
[n=llll
In-941
25
25 (b)
20
- 20
15
I.5
10
10
5
5
0 0
1
0 9
2345678
Units m
Actrapid
10
11
12
13
14
15
of Insulin NPH
Fig. 2. Summarised dosage advice from preliminary methodology trial (n values indicate total number of injections advised for 12 patients over 5 days, for two common insulin formulations). (a) Distribution of advice for 1st clinician; (b) distribution of advice for 2nd clinician; (c) difference in advice between the 2 clinicians (calculated for each insulin formulation and each injection); Cd) difference between one clinician and selected computer system (calculated as for Fig. 2(c)).
H.J. teicester et al. /Artijicial Intelligence in Medicine 6 (1994) 161-173
[n=63
Frequency(X) b1041 40
10
30
30
20
20
10
10
0
-6
-4
-2
D
l2
0
.4
+6
Difference in units of insulin m Frequency&L)
Actrapid
m
NPH
[rW_X
h-104)
40
“X 30 -
30
20 -
20
10
10
0‘
-4
-2
0
+2
+4
Difference in units of insulin m
Actrapld
m
Fig. 2 (continued).
NPH
a l6
0
165
166
H.J. Lekester
et al. /Artificial
Intelligence in Medicine 6 (1994) 161 -I 73
clinicians assess each set of advice for accuracy and safety, recording their views in tailored questionnaires. Finally the data, advice and questionnaires return to the Consensus Panel to resolve any conflicts of interpretation. Data and information passing between levels are transcribed and standardized to conceal sources. Statistical tests and domain-specific rules are also incorporated between levels to ensure that anomalous data or information are recorded and filtered from the process. This ensures that time and expertise is concentrated on important features. In addition, statistical advice and recommendations from the Consensus Panel are used to confirm that a sufficient range and number of patients are considered. 2.2. Results and limitations Fig. 2 summarises advice from two clinicians and one computer system in a recent study. Participants received blood glucose data from 12 patients over five days. With additional details of patients’ diet and general state of health, participating clinicians were instructed to advise daily adjustments to insulin dosages within patients’ existing regimes. As a simplification, injections were limited to two common insulin formulations: NPH and Actrapid. Fig. 2(a) and (b) show the distribution of advice from each clinician; Fig. 2(c) compares clinicians; and Fig. 2(d) compares one clinician with the computer. Illustrative peer review questionnaire responses from this preliminary study are presented in Table 1. There are clear distinctions between the two clinicians which are more pronounced for one of the insulin formulations_ The questionnaire design, however, did not indicate whether the differences were clinically important. Indeed the
Table 1 Example questionnaire and results from preliminary methodology. Shows recorded assessments, by a clinician, of advice from fellow clinician and a computer over 2 patient cases Patient 1
Questionnaire
Patient 2
Please answer the questions by marking a value between 1, for a negative answer and 10, for a positive answer
A clinician from level 1
Cornputer
A clinician from level 1
Cornputer
(1) Was there enough patient data, information available, in order for you to make an appropriate decision? (2) Do you feel that the advice given will be effective in terms of Blood Glucose control? (3) Would you be satisfied if the same advice was given in a real clinical situation? (4) Would your advice be similar? (5) Are you able to identify the advice source? If yes, which?
7
8
8
8
7
2
8
6
6
2
8
6
7 1
2 1
8 1
5 1
H.J. Leicester et al. /Artificial Intelligence in Medicine 6 (1994) 161-l 73
167
differences may well have resulted from the fact that one of the clinicians had been treating the test cases in the clinic. In addition to omissions in peer review question selection, results show some inconsistency in interpretation. It is unclear, for example, whether questions about the source of advice were interpreted as attempts to ‘pin-point’ origins or to make broad distinctions between clinical and computer advice (Table 1, question 5). Similarly, it is uncertain why some clinicians appear to make subtle distinctions between their own recommendations and judgements on the performance of others (Table 1, questions 2, 3 and 4). The preliminary methodology, as it was implemented, was limited in scope and performance. Based on our experience, it is clear that improvements must consider: (a) Distinctions between doctors’ performance in the clinic and under test conditions. (b) How many clinicians to include, and their levels of experience and knowledge of computer systems on trial. Cc) Quality, quantity and presentation of data. Cd) Number and character of patient cases. (e) Selection and phrasing of question at the peer review level.
3. Enhancements
to the methodology
3.1. Clinician-diabetologist distinction The distinction between clinical and test performance is key. Accordingly, the term ‘clinician’ is subsequently used to indicate a doctor who sees only recorded data, while ‘diabetologist’ is reserved for those treating the test patients in the clinic. 3.2. Measures of difference and their interpretation
Fig. 3 shows hypothetical insulin dosage advice from clinicians, computers and a diabetologist. The information (for a single injection or a 24h regime) is arranged in order along a scale of insulin units. The Figure highlights four prime measures of difference which may be calculated from dosage advice: dl:
Discrepancy between clinical and test pe$omance.
Specifically, this measures the difference between the diabetologist and the nearest clinician. It is considered to reflect how well tests match the clinical situations; and it is likely to vary with the type, amount and presentation of case data. Experiments should be performed to minimise dl before trials continue.
168
H.J. Leicester et al. /ArtijEal
Intelligence in Medicine 6 (1994) I61 -173
-D-
dl = Diabetobgist
CPI
CP2
-
CPn
Scale of Insulin dosage (Units of Insulin)
- closest Clinician
dz = Highest Clinician d3 = Diabetologist
-
- lowest Clinician
- closest Computer
d4 = Highest Computer
System
System
lowest Computer
System
Fig. 3. Idealised set of advice from clinicians (Cl to Cn), Computers (CPl to CPn), and a diabetologist CD). Difference measures cdl, d2, d3 and d4) explained further in text.
d2: Variation between doctors. This records the range of advice among clinicians, and measures the level of consensus among medical staff. It is anticipated to vary with the number and experience of clinicians, and with the difficulty of the cases. d3: Best computer performance. This measures the smallest difference between computer and medical advice. Good computer performance would be indicated by a small d3. Strict interpretation, however, depends upon d2 (the degree of consensus among doctors). d4: Comparison between computer systems. This is a comparative measure among competing systems or particular parameter settings for individual algorithms. Over several cases, this measure highlights relative merits and deficiencies between computer systems. Difference measures may be calculated for any of the variables listed in Table 2. Collectively, they provide a detailed breakdown of insulin dosage management. Examples for selected patients are summarised in Table 3.
Table 2 Sample performance measures in insulin therapy Regime:
Number/timing of injections Number/type of insulin formulations Dosage adjustment:
Total insulin units injecter/24h Insulin units/injection Total insulin units/formulation Ratio of insulin formulations (e.g. for number, timing, or units injected)
H.J. Leicester et al. /Artificial
169
Intelligence in Medicine 6 (1994) 161-173
Table 3 Example different measure results, using data from Patient 1 in Table 1 Selected Pwfomance Vanable
Difference MSSISUE
CS
dl =O
-. I I
c2 I
I
I
I
I
c”P
6
6
’
x
9
I
I
I
I
I
b
8
10
Cl Total number of
dz=
m~ecttons
I
t rn=l
0
I
I
I
I
2
3
I
I
?
J
C? Total ActrapId umts
dz = 5
l,,JeCtd
m = -7
0
I I I? D
I”
C3 Cl
dl =O
13 14
I
I I
16 I8 6:
2o
dl =2 C?
Cl dz = 6
Total NPH units injected
0
d3=6
I
I
2
4
I c”P
I ’
I
C3
I
I
I
I
I
I2
I‘l
16
t*
19 xl
I
I II I II I
”
D
I
I
I I
12
15
C3
dl = -2 Total “mls of ActrapId & NPH
dz=
I
t
d.?=-I
(
0
I 3
I
6I
I I
9
I*
?I
Cl
242526
cz
27
30
DCP
3.3. Statistical
considerations
Full interpretation of changes in dl, d2, d3 and d4 requires not only a general increase in patient numbers but also consideration of patient type. Table 4 lists some criteria which may be used to categorise patients. Broadly, they distinguish aspects of medical complexity, such as diagnostic complications, from practical issues, such as patient compliance and data availability.
Table 4 Example patient categorisation variables Population/demographic
features
Diabetic complexity Other medical complications Number of days’ recorded data required for interpretation Severity of insulin dosage adjustment Value ranges for selected medical variables Patient compliance In-patient/out-patient Number of previous patient visits Regularity of visits Length of hospitalisation
170
H.J. Leicester et al. /Artificial Intelligence in Medicine 6 (1994) 161-l 73
Table 5 Sample auestions and resnonse scales for Deer review Are there transcription errors? Is there sufficient data for you to provide advice? What additional information would you request? Can you recognise who gave this advice? Is this advice safe/accurate/effective? Is the short/medicum/long-acting insulin appropriate? Would you change the regime?
y/n y/n ... y/n 1...10* 1...10* Y/n *
* Indicates several questions combined
Patient categorisation is problematic and may require several trials. However, even simple categories are useful. They provide the basis for a graded range of test cases. In practice, statistical considerations are often compromised by resource limitations. It is worth noting, however, that the determination of patterns in any complex area of medicine requires large amounts of data. Fundamentally, however, studies should consider statistical and control features to: Ensure effective use of human and material resources. Eliminate experimental bias. Assist in the determination of patient sample size and of the number and experience of medical staff involved.
Cl-CT-C”-
D -
t
CPl
1 St%iC Advice
-
CPZ -
CPn
Scale of Insulin dosage
T S&C Advice
Dynamic Advice
Peer Review Questionnaire scales
Medical
consequencps (Table 6)
calibrate qucstionnak interpret static advice
: _
Fig. 4. Summary of the process for refining and calibrating peer review questionnaires. to Cn; computers: CPl to CPn; diabetologist: D).
(Clinicians: Cl
H.J. Leicester et al. /Artificial Table 6 Example
measures
of the consequences
Measures of patient or data accuracy : Consequences:
compliance:
Intelligence in Medicine 6 (1994) 161-l 73
of applied
insulin
in-patient/out-patient HbAl, M value, relative
headaches, physical number/severity/length number/severity/length
discomfort hypoglycaemic hyperglycaemic
dosage
171
advice
to blood glucose
measurements
periods periods
3.4. Developing the peer review approach
A questionnaire at the peer review level 2 of the validation methodology (Fig. 11 may be used to record peer assessments of advice. Table 5 provides questions and response scales developed from the preliminary results (Section 2.2). Inclusion of the diabetologists’ advice in the methodology provides a means for refining the questionnaire design and interpreting the responses. Fig. 4 summarises this process, which is developed below. By definition, a retrospective methodology cannot produce advice which patients can follow. Suggestions from clinicians and computers may therefore be termed ‘static’ advice. In contrast, advice from diabetologists is given direct to the patient. And the consequences of the diabetologists’ ‘dynamic’ advice may be measured using variables listed in Table 6. The distinction between static and dynamic advice allows the peer review stage of a methodology to be refined. Questions and responses may be calibrated against the dynamic advice, providing a tool for interpreting static advice. Hence questionnaires which accurately reflect salient features of dynamic advice may be applied to static advice with more confidence.
4. Summary of the methodology In essence, the methodology combines a series of studies which should be performed to ensure rigorous, comprehensive and informative testing of any new system. The principal steps are summarised below: (a) Establish a Consensus Panel of domain experts. Determine, empirically or by expert judgement, statistical and domain criteria for: (i) number and experience of clinicians; (ii) number and categories of patients. (b) Comprehensive collection of patient data, covering several days and including measurements of medical consequences. (c) Run pilots of the methodology to establish appropriate test conditions by: (9 varying the type, amount and presentation of data. (ii) designing and calibrating peer review questionnaires. (d) Run the methodology in full.
112
H.J. Leicester et al. /Artificial Intelligence in Medicine 6 (1994) I61 -173
(e) Present results, indicating: (i) all selection and categorisation criteria covering patients, clinicians, data and questionnaires; (ii) dl, d2, d3 and d4 (if applicable) for all selected attributes of advice, within and between patient categories. 5. Conclusions
Applied computing in medical decision making, as a relatively young branch of science, has no strict performance criteria. A system’s progress into hospital trials and general use, however, relies on robust test results that satisfy medical staff, administrators and ethical committees. In chronic health management, defining evaluation criteria is particularly problematic. It is an area with, by definition, no ‘gold standard’ against which to judge a decision. It is also a process of continual decision making rather then a single judgement or classification. And it is complicated by aspects of patient compliance, data accuracy and availability, and by doctors’ preferences and experience. Chronic health management seems most appropriately analysed through ‘quality control surveys’. Yet, such surveys, without considerable statistical controls, cannot distinguish the details of computer performance. Diabetes poses many of these problems. But it has the advantage that insulin regimes, dosage adjustment and medical consequences of therapy can be measured. As a result, it provides opportunities to assess medical and computer performance in quantitative as well as qualitative terms. The methodology described here emphasises the role of measurement and analysis. Though it concentrates on aspects of insulin dosage adjustment, it might similarly be extended to regime changes, to diet modifications and perhaps to exercise. Necessarily, the methodology relies heavily on comprehensive data collection and promotes the arguments for categorising patients. It incorporates a means of comparing validation test results with the clinical environment. It also allows relative merits of doctors and various computer algorithms to be compared in detail, statistically and through peer review, over empirically defined patient ranges. Whilst this methodology is still evolving, it offers potential to all parties in diabetes management. Without directly affecting patients, medical staff are given an analytical review of their own performance; computer system designers receive comparative information about different algorithms. Above all, the methodology aims to generate results which are both medically interpretable and statistically sound. Acknowledgements
The authors would like to thank the clinical staff in the Department of Endocrinology, St. Thomas’s Hospital, London. In particular, we should like to thank Professor Peter Sonksen for his positive criticism.
H.J. Leicester et al. /Artificial Intelligence in Medicine 6 (1994) 161-I 73
173
References [I] S. Andreassen, J.J. Benn, E.R. Carson, R. Hovorka, U. Kjaerulff and K. Olesen, A causal probabilistic network model of carbohydrate metabolism for insulin therapy adjustment, in: P.C. Pedersen and B. Onaral, eds., Proc. TiveI,+thAnnual Int. Co@ IEEE Engineering in Medicine and Biology Sot. (IEEE, New York 1990) 1011. [2] A.M. Albisser. A. Schiffrin, M. Schulz, J. Tiran and B.S. Leibel, Insulin dosage adjustment using manual methods and computer algorithms: A comparative study, Med. Biol. Eng. Comput. 24 (1986) 577-584. [3] E.R. Carson, S. Carey, F. Harvey, P.H. Sonksen, S. Till and C.D. Williams, Information technology and computer-based decision support in diabetic management, Comput. Meth. Prog. Biomed. 32 (1990) 179-188. (41 T. Deutsch, E.R. Carson, F.E. Harvey, E.D. Lehmann, P.H. Sonksen, G. Tamas, G. Whitney and C.D. Williams, Computer-assisted diabetic management: a complex approach, Comput. Meth. Prog. Biomed. 32 (1990) 195-214. [5] EURODIABETA, Information technology for diabetes care in Europe: The EURODUBETA initiative, Diabetic Med. 7 (1990) 639-650. [6] J. Gaschnig, P.H. Klahr, H. Pople, T. Shortliffe, and A. Terry, Evaluation of expert systems: issues and case studies, in: F. Hayes-Roth, D.A. Waterman and D.B. Lenat, eds.. Building Expert System.y (Addison-Wesley, Reading, MA, 1983) 241-280. 171 J. Holland, EURODIABETA: Modelling and implementation of information systems for chronic health care - example: diabetes, in: R. O’Moore, S. Bengtsson, J.R. Bryant and J.S. Bryden. eds., Lecture Notes in Medical Informatics (Springer, Berlin, 1990) 48-53. [8] R. Hovorka, S. Svacina, E.R. Carson, C.D. Williams and P.H. Siinksen, A consultation system for insulin therapy, Camp. Meth. Prog. Biomed. 32 (1990) 303-310. [9] E.D. Lehmann, T. Deutsch, A.V. Roudsari, E.R. Carson, J.J. Benn and P.H. Sonksen, A metabolic prototype to aid in the management of insulin-treated diabetic patients. Diab. Nutr. Metab. 4 (Suppl.1) (1991) 163-167. [lo] A. Peter. M. Rubsamen, U. Jacob, D. Look and P.C. Scriba, Clinical evaluation of decision support system for insulin-dose adjustment in IDDM, Diabetes Care 14 (1991) 875-80. [ll] A.V. Roudsari, H.J. Leicester, E.D. Lehmann, R. Hovorka,S. Andreassen, T. Deutsch. E.R. Carson and P.H. Sonksen, A validation methodology for testing decision-support systems for insulin dosage adjustment, in: K.-P. Adlassig, Grabner, S. Bengtsson and R. Hansen, eds.. Lecture Notes in Medical Informatics (Springer, Berlin, 1991) 382-388. [12] A.V. Roudsari. H.J. Leicester, P.A. Lawrence, R. Hovorka, E.D. Lehmann, E.R. Carson and P.H. Sonksen. A practical application of a retrospective validation methodology, in: Int. Conf. on System Science in Health Care (Prague, Omnipress, 1992) 1296-1299. [13] E. Salzsieder, G. Albrecht, U. Fischer, A. Rutscher and U. Thierback, Computer-aided systems in the management of type I diabetes: The application of a model-base strategy, Comput. Meth. Prog. Biomed. 32 (1990) 215-224. [14] J. Wyatt and D. Spiegelhalter, Evaluating medical expert systems: what to test and how’?. Med. Inform. IS (3) (1990) 205-217.