Computer Programs in Biomedicine 7 (1977) 163-170 © Elsevier/North-Holland Biomedical Press
163
MEDICAL DATA BASE SYSTEM WITH AN ABILITY OF AUTOMATED
DIAGNOSIS
M. OKADA, N. MARUYAMA *, T. KANDA, K. SHIRAKAWA and T. K A T A G I R I ** * Department of Neurophysiology and ** Department of Neurology, Brain Research Institute, Niigata University, Niigata, Japan We carried out an exPeriment on a medical information system in which a clinical data base is combined organically with computer programs for automated diagnosis. In this system, the parameters for automated diagnosis are devised to be renewed as the contents of the data base (patient's information) increase. This system can be regarded as a data base possessing a kind of diagnosing ability which grows up with time. We have named this system "Intelligent Data Base". The algorithm for computer diagnosis used in this study is based on maximum likelihood method, and each likelihood is weighted with a prior probability of each disease. The discrimination efficiency of this method is logically equal to that of the Bayes rule. First 27 cases were learnt by the system and correct diagnosis was obtained in 78% of the cases. When cases for learning increased to 82, the percentage of correct diagnosis was improved to 95%. Data base
Automated diagnosis
Learning
1. Introduction A number of methods and systems for automated diagnosis by means of a computer have been reported [ 1 - 6 ] . Recently, studies of the automated diagnosis have focused attention on the methods based upon a probabilistic assessment [ 3 - 6 ] . In most of these studies, however, criteria of diagnosis are derived from fixed number of samples. A doctor learns diagnostics from a number of cases for years and grows up into a specialist. The learning method also will be useful in computer diagnostic systems [ 7 - 9 ] . In order to make a learning system efficient, the system had better be included in a medical data base system (patient's information system [ 1 0 - 1 9 ] ) because the contents of data base will increase day by day. From this point of view, we made an experiment on a system in which a computer program for automated diagnosis works organically with a medical data base. In this system the so-called " m a x i m u m likelihood m e t h o d " [20] is employed for diagnosis. To begin with, we record information of patients who have already been diagnosed and thereby the parameters for automated diagnosis are computed. When a patient's information is newly recorded in the data base, an automated procedure of diagnosis is carried out on the basis of these parameters and "possible diagnosis" is then typed out. If desired, the parameters for auto-
mated diagnosis can be improved as the patient's information in the data base increases. The system can be regarded as a data base possessing a kind of diagnosing ability which grows up with time. We call this system "Intelligent Data Base". This ~eport will describe the construction of the system and the results of its clinical application.
2. Methods 2.1. Materials
Diseases with which we deal in this experiment are confined to multiple sclerosis and those that need to be differentiated from multiple sclerosis such as myelitis, brain stem encephalitis and myeloradiculoneuritis. Table 1 shows the chart used in this study. 2.2. Computer programs
The whole program consists of five parts. These parts offer the following services. 1. Accept the new patient's information in the data base. 2. Renew or correct the informatioda which has already been recorded. 3. Carry out the retrieval of information for several purposes [21 ].
164
M. Okada et al., Data base and automated diagnosis
Table 1 List of items in a patient's file.
(1) (2) (3) (4) (5) (6) (7)
Patient identification Chart No. (in), (out) Name Sex Birthday Address Occupation Age at onset
1. Ingravescence 2. Familial incidence 3. Remission & recurrence 4. Multifocal lesions 5. Cerebral signs 6. Optic signs 7. Spinal cord signs 8. Brain stem signs 9. Cerebellar signs 10. Laterality of signs Past history 11. Measles 12. Allergic diseases 13. Trauma 14. Appendicitis 15. Operation 16. Tuberculosis 17. Gastro-intestinal diseases 18. Liver diseases 19. Nephritis 20. Miscellaneous
21. 22. 23. 24. 25. 26. 27.
Precipitating factor Overwork Trauma Vaccination Operation Pregnancy Delivery Miscellaneous
Prodromal symptoms 28. Fever 29. Headache 30. Common cold 31. Nausea & vomiting
32. 33. 34. 35. 36.
Exanthema Dizziness & vertigo Pain Miscellaneous Mode of onset
37. 38. 39. 40. 41. 42. 43. 44. 45.
Initial symptoms Impairment of visual acuity Double vision Paralysis Speech disturbance Gait disturbance Numbness Hypesthesia Ophthalmic pain Miscellaneous
Main neurological signs 46. Mental disorder 47. Impairment of visual acuity 48. Optic nerve atrophy 49. Ophthalmoplegia 50. Nystagmus 51. Dysarthria 52. Dysphagia 53. Paralysis 54. One side of the body 55. Lowerhalf of the body 56. Hyperreflexia 57. Hyporeflexia 58. Pathological reflex 59. Ataxia or intension tremor 60. Sensory disturbance 61. One side of the body 62. Lowerhalf of the body 63. Glove& stocking 64. Bladder & rectal disturbance 65. Convulsion 66. Painful cramp 67. Wassermanreaction 68. C.S.F. 69. Cell count 70. Protein 71. Miscellaneous Therapy 72. Effect of steroid therapy
4. Compute the parameters required for automated diagnosis from the stored information on the patient. 5. Diagnose a patient whose information is input on
the basis of the parameters already computed and stored. Any service mentioned above can be ordered arbitrarily through typewriter keys. The following sections explain the system and the directions for use. 2.3. Patients 'files
Every input is done by a procedure where a doctor, through typewriter keys, answers questions displayed on a CRT. After all questions have been answered, illegal inputs are checked automatically on the CRT display. Inputs for one patient make a unit of file in core memories and the file in core memories is then transferred to magnetic tapes. Information of one patient forms a fixed-length file, that is, a file, t h e maximum capacity of which is limited. A unit of file is divided into two sorts of records. In the first record we record the data concerning the patient's identification such as personal identification number, chart number, name, sex, address, occupation, etc. In the second record we record the data concerning the patient's disease such as past history, familial history, prodromal symptoms, symptoms and signs, laboratory findings, etc. As for the input format, there are two different kinds of items: the specified format items and the arbitrary format items. In the:first record, the input format is specified for the items such as diagnosis, chart number, sex and age at onset, and arbitrary for the items such as name, birthday, address and occupation. The specified format items are usable for a statistical computation or other purposes [22]. The personal identification number is
CASE NO. 105 (I) (2) (3) (4) (5) (6)
DIAGNOSIS CHARTNO. (OUT) NAME SEX BIRTHDAY ADDRESS
(7) OCCUPATION (8) AGE AT ONSET
I-MULTIPLE SCLEROSIS 4348 ???? ????? MALE 1950.10.25 NIIGATA, NIIGATA STUDENT * 17
Fig. 1. Example of computer display.
M. Okacla et aL, Data base and automated diagnosis CASE NO. 105 41. GAIT DISTURB.
YES
42. NUMBNESS
NO
43. HYPESTHESIA
NO
44. OPHTHALMIC PAIN
NO
45. OTHERS
NO
46. MENTAL DISORDER
NO
47. IMP. OF VISUAL AC.
YES
48. OPTIC N. ATROPHY
YES
49. OPHTHALMOPLEGIA 50. NYSTAGMUS
NO * NO
Fig. 2. Example of computer display. Data on display are employed to compute the parameters for automated diagnosis. given consecutively by the computer program. This number is utilized for calling up or retrieving stored data. The format in the second record of the file is specified, since most of the items are used directly to compute the parameters for automated diagnosis. Even if the format is specified, we can write comments in any formats immediately after specified format descriptions. As stated before, one unit of file is a fixed-length file, so the number of characters per patient is limited. The number of characters per item, however, is not fixed (variable length), so that we can write in at length until all inputs for a patient exceed the whole capacity of a file. If characters written exceed the whole capacity, the excess will be ignored and won't be displayed on the CRT. It is possible to correct the contents of data stored on a magnetic tape by calling it on the CRT display (figs. 1 and 2). 2.4. Algorithm o f automated diagnosis The method which was similar in ideas and formulae to those of the "maximum likelihood method" (R.A. Fisher) were applied to the automated diagnosis. The likelihood was weighted with a prior probability of each disease (see Appendix). The discrimination efficiency of this method is logically equal to that of the Bayes rule. The clinical data include continuous, discrete and bivariate quantities. To make too many grades for each symptom will rather decrease accuracy of probability because detailed description has not always been ob-
165
tained. Hence we regard each of all data as bivariate quantities: either occurrence or non-occurrence of a symptom; either normal or abnormal range for laboratory findings. In compliance with this, the specified format item; which is used directly for automated diagnosis, takes the style that an answer is to be chosen among the three, i.e., YES, NO and UNKNOWN. As for item 1 in table 1, YES was input in the case of which the clinical course had showed ingravescence. As for item 36, YES was input in the case of which the mode of onset had been acute. As for items 6 8 - 7 0 , YES was input in the case of which the result of each examination had been in an abnormal range. As for item 72, YES was input in the case in which a therapy with corticosteroids had been clinically effective. As for items 20, 27, 35, 45 and 71, YES was input in the case in which other positive findings of the same category had been found. This style is convenient for our diagnostic method. Let Si (i = 1,2 ..... m) be each symptom and Dc~ (a = 1,2 ..... n) be each disease. We compute the conditional probability that the symptom Si occurs in the disease Da for each c~E 1,2 .... , n and i E 1,2 .... , m. In our experiment, m = 72 and n = 4. Also we compute the probability of each disease. These probabilities are equal to the prior probabilities in the Bayes rule. Symptoms mentioned here and in following descriptions include not only a symptom itself but also family history, past history, prodromal symptoms, mode of onset, physical signs, laboratory findings and therapeutic effects. In the practical procedure, the ratio of occurrence of each symptom in each disease, and the ratio of each disease to all cases are computed. We regard the former as an estimate of probability of occurrence of a symptom and the latter as that of a disease. The parameters mentioned above mean these values. In the maximum likelihood method, a "likelihood" of every disease is computed using these parameters. We can find a disease which a combination of symptoms of a given patient is the closest to. Hereafter we will call the most likely diagnosis a computer diagnosis. 2.5. Learning and automated diagnosis Initial learning is made up by computation of the parameters for the cases whose final diagnosis has been decided by doctors with anatomical findings etc.
166
M. Okada et al., Data base and automated diagnosis
mation, therefore, will be ignored in computing the parameters. An automated diagnosis is carried out on a patient's information newly input and the result is typed out, if desired. But if not desired, the automated diagnosis can be omitted by a specified key operation. By either procedure, new patient's information is recorded on a magnetic tape, but the diagnosis obtained here (computer diagnosis) won't be recorded. Figure 3 exemplifies the typed output of the case which has been diagnosed automatically. There we can see the diagnosis of the highest ranking together with its likelihood. Likelihoods of the other diagnoses are also typed out for a reference. The second line in the figure is a chart number. The flow of the program is shown in fig. 4.
**** COMPUTERDIAGNOSIS****
(3098) MOST PROBABLEDIAGNOSIS: *MULTIPLE SCLEROSIS
(L= -22.105)
OTHERS: *MYELITIS
(L= -89.889)
*BRAIN STEMENCEPHALITIS (L: -65.663) *MYELORADICULONEURITIS
(L= -74.317)
Fig. 3. Sample of diagnostic report.
The probabilities of occurrences can be typed out at any time by a specified operation of a typewriter key. Computed parameters are always recorded on a magnetic tape. If the cases with a final diagnosis have increased in the data base, relearning can be ordered through a typewriter key and, consequently, the parameters are improved. Ill the cases of which the final diagnosis has not yet been determined or is uncertain, the item of diagnosis in the file should be left blank. These patients' infor-
START
3. R e s u l t s
By means of the system mentioned above, learning was executed and the result of learning was tested for the percentage of correct diagnosis. In this section the change of the percentage of correct diagnosis with the increase of the cases learnt will be described. The cases to which the final diagnosis was already given were used for both learning and test.
)
|STORING
n~THE
DATA INTO
MAGNETIC TAPE i yes COMPUTER DIAGNOSIS I /
R~WAL~
_ no ..... ~
yes.|THE DATA OF|
,PATIENT i IR
"1 DESIGNATED I
"l I
ETEaS
TYPE OUT OF THE RESULT~
I
no
Fig. 4. Flow chart of the program.
M. Okada et aL, Data base and automated diagnosis
167
Table 2 Improvement of the percentage of correct diagnosis by relearning. Number of learnt cases
Number of tested cases
Correctdiagnosis
Misdiagnosis
Correct- /testeddiagnosis/number
27 41 58 82
14 15 16 82
11 11 15 78
3 4 l 4
0.78 0.73 0.94 0.95
We recorded 27 cases into the data base; 10 cases of multiple sclerosis, 10 of myelitis and 7 of brain stem encephalitis. Initial learning was made with these 27 cases. The first test was performed with 14 external samples (cases outside the data base). The percentage of correct diagnosis was 78%. Then these cases used for the first test were included in the data base and, consequently, the number of cases in the data base became 41. With 41 samples relearning was executed and the effect of learning was tested with another 15 external samples (the second test). The percentage of correct diagnosis was still 73%. These 15 samples and another 4 samples (2 of them had not yet been given final diagnosis) were included in the data base. With 58 of 60 cases relearning was executed and the effect of learning was tested with 16 new external samples (the third test). The percentage of correct diagnosis was raised to 94%. These 16 cases and 8 cases of myeloradiculoneuritis and 2 cases to which the final diagnosis had not been given were added into the data base and, consequently, the number of diseases in the data base became four, i.e., multiple sclerosis, myelitis, brain stem encephalitis and myeloradiculoneuritis. With 82 of these 86 cases relearning was executed for all diseases. The samples used in the last test (the fourth test) were internal samples (cases within the data base) because we had no extra external samples. The percentage of correct diagnosis was 95% at the final step. Table 2 shows a transition of the percentage of correct diagnosis as the number of cases in the data base increases. In this table the extreme left column shows the number of cases learnt. Each row shows the result of each test mentioned above. The percentage of correct diagnosis varies largely with each of the tests. This may be because of a small number of cases used in each test, large variations from one case to another, etc. The results may indicate that
the percentage of correct diagnosis is improved with the increase of the cases in the data base.
4. Discussion The diagnostic system with learning ability will be very useful in the field of computer diagnosis or medical information system. As for a theory of computer diagnosis, there are many useful methods such as the Bayes rule, maximum likelihood method, discriminant function and principal component analysis. When we have many diseases which must be differentiated, the Bayes rule and maximum likelihood method are convenient and powerful. These two methods are also suitable for frequent relearning. The most remarkable difference between the Bayes rule and maximum likelihood method is in that the former considers a probability of occurrence of each disease, i.e., prior probability, while the latter does not. We multiplied a likelihood by a probability of occurrence of each disease. So discrimination efficiency of our method is equal to that of the Bayes rule (see Appendix). Our method takes less computation time than the Bayes rule. In our method, or maximum likelihood method, it is necessary that each item (symptom) is stochastically independent of each other. In the clinical data, the items correlate to each other; in other words, a factor (information) is contained redundantly in many items. In such data, the value computed by the formula of likelihood does not express a true probability. In other words, when a factor (information) is contained in more than one item, the factor is overestimated. So the weight of the factor is different from each other. We examined whether or not such redundant items as mentioned above must be excluded. We picked up the
168
M. Okada et al., Data base and automated diagnosis
items which scarcely correlated to each other according to X2-test with a critical value o f p = 0.5; the test was done for every combination of each item, and one of the pair which correlated to each other was omitted. The number of selected items was 45 out of 72. Using these 45 items, the percentage of correct diagnosis was examined. The result was 92%. This decrease of the percentage of correct diagnosis may be due to information loss brought about by the data selection. If a data selection is correctly done, a true probability of each diagnosis will be obtained. On the contrary, not a few information for diagnosis will be lost. Because these merit and demerit, it must be confirmed in various combinations of diseases and of symptoms using a sufficient number of cases whether data selection should be done or not. The result might be different from case to case. However, it can be said that the selection of items is not always needed. From our experiment we found that the combination of items which we settled was suitable enough for diagnosis, because the percentage of correct diagnosis was satisfactorily high. This fact may be based on the reason that the variation of weight of the factor is not so wide as to make trouble. This deduction is also supported by the following facts. When a case had an atypical combination of symptoms, the likelihood to a correct diagnosis became low and likelihoods to the other three diseases became much lower at the same time; so we could make a differential diagnosis correctly. Owing to the redundancy, it is expected that we can get sufficient information for a diagnosis from the data which is lacking in description of some items. In many practical cases some of the items are left blank. Even for such cases we can diagnose successfully. This was confirmed by the following examination. We tested the percentage of correct diagnoses for cases in which information of several items was excluded. Even when we ignored several items which had been believed to be very important (for example, remission and recurrence, multifocal lesion in cases of multiple sclerosis), an accuracy of differential diagnosis was not so much affected. From the present experiments it is believed our method is applicable to practical use, though it is somewhat lacking in mathematical strictness.
5. Hardware and software specifications The program is written in assembly language and run on an PDP-12 computer, requiring 8K core memory. The peripherals used are two magnetic tape units, a CRT display unit and a teletype unit. For arithmetic data handling, the program uses the floating-point package constructed by Professor N. Maruyama.
6. Mode of availability Persons interested in obtaining listing of the program are invited to contact the authors.
Appendix Maximum likelihood method Let f ( x ; a ) b e the frequency function of random variable x, subject to the condition Da. Suppose that k observations are to be made of the variable x. Let Xl, x2 ..... xk denote the k random variables corresponding to those k observations. Then the function given by L ( x , x : . . . . . x k; ~ )
=:(x~; ~):(x:; ~) .../(x~,;~)
defines a function of the random variables Xl, X 2 Xk and the parameter a, which is known as the likelihood function. When x is the discrete random variable and x E (~1, ~2), let the observational frequencies of (x = ~1), (x = ~2)be fl, f2. The likelihood function is written by the next equation. .....
L(X 1, x 2..... Xk; O0 = (Pl(O0} fl • {P2(O0) f2
= {pl(~)} fl • (1 -- pl(O0} (k-fl) fl +f2=k,
(1)
where, p 1(00, P2 (Or) are the conditional probabilities of (x = ~1), (x = ~2), subject to the condition D~. For any
M. Okada et al., Data base and automated diagnosis particular set of observational values, eq. (1) gives the probability of obtaining that set of values. If k = 1,
L(x l" a) = ( p ( a ) } f " ( 1 f=/ 1
w h e n x l = ~x,
0
whenxl =~2"
p(o0) (l-f)
,
L(x 1; a) represents the conditional probability that Xl = ~1 or xl = ~2, subject to the event D , . Therefore, if there are m variables {$1, $2 ..... Sm} and these are stochastically independent to one another,
not necessarily hold true because of defect of the data. Therefore, we computed each probability separately in this article. When there are many symptoms, the value of L*(X; ~) in eq. (3) becomes very small. This does not matter if we carry out floating point computation with a computer. But the output given by floating point expression is inconvenient. Taking logarithm of bdth hands of eq. (3), we can convert the value of L*(X; a) into the order which is expressible in a fixed point format, i.e., m
In L*(X; a) = ~
m
169
( f / - i n pi(~) + (1 - f i ) " ln(1 -pi(a)))
i=1
L ( X ; . ) = I-I (pi(a)) ~" (1 _ p i ( a ) ) ( 1 - ~ ) , /=1
+ p(Dc~). X = (S1, S 2 ..... Sin} ,
(2)
where pi(ot) is the conditional probability of the event Si, subject to the event Da. In order to apply the function to the medical diagnosis, let D a be a disease; Si be a symptom; X be a set of symptoms;pi(a) be a probability that a symptom Si occurs, under the assumption that a disease D , has been observed; 1 - pi(oO be a probability that a symptom Si does not occur, under the assumption that a disease D~ has been observed. When we compute L(X; a) of all diseases Da (a = 1, 2 ..... n) for a set of symptoms {fl, f2 ..... fi ..... fm }, ( f / = 0 or 1) of a patient whose diagnosis is unknown, Da, L(X; a) of which is the maximum, is the most probable diagnosis. Ifp(D~) (a E 1,2 ..... n) is the probability of each disease, i.e., the frequency ratio of disease Da in the data base, then m
L*(x;
= pW.)
1-I
(pi(a)) ~" { 1 - P i ( a ) ) O - ~ ) . ( 3 )
i=1
L*(X; a) means a likelihood weighted with the probability of each disease, p(Da) is equal to a prior probability in the Bayes rule. Equation (3) is equal to the numerator of the equation used in the Bayes rule. Since the denominator in the Bayes rule is the same in each diagnosis, our method is equal to the Bayes rule in relative values among diagnoses. Note that although in eq. (3) we describe the probability that symptom Si will not occur as 1 - pi(oO, in practical clinical data, p(x = ~2) = 1 - p ( x = ~1) does
(4)
In the example of typed output shown in fig. 3, we evaluated "likelihood" by eq. (4).
Acknowledgements The authors wish to thank Professor T. Tsubaki for a critical reading of the manuscript. We also thank Dr. Y. Horikawa for having permitted us to include the data of multiple sclerosis.
References
[1] J. Wartak, IEEE Trans. Biomed. Eng. BME-17 (1970) 37. [2] O.L. Peterson, E.M. Barsamian and M. Eden, J. Med. Educ. 41 (1966) 797. [3] R.S. Ledley and L.B. Lusted, Science 130 (1959) 9. [4] R. Gamboa, J.D. Klingeman and H.V. Pipberger, Circulation 39 (1969) 72. [5 ] J.D. Klingeman and H.V. Pipberger, Comput. Biomed. Res. 1 (1967) 1. [6] D.H. Gustafson, J.J. Kestly, R.L. Rudke and F. Larson, Comput. Biomed. Res. 6 (1973) 355. [7] H.R. Warner, MEDIS '73, Symposium E-2 (1973). [8] W.M. Lively, S.A. Szygenda and C.E. Mize, Comput. Biomed. Res. 6 (1973) 393. [9] M. Okajima, L. Stark, G. Whipple and S. Yasui, IEEE Trans. Bio-Med. Electron. BME-10 (1963) 106. [10] W.V. Slack, G.P. Hicks, C.E. Reed and L.J.V. Cura, N. Engl. J. Med. 274 (1966) 194. [11] L.L. Weed, N. Engl. J. Med. 278 (1968) 593. [12] I.F. Kanner, JAMA 215 (1971) 1281. [13] W.V. Slack, B.M. Peckham, L.J.V. Cura and W.F. Carr, JAMA 200 (1967) 224.
170
M. Okada et aL, Data base and automated diagnosis
[14] C. Vallbona, MEDIS '73, Symposium A-4 (1973). [15] R.O. Foul, MEDIS '73, Symposium D-7 (1973). [16] J.H. Greist, L.J.V. Cura and N.P. Kneppreth, Comput. Biomed. Res. 6 (1973) 257. [17] L.A. Kuehn and D.M.C. Sweeney, Comput. Biomed. Res. 6 (1973) 226. [18] C.L. Horton and D.A. Cooley, Comput. Biomed. Res. 6 (1973) 286.
[19] P.L. Reichertz, MEDIS '73, Symposium E-4 (1973). [20] P.G. Hoel, Introduction to Mathematical Statistics (Wiley, New York, 1961) 39. [21] J.D. Buckley, V.X. Gledhill, J.D. Mathews and I.R. Mackay, Comput. Biomed. Res. 6 (1973) 235. [22] F.L. Meijler, A New Coding System for Electrocardiography (Excerpta Medica, Amsterdam, 1972).