Fuzzy Sets and Systems 44 (1991) 431-448 North-Holland
431
Fuzzy specification in data capture and knowledge representation for an expert system in oncology J. Mira,* A. Yfifiez, A. Barreiro, A.E. Delgado* Departamento de Electr6nica, Facultad de Ffsica, Universidad de Santiago, Santiago, Spain
J.M. Couselo Departamento de Pediatrfa, Hospital General de Galicia, Santiago, Spain Received February 1988 Revised July 1988
Abstract: This paper is concerned with the design of a uniform data capture strategy including the underlying reasoning. Uncertain and imprecise information suggests the use of the fuzzy paradigm in the build-up of the database and knowledge representation. Patient information has been segmented in antecedents, symptoms, physical examination findings, hematology, chemical analysis and complementary data. In the knowledge representation we use a mixed prototype production rule representation tool. This methodology has been used in the expert system ONCOGALmainly concerned with diagnosis and protocol follow-up of pediatric oncology.
Keywords: Fuzzy knowledge representation; fuzzy reasoning; artificial intelligence; expert systems; medicine.
1. Motivation This paper is concerned with the development of a knowledge based system to advise in diagnosis and follow-up of chemotherapy protocols of cancer. The modern management of the treatment of cancer requires the physician to follow complex therapy protocols, to understand the causal processes underlying directly observed symptoms and to respond with precision and expertise to changes in the patient's condition. This requires [10]: (1) Close co-operation on a multicenter trial framework starting with the standardization of patient data and laboratory measurements as well as in the knowledge representation tools. (2) Formal descriptions of data in which 'what, where, when, how and why should be stored' is defined. (3) Acceptance that data can no longer be separated from procedural information and causal knowledge. This suggests to us the usefulness of considering a double structure in our database: A surface structure (where 'what, * Departamento de Informtitica y Autom~ttica, Facultad de Ciencias, Universidad Nacional de Educaci6n a Distancia (UNED), 28040 Madrid, Spain. 0165-0114/91/$03.50 t~) 1991--Elsevier Science Publishers B.V. All rights reserved
432
J. Mira et al.
when and where' is stored) and a deep structure where the underlying knowledge related to reasoning and imbricated with the datum is stored ('when, how and why'). (4) Acceptance that some degree of uncertainty and imprecision is always present in symptoms, signs, tests results and findings. Quantitative data enter into decision-making through the elicitation of linguistic labels of diagnostic meaning (normal, reduced, e l e v a t e d . . . ) . Qualitative data, e.g. pain, also require a fuzzy approach and diagnosis is usually related to opinions and preferences. So fuzzy type of data characterization is proposed trying to develop an effective common coding set for the transfer and management of imprecise facts and imbricated knowledge. (5) Once the database has been defined, the knowledge representation architecture is stablished. In this paper we are concerned with the conceptual model used for "data-andimbricated-knowledge" capture and the build-up of the corresponding surface and deep structure of a database for cancer in the young. The patient information has been segmented into Filiation, Antecedents, Symptoms, Physical examination, Hematological findings, Chemical analysis and Complementary data (Image processing, Biopotentials...). The oncologist's knowledge of his field comprises two clearly separate components: knowledge of what clinical variables are relevant, and of how to measure them; and a set of pathology models representing the expectations acquired by individual or collective experience. Both components are represented in ONCOCAL by prototype structures in which each prototype contains both substantive knowledge (the valid values of a clinical parameter, the laboratory data relevant to a particular pathology, etc.) and metadiagnostic knowledge concerning the way in which the substantive data are to be obtained or handled. Cancer is one of the more frequent causes of death in young people in developed countries. The sort of cancer in pediatric age groups markedly differs from adult malignancies. The more frequent cancers in children are leukemias, embryonal tumors and sarcomas. On the contrary, the more frequent tumors in adults, adenocarcinoma and carcinoma, are scarcely present in children. The anatomical layout is also different. Excluding skin, the more frequent sites of cancer in adults are breast, lungs, colon/rectum, pancreas, stomach and ovary. However, child cancer is usually located in blood and bone marrow (leukemia), sympathetic nervous system (neuroblastoma, soft tissue and kidney) [9, 14]. In our hospital a mean value of 30 new pediatric tumors/year are treated with an estimated survival rate of 60% during the sampling interval 1980-1983. The relative frequency of tumors has been leukemia (21.9%), lymphomas (14.9%), CNS tumors (14.9%), neuroblastoma (11.1%); nephroblastoma (9.3%), soft tissue tumors (9.1%), bone tumors (5.6%), germ cells tumors (3%), retinoblastoma (2.7%), hepatoblastoma (1.1%) and other tumors (3.5%) [12]. From the knowledge representation and expert systems point of view, oncology in children differs little from other branches of oncology and complex medicine in general. The life and death situation and the fastidious detail needed, make
433
Fuzzy specification in data capture and knowledge representation
pediatric oncology an important vehicle for the development of advanced and sensible knowledge-oriented medical database and expert systems. Some of the concepts presented in this paper, such as: (a) deep versus surface structure, (b) implicit knowledge imbricated in the data, (c) fuzzy representations of findings and (d) fuzzy relations are domain-independent. Consequently the results of this work can potentially be used in other contexts. 2. Data capture methodology The methodology of data acquisition can be formalized according to the scheme of Figure 1. From one side, the knowledge engineer states a soft-frame of
I KNOWLEDGEENGINEER
I
I
(KE)
i
CLINICIAN(cT)TEAM i
i
ii
i model of
* Conceptual
* I n f o r m a t i o n on f i n d i n g s
data capture
* I m p l i c i t knowledge * I m p l i c i t models o f
* Soft-frames
pathology
-Segments -Fuzzy
sets
Non f o r m a l r e a s o n i n g
strategies & plans
-Deep & Surface struet.
Inductive/Deductive
~Fuzzy relationships
decisionmaking
* Dialogue protocol
I__ KE view of CT I
L DIALOGUE FRAME
[
CT view of KE
1 KE formal rep. of CT data and knowledge using KE operational tools
CT structure of CT knowledge using KE soft-frame
Fig. 1. Datacapturemethodology.
434
J. Mira et al.
data capture and knowledge representation tools. In our case this soft-frame includes: (a) segmentation of entities (data, diagnosis, questions, knowledge related); (b) concept of f u z z y representation of entities (symptoms as fuzzy sets and datum as a set of membership functions); (c) minimum dialogue protocol including distinction between deep and surface structures and taxonomy in both levels; (d) concept of fuzzy relations and some elements of an algebra of relations. From the clinician team we have factual information, implicit procedural knowledge, conventional non-formal reasoning strategies and implicit models of pathologies and physiological mechanisms related to symptoms by declarative 'if-then-else' statements. The procedure in data and imbricated knowledge elicitation has been to compel clinicians to make explicit decision and procedural knowledge using the knowledge engineering soft-frame. As a result, a knowledge oriented soft-fuzzy database has been developed, which includes classification of data and explicit models of medical reasoning.
3. Entities
As an example of the results produced by formalization of data acquisition, we show the prototypes [5, 6, 8, 11, 13] for representation of clinical variables and pathologies in ONCOOAL. 3.1. Clinical variable prototypes A clinical variable is a particular aspect of the patient's condition; it may be a parameter such as 'platelet count', or a yes/no fact such as 'stomach ache'. The prototype frame of a clinical variable contains all the information needed in handling it. Figure 2 lists the generic structure of this kind of prototype and its realization for the variables fever and thrombopenia. The structure comprises the following seven components: (1) Values specifies the variable's possible values, and is used to check the validity of the data keyed in by the user. Clinical variables may be numerical, such as fetal-hemoglobin, whose range of values if 0-100 (%); non-numerical, such as pallor, whose values (in this case yes and no) cannot be calculated from numerical data, and are instead determined qualitatively by the doctor; or semiquantitative, such as fetal_hemoglobin_level, whose values (in this case low, normal, high and very high) are natural language labels associated with fuzzy subsets of the range of numerical variable (in this case fetal_hemoglobin). (2) Interaction stores a natural language message used to solicit a value for the variable. (3) Preference specifies whether the preferred means by which the system obtains the value of the variable is direct (by asking the user, as in the case of fever) or indirect (by deduction from other data, as when the presence or absence of thrombopenia is deduced from the patient's platelet count).
Fuzzy specification in data capture and knowledge representation
435
(DEFSTRUCT VARIABLE "Structure of clinical variable prototypes" ; valid values (VALUES NIL) (INTERACTION "") ; message to request value input (PREF 'DIRECT) ; preferred means of obtaining values (CLIN_ACT NIL) ; clinical action for getting value (IND_SOURCE NIL) ; rules used in indirect acquisition (ATTRIBUTES NIL) ; attributes (see text) (OCURRENCE NIL)) ; frequencies of occurrence (DEFSTRUCT ATTRIBUTE "Structure of attribute frames" (VALUES NIL) ;valid values (INTERACTION "')) ;message to request value input Variable: Values: Interaction: Pref: Clin_act: Ind source: Attributes: Occurrence:
FEVER (MEMBER YES NO) "-&Fever? " DIRECT SYMPTOMS NIL (KIND DURATION IN DAYS) ((ALL 61) (ANLL 33) (CML 76))
Variable: Values: Interaction: Pref: Clin act: Ind source: Attributes: Occurrence:
THROMBOPAENIA (MEMBER YES NO) "-&Thrombopaenia? " INDIRECT BLOOD ANALYSIS (RI4) NIL ((ALL 90) (ANLL 90))
Rule: If: Then:
RI4 T
(FUNCALL #' FUZZY_INTERPRETER THROMBOPAENIA PLATELET_COUNT) I
i
Fig. 2. Clinical variable prototype and example.
(4) Clinical_action specifies the clinical source of the variable's value: physical examination, blood analysis, reported symptoms, etc. (5) Indirect_source lists the local production rules to be used for indirect deduction of the value of the variable. Backward chaining is used in their application. (6) Attributes is a list of subordinate parameters whose knowledge may or must supplement that of the main variable. These subordinate parameters are each endowed with a reduced prototype structure comprising a list of values and interaction. Fever, for example, has the attributes kind (continuous, intermittent, evening peaked) and duration in days. (7) Occurrence is a description of frequencies of occurrence variable-diseases.
436
J. Mira et al.
(DEFSTRUCT PATHOLOGY "Structure of pathology prototypes" (HI_ECHELON NIL) ; pathology with this one as a variety (LO_ECHELON NIL) ; varieties of this pathology (VARIABLES NIL) ; relevant clinical variables (PATTERNS NIL) ; significant clinical patterns (WEIGHTING_R NIL) ; diagnostic hypothesis evaluation rules (IF CONFIRMED NIL) ; actions if pathology confirmed (CONTROL NIL) ; defines local control automaton Pathology: ALL Hi echelon: (LEUKOSIS) Lo_echelon: NIL Variables: ((SYMPTOMS AND SIGNS ((ANOREXIA 0) (FEVER I)...)) (ANALYTICAL_DATA ((ANAEMIA i) (THROMBOPAENIA i)...)) o°,)
Patterns:
(CPI0 CPII CPI2 CPI3 CPI4 CPI5 CPI6 CPI7 CPI8 CPI9 CP20) ; plus patterns inherited from LEUKOSIS Weighting_R:(RE7 RE8 RE9 RE10 RE11 RE12) If_confirmed: PROGN (FUNCALL #'PROGNOSIS ALL) (FUNCALL #'PROTOCOLS ALL)) Control: ((SYMPTOMS_AND_SIGNS i ((0.06 2))) (SYMPTOMS AND SIGNS+ANALYTICAL_ DATA 2 ((.60 3))) (SYMPTOMS_ANDSIGNS+ANALYTICAL__DATA+MARROW_DATA 3 ((i. FIN)))) (DEFSTRUCT PATTERN "Structure of clinical pattern frames" (PHASE NIL) ; automaton state in which the pattern is used (CONDITION NIL); condition defining the pattern (CONFIRMED NIL); significance of the pattern for diagnosis of the various pathologies Pattern: CPI3 Phase: 1 Condition: (RAND (HAEMORRHAGE) (ROR (PALLOR) (ANOREXIA) (FATIGUE) (STOMACH__ACHE) (LOSS OF WEIGHT) (ACHING_BONES) (ACHINGJOINTS))) Confirmed: ((ALL 22)) Pattern: Phase: Condition:
Confirmed:
CP17 2
(RAND (HEPATOMEGALIA) (SPLENOMEGALIA) (ADENOPATHY) (COM 2 5 ANAEMIA THROMBOPAENIA LEUKOCYTOSIS LEUKOPAENIA HAEMORRHAGE)) ((ALL 90)) I
~g. 3. Pathology prototype and example.
Fuzzy specification in data capture and knowledge representation
437
3.2. Pathology prototypes Figure 3 lists the generic structure of a pathology prototype and its realization in the case of acute lymphoblastic leukemia (ALL). The structure comprises the following seven components: (1) Higher_echelon specifies the pathology in which the current pathology is subsumed as a subclass. This allows inheritance of the wider entity's properties. (2) Lower_echelon lists the varieties of the present pathology. (3) Variables lists the clinical variables that are relevant to the pathology, grouped by the type of clinical procedure by which they are determined. (4) Clinical_patterns lists a series of combinations of clinical variables, each of which is of diagnostic significance for the pathology. Not all the patterns associated with a pathology are explicitly listed, since those of the pathology's ancestors in higher echelon are inherited. The combination of variables at the core of the pattern is accompanied by numbers indicating its diagnostic weight and the phase(s) of the diagnostic process to which it is relevant. (5) Weighting rules lists a series of production rules used in evaluating the relative importance of competing diagnostic hypotheses. (6) If_confirmed specifies the actions to be taken if the pathology is confirmed (warning of possible complications, evaluation of factors affecting prognosis, prescription of therapeutic regimens, etc.). (7) Control dictates the strategy of the diagnostic process as far as the present pathology is concerned. This strategy is defined in terms of partial goals associated with the states of an automaton structure whose transition from one state to the next depends on the clinical data.
4. Fuzzy knowledge representation The clinician's diagnostic, prognostic and therapeutic judgements and decisions are all based on imperfect knowledge. Uncertainty affects both clinical variables employed and the empirical relationships among them; and in the former case, the source of the uncertainty may be either imprecise measurement (as for a statement such as 'the value of fetal_hemoglobin is 15') or the fact that the values of the variable are essentially imprecise (as for a statement such as 'the value of fetal_hemoglobin_level is high"). The theory of fuzzy sets [15, 16] allows such uncertainty to be handled by computer programs. Fuzzy sets associated with the natural language clinical terms used by the doctor and with natural language expressions of degrees of belief can be fuzzily combined by means of operators for the conjoining of evidence and the global evaluation of all the evidence in favor of a given hypothesis [1-4].
4.1. Clinical predicates In order to allow both numerical and non-numerical data to be handled in the same way, clinical variables are always used in ONCO~ALin clinical predicates of the form: (variable_name evaluation)
438
J. M i r a et al.
When such predicates form part of the system's static knowledge (in the antecedent clause of various kinds of production rules, in the consequent of rules for calculating one clinical variable from others, and in clinical patterns), the evaluation is a value or a range of values, as in (fetal_hemoglobin 15) or (fetal_hemoglobin > 15). When they occur as part of the dynamic knowledge concerning a particular patient, the evaluation is a list of values together with the degree of certainty # with which each is attributed to the patient, so that in this case the predicate takes the form: (variable_name (value 1 /t(value 1)) (value N
#(value N)))
The degree of certainty is in fact, in fuzzy terms, the membership of the patient to the fuzzy set defined by the corresponding value, except that, for convenience, # is allowed to take not only values in [0,1] but also the non-numerical value unknown. The dynamic predicate form shown above will be interpreted in what follows as a compact way of storing an array of clinical predicates, each with an associated degree of certainty: ((variable_name (same_variable_name
value 1 value 2
#(value 1)) /~(value 2))
(same_variable_name
value N
#(value N)))
When entering data, the user may specify the degree of certainty of the data in the form of natural language labels that are themselves associated with fuzzy sets F~: [0,1]----> [0,1] and which the system translates into numerical # values. 4.2. Fuzzy combination of clinical predicates In the condition slots of clinical patterns and the antecedents of production rules, clinical predicates are combined by the fuzzy logical operators NOW,AND, OR, MIN and COM. Evaluation of whether a particular patient satisfies such compound conditions consists of using the function defining these operators to calculate a combined degree of certainty of the individual predicates for that patient. As usual, #(NOT P) is defined as 1 - # ( P ) , #(P~ ANt).-. AND Pn) as min #(P/) and #(P/OR- • • OR Pn) as max #(P,.). MJN(nl, n2:P1 . . . . . Pn2), which in non-fuzzy terms is true iff at least n~ of the n2 P/s are true, is defined following Zadeh, by /Z(MIN(nl,n2: PI . . . . .
Pn2)) = min U(nl,n2:P1 . . . . .
Pn2),
where U(n~,n2: P~. . . . .
Pnz) is the set comprising the greatest ni values among P n 2 ) , which in non-fuzzy terms are true iff exactly n~ of the n2 P/'s are true is defined by
/A(PI). . . . .
IA(Pne) •
COM(nl,n2:P1 . . . . .
lt(CoM(nt,n2:P1 . . . . , Pnz))= min(#(MxN(nl,n2: PI . . . . , P,z)), 1 - #(MIN(nl + 1,n2:P1 . . . . .
P~2)))-
Fuzzy specification in data capture and knowledge representation
439
4.3. Fuzzy inferences and deductions Even if the values of primary clinical variables were known with absolute certainty, uncertainty would still exist as to the reliability of the empirical rules applied to them to obtain diagnostic conclusions. ONCOGAL uses two kinds of inference rules, both of which allow for uncertainty in the transition from premises to conclusions. The values of non-immediate clinical variables are deduced by backward chaining in both cases. 4.3.1. Deduction using a fuzzy interpreter Basic numerical data are generally assumed to be known for certain; uncertainty appears in the deductive process when qualitative data are deduced from numerical facts. An example is the deduction of the existence of thrombopenia from a platelet count; the platelet count is assumed to be reliable, but whether a given count amounts to thrombopenia depends in an ill-defined fashion on the characteristics of the patient. ONCOGAL handles this kind of situation by associating the values of the qualitative variable with fuzzy sets. Thrombopenia, for example, may be defined by either of the fuzzy sets shown in Figure 4. The function shown in Figure 4a is 1 - S ( x ; 2 0 , 5 5 , 9 0 ) , where x is platelet count and S is one of Zadeh's standard functions [15], defined by
S(x; a, b, g)
I
O,
2[(x - a)/(g - a)l 2, | 1 - 2[(x - g ) / ( g - a)] 2, |
1.1,
x < =a, a < x < =b,
b < x <=g, x>g.
Such a smooth function is generally not necessary, however, and it is sufficient to use a function such as that of Figure 4b defined by cutoff value xc = 20 and a slope s = - 1/70. Any available knowledge as to how the patient's characteristics affect the relationship between the quantitative and qualitative variables can be included in the above deductive mechanism by varying the parameters defining fuzzy sets (a, b and g, or xc and s). In particular, in pediatric oncology it is c o m m o n for age and sex to have a bearing on the interpretation of data. The thrombopenia functions shown above, for example, are valid for patients older than 16 years.
4.3.2. Deduction using production rules ONCOGALuses production rules of the conventional form: RIK = 'IF condition THEN clinical_predicate' whenever the deduction of a clinical predicate can be concisely expressed in this way as for the rule 'if erythrocyte count is low or the hemoglobin is low, then the patient has anemia'. The reliability of each such rule is expressed by an associated degree of certainty /~RIk, and the degree of certainty of the clinical predicate deduced is calculated as min(/ZRIk, /~(condition)).
440
J. Mira et al.
t.0
P
0.0 20.
911.
20.
90.
200.
PLT-COUNT [t0"3/|lJ
t.Q
0.0 200.
~T-C01JI~" (t0"3/mll
Fig. 4. Fuzzy sets defining thrornbopenia. (a) with linear function. (b) with Zadeh's S function. PLT-COUNTstands for platelets counts.
All the rules concluding values of a given clinical variable are listed in the indirect_source slot of that variable's p r o t o t y p e frame. W h e n m o r e than one rule affects the same clinical predicate, the joint degree of certainty of the latter is calculated as the greatest of the degrees of certainty derived from the individual rules.
4.4. F u z z y relation data-diseases
Fuzzy relations data-disease are handled using frequencies of occurrence clinical_variable-disease and degrees of confirmation clinical_pattern-disease [1-4].
Fuzzy specification in data capture and knowledge representation
441
5. Medical evidence combination and inference mechanism
The basic clinical problem in the diagnosis of many pathologies is that determination of the definitive pathognomic data (in the case of leukosis, the presence of lymphoblast in bone marrow) is either expensive or puts the patient at risk. As a result, as much diagnostic information as possible must be squeezed from more readily available data so as to avoid unnecessary costs in money or discomfort. The diagnosis of acute lymphoblastic leukemia, for example, involves in practice not only the pathognomic variables lymphoblast_in_peripheral_blood and lymphoblasts_in_bone_marrow_over_25%, but also some 18 other YES/NO variables representing different possible features of the disease. In what follows we shall term such features diagnosons.
5.1. Evaluation of diagnostic hypotheses using clinical patterns Since not all diagnosons are invariably present in a given case, since not all can be considered in the early stages of diagnosis, and since many, if not most, may be individually relevant to other pathologies, the practising clinician may work by deciding whether the available information for a given patient fits any of a number of clinical patterns that the suspected pathologies are known to induce among the diagnosons relevant to the current stage of the diagnostic process. Since a complete set of clinical patterns covering all possible eventualities would be prohibitively large, the problem is to find a sufficiently small set that nevertheless reduces errors to within the limits of current medical knowledge. Essentially, this is attempted by grouping together patterns which have similar diagnostic weights for the pathologies considered and whose component diagnosons are mostly the same. In particular, patterns with small diagnostic weight may be eliminated altogether. Other variations of this reduction strategy can be formalized as follows by considering the clinical patterns as initially expressed as strings of diagnosons linked by AND and Nor operators: (a) If a diagnoson dl is considered to have but slight importance given d2 and d3, then the diagnostic weight of patternl = (dl AND d2 AND d3) will be close to that of pattern2 = (NOr dl AND dz AND d3). Patterns 1 and 2 may therefore be eliminated and replaced by pattern3 = (d2 AND d3). (b) Patterns with similar weights may be combined by using additional logical operators such as OR, MIN or COM. Thus (da AND d2) and (dl AND d3) patterns can be merged as (dl AND (d 2 OR d3) ). Once the set of clinical patterns has been established, the evidence in favor of the various pathologies in a given case can be quantified using their diagnostic weights. In ONCOGAL this is done by adapting the conventional use of fuzzy weighting functions [2] to ONCOGAL'Sprototype-based knowledge representation scheme. Specifically, for each of the competing diagnostic hypotheses (pathologies) Dj that occupy terminal positions in the current pathology tree (defined by the higher and lower_echelon slots), a degree_of_confirmation /zY and a degree_of_exclusion/~N are calculated using the formulae
t~Y = maxi(min(l~p(p, patterni),/uc(patterni, Dj)))
442
J. Mira et al.
and /tN = maxi(min(1 - l~p(pt, diagnosoni), #o(diagnosoni, Dj))) where 'Pt' indicates the patient, /~p is the degree of certainty of the patient's fitting a given diagnoson or pattern and #o and #c are respectively the degree of occurrence of a diagnoson in pathology Dj and the diagnostic weight of a pattern for pathology Dj. For a non-terminal pathology, #Y is the greatest of the #Y of its subtypes, and #N the least of the #N of its subtypes. A pathology is deleted from the tree for the current patient if #N = 1 or if/.tY is less than the threshold for the current phase of the diagnostic process; regarded as confirmed is / t Y = 1; and maintained as an open hypothesis is #Y is less than 1 but greater than or equal to the current threshold. Thresholds are defined in the slot of control of each pathology and they correspond to a particular state of knowledge as defined by the clinical variables that have been determined. They are progressively higher for successive phases of the diagnostic process. Confirmed hypotheses are announced as such; in their absence, the open hypothesis with the greatest #Y is suggested as being currently the best.
5.2. Evaluation of hypotheses using evidence weighting rules The main drawback of the clinical pattern approach described above is the danger of oversimplification in reducing the set of patterns to a manageable size. The replacement of several patterns with different diagnostic weights by a single wider pattern limits the sensitivity of the method, especially when the pattern's fuzzy nature is not taken into account during the reduction process, and great demands are put upon medical staff collaborating in the installation of the knowledge base. Misunderstanding between the doctor and the knowledge engineer may lead, for example, to 'dl and d2' being installed as (dl AND d2) instead of (d I AND d 2 AND NOT d3) when the NOT d 3 part of the latter is not insignificant. As a result of these difficulties, important patterns may be left out of the knowledge base and, contrariwise, in contrast, sub-patterns oRed into larger patterns may be afforded under weight (in which case the max-min rule for #Y may lead to undue weighting of one of the competing hypothesis). It is in fact found that during the intermediate stages of diagnosis, when pathognomic data are still not available, the clinical pattern method of Section 5.1 often announces a 'best hypotheses' that differs significantly from the human expert's opinion; as a result, the system may give priority to obtaining further data that are irrelevant to the pathology regarded as most likely by the clinician. The problems associated with the use of clinical patterns may be obviated by using a hypothesis evaluation function defined, like #N above, in terms of individual diagnosons rather than patterns, but which takes into account the generally non-linear importance of combinations of diagnosons. Ad hoc functions of this kind have been used in the expert system Pip [11], CENTAUR [5, 6], CAEMF [8] and CADIAG-2 [1, 2, 3, 4]. CADIAG-2, for example, uses the evaluation function PN = ~ [a * min(#p(pt, di), #o(di, Dj)) + b * min(#p(pt, di), #c(di, Dj))] i
Fuzzy specification in data capture and knowledge representation
443
where /~c is now the diagnostic weight of the 'pattern' consisting of a single diagnoson, and a and b can be varied (subject to a + b = 1) so as to adjust the relative importance of degrees of occurrence and diagnostic weights; the best hypothesis is chosen to be that with the greatest value of PN. Unfortunately, the use of such functions makes it virtually impossible for the user to find out just why (in terms of diagnosons satisfied) one hypothesis is preferred to others. This both hampers the development of the system and leads to users treating the system with extreme wariness. The alternative method used in ONCOGAL is based upon the grouping of diagnosons, not in clinical patterns, but in levels of diagnostic significance referred to by a set of evaluation rules of the form REr:
IF qil(level 1) AND qi2(level 2) AND qiN(level N) THEN accept hypothesis Dj with confidence/.tREk
where each qi is one of fuzzy quantifiers listed in Table 1 (the fuzzy quantifier 'almost all' is shown in Figure 5) a n d / I R E k is the support for Dj provided by Ek, the set of diagnosons considered in the rule. A typical evaluation rule in natural language form, is REr:
IF almost all Level 0 diagnosons are present AND some Level 1 diagnosons are present THEN diagnose acute lymphoblastic leukaemia with confidence ~ R E k
The fuzzy quantifiers applied to the diagnosons di of a given level N = (dl, d 2 , . . . , d,(N~) make reference to the proportion of data included in that level that are present on a given patient. We use the concept of fuzzy set cardinality as a measure of the number of elements [15] to make the normalization of the quantifiers in order to make them independent of the number of elements of each level. Given a level N - (d~, d2 . . . . , d~(N~) characterized by n ( N ) elements and a fuzzy set M = t z M ( d O / d l + • • • + ~tM(d~(N~)/d~(N~ defined by the degrees of certainty associated with each diagnoson in the present patient ~tM(di) = gp(di), the cardinality of M Table 1. Fuzzy quantifiers used in hypothesis evaluation rules Quantifier
Fuzzy set
All Almost all Many
Vx ~ 1, g(x) = o, it(l) = 1 S(x; 0.50, 0.70, 0.90) miniS(x; 0.40, 0.55, 0.70), 1 - S(x; 0.70, 0.80, 0.90)] min[S(x; 0.20, 0.35, 0.50), 1 - S(x; 0.50, 0.65, 0.80)] miniS(x; 0.00, 0.15, 0.30), 1-S(x; 0.30, 0.45, 0.60)]
Some
Few Hardly any None
1 - S ( x ; 0.10, 0.30, 0.50)]
Vx~0, ~(x)=0, tt(0)=l
444
J. Mira et al. l.O
P
0.0 .5
• ~" )J latl/n
.9 1-t..n
i.
Ilevell
Fig. 5. The fuzzyquantifier'almost-all'. is
IIMII =
~ i=1 .....
n(N)
#M(di)
being a measure of the number of elements in level N that appear in patient p. The normalization into interval [0, 1] is made by dividing the cardinality of M by the number of elements n(N) that constitute the level• The degree of certainty of the patient's complying with the antecedent as a whole, I~p(p,, Ek), is calculated from those of the clauses for each level using standard fuzzy logic. The final support for D/is defined as
#Y(p,, Dr)= max(min(/zp(pt, Ek), #REk)). With this method, the system's choice of a particular hypothesis as the best can readily be explained to the user by displaying the natural language form of the evaluation rules employed together with the list of diagnosons present or absent and their significance levels.
6. The database
All the data are stored in a database in which for the user we have structured the information in masks. Each mask includes a set of information items which are somehow related and which appear to the user to be treated as a block• For example we have Patient's antecedents, Symptoms (pain, tumor, epitaxes, hemorrhage disorders, fever, vomiting, intestinal disorders.• .) and Physical examination findings, as well as Hematological data and Chemical analysis results• In the last case we focus on the target organs looking for estimators of functional degree and residual function in liver and kidney as well as on Homeostasis descriptors (potasium, glucose, calcium and acid-base equilibrium).
Fuzzy specification in data capture and knowledge representation
445
Not all of these data need fuzzy description. So in the masks we have non-fuzzy entities, directly coded fuzzy data and symptoms and fuzzy representation of quantitative data via a fuzzy interpreter that generates the set of membership functions associated to each datum value in the usual way used by the Vienna team [1-4, 16]. The Hospital General de Galicia database for oncology in the young distinguishes between the file structure and software structure [7]. Concerning the file structure we have several files. HOSPITALNUMBERFILE (HNF), NAMEFILE (NF) and PATIENTDATAFILE (PDF), constitute the surface structure and contain the patient data in the usual way. Information about each patient has been split at this level into three files in order to speed up the information retrieval process. That is to say, the HOSPITALNUMBERFILE and NAMEFlEE function as index files. A patient's record contains three records (as many as files) and each one of these records contains its own fields. The arrays of characters in the files code numeric items and flags for fuzzy interpreters. Concerning the software structure, the database is modular with two main programs each of which can access/serve to a string of common or different functional modes. The IRD (Insertion/Retrieval/Deletion) program subserves to interface tasks between the user and the data (Figure 6). In this program we have structured the information in masks that menu a set of patient data according to the segments previously defined, each segment gathers information on patient history, patient examination, requested laboratory investigations, etc. In Figure 7, we can see some of these masks corresponding to Physical examination III, Hematology and Chemical analysis. Should the information be coded, the codes are shown in the masks. The program also includes a complete keyboard and display control to make operations fully interactive and crystal clear to the clinician. Data are fully validated at input time.
HOSPITAL MAIN INSERTATION RETRIEVAL DELETION
NUMBER FILE
NAME FILE
PATIENT DATA FILE
MASK
,
MODULES
MODULE
L F i g . 6. IRD p r o g r a m .
MODULE
!
446
J. Mira et al.
PHYSICAL *TENDON REFLEXES SUPERIOR RIGHT SUPERIOR LEFT INFERIOR RIGHT INFERIOR LEFT
N-Normal I-Increased D-Decreased - No present
EXAMINATION
Ill
RIGHT SENSITIVITY LEFT SENSITIVITY N-Normal H-Hypersensitive P-Byposensltlve - No sensitive
RIGHT BABINSKI LEFT BABINSKI ROMBERG ATAXIA DYSMETRIA HYPOTONIA
*CRANIALNERVES DISORDERS NUM 1 NUM 18 NUM 2 NUM 11 RUM 3 NUM 12 NUM 4 NUM 5 U-Unilateral NUM 6 B-Bllateral NUM 7 - No Disorders NUM 8 NUM 9
*PUPIL SIZE RIGHT SIZE LEFT M-Midriatlc O-Miotic N-Normal LEUKOCORIA ISOCORIA LIGHT RESPONSE(R) LIGHT RESPONSE(L) CONSENSUAL REACT
HAEMATOLOGY BLOOD GROUP RR RBC(xle^6) PCV(%) HGB(g/dl) MCV(u^3) MCH(xl~^9g) MCHC(g/dl) RETICS(%) WBC(xl~^3) NEUTROPRILS(%) LYMPHOCYTES(%) MONOCYTES(%) EOSINOPHILS(%) BASOPHILS(%) PLT(x18^3) E.S.R. ist HOUR(m m/hour) E.S.R. 2nd HOUR(mm/hour)
*ELECTROPHORETIC SPECTRUM TOTAL PROTEIN(gr/I) ALBUMIN(gr/I) ALPHA GLOBULIN BETA1 GLOBULIN BETA2 GLOBULIN GAMMA GLOBULIN IGG(mg/dl) IGA(mg/dl) IGM(mg/dl) ALPHA1 ANTYTRIPSIN(mg/dl) PROTEIN C-REACT COAGULATION T-Thrombocytopenia C-Factors consumption
B-Both O-Other d i s o r d e r s - Normal
CHEMICAL XKIDNEY FUNCTION BLOOD Sodlum(meq/l) Urea(mg/dl) Creatine(mg/dl) Osmolallty(mosm/l) Packed cell volume(%) Cloride(meq/l) Uric acld(mg/dl) URINE Urea(mg/dl) Osmolallty(mosm/l) Sodlum(meq/l) Albumin(gr/l) Volume(ml) *POTASIUM HOMEOSTASIS(meq/I) Blood K Urine K *GLUCOSE HOMEOSTASIS Glucose(blood)(mg/dl) Glucose(urine)(mg/dl) Acetone(urlne)
ANALYSIS ~LIVER FUNCTION Albumln(gr/l) GOT-AST(I.u./I) GPT-ALT(I.u./I) GGPT(i.u./I) Bilirubin(mg/dl) Total protein(gr/l) *ACID-BASE HOMEOSTASIS Bicarbonate(meq/l) pR pCO2(mmHg) pO2(mmHg) Base excess *CALCIUM HOMEOSTASIS BLOOD URINE Calcium(mg/dl) Calcium Phosphate(mg/dl) Phosphate Alk. Phos(l.u./l) Magnesium Magnesium(mg/dl)
Fig. 7. Some examples of masks.
447
Fuzzy specification in data capture and knowledge representation SLAVE MODULES
HOSPITAL NUMBER FILE
l
EDIT
1
,: CALC
SEARCH
DATA FILE
APPLICATION
-I -1
T DEEP STRUCTURE DATA HANDLING
i
_1
l,
LIST
DIAG
] i
'
Fig. 8. APPLICATIONprogram. The other main program, APPLICATION,is used to capture information from the database. It also has a modular structure (Figure 8) including relations with the deep structure and a logical zoom to analyze target sets of patients in order to establish the incidence of certain data items on another data items. A special natural-like language has been developed for communication with the database. This language assures that every possible set of patients can be selected and then a search for these patients through all the data is performed (module SEARCH) after the module TRANSFORM has translated natural-like language into a program understandable code. Some statistical facilities are also included (CALC).All the software has been implemented on A O S / V S Pascal and Fortran 77 running on a Data General MV 4000 computer and with links from LISP used in the expert system. References
[1] K.P. Adlassnig et al., Fuzzy medical diagnosis in a hospital, in: H. Prade and C.V. Negoita, Eds., Fuzzy Logic in Knowledge Engineering (Verlag TUV, Reinland, 1986) 275-294.
448
J. Mira et al.
[2] K.P. Adlassnig et al., CADIAG: an approach to computer assisted medical diagnosis, Comput. Biol. Med. 15 (1985) 315-335. [3] K.P. Adlassnig and G. Kolarz, CADIAG-2: computer assisted medical diagnosis using fuzzy subsets, in: Approximate Reasoning in Decision Analysis (North-Holland, Amsterdam, 1982) 219-247. [4] K.P. Adlassnig et al., Present state of the medical expert system CADIAG-2, Methods Inform. Med. 24 (1985) 13-20. [5] J. Aikins, Prototypes and production rules: an approach to knowledge representation for hypothesis formation, in: Proc. IJCAI-79 (1979) 1-3. [6] J. Aikins, A representation scheme using both frames and rules, in: B. Buchanan and E. Shortliffe, Eds., Rule-Based Expert Systems (Addison-Wesley, Reading, MA, 1984) 424-440. [7] J. Bradley, File and Database Techniques (Holt-Saunders, New York, 1982). [8] R. Matin, Un sistema experto para el diagnostico y tratamiento anteparto del estado materno-fetal, Thesis, Universidad de Santiago (1987). [9] L.P. Miller and R.D. Miller, The pediatrician's role in caring of the child with cancer, Pediatric Clin. N. Am. 31 (1) (1984) 119-131. [10] J. Mira et al., Formal specifications in data capture for an expert system in oncology, in: Proc. 2nd IFSA Congress, Tokyo (1987) 392-395. [I1] S. Parker et al., Towards the simulation of clinical cognition taking a present illness by computer, Am. J. Med. 60 (1976) 981-986. [12] Registro nacional de tumores infantiles: estadisticas b~isicas del RNTI.2 (1980-1984), Monografias sanitarias, Conselleria de Sanitat I Consum, Generalitat Valenciana. [13] D. Smith and J. Clayton, Another look at frames in: B. Buchanan and E. Shortliffe, Eds., Rule-based Expert Systems (Addison-Wesley, Reading, MA, 1984) 441-452. [14] W. W. Sutow, General aspects of childhood cancer, in: W.W. Sutow, T.J. Vietti and D.J. Fernbach, Eds., Clinical Pediatric Oncology, 2nd edition (C.V. Mosby, St. Louis, MO, 1977) 1-15. [15] L.A. Zadeh, A theory of approximate reasoning, in: J.E. Hayes, D. Michie and L.I. Kulich, Eds., Machine Intelligence 9 (Wiley, New York, 1979) 149-194. [16] L.A. Zadeh, Outline of a new approach to the analysis of complex systems and processes, I E E E Trans. Systems, Man Cybernet. (1973) 28-44.