Expert Systems with Applications 39 (2012) 11782–11791
Contents lists available at SciVerse ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Improving medical decision trees by combining relevant health-care criteria Joan Albert López-Vallverdú ⇑, David Riaño, John A. Bohada Research Group on Artificial Intelligence (BANZAI), Departament d’Enginyeria Informàtica i Matemàtiques, Universitat Rovira i Virgili, Av. Països Catalans 26, 43007 Tarragona, Spain
a r t i c l e
i n f o
Keywords: Medical decision making Decision trees Background knowledge
a b s t r a c t Through the years, decision trees have been widely used both to represent and to conduct decision processes. They can be automatically induced from databases using supervised learning algorithms which usually aim at minimizing the size of the tree. When inducing decision trees in a medical setting, the induction process should consider the background knowledge used by health-care professionals to make decisions in order to produce decision trees that are medically and clinically comprehensible and correct. Comprehensibility measures the medical coherence of the sequence of questions represented in the tree, and correctness rates how much irrelevant are the errors of the decision tree from a medical or clinical point of view. Some algorithms partially solve these problems pursuing alternative objectives as reducing the economic cost or improving the adherence of the decision process to medical standards. However, from a clinical point of view, none of these criteria is valid when it is considered alone, because real medical decisions are taken attending to a combination of them, and also other health-care criteria, simultaneously. Moreover, this combination of criteria is not static and may vary if the decision tree is made for different purposes as screening, diagnosing, prognosing or drug and therapy prescription. In this paper, a decision tree induction algorithm that uses combinations of health-care criteria is presented and used to generate decision trees for screening and diagnosing in four medical domains. The mechanisms to formalize and to combine these criteria are also presented. The results have been analyzed from both a statistical and a medical point of view, and they suggest that our algorithm obtains decision trees that physicians evaluated as more comprehensible and correct than the decision trees obtained by previous approaches as they keep an equivalent accuracy. 2012 Elsevier Ltd. All rights reserved.
1. Introduction In medicine, decision processes may be of several kinds and for different purposes (Fauci et al., 2009): screening, diagnosing, prognosing, drug and therapy prescription, etc. Through the years, multiple computer-based structures have been proposed to formalize these decision processes. They range from statistical approaches as Bayesian Networks (Arsene, Dumitrache, & Mihu, 2011; Lucas, van der Gaag, & Abu-Hanna, 2004; Velikova, de Carvalho Ferreira, & Lucas, 2007) or probabilistic models (Husmeier et al., 2004) to symbolic approaches as decision trees (Chapman & Sonnenberg, 2003; Podgorelec, Kokol, Stiglic, & Rozman, 2002), decision tables (Shiffman, 1997) or decision rules (Clark & Niblett, 1989; Yeh, Cheng, & Chen, 2011). Among them, decision trees have been particularly successful and widely used both to represent and to conduct decision processes. Medical decision trees can be provided by experts (Candell Riera, 2003; Fauci et al., 2009) or automatically induced from medical databases (Ling, Yang, Wang, & Zhang, 2004; López-Vallverdú, Riaño, & Collado, 2007; Quinlan, 1986). In ⇑ Corresponding author. Tel.: +34 977558516; fax: +34 977559710. E-mail addresses:
[email protected] (J.A. López-Vallverdú), david.riano@ urv.net (D. Riaño),
[email protected] (J.A. Bohada). 0957-4174/$ - see front matter 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2012.04.073
computer science, three of the most referred algorithms to induce decision trees are ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993) and C5.0 (Quinlan, 2003). They aim at minimizing the size of the tree and therefore shortening the decision process by maintaining the quality of the final decision. The main drawback of the trees produced with these algorithms is that the final trees only consider the information that can be extracted from the medical databases and so they do not necessarily satisfy medical and clinical comprehensibility and correctness. Comprehensibility is a measure of the medical coherence of the sequence of questions of the decision processes represented in the tree according to the health-care experts (e.g., asking for the age of the patient before obtaining the thyroid-stimulating hormone value can be accepted in a patient screening process but not in diagnosing thyroid malfunctions). Correctness rates how much irrelevant are the errors of the decision tree from a medical or clinical point of view (e.g., the medical error of sending a patient to the Intensive Care Unit rather than to a general hospital floor is lower than sending him home by mistake). Providing efficient, but also comprehensible and correct decision mechanisms is prior in medical decision making. In the past, some approaches to the induction of medical decision trees have pursued alternative objectives as reducing the
J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791
economic cost (Chai, Deng, Yang, & Ling, 2004; Ling et al., 2004; Ling, Sheng, & Yang, 2006) or improving the adherence (Horning, Hoehns, & Doucette, 2007) of the decision process to medical standards (López-Vallverdú et al., 2007). However, these approaches do not guarantee medical comprehensibility and correctness. On the one hand, none of the previous criteria on the length, the economic cost and the adherence to clinical standards is useful when it is considered alone, because real medical decisions are taken attending not only to these criteria but also to many others that are combined, simultaneously. On the other hand, the induction of decision trees with those criteria cannot differentiate among the different possible application purposes. This differentiation is important because, for example, a comprehensible and correct decision tree for diagnosis can be completely wrong for screening purposes. In this context, in Section 2 we formalize the concept of medical decision process. In Section 3 we propose the mechanisms to formalize medical criteria in order to include them in a decision tree induction algorithm, and in Section 4 we propose a methodology to combine them. In Section 5 we present a general algorithm to induce decision trees, identifying the points where medical decision criteria can be introduced as background knowledge. These are called choice points. In Section 6, the measures of accuracy, comprehensibility and correctness for the evaluation of the induced decision trees are formalized. The inductive algorithm is used in Section 7 to generate decision trees for the purposes of screening and diagnosing in four medical domains. The results have been analyzed from a statistical and a medical point of view, and the conclusions reported in Section 8.
11783
than a medical decision process in which the questions to the different patients can be asked in a different order. Fig. 1 shows a DT induced using an information gain based algorithm to identify patients with heart disease (questions are represented as ellipses and decisions as boxes). It does not consider any medical background knowledge so medical comprehensibility and correctness is not guaranteed. For example, the question about number of major vessels in the root requires an invasive test for all the patients. The systematic application of this test as the first step of the decision process lacks of medical sense, and it is completely wrong for certain medical decision processes as screening. The algorithms to induce comprehensible and correct decision mechanisms from a set of data must then be based on one or more medical decision criteria that extend the statistical sense of asking one or another question with a clinical sense. These medical decision criteria and their formalization are introduced in the next section. 3. Decision criteria in health-care and their formalization In medicine, the list of criteria which may be combined to make decisions is very large and diverse. A systematic approach to the organization of such criteria and their representation using cost functions and layered partial orders (LPOs) is proposed in LópezVallverdú and Riaño (2012). In this section, we explain how these criteria can be formalized in order to decide about the appropriate questions and decisions in a decision process. This appropriateness is used to determine the best order of questions and decisions in medical decision processes.
2. Formalizing a decision process 3.1. Criteria on the questions In medicine there are many different descriptions of what a decision process is (Fauci et al., 2009), therefore it is mandatory to define the concept of medical decision process in this paper. Here, a decision process is a sequence of medical questions or observations that lead to a concrete medical decision. In a particular domain, if Q = {q1, q2, . . . , qm} is the set of valid questions, D = {d1, d2, . . . , dn} the set of possible decisions and qi(p) the answer to the question qi 2 Q for a certain patient p, then the finite sequence ðqi1 ðpÞ; qi2 ðpÞ; . . . ; qik ðpÞ; dp Þ represents a medical decision process for patient p in which a health-care professional takes decision dp 2 D after having asked the questions qi1 ; qi2 ; . . . ; qik 2 Q in this exact order. Observe that questions represent patient signs and symptoms but also consultation to the patient record or to an expert. Individual decision processes can be generalized and structured as decision mechanisms that do not only capture the medical knowledge supporting each individual decision process, but also provide the way of conducting new decisions under other circumstances or for other patients. Among the existing decision mechanisms (see, for example Arsene et al., 2011; Clark & Niblett, 1989; Chapman & Sonnenberg, 2003; Husmeier et al., 2004; Podgorelec et al., 2002; Shiffman, 1997) here we choose decision trees because they are structured, explicit, and easy to understand and to interpret, which are compelling requirements of a medical process. A decision tree (DT) is a decision mechanism that describes decision processes that always start with the same question and concatenate questions in such a way that each possible answer to a question is followed by a new question or by a final decision. Decision mechanisms, as for example DTs, can be automatically obtained applying induction algorithms. These algorithms start from a set of data represented in the form (q1(p), q2(p), . . . , qm(p); dp) where p are the different patients, q’s are questions whose answers can be known or not for each patient p, and dp is the decision taken for patient p. Observe that the order of the questions is the same for all the patients since it defines the description of the case rather
The order in which questions are asked in a decision process is decided according to the criteria on the questions. They are used to determine whether a question is more or less adequate than another one in a given context. For example, in diabetes screening, we may use the decision time criterion to decide to perform an oral glucose tolerance test rather than obtaining the longer 2-h serum insulin value. When formalizing the criteria on the questions, cost functions or LPOs are defined over the set of questions Q. For example, the expert may choose to represent the economic cost criterion as a cost function fe: Q ? [0, 1] and the script criterion, which measures the adherence of the procedure to the sequence specified by medical standards (López-Vallverdú & Riaño, 2012), as a LPO 6s over Q. Criteria on the questions can be contextual or context-free. A criterion is said to be contextual when it depends on the context (related disease, medical purpose, etc.) of the medical decision process. In a certain context, the answer to a question may be important in order to make a decision, but in another context, this question may be totally unnecessary. For example, the script value of answering the question stability_of_blood_pressure is greater if we are deciding where a post-operative patient must be sent, than if we are determining whether the patient is hypothyroid or not. Script and granularity (López-Vallverdú & Riaño, 2012) are examples of contextual criteria. Context-free criteria do not change when they are used in different contexts because they depend on the health-care test needed to obtain the answer of the question. For example, economic cost is a context-free criterion. The question sodium_on_blood has no economic cost itself but its economic cost is related to the blood test that provides the answer for this question. The economic cost of a regular blood test is always the same regardless of the context. Moreover, a health-care test can provide simultaneous answers to several questions of the decision process. For example, a regular
11784
J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791
Fig. 1. Decision tree to identify patients with heart disease.
blood test informs about the levels of sodium, urea, creatinine, etc. providing an answer to the question sodium_on_blood, but also to urea_on_blood, creatinine_on_blood, etc. Decision time, economic cost, health risk and physical comfortability (López-Vallverdú & Riaño, 2012) are examples of context-free criteria. Notice that once a health-care test is performed to answer one of the questions, it does not have to be performed again to answer the rest of the questions related to that health-care test. Being t a health-care test that provides the answer to a set of questions Q0 Q; when a question q0 2 Q0 is asked in the decision process, the values for context-free criteria of the questions in Q0 change, so that "q0 2 Q0 , f(q0 ) = 0 (for cost functions), and each question q0 2 Q0 is moved to the first layer of 6(for LPOs). 3.2. Criteria on the decisions A decision process concludes with a final decision that can be right or wrong. The relevance of the error in wrong decisions is evaluated with the criteria on the decisions. For example, according to the health risk criterion, it is safer to wrongly send a post-operative patient to the Intensive Care Unit (ICU) than sending him home by mistake. Some previous works have evaluated the possible wrong decisions performed in a decision process (Ling et al., 2004, 2006; Turney, 2000). In these approaches, an expert has to provide a cost function error(d1, d2) which returns the error of performing d2 when the correct decision is d1, for each pair of decisions d1, d2 in the set of possible decisions D. This approach has the inconvenience that the expert is required to provide a value for each one of the #D (#D 1) possible errors in the decision process (where #D is the cardinality of D). For medium and large sets of decisions this is much information that experts must provide. In order to reduce this effort, here we use a different approach that divides the error into type I and type II medical errors which are concepts that medical doctors are familiar with. Type I error represents the relevance of taking a wrong decision (e.g., the economic cost if we send a patient to ICU when this is a wrong decision).
Type II error represents the relevance of not taking a correct decision (e.g., the risk on the health of a patient who is not sent to ICU when this is the correct decision). When formalizing the criteria on the decisions, the cost functions or the LPOs are defined over the set of decisions D. This means that two cost functions f: D ? [0, 1] (or two LPOs 6 over D) are needed for each criteria considered; one for type I error and another one for type II error. For example, the expert may choose to represent the type I error of the health risk criterion (h) as a LPO 6h and the type II error as a cost function fh. This approach requires the expert to only provide 2 #D values. For each decision d 2 D, this value is fc(d) or ‘c(d) when the criteria c is represented with a cost function or a LPO, respectively. Compared with previous approaches (Ling et al., 2004, 2006; Turney, 2000) our proposal requires much less information and it is easier to provide by experts. This appreciation was confirmed by the health-care professionals that evaluated the results of this work. According to them, our approach is much more closer to the way they objectively measure medical errors. 4. Combination of criteria In a decision process, questions and final decisions are not chosen based on a unique criterion but on the simultaneous application of a set of medical criteria. This combination can be very complex and it may involve criteria with different levels of priority and relevance. In this section we present a means to include a combination of the formalized criteria in the induction of DTs. We first explain how the inductive algorithm selects criteria according to their priority, and then we present a method to combine them, considering their relevance. 4.1. Selection of criteria considering their priority In a decision process, medical and clinical criteria are arranged in different levels of priority. The priority of a criterion is defined as the relative position of this criterion in the set of criteria when it is
J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791
used in medical decision making. This priority is represented by a positive number, 1 being the highest priority. Health-care professionals may use priorities to rank the relevance of the criteria in the decision problem that they are trying to solve. For example, in the selection of questions for screening patients with diabetes, the expert may consider script, economic cost and physical comfortability criteria of higher priority than health risk or decision time. The expert can also avoid the use of priorities just by stating that all the criteria have the same level of priority. The criteria in the first level of priority are those which are used to guide the sequence of questions or to make the final decisions in the decision process. Only in the case that these criteria are not able to identify the best question or decision in the process, the criteria of the second level of priority are considered. If these also fail, then the criteria in the third level are used, and so on. If none of the levels is useful to choose the best question or decision, then any remaining question qi 2 Q or decision di 2 D is appropriate and the one with the lower index i is selected.
11785
take the most appropriate decision. This behavior is described in Algorithm 1 where three choice points have been identified. These are points in which background knowledge can be considered in order to improve the medical and clinical comprehensibility and correctness of the DT induced.
4.2. Combination of criteria considering their relevance After having considered priorities, the criteria of the same priority are combined according to their relevances. The relevance of a criterion is defined as its weight within the combination of criteria used in medical decision making and it is represented by a value P a 2 [0, 1] such that c2C i ac ¼ 1, where Ci contains those criteria with priority i. Given ac > ac0 we say that criterion c is more relevant than criterion c0 . Health-care professionals must provide the relevance of the decision criteria as a means of weighting the relative importance of each criterion in the decision problem that they are trying to solve. When combining n criteria represented as cost functions or LPOs we deal with three cases: Case 1: Combination of n cost functions (fc1 ; . . . ; fcn ): We apply a linear combination: g ¼ a1 fc1 þ þ an fcn with ai the relevance of criterion ci. Case 2: Combination of n LPOs (6c1 ; . . . ; 6cn ): We apply the procedure of combination of LPOs described in López-Vallverdú and Riaño (2011a). Case 3: Combination of m cost functions and n m LPOs (fc1 ; . . . ; fcm ; 6cmþ1 ; . . . ; 6cn ): We apply the procedure of combination of LPOs in (López-Vallverdú & Riaño, 2011a) to the n m LPOs obtaining a single LPO 6 0 . Then we transform 60 into a cost function f0 (López-Vallverdú & Riaño, 2011a) and finally, combine the m + 1 cost functions fc1 ; . . . ; fcm ; f 0 as in case 1 with P the relevance of f0 calculated as a0 ¼ ni¼mþ1 ai . 5. Induction of decision trees based on medical criteria The three most successful and widely applied algorithms to induce DTs are ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993) and C5.0 (Quinlan, 2003) with more than 1800 publications in medical informatics since 2000.1 These are greedy algorithms that produce DTs as a result of a top-down partitioning process that starts with a dataset which contains descriptions of past decision processes. In medical informatics (Podgorelec et al., 2002), these cases represent decisions on patients that are expressed as (q1(p), q2(p), . . . , qm(p); dp) where p are the different patients considered, q’s are questions on particular conditions of the patients whose answer can be known or not, and dp is the decision taken for patient p. In spite of significant differences, the baseline of ID3, C4.5 and C5.0 is equivalent: partition the dataset into subsets using the best possible question, until the decision of the remaining cases can be considered equivalent, then 1 Bibliographic search in ScienceDirect with keywords medicine AND (id3 OR c4.5 OR c5.0).
Choice point one, in line 2, sets a condition for placing a decision node (or not). For the current dataset this condition determines whether the situation (qi(p) ) ) is better represented with a decision (qi(p) ) dp) or if more questions have to be asked (qi(p) ) qj(p) ) ). Choice point two, in line 3, is the condition to select the best decision dp 2 D for the current decision process. Choice point three, in line 7, is the condition to select the best question qj 2 Q for the current decision process.
5.1. Introducing background knowledge in the induction of DTs In order to improve the medical comprehensibility and correctness of the trees induced by ID3, C4.5 or C5.0 and also to be able to produce trees with a concrete medical orientation (e.g., screening, diagnosis, treatment, etc.), the medical background knowledge is included in Algorithm 1 (see Fig. 2). This knowledge comes represented by cost functions and LPOs related to each one of the criteria taking part in the decision process. For each criteria, three cost functions (or LPOs) are defined: one for questions and other two for type I and type II errors on the decisions. These cost functions and LPOs, together with the priorities and relevances of the criteria, define the background knowledge required to produce decision trees with a medical sense. A representation of all the background knowledge required is shown in Table 1 where c1, . . . , ck are the criteria selected. For each criterion ci (i.e., table row), the background knowledge provides the priority and relevance (pqi and aqi) when the criterion is used to select the questions, and the priority and relevance (pIi and aIi for type-I error, and pIIi and aIIi for type-II error) when it is used to select the proper decision. Each criterion ci may be represented as a cost function or a LPO, for questions, and for type I and type II errors. Table 1 is a central component of the process described in Fig. 2. The criteria in Table 1 are combined using the methodology described in Section 4 obtaining three global cost functions or LPOs for each level of priority j: one for criteria on the questions (gqj or 6qj), another one for criteria on the decisions related to type I
11786
J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791
Fig. 2. Introducing background knowledge in Algorithm 1.
Table 1 Representation of the input background knowledge. Criteria
Questions
Decisions Type I error
c1 c2 ck
Type II error
p
a
Formalization
p
a
Formalization
p
a
Formalization
pq1 pq2 pqk
aq1 aq2
fq1 or 6q1 fq2 or 6q2 fqk or 6qk
pI1 pI2 pIk
aI1 aI2
fI1 or 6I1 fI2 or 6I2 fIk or 6Ik
pII1 pII2 pIIk
aII1 aII2
fII1 or 6II1 fII2 or 6II2 fIIk or 6IIk
aqk
aIk
errors (gIj or 6Ij) and a third one for criteria on the decisions related to type II errors (gIIj or 6IIj). With the aim of inducing DTs that are medically and clinically comprehensible and correct and, at the same time, adapted to the health-care purpose the DT must serve to, we propose an implementation for each one of the choice points of Algorithm 1 that uses the different global cost functions and LPOs.
aIIk
comparable. If a decision is correct enough for the current dataset (its cost is lower than ) it can be placed in the DT with no need to calculate the cost of making a question. This procedure considers both the information in the database (proportion of patients for each decision) and the medical background knowledge (type I and II error cost for each wrong decision). 5.3. Select the best decision: correctness
5.2. Condition for placing a decision node In medicine, deciding whether a decision process has reached a final decision or if new questions are recommended is a trade off between type I and type II errors. Here, these errors are respectively represented with the cost functions gIj and gIIj obtained for each level of priority j (see Fig. 2). If we have global LPOs, they are transformed into the cost functions gIj and gIIj (López-Vallverdú & Riaño, 2011a). Therefore, for each priority level j, gIj and gIIj provide the global cost of accepting a wrong decision and the global cost of rejecting a correct decision over a decision process ðqi1 ðpÞ; qi2 ðpÞ; . . . ; qik ðpÞ; dp Þ on a patient p. Given a set of patients P0 , if PP0 ðdÞ is the proportion of patients in P0 on which the final decision was d, then, considering criteria with priority i, the cost of placing a decision node Deci(d, P0 ) is calculated using Eq. (1). The condition for placing a decision node is reached if one of the total costs for making a decision d over the current dataset, considering criteria with priority 1, is lower than a threshold (i.e., mind 0 2D(Dec1(d, P )) < ).
X
Deci d; P0 ¼ ð1 P P0 ðdÞÞ g Ij ðdÞ þ 0
0 0 PP0 d g IIj d
ð1Þ
0
d 2D;d –d
We compare the costs for making a decision with a threshold rather than with the costs of making a question because questions and decisions depend on different criteria and thus they are not
From a medical point of view, the most correct decision to be made over a certain set of patients, must be determined considering type I and type II errors (see gIj and gIIj in Fig. 2). Therefore the selection of the best decision is done using Eq. (1). The best decision to be selected is the one which minimizes Dec1. If several decisions minimize Dec1 then we select the one of them which minimizes Dec2. The procedure is repeated for each level of priority until there is only one optimal decision. If the lowest priority level is reached and there is not a single optimal decision selected, then the remaining decision di with the lowest index i is taken. 5.4. Select the best question: comprehensibility A decision process is medically comprehensible if the questions are made in an order similar to the criteria of the health-care experts. Therefore, criteria on the questions are involved in the selection of the best question for a certain patient (see gqj and 6qj in Fig. 2). Nevertheless, from a medical point of view the most comprehensible question is not necessarily the question that leads to the best situation to make a final decision. In order to select comprehensible questions which are also useful to make a final decision, we use the concept of expected cost (EC). For each question qi, the EC represents the cost of making a decision in the next step of the decision process after asking the question qi. This is the average of the costs of placing decision nodes for each of the subsets
J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791
obtained when a certain set of patients P0 P is partitioned using qi. EC is calculated with Eq. (2), where P 0a ¼ fp 2 P0 : qi ðpÞ ¼ ag and Ai = {a: qi(p) = a, p 2 P0 }.
ECðqi ; P0 Þ ¼
1 X min Dec1 d; P0a #Ai a2A d2D
ð2Þ
i
We compute EC for each question and we select those questions whose EC is lower than a threshold d. The best question is the one which minimizes the global cost function gq1 (or which is in the lowest layer of the LPO 6q1) for criteria on the questions of level of priority 1. If several questions minimize gq1 (or are in the lowest layer of 6q1) then we select the one of them which minimizes gq2 (or which is in the lowest layer of 6q2). The procedure is repeated for each level of priority until there is only one optimal question. If none of the levels is useful to select one of these questions, then the remaining question qi with the lowest index i is selected. The use of the expected cost together with the criteria on the questions guarantees a trade off between the information in the database and the medical background knowledge when selecting the best question. 6. Evaluation of medical decision trees The accuracy of a DT is defined as the percentage of correct decisions over the total number of decisions made. Accuracy is a statistical measure like sensitivity, specifity and positive and negative predictive values (Lang & Secic, 2006), which numerically compares the decisions represented in the DT with the cases in the training dataset. These measures are not based on any kind of medical background knowledge, so they are not a valid way to assess the medical comprehensibility and correctness of the DTs. Let pathðp; DTÞ ¼ fqp1 ; qp2 ; . . . ; qpk g be the sequence of questions asked to patient p if we follow the decision tree DT. Comprehensibility is calculated with Eq. (3) and evaluates the sequence of questions in path(p, DT) for all the patients p 2 P following the indications of the decision tree DT. Comprehensibility takes into account the global cost function gq1 of the criteria on questions with priority 1. If the medical background knowledge is represented with a global LPO 6q1, this has to be transformed into a cost function gq1 (López-Vallverdú & Riaño, 2011a), before Eq. (3) is applied.
comprehensibilityðP; DTÞ ¼
X 1 #P #P p2P
P
q2pathðp;DTÞ g q1 ðqÞ
!
#pathðp; DTÞ ð3Þ
Let DN be the set of decision nodes in a decision tree DT (i.e., the terminal nodes of the DT), and let dn and Pn be the decision made and the set of patients in a decision node n 2 DN, respectively. Correctness is calculated with Eq. (4) and it evaluates all the final decisions made in a DT with the function Dec1 which returns the cost of placing a decision node considering criteria with priority 1.
! X 1 #DN correctnessðP; DTÞ ¼ Dec1 ðdn ; Pn Þ #DN n2DN
ð4Þ
7. Tests and results In this section, we detail the tests carried out on the induction of medically comprehensible and correct DTs and the results obtained with our algorithm on four medical domains from the UCI Repository of Machine Learning (Frank & Asuncion, 2010). The domains are diabetes with 768 patients, 8 questions and 2 decisions; heart disease with 303 patients, 13 questions and 2 decisions;
11787
post-operative with 90 patients, 8 questions and 3 decisions, and thyroid with 3772 patients, 20 questions and 3 decisions. The background knowledge about the different decision criteria in all four domains has been provided by physicians of the Clinical Hospital in Barcelona (CHB) (Spain) and the SAGESSA Health Care Group (Spain). For each domain, these professionals selected some medical criteria and provided the background knowledge according to Table 1 and for the purposes of patient screening and patient diagnosis. 7.1. The tests With the aim of finding evidence that our approach (MEDBK) provides comprehensible and correct DTs which are useful to represent medical decision processes and, at the same time, showing the limitations of the information gain based algorithms (IG) as ID3, C4.5 or C5.0 in the induction of medical DTs,2 we have performed the following two types of test on the previous four medical domains. Test type 1 to show evidence that MEDBK generates comprehensible and correct medical DTs, with no loss of accuracy with respect to IG. Test type 2 to show evidence about the suitability of MEDBK to produce decision mechanisms for different purposes (screening and diagnosis) for the same datasets. The first type of test has been performed by generating DTs to screen patients in the four domains. MEDBK required the professionals of the two health-care institutions to agree on the criteria to be used and also on the priorities and relevances of such criteria for a screening decision process. Table 2 summarizes the selected criteria extracted from the list in (López-Vallverdú & Riaño, 2012) (column 1), their respective priorities (columns p), relevances (columns a) and their formalization as cost functions or LPOs, for questions, and type I and type II errors on the decisions. The cost functions and LPOs are not provided here because each medical domain tested has its own ones. These are 25 cost functions and 15 LPOs in total which are provided in López-Vallverdú and Riaño (2011b). According to physicians, some of the criteria in Table 2 are not appropriate for selecting questions or considering type I or type II errors. These appear as ‘–’ in the table meaning that they are not part of the background knowledge. All these tests have been performed with and without cross-validation, and with and without pruning. Cross-validation is used to analyze the robustness of the DTs and, in our case, it consisted in repeating the following procedure 10 times. We randomly separated 90% of the patients of the initial dataset and we used them to generate the DT which was then tested using the remaining 10% of the patients. Pruning is used to reduce the overfitting of DTs and to remove sections of a DT that may be based on noisy or erroneous data. Pruning is based on a prefixed percentage of DT node representativity. So, during the induction process, if a node of the DT represents less than this percentage of patients, it becomes a decision node. For representativity ratio we used 2%. We compared the results of these tests with the DTs obtained with IG. The second type of test was centered in the thyroid domain and consisted in the generation of DTs with both the IG and the MEDBK algorithms for the decision processes of patient screening and patient diagnosis. The results of the two types of test were analyzed by physicians 2 In the following tests we used as IG the Weka J48 implementation of the C4.5 algorithm (Witten & Frank, 2005).
11788
J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791
Table 2 Priorities, relevances and formalization of the medical criteria to perform screening decision processes. Criteria
Criteria on the questions
Criteria on the decisions Type I error
Script Health risk Physical comf. Economic cost Decision time a
Type II error
p
a
Formalization
p
a
Formalization
p
a
Formalization
1 2 3 3 3
1 1 0.4 0.4 0.2
6qs 6qh 6qc fqe fqt
– 1 1 2 2
– 0.9 0.1 0.5 0.5
– fIha fIca fIe fIt
– 1 – – –
– 1 – – –
– fIIha – – –
For the post-operative domain, it was formalized with a LPO.
Fig. 3. DT induced for the screening of heart disease using MEDBK.
of the two previously mentioned health-care institutions and their main conclusions summarized in Section 7.2. We also compared the accuracy, comprehensibility and correctness of the DTs induced by MEDBK in comparison with those other DTs generated with IG. This comparison is detailed in Section 7.3. 7.2. Decision trees obtained and medical analysis With MEDBK, we have induced DTs to screen patients in the medical domains of diabetes, heart disease, post-operative, and thyroid. Several physicians proposed the criteria, priorities and relevances in order to avoid as much as possible the presence of questions based on risky, uncomfortable or expensive medical tests (see Table 2). In Fig. 3 we provide one of the DTs induced with MEDBK. Contrarily to the DT obtained with IG (see Fig. 1), this one is based on low-invasive questions as age, sex, chest pain type, resting blood pressure, resting electrocardiogram and maximum heart rate rather than in other questions based on invasive tests as for example the number of major vessels. Observe that the DT induced with
MEDBK uses the questions age and sex (highest priority according to the criterion script López-Vallverdú & Riaño, 2011b) before asking other questions. However, the trade off of our method between the information in the database and the medical background knowledge causes that not always the latter is the one that determines the sequence of questions. For example, in one branch the question maximum heart rate is used to make a final decision, without having asked other questions with a higher priority like resting blood pressure, fasting blood sugar and serum cholestorol. The physicians qualified the behavior of this DT as according to normal practice, whereas the one depicted in Fig. 1 was rejected as inappropriate for decision making in the screening of patients with heart disease. This interpretation is the same for all the DTs obtained in the four medical domains tested and it is corroborated by the numerical results discussed in Section 7.3. All the DTs obtained with IG represent medical decision processes that are either more risky, uncomfortable or expensive than the ones obtained with MEDBK.
11789
J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791
Fig. 4. LPO over the questions to diagnose thyroid malfunctioning.
MEDBK was also used to induce different DTs for the same input data. This was possible by adjusting the set of selected criteria and their priorities and relevances to the sort of medical decision desired (i.e., screening or diagnosis). Centered in the thyroid problem, MEDBK was used to generate DTs to screen and to diagnose patients. The criteria were again the ones in Table 2 for the screening process, and script for the diagnosis process. The script criterion was represented with the LPO in Fig. 4.3 MEDBK proposed a DT to screen patients with thyroid problems, and another DT to diagnose thyroid malfunctioning (see Fig. 5). Both DTs were accepted as correct by the team of physicians supporting this work. The DT that was obtained with IG was not accepted for screening purposes, but acceptable for diagnosis. However, in spite that the DT proposed by IG was pretty similar to the one in Fig. 3 (and therefore appropriate for diagnostic4), the physicians concluded that even in a diagnosis, there is always a set of medical criteria guiding the selection of questions. And, since these criteria cannot be incorporated to IG, this algorithm is also unable to guarantee DTs representing good diagnosis processes. This fact has been observed in several of the domains studied, as diabetes whose DTs incorporated questions related to blood pressure or pregnancy which are irrelevant in order to make final diagnostic decisions. 7.3. The quality of the results The quality of medical DTs is measured in terms of their accuracy and their medical comprehensibility and correctness. Table 3 shows these values for the MEDBK DTs when they are used to screen patients in the domains of diabetes, heart disease, postoperative, and thyroid. The average of the IG DTs is also provided for the sake of comparison. The quality of a medical DT is also related to the capability of this tree to remain unchanged and still represent good medical decisions (i.e., DT robustness) and the ability not to represent chance decisions (i.e., DT overfitting). In Table 3 we provide the results before and after applying cross-validation in order to analyze the robustness of the DT obtained, and also the results before and after applying pruning in order to analyze overfitting. 7.3.1. Accuracy of DTs We observe that the mean difference between the average accuracies of the DTs without cross-validation obtained with IG and MEDBK is 3.9% (4.3% with pruning and 3.5% without pruning). This difference can be explained by the fact that MEDBK is not designed to maximize accuracy but to maximize comprehensibility and cor3 The 16 other questions that do not appear in the LPO are in layer 4 but they were omitted for space reasons (see López-Vallverdú & Riaño, 2011b). 4 The physicians argued that some cases of thyroid problems could not be diagnosed with the IG and MEDBK DTs because there were not instances of such cases in the input database.
Fig. 5. DT induced by MEDBK for the diagnosis of thyroid.
Table 3 Results obtained for DTs to screen patients in four medical domains with MEDBK. With pruning Acc. (%) Diabetes Heart disease Postoperative Thyroid Average Average IG
Diabetes Heart disease Postoperative Thyroid Average Average IG
Com. (%)
Without pruning Cor. (%)
Acc. (%)
Com. (%)
Cor. (%)
With cross-validation 71.4 78.0 77.9 77.7 92.7 87.7 64.4 90.9 90.0
74.0 74.2 57.8
78.5 90.8 82.5
79.6 85.5 84.2
95.4
85.4
95.5
95.9
88.4
95.9
77.2 75.5
86.8 76.2
87.8 85.4
75.5 75.3
85.1 44.9
86.3 85.4
Without cross-validation 78.5 81.8 84.0 82.5 92.0 90.5 75.6 83.6 94.7
83.1 91.7 92.2
81.0 88.6 83.2
86.9 95.4 98.3
95.5
83.9
95.5
97.5
81.0
97.5
83.0 87.3
85.3 39.0
91.2 91.6
91.1 94.6
83.5 42.4
94.5 95.8
rectness. On the contrary, IG is an algorithm oriented to accuracy maximization, but it obtains DTs whose accuracies are not significantly better than the ones obtained with MEDBK. At the same time cross-validation shows that the accuracy of IG DTs diminishes more quickly than the accuracies obtained with MEDBK DTs (15.5% and 10.7%, respectively). Therefore IG obtains slightly more accurate but less robust DTs. 7.3.2. Comprehensibility of DTs The results of comprehensibility are clearly favorable to MEDBK, whose average comprehensibility is 43.7% better. Thyroid is a clear example in which comprehensibility is more than 60% better with respect to IG trees, for all the tests performed. In all four domains, the results show that the order of the questions in the DTs produced with MEDBK is more coherent from a medical point of view. 7.3.3. Correctness of DTs The strong relation between accuracy (i.e., percentage of correct decisions) and correctness (i.e., quality of the decisions) causes that, often, the results obtained by IG in terms of mean correctness are good. Nevertheless, when comparing IG and MEDBK DTs we find cases where IG DTs are better in accuracy but worse in correctness. This means that MEDBK makes more mistakes than IG (1.4% in average) but these mistakes are less important. This happens in
11790
J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791
several cases as, for example, in the DTs for screening of post-operative patients with pruning. According to accuracy, IG obtains a better DT than MEDBK (with respective accuracies 82.2% and 75.6%), but medical correctness indicates that the errors of the DTs induced with MEDBK are less critical from a medical or clinical point of view (this is represented with the respective correctness values 89.7% and 94.7%). 7.3.4. Robustness of DTs The results in Table 3 suggest that MEDBK DTs are better at making decisions over new patients. With cross-validation, the average loss of accuracy is 4.9% lower with MEDBK than with IG, with respect to the DTs generated without cross-validation. The differences on the loss of comprehensibility and correctness are less relevant but also favorable to MEDBK (1.6% and 2.5%, respectively). This means that the DTs generated with MEDBK are more robust than the trees generated with IG. 7.3.5. Overfitting of DTs Pruning is a satisfactory procedure because it obtains smaller DTs which reduce overfitting while there is not a significant loss of accuracy, correctness and comprehensibility. Both MEDBK and IG obtain DTs with a similar average loss of accuracy and correctness when applying pruning (always below 3.5%). As far as comprehensibility is concerned, DTs of MEDBK are medically better after pruning (1.8% in average), while those of IG are significantly worse (6.1% in average). 8. Conclusions The information gain based algorithms to induce decision trees in complex domains cannot always guarantee acceptable results from an expert point of view. Concretely, in the medical domain, these algorithms do not consider health-care criteria and therefore, important aspects as the risks of the clinical procedures or the patient uncomfortability can be left out of their decision processes. Moreover, medical errors in the final decisions can be critical and therefore their recommendation cannot be taken as medically correct. For the same dataset, these algorithms always produce the same DT regardless of its final medical purpose or intentionality. This is not correct because, for example, a good DT for diagnosing is not necessarily a good DT for other medical decision processes like screening or disease treatment. Here, we have proposed an algorithm to induce medical DTs that uses a combination of some relevant health-care criteria. The chosen criteria and their respective priorities and relevances allow the algorithm to produce DTs oriented to different medical purposes. The tests performed in the medical domains of diabetes, heart disease, post-operative and thyroid malfunctioning for the purposes of screening and diagnosing conclude that the medical DTs generated with the new algorithm are medically comprehensible and correct, while their accuracy is not significantly worse than the one obtained with information gain based algorithms, but more robust to new data. The sequences of questions of the trees in these domains are medically comprehensible and do not imply unnecessary risky, uncomfortable or expensive medical tests. With respect to correctness, the presence of critically wrong decisions is avoided. Cross-validation and pruning tests indicate that the DTs obtained by our algorithm are robust and resistant to overfitting. In the future, this work will be continued following three lines. The first line is the exploitation of health-care databases about different medical decision processes like prevention, screening, diagnosing and patient treatment, in order to automatically adjust the relevances that produce the most accurate, comprehensible and
correct DTs with respect to the medical decisions contained in the data. Our aim is to consider all the criteria and let the optimization algorithm to determine the relevances which will approach to zero for those criteria that are not used in each concrete decision process. At the end, we expect to have a family of criterion-relevance pairs describing each medical process and we will use them to compare the way of working of different medical centres. The second line will adapt the current induction of DTs to the induction of clinical algorithms (Bohada, Riaño, & López-Vallverdú, 2012; Riaño, López-Vallverdú, & Tu, 2008). A clinical algorithm (CA) is a flow diagram consisting of branching-logic pathways which represent sequences of clinical decisions, for teaching clinical decision making, and for guiding patient care. These branchinglogic pathways can be represented with DTs, therefore they can be induced with the algorithm in Section 5. Considering this, we will aim to induce medically comprehensible and correct CAs from hospital databases by including medical background knowledge. The third line will face the induction of medical DTs following a different approach. We can accept that medical criteria are found implicit in the data available about medical decisions. Starting with databases containing decision processes as ðqi1 ðpÞ; qi2 ðpÞ; . . . ; qik ðpÞ; dp Þ, we will study the possibilities of generating accurate, comprehensible and correct DTs without considering an explicit representation of medical criteria (Torres, LópezVallverdú, & Riaño, 2011a). Acknowledgements We would like to thank Dr. Collado and Dr. Alonso for their continuous support leading the groups of health-care professionals from the SAGESSA Health Care Group (Spain) and the Clinical Hospital in Barcelona (Spain), respectively. References Arsene, O., Dumitrache, I., & Mihu, I. (2011). Medicine expert system dynamic Bayesian network and ontology based. Expert Systems with Applications, 38, 15253–15261. Bohada, J. A., Riaño, D., & López-Vallverdú, J. A. (2012). Automatic generation of clinical algorithms within the state-decision-action model. Expert Systems with Applications
. Candell Riera, J. (2003). Estratificación pronóstica tras infarto agudo de miocardio. Revista Espanola de Cardiologia, 56(3), 303–313. Chai, X., Deng, L., Yang, Q., & Ling, C. X. (2004). Test-cost sensitive Nayïve Bayesian classification. In Proceedings 4th IEEE international conference on data mining. Chapman, G. B., & Sonnenberg, F. A. (Eds.). (2003). Decision making in health care: Theory, psychology and applications. Cambridge series on judgement and decision making. Cambridge University Press. Clark, P., & Niblett, T. (1989). The CN2 induction algorithm. Machine Learning, 3(4), 261–283. Fauci, A. S., Braunwald, E., Kasper, D. L., Hauser, S. L., Longo, D. L., & Jameson, J. L., et al. (Eds.). (2009). Featuring the complete contents of Harrison’s principles of internal medicine (17th ed. McGraw Hill. Harrison’s Online. Horning, K. K., Hoehns, J. D., & Doucette, W. R. (2007). Adherence to clinical practice guidelines for 7 chronic conditions in long-term-care patients who received pharmacist disease management services versus traditional drug regimen review. Journal of Managed Care Pharmacy, 13(1), 28–36. Husmeier, D., Dybowski, R., & Roberts, S. (Eds.). (2004). Probabilistic modelling in bioinformatics and medical informatics. Springer. Lang, T. A., & Secic, M. (2006). How to report statistics in medicine (2nd ed.). American College of Physicians. Ling, C. X., Yang, Q., Wang, J., & Zhang, S. (2004). Decision trees with minimal costs. In Proceedings 21st international conference on machine learning. Ling, C. X., Sheng, V. S., & Yang, Q. (2006). Test strategies for cost-sensitive decision trees. IEEE Transaction on Knowledge and Data Engineering, 18(8), 1055–1067. López-Vallverdú, J. A., & Riaño, D. (2011a). Cost functions and partial orders as medical background knowledge: formalization and operations. Research report DEIM-RR11-003. Spain: Universitat Rovira i Virgili. Accessed March 2012. López-Vallverdú, J. A., & Riaño, D. (2011b). Repository of background knowledge. Accessed March 2012. López-Vallverdú, J. A., & Riaño, D. (2012a). Decision criteria in health-care and their representation. Research report DEIM-RR-12-001. Spain: Universitat Rovira i
J.A. López-Vallverdú et al. / Expert Systems with Applications 39 (2012) 11782–11791 Virgili. Accessed March 2012. López-Vallverdú, J. A., Riaño, D., & Collado, A. (2007). Increasing acceptability of decision trees with domain attributes partial orders. In Proceedings of the 20th IEEE international symposium on computer-based medical systems, Maribor, Slovenia. Lucas, P., van der Gaag, L., & Abu-Hanna, A. (2004). Bayesian networks in biomedicine and health-care. Artificial Intelligence in Medicine, 30(3), 201–214. Frank, A., & Asuncion, A. (2010). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. . Podgorelec, V., Kokol, P., Stiglic, B., & Rozman, I. (2002). Decision trees: An overview and their use in medicine. Journal of Medical Systems, 26(5), 445–463. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo, CA., USA: Morgan Kaufman. Quinlan, J. R. (2003). C5.0 Online tutorial. Accessed March 2012.
11791
Riaño, D., López-Vallverdú, J. A., & Tu, S. (2008). Mining hospital data to learn SDA⁄ clinical algorithms. LNAI (Vol. 4924, pp. 46–61). Shiffman, R. N. (1997). Representation of clinical practice guidelines in conventional and augmented decision tables. Journal of the American Medical Informatics Association, 4, 382–393. Torres, P., López-Vallverdú, J. A., & Riaño, D. (2011). Inducing decision trees from medical decision processes. LNAI (Vol. 6512, pp. 40–55). Turney, P. D. (2000). Types of cost in inductive concept learning. In Workshop on cost-sensitive learning at the 7th international conference on machine learning. California: Stanford University. Velikova, M., de Carvalho Ferreira, N., & Lucas, P. (2007). Bayesian network decomposition for modeling breast cancer detection. In Artificial intelligence in medicine, AIME 2007, Amsterdam, The Netherlands. LNAI (Vol. 4594, pp. 346–350). Springer. Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). Morgan Kaufman. Yeh, D., Cheng, C., & Chen, Y. (2011). A predictive model for cerebrovascular disease using data mining. Expert Systems with Applications, 38(7), 8970–8977.