Knowledge-based approach to septic shock patient data using a neural network with trapezoidal activation functions

Knowledge-based approach to septic shock patient data using a neural network with trapezoidal activation functions

Artificial Intelligence in Medicine 28 (2003) 207–230 Knowledge-based approach to septic shock patient data using a neural network with trapezoidal a...

454KB Sizes 6 Downloads 89 Views

Artificial Intelligence in Medicine 28 (2003) 207–230

Knowledge-based approach to septic shock patient data using a neural network with trapezoidal activation functions Ju¨rgen Paetz* J.W. Goethe-Universita¨t Frankfurt am Main, Fachbereich Biologie und Informatik, Institut fu¨r Informatik, Robert-Mayer-Straße 11-15, D-60054 Frankfurt am Main, Germany Received 20 August 2001; received in revised form 20 January 2003; accepted 14 April 2003

Abstract In this contribution we present an application of a knowledge-based neural network technique in the domain of medical research. We consider the crucial problem of intensive care patients developing a septic shock during their stay at the intensive care unit. Septic shock is of prime importance in intensive care medicine due to its high mortality rate. Our analysis of the patient data is embedded in a medical data analysis cycle, including preprocessing, classification, rule generation and interpretation. For classification and rule generation we chose an improved architecture based on a growing trapezoidal basis function network for our metric variables. Our results extend those of a black box classification and give a deeper insight in our patient data. We evaluate our results with classification and rule performance measures. For feature selection we introduce a new importance measure. # 2003 Elsevier Science B.V. All rights reserved. Keywords: Rule generation; Rule performance; Neural network; Neuro-fuzzy; Feature selection; Septic shock

1. Introduction During the last years many scientists have published medical applications of neural networks for classification analysis, e.g. [2,22,34,43]. We have learned that supervised neural networks usually adapt better to data with highly overlapping or nonlinear class borders than statistical regression does [29,37]. *

Tel.: þ49-69-798-25119; fax: þ49-69-798-28353. E-mail address: [email protected] (J. Paetz). 0933-3657/$ – see front matter # 2003 Elsevier Science B.V. All rights reserved. doi:10.1016/S0933-3657(03)00057-5

208

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

Standard statistical regression and standard neural network techniques like backpropagation do not explain their classification results by rules. Particularly physicians as a main interest group are interested in such rules to get insight in the classification process, e.g. to draw conclusions for therapy. Thus, scientists have developed methods that allow the generation of rules within the classification process or the extraction of rules after the completed classification process. In Section 2.1, we give a short overview of such alternatives. The subsequent layout of the paper is described below. Since our main goal is the application of a knowledge based method to septic shock patient data, we choose one of the algorithms that are introduced in Section 2.1 for this task, discussing our choice. The main ideas of the chosen algorithm are described in Section 2.2. During the experimental phase we realized that the algorithm could be improved, regarding the overlapping behavior of the neurons and the shrink-mechanism [28], see Section 2.3. We repeat all the experiments with random partitions of the data into training and test data to get meaningful, statistically reasonable results. We evaluate our classification results with standard performance measures, e.g. classification error on training and test datasets. Each one of the rules is evaluated with a frequency and confidence measure (Section 3.1). Another question concerning rules is the global importance of each variable. We propose an importance measure for this feature selection task in Section 3.2. Before applying the improved network to our septic shock patient data, we present some results on well-known benchmark datasets in Section 4 to point to our improvements and to clarify the general usefulness of our new measure ‘‘importance.’’ Septic shock is one of the most common reasons of death in intensive care units (ICUs). Of course, this is reason enough to explore the causes in detail. Our analysis is restricted to abdominal intensive care patients who developed a septic shock during their stay at the ICU. The abdominal septic shock has a high mortality rate in the ICU of up to 50%. Some more details are described in Section 5.1. Our analysis is retrospectively based on a medical database. Thus, preprocessing steps for data quality improvement are necessary (Section 5.2). We present all the results concerning our septic shock patient data in Section 5.3. Some interesting insights are found compared to a mere classification procedure. Finally, we discuss our results in Section 6. We find out that it is not possible to clearly reduce the same dimensions for all the rules if we use septic shock data sampled from the entire time series. But we can generate several performant rules with less dimensions that give insight in the data.

2. Rule generation for metric data In principle there are two kinds of medical data: metric (numerical) data (blood pressure, heart frequency, doses of medicaments, etc.) and categorical (symbolic) data (operations, diagnoses, therapies) including binary data (yes/no) or medical codes such as ICD10 or OP301 [42]. Metric data could further be divided into biosignal data (e.g. EEG, MEG), sampled with an adequate sampling rate and measurement data from patient records, recorded by physicians irregularly whenever they considered it necessary.

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

209

Originally our retrospectively analyzed data base contained patient records with metric and categorical data, but no biosignal data. Although categorical data were available in general, the quality was limited due to text entries (non standardized abbreviations) that could not be assigned unambigously to database attributes. Also, the categorical data entries are affected by missing values. The origin of missing values was often inexplicable. We decided to exclude the categorical data in the database from analysis. We will analyze categorical data in the near future with more carefully collected data, see Section 6. Thus, we will concentrate on the topic of rule generation for metric data, i.e. metric rule generation. 2.1. Overview Metric rule generation based on neural network techniques was done in the past, considering different theoretical and practical viewpoints. Meanwhile the quantity of algorithms and variants is enormous. Every scientist whose objective is applicationoriented data analysis or data mining has to choose algorithms that seem to be promising for the specific problem, without wasting too much time in trying out all the algorithms and keeping down the risk of overseeing a new interesting method. This is really the first challenge in application-oriented data mining. We list some of the relevant publications that have principally raised interest in our working group with regard to our problem in mind. Only reference [41] is an unsupervised method, the others are supervised. The contributions [6–8,10,17–19,25,26] consider facets of neuro-fuzzy methodologies.  An adaptive geometric algorithm [6,17].  NEFCLASS based on a fuzzy perceptron [25], applied to medical benchmark datasets [26].  ANFIS based on gradient descent and least-squares methods [18].  Metric rule generation with RBF networks is equivalent to fuzzy rule generation [19].  User-interface for generated rules [7].  Classification-rule generation-dilemma [8].  A-posteriori rule generation from backpropagation networks [39].  Growing neural gas for rule generation [10].  Self-organizing distance matrix approach with Kohonen maps [41]. Based on the fundamental theoretic result [19], rules can be generated directly from radial basis activation functions [7,10]. The drawback of this approach is the generation of too many rules, so that a physician may become confused. Additionally in [7] the more didactic but important problem of building up a widely accepted user-interface for physicians is discussed. Without acceptance by physicians even a performant system could fail in practice. Backpropagation algorithms are designed for optimizing a classification result. To achieve rules using trained backpropagation networks, an additional aposteriori rule generation process is necessary. Such a process may be long-winded and not very satisfying because it is difficult to optimize rule generation after classification [8], although it is in principle possible to obtain reasonable rule sets, e.g. [39]. The only unsupervised algorithm in the list above is the Kohonen map approach [41] that did not produce reasonable results for our data because our data contained no significant clusters.

210

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

Initially, we chose the supervised algorithm [17]—also referred to as (ALG) in the following—as a starting point for metric rule generation due to its simple adaptive geometrical rule generation process. It is well designed for rule generation and classification. With our additional improvements it becomes a good method to derive results in benchmark and in real world applications. In the next section, we will discuss the advantages of the method [17]. Then, we will motivate our improvements. 2.2. The neuro-fuzzy algorithm The supervised algorithm (ALG) uses the class information of the data within its adaptation process. Here, we use the labels {survived, deceased} for the classes of survived respective deceased patients. Principal advantages of the algorithm are:  A simple heuristic geometric adaptation process that softens combinatorial rule explosion during the rule generation process. If one generates rules (geometrically interpreted as hyper-rectangles) for a d-dimensional dataset, d 2 N, in a deterministic combinatorial manner, one has to consider 2d -directions. Each of the 2d directions could be considered in combination. Using the binomial theorem this leads to  2d  X 2d i¼1



   

i

¼ 22d  1 ¼ 4d  1

(1)

combinations for changing the side expansions of d-dimensional rules (exponential growth). Irrelevant attributes for single rules are detected. This is the case if a part of a rule R has the format ‘‘. . .and if varj in (1; þ1) and . . .then class . . .’’. Then, the value of variable j is not relevant. This variable can be omitted, leading to a shorter rule R with less attributes. Adaptive learning without stating membership functions a-priori. No rule aggregation of generated rules after rule learning is necessary. Starting the training with a-priori known rules after fuzzification is possible. Both extraction of crisp rules is possible and extraction of fuzzy rules is possible, see Section 3.1.

Disadvantages of (ALG) are listed together with the improvements in Section 2.3. First let us describe the ideas of the algorithm in more detail. The rule generation algorithm (ALG) is based on radial basis functions with dynamic decay adjustment [4,5]. The two-layer network in Fig. 1 has neurons in the hidden layer with ndimensional asymmetrical trapezoidal fuzzy activation functions (see Fig. 2). Every neuron in the first layer belongs to only one class and represents a fuzzy rule. During the learning phase these neurons p are adapted, i.e. the sides of the upper, smaller rectangles (core rules) and the sides of the lower, larger rectangles (support rules) of the trapezoids are adapted to the data. For every new training data point x of class c this will be executed in four phases (cover, commit, shrink committed neuron, shrink conflict neurons). The learning procedure is initialized by the first training sample, for which

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

211

Fig. 1. The neural network structure of the rule generation algorithm (ALG) for two classes. In the first layer the network consists of neurons, separately for every class, i.e. the first layer is not fully connected to the second layer. Notations are explained in Algorithm 1.

one neuron is committed with infinite support side expansions and zero core side expansions in every dimension: (1) Cover: if x is located in the region of a support rule of the same class c as x, expand one side of the corresponding core rule to cover x and increment the weight of the neuron. (2) Commit: if no such support rule covers x, insert a new neuron p at point x of the same class and set its weight to one and its center z :¼ x. The expansions of the sides are set to infinite.

Fig. 2. A two-dimensional asymmetric trapezoidal membership function, interpreted as a core rule (small rectangle at the top of the trapezoid) and a support rule (larger rectangle at the bottom of the trapezoid) in the algorithm (ALG).

212

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

(3) Shrink committed neuron: for a committed neuron shrink the volume of the support and the core rectangle within one dimension of the neuron in relation to the neurons belonging to other classes. (4) Shrink conflict neurons: for all the neurons belonging to another class 6¼ c, shrink the volume of both rectangles within one dimension in relation to x. At the beginning of each entire training cycle all the weights are set to zero. Classification is done by a winner-takes-all mechanism, i.e. calculate the activity si ðx; ci Þ as the sum of the weights multiplied by fuzzy activation for every class ci and choose the class cmax as classification result, where cmax :¼ classðmaxci ðsi ðx; ci ÞÞÞ. Details about the learning algorithm and the shrink procedure can be found in [6,17]. In Section 2.3, we will describe the improved algorithm as we used it, including also an improved shrink procedure. 2.3. Improvements of the neuro-fuzzy algorithm Negative properties of the algorithm (ALG), that affect the rule performance, especially in high dimensional spaces, are [28]:  the relation to presentation order of the training samples with an unfavourable expansion of core rules (P1);  the immediate creation of new rules for outlier data (P2);  the large overlapping of support rules (P3);  the extensive overlapping of core rules with different class labels. This may cause semantic confusion of overlapping rules with different class labels (P4). The overlapping of rule antecedents (hyper-rectangles) is in principle reasonable and also desired to achieve fuzzy rules. But the rules in algorithm [17] tend to overlap too intensely due to the heuristic cover-commit-shrink-procedure. Our aim is to soften the problems (P1), (P3) and (P4) before applying the algorithm to our data. We approach the problem (P2) directly to avoid the location and sorting of outliers in an additional costly selection step and a succeeding new training phase as proposed in [3]. We improve the algorithm (ALG), approaching the problem:  (P1) by using a heuristic nearby parameter—defined in ((2)) below;  (P2) by a counter (weight) for every neuron and everyclass and an additional insertion criterion;  (P3) by a weak shrink;  (P4) by an additional sub-procedure that detects and avoids the core conflict, so that no core rules will overlap. The modified growing trapezoidal basis function network algorithm (MGT) is described below. The improved algorithm (MGT) is given in pseudocode. Comments are delimited by %. As heuristically determined [17], mostly three to seven training epochs already provides a reliable classification error that settles down below a reasonable predefined threshold.

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

Algorithm 1 (MGT).

213

214

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

A heuristic parameter used in (MGT) is the nearby parameter, defined as nearby :¼

Kk X Ll c X c X 1X dij ; m k¼1 l¼kþ1 i j

(2)

where dij is the distance between two samples of different classes, m the number of all the distances, Kk resp. Ll is the number of samples of class k resp. l, and c is the number of classes. The nearby parameter prevents too large core rules by limiting their expansion, see (MGT) step 2. We can interpret the nearby parameter as the average intermediate distance of samples. The shrink procedure that we use in (MGT) is slightly modified in comparison to the original procedure used in (ALG). In Algorithm 2, we distinguish between the new parameter Ld;bestinfinite; and the old parameter Ld;max; because Ld;min; does not always exist and the algorithm (ALG) may crash if for all d ¼ 1; . . . ; n Ld;min; do not exist, since then a comparison between Ld;max; and Ld;min; is not possible (for improvements see steps 4. and 6. in Algorithm 2). The mutual basis of (ALG) and Algorithm 2 is shrinking a rule in one dimension so that the volume of the shrunken rule remains as maximal as possible (‘‘principle of minimal volume loss’’) without getting too narrow expansions in some directions. Algorithm 2 (Modified shrink mechanism ( p, x)).

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

215

With the modifications presented in Algorithms 1 and 2 we have a performant tool for classification and rule generation. To examine performance characteristics we will discuss well-known performance measures in the next section, and we will introduce a feature selection method that is directly based on rule performance and rule features. We will apply all the measures in Sections 4 and 5.3. 3. Performance measures To evaluate the performance of our classification and rule generation results, we reproduce relevant performance measures that are commonly used. A new method to rate the global importance of variables for feature selection (dimension reduction) is introduced. Our aim is to take into account the specific rule structure of the neuro-fuzzy rule set. We do not consider more general or more arbitrary, ad hoc defined rule interestingness measures [15,46]. An important aspect concerning neural network training is ‘‘overfitting’’, so we calculated all the measurements on test data after dividing the original dataset randomly into training and test data (50%/50%). Of course, other well-known test strategies as k-fold cross validation [14] could be used. Using k-fold cross validation, data is divided into k partitions. Then, training is done k times, leaving one partition out for test. An additional division of training data into estimation data and validation data can be used if one wants to validate the model before testing it on test data [14], e.g. with a division of the data into one third estimation, one third validation and one third test data. In applied statistics confidence intervals are often used to rate the significance of the results. Because we have no confidence intervals for the rule generation algorithm, we repeat entire experiments with different random partitions, calculating mean values and standard deviations. This is an important standard strategy to avoid overly optimistic or overly pessimistic results. In summary, two rules that do not depend on the concrete test strategy are, see e.g. [14,24]: Do not calculate performance on training data. Calculate performance on independent test data that was not used as training data before, and repeat experiments with different partitions of data.

216

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

In Section 5.3, we will use training and test data where additionally the training set contains no patients that are in the test set and vice versa, because we intend to have fully unknown patient data in the test set. Patient data that are contained in both sets may simulate too good predictions [11]. Standard measures for the classification performance are the classification error, the confusion matrix (the ði; jÞth entry of the confusion matrix is the percentage of samples of class i prognosted as class j), specificity (‘‘deceased classified/number of deceased’’) and sensitivity (‘‘survived classified/number of survived’’). Simple measures for rule generation are the number of rules and the length of rules (number of variables used in the rules). A smaller number of rules and shorter rules are usually preferred for a better interpretation by physicians, although rules should be long enough to take into account all the relevant input variables. Each measure could be calculated with mean and standard deviation. 3.1. Frequency and confidence Now, we introduce the common rule performance measures frequency and confidence similar to [1], where the measures are used for association rules. Because these measurements are seldom used in the context of neuro-fuzzy systems, and these measures will play a central role in our rule generation process, we define them in Definition 2. Before, we give the definition of a rectangular rule extracted from a fuzzy rule. Definition 1 (Rectangular rule). Let F be a fuzzy rule, generated by algorithms (ALG) or (MGT), and HT the corresponding hyper trapezoid. A rectangular rule R is defined by a hyper-rectangle HR that is cut from HT at a given fuzzy level (degree of membership), cf. Fig. 3. HR is allowed to have infinite side expansions in some dimensions. If we consider a certain fuzzy level n 2 ½0; 1 , we call the rectangular rule n-rule. The corresponding hyperrectangle is called n-region.

Definition 2 (Frequency and confidence). Let R be a rectangular rule and HR the corresponding hyper-rectangle. The rule R is associated to one class k. Such rules are generated by algorithms (ALG) resp. (MGT).

Fig. 3. An one-dimensional membership function, interpreted as a trapezoid HT and an one-dimensional hyperrectangle (interval) HR , cut from HT at fuzzy level n.

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

217

a) The class s frequency freqs ðRÞ of the rule R of class k is defined as the number of samples of class s that are elements of HR divided by the number of all the samples in the whole dataset. If s ¼ k we say shortly frequency freqðRÞ instead of class k frequency. b) The class s confidence confs ðRÞ of the rule R of class k is defined as the number of samples of class s that are elements of HR divided by the number of all the samples that are elements of HR . If s ¼ k we say shortly confidence confðRÞ instead of class k confidence.1 Only rules R that are sufficiently frequent and confident—those ones with freqðRÞ minfreq and confðR; cÞ minconf , using a-priori defined thresholds minfreq and minconf —are interesting for a presentation to physicians. These thresholds must be chosen high enough to provide interesting, significant rules, and they must be chosen low enough to generate a sufficient number of rules. It is better to involve an expert of the application area—in medical applications a physician—in order to design proper thresholds for useful results. Then, sufficiently frequent and confident rules may be a great benefit for physicians. 3.2. Importance A fundamental problem in data analysis is the detection of meaningful features to enhance the understandability of rule sets: shorter rules with less variables are better understandable. How should we select the most relevant variables? As we have already learned, algorithms (ALG) and (MGT) can extract relevant local variables, i.e. variables that are relevant for a single rule. Of course, a variable that is relevant for one rule may not be relevant for another rule. It would be interesting if it is possible to perform the global task of feature selection, i.e. reducing the number of variables in general equally for all the rules. Thus, we will derive an a-posteriori measure of importance for this task. Several different well-known unsupervised methods for feature selection exist, e.g. classical principal component analysis (PCA) [20] or self-organizing maps [21]. Supervised methods like stepwise regression (backward selection) [23] or genetic algorithms [47] could also be used. The main disadvantage of this supervised methods is the longwinded process of variable selection and repetition of the regression or evolution steps. Unsupervised models like PCA do not consider the class information: usually in PCA the PCA-transformed variables with small absolute eigenvalues are dropped, but such a variable with a small absolute eigenvalue may be the cause for class separability. A (PCA-transformed) variable with a high absolute eigenvalue may not be the reason for class separability. Another a-posteriori method for dimension reduction for the neurofuzzy system [17], that was recently proposed in [38], can be used as an alternative. The method, that we have not used for our experiments, is based on information theory. It uses information gain calculations, similar to the ones used for splitting criteria of various decision trees [24]. 1

Multiplied by 100 the measures could be interpreted as a percentage.

218

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

To use the rule structure of the neuro-fuzzy system more efficiently for feature selection, we introduce a new importance measure for algorithms (ALG) and (MGT) that is based directly on the generated rules and their performance, so that no additonal methods like PCA, stepwise regression or information gain calculations are needed. A variable V should be more important if it is used in more frequent and more confident rules that are composed of only a small number of variables. This consideration leads to the following definition. It is an extended variation of the separability measure [16] that was calculated on relevant training variables only. Definition 3 (Importance). (a) Let R be the set of all generated rules. A variable W is called irrelevant (abbr.: irr.) for a rule R 2 R, if the interval borders of the rule for this variable are not limited, i.e. ð1; 1Þ. For a given variable V, let fRV1 ; . . . ; RVrs g  R be the subset of the rs generated rule prototypes of class cs where variable V is not irrelevant. Let ps be the a-priori probability of the data of class s. Let confs be the confidence of a support rule. The importance for class s of a variable V imps ðVÞ, considering the rule set R, is defined as imps ðVÞ :¼

rs 1 X 1 :  freqs ðRVi Þ  conf s ðRVi Þ  ps i¼1 jfWjWnot irr: in RVi gj

(3)

(b) The importance without respect to a specific class is defined as (with c as the number of all the classes) impðVÞ :¼

c X

impi ðVÞ:

(4)

i¼1

Originally we tried to achieve an importance separately for every class as in Definition 3a. But as we realized, the values of the importance in the case of two different classes were highly correlated, so it is of less use to consider Definition 3a regarding a two-classproblem: better use Definition 3b without respect to different classes in this case. We believe that the reason for this effect is the shrink procedure: every time a rule of class k is shrunken, other rules of classes l 6¼ k are shrunken, too. The geometrical ‘‘symmetry’’ of (ALG) resp. (MGT) leads to this high correlation values. In the following we will use impðVÞ instead of imps ðVÞ, because we will only consider two-class-problems. Of course, one can use Definition 3a for k-class-problems, k > 2.

4. Application to benchmark datasets Before applying the algorithm (MGT) to our septic shock patient data, we test it on benchmark datasets. The spiral data serve as a good (two-dimensional) example of nonlinear data to visualize the generated rules by (ALG) and (MGT). ‘‘Cancer1’’ [36] serves as a medical benchmark dataset to evaluate our importance measure.

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

219

Fig. 4. Spiral data, () class 1, (þ) class 2.

4.1. Spiral As a first benchmark we use the well-known two-dimensional spiral dataset (SD) which is hard to handle for all the classifiers due to nonlinear class boundaries, see Fig. 4. As a slight modification to the standard spiral data we omitted a few samples in the class 2 spiral to demonstrate the overlapping effects of (ALG) and (MGT). In Fig. 5 we see the (twodimensional projected) generated rules of (SD) by (ALG) without the nearby parameter (high overlap), and in Fig. 6 we see the result generated by (MGT) with nearby ¼ 0.14.

Fig. 5. Generated rules with spiral data (light class 1, dark class 2), nearby ¼ 1, rules generated by (ALG).

220

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

Fig. 6. Generated rules with spiral data (light class 1, dark class 2), nearby ¼ 0.14, rules generated by (MGT).

The average classification performance is calculated by 10 repetitions, each with a different partition of 50% randomly selected test data of (SD). The classification improvement with nearby parameter is 3.6% (68.4% instead of 64.8%) with obviously smaller but more reasonable rules. To evaluate the performance of the rules we calculate the average confidence for the rules of (SD) for different fuzzy levels (degrees of membership). The confidence is always calculated on the test data samples, not on the training data samples, to prevent a training bias as already stated in Section 3. A core rule (1-rule) has fuzzy level 1 and a support rule (0-rule) has fuzzy level 0. We also calculated all the rules for the intermediate fuzzy levels 0.1, 0.2, . . ., 0.8, 0.9. In the case of the dataset (SD) we present the confidence results in Fig. 7. Of course, the frequency for the fuzzy levels is not so interesting because it is monotone descending if the fuzzy level is ascending. If a rectangular rule is small due to the training samples, it may happen that no test data sample is an element of this rule. Then, the rule’s frequency and confidence is set to zero. Due to the fact that the core rules (cut at fuzzy level 1:0) are the smallest rectangular rules, the best confidence of a rectangular rule is usually not achieved for the core rules but for rectangular rules cut at fuzzy level <1:0 [28]. Here, the choice of rectangular rules of fuzzy level approximately equal to 0.9 (0:9-rules) is the best choice for (SD). The mean confidence at this fuzzy level is higher than at fuzzy level 1.0 (core rules). Of course, both dimensions have an equal importance because the geometric form of spirals of different classes induce similar rules for both classes. If we want to design fuzzy rules like ‘‘if heart frequency is high and systolic blood pressure is low then class deceased’’, it is not necessary to extract the rectangular rules of different fuzzy levels because the whole trapezoid is interpreted as a fuzzy rule in the defuzzification step. We have noticed that fuzzy variables like ‘‘high’’, ‘‘middle’’ or ‘‘low’’ are not precise enough for describing our septic shock patient data. Also, it seemed very

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

221

Fig. 7. Spiral data (nearby ¼ 0.14): mean confidence of the rules per class in relation to fuzzy level (in 0.1-steps).

difficult to describe fuzzy variables that are generated by the rule generation process with a higher resolution than ‘‘high’’, ‘‘middle’’ or ‘‘low’’. Additionally, due to the individual behavior of the patients it is not reasonable to define what a high blood pressure is in our situation. Concerning the septic shock problem, a physician could not draw a good conclusion from these fuzzy variables without knowing the relevant values more precisely. The septic shock data is an example where a ‘‘simple’’ fuzzy approach is not satisfying. This was the reason for analyzing the underlying rectangular rules of the trapezoidal fuzzy activation functions with the frequency and confidence measures. 4.2. Cancer1 We will demonstrate how we used the importance measure for the benchmark ‘‘Cancer1’’ dataset (9 metric variables, 2 classes, 699 measurements: 241 class 1 ¼ ‘‘benign’’, 458 class 2 ¼ ‘‘malignant’’). The importance of the nine variables is shown in Table 1. Table 1 Importance of the variables in ‘‘Cancer1’’ Number

Variable

Importance

1 2 3 4 5 6 7 8 9

Clump thickness Uniformity of cell size Uniformity of cell shape Marginal adhesion Single epithelial cell size Bare nuclei Bland chromatin Normal nucleoli Mitoses

0.99 1.26h 0.43 0.72 0.18 1.31h 0.17l 1.12 0.03l

The two highest values are marked with h, the two lowest with l. Mean of 10 repetitions.

222

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

Table 2 Correct classifications on training and test data (nine variables, %)

Train correct Test correct

Mean

S.D.

Minimum

Maximum

96.10 95.94

0.49 0.88

94.84 94.57

96.56 97.14

Ten repetitions.

Table 3 Correct classifications on training and test data (variables 2 and 6, %)

Train correct Test correct

Mean

S.D.

Minimum

Maximum

93.90 94.43

1.47 1.66

91.12 90.86

95.70 96.86

Ten repetitions.

The corresponding classification result is shown in Table 2. Next, we repeat the classification with those two variables (variables 2 and 6) that have the highest importance value, cf. Table 3. We notice that the classification result becomes a little bit worse (about 2%), compared to the result using all the variables. The mean rule length (number of relevant variables within a rule) using nine variables is 1:8 for class 1 and 3:4 for class 2. Using the two variables 2 and 6 the mean rule length is 1:7 for class 1 and 2:0 for class 2. The rules for class 2 are shorter, raising understandability. To show that variables 6 and 9 are a good choice, we repeat the experiment with the two variables that have the lowest importance values (variables 7 and 9), cf. Table 4. The results are on average another 4% worse than those in Table 3 with a higher standard deviation. Thus, we conclude that these variables would not be a good choice for dimension reduction. Of course, in the example ‘‘Cancer1’’ the classification result considering variable 7 and 9 is not worse (89%) in an absolute manner, but we found no classification result considering another pair of two variables that was significantly below 89%: some pairs as variables 5 and 9 lead to a similar worse classification result (89%). Furthermore, we checked all possible combinations of two variables and found no better results than with variables 2 and 6. A classification using variables with smaller importance values tends to a worse classification result than one using variables with higher importance values. How bad the worst result is and how good the best result is, certainly depends on the dataset. As an additional information we give the average number of generated rules: 4.9 for class ‘‘benign’’ and 5.6 for class ‘‘malignant’’ (nine variables), 4.8 for class ‘‘benign’’ and Table 4 Correct classifications on training and test data (variables 7 and 9, %)

Train correct Test correct Ten repetitions.

Mean

S.D.

Minimum

Maximum

89.74 89.14

3.61 5.32

80.23 74.86

92.26 92.86

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

223

5.0 for class ‘‘malignant’’ (variables 2 and 6), 3.5 for class ‘‘benign’’ and 3.8 for class ‘‘malignant’’ (variables 7 and 9). Here, the lower number of rules for variables 7 and 9 leads to a lower classification performance. 5. Application to septic shock patient data In this section, we review shortly the septic shock problem in intensive care medicine. Then, we describe our preprocessing steps for preparing the data for analysis. In fact, preprocessing of multivariate time series with missing values—the usual case in medical databases—is very time consuming although very important [40]. Finally, we present our results. 5.1. Medical background In abdominal intensive care medicine patients are in very critical, harmful health conditions: Often ICU patients develop a septic shock [9,12,13,27], a phenomenon that is related to mechanisms of the immune system and which is still an important research subject for medical experts and data analysts. No ultimately satisfying results have been published until now. Septic shock is associated with a high mortality of about 50%. It is always related to measurements leaving the normal range (e.g. blood pressure, temperature, respiratory frequency, number of leukocytes), and it is often related to multiorgan failure. The epidemiology of 656 intensive care unit patients is elaborated in a study undertaken from 1995 to 1997 at the Klinikum der J.W. Goethe-Universita¨ t Frankfurt am Main [45]. The data of this study and another study undertaken in the same clinic from 1993 to 1995 is the data base for this contribution. We set up a list of variables, including the relevant readings (temperature, blood pressure, etc.). In this data base, that consists of 874 patients, 70 patients developed a septic shock during their stay at the ICU. Here, 38.6% of the septic shock patients died. 5.2. Data mining and data preprocessing With respect to our analysis, medical data mining is a composition of the following steps:  knowledge extraction: interviewing experts, reading literature, browsing the World Wide Web in order to obtain assured facts about septic shock;  data collection and problem formulation: collecting patient records, setting up the variables, evaluation of interesting questions and hypotheses (What are the medical indicators of very critical septic shock patients?);  database operations: (relational) database design including input and output user interfaces, (SQL-) queries and basic visualization programs;  preprocessing (cf. [33,40]): adaptation of different units, discussion of some unusual values, correcting typing errors with the help of limit values, selection of patients, variables and periods of time from multivariate time series, sampling (here: 24 h), consideration of missing values and outliers;

224

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

 methods: development, selection, testing and application (here: classification and rule generation, see Sections 4 and 5.3),  interpretation: discussion of the results (new knowledge from generated rules). Preprocessing is a very important step that can heavily influence the results. Especially missing value handling must be done carefully. Our missing value strategy is as follows: because we absolutely do not want to learn anything by missing values erroneously or by adding additional calculated values like class-dependent means, we insert randomly chosen data from a suitable normal distribution (noise), so that our algorithm cannot learn from that added noise data. In particular, we want to prevent situations where data with missing value replacements might be better classified than without [3]. In reference [3] missing values are replaced by centers of gravity (of the fuzzy rules). Since this procedure unjustly improves the statistical properties of the original data by replacing the missing values in the best possible manner for the system, we have not used this strategy. Some other well-known replacement strategies for missing value replacement (e.g. means, regression, nearest neighbor, random values) in one dimension are compared in [35]. Since our data samples have missing values in more than one dimension, the latter results cannot be adapted directly to our situation. 5.3. Results We present the results using a subset of our data base, containing the measurements of the 12 most frequently measured variables: creatinin (mg/dl), calcium (mmol/l), arterial pCO2 (mmHg), pH (–), haematocrit (%), sodium (mmol/l), leukocytes (1000/ml), haemoglobin (g/dl), central venous pressure (CVP) (cmH2 O), temperature (8C), heart frequency (1/min) and systolic blood pressure (mmHg). To limit the influence of missing values, we required the existence of a minimum of ten out of 12 variables for each sample, so that 1698 samples remained out of 2068 (1177 survived, 521 deceased). We point out that we will classify the samples of the patients to generate warnings for very critical states and all-clears for very uncritical states. We labeled all the samples of a deceased patient as ‘‘deceased’’. For technical reasons the samples were transformed to the unit interval hypercube ½0; 1 12 for use with (MGT). The results are achieved using (MGT) with nearby-parameter (2) nearby ¼ 0:54, with our missing value strategy ‘‘noise’’ (cf. previous section), a minimum of three and a maximum of seven training epochs, mindp ¼ 3 and sweak ¼ 0:3. Training was done with 50% of the data, testing with the remaining 50%. The classification results of seven repetitions of the entire training procedure, each one with a different random data partition, is presented in Table 5. The test data specificity is Table 5 Correct classifications on training and test data (%)

Train correct Test correct Seven repetitions.

Mean

S.D.

Minimum

Maximum

74.62 69.00

3.82 4.37

68.85 61.20

78.66 72.66

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

225

Table 6 Confusion matrix of the test data (%)

Real survived Real deceased

Cl. survived

Cl. deceased

64.94 25.37

5.63 4.06

Mean values of seven repetitions. Cl.: ‘‘classified as’’.

Table 7 Frequencies and confidences for the classes survived and deceased (%)

Frequency (core rules) Frequency (support rules) Confidence (core rules) Confidence (support rules)

Survived

Deceased

2.22 14.82 56.94 68.66

0.17 11.61 9.69 31.83

Mean values of seven repetitions.

92.26%, the sensitivity is 15.01%, cf. Section 3. The confusion matrix is shown in Table 6. These results using (MGT) are similar to the classification results published in [33] for a dataset similar to ours. Poor sensitivity performance could be interpreted in the way that deceased patients are not very critical all the time during their stay at the ICU, which makes it difficult to predict the outcome with a high confidence with our dataset for all the patients. Consequently, a warning should be generated only from the most confident rules. The mean frequencies and the mean confidences are presented in Table 7. As we see, the mean confidences are not very high: the mean confidence of the core rules is even lower than that of the support rules, since often test data samples were not elements of the core regions. Remember, in this case confidence was set to zero (cf. Section 4.1). Unfortunately, no more data samples were available. In fact, the best rectangular rules we achieved at fuzzy level Z 2 ½0:3; 0:4 due to the class deceased, see Fig. 8. Fig. 9 shows the corresponding mean frequencies. The number of generated rules averaged 91.8 (survived: 43.4 rules, deceased: 48.4 rules). There are no global rules, i.e. valid rules for the whole dataset, but some local rules covering smaller areas of the dataset with a reliable performance, many with a dimensionality less than 12. The mean rule length (number of relevant attributes) for the support rules is 5:2 (equal length for both classes). These rules (for class deceased) are good candidates as generators for warnings to physicians. We give two examples of rules (one for class survived, one for class deceased) that have a high confidence and frequency: (I) ‘‘if heart frequency  115:00 and systolic blood pressure 125 and CVP 7:00 and leukocytes  26:40 and haematocrit 25:00 and sodium  149:00 then class survived with frequency 16.4% and confidence 86.3%;’’ (II) ‘‘if heart frequency  110:00 and systolic blood pressure  120:00 and CVP

11:00 and temperature 36:50 and leukocytes in (10:93; 37:00) and sodium in (138:00; 146:00) then class deceased with frequency 3.6% and confidence 64.7%.’’

226

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

Fig. 8. Mean confidence of the rules per class in relation to fuzzy level (in 0.1-steps), nearby ¼ 0.54.

As human contribution, a medical expert proposed the rule ‘‘if pH  7:2 (–) and arterial pO2  60 [mmHg] and inspiratorical O2 concentration 80 [%] and base excess  5 [mval/l] then class deceased.’’ Interestingly, no data sample of our database is an element of the defined region: There is no data support for this opinion. Thus, a rational data driven

Fig. 9. Mean frequency of the rules per class in relation to fuzzy level (in 0.1-steps), nearby ¼ 0.54.

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

227

Table 8 Importance of the variables in the dataset in descending order No.

Variable

Importance

1 2 3 4 5 6 7 8 9 10 11 12

Central venous pressure Temperature pH Sodium Systolic blood pressure Heart frequency Leukocytes Haematocrit Calcium Arterial pCO2 Creatinin Haemoglobin

1.93 1.82 1.33 1.32 1.29 1.22 1.06 0.84 0.76 0.44 0.43 0.10

machine learning approach to metric rule generation is in any case a benefit in comparison with subjectively experience-based induced rules for the problem of septic shock. Nevertheless with our data we cannot generate rules with almost 100% confidence, i.e. almost causally determined rules. Such rules would be very important as a basis for the development of therapy strategies. We calculated the importance—introduced in Definition 3—and present the results in Table 8. Only haemoglobin measurements are less important for the rule generation process. Central venous pressure and temperature are the most important variables in this experiment. Here, it is not possible to find a noticeable smaller set of variables as it was possible for the ‘‘Cancer1’’ data. As preliminary experiments showed, a PCA produced similar results for the eigenvalues: the absolute eigenvalues (of the correlation matrix of the 12 variables) are 1.82, 1.63, 1.31, 1.21, 1.06, 1.02, 0.97, 0.85, 0.78, 0.65, 0.42, 0.27. The difference between the highest and the lowest eigenvalue is not high. Thus, a PCA could also not reduce the dataset noticeably. Recently, we have found out that the classification performance achieved with data samples of the first 3 days of ICU stay is much lower than using all the samples. It becomes much higher if we use only samples of the last three days. The classification performance is time dependent. Future research will attempt to find better results by utilizing this insight. It is of interest to physicians that a combination of the three variables systolic blood pressure, diastolic blood pressure and thrombocytes will lead to better classification results, as additionally experiments have shown recently.

6. Discussion and conclusion We have presented our data analysis approach for the important medical problem septic shock with an emphasis on rule generation for metric data. The results are a major extension of preliminary work (preprocessing, classification) [11,33], now providing us with understandable knowledge for classification.

228

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

We have reviewed and improved the algorithm [17] with regard to the overlapping behavior and the shrinking mechanism to generate more performant rules. Our results on benchmark data motivated the application of (MGT) to our patient data. The generation of fuzzy rules with fuzzy antecedents seemed promising but it turned out that there are no suitable fuzzy variables. Thus, we used rectangular rules. We found out that the frequency resp. confidence of rectangular rules (n-rules) with higher fuzzy levels n may become too low. The understandability by physicians could be affected if n-rules with a too high n are used. Several interesting local rules for human interaction have been generated, although it was not possible to obtain an excellent global result covering all the patient data. The rules, derived from several repetitions of our experiments, were carefully examined with the performance measures frequency and confidence that were calculated on test data. Thus, our results are statistically representative results that are not only descriptive. Due to the individual behavior of the septic shock patients, a more satisfying sample state classification is still a challenging problem for future research. The detection of very confident states is especially difficult. We introduced a new measure to analyze the global importance of variables with respect to the rule set. Although the importance measure led to an evident dimension reduction on the medical benchmark data, only little success in dimension reduction was achieved with respect to our septic shock patient data. As an interesting result we obtained an order of the importance of the variables. Further work will be the application of rule generation with additional categorical data, extending our previous approaches [30–32]. A better inclusion of time series dynamics and the inclusion of more variables are desired extensions for our analysis. A comparison to medical scores, e.g. the SOFA score [44], is in the pipeline. We also expect improved results from analyses of datasets, currently collected in clinics all over Germany within the last 3 years. Another interesting approach for the future is the prognosis of specific therapy options.

Acknowledgements The work was done within the project MEDAN http://www.medan.de (Ref. no. HA 1456/7-2), supported by the German Research Foundation (DFG). The author thanks all the participants of the MEDAN working group especially Dr. Brause and Prof. Hanisch for supporting my work.

References [1] Agrawal R, Skrikant R. Fast algorithms for mining association rules. In: Bocca J, Jarke M, Zaniolo C., editors. Proceedings of the 20th International Conference on Very Large Databases (VLDB), Santiago de Chile, Chile. San Mateo: Morgan Kaufmann; 1994. p. 487–99. [2] Baxt W. Application of artificial neural networks to clinical medicine. Lancet 1995;346:1135–8. [3] Berthold MR. Fuzzy-models and potential outliers. In: Dave RN, Sudkamp TA, editors. Proceedings of the 18th International Conference of the North America Fuzzy Information Processing Society (NAFIPS), New York, USA. Los Alamitos: IEEE Press; 1999. p. 532–5.

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

229

[4] Berthold MR, Diamond J. Boosting the performance of RBF networks with dynamic decay adjustment. In: Tesauro G, Touretzky DS, Leen TK, editors. Proceedings of the Advances in Neural Information Processing Systems (NIPS), vol. 7, Denver, USA. Cambridge: MIT Press; 1995. p. 521–8. [5] Berthold MR, Diamond J. Constructive training of probabilistic neural networks. Neurocomputing 1998;19:167–83. [6] Berthold MR, Huber K-P. From radial to rectangular basis functions: a new approach for rule learning from large datasets. Internal Report 15–95. University Karlsruhe, Germany; 1995. [7] Brause R, Friedrich F. A neuro-fuzzy approach as medical diagnostic interface. In: Verleysen M, editor. Proceedings of the 9th European Symposium on Artificial Neural Networks (ESANN), Bruges, Belgium. Brussels, Belgium: D-Facto publications; 2000. p. 201–6. [8] Castellano G, Fanelli AM. Modeling fuzzy classification systems with compact rule base. In: Mohammadian M, editor. Proceedings of the 1st International Conference on Computational Intelligence for Modeling, Control and Automation (CIMCA), Vienna, Austria. Amsterdam: IOS Press; 1999. [9] Fein AM, et al., editors. Sepsis and multiorgan failure. Baltimore: Lippincott Williams & Wilkins; 1997. [10] Fritzke B. Incremental neuro-fuzzy systems. In: Bosacchi B, Bezdek JC, Fogel DB, editors. Proceedings of the Applications of Fuzzy Logic Technology IV, vol. 3165, San Diego, USA. Bellingham: Society of Photo-optical Instrumentation Engineers (SPIE); 1997. p. 86–97. [11] Hamker F, Paetz J, Tho¨ ne S, Brause R, Hanisch E. Erkennung kritischer Zusta¨ nde von Patienten mit der Diagnose ‘‘Septischer Schock’’ mit einem RBF-Netz. Interner Bericht 04/00, Fachbereich Informatik, J.W. Goethe-Univ. Frankfurt am Main, Germany, ISSN 1432–9611, 2000. [12] Hanisch E, Encke A. Intensive care management in abdominal surgical patients with septic complications. In: Faist E, editor. Immunological screening and immunotherapy in critically ill patients with abdominal infections. Berlin: Springer; 2001. p. 71–138. [13] Hardaway RM. A review of septic shock. Am Surg 2000;66(1):22–9. [14] Haykin S. Neural networks: a comprehensive foundation. 2nd ed. Upper Saddle River: Prentice-Hall; 1999. [15] Hilderman RJ, Hamilton H. Knowledge discovery and interestingness measures: a survey. Technical Report CS 99-04. Department of Computer Science, University of Regina, Canada; 1999. [16] Huber K-P. Datenbasierte Metamodellierung mit automatisch erzeugten Fuzzy-Regeln. VDI-Verlag, Doctoral thesis, University of Karlsruhe, Germany; 1998. [17] Huber K-P, Berthold MR. Building precise classifiers with automatic rule extraction. In: Proceedings of the IEEE International Conference on Neural Networks (ICNN), Perth, Western Australia. Piscateway: University of Western Australia; 1995. p. 1263–8. [18] Jang J-SR. ANFIS: adaptive-network-based fuzzy inference system. IEEE Trans Syst—Man and Cybernet 1993;23:665–85. [19] Jang J-SR, Sun C-T. Functional equivalence between radial basis function networks and fuzzy inference systems. IEEE Trans Neur Networks 1993;4(1):156–9. [20] Jolliffe IT, Principal component analysis. New York: Springer; 1986. [21] Kohonen T. Self-organizing maps. 3rd ed. Berlin: Springer; 2001. [22] Lisboa PJG, Ifeachor EC, Szczepaniak PS, editors. Artificial neural networks in biomedicine. London: Springer; 2000. [23] Miller AJ. Subset selection in regression. London: Chapman-Hall; 1990. [24] Mitchell, TM. Machine learning. New York: McGraw-Hill; 1997. [25] Nauck D, Klawonn F, Kruse R. Foundations of neuro-fuzzy systems. Chichester: Wiley; 1997. [26] Nauck D, Kruse R. Obtaining interpretable fuzzy classification rules from medical data. Artif Intell Med 1999;16(2):149–69. [27] Neugebauer E, Rixen D, Raum M, Scha¨ fer U. Thirty years of anti-mediator treatment in sepsis and septic shock—what have we learned? Arch Surg 1998;383:26–34. [28] Paetz J. Metric rule generation with septic shock patient data. In: Cercone N, Lin TY, Wu X, editors. Proceedings of the 1st IEEE International Conference on Data Mining (ICDM), San Jose, USA. Los Alamitos: IEEE Computer Society Press; 2001. p. 637–8. [29] Paetz J. Some remarks on choosing a method for outcome prediction. Letter to the Editor, Crit Care Med 2002;30(3):724.

230

J. Paetz / Artificial Intelligence in Medicine 28 (2003) 207–230

[30] Paetz J. Durchschnittsbasierte Generalisierungsregeln Teil I: Grundlagen. Frankfurter Informatik-Berichte Nr. 1/02, Institut fu¨ r Informatik, Fachbereich Biologie und Informatik, ISSN 1616–9107, 2002. [31] Paetz J, Brause R. Durchschnittsbasierte Generalisierungsregeln Teil II: Analyse von Daten septischer Schock-Patienten. Frankfurter Informatik-Berichte Nr. 2/02, Institut fu¨ r Informatik, Fachbereich Biologie und Informatik, ISSN 1616–9107, 2002. [32] Paetz J, Brause R. A frequent patterns tree approach for rule generation with categorical septic shock patient data. In: Crespo J, Maojo V, Martin F, editors. Proceedings of the 2nd International Symposium on Medical Data Analysis (ISMDA), LNCS vol. 2199, Madrid, Spain. Berlin: Springer; 2001, p. 207–12. [33] Paetz J, Hamker F, Tho¨ ne S. About the analysis of septic shock patient data. In: Brause R, Hanisch E, editors. Proceedings of the 1st International Symposium on Medical Data Analysis (ISMDA), LNCS vol. 1933, Frankfurt am Main, Germany. Berlin: Springer; 2000, p. 130–7. [34] Penny W, Frost D. Neural networks in clinical medicine. Med Decis Making 1996;16:386–98. [35] Pesonen E, Eskelinen M, Juhola M. Treatment of missing data values in a neural network based decision support system for acute abdominal pain. Artif Intell Med 1998;13:139–46. [36] Prechelt L. Proben1—A set of neural network benchmark problems and benchmarking rules. Technical Report 21/94, Computer Science Department, University of Karlsruhe, Germany, 1994. ftp:// ftp.informatik.uni-freiburg.de/documents/papers/neuro/proben/proben1/. [37] Schumacher M, Roßner R, Vach W. Neural networks and logistic regression: part I. Comput Stat Data Anal 1996;21:661–82. [38] Silipo R, Berthold MR. Discriminative power of input features in a fuzzy model. In: Hand D, Kok J, Berthold MR, editors. Proceedings of the 3rd International Symposium of Advances in Intelligent Data Analysis (IDA), LNCS vol. 1642, Amsterdam, The Netherlands. Berlin: Springer; 1999. p. 87–98. [39] Tsukimoto H. Extracting rules from trained neural networks. IEEE Trans Neur Networks 2000;11(2): 377–89. [40] Tsumoto S. Clinical knowledge discovery in hospital information systems: two case studies. In: Zighed DA, Komorowski HJ, Zytkow JM, editors. Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), LNCS vol. 1910, Lyon, France. Berlin: Springer; 2000. p. 652–6. [41] Ultsch A, Korus D, Kleine TO. Integration of neural networks and knowledge-based systems in medicine. In: Barahona P, Stefanelli M, Wyatt JC, editors. Proceedings of the 5th Conference on Artificial Intelligence in Medicine in Europe (AIME), LNAI vol. 934. New York: Springer; 1995; p. 425–6. [42] van Bemmel JH, Musen MA. Handbook of medical informatics. Heidelberg: Springer; 1997. [43] Villmann T. Neural networks approaches in medicine—a review of actual developments. In: Verleysen M, editor. Proceedings of the 9th European Symposium on Artificial Neural Networks (ESANN), Bruges, Belgium. Brussels, Belgium: D-Facto publication; 2000. p. 165–76. [44] Vincent J-L, Moreno R, Takala J, et al. The SOFA (sepsis-related organ failure assessment) score to describe organ dysfunction/failure. Intensive Care Med 1996;22:707–10. [45] Wade S, Bu¨ ssow M, Hanisch E. Epidemiology of SIRS, sepsis and septic shock in surgical intensive care patients. Chirurg 1998;69:648–55. [46] Wittmann T, Ruhland J. Eichholz M. Enhancing rule interestingness for neuro-fuzzy systems. In: Zytkow JM, Rauch J, editors. Proceedings of the 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Prague, Czech Republic. Berlin: Springer; 1999. p. 242–50. [47] Yang J, Honavar V. Feature subset selection using a genetic algorithm. In: Motoda H, Liu H, editors. Feature extraction, construction, and subset selection: a data mining perspective. Boston: Kluwer Academic Publishers; 1998. p. 117–36.