An interactive framework for an analysis of ECG signals

An interactive framework for an analysis of ECG signals

Artificial Intelligence in Medicine 24 (2002) 109–132 An interactive framework for an analysis of ECG signals Giovanni Bortolana, Witold Pedryczb,c,*...

987KB Sizes 8 Downloads 104 Views

Artificial Intelligence in Medicine 24 (2002) 109–132

An interactive framework for an analysis of ECG signals Giovanni Bortolana, Witold Pedryczb,c,* a LADSEB-CNR, Corso Stati Uniti 4, 35020 Padova, Italy Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alta., Canada T6G 2G7 c Systems Research Institute, Polish Academy of Sciences, 01-447 Warsaw, Poland b

Received 13 June 2001; received in revised form 23 September 2001; accepted 13 October 2001

Abstract In this study, we introduce and discuss a development of a highly interactive and user-friendly environment for an ECG signal analysis. The underlying neural architecture being a crux of this environment comes in the form of a self-organizing map. This map helps discover a structure in a set of ECG patterns and visualize a topology of the data. The role of the designer is to choose from some already visualized regions of the self-organizing map characterized by a significant level of data homogeneity and substantial difference from other regions. In the sequel, the regions are described by means of information granules—fuzzy sets that are essential in the characterization of the main relationships existing in the ECG data. The study introduces an original method of constructing membership functions that incorporates class membership as an important factor affecting changes in membership grades. The study includes a comprehensive descriptive modeling of highly dimensional ECG data. # 2002 Elsevier Science B.V. All rights reserved. Keywords: Information granulation; Fuzzy sets; User-interactive models of data; Self-organizing maps; Computerized ECG signal analysis and classification; Data mining; Noninvasive data analysis

1. Introductory comments and problem statement Data analysis along with its current trends manifested in the form of data mining cf. [5,18] is aimed at providing the user with a transparent and meaningful description of significant dependencies in experimental data. This research agenda is particularly vital in biomedical data where the transparency of findings is of interest to the end user. The data mining trend is augmented by a general methodology of computational intelligence *

Corresponding author. E-mail address: [email protected] (W. Pedrycz). 0933-3657/02/$ – see front matter # 2002 Elsevier Science B.V. All rights reserved. PII: S 0 9 3 3 - 3 6 5 7 ( 0 1 ) 0 0 0 9 6 - 3

110

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

(CI) and in this way is linked to fuzzy modeling, cf. [7,8,16,17,19,21]. Neural networks being an integral component of CI contributes to the flexibility (adaptability) of the systems. It is instructive to identify several key features of data analysis that we consider to be essential to any pursuit of data mining (it is anticipated that most of the methods used there share these properties to some extent).  User-friendliness of the process: By this, we mean processes of data analysis in which the human (designer or end-user) is provided with results that are easily interpretable. Moreover, the designer is furnished with an ability to affect the way in which the data analysis takes place. In particular, it is the user who can actively change a perspective from which he or she looks at the data.  Noninvasive nature of processing: This is concerned with a minimal level of intervention of the designer and an environment, which assumes a minimal level of constraints being imposed on the data. To be specific, a class of linear regression models is more specific and constrained than a class of fuzzy rule-based models. By keeping a level of such intervention, we allow the data set to ‘speak’ its own language without being confined to the structure being imposed up-front. The two features listed above are inherently linked with an issue of granular computing and information granules. Information granules are central to a way in which data are perceived and interpreted by humans [23–25]. Instead of being buried in huge piles of numeric data, the user summarizes the data and forms a very limited vocabulary of information granules, say fuzzy sets, that are easily understood, come with a well-defined semantics and help express relationships existing in the data at this specific level of information granularity. Granularity implements information abstraction by moving to a certain level of abstraction (granulation), we hide details and concentrate on essentials (main dependencies) in data. Quite frequently, information granules are constructed through clustering, cf. [1,2]. ECG signal analysis has been a vigorous research pursuit in which the above paradigm of noninvasive data interpretation is of interest. There has been a vast body of detailed models (especially classifiers including neural networks, see [4,9,14,20]), nevertheless an approach of a more descriptive nature has not been investigated in depth. The intent of this study is to narrow a gap in this realm. We introduce and study a noninvasive model of ECG data analysis developed in the setting of self-organizing maps (SOMs). From the application point of view, this mode of data analysis supported by the maps is very much in the realm of exploratory data analysis or descriptive modeling. As such, the exploratory nature of the approach is a first essential step for a subsequent diagnostic classification in which one looks into more details of the classifier, its topology, and a way of training and testing. Furthermore, the approach presented in this study can be regarded as a way of bringing in some fundamental ideas of data mining being treated as a general attempt in understanding medical data. The exploratory data analysis has to be contrasted with the predictive modeling that aims at classifying new patterns that is predicting its class membership on a basis of the location of the pattern in the feature space. The material is arranged into nine sections, each of them focusing on a separate facet of the descriptive modeling. We start with a discussion of the experimental data

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

111

(Section 2). In Section 3, we revisit a concept of SOMs regarded as a backbone of the user-oriented development environment. The basic idea of SOMs is augmented, here, by an introduction of some auxiliary maps that help visualize the structure in the data and describe its characteristics (Section 4). In Section 4, we discuss a way of constructing membership functions of fuzzy sets that are a direct product of the homogeneous data regions defined by the designer in the SOM. The experimental part of the study is covered in Sections 5 and 6. The first one deals with a synthetic data set while the second one is devoted to a comprehensive development of the descriptive model for ECG data. Section 7 moves the ECG data analysis further by elaborating on the links between the descriptive model and the ensuing predictive models. Conclusions are covered in Section 8.

2. Material—a description of experimental ECG data The experimental environment exploits a CORDA database, developed by Jos Willems at the Medical Informatics Department of the University of Leuven [22], consists of 3253 (2140 men and 1113 women with a mean age of 49  12 years) 12 standard lead ECGs. There were 12 standard leads (that is I, II, III, AVR, AVL, AVF, V1, V2, V3, V4, V5, V6). It consists only of single-disease (clinically validated) cases with normal QRS duration and no conduction abnormality. Seven diagnostic classes have been considered: normal (N), left ventricular hypertrophy (LVH), right ventricular hypertrophy (RVH), biventricular hypertrophy (BVH), inferior (IMI), anterior (AMI), and combined (MIX) myocardial infarction. In the study, we will be also using the numbers of the classes, namely N LVH RVH BVH IMI AMI MIX

class class class class class class class

1 2 3 4 5 6 7

From the original ECG signal (12 standard leads acquired at 500 Hz for a period of about 10 s) a set of 540 (45 for each lead) primary measurements were computed with a computerized system, obtaining a first consistent data reduction. A second data reduction, according to a clinical selection and a statistical selection, has been performed obtaining a set of 39 ECG features. They include amplitudes and duration of the QRS and T waves, QRS and T-axes, ST-segment elevation or depression, and the area under QRS and T waves. A complete list of these features is included in Table 1. The same data set has been used to establish the performance of statistical classification models and to validate the performance of different architectures of neural networks [3,4]. In the previous works of supervised learning, a set of 2446 records was randomly selected as learning set, and the remaining 807 records as testing set.

112

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

Table 1 A list of features (variables) of the ECG signal used in the study 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

Age Gender Angle of initial 40 ms QRS axis in frontal plane (sine part) Angle of initial 40 ms QRS axis in frontal plane (cosine part) Angle of QRS axis in frontal plane (sine part) Area under QRS complex in lead I ST segment elevation (end point) in lead I Area under T wave in lead 1 Peak-to-peak amplitude of QRS complex in lead II ST segment elevation (J point) in lead II Area under QRS complex in lead III Peak-to-peak amplitude of QRS complex in lead AVR ST segment elevation in lead AVR Area under T wave in lead AVR Peak-to-peak amplitude of QRS complex in lead AVL ST segment elevation (end point) in lead AVL Q duration in lead AVF Area under T wave in lead AVF Q duration in lead V1 Slope of ST segment in lead V2 Q amplitude in lead V3 Q duration in lead V3 Peak-to-peak amplitude of QRS complex in lead V3 T amplitude in lead V3 Area under T wave in lead V4 Peak-to-peak amplitude of QRS complex in lead V5 ST segment elevation (end point) in lead V6 Slope of ST segment in lead V6 Max R amplitude in V3, V5, V5 þ max|S amplitude|in V1, V2, V3, V4 Max R amplitude in V2/max|S amplitude|in V3 Progression S amplitude from V1 to V3 R amplitude in lead AVL R amplitude in lead V1 R amplitude in lead V5 R amplitude in lead V6 R duration in lead AVL R duration in lead V2 R duration in lead V6 T amplitude in lead V6

3. Self-organizing maps—an insight into a structure of data The concept of a SOM has been originally coined by Kohonen [11–13] and is currently used as one of the generic neural tools for structure discovery and visualization in many areas of applications including ECG signal analysis [15]. As usually reported in the literature, SOMs are regarded as regular neural structures (neural networks) composed of a rectangular (squared) grid of artificial neurons. The intent of SOMs is to visualize highly

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

113

dimensional data in a low-dimensional structure, usually emerging in the form of a two- or three-dimensional map. To make this visualization meaningful, an ultimate requirement is that such low-dimensional representation of the originally high-dimensional data has to preserve topological properties of the data set. In a nutshell, this means that two data points (patterns) that are close each other in the original highly dimensional feature space should retain this similarity (or resemblance) when it comes to their representation (mapping) in the reduced, low-dimensional space in which they need to be visualized, and reciprocally, two distant patterns in the original feature space should retain their distant location in the low-dimensional space. Being more descriptive, SOM performs as a computer eye that helps us gain insight into the structure of the data set and observe relationships occurring between the patterns being originally located in a highly dimensional space. In this way, we can confine ourselves to the two dimensional map that apparently helps us to witness all essential relationships between the data as well as dependencies between the software measures themselves. In spite of the existing variations, the generic SOM architecture (as well as the learning algorithm) remains basically the same. Below, we summarize, the essence of underlying self-organization algorithm that realizes a certain form of unsupervised learning. Before proceeding with the detailed computations, we introduce all necessary notation. The ‘n’ feature of the patterns (data) are organized in a vector x of real numbers located in the n-dimensional space of real numbers, Rn. The SOM comes as a collection of linear neurons organized in the form of a regular two-dimensional grid (array), Fig. 1. In general, the grid may include ‘p’ rows and ‘r’ columns; quite commonly, we confine ourselves to the square array of p  p elements (neurons). Each neuron is equipped with modifiable connections wði; jÞ, where the successive coordinates are arranged into an ndimensional vector namely wði; jÞ ¼ ðw1 ði; jÞ; w2 ði; jÞ; . . . wn ði; jÞÞ. The terms connections and weights are used interchangeably in the literature, and we will be following this style in

Fig. 1. A basic topology of the self-organizing map constructed as a grid of identical processing units (neurons): connections of the (i, j)-th node.

114

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

the ensuing discussion. The two indexes (i and j) identify a location of the neuron on the grid, see Fig. 1. The neuron calculates a distance (d) between its connections and a given input x yði; jÞ ¼ dðwði; jÞ; xÞ

(1)

The same input x affects all neurons. The neuron with the shortest distance between the input and the connections becomes activated to the highest extent and is declared to be a winning neuron—we also say that it has matched the given input (x). Let us denote its coordinates by (i0, j0). More precisely, we have ði0; j0Þ ¼ arg minði;jÞ dðwði; jÞ; xÞ

(2)

The above expression captures what has been said earlier, we calculate the minimum of the distances, minði;jÞ ðdðwði; jÞ; xÞ and then determine arguments (coordinates of the neuron). This operation is denoted by ‘arg’. As a winner of this competition, we reward the neuron and allow it to modify the connections so that they are getting even closer to the input data. The update mechanism is governed by the expression w new ði0; j0Þ ¼ wði0; j0Þ þ aðx  wði0; j0ÞÞ

(3)

where a denotes a positive learning rate, a > 0. The higher the learning rate, the more intensive updates of the connections. In addition to the changes of the connections of the winning node (neuron), we allow this neuron to affect its neighbors (viz. the neurons located at similar coordinates of the map). The way in which this influence is quantified is expressed via a neighbor function F(i, j, i0, j0). In general, this function satisfies two intuitively appealing conditions: (1) it attains maximum equal to one for the winning node, i ¼ i0, j ¼ j0; and (2) when the node is apart from the winning node, the value of the function gets lower (in other words, the updates are less vigorous). Evidently, there are also nodes where the neighbor function zeroes. Considering the above, we rewrite Eq. (1) in the following form: w newði; jÞ ¼ wði0; j0Þ þ aFði; j; i0; j0Þðx  wði; jÞÞ

(4)

Commonly, we use the neighbor function in the form Fði; j; i0; j0Þ ¼ exp ðbðði  i0Þ2 þ ðj  j0Þ2 ÞÞ

(5)

where the parameter b (equal to 0.1 or 0.05 depending upon the series of experiments) modeling the spread of the neighbor function. The above update expression (Eq. (4)) applies to all the nodes (i, j) of the map. As we iterate (update) the connections, the neighbor function shrinks—at the beginning of updates we start with a large region of updates, and when the learning settles down, we start reducing the size of the neighborhood. For instance, one may think of a linear decrease of its size. The number of iterations is either specified in advance or the learning terminates once there are no significant changes in the connections of the neurons. The distance d(x, w) can be defined in many different ways. A general class worth considering is that of the Minkowski distance. Practically, three of representatives of this family are commonly used, that is Hamming, Euclidean, and Tchebyschev. From the experimental end, the choice

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

115

between these three distances is not critical and the Euclidean distance is a legitimate choice. Dealing with raw measures poses the risk that one software measure may become predominant, simply because its domain includes larger numbers (that is the range of the measure is high). Therefore, the distance function is computed for normalized rather than raw data. In the sequel, the SOM exploits these transformed software measures. Two common ways of normalization are usually pursued, the linear and statistical normalization. In the linear normalization, the original variable is normalized to the unit interval [0,1] via a simple linear transformation xnorm ¼

xori  xmin xmax  xmin

(6)

where xmin and xmax are the minimal and maximal value of the variable encountered in the data. When observing the activity of the individual neurons in the grid, some of them may be excessively ‘active’ and winning most the time. The other neurons tend to become ‘idle’. This uneven activity pattern is undesired and should be avoided. In order to promote more even activity across the network, we make the learning frequency sensitive by penalizing the frequently winning nodes, and increasing the distance function between the patterns (inputs) and the connections of the winning node. For instance, instead of the original distance dðx; wÞ, we use the expression ð1 þ eÞ  dðx; wÞ, where e is a positive constant modeling the effect of intentionally increased distance between x and w. The higher the value of e, the more substantial the increase in the effective distance between the pattern and the neuron. When designing a SOM, the following are essential design parameters: the size of the map, initial learning rate, and its temporal decay, type of distance function, and a form of data normalization. Overall, the developed SOM is fully characterized by a matrix of connections of its neurons, that is W ¼ wði; jÞ; i ¼ 1; 2; . . . p; and j ¼ 1; 2; . . . p (note that we are now dealing with a squared p  p grid of the neurons). The simplest visualization scenario one can envision is to map the original data on the map so in this manner, we get a certain insight into the structure of the data in a highly dimensional space. For instance, we can state that x(1) and x(6) are similar because they ‘activate’ two neighboring neurons on the map. A visualization of the relative position of the patterns is a main advantage of the SOM. Moreover, by a careful arrangement of the weight matrix into several planes (arrays), we can produce a variety of important views at the data. We introduce such concepts as a weight, region (cluster) and data density map. While, in general, these notions are not new as some similar concepts have been used in the literature, we introduce them, here, as a comprehensive design vehicle.

4. Associated self-organizing maps The associated maps come as a result of a different organization of W or some slight modifications of the original connections.

116

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

Fig. 2. A concept of associated SOM maps: region and weight maps.

4.1. Weight maps Obviously, the weight matrix W can be viewed as a pile of layers of p  p maps indexed by the variables, see Fig. 2. That is, we regard W as a collection of two-dimensional matrices each corresponding to a certain feature of the pattern, say ½w1 ði; jÞ ½w2 ði; jÞ . . . ½wn ði; jÞ

(7)

each of these matrices contains information about the weights (or the features of the patterns as the weights tend to follow the features once the self-organization has been completed). The most useful information we can get from these weight maps deals with an identification of possible associations between the features. If the two weight maps are very similar, this implies that the two corresponding features they represent are highly related. If two weight maps are very dissimilar, this means the two variables they represent are not closely interrelated. In addition, we can also determine the feature association for a data subset. For example, two weight maps can be very similar in the upper-right corner but are very dissimilar in other area of the map. This means only the data located in the upperright corner are highly related. One should note that the identification of relationships is carried out in a visual mode, and we do not allude to any measure of association such as a correlation coefficient. Nevertheless, this aspect is highly supportive in a descriptive data analysis and helps the designer understand the essence of the relationships between the features. In the sequel, it may lead to the identification of possible redundancies of some features (e.g. we state that two features are redundant if their weight maps are very close each other). 4.2. Region (clustering) map A slight transformation (summarization) of the original map W allows us to visualize homogeneous regions in the map, viz. the regions in which the data are very similar.

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

117

Furthermore, we should be able to form boundaries between such homogeneous regions of the map. Owing to the character of this transformation, we will be referring to the resulting areas as clusters and calling the map a region (or clustering) map. The calculations leading to the region map are straightforward: for each location of the map, say (i, j) we compute distances between the weight vectors of its closest neighbors, like ði  1; jÞ; ði; j  1Þ; ði; j þ 1Þ; . . ., etc. that is dðwði; jÞ; wði  1; jÞÞ dðwði; jÞ; wði; j þ 1ÞÞ ...

(8)

and take a median of these differences, that is medI ½Dði; jÞ , where I treated as a neighborhood of this particular location of the map. We regard this as a measure of homogeneity of the nearest neighbor of the (i, j) location of the map. At the visual end, we assign a certain level of brightness to each of these results that gives us a useful vehicle of identifying regions in the map that are highly homogeneous. Likewise, the entries with dark color form a boundary between the homogeneous regions. Clusters can be easily identified by finding bright areas being surrounded by these dark boundaries. For some data set, there are distinct clusters, so in the clustering map, the dark boundaries are clearly visible. There could be cases where data are inherently scattered, so in the clustering map, we may not see clear dark boundaries. In Fig. 3, the region (clustering) map reveals two clusters: the one quite extensive that covers an upper part of the map and the smaller one. The clustering map is an important vehicle for a visual inspection of the structure in the data. It delivers a strong support for descriptive modeling: the designer can easily understand how structure looks like in terms of clusters. In particular, one can analyze the size of the clusters, their location in the map (that tells about closeness and possible linkages between the clusters). By looking at the boundaries between the clusters, the region map tells us how strongly these clusters are identified as separate entities distinct from each other. Overall, we can look at the region map as a granular signature of the data. These visualization aspects of SOMs underline their character as a user-friendly vehicle of descriptive data analysis. In this context, it also points out at the essential differences between SOMs and other popular clustering techniques driven by objective function

Fig. 3. An example data density map. The darker neurons identify groups of higher data density. These regions should be analyzed in conjunction with the region (clustering) map.

118

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

minimization (say FCM and alike). Note that while FCM solves an interesting and welldefined optimization problem but does not provide with the same interactive environment for data analysis. 4.3. Data distribution map The previous maps were formed directly from the general map W produced through selforganization. It is advantageous to supplement all these maps with a data distribution (density) map. This map shows (again on a certain brightness scale) a distribution of data as they are allocated to the individual neurons on the map. Following the assumed visualization scheme, the darker the color of the neuron, the more patterns invoked the neuron as the winning one, see Fig. 3. The data density map can be used in conjunction with the region map as it helps us indicate how much patterns are behind the given cluster. In this sense, we may eventually abandon a certain cluster in our descriptive analysis as not carrying enough experimental evidence. Moreover, the data density map helps reveal some evident learning problem related to a few frequently winning nodes (neurons); to alleviate this discrepancy, we introduce a frequency sensitivity component in the learning process. Overall, the sequence realized so far can be described as follows: highly dimensional data ! SOM ! region, data density maps, feature maps ! interactive user-driven descriptive analysis. To highlight and exemplify a way in which the self-organizing map is used in data analysis, below we show a few snapshots illustrating how the software operates and what type of user-interaction is involved. The developed software (Cþþ, running on a PC platform) is user-friendly with a significant interaction facilitated by the graphical userinterface. Not only this interface supports a visualization environment, but it also helps the developer make decisions as to finding the structure in the data set and interact with the data in such process. The development environment provides the user with some generic statistical characterization of the data, Fig. 4(a) and navigates him through a detailed setup of the self-organization process (involving all necessary details such as the size of the map, normalization schemes, types of distance function, number of learning epochs), Fig. 4(b). During the learning a map being formed is continuously updated, Fig. 5. Finally, the analyst can highlight (select) any region on the map and look at the corresponding subset of experimental data. The regions on the map are identified through a visual inspection. As one can immediately look at the corresponding data, the overall process becomes highly interactive. Obviously, such regions could be formed in an automatic fashion (as a matter of fact, SOMs are just images and there are a lot of image processing tools of edge detection and contour forming that could be found useful here). This option has not been pursued as being too restrictive and biased toward a specific algorithm of edge detection and edge tracking. It is worth stressing that when running the same data set through a self-organizing map, we may end up with a different configuration of the regions. This is not essential as the regions themselves are quite repeatable in terms of their data content as well as mutual distribution on the map.

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

119

Fig. 4. Snapshots of the software supporting the development of SOM: (a) initial data analysis involving histograms and correlation between features; and (b) setting up a map and specifying details of the training process.

120

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

Fig. 5. A format of some results of SOM: clustering map showing homogeneous regions of the map and a data distribution map visualizing a distribution of data across the neurons of the map.

In the sequel, we discuss how the regions (clustering areas) identified in the map can be described in terms of information granules—fuzzy sets.

5. An estimation of membership functions There are a number of existing methods of membership function estimation whose origin stems from different ways of interpreting fuzzy sets, cf. [10,16,22]. In what follows, we are concerned in the notion of membership cast in the framework of pattern recognition. We also share an opinion that membership grades should relate in some way to experimental data. We start with experimental data X ¼ ½xðkÞ ; k ¼ 1; 2; . . . ; n; belonging to several classes where the degree of belonging of x(k) to any class is binary (that is, we adopt the yes-no class assignment). The dominant class is determined; say o0 and the intent is to compute a membership function A of the concept (feature) describing o0. More specifically, A is a fuzzy set (linguistic term) using which we intend to capture a collection of numeric data. As we have classes as a part of the problem, the emergence of A is cast in the setting of the classification problem, namely A is built as a specific entity capturing the semantics of patterns belonging to a given class). Without any loss of generality, we consider a two-class problem meaning that we distinguish between the data points

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

121

belonging to this most frequent class we are interested in (o0) and other class o1 that may represent all remaining classes. The underlying idea is to assign the high membership grade (1) to the regions where there are only patterns belonging to o0. When moving to the regions where we encounter some patterns belonging to o1, we gradually start reducing the corresponding membership grades. Furthermore, we consider the membership function to be unimodal and distributed around a prototypical value of the concept governing the dominant class o0. The procedure outlined can be formalized as a two-step algorithm  determination of the prototypical value of the fuzzy set,  determination of membership grades assumed by the fuzzy set around the prototype based on the experimental data The prototypical value can be found in many different ways (say, as a median or mean). The simplest one is to take a mean value of the elements in X that belong to o0, that is P k:xðkÞ2o0 xðkÞ (9) m¼ cardðxðkÞjxðkÞ 2 o0 Þ The determination of the membership grades of other points is guided by the following rationale; refer also to Fig. 6. The calculations are carried out separately for the data points to the left and to the right from the mean value. Furthermore, we order the elements larger than the prototype in an increasing order. The elements lower than the prototypical value are order in decreasing fashion. We start moving from the prototype up towards the higher values of the data with the initial membership grade equal to 1, AðmÞ ¼ 1. In this move, two cases may occur: (1) either the next data point belongs to o0; or (2) it belongs to o1. In the first case we maintain the previous membership grade. If the point belongs to o1, we assign lower membership grade whose value is computed as AðxðkÞÞ ¼ maxð0; Aðxðk  1ÞÞ  dÞ

(10)

Fig. 6. Computing the membership function of A with the use of overshadowing principle; note that the point belonging to o1 overshadows other elements belonging to o0 that are located more distantly from m.

122

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

where

  xðkÞ  m d ¼ 1  exp  xmax  m

(11)

and xmax is the largest element in the data set. This means that if this new point x(k) is close to the prototype, the reduction in membership value (d) becomes substantial. Interestingly, the existence of elements that belong to the other class leads to an irreversible drop in the membership values of the fuzzy set. Note that for the next data points, say xðk þ 1Þ, xðk þ 2Þ that may belong to the dominant class, we end up having the lower membership grade. In other words, the data point belonging to o1 overshadows the remaining larger data points in the sequence. Overall, the calculations of the membership values can be succinctly described by the expression  AðxðkÞÞ; if xðk þ 1Þ 2 o0 Aðxðk þ 1ÞÞ ¼ (12) AðxðkÞÞ  d; if xðk þ 1Þ 2 o1 The membership grades computations for the data points lower than the prototype (m) is carried out in an analogous manner. The only difference is that, now, we order these data points in a decreasing order. The membership function determined in this way exhibits a step-wise effect meaning that the changes of membership grades are confined to the discrete data points. One can approximate a continuous membership function such as linear, parabolic, Gaussian, etc. using these specific values.

6. The granular analysis of the ECG data Prior to any analysis, it is worth mentioning a certain hierarchy of the classes of the signals that could be helpful in understanding the results of self-organization. The diagnostic class of BVH includes both LVH and RVH, and consequently the three classes BVH, LVH and RVH are not completely independent. This means that a classification of LVH þ RVH is equivalent to BVH, and that a BVH patient classified as only LVH or RVH represents a partial discrepancy. Analogous considerations are valid for the diagnostic class of combined myocardial infarction MIX with respect with AMI and IMI. We consider a data set consisting of 2446 ECG patterns; with the following composition: 402 normal, 417 LVH, 243 RVH, 324 BVH, 295 AMI, 489 IMI and 276 MIX (refer to the detailed description in Section 2). The patterns were linearly normalized, i.e. all features were transformed by a linear mapping. Several sizes of SOMs were tried. We found that the map of size 35  35 was detailed enough to reveal essential dependencies between the classes. The training was carried out for 12,000 learning epochs with the initial radius equal to seven nodes. The resulting map is contained in Fig. 7. Furthermore, the data distribution map was found populated fairly uniformly as clearly shown in Fig. 8. Subsequently, the maps of the individual features are contained in Fig. 9. From these maps, we immediately gain a qualitative picture as to the relationships between individual features. For instance, we note that the features 32 (R amplitude in aVL) and 33 (R amplitude in V1) are very similar (as the corresponding maps look very

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

123

Fig. 7. A result of self-organization of data shown in the form of the SOM with several clearly delineated regions.

similar) as well as 34 (R amplitude in V5) and 35 (R$ amplitude in V6). Similarly, features 21 (Q amplitude in V3) and 22 (Q duration in V3) are very similar in the extreme values. Some other features such as 2 (gender), 19 (Q duration in V1), and 22 (Q duration in V3) 30 (ratio of max(R, R0 ) amplitude and max(S, S0 ) amplitude in V2) are very different (we say that they are very unrelated). Next, we identify homogeneous regions in the map, Fig. 10. This is done by a designer in a interactive mode, namely the designer identifies the entries of the map that are homogeneous and then checks if most of them belong to the same class. By monitoring

Fig. 8. Data distribution map (note a homogeneous density of the patterns across the map).

124

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

Fig. 9. SOMs of the individual features existing in the problem.

the distribution of the patterns, the regions is either expanded (if most patterns belong to the same class) or reduced (when there are patterns belonging to several classes and we need to make the region more homogeneous). The distribution of classes in each region is summarized in Table 2.

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

125

Fig. 10. Identification of homogeneous regions in the SOM.

The total number of the patterns occurring in these regions is equal to 930 that amounts to 38% of all data. There are of immediate and important observations one can make on the basis of a visual inspection of the self-organizing map (especially, the region map and the maps of the individual features).  The homogeneous regions in the SOM, their size and location vis-a`-vis other regions help identify relationships between the classes of ECG signals. In particular, a region (cluster) capturing normal signals (region R3 in Fig. 11) is quite compact and shows a high level of homogeneity. The region denoted by R1 (that involves AMI) is quite extended and is quite distant from other regions. Similarly, region R2 (that captures a mixture of IMI and MIX) is apart from the other regions and occupies an entire region on the upper part corner of the map. A very different behavior can be observed for the three other regions, that is R5, R8, and R6. These are close neighbors and all of them Table 2 A description of homogeneous regions identified in the SOM Region number

Class 1

1 2 3 4 5 6 7 8

Comments 2

3

4

1 117 84 6

38 6 10 2 7 26

4 12 47 8

2 5 21 1 22

5

6

7

7 43

245 8

49 61

34

18

1

1

1 1 41 1

Class 6 with some elements from class 7 A mixture of classes 7 and 5 Class 1 Class 1 Class 3 with elements from class 4 Classes 5 and 6 Class 4 Class 2

126

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

Fig. 11. Discriminatory properties of the features of the ECG signal: (a) the use of minimum and maximum operations; and (b) the use of the product operation and probabilistic sum.

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

127

capture two classes RVH and BVH but in a different mix. When moving along the map and starting from the first one (E), there is an evident mix of BVH and RVH. In the sequel, the next group (H) is dominated by RVH while the group identified as F has a similar dominance by RVH with some BVH.  The identification of the groups in the map can be viewed as a descriptive data analysis with an ultimate goal to capture the essence of the data. In this case we are interested in building concise and homogeneous descriptors of the ECG classes. The map tells us what is most likely as to the occurrence of ‘plain’ or mixed classes of patterns. Obviously, it is easy to describe (and discriminate) between the class of normal signals (N) and others while discriminating between class IMI and MIX (as shown in region B) will be a difficult task (no matter what classifiers we are interested in). It is easy to discriminate between class RVH when dealing with region H, however, doing the same for the region E (where there is an evident mix of RVH and BVH) will be a significant classification challenge. It is possible to make some considerations on the basis of the accuracy interpreting the SOM by the view of point of a classifier. The eight regions, as already observed, are characterized by the 38% of all the records in the database. In addition to the fact that the seven diagnostic classes in the database are not equidistributed, the classes involved in this selection are not uniform also in relative terms, with a predominance of normal patients or with myocardial infarction. In fact there are 208 records of 402 of class 1 (51.7%), 211 records of 984 of classes 2–4 (21.4%), and 511 records of 1060 of classes 5–7 (48.2%). This fact may influence the accuracy of the eight regions. At this point, labeling each group with the class that has the maximum number of elements, it is possible to evaluate the accuracy of the eight different regions, obtaining Table 3. This analysis is shown here as a prerequisite useful to further refinement of the classifier. In Table 2, some region shows a low accuracy, but it is possible to observe (Table 1) that such regions have a mixture of patients with ‘related’ diagnoses. In order to point out this aspect, we reclassify the eight regions of Table 2 considering groups of ‘related’ diagnostic classes, and groups of homogeneous regions. Table 3 Accuracy of the regions identified by the SOMa Region

Class with maximum number of elements

Accuracy (%)

1-A 2-B 3-C 4-D 5-E 6-F 7-G 8-H

6 7 1 1 3 5 4 2

81.1 54.5 72.7 77.8 55.3 42.7 57.9 92.9

a

Each region is labeled with the class that exhibits the maximum number of elements.

128

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

Table 4 Description of groups of homogeneous regions Group of regions

AþBþF CþD EþGþH

Group of classes 1

2þ3þ4

5þ6þ7

1 201 6

3 67 141

506 1 4

For example the regions A, B and F are connected mainly with patients with myocardial infarction and then they can be considered together. Similarly, regions E, G and H are characterized mainly by patients with ventricular hypertrophy, and regions C and D by normal patients, suggesting a common arrangement. It is possible to derive the same arrangement directly from Table 3 considering the three enlarged classes. In addition, there is also a spatial relationship among these three groups. In fact, it is possible to observe from Fig. 8 that the regions C þ D are close and spatially connected, as well as regions

Fig. 12. Feature number 39 (T amplitude in aVF) with the highest discriminative properties (x coordinate is normalized, originally the variable is expressed in millivolts).

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

129

E þ G þ H, whereas regions A þ B þ F have a different spatial connection, being on the border of the map. With this rearrangement, Table 4 describes the groups of homogeneous regions. This means that groups A þ B þ F have a partial accuracy of 99.2% with respect to myocardial infarction, the groups E þ G þ H have a partial accuracy of 93.4% with respect to ventricular hypertrophy, and the groups C þ D have an accuracy of 74.7% with respect to normal subjects.

7. Feature selection Once a collection of information granules has been formed, their location, especially a level of overlap serves as a measure of discrimination of a given feature. Consider c fuzzy sets A1, A2,. . ., Ac defined for a given feature F assuming values in (a, b). For any two fuzzy sets, Ai and Aj, we compute the expression  p  1X minðAi ðxk Þ; Aj ðxk ÞÞ 1 jij ¼ (13) p k¼1 maxðAi ðxk Þ; Aj ðxk ÞÞ where p denotes the number of discrete points in (a, b). Note that if these two fuzzy sets are disjoint, the above expression produces 1. If, they start overlapping meaning that their

Fig. 13. Membership functions of the feature number 22 (Q duration in V3) (x coordinate is normalized, originally the variable is expressed in milliseconds).

130

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

discriminative power gets lower, the expression attains lower values. In limit for two fuzzy sets being almost equal, we end up with the value of the expression equal to 0. The above expression is the same as introduced in [6]. Next, we take all the different pairs of fuzzy sets, and compute the expression discrðFÞ ¼

c X 1X j c i¼1 j>i ij

(14)

which is regarded to be an overall measure of discriminative power of the feature, discr(F). Based on the values of the above index, we say that F is more discriminant than G, if discrðFÞ > discrðGÞ. It is worth stressing that the min and max operations can be generalized, instead of them we may use a general class of t- and s-norms, cf. [10,16]. It becomes evident that there exists a set of features whose discriminatory properties are very much limited (e.g. feature no. 2 (gender) and 19 (Q duration iv V1), Fig. 11), Fig. 9. An identification of these features could be useful in the formation of a new reduced feature space. On the other hand, there are several strongly discriminant features. Feature no. 39 (that is T amplitude in AVF) is one among them. Subsequently, the membership functions of the fuzzy sets defined in this space are shown in Fig. 12. Most of the fuzzy sets arising there are quite disjoint.

Fig. 14. Membership functions of one of the weakest features (from the standpoint of discriminatory properties): note a significant overlap occurring between fuzzy sets (x coordinate is a normalized ratio).

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

131

The other feature (feature no. 22, that is a Q duration in V3) is less discriminative (as seen in Fig. 11) and the resulting fuzzy sets are shown in Fig. 13. An example of one of the weakest features, feature no. 30 (ratio of max(R, R0 ) amplitude and max(S, S0 ) amplitude in V2) is shown in Fig. 14. Apparently, the membership functions overlap quite substantially that is a qualitative manifestation of the lack of discriminative properties.

8. Concluding remarks In this study, we have developed a user-centered environment of analysis of ECG data (patterns). This environment arises around a well-known concept of SOMs. The interpretation of the maps that carried out by a designer leads to a discovery of the structure in the collection of the ECG patterns. We have shown that SOMs equipped with the interpretation mechanisms introduced here form a versatile framework of data analysis and as such can be quite universal and highly useful when it comes to the better understanding of the relationships between variables (features). The size of the regions of the map as well as their homogeneity can serve as useful indicators for further more refined design of the classifiers. In this sense, we can project a level of difficulty when developing pattern classifiers for specific classes or even subclasses of the patterns. This information can guide the designer with regard to the type of the classifier (e.g. linear classifier, nearest neighbor, neural network). In the context of this study, it is worth emphasizing that we are concerned about the descriptive fuzzy modeling of data. The objective is to translate findings in terms (that is information granules expressed as fuzzy sets) so that they become meaningful to the designer/user. We are after the transparency of the constructs (and their relevance quantified through sufficient experimental evidence). Some possible ways of making the processes of description of self-organizing maps (their regions) automated is worth pursuing in aid of human-centered way of interpretation discussed in this study.

References [1] Anderberg MR. Cluster Analysis for Applications, Academic Press, New York, 1973. [2] Bezdek JC. pattern recognition with fuzzy objective function algorithms. New York: Plenum Press, 1981. [3] Bortolan G, Willems JL. Diagnostic ECG classification based on neural networks. J Electrocardiol 1994;26:75–9. [4] Bortolan G, Brohet C, Fusaro S. Possibilities of using neural networks for diagnostic ECG classification. J Electrocardiol 1996;29:10–6. [5] Cios K, Pedrycz W, Swiniarski R. Data Mining Techniques. Boston: Kluwer Academic Publishers, 1998. [6] Dubois D, Prade H. A unifying view of comparison indices in a fuzzy set theoretic framework. In: Yager RR, editor. Fuzzy set and possibility theory: recent developments. New York: Pergamon, 1982. [7] Delgado M, Gomez-Skarmeta F, Martin F. A fuzzy clustering-based prototyping for fuzzy rule-based modeling. IEEE Trans Fuzzy Systems 1997;5(2):223–33. [8] Gomez-Skarmeta AF, Delgado M, Vila MA. About the use of fuzzy clustering techniques for fuzzy model identification. Fuzzy Sets Systems 1999;106:179–88. [9] Hu YH, Palreddy S, Tompkins WJ. A patient-adaptable ECG beat classifier using a mixture of experts approach. IEEE Trans Biomed Eng 1997;44(4):891–900.

132

G. Bortolan, W. Pedrycz / Artificial Intelligence in Medicine 24 (2002) 109–132

[10] Klir GJ, Folger TA. In: Fuzzy sets, uncertainty, and information. Englewood Cliffs, NJ: Prentice Hall, 1988. [11] Kohonen T. Self-organized formation of topologically correct feature maps. Biol Cybernet 1982;43:59–69. [12] Kohonen T. Self-organizing maps. Berlin: Springer, 1995. [13] Kohonen T, Kaski S, Lagus K, Honkela T. Very large two-level SOM for the browsing of news groups. In: Proceedings of ICANN96, Lecture Notes in Computer Science, vol. 1112. Berlin: Springer, 1996. p. 269– 274. [14] Lagerholm M, Peterson C, Braccini G, Edebrandt L, Sornmo L. Clustering ECG complexes using Hermite functions and self-organizing maps. IEEE Trans Biomed Eng 2000;47(7):838–48. [15] Leung TS, White PR, Collis WB, Brown E, Salmon AP. Characterisation of paediatric heart murmurs using self-organising map. In: Proceedings of the 1st Joint BMES/EMBS Conference, 1999;2:926. New York: IEEE. [16] Pedrycz W, Gomide F. An introduction to fuzzy sets: analysis and design. Cambridge, MA: MIT Press, 1998. [17] Pedrycz W, Reformat M. Rule-based modelling of nonlinear relationships. IEEE Trans Fuzzy Systems 1997;5(2):256–69. [18] Piatetsky-Shapiro G, Frawley WJ, editors. Knowledge discovery in databases. Menlo Park, CA: AAAI Press, 1991. [19] Setnes M. Supervised fuzzy clustering for rule extraction. IEEE Trans Fuzzy Systems 2000;8(4):416–24. [20] Silipo R, Bortolan G, Marchesi C. Design of hybrid architectures based on neural classifier and RBF preprocessing for ECG analysis. Int J Approx Reason 1999;21:177–96. [21] Sugeno M, Yasukawa T. A fuzzy logic-based approach to qualitative modeling. IEEE Trans Fuzzy Systems 1993;1(1):7–31. [22] Willems JL, Lesaffre E, Pardaens J. Comparison of the classification ability of the electrocardiogram and vectorcardiogram. Am J Cardiol 1987;59:119–24. [23] Zadeh LA. Fuzzy logic ¼ computing with words. IEEE Trans Fuzzy Systems 1996;4:103–11. [24] Zadeh LA. Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets Systems 1997;90:111–7. [25] Zadeh LA, Kacprzyk J. Computing with words in information/intelligent systems, vols. I and II. Heidelberg: Physical Verlag, 1999.