Pattern Recognition, Vol. 30, No. 7, pp. 1121-1135, 1997 Pattern Recognition Society © 1997 Published by Elsevier Science Ltd Printed in Great Britain. All rights reserved 0031--3203/97 $17.00+.00
Pergamun
PII: S0031-3203(96)00133-1
AN INVESTIGATION OF MOUNTAIN METHOD CLUSTERING FOR LARGE DATA SETS ROBERT E VELTHUIZEN, t'* LAWRENCE O. HALL, t LAURENCE E CLARKEt and MARTIN L. SILBIGER t tDepartment of Radiology, College of Medicine, University of South Florida and the H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, U.S.A. $College of Engineering, University of South Florida, Tampa, Florida, U.S.A. (Received 3 May 1995; in revised form 12 August 1996; received for publication 26 August 1996) Abstract--The Mountain Method of clustering was introduced by Yager and Filev and reirmed for practical use by Chiu. The approach is based on density estimation in feature space with the highest peak extracted as a cluster center and a new density estimation created for extraction of the next cluster center. The process is repeated until a stopping condition is met. The Chiu version of this approach has been implemented in the Matlab Fuzzy Logic Toolbox. In this paper, we develop an alternate implementation that allows large data sets to be processed effectively. Methods to set the parameters required by the algorithm are also given. Magnetic resonance images of the human brain are used as a test domain. Comparisons with the Matlab implementation show that our new approach is considerably more practical in terms of the time required to cluster, as well as better at producing partitions of the data that correspond to those expected. Comparisons are also made to the fuzzy c-means clustering algorithm, which show that our improved mountain method is a viable competitor, producing excellent partitions of large data sets. Published by Elsevier Science Ltd. Clustering
MRI
Segmentation
Brain tumor
I. INTRODUCTION Clustering plays an important role in pattern recognition both in exploratory data analysis and in circumstances where training data are hard to obtain. °) The objective of clustering is to partition the feature space so that similar objects are grouped into the same cluster while dissimilar objects are in different clusters. (1'2) Many clustering algorithms search for a partitioning that minimizes some objective function, (2) which often is related to a cluster validity function such as the compactness of the resulting clusters. O) Iterative clustering techniques often use a hillclimbing approach for the search. These algorithms result in partitionings that depend upon initialization, and may place considerable demands on computational resources. This is especially true for large data sets such as those encountered in multi-spectral image processing. Other algorithms use mode-separation of the density in feature space to define clusters. (2'4-9) Koontz et al. (4) describe a technique to build directed trees where each node represents a single datum, such that the parent of each node is associated with a node with a higher local density in feature space. It uses a Parzen density estimation, but develops the trees on the individual data points and is not suitable for large data sets. The method was applied in a parallel implementation in reference (6). Kittler (9) also developed an algorithm that operated on individual data points for which the Parzen density estimation was calculated. It separates modes by building a sequence * Address for correspondence: 12901 Bruce B. Down Blvd, MDC 17 Tampa, FL 33612, U.S.A. E-mail:
[email protected].
Histogram
Mode separation
of data points such that the density is piece-wise monotone. The cluster boundaries are found as minima in the sequence. Some clustering algorithms do not operate directly on the data points but instead solely use the density in feature space. Goldberg and Shlien (5) developed an algorithm specifically to address large data sets and used the multi-dimensional histogram to estimate the feature density. Their scheme finds a threshold on the height of the multi-dimensional histogram to separate modes followed by a connectivity algorithm to identify connected cells and hence clusters. Drawbacks are that for high-dimensional data sets the histogram may be noisy and only clusters with approximately the same peak density can be detected. A less noisy density estimate can be obtained using a Parzen estimator. (l°) Gerig et a/. (7) developed a clustering algorithm based on the feature vector density specifically for dual-echo MR data. That method uses a Parzen density estimation to find cluster centers represented by peaks, but then relies on a database of cluster centers and associated tissues to determine the decision boundaries for the clusters. Even in a specialized domain such as MRI segmentation this is problematic since the position, shape and size of the clusters will depend on the scanner, the tuning of the scanner and the particular acquisition sequences used, as well as on how the patient loads the coil and what pathology is present. The mountain method (s'll) also operates on a density estimation, and introduced the concept of subtractive clustering. (12) Three important problems are associated with the mountain method:
1121
R.P. VELTHUIZENet al.
1122
• it requires a grid on which to define the density function, limiting the precision of the clustering; • for large data sets the computation times are unpractical; • the values for the parameters of the algorithm need to be selected. Chiu~12)modified the mountain method to address the first of these three problems. In this paper, we describe the original mountain method, the modifications Chiu made and the modifications we propose to address all three problems. Magnetic Resonance Images (MRI) are used as the test domain. Segmentation of MRI is used for visualization purposes or quantitation of, for example, lesion volumes.~13'14) Supervised segmentation of MR images is associated with a loss of reproducibility and can be labor intensive. At our institution we have evaluated several supervised techniques (maximum likelihood, k nearest neighbors and artificial neural nets), ~15'16)and developed a technique that is less sensitive to operator variability called semi-supervised fuzzy c-means (ssFCM). (3'14) Of the unsupervised approaches, fuzzy c-means (FCM) is very promising.~3'16) FCM usually finds good tissue boundaries, especially for brain images of normal volunteers. ~6~ For patients with brain tumors however, a sub-optimal solution is often found, depending partly on the initialization. The commonly accepted reason for this problem is that many iterative clustering methods tend to find clusters of equal size. (1"3A7)There are two problems with FCM: the execution time and the reliable detection of anatomically relevant tissues. We have approached these problems in several ways, using a validity measure, (3A7) a knowledge base<~8) and, in this paper, using the multi-dimensional density estimation of the feature values as the basis of clustering. The results of our approach could also be used as a starting point to find an initialization that can enhance the performance of FCM. Comparisons are made with Chiu's implementation and FCM.
2. METHODS
2.1. The mountain method The mountain method was originally described by Yager and Filev.~8'11) It is described briefly here to illuminate its limitations and to provide the basis for the proposed modifications. The "mountain function" is similar to a Parzen window estimation of the probability density distribution of the feature values x: N
M(Git,...,is) = Z exp[-c~d(xk, Gi,,...,i,)],
(1)
k=l
where
Gil ,i2,...,is is a grid point on a regular, rectangular grid with grid indices ij in the array of grid points G, d a distance measure (commonly the Euclidean distance or city-block
distance) between x and a grid point G, s the dlmensionality of the data set and N the number of data points. It only differs from a density estimation in that it is not normalized. However, only the shape of the density function is of interest for the cluster extraction. An example of the mountain function of a two-feature data set is shown in Fig. 1. The two images on top are MR images of the same location, forming a multi-spectral data set. The pixel intensities in these images are the features used to calculate the mountain function. Figure 1 also includes the two-dimensional histogram of the shown pair of MR images. The first cluster center is found at the grid point where the maximum of the mountain function occurs. The mountain function is then broken down to eliminate the cluster, by subtracting from the mountain function, a function describing the identified cluster with size parameter/3:
M2 ( Gih...,is ) : M 1 ( Gil ,...,i, ) N - M~ Zexp[-/3d(Gjl,..4s, Gi,,...,i~)], (2) k:l where M2 is the mountain function after elimination of the first cluster, M 1 the original mountain function and M~ the maximum value of the original mountain function as found at the first cluster center on grid p o i n t j l , . . . ,js. The algorithm is summarized as follows:
Algorithm 1. The Mountain Method 1. Select values for ~ and ft. Lay out or determine grid G. 2. Calculate the mountain function M(G;~). 3. Select the gridpoint in G where M is maximum. 4. Eliminate the maximum cluster from Mby subtracting a function f(G;/~). 5. Return to step 3 or stop. Three problems were found in directly using the mountain method for clustering as in reference (8). First, the method requires a grid on which to build the Mountain function. The number of gridpoints required for a given level of precision grows exponentially with the dimension of the data. Irregular distributions of the clusters will necessitate finer grids. Calculation of the Mountain function using equation (1) and the successive mountains using equation (2) then leads to long computation times. This part of the problem was also recognized by Chiu, ~12't9)who proposed to work with the data points themselves as gridpoints. His solution is available in the MATLAB Fuzzy Logic Toolbox, <2°) and is discussed later. Second, large data sets lead to high computational demands for the calculation of equation (1). For example, a typical MR image data set consists of 256 × 256 three-valued data points. Even when using a grid with only 10 points along each dimension, this leads to 103 x 2562 > 60 million evaluations of the exponential in the mountain function. Such computational burden is considerably higher than that required for traditional clustering techniques. Finally, a major pro-
An investigation of mountain method clustering for large data sets
1123
Fig. 1. Example MR data set with histogram and mountain function. MR images with (a) proton-density (PD) weighted contrast; and (b) T2 (spin-spin relaxation) weighted contrast; (c) histogram of this data set in PD/T2 space; and (d) the mountain function on the same space.
blem with the mountain method is that no guidelines are available for the values of the parameters a and /3. Application of the mountain method to MR data showed that the results of segmentation were very sensitive to the choice of the values of the parameters in equations (1 and 2). This is illustrated in Fig. 2. The data shown in Fig. 1 were clustered using the mountain method with different values of the extraction-parameter/3. Figure 2 shows the position of cluster centers in a two-dimensional feature space for the different values of/3. Only/3=0.06 allowed detection of a tumor, which is separately marked in the corresponding plot, while variations as small as 0.02 in the value of/3 did not allow detection of a tumor. These two problems were addressed by the approaches described in Section 2.3.
2.2. Subtractive clustering The computational complexity of the original mountain method for high-dimensional data was also recognized by Chiu. (12A9) He proposed an improvement by calculating the values of the mountain function not on a
regular grid, but instead on the data vectors themselves, referring to the values as "potentials". The implementation is available in the MATLAB Fuzzy Logic Toolbox t2°) as a routine called "subclust" (subtractive clustering). Please note that the number of grid points on the regular grid G as defined in the original Mountain Method may be less than the number samples; then the computational demands actually increase using Chiu's approach. Some guidelines for parameters are given as ranges for "good values". The required parameter for a call to subclust is a "radius of influence", which plays a roll in constructing both the mountain and subtracting clusters. Two additional parameters control the number of cluster centers that the method will return. These two parameters are thresholds on the potential (value of the mountain function) which define three regions: the lowest in which a peak is rejected, the higher region in which peaks are accepted, and an intermediate region in which the peak is accepted as a cluster center if it is far enough from already defined cluster centers. Comparative experiments with "subclust" were performed and will be described in the proceeding.
R.P. VELTHUIZEN et al.
1124
250 . . . . . . . . . . .
250~ . . . . . . . . .
250
200
200f
200
150
150f
150
100
100[
100
50~
50
50 0
~
0[,
Beta=l.00 .
.
.
.
i
100
,
,
,
150
-
,
.
.
.
.
,
i
Beta=O..20.•
,
150
100
250
200
,
250 . . . . . . . . . . . . .
250 . . . . .
200
200
150
150
0
250
200
*~ . . . . . .
1O0
Beta=O..1.O.
150
200
250
250
ru.,
200 150
y~
100
100
100
50 0
50
50
,
100
,
,
,
L
,
150
•
Beta=O.08 •
•
I
.
200
.
.
.
0 250 100
250
250
200
200
150
150
100
100
50
50 0
Beta=O'02
Beta=O.06 ,
,
,
I
,
150
,
•
,
,
.
.
.
0 250 100
.
200
.
Beta=O.04 .
.
.
i
.
150
.
.
.
t
.
.
.
.
200
250
T2
Beta=O'01
l
Proton density
,
100
150
200
250
100
150
200
250
Fig. 2. Extracted cluster centers in the Proton Density/T2 space as a function of extraction kernel width/3. Tumor tissue is bright on both PD and T2 weighted images, represented by high feature values (245,150). The class is detected only for/3=0.06.
2.3. The modified mountain method (M3) 2.3.1. Identifying candidate cluster centers. In order to apply the mountain method to large and/or highdimensional data sets, it is crucial to reduce the number of grid points on which the mountain function is calculated. In the modified mountain method this reduction is achieved by selecting candidate cluster centers from the m u l t i - d i m e n s i o n a l histogram. Calculation of the histogram can be done very fast, so that the number of grid-points for the histogram is not limited by computation time. This allows for many candidate cluster centers to be calculated on a higher resolution. Candidate cluster centers are identified using a method based upon a graph-theoretical approach as described by Khotanzad and Bouarfa. (6) The method is graphically represented in Fig. 3. For each cell in the histogram, a pointer is directed to the neighbor with the highest counts. Candidate cluster centers are then found
as those cells with a number of pointers directed at them above a threshold. The threshold is found allowing a pre-set number of candidate cluster centers, for example 100. If a large number of clusters is expected, the number of candidate clusters can be increased via the chosen threshold. This approach results in a significantly reduced number of vertices for which the mountain function is calculated, and at the same time allows an increase in resolution in each of the features, since the vertices are selected from the finer grid used to generate the histogram (typically 50 bins for each feature).
2.3.2. Determination of the kernel width. The constant a corresponds to the kernel width for the Parzen density estimation given by equation (l). Initial testing on a variety of different data sets with different values of c~ showed that there was some sensitivity to the choice of the parameter. Fukunaga (2) calculates the
An investigation of mountain method clustering for large data sets
: 36
41
-68
69
:
/
37
/
47 143 '
II
0 3 3
67-.,,~53 ~ 5 9 " *
--
62
.-
"63
73...~ 66 i 54 60
i
51
'
62 4 9 , ! I
: 52
"
i
48:39
210 i
I
i
3
3'
0
1
1 1
0
45 !
3
0
2
0
3
1:
I
]28
i
0
i
48
~
1125
1 0
/
Fig. 3. Identifying candidate cluster centers: each cell directs a pointer at the neighbor with the highest counts. The number of incoming pointers is counted. Candidate cluster centers are identified as those cells with high counts.
optimal metric A and size r for the Parzen window in the case of a Gaussian kernel as
[2'+2r[(s + 2)/2]] 1/(s+4)
A : Y],, r :
L ~ 2~2-~i- J
"N-1/(s+4), (3)
where r2A is the covariance matrix of the kernel density; the covariance matrix of the underlying distribution of the data; N represents the number of samples; and s the dimensionality of the data. £(.) is the Gamma function w h i c h can be f o u n d in s t a n d a r d b o o k s on Mathematics. ~21) However, the underlying distributions are what is searched for, where each distribution will have a different size as well as a different shape. Since the shape of the individual distributions is not known, a symmetric shape with a fall-off related to the average standard deviation of the full data set is used. The following estimate for the parameter c~ cannot be derived rigorously, but provides values that intuitively relate to equation (3). An estimate for the average standard deviation is the square-root of the sum of the variances, which is the trace of the covariance matrix of the whole data set, divided by the number of features n. The size parameter r is modified to be calculated for the average number of feature vectors per class, substituting N/c for the number of samples in each class. Noting that the exponent in the Mountain function is linear rather than quadratic in d, in the modified mountain method the following value of c~ is used: 1 ce = [ ~ / n ] r ( N
/c)'
(4)
where c is now a pre-set number of classes. 2.3.3. Extraction of identified clusters. In the original mountain method, cluster centers are identified successively as the current highest peak in the mountain function, and the effect of the cluster is removed using a fixed kernel using equation (2). The
successive extraction of clusters is important for clusters with overlapping distributions. In Fig. 4(a), for example, it is shown how in the one-dimensional case a cluster may be hidden by a neighboring cluster. The shape of the distribution of the data is not necessarily identical for different clusters. However, the original mountain method uses a fixed kernel, making it sensitive to the choice of extraction parameter/3- Therefore, an estimate of the actual shape of the distribution of a cluster, based on the local neighborhood in the mountain function, is used. This approach removes the sensitivity of the method to the parameter r, since the actual shape of the distribution is used rather than a fixed kernel. After the candidate cluster center with the current highest value of the mountain function is identified, the local neighborhood of the peak is identified using the distance to the nearest candidate cluster centers; for the results described in this paper, the neighborhood was a hypercube with a side of three times the average distance to the five nearest gridpoints. This choice aims to capture the extent of the cluster under consideration; if the hypercube is too small, only the top of the distribution would be used to estimate the shape of the cluster; if the side of the hypercube was chosen too long, it would include other clusters. This is illustrated in Fig. 4. The extraction of the cluster with the highest peak is shown using gridpoints spread over the cluster [Fig. 4(b)], sampling only the top [Fig. 4(c)] and using widely spread gridpoints [Fig. 4(d)]. The mountain function that remains after the cluster is subtracted is shown as a dashed line. It is clear from Fig. 4 that accurate sampling of the mountain function is crucial to identifying subsequent clusters. The mountain function is then evaluated on a fine grid with a limited number of grid points in this hypercube around the chosen candidate cluster center. Some gridpoints may have contributions from other clusters, and need to be excluded from the curve fitting procedure. As depicted in Fig. 5, the points belonging to the peak can be found by tracing the valley around it.
1126
R. E VELTHUIZEN et al.
500
500
I
b
000000000
400
400
.,, /
300
300
200
200
100
100
0
'i
0 0
100
200
300
500 I-
0
1O0
200
300
50C O,,
C 400
400I 300 [~
300
200
200
d
0 0 © © 0 © 0 ~ 0
.'
,"
,.
L
100 0
A
0
I O0
200
300
0
I O0
II
200
---k
___
300
Fig. 4. One-dimensional example of the histogram of two overlapping clusters. (a) The histogram with noise (solid) is the sum of two Gaussian distributions (over-plotted). (b) Fitting a Gaussian on well chosen gridpoints (diamonds) extracts the dominant mode well (dotted line); extracting the cluster leaves the solid distribution. (c) Fitting on gridpoints only close to the top may fail to estimate the cluster parameters. (d) Similarly, too wide a grid may over-estimate the variance.
Grid points belonging to a cluster are identified from the density estimation in the neighborhood around the cluster center by casting rays to the edges, and the position of the valley is determined. Only grid points on the down slope are considered for cluster density estimation. Figure 5 shows an example of the result, where the selected region is outlined in white. An obvious choice for the shape of the density distributions of the clusters is a multivariate Gaussian. This may not be an optimal choice for many domains, but often will be the only reasonable choice in the absence of knowledge of the tree distributions. After all, if true distributions were known, parametric methods should give better classification schemes than clustering techniques. There is an important advantage of the described technique over a direct fit of multiple multivariate Gaussians. With the proposed technique, the density estima-
tion function needs be evaluated only on a very limited number of grid points, but the resolution in important areas is still high. In the process of extracting clusters, the effect of overlapping distributions is corrected, and the areas where high resolution is required can be identified. The algorithm is summarized as follows:
Algorithm 2. The Modified Mountain Method (M3) 1. Determine global grid Gg from peaks in the histogram. 2. Calculate c~ using equation (4). 3. Calculate global Mg(Gg;c0 using equation (1). 4. Select grid point where Mg is maximum. 5. Determine local grid Gl in the neighborhood of Mg's max. and calculate Ml(GhC0. 6. Search valleys in M1 and select points in Gl belonging to the cluster. 7. Fit a multivariate Gaussian to selected points.
An investigation of mountain method clustering for large data sets
1127
Fig. 5. Grid points belonging to a cluster are identified from the density estimation in the neighborhood around the cluster center. Rays are cast to the edges, and the position of the valley is determined. Only grid points on the down going slope are considered.
8. Subtract the found Gaussian from Mg 9. Continue with 4, until all c clusters are found.
2.4. Segmentation and labeling The cluster centers that are found by any of the mountain clusterings can be used to group pixels and therefore segment the MR image data. Grouping is done using a nearest-centroid classifier, taking the cluster centers as the centroids, which is consistent with the assumption that the clusters are Gaussian. The nearest prototype classifier groups a feature vector to the cluster center at the smallest Euclidean distance. Labeling of the clusters, i.e. assigning a tissue class to each cluster, was done by the following technique. For each data set, class labels were available, either as data sets with manually labeled ground-truth, or as data sets with kNN segmentation considered acceptable by a
radiologist. The clusters were matched with the anatomical classes of the acceptable segmentation in such a way that the number of pixels that did not match was minimized. This method was preferred to manually assigning tissues to the classes since some data sets are difficult to label, and operator judgement may vary. (18) The maximum-match criterion allows objective (operator independent) evaluation of error rates and disagreement rates respectively. Segmentations were also performed using Fuzzy cMeans, (16) using two different initializations. FCM takes a finite data set X - - { X l , X 2 , . . . ,Xn} as an input, and each xi C X is a feature vector; x i - - ( x n , x i 2 , . . . ,xis), where xij is the jth feature of the ith object, and s the dimensionality of xi. Functions uk : X ~ [0, 1] are defined; they assign to each xi in X its grade of membership in the fuzzy set muk. Together, the functions Uk yield a fuzzy c-partition U, which can be represented as a c x n
R. E VELTHUIZEN et al.
1128
matrix. The fuzzy c-means algorithm consists of an iterative optimization of an objective function Jm: n
c m
Jm(U,v) : E E ( U l i )
2
(dli) ,
(5)
i=1 /=1
where d~ = Ilxi - vt II2, v = (va, v2,..., vc) with vl being the cluster center of class 1, 1 <_l<_c, and c the number of clusters. The parameter m determines the "fuzziness" of the result; for computational reasons, we used m=2 in this work. Given a partition, the cluster centers are calculated using ~n Vl--
~m x
/
i=l~Uli) n
~i=l (uli)
m
i
1 < 1 < C.
(6)
An iteration is then completed by calculating the new partition:
[~-~:[Ixi--vllla'~2/(m-l)] -1
u,,= Lk=l llxi:v A// l
'
l
(7)
The steps described by equations (6) and (7) are repeated until the Euclidean distance between membership matrices at successive iterations falls below a threshold e: IlUk+l -- ukll < ~.
(8)
In this work, the value used for the threshold was ~=0.001. Two different initializations were used. The first initializes the membership matrix of the desired dimensions using the general initialization scheme used for the MRI segmentation: I~
100 0 1
1
001 0 0
0
U0 =
,
(9)
2.5. Magnetic resonance data The modified mountain method was applied to magnetic resonance images. One phantom image data set and 13 brain scans were studied. The phantom consists of three rows of tubes, each row containing a material with distinct MR parameters. This phantom allows sampling of MR data over the full field of view, while keeping the size of structures comparable to those found in clinical MR images. An MR image obtained with this phantom can be seen in Fig. 6. Also studied were brain scans of three normal volunteers and 10 brain image data sets of patients with brain tumors. For one of the three normal data sets, and one of the patient cases, an experienced radiologist manually labeled the intracranial tissues (white matter, gray matter, cerebro-spinal fluid (csf), as well as tumor, edema and necrosis for the patient case). All images were acquired using the same GE Signa 1.5 Tesla MR scanner and head coil (General Electric, Milwaukee), using a field of view of 240 mm, acquisition matrix of 256 × 192 and a slice thickness of 5 mm. Each data set consisted of three spin-echo images with different contrasts: T1 weighted, proton density weighted and T2 weighted. Examples of a proton density and a T2 weighted image can be seen in Fig. l(a) and (b) respectively. The data sets that were used to evaluate the modified mountain method included only the pixels in the intra-cranial cavity (inside the skull), since those pixels correspond to the tissues of interest. This reduced the data sets from their original size of 2562 pixels to about 20,000 pixels. This allows faster convergence, and reduces the number of cluster centers sought to the number of tissue types in the intracranial cavity.
3. R E S U L T S
3.1. Accuracy of segmentations 0
0
0
1
1 0
with U0 a 5 x n matrix (five classes). The pattern that is used to initialize FCM guarantees convergence of the algorithm to a local minimum, but does not necessarily provide a good starting point to look for the global minimum.°6) In the second initialization scheme, the cluster centers v in FCM were initialized with the cluster centers found by M3. The classes found by either application of FCM were matched to tissues as described for M3.
A well known problem with MR segmentation validation is that, generally, no ground truth is available. We studied three data sets with known ground truth: a phantom with homogeneous volumes, a volunteer data set and a patient data set that were labeled manually by an experienced radiologist, so that the labels for the tissues, white matter, gray matter and cerebro-spinal fluid (csf), and, for the patient, tumor and edema, are known. However, manual labeling of MR data is very time
Fig. 6. Phantom data: (a) MR Tl-weighted image; (b) MR proton density weighted image; (c) segmented MR image using the cluster centers found using the modified mountain method; and (d) segmentation by FCM.
An investigation of mountain method clustering for large data sets consuming and cannot feasibly be done for large numbers of data sets. Therefore, the data sets for which ground truth is not available are compared to a supervised segmentation using k nearest neighbors (kNN), since this method often finds good segmentations. (15) Figure 6 shows the results for the phantom. The different methods yield results that are visually indistinguishable. It can be seen that the errors occur on the edges of the circular regions, and towards the edges of the image. This might be explained by some of the MR artifacts: Gibb's effect (ringing at sharp edges), partial volume effects (voxels on the edges are only partly filled with MR signal producing material) and MR non-uniformity (there is a signal fall-off towards the edges of the image). In absolute numbers, the error count for the mountain method with the nearest centroid classifier, fuzzy c-means initialized with M3, and fuzzy c-means initialized with a standard initialization scheme are 368, 389 and 389 voxel classification errors respectively. These error counts constitute about 3% of the number of signal producing voxels. Figure 7 shows the segmentation results on the volunteer data set for which manual labeling was done for the intracranial tissues. The result obtained using M3 as well as the subclust program from MATLAB show a large increase in the number of pixels classified as csf. Compared to the manually labeled image, white matter
1129
(the dark gray) is somewhat more prevalent in the central region, while it is missing at the perimeter. FCM initialized with M3 converged to the same result as FCM initialized with a standard pattern. Compared to the result of M3, the FCM segmentations show a more accurate classification of csf, but a smaller class of gray matter, with overall a higher number of errors than the result of M3. The subclust program from MATLAB requires at least one parameter to be set by the operator. The manual suggests a range for this parameter: between 0.2 and 0.5. The experiments were done for values of 0.1-0.6 in increments of 0.1. This range of values was investigated for each of five different levels of sub-sampling the data set. The subclust program was run on sub-sampled data due to time constraints, as discussed later. The subclust segmentation in Fig. 7 is the best obtained from the 30 experiments with this data set, using a parameter value of 0.3 and a sub-sampling of every 100th pixel. It shows the same errors in csf, but a further shift in the white/gray matter boundary, giving rise to a slightly higher error count than M3 or FCM. For the three studies where ground truth was available, error rates were determined. The other studies were compared to the results of the k nearest neighbors rule (kNN) to produce disagreement rates. The kNN segmentations were evaluated by radiologists who decided that
Fig. 7. Results of segmentation of a normal volunteer. (a) T2-weighted MR image; (b) ground-truth image of white matter, gray matter and cerebrospinal fluid; (c) result of nearest centroid classifier using the cluster centers found through M3; (d) result of FCM with either of the initializations; and (e) result of nearest centroid classifier using the cluster centers found through subclust. The gray levels for (b)-(e): white: cerebro-spinal fluid; light gray: gray matter; dark gray: white matter; black: background.
1130
R.P. VELTHUIZEN
rates occur in this and other cases where the clusters found by the automatic methods do not correspond to the anatomical tissues. In the example of Fig. 9(c), the second gray-matter class and all the pixels that should have been classified as tumor contribute to the dissimilarity rate. The four patient cases that show lower dissimilarity rates after application of the modified mountain method have in common that the segmentations identified the pathological tissues better than FCM initialized with the pattern. In most other cases of high dissimilarity rates, tumor was identified by both runs of FCM, but often a different gray/white matter boundary was found compared to kNN. The MATLAB implementation was also evaluated. Initial testing of the subclust routine in the MATLAB fuzzy logic toolbox quickly made clear its application to large data sets is not practical. A run using one of the data sets used for the evaluation (24,944 data points) was executed in 64.95 h (2.71 days). Therefore, tests were done on sub-sampled data sets, using every 100th, 20th, 10th and 5th feature vector respectively. The results of experiments on data sets of a normal volunteer and a brain tumor patient are shown in Fig. 10. The number of cluster centers that are returned by subclust are not under direct control of the user; several (optional) parameters are involved as mentioned in Section 2.2. To allow a fair comparison with M3, only the first extracted cluster centers corresponding to the highest peaks were used to determine the errors; three classes for the volunteer, five for the patient. If subclust produced less than that number of classes, errors were established using only the
they reflect the true tissue distributions reasonably well. These disagreement measures are not absolute error measurements since ground truth is not available, but the error rate is expected to be close to the disagreement counts. A summary of the error/disagreement rates can be found in Fig. 8. Occasionally, the modified mountain method performed poorer than FCM. In most of those cases, M3 would not detect white matter and gray matter as separate classes. This is not unexpected because white matter and gray matter have very similar MRI characteristics, and their distributions overlap considerably. This is worsened by the effects of radiation treatment that has been administered to these brain tumor patients. (13) Because these tissues represent the largest clusters of the images, M3 finds high error rates. FCM may still separate the white matter and gray matter modes because they are large classes, and separating them will reduce the value of the objective function, d7) It should be noted that M3 is not sensitive to initialization since it uses a global density estimation, and that FCM may perform similarly poorly under different initializations. For most cases, the two segmentations using FCM came to virtually the same result, which is reflected in the almost identical curves in Fig. 8. For four cases, patients 2, 3, 8 and 9, FCM initialized with the cluster centers found by the modified mountain method resulted in significantly lower dissimilarity rates than FCM with a standard initialization. One of these cases, patient 2, is shown in Fig. 9. It can be seen that FCM with the standard initialization did not result in the detection of tumor, but found two gray-matter classes. High error/dissimilarity
60
I
I
I
I
I
I
et aL
I
I
I
I
- - ~ - - mountain method _ .+__ fcm init on mm A
I
I
I
I
I
I
I
/ \ / /
fcm init o n pattern
',
,
'
'
40 o
o~ 20 '
[
'.
_.-_~
2
'/
(.9
I
I
I
I
I
I
I
I
I
)hantom .voluntee, r 2 . p ~ i e n t l . .patient3. p.atient5. .patient7. .patient9. VOlunteer] volunmerJ patientz patient4 patientt~ paiient~ patient10 DATA SET Fig. 8. Comparison of pixel classes with ground truth (phantom, volunteer 1 and patient 7) or with the classification found by k-nearest neighbors (all other data sets).
An investigation of mountain method clustering for large data sets
1131
Fig. 9. Results for a brain tumor patient. (a) T1 weighted MR image; (b) segmentation by k-nearest neighbors; (c) FCM initialized with a pattern; (d) modified mountain method; and (e) FCM initialized with M3. Legend in the upper right, from top to bottom: csf, white matter, gray matter, gray matter II (see text), tumor and edema.
Subclust errors volunteer 1 60
[
I
I
I
I
Subclust errors patient 1 T--
I
I
I
I
I .......
k
40
+
09 nO nc nc U.l
'
\ "~ "~
I t:" -= .
._
°f
,)k~,-
4O
,'~,,-/~
"' ~\
'".~
'-\\J~//
o
\ \
,,=,
i' /"
/,"
'"'~r~-- -'~''
M3-Errors
o~ 20
[] sample= 1/100 + sample= 1/20 e - sample = 1/10 - - & sample = 1/5 - ~ - sample = 1/5. I TOrclng ~ClUsTers
20 ~: "'~//
M3-Errors
-
I
I
I
I
I
b.lO 0.20 0.30 0.40 0.50 Subclust parameter
I
I
0.60
0.10
I
1
I
I
0.20 0.30 0.40 0.50 Subclust parameter
I
0.60
Fig. 10. Per cent error for results obtained using "subclust" on a volunteer and a patient case. Four sampling rates were used, and each sub-sampled data set was analyzed with varying parameters. In addition, the subclust cluster accept and reject parameters were varied to generate at least the expected number of classes (lines marked with "*"). The error rate obtained with M3 is shown as horizontal lines (the user does not need to set parameters for M3).
1132
R. E VELTHUIZEN et al. positive and false positive rates:
available classes. The full data set was then classified using the nearest centroid classifier. The experiment was done for a range of values of the subclust "range of influence" parameter. It can be seen that the error rate varies dramatically with this parameter, making it hard to use the method in exploratory data analysis. Increasing the number of feature vectors for clustering does not necessarily improve the error rate; for the volunteer, the lowest error rate is achieved using the coarsest sampling, using only 138 pixels for the call to subclust. The result of this clustering was also shown in Fig. 7(e). For comparison with M3, the cluster accept and reject parameters were varied so that subclust would return at least three classes for the volunteer and five for the patient data set. This was done on large data sets, sampling one in five feature vectors. The results are included in Fig. 10, as well as the error rates found using M3. M3 produced less errors than virtually all experiments with subclust.
TP(class) = # pixels correctly classified in class # pixels in class (10) FP(class) = # pixels incorrectly classified in class pixels not in class (11) Whether a pixel was correctly or incorrectly classified was determined by the kNN result, except for the manually labeled cases. Hence, these are not true TP/FP rates, but approximations based on the kNN result. For the pathological tissues (tumor and edema) the TP/FP rates were calculated and averaged. The results can be seen in Table l, which shows that application of the mountain method to these data sets has increased the true positive detection rate of the pathological tissues while reducing the false positive rate. The performance of M3 itself is slightly better than FCM for pathological tissues, but not for normal tissues (see Fig. 8). When FCM was initialized with the cluster centers found by
3.1.1. D e t e c t i o n o f p a t h o l o g y . Insight into the performance of the methods with regard to the pathological tissues can be gained from the true
2.0
I
I
I
I
I
~
I
I
I
I
I
I
I
I
/k
//
\
- • - modified mountain method (M3)
/
\
---+-- FOM ,nit on M3
/ +.,.,_ ~ iii I--Z
O
IO I.IJ 1.0 x 1.1.1 i.u _> I-
,
~\
",,, !,,,,,,,,.
:; :
x
.~,
5u.I
rr 0.5
""
/
x
'".
.."'\
\ ..'+ .
..A-
.
.
',
.
',
/Iv / 4~"
",
,
..."
~'"
',,,
.+
", /
."
+
",,
."
\
\YK- - - . ~ k /
0.13
i I I I J I J I I I i J I I ~hantom vo unteer2 pat entl patient3 .patient5 patient7, patient9 volunteer1 volunteerkJ patient2 patient4 patient6 pal entt~ pat e n t l 0 DATA SET
Fig. 11. Execution times of the different approaches relative to the time that was needed to execute FCM initialized with a standard pattern. The time FCM took is represented by the horizontal line at relative execution time = 1.0. *: M3; +: FCM initialized on the result of M3; and 0 the total execution time of FCM and the M3 initialization.
Table 1. Average true-positive and false-positive rates (%) for pathological tissues M3
Method:
FCM init M3
FCM standard init
Tissue
False pos
True pos
False pos
True pos
False pos
True pos
Tumor Edema
5.6 5.9
66.1 77.9
5.2 8.7
75.5 81.2
10.3 5.9
59.4 75.9
An investigation of mountain method clustering for large data sets
1133
TISSUE DISTRIBUTIONS I
I
I
'
'
'
I
1200
! C
1100
t"
C
looo
,~ 9oo 800
Tissue median FCM [] M3
29
+
G.
700 600
,
I
0
,
,
,
I
200
,
,
~
I
t
,
,
I
L
,
400 600 T1 weighted intensity
,
I
800
,
,
,
1000
Fig. 12. Contour plot of the histogram in T1/Proton density space for the manually labeled patient data set. Contour lines are drawn at 25%, 50% and 75% of the maximum, csf = cerebro-spinal fluid; wm = white matter; gm = gray matter; edm = edema; ncr = necrosis; tm = tumor. Also included are the medians of the manually labeled distributions and of the segmentations. For this data set, FCM initialized with the result of M3 converged to the same partitioning as FCM initialized with the pattern.
M3, it provided the best tumor recognition and an improved edema recognition with a slightly higher edema FP rate. 3.2. Execution times
One of the objectives for the application and optimization of the mountain method was reduction of execution time. From the experience on the data sets described here, the modified mountain method will not provide a significant improvement over FCM in this area. Figure 11 shows the execution times of the different approaches relative to the time that was needed to execute FCM with the standard initialization. The time FCM took is represented by the horizontal line at relative execution t i m e - - 1.0. FCM converges relatively fast on data sets of normal brains; the lower number of classes as well as the absence of MR signal distortions due to pathological tissues makes partitioning relatively easy. Partitions are always arrived at quickly by FCM for a phantom of homogeneous materials. For MR data of patients with pathology, FCM convergence takes 3-6 times more iterations than for normals. Long execution times for FCM tend to correspond to poor segmentations. The MATLAB implementation "subclust" required greater than 1000 times more time than the equivalent run with M3. Using subsampled images, using 1/10 of the data set for subclust, the MATLAB implementation required 10 times as much time as M3 needed to cluster the full data set. This was expected as the subclust algorithm has a complexity quadratic in the number of samples.
4.
DISCUSSION
In this paper we have described modifications to the mountain method that allow analysis of large, real-world data sets. We applied the modified mountain method to MRI segmentation, and evaluated the ability to detect anatomical classes and its potential to speed up convergence of fuzzy c-means clustering. Clinical magnetic resonance data are difficult to segment since the distributions of the MR signals for the different tissues overlap, and the tissue clusters have different shapes and different sizes. The segmentation of the phantom data described above results in very low error rates for any method since the distributions are well defined in feature space. The segmentation of the ground-truthed data set of a normal volunteer and for a patient show much larger error rates than the results for the phantom. The distributions for the manually labeled patient data set are plotted in T 1/proton density space in Fig. 12. The graph shows a few of the characteristics of the pixel intensity distributions for the different tissues. Most distributions are fairly compact, but necrosis exhibits a wide, irregular distribution. The distributions for white matter and gray matter overlap considerably, as well as the edema and necrosis distributions. What cannot be seen in this contour plot, is that the csf distribution is essentially bimodal. Many of the voxels that the radiologist classifies as csf are partly filled with gray matter, since the csf around the brain surrounds the gray matter in the many thin folds of the cortex. Pixels around the ventricles are partly
1134
R.P. VELTHUIZEN et al.
filled with gray matter. This causes partial volume effects, which spread out the distribution of the tissues. The mean T1 weighted pixel intensity value for csf is about 20% higher than the median value, indicating that the distribution is positively skewed. Another indication of the difficulties with the distributions of pixel intensities in MR images of human subjects is the following. If the data of the manually labeled volunteer were classified with the nearest centroid classifier, using the means over the three classes, an error count of 2040 is obtained. Taking that into account, the results of the M3 (1998 errors) and FCM (2106 errors) are excellent. A significant improvement was seen in the detection of pathological tissues by M3, as well as by FCM after initialization with the mountain method (Table 1 and Figs 8 and 9). The level of agreement between the segmentations obtained through kNN and the unsupervised approaches is very high in those cases where the pathological tissues are detected, especially considering the inter- and intra-operator variations seen in kNN segmentations. The number of cases in which the pathological tissues are detected was increased in our experiments with the mountain method, as is reflected in Table 1. The application of the mountain method as an initialization scheme was pursued because FCM initialized with a pattern or randomly often did not result in adequate segmentations of MRI brain tumor images. An initialization based on peaks in the multi-dimensional histogram was assumed to be able to improve FCM segmentations. The original mountain method proved to be impractical due to the size of the data sets and the distribution of class centroids. To allow application to large data sets in general and MRI data in particular, three modifications were made: rather than using a regular grid, as in the original mountain method, or the data points themselves, as in the MATLAB implementation, a small number of candidate cluster centers are identified; the value for the parameter in the mountain kernel a is prescribed; and an adaptive subtraction method is implemented. These modifications allow application to MRI data. The modified mountain method results in improvements in segmentation over FCM initialized with a pattern or randomly both when used by itself to identify cluster centers in combination with the nearest centroid classifier, as well as when used as an initialization for FCM. In summary, in this paper modifications to the mountain method and choices for its parameters are presented to enable its practical use on large data sets. This approach is shown to be a significant improvement over the MATLAB implementation of a differently modified mountain method which requires the user to search for good values of the parameters and may take considerable execution time. The experiments described here indicate that the modified mountain method is both practical for large data sets and may allow an improvement in the error rates as compared to clustering by an iterative method such as fuzzy c-means.
REFERENCES
1. R. C. Dubes, Cluster analysis and related issues, Handbook of Pattern Recognition and Computer Vision, C. H. Chen, L. E Pan and P. S. P. Wang, eds. World Scientific, Singapore (1992). 2. K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd Edn, pp. 254-268. Academic Press, Boston (1990). 3. A. M. Bensaid, Improved fuzzy clustering for pattern recognition with applications to image segmentation, Ph.D. Dissertation, University of South Florida (1994). 4. W. Koontz, P. M. Narendra and K. Fukunaga, A graph theoretic approach to non-parametric cluster analysis, IEEE Trans. Comput. C-25, 936-944 (1976). 5. M. Goldberg and S. Shlien, A clustering scheme for multispectral images, IEEE Trans. Systems Man Cybernet. 8(2), 86-92 (1978). 6. A. Khotanzad and A. Bouarfa, Image segmentation by a parallel, non-parametric histogram based clustering algorithm, Pattern Recognition 23(9), 961-973 (1990). 7. G. Gerig, J. Martin, R. Kikinis, O. Kiibler, M. Shenton and EA. Jolesz, Automatic segmentation of dual-echo MR head data, Proc. 12th Int. Conf. on Information Processing in Medical Imaging, Wye U.K., July 1991, pp. 175-187. Springer, Berlin (1991). 8. R. R.Yager and D. P. Filev, Approximate clustering via the mountain method, Technical Report # MII-1305, IONA College, Machine Intelligence Institute, New Rochelle, NY10801 (1992). 9. J. Kittler, A locally sensitive method for cluster analysis, Pattern Recognition 8, 23-33 (1976). 10. E. Parzen, An estimation of probability density function and mode, Ann. Math. Statist. 33, 1065-1076 (1962). 11. R. Yager and D. Filev, Generation of fuzzy rules by mountain clustering, J. lntell. Fuzzy Systems 2(3), 209-219 (1994). 12. S. Chiu, Fuzzy model identification based on cluster estimation, J. lntell. Fuzzy Systems 2(3), 267-278 (1994). 13. L. P. Clarke, R. P. Velthuizen, M. A. Camacho, J. J. Heine, M. Vaidyanathan, L. O. Hall, R. W. Thatcher and M. L. Silbiger, MRI segmentation: methods and applications, Magnetic Resonance Imaging 13(3), 343368 (1995). 14. M. Vaidyanathan, L. P. Clarke, R. P. Velthuizen, S. Phuphanich, A. M. Bensaid, L. O. Hall, J. C. Bezdek and M. Silbiger, Comparison of supervised MRI segmentation methods for tumor volume determination during therapy, Magnetic Resonance Imaging 13(4), 719-728 (1995). 15. L. P. Clarke, R. P. Velthnizen, S. Phuphanich, J. D. Schellenberg, J. Arrington and M. Silbiger, MRI: Stability of three supervised segmentation techniques, Magnetic Resonance Imaging 11(1), 95-106 (1993). 16. L. O. Hall, A. M. Bensaid, L. P. Clarke, R. P. Velthuizen, M. S. Silbiger and J. C. Bezdek, A comparison of neural network and fuzzy clustering techniques in segmenting magnetic resonance images of the brain, IEEE Trans. Neural Networks 3(5), 672-682 (1992). 17. A. M. Bensald, L. O. Hall, J. C. Bezdek and L. P. Clarke, Fuzzy cluster validity in magnetic resonance images, in Medical Imaging 1994: Image Processing. Proceedings SPIE 2167, Murray H. Loew, ed., pp. 454-464 (1994). 18. M. C. Clark, L. O. Hall, D. B. Goldgof, L. P. Clarke, R. P. Velthuizen and M. S. Silbiger, MRI segmentation using fuzzy clustering techniques, IEEE Eng. Medicine Biology Magazine 13(5), 730-742 (1994). 19. S. Chiu and J. J. Cheng, Automatic generation of fuzzy rulebase for robot arm posture selection, Proc. NAF1PS, pp. 436--440 (1994).
An investigation of mountain method clustering for large data sets 20. J. S. Roger Jang and N. Gulley, Fuzzy Logic Toolbox User's Guide. The Math Works, Natick, Massachusetts; 357-3-59 (1995).
1135
21. L. Rhde and B. Westergren, Beta /3 Mathematics Handbook, 2nd Edn. CRC Press, Boca Raton, Florida (1990).
About the Author--ROBERT P. VELTHUIZEN received the M.Sc. degree in Physics from the University of Utrecht, Netherlands, on a human vision subject. He has worked in medical image processing since 1990, mainly in MRI segmentation using pattern recognition techniques and genetic algorithms. His CV now includes 14 refereed journal papers and more than 25 abstracts.
About the Author-- LAWRENCE O. HALL is a Professor of Computer Science and Engineering at University of South Florida, and received his Ph.D. in Engineering Science from the University of South Florida in Tampa, with a dissertation on feature of extraction using genetic algorithms. He received his Ph.D. in Computer Science from the Florida State University in 1986 and a B.S. in Applied Mathematics from the Florida Institute of Technology in 1980. His current research in Artificial Intelligence is in hybrid connectionist, symbolic learning models, medical image understanding through integrated pattern recognition and AI, parallel expert systems, the use of Fuzzy sets and logic in imprecise domains and learning imprecise concepts such as tall or fast. He is an Associate Editor for IEEE Transactions on Systems, Man and Cybernetics, and IEEE Transactions on Fuz~ Systems. He is the President of the North American Fuzzy Information Processing Society. Dr Hall has written over 100 research papers and co-authored one book.
About the Author--LAURENCE E CLARKE is a Professor of Radiology and Physics and Senior Member in Residence at the H. Lee Moffitt Cancer Center and Research Institute. He is Program Leader of the Moffitt Digital Medical Imaging Program at the University of South Florida. Dr Clarke is Fellow of the American Association of Physicists in Medicine and the Society of Magnetic Resonance Imaging and member of the IEEE, EMB and SNM Society. Research interests include computer assisted diagnosis (CAD) in medical imaging and remote diagnosis.
About the Author-- MARTIN L. SILBIGER is Vice-President for the Health Sciences, Dean of the School of Medicine and Professor of Radiology at the University of South Florida, Tampa. His research interests include clinical evaluation of MRI techniques, magnetic resonance angiography and blood flow, and the application of computers for 3-D imaging analysis and diagnosis.