Chemometrics and Intelligent Laboratory Systems 64 (2002) 45 – 54 www.elsevier.com/locate/chemometrics
Hierarchical clustering extended with visual complements of environmental data set A. Smolin´ski a, B. Walczak a,*, J.W. Einax b a Institute of Chemistry, Silesian University, 9 Szkolna Street, 40-006 Katowice, Poland Institute of Inorganic and Analytical Chemistry, Friedrich Schiller University of Jena, Lessingstrasse 8, D-07743, Jena, Germany
b
Accepted 25 May 2002
Abstract Hierarchical clustering techniques complemented with visual display of data sets allow direct interpretation of the clustering results in terms of original variables. The proposed method of data ordering and display is simple, informative and fulfils fundamental objectives of the data visualisation techniques. In our study, it is applied for exploratory analysis of an environmental data set. D 2002 Elsevier Science B.V. All rights reserved. Keywords: Data exploration; Data visualisation
1. Introduction Exploratory analysis of a studied data set often starts with hierarchical clustering of data, which reveals internal structure thereof (i.e., its clustering tendency). Hierarchical clustering usually leads to suboptimal clustering of objects (due to its hierarchical nature) and largely depends on the method used for clusters’ linkage. Very often, different linkage methods are applied to the same data set and their performance is determined mainly by interpretability of the results. However, interpretability of clustering is not an easy task, especially when clustering is performed in high-dimensional space of parameters. Efficient and useful as they are, hierarchical clustering methods can be much more powerful, when
*
Corresponding author.
complemented with visual display of data sets, allowing direct interpretation of the clustering in terms of original variables. The proposed informative graphic display is simple to construct and easy to interpret. The performance of the proposed approach is demonstrated on the environmental data set representing physical and chemical parameters measured for water samples of the Saale River, flowing through a very densely populated and highly industrialized region of Germany.
2. Theory 2.1. Hierarchical clustering Hierarchical clustering can be applied to multidimensional data sets, in order to study similarities (or dissimilarities) of objects in the variables space,
0169-7439/02/$ - see front matter D 2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 9 - 7 4 3 9 ( 0 2 ) 0 0 0 4 9 - 7
A. Smolin´ski et al. / Chemometrics and Intelligent Laboratory Systems 64 (2002) 45–54
46
or similarities of variables in the objects space. Detailed description of the hierarchical clustering methods and their applications can be found in Refs. [1– 4]. Any agglomerative hierarchical clustering method is characterized by:
the similarity measure used, and the way the resulting subclusters are merged (linked).
For continuous variables, popular similarity measures are Euclidean or Manhattan distance. They are the special cases of the Minkowski distance, defined as: " #1=q X q dij ¼ ðxi xj Þ k
where k denotes the number of the variables. If q = 2, then dij represents the Euclidean distance, whereas for q = 1, it represents the Manhattan distance. Among the linkage methods, the most popular ones are:
single linkage, often called the nearest neighbor, which defines the distance between the two clusters A and B as the smallest dissimilarity between an object from cluster A and an object from cluster B, complete linkage, which defines the distance between clusters A and B as a furthest distance between two objects belonging to clusters A and B, average linkage defining the distance between the clusters as an average of the single linkage and complete linkage distances, centroid linkage, which is based on the distance of the mass centers, and Ward linkage, which is based on the inner squared distance of clusters, so that at each stage these two clusters are merged, for which the minimum increase in the total within-group error sums of squares is observed.
A choice of a clustering method depends on the data studied and the particular purpose for application. For exploratory purpose, the same data set can be
studied using different clustering methods and comparing the resulting classifications. Final results of hierarchical clustering are presented in form of a dendrogram. On x-axis of the dendrogram, the indices of clustered objects (or variables) are displayed, whereas y-axis represents the corresponding linkage distances (or an adequate measure of similarity) between the two objects or clusters, which are merged. The dendrogram reveals data structure (i.e., the subgroups of objects), but it allows no interpretation of the observed patterns in terms of the original variables (parameters). For this purpose, we propose a simple visualization method, the principle of which can be presented as follows. Let us assume that the studied data set is organized in the matrix form containing m objects and n variables, X (m n). If hierarchical clustering is applied to data objects, then along x-axis of the resulting dendrogram, there are m ordered objects. Let us denote this specific order of objects as the
Table 1 Twenty-four physical and chemical parameters measured at 29 sampling sites along the Saale River No.
Parameter
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Na K Mg Ca NH +4 As Cd Cr Cu Fe Mn Ni Pb Zn Cl NO 3 PO3 4 2 SO4 Temperature pH DOC Redox potential Conductivity Suspended matter
A. Smolin´ski et al. / Chemometrics and Intelligent Laboratory Systems 64 (2002) 45–54
‘objorder’. One simple way of interpretation of the resulting clustering tree would be to display the data set with objects sorted according to the ‘objorder’ as an image, with pixels representing the matrix elements. However, if the measured parameters are independent variables, then their random (arbitrary) order in the data matrix introduces abrupt disturbances in the image. To overcome this problem, we propose to sort data matrix in the varia-
47
bles’ direction as well. The order of the variables can be estimated, based on the results of their clustering in an analogous way, as the ‘objorder’ was calculated. The order of the variables will, in the subsequent parts of this text, be denoted as ‘varorder’. Then the resulting image of the data set attains a smoother appearance, because the neighboring objects and variables are ordered according to their similarity.
Fig. 1. Map of sampling sites along the Saale River.
48
A. Smolin´ski et al. / Chemometrics and Intelligent Laboratory Systems 64 (2002) 45–54
Summing up, the proposed method of two-way clustering consists of the following steps: 1. select similarity measure and the linkage method, 2. perform clustering of objects; output: dendrogram and ‘objorder’, 3. perform clustering of variables; output: dendrogram and ‘varorder’, 4. sort objects in data matrix X according to ‘objorder’ and sort variables according to ‘varorder’, 5. display image of sorted data matrix. Depending on the software used, display of matrix can be performed in many ways. While working with the Matlab software, one can visualize data sets using such functions, as pcolor(X) or imagesc(X). The function pcolor maps a minimal element of matrix X to 1 and a maximal element to index m. Then these indices can be used with the colormap function to determine the color associated with each element of matrix X. It ought to be stressed that the pcolor function does not represent the last row and the last column of data of matrix X. To obtain the map of all elements of X, it is necessary to augment matrix X with an additional row and column (to repeat the last column and the last row). The function imagesc scales the data to use the full colormap and displays the first row of matrix X in the lowest line of the image. If we want to preserve a proper order of objects, then the function flipud ought to be applied.
(DOC) were determined in the laboratory. The contents of different metals (As, Ca, Cd, Cr, Cu, Fe, Hg, K, Mg, Mn, Na, Ni, Pb, Se, and Zn) were analyzed in the river water filtrate by means of inductive coupled plasma atomic emission spectroscopy (ICP-AES) and graphite furnace atomic absorption spectrometry (AAS) with Zeeman background compensation, and hydride AAS. The NH4+ concentration was measured by photometry. The concentrations of anions (Cl , NO3 , NO2 , PO34 and SO24 ) were determined by ion chromatography and photometry. Details about sampling and analytical conditions are described in Refs. [5,6]. Some of the measured parameters have to be deleted before the following computations because they are either nearly constant or there are a lot of nondetectable entries. Data are organized in matrix X
3. Data The data set studied presents measurements of 24 different physical and chemical parameters of water samples at 29 sampling sites along the Saale River, one of the largest tributaries of Elbe. The measured parameters are listed in Table 1, whereas the location of the sampling sites along the Saale River is presented in Fig. 1. The Saale River was sampled monthly from September 1993 to August 1994. Such parameters, as pH, oxygen content, temperature, salinity, conductivity, turbidity, and redox potential were measured directly at the sampling spot of the water. The contents of suspended matter and the dissolved organic carbon
Fig. 2. Dendrograms of objects (a), and variables (b) in the space of 24 measured parameters, respectively, in the space of 29 sampling sites.
A. Smolin´ski et al. / Chemometrics and Intelligent Laboratory Systems 64 (2002) 45–54
49
Fig. 3. Dendrogram of objects with visual complement in the space of 24 parameters.
(29 24), i.e., each row of matrix X represents one sampling site described by 24 parameters. Each element of data matrix, xij, is a median of the measurements performed over the sampling period of 12 months. As the measured parameters significantly differ in their ranges, the data set is standardized:
xij ¼
ðxij x¯ j Þ sj
where x¯j, sj denote the mean of the jth column and its standard deviation, respectively.
4. Results and discussion To explore the studied data set and to examine the similarities of the sampling sites, the hierarchical clustering methods were used. The results presented below are based on the Euclidean distance and the ‘Ward’ linkage algorithm.
50
A. Smolin´ski et al. / Chemometrics and Intelligent Laboratory Systems 64 (2002) 45–54
The dendrogram in Fig. 2a reveals two distinct clusters of the sampling sites:
cluster A containing site nos. 1– 16 and cluster B containing site nos. 17– 29.
The character of the river water has widely changed between sampling sites 16 and 17. This fact is originated by the confluence of the Saale River and its tributary, the river Unstrut, which is polluted by anthropogenic activities. These main clusters have additional substructures, and namely in cluster A, the following three subgroups of objects are observed: site nos. 9 and 11– 16, site nos.
1, 5– 8 and 10, and site nos. 2– 4. In cluster B there are two subclasses, the first one containing object nos. 17– 22 and the second one containing object nos. 23– 29. These results suggest that the traditional division of the sampling sites as these belonging to the lower, middle and upper stream of the river is unjustified in terms of the considered parameters. There appear two sections of the Saale River rather corresponding to the groups A and B. The dendrogram constructed for the variables (see Fig. 2b) reveals three main classes thereof (A, B, and C):
class A contains variable nos. 1 –6, 13, 15, 18– 20 and 24, which represent concentrations of Na,
Fig. 4. Dendrogram of objects with visual complement in the space of anthropogenic parameters (As, Cd, Pb, and Zn).
A. Smolin´ski et al. / Chemometrics and Intelligent Laboratory Systems 64 (2002) 45–54
K, Mg, Ca, NH4 + , As, Pb, Cl , SO42 , temperature, pH and the suspended matter, respectively; class B is constituted by variable nos. 7, 16, 22, 17 and 21 (Cd, NO3 , redox potential, PO43 and DOC), whereas to class C belongs to variable nos. 8 – 12 and 14 (Cr, Cu, Fe, Mn, Ni and Zn). Using ‘objorder’ and ‘varorder’ to sort the data set, we can augment the resulting dendrogram of objects with an image of data (standardized) (see Fig. 3). Simultaneous interpretation of the two figures allows
51
to conclude that group A contains objects characterized by a low value of the parameter nos. 1– 6, 13, 15, 18 and 23 (corresponding to concentration of Na, K, Mg, Ca, NH4 + , As, Pb, and Cl ), whereas for the objects belonging to group B, the same parameters attain high values. It is easy to notice that the uniqueness of such sampling sites, as 2 –4, is associated with the high values of parameter nos. 8 –10 (concentrations of Cr, Cu and Fe), whereas the subgroup of site nos. 23 –29 differs from the other sites within the group B, due to high concentration of parameter nos. 11 and 12 (concentration of Mn and Ni).
Fig. 5. Dendrogram of objects with visual complement in the space of geogenic parameters (Ca, Fe, Mn, and Ni).
52
A. Smolin´ski et al. / Chemometrics and Intelligent Laboratory Systems 64 (2002) 45–54
Fig. 6. Score plots and loading plots on the planes defined by the main principal components.
A. Smolin´ski et al. / Chemometrics and Intelligent Laboratory Systems 64 (2002) 45–54
53
Extremely polluted sites, represented by the intense brown pixels, are also easy to notice. For instance, the sampling place no. 19 is characterized by a very high concentration of PO43 , sampling site no. 4 has the relatively highest concentration of parameter nos. 8, 10 and 11 (Mn, Cr and Fe), and at sampling site no. 6, very high concentration of parameter 16 (NO3 ) is observed. Clustering of the sampling sites in the parameter space described by the so-called anthropogenic parameters (i.e., parameter nos. 6, 7, 13 and 14) is presented in Fig. 4. Two main clusters observed contain sampling site nos. 1 – 16 and nos. 17 – 29. Although the main clusters are the same in the anthropogenic parameters’ space and in the whole space of 24 measured parameters, their substructures differ in the two spaces. Sampling site nos. 1 – 16 have relatively lower concentrations of As and Pb (parameter nos. 6 and 13), compared with the sampling site nos. 17 – 29, but within the two main clusters the sites differ, due to the concentrations of Zn and Cd (parameter nos. 7 and 14). Site nos. 9 and 16 are outlying in the first cluster, due to relatively low concentration of Cd (no. 7). Also the uniqueness of site no. 20 is associated with very low concentration of Cd and with medium concentration of Zn (nos. 7 and 14). Sampling sites clustered in the space defined by important geogenic parameters such as Ca, Fe, Mn and Ni (parameter nos. 4, 10, 11 and 12, respectively) reveal four main groups (see Fig. 5):
group group group group
A, containing object nos. 1, 5– 16; B, containing object nos. 17 –22; C, with sampling site nos. 2 –4; and D, with site nos. 23– 29.
Group A of the sampling sites characterizes by the relatively low values of all considered geogenic parameters, and group B differs from group A due to higher values of parameter no. 4 (concentration of Ca). Group C reveals very high values of parameter no. 10 (concentration of Fe), low values of parameter no. 4 (Ca), medium values of parameter no. 12 (Ni), and medium values of parameter no. 11 (Mn) for site nos. 2 and 3 and very high value of this parameter for site no. 4. Group C characterizes by the high values of parameter nos. 11, 12 and 4 (Mn, Ni and
Fig. 7. Visualisation of data sorted randomly (a), sorted according to ‘objorder’ (b), and sorted according to both the ‘objorder’ and ‘varorder’ (c).
54
A. Smolin´ski et al. / Chemometrics and Intelligent Laboratory Systems 64 (2002) 45–54
Ca) and by the low or medium values of parameter no. 10 (Fe). All these detailed conclusions presented above, and formulated based on the two-way hierarchical clustering complemented with visual display of the data, are specially valuable in the case when the standard method of data exploration, and namely Principal Component Analysis (PCA) (e.g., Ref. [7]) cannot efficiently compress the data. For the data set studied, the first five factors describe 85.5% of data variance only and interpretation of the results obtained requires inspection of some two-dimensional plots (see Fig. 6). Of course, there are available nonlinear projection methods also, which allow an efficient compression of data and its presentation on the plane (i.e., the Kohonen network [8], the Sammon projection [9], Generative Topographic Map [10], etc.), but this group of methods do not allow interpretation of the observed pattern of objects (in our case, of the sampling sites) in terms of the measured parameters.
5. Conclusions The visual approach complements hierarchical clustering methods and it should be an essential component of the exploration studies. It ought to be stressed that the clustering results do not depend on the order of objects or the order of variables in the input data matrix. However, proper ordering of the data set to be displayed exerts an enormous influence on image interpretability. In Fig. 7, there are visual displays of the data set with random order of objects and variables, data set sorted according to ‘objorder’ and the data set sorted according to both, the ‘objorder’ and ‘varorder’ presented as outputs of the imagesc Matlab function. These displays clearly reveal the influence of the data sorting on interpretability of the final image.
If hierarchical clustering is applied to a numerous data set and the user of software declares a limited number only of the groups to be displayed, g, then the data image can be constructed as an image of data matrix containing g objects and n variables, where the ith row (object) represents the mean vector of all objects belonging to the ith group. Although hierarchical clustering is usually applied at the first stage of data exploration, it can lead to many valuable observations and conclusions, concerning data structure and variability. Applied to the Saale River data, it allows to separate the polluted river sections, identify pollutants’ sources and define uniqueness of some sampling sites.
References [1] D.L. Massart, L. Kaufman, The Interpretation of Analytical Data by the Use of Cluster Analysis, Wiley, New York, 1983. [2] W. Vogt, D. Nagel, H. Sator, Cluster Analysis in Clinical Chemistry; A Model, Wiley, New York, 1987. [3] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data; An Introduction to Cluster Analysis, Wiley, New York, 1990. [4] H.C. Romesburg, Cluster Analysis for Researchers, Lifetime Learning Publications, Belmont, CA, 1984. [5] D. Truckenbrodt, O. Kampe, J.W. Einax, Analytik und Bewertung des Belastungszustandes der Saale, Ilm und Unstrut, Vom Wasser 87 (1996) 29 – 38. [6] D. Truckenbrodt, O. Kampe, J.W. Einax, Zur aktuellen Belastungssituation der Saale, Ilm und Unstrut, in: F. Karlruhe (Ed.), Die Belastung der Elbe: Teil I. Elbenebenflusse, BMBF, Karlsruhe, 1995, pp. 57 – 68. [7] B.G.M. Vandeginste, D.L. Massart, L.M.C. Buydens, S. de Jong, P.J. Lewi, J. Smeyers-Verbeke, Handbook of Chemometrics and Qualimetrics: Part B, Elsevier, Amsterdam, 1998. [8] J. Zupan, J. Gasteiger, Neural Networks in Chemistry and Drug Design, 2nd ed., Wiley-VCH, Weinheim, 1999. [9] J.W. Sammon, A nonlinear mapping for data structure analysis, IEEE Transactions on Computers 18 (1969) 401 – 409. [10] C.M. Bishop, M. Svensen, C.K.I. Williams, GTM: the generative topographic mapping, Neural Computation 10 (1) (1998) 215 – 234.