Some recent developments in cluster analysis

Some recent developments in cluster analysis

Physics and Chemistry of the Earth 35 (2010) 309–315 Contents lists available at ScienceDirect Physics and Chemistry of the Earth journal homepage: ...

850KB Sizes 0 Downloads 164 Views

Physics and Chemistry of the Earth 35 (2010) 309–315

Contents lists available at ScienceDirect

Physics and Chemistry of the Earth journal homepage: www.elsevier.com/locate/pce

Some recent developments in cluster analysis Ian T. Jolliffe a,*, Andreas Philipp b a b

University of Exeter, 30 Woodvale Road, Gurnard, Cowes, Isle of Wight PO31 8EG, UK University of Augsburg, Universitätsstrasse 10, 86135 Augsbwg, Germany

a r t i c l e

i n f o

Article history: Received 20 February 2009 Received in revised form 14 July 2009 Accepted 23 July 2009 Available online 26 July 2009 Keywords: Agglomerative clustering Divisive clustering k-Means Model-based clustering Time-constrained clustering Unsupervised classification

a b s t r a c t Cluster analysis has been used for many years in weather and climate research but most applications have concentrated on a handful of techniques. However, the subject is vast with many methods scattered across a number of literatures, and research on new variants continues. This paper describes a few of these new developments, concentrating on those that have appeared in the statistical and atmospheric science literatures. Their relevance to applications in weather and climate is discussed. Ó 2009 Elsevier Ltd. All rights reserved.

1. Introduction Cluster analysis has been used for many years in weather and climate research but most applications have concentrated on a handful of techniques. For example, the hierarchical Ward’s method is frequently encountered, and the favourite non-hierarchical method is k-means. There are many other methods scattered across a number of literatures, and research on new varieties continues. This paper describes a few of these new developments, concentrating on those that have appeared in the statistics and atmospheric science literatures. Section 2 gives an introduction to cluster analysis, distinguishing four main types of methodology. Section 3 describes a few new methods from the recent literature in some detail, together with a new technique developed by the current authors. The relevance of the new developments to applications in weather and climate is discussed. Some concluding remarks are made in Section 4. 2. Types of cluster analysis Before describing the main types of cluster analysis we first give some indication of the size of the subject. Cluster analysis is a subset of classification. It is distinguished from some other types of classification by being objective in the sense that the data are input to a clustering algorithm and a set of groups or * Corresponding author. E-mail addresses: [email protected] (I.T. Jolliffe), [email protected]. de (A. Philipp). 1474-7065/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.pce.2009.07.014

clusters is produced automatically. Although there may be ‘tuning parameters’ to choose for some methods, which introduce a degree of subjectivity, it is the algorithm that produces the groups, not some individual who looks at and interprets the data. In some disciplines, cluster analysis is known as unsupervised classification. Cluster analysis has been developed and is used in many disciplines and there is much ongoing research. The International Federation of Classification Societies (http://www.classification-society. org/) links 14 national or regional classification societies, organises biennial conferences, and supports two journals. A substantial proportion of the activities of these societies are concerned with cluster analysis. There is also plenty of freely available software. For example, the R project (http://www.r-project.org/) has cluster and stats packages, as well as more specialised packages. In the Web of Knowledge (http://www.isiwebofknowledge.com/), cluster analysis has 2673 hits in 2008; unsupervised classification has 86 and classification has 28,520. Of course, many of these ‘hits’ correspond to straightforward applications of existing methods of cluster analysis, but there are new developments in a number of literatures. Space dictates that only a small fraction of this material can be covered in this article. We suppose in what follows that we have n objects, perhaps days, and for each object p variables are measured, perhaps sea level pressure at p gridpoints. We wish to partition the n objects into a number of clusters, such that objects in the same cluster are similar to each other, whilst objects from different clusters are dissimilar. Cluster analysis can be divided into four main types.

310

I.T. Jolliffe, A. Philipp / Physics and Chemistry of the Earth 35 (2010) 309–315

2.1. Agglomerative hierarchical techniques These start with all n objects to be clustered in separate clusters, which are then successively merged. At each stage it is the 2 clusters that are closest according to some measure of similarity or dissimilarity that are merged, until eventually all objects are in the same cluster. Well known agglomerative hierarchical methods include Ward’s method, single-link (nearest neighbour), completelink (furthest neighbour) and various sorts of average link. The results are usually displayed in a dendrogram or tree diagram, from which a cluster solution with any number of clusters can be found. Agglomerative hierarchical clustering methods have been fairly widely used in weather and climate applications. One of a number of recent examples is Bednorz (2008). The techniques are easy to use but have the disadvantages that different methods can give very different results, and once two objects have merged they can never be separated. Thus an early borderline decision on which clusters to merge can never be reversed. 2.2. Divisive hierarchical techniques These adopt the opposite strategy to agglomerative techniques. They start with 1 cluster containing all n objects to be clustered and successively split an existing cluster in such a way that the two resulting clusters are as different as possible according to some dissimilarity measure. Eventually, there are n clusters each containing a single object. As with agglomerative methods, the results can be represented by a dendrogram. The disadvantages of divisive methods are similar to those of agglomerative methods, and they are computationally more demanding if the whole dendrogram is required. Divisive methods had a following when there was the first major burst of clustering activity in the 1960s. Since then, there has been less on divisive methods in the literature than for other types of clustering, possibly because of greater computational requirements when a full dendrogram is required. An exception (Li, 2006) that develops a new variant of divisive clustering will be discussed in Section 3.

typically has many local optima and, unless very complicated moves are allowed, the algorithms often end up in one of these local optima. k-means algorithms need a starting partition and different starting partitions can lead to different solutions. Multiple starts for the algorithm increase the chance that one or more of them will lead to the global optimum, but there is no guarantee. 2.4. Model-based techniques In these techniques the distributional form of the p variables for each cluster is assumed known, often multivariate Gaussian distributions. Unknown parameters of these distributions are then estimated, along with optimal allocation of objects to clusters, using statistical techniques such as maximum likelihood. There has been a fair amount of activity on these techniques recently in the statistical literature – see, for example, Raftery and Dean (2006) – and they have also been used in atmospheric science (Smyth et al., 1999). To use this approach it is necessary to specify, for each cluster, the distributional form of the p measured variables, in other words the ‘shapes’ of the clusters. This can be viewed as both an advantage and disadvantage compared to other types of cluster analysis. It is a disadvantage because the assumptions about ‘shape’ will usually be at best approximately true. Conversely, those assumptions are explicit, whereas other methods will often preferentially find certain shapes of cluster but this behaviour is hidden.

3. Some recent developments As noted earlier, the number of developments discussed here is small, compared to the amount of research on cluster analysis that is being conducted. We look at one divisive hierarchical method, two modifications of k-means clustering and three approaches that allow different variables to be treated differently from each other in a cluster analysis. 3.1. A new divisive hierarchical method

2.3. Global optimisation techniques In these techniques a criterion is defined which can be calculated for any partition of a data set into k groups or clusters, and which measures how good is the partition in producing homogeneous and well-separated clusters. Most such criteria are based on within-cluster and between-cluster sums of squared dissimilarities between objects. A good partition will have large values for the between-group sum and small values for the within-group sum. The aim then is to find, for any given k, the partition that optimises the chosen criterion. Techniques that are based on such sum of squares are often referred to as k-means methods, and are possibly the most popular type of clustering method. The Web of Knowledge has 510 hits for k-means in 2008. Recent examples from the climate literature include Moron et al. (2008) and Philipp et al. (2007). The main disadvantage of global optimisation methods is that the optimisation is seldom genuinely global. The goal is that the best k-cluster partition, according to the criterion, should be found, but unless the number of objects to be clustered is small, it is computationally impossible to search more than a tiny fraction of the possible partitions. k-means algorithms are typically iterative, starting with an initial k-cluster partition and moving objects between groups according to certain rules, in order to improve the chosen criterion. Different k-means algorithms allow different types of moves. An algorithm stops when there are no allowable moves that lead to an improvement in the criterion. The criterion

Li (2006) describes a new method for divisive hierarchical clustering. The method has links to both the modelling approach to cluster analysis and to linear discriminant analysis. The latter is a well-known statistical technique (McLachlan, 1992) used to analyse data that contain known groups, unlike cluster analysis where the groups are unknown. At each division Li (2006) finds a split into two groups based on a least squares criterion predicting group membership from a linear function of measured variables. Group membership and coefficients in the linear function are optimised simultaneously. In some cases only a subset of measured variables differ between clusters. Li’s technique can be extended to select a subset of variables and to use nonlinear functions. Li (2006) argues that a divisive method may be better than an agglomerative one when the number of desired clusters is much smaller than the number of objects to be clustered. This is often the case in weather and climate applications, for instance in classifying a large number of synoptic maps into a small number of clusters, so a divisive technique (not necessarily Li’s) could be attractive. The argument is that because a smaller number of steps is needed to get to the desired level of the dendrogram via division than via agglomeration, there are fewer opportunities for ‘bad’ irreversible steps. There will also be computational advantages for divisive, compared to agglomerative, techniques when the number of levels of the dendrogram that need to be explored is much larger for the latter.

I.T. Jolliffe, A. Philipp / Physics and Chemistry of the Earth 35 (2010) 309–315

3.2. Variations of k-means As already mentioned, k-means has 510 hits in the Web of Knowledge in 2008 alone. Many of these papers are simply applications of k-means in different disciplines, but a non-trivial number suggest modifications to the basic algorithm. In the climate literature, Strauss et al. (2007) say in their abstract that they use a modification of k-means, but the modification is in the pre-processing of the data, not in the basic k-means methodology. Philipp et al. (2007) modify the algorithm used in k-means to enhance the chances of getting close to the global optimum. Their technique, which incorporates the numerical techniques of simulated annealing and diversified randomisation is known as SANDRA. A recent modification from the statistics literature (CuestaAlbertos et al., 2008) uses the concept of trimmed k-means. Although the technical details are complex, the idea itself is simple – the clustering is done only on a proportion (1  a) of observations that lie in the ‘core’ of k-clusters. We thus discard observations that are peripheral and might obscure the cluster structure. This could certainly be useful for weather and climate data where boundaries between clusters are often ill-defined. To implement this idea, a set of k spherical regions that contain at least (1  a) of the observations is found such that the within-cluster sum of squared distances between observations is as small as possible for such regions. Optimal trimming and cluster membership are found simultaneously, resulting in so-called trimmed k-means clusters. There is an R package (trimcluster) that finds these clusters. The value of k must be specified, and a can also be chosen, as can the number of different starting partitions and maximum number of iterations associated with the iterative procedure. The default value for a is 0.1. In fact, trimmed k-means is only the starting point in Cuesta-Alberto et al.’s algorithm. They use the clusters from this stage in a mixture model with the trimmed points treated as censored (i.e. their values are unknown, but it is known that they are outside the core area). The parameters in the model are estimated by maximum likelihood, assuming multivariate Gaussian or multivariate t distributions. The ellipsoids produced by this model are then used to come up with a new trimming and the process is iterated. Hence, although the method starts with a trimmed k-means its ultimate objective is a trimmed mixture model, treating peripheral data as ‘censored’. 3.3. Varying the importance of different variables If large numbers of variables are measured, it is likely that some will be more important in determining clusters than others. Three types of method that allow importances of different variables to vary will be described next. In the first type, importances can vary continuously, whereas in the second the clustering is based on only a subset of variables, so that the omitted variables have zero importance. In the third type, an additional variable (time) is taken into account in order to constrain the clustering, and the importance of this variable can be varied. 3.3.1. Varying importances – COSA Friedman and Meulman (2004) introduced a technique that allows importances of different variables in a cluster analysis to vary continuously. They call it COSA (Clustering Objects on Subsets of Attributes). Software is available in R at http://www-stat.stanford.edu/~jhf/COSA.html. COSA finds clusters that optimise a criterion based on minimising ‘distances’ within-clusters, where distances are defined by allowing different attributes (variables, gridpoints) to have different weights in different clusters. It is therefore rather like a k-means method, but with different variables having different weights when calculating within-cluster dis-

311

tances. Minimisation is with respect to both cluster structure and weights within each cluster – a complex optimisation problem. There are a number of tuning parameters and options that may be chosen, giving the method a large degree of flexibility and subjectivity. In classification of synoptic maps the idea that different variables (geographical areas) might have different importances in determining different clusters is appealing. For example a cluster corresponding to a blocking anticyclone in one geographical area will be largely determined by different variables than a cluster dominated by strong zonal flow. However, personal experience with the method suggests that its flexibility may be a drawback. Different choices for tuning parameters and options lead to different clusters and there is no clear guidance on how to make those choices. Although similar choices need to be made for other clustering methods, the problems associated with making these choices seem especially difficult for COSA. 3.3.2. Variable selection COSA allows importances of variables to vary continuously. An alternative is the all-or-nothing approach where a variable is either used or is not used; in other words a subset of variables is selected with which to do the clustering. A motivation for this strategy is that variables which are irrelevant to the clustering introduce noise and make it more difficult to see the structure. Raftery and Dean (2006) incorporate variable selection into a model-based clustering technique. There is a step-wise approach to selecting variables, adding one-at-a-time if they improve a Bayesian Information Criterion (BIC), and deleting one-at-a-time if deletion does not degrade BIC. At each stage BIC is optimised with respect to the number of clusters and the parameters in the model, giving another complex optimisation problem. There is an R package clustvarsel that implements Raftery and Dean’s technique. Because it is a model-based technique, an assumption must be made about the probability distribution within-clusters (usually multivariate Gaussian). In clustvarsel it is necessary to specify the maximum number of clusters, which is straightforward, but also the model structure including any relationships between parameters in the different clusters such as equal covariance matrices, which is restrictive. Results from simulations, and for a data set on crabs which has been widely used as a test data set for clustering methods, are impressive. However, Fraiman et al. (2008) provide examples where Raftery and Dean’s method does not work well. Fraiman et al. (2008) suggest two procedures for variable selection in cluster analysis. The first, like Raftery and Dean (2006), concentrates on identifying and eliminating noisy uninformative variables, whereas the second also deals with multicollinearity and dependence between variables. Results from simulation studies and some real data examples are again very promising. 3.3.3. Time-constrained clustering Often in weather and climate applications the objects to be clustered are related spatially or temporally, but this information is not directly taken into account. Intuitively, if we are classifying daily synoptic maps, for example, we might wish to make it more likely that 2 days with the same level of dissimilarity are put into the same cluster if they are consecutive in time than if they are well-separated. This implies that the clustering should be constrained in some way to take account of temporal (or spatial) proximity. The general idea of imposing constraints in clustering is not new (Gordon, 1981, Section 4.3), but it does not seem to have been much developed or used. What follows describes a simple way to implement one form of constraint, illustrated with an example. Suppose that Dv(ij) is the dissimilarity between objects i and j based on the original variables, and let Dt(ij) be a dissimilarity

312

I.T. Jolliffe, A. Philipp / Physics and Chemistry of the Earth 35 (2010) 309–315

Fig. 1. Mean sea level pressure patterns of seven clusters of reconstructed daily maps between 1850 and 2003 for January and February derived by unconstrained clustering (left column) and by time-constrained clustering using the value of 2  105 for weighting factor lambda (right column). Dashed contour lines indicate pressure above, solid lines below the spatial average. Contour interval is 5 hPa.

I.T. Jolliffe, A. Philipp / Physics and Chemistry of the Earth 35 (2010) 309–315

313

between objects i and j based on their separation in time. One possibility, which will be used in the example below, is Dt(ij) = 1  exp(|ti  tj|). A new dissimilarity measure, combining these two, can then be constructed as Dk(ij) = Dv(ij) + kDt(ij), where k is a tuning parameter, and some chosen form of cluster analysis is then performed using this measure. When k = 0 time is ignored, and when k ? 1, clusters become completely contiguous in time. By varying k a suitable compromise can be found. As an example, cluster analysis is done on daily sea level pressure maps for January/February 1850–2003. Values are available at 250 gridpoints with 5° spacing, 70°W to 50°E, 25°N to 70°N. The data were compiled as part of the EMULATE project by Ansell et al. (2006). The clustering method used was SANDRA, a version of k-means due to Philipp et al. (2007) mentioned above. In the case of k-means algorithms the original distance is the Euclidean distance Dv(ij) between a single object i and a cluster centroid j, but the time related distance is calculated based on the average time difference between object i and all objects in cluster j, weighed and added to Dv(ij). The example was carried out on dissimilarities Dv(ij) based on sea level pressure alone (k = 0) and on Dk(ij) for varying values of k. Fig. 1 shows the mean pressure maps for a seven cluster solution with unconstrained clustering (left column). The assignment to these clusters of the first 50 days in the time series of 9123 days is indicated in Fig. 2 on the left. Thus, day 1 is assigned to cluster 1, day 2 to cluster 4, day 3 to cluster 5 and so on. Although there are a few periods when the same cluster is assigned for more a few days, there are many instances, as at the beginning of the sequence, when there is no continuity of cluster assignment for several days in a row. Thus for example the first 6 days are all in different clus-

Fig. 2. Assignment of the first 50 days of the input dataset (starting at the top) to 7 clusters for unconstrained clustering (left column, k = 0) and constrained clustering (right column, k = 2  105). Each day is represented by a box (date given on right) while greyscale and numbers in the boxes indicate the cluster for this day.

Fig. 3. Mean sea level pressure maps of the first 6 days of the input dataset. Dashed contour lines indicate pressure above, solid lines below the spatial average. Contour interval is 5 hPa.

314

I.T. Jolliffe, A. Philipp / Physics and Chemistry of the Earth 35 (2010) 309–315

ters, even though they are only gradually changing from 1 day to next and therefore show some similarity as demonstrated in Fig. 3 suggesting that some neighbouring days should be placed together into a common cluster. The results for time-constrained clustering with k = 2  105 are given on the right-hand side in Figs. 1 and 2. The actual value of k here is unimportant. The reason it is so large is simply that ‘times’ are much smaller numbers than ‘pressures’ in the units chosen, so that k needs to be large in order for kDt(ij) to be comparable to Dv(ij). The centroid patterns of the 7 time constrained clusters in Fig. 1 (right column) are different from the unconstrained (right column) clusters, but close inspection shows that, although they are in a different order, some similar clusters can be seen in both cases. The right-hand bar in Fig. 2 shows that clusters tend to consist of longer periods of consecutive days than in Fig. 1. Thus to revisit the above example of the first 6 days, they are now allocated consecutively to 3 clusters, instead of 6. The necessity of choosing the tuning parameter is a disadvantage of time-constrained clustering, though most other clustering

methods involve subjective choices too, albeit sometimes hidden. Various diagnostics are available to aid in making the choice of k, and some of those diagnostics are shown in Fig. 4. At the top is the within-cluster sum of squares, which increases with k; in the middle is the average length of period for which days stay in the same cluster, which also increases with k; at the bottom is the number of occasions on which a cluster is visited for only 1 day, which decreases with k. Ideally, we would like some sort of ‘elbow’ in these plots to indicate a good value for k. In practice, as here, no such elbow may be evident and we need to try two or three values of k and see which set of clusters gives the most appealing interpretation. In our example it might be thought that perhaps the mean length of time spent in a cluster is too large for the illustrated value of k, so that a smaller value of k might be preferred. Not withstanding the difficulties associated with choosing k, the jumping around between clusters evident on the left bar in Fig. 2 is undesirable and a form of time-constrained clustering is a promising way of reducing this phenomenon.

Fig. 4. Indicators for the effect of increasing weighting factor lambda (abscissa of the diagrams in logarithmic scale) on sum of squared within-cluster differences (top), average period length of consecutive occurrences of the clusters (middle) and numbers of days with different clusters occurring the day before and after (bottom).

I.T. Jolliffe, A. Philipp / Physics and Chemistry of the Earth 35 (2010) 309–315

315

4. Concluding remarks

References

This paper has indicated the breadth of the subject of cluster analysis, summarised the main types of analysis, and outlined a small number of recent developments in the subject. A key question is what version of cluster analysis to use. All have some disadvantages and before using a method an attempt should be made to find out something about its assumptions (explicit or implicit) and properties, and then decide whether they are appropriate for the data at hand. It must be stressed that clustering techniques will find clusters even when there are no clusters in the data – see Christiansen (2007), for example. Partitioning large data sets into more homogeneous subsets may be useful even in such cases, so cluster analysis may still have a role, but it must be remembered that the clusters found are not necessarily ‘real’ separate groups. Which dissimilarity measure to use is another decision that needs careful thought, as does the choice of how many clusters. In the latter case, some methods, such as Raftery and Dean’s decide automatically for you, and there is no shortage of suggestions for how to do it – see, for example Sugar and James (2003).

Ansell, T.J., Jones, P.D., Allan, R.J., Lister, D., Parker, D.E., Brunet, M., Moberg, A., Jacobeit, J., Brohan, P., Rayner, N.A., Aguilar, E., Alexandersson, H., Barriendos, M., Brandsma, T., Cox, N.J., Della-Marta, P.M., Drebs, A., Founda, D., Gerstengarbe, F., Hickey, K., Jonsson, T., Luterbacher, J., Nordli, O., Oesterle, H., Petrakis, M., Philipp, A., Rodwell, M.J., Saladie, O., Sigro, J., Slonosky, V., Srnec, L., Swail, V., Garcia-Suarez, A.M., Tuomenvirta, H., Wang, X., Wanner, H., Werner, P., Wheeler, D., Xoplaki, E., 2006. Daily mean sea level pressure reconstructions for the european-north atlantic region for the period 1850–2003. J. Climate 19, 2717–2742. Bednorz, E., 2008. Synoptic reasons for heavy snowfall in the Polish-German lowlands. Theor. Appl. Climatol 92, 133–140. Christiansen, B., 2007. Atmospheric circulation regimes: can cluster analysis provide the number? J. Climate 20, 2229–2250. Cuesta-Albertos, J.A., Matrán, C., Mayo-Iscar, A., 2008. Robust estimation in the normal mixture model based on robust clustering. J. Roy. Stat. Soc. B 70, 779– 802. Fraiman, R., Justel, A., Svarc, M., 2008. Selection of variables for cluster analysis and classification rules. J. Am. Stat. Assoc. 103, 1294–1303. Friedman, J.H., Meulman, J.J., 2004. Clustering objects on subsets of attributes. J. Roy. Stat. Soc. B 66, 815–850 (including discussion). Gordon, A.D., 1981. Classification. London, Chapman & Hall. Li, B., 2006. A new approach to cluster analysis: the clustering-function-based method. J. Roy. Stat. Soc. B 68, 457–476. McLachlan, G.J., 1992. Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York. Moron, V., Robertson, A.W., Ward, M.N., Ndiaye, O., 2008. Weather types and rainfall over Senegal. Part 1: observational analysis. J. Climate 21, 266–287. Philipp, A., Della-Marta, P.M., Jacobeit, J., Fereday, D.R., Jones, P.D., Moberg, A., Wanner, H., 2007. Long-term variability of daily North Atlantic-European pressure patterns since 1850 classified by simulated annealing clustering. J. Climate 20, 4065–4095. Raftery, A.E., Dean, N., 2006. Variable selection for model-based clustering. J. Am. Stat. Assoc. 101, 168–178. Smyth, P., Ide, K., Ghil, M., 1999. Multiple regimes in Northern hemisphere height fields via mixture model clustering. J. Atmos. Sci. 56, 3703–3723. Strauss, D.M., Corti, S., Molteni, F., 2007. Circulation regimes: chaotic variability versus SST-forced predictability. J. Climate 20, 2251–2272. Sugar, C.A., James, G.M., 2003. Finding the number of clusters in a dataset. J. Am. Stat. Assoc. 98, 750–763.

Acknowledgements This paper is based on a keynote talk presented at the COST 733 mid-term conference in Krakow in October 2008. We are grateful to the organisers, especially Radan Huth, Zbigniew Ustrnul and Agnieszka Wypych, for the invitation to give the talk and for encouragement to write it up. Comments from a reviewer helped to improve the final version.